Clinical biostatistics PDF

Autor Alvan Feinstein | Alvan R. Feinstein | Howard DeLong

118 downloads 2K Views 66MB Size

Report

Recommend Stories

Empty story

Idea Transcript

'.

CLINICAL BIOSTATISTICS Alvan R. Feinstein

'j

(

U\

CLINICAL BIOSTATISTICS

CLINICAL BIOSTATISTICS Alvan R. Feinstein, M.D. Professor of Medicine

and Epidemiology,

Yale University School of Medicine,

New

Haven, Conn.

The

C. V.

Saint Louis

Mosby Company

1977

Copyright

©

1977 by The C. V. Mosby

Company

No part of this book may be reproduced any manner without written permission of the publisher.

All rights reserved. in

Printed in the United States of America

Distributed in Great Britain by Henry Kimpton,

The

C. V.

London

Mosby Company

11830 Westline Industrial Drive,

St.

Louis, Missouri 63141

Library of Congress Cataloging in Publication Data

Feinstein,

Alvan

R

Clinical biostatistics.

Originally appeared as essays in the journal Clinical pharmacology and therapeutics.

Includes bibliographies.

—

Medical statistics Addresses, essays, lectures. Medical research Statistical methods Addresses, essays, lectures. Title. [DXLM: 1. Biometry. I. 2. Epidemiologic methods. HA29 F299c] 610'.1'5195 RA409.F36 77-3703 ISBN 0-8016-1563-1 1.

2.

CB/CB/CB

—

—

987654321

For

my

mother, Bella, and

who have

my tcife, me

given

roots, wings,

and

love

Linda,

PREFACE When 1970,

I

I

began writing the series of essays called "Clinical biostatistics" in I would run out of material in about a year. I knew I had some

thought

unorthodox things to say and some unconventional viewpoints to develop, but believed the development would require only

more deeply immersed in biostatistical ideas, however, I At every level of contemplation ranging from the massive

—

to do.

large-scale clinical

control" studies,

standard deviation

a

scientific

problems

—the

world of

how (and why)

to

seemed beset with

biostatistics

had received unsatisfactory had been given more attention than the bio-. As that

those challenges, the essays continued to proliferate, until

now

logistics of

to the elaborate complexities of "retrospective case-

such apparently simple questions as

to

calculate

-statistics

trials,

I

I became constantly found more

As

six or seven essays.

because

solutions

the

began grappling with almost 40 of them have I

appeared.

While the essays were making

their

bimonthly and

later trimonthly

appear-

ances in Clinical Pharmacology and Therapeutics, readers of the journal were highly complimentary; and first

I

many urged me

to publish the series as a book.

resisted this suggestion, mainly because

I

have never liked

this

At

type of

autoanthology.

The author

renovated

not just an unaltered collection of previously published papers.

My

text,

of

book,

a

I

thought, should prepare a suitably

resistance to the suggestion eventually collapsed, however, under

pressures: time

and audience. As

I

kept wanting to prepare that

never finding the necessary time to do

so,

I

two

new

realized that the only

sets of

text while

hope

for

achieving a book in the imminent future was to preserve the original essays.

Furthermore,

many

readers kept assuring

remain intact

—that

might be

their "spirit"

me

that the original essays should

lost in a revision

and

that the

book

would be more enjoyable to read if it preserved the informality of the original prose. Adding to these incentives for an "anthology" format were the fiscal concerns of The C. V. Mosby Company, publishers of the journal where the "Clinical biostatistics" essays have appeared. The cost of publishing the book could be substantially reduced

if

the essays were maintained in their original

form, with each text and bibliography unchanged.

Accordingly, "Clinical logical

this

biostatistics"

book contains series.

a

collection

of

original

They have been rearranged

essays

from the

as chapters, into a

pattern that differs from the chronologic sequence in which they

first

VII

viii

Preface

appeared.

A few

of the original titles

and many of the identifying numbers

of

the essays have been changed to conform with the current sequence of chapters.

Otherwise, the texts and

lists

To keep

of references for each essay remain intact.

the book from being too large,

I

have omitted

a

few

of the essays that contained

quantitative surveys of the medical and statistical literature, digressions into the ethics of research clinical

and the teaching

introductory chapter that brief

of statistics, or specific critiques of individual

The remaining 29

investigations.

is

Followed by

have been divided into an

essays

five

major sections, each preceded by a

commentary

who have already developed an interest in may have arisen spontaneously; it may have been necessitated by the demands or comments of a manuscript reviewer; or it may have been provoked by the efforts needed to understand the many The

essays are intended for people

biostatistical

issues

The

interest

mathematical machinations that are used I

assume that the reader

aware

is

wise not particularly adept that assumption, the goal

in

is

in

published reports of current research.

of rudimentarv statistical tactics, but

mathematics and

to enlighten

is

and perhaps

other-

is

possibly frightened by

With

it.

with the style

to entertain

of an essay, not to educate with the formality of a textbook.

Conventional textbooks and courses in biostatistics are usually devoted to the theoretical processes that

produce such mathematical calculations as P values,

confidence intervals, correlation coefficients, and regression equations.

mathematical emphasis, almost no attention has been given to the basic

Amid

procedures used for planning research, obtaining data, and analyzing The aim of these essays

is

to provide supplemental reading for the

tant topics that are omitted

the

scientific results.

many impor-

from conventional textbooks, and also some remedial

reading for topics that usually receive inadequate consideration.

Because of the way the book has been assembled, for

which

I

apologize.

The

first is

it

contains three features

that the text regularly contains references to

previous or forthcoming essays in the originally published series. Although useful liaisons for essavs that

were dispersed

in time,

many

of the references will

now

wrong places in the rearranged series. Since the current text is identical to what originally appeared in the journal publications, these references could not be changed. The second flaw is that the original bibliographic citations have also, of necessity, been preserved at the end of each essay. This process, while making the citations easy to find, produces frequent redundancy in some of the listings. The third infelicitous feature is that certain ideas are mentioned repeatedly in different locations of the text. The repetition seemed desirable in a appear

in the

succession of individual essays spread over a 6-year period, but

appealing

if

the essays are read contiguously.

I

hope that readers

may be

less

will find these

die repetitions instructive rather than irritating. n

discussing the various challenges and imperfections of biostatistics, to

I

have

keep the prose lively and have occasionally made it deliberately provocfost readers have said they enjoy this approach, but it has sometimes led accusation that I am antistatistical. This accusation has probably been

by anyone who has ever been discontent with the defects of any status the established tenets of clinical medicine and epidemiology, the

Preface

established creeds of statistics contain infirmities,

and

I

have always

I

would hardly want

clinical biostatistics

if

I

many

infirmities.

In pointing out the

tried to offer constructive suggestions for

spend so much

to

effort

working

improvement;

in the

domain

did not respect both the clinical bio- and the

of

-statistics

portions.

To do the kind had many sources

and writing that have produced these essays, I have support for which I want to express thanks. The Veterans hospital in West Haven, provided research aid for many

of thinking

Administration, at

of its

was Chief of the Eastern Research Support Center and, later, of the Cooperative Studies Program Support Center. For my activities at the Yale

years while

I

University School of Medicine, the National Center for Health Services Research

and Development supplied grants for several projects from which many of these emerged as by-products. During a highly productive period from 1971-

essays

1973, as a visiting professor,

stimulation from the

received professional hospitality and illuminating

I

Department

of Clinical

Epidemiology and

Biostatistics at

McMaster University Medical Center in Hamilton, Ontario, Canada. For the few years, the essays have been composed in my work as Director of the

past

Yale Clinical Scholar Program, which

is

sponsored by the Robert

Wood

Johnson

Foundation. I have been greatly helped by human and contributions. Before submitting the prepared essays for publication, have relied on thoughtful appraisal and stringent evaluation from critics who

In addition to this institutional aid,

talents I

are clinicians, epidemiologists, statisticians, or computer experts. In acknowl-

my

I also herewith absolve them of any Thev are Linda Marean Feinstein, Michael Gent, Charles A. Goldsmith, Moreson H. Kaplan, Donald Mainland, Walter A. Ramshaw, David L. Sackett, Helen L. Smits, Walter O. Spitzer, and Carolyn K.

edging

gratitude for their valuable help,

responsibility for the contents.

Wells.

I

am

also especially grateful to Dr.

Pharmacology and Therapeutics,

Walter Modell, Editor of Clinical

for his constant

encouragement and

for the

freedom he has provided. For excellent performance in the tasks of typing the difficult combinations of prose and mathematical symbols, I thank editorial

Elizabeth Tartagni, Carrol Ludington, and Pamela Rowe. Finally,

Daniel,

I

want

who have

my

wife,

Linda, and our children, Miriam and

gently tolerated the

many hours in which I was absent or and who have filled the nonwriting hours

to

thank

secluded while working on these essays, with warmth, affection, and

joy.

Alvan R. Feinstein

New

Haven, 1977

CONTENTS 1

Introduction and rationale,

1

SECTION ONE

THE ARCHITECTURE OF COHORT RESEARCH design of experiments, 17

2

Statistics versus science in the

3

Components

4

Intake, maintenance,

5

Subsequent implementation of the objective, 54

6

Sources of 'transition

7

Sources of 'chronology

8

Credulous idolatry and randomized allocation,

9

Consequences

of the research objective, 28

and

identification,

bias,'

38

71

bias,'

89 J 05

of 'compliance bias,' 122

TWO

SECTION

OTHER ARCHITECTURAL PROBLEMS

—and the responsibility

10

Statistical

11

Random sampling and medical

12

The

13

Ambiguity and abuse

14

The epidemiologic

malpractice

reality,

of a consultant, 137

154

rancid sample, the tilted target, and the medical poll-bearer, 169 in the twelve different concepts of 'control,'

trohoc,

the

ablative

risk

ratio,

and

186

'retrospective'

research, 197

15

On

the sensitivity, specificity,

and discrimination

of diagnostic tests, 214

SECTION THREE

PROBLEMS

IN

MEASUREMENT

16

On

17

The derangements

exorcizing the ghost of Gauss and the curse of Kelvin, 229 of the 'range of normal,' 243 xi

xii

Contents

18

How

19

The

do we measure difficulties of

'safety'

and

'efficacy'?

256

pharmaceutical surveillance, 271

SECTION FOUR

MATHEMATICAL MYSTIQUES AND STATISTICAL STRATEGIES 20

Permutation

21

The

22

Sample

23

Problems

tests

and

'statistical significance,'

287

direction of relationships, h\ pothcscs, and probabilities, 305 size in

and the other the

side of statistical significance,' 320

summary and

display of statistical data, 335

SECTION FIVE

THE ANALYSIS OF MULTIPLE VARIABLES 24

On

25

A

26

The purposes

27

The

28

Evaluation of a prognostic

29

Additional tactics in prognostic stratification, 430

homogeneity, taxonomy, and nosography, 353

primer of multivariate anal}

sis,

369

of prognostic stratification, 385

process of prognostic stratification, 398 stratification,

414

INDEXES Index of Authors, 447 Index of Methodologic Topics, 453 Index of Clinical and Other Practical Examples, 465

CLINICAL BIOSTATISTICS

CHAPTER Introduction and rationale The

"Clinical biostatistics" series began

who had

retired

when

I

was invited

from writing a bimonthly "column" on

and Therapeutics. The

first

essay in the series contains

to

succeed Dr. Donald Mainland,

statistics for Clinical

my

Pharmacology

tribute to Dr. Mainland's

many

previous contributions to biostatistics and aho describes the background philosophy with which the new series would be approached. The text was as follows.

—ARF

Donald Mainland can be succeeded but not replaced. His training, timing, and temperament have made him a unique domain of medical statistics, and a tough act to follow. In training, he was graduated in medicine with honors in 1925 at Edinburgh, where he was later awarded the Doctor figure in the

Science

of

degree

for

his

research

in

embryology and histology. After finishing medical school, he taught anatomy at Edinburgh for several years and then went to Canada. He worked at Manitoba from 1927 to 1930, when he left to become Professor and Chairman of the Department of Anatomy at Dalhousie University. His first publication in 1927 dealing with an uncommon abnormality in a muscle 13 was a harbinger of his subsequent concern with frequency distributions in biology. Within the next two years, he was

—

—

vestigations,

he contemplated methods

he applied his quantitative interests to measuring the forces of muscles 2021 and ;

then, in 1934, with a paper on

anatomist started his metamorphosis into medical statistician. By 1936, after some additional research on blood cells and blood counts, he had begun to write on "Problems of chance in clinical work." 23 In 1938, he produced his first book on quantitative medicine, 24

and twelve years continuous productivity in both

later, after

biologic

and medical statistics, York University's invitabecome Professor of Medical Staresearch

he accepted tion to tistics.

From

New

that position, with persistent

growth and enormous experience, he has continued

intellectual cal

enlightenment

vide

areas. 14 anatomic Later on, during various embryologic in-

students, consultees,

irregular

This chapter originally appeared as "Clinical biostatistics /.

A new name and some

—

other changes of the guard."

In Clin. Pharmacol. Ther. 11:135, 1970.

"Chance

and the blood count," 22 the quantitative

evaluating the accuracy of techniques for estimating

for

assessing the size and volume of cellular structures." 19 Over the next few years,

to

his

practito pro-

colleagues,

and readers. In timing, Dr. Mainland became

inter-

ested in biologic statistics during an era when the analytic techniques were in primitive

stages

semination.

of

conception

He knew many

and

dis-

of the early

Introduction and rationale

heroes in the contemporary

pan-

statistical

and he became a pioneer physician in developing the modern relationship between statistics and medicine. After the first edition of his classic book, Elementary theon,

Medical Statistics- in 1952. he continued to produce a powerful array of creative, didactic, expository, and polemic publications on the use of statistics in medicine. With his textbook, now in its second edition, 26 and his many other writings, he has probably contributed as much as any 1

single

person

the statistical

to

of clinical investigators in

sensibility

North America

today.

temperament,

In

he

has

managed

to

I would award a C, or perhaps a C+, but nothing higher," 2s or "I sometimes

today,

wonder how many more instances of stupiditv I might dig up from the days when was hypnotized by statistical techniques I

applied to pooled data." 30

My own

first encounter with Mainland about 1960, when I discovered his publications entitled "Notes from a Laboratory of Medical Statistics" a group

came

in

—

documents

of

cipients

cherished by the re-

still

who were

lucky enough to learn

about the "Notes," and to satisfy Mainland's hardy standards for the mailing list. ("There is a limit of 3,000 to the number of 'Notes' that can be issued.

we can no

We

.

.

are

preserve the extraordinary virtue of com-

sorry that

mon

that have disappeared after they have Agencies that require been received

to

sense, despite his constant exposure

arcane models,

the abstract concepts,

and

folderol

intellectual

statistician's

the

lurk in

that

world. Part of this virtue

is

attributable to Mainland's firm rooting in

the realities of medical biology.

not merely

preached about

research; he has practiced

development of maintained his vestigation items,

has

During the he

it.

his statistical interests, activities

in

—contributing,

biologic

among

textbook on anatomy 27

a

He

biostatistical

in-

other

—and

large-scale clinical research projects,

chiefly in

rheumatoid

But the greater part of Mainland's viris probably attributable to the man himself. Now near the age of retirement, he remains young in mind, in spirit, and in outlook. What other "older man," venerated and respected as he nears completion of his major work, is ready to recognize that "repetition of this theme during

two or three decades, by others as well as myself, has had very little effect" 31 to confess that he is "technically unsophisticated" 32 to solicit disagreement and rebuttals to all of his comments; and to be ;

;

receptive

for old problems.

to

new approaches

How many

established

enough to appraise work with comments like

"authorities" are brave their

these:

p;

ious

"C;ading

all

four

.

much

trouble

which Mainland issued periodically whenever he found time to do so, were the ancestors more recent "Statistical ward of the rounds" in this Journal, and the "Notes on Biometry in Medical Besearch," which have appeared under the sponsorship of with.")

deal

to

These

"Notes,"

the Veterans Administration. I

remember

still

the

enchantment of

discovering those early "Notes." For sev-

my own clinical reme increasingly in constatistical procedures, and my

eral years previously,

search had brought

arthritis.

tue

constantly

.

formal invoicing are also too

he

currently continues an active role in several

.

longer replace 'Notes'

items

together

tact

with

manuscripts

were being frequently sent

to statisticians for review. Since

the reviewers'

comments were

many

of

either clini-

absurd or statistically incomprehenhad begun, in self-defense, to I read textbooks on statistics. Like Mainland's, my education in statistics is largely self-acquired; but unlike most physicians, cally

sible,

I

was not intimidated by the arithmetic, I had done graduate work in pure

since

mathematics school.

was

What

before I

sometimes

found

entering

medical

the

textbooks

in

enlightening,

but

more

often appalling.

From my

previous

activities

in

pure

and in biologic science, I had become accustomed to a rigorous type

mathematics

Introduction and rationale

documentaany assertion. In pure mathematsuch an assertion was called a ics, theorem, and the rigorous documentation was a sequence of logically cohesive statements called a proof. In biologic science,

Aware

of either logical or empirical tion for

the assertion

was

and

called a hypothesis,

the rigorous documentation

was a

tion of empirical data called

observed evi-

collec-

dence. But most of the statistical textbooks

seemed

to contain neither a logical

nor an

empirical documentation for the assertions.

The

were often

texts

like

cookbooks, con-

taining a series of instructive recipes

how

on

and perform certain These instructions were seldom accompanied by a proof of their validity, by any references to where a proof might be found, or by any empirito tabulate data

of significance."

"tests

data to demonstrate that the proce-

cal

dures would remain valid requisite

when

sertions,

intellectual

I

was disturbed by the lack

of

consequences of the

component of "biostatistics." Here were men of high professional and intelcompetence.

lectual

How

could they so

blithely ignore the effects of their errone-

assumptions that most clinical data

ous

came from "random samples" with

"nor-

mal

vari-

distributions"

ables"? of

How

clinical

and "continuous

could they discuss the design

experiments

by extrapolating

from a brewery vat or an agricultural

conditions were violated.

From

give

would explore the

litera-

ture of mathematical statistics, looking for

many

biologic

to a

I

of the

real attention to the

their pre-

time to time,

some

of

problems that pervade work in clinical medicine, I had expected to find that the cerebral grass would be greener in the statistician's yard. To my dismay, I found manv weeds being cultivated and labeled as flowers. Apart from my dissatisfactions with the absence of proofs for didactic as-

field

human population? How could they so much emphasis to procedures for

purelv statistical analysis, while showing so

little

rigorous concern for such basic-

rational logic or the scientific

issues in scientific logic as specifying the

evidence to support what appeared in the

determining question, fundamental that answer whether the research would control appropriate question, choosing an group, checking the reliability of the data,

either the

"cookbooks," but

Even with

was seldom

I

successful.

a mathematical background, I

could not understand

many

of the esoteric

and my biologic background made me wary of the unrealistic assumptions that underlay many of the mathematformulations;

ical

arguments. did not

I

know

at the

time that some of

were

these mathematical defects

monplace

as

distinguished

so

com-

arouse public lament by

to

Harold

Said

statisticians.

Ilotelling 11 in 1960:

establishing reproducible criteria for sub-

ascertaining and jective evaluations, whether the investigated population was both homogeneous enough for everyone to be "lumped" together and selected in a manner that justified the idea of "random-

ness"?

Wandering among statistical doctrines seemed neither mathematically

that often

validated, biologically cogent, nor intellec-

The custom

omitting

of

proofs,

which would

beyond

not be tolerated in pure mathematics

very

limited

the

extent,

and

of statistics,

students

is

is

common

in

a

the teaching

excused on the grounds that

do not know enough mathematics

to

understand the proofs. Perhaps

a

better

reason

is

that

the

in

some cases and the

teachers,

authors of the textbooks, do not understand the proofs. in

some

because

wrong.

In

some

instances

the

instances

no proofs

exist,

no genuine proofs can

methods

taught

are

and exist,

demonstrably

I came upon Mainland's The man seemed to know that

tually challenged,

"Notes."

ought to pertain to biology, and he seemed to know about biology. He sounded like someone who had learned about research not bv aselomerating theories of probability, or massaging data whose origins he had never observed, but by actually feeling a tissue, handling an animal, calibrating an instrument, lookbiostatistics

ing through a microscope, or talking to a

Introduction and rationale

An

patient.

effective stimulant for the in-

torpor of the textbooks, Main-

tellectual

"Notes"

land's

made

biostatistics

vivid,

and exciting. He brought into open view many of the critical issues that lay hidden beneath glib traditional preconceptions; he helped demonstrate that many statistical models were inappropriate and misleading for biologv; and he provided a medium in which biologic scientists insecure and anxious in their heretical suspicions about conventional vital,

—

dogmas

statistical

—could

from seeing that other

comfort

take

and

scientists

shared the same heresies.

isticians

stat-

Here.

who

could talk

sensibly about clinical research.

(He could

at

was

last,

a statistician

sometimes talk too long, but verbosity is an accepted occupational hazard of biostatisticians. Mainland was readily for-

also

given for occasional ventures into prolix prose,

and

columns be equally

his successor in these

hopes that future readers

will

come and

increasingly involved in biostatistics,

as

accept

we begin

co-workers

tical

tribute

to appreciate its scope,

challenges, educate our statis-

its

creative

in its

problems, and con-

solutions

Donald Mainland our honored "founding lems,

those prob-

to

will

remain one of

fathers." He is a who helped establish the basic on which we must now build

physician

concept

the concept that biostatistics can best be developed neither from abstract theory in statistics nor from imprecise anecdotage in biology, but from a coordinated integra-

and

tion of perceptive observation

think-

ing in both. In succeeding Dr.

Mainland

as master

of ceremonies for these columns,

hope

I

preserve his basic outlook and philosophic standards, although I shall undoubtto

edly introduce some deviations of my own, because our specialized interests and training have been so different. His prestatistical domain was anatomy; mine has been

His basic work for

tolerant.

clinical medicine.

Mainland was the first medical statistician I had encountered who acted as though "bio" were an integral part of "bio-

most two decades has been centered in a department of medical statistics; mine has been (and remains) centered in a depart-

instead

statistics,"

of

a

prefix

attached

ment

intimately familiar with

occasional teaching exercise or a book in-

the

tended for graduate students in biologic domains. Since that time, I have met a few other statisticians who have truly become biostatisticians, but Mainland remains a pioneer both in migrating in the unusual profession direction from biology to statistics, and in exemplifying the modern fusion of biology with statistics. Clinical investigators today owe him an inestimable debt of gratitude for the contributions he has made to our domain by preserving thoughtful realism in our statistical

outlook.

Many

people believe that

book, Elementary Medical Statistics, could benefit from tighter organization

his

and greater

succinctness, but

it is

still

the

only Mich publication that gives at least as

mi

i

of meci

maneuvej

attention I

to

statistics

As

the medical issues as

clinical

to

the statistical

investigators

be-

aspects of

mathematical precepts of statactics; my acquaintance with

tistical

of these precepts

shall regularly ask

help

when

issues with iar.

many

basic

some for

He became

medicine.

internal

of

casually to "statistics" for the sake of an

al-

my

the

which

I

is

tenuous, and

statistical

discussions

am

I

colleagues get

into

relatively unfamil-

Since Dr. Mainland's packs of cards

and barrels of mitted

a

as

discs

legacy

have not been of

this

job,

many

probably use a computer for exercises in

random number

I

transshall

of the

selection that

he would have consigned to his trusty manual companions. I shall probably also call upon the computer for certain new activities

modern

that

it

now makes

possible

in

biostatistics.

One of the main challenges will be to keep the column as least as interesting, informative, and provocative as Mainland made it. Connoisseurs of the Mainland style will recall that he often goads his

Introduction and rationale

readers deliberately, hoping to ther discussion. (Example: "I I

have trodden on some

enough

.

.

.

corns hard

official

My own

tactics in

may be somewhat

provocation

.

produce a defense of their

methodology."- 9 )

.

.

foot-to-brain-to-hand

to initiate a

reflex that will

fur-

elicit

hope that

different

viously

by Dr. Mainland and by other

people

who have

biostatistics;

am

about

established

axioms,

comfort concepts,

have not been reguintensive scrutiny and

or other beliefs that

subjected to

larly

skeptical reappraisal.

To help augment the role of this column as a medium of vigorous communicative exchange and intellectual growth, I plan to proponents of their views or in rebuttal of mine, to become the "columnist" from time to time. The columns will be titled in a numbered sequence for my own papers, but a different designation will be used to accommoinvite various guests, either as

date other authors.

I

also

hope

that read-

ers will frequently write to express their

agreement or dissent about anything that appears here, and I would plan to have the letters (with the author's ted,

so

if

lively

desired)

discourse

medical

are

a

omit-

source of

Those

ers.

three sources of input have

first

already provided the topics planned for discussion in the next

open

are

few columns, but the and suggestions

thereafter,

be happily received.

will

One immediately obvious change the

Manv

of this column.

title

is

in

leaders are

"done in" by their successors, and Dr. Mainland should not have to worry about being blamed for my mistakes, misconceptions,

To

give him that and also to allow ward rounds" as a

or mischief.

freedom of

responsibility,

him

to use "Statistical

title

for a possible book, the

essays has statistics."

been changed

know

I

prefer the older

that

title;

name

of these

to "Clinical bio-

many readers will new one seems

the

more formal, somber, and sesquipedalian, but it was the least of the available evils in nomenclature, and I hope that these

new

"rounds"

informality

retain

will

the

and free-wheeling

appealing intellectual

fun of their ancestors.

future columns. The problems with which we

in

statistical

struggle

become

name

from

at the Veter-

hundreds of biomedical research tasks each year; and (4) comments from read-

slots

intellectual

work

ters, which are currently asked to help prevent or remedy biostatistical maladies

maintain

prolonged

clinical

stimulation

ans Administration Research Support Cen-

in

I

of us

all

(3)

projects encountered in

hope to need vigilant prodding to avoid or destroy complacency. The pace of science and technology has become too rapid for anyone to and occasionally inadvertent, but

preserve the principle that

about

written

new

too

numerous and too im-

There are more profound reasons, howchange that brings clinical into

portant to be resolved without an abun-

ever, for a

dance of argument. I hope that the arguments from all the people who contribute to these proceedings will be responsible, thoughtful, clearly written, and prepared in an atmosphere of light rather than heat but arguments nonetheless. Readers are invited not only to express opinions about what has appeared, but also to make suggestions about topics for future discussion. My ideas about choice of topics will come from several sources: (1) personal adventures during my own research activities; (2) review and occa-

juxtaposition with biostatistics. In quantita-

—

sional

revival

of

ideas

expressed

pre-

nomenclature,

tive

biostatistics

the

statistics

part

of

occupies ten letters and the

bio only three; the addition of eight clinical

letters

ance.

More

main of

many

to

the total phrase

nominal

restore

may

help

as well as conceptual bal-

importantly, however, the do-

biostatistics

is

currently beset with

intellectual maladies that I believe

can be remedied only if clinical biologists begin to make active contributions to the domain. These maladies, which arise not in the contents of statistical

the

way

statistical

thinking but in

concepts are applied to

Introduction and rationale

other

have recently become

disciplines,

comment by

public

of

subjects

leading

statisticians:

John

W. Tukey38

—

what was going on I was have been both ignorant and extremely superficial. It is this many-times-repeated experience that has led me to assert that mathematics has often chosen to ignore the careful examination and exposition of the methods it

explicidy to a machine

revealed

uses.

A

teacher of biochemistry does not find it intolerable to say, "I don't know." Nor does a physicist.

.

.

Why

.

should not

.

.

statisticians

.

Far better an approximate do the same? answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made more precise. .

J.

to

G. Skellam

7

.

.

have recently been

discontents

have been playing

role they

40

ever

years,

since

R.

more than

for

Fisher's 6

A.

epochal book made biologists begin the frequent search for statistical advice.

—

Valuable information which affects common-sense judgments tends to be ignored when formal statistical tools are employed along con.

.

Other

expressed by statisticians about the methods used to prepare consultants for the

Among

comments have been the

the

fol-

.

ventional lines.

.

.

Surprisingly

.

attention

little

lowing:

is

what is often a much more serious source of error and deception, the defects of the model itself. There is an important difference of emphasis between the application of mathematics to biology, and the mathematization of biology, and it is the latter which needs the

J.

most encouragement, for

or

normally given

G. Skellam' 7

—

to

.

.

...

difficulties lie.

.

I

it

here that the real

is

am somewhat

attribute

I

to the

way

(undesirable)

this

that statistics

is

largely

attitude

usually taught

mathematical discipline of great

—

as a

intrinsic interest

imparted to talented students who unfortunately have rarely had proper training in natural science

hand experience of

first

research.

scientific

disturbed by

thought

that the exalted status of mathemight possibly exercise ( an unintentional brand of tyranny over other ways

M. Zelen '—

of thinking.

most schools, seems so far removed from reality, that a heavy dose may be too toxic

the

matics

.

.

.

.

do not

.

!;

.

—

The

criticize statistical .

theory as such or .

.

The

trouble

is

methods are often used thoughtlessly

experimental work, for instance, a common abuse is to use a statistical test to try to "prove" .

a hypothesis.

.

.

.

The

scandal

nificant" results are published as

meaning. ... pose all kinds poor biologist although they

is

.

at

an institution or laboratory where active scienwork is being conducted.

tific

that the "sig-

though they had

on the or psychologist conditions which, produce unequivocal statistical reactually hinder him in his research. nonsensical

C. P.

Cox

1

—

conditions

—

If

the

identity

of

discipline .

.

and research

the

.

in

statistics

is

Complaints about the status quo have come from computer experts trving to implement some of the existing mathematical approaches and models. Said

also

of

its

statistics

"user" disciplines must be con-

tinually developed.

.

.

.

Besides training in

statis-

an aspirant statistical consultant should receive complementary and systematized, as distinct from casually acquired, training in the disciplines All the in which he is expected to consult. scientists concerned may be advantageously encouraged to scrutinize and clarifv their ideas on scientific method and to challenge purely statistical inferences whenever these are unconvincing. tics,

.

.

The direction of communication

W. Hammi,

have been re atedly shocked to find out often I though knew what I was talking about; but that in ^cid test of describing I

how

retain

to

inter-connections

.

R.

as

.

All too frequently statisticians imof

...

design of experiments

in

part of his training as a biometrician-in-residence

and routinely by researchers for purposes for which they were not intended. ... In biologic

sults,

statistical

with regard to future applications. ... It is absolutely vital that the future biometrician spend

the proper uses of statistics. that these

.

taught

William Feller I

.

.

I

What cal

is

surprising about

statisticians

all

these

criti-

have come from rather than from clinicians or

comments

is

that they

and

Introduction

other recipients of the statistical consulta-

In an era in which patients have

tions.

been increasingly vocal in complaints about the services received from their clinical consultants in problems of medicine, clinicians have been notably silent in commenting on the "quality of care" in interchange

the

clinician

"doctor"

who

when

occurs

that

coming

"patient"

a

is

the to

a

a statistical consultant in

is

problems of research. Clinicians have had many of these consultative encounters. Courses on statistics are now offered in the curricula of most medical schools. Editors of many medical,

and other journals

psychologic,

tinely request that suitable

passed by a

will rou-

manuscripts be Proposals

censor.

statistical

for large-scale clinical trials are not only

designed by

often

statisticians,

must be approved by the project

is

but also

statisticians

before

funded.

unidirectional. Thus, although the scripts

manu-

and contents of medical journals

are regularly subjected to critical statistiappraisal,

cal

methods used for papers that have appeared in the medical literature, 24 8 33 3i but I am not aware of any comparable "

'

almost no evidence of ex-

-

inappropriate or somemedical assumptions contained in papers that appear in the statistical literature. Innumerable books have been written on the general topic of elementarv statistics for clinicians, but no one has written a clinical primer for critiques

times

the

of

bizarre

statisticians.

time that clinicians began to widen narrow path of communication and to

It is

this

inform our colleagues of the

knowledge

pertinent

during those

and

many

Although the

many

spent

learned

years in hospital wards

clinical practice.

tician has

statistically

was

that

statis-

years in graduate

school getting his Ph.D. degree, and can tell us a great deal about what he discovered during that time and afterward,

many

the clinician has spent

During these many interplays of statistics and medicine, however, the path of consultative enlightenment has remained

rationale

years getting

not only his M.D. degree, but also such "degrees" as postdoctoral additional F.A.C.P. or F.A.C.S. If clinicians need to

know about

the mystic

statistic,

statisti-

cians might benefit from discovering the

pinnacle.

clinical

We

have much

all

to

teach each other.

posure to clinical reviewers can be found in

the

many

that regularly

medically

the

Many the

papers

appear in such periodicals

as Biometrics, Biometrika,

of

oriented

and the Journal

American Statistical Association. have been published on

critiques

unsuitable

or

incorrect

statistical

The composition of

"statistics"

One

first

the

of

main

steps

process of mutual education nize

that

domain,

"statistics"

containing

this

recog-

composite

a

is

at

in

to

is

two

least

tinctly different intellectual activities:

dis-

(1)

the acquisition, logical organization, and •To avoid ambiguity, let me define a clinician as a member of one of the healing professions such as medicine, osteopathy, and clinical psychology who takes

—

direct

responsibility

for

the

care

of

living

—

patients,

or

who

has spent substantial amounts of postgraduate time (more than an internship) in developing his skillful knowledge of such activities. The clinician may be in private practice, academic research, or administrative work, but his distinguishing characteristic is a background of observational and therapeutic experience in dealing with sick people. Although an M.D. degree is sometimes regarded as the hallmark of a clinician, many M.D's

such

as

anatomists,

epidemiologists, gists,

biochemists,

and physiologists

— may

"clinical"

pathologists,

pharmacolohave neither the training nor

microbiologists,

numerical presentation of data, and (2) the analysis of the data to arrive at decisions

about degrees of variation, interrela-

tion,

and

activitv tistics;

difference.

is it

often

The

called

first

type

descriptive

of sta-

produces the collections of data

that appear in baseball batting averages, in financial charts, in the birth rates

death rates of

many

"vital statistics,"

and

and

in the

graphs, tables, and other numerical

pathologists,

the functional responsibilities of clinicians. This definition is intended only to clarify what I am talking about, and has no pejorative connotations in any direction.

expressions of biomedical projects ranging

from molecular explorations to therapeutic surveys. The second type of activity,

Introduction and rationale

8

sometimes cited as "inferential statistics," will here be called inductive statistics; it is responsible for such calculations as correlation

and

regression,

linear

and

coefficients

multivariate

provides the

it

confidence

values,

t

square

techniques,

tests,

chi

intervals,

analysis

of

and other procedures used

to

P

variance, "sta-

test

Although these two intermingled

activities are regu-

in

each

and

activity,

background

there

biostatistics,

the content of

are drastic differences in in

prerequisite

the

performance. In condescriptive statistics deals with ob-

tent,

servations

for

actual

of

phenomena, rived from

its

previous

and

substances

concepts de-

with

classified

and

observations,

prepared for the sake of direct information, comparison, and application. In this world,

real

organizes

the

descriptive

statistician

histograms,

tabulations,

his

and other numerical statements that summarize the observed

averages,

facts.

rates,

Inductive

based

on

statistics,

contrast,

in

and

continuity of variables

is

about

assumptions

idealized

linearity of re-

on theoretical concepts of probability; and often on the abstract idea of taking random samples an indefinite number of times from a vast uninlationships;

spected population whose distribution

deemed

is

of

descriptive

have

statistics

become an integral part and are often learned

of

domain

the

concomitantly,

without conscious

effort, as

one learns the

contents

domain.

Inductive

the

of

sta-

on the other hand, requires intensive deliberate study; and for most people tistics,

who

take courses in "statistics," the main

motive

tistical significance."

larlv

procedures

learn

to

is

maneuvers

and

various

the

tests.

Most

analytic

elementary

courses in statistics contain a brief

hom-

age to the medians, means, standard deviations, frequency polygons, and other numerical tactics of descriptive statistics, but the instructor thereafter usually concentrates on the theories of probability, inferential strategies,

and various analytic

procedures that are considered the major domain of the trained statistician. Thus,

most

boys

12-year-old

in

United

the

States could state the descriptive statisti-

methods

cal

determining

for

baseball

grown

batting averages, but few boys (or

men

would know how

)

cally that

decide

to

one player's average

statisti-

signifi-

is

cantly higher than another's.

The

antipodal

ease real

learning

of

world data

between

differences

these two types of activity

—the

apparent

and understanding the of

descriptive

statistics,

in contrast to the theoretical training

mathematical

efforts

and

necessary to compre-

demonstrated to be true "parameters" are estimated but never known. In this imaginary world, the inductive statistician

hend the methods

develops

intellectual failings of contemporary biostatistics. Familiar with methods of producing descriptive statistics in biologv,

but

"normal,"

world of of

not

and whose

models

that

used

are

in

the

reality to express the association

events and to quantify the

contribu-

prerequisite

ground, descriptive

educational statistics

back-

requires no

particular scholastic training in "statistics"

and

is

regularly performed

people whose main

by

intelligent

statistical skill is

only

the ability to keep careful records.

For

example,

automobile

drivers engage in whenever they check their gasoline mileage. For complex scientific domain such as clinical biology, the

descriptive statistics

sultative

role

of

biologic science,

statisticians

in

modern

and also for the

many

the clinical investigator tends to disregard

tions of chance.

In

of inductive statistics

can be held responsible both for the con-

and take them for he seeks statistical advice, he usually wants management of the analytic procedures that he does not comimportance

their

granted.

When

prehend. Unfamiliar with the basic logic and data of the descriptive medical statistics,

their

the

unchallenged by and somewhat awed by

mathematically contents,

associated

mystique,

the

statistician

often accepts the clinician's request, per-

Introduction and rationale

tests, and hopes the clinician be sensible enough to ignore results

forms the

about the relative value of these three

will

players to the team?

that are inappropriate.

During this collaboration, both men engage in the self-delusions that create the major fallacies of biostatistics. The cliniimportance of his

cian, forgetting the

own

contribution to the logic and data of the research,

becomes mesmerized by what

he does

not

analyses.

He

How much

does each

player contribute to the general spirit and

rectifying errors in

morale of the team? Are they "colorful" players who will lure spectators to the box office? Is B a relief pitcher who has saved many important games, whereas A is an

validate

that problems

and data have already

He

These and many other queswould immediatelv be contemplated

outfielder? tions

by any connoisseur of baseball who

is

and he concentrates on the fit them into his array of

produced by the statistical analysis would be dismissed as unimportant, and the main judgment would depend on many

analytic

statistical

may

maneuvers. The ensuthe clinician, the

satisfy

the

editor,

agency, and the reader

the

—but

granting

what

really

often an elaborately analyzed significant" collection of

and bad data whose

scientific

bad de-

merely neglected but actually embellished and convoluted amid the mass of numbers and statistical tests. Let us consider a nonmedical illustration. Suppose the administrative advisor of a major league baseball team must deficiencies

many

asked to evaluate players. The P values

will

logic

to

accepts the data

way he

"statistically

attention

the

believing

descriptive statistics.

is

a really sound de-

batting activities.

somehow

been resolved or are unresolvable, becomes oblivious to what he does not understand: the clinical background of the

emerges

would require

cision

what the P

matter

statistical

statistician,

statistician,

no

To make

values show.

assumes that the

in the basic logic

ing results

worth,

other features of the players and of their

activities,

as presented,

not an appropriate test of a

are

statistical

observation and correcting distorted logic.

The

alone

player's

the

understand:

computations will

more basic

The answer, of course, is that he should draw no conclusions. The statistical analysis was silly, because batting averages

are

not

items of descriptive data that are statistically omitted from the batting averages.

Suppose, however, that our administrative advisor is not a connoisseur of baseball and, in fact, does not understand the

game

at all. His reputation as a consultawizard in athletics may have been tive previous success in a sport his based on like

horse racing, where the horse's per-

can always be quantitatively evaluated from such numerical indexes as

formance

cide

time and finishing position in races. Upon asking a connoisseur of baseball for "innumerical) help about (i.e., structive"

chooses the batting average as a useful

greatly disappointed to receive such non-

what salaries should be offered to the team next year. He searches for a way of evaluating the worth of the players and index.

chance

An enlightened man, he knows that may enter into a batter's success,

and so he performs the

differences

statistical

among

analyses of

players.

He

finds

was higher than Player B's, with a P value of < 0.001, and that Player B's average was higher than Player C's, but with a P value of only < 0.2. The first of these statistical differ-

that Player A's average

ences not.

is

"highly significant"; the second

What

should

the

advisor

is

conclude

players,

evaluating

may be

advisor

the

quantitative phrases as "spirit," "colorful,"

and "saving games." Knowing analysis

worked

of

dimensional

so well

activities,

the

in

his

advisor

that

the

information

has

previous sporting decides

the "soft" verbal descriptions his

evaluations

to

and

ignore to base

on the "hard" numerical

The owner and manager of the team may be somewhat queasy about the idea, but they go along with it and say nothing because data of the batting averages.

Introduction and rationale

10

they do not want to seem ignorant about

with that disease are regularly "lumped"

P

together for the

and besides, they are reluctant to doubt the word of an eminent consultant who has been widely acclaimed for values,

his success in

applying

statistical

analysis

to athletics.

The

would The game is too well known to too main people, and even if the owner and manager were foolish enough to entrust their evalsituation

described

just

probablv never happen

baseball.

in

uation of players to this type of statistical

would probably soon

they

analysis,

re-

scind the action because of the roars of

outrage, laughter, and protest that

would

ensue from sportsw riters and baseball fans. But the activities of biomedical research, not so well known, and counter-

alas, are

parts of the foolishness just described constantly occur,

the activities ol

although

modern

less

obviously, in

When

of

three

elementary

of

principles

(1) the neglect of heterogeneitv of a population during the process

analysis:

of

comparing

members;

its

(2)

the use of a single property as the

index

for

pressed (

3

)

assessing

multiple

in

a

way

for future clinical application.

Univariate responses. Since the per-

2.

formance of specific

trial

property

careful

statistics

appraisal

of

has

a

in

clinical

such

"variate"),

and

properties, tive

a

to

Examples of these violations are so abundant in biomedical statistics that the singling out of any specific situation would be unfair to the many outstanding tistics.

the assessment

of course,

based on a multitude of

These

inconvenience.

however, statistician

clinical

unattrac-

often

are

because thev repre-

appraisals

qualitative verbal phrases.

expressed

Even

in

these

if

data were to be accepted however, the importance of property could easily each not be "weighted" for combination into a single "soft"

clinical

for analysis,

index.

the properties were not comand were analyzed separately as

If

individual

a

is

subjective

overwhelmed

sta-

evaluated in the real world

is

incapacitation and the treatment's side effects

bined,

descriptive

as

When

properties, including the patient's pain or

infatuation with

the

chosen for evaluation, treatment (or

clinician,

of response

sent

the application of a statistical test to

inductive

to

regularlv expressed in terms of a

is

single

ex-

and

manifestations;

work because an

statistical tests requires that a

index be

response

the

main

phenomenon

prove a "significance" that may be meaningless or unimportant. These and many other departures from sound scientific reasoning constantly occur during biostatistical

reported for

total

by a

situation just cited contained viola-

scientific

treatment.

of

later

group of patients, the clinician of knowing whether the compared therapeutic agents had the same effects in the good prognostic risks as in the bad, or whether patients with different degrees of clinical severity responded differently. Because heterogeneous patients have been statistically managed as homogeneous, the results of an elaborate, expensive trial may have little or no value the

has no

treatment

tions

are

survival time or white blood count.

biostatistics.

Violations of scientific principles

The

allocation

the results

sponse, tain

indexes

of

therapeutic

the statistical results

an array of

many

tabulations,

and P

values,

answer.

The reader would have

instead

of

a

re-

would consingle to

tests.

neat

review

here will be confined to problems that often occur in the evaluation of thera-

the reports of the diverse responses, and might have to make his own decision about which variates are important, based on clinical and biologic values, rather

peutic

than

candidates for selection.

1.

illustrations

trials.

Neglected

statistic ally

major

The

c

heterogeneity.

designed

ronic

trials

diseases,

In

many

of therapy for

all

the

patients

statistical tests.

To avoid

messy imprecision, the the

neatness

variate

of

response;

this

type of

statistician opts for

analyzing a single unithe investigator agrees;

Introduction and rationale

and another to the

clinical

produces sim-

trial

that have

results

plistic

complex world of

little

pertinence

Spurious significance. The biostatistical

3.

malpractice committed with tests of "sig-

become, as noted earlier, 5 and the phrase "statistical sighas become such a malignant

nificance" has

a scandal, nificance"

mental pathogen that major efforts to excise it will be undertaken here in a future Let

discussion.

moment

for the

suffice

it

note that a test of statistical

to

cance

signifi-

nothing about the quality of planning, or execution in the

tells

thought,

work; nothing about the biologic or cal

meaning

of the difference in

clini-

numbers;

and nothing about whatever has allegedly caused the difference. Too often, how-

word

the inappropriate

other biostatistical infractions of scientific

The

principles of research.

statistician

not really responsible for the

reality.

11

is

difficulties.

He is doing the best he can with what he knows, but he often makes the false assumption that what he does not know about the subtleties of clinical biology will be amply managed by his clinical colThe

leagues.

clinical

investigator

He

is

not

been told that his complex clinical problems will be solved by statistical help, so he gets it, but he often makes the false assumption that inductive statistical sagacity eliminates the need for interpretive clinical wisdom. The consequence of this folie a deux is a mutual belief that the main targets of clinicostatistical research responsible,

really

either.

has

significance

are the rigid digits of statistical analysis,

has served to intoxicate a research worker

rather than the valid facts of biologic sci-

preconceptions

ence. Instead of giving basic scientific at-

ever,

with

the

that

belief

his

have been confirmed, to divert an editor or reviewer from carefully contemplating the

logic

and

data,

and

delude

to

a

reader into believing that the value, importance,

and meaning

somehow been

have

of

research

the

authenticated

cause the results are "statistically In the planning of clinical

cant."

the

number

needed

of patients

trials,

for "sta-

significance" has currently

tistical

be-

signifi-

become

a major focus in the thoughts devoted to

and the calculation of this numsometimes the acme of the biostat-

"design,"

ber

is

a and

[i

with the quantitative lure of the

statistics.

How

can the situation be improved? The problems have obviously not been

by

solved

statisticians'

many

efforts

to

educate clinicians. As a result of these

and

other

now

regularly

proved

such

use

procedures

random

investigators

clinical

efforts,

as

statistically

ap-

groups,

control

and double-blind But the "controls" are often

allocations,

techniques.

the

chosen inadequately; the random allocaand the are often meaningless; tions

levels of "significance," of course,

double-blind techniques often yield amau-

contribution.

istician's

and data of the rebecome obsessed

tention to the logic

search, the collaborators

To manipulate

must often neglect the the patients and must calculations on univariate ap-

from which no one can

the biostatistician

rotic results

heterogeneity

cern what has been accomplished. Perhaps the educational efforts of the past

base

his

of

praisals of response

science

and

—but

the disregard of

of clinical applicability

seems

unimportant as long as the number that emerges from the calculations holds the glittering

promise

of

"statistical

signifi-

Sources of difficulty

No

single simple cause can

the

lamentable state of

Perhaps

clinicians,

unidirectional.

instead

of

being

purely passive recipients of statistical consultation, should

tributors

in

now become

active con-

a process of intellectual ex-

change that enlightens both the statistical and clinical participants. What contribution can clinicians make to help convert

cance."

for

few decades have been too

dis-

be blamed affairs

produces the cited violations and

that

many

statistical

art

into

biostatistical

science?

Since the greatest knowledge at the clinician's

disposal

is

his familiarity

with the

Introduction and rationale

12

data and patterns of events that describe

To read about

the biologic realities of nature, the clini-

of the

cian can teach the artful inductive statis-

forced

about

tician

of

science

the

descriptive

many

and mathematicians properly regard them-

creative

as

not

in

they

since

artists.'

challenges

seek

their

nature but in abstract

systems or "spaces" conceived as acts of artistic imagination. The theories created not a part of

are

art

statistical

this

in

science unless they relate to nature, with

concepts

)

Since the statistician

statisticians

selves

signs, the student is begin bv knowing what diaglook under.

to

noses to

statistics.

Most

and

either

that

tactics

emerge

world

of

on

rely

develop his perception of these natural realities. His postgraduate activities may also not expand his scientific

because he

vision,

playing

in

to

the

may

Procrustean

—a

which he was trained and altering the data

perfor

role

role of choosing

to

fit

his

in-

tests,

stead of choosing or altering his tests to fit

the data.

remark should review the taxonomic arrangement of topics in textbooks on statistics. If an investigator has a problem that requires testing the difference between two collecthat

last

tions of nominal, ordinal, or metric data,

he

will

be unable, with rare exception, 7

-

36

to find a statistical textbook that arranges its

topics according to the types of data.

The

topics

are

almost

all

arranged ac-

cording to the available statistical tests, not according to the characteristics of the data.

Thus,

the

he

has

1

make

the ob-

arrange their

logic,

and

explain

to

their

not done

significance.

so.

to

And we have

the statistician has built

If

he is not to blame; have not onlv failed to put the on a sound foundation, but have in

castles

castles

the

moved

air,

into

the-

and

castles

illusive

extolled the architecture.

Plans for the future

columns of "Clinical biostahope we can provide some solid ground on which clinicians and statistifuture

In

tistics," I

cians can join to build a better structure

an inductively oriented

for the future. If statistician

Anvone who doubts

observation,

clinical biologists to

servations,

often

experience with which

scientific

had no way of becoming familiar with the complex data and intricate logic that describe that world. He has had to

given him

or no direct observational

been

really

clinicians

little

has seldom

trained or provoked to enter the natural

from nature, or that are compatible with the events of nature. As discussed earlier. the statistician's customary education has

sist

the diagnostic significance

symptoms and

clinician)

(or

perceptively

a

believes

that

realistic

descriptive

sta-

an unsatisfactory title for this activity, let us call it something else. One of the most distinguished contemporary statisticians, John W. Tukey, has proposed tistics

an

is

new

excellent

term,

data

analysis.

Tukey 57 has also proposed some excellent rules and plans for the new meeting ground:

We

should seek out wholly

he answered. problems in more .

.

.

We

realistic

new

need

to

questions to tackle

frameworks.

.

.

.

old

We

investigator already choose for a particular

should seek out unfamiliar summaries of observational material, and establish their useful proper-

problem, he can find the test. (An analogous type of backward logic occurs in the

... It can help, throughout this process, to admit that our first concern is with "data analysis". ... To the extent that pieces of mathematical statistics fail to contribute, or are not intended to contribute, even by a long and tortuous chain, to the practice of data analysis, they must be judged as pieces of pure mathematics, and criti-

knows what

if

test to

organization ui cal diagnosis"

nostic

many textbooks on "physiand other aspects of diag-

medicine.

Students often find the

textbooks

frustrating

organized

according

though search

because to

they

are

"disease,"

al-

ties.

cized according to

a

student begins his diagnostic observing symptoms and signs.

its

purest standards. Individual

mathematical statistics must look for their justification toward either data analysis or pure mathematics. Work which obeys neither parts

of

Introduction and rationale

master

.

cannot

.

.

doomed

careful that, in

work

it

.

.

.

be

to

fail

sinking,

its

transient,

he

to

And we must be

sink out of sight.

to

does not take with

it

up the vain hope can be founded upon a logico-

we need

Finally,

to give

deductive system like Euclidean plane geometry and to face up to the fact that data analysis .

.

.

be true that there

still

...

an empirical science.

intrinsically

will

be the hallmarks of stimulating science:

demanding

tual adventure,

need

a

"how

out

find

to

and

investigation

with experience.

.

things

The

.

really

.

analysis

3, in

Olkin,

bility

and

I.,

(Essays in honor of Harold

Stanford,

Stanford

1960,

Calif.,

Mahon, W.

13.

and Daniel, E.

A.,

assessment

for

reports

of

of

E.:

A

the flexor digitorum sublimis muscle,

14.

The technique

D.:

areas

irregular

with notes on the 63:345-351, 1929.

.

ment.

Mainland, small

future of data analysis

15.

trials,

Canad. Med. Ass. J. 90:565-569, 1964. Mainland, D.: An uncommon abnormality of 62:86-89, 1927.

insights

method

drug

and are" by

of

vs.

editor: Contributions to proba-

statistics

intellec-

(depends on) our willingness to take up the rocky road of real problems in preference to the smooth road of unreal assumptions, arbitrary criteria, and abstract results without real attach.

12.

insight,

confrontation

the

.

upon

calls

Numerical

University Press.

be aspects of data

analysis well called technology, but there will also

W.:

R.

Hotelling),

will

It

Hamming,

mathematics, Science 148:473-475, 1965. 11. Hotelling, H.: The teaching of statistics, chap.

of continuing value.

that data analysis

is

10.

13

Mainland, D. in

ovarian

:

A

Anat.

estimating

biological

in

tests

of

J.

research,

of accuracy,

J.

Anat.

study of the sizes of nuclei Anat. Rec. 48:323-340,

stroma,

1931. 16.

Mainland.

The measurement

D.:

of

ferret

pronuclei, Trans. Roy. Soc. Canad., 3rd series,

At our next meeting, two months from now, the topic for discussion will be the

25: Section V, 9 pages, 1931. 17.

design of experiments. In this domain of

experimental design, which has long been regarded as the province of inductive statistics, clinical biologists can make major contributions

the

to

future

scientific

of

Mainland,

study of the with a note on the second polar spindle, Amer. J. Anat. 47:195-

19.

Some

Cox, C. P.:

Anticoagulant therapy, Phila-

S.:

A.

Feinstein,

and

The problem acute

in

A.

Feinstein,

and

R.,

fever,

H.:

Spitz, I.

Res. 14:107-124, 1934.

The

epi-

D.: Chance and the blood count, Canad. Med. Ass. J. 31:656-658, 1934. (Edit.) 23. Mainland, D.: Problems of chance in clinical

work,

Clinical prob-

surveys, Arch. Intern.

statistical

W.

Are

:

Med.

Research

4:24-29,

methods

1925,

1969.

for research

Oliver

& Boyd,

applied

statistics,

Freeman, L.

New 8.

York,

Gifford,

tique

of

C: Elementary

1965, John Wiley

&

Sons, Inc.

and Feinstein, A. R.: methodology in studies of

R.

H.,

A

cri-

antico-

agulant therapy for acute myocardial infarction, 9.

New

Eng.

J.

An

Med. 280:351-357, 1969.

Halmos, P. R.: Mathematics as a creative Amer. Scientist 56:375-389, 1968.

art,

introduction to statistical

and methods for medical and dental workers, Edinburgh and London, 1938, Oliver & Boyd, Ltd. Mainland, D.: Elementary medical statistics:

25.

The

Ltd. (Ed. 13, 1963.) 7.

Med. J. 2:221-224, 1936. D.: The treatment of clinical and

ideas

overawed by

scientists

R. A.: Statistical

Edinburgh,

Brit.

24. Mainland,

laboratory data:

life

Scientific

statistics?

workers,

in-

22. Mainland,

123:171-186, 1969.

6. Fisher,

by

D.: Forces exerted on the human mandible by the muscles of occlusion. J. Dent.

of evalu-

rheumatic

demiology of cancer therapy.

5. Feller,

exerted

21. Mainland,

stethoscopes,

Standards,

R.:

statistics.

treatment

Company.

Pediatrics 27:819-828, 1961.

lems of

forces

muscles,

1933.

Douglas, A.

delphia, 1962, F. A. Davis

4.

The

D.:

human

with special reference to errors in muscle measurement, Trans. Roy. Soc. Canad., Section V, pp. 265-276, dividual

observations on the teach-

801, 1968.

ating

of ferret pronuclei,

Anat. Rec. 50:53-83, 1931.

20. Mainland,

ing of statistical consulting, Biometrics 24:789-

steroids

sizes

Mainland, D.: The volumes of ferret ova, with methods of determina-

tion,

3.

The

Mainland, D.:

Anat. Rec. 49:103-120, 1931. special reference to the

References

2.

quantitative

ferret,

240, 1931. 18.

data analysis.

1.

A

D.:

body of the

polar

principles of quantitative medicine, Phila-

delphia,

1952,

26. Mainland,

ed.

2,

D.:

W.

B. Saunders

Company.

Elementary medical

Philadelphia,

1963,

W.

B.

statistics,

Saunders

Company. 27. Mainland, D.:

Anatomy

and dental education, B.

Hoeber,

Inc.,

as a basis for medical

New

Medical

Harper & Row, Publishers.

York, 1945, Paul

Book Division

of

Introduction and rationale

14

28.

Mainland, D.: Notes from a laboratory of medical statistics. Note 20, pp. 8-9, October

34. Schor,

Mainland, medical

Notes

D.:

statistics.

from

laboratory

p.

3.

Note 104,

a

March

of

35. Schor, S.,

program

for

21:28-31,

Statis.

I.:

Statistical evaluation

J.

A. M. A. 195:1123-

1128, 1966.

30. Mainland,

medical

Notes

D.:

from

laboratory

a

Note 116,

statistics.

p.

4,

May

of

36. Siegel,

32.

Mainland,

10-1. Suppl. 7. p.

n.:

1,

June. 1969.

Pharmacol. Ther.

Clin. 33. Saiger,

M.

G.

L.:

Errors

of

10:576-586,

medical

A. 173:678-681, 1960.

37. Skellani,

1969.

studies,

New

statistics

York,

for

1956.

the

McStrat-

Tukey, J. W.: The future of data analysis, Ann. Math. Statist. 33:1-67, 1962. 39. Zelen. M.: The education of bioinetricians, Amer. Statis. 23:14-15, 1969.

38.

Ward Rounds—16,

Statistical

Nonparametric sciences,

Graw-Hill Book Company. Inc. C: Models, inference, and |. egy, Biometrics 25:457-475, 1969.

1965.

Mainland, D. Notes on Biometry in Medical Research. Veterans Administration Monograph

S.:

behavioral

13,

31.

A.

and Karten,

of journal manuscripts,

16,

1965.

J.

Amer.

1967.

24, 1961. 29.

reviewing

Statistical

S.:

medical manuscripts,

SECTION ONE

THE ARCHITECTURE OF COHORT RESEARCH Although most

statistical interpretations of

experiment, most medical research

is

research depend on the idea of an

not experimental. The people assembled for

the research are almost never chosen randomly from the groups they allegedly represent; the "causal" agents under comparison are seldom contrasted concurrently;

and the agents are not assigned according to a suitable prearranged plan. clinical and epidemiologic investigations are conducted as surveys,

Because most

not experiments, major problems of bias or distortion can occur

when

the com-

pared groups are assembled and when their results are analyzed. Even when the research

is

experimental, however, such problems can

still

appear because of

flaws in chronologic aspects of design or because of vicissitudes in the action of fate during

random choices

or

random assignments.

Although diverse mathematical models have been developed analysis of research data,

for the statistical

no corresponding models have been available for the The first few essays in this section

scientific architecture of the research itself.

are concerned with the establishment of an appropriate scientific

ing the research structure of a cause-effect relationship. fined baseline condition or initial state is

the causal agent or effector.

state.

For contrast, another

is

exposed

The outcome

is

An

model

for

show-

some demaneuver, which

entity in

to a principal

then observed in the subsequent

entity, in a similar initial state,

is

exposed to a com-

parative (or "control") maneuver.

The model

is

quite simple

and

direct, since

it

takes the exact form of the

normal "anatomy" and "physiology" of an experiment. The model plicable to any form of cohort research

—

is

directly ap-

—whether etiologic or therapeutic, survey

which the observed groups are followed forward in time (or "longitudinally"), being investigatively pursued in the customary scientific direcor experiment

tion

in

from imposition of a "cause" to occurrence of an

If

"effect."

the groups consist of people, however, diverse features of

cal life

human and medi-

can produce biases that create "pathology" in the research structure. As

people traverse the pathway from anonymity tistical units,

the biases

may

distort

at

home

to immortality as biosta-

the comparison of the maneuvers by altering

compared groups, the performance of the maneuvers, the detection of the outcome events, or the chronologic duration of persistent observation. The act of allocating the compared maneuvers by a randomization the baseline equality of the

15

16

The architecture

process

is

it provides no guarantee that these difficulties will randomized allocation can sometimes increase the problems

often helpful, hut

be avoided. In by

of cohort research

fact,

lulling the investigator into a state of false- serenit)

All of the cited hazards of cohort research can occur is

altered, so that the groups are

rather than forward direction.

pursued

the research format

in a "cross-sectional"

The changes

some separate additional problems

when

in direction,

or "retrospective"

however, also introduce

that will he discussed in Section

Two.

CHAPTER

2

Statistics versus science in the

design of experiments

The design

been a

of experiments has

favorite concern of statisticians ever since

review are classified in such groups as factorial designs, response sur-

R. A. Fisher, 10 after

emphasizing the impublished

face designs, designs for nonlinear models,

second book in 1935, entitled The Design of Experiments, 17 and stated that "a

are thus concerned mainly with statistical

portance of

statistics in research,

a

and standardized staprocedures will ... go far to eluci-

clear grasp of simple tistical

date

the

principles

Fisher's book,

now

in its eighth edition, 18

has been followed by tical texts

minimal

experimentation."

of

many

and papers on this I was able to

effort,

rent statistical books, 1 37 -

85-37

other

'

18

15 >

>

With

topic.

2G '

>

2S

-

30

'

31 >

written in English, with "experimental

design" or

some congener

in their titles. In

a recent bibliographic review 23 of the topic, the

authors

about 800 items pub-

cited

lished since 1957.

After inspecting the publications

and the

names

of these 800

tables of contents in

the textbooks, a clinical investigator might

wonder

about the kind of knowledge to plan scientific experiments dealing with sick people and human popu-

necessary

The

lations.

are

topics in the statistical texts

devoted

generally

blocks,

Latin

squares,

ments,

to

squares,

Youden

Graeco-Latin

squares, lattice arrange-

confounding,

partial

randomized

analysis

of "significance."

With the same numc, "Clinical

biostatistics

11:282, 1970.

The papers

this

—

II."

of

and

tests

in the

bib-

variance, analysis of eovariancc,

and

chapter originally appeared as In Clin. Pharmacol. Ther.

serial designs.

The books and papers and

tactics in analyzing data,

attention

is

little

or no

given to such clinical biologic

problems as defining the goal of the experiment, choosing the experimental material,

and validating the

Amid

statis-

find 16 cur24

liographic

results.

the intensely statistical discussions

contained in

this literature

on experimental

design, there occasionally appears a est scientific

mod-

warning, such as "uniformity

the only requisite between the objects whose response is to be contrasted," 20 but the warning is not accompanied by a description of scientific methods for assessing

is

"uniformity" or appraising "response."

One

makes the experimenter's

role

writer even

wholly subservient to the statistician's by stating that "the purpose of an experiment is to produce a sample of observations which will furnish estimates of the parameters of the population together with measures

of

the

uncertainty

of

these

esti-

mates." 34

Many years of exposure to these purelv mathematical ideas about experimental work mav have obscured the realization that for most scientists the purpose of experiments is to get answers to questions. A scientist's main concern is not the idealized elegance of a mathematical design, but a realistic plan for asking an important question in a

way

that will yield a reliable an-

17

The

18

swer.

The

not just

architecture of cohort research

object

is

to get scientific validity,

and mean-

statistical "significance";

ingful answers, not just magisterial

num-

bers.

The

designs discussed in statistical

liter-

ature provide splendid examples of mathe-

matical art in the imaginary world for which they were created. They may even be scientifically satisfactory for the agricultural fields or chemical vats in which the models have often been applied. But they do not focus on the basic problems of design in most activities of clinical biosta tistics. The statistical concepts do not provide methods for attaining the documentation, precision, validity, and reproducithat characterize scientific research;

bility

and the statistical designers have seldom had either the training or the experience needed to discern subtle scientific distinctions about human populations. Because a group of people has neither the homogeneous uniformity nor the simple responses of a

field or vat, investigators

can-

comprehend experiments on people merely by contemplating mathematical principles or by exnot properly design and

trapolating

that

theories

statistical

may

have worked for nonhuman material. Such scientific issues as the

choice of hypothesis,

appropriateness of sampling, suitability of of maneuver,

selection

control,

and

tion of prognosis,

stratifica-

criteria for

response

are critical principles of design in clinical biostatistics

—but

these principles are gen-

overlooked,

erallv

over in most

neglected,

glossed

or

statistical discussions of "de-

perimental

When

Sir

campaign

Bonald Fisher inaugurated the

for

statistical

attention

to

the

planning of experiments, clinical biologists to be persuaded that the anecdotal

needed

were no longer acand that medical science required

doctrines of the past ceptable,

the

nfirmation

ment

After

the ca

quantitative

experi-

more than three decades,

-ign has generally

cessful,

been quite suc-

nical investigators

now

use con-

andomization, and quantifica-

trol groi.

in

of

n

designs

of

biomedical

re-

trials

of therapy. Unfortunately,

however, main of these projects have been designed according to abstract strategies of statistical principles, rather than realistic

methods of

clinical

The

science.

control

groups are often chosen improperly, and are not truly comparable; the randomizausually performed promiscuously, no concern for prognostic heterogeneity; and the quantification is often spurious, because the wrong variables were measured. Fisher's complaint about the clinical practices of 35 years ago was that "The liberation of the human intellect must remain incomplete so long as it is free only to work out the consequences of a prescribed body of dogmatic data, and is denied the access to unsuspected truths, which only direct observation can give." 19 The pendulum has now swung far enough so that the same complaint is once again cogent, but its target today would be the tion

is

with

.

.

dogmas As

.

of current statistical theories.

direct observers of experimental phe-

nomena in people, clinical owe statistical colleagues an

investigators

access to a

methodologic description of important biologic principles that are unsuspected, ignored, or minimized in the current infatuation with statistical designs. this

paper

My

object in

to outline the diverse struc-

is

tures that are considered or created in the

experiments of clinical research, and to dicate

some

tures cannot

current

sign.

tion

and the medical literature contains an abundance of statistical surveys and ex-

search,

of the reasons

why

in-

these struc-

be adequately designed with models. Using these

statistical

structures as a

new

basis for the architec-

ture of clinical research, subsequent papers

be devoted to operational and other details of scientific

in this series will

principles

methods for the constructions. The structure of

The

clinical

experiments

basic structure of any experiment

sequence in which exposed to a maneuver,

consists of a temporal

a preparation

is

and undergoes a response. The preparation

is

described according to

its

initial

Statistics versus science in the design of

state,

and the response

determined by

is

Table

experiments

The experiments

I.

19

of nature

noting the subsequent state either alone or

comparison to the initial state. For the clinical biologic experiments that will be

Subsequent

Initial

in

considered

who

person,"*

"preparation"

the

here,

healthy

is

is

diseased

or

a

Maneuver

state

Type

of

experiment

state

Healthy

Normal growth

Healthy

Ontogenetic

Healthy

Develop-

Diseased

Pathogenetic

Healthy,

Pathogressive

in

subsequent state, or both. The maneuver performed in the experiment can be chosen bv nature, bv the investigator, or by the person who either the initial state, or the

ment of disease

Diseased Clinical

acts as the experimental preparation.

course of

diseased,

disease

or

dead

In this sequence of

Maneuver

Initial

State

the

Subsequent State

>

crucial

scientific

designing

in

issues

and analyzing the experiment depend on why the experiment was done, what maneuver was used, who chose the maneuver, and what was the state of the preparation before and after the maneuver. The experiments of nature. Nature performs at least three different experiments

The structure shown in Table I.

that are of clinical interest." of these experiments

As an ontogenetic healthy

person

is

activity,

nature allows a

change

to

he grows

as

As a pathogenetic activity, nature makes a healthy person diseased, or creates a diseased newborn. As a pathogressive acolder.

nature takes a diseased person through the clinical course of the disease. In most of these events, nature chooses the experimental maneuver, but in other activity,

tivities,

the choice

is

made by

the affected

of nature are constantly studied with statistical

methods. Tabulations of these natural

experiments are the basis for birth

death

rates,

tributions

of

rates,

disease,

"vital statistics";

and

of the

all

geographic

dis-

and other data of

for all of the statisti-

cal epidemiologic studies of causes of disease. 10

Such tabulations are also used to determine the "range of normal" for various types of clinical and laboratory data in the conditions created

bv nature, and

to

establish the diagnostic concepts for iden-

Moreover,

tifving diverse diseases. tical

statis-

accounts of the clinical course of

dis-

ease are the basis for the extraordinary activities,

which

described in the next section, in

clinicians

impose therapeutic

inter-

vention on the events begun by nature.

For investigating these natural phenomthe

ena,

researcher's

main challenge

is

Such choices occur, for example, when someone drinks polluted water, or when two people with sickle cell trait decide to marry and produce children. These experiments of nature receive

not to create an experimental design, but

scant attention in statistical concepts of de-

sive"

person.

sign,

probably

because

the

investigator

does not choose the maneuver, and hence

engages

in

no

"experimental

design."

Nevertheless, these experimental activities

to devise a

plan for discerning the design

created by nature.

observation

and

rather

his activity

is

He

engages

than

in

planned

experimentation,

usually given the "pas-

name of survey, rather than the "active" name of trial or experiment. The experiments of man. In the observational

activities

just

described, the in-

vestigator wants to determine

what nature

has done, and he accepts nature's "maneuver" as the basic force that connects the °In

many

contemporary "clinical research," the basic material is an animal, a substance derived from a person or animal, or an inanimate system. - The discusactivities of

1

sion

here

is

limited to the types

of

clinical

which the "material" under surveillance group of persons.

is

research

in

initial

who

and subsequent

forms a

In what

is

state of

each person

statistical unit in the research.

usually regarded as an "experi-

a person, or

ment,"

however,

the

main maneuver

is

20

The

Table

II.

architecture of cohort research

The

experiments of

interventional therapeutic

man Planned subsequent

Initial

Maneuver

state

Healthy

Type

Prevention

of

experiment

state

Healthy

Contrapathic

Improved

Remedial

of disease

Diseased Alteration

Not worse

Diseased Prevention of adverse prop' SS

ing clinician.

No

fixed protocol

estab-

is

lished for the allocation of treatment within a group of individual patients, and the comparative "controls" depend on results

obtained in similar situations of the past. When the results of such individual adven-

or cured

of disease

an experiment/ the procedure is not usually regarded as an "experiment" because the maneuver for each patient is chosen in an arbitrary ad hoc manner by the attend-

Contratrophic

tures

in

ordinary treatment are collected

and analyzed, therapeutic

the

research

Unlike

survey.

is

the

a

called

activities

studied in an observational survey, routine

treatment involves a maneuver of

chosen by man. For the activities of clinical investigation, the purpose of this ma-

man

im-

posed on a maneuver of nature, and the purpose of a therapeutic survey is to discern what man's intervention has done to

When

same

to

change the course

the course of nature.

of nature, to explain the

way nature works,

perimental sequence and purpose are car-

methods for collecting and interpreting the investigative data.

ried out with a prearranged plan for choosing comparative groups and for allocating treatment to each patient, the ac-

neuver can be either or to provide better

Interventional experiments. Clinical ther-

apy

a

is

unique type of experimental ac-

because it contains the events of two simultaneous experiments, one imposed on the other: an act of man intervening in an tivity

act

of

clinical

nature.

The purpose

therapy

is

to

of

ordinary

change the course of

nature by preventing what nature

may

do,

what nature has already done. and the target chosen to be altered or prevented, therapeutic activities can be classi-

or altering

According

fied as

to the patient's initial state

remedial, contrapathic, or contra-

trophic, 9 as

shown

in

Table

II.

In remedial treatment, such as the relief of pain, the clinician tries to

move

a

symptom,

lesion,

modify or

re-

or other target

that already exists in a diseased patient. In

tivity is is

this

ex-

regarded as truly experimental and

called a therapeutic

trial.

Explanatory experiments. In the therapeutic activities just described, the

maneu-

ver was planned to create a permanent

change

in the patient's condition.

was

An

exist-

be remedied, or an expected subsequent state was to be pre-

ing initial state

vented. In

many

to

other clinical experiments,

however, the motive course of nature, but

is

not to change the

to explain the

way

in

which natural phenomena are created or altered. In such experiments, the maneuver is

used as a stimulus for transient reactions

that are analyzed as the "response," but the patient returns to the initial state after

tion against poliomyelitis, the clinician tries

is completed. probative experiments. The response to an experimental maneuver is often used

prevent a healthy person from becoming

for identifying the patient's capacities in

contrapathic treatment, such as immunizato

diseased. In contratrophic treatment, such as rigorous regulation of diet

and blood

sugar for diabetes mellitus, the object

is

to

the experiment

his

initial

state,

or for differentiating his

condition from the in other patients.

initial state

encountered

Thus, the electrocardio-

prevent adverse progress of an established

graphic changes after a burst of exercise

disease.

will

;ough each act of ordinary clinical there.

contains the temporal sequence of

sometimes distinguish healthy people from those who have coronary artery disease; the response of serum and urine in a

Statistics versus science in the

glucose tolerance test

is

often used for the

and various

diagnosis of diabetes mellitus;

types of climatic

and other physical

may be employed

stimuli

determine the range of physiologic response in normal people. These procedures are an active experito

mental counterpart of the more passive observational activities, described earlier, that

design uf experiments

21

is conducted when healthy and diseased patients are exposed to varying doses of a pharmaceutical agent to determine whether the patients with different clinical

ment

states

respond

differently,

and whether the

responses depend on the doses employed

ver and uses control groups, each experi-

maneuver. Methodologic experiments. The last main type of clinical experiment is intended neither to change the course of nature nor to explain it, but rather to investigate the methods used for the other types of activities. Examples of such methodologic explorations would be a studv of observer

ment

variability in the interpretation of roent-

are used to determine the range of normal

and diagnostic boundaries for various conditions of health and disease. Although the procedures can be called experiments, because the investigator chooses the maneureally

is

conducted as a "probe" to

what was created during the

identify

ante-

cedent "experimental" activities of nature.

A

design

is

prepared for these

clinical ex-

periments, but the basic aim of the design is

what happened in the origiby nature. MANEUVERAL EXPERIMENTS. In this type to discern

nal "design" prepared

in the

genograms, or a comparison of a patient's obtained by a computer "interview" versus that obtained by a physician. In such procedures, the "material" inhistory

vestigated in the "experiment"

man

is

the hu-

or mechanized observational appara-

"maneuver"

tus; the

the exposure of this

is

determine whether a particular maneuver can elicit a

"apparatus" to the film or patient; and the

specified response, or to assess the effects

pretation or history that emerges from the

of different degrees of the maneuver. An example of such an experiment is the attempt to induce disease by inhalation or

exposure. These methodologic explorations

of

experiment, the goal

injection

microbial

of

is

to

More

substances.

"response"

are seldom regarded as truly experimental because thev are based on "passive" observations of the phenomena used in ob-

and interpreting

examples of these experiments are the procedures performed in

tive"

"phase II" therapeutic investigations for

the state

characteristic

appraising the

mode

of action

and optimal

the roentgenographic inter-

is

taining

and no "acprobe or treat of the patients used in the "ma-

attempt

made

is

data,

to

neuver." Nevertheless, investigations of this

dosage of pharmaceutical agents. Unlike the probative type of explanatory experi-

tific

ments, which are intended to elucidate a

other form of clinical research.

stimulus

the

as

critical prerequisites for the scien-

validity of the data obtained in

first

part of the total experi-

ment, a healthv volunteer

may have ma-

The inadequacies of

Most clinical

these

of

squares,

lattice

statistical

in

has

testing the antimalarial therapeutic proper-

signed.

agent.

A

com-

bined probative and maneuveral experi-

structures

be

arrangements,

strategems

architecture

new pharmaceutical

scientific

design in

adequately

planned with the concepts described in current statistical writings about experimental design. Randomized blocks, Latin

successfully

a

statistical

cannot

research

induced so that the second part of the experiment can be used for

laria deliberately

ties of

any

and response created by nature,

maneuveral experiments are intended to clarify the operation of an agent of man. conjunctive experiments. Certain experiments are prepared as a conjunction of one or more of the two types of experimental activities just described. For example,

type are

But the

pertinent for

research

already statistical

many

and other

can often be used

whose scientific been well demodels are not

of the tvpes of clinical

investigation just cited, and,

when

perti-

The architecture

22

of cohort research

nent, are too superficial for the tal

demands

of scientific rigor.

cal tactics of "design"

fundamen-

The

statisti-

depend on the basic

assumptions that the research

is

being per-

formed as an experiment, that the experimental material and its responses can be reproduciblv identified, that "random samples" can be readily obtained, and that "random allocations" will provide satisfactory solutions to problems in "control" but

all

of these assumptions are either too

tabulations of a disease represent

tistical

that disease.

What

sent, of course,

these tabulations reprethe rate of diagnosis of

is

that disease, rather than

its

actual occur-

Because "disease" is a wholly intellectual concept of nosology, and because both the nosographv and diagnostic technology of disease are constantly changing, rence.

enumerations of rates of

statistical

"dis-

ease" cannot be scientifically satisfactory

accompanied by a

unless

satisfactory as-

naive or too erroneous for scientific design

sessment of the diagnostic procedures ex-

in clinical investigation.

tant at the time and geographic locale for each item of the reported statistics. 10 Another common fallacy in epidemiologic research based on "vital statistics" is

The concept cal

of an experiment. Statisti-

principles of experimental design are

many

observational

surveys, therapeutic surveys,

and methodo-

not pertinent for the

the

belief

that

the occurrence rate of a

becomes

logic explorations that provide the funda-

particular

mental data of clinical science. Since these

validated

survevs and explorations are neither de-

been reviewed to confirm the diagnosis whenever that disease was recorded on a

conducted

nor

signed their

as

"experiments,"

intellectual construction

sidered

in

statistical

is

descriptions

not conof "ex-

perimental design." trials

and probative experiments that can truly be regarded as experiments, statistical tactics

in

design are scientifically superficial

because they are based on a model that is oversimplified. These clinical investigations contain the profound complexity of a simultaneous dual experiment, in which a design of man is imposed on a "design" of nature for the purpose of explaining or changing nature's activities, but the statis-

models are planned to manage only one major experimental activity, not two. Reproducible identification of material. Statistical models also begin with the assumption that the experimental material can be reproducibly identified, but this elementary necessity of science cannot be tical

taken for granted in clinical epidemiologic research,

most

and

difficult

its

achievement

challenges in

one of the planning the is

research.

In

the

the

available

scientifically

evidence

has

does not detect the "false nega-

it

situations

tive"

in

which the disease

being

without

curred

diagnostically

ocre-

ported.

These errors are so ubiquitous and cast major

their

are so virulent that they

scientific effects

doubt on the validity of any of

the massive epidemiologic tabulations deal-

changing

with

ing

and

prevalence,

and with

diseases,

of

rates

mortality statistical

incidence,

diverse

for

conclusions

about causes of disease. The main scientific

challenge in the design of this type of

research

is

discerning

not in using the

changes

statistics,

created

but in

in

"dis-

by changes in the standards, dissemination, and application of nosologic and technologic principles of diagnosis. 10 Tn survevs and trials of therapy, the ease"

clinical

problems of diagnosis are different epidemiologic

from

the

cited.

In treatment, the main diagnostic

difficulties

just

sources of variability are not the diverse diverse

epidemiologic

surveys

depend on occurrence rates of disease at differ: nt times and places, a widespread contempo ry fallacy is the belief that stathat

if

death certificate. This type of confirmation can onlv eliminate "false positive" diagnoses;

Furthermore, for the therapeutic

disease

techniques

used to identify "disease" in and geographic locales, but

different eras

the diverse ways in which a particular set of techniques

is

applied by different physi-

Consequently, a

cians.

necessity

scientific

in any investigation of therapy

is

a clear,

precise statement of the criteria used for

diagnosis of the disease under treatment.

Despite

the recent attention given to

all

23

design of experiments

Statistics versus science in the

the hospital, the diagnostic facilities of the hospital,

variety of socioeconomic feapresence of a suitable investiga-

a

tures, the

tor at that hospital,

and the

patient's will-

ingness to participate in both the

initial

methods of therapeutic design, such criteria are frequently absent from the clinical literature. For example, adequate diagnostic criteria were omitted in 24 of 32 prominent clinicostatistical studies of

plans of his doctors and the subsequent

anticoagulant therapy for acute myocardial

be regarded

statistical

infarction 22

the pretherapeutic criteria of

;

were described inconsistently and nonreproduciblv in 26 prominent clini-

"operability"

costatistical

of surgery for carci-

reports 14

plans of the investigator, the patient

then appear as a unit of data in a

may

statisti-

cal series.

None

activities can possibly "random." Each of them is motivated (or biased) by the

these

of

strongly

as

many decisions made at each transition that moved the patient from one part of the spectrum 8

10 '

of the disease into an-

noma of the lung. The concept of "random sampling." One

other. Unless these decisions are carefully

most pernicious scientific delusions now prevalent in the world of medical re-

lection of patients with that disease will

of the

search

dom

the idea that concepts of "ran-

is

scientifically

classified, the statistical col-

be

meaningless because the re-

cannot be extrapolated. The patients

sults

sampling" can be readily applied to

represent no one except themselves; they

complete-

are a "sample" of a larger population that cannot be specified because of the many alterations created by such determinant

This idea

clinical populations. ly vitiated

is

by the use of patients

as the

"material" of clinical investigation, because

a

and

identified

patient

—unlike

an

agricultural

field,

features as iatrotropic stimuli, fashions in

chemical vat, or the material of any other type of experimentation chooses the in-

medical

—

"work-ups,"

diagnostic

criteria,

pretherapeutic criteria, the effects of co-

vestigator, rather than vice versa. Before a

existing diseases, the patients' acceptance

person can become a statistical unit in a study of disease, he must traverse a long

of the doctors' proposals,

and

statistics.

intricate

tions

that

anonymity

series

him

lead at

of directional transi-

home

from

to

his

his

medical

statistical

in-

clusion in a collection of data. After nature has created the disease, the diseased

person must be provoked, by symptoms or by other events, to see a doctor and to become a patient. According to the iatrotropic 8 10 and other stimuli, the doctor may '

or

may

not suspect the existence of that

disease in the patient. According to the intensity of the doctor's suspicions

and

his

13

and the chrono-

used

"date-marks"

logic

tabulating

for

and seldom classify or tabulate these determinant features, and persist in analyzing the statistical data with concepts based on "random sampling." In most enumerated surveys or probative experiments designed to demonNevertheless,

clinicians

statisticians

strate

the

patients

characteristics

of

hospitalized

with a particular disease, these

determinant features are rarely

on

this

scientifically

cited.

of the collection of patients, there

imposed the

Up-

defective description is

then

procedures and crinot be able to diag-

absurdity of calculating "standard errors" and "95 per cent

If the disease is suspected or diagnosed, the patient may then be referred to various other consultant

confidence intervals" to estimate the "true

available teria,

he

diagnostic

may

or

may

nose that disease.

doctors, tal

to

who

in turn

which he

is

may

select the hospi-

referred

for

further

"work-up." According to the consultants,

statistical

parameters" of the mythical "base population" of

which

of the disease

dom

sample."

of people

this is

Worse

who do

polygenous collection

assumed

to

be a "rangroup

yet, a "control"

not have the disease

is

of cohort research

The architecture

24

seem

generally selected, by even more defective methods, from the rest of the patients in the hospital, and the "parameters" are then estimated for the mythical "base

cause they offer the opportunity to assign treatment randomly (as well as to esti-

population" of the "controls."

lations).

The problem groups for

trol

tion"

mate the "parameters" of the treated popuUnfortunately, however, in order for the

of choosing suitable con-

type of "retrospective"

this

requirement be-

to fulfill this

maneuver

therapeutic

to

be the main

vari-

or "cross-sectional" study of etiology and

able in the "design," the statistical con-

pathogenesis of disease w

be considered

eepts require that the initial experimental

The main

material be "homogeneous." This require-

in

ill

a later paper of this series. to

point

be noted here

is

that

if

hos-

ment cannot be achieved

many

in the

clini-

pitalized patients with a particular disease

cal trials, particularly in chronic disease, in

are "random" representatives

cance" are constantly performed upon the

which the patients are extremely heterogeneous in their prognosis for whatever target is under treatment. By conducting a pathogressive "maneuver" in the outcome

"parame-

of disease concomitant with the clinician's

science

maneuver in therapy, nature creates the complex dual experiment that invalidates simplistic statistical designs based on a

of nothing,

hospitalized "controls" represent even

Nevertheless,

statistical

of

differences

estimated

ters"

"tests

populational

the

the

in

of

fallacious

less.

signifi-

and bizarre mathematical imagery of these "random samples." The statistical problems of

inappropriate "parametric estimations"

single

be

can tests,

avoided with nonparametric quantile methods, and other suitable

techniques, 2 natives

-,7, '

do not

M but

these statistical alter-

affect

the basic scientific

maneuver

appropriately identified

are

patients classified,

and

the treatment will be allocated

indiscriminately to "good risk"

whose basic

error of using the poorly identified "dis-

risk" cases

ease" and "control" groups for extrapola-

unrecognized.

more general population. The concept of "random allocation" A different facet of the random problem

randomization

tions to a

alone.

Unless the prognostic differences of the

This

and "bad

differences are left

type

of

may remove

assignment of treatment, but

promiscuous bias

in

the

it

also

re-

moves

clinical sense in the evaluation of

occurs in allocating treatment for a thera-

results.

The

In the types of research described in the previous section, the investi-

because a clinician will not know how to apply them in the future; he cannot determine whether "good risk" and "poor risk" patients responded the same

peutic

trial.

gator wants to compare a group of dis-

eased patients with other people who do not have the disease, and one of his main scientific problems is to ascertain that the

results will

be

clinically

mean-

ingless

way to each therapeutic agent. The results may even be clinically misleading because,

compared groups are properly representa-

as

tive of the external population of diseased

agent

discussed

elsewhere, 11

B may have

agent

exactly

the

A

and

reverse

or nondiseased people. In a trial of treat-

therapeutic effects in "good risk" and "bad

ment, however, the investigator deals only with people who have the disease. His choice of a "control" group can depend on

risk" patients,

his

own maneuver

in assigning therapeutic

on nature's maneuver in creating disease. In this situation, one of the investigator's main scientific concerns is that the "control" group and the "treated" groups be selected without bias. Statistical techniques of "random allocaagents, rather than

but the differences will be obscured when all the results are added up in a grand statistical conglomeration that ignores prognostic differences.

Aware of prognostic heterogeneity in pasome statisticians attempt to stratify

tients,

the treated population into "comparable" groups.

The

trouble with most of these

that the "comparable" is groups are usually selected according to stratifications

Statistics versus science in the

the sex,

demographic features of age, race, and instead of the clinical and paraclinical

phenomena that are harbingers of prognoThe problems of a suitably correlated

sis.

prognostic scribed

have been deand are beyond the

stratification

elsewhere 11

'

scope of the discussion here; they currently

one of the major

constitute

impede

stacles that

ob-

scientific

statistical

and

clinical

progress in the design of therapeutic

trials.

Reproducible identification of response and other pertinent data. The last item to be contemplated here is a common type of statistical

of his

serum cholesterol or electrocardio-

graphic waves; the "palliation" of cancer is usually reported according to survival

palliation of the patient's discomfort, dis-

a statistician generally

deal with continuous rather than

categorical variables,

and because of

his

desire for precise information, he generally

prefers "hard" rather than "soft" data.

A

—such as age, height, cholesterol—can be expressed

continuous variable

and serum

numerically as a dimension on an established scale that permits "continuous" gra-

A

—such

variable

categorical

as

occupation, ability to work, and severity of

name

angina pectoris is often expressed not in terms of what happened to the patient's chest pain or incapacitation, but in terms

quest for science. Be-

statistical theories,

chest pain

for getting meaningful results. For the sake of this apparent scientific and statistical convenience, the data used to assess post-therapeutic response may be "hard" and continuous, but inappropriate. For example, the treatment of disabling

necessity

and roentgenographic abnormalities instead of the true

clinical investigator's

dations.

—

expressed either as a titular

is

or an ordinal rank in a category that

time, white blood count,

and functional

tress,

The

enhance and perpetuate improve the quality of

sex,

and death; whereas

serum choles-

"soft" data con-

statements and judgments

tain subjective

information contained in "soft data," es-

judgmental parts, 8

remove or minimize observer variability in needed for those decisions. Instead, however, clinicians and accept the reasoning of

judgment" as a mystique that

beyond the reach

of analytic science,

is

and

ignore crucial judgmental information in favor of "hard" data that cally useless.

expressions

component

suitable studies to

collecting the data

quality of

categorical

into

decisions

and performing

cally satisfying

life.

by

tablishing rigorous criteria for "dissecting"

such as severity of pain and functional

To avoid

could

clinical science

preserving their attention to the important

statisticians often

age,

scientific deficien-

cies in clinical procedures. Clinicians

"clinical

like

on these inappro-

priate choices of crucial variables serves to

merical scale. "Hard" data consist of objective facts

status.

persistent focus

does not have continuous values on a nu-

terol,

25

advice that actually hinders the

cause of the concepts that underlie most likes to

design of experiments

may be

but clinically and

statisti-

scientifi-

and

subjective information, a statistician giving

advice about the design of clinical surveys

The architecture of

Not

all

clinical

statisticians

biostatistics

have become ob-

often concentrate on "hard"

sessed with a purely statistical approach to

data,

and particularly on data expressed in continuous variables. This choice of data

experimental design, and thoughtful bio25 statisticians have pointed out its follies

seems desirable because it avoids the scientific problems of using subjective information that is unstandardized and possibly

and

and

trials will

dangers. 38

statisticians

When

experienced

bio-

analyze the plan of a research

project, they generally concentrate

on judg-

statistical

ing the basic scientific principles that must

problems of developing suitable methods for analyzing categorical rather than continuous data. Unfortunately, however, this

be managed before any statistical theories of design can be applied. In a recent paper

choice also avoids the clinical and scientific

Stanley Schor 32 emphasized the need for

unreliable,

and

it

avoids

the

describing "misleading medical research,"

The

26

architecture of cohort research

careful statistical

judgment

in at least five

Did the researcher choose the

satisfactory for the realistic scientific archi-

tecture of clinical biostatistics.

basic prerequisites to scientific research: right people to

The difficulty has been well summarized by a mathematician- 9 :

question or experiment on? Did the researcher choose a statistical unit that

made

his

Did

problem solvable? use

researcher

the

a

control

group

and

choose and use it properly? Are the groups compared truly comparable? Did the researcher guard against a probable bias

in

the people he was

A

frequent cause of incompatibility between the and reality is the neglect or confusion

model

among

.

of these five issues as well as

of the other scientific

many

principles cited here

and elsewhere1 arc constantly mismanaged in clinic. il investigation, because a distinct methodologic discipline has not been estab-

lished for this type of research.

epidemiologic investigator

or

A

clinical

who wants

imposition of

the

.

.

.

Each

or

on variables. ... A proof within a mathematical model proves nothing in biology. The choice or design of a valid mathematical model is dictated bv what is

bounds

.

testing?

variables

influential

unrealistic

needed

describe

to

.

the real situation.

.

.

This

.

requires a well-structured imagery, based on deep

knowledge and understanding of the real situation, and it is precisely here that the major influence of workers in biology must be felt. My choice of the term imagery was not accidental. It was chosen to connote the imaginative, interpretive, even poetic outlook which biologists must provide. .

.

.

.

.

.

ad\ ice about suitable ways to choose control

groups, define statistical units, avoid establish

bias, bility,

criteria,

arrange compara-

and perform many of the other pre-

A

methodologic research discipline

clinical

will require

biostatistics

imagery"

structured

that

in

a "well-

represents

the

The

published specifications for the activities. literature contains many statistical

and poetic art. Since this fusion is better suggested by the term architecture than by the existing

precise instructions for tests that quantify

name

the operations of chance, but almost no

ments, and explorations discussed in this paper can be regarded as an outline of the

requisites to scientific research cannot find

instructions

make

for

the

judgment needed

to

fusion of realistic science

design,

structures

decisions of science.

Clinical biostatistics has thus

been ob-

the

surveys,

produced

clinical research.

in the architecture of

The next few papers

scured in the mystiques of two types of nondescript "judgment" an artful clinical

this

"judgment" whose rational characteristics are omitted from clinical publications dealing with medical science, and a scientificstatistical "judgment" whose operational details are omitted from statistical publications dealing with artful design. An able clinician constantly' uses clinical judgment in his investigative decisions but does not describe its components; and an able statistician engages in an analogous occultation of the statistical judgment used to

statistical architecture.

—

plan research.

now matured

into a sacred

traffic."

The

way

1.

of

this

bio-

An

Experimentation:

introduc-

to

Hall, Inc. 2.

Bradley, tests,

V.:

J.

Englewood

Distribution-free Cliffs,

N.

J.,

statistical

1968, Prentice-

Hall, Inc. 3.

Chapin,

F.

ciological

revised

& Bow,

designs

ed.,

in

New

so-

York,

Publishers.

Cochran, W. C, and Cox, G. M.: Experimental designs, New York, 1950, John Wiley

& 5.

Experimental

S.:

research,

1955, Harper

Sons, Inc.

Cox,

D.

Planning of experiments,

B.:

York, 1958, John Wiley 6.

ments

details

measurement theory and experiment design, Englewood Cliffs, N. J., 1962, Prentice-

current collection of abstract

models for the design of experiis not been and cannot become

statistical

C:

Baird, D. tion

cow which of scientific

in

to operational

References

4.

as often as not gets in the

and other

principles

"Mathematics," said Prof. P. G. H. Gell, 21 "has

be devoted

series will

experi-

trials,

Edwards, A. chological

L.:

ed.

3,

& Winston,

New

Sons, Inc.

Experimental design

research,

Holt, Binehart

&

New Inc.

in psy-

York,

1968,

Statistics versus science in the

7.

York, 8.

Experimental

T.:

therapy for acute myocardial infarction, 280:351-357, 1969. J. Med.

Clinical judgment, Baltimore,

:

A.

epidemiology:

Clinical

R.:

The populational experiments

man

human

in

of

I.

24.

A.

The

R.:

identification

tern.

The

epidemiology:

Clinical rates

of disease,

Rinehart

II.

Ann. In-

25.

Med. 69:1037-1061, 1968. A.

11. Feinstein,

of

statistics

in

III.

therapy,

26.

Ann. Intern. Med. 69:1287-1312, 1968. A. R., Koss, N., and Austin,

12. Feinstein,

The changing emphasis

M.:

in

J.

clinical

Sons, Inc.

ed.

temporal demarcations, 123:323-344, 1969. 14. Feinstein, A. R.,

and

Data, decisions, and Arch.

Spitz, H.:

Intern.

Med.

The epidemi-

ology of cancer therapy. I. Clinical problems of statistical surveys, Arch. Intern. Med. 123:

D.

Finney,

J.:

statistical basis,

of

R. A.: Statistical

methods

workers, Edinburgh, 1925, Oliver

19.

9.

20. Fisher, R. A. 8,

:

The design

of experiments, ed.

Edinburgh, 1966, Oliver

& Boyd,

Ltd., p.

P.

quoted

G.

H.:

Research

and

imagination,

Lancet 1:273, 1969. 22. Gifford, R. H., and Feinstein, A. R.: A critique of methodology in studies of anticoagulant in

W.

B.

statistics,

Saunders

1968,

Wadsworth Publishing

Inc.

Nooney, G. C.: Mathematical models, reality, and results, J. Theor. Biol. 9:239-252, 1965. 30. Peng, K. C: The design and analysis of scientific experiments, Reading, Mass., 1967, Addison-Wesley International Division. 31. Quenouille, M. H.: The design and analysis of experiment, London, 1953, Charles Griffin Ltd.

How to evaluate medical research Hosp. Physician 5:95-109, 1969. 33. Siegel, S.: Nonparametric statistics for the be32. Schor, S. S.: reports,

New

havioral sciences, Hill

Book Company,

34. Snedecor, scientific

York, 1956,

McGraw-

Inc.

G. W. The statistical part of the method, Ann. N. Y. Acad. Sci. 52: :

792-799, 1950.

The mathematics of experimental Incomplete block designs and Latin squares, London, 1967, Charles Griffin & Company, Ltd. 36. Winer, B. J.: Statistical principles in experimental design, New York, 1962, McGraw-Hill 35. Vajda,

S.:

design:

33. 21. Gell,

1963,

29.

Ltd.

A.:

Elementary medical

Calif.,

Company,

Edinburgh, 1935, Oliver

18.

relation-

W.: Introduction to linear models and the design and analysis of experiments,

for research

& Boyd,

The design of experiments, & Boyd, Ltd. Fisher, R. A.: The design of experiments, ed. 8, Edinburgh, 1966, Oliver & Boyd, Ltd. Fisher, R. A.: The design of experiments, ed. 8, Edinburgh, 1966, Oliver & Boyd, Ltd., p. R.

17. Fisher,

D.:

& Company, Experimental design and its Chicago, 1955, The University

Chicago Press.

16. Fisher,

The

28. Mendenhall,

171-186, 1969. 15.

theory:

Statistical

Philadelphia,

2,

Belmont,

course:

York, 1964, Holt,

Company.

66:396-419, 1967.

clinical

New

Inc.

don, 1957, George Allen & Unwin. Kempthorne, O.: The design and analysis of experiments, New York, 1952, John Wiley &

27. Mainland,

I.

The

L.:

H.

13. Feinstein, A. R., Pritchett, J. A., and Schimpff, C. R.: The epidemiology of cancer therapy. II.

Hogben,

re-

Topics under investigation. An analysis of the submitted abstracts and selected programs at the annual "Atlantic City Meetings" during 1953-1965, Ann. Intern. Med. search.

& Winston,

ship of probability, credibility and error, Lon-

epidemiology:

Clinical

R.:

design

clinical

A. M., and Cox, D. R.: Recent work on the design of experiments: A bibliography and a review, J. Royal Statist. Soc. (Series A) 132:29-67, 1969. Hicks, C. R.: Fundamental concepts in the

design of experiments,

69:807-820, 1968. 10. Feinstein,

New

23. Herzberg,

nature and

Ann. Intern. Med.

illness,

27

Eng.

The Williams & Wilkins Company.

Feinstein,

of

New

design,

The Macmillan Company.

1955,

Feinstein, A. R.

1967, 9.

W.

Federer,

design of experiments

37.

Book Company, Inc. Wortham, A. W., and Smith, T.

E.: Practical

experimental design, Dallas, 1960, Dallas Publishing House. 38. Zelen, M.: The education of biometricians, Amer. Statis. 23:14-15, 1969. statistics

in

CHAPTER

3

Components

of the research

objective

Because the

fundamentals of in most statis-

scientific

omitted

clinical research arc

discussions of the "design of experi-

tical

ments,"

new

a

be developed

must

scientific architecture

as a

methodologic discipline

for planning clinical

investigations.

These

is

arranged quite differently from the other and will not be considered further

activities

which is confined to the construction of clinical survevs and experiin this discussion,

ments.

The

operational concepts

are de-

that

They have

investigations can be constructed in several

scribed here are not original.

ways, according to the events, problems, and challenges contained in the

been expressed, in whole or part, in various treatises on scientific methods of research, 1 3 7 •-». «- "• i0 18 2 ° and although

different

experiments of nature and of man. In the previous paper of this series, I outclinical

-

"

-

given

-

-

scant

attention

statistical texts, the principles

which include the following: surveys of

extensive

nature's activities in preserving health, or

Donald

creating

in

and evolving human

surveys and experimental therapeutic attempts

to

trials

disease;

of man's

intervene in the

course of nature; explanatory experiments

discussion

Mainland's

Statistics.^ 2

My own

in

conventional

in

lined those diverse investigative structures,

have received

the

half

first

Elementary contribution

of

Medical to ar-

is

range these principles in a different way, based on the sequential events of an experiment. This sequence occurs as

maneuver of nature or demaneuver of man; and exploratory efforts to assess and improve the meth-

that probe a lineate a

Initial

Subsequent

Maneuver

State

>

State

odology used in the foregoing research

and

activities.

series

are

architecture

is

de-

concerned with the "architecused in constructing those

troduced or combined at various stages in the contemplation of the sequence. The

The

just cited

investigative

signed during ten operations that are in-

its

two successors

tural" operations projects.

its

in this

This paper and

last of

the research structures

—the methodologic

first

operation

is

outlined in this paper;

the other nine operations will be outlined

exploration

in the next

two papers of

this series;

and

further details of the operational structures This chaptc ///.

The

originally a.

Pharmacol. Th

28

lecture

appeared as "Clinical of

clinical

11:432, 1970.

biostatistics

research."

In

—

Clin.

and processes quently.

will

be

discussed

subse-

Components

of the research objective

29

are beyond

the scope of this dissuch judgments are the "common sense" used for intermediate decisions during the that

plexities

cussion.

Among

and the basic original is worth doing. we agree that the work should be done, we

architectural construction,

decision that the research project

INITIAL

If

STATE

STATE

can then proceed to

its

plans.

At the onset of design

The

operational principle in plan-

first

ning a research project

to stipulate the

is

Although this principle seems obvious, or perhaps beobjective

cause

the

of

it is

so obvious,

ineffectually

reports

or

of

In meeting with a

investigation.

clinical

managed

often

it is

proposals

in

consultant,

biostatistical

may

research.

an

investigator

expansively describe the background

proposed research, the reasons he wants to do it, and its presumptive importance, but he often does not specify the components or the logic of what he wants to do. The specifications are also often omitted when a completed project for his

The

reported in the literature.

is

tor's failure to

ment

make

investiga-

a clear, precise state-

of the objective of the research

detrimental both to the biostatistician helps design a project,

and

is

who

to the reader-

investigation, the objective

form of a question. "What happens," may say, "when something done to something Y?" In this original

the investigator

X

is

statement

the

of

research

question,

the

alwavs

indicates

the

almost

investigator

maneuver under surveillance, and he usually also mentions or implies the principal

the population. Regardless

initial state of

how

of

well the question has been stated,

however, details

it

almost never contains

needed

to

Consequently, the

main principle of tecture is to expand

biostatistical archi-

the

and subsequent and to establish the necessary distinctions of the principal and subsidiary maneuvers that may be emof the initial state

tions

state of the population,

nostic

ployed.

A. Initial state. ulation requires

The

two

initial state of a

and logic for the maneuver, and the subsequent state of the observed population are prerequisite to both the scientific methods and the artistic description

taste that enter into the architecture of a research

project.

the

The

investigator's

inevitably

become misdirected

the objective taste

is

methods are used to achieve objective, and the work will

scientific

is

needed

or

incomplete

The

not suitably specified.

— a diagnostic account of the existing conanticipation

of

the

and

for

ods and procedures. This discussion

because

with the

described.

logical issues

issues

in

is

making

concerned method,

crucial

roles

of

and

"artistic"

judgment will be noted when they appear, but the performance of the judgments involves com-

of

For the

illustrations that follow, let us consider a

mythical 1.

new

drug, Excellitol.

Diagnostic demarcation. Suppose an

investigator asks the question, "Is Excelli-

good for angina pectoris?" From the wording of this question, we do not know tol

the

initial

state

of

the

population.

We

whether the purpose of the therapeutic maneuver is remedial, to relieve angina pectoris whenever the pain cannot

tell

occurs; or contrapathic, to prevent angina

scientific

can be readily defined

The many

state.

if

intermediary decisions during the choice of meth-

mainly

and a proglikelihood

artistic

for evaluating the general impor-

tance and worth of the research,

pop-

different specifications

achieving the subsequent the

investigator's

original statement into adequate specifica-

objective.

state,

the

of a research project. first

is

An adequate

all

implement the objective

ditions of health or disease,

initial

usually stated

in the

who later decides whether the worth doing or who appraises its completed results, because all of the detailed planning and analysis of the research depend on the original stipulation of the reviewer

work

a clinical

for

is

in

people

tratrophic,

who have to

never had

it;

or con-

prevent an adverse course

of coronary artery disease in patients

have had angina.

who

The

30

architecture of cohort research

search question would be: "Does Excellitol retard the growth of normal children?"

into two different groups within which therapy can be assessed. 3 4 The stratification performed to divide

Aside from the later problems of defining what is meant by "retard" and "growth,"

thus used not to create different "factorial

Another unsatisfactory wording of a

re-

question docs not indicate what kind

this

rated

'

these

prognostically

blocks"

same experiment, but

the

in

groups

disparate

is

to

state

establish the initial populations for different

of the children. Are the)' to be "normal" in

experiments, within each of which the same therapeutic maneuvers may then be imposed. Because of these distinctions,

of normality

is

sought in the

initial

general physical health, or must they also

be free of any mental, psychic, or biochemical abnormalities?

As the tions,

first

step in architectural specifica-

therefore,

the initial state must be

groups of patients with contrasting prognoses should not be statistically mingled in

an analysis of variance as though they

diagnosticallv demarcated according to the

had

"interactive

types cf healthy or diseased people that

risk"

population forms one experiment; the

are to be assembled for the investigation.

"bad-risk" forms another; and their

Prognostic stratification. In the type

2.

population was diagnostically impre-

cise.

A more

ficulty occurs

and more common

subtle

when an

dif-

investigator asks the

question, "Does Excellitol lower the fatalits-

rate in patients with acute myocardial

infarction?"

In

situation,

this

the

diagnostic state of the population

is

intended

to

imprecise because

it

gressive

activities

of

prevent death in

dis-

does not specify the

the broad clinical spectrum of acute

myo-

of man's

therapeutic

lactic

goal to be achieved

target to

the patien

re

not comparable.

They

by the

treat-

selected as a particular response

is

plan the architecture of a clinical research project, the targets in the

nostically correlated.

myocardial infarction," their outcomes are so disparate that

this

be evoked by the chosen maneuver. To

Because of nature's underlying "maneuvers" in producing and evolving the course of a myocardial infarction, the disease has a diverse clinical spectrum of patients with major differences in their prognostic anticipations. The fatality rate, although quite low in 'good-risk" patients who have had no symptoms other than transitory chest pain, is quite high in "bad-risk" patients with shock, pulmonary edema, and/ or sipiificant arrhythmias. 15 17 Although both s of patients share the diagnosis '

activities,

ment. In an explanatory experiment, the

must be

anticipat

target

target represents the remedial or prophy-

cardial infarction.

"ac

this

trial

types of patients to be considered within

of

nature,

represents nature's production of various states of health or disease. In a survey or

eased patients. Nevertheless, the question is

maneuver. In a survey

of the experimental

of the ontogenetic, pathogenetic, or patho-

speci-

and the purpose of the therapeutic maneuver is clearlv noted as contratrophic, it is

B.

state of the population serves as the target

initial

fied

since

statisti-

must be planned accordingly. Subsequent state. The subsequent

cal appraisals

of problem just cited, the state of the initial

The "good-

variables."

subsequent state

identified, differentiated,

and prog-

1. Identification of targets. Since a person can change or react in innumerable ways while under observation, and since an investigator could not assemble detailed information about all of these possible re-

sponses, the targets of the research

must

be identified to denote the basic observations that are to be performed and analyzed. In all of the examples that follow, the primary target of the research

maneuver

is

inadequately cited in the ex-

pressed question, and the reasons are noted in

the associated parenthetical question:

rep-

resent two fferent "experiments" instigated by nature, and they must be sepa-

"Is

Excellitol

kind of hazard?)

hazardous

to

health?"

(What

Components

"What mal

for

is

the optimal dose of Excellitol?" (Opti-

what goal?)

"Is Excellitol a good analgesic agent?" (For what type of pain?) "Is Excellitol good for angina pectoris?" (To

prevent angina or to relieve

of the research objective

another way according to their risk of developing thrombophlebitis; similarly, in patients with acute myocardial infarction,

the prognostic classifications for the duration of chest pain are different

it?)

In addition to a clear identification of all

that

so

maneuver must be specified, suitable plans can be made for

observation of the subsequent state. Differentiation

2.

of

Although

targets.

maneuver of an experiment has a primary target, many other events can be studied as ancillary targets, and additional phenomena may be noted as incidental targets. For example, although the primary target of Excellitol therapy may be the

may

angina pectoris, the investigator

want

examine such ancillary medication, the convenience of administration, and the occurrence of symptomatic side effects. Furthermore, although Excellitol may not be expected to affect the white blood count or blood urea nitrogen, these entities may be chosen for observation to ensure that they have not incidentally received an adverse effect from the also

to

targets as the palatability of the

drug.

Thus, in

listing

all

the properties that

be observed as possible targets in the subsequent state, the investigator must

are to

differentiate

incidental

the

role

of

primary,

ancillary,

and

the targets in the re-

search. 3.

Correlation of prognostic strata. One main reasons for differentiating the

targets of

an investigated maneuver

is

to

provide an appropriate correlation for performing the prognostic stratification of the state.

Since

members

population will have for

stratification

distinctions,

cannot

prog-

a

be designed

arbitrarily according to age, race, sex, etc.,

instead,

be planned

in

direct

correlation with the particular target under surveillance. 4

particular

If

the primary target of a

maneuver

is

prevent event A,

to

the initial population should be stratified

according to their risk of achieving event A. If a different event, B,

is

encountered

a "side effect" of the maneuver,

as

the

proper evaluation of the results requires that the population receive a completely

separate

stratification

according

to

the

risk of achieving event B.

C. Maneuver. All of the features just

and classification of the population before and after the maneuver. The next step in architectural design deals with the logic needed to ensure that the maneuver is suitably applied and evaluated. This logic is based on the maneuver's potency and on the comparison, multiplicity, and concurrence cited refer to a suitable identification

of additional maneuvers. 1.

Potency.

refers to

its

The potency

of a

maneuver

capacity to achieve the desired

target state. This capacity

may depend on

the dose or intensity of the maneuver, and

of the

initial

Because of these nostic

and must,

the

relief of

vival.

of the other possible

targets of the

from the

that demarcate 30-day sur-

classifications

the primary target,

31

different

classification

targets,

of the

different

the

same

prognoses

selection

and

of the variables used for a

particular prognostic stratification will de-

pend upon the chosen target. Thus, the same group of healthy women might be classified in one way according to their likelihood of becoming pregnant, and in

on

its

ample,

manner if

For exwhether cigarette

of administration.

we wish

to test

smoking causes a particular disease, what be the acceptable quantity of cigarettes for someone to be regarded as a "smoker"? Should the classification of nonsmoker or stopped-smoker be used for a person who has not smoked for decades after having consumed a total of several packages at diverse times during adolescence? Should we distinguish cigarette smokers who "inhale" the smoke from those who do not? A more subtle problem in potency of will

the

maneuver

deals with pharmaceutical

The

32

agents

architecture of cohort research

—such

as

and

coagulants,

aspirin,

digitalis,

insulin

— that

may

anti-

require

a different dosage in different people to

produce the desired effect.

these agents

If

are given at a fixed dosage to

the pa-

all

an investigation of therapy, the clinical activity is improper because some patients will receive too much of the drug and others will receive not enough. On the in

tients

hand,

other

agents

the

if

are

given in

variable dosage, the statistical analysis

seem

may

be confounded by the necessity

to

to evaluate the

many

different dosages of

same treatment.

the

An analogous type of problem arises if we want to compare surgical versus meditherapy of a particular disease, but each surgeon may want to per-

cal

find that

form the operative procedure

in a slightly

different way, according to his

ular skills

and judgment.

If

are denied the opportunity to

own minor

variations

in

own

partic-

employ

their

performing the

not be carried out in an optimum

manner by each

of the individual opera-

In the

first

type of situation (choice of

an appropriate "dose" and "inhalation" of cigarettes), there is no scientific solution for the problem. The decision must be made according to whatever seems sensible to the investigator

and acceptable

to the

people who review the results. In the second and third type of situation (dealing with variable doses of drugs or minor modifications of a surgical procedure), the most scientifically appropriate decision is to recognize that

the

surgical

we want operation

its

individual

in

the

surgical

way

of performing the selected

maneuvers. The statistical appraisals are not confounded by this decision, because our objective is to test a particular type of man> 'iver, rather than each of its minor variatioi

"potency." 2. Comparison. When an investigation is performed for purely descriptive pur-

poses

—as

for example, in surveys of the

growth, development, or treatment of a state of health or disease

Accordingly,

—no

comparative

maneuver need be considered. Examples of research containing no comparative maneuvers are the investigations conducted to answer the following questions: "Does blood pressure increase as healthy people grow older?"; "What is the incidence of pulmonary embolism in hospitalized patients?"; and "What have been the results

of cardiac transplantation?"

many

In

other types of clinical research,

however, a comparative maneuver is employed as part of the necessary scientific logic of the experiment. We could not demonstrate, for example, that Excellitol is

hazardous to health unless

ing

of

Excellitol

pectoris unless

was

compared.

in

data

of not tak-

Nor could we

Excellitol.

value

we had

maneuver

assess

relieving

the

angina

some other mode of relief For many investigative

the complete archi"maneuver" involves the choice of comparative maneuver(s) as well as the particular one under prime contherefore,

situations,

tecture

of

the

sideration.

The

scientific

logic

used in

choosing

careful consideration of the relativity, con-

procedure, because this flexibility provides the best

gard the administered spectrum of the drug or operation as a single form of treatment that has been given in optimal

drug

we would

modifications

or re-

optimal

to test the at

dosages

and we would

different

comparative maneuvers is a crucial feature in the design of research, and requires

choose a dosage for the drug and allow

"potency"; therefore, flexible

the

of

for the comparative

tors.

or

results

operative procedures,

the surgeons

operation, the general surgical procedure

may

the

we would combine

stituents,

and environment

of the compari-

son. a.

relativity.

The comparative maneu-

ver must be specifically related to the ques-

be answered in the research. If the question deals with efficacy of the experimental maneuver, the appropriate comparison should contain essentially no maneuver or a placebo, but if the question deals with efficiency, the comparative procedure tion to

i

Components

should contain a deliberately comparable

maneuver. For example, know whether Excellitol lieving headache, tion,

and

its

to

unless they are administered in similar environmental surroundings.

capable of reis under ques-

For example, one of the main roles of placebo treatment is to ensure that both

we want

if is

efficacy

should be compared against a

it

placebo; but

we want

if

to

know whether

works better than existing

Excellitol

anti-

headache preparations, its efficiency should be compared against an established agent, such as aspirin. b.

ings

The internal surroundmain maneuver represent its

ritual,

therapeutic

occur

if

differences

the associated anesthesia; for a pharmaceu-

the

agent,

medium

include the

constituents

whereas the patients who

smoking, the constituents include the pa-

chair rest, stockings.

Since the scientific purpose of compari-

son

is

expose the compared groups to

to

maneuvers that are identical

in every

way

except for the "active ingredient," the con-

main maneuver must be its comparative maneuver is planned. For example, a sham surgical procedure is not adequately comparable to stituents

of the

considered

when

the real surgical procedure

performed agent;

if

with

Excellitol

cial solution

if

the

a

different

is

dissolved in

sham

is

ignored the environmental

of ancillary therapy occurred

surveys of anticoagulant treatment for myocardial infarction. 8 The patients who received "no treatment" in the pre-anticoagulant era were usually kept at prolonged

per in which the tobacco

wrapped.

that

in

bed

is

expectations

that

which the active ingredient is conveyed; for a maneuver such as cigarette in

personal meetings,

simply "no treatment" rather than a placebo. An example of inappropriate comparisons

For a surgical procedure,

rest,

early

ambulation, and elastic

A more subtle problem in the analysis of environment occurs when an extraordinary personal effort

is

necessary for a patient to

adhere to the assigned maneuver. Thus, a patient's maintenance of oral medication ordinarily creates no psychic difficulties or discomfort except for his remembering to take the medication, but adherence to an unappetizing or displeasing diet quire a heroic resolution that

some

found

spe-

most people.

in

efforts are

When

is

may

such heroic

required to comply with a ma-

the persons

who

placebo

neuver, able to

tion.

other distinct characteristics that will

environment. The external surroundings of the main maneuver represent its

their

the state of people

"environment." In a maneuver of clinical

the prescribed

the

comparative

c.

therapy, the environment

would

include:

the home, hospital, or other setting in the treatment

is

which

given; the ancillary treat-

ment that may be employed in addition to the main maneuver; the personal efforts necessary for the patient to adhere to the treatment;

and the frequency and

sity of the

interchange between the doctor

inten-

and patient. Thus, even if the comparative maneuver has a composition essentially identical to that of the main maneuver, the two maneuvers are not truly comparable

may

make

in

are

the special effort

postmaneuver

results

re-

seldom

should also be dissolved in that same solu-

travenously,

re-

ceived anticoagulants also often received

anesthetic

before being administered in-

to the

and would not the comparative maneuver were

medicinal

of the

the constituents include such features as

tical

and physician are exposed

patient

constituents.

"constituents."

33

of the research objective

state

willing

or

may have

different

make from

who cannot adhere

to

maneuver

—but the different

the adherers

and nonadherers

then be fallaciously ascribed to the

maneuver. For example, suppose it is true that life is prolonged by such "healthful" patterns of living as regular amounts of sleep and exercise, and suppose that people who are compulsive enough to adhere to an unappealing diet will also assiduously maintain "healthful" patterns of living. Now suppose we want to test whether continued use of a low-sulfate diet, although extremely distasteful, can lower mortality

The

34

The

rates.

architecture of cohort research

results of the research

people

that

who

maintain

is

live

needed

efforts

to carry out the execution

of the long, complicated project.

that the compulsively "healthful"

for assessing the results of the comparison.

who would

people,

diet

subtle problems in the scientific logic used

but what

diet,

pened

may show

who do not adhere to may really have hap-

longer than those the

this

longer anyhow,

live

were the onlv persons capable of adhering

To

to the diet over long periods of time.

The use

we

If

of multiple

maneuvers creates

are interested in only one subse-

quent

target, the

compared maneuvers are

comparable maneuver would involve the use of an equally distasteful diet that is not low in

many

sulfate.

decisions, the effect of each

avoid

3.

problem, a

this

strictly

The next issue to be number of maneuvers

Multiplicity.

sidered

the

is

maneuvers

—a

wants

presumably

to

agent versus an old one

—but

in

new

certain

are examined.

when

the investigator seeks to find a "best"

or "worst"

among

several agents;

when

the

performed to explain the action of a particular maneuver; or when the

experiment

is

research project

is

particularly difficult to

execute, so that the investigator wants

it

answer as many questions as possible. An example of the search for a best agent would be the comparison of Excellitol, aspirin, buffered aspirin, and placebo to

headache; another such extest of Excellitol given different dose levels to determine

in the relief of

target

to lowest, or "best"

When more

is under scruhowever, the analysis of results becomes complicated by the different ratings

that

each maneuver can achieve for

Suppose maneuver

that the primary target of our is

or in diverse combinations, to determine

which of the ingredients is necessary to produce an effect noted after administration of the intact composite solution. In a

large-scale cooperative study of the treat-

many

years of observation for hundreds of patients at different hospitals, the investiga-

tors

diets

as

might want to test several different and drug regimens in order to gain

much

information as possible from the

to relieve headache, but

we

also intend to assess such ancillary targets

speed of

as

relief,

cost

of

drug,

palat-

and adverse side effects. If we are comparing only two drugs Excellitol versus aspirin we might find that Excellitol acts somewhat more rapidly and tastes somewhat better than aspirin, but costs ten times as much and has many more adverse side effects. Since neither science nor statistics provides a method for deciding which of these results is most ability of drug,

—

—

how they counterbalance one we must make the decision as an

desirable or

gredients of Ringer's solution, given alone

its

individual effect on the different targets.

another,

which dose produces the best results. An example in which multiple maneuvers are employed to explain a mechanism of action would be the comparison of various in-

through "worst."

than one target

tinv,

at five

of a chronic disease, involving

the

only a single

considered.

is

ample would be a

ment

to

down

more than two maneuSuch situations occur

types of research, vers

when

of ratings

due

that

agent

active

simplicity

is

con-

compare two

versus a presumably inert one, or a

are used. This ease

Without any subtle maneuver on the single target can be ranked as highest

will be contrasted. In most research projects, the investigator

how

easy to assess, no matter

relatively

judgment or clinical "common Because only two maneuvers are involved, the judgment is not particularly difficult. We can readily compare and "weigh" the effects *of the two maneuvers on each of the multiple targets, and, in this case, we might conclude that aspirin is the better agent because the increased costs and side effects of Excellitol outweigh its slight advantages in other feaact

of

sense."

tures.

On

the other hand,

if

the research in-

volved a simultaneous comparison of five different antiheadache preparations rather than two, difficulty in

we would have much making a

similar

greater

judgment be-

Components

cause each agent might rank differently for

each of the four targets under consid-

eration.

In trying to

make

decisions that

ing maneuvers all

if

35

of the research objective

possible

would require consideration comparisons were to be

checked:

involve multiple concomitant "weighings," the standard judgmental procedures

"common

we

Excellitol alone

use

Whisky alone

by a

Excellitol

complexity seldom encountered in our or-

Excellitol

for

sense" are confronted

dinary acts of thinking, where restrict

two For

we

Excellitol

usually

Excellitol

our simultaneous comparisons to

Excellitol placebo only

alternatives rather than a multitude. this reason,

when

a

be assessed. In largescale clinical surveys and trials of therapy, where many different effects and side effects must be evaluated in the target state, the use of multiple maneuvers may often be unavoidable, but will greatly increase the difficult)' and complexity of the subsequent analysis of data. Another problem in multiple maneuvers arises because the comparisons may be single target

is

Whisky placebo only

multiple maneuvers can be

used most advantageously only to

either equivalent

or additive.

Equivalent

maneuvers are expected to act in essentially the same way or to produce the same type of effect, and can be used as replacements for one another. Thus, in the previous example of a four-way test of Excellitol, aspirin, buffered aspirin, and placebo in the relief of headache, each of the four maneuvers might be regarded as equivalent, and the "winner" of the test would then be preferred instead of the

and whisky and whisky placebo placebo and whisky placebo and whisky placebo

No

treatment

manv

Since the above array contains too

maneuvers

the

for

practical

number

clinical research, the

realities

of

of compari-

sons must be sharply reduced, and, since no rules of science or statistics are available for the reducing procedure, another act of logic is needed to choose the maneuvers that are most cogent for the specific

question of interest to the investiga-

tors. If

of the fective,

the objective

is

determine which

to

two agents or both

ample might be reduced Excellitol

Excellitol Excellitol

4.

is

the most ef-

the comparison in the above ex-

and whisky and whisky placebo placebo and whisky

Concurrency. The

sidered here

is

to:

last issue to

be con-

the concurrency of the comJ

pared maneuvers. In most experimental

sit-

ma-

maneuvers are tested concurrently in people who receive the compared maneuvers during essentially the same

neuvers contain agents that are expected

period of time. In certain circumstances,

others.

On

the other hand, additive

produce different results, so that the maneuvers may be combined. Thus, if we want to compare Excellitol versus whisky in the relief of headache, we might assume that these two agents act differently, and we might also want to assess the effects of a combination of Excellitol and whisky. When additive maneuvers are employed, the choice of appropriate comparisons is made difficult by the need to establish a suitable comparison for each of the different maneuvers, alone or in combination. For example, in the Excellitol-whisky test that was just described, each of the followto act differently or to

uations, the

however, the maneuvers istered serially.

The

mav be admin-

in "crossover" therapeutic trials,

same person

is

occur

serial situations

where the

successively exposed to dif-

ferent maneuvers, or in "conditional" ex-

periments where maneuver

maneuver

A

B

will follow

provided that maneuver

elicited a certain result.

"conditional" situation

is

A

An example

has

of a

the use of pre-

operative radiotherapy for certain cancers:

The surgery if

is

performed afterward only

the patient has remained "operable" at

the

conclusion of the radiotherapy.

The

most common type of clinical research involving nonconcurrent maneuvers is in the

The

36

architecture of cohort research

"dvstemporal" surveys

compare the

of

treatment

new

results of a

agent with the results obtained

that

surgery

many

comparison

years

group of patients treated with older methods. For the scientific validity of comparison, maneuvers all of these nonconcurrent create hazards that arise not from the maneuvers, but from differences in the initial population or in the ancillary maneuvers earlier in a

employed at the different times when the compared maneuvers were actually administered.

A

serial MANEUVERS.

a.

"crossover"

trial

based on the assumption that the patient has the same initial state each time he is exposed to the next maneuver. of therapy

is

This assumption reality

seldom valid

is

because

of

compared against the two maneuvers of followed by radiotherapy. This

therapeutic

in clinical

effects

"carry-over''

gery only

obviouslv not valid because

is

radiotherapy

usually ordered after sur-

is

when

the operation disclosed in-

The patients without metastases are seldom given postoperatrathoracic metastases.

but also have better prog-

tive irradiation,

noses than those verse example

who

receive

A

it.

con-

provided bv trials of preoperative irradiation for lung cancer. The patients who can go through the time required for the radiotherapy, while still is

keeping their tumors anatomically operable, may have slower growing cancers, and hence better prognoses, than the pawith diverse rates of growth

tients

ceive surgery immediately.

able

comparison

for

who

A more

preoperative

re-

suit-

radio-

who

created in the target by the agents used

therapy might be a group of patients

maneuver, or because the state may have been patient's clinical changed by the preceding treatment or by

remain "operable" and receive surgery after a time delay equal to that consumed by

in the previous

radiotherapy.

the evolving course of his disease. For this reason, the various Latin, Graeco-Latin, or

other "squares" that are so popular in statistical

discussions

of

"experimental

de-

sign" are frequently not applicable in clini-

A

C.

DYSTEMPORAL MANEUVERS.

is

compared,

period elapses between the successive ther-

carefully

"crossover"

apeutic agents, and turns to the

same

each treatment

is

when

trial

the patient re-

in a therapeutic survey,

may

dystemporal comparison different

research.

a

with

another treatment used at a later era, the

can be valid only when a long enough "wash-out"

cal

When

treatment used during one period of time

types

of

create

unless

bias,

the

two two

populations exposed to the maneuvers are fication

checked for prognostic stratiand environmental effects. Because

initial clinical state after

the passage of time

completed. Examples of

ments both

in

may

bring improve-

diagnosis and in ancillary

possibly suitable populations for crossover

aspects of therapy, these temporal improve-

who

ments, rather than the differences in the

trials

are essentially healthy people

are persistently asymptomatic, or

who have

main treatment, may be responsible

for the

a predictively recurrent symptom, such as

better results found in the group treated

dysmenorrhea.

at

CONDITIONAL MANEUVERS. In Studies of conditional maneuvers, major bias may be

earlier

B.

created by the

demand

that the patient

attain a particular state before the

maneuver can be bias,

second

Because of

given.

this

the combination of two maneuvers

may be

given mainlv to patients whose

prognostic

anticipations

are

significantly

better or worse than the patients

who

re-

ceive only one of the two. For example, in

surveys of the treatment of lung cancer, the

maneuver

of surgery

alone

may be

the

later

date.

population,

In comparison to the the

later

group

may

contain milder cases of the disease (with ) and may also be exposed methods of ancillary treatments

better prognoses to

better

that

were disregarded

in the

environmental

assessment. Examples of this tvpe of dys-

temporal

comparison

were

cited

earlier

during the discussion of therapeutic surveys of anticoagulants in myocardial infarction. 8 •

•

•

In his introduction to the Design of Ex-

Components

periments, Sir Ronald Fisher" makes the

following remarks

The totally

"His

authorative

assertion,

controls

are

must have temporarily

dis-

many

credited

a promising line of work.

.

made by what heavyweight authority (who has)

type of criticism call a

is

usually

.

.

I

.

This

.

.

reputation.

scientific

are seldom in evidence.

method nature notions

.

.

.

.

.

.

it

the principles

of

so

is,

as

3.

design

R.:

.

.

Planning

clinical

experiments,

1968,

Charles

C Thomas,

111.,

The Williams & Wilkins Company.

1967,

The

A.

R.

criticism was unfortunately not followed by the development of technical details

about the logical structure of scientific research. Although stating that "statistical

and experimental design are only two different aspects of the same procedure

A.

Feinstein,

6.

7.

9.

II.

Pharmacol. Ther. 11:282-

Gifford,

R.

critique

of

The procedures

New

and the

cited

10.

Handy,

knowledge of

distinctions

12.

Springfield,

tecture of a clinical investigation.

require

The

ini-

Northrop, F.

search,

of this series will

principles

nine

that

are

remaining

used

complete the basic architecture of a

to

clini-

C

The

principles

of

science,

1963,

W.

B.

Saunders

C: The

logic of the sciences

New

York, 1959, Meridian

Foundations of experimental re& Row, Pub-

R.:

York, 1968, Harper

lishers.

15. Schnur,

S.:

Mortality and other studies ques-

tioning evidence for and value of routine anti-

coagulant therapy in acute myocardial infarction,

Circulation 7:855-868,

1953.

17.

Evidence in science, Bristol, 1966, John Wright & Sons, Ltd. Tillman, C: Acute myocardial infarction: Ten-year study of consecutive cases managed and evaluated by same phvsician, Arch. Intern.

18.

Wilson,

16. Stone,

K.:

Med. 111:77-82,' 1963. E.

19. Witts,

B.:

New

L.

An

introduction

to

scientific

York, 1952, McGraw-Hill

Book

Inc. J.,

clinical trials,

editor: Medical surveys and London, 1964, Oxford University

Press.

20. Wolf, A.: Essentials of scientific

don, 1925,

cal investigation.

S.

New

Company, the

Charles

1964,

Books, Inc.

research,

continue.

S.:

Philadelphia,

2,

14. Plutchik,

must be diag-

demarcated and prognostically stratified; the targets of the subsequent state must be identified, differentiated, and correlated with the prognostic stratification; and the maneuver(s) must be properly chosen for potency, comparison, multiplicity, and concurrency. After these principles of scientific logic and precision have been fulfilled in specifying the objective of the research, the rest of the design can

W.

and the humanities,

nostically

with

111.,

Company.

sequent operations in planning the archi-

concerned

S.

London, 1900, The Macmillan Company. Mainland, D.: Elementarv medical statistics, ed.

sta13.

The next two papers

E &

Methodology of the behavioral

R.:

ll.Jevons,

attention than any of the sub-

state of the population

1961,

Thomas, Publisher.

than theoretical

clinical science rather

Med. 280:351-357, 1969.

J.

Lectures on the methodology

of clinical research, Edinburgh,

lineating the objective of a clinical research

project require a thorough

Eng.

Hamilton, M.:

sciences,

described for de-

just

H., and Feinstein, A. R.: A methodology in studies of anti-

Livingstone, Ltd.

lyzing the logic of the design.

operational

Biostatistics.

coagulant therapy for acute myocardial infarc-

experiments, but neither he nor

orous set of principles to be used for ana-

be

Clinical

R.:

versus science in the design of ex-

(Publishers), Ltd. 8.

his statistical successors established a rig-

tial

III.

therapy,

in

Fisher, R. A.: Design of experiments, ed. 8, Edinburgh, 1966, Oliver & Boyd, Ltd. Freedman, P.: The principles of scientific research, London, 1949, MacDonald & Co.

tion,

of strategies for mathematical analysis of

tistics,

epidemiology.

statistics

292, 1970.

whole," Fisher created a magnificent set

much more

of

.

periments, Clin.

scientific

Clinical

:

design

clinical

Ann. Intern. Med. 69:1287-1312, 1968. 5.

scorn for imprecise standards of

Fisher's

scientific

Random House,

Feinstein, A. R.: Clinical judgment, Baltimore,

4. Feinstein,

theoretical

of experimental

K.

Statistics

are lacking.

of

art

York, 1950,

Publisher.

Technical details

long

Cox,

Springfield,

judgment must surely continue, human

of

being what

The

B.:

I.

New

Inc. 2.

pro-

Such an authoritarian

W.

Beveridge,

investigation,

might

longed experience, or at least the long possession of a

References 1.

inadequate"

37

of the research objective

method, Lon-

The Macmillan Company.

CHAPTER

4

Intake, maintenance,

and

identification

In the bio-statistical architecture of

clini-

cal research, the first operational principle

to specify the components

is

and choose

the logic of the objective of the research.

As discussed series,

11

the previous paper of this

the components consist of a se-

quence of sequent

in

initial state,

state.

The

maneuver, and sub-

logic consists of suit-

Before a group of people can be studied

each person must undergo a series of transfers that bring him from his home to his position as a unit in in a clinical investigation,

the research.

The

diverse

decisions

that

determine the transfers can greatly alter the population available to form the "initial state." Consequently, the second opera-

able scientific judgment in the decisions

tion of biostatistical architecture

made

lineate the role of these decisions in pro-

demarcate the diagnostic and

to

prognostic conditions of the the population;

to

identify,

viding the populational "intake" for the

differentiate,

research. Because the people under inves-

After the objective of a research project

in

operations

its

implementation

Intake

is

nine subsequent architectural

whose

principles

are outlined

here and in the next paper of this 2.

tigation

make so many of the crucial word intake seems more ap-

choices, the

propriate for this procedure than the conventional term, sampling, which does not

connote the important

currency.

planned

to de-

initial state of

and prognostically correlate the diverse targets of the subsequent state; and to choose maneuvers that are satisfactory in potency, comparison, multiplicity, and con-

has been specified,

is

series.

effects of self-selec-

tion in the populational "samples" studied in clinical research.

The judgments fers are the

that govern these transproduct of decisions made by

the people

who

are

studied,

by various

sources of referral, and by the investigators -^

or their professional colleagues.

.-.INITIAL

© STATE

MANEUVER

SUBSEQUENT STATE

A. tion

decisions.

of a population

The

individual

under investiga-

must make several types of personal becoming units in the re-

decision before

'Intake

—

This chapter originally appeared as "Clinical biostatistics TV. The architecture of clinical research (continued)." In Clin. Pharmacol. Ther. 3:432, 1970.

38

Personal

members

search. 1,

latrotropic

stimuli.

The

iatrotropic

stimulus has been defined as the reason

and

Intake, maintenance,

that a patient seeks medical attention. 5

This attention must be

'

G

solicited as the first

step in the process that ultimately brings

a research

the patient into

con-

project

ducted at a medical setting. The stimulus can arise from symptoms of a disease or can come from such features as anxiety

one of these alternative

39

identification

he may receive an and he must then

offers,

proposal,

decide about accepting the alternative. None of these personal decisions is made randomly, and all of their effects must be

when

carefully considered

the results of

The

a research project are analyzed.

effects

over death of a friend, a pre-employment physical examination, or the incentives

may

created by public health campaigns urging

ing the assessment of the maneuvers, or

types of "checkups." Because

various

of

group

differences in iatrotropic stimuli, a

may

of patients with the "same" disease

greatly alter the initial state of the

population either prognostically, confound-

impeding attempts to exemploy statisconcepts based on "random samp-

diagnostically,

trapolate the results or to tical

be found at different stages of the disease, and may therefore have different prog-

ling."

nostic anticipations. 5

is

In certain types of research projects, the

motivation for the subjects' entry

not

is

desire to visit a doctor for specific medical attention. Instead, the subjects are solicited directly by the investigator via a

the

mailed questionnaire or a public volunteers.

call

for

The populational response

eli-

by a mailed questionnaire will be by the recipient's attitudes

cited

greatly affected

toward the

topic,

that are asked,

the kinds of questions

and the

difficulty or incon-

B. Referral decisions.

seldom the

patients tion

whom

he

setting will

1.

investigator

studies,

and the popula-

available for research at a medical

have been altered by many an-

tecedent decisions tor's

The

physician to see the

first

made by

the investiga-

medical colleagues. Interiatric referrals. Just as a patient

can have diverse reasons for his original choice of a medical setting, doctors or hospitals can have varied patterns of practice

that lead

to

from one doctor

transfer of the patient

to another,

from one hos-

or from one service to

encountered in answering the questions. 3 The composition of a group of "volunteers" may strongly depend on the monetary or other incentives offered by the investigator or on the psychic state of the

pital

potential "volunteers." 17

cian to "release" his patient for entry into

venience

2.

Iatric attractions. After a

become

cides to

person de-

a patient, his choice of a

doctor or hospital will be affected by his to such features as reputation, geographic location, technologic fa-

to

another,

another within the same hospital. In addition to these interiatric referrals, iatric decision depends on the willingness of an attending physi-

another aspect of

a therapeutic trial in which treatment for

a particular disease

randomization.

If

at

cost,

thusiastic about

and ethnic or

cilities,

A

religious considera-

particular doctor or hospital

may

peutic agents in the

may

therefore receive a collection of patients

be subjected collection

and other aspects of the population

may be

Acceptance of proposals. After reaching medical attention, the patient may ac3.

cept or reject the various referrals, diagnostic

"work-ups,"

or

trial,

the physicians

of

therapeutic

to

the

patients

trial

may

randomization. "referred"

thus

for

become

The the

a dis-

torted version of the general spectrum of

the disease seen in regular clinical practice.

strikingly different.

therapeutic

dures that are offered to him.

be assigned by

not allow certain types of patients to

with the same "disease" seen by other doctors and hospitals, but the prognostic severity

to

an individual medical setting are unenone or more of the thera-

reactions

tions.

is

the attending physicians

If

proce-

he rejects

2.

Retrieval of medical records.

When

performed by reviewing the records of patients at a medical center, the investigator must beware of bias that can enter during the retrieval of the a survey

is

The

40

architecture of cohort research

10 For records from their storage locations.

example, an investigator may not receive all the records he requests from the medical record librarian because some other investigator may have sequestered some records for a research project of

of

the

his

own; and the absence of those records

may dim

create a significant bias in the

population available for the

first

investi-

gator's scrutiny.

Another kind of bias may

arise

if

the

in-

The diagnoses

that are

made

in specialized

emergency room, the radiotherapy and chemotherapy clinics, or even the general medical and surgical clinics) may not be recorded in out-patient settings (such as the

the diagnostic

files

brary; and. in

many

about out-patients

maintained in the lihospitals, information not included

is

the in-patient records. 10

By

among

confining his

population to the in-patient group available from

the

medical record

of a

files

librarv, the investigator

may

thus limit his

spectrum of a disease to the more severe instances in which the patients required hospitalization,

patients

ignoring

who were

the

many

other

treated in ambulatory

by doctors in other parts of the hospital or in the community. In both of

A

criteria.

population

may

used for diagnosis of "health" or "disease." For example, if protracted chest pain is de-

manded

a

as

criteria

"acute myo-

for

criterion

cardial infarction," the population will ex-

who had

clude patients

only milder forms

more severely ill patients whose infarction was manifested only by sudden onset of arrhythmias or of thoracic pain or

congestive heart failure.

Outside of a medical setting, the popumay have reached the investigator

vestigator uses the medical record library as his only source of diagnostic retrieval.

Diagnostic

1.

be greatly altered by the

lation

directly by "volunteering" for a project or by answering a questionnaire. Nevertheless, the investigator must still decide which questionnaires or volunteers will be

He may

used for the research.

questionnaires that have been filled

out

volunteers

improperly,

who seem

may

he

or

exclude

smudged

or

reject

unattractive in ap-

pearance or in personality. He must then be concerned about the way that these selections have altered the population that becomes the group of research subjects. 2. Co-morbidity criteria. The presence of major associated diseases may create problems in establishing diagnosis for a particular disease 14

and

choosing or ex-

in

many

settings

cluding patients for therapy.

the instances just cited

co-morbid patients are excluded from an investigation, the resultant "pure" population with the main disease may not repre-

records and vestigator

—sequestration limitations of source— the

may

of in-

receive a distorted picture

If

too

sent the true state of that disease in clini-

of the patients associated with the thera-

cal

peutic maneuvers

morbid diseases may have major prog-

in a survey of treatment

reality.

In

a

study

of

therapy,

co-

unrecognized

for a particular disease.

nostic

C. Eligibility decisions. After the basic populational intake has been brought to

sources of bias in the post-therapeutic re-

of an investigation, a series of eligibility decisions will reduce the population to the group of people who will constitute the initial state. These decisions which may be made by the investigator, by his colleagues, or by both involve the

the

site

—

choice of diagnostic criteria, co-morbidity

and pretherapeutic

criteria,

determine a patient's ticu

;

i

ageni

r

criteria 8,

that

eligibility for a par-

activity of ordinary it

14

or of research.

clinical

man-

that

effects

create

sults. 3.

Pretherapeutic

criteria.

co-morbidity,

other

peutic criteria

may be used

kinds

In addition to of

prethera-

to include or

exclude patients from an investigation of therapy.

These

criteria

may depend on

such features as age, economic and geographic status, and severity or extent of the main disease under treatment.

For example, in order to be regarded as having "operable cancer," a patient must have sintable diagnostic evidence of the

Intake, maintenance,

he must be free of associated comorbid ailments that might impair his tolerance of the operative procedure or his subsequent survival; and the cancer must be appropriately localized in an anatomic disease;

position that permits resection. Thus, the

two groups

of patients with can vary greatly acthe criteria chosen to define the

collection of

cancer"

"operable

cording to

concepts of "suitable" diagnostic evidence,

by

its

and

41

identification

selection through replies to a mailed

questionnaire rather than through personal interviews.

The neglect of populational bias caused by these transfer decisions is one of the main sources of nonreproducible data and defective scientific logic in contemporary biostatistics. 3.

Maintenance

"impairing" co-morbid ailments, and "appropriate" anatomic localization. Moreover,

may be

a patient with "inoperable" cancer

because he refused a proposed operative resection, or because the physician decided that the tumor was unresectable, or because death occurred before any of these decisions could be made. "inoperable"

Since

these

nostic

kinds

three

of

"inoperable"

have major differences

patients

anticipation, 4

in

prog-

cannot be

they

re-

garded as comparable in their initial state. An additional problem in pretherapeutic criteria is

—particularly

in long-term trials

SUBSEQUENT

MANEUVER,»

-INITIAL

© STATE

STATE

J3J

Maintenance

The next

step

in

planning a research

about the methods be used to maintain the population long enough for the subsequent state to occur and to be observed. When the project

is

to

decide

that will

the investigator's frequent decision to

objective of a research project is specified, the investigator creates the concepts that

admission to "compliant" people particularly likely to maintain

determine whether the project is worth doing; when the maintenance of the popu-

restrict

who appear the

assigned maneuver and to

continue

under investigation. The absence of the "noncompliant" candidates, however, may create a serious bias in the spectrum of the observed disease.

As a

result of these diverse decisions

patients

and doctors about

ceptance of proposals, gibility,

and other

by

iatrotropy, ac-

iatric referrals, eli-

mechanisms, apparently have

transfer

two groups of people who the same "initial state" may differ in many characteristics that can greatly affect the outcome in the subsequent state. Major differences in people with an apparently similar initial state

medical setting

is

may

thus arise

urban or

if

the

rural, large or

small, American or British. Within the same academic hospital center, great diversities

may

exist in populations

chosen

from private or ward services, surgical or medical services, and "university" or "VA" services. A population that is not at a medical setting

may be

significantly altered

lation

is

the

contemplated,

investigator

determines whether the project

and can be carried

out.

The

is

feasible

ability

to

maintain the population is particularly important in research dealing with the cause, course, or treatment of chronic disease, since a long time may elapse between the initial

and subsequent

states.

No

matter

how well the rest of the research project the is planned, the plans may be wasted if population cannot be adequately preserved and examined.

Many different issues in maintenance* must be considered: duration and frequency of examinations; opportunity for detection of target; adherence to maneuvers; and preservation of protocol, data, and investigators. A. Duration and frequency of examinations.

Members

of an investigated popula-

°The complex issues beyond the scope of this

of ethics in outline.

clinical research

are

The

42

architecture of cohort research

must be encouraged

remain

in the

instituting appropriate types of ancillary or

project for a duration suitable for the con-

replacement therapy. B. Opportunity for detection of target. If the "target" in the subsequent state is the development of an event that was not

tion

to

ditions and maneuvers under investigation. For example, a much longer duration of observation will usually be required when

maneuver provides treatment

the

common

cancer than for the

couragement

cold.

for

The

a

en-

continued par-

to a patient's

present in the

initial

state,

the compared

populations should have equal opportunities for that

be detected when

it

opportunities will be equal

if

target to

The

ticipation, particularly in a long-term study,

occurs.

depend on many aspects of the interchange among patient, doctor, and investigative setting. For example, to achieve

all the people exposed to the different maneuvers are also exposed to the same thoroughness and frequency of subsequent examinations, but inequalities will commonly be produced by various characteristics of the initial state, the maneuver, or the target. The subsequent comparison of data may then be fallacious because differ-

will

persistent attendance of patients at a clin-

the

ic,

investigator

may need

to

make

many

special arrangements for such fea-

tures

as

clinics,

of staff, location and and transportation of pa-

The frequency

of examination will also

timing of

efficiency

ent

tients.

depend on the circumstances of the tigation.

Routine

may be needed

periodic

inves-

conditions

more

target

group of people than

One

examinations

nation. People

sought often enough. For example,

and

of the is

main

if

one

targets in the subsequent state

the occurrence of a streptococcal infec-

tion, the patient

may need

to

be examined

monthly or bimonthly intervals for the performance of suitable tests. On the other hand, monthly observations would not be needed to determine the occurrence of at

in the patient's clinical or tus.

geographic

sta-

In experiments conducted by double-

blind techniques (in which neither the patient

nor the examining physician knows

is

events.

The procedure should

include suit-

able arrangements for learning the identity of the

maneuver

in

emergency

that have led to

situations,

maneuvers major adversities, and for

for discontinuing or modifying

whom

less prevalent.

medical super-

Once a treatment

been suspected of causing adverse

has

side effects, the occurrence rate of those effects

may

increase

spuriously

because

they are sought with increased attention

people

receiving

themselves

the

often

and

The

treatment.

may become

fearful,

so that they solicit medical attention

more

for lesser reasons than usual, or

the doctors

may become

particularly dili-

gent about detecting the side

effects.

For

example, the high rate of phlebitis found in nurses (with or without "the pill")

merely

may

both to note the condition and to seek medical atreflect

tention for

A

the identity of the maneuver), a "fail-safe"

procedure must be established for the potential occurrence of disastrous clinical

is

who can see doctors easily may appear to have

eases than people for vision

in

and the maintenance of communication to learn about incidental events and changes

in another.

higher rates of minor or even major dis-

patients

tion

the

one

conveniently

Other important reasons for periodic examinations are the regulation of medica-

(when appropriate), the preservaof the patient's interest and morale,

in

the population's access to medical exami-

death.

tions

be detected

obvious source of such disparities

evidence about phenomena that would be missed if not to obtain

made

observation

of

likely to

a

their

propensity

it.

second cause of

maneuver

itself

administration or

difficulty arises

when

requires techniques of

observation

that

differ

from the procedures used in maintaining another maneuver. For example, patients receiving long-term anticoagulants must periodic examinations for the dosage of anticoagulants to be regulated,

receive

Intake, maintenance,

and

these examinations, the physician

at

the opportunity to detect minor

also has

intercurrent illnesses. of such illnesses er

may

The subsequent

who

did not

Another source of disparity is the population's access to medical technology. If certain technologic procedures (such as electrocardiography, roentgenography, and available

another, the ferent rates

more

are

tests)

one

for

consistently

population

two populations

will

than

for

have

dif-

of identification for diseases

whose diagnoses require these technologic procedures for confirmation or exclusion. For example, facilities for prompt, inexpensive identification of streptococcal infections

have recently been made available by the health departments

to practitioners

of several states.

The

rates of streptococcal

infection in those states

soared.

Similarly,

have subsequently

a major "epidemic" of

lupus erythematosus has occurred in the

two decades

past

after the introduction of

effective laboratory tests for identifying the

The here

is

last

source of disparity to be noted

the performance of routine "screen-

the

two populations.

rates of disease in a certain region of

body

for

two populations

if one population has that region examined only when appropriate symptoms appear, whereas the other popula-

bly differ

tion receives routine periodic examinations

regardless of symptoms.

The neglect

of these problems has

been

a major source of intellectual blunders in

contemporary biostatistics, and has greatly impaired the validity of many epidemiologic surveys of the occurrence rates and therapy of disease. Berkson 2 has demonstrated the statistical fallacies of studying

the rates of concomitant diseases in hospital

populations,

but egregious

statistical

occur for single diseases in non-hospital populations with unequal opportunities for detection of

fallacies

are

just

as

likely

to

receiving

may have

group

higher

contraceptive

il-

are

pills

the

whose medical

the

a rate of subsequent throm-

or cervical cancer that

than

spuriously

is

found in the "controls," was not checked with equally

rate

state

careful attention.

The

executive

personnel

in an industrial have a higher rate of coronary artery disease than the laborers, but the difference may not be due to any characteristics inherent in the executives. They may have been subjected to routine cardiac "screening" examinations that enabled all of their symptomatic and asymptomatic coronary disease to be diagnosed before death, whereas much of asymptomatic coronarv disease in the "unscreened" laborers may not have been detected during life. 3. In comparison to people who have not had chronic constipation, people with this symptom are more likely both to have used suppositories and to have received barium enema examinations. If an epidemiologic statistician discovers that suppository users have a higher rate of colonic polyps than non-suppository users, he may then regard 2.

population

may be found

to

as a probable cause for the with no attention given to the unequal exposure of the two populations to the barium

the

suppositories

polyps,

A

for detecting the polyps.

different variant of this

problem occurs

in

epidemiologic research in which the subsequent state is described by entries on a death certificate.

If

survey, patient's

will inevita-

can be

use mechanical types of contraception,

"pill"

4.

ing" examinations in the

The

who

enemas needed

disease.

fallacies

examined more closely or more often than women

bophlebitis

receive equally frequent examinations.

laboratory

women

If

1.

the anticoagulant group than in a

in

the

of

43

identification

lustrated as follows:

rate

then be falsely high-

"placebo" or untreated group

Some

target.

and

the development of disease

the

will

investigators

premortem data

to

often

X

is under check the

eliminate false posi-

but the false negative side of the diagnostic issue is almost never explored to see whether disease X might have occurred without being listed on the death certificate. 7 In this way, many epidemiologic studies based on death certificate diagnoses of arteriosclerosis or cancer provide inaccurate results that differ significantly from the data found by necropsy tive diagnoses of disease X,

or

by thorough methods 5.

To

illustrate

of

premortem examination.

misleading therapeutic results

produced by a maneuver's

distortion of equal op-

portunities for target detection, consider the fol-

lowing example.

Suppose we want

to

test

the

hypothesis that large daily doses of aspirin will

be useful prophylaxis for preventing streptococcal infections, and suppose our method of detecting the infections is to perform a throat culture whenillness occurs. In this experimental with these techniques of target detec-

ever a febrile situation,

the "treated" group, which has its fever suppressed by the aspirin, will have a much lower

tion,

The architecture

44

of cohort research

rate of streptococcal detection than the untreated

and we might then

"controls,"

falsely

conclude

an effective antistreptococcal agent. example of "detectional bias" occurs in a situation, quite different from those just cited, in which detection of the target creates an additional clinical hazard for one populational group but not for the other. Suppose we are testing medical therapy versus vascular-implant surgery for coronarv artery disease. Both groups of pathat aspirin

A

6.

is

project

These

who

or

cussed

by

final

tients

receive coronarv arteriography to delineate

their

initial

state,

repeat

receives

but

the

surgical

arteriograms

of the coronary

to

test

group later the patency

ineffectually.

further considered here. 4.

identification

Initial

SUBSEQUENT^

^.INITIAL

© STATE

STATE

vasculature after the operation.

The morbidity and

participate

have been thoroughly disMainland 18 and will not be

issues

(D

Initial

Identification

mortality rates associated with

the second arteriographic procedure for the surgical

group create an additional investigative hazard

that

is

not present

in

the medically treated

pa-

Either the experimental plan or the subse-

tients.

quent analysis would require suitable modifications to avoid the bias thus imposed, after the main experimental maneuver, on the patients treated surgically.

Identification

When

an

a core problem in scien-

vestigation have

validated.

C. Adherence to maneuvers.

is

research. Unless the entities

under inbeen suitably identified, the work cannot be reproduced; the data cannot be assessed; the results cannot be tific

lems

of

In clinical research, the probare

identification

much

greater

oral drug, diet, or other activity (such as

than in any other form of investigation,

exercise)

is to be maintained by the pahome, away from the site of the investigation, the patient must be encouraged to adhere to the assigned maneuver. Moreover, regardless whether the of maneuver was assigned by the investigator or chosen by the patient, the assessment of the patient's adherence to the maneuver is a critical feature of the subsequent evaluation, and depends on suitable information about adherence, obtained by means

because

tient

many human

at

the

clinician

must

attributes

that are not en-

cope

with

countered in animals, animate fragments, or inanimate substances. 5 These tributes

human

at-

are not only different from the

isolated variables studied in nonclinical re-

much more abundant and they must be assessed before, during, and after the experimental maneuver. search; they are also

Despite these distinctions in identification,

more objective procedures. The evaluation

most traditional discussions of "experimental design" contain little or no attention to the problems of choosing, observing, and

of such data will be discussed at a later

classifying the basic evidence.

of direct questions to the patient or with

The

stage of the architecture.

D. Preservation of protocol, data, and

When

identification

architecture

is

stage

of

research

intended to provide imple-

a long period elapses

mentation for the ideas expressed in the

between the maneuver and the subsequent

objective of the project. All of the general

investigators.

when many

state,

or

tients,

and

examinations,

pa-

investigators are involved in a

concepts that described the objective of and mainte-

the research, and the intake

research project, the principal investigator

nance of the population, must now be

must establish certaining

pressed operationally in terms of the observed evidence. For example, the diag-

pants, for coordinating the collection

satisfactory

methods

for as-

that the research protocol is properly carried out by the diverse partician;

men have

sis

of

of data,

and

and

and prognostic requirements of the population may have been stated in such

for suitable replace-

categorical phrases as "healthy," "anemic,"

who

"angina pectoris," or "good risk," but none of these phrases indicates the actual items

collaborating

liscontinued

ex-

investigators

participation

in

the

nostic

and

Intake, maintenance,

be examined and

of evidence that will

in-

Each property (or

A. Types of evidence.

be used

"variable") to

an investigation

in

depends on the observation of certain basic evidence. Thus, the presence of anemia may be assessed from measurements of blood hemoglobin or hematocrit; the presence of angina pectoris may be noted from history taking; and anatomic metastases of a cancer may be sought with clinical examination, roentgenography, endoscopy, cytology,

biopsy,

necropsy.

The

evidence

in

exploration,

surgical

or

characteristics of the basic

an

must

investigation

be

thoughtfully considered, because the qualthe

ity of

raw data can not only

affect the

elemental facts but also distort the varia-

chosen

bles

between adjacent

vals

such values are

terpreted.

represent

to

elemental

the

facts.

The raw data obtained

in a clinical in-

can be verbal or numerical. Thus, a particular person can be described vestigation

verbally as a red-haired

American laborer

.

.

Examples of

ranks.

0, 1, 2,

and

of children,

3

.

.

.

for

67, 68, 69, 70

.

45

identification

number .

.

.

for

inches of height.

The other three types of variables are expressed with categorical rather than di-

An

mensional values.

ordinal variable has

semiquantitative values that can be ranked a graded order, but the intervals between any two adjacent ranks are not measurably equal. Examples of such values are 0, 1+, 2+, 3+, and 4+ for briskness of reflexes, and none, mild, moderate, and severe for severity of dyspnea. A nominal variable has values that cannot be ranked in a graded order. Examples of such values are red for color of hair, American for nationality, and laborer for occupation. For an existential variable, the scale of values consists of present and absent, or yes and no. Examples of existential variables are presence of chest pain and survival for at in

6 months. The values for existential

least

variables can sometimes be semiordered in

with chest pain,

a scale such as definitely absent, probably

inches

absent,

and numerically as 68 and the father of 4 children, with a serum cholesterol of 260 mg. per cent. Regardless of whether the basic evitall,

term "hard" is often applied to data for such variables as age, sex, weight, serum cholesterol, and death whose observation and dence

is

verbal

numerical,

or

the

—

—

interpretation require

judgments. to data

istence

The term

—such of

few or no subjective "soft"

is

as statements

angina

many

subjective

activities

and interpretation. obviously

more

research

have

variables

of

been described

else-

where. 12

From

C. Selection of indexes. verse data

collected

the di-

an investigation,

in

often applied

certain variables or combinations of vari-

ables

severity

—that

in

of

require

observation

may be used

particularly

research. for

as indexes to delineate

important

properties

in

the

These are the properties used

any of the

graphs,

tabulations,

or

hard data are

other decisions in the "admission" of peo-

reliable than soft data, the

ple to the project, or in the analysis of

Since

clinico-statistical collaborators

a

The procedures used for converting raw data into values for these different types

about the ex-

pectoris,

dyspnea, or ability to work

probably present, and

uncertain,

definitely present.

project

who

usually

will

design prefer,

Although the investigators may results. have assembled information about a great

whenever possible, to work with hard data as the main source of evidence. B. Types of variables. The raw data of an investigation can be preserved intact

many

or converted into the values for four dif-

Thus, data for such topics as name, address, occupation, height, and serum calcium value might have been obtained during

ferent types of variables. is

expressed

in

A

metric variable

dimensionally

ranked

values that have measurably equal inter-

the

variables,

ones

that

the index actually

variables

become used

are in

including or excluding people from the project

the

and

in the appraisal of the data.

research,

but might never be used

The

46

architecture of cohort research

in the subsequent analyses, whereas information about such topics as

thereafter

ral

versus multitemporal.

for

an index variable

existence of disease X, severity of cardiac decompensation, prognostic risk, and serum cholesterol value might become index

state

variables.

single

time

in

may be

more temporal

or

index

The value used

at a particular single

chosen from one A unitemporal

values.

based on the patient's state at a in time, whereas a multi-

is

point

As the "key" data from which critical conclusions will be drawn,

temporal index involves consideration of more than one temporal state.

the index variables require careful selection.

For example, before administering an agent intended to lower blood pressure,

investigative

Form

of expression: Dimensional versus categorical. Despite the mathematical 1.

many

appeal of dimensional data,

may become

variables

scientifically or clinically

more

meaningful when their numerical values

we might

obtain a series of "base-line" or

by measuring the patient's blood pressure daily for two weeks. When we later analyzed the patient's response "control" values

we would

are converted into categorical expressions.

to

For example, although a person's hematocrit can be measured quantitatively, the result is often more meaningful when stated in one of the ordinal categories: anemic, normal, or polycythemic. Similarly, such measurements as temperature, serum cholesterol, and streptococcal antibody titer

use

before the therapeutic agent was given; and the index would be multitemporal if the single value for initial state depended on a mean, median, or mode of the two-

are often best classified not in their origi-

week

nal dimensions but in the converted ordinal

more,

if

were

initially

categories of high, normal,

The conversion gories

may be

and low.

of dimensions

cate-

valuable for

particularly

assessing the importance or

to

meaning

of a

change in a variable. For example, suppose "statistical significance" has been noted for the comparison of an average decrement of 5 in one group of patients, with an average of 2 in the other group. Regardless of the "statistical significance,"

would not be meaningful clinically if the changes were "falls" of 5 mm. per hour and 2 mm. per hour from difference

this

an

initial

Westergren sedimentation rate mm. per hour in both

that averaged 180

groups. In this situation, the

common

sense

judgment would depend on the categorical decision that both 5 mm. per hour and 2 mm. per hour were too small a change for the transition to be regarded as a significant fall. tions,

In

categorical

many

analogous situa-

distinctions

are

invalu-

able for making decisions about whether

a

"statistically

clim 2.

significant"

difference

Uy meaningful. l ironologic components:

is

Unitempo-

the antihypertensive agent,

unitemporal index for the initial state of blood pressure if the value depended only on a single reading just a

series of "base-line" readings. Further-

the values in the base-line series

high

and then

fell

to

a

"plateau" just before the onset of therapy,

we would have

whether to choose from all of the base-line readings, or only from those conthe

initial

tained

in

choice

is

to decide

state

index

the "plateau." The way this made might have major effects

on the magnitude of the blood pressure by the therapeu-

available for "lowering" tic

agent.

Another consists

of

type a

of

sum

multitemporal of

units

rather

index

than

an average of measurements. Thus, as an index of response to pharmaceutical treatment of angina pectoris, we might count the total number of episodes of angina or the number of nitroglycerine tablets consumed during the period of treatment. If

the index consists of a discrete event,

rather

gator

than a measurement, the investi-

may need

to

distinguish repetitive

from those that are single or sporadic. For example, in their initial state before treatment of primary lung cancer, two patients may both have or persistent events

and

Intake, maintenance,

had hemoptysis that first occurred three months previously, but one patient may have had a single episode of hemoptysis, with no repetition, whereas the other may have had recurrent daily episodes. Similarly, a patient who had a severe bout of chest pain that lasted for one day, two weeks previously, is different from a pa-

whose chest pain has persisted unchanged for two weeks, but the difference would not be stipulated if the index variable were expressed merely as exist-

tient

ence of chest pain during present illness. 3. Number of constituents: Elemental versus composite. An elemental index is based on a single variable in the investigation, whereas a composite index contains data

from two or more variables. A may be elemental or com-

particular index

posite according to the types of variables

used for its creation. For example, severity of dyspnea would be an elemental index if

based on a single "global" assessment of and composite if it includes

dyspnea,

specific contributions

from such variables

4.

47

identification

Form

of aggregation: Boolean clusters additive scores. The constituents

versus

of a composite index can in at least

two

be aggregated

A "Boolean group of individual

different ways.

cluster" consists of a

categories that are present or absent to-

gether in various

ample,

the

For ex-

combinations.

condition

of

a

patient

with

acute myocardial infarction could be called good, if he has neither shock nor pulmonary

edema;

he has one of these compliand poor, if he has both. An "additive score" is prepared by assigning an arbitrary score (or "weight") to each constituent variable; and the value of the composite index is the sum of these scores. For example, dyspnea might be given 20 points, and tachycardia, peripheral edema, or a large liver, might each be given 10 points, so that a patient who fair, if

cations but not both;

has

all

would

four of these manifestations

receive a score of 50 points.

The two methods different

of aggregation are as

and

logic

as

arithmetic.

In the

as the metric respiratory rate, the existen-

"Boolean cluster" procedure, the categories of each constituent variable are either

presence of orthopnea, and the ordinal

present or absent, and the values of the

tial

amount

exertion

of

needed

to

produce

respiratory distress.

For reproducibility of

an index be prepared in a specifically composite manner, so that each constituent can be identified. If an index with many complex ingredients is derived in an elemental manner, as a global act of "judgment," the procedure cannot be reproducible, because the ingredients will not be

that contains

stipulated.

many

Thus,

results,

constituents should

as

a

single

variable,

thromboembolic phenomenon can be rated as present or absent, but the ingredients of the rating will depend on the presence or absence of such constituent

existence of

variables as pain in leg, circumference of calf,

hemoptysis,

and

roentgenographic

abnormalities. If thromboembolic

phenome-

non were to be used as an index variable, its

existence

specific

index

itself

are chosen as arbitrary

names

for simultaneous combinations of categories.

would require

definition with

diagnostic criteria established for

the appraisal of the constituent evidence.

In the "additive score" procedure,

category

each

assigned an arbitrary dimen-

is

and the index is the sum Both the Boolean and the additive techniques may sometimes be combined in a single index. For example, sional

value,

of

those values.

to

fulfill

the modified Jones criteria 1 for

diagnosis of rheumatic fever, a patient

is

required to have a combination of "major" and "minor" features. The combination is

by any two features of the "major" group together with one of the "minor" satisfied

features, or vice versa. 5.

Degree

contrived.

of

The

artifice:

collection

a composite index

is

Natural

versus

of variables in

"natural"

if

the vari-

homogeneous frame physiologic or clinical reference, and are

ables have a reasonably of

ordinarily joined in clinical reasoning. Thus,

an earlier example, the index for thromboembolic phenomenon was based on a in

The architecture

48

of cohort research

be assessed

natural combination of variables, such as

that

symptoms and physical signs in the legs and chest. On the other hand, if we established a "thromboembolism index" that

placed by a second variable with which the main one appears to be correlated, for example, the hematocrit measurement

included

such

features

age,

as

serum sodium, serum prothrombin time, as well

weight,

and

height,

potassium,

features

already

the

as

new

the

cited,

index

would be contrived, because its heterogeneous components do not have a common physiologic frame of reference. They are not combined in ordinary clinical reasoning, and their conjunction in the index was purely arbitrary.

Some

ot

the problems of hetcrogeneous-

ly contrived

by Mainland.

indexes have been discussed

One

1 '

of such contrivances

may be combined the'

of the is

main problems

that diverse entities

manner

in a

that

makes

individual entities unrecognizable, par-

when

ticularly

the

responses

assess

Thus,

neuver.

to

in

used to an experimental maindexes

some

are

the contrived

of

indexes used in rheumatoid arthritis, patient

who

remains

a

sedimentation

his

persistently

could

elevated

physically

might replace the hemoglobin value, or thyroid uptake of radioactive iodine might replace the measurement of serum protein-bound iodine. A substantive index the

a

is

various psychologic tests for

congestive heart failure. 6.

Application

variables

many

and moreover, fulfills

after learning

these

criteria,

know whether

somewhat

clinical

>ne in

investigative

of

consideration here, the indexes are used for identifying the initial state of the

and

lation,

they

provide

popu-

data

the

for

such necessities of research as diagnostic criteria,

prognostic strata,

requirements.

At

later

and

eligibility

of the

stages

indexes for a research project

the

may displace the research. He may elect to

for

phenomena

investigator

the

less

arbitrary

and

research. A homologous which the main variable

that are

because

and

they

reliably,

nomena may not

re-

is

that the

objective of

use indexes

scientifically

can

at-

be assessed

but the selected phe-

accurately indicate the

original goals of the research.

For example,

a

heterogeneous manner, are commonly is

stages

In the stage under general

used to identify targets in the subsequent state, and transitions from the initial to the subsequent state. D. Displacement of Objective. Perhaps the greatest intellectual hazard in choosing

other types of contrived indexes,

ii.

single

while

had congestive heart failure alone, chorea alone, or only arthritis.

used index

The

indexes.

hetero-

patient

less

of

complex combinations that indexes can be employed at

different

architecture.

easily

Two

or

are used as

matic fever. 1 The criteria cannot be used, however, for assessing a patient's response

created in a

intelli-

gence or anxiety are examples of substantive indexes. Another example is the use of central venous pressure as an index of

tractive

physician would not

sub-

give

to

cept that does not have a tangible identity.

The

contrived index for the diagnosis of rheu-

patient

created

A

geneous contrived index may sometimes effective for purposes of diagnostic identification, although it may not be applicable to the evaluation of therapy and may obscure the diagnostic constituents. For example, the Jones criteria have provided a highly successful heterogeneous

a

score

incapacitated

be highly

that

or

test

stantive identification to an intellectual con-

search architecture, other indexes will be

his sedimentation rate declines.

to treatment,

deliberately re-

is

who

not be distinguished from a patient

remains

'"'

has a major improvement in

symptoms while

joint

rate

1

to

is

suppose

we wanted

to

study the effect of Excellitol in improving the respiratory

distress

of patients

with

pulmonary disease. One of the variables we would want to identify in chronic

the initial state its

is

severity of dyspnea, but

rating depends on a patient's subjective

perception of dyspnea and on a doctor's subjective classification of degrees of se-

Intake, maintenance,

Because of this double "softness" we might decide to replace severity of dyspnea with "harder" information, such as the patient's timed vital capacity. We have now achieved a more "reliable" measurement, but we have also verity.

the data,

in

•

displaced

phenomenon we wanted

the

Although a good general correlation may exist between severity of dyspnea and timed vital capacity, an individual patient's ad hoc performance on a test to assess.

of vital capacity does not necessarily re-

the respiratory distress he experiences

flect

outside the laboratory in the conditions of his daily life.

ment

investigator, while giving treat-

prevent vascular complications of

to

49

identification

rejecting crucial soft data, but to "harden"

the

soft

data.

When

variables

the

that

are necessary to a well-designed investi-

gation are based on soft data, an additional

aspect of proper design for the research

development and

is

the

of

observation

prove

the

of

methods

better

classification

reliability

the

of

to

im-

essential

data. 2. Validation of contrived indexes. In the examples just cited, the indexes chosen

for assessment of a particular entity

were

displaced into another, obviously different

The

solution to this problem is and requires only that appropriate attention be given to the correct entity. entity.

simple,

1. Reliability versus relevance. Analogous displacements of the objective occur

when an

and

When a contrived index contains a motley mixture of heterogeneous elements, it can be "decomposed"

into a series of separate

diabetes mellitus, studies change in glucose

indexes that are individually homogeneous

tolerance

and meaningful. In other circumstances,

mortality

instead

rate,

evidence of vascular complica-

when

tions; or is

or

test

of clinical

the relief of angina pectoris

assessed from serum cholesterol, electro-

however,

contrived

a

index

seems

that

plausibly or directly related to the desired entity

may be

acceptable

if its

effectiveness

cardiographic evidence, or arteriographic

has been validated by thoughtful judgment

evidence, but not from an evaluation of the

or

clinical

severity of the angina.

placements

may

The

dis-

increase the reliability of

the indexes used in the investigation, but

may

they the for

displace the objective so

also

results do not answer main question that was the reason doing the research. These tactics in

greatly

the

that

index displacement are the source of the "substitution so

game"

forcefully

that

indicted

Yerushalmy 20 has in

modern

bio-

statistics.

The problem

of reliability versus rele-

vance of data is a constant cause of inadequate design in clinical investigation, and the problem occurs in choosing variables that

define

not only the

the population but also

initial

state

(as noted later)

the targets of the subsequent state. the quest for science

of

When

makes the designers

displace critically important data with data that are

research

more

reliable

project

but

less cogent,

may emerge

with

the the

answers to the wrong questions. The solution to this problem is not to continue right

by actual data. Examples of such contrivance are the substantive indexes often used in psychologic research. Since no contrived test can provide an exact assessment of intelligence, anxiety,

or

personality,

the

investigator

must always decide whether the test really measures what he wants it to measure. Is a conventional I.Q. test a "true" measureneeded Are "anxiety" and "personality" really well measured by some of the tests or "inventories" used for these pur-

ment

of the type of intelligence

for creativity?

poses? the

In nonpsychologic research,

substantive

does

index of central venous

pressure provide an adequate assessment of congestive heart failure? In situations just cited, there

method

for

"validating"

is

the

all

no

of the

statistical

index,

and

acceptance or rejection depends on the scientific judgment of the investigator. When the contrived index is created

its

by homologous substitution, the index can sometimes be validated from actual data showing the correlation between the en-

The architecture

50

of cohort research

under consideration and its homologous The investigator must always beware, however, of a correlation that seems satisfactory in an abstract sense, but that may be misleading for the objective of the research. For example, although there is a generally good correlation betxveen height and weight at different ages in a human population, the repeated measurement of height would be a silly index of progress in a study of dieting to reduce obesity. Similarly, the enumerated consumption of nitroglycerine tablets would be an analogous but unsatisfactory index

be expressed

titv

ly

index.

guage".

indexes, the variables,

dence,

the

he

can

establish

his

own methods

of

observation for each item of evidence, he

must ascertain that the observational procedure is standardized and reliable. The

may

ascertainment

require calibration of

equipment, appraisal of observer variability, attempts to remove observer bias by "double-blind" procedures, and other tech-

niques that deal with quality control of

generally correlate with the severity

the primary data. If the investigator can-

of

severity

of tablets used

not establish his

own

important functional aspects of "severity,"

contemplate

the

flaws

such as the patient's ability

been present

in the

to walk,

work,

or engage in other acts of physical exer-

Computer distortions. As computers become used increasingly for storage and analysis of data in modern research, an 3.

additional

new

cause of the "displaced-

objective" problem

is

not the goal of scien-

tific reliability, but the convenience of computer compatibility. For processing by computer, the basic evidence must first be converted into "machine-readable language." In designing the formats needed for these conversions, the computer per-

sonnel

may omit

of crucial information for

many

could be prepared

enough raw

data have been collected for the topics

and categories to be discerned and coded, but by that time the investigators may have become infatuated with the analysis of what is already in the computer. They may thus ignore crucial raw evidence that rema uncoded and unanalyzed, or they c

stricted

flne their

assessments to the con-

Election of data that could readi-

for

and

be considered again subsequent papers of

will

this series.

F.

Methods

of classification. After the

primary data have been obtained, they

must often receive further before

variables

classification

perform their role as and indexes. For example, if a variable to be specified as can

they

or

A

lfl

in greater detail in

is

later, after

issues of quality con-

numerous and complex

They have been discussed

architecture.

elsewhere'-

present

suitable format

procedures by which

the scope of this outline of biostatistical

which a suitable

the general topics and specific cate13

too

are

anemia

format can not be easily prepared at the beginning of the research project, because an appropriate taxonomy does not exist gories of the coding. 12,

trol

kinds

or displace

methods, he must that might have

the available data were collected.

The methodologic

tion.

may

evi-

can contemplate

the acquisition of the basic evidence. If

of the angina, but does not indicate other

for

and the basic

investigator

by a patient

total

The number

may

E. Methods of observation. After all these decisions have been made about the

angina pectoris.

the

of

"machine-readable lan-

in

absent,

a

description

of

the

method for measuring hematocrit does not indicate "anemia" until the hematocrit values receive classifications that assign ranges of values for

anemia and nonanemia. The

taxonomic methods of classifying data are also beyond the scope of this discussion, but two aspects of the problems can be briefly cited with regard to imperfect data

and intermediate

criteria.

Management of imperfect data. The management of imperfect data requires 1.

decisions about information that

is

missing

or that creates ambiguity because of dis-

agreements or contradictions. For example, be diagnosed as pres-

will angina pectoris

Intake, maintenance,

cut

the patient

if

pain" that

said to have a "chest

is

not further described, or

is

one examiner that the

the patient

tells

symptom

present but denies

is

if

it

in dis-

cussion with another examiner? Will ane-

and

statistical architecture,

methods

the

be used The methods

targets.

to

our tunc tin will

the decisions require scientific judgment,

in the

and the main issue

subsequent identification

establish repro-

ducible consistency in the methods used for

making the

decisions.

Intermediate

2.

servational evidence in data of the initial

cri-

may state

terms that enable their specific appli-

For example, criteria for a diagnosis of myocardial infarction may require "abnormal Q waves" in the electrocardiogram but may not indicate what is meant by an "abnormal Q wave." The role of intermediate criteria is to provide cation.

of the specifications for the decisions transfer

that

from

the

to

the

evidence

observational

expression in the primary data

its

categories

various

often unspecified in these for

intermediate

the

scientific

They

results.

methodologic important

are

many

necessary

are

reproducibility

a

data

part

crucial

used

process

soft

research projects,

criteria

and

of

the

of the

"harden" make hard

to

to

data meaningful.

may

Many

Subsequent identification

and

require

will

initial state contains no need examination, but symp-

dosage, their

its

symptoms

that

toms may occur as "side effects" in the subsequent state. Additional examination procedures may be needed for laboratory

and other first

new phenomena

of

tests

B. Additional

and other state

criteria.

The

SUBSEQUENT^ STATE

Subsequent

may

not provide for

target that

is

to

many

be prevented by a ma-

stipulation as part of the subsequent identifications.

When

a particular disease

project

was

©

un-

is

diagnostic-

will often be necessary untoward "side effects" of maneuver from the expected or desired criteria

to separate the

a

effects.

^

In the types of criteria that have just

of

the

research

specified, a series of targets in

subsequent state differentiated. At this the

is

der treatment, special criteria of "diagnostic co-morbidity" 14 may be necessary to decide

been described, an index objective

situations

For example, a

neuver will not be present in the initial state, and its diagnostic criteria will need

Identification

the

diagnostic

criteria established for the initial

that occur subsequently.

distinct

When

that

appear after the maneuver.

the maneuver, or to the development of a co-morbid disease. Another group of

MANEUVER

© STATE ®

initial

arrange-

suitable

ally attributable to the original disease, to

© ^..INITIAL

new

of the vari-

not have been present in the

whether a new manifestation 5.

include

ments for observation and interpretation. For example, if asymptomatic healthy people are being given a new drug to assay

designation,

of

and inference that constitute the major criteria 12 for the variables and indexes used in the investigation. Although appraisal,

however, the

identification,

ables encountered in the subsequent state

teria established for diagnostic, prognostic,

all

initial

kinds of data, criteria, and problems.

or eligibility decisions are often not stated in

of

same methodologic challenges present

A. Additional data.

The major

many

In addition to containing

state-.

utc

still 8'

a

in

35 ' 3G

'

39

'

state

of

controversial

dis-

If

an investigator can

choose the maneuvers, as in a therapeutic or explanatory experiment, the main challenge of allocation

is

to

develop a suitable

method of assigning patients to the selected maneuvers. The principal decisions involve judgments about premaneuver numerical equality of popu-

lations, size of total population,

and

allo-

cation procedure.

Pre-manenver stratification. If m difmaneuvers are to be studied, the peor entered into the experiment must obvio ly be divided into m groups. Before ti m manuevers are assigned, how1.

fen

h

prognostic strata for the main target

seldom clearly known before the experibegins,

and,

many

besides,

as

discussed

different stratifications

42, 48, 50

B. Experiments.

stratification,

of

might be needed for each of the diverse ancillary targets of the maneuver. The omission of such a pre-allocation stratification, however, does not absolve an investigator of the need for appropriate prognostic divisions afterward. Unless the patients are suitably divided into groups with different "risks" for each target, the investigator may ignore fundamental dif-

ferences in the natural events

upon which

experiment was imposed. He may mix moribund and asymptomatic patients improperly in evaluating the outcome and in performing analyses of variance or other 17 :9 These analyses statistical procedures. may produce misleading results that rehis

'

57

Subsequent implementation of the objective

main uncorrected because the investigator has failed to discover that certain maneuvers may have had antipodal effects on different prognostic groups

1

17

"'

or that the

eVen though "ranhave created major prog-

allocation procedures,

domized," nostic

may

disproportions

among

the patients

assigned to the compared manuevers. prognostic appropriate Nevertheless, are constandy neglected in the analysis of clinical experiments. The consultant statistician and principal investistratifications

convince each other that the stratification is either unnecessary, because the randomized allocation "took care of gator

may

everything," or inappropriate, because of

the post hoc timing of the analysis. In other instances, investigators attempting a post hoc stratification may discover that

some

of

necessary

the

were not

data

obtained as part of the description of the initial state. For example, many computerized collections of information about the treatment of cancer, diabetes, or other

major

diseases

chronic

thoroughly information

analyzed

about

cannot

now be

because the coded

the

initial

state

does

of symptoms,

not include enough chronometry, co-morbidity,

details

or

other

im-

23

22 portant prognostic features. As discussed later, the process of ran'

an excellent way of allocatdomization treatment in an unbiased experimental ing manner. But if an appropriate prognostic analysis is not performed either beforehand is

or afterward,

the experimenter

may

find

he has removed statistical bias with the randomization, but has also removed clinical sense. To avoid making this paper unduly long, the strategy and tactics of post hoc prognostic stratification will be that

reserved for discussion at a later date. 2. Number of patients per maneuver.

In at least two circumstances, however, numerical assignments to different maneuvers

the

may

be deliberately unequal. In one such circumstance, the investigators intend to test two separate maneuvers whose results will later be combined for

comparison against a third maneuver. For example, placebo might be compared against two dosage levels of Excellitol, with the Excellitol consolidated if desired. In such a each of the two "combinatorial" maneuvers could be assigned half the number of patients assigned to the third maneuver; and when the results of the two maneuvers are later comresults

later

situation,

bined, the numbers will equal those of the third.

An unequal number

patients

of

may be

as-

signed in a different type of circumstance when a maneuver suspected of being distinctly inferior

must be tested

in a therapeutic trial in order for the suspicion to be proved. In this situation, the suspectedly inferior maneuver is sometimes allo-

cated to a smaller number of people than the other maneuvers. 3.

Size of the population. At least three

methods can be used for determining the total size of the population entered into an experiment. In the "calendifferent

dar" method, a fixed period of time is used for populational intake, and the ultimate size of the population depends on the number of people assembled during that

calendrical

interval.

In

the

"fixed-

method, a fixed number of people is chosen for admission, and the intake of

size"

population continues until that number reached.

The number can be chosen

is

arbi-

or from statistical formulas based on the magnitude of difference that the investigator hopes to demonstrate. 14 4G In the "sequential" method, 2 pairs of patients for the two compared maneuvers continue to be successively entered into the experitrarily

'

ment

until

the

results

reach boundaries

of statistical "significance" or "nonsignifi-

cance."

Each

of these three

methods of choosing

populational size has

its

own

advantages

In most experimental plans,

and

In

the

"calendar"

equal divisions usually provide the best statistical opportunity for demonstrating

method, the number of patients admitted during the chosen time interval may turn out to be too small for "statistically significant" results. The project might have to be either concluded without attaining statistical proof or extended over a longer

equal numbers of patients are assigned to each of the maneuvers under comparison, since

a significant numerical difference the maneuvers.

37

among

disadvantages.

The architecture

58

time

interval

get

to

of cohort research

more

patients.

In

contrast to these hazards of the "calendar"

method,

both

the

"fixed-size"

and

'se-

quential' methods offer the attraction of

while the "soft" endpoint would need only 70.° Confronted with the huge amount of extra patients and associated labor required to get numerical "significance" with the "hard" endpoint,

assuring "statistical significance." Both of

abandons

the latter two methods, however, also have the logical handicap of being based on a single target of response. Consequently, the calculated numbers that might

numbers

yield "significant" differences for the single

may not be adequate many other targets that

target

for analyzing

the

often require

attention

a clinical experiment. In ad-

in

procedure is limited comparison of two maneuvers, and the target state must be an entity that can be assessed promptly after the maneuvers, to enable decisions about admitting dition, the "sequential"

a

to

Despite

when

the target of the investigation has

been displaced, as described previously, 20 into a hard data "endpoint" that has a much lower rate of occurrence than the "endpoint" for the appropriate entity of soft data. In

sample

size

the

tliis

situation, the

estimated to give "statistical signif-

may be a number that is massively higher than what would have been required with the icance"

to

so discouraged that he conduct the experiment.

problems

cited

and

the

logic,

both

in

calculations

"sample size" according to "fixed" or quential" methods has site

of "se-

become a prerequi-

of "statistical design" for con-

ritual

temporary clinical trials. 10 " 43 The maneuvers chosen for comparison may be inadequate or illogical; the investigators may not know how to perform a suitable prognostic stratification beforehand and '

may

neglect

may

criteria

may method

additional hazard of the "fixed-size"

occurs

plans

his

the eligibility

afterward;

it

vitiate

any numerical

antici-

on previous experience and

pations based

the next pair of patients.

An

may be

the investigator

admission

preclude

of

the

diverse

needed for the trial and the initial to targets of response may be variables or chosen inappropriately and assessed improperly; but the statistician, undaunted by these imprecisions, may marshal his «, /?, and 6 values and calculate the exact number of patients needed for "statistical types

be

patients

of

clinically meaningful;

"softer" target.

Suppose, for example, that our objective is to test a treatment intended to improve the severity of angina pectoris, and we want the incremental results (or Q value) in the treated least

30 per cent better than

in

group

to

be at

the "controls."

For our "endpoint" we decide to reject the soft data target of "clinical improvement of severity" and instead we choose the hard data of "fatality rate." This decision will greatly alter the

number

needed for "significant" results in the experiment. If "clinical improvement" can be expected in 70 per cent of the control group, we of patients

would have demanded (according to the Q specification of 30 per cent) that 91 per cent of the treated group "improve"; but tality rate in the control

group

if is

the expected fa-

10 per cent,

we

shall demand that this rate be reduced to 7 per cent in the treated group. With these specifications, and with a 0.05 and P 0.05, we

=

would have needed

a total of

=

86 patients

to get

with the "soft data" endbut with the "hard data" endpoint we need 2,240 patients in the trial. Even if

"statistical significance"

point, shall

significance."

Although an estimate of the number and length of time is always

of patients

desirable

for

a

large-scale

investigation,

particularly for cooperative studies at multiple institutions, the

lamentable aspect of

the current "numbers game"

is

that the

intensive planning given to these statistical

desiderata

is

often the

main focus of the While the

activities in "statistical design."

numbers are being meticulously however, sities

many

calculated,

of the fundamental neces-

of scientific logic

and data may be

ignored. 4.

Method

of assignment. Regardless of

how

the numbers of groups and patients

are

determined,

the

investigator

must

choose a method of assigning patients to each maneuver. The main scientific ob-

we became more

"liberal" and raised the p level 0.10 while preserving a at 0.05, the "hard data" endpoint would still require 1,811 patients

to

"These calculations were based on the formulas cited 221-222 of reference 46. I am indebted to Mrs. Elizabeth C. Wright for checking the calculations. in pp.

Subsequent implementation

method

ject

of this

the

assignment.

which

dure,

A

is

to

avoid bias in

"double-blind"

proce-

helpful in preventing bias

is

when subsequent examinations

are

per-

when when

formed, would also help avoid bias

maneuvers are allocated. In addition, one manuever is initially suspected to be distinctly better than another, a doubleblind allocation has the further advantage of freeing the investigator from "moral" qualms that may arise if he knows which patients have been assigned to the "inferior" maneuver. For example, in the

ciple

step

of research

the

in

59

of the objet Hi e

design until

architecture

sixth

this

—and

then only for research projects that are experiments, rather than surveys. Although many clini-

have been inordinately slow to accept the need for randomization in planning cians

and experiments, many stathave been overly zealous in promulgating randomization as a panacea for clinical trials

isticians

flaws in "experimental design."

As an ad-

junct to well-planned experimental archi-

randomization makes a powerful

tecture,

contribution to

modern

science; as a substi-

large-scale field trials of a poliomyelitis

tute for suitable scientific logic, randomi-

vaccine, 26 a double-blind technique helped

zation serves to perpetuate defective re-

first

the

avert

ethical

quandaries

arose

that

about assignment of patients to the vaccine or to the "control" preparation.

would

cation

solve the scientific

just cited,

it

and

design"

"statistical

Although a double-blind method of problems

performed and often accepted

search,

allo-

by the

Intrusion

7.

©

is

assignments.

full

this outline,

when maneuvers are assigned nate manner or in any other manner (such

but

its

may

arise

an

alter-

in

"systematic"

as patients' unit

guards against the bias that

for

results

dom

it

Intrusion

the patient; the maneuver or the observation period

may

stop participating in the project.

legiti-

subsequent assessment of with statistical tests based on "ranof randomization

is

made by

one in-

statistics to modern clinical reand the frequent scientific abuse of the procedure, as described here and elsewhere, 17 should not detract from its great value when properly employed. Despite the importance of randomization, how-

maneuver and subsequent

The

state

are con-

sidered in this stage of the architecture.

A. Nonadherence to maneuver.

may

A

per-

not faithfully adhere to a maneu-

ver that must be maintained over a long period of time. Oral or injectable medica-

ductive

tion, or

ing,

does not appear as a cogent prin-

not be performed; or the patient

various intrusions that can appear between

son

sampling."

it

by an untoward

may

occur

search,

ever,

altered

tions

provides mathematical

the major contributions

may be

event; the necessary subsequent examina-

numbers);

the

The technique of

(D

©

may

when double-blind techniques cannot be and

STATE

After a maneuver has begun, the path-

domization averts the bias that

macy

*©

©

way from initial state to subsequent state may be interrupted in several ways. The maneuver itself may be abandoned by

beyond the scope of

used;

MANEUVER ^SUBSEQUENT

procedure of randomiza-

value can be summarized as follows: ran-

it

INITIAL

©STATE©

description of randomization

statistical

A

alone adequate for

ethical

does not solve the

problem of establishing an order A specific sequence must be chosen for allocating the maneuvers, even though the person who later administers each maneuver may not know its identity. This sequence is best selected tion.

is

valid experimental science.

practical

for the

in

the complacent delusion that a satisfactory

such maneuvers as cigarette smoktaken erratically or discon-

may be

tinued entirely by patients less

who

neverthe-

continue to be observed in the project.

In stage three

of

the

research

archi-

tecture (dealing with maintenance of the

population 20 ),

arrangements

were

made

The

60

architecture of cohort research

to obtain data

about adherence to maneu-

for

diagnostic attribution. For example

its

1 ,

In this seventh stage of the architectural design, the main challenges are

during long-term treatment of a chronic disease, the attending physician (or the

the fidelity of the adherence decide about the way in which the results obtained in poor adherers will be ascribed (or not ascribed) to the' asso-

investigator)

vers.

to

classify

and

to

ciated maneuver.

must decide whether each is due to the

posttberapeutic manifestation

treatment itself, or to features associated with either the evolving main disease or co-morbid ailments. Thus, for appropriate classification of anorexia that

Suppose

patient

a

is

assigned

tn

take one oral

tablet twice daily, on awakening ami at bedtime. How many and what kind of deviations from tliis

prescription

will

constitute

regimen, and

different

decrees

how many

fidelity

to the

fidelity

should he established? Should

of

classes of

we simply

adherence for all patients as either good or not good, or should there he four categories oi adherence, such as exct Ben*, good, fi'ir. and poor? What will he the specific criteria lor classifying deviations in the intra- and interdiumal patterns? Will we he satisfied if the patient takes his two tablets each d.tv. but hoth at one time, or if he ingests one tablet at lunch and the other at suprate the

per? Suppose he forgets one tablet on one day and takes three the next? II he omits the tablets tor three

make

a

days

in the

course of a month, does

it

difference whether the three days occur

consecutively

or

sporadically?

No

statistical

an-

are available for anv of these questions, which require arbitrary judgments individualized for each research project. Another tricky problem occurs in ascribing the results obtained in persons who have not maintained the maneuver faithfullv. Suppose daily doses of Excellitol were given to lower blood cholesterol, and suppose the adherence to the regimen can be classified as good, fair, and poor. At the end of the project, we would not be surprised if the cholesterol were significantly lower in the good adherers than in the poor, but how would we interpret an even lower result in patients with only fair adherence? Suppose the placebo patients with good and fair adherence had sub-

swers

lower values for cholesterol than the adherence Excellitol group? Again, these questions cannot be answered with statistical or scientific theories, and each decision must be made with a logic appropriate to the problem. stantially

poor

B.

Untoward

events.

The

patient's abili-

ty to maintain an on-going maneuver, such as a diet or medication,

by

may be impaired

the development of an adverse clinical

develops after

chemotherapy of a cancer, the investigator must attribute the anorexia to functional cancer,

of the

effects

a reaction

to

(i.e.,

produced by the chemo-

a "side effect")

agent, to an associated comorbid disease that was present before

therapeutie

or after treatment, to a psychic depression patient's emotional response

evoked by the to

his

combinations

to

illness,

these

of

factors, or to other causes.

The way that these "co-morbidity" decisions made can profoundly affect statistics about the

are

of

target

research

a

rates

fatality

who

project.

are

For example, with

patients

of

surveys

pathogressive

based on

usually

all

in

cancer, patients

died, regardless of cause of death. Thus, a

whom

patient in

cardial

had been successfully

a cancer

removed but who

died because of a myo-

later

infarction or in an automobile accident

same way

often statistically counted in the patient

who

example of

as

is

a

died of disseminated cancer. Another this

problem

the misleading rates

is

created from using mortality data of

of disease

the Bureau of Vital Statistics, where each patient's

death of

is

attributed

to

how many major The

tions

results of

also

a single

diseases

cause, regardless

were present. 1,;

two recent large-scale investiga-

illustrate

some

of the difficulties

that

can occur in evaluating deaths and associated treatments. In comparison with the "control" group, patients with prostatic cancer who were treated with estrogens had fewer deaths "due to cancer" but more deaths ascribed to "other causes," so that the total fatality rates in the two groups were essentially the same. 49 Domiciliary patients receiving a low fat diet had fewer cardiovascular deaths than the "control" group but more noncardiovascular deaths, so that the total fatality rates in

the two groups were similar. 13

The problems

of "attribution" in "diag-

nostic co-morbidity" have

been discussed

condition not present in the initial state.

elsewhere, 21 and are beyond the scope of

Beyond any effect on maintenance of the maneuver each subsequent clinical event or other mifestation must be analyzed

this

i

here

outline. is

statistical

that

The main no

point to be noted

thoroughly

satisfactory

procedures have been developed

Subsequent implementation of the objective

for these problems.

rently

depends

Their management cur-

on the sensible use

judgment.

scientific

Displacement

C.

of

vented earlier

A

examinations.

of

The

bias.

the

that

20

ties for

flaw

maneuver

—may

in

itself

create

unequal opportuni-

subsequent

A

procedure used in the maneuver.

was sched-

particular examination that

uled for repetition at specified intervals

may

not have been done;

it

may have

been performed on dates other than the ones planned for it, or it may have been "displaced"

by some other

Suppose

—a

value

that

extraordinarily

is

range of expected values for the test? Should this "outlyer" be rejected from consideration or accepted and in-

cluded

among

general

the other data?

Conventional

37

50 '

-

40

for

seldom suitable for the circum-

stances just described.

The

tactics usually

apply to situations in which a group of objects have all received the same test at the

same time

No

provision

is

in

a single performance.

made

other

logical

problem of and for manag-

for the

interpreting repetitive tests

ing the

difficulties

in

dis-

placement of data. If

the compared maneuvers have been

allocated

may

tor

by randomization, the

investiga-

take false comfort from the belief

that subsequent displacements of data will also

occur randomly and that almost any

rational

method

satisfactory,

daily penicillin

oral

domly allocated

and monthly

in-

in

an experimental

of their

trial

capacity to prevent subsequent streptococcal in-

young patients who have had rheumatic and suppose streptococcal infections will be

detected via comparison of antibody

of analyzing

them

will

be

since the randomization pre-

titers

in bi-

monthly specimens of sera. If the patients assigned to monthly injections must appear at the research clinic in order to receive the injections,

the

bimonthly

specimens

of

be

can

sera

ob-

tained as part of the circumstances surrounding the

administration

of

On

the other hand,

if

the

therapy.

prophylactic

the patients taking the oral

medication can receive their monthly supply by mail, the routine acquisition of serum specimens will

require a special

examination

that

is

visit

to

the clinic for an unnecessary.

therapeutically

According to the way this inequality of "maintenance" is managed by the investigators, the serologic specimens may be obtained with greater diligence and regularity in the "injection" group than in the "oral" group, or vice versa. A difference in the rate of streptococcal infections in the two groups may thus arise from these problems of intrusion, rather than from the therapeutic maneuvers.

The

statistical tactics

dealing with missing data and with "outlyers" are

some of the bias caused, after by administration of the

fections in

For example, if serum antibodies are to be every two months for detection of intercurrent group A streptococcal infections, should we regard a particular patient as adequately tested if a particular examination was delayed so that a three-month interval elapsed between specimens? If the test after a three-month interval shows that an infection has occurred, in what period does the infection get counted: the previous two-month period, or the next one? What about the situation in which the timing of the test was satisfactory, but a different test was done? Thus, would we regard a patient as adequately examined for group A streptococci if the tests were based on throat cultures rather than serum antibodies? And what about the type of "displacement" in which a particular test produces an result

may

data

of

of long-acting penicillin have been ran-

jections

test.

tested

higher or lower than the

displacements

randomization,

fever,

"outlyer"

The

detection of the target state.

thus reflect

is

described

as

problem occurs in the patient's or the investigator's adherence to the planned examination procedures.

third type of intrusive

belief

this

—

61

which can randomized allocation of maneuvers in an experimental trial, must be especially guarded against when the research is conducted as a survey. As noted previously, 20 unequal opportunities difficulties just described,

occur

for

despite

detection

women

of

the

target

may

taking oral contraceptive

cause

pills

to

thrombophlebitis or cervical cancer than women using mechanical devices. Similarly, since people with a chronic cough are more likely to have chest x-rays taken than people without a cough, and since smokers are more likely to have a chronic cough than nonsmokers, some of the higher rate of lung cancer in smokers may be due to their greater opportunity for having lung cancer detected when it occurs.

have a

falsely higher incidence of

The

62

archif

dure

of cohort

research

D. Displacement of patients. The last problem to be cited here is the

intrusive

in

an

during

patients

of

loss

investigation

which the target state occurs long after initial state. Suppose the main target

the

of investigation

is

a discrete event

When we

as death.

tabulate the frequency (in such

event

of

the

as

"5-year survival rates"),

target

an earlier date, but later

up"

5-year

the

at

expressions

what

done about patients known at

—such

to

shall

be

'lost to

interval

be

alive

follow-

selected

for

kind of "serial intake" problem, the

this

method

life-table

The

life-table

is

the use of an "actuarial" or "life-table" analytic ll > 3S

procedure. *•

In this procedure, a numerator

and denominator population are counted at periodic Intervals, such as 1 year, from the onset of the maneuver. The denominator for each interval consists of an appropriate count of the people who began that interval; the numerator consists of all people

in

whom

the target event occurred during

interval A rate of the target event is calculated from the numerator and denominator for

until

the

fifth-year

are

rates

multiplied

by the

product of the four preceding rates to yield the final result.

The

life-table

approach

is

particularly

project

curs

in

which populational intake oc-

serially

interval.

less satis-

one of

is

When

"losses

to follow-up"

occur during

a particular interval in the life-table calculations, is

denominator

the

sum

created as die

who were

for

suppose

the interval

of all the patients

followed throughout the inter-

together with half the

val,

number

of those

and the "drop-outs" do the numerator. For example,

lost;

death

that

is

the

target

event

90 people who were followed throughout an interval during which 20 other people were "lost." The denominator would be 100 [ = 90 + 10],

and occurs

in 5 of

and the death rate for the be 5 per cent [ = 5/100].

interval

would

Regardless of the theoretical statistical this "half-life" procedure, a

support for

uneasy that unproved assumptions have been made about the scientist

will

feel

A more

fate of the "lost" patients.

effective

approach to this problem would be to obviate any guesswork based on statistical theories or scientific logic, and to establish adequate epidemiologic methscientific

ods for following the population carefully enough to provide information about the actual state of all members, thus eliminating the need for conjectural assumptions. If

valuable for analyzing data in a research

much

is

the problem

drop-out" rather than "serial intake."

that

each successive interval, and the "final" rate for the most advanced time period is the product of the rates calculated for each antecedent interval. Thus, the five-year survival rate would be obtained by first finding the one-year survival rate in people who have been followed for one year; this rate is then multiplied by the survival rate during the second year in people who were followed for a second year; the product of the first two yearly rates is then multiplied by the survival rate during the third year; and so on,

if

"serial

not appear in usual statistical approach to this problem

The

method

factory, however,

who were

analysis?

an excellent ad-

offers

justment for duration of follow-up.

the investigators

make

vigorous efforts

to "trace lost persons," the logical hazard

of

a

"serial

drop-out"

managed by avoiding

life-table

can be

it.

over an extensive calendar

Since not

all

the patients

8.

will

study at the same time, they have different lengths of follow-up the calendar dates on which the in-

Transition

©

enter the will at

vestigator

prepares

"progress

reports."

Thus, about four years after the project has begun, the investigator may want to calculate 3-year survival rates,

patients

who

entered

the

but many

project

^INITIAL,-. MANEUVER ^.SUBSEQUENT^ STATE (3) STATE

©

©

©J<

*©

>-©

Tronsition

only

18 onths ago have not yet had the oppoi unity to survive for three years. For

After suitable provision has been

made

for the types of intrusion just described,

Subsequent implementation of the objective

standards must be established for assessing

from the of each per-

the transition that has occurred initial to

the subsequent state

son in the population and for the transition

each population group. A. Types of change. The assessment of change involves subtle issues in chronology. As noted previously, 20 the single value of an index for a particular state in time can in

depend on unitemporal or multitemporal contributions.

(

For example,

the

maneuver was maintained represents a single monadic "subsequent state." Examples of such

the

indexes

are

pectoris

or

the the

number of attacks of angina number of nitroglycerin tablets

during maintenance of an antianginal drug regimen. In the examples just cited, the indexes were individually multitemporal, because they contained contributions from several points in time; but their sum constituted a monadic description of "change," because the pretreatingested

ment

state of the patients did not enter into the

calculations.

initial-

or subsequent-state value for blood

state

63

2.

Polyadic

changes.

In

contrast

to

pressure can each be a single reading or

monadic

the average of a series of readings.)

changes represent distinct "transitions" because a value for the initial state is contained in the assessment of the "change." A polyadic change is based on the value

gardless of the

number

Re-

of temporal con-

tributions included in the index value for

a single state, the assessment of a

may

change

more

involve values for one, two, or

According

states.

that are

to the

compared

number

of values

for the delineation,

a

change can be monadic, polyadic, or dyadic.

Monadic

1.

change,

the

changes.

person's

In

initial

really a constituent of the

for

monadic

a

state

is

not

data evaluated

delineating the "transition." There

is

no actual comparison of a "before" and "after" condition, because the "after" event may not have been present initially, or the subsequent state may be assessed exclusively on the basis of what happened during or after the maneuver. For example, the development of poliomyelitis is the monadic change used as the target to be prevented in contrapathic vaccination of healthy

people;

change

to

death

monadic

the

is

be prevented

as

contratrophic treatment for

target

a

of

many people

with cancer. Monadic indexes are commonly established the

score

of

a

particular

test

as

used to describe

a person's condition after the main "maneuver" has already occurred. For example, an I.Q. test

may be

given to a group of well-nourished and

and the investigator, on the basis of the monadic scores, may attempt "retrospectively" to compare the influence of nupoorly-nourished

children,

on intelligence. Another type of monadic

trition

index

is

used

for

which the target in the "subsequent state" is an entity that can occur repetitively during the course of an on-going maneuver. In situations in

this

circumstance, the entire period of time that

changes,

polyadic

and

dyadic

more and the index

of a quantitative variable at three or different

points

in

of change depends

time,

on the

line or curve

that connects those points.

For example,

if

a person initially weighed 130 pounds

age 15, and then weighed 160 pounds age 18, 190 pounds at age 21, and 220 pounds at age 24, the weight curve has been linear, with an upward slope of 10 pounds per year. A polyadic change is at

at

commonly expressed determined with

as

statistical

a

"trend,"

and

procedures that

find the best-fitting straight line

(or non-

linear curve) for the collection of different

temporal points. When a linear model is used for "fitting" the line, the trend is usually expressed as the "slope" of the Quadratic or other models can be used to fit trends that have distinctly curved shapes or that go up and then down, or vice versa. 3. Dyadic changes. Dyadic transitions are constantly used in therapeutic research for evaluating either remedial treatment, intended to make a patient's condition line.

"better,"

or

contratrophic

treatment,

in-

tended to keep him from becoming "worse." Unlike the monadic and polyadic changes just described, a dyadic change contains a direct comparison of two temporal states: before and after the maneuver. The variable used for determining a dyadic change must be expressed in values that have

The architecture

64

of cohort research

graded ranks. The ranks can be measured or .) counted dimensions (such as 13, 14, 15, or semiquantitative ordinal ratings (such as high, medium, or low). For dimensional ranks, the transition between the initial and the subsequent state can be expressed as a subtracted increment .

(or decrement

or

I

as

a

.

percentage increase (or

tional decisions

individual

aspect

their capacitj

tative data

1

dyadic

transitions

converting qualitative or quanti-

lor

into semiquantitative categories

on concepts

of

is

clinical

or

biologic

based

desirability.

Thus, the ordinal rating scale ol better, same, or worse ior dyadic changes can be applied both to quantitative phenomena, such as a reduction in temperature, or to such qualitative alterations as

a

disappearance

of

symptoms, or a change

people

for individual

gories as higher, same, or lou interesting

transition

ing

populational

a

in

the group, addi-

become necessary index

already

the

the problems of dyadic tran-

research

established

who

has

criteria

for

architect,

of

sets

identifying entities in the initial state

subsequent special

new

state,

and

must now establish

set of criteria for the

between the two

states.

criteria" will require

These

many

a

changes

"transition

scientific judg-

ments beyond those that have already been necessary. One set of judgments deals with the conversion of dimensional data into transitional categories. Thus, an initial Westergren sedimentation rate of ISO mm. per hour and a subsequent value of 175 mm. per hour might each be called markedly

elevated in their single states, but would

mm. per hour in the warrant being called a fall? A second set of judgments deals with the the decrement of 5

two

states

magnitude of

in

choos-

can sum-

that

marize the results of the individual indexes the entire group.

Types

1.

the

If

populational

of

populational

indexes.

A

index can be derived from

indexes

in

at

index

individual

least is

two ways.

expressed

in

categories, the proportionate frequency of

categories can be enumerated in the population;

expressed dimensionally,

if

its

aver-

age value can be calculated. For example, the categorical

congestive

response to treatment of

heart

failure

as excellent in 9 per cent,

To manage

of a

in

a

group

of

78 patients can be cited proportionately

in color of urine.

sition,

each

for

member

lected variable can thereby be expressed

in

An

of

population. Although a change in the se-

Thus, if the sedimentation rate was 50 mm. per hour before the maneuver and 30 mm. per hour afterward, the change can be c\prcssed as a fall of 20 mm. per hour or as a 40 per cent reduction. For ordinal ranks, the transition is expressed in such comparative catereduction).

index

individual

important variable in each

transition in semiquantitative

ordinal categories. For example, if cardiac enlargement is graded as none, slight, moderate, and extreme, what pairings of categories will be used to denote such transitions as much smaller, smaller, un-

cent, fair in 14 per cent,

good in 65 per and poor in 12

per cent; the survival time in 112 patients

with cancer can be averaged as a median value of 8.3 months, or as

When

months.

the

mean

individual

of 9.1

index has

been expressed dimensionally, the populational index can still be expressed cateThus, in the 112 patients just could have been categorically stated as a rate of 42 gorically.

cited, the populational survival

per cent at 6 months.

(In the latter ex-

pression, the "categories" of survival alive at 6 montlis

When

and dead

were

at 6 months.)

transitions have been exdimension that was measured repetitively, rather than only at the two times of initial state and subsequent state, the populational performance can be calculated with a "trend" equation, using some of the mathematical

pressed

individual

in

a

models described previously for determining trend an individual person.

in

2.

Problems in denominators.

A

group

index expressed in proportionate categories

the procedures just cited for describing

of a ratio: the numerator is the frequency of occurrence of the selected category within the population of people or events enumerated in the denominator. (Thus, in the earlier example, the 9 per

a change, the investigator can establish an

cent excellent response rate in congestive

cluinged, larger,

B. Population

and much larger? indexes. Using

one

of

consists

Subsequent implementation

heart failure

was based on 7 such responses

among 78 treated patients. The denominator of such expressions may sometimes be displaced inappropriatewhat Mainland has called "wrong

ly into

sampling units" or "spurious replication."' 7 As an example of the problem, Mainland cites the attempt to measure personal resistance to new caries by counting the number of carious teeth in a group of people, and dividing this total

number

number by the were counted,

of teeth that

instead of dividing the

number

of people

with carious teeth by the total number of people.

The choice

an appropriate denominator is subtle when the numerator consists

of k events

and when the data available

for the

m

to

know

the risk of

new

carditis

in

a

recurrent attack of rheumatic fever, or in a pa-

who

has a recurrent attack.

would be 54/105, but the

attack

The

risk

risk

per

per patient

cannot be determined from these data because

we would need number

of

know,

to

patients

54

rather

than

had been observed.

the

as

numerator,

who had had new

carditis

recurrences

in

the

carditis

which new

When the numerator consists of an event that can occur repetitively, such as streptococcal infections or episodes of angina pectoris, the denominator is often converted to the total time period of observation for the persons in the population. Thus, the occurrence of 40 streptococinfections

cal

in

100

patients

observed

for

a

200 patient-years is often reported as an attack rate of "20 per cent per patient-year" rather than "40 per cent per patient." This type of person-time denominator can be useful for many situations, but it carries the hazard of possible distortion by the way the unit of time is chosen and by major disparities in length total of

of the observation period for individual

of

the

cited

population.

per

miniscule

Thus,

on "200 patient-years" would be misleading.

There no statistical criteria for deciding whether a populational index is best ex3.

Strategies in transstratification.

are

pressed as a curvilinear trend, a categorical

dimensional average. For aver-

ratio, or a

there are also no

ages,

mean

ferring the

for

criteria

median

or the

pre-

as

the

choice

of

these

de-

cisions

must be made judgmentally

ac-

expression.

cording to

the

An example

logic

All

of

each

of

situation.

of the considerations

is

pre-

if

have

seemed

expressed as 0.55 per cent per pa-

expressed as 200 per Furthermore, 200 patient-years could be obtained by observing each of the 100 patients for about 2 years, or by

and gigantic

cent

patient-decade.

where the median is mean and where "rate" may be preferable

survival in cancer,

usually preferable to the

the categorical to

both the "average" measurements.

A

critical

aspect of these indexes

is

not

but their correlation with the strata of population to which they refer. Suppose a population contains 50 people whose value for a particular index in the initial state is high and just

their

selection

50 people whose value is low. Suppose further, in the transition results, that the value has become higher for 50 people and lower for 50 people. Did these changes occur throughout both groups or did the high results become higher and the low ones lower, or vice versa? Unless the results

are

transstratified,

we might

fail

detect the major differences between one treatment that raises high values and to

lowers low values and another treatment that has just the reverse effects,

making

high values low and low values high.

9.

Induction

members

the 20 per cent just

would

patient-year

tient-day

per

people for about 2 months each. In the latter circumstance, a denominator based

of

denominator consist of episodes in n people. For example, suppose we know that evidence of new carditis occurred 54 times in 105 recurrent attacks of rheumatic fever in 78 patients. 54/105 or 54/78? Is the rate of new carditis The answer to this question depends on whether

tient

remaining 91

sented elsewhere, 22 in the expression of

particularly

we want

65

the objecth e

oj

.-.INITIAL,-.

©STATE©

®^<

MA NEUVER !

!

.-.SUBSEQUENT,-.

*©

if

observing 9 people for about 20 years and the

STATE

>-© nduction

®

of cohort research

The architecture

66

At

in the

next-to-last stage

this

we have

tectural operations,

archi-

reached

finally

a major activity that requires knowledge of what is generally regarded as "sta-

The data having been expressed

tistics."

exposed to each ma-

for the populations

neuver,

we can now

use inductive

inferential decisions.

various

determine

distinctions

among groups

differences

statistical

correlations)

(or

trends

compare numerical

to

In

employed

are

tests

in

or events.

strategy of selecting these tests

is

to

and the

The

beyond

the scope of this outline and will depend on such characteristics of the research project as the types of data, the distributions of data, the

number

the

number

of groups,

and

In seeking guides

of events.

the choice of analytic statistical pro-

to

cedures, clinical investigators

books or

tain

may

28, 32, 33, 3r

articles 6,

-

45

One

excellent procedures

statistical

of the problems

analyses,

chosen, pling."

statistical

is

certain

the validity of apply-

randomly based on "random sam-

that are not at

to populations

tests

all

Since the intake of people in a survey,

as discussed previously, 20 is

seldom if ever "random," investigators could not apply most statistical tests to survey data if we insisted on "random sampling."

we

To

allow

statistical

tests,

therefore,

conveniently ignore our constant violations of

the basis for the

test.

Another problem is created by the modern inversion of the concept of variance so that it can refer to the measured objects as well as to the system of measurement. The idea of "variance"

was

originally

created

in

reference

to

repeated

measurements, and each deviation from the mean contributed

modem to

to

the

mal") fied

"error

variance." 34

In

many

applications, however, variance refers not

mensurational precision in multiple measure-

ments of a single object, but

we may not be F ratios or

distribution,

m

using

t

tests,

justi-

other

"parametric" tests that depend on Gaussian

The

distributions.

effect

of \-iolating the

assumptions of "normality," although

dom mentioned

most

in

statistical

sel-

books

aimed at biologists, has recently been thoroughly discussed in texts devoted to "distribution-free" 6

"nonparametric" 15

or

Validity of transformations. Attempts

made to "normalize" a nonGaussian distribution by using logarithms, square roots, or other transformations of are

often

the original variables.

to populational dis-

decide

Since investigators

meaningful importance by comparing the data they have observed, rather than arbitrary transformagenerally

fundamental problems in statistical logic remain unresolved and are generally ignored or glossed over in most textbooks.

ing,

research do not have a Gaussian (or "nor-

2.

of the text.

for

procedures have acquired

solutions in recent years:

statistics.

used as a basis for taxonomic arrangement

available

other traditional problems in ana-

statistical

1. Parametric violations. Since most of the dimensional data encountered in clinical

particu-

valuable because the characteristics of the research data, rather than the innate distinctions of the statistical tests, are

many

Two lytic

find cer-

larly

Despite the

control"

"error variance."

new

procedures.

Analytic

A.

analysis,

single

in

"Quality

jects.

statistical

procedures for analyzing the results and

making

measurements of multiple oband "range of normal" can thus become admixed in the same conceptual procedure, so that cither the measuring devices or the people who deviate from a mean become associated with different magnitudes of

persion

tions

of the

about

data,

these changes

is

the scientific logic of

uncertain.

Are they

pref-

erable to using the alternative "nonpara-

metric"

or

"distribution-free"

tests

that

rank the observed data without altering the basic values? Suppose we have found that the cube root of the arc tangent of serum cholesterol is "significantly" lower in one group of patients than in another? Is a "significance" based on such a peculiar conversion of data really more meaningful than what could be found by ranking the results in a "nonparametric" or "distribution-free" test?

B. Inferential decision. In

all

of the pro-

had be adopted for choosing the test that would be followed by an inferential decision about "significance." The strategy depends on the following act of contracedures just cited, a

to

statistical strategy

Subsequent implementation of the

We

assume, according to the

changes,

laboratory

no difference among the groups being compared. With this assumption, we determine the probability (or P value), that the observed

venience

of

puntal logic:

"null hypothesis," that there

is

by chance. If this probor below a selected a level,

difference arose ability

at

is

we

such as 0.05 or 0.01,

reject the null

hypothesis and proclaim the observed dif-

objectii <

results,

side

and

cost

treatment,

67

con-

effects,

treatment.

of

The

statistical decision procedure, however, is geared either to regarding all targets as having equal importance or to using only one target

variable.

To

get a single answer about "statistical

abandon

significance," clinical investigators usually

a judgmental evaluation of separate decisions for each target, and, instead, compress multiple targets

into

a single variable. This compression

source of

the

many

of

"contrived

the

is

indexes"

ference to be "significant" at the calcu-

whose

lated value for P. This statistical strategy

ly-

although sanctified bv tradition and worshipped by investiga-

3. Problems of the null hypothesis. Although the two sides of a coin are often used to illustrate concepts of probability, only one side of a

of "hypothesis testing,"

tors

and

free,

however.

editors,

P

5

-

>

M

two-sided biologic issue

value strategies.

P values has been severely

cent vears. 3

44

>

Some

The

criticized

use

in

re-

writers have proposed

P values and "significance" concepts be dropped entirely and replaced bv estimations based on "confidence intervals" 5 25 or "maximum likelihood ratios." 25 3l In the avant garde today, main- statisticians advocate procedures of Bayesian analysis, which replaces conventional quantifications with subjective estimates of numerical that

'

'

probabilities.

31

,J '

>

34

Partisan

9

'

-3

>

are

theorists

still

may

procedures,

statistical

For example, in one reconducted after almost three decades of dissemination and general of

seminar," 12

the

doctrine

that

"hypothesis

statistical

clinical

trials

testing,"

eight

spent 25 pages discussing the charge,

made by another

statistician, 1

of error probabilities to

.

.

.

that "the concept

has no direct relevance

experimentation." 2.

The

restrictions of univariate targets.

Regard-

the strategy used for a statistical deabout "significance," the decision must be based on a univariate target. Although able to

less

of

cision

manage multiple

do not

variables in

the initial state of

gorical

not

suitable

for

the subsequent state.

multiple

clinical

some of target must In

the

"multivariate" techniques, the be a single variable; in others, multiple target variables are accepted but are given

the

same

rela-

"weights" or importance. Thus, a thoughtful clinician, appraising the results of treatment, might tive

to give separate evaluation to

targets

as

enable the

alter the basic intellectual restrictions 1 '

symptomatic

such multiple

response,

functional

data,

with

a

clinical

investigator

new measurements

sional data.

will

'

usually

After

all

of "continuous" dimen-

this

work

in

getting "con-

tinuous" data, however, and after calculating

all

the statistical tests of the data, the investigator

then makes the final decision about his results on the basis of a completely arbitrary pair of dichotomous categories. These categories, which are called "significant" and "nonsignificant," are usually demarcated by a P value of either 0.05 or 0.01, chosen according to the capricious dictates

want

may

be serviceable but

to

go to enormous efforts in mensuration. He will get special machines and elaborate technologic devices to supplant his old categorical statements

or

in

modifications

procedure

4.

analysis" targets

The

of

the patient, statistical techniques of "multivariate

are

errors.

null-hypothesis

Procrustean categories of decision. The method making statistical decisions about "significance" creates one of the most devastating ironies in modern biologic science. To avoid using cate-

of

themselves.

statisticians

a

of

lated procedure for testing "non-null"' hvpotiieses.

cent "biometrics

required

and P value statistics of the "null hypothesis." would have to modify the basic concepts by using confidence intervals and by applying the ideas of P error introduced by Neyman and Pearson 41 to supplement the unilateral scope

We

about

be relieved to learn that statisticians are beset with doubts about the basic "security" of the

acceptance

is

used to "prove" a difference but never a similarity. If we wanted to show that two drugs were essentially the same, rather than different, we could not do so directly with the ol levels

that arise in the absence of a specifically formu-

ignorance

procedures

statistical

these pro-

all

Clinical investigators, beset with insecurities their

explored in

is

and a "winner" has not yet emerged.

debating the respective merits of posals, 7

were described previous-

inference based on the null hypothesis, which

Uncertainties of

1.

of

trouble

not entirely

is

defects

scientific

of the statistician, the editor, the reviewer,

the

for

granting agenev.

"significance"

value that emerges

If

0.05

is is

the or

0.06,

level

lower

demanded and the P

the investigator

may

be ready to discard a well-designed, excellently conducted, thoughtfully analyzed, and scientifically important experiment, because it failed to cross the Procrustean boundary

demanded

for statisti-

cal approbation.

The

widespread

acceptance

of

these

rigid

The

68

architecture of cohort research

categories of "statistical significance"

able

is

a lament-

demonstration of the credulity with which

modern

abandon biologic wisdom

will

scientists

favor of any quantitative ideology that offers

in

the specious allure of a mathematical replacement for sensible thought.

loose ends

remain.

still

No

attention has

been given to research projects that are performed as methodologic explorations, rather than as clinical surveys and experiments. Such projects include studies of observer variability, establishment of

Extrapolation

10.

MANEUVER ^SUBSEQUENT-.

INITIAL,^

._.

©STATE©

*(Z)

1

!

STATE

(D

>—©

implementation of "double-blind" techniques in circumstances where "blindness" seems either difficult or impossible to attain.

Important methodologic issues

management

or in

clude:

prognostic stratification,

in

tactics

arrangement

chronologic

appraisals of "intrusion,"

The

transstratifieation.

elsewhere. This all

the time lor reviewing

is

the objective- of the research clearly

Was

Specified?

stated

the

objective

im-

plemented by procedures that carried out those specifications? Were the comparative maneuvers properly chosen and allocated?

Was

the population adequately identified

and assessed for bias in intake? Were the data obtained by suitable methods, classified with stipulated criteria, and analyzed

Were

appropriately?

population?

and accounted

W ere 7

What

been but briefly mentioned, and only scant

If

of these

answers,

formed a tific

If

for

the results suitably

else

might have gone

questions receive satis-

we have probably

terms

other

statistical

phrases

that

designated improperly in

name

are:

Aldiough

this

and

of sig-

have already been cited, many have been omitted. Among

nificance

often

are

or concept

normal, control, regression, negative

result,

standard error, and sampling error.

All of these topics are available for con-

sideration in

and may ultimately appear

(

our future sessions here.

References 1.

Anscombe, F.: Sequential medical book review, J. Amer. Statist. Ass.

not,

we may have

•

3.

Armitage,

outline

now been

of

per-

biostatistical

many

a

58:365-

Blackwell

Bakan, D.: The

medical

Scientific

trials,

test of significance in

Psychol.

research,

Bull.

Ox-

Publications.

psycho-

66:423-437,

1966.

Berkson, survival

J.,

and Gage, R. for

rates

P.:

Calculation of

Mayo

cancer,

Clin.

Proc.

25:270-286, 1950. 5.

Boen,

J.

R.:,

P

values

versus

standard errors in reporting data,

208:535-536, 1969. (Letter 6.

concluded,

Sequential

P.:

1960,

logical

4.

•

trials;

383, 1963. 2.

Bradley, tests,

architecture has

inadequacies

intellectual

other

per-

formed yet another exercise in statistical numerology, and the next stage in the work should be not an extrapolation, but an architectural return "to the drawing board." •

or

linguistic

valid, reproducible act of scien-

research.

in-

Although some of the

vocabulary.

tical

ford,

all

many

tellectual "pollutants" contained in statis-

wrong? factory

uncertainties about

inductive statistical procedures have

discussion has been given to the

prognostic distinctions of the

stratified for

strategies of

the various forms of

"intrusion" recognized in the analyses?

"spection,"

of

and

the words sample, error variance,

the previous steps:

Was

many

in logic

have received Such issues in-

of data

minimal attention.

only

Now we have reached the last stage. We have made decisions about the numerical distinctions of the data, and we would like to draw conclusions that will enable the results to be extrapolated and used

cri-

choice and validation of indexes, and

teria,

J.

V.:

Englewood

to

means and A. M. A.

J.

editor.)

Distribution-free Cliffs,

N.

J.,

statistical

1968, Prentice-

Hall, Inc. 7. Bross,

D.

J.:

Applications

of

probability:

Subsequent implementation of the objective

Science

pseudoscience,

vs.

Amer.

J.

lems of statistical surveys, Arch. Intern. Med. 123:171-186, 1969.

Statist.

Ass. 64:51-57, 1969. 8.

W.

Cochran,

Amer.

ies,

9. Cornfield,

Matching

G.:

in analytical stud-

C.

determination

A method number

the

of

&

trial,

S.

and

J.,

Ederer,

method

utilization of the life table

J.

13.

hypothesis

of

testing

clinical

in

W.

and Tomiyasu,

J.,

trial

a

of

A

U.:

28.

29. S.,

con-

high in unsaturated fat in preventing complications of atherosclerosis, Circulation 40:(Suppl. 2) 1-63, clinical

trolled

and Massey, F.

J.,

J.,

Intro-

Jr.:

duction of statistical analysis, ed. 3, New York, 1969, McGraw-Hill Book Co., Inc. R.: Clinical judgment, Balti15. Feinstein, A. more, 1967, The Williams & Wilkins Com-

Tolchinsky,

R.

An

E.:

evaluation

B.,

A.,

J.

of

the

D. S.: The field trial: Some on the indispensable ordeal, Bull. N. Y. Acad. Med. 44:985-993, 1968. Freeman, L. C: Elementary applied statistics, New York, 1965, John Wiley & Sons, Inc. Good, I. J.: A subjective evaluation of Bode's law and an 'objective' test for approximate numerical rationality, J. Amer. Statist. Ass.

30.

Grubbs, F. E.: Sample criteria for testing outlying observations, Ann. Math. Stat. 21:27-58,

31. Hacking,

Logic of

I.:

identification

of

rates

epidemiology.

II.

Ann.

disease,

III,

Elementary statistics, Springfield, C Thomas, Publisher.

H.:

1968, Charles

trials,

34.

A.

clinical

epidemiology.

Clinical

R.:

design

of

statistics

therapy,

in

ments, Clin.

statistical analysis of clinical

39:294-310,

Anaesth.

Statistical

L.:

theory:

The

1967. relation-

Clinical biostatistics.

II.

Sta-

& Unwin.

E., and Dowd, Cancer epidemiology: Methods of study, Baltimore, 1967, The Johns Hopkins Press. 36. MacMahon, B., Pugh, T. F., and Ipsen, J.:

35. Lilienfeld, J.

versus science in the design of experi-

tistics

Hogben,

The J.

don, 1957, George Allen

III.

Ann. Intern. Med. 69:1287-1312, 1968. 18. Feinstein, A. R.:

Brit.

ship of probability, credibility and error, Lon-

Med. 69:1037-1061, 1968. 17. Feinstein,

Cam-

Cambridge University

Press.

32. Heath,

The

Intern.

statistical inference,

bridge, England, 1965,

33. Hill, G. B.:

pany. 16. Feinstein, A. R.: Clinical

Pharmacol. Ther. 11:282-292,

A.

M.,

Pedersen,

E.:

Epidemiologic methods, Boston, 1960,

Little,

Brown & Company.

1970. A.

19. Feinstein,

Clinical

R.:

biostatistics.

III.

37. Mainland,

The architecture of clinical research, Clin. Pharmacol. Ther. 11:432-441, 1970. A.

20. Feinstein,

The

Voight,

1950.

W.

Dixon,

The

F.,

64:23-49, 1969.

diet

1969. 14.

R.

thoughts

trials,

Chron. Dis. 19:857-882, 1966. Pearce, M. L., Hashimoto, S.,

Dayton, Dixon,

Korns,

T.,

M., Hemphill, F. M., Napier,

27. Fredrickson,

sion

role

methods and scien-

Statistical

(Suppl.), 1955.

J.

by Ederer, F., Zelen, M., Shaw, L. W. and Beebe, G. W. ): Biometrics seminar: The

study of Chron. Dis. 20:13-27, 1967.

J.

1954 poliomyelitis vaccine trials summary report, Amer. J. Public Health 45: (Part 2) 1-63

in analyzing

Chron. Dis. 8:699-712, 1958. Cutler, S. J., Greenhouse, S. W., Cornfield, and Schneiderman, M. A. (with discusJ., survival,

12.

and

Maximum

F.:

rheumatic

Boyd, Ltd.

Boisen,

Lancet 2:1357-1358, 1966.

Clinical

E.:

inference, ed. 2, Edinburgh, 1959, Oliver

26. Francis,

of patients to include in a controlled clinical

11. Cutler,

Stern,

prospective epidemiologic

R. A.:

25. Fisher, tific

and Downie, C. C:

J.,

rapid

the

for

and

R.,

105 episodes,

Biometrics 25:617-657, 1969. 10. Clark,

A

fever:

Discussions

H. O., Kempthorne, O., and Rubin, H.:

A.

effects of recurrent attacks of acute

The Bayesian outlook and its by Geisser, S., Hart-

J.:

application. lev,

24. Feinstein,

Public Health 43:684-691, 1953.

J.

69

architecture

of

biostatistics.

research

clinical

Pharmacol.

Clin.

tinued),

Clinical

R.:

Ther.

IV.

fication of

Chron.

R.: The pre-therapeutic classico-morbidity in chronic disease, J.

Dis., 1970.

(

C.

R.:

II.

The

J.

The epidemiology clinical

course:

temporal demarcations, 123:323-344, 1969. 23. Feinstein,

A.

R.,

and

A.,

and Schimpff,

trated

ficiency

40.

Data, decisions, and Intern.

1963,

statistics,

W.

B.

Saunders

M., and Shulman,

L.

E.:

Determi-

by systemic

in

retrospective

studies,

design

Amer.

J.

efJ.

Epidem. 91:111-118, 1970. Mosteller, F., and Tukey, J. W.: Data analysis, including statistics, in Lindzey, G., and Aronson, E., editors:

Med.

illus-

erythematosus,

lupus

Chron. Dis. 1:12-32, 1955. 39. Miettinen, O. S.: Matching and

of cancer therapy.

Arch.

Elementary medical

nation of prognosis in chronic disease,

In press.

22. Feinstein, A. R., Pritchett,

D.:

Philadelphia,

Company.

11:595-

A.

2,

38. Merrell,

(con-

610, 1970. 21. Feinstein,

ed.

gy,

ed.

2,

Handbook

Reading,

of social psycholo-

Mass.,

1968,

Addison-

Wesley, Inc. H.:

Spitz,

demiology of cancer therapy.

I.

The

epi-

Clinical prob-

41.

Neyman, J., and Pearson, E. S.: On the problem of the most efficient tests of statistic-:!

The architecture

70

of cohort research

A

Roy. Soc,

hypotheses, Philos. Trans.

Rheumatic fever

231:

A

289-337, 1933. 42. Pike,

M. C, and Morrow, of

analysis

demiology, 43.

R.

J.

Prev. Soc.

in

clinical

epi-

in children

epidemiologic laxis,

and adolescents: study

of

subse-

streptococcal infections, and

sequelae.

V.

Relationship

of

the

rheumatic fever recurrence rate per streptococ-

Med. 24:42-44,

1970.

cal infection to pre-existing clinical features oi

Remington, R. D.: How mam experimental subjects? Or, one good question deserves an-

5) 58-67, 1964.

other, Circulation 39:431-434, 1969. 44.

quent propln

Statistical

studies

patient-control Brit.

H.:

long-term

Rozeboom,

\V.

\V.:

The

hypothesis significance

fallacy

test,

Nonparametric

S.:

behavioral

sciences,

New

the

null-

Psychol. Bull. 57:

statistics

for

1956,

York,

the

Mc-

Graw-Hill Book Co., Inc. Statistical 46. Snedecor, C, and Cochran, \V. methods, ed, 6. Ames, Iowa, 1967, Iowa :

State University Press. 47. Taranta,

Wood,

A.,

II.

Weinberg,

F.,

Ann.

Intern.

Med. 60:(Suppl.

Taube, A.: Matching in retrospective studies; sampling via the dependent variable, Acta Soc. Med. Upsal. 73:187-196, 1968. Administration Co-Operative Uro49. Veterans logical Research Croup: Treatment and sur\ i\ ill

Surg,

Feinstein,

A.

R.,

Tursky, E., and Simpson, R.:

of patients with cancer of the prostate,

(aiicc.

50. Yerushahny,

Obstet. J.,

124:1011-1017, 1967.

and Palmer, C.

methodology of investigations of tors

E.,

patients,

48.

l

416-428, I960. 45. Siegcl,

the

in

chronic

27-40, 1959.

diseases,

J.

E.:

On

the

etiologic fac-

Chron.

Dis.

10:

CHAPTER

6

Sources of 'transition

The idea

of a cohort

is

constantly used

as a basic tactic in biostatistics.

contemplate

derived

statistics

occurrence

thrombophlebitis in recipients of the

of

"pill,"

the survival rates after different forms of

treatment for acute myocardial infarction,

development of vascular complications

the

diabetes

in

weight gain

mellitus,

or

in the first

of these circumstances, of people to see

the

anticipated

year of

we

life.

In

all

follow a group

what happens

to

textbooks.

In

often used for teaching or for reference in medical activities, the word cohort does not appear in the index, and seems absent from the text. In two new biostatistical

epi-

course and treatment from ontogenetic studies of normal growth, the fundamental scientific logic depends on the investigation of a cohort. A cohort is presumably the source of statistical data dealing with such phenomena as the appearance of lung cancer the

statistics

from

clinical studies of the

smokers,

most

in

many books

of disease, or

cigarette

sistently

When we

demiologic studies of cause of disease, from

in

bias'

them

after

G ~ 10, 23 - 27 ' 28, 32 - 36 37 '

1 -

that

are

textbooks, cohorts are described with ver-

bal illustrations but are undefined, and the illustrations

on a

are based, in one instance 34

,

and in on a study of weight

life-table analysis of mortalitv,

the other instance, 35

During a nonexhaustive search of the I was able to find the term cohort defined in onlv two books that had gain.

literature,

"statistics" (or

An

some congener)

in the title.

additional set of definitions

was

avail-

able (as might be expected) in several text-

books of epidemiology. The diverse concepts

expressed

in

these

six

sources

noted in Table I. seems to be consistent in

of

definition are

What

they have been exposed to something: an alleged cause of disease, the action of an es-

definitions

tablished disease, the intervention of a thera-

observation of a group of people, followed

peutic agent, or the course of time

forward (or "prospectively") in time. What seems inconsistent are the purposes for which the group is being followed, the

Despite the

on

this

is

concepts that depend

tvpe of scientific reasoning and bio-

statistical

hort

many

itself.

interpretation, the idea of a co-

usually undefined or defined ineon-

is

specifications

Clin.

Pharmacol. Ther. 12:704, 1971.

these

for inclusion

in

the group,

and the reference date from which the follow-up period begins. In two definitions, the use of a cohort

—

This chapter originally appeared as "Clinical biostatistics A. Sources of 'transition bias' in cohort statistics." In

all

the principle of longitudinal

is

cause of disease, but the

1

limited to studies of in

other definitions

purpose of the research

is

unrestricted.

71

72

The architecture

Table

I

of cohort research

Author

Definition of a cohort is made of the occurrence of an event, suspected of being the cause of the lesion, and the subject is followed forward to see whether the effect is or is not

"In the prospective or cohort study, a record

Doll'

produced." Fox, Hall, and Elveba< k

(

p.

73)

Three definitions air offered or implied: ot persons born in the same \ ear or five-year period" (p. 191) "Related prisons (comprising two or three generations) whose past

:

"Croups

disease experience can be well

documented and where future disease

experience can be observed" (p. 197) "Population segments bom at the same time" (p. 261)

M.u Mahon, Pugh, and Ipsen 29

"The

investigation over time of an identified group of individuals"

16)

(p.

Ma< Mahon and Pugh 30

(The cohorts)

are defined in terms of characteristics manifest prior

the appearance of the disease

under investigation (and) are observed over a period of time to determine the frequency of the disease among them. (p. 207) to

"Croups

Mainland 11

.

.

same year or within

of individuals born in the

.

a

few years

of each other" (p. 142)

Morris 33

At

least

"The

two

definitions are offered:

clinical follow-up,

over

many

lations' of suitable individuals" (p.

years

if

necessary, of whole 'popu-

131)

of people bom in a defined period is called a cohort; and study of the future history of such a group, of what happens to

"A group a

them,

is

called 'cohort analysis'."

(

p.

232

Several definitions require that the cohort

a single cohort

members be

trapolation, or

when two

pared for the

effects of the

similar in age, but other defi-

contain

nitions

no such chronologic de-

mands. Since the textbooks of statistics and epidemiology have either failed to define a cohort or have offered discordant definitions, we should not be surprised to find that the cohort concept has been used with

many

this discussion

of

My

logical discrepancies.

scientific

cohort

is

to consider the principles

logic

statistics,

object in

that

and

scientific errors that

are

to note

may

suitable

some

occur

for

of the

when

the

A. Objectives

in

cohort

research

Since the validitv of biostatistical data

mav be

destroyed by bias in the initial selection or subsequent examination of a

popuhu the

way

a crucial feature of

i,

s

a cohort

chosen and maintained.

is

When

appraised for general excohorts are com-

maneuvers to which they were exposed, the investigator (or reader) must constantly beware of un-

recognized bias in the procedures used to assemble the members and to obtain the data of the cohorts.

We

can contemplate some of the neces-

sary logic and potential sources of bias

we

recall the

INITIAL

STATE that

principles are violated.

is

if

sequence of

MANEUVER SUBSEQUENT >

STATE

was previously described 17

as a basic

guide to the architecture of clinical research. With this sequence in mind, we can consider the purposes for which cohorts are assembled, and the ways in which bias can enter the results.

A

cohort can be assembled to investigate

at least four different types of

maneuver:

Sources of 'transition bias'

pathogenetic,

ontogenetic,

pathogressive,

interventional. In an ontogenetic sur-

and

73

ing the decision to give "no treatment." Accordingly, a modern investigator studies

maneuver is time itself, and the be noted is the growth or development of a normal person during a selected

assumes that no mode of treatment may have been effective; he then combines the

period of time. In a pathogenetic survey,

results of all patients, regardless of treat-

vey, the

natural history

effect to

He

maneuver

the

is

exposure to an agent that

allegedly causes a disease, and the effect

is

development of that disease. The maneuver can be selected by the exposed person (as in cigarette smoking), imposed by "nature" (as in airborne infection, natural disasters, and "normal" or degenerative changes with age), or derived from the

medical experiences (as in diverse "iatro-

maneuver is exposure to an estabdisease, and the effect is the change

lished in the

disease takes

The sists

to

condition as the

affected person's its

course.

activity that

is

called therapy con-

of an interventional

maneuver intended

change what might happen after expo-

sure to a causal agent or during the course of a disease.

The

effect

is

determined by whether an

noting, in prophylactic therapy,

anticipated occurrence

is

prevented, and in

whether

an observed manifestation is altered. Thus, in a survey of therapy, the investigator determines remedial

therapy,

what happened to a cohort of patients who have already been treated according to the ad hoc decisions of individual clinicians and patients; and in a trial of therapy,

nil hypothesis.

ment; and he studies the "clinical course" the cohort obtained by the combina-

in

tion. 18

Having determined indexes of accomplishment and demarcations of different prognostic strata for the entire group of patients, regardless of treatment, the in-

vestigator can then appraise the results of

treatment within the strata, and can use the results for designing future trials.

A

genic" ailments). In a pathogressive survey, the

by making a

cohort can thus be studied for the ob-

servation of cause, course, or intervention,

For a cause-cohort, the objective

is

to de-

termine whether a particular disease develops after exposure to an agent allegedly causing the disease. For a course-cohort, the objective

is

to determine the path of

nature in the normal development of a healthy person or in the outcome of an established disease, with or without treatment. For an intervention-cohort, the objective

to determine

is

maneuver

alters

whether a particular

what would otherwise

oc-

cur in the course of normal growth, ex-

posure to disease, or course of disease.

The key research

is

issue of scientific logic in cohort

the validity of the comparisons

that are performed internally ly.

In

studies

cause

of

the investigator can assign the treatment

where the

according to a prearranged experimental

are contrasted, the

plan.

place "internally"

and

or

external-

intervention,

maneuvers main comparison takes

of different

effects

among

the cohorts

who

and

maneuvers. At least two cohorts must be assembled one that is exposed to the particular causal or inter-

needed for satisfactory deand evaluation of experimental therapeutic trials. 12 15 Although an ideal cohort for such surveys would consist of "un-

maneuver under consideration, and another cohort that remains either unexposed or exposed to some alternative maneuver. Thus, in patients with disease D,

One surveys

of the is

main purposes of therapeutic

for investigators to discern the

appropriate

types

of

data,

indexes,

stratifications

sign

'

treated" people in

whom

"natural history"

receive

those

—

ventional

if

treatment

TA

for cohort

could be observed, such "untreated" co-

results

than treatment

seldom attainable, particularly for the major chronic diseases that constantly receive diverse forms of treatment, includ-

would

first

horts are

TB

CA

gives better

for cohort

C B we ,

want to be sure that the two cohorts were similar enough so that the

treatment, rather than a difference in co-

74

The

horts,

could be held responsible for the

architecture of cohort research

difference in the results. After this 'inter-

nal" comparison

want

we would

completed,

is

perform an "external" comparison

to

however, the people whose target events are counted in the numerators are a temporallv delayed subset of the same people

who were

previously counted in the de-

to extrapolate the results to the- larger pop-

nominators.

which the cohorts were derived. For this purpose, we would want to ascertain the similarity of cohorts C A and

the de-nominators. hort statistics provides a role not merely

CB

for the

ulation from

to other patients with disease D. If the

cohorts were sufficiently representative of

we might he

the diseased population, to conclude that

TA

better treatment for

is

people with disease D, and not

group

CA

of patients

who

just for the

constituted cohort

This type of external projection

.

the onlv form of "comparison"' course,

is

often

in studies oi

where maneuvers are not compared

and onlv

a single cohort

Sources of bias

B.

able

in

is

observed.

The numerators

"evolve" from

This temporal-subset distinction of co-

numerator and denominator of the but also for the virgule (or

statistical ratio,

that separates numerator from denominator. The virgule "contains" the time interval that elapses from initiation of the maneuver in the denominator population until the appearance of the target events counted in the numerator. "fraction

The

line")

virgule thus "includes" several crucial

features of a cohort: the initiation of the

maneuver, the performance of the maneuand the observation of the population

cohort statistics

ver,

The

of an observed cohort are expressed usually statistically as a ratio, or

thereafter.

The denominator of this ratio contains the number of people initially in the cohort. The numerator contains the num-

tinctions arc violated, the temporal

results

rate."

ber of those people in

occurred

event

target

For

example,

velops the

3

in

if

of

the

"pill,"

whom

the chosen

a

at

date.

later

thrombophlebitis 1,500

women

de-

receiving

thrombophlebitis

rate

3/1,500 or 0.2%; if 60 of 200 people with a particular cancer survive for three is

years, the three-year survival rate

30%. As a general

or

the

number

expression,

of people in cohort

60 200

is

,

RA

Because of the forward way cohort

,

which can be called "chronology

bias,

arises

when

the cohort

cross-section of people in

may have begun

ver

their

clinical

course,

This type of bias

if

ceive

e A /n A

in

which a

.

is

assembled

bias," as a

whom the maneu-

at different times in

so that the virgule

does not represent the inception of the maneuver for each member of the cohort.

and

is

changes can become disfundamental type of

statistics

One

torted or biased.

spread in cohort

followed, this statistical ratio has

is

implied by cohort

is

the target event later occurs in e A people, the rate of the target event,

these biological and logical dis-

nA

if

CA

When

a

paper of

so important

is

statistics

separate

that

discussion

this series.

A

and wideit

will re-

the next

in

second fundamental

type of bias, which can be called "transition bias," arises

aspects of the

from problems

way

in various

that people are trans-

unique logical distinctions that are not

ferred from their anonvmity in a general

present in any other types

population to their prominence in the data

data.

For the

statistical

of statistical

ratios in all other types of

"samples," the numerators

and

denominators can be assembled in diverse manners and can be related to each other in a variety of ways. In cohort statistics, °I

shall

not attempt to differentiate

among

the terms

and proportion. The distinctions are seldom honored, even by people who know the differences, and ratio,

rate,

the three terms are used interchangeably, with rate being most popular.

for a statistical cohort.

and

difficulties

The

diverse sources

of transition bias

are the

topic for the rest of the discussion here. C.

The problems of transition bias

In order to contribute to the a cohort, each of

go

six

its

statistics of

members must under-

major populational transfers: from

the general population to the base popula-

Sources of 'transition

people with the particular condition under study in the denominator; from that base population to the parent population tion of

75

bias'

epidemiologic setting achieved with a field survey or with mailed questionnaires. Thus, if

the cohort statistics are based on treat-

from which the cohort ( or cohorts ) is ( are derived; from the parent population to the

of a disease at a particular hospital, the parent population consists of the hos-

candidate population that

pitalized patients with that disease;

eligible

is

for

ment

if the obtained by mailing question-

"admission" to the cohort; from the candi-

cohort

date population to the particular cohort

naires to a group of healthy adults to ask

been exposed to a maneuver and that becomes counted in the denominator; from the state before inception of the maneuver to a postmaneuveral state in which the target event may occur; and from occurrence of the target event to the detected target that is counted in the nu-

about their smoking habits, the parent population consists of the people who have

that has

merator.

The

is

been sent the questionnaire. Not all members of a parent population enter into cohort statistics. Because of various types of eligibility

many members

some or

criteria,

of a parent population

may

be excluded from the candidate population

first

three of these transfers occur

that enters the cohort(s). Thus, in choosing

assembled, and are not

treatment for a particular disease, physi-

before the cohort

is

expressed in the cohort data. of a general population

The members

must have or must

may

cians sick,

exclude patients

who

are too

too uncooperative, or otherwise un-

develop the particular condition that brings

suitable

them

studies of cause of disease, the people

under con-

into the base population

sideration.

If

the condition

a state of

the

for

proposed

therapy.

In

who

then the base population consists of those

have received a questionnaire determine their own eligibility for the candidate population by deciding to complete and to

who

return the questionnaire. After these ex-

is

health, as in studies of cause of disease,

members

of the general population

are in the appropriate state of health. If

clusions have

the condition

of the general population

is

a state of disease, as in

studies

of pathogressive course or thera-

peutic

intervention,

consists

of those

population

the

base population

members

of the general

who have developed

priate state of disease. Thus,

ver under study

is

if

the approthe

maneu-

the effect of cigarette

smoking on healthy adults, the base population consists of those people in the general population If

the

who

are healthy adults.

maneuver under study

is

the effect

the base population consists of the anginal

members

of the general population.

cohorts

from

sel-

cohort.

an

with

The maneuvers

intervention-cohort

associated

are

usually

assigned ad hoc by a physician or planned

by an

investigator; the

maneuvers

associ-

ated with a cause-cohort have often been

chosen by the members of the cohort and are noted from their responses in a questionnaire or interview.

Since the denominator of cohort statistics is

not demarcated until the candidate popu-

been associated with the maneufrom general population

ver, the transfers

to base population, to parent population to

candidate population are not accounted for in cohort data. Nevertheless, these transfers

reach a milieu

profoundly affect the extrapolation of the results noted in cohort statistics. Unless the candidate population adequately represents its antecedent populations, the cohort data

population.

who

that provides the opportunity for the

mem-

be investigatively observed. This milieu is often a clinical setting, such as a

bers to

it

to the candidate population that en-

a

The those members

parent

of the base population

but

ters

investigators usually choose

a

parent population contains

hospital,

them

lation has

Since an entire base population can

dom be studied,

negotiated the various transfers that lead

angina pectoris,

of surgical treatment for

been completed, the members have successively

also can

be the type of

may

not be validly applicable to any grouo

The architecture

76

of cohort research

people beyond the particular persons w ho were observed. The "internal" comparison of the cohorts may be valid within

of

sisted

exclusively of patients with metas-

tases,

we would immediately The

the comparison as unfair.

recognize contrast of

the candidate population, but a bias in the

treatment would be biased because the two

assembly of the candidate population may prevent the results from being extrapolated to any members of the "external" popula-

had unequal prognostic expectawas applied. Now suppose we had a mixture of localized and metastatic patients in the two cohorts. The cohorts would still be biased unless the mixture was proportionately ecpial in the two strata (or subgroups) that compose the cohorts. For example, let us assume that neither surgery nor radiotherapy has any effect on the natural course

tion

the world

in

For

cohorts.

beyond the immediate

this reason, the final interpre-

depends on cru-

tation of cohort statistics cial

features of populational transfers that

noted or contemplated

are not

in

anv of

numerical data that constitute the statistical results. The problems entailed in tin

and reducing the bias created

detecting

by

pre-cohort

these

reserved

transfers

discussion

for

be

will

later

this

in

essay.

cohorts

tions before treatment

of the cancer, but

15%

metastatic patients, whereas the radio-

therapy candidate population has been

\lter the

and rates

the statistical

lows:

are

transfer

cited

in

and can be contemplated directly. be biased by factors

ratios,

These

affecting the denominator, the numerator,

virgule of the statistical

the denominators, bias

is

ratios.

maneuver

are

shown

to

to

maneuver

begins.

In

is

created unless the

cohorts have equal opportunity for the target event to be detected

Major

when

it

occurs.

anv of these three feasusceptibility to target, performance tures of maneuver, and detection of target can produce a transition bias that will impair disparities in

—

—

the scientific

\

Radiotherapy cohort:

=

alidity of the comparisons.

therapy cohort.

70% in those with a localized 20% in those with metastatic

Now

suppose

we wanted

to

com-

the survival

seem more than

If

we were unaware

of

between localized and metastatic patients, and if we were unaware of the disproportionate mixture of these two groups of patients in the cohorts, we would conclude erroneously that surgery was substantially better treatment than prognostic

differences

radiotherapy.

The hazard

search.

tumor, and

make

IV2 times higher than that in the radio-

tion 1

disease.

=

= 40%.

rate in the surgery cohort

survival rate for patients with a particular is

(.15) (.20)

(.40) (.70) + (.60) (.20)

.28 + .12

metastatic patients would

Equal susceptibility to the target event. Suppose the expected three-year 1.

cancer

+

62.5%

disproportionate mixture of localized and

the \irgulc, bias

is

=

ol

created unless the cohorts have ecpial performance of the compared maneuvers. In

numerators, bias

(.85) (.70)

.595 + .030

have equal sus-

event before the

the target

ceptibility

localized

Thus, although the two treatments had no effect on the course of the cancer, the

cohorts compared to assess the effects

the-

Surgery cohort:

In

created unless the

40%

The survival would be as fol-

two cohorts

the

in

contained

metastatic patients.

transitions can

or the

a

cohort

60%

assembled, the remaining types of populational

suppose the surgical co-

hort contained S5 r r localized patients and

of this type of error

reason for recognizing prognostic "'

strata

as If

crucial

a

a

feature of cohort

cohort can

of people with

prognoses

(i.e.,

event), then the

is

the

stratificare-

be divided into

distinctly

different

susceptibility to the target

compared cohorts

will

be

pare the results of surgical treatment ver-

biased unless thev haye similar proportions

sus radiotherapy. If the surgical cohort con-

of the prognostically disparate strata.

sisted

e.

lusively of patients with localized

cancer, and

if

the radiotherapy cohort con-

The mathematics of this situation can be demonstrated, with the aid of some simple

Sources of 'transition

algebra,

as

Suppose that people

follows:

with a condition

or

D, can be S, and S 2 with

disease,

divided into two strata,

,

target event for those

two

two main cohorts

be the product A k A r

will

AkA r

The value when two

the division of cancer patients into a lo-

amounts of patients from two

rj

,

70%

calized group, with expected

survival,

and a metastatic group, with expected 20% Now suppose that we assemble cohorts A and B as a mixture of patients from the two strata, Sj and S 2 In cohort A, let k A be the proportion of patients from Si; l-k A will then be the proportion of patients from S 2 For cohort B, the corresponding proportions for S and S 2 will be k B and l-k B The expected rate of occursurvival.)

.

.

x

.

A

rence of the target event in cohort

will

then be

R A = kA rj + (l-k A )r 2 = k A (r!-r 2 For cohort B, the rate will be Rb

=

kn^ + (l-kB )r 2

=

+

)

r2

.

r,.

two were the same in both cohorts, so k A = k B the two cohorts would have

the proportionate distribution of the

If

strata

that

,

=

with R A Rb regardof the individual values for ri and r 2

similar expectations, less

.

Alternatively,

if

the two strata, Si and S 2

were prognostically target rates of

r,

=

similar, r2

,

,

with identical

the terms containing

would be zero, and the values for R A and R B would be equal, regardless of any disproportions in the values for k A and k R But if the two strata have unequal target rates, so that Ti and if the r2 (r x -r 2 )

.

^

,

^

will differ. in the

k B the ,

By

results for

RA

and

A

preceding equations,

R A -R B =

(k A

we

I ).

Thus, in the preceding example for cancer patients, r t

-

r.

=

k A - k B = .85 - .40 = R A - R H = (.50) (.45)

.70

-

.45.

.20

The

=

.50

and

difference

= .225 = 22.5%, which is the difference between the 62.5% and 40% noted earlier for the two cohorts. The general principle here can be stated as follows. If A k = |k A - k B is the difference in proportional distribution of the two strata in two main cohorts, and if A r = - r 2 is the difference in the rates of the |r |

t

|

cohorts contain disproportionate distinctively

important not just for the issue of cohorts, but also to avoid

is

bias in

compared

misleading conclusions about the results of

maneuvers that have opposite

on same

effects

different prognostic strata within the

cohort. 15

For example, returning to the earlier, suppose the surgery cohort and the radiotherapy cohort each contained 50% localized and 50% metastatic patients. Suppose, however, that surgery raises the survival rate by an cancer population described

15%

in the localized patients

and lowers survival by a decrement of 15% in the metastatic group; whereas radiotherapy has exactly the reverse

effect.

Thus,

the survival rate after surgery would be

85%

5%

in patients

with localized cancer and whereas radio-

for metastatic cancer;

therapy would reduce survival to localized patients in

and

55% in 35%

raise survival to

The survival would then be

metastatic patients.

in the surgery cohort

rate

= = 45%

(.85)(.50) + (.05) (.50) (.425) + (.025)

The

survival

in

rate

the

radiotherapy

cohort would be (.55) (.50) + .(35)(.50)

=

(.275) + (.175)

Consequently,

if

we

=

45%.

looked only at the

get

-kB )(r -r 1

.

the bias introduced

consideration of different prognostic

strata

RB

subtracting the entities noted

is

different prognostic strata.

strata are disproportionately distributed, so

that k A

of

increment of

kB (r!-r 2 ) +

then the

strata,

baseline difference in expected rates for the

and r 2 for the target event. (An example would be corresponding expected rates,

77

bias'

"For two cohorts comprising people from three disdifferent prognostic strata, with target rates n, Let k.\ r;, and r3, the calculations would be as follows: be the proportion of n-type patients and m\ be the proportion of r—type patients in Cohort A. The proportion of n-type patients will be 1-kA-im. For Cohort B, the corresponding values will be kB, niB, and 1-kB-mB. The expected target rates would be For Cohort A, Ra = kAri + niArj + (l-kA-mA)rj. For Cohort B, Rb = kiin + mBrs + ( 1-kB-ms ) ra. By subtraction, the difference in cohort rates is

tinctively

Ra - Rb

=

(kA-kii)

(n-r.i)

+

(iriA-mii)

(r--r:i).

Thus, if Ak and Am, respectively, represent |kA-kn| and |mA-mB|, and if Ar t and Ar 2 respectively, represent |ri-r:>| and |r::-r?|, the bias can be specified as AkAr, + AmAr ; If only two prognostic strata are under consideration, the bias reduces to the AkA r noted previously. ,

.

The architecture

78

of cohort

research

total survival rates, ignoring a prognostic stratification,

we would

conclude

erro-

neously that surgery and radiotherapy had similar effects in the treatment of cancer,

a cohort found in 1940.

The

rates in the 1970 cohort

better survival

might then

reflect

earlier stages of detection rather than im-

although the actual effects were completely opposite in the two prognosticallv differ-

proved methods of treatment. 2. Equal performance of maneuvers. Another type of bias that is also commonly

ent strata.

overlooked

These problems in prognostic stratification can affect any type of cohort research. regardless of whether the study deals with

cohort statistics

is caused by performance of compared maneuvers. For example, in manv

in

inequalities

in

the

intervention, cause, or course. In surveys of

surveys of treatment for acute myocardial infarction, the patients who received anti-

intervention, a prognostic

coagulants were also treated with elastic

cohorts

is

needed

when

arise

physicians

chosen

the cancer

for

is

of

treatment

assign

surgery

if

is

is

pref-

patients

"operable-"

with localized cancer and

when

stratification

avoid the bias that can

For example-,

selectively.

erentially

to

seldom offered

metastatic, a surgical

cohort will inevitably be biased for comparison with a cohort that receives radio-

therapy or any other

mode

of therapy in

which patients are not preselected

for their

"operabilitv."

In pathogenetic surveys of cause of disease, the issues of prognostic stratification

are

more

difficult to discern

and

to identify

— mainly because epidemiologists have genon the produced

erally obtained so little information

The bias to beware is by patients' self-selection of the "causal" maneuver. Suppose that people who are "tense" have a lower life expectancy than people who are not. Suppose also that subject.

tense people are

more

likely

cigarette smokers than people "tense."

smokers,

may

to

become

who

are not

Compared

to

a

cohort

of

cigarette

a

cohort

nonsmokers

of

thus be disproportionately large in

its

quota of "tense" people. The subsequent comparative reduction in life expectancv of the smokers might then be due to the reduction caused by "tension" rather than bv smoking. In sur eys of course of disease, a prognostic

str. rification is necessarv to avoid misleading results in comparison of target

rates

from one era to the

next. Thus,

with

better techniques of "early" cancer detection,

a cohort of patients in 1970 might

contain

many more

localized cancers than

and chair

stockings, early ambulation,

rest;

whereas the "control" group generally was treated with bed rest alone. Conscquentlv, the

differences

the

in

physical

rather than

the-

coagulants,

may have been

therapy,

presence or absence of antiresponsible for

outcome of the two cohorts. Another example of this tvpe of bias would be the comparison of two different surgical operations, of which the first is performed by a highly skilled surgeon working with excellent anesthesiology support, whereas the other operation is done by a less skillful surgeon working in a subdifferences in the

optimal

anesthesiology environment. In such circumstances, the better short-term

outcome

in patients

who

operation might have

received the

little

to

particular distinctions of the surgerv

A

particularly

treatment that

(

important

first

do with the itself.

problem

in

such as diet or oral medication

must be self-administered by a patient

caused by selectivity in the decisions of patients to maintain or reject the prescribed

is

treatment. Suppose the natural rate of improvement for a particular condition is 30%, and suppose this rate is raised to

70% when ment A

patients

receive either treat-

or treatment B.

Suppose

further,

however, that treatment A is difficult to maintain (because of discomfort, poor taste, or some other feature) and is abandoned by 50% of the people who start it, whereas treatment B is more easily maintained and is continued faithfully by 90% of

the patients.

Suppose

have not investigated the

finally

that

we

fidelity of maintenance for the two treatments, and that

Sources of 'transition

we

are

results

unaware of these distinctions. Our would show the following rates of

improvement:

who than

are "tense" are likely to die earlier

who

people

are

"relaxed."

RA =

(.30) (.50)

=

(.30) (.10)

(

=

.70

.70

(

) (

.50

be much more likelv to maintain a special dietary program than people who are

+

)

= 50%.

+.15

.35

RB =

For Cohort B,

) (

=

.63 + .03

.90

)

Now

"tense."

+

trial of

66%.

suppose

we perform

in

considered only the total outcome

these

two

we would falsely conB is inherently more treatment A, when in fact the rates,

clude that treatment

than two treatments had equal potency. The

effective

was due

ference in total outcome

ences

maintenance

in

the

of

dif-

to differ-

treatment,

not in the inherent efficacy. If treatment

A were

reorganized in a more convenient manner that made its maintenance easier, the results might become just as good as those obtained with treatment B.

A

example of

practical

this

problem occurred

during studies of antibiotic prophylaxis to prevent streptococcal infections and recurrences of rheumatic fever in patients with previous attacks of

rheumatic fever. 28 For patients receiving monthly

benzathine penicillin, the

injections of long-acting

and recurrences were

attack rates of both infections

substantially

lower than

those

patients

for

received daily doses of oral penicillin. injection of penicillin

ensured

almost always accomplished or in the patient's

home, the

in

its

who

Since the

receipt,

and was

the research clinic

fidelity of

prophylaxis

could be assured in the "injection" cohort. For the cohort, however, the daily ingestion of medication could not be supervised. Accordingly,

"oral"

in

order to determine whether the superior results

obtained with the injections were due to pharmacology or to maintenance, special procedures were established

1

"

to classify

the fidelity of maintenance

When

regimen as excellent and not excellent. the injection group was found to have better

results

than the excellent oral group, the distinction

for the oral

is

assigned

randomly by the investigator, rather than self-selected

bv the

cohort,

the

patients

who choose to maintain the maneuver (or who are able to maintain it) are a selfselected

group,

and the

results

may be

by unrecognized distinctions that were involved in the selection process. For example, let us again assume that people biased

Having learned

about the bias involved in failing to assess fidelitv of a maneuver's maintenance, we earefullv perform such assessments, and we then exclude from consideration the people who did not maintain the diet faithfully. Our group of faithful dieters will now be composed mainly of "relaxed" people, because the "tense" people will not have adhered to the dietary program. Our group, which did not have to adhere to a diet, will be composed of a mixture of "tense" and "relaxed" people. Even if the diet has no effects on mortality, the resultant death rate will nevertheless be lower in the faithful dieters than in the "control" group, because the dieting cohort will contain a predominance of "relaxed" people, with low death rates. 3. Equal detection of target. Another tvpe of transition bias, discussed earlier 20 in this series of papers, has been almost totally ignored in epidemiologic research, and acts as a major flaw in the validitv of cohort statistics dealing with causes of dis"control"

ease.

Many

acute ailments, such as strepto-

coccal infections and thrombophlebitis, and

many

chronic diseases, such as lung cancer

and coronarv artery

disease,

are not de-

tected unless the patient receives regular

medical surveillance and/or special diagFor example, as noted in

nostic testing.

could be attributed to pharmacology.

Even when the maneuver

a clinical

a special dietary program intended

to reduce atherosclerosis.

we

Let us

further assume that "relaxed" people will

For Cohort A,

If

79

bias'

neeropsv examinations, about 20% of pawith lung cancer 25 and about 50% 2 of patients with coronarv artery disease

tients

had not received the appropriate diagnosis during

life.

Because so

many major

ailments

can

escape diagnostic identification, an important source of bias in

compared cohorts

is

an unequal opportunity to have the target event detected. To avoid such bias, the

architecture of cohort research

The

80

two cohorts should be

relatively similar in

the frequency, intensity, and scope of med-

—

particularly those techical examinations niques bv which the target event is detected. For example, if lung cancer is the

sought in a comparison of smokers and nonsmokers. the results may be biased if chest x-ray examinations were routinely performed more frequently in event

target

the smokers.

laborers

If

coronary arterv disease

sought

target

tin-

a

in

comparison

and executives, the

results

same

m;iv

and

1

We

2.

would then

For Cohort A,

RA =

k A d,r + (l-k A )d,r

=

k A r (d,-dj) + d.r

and

= =

R„

Cohort B,

for

By

k B d,r +(l-k u )d.,r

k B r(d,-d 2 ) +

=

R,-R n If

A

d.r.

subtraction,

is

of

strata

in

find the following.

r

[(kA-kB )(dx-d2 )].

and

k A -k,.

i,

ck

d,

A,,

,

this dif-

ference can be expressed as rA k A,i, a value

be biased if electrocardiograms were routinely performed more frequently in the

which indicates the bias produced by

executives.

strata distributed disproportionately in

To

illustrate the quantitative aspects of

type of bias,

this

bophlebitis

MK of a who have [

but of

when

it

occurs in

stratum that consists of

women

frequent medical examinations,

detected

is

us assume that throm-

let

detected

is

in

only

women examined

40%

of a stratum

infrequently. Let us

assume further that a cohort of women taking the "pill" contains 75* } of women from the first stratum and only 25% from the second, whereas a cohort of women using

==

=

.75-.45

.30;

A

(]

=

and A k A,, = (.30) (.40) = 12%, which was the difference between 70% and 58%. These problems are an "incidence" coun.80-.40

=

terpart

of

.40;

the

discussed in the

fallacies

many

years

ago

pointed

out

that

"prevalence" bias noted

by Berkson.

when tions

1

Berkson

the prevalence of certain manifesta-

was compared

in patients hospitalized

for the "pill" as for other

which people with the compared diseases had been admitted to the hospital. In the fallacy under discussion here, the incidence

women from the first stratum from the second. Finally, let us assume that r. the rate of development of thrombophlesame

Ak

ing example,

with different diseases, the results could be biased by disparities in the rates at

of

the

two

cohorts. Thus, in the immediately preced-

45% and 55%

other forms of contraception contains

bitis, is

dis-

parate detection of the target event in two

contraceptive agents.

The reported

rates of

of target events in cohort groups can be

thrombophlebitis in the two cohorts would

biased by disparate rates of examination

be

procedures for detecting the occurrence of

as follows:

In the "Pill" cohort, (.75)(.80)r (.25)

In

(.40)r

the

=

(.60)r + (.10)r

"Non-pill"

cohort,

the target. Although both Berkson's point

= 70%

(.45)(.80)r

r.

and the one cited here have generally been

+

neglected in epidemiologic research, the

(.55)(.40)r = (.36)r + (.22)r = 58% r. Thus, the reported rate of thrombophlebitis in the "pill" cohort

would be

substantially

lack

even though the "pill" had no on thrombophlebitis.

dif-

The algebra of this tion can be shown as d 2 be the

"tilted target" situa-

follows: Let dj

and

different rates of detection for

and 2. Let k A and k B be the proportions with which strata 1 and 2 occur in Cohorts A and B. Let r be the true rate of the target event, which is the the target in strata

1

is

to

disappear

such bias does not cannot provide

and

assurance that

Such attention

it

is

absent.

to bias in target detection

generally not a major necessity in thera-

peutic

ferential effect

it

scientific

higher than the rate in the "non-pill" cohort,

attention

of

make

trials,

usually

because the investigator can

make advance

plans for the target

event to be assessed equally in the different cohorts subjected to treatment.

When

the

performed as a survey, however, the investigator cannot arrange for the target event to have been equally detected, and must beware that bias has oc-

research

is

Sources of 'transition

remove systematic

81

bias'

curred in the rates of detection for different

to

cohorts. In intervention-cohorts, the thera-

while assigning the compared maneuvers

peutic circumstances target

with

may have made

the

event more likely to be detected one form of treatment than with

example cited

another. Thus, in an

where 18 for a

else-

clinical trial of antistrepto-

coccal prophylaxis, streptococcal infections

were more

likelv to

be detected in patients

to cohorts derived

bias

in

allocation

from the same candidate

same calendar

population, during the

in-

terval.

The main systematic in these

circumstances

bias to

is

be feared

the effect of pre-

vious transfer decisions that

may

render

the candidate population unrepresentative

whom

receiving injected penicillin than in those

of the patients to

receiving oral penicillin.

be extrapolated. For example, when a par-

In cause-cohorts,

phenomena

may

associated

the results will

ent population at a hospital

is

"screened"

have led to biased detection of the target disease., Thus, the frequent examination of "pill"-takers may produce a spuriously higher rate of thrombophlebitis than in the less frequently examined women who use other means of contraception. In course-cohorts, the different rates of disease 14 noted from one era to the next or in different geographic regions may be a reflection merely of the detection bias due to differences in diagnostic criteria and methods of case

for admission to a therapeutic trial for a

finding.

lation.

with a "causal agent"

also

In

certain

research

described later),

some

circumstances

(as

of the hazards just

can be eliminated or minimized. In most forms of cohort statistics, however, various forms of bias are inevitable, and an

cited

investigator's lidity

is

main hope

for scientific va-

to attempt to reduce or "adjust"

the bias. 1.

Allocation of maneuver. In the experi-

mental circumstances of an interventional trial, the investigator begins with a defined candidate

population.

He

creates

the

and "control" groups as he assigns the successive patients to be exposed or nonexposed to the maneuver under study. By allocating the maneuver according to "treated"

principles of randomization, the investiga-

can attain three important goals: he

tor

can eliminate systematic bias in the alloca-

maneuvers; and he can ensure that the compared cohorts will be concurrent tion of

in

both populational source and calendrical The randomization procedure serves

time.

may be

excluded because they are "uncooperative," too ill to participate, or afflicted with major

co-morbid ailments. As a result of these exclusions from the parent population, the candidate population will become disproportionately altered in a

manner

that im-

pairs the validity of extrapolation to other

patients with the "same" disease,

the random

allocation

despite

of treatment that

formed cohorts from the candidate popu-

The magnitude of this problem can be reduced if the candidate population is

The reduction of transition bias

D.

particular disease, certain patients

carefully identified, so that the extrapolation

can be suitably cautious, and if a log" 24 is kept to record the

"screening

characteristics

and outcome

memwho were

of the

bers of the parent population

excluded from the trial. Thus, in a trial-of porta-caval shunt for patients with esophageal varices, perhaps the most striking result was that the candidate populaof therapy, had a much outcome than the members of the parent population who were not admitted

tion,

regardless

better

to the trial.

In contrast to a therapeutic trial, the in a survey is chosen not by

maneuver

the investigator, but by nature, patients, or the patients' doctors. The investigator's inability to assign the maneuver is, in fact, the main feature that differentiates an experiment from a survey. In both types of research, the investigator can establish "control" groups for comparative purposes, but in a survey the investigator cannot

The architecture

82

control

of cohort research

govern, assign, allocate) the

(i.e.,

maneuver. Because the exposure or nonexposure to the maneuver is not determined by the cohorts

the

investigator,

of

surveys

are

chosen quite differently from those of experiments. In an experiment, a candidate population is assembled and then divided into cohorts to be exposed or noncxposed

maneuver. In a survey, the cohorts are determined, before the research begins, by their members having already been previously "assigned" to whatever maneuver they received. Since this assignment was

to the

-

not carried out

random

l>v

allocation to a

might be examined in patients who are in "good" or "bad" stages of disease. In a survev designed to determine whether the predisposes to cervical cancer, the

"pill"

gynecologic condition of the cohort groups can be determined. In one recent initial

study, 89 for example, the to

use the

ing

high-fat

or

familial,

status,

unequal

in target susceptibility

because of

decisions

made when

the maneuver

was

selected. In situations

where the ma-

neuver is chosen by doctors, a bias may arise because of pretherapeutic criteria that prognostic

create patients

assigned

Thus,

therapy.

disproportions different

to

in

cancer

among

modes of

therapy,

the

cancer

select

to

may

one type of maneuver in prefanother. Thus, people who

or to use the "pill"

tinctly

different

may have

dis-

prognostic characteristics

from people who do not choose these maneuvers. Although such bias cannot be avoided in survevs,

a

scientific

investigator

will

try

reduce its effects by "adjusting" the data. These adjustments consist of performing a prognostic stratification, and then comparing results of different maneuvers

to

ii

su.

same prognostic strata. Thus, in a ey of treatment of cancer, the results

the

then

strata of the cohorts.

The crucial issue in this type of adjustment is a prognostic stratification for the members of a cohort. The creation of the strata

depends on the

recognize

to

the

investigator's ability

particular

initial-state

prognostic unless the demographic differ-

is

In

choose to smoke cigarettes, to eat low-fat diets,

psychic

chosen by patients, exist in people

outcomes than those

"inoperable." 19

prognostic differences

erence

as

and longevity The effects of the maneuvers be compared within similar

situations

unresected, the "operable"

where the maneuver

who

features

ethnic background,

of parents.

can

according to such

strata

prognostic

These properties and must be shown to correlate predictively with the target event. 15 Although certain demographic distinctions such as age, race, and sex are often used to establish strata in

patients have better

who were

smok-

personality,

properties or "risk factors" that affect prog-

left

is

the

provide a

with patients prognoses than those

"inoperable."

of cigarette

diets,

who have better who are deemed Consequently, even when the

criteria for "operability" usually

surgeon

than

and ethnic features of the cohorts

be divided into

that

the

chose

have sub-

can be assessed, and the populations can

beware

were rendered

to

dysplasia

cervical

were begun. For surveys

possibly

cohorts

more

stantially

women who

were found

non-pill users before contraceptive agents

candidate population, the investigator must the

"pill"

nosis for the target event.

cannot be chosen

—

—

cohorts,

these

strata

ences have been bility

arbitrarily,

to

are

shown

the target

not necessarily

to affect suscepti-

event.

Thus,

if

the

prognosis for survival in cancer depended mainly on whether the cancer was localor metastatic, and if the prognosis was otherwise the same for patients who were old or young, black or white, men or women, our search for bias would be

ized

concerned mainly with a possible disproportion between localized and metastatic

We would not be particularly concerned about demographic disproportions, 2. Ascertainment of maneuver. In an a preplanned experimental trial or in longitudinal survey, the investigator can

I

i

strata.

j

83

Sources of 'transition bias

make

specific

advance

arrangements

to

which patients

ascertain the fidelity with

modern epidemiologic surveys

of the

all

of

etiology of disease in cause-cohorts, no ef-

adhered to the assigned maneuver. He can then divide the patients according to good or not good compliance, and deter-

fort has

mine the outcomes in the different strata of "fidelity." These procedures can also be attempted when the survey is based on existing records compiled by other people,

veys have usually determined the occur-

rence of the target event not by an active research procedure, but by the passive

although the investigator

cates.

the

that

find

is

less

likely to

necessary information was

obtained and recorded. In

many modern cohorts

of

surveys

trials

or

preplanned

where such

ascertain-

ment could readily be achieved, however, investigators have seldom made the required efforts. Adherence to maneuvers has been either omitted from assessment, the

only via questionnaires

that

were not rechecked or evaluated for

relia-

or

assessed

bility of

answers. Thus, reports of success

pharmaceutical treatment of angina pectoris, or in diet-plus-drugs treatment of in the

obesity are seldom

accompanied by data

the results according to the with which the prescribed regimen was maintained. In epidemiologic studies indicating

fidelity

of

outcome of cigarette smoking or

the

different types of dietary or exercise patterns,

the investigators have seldom em-

ployed repeat questionnaires or other appraisals to

tory

check the accuracy of the hissmoking, dietary, or exercise

of the

maneuvers

in

which the cohorts had en-

gaged. 20 In the absence of such ascertain-

ments and subsequent stratifications, the investigators have failed to "control" their cohorts for the hazard of an important

been made to search

The

investigators in

"poll-bearing" 20

often

incorporated

therapeutic

trials,

into

assessments

the

protocol

are of

the attempt to discern

or exclude bias in equality of opportunity for target detection has

been almost wholly

absent in epidemiologic research. In almost

certifi-

occur without being recorded on the death certificate. Thus, such ailments as coronary artery disease or lung cancer are target events that can often occur with-

out being detected during less

life.

14, 23

2 '

Un-

the disease has been diagnosed and

the patient has been specifically noted to have died because of that disease, the death certificate will contain a "false unless

negative" report of the target event. Al-

though the investigators of etiologic cohave sometimes checked death certificates for "false positive" diagnoses, no reports have been published of efforts to detect and adjust the bias introduced by horts

14 "false negative" reports.

A

second source of bias, even when atmade to check "false negative" diagnoses, is an unequal intensity of diagnostic procedures in compared cohorts. The data from death certificates do not tempts are

indicate whether the compared cohorts were examined in equally assiduous manners. Since most etiologic cohorts engaged in self-selected maneuvers and received

diverse forms of medical attention there-

the

after,

such

death

of

may

scientific

Although

analysis

of these sur-

move two types of bias that creates "tilted targets." The first is that the target event

Ascertainment of target detection. A preplanned longitudinal survey also allows the investigator to check the intensity of procedures for target detection in compared cohorts.

many

This type of analysis cannot re-

source of transition bias. 3.

for possible

bias in target detection.

death

certificate

provides

no

assurance that these unplanned,

nonrandom activities were performed without bias. The assurance can be provided only with specific investigations of the in-

Such have not been contemporary

tensity of target detection procedures.

investigations, unfortunately,

performed or considered

in

epidemiologic strategies.

Although a

scientific

investigator estab-

lishes various types of "control" techniques

The

84

architecture of cohort research

bias

eliminate-

to

proce-

experimental

in

dures, epidemiologists have given almost no attention

analogous forms of "control"

to

major sources of transition

For (imj of the

bias

A

search.

effort

scientific

inequalities

possible

to

mam-

to

tional

maneuvers, and detection of targets has been strikingly absent from modern epidemiologic statis-

performance

targets,

of

tics.

4.

Representation of antecedent populaIf

compared maneuvers have

the

been randomly allocated, the cohorts will usually be representative of the candidate populations.

randomly

maneuvers were not

the

If

the

assigned,

can

investigator

use prognostic stratification

(as described

earlier) to help adjust for possible bias in

the assignment.

Even when the maneuvers

were random 1) allocated, however, nostic stratification

a prog-

important to discern

is

the bias that may occur as a chance event during randomization, and also to avoid the problems cited earlier when maneuvers

have opposing

effects

in

different

prog-

absence

the

In

investigate

susceptibility

in

trials of

In addition to these comparisons,

how-

would like to extrapbevond the candidate populations from which the cohorts were derived. As soon as he contemplates this

random sampling,

of

types of bias can enter these popula-

For diseased cohorts, the

transfers.

transfer from

general population to base

population will be affected by medical standards of practice with regard to "early" detection of lanthanic disease in patients

who

are

asymptomatic or who have no

complaints

related

the

to

disease.

The

from base population to the parent population found at a particular medical setting will be affected by patients' iatrotransfer

tropic stimuli, 12 terns, cial,

by inter-iatric referral patand by diverse personal, ethnic, soor economic influences. The transfer

from parent population to candidate population will be affected

by the stage of by the severity of co-morbid ailments, and by the patients' severity of the disease,

willingness to accept the proposed diagnostic

and therapeutic procedures. All of way in which a

these features can alter the

cohort represents

nostic strata.

the past few decades,

none of the cohorts represented a random sample of people with the particular disease or clinical condition under treatment.

the cause-cohorts of etiologic re-

in

tions.

therapeutic

proportions

in

its

of

antecedent populations people from different

ever, an investigator

strata within the disease.

olate

For healthy cohorts, these transfers can be associated with various forms of bias. For example, the "healthy people" who form a candidate population by responding to a questionnaire may not all be healthy. Consequently, unless the initial state of each member of the candidate population is checked appropriately, the

his

results

extrapolation, however, the investigator

confronted by fers

that

all

is

of the populational trans-

occurred

between the general

population and the candidate population.

In ideal circumstances, each of these trans-

—

from general to base to parent to candidate population should have been performed via random sampling from the antecedent population. In practical reality, fers

—

however, these goals achieve that

are

so

random samples

difficult

to

are almost

also

may

compared

cohorts

portionate

amounts of people who are and who may even have de-

already

ill,

contain

dispro-

veloped the disease or target event that was to be sought later on. Berkson 5 has described the fallacy that can occur when

clinical

cohorts are created from disproportionate

For example, none of the cohorts investigated in research on cigarette smoking or the "pill" was obtained by random sampling from a base population of smoki s and nonsmokers, or pill users and non-piJ users. In the statistically designed

numbers of sick and healthy people who have returned questionnaires submitted to

never used research. 20

in

epidemiologic

or

a

general

population.

After

a

"healthy"

candidate population has been assembled

by interviews or questionnaires, some the people

may be

of

rejected because of im-

85

Sources- of 'transition bias

proper or inadequate replies, and

the

if

form a distinctive stratum, the cohort populations may be proportionately altered by the absence of members from rejectees

of these transfers of diseased or

all

common form of prognostic "adjustment" in epidemiologic cohorts is to

stratify

an unequal susceptibility to the target event when the transferred people enter the candidate population that forms the

of

compared

cohorts. Consequently,

in reference to

justment"

based on prognostic

tion

an "adstratifica-

the investigator's best hope for re-

is

ducing the bias that of

validity

may

and

"intake" of cohorts,

during the

arise

for increasing the

subsequent extrapolations. In

order to perform the stratifications,

how-

the investigator must be aware of

ever,

and must have the outcome in

the possible sources of bias

data for testing

suitable

on the purely demographic basis

age,

or sex.

race,

(The frequency

according to age

stratifications

of

respon-

is

one of the uses of the word cohort a particular age group. With such procedures, the results of a cohort may be inspected separately in such demographic strata as old people and young, whites and blacks, men and women, but the psychogenetic risk factors are seldom analyzed probably because the necsible for

—

and would require rigorous investigative efforts to be obtained. An alternative type of demessary data are not readily available

because

ographic "stratification" consists of perform-

prognostic distinctions have generally re-

ing the same investigation in an additional

different

Unfortunately,

strata.

ceived so

quantification

little

the

biology,

clinical

modern

in

necessary

data

are

cohort that has occupational or geographic properties different from those of a group

seldom available, and the necessarv stratifications are seldom performed. Such stratifications are particularlv im-

the same occupation or in the same geo-

portant for the epidemiologic cohorts in-

graphic region

was studied

that

previously.

results of a cohort containing

Thus, the people with

may seem more

extrapolat-

similar results are found in other

vestigated for causes of disease. In clinical

able

cohorts studied for the course or treatment

cohorts with different occupations or geo-

of disease,

the that

an investigator can usually state

diagnostic

pretherapeutic

or

differentiate

his

who have

people clinical

cohorts

the

same

criteria

from other disease.

In

specific

With these

criteria for identification

can often

clearly define the type of populations

In

his results

to

cause-cohorts of epidemiologic however, an investigator begins with a generally healthv group of people and cannot classify them on the basis of an

the

disease

or

previous

experience

with the course of that disease. quently, an epidemiologist larly

The

stratifications

alert

for

the

Conse-

must be particu-

possibility

susceptibility

affect

Even when

to

the target

event.

age, race, sex, occupation, or

geographic region have prognostic correlations, the role of psychic, ethnic, ial

cannot

factors

be

ignored

and

famil-

as

even

greater sources of prognostic bias. Thus,

would apply.

research,

existing

veniently available data.

demographic factors have been shown to

when

stratification, the investigator

which

These demographic stratifications, howbased on arbitrarily chosen, con-

ever, are

he can often use previous

survey data are not available to denote the

and

graphic regions.

experiences to arrive at a suitable

trials,

prognostic stratification even strata.

if

are not necessarily prognostic unless the

therapeutic

'

predisposing factors to such target

events as disease and death. Nevertheless,

healthy people, the main hazard of bias is

.

may be

the most

that stratum.

In

"psychogenetic" features as psychic state, ethnic background, and parental longevity

that

such

in a correlated prognostic stratification for

mortality of healthy people, the longevity of parents, ethnic background, or psychic

anxiety

may be much more

"risk factors"

of sex, race,

important as

than the demographic aspects

and occupation. The persistent

neglect of these psychic and genetic features

is

an additional impairment to valid

architecture of cohort research

The

86

cause-cohorts

the

for

extrapolation

re-

ported in modern epidemiologic statistics. In the clinical populations studied as intervention-cohorts, many important pheare also regularly neglected during "staging" procedures or other attempts at prognostic stratification. Among the

nomena

omitted clinical features that

may

delineate

prognosis are the cluster and sequence of

symptoms, the chronometrv of clinical events, and the co-morbidity of associated

Among

ailments.

may

the "decisional data" that

prognosis but that are usually

affect

ignored in

statistical

assessments are the

iatrotropie stimuli, inter-iatric referral pat-

diagnostic

terns,

criteria,

been discussed pre\ iousK adjust

to

procedure, in

for

bias

in

differential

constitute

pediment

scientific

to

a

and

major

im-

both forms of cohort in

research. E.

"Migration

and the unstable

bias"

cohort

In

all

of the cohort

scribed, the

members

problems

just

of the cohorts

de-

may

have been unrepresentative of their antecedent populations and may have been

compared stable.

unfairly,

Once

its

but each cohort was

members had been

noted,

they persisted in their role as the denominator of anv subsequent particular

citation suggests that the death rate was obtained in a stable cohort, neither the

numerator nor the denominator of the rate came from a stable cohort. At the beginning of 1967, no attempt was made to count the number of people in Connecticut who were 45 to 64 years old; and at the end of 1967, no attempt was made to determine how many of those same people had died. Instead, since the last census count was taken in 1960, the denominator of 638,000 1967 represents an intercensal

in

from people who had been counted seven years earlier in the age group 38 to 57. The numerator represents the counted number of deaths reported in Connecticut in 1967 at ages 45 to 64. Regardless of the accuracy with which the denominator was estimated, the cited death rate does not refer to a stable cohort. During the seven years since the 1960 census, many people who were originally in the 38 to 57 age group had moved away, and other people, who were 45 to 64 years old during 1967, had entered Connecticut from other geographic regions. If the death rate

progress

epidemiologic and clinical

is

the sampling

existing delects in techniques of prognostic stratification

group is cited as based on 6,405 counted deaths having occurred in an estimated 638,000 people.) Although this manner of 1003.9. This rate

estimation, obtained with diverse "adjust-

maneuvers, the

of

effects

the death rate per 100,000 popula-

neces-

is

allocation of maneuver,

in

listings,

tion in this age-regional

people

.

prognostic stratification

Since sary

pretherapeutic

criteria,

and patient acquiescence that have

tions of national vital statistics. 40 (In those

statistics.

form of epidemiologic

however, the data are presented

In one

statistics,

as

though

they were obtained from a cohort, but the

ments,"

among among

;

was the same

emigrants

the

the apparent cohort would be unaffected.

But

if

versa,

were substantially

emigrants

the

healthier

than

the

immigrants,

would be

distorted.

when death

ent geographic regions

the cohort.

For example,

the rates of death are cited in "vital

statistics"

for

a

particular geographic re-

For example, the death rate in Connecticut in 1967 for people aged 45 to 64 years was about 1%, according to tabulagion.

vice

This type of bias must be considered

violated because of migration to and from

This type of "migration bias" can occur

or

the result for the apparent cohort

basic principle of cohort stability has been

when

as

the immigrants, the total rate for

rates are

if

compared at

for differ-

different

eras.

elderly people with a high

death rate move to a region that is regarded as "good" for "old age," the death rate may soar for elderly people in that region. Instead of being regarded as a healthy locality, the region may then be-

come regarded

as unhealthy.

Sources of 'transition

There

is

no readily available method

of

adjusting for this type of "migration bias."

Even

the decennial census of geographic

if

regions could be performed annually, the

problem of intra-annual' migration would

The migratory shifts denominators can be approximated bv various methods/' 3S but migration is not accounted for in the numerators. One approach would be to arrange for standard still

create difficulties.

in the

death certificates to include data for not only the location of death and the "usual residence" of the deceased, but also an indication of

how

long the person had lived

and the duration of time at the place where the person had lived previously. With appropriate analysis of such data, the numerator events could be more accurately demarcated for

87

bias'

by appropriate randomization and advance planning. The problems of "chronology bias," however, are created by the investigator himself; and the problems cannot be resolved by any of the tactics just noted. There is no way, in fact, to get possible,

rid of "chronology bias" because

When

remediable.

wrong

gator has chosen the

The problems and the

it

is

ir-

occurs, the investi-

it

cohort.

of these distorted cohorts

difficulties

created by several other

logical errors in statistical procedures for

cohort analysis will be the topic of our next discussion

two months from now.

at the "usual residence"

the

denominator

cohorts

in

whom

References 1.

2.

the

events presumably occurred. •

Almost

all

•

•

of the problems just described

Bailey, N. T.: Statistical

New

3.

are caused

by an

investigator's inability to

govern and regulate the

The

people.

4.

human

for biologists,

Sons, Inc.

:

1968, George Allen

Berkson,

& Unwin,

Ltd.

Limitations of the application of

J.:

fourfold table analysis to hospital data, Biomet.

lives of free-living

additional data that the in-

vestigator obtains for a

&

Beadenkopf, W. G., Abrams M., Daoud, A., and Marks, R. V.: An assessment of certain medical aspects of death certificate data for epidemiologic study of arteriosclerotic heart disease, J. Chron. Dis. 16:249, 1963. Benjamin, B. Health and vital statistics, London,

in the "transition" bias of cohort research

methods

York, 1959, John Wiley

Bull. 2:47-53, 1946. 5.

Berkson,

The

J.:

study of association

statistical

between smoking and lung cancer, Mayo

cohort and

Clin.

Proc. 30:319-348, 1955.

the additional stratifications that he per-

6.

forms are intended to compensate for the

cal

introduced by his lack of "control"

bias

over the procedures of sampling, allocation,

E. 7.

&

S.

Livingstone, Ltd.

Bradford tistics,

maintenance, and target detection in hu-

Hill,

ed. 8,

of medical sta-

Principles

A.:

New

York, 1966, Oxford Univer-

sity Press.

man

Another

populations.

formed

in

cohort

by the

per-

activity

however, is completely "con-

is

investigator.

As the investigator reviews the assembled data for each potential member of the cohort, he must make certain critical decisions about the timing and chronologic sequence of events in that person's life. If the investigator

makes these chronologic

decisions improperly, he can create a type of bias quite different cited.

The problems

result

of

from the kinds

just

of "transition bias" are

made by nature, doctors. The problems

selections

and other can be reduced or minimized by suitable patients,

identifications

and

8.

research,

purely intellectual and trolled"

the

Hill, A.: Statistical methods in cliniand preventive medicine, Edinburgh, 1962,

Bradford

stratifications, and,

when

Campbell, R.

C:

Statistics for biologists,

Cam-

Cambridge University Press. Cochran, W. C, and Cox, G. M.: Experimental designs, New York, 1950, John Wiley & Sons, bridge, 1967,

9.

Inc. 10. Dixon,

W.

J.,

and Massey, F.

to statistical analysis, ed. 3,

McGraw-Hill Book

J.:

Introduction

New

York, 1969,

Co., Inc.

and prospective studies, editor: Medical surveys 2, London, 1964, Oxford

11. Doll, R.: Retrospective

chap. 4, in Witts, L.

and

J.,

clinical trials, ed.

University Press. 12. Feinstein, A. R.:

1967,

Clinical judgment, Baltimore,

The Williams & Wilkins Company.

13. Feinstein,

A.

R.,

Spagnuolo,

and

M.,

Jonas,

S.,

Prophylaxis of recurrent rheumatic fever. Therapeuticcontinuous oral penicillin vs. monthly injections. A. M. A. 206:565-568, 1968. J. Kloth, H., Tursky, E.,

14. Feinstein,

A.

R.:

Levitt, M.:

Clinical

epidemiology.

II.

The architecture

88

The

of cohort research

demiologic

Ann. Intem.

identification rates of disease,

A.

15. Feinstein,

The

epidemiology.

Clinical

R.:

clinical design of statistics in therap)

The epidemiology

C. R.:

clinical course:

demarcations,

poral 2

IT

of cancer therapy. II. Data, decisions, and tem-

Arch.

Intern.

Med.

123:

1969.

1-344,

A.

Feinstein,

Clinical

R.:

biostatistics.

Statistics versus science in the

theory of 28. Lewis,

30.

A. R.: of

architecture

The

Clinical biostatistics. V.

research

clinical

19. Feinstein, A. R.:

Scientific defects in the stag-

ing of lung cancer, pp. 1'..

ed.

1005-1011,

in

Staff

Conference.

Biostatistics,

New

T. F.

Hafner

York,

1966,

Epidemiology.

:

medical poll-hearer, Cl.lV PHARMACOL. Ther. 12:134-150, 1971. J.

and Elveback, L.

Hall, C. E.,

P.,

Man and

Epidemiology.

R.:

disease, Toronto, 1970,

Elementary medical

D.:

Philadelphia,

2,

1963,

W.

:

Introduction

33. Morris,

Little,

34.

sciences,

Englewood

Cliffs,

35. Schor,

S.:

Fundamentals of

A

38.

Eng.

J.

Med. 270:496-500,

1964. 23. (Goldstein,

N.

J.,

1970, Pren-

biostatistics,

New

York, 1968, G. P. Putnam's Sons, Inc., p. 222. 36. Snedecor, G. W., and Cochran, W. G: Sta-

37. Sokal, R. R.,

of prophylactic porta-caval

1967,

Calif.,

tice-Hall, Inc., pp. 340-341.

tistical

New

probability

to

N.: Uses of epidemiology, Baltimore,

J.

State University Press.

trial

Saunders

The Williams & Wilkins Company. Remington, R. D., and Schork, M. A.: Statistics

Donaldson, R. M., O'Hara, E. T., Callow, A. D., Muench, H., Chalmers, T. C, and the Boston Inter-Hospital Liver group: controlled

statistics,

with applications to the biological and health

J.,

shunt surgery,

B.

1964,

The Macmillan Company. 22. Garceau, A.

W.

and statistics, ed. 2, Belmont, Wadsworth Publishing Co., Inc.

Med. 73:

1003-1024, 1970. 20 Feinstein, A. R.: Clinical biostatistics. VII. The rancid sample, the tilted target, and the

methods, ed.

Ames, Iowa, 1967, Iowa

6,

and Rohlf, F. J.: Biometry, San W. H. Freeman & Co. Spiegelman, ML: Introduction to demography, revised ed., Cambridge, 1968, Harvard UniverFrancisco,

1969,

sity Press.

A.:

Biostatistics,

The Macmillan Company. (Ireenberg, B. C: Conduct clinical trials,

Amer.

New

York,

1964,

of cooperative field

Statist.

13:13-17, 28,

Heasman, M.

39. Stern,

E.,

Clark,

V.

A.,

and

Coffelt,

Accuracy of death certification, Proc. Roy. Soc. Med.' 55:733, 1962. 26. Irvington House Croup: Rheumatic fever in children and adolescents. A long-term epiA.:

C.

F.:

Contraceptive methods:

Selective factors in a

study

the

of

dysplasia

of

cervix,

Amer.

I.

volume

II.

Public Health 61:553-558, 1971. 40. Vital Statistics of the United States,

June, 1959. 25.

1963,

and methods, Boston, 1970,

32. Mendenhall,

Carbone,

NCI Combined Lung cancer: Per-

spectives and prospects, Ann. Intern.

and

York,

Company.

Transcription of

mod.:

Clinical

24.

E.:

Brown & Company. MacMahon, B., and Pugh,

31. Mainland,

(concluded),

Clin. Pharmacol. Ther. 11:755-771, 1970.

Fox,

The advanced

Stuart, A.:

New

Brown & Company.

18. Feinstein,

21

A.

Principles

L970.

1'.

and

G.,

statistics,

Reinhold Publishing Corporation. 29. MacMahon, B., Pugh, T. F., and Ipsen, J.: Epidemiologic methods, Boston, I960, Little,

Pharmacol. Ther, 11:282-292,

ments. Clin.

prophylaxis,

Publishing Co.

II.

design of experi-

M.

27. Kendall,

Med. 69:1287-1312, 1968. Feinstein, A. R., Pritchett, J. A., and Schimpff.

The

subsequent

of

Ann. Intern. Med. 60:(Suppl. 5) 1-129, 1964.

III.

Ann.

,

Intern. 16.

study

streptococcal infections, and clinical sequelae,

Mod. 69:1037-1061, 1968.

Mortality. Part A. Tables 1-12 and 6-4, ington, D.

G,

Wash-

1969, United States Department

of Health, Education

and Welfare.

CHAPTER

7

Sources of 'chronology bias'

In the previous paper of this series, 13 a

the denominator of a cohort also depends on the

maneuver.

cause

of

cohort was defined as a group of people

investigated

who

disease or of certain types of prophylactic inter-

are followed forward in time to ob-

serve the effects of a

maneuver

they have been exposed.

The

to

which

results

are

usually expressed statistically as a ratio, or

which the denominator

rate, in

of the co-

number of people exposed maneuver, and the numerator contains the number of those people who later developed a particular target event.

hort contains the to the

The type of target event that is sought in the "subsequent state" and counted in the numerator depends on the maneuver under investigation. In a cause-cohort, the alleged is

cause

of

maneuver is exposure to an and the target event

disease,

the development of that disease. In an inter-

vention-cohort,

when

the

maneuver

of prophylactic therapy, the target

is

event

an agent is

a con-

be prevented (such as a disease or and when the maneuver consists of remedial therapy, the target is a change in an existing condition. In a course-cohort, the maneuver is exposure to time or to the course of an established disease, and the target event may be a change in an existing condition or the development of a new one. The "initial state" of the people contained in dition

initially healthy,

that

is

studies

of

deliberately chosen to be

is

from the disease

or at least free

the target event. In studies of course of a

members must

be have that disease. In certain studies of ontogenetic development, the cohort members must be initially healthy, whereas in other studies alter the course, the cohort

shown

of

—

Clin.

Pharmacol. Ther. 12:864, 1971.

all

to

associated

effects

with

passage

the

of

time,

taken from the members of a general population in a particular geographic region.

the cohort

For

is

scientific

about

conclusions

the

results,

compared both "internally" to assess effects of the different maneuvers to which they were exposed, and "externally" to cohorts are regularly

extrapolate the results to the antecedent popula-

from which the cohorts were derived. As

tions

dis-

cussed previously, 13 these comparisons are often

made tion

defective by "transition bias" in the formaand subsequent observation of the cohorts.

The

validity of external extrapolations

may be

im-

compared cohorts received disproportionate quantities of members from different

paired

if

the

prognostic

strata

member from

during

the

transfer

each

of

general population to base popula-

tion to parent population to the candidate population

from which the cohorts were selected. After

the candidate populations are associated with the

maneuvers that determine the cohorts, the This chapter originally appeared as "Clinical biostatistics XI. Sources of 'chronology bias' in cohort statistics." In

may

disease or of interventional maneuvers that

to

death);

the cohort

vention,

In

validitv

comparisons

may be impaired

if

contrasted

cohorts

unequal

initial

prognostic

susceptibility

of

internal

are

to

the

in

their

target

event,

the

The

90

architecture of cohort research

performance of the compared maneuvers, or portunity

for

in

op-

detection of the target event.

adjusted or avoided

chronologic

can be suitable data arc obtained

These diverse sources if

transition l>ias

ol

permit analysis of the cohorts within different strata of people who an- similar in prognosis, in to

performance of maneuvers, and in detection of targets. For prognostic analysis, the stratifications should l>c chosen according to clinical, psychic, genetic, or other factors that have been found to be

predictive for the target event.

distinct!)

may

strata

be

according

ineffectual

selected

it

demographic

to

data

The

arbitrarily

con-

are

that

available but that have no cogent prog-

venientl)

nostic correlations.

Another type gration

bias"

ot

transition

can

that

problem

occur

is

rates

in

the "miot

target

population ol a particular he elicits ol immigration geographic region. and emigration make the population an unstable cohort, and the subsequent total rates may be events for the general 'I

unless thej can he suitably adjusted for exchanges ol migration, or unless the rates the tun migrant strata are shown to he similar

biased the in

to those ol the Stable stratum.

These different

problems

transition

in

bias an- an inevitable concomitant of the effort to

study

patterns

free

human populations whose movement and personal

of

be governed by the investigator. The investigator's best hope in these circumstances is to be aware of the decisions cannot

potential for bias, to assemble appropriate

data for cheeking the bias, and to analyze the results with suitable adjustments.

A

quite different type of bias in cohort

however,

research,

feature of

human

creation

intellectual

not

an

and

is

is

life,

of

the

inevitable entirely an

investigator.

This type of bias can be produced as the investigator deals with the complex subtlethe

diverse

chronologic

ties

of

that

must be considered

when

features

cohorts are

assembled and analyzed. A. Chronologic issues

At are

least

cohort research

four different aspects of time

involved in cohort research.

these, is

in

which depends on the

One

of

investigator,

the chronologic relationship between the

events that transpire in the cohort and the

time

when

the investigator collects the data

used for research. The other three features depend on the members of the co-

of time

These features

hort.

calendar, to his posure

1

to

refer

position

in

own

to

a

relation

person's

the

to

and to his whatever maneuver is under life,

exin-

vestigation. I.

Investigative time. In studying a co-

hort, an investigator

can begin

members

before or after the

his research

of the cohort

undergo their exposure to the maneuver and its consequences. With either type of temporal approach, the cohort is followed from its initial state to its sub-

forward sequent

state,

but

if

the investigator begins

the research beforehand, he can collect the

data according to an advance whereas if he begins the research afterward, he must collect his data from information observed and recorded by other people. Thus, if an investigator in 1950 decides to follow a cohort assembled in 1951-1952, he can try to devise special techniques for examining the members of the cohort and for obtaining the data. On the other hand, if an investigator in 1970 research

plan,

decides to follow that same 1951-1952 co-

he must get

from whatever other information had been noted during 19511952 and thereafter. The temporal distinction refers to the time when the phenomena observed in the cohort are converted from their original hort,

existing

records

descriptive for the

his research data

or from

data into the data collected

research. This conversion can be

planned before or after those phenomena take place. In both instance's, the same cohort is followed in the same forward direction, and if the original data have been well observed and recorded, the investigator who does the research afterward should find the same results as the one who plans beforehand. For example, if an investigator wants to know the one-month mortality rate for premature babies at a particular hospital during 1951-1952, if

and

the hospital records are satisfactory, the

investigator

same

results

should

presumably get the of whether he

regardless

planned his survey during 1950 or 1970. The words prospective and retrospective

Sources of 'chronology

sometimes

temporal but these same words have also been applied, with are

applied

data

in

distinction

to

this

collection,

91

bias'

logic. The sequential direction phenomena has been completely re-

scientific

of

versed, so that the investigator looks not

in-

from cause toward effect, but vice versa. 8 The problem of scientific quality in data is always present in human research, regardless of the direction in which a population is followed, but when the population is

vestigations are often called "prospective"

followed backward the investigator inverts

quite

different

the forward or a population

if

backward

is

of

forward

longitudinal

to

which

distinctions

in

disease.

investigator performs

the

here,

The

important

etiology

of

direction in

followed.

particularly

are

studies

connotations,

scientific

statistical

Such

the type of

research

described

by following cohorts who are

ex-

posed or not exposed to a cause of disease, and by determining the rate at which the disease develops in the etiologic

research

spective"

if

is

two

often

cohorts.

called

the investigator does not study

he begins with a group people and a contrived "congroup" and he then looks backward to

of diseased

ascertain the rate at which the two groups had been exposed to the alleged cause. The words prospective and retrospective thus have two distinctly different connotations according to the way in which they are used for investigative time or for investigative direction.

In reference to col-

of data, the -spective words deal with the order of timing for two different lection

of events: whether the plan for assembling the research data precedes or follows the occurrence of the phenomena studied in the research. In reference to the

sets

temporal sequence, the -spective words deal with the forward investigation of members of a cohort toward the subsequent target event, or the backward investigation of a group of people in whom the target event has already ocdirection

of

a

curred.

usage of -spective, the main investigative hazard of "retrospection" is in scientific quality of data. Because the investigator could not "control" the original methods by which the population was obIn the

first

served and described, he the recorded

may

find

that

data contain diverse omis-

sions, imprecisions,

or other inadequacies.

In the second usage of -spective, the investigative hazard of "retrospection"

"transitional"

main is

in

difficulties

discussed

or the "chronologic"

ously,

previ-

issues

to

be

discussed here.

The

"retro-

cohorts. Instead,

trol

new problems in bias that transcend any of the

the logic of science and creates

These

additional

types

of

"inversion

complex for further consideration now, and will be reserved for a later paper in this series. The main point to be noted here is that something must be done about the incompatible disparities in the two different uses of the words prospective and retrospective. The ambigubias" are too

ities

created by the disparate connotations

are

confusing

To

thought.

and

destructive

to

clear

eliminate these ambiguities,

propose that the -spective terms be eliminated from both types of usage. We will then need two new sets of words to denote the timing of data collection and the directional pursuit of a population. Several alternate terms have already been suggested 1 for differentiating prior I

versus later plans for collection of research data.

My own

distinction collect,

lectus.

preference

assemble) with If

is

to base this

on the Latin legere (to gather, its

past participle

the research data are collected

with advance planning before the events take place, the project can protective; if the research data lected from information that was

observed

be called are col-

recorded

before the project began, the project can

be called

retrolective.

As for the directional

pursuit of a population, the

name

cohort

already exists and seems satisfactory for describing a group of people

who

are fol-

lowed forward. An appropriate name for a group of people who are followed backward will be considered later in this series of papers, during a more extended ess;r.

architecture of cohort research

The

92

on the hazards of this type of inverted logic. The remainder of the current disussion is reserved for problems in the chronology

is

of

regardless

cohorts,

of

whether the research

prolective or retro-

self

the only feature- that has changed,

is

may have

other secular changes

proportions

the

of

assembled

strata

in the

affected

prognostic

different

denominator, the

detection of target events in the numer-

2. Secular time. Secular time, or calcndrical time, refers to a particular date on a

and many important aspects of aneillarv performance in whatever maneuver is under investigation.

dates.

between such In some usages, the term epoch

ease at a particular region might be altered not

refers

to

ator,

lectixe.

calendar, or to the interval

an

individual

date.

ealendrieal

such as a particular day or year; and the term era is used to refer to the ealendrieal interval that spans two epochs. Thus, the decade between the epochs of 1920 and 1929 could be called the era of the 1920's. A change that is noted between epochs or between eras is called a secular change, and. if the change shows a monotonic direction, a secular trend.

Thus, the annual

shown"

mortalitv rate for tuberculosis has

downward secular trend during century. The survival rate for

a

the 20th patients

with Ilodgkin's disease"- 19 allegedly shows a rising secular trend during the past few decades, but the survival trend during the

same era for breast cancer legedly shown little or no rise.

A

single cohort

is

-

"• 18

has

al-

Thus,

by

tlie

the

mortality

rate

for

a

particular

dis-

progress of secular time hut by different

aspects of migration in the regional cohorts and

by

techniques of detecting the disease.

different

The

survival rate for the treatment of cancer

may

bave changed not because of secular improvements in anti-neoplastic therapy, but because newer techniques of diagnosis have altered the also

prognostic strata exposed to treatment,

and be-

cause newer methods of treating co-morbid ailments have prevented the deaths caused by diseases other than cancer. When the "same" maneuver

is

compared

in cohorts of different secularity,

therefore, the comparison in

its

may

bias

may seem

overtly "fair"

chronologic aspects, but subtle sources of arise

from the transition problems cited and

earlier in transmigration, in prognostic strata, in detection of target.

When used

cohorts of different secularity are

to

compare different maneuvers, same maneuver, the com-

rather than the

usually uni-secular in

parison

is

usually promptly recognized as

all

unfair because a change associated with

denomi-

secular transitions might be attributed er-

nator of the statistics) occurred during a

roneously to a change in the corresponding

that the pre-maneuver initial state of of

its

members

(as counted in the

limited ealendrieal interval.

To

search for

same are often compared in

secular change in the effects of the

maneuver,

results

maneuvers. For example, because of temporal improvements in diagnosis and in ancillary therapy, patients with acute myo-

uni-secular cohorts from different epochs

cardial infarction today generally have bet-

Such comparisons are particularly common for course-cohorts and interven-

ter survival rates than

or eras.

tion

cohorts.

Thus,

a uni-secular

cohort

such patients treated

before the introduction of anticoagulants the early

in

1930s.

Consequently, when

from 1940-1949 might be compared with a uni-secular cohort from 1960-1969 to determine whether mortality rate has risen

that of an anticoagulant-treated post- 1950

for coronary artery disease in the general

cohort, the

population

of

a

geographic

region,

or

whether survival rate has improved for treatment of a particular cancer. The main hazard of such comparisons arises

from the different forms of transi13 Although

tion bias described previously.

the investigator

may

believe that time

it-

outcome

a non-anticoagulanttreated pre-1930 cohort is compared with the

show

of

more recent cohort

will usually

but the difference might have little or nothing to do with the anticoagulants. 10 For this reason, compared cohorts should always be uni-secular better

unless the is

secularity 3.

results,

maneuver under consideration itself.

Life time.

The age

of each

member

of

93

Sources of 'chronology bias'

a cohort

may be regarded

generational time.

A

as life time or group of people who

were born in the same year, or within a few years of each other, may be called a generation cohort or an age-specific cohort. When an ontogenetic occurrence, such as weight gain in newborn infants, is measured from the date of birth, the cohort obviously contains state

age

members whose

similar. In

is

many

initial-

other circum-

however, a cohort containing people of different ages may be divided into stances,

sub-cohorts)

(or

strata

specific."

that

"age-

are

This type of age-stratification can

Another type of liberately

age-stratification

prognostic,

to separate

young

de-

among peo-

of target susceptibility

tions

ple of different ages. Thus,

is

distinc-

elderly

if

and

adults with maturity-onset diabetes

have disparate survival rates, an would be needed to eval-

mellitus

age-stratification

uate the mortality effects of different forms

The main problem

of treatment.

type of age-stratification

is

in

that

this

may

it

not be cogently prognostic. 10 As discussed previously, 13 age for

often a favorite variable

is

because the data about age are usually reliable and statistical

stratifications

be performed for at least three different

easily

purposes.

prognostically important as certain clinical

One type of age-stratification is used mainly for descriptive separation, as a way

and co-morbid features of a diseased cohort, or psychic and genetic features of a

of

"matching" the age of cohorts that are

available,

stratification

For example, if we want to know whether people who lived in Connecticut

statistically

in

had better

1967

those

who

survival

rates

lived in Florida, the

under comparison

is

than

maneuver

living in Florida ver-

To help get a fair comparison and to reduce the effects of transmigration, we might then contrast the sus living in Connecticut.

—

ial.

Unless

shows cogent changes with the different age strata, the "age-specific" rates may produce statistical fertilization without scientific fruit.

A

particularly

fication

These are two main hazards of bias comparisons.

The

in

such

each cohort (or stratified sub-cohort) in the denominator is estimated rather than counted from decennial census data. The estimate for an intercensal year might be quite inaccurate if a substantial imbalance existed between immigration and emigration. A second hazard is not the numerical accuracy of the denominator, but the bias caused by disparate death rates in the two migrant populations. 13 Thus, if many healthy people in the age group from 45-64 moved to Connecticut from Florida, and if the same number of sick people in that age group moved to Florida from Connecticut, the denominators of the agecohorts would be unaltered, but a major change could take place in the numerator data for deaths. Connecticut might then appear "healthier" than Florida, but only because of "migration" bias in the rates of death for the migrant members of the cohorts.

first

is

that

the

size

of

common

type of age-strati-

performed for

is

circumstances,

such as the determination of general mor-

of people

—

triv-

the rate of the target event

1967.

different

may be

based on age alone

tality rates, in

in

often not as

impressive but biologically

age groups such as strata of people below age 45, 45-64, and above 64 in those two regions during results

is

healthy cohort. In such circumstances, a

being compared for a maneuver other than age.

but age

which the cohort

a gen-

is

eral population rather than a selected

who

eased. Since age

group

are either healthy or dis-

an important prognostic

is

factor for mortality in a general population, the

crude death rate

of observed deaths per in that population)

cause

it

(i.e.

the

number

may be

number

of people

misleading be-

contains no provision for the pro-

portionate age distribution of the population.

Consequently, although populations

A

and B have the same age-specific death rates in young and old people, population A, which contains a large proportion of elderly people, might have a substantially higher crude death rate than population B,

which

contains

people.

To

predominantly

younger

avoid the bias contained in the

rates, we would therefore want compare the rates for A and B within younger and older strata of the population

crude death

to

architecture of cohort research

The

94

compari-

pite scientific clarity, these

do not have the statistical virtue of assessments based on a single initlmi strata

dex. Accordingly,

quest

in

epidemiologic

dex,

the

"correct"

such an

ol

regularly

statisticians

an

with

rates

age-specific

in-

"adjustment" that will allow the individual rates to be added together to form a single value. This type ol "correction"

standard rates"

sible for the "age adjusted

epidemiologic

are a mainstay ol

that

respon-

is

sta-

The adjustment depends on ages and deaths .uid

direct

method

the

The

rected."

details

yond the scope

the

death

of

multiplied

1>\

these

proportionate

the

be

to

is

the

for

population

observed

that

but

can

the)

age-strata

rates

ol

then

are

want

compare summary

to

age- or other subgroup-specific rates have

been studied carefully." Despite these caveats, the adjustment procedure has not only maintained its curpopularity

rent

epidemiologic reports,

in

often do not cite, and are not asked to indi-

which population was used for stanand whether the method of standardization was "direct" or "indirect." cate,

dardization

these

In

circumstances,

crude death rate and proportionate distribuof ages are known for the observed population: the observed age-proportions are then multi-

of the investigators

the death

lor

rates

each corresponding

death rate of the standard population; and the result is used as a multiplicative factor that "cor-

who

present such trans-

muted data and the remarkable

who

of the editors

publication,

1>\

the

makes

Bradford

clear

description

Hill

has

1

the

of

provided a particularly rationale

and good numerical

cedures

for

pro-

these

illustrations

ot

the

is

the

and

in-

calculations.

The main problem bias that

to

be noted here

inevitable in both the direct

is

our previous consideration noted the difficulties that can arise when disproportions occur in the multiplication of ratios whose products yield the incorrections.

In

of "transition bias,"

we

added together to produce a The same tvpe of sum of products

values

dividual

"rate."

of ratios occurs in these death-rate "corrections."

Since the "corrective" ratios depend entirely on the

population

particular

"standard,

that

the final results

is

chosen to be the vary dramatically

may

according to the contents of that population.

Elveback 13 have pointed out, "One of the major weaknesses of t. adjustment procedure lies in its lack of ui. ness, or in the influence which

As

Fox,

•

Hall,

and

credulity

accept the material for

absence of suitable dethe published results

neither interpretable nor reproducible. 4.

Serial time.

be considered called It

serial

The

in

fourth type of time to

cohort research can be

time or post-exposure time.

the length of time that has elapsed

is

rects" the crude death rate of the observed population.

who

reader

scientific

for

scriptions

age stratum ol the standard population; the sum of these products is divided into the crude

a

documentation of observations is unable to determine what was actually found in the observed age strata. Despite the unskeptical enthusiasm searches

the

final

and

should he undertaken until the

acteristic

tion

direct

"we do not

rates at all"

that "no adjustment for age or other char-

for

distribution

each corresponding age-stratum ol the standard population; and the sum ol these products is the corrected dr. ah rate. In the indirect method, only

plied

the desired compari-

ot

son." These" authors suggest that

cor-

the direct method, the

In

known

are

the data available

t

discussion,

this

"I

either

is

two methods are be-

the

ol

be outlined as follows: rates

"correction"

ol

population

observed

the distribution ol

selected "standard" popula-

a

in

or indirect according

the

for

on the outcome"

hut seems so well accepted that the authors

tistics.

tion,

the choice of a standard population exerts

serially

since

"zero

time,"

which

is

the

each member's exposure to the maneuver under surveillance. Unlike secular and generational time, which can inception

of

be readily identified and time

is

classified,

serial

often difficult to ascertain.

The ascertainment

presents no major problems in an experiment, since the in-

maneuver to the members of the cohort and can readily determine when the maneuver was applied, what was the initial state of each member before the exposure, and how vestigator

much

assigns

the

time elapsed until the occurrence of

subsequent events. When the experimental data are assembled to form the statistical rates of the target event, the investigator

can be sure that the virgule (or "fraction line") between numerator and denominator

95

Sources of 'chronology bias

signifies

the inception and subsequent per-

mem-

formance of the maneuver for each

the effects of a maneuver, the logical choice of a reference date is zero time: the incep-

maneuver. Such a date, which

ber of the population.

tion of the

however, the investigator docs not assign the maneuver, and may Rave many problems in deciding either what event to regard as the initiation of exposure to the maneuver, or when that

oriented neither to calendar nor birthday,

surveys,

In

took place.

event

may

time

In

cause-cohorts,

be the onset of a

zero

maneuver such

as cigarette smoking whose exact date of initiation can seldom be accurately re-

called.

course-cohorts, zero time

In

may

be the date of development of a disease whose specific date of onset can seldom be

In

discerned.

particular

mode

not be the

intervention-cohorts,

of treatment

may

or

a

may

course of therapy for the

first

The problems

and recogniz-

of defining

and the absence

ing serial time,

methods for analyzing

this

of suit-

crucial

chronologic feature of cohort research are responsible for fects

many major

scientific de-

contemporary coThese problems will occupy

Zero time and the inception cohort

statistics depend on the from an initial state to a subsequent state, each member of a cohort must have a chronologic reference point at which the "initial state" is identified, and from which the subsequent follow-up

cohort

transition

period begins. initial state at

span

or

course

clinical

at

lems noted

the choice of zero time

earlier,

will differ with different types of

maneu-

ver. If

the survey

is

an ontogenetic examina-

tion of natural events that occur in post-

natal

the choice of zero time

life,

ous. It

the date of birth for the

is

is

obvi-

members

For surveys in exposure to a cause of disease, a suitable zero time would be of

generation

a

cohort.

which the maneuver

is

The

trials in which an intervention with remedial therapy, zero

causal agent. For surveys or

the

maneuver

prophylactic

or

is

time would be the date of that intervention.

For pathogressive surveys

of

course," 11 the choice of zero time

"clinical is

compli-

cated by the examination of "natural his-

the remainder of this discussion.

Since

life

which he became exposed to the maneuver under surveillance. Because of the prob-

in the validity of

hort statistics.

B.

person's

particular time in a

the beginning of exposure to the alleged

selected disease.

able

would depend on the

is

characteristics

of the

that reference point are used

demarcate the people who are counted the denominators of the cohort, or in the denominators of any stratifications of

whose

tory" in a group of diseased patients

are combined, regardless of ther-

results

apy.

Since some patients will have been

"untreated" and since others will have re-

ceived

several

courses

of

the

therapy,

choice of zero time should enable the pato

tients

be mutually comparable

when therapy

at

a

allowed an opportunity to affect "natural history." Accordingly, an appropriate zero time for

date

is

first

to

each patient in such surveys would be

in

either the date of the

the cohort.

rected at the disease, or the date of the decision to give no specific therapy.

The chronologic

reference point

used to measure the duration of time elapsed before occurrence of any target events cited in the numerators. Since the reference point has so central a role in the statistical data, its improper choice for the members of a cohort can create

is

first

treatment di-

also

major, irremediable sources of bias. 1.

The choice

of zero time. Because the

puqiose of cohort research

is

to observe

Thus, a

study of therapeutic intervention for

in a

condition,

particular

zero

time could be

the

date at which the intervention began, but in a

study of "clinical course" for the same condition, zero time might be either the date of the therapeutic decision

intervention

against

specific

or

of

the

first

therapeutic

intervention.

For

ex-

ample, if we wanted to study the effects of radiotherapy in lung cancer, zero time would be the onset

of

the

radiotherapy for each patient.

To

The

96

architecture of cohort researt h

evaluate the results properly,

we might

then have

to stratify the initial state of the patients accord-

ing to surgical or other previous therapy, and also according to the "stage" of prognostic anticipations before radiotherapy began. On the other hand, if we wanted to study the clinical course

lung cancer, 11

of

mode

ploration,

time would be the

zero

of treatment

(selected

We

first

ex-

surgical

chemotherapy,

radiotherapy,

anti-neoplastic therapy).

among

no

or

would then need

each person's exposure to the maneuver under study. The choices did not depend on age or on secular aspects of the calendrical interval in which the cohort was chosen. These secular boundaries, however, determine the initial "parent population" from which the investigator selects the "candidate population" for his cohorts.

to

mainly according to prognostic stages, since therapy had not jet occurred. stratify the initial state

In any type of cohort research, the

appropriate

For the

course of certain

clinical

ments, the date of diagnosis

same

as the date of the

the

since

cision,

first

is

ail-

often the

therapeutic de-

two events mav occur

concomitantly. Thus, a diagnosis of rheumatoid arthritis is usually associated with a simultaneous decision about

its

therapeu-

members

are assembled because they appeared at an site

during a particular ealen-

drical interval in

interested.

If

which the investigator

is

the research deals with the

D found during 1955-1959 H, the cohort will be assembled from people noted to have the disease during that secular interval at the cited course

of

disease

at Hospital

hospital. If the research deals with the ef-

after diag-

smoking in people who were healthy adults during 1950-1954, the cohort will be assembled from the healthy adult cigarette smokers who answered ap-

nosis until specific pre-thcrapeutic require-

propriate questionnaires during 1950-1954.

tic

management, and

so the date of diag-

nosis might be an appropriate zero time. In

other circumstances,

however, the choice

may be delayed

of treatment

ments are evaluated. Thus, the treatment chronic infections

of

sults

may

re-

of tests for the infecting organism's

sensitivity to antibiotics,

of a cancer bility

await the

and the treatment

may be delayed

until the possi-

and extensiveness of metastasis are

fects of cigarette

Despite this

first

therapeu-

calendrical secularity

major calendrical disparities

in

the

date

which the individual zero times

at

oc-

curred.

The people who

determined. In the case of the infection

and the cancer, the date of

common

manner of assembly, however, the members of the cohorts may have had in their

particular

are uni-secular for a

maneuver are not

necessarily uni-

may not begin until an acute ailment has led to the detection of the disease. For example, in dia-

maneuver. Thus, in a causecohort, the adults who were cigarette smokers during 1950-1954 had not all begun smoking during that time. Some people had started smoking many years before 1950-1954, whereas other people had

betes

begun more

tic

decision

is

often the best choice of zero

time for pathogressive surveys of "clinical course." For certain chronic diseases, the study of "chronicity" after

nosis

mellitus,

the patient's

initial

and treatment may occur

diag-

in the hos-

during an episode of acidosis, and the conventional plans of daily regulation pital

at

home may not be

established

until

afterward. Consequently, for a survey of

the clinical course of diabetic patients

first

diagnosed in a hospital, 17 a good choice of zero time might be the date of discharge all

the c

time

of the situations just described, es

\\

of zero time to begin serial

oriented

to

the

recently. Similarly, in a course-

cohort, the people

who

are admitted to the

same hospital with the same disease during the same calendar interval have nevertheless had the disease for varying lengths of time. For some patients, the therapeutic maneuver assigned during that admission was the first course of treatment for the disease, but for other patients, the treat-

ment may have been a second,

after that hospital admission.

In

serial for that

inception of

third,

or

later episode of therapy.

In addition to this problem of uni-seriality

for the maneuver, a

problem of

uni-

Sources of 'chronology bins'

group of patients asa hospital or other medical

zonality exists for a

sembled

at

The

center.

was

multi-zonality

of

bias

97

during the cited secular interval. The inception would exclude people with lung cancer

cohort

who were

in one or both of the two "index" hosduring the cited interval, but whose first treatment had taken place at some other hospital or at a date before 1953 or after 1959. pitals

considered previously in reference to transmigration for a general' population in a particular region.

The

rate of subsequent

According

deaths or other events in the apparent co-

be distorted by disproportions the rates with which healthy and sick

ception cohort

in

uni-secular.

people enter or leave the region. Analogously, the rates of post-therapeutic events

diseased population at a particular

for the

may be

hospital

biased by the patterns of

admission and referral to that hospital. For example, a large medical center that provides special types of radiotherapy or chemother-

may

apy for cancer patients

who

surgical

successful"

"satellite" hospital. If

surgical

receive

many pre-moribund

are referred after having

treatment

at

a

had "un-

neighboring

data for these referred post-

whose surgery was first performed at the medical center, the combined post-surgical results may seem worse at the medical center than at the

2. The inception cohort. The diverse forms of bias that can result from these dif-

and zonal migration after zero time can be avoided if the group under study is an inception ferent patterns of serial exposure

cohort.

An

inception cohort consists of a

group of people in

whom

"zero time" for

maneuver occurred at the same investigative locale and during the same secular interval. the investigated

Thus,

maneuver

the

if

is

the ontogenetic risk

1967 for the general population Connecticut who were 45-64 years

of death during of people

in

age on their preceding birthday, an inception cohort would consist of the people with that age and in that region on January 1, 1967, reof

gardless

of

whether any members of the cohort

moved away

subsequently 64.

The

inception cohort

or became older than would exclude any ap-

first

course

of

1967.

1,

the

If

treatment

is

the

1953-1959 for either of two "index"

during

patients with lung cancer at hospitals, 11

maneuver

an inception cohort would consist of those people whose first course of treatment took place at either one of those two hospitals all

in the

denominator

condition at zero time,

refers

when

the

maneuver began; the cohort is uni-zonal in that the maneuver for each member was initiated in the particular zone (city, state, hospital, or

other medical set-

which the cohort was assembled;

ting) at

and the cohort

uni-seeular in that the

is

uni-zonal zero time took place

uni-serial,

within the secular interval of the assembly.

The are

bias

that can occur

multi-zonal

when

multi-secular

or

cohorts

has

al-

ready been discussed. Such bias is caused by the "transition" problems described earlier, and the bias can be reduced or eliminated with appropriate attention to stratifications and other suitable procedures that

can separate zonal or secular

j)rog-

nostic distinctions within the observed co-

The remainder

hort.

devoted cohort serial,

of this discussion

to the bias that

multi-serial

is

when

or

a

is

can occur when a radier than unicohort

uni-serial

is

by subsequent elimination of some of its starting members. Such bias cannot be removed by stratification or by any other procedure of taxonomy or mathe"aborted"

matics.

The

bias

is

irremediable

if

the ob-

served cohort has been permanently dis-

by the omission of unknown numbers of people whose data are disparate from the reported results.

torted

aged "immigrants" who moved to Connecticut during the year, or people already living in Connecticut who became 45 years old January

his

studied

propriately

after

cohort

the

of

in-

and

uni-serial, uni-zonal,

The members

member

for each to

tients

"satellite."

is

are uni-serial in that the initial state cited

are added to the data for pa-

patients

an

to these specifications,

hort could

C.

The bias of "survival" cohorts

Unless a uni-secular cohort to

is

restricted

members whose zero time occurred

ing

that

members

secular

interval,

will actually

many

dur-

of

the

be survivors of other

cohorts with zero times that occurred secularly at

an earlier date. For example,

if

we

The architecture

98

of cohort research

decide to study the subsequent clinical course of a group of patients noted to have lung cancer at a particular hospital

during 1953-1958, any such patients whose treatment occurred during 1949-1952 are sunivors of the earlier cohort. It the first

survival rate for tin-

same

as that

newly treated patients is lor those who have already

survived several years alter

we need

then

rates

survival alive

at

treatment,

and the "new" groups. But

"old"

the

first

attempt to distinguish

not

are

times

serial

different

/eio

alter

make

can be

disastrous.

the distinctions

the long-term

people people

who who

rate

survival

die earl)

is

(

is

are dead aftei

r.

i|

1

in;

The number the number

who

people long-term

ol

die

dates

=

Since the short-term deaths have already occurred

people with disease D, the assembled cohort

be drawn from (jn people, rather than n people in whom the disease began. Of those qn will

(q-r)n

people,

vestigator

die

will

leaving

rn

calculates

during

survivors.

the

short-term survival rate. If an investigator compares two cohorts that have the same long-term survival rate of r, but

short-term survival

different

and

he

if

period,

of

ratio i

survival

the

long-term

When rate

the for

two cohorts

=

in-

his

observed cohort, he divides the survivors 1>\ tinnumber of people with which he began, and he obtains rn/qn r/q. If he had started with a inception cohort,

vival rate

a

"survival

survival rate

To

cohort,"

by a

illustrate

numbers,

let

he

r,

has

By

using

biased

the

true

.20

.20

=

50%

for the

first

cohort and

.40

80%

=

25%

.80

for

the second.

the

first

The observed long-term

as

that

rate for

cohort will therefore seem twice as high the second cohort, even though

for

the

two cohorts are actually the same. Now suppose two cohorts contain a mixture of two subgroups with different short-term survival rates, q, and and the same long-term rate, r. Suppose the proportion of people from the first subgroup is k* in Cohort A; the proportion of people from the second subgroup in Cohort A Suppose the corresponding prowill be 1-1ca. portions in Cohort B are k B and 1-1cb. Because the long-term rates in the two subgroups q-..,

are the same, the long-term results in Cohorts

and B should be the same

A

they were selected

if

as inception cohorts, regardless of the proportions

ki

and k n

.

If

the cohorts were selected after a

post-inception

short-term

however,

interval,

the

observed survival rates will be as follows:

For Cohort A, R A

+ (i

with

the short-term

=k — A

f

]

not r/q.

phenomenon

if

one cohort and

for

be

will

be calculated for the two cohorts are

will

factor of 1/q. this

The

r/cjj.

rates

for the other cohort, the spurious long-term rates

however, the correct sur-

would have been

q-..,

both cohorts have a

20%, but

40%

is

if

=

true

and

c\,

he r/q, and

long-term

Thus,

q./q,.

long-term survival of rate

will

spurious

these

r/q;

:

\

trial.

This difficulty regularly occurs for pa-

been

in the diverse clinical spectrums found during therapy for many acute medi-

discussed leads us to the clinical point that

cal ailments as well as for all chronic dis-

'Tin

is

tients

point

statistical

that has

just

the unalterable, cardinal flaw of

all trials

performed with randomized allocation of therapy. The results are highly limited. They may not pertain to anyone other

than the group of people

dom sample under

the

of

treatment,

who

participated

people are not a ran-

in the trial. If these

disease or

the

results

condition

cannot

be

eases

—such

diabetes

as

coronary and artery disease. We cannot be sure that the clinical characteristics and spectral distribution of patients with such ailments as

pneumonia,

bowel

functional

for a private referral clinic as for a

ward

We

this

that the tvpes of patient treated

him

who performed

the

those of Dr. B,

who

is

extrapolation that concerns

clinical. If

the treated population

a random sample, represent?

And

to

whom

was not

did the patients

what kind

of patients

can the results apply?

Amid

many randomized therapeutichave been performed in the past few decades, I do not know of a single investigation in which the patients under study were chosen randomly. The participating patients have been the "chunk samples" of people found in hospitals, out-patient links, doctors' offices, and other sites conven -nt to the investigators. The partrials

the

that

ticipants lave never been chosen randomly from the parent population of patients who

sults.

In

service.

trial,

same mu-

cannot be sure

nicipal

but a clinician may not care about

The

distress,

stroke, or inoperable cancer are the

validly extrapolated for statistical purposes,

issue.

cancer,

mellitus,

emphysema,

pulmonary

by Dr. A,

are the same as

plans to apply the re-

some circumstances, we cannot

even be sure that the

results

found

one

in

subset of patients are pertinent to what

was noted

in

another subset of patients

trial. (The major contrast can arise for patients in the same multi-center trial was demonstrated in the

within the same that

UGDP

study,

gardless

(1/87)

in

where the

fatality rates, re-

from one participating center to

of

therapy,

ranged

1% 26%

(23/90) in another.) To be able to apply the conclusions of a clinical trial reproducibly, clinicians

have satisfactory data

must

to identify the post-

Credulous idolatry and randomized allocation

therapeutic results found

the different

in

The

kinds of patients under treatment.

cannot be suitably identified

tients

if

pa-

they

described only according to a diag-

are

nostic label that does not denote important

and

fore

cogent

the

of

The

after treatment.

crucial role

described

stratifications

and previously 12 is to permit these distinctions to be designated. When such earlier

stratifications are absent, a clinical

way

has no

reader

knowing whether and how

of

the results of a

Even when the

trial

can be extrapolated.

stratifications are well per-

may still be difficult to was designed with overly

formed, the results

apply

if

rigid

admission

the

trial

created

that

criteria

a

"pure-disease sample," a "compliance sam-

some other highly

agencies and other

for the regulatory

in-

members of the medical community who make decisions about the efficacy, safety, and applicability of new (or old)

fluential

forms of treatment.

distinctions of their condition be-

clinical

119

1.

Issues in efficacy.

A

may

single trial

demonstrate that an agent is efficacious, but cannot prove the absence of efficacy. As

shown in the earlier example of an agent that worked well for 1 of every 25 people, the trial may not have included enough members of the strata of "responsive" pa-

A

tients.

trial

may

also give a falselv "nega-

because of the "cancellation" created by opposing effects in different strata, or because of other forms of insensitive design, as discussed bv Cromie. Even tive"

result

7

if

efficacy

was demonstrated, the

results of

group

the investigated group or of other patients

that differs greatlv from the patients en-

must be carefully analyzed to determine whether the efficacy pertains to all the

ple," or

restricted

countered in ordinary clinical practice.

—

For both these reasons the clinically constricted spectrum of the admitted patients and the failure to perform suitable stratification of those who were admitted many randomized trials have been performed as a type of in vitro experiment that must be followed by extensive in vivo tests.

The

may show

trial

the relative efficacy

and safety of the therapv for the group of people under scrutiny, but the results may not be readily interpretable or applicable

Even when the data number of pa-

to patients elsewhere.

are suitably stratified, the

may be

tients in the strata

satisfactory to

plications,

more common-

results

been excluded from the during

trial

clusions are zealously

of trials. tific

The

or inadequate-

clinical

problems

when sweeping

con-

drawn from either number

or a relatively small

careful formation of scien-

conclusions

is

can prove that an agent is safe. The harm no show only that the agent brought to the people who received it. The agent trial

may

not have been maintained long enough

for toxic effects to develop, or the trial

may

not have included the particular kinds of patients for

whom

the agent

example, in a clinical

trial

is

unsafe. For

confined to

men

and post-menopausal women, thalidomide might readily be shown to be an effective, harmless sedative. Conversely, of safety

is

when

implied in a clinical

lack

trial,

and whether the reactions occurred the treated

the

particularly

important

patients

or

just

in

in all

certain

susceptible strata. 3.

Issues

in

applicability.

For

all

foregoing reasons, the ultimate safety, ficacy,

These intellectual and

trial

in the

who may have

it.

are regularly ignored

a single

noted when

becomes used

or

diverse types of patients ly identified

can

clinical trial

termine whether the "adverse reactions" were actually due to the indicated agent

must rely either on

on surveys of the

was

No

Consequently,

clinicians

the treatment

Issues in safety.

inteqMctations and ap-

additional clinical trials or, ly,

2.

data must be carefully examined to de-

generalization.

make appropriate

too small for

treated patients or only to certain strata.

and

applicability

of

the ef-

therapeutic

cannot be established only with randomized clinical trials and must be determined from supplementary data,

agents

results of a single trial or of a fe v trials

cannot

possibly

types of patients

encompass the and reactions

t

The architecture

120

of cohort research

be encountered when a therapeutic agent comes into general use. The trials can serve

new agent should be allowed into the clinical "market" or that an old agent should be reappraised but the trials do not necessarily allow

Cobb, L.

3.

Colton,

only to indicate that a

information of general clinical experience

(with or without randomized

trials)

4.

The

5.

Cornfield, S.:

imposed during the

generalize the results of

to

abilit)

would be greatly enh. lined if better techniques were developed for identifying and stratifying clusters of

7.

An

Many

teristics.

safety

and

solved

if

Assn.

58:

Group Diabetes

University

further statistical J.

A.

analysis

of the

M. A. 217:1676-1687,

Halperin,

J.,

and Greenhouse,

M.,

Am.

cal trials,

J.

Cox,

R.:

D.

Stat. Assn.

clini-

64:759-770, 1969.

Randomization,

Biometrics

27:

efficacy could be readily rethe anecdotal opinions reported

non-random

dence assembled

in

clinical

8.

trials

were

improved

all

and an analysis of

within

results

Feinstein,

A.

Clinical

R.:

A.

Feinstein,

An

III-V.

biostatistics.

R.:

Clinical

(UGDP)

Program

Diabetes

VIII.

biostatistics.

analytic appraisal of the University

Group

study,

Clin.

Pharmacol. Ther. 12:167-191, 1971. A.

10. Feinstein,

R.:

Clinical

biostatistics.

IX.

How

do we measure "safety and efficacy"? Clin. Pharmacol. Ther. 12:544-558, 1971.

to allow

a specification of distinctive clinical strata

Med. 59

Roy. Soc.

755-771, 1970. 9.

imprecise mixtures tabulated in ran-

domized

providing data for

in

Proc.

analysis,

The architecture of clinical research, Clin. Pharmacol. Ther. 11:432-441, 595-610, and

evi-

and

surveys,

Cromie, B. W.: Errors (Suppl.): 64-68, 1966.

arguments about

current

as clinical experience, the

A.

11. Feinstein,

On

those

of

Clinical

R.:

ghost

the

exorcising

curse strata.

Kelvin,

Clin.

biostatistics.

XII.

Gauss and the Pharmacol. Ther. of

12:1003-1016, 1971. •

•

12. Feinstein, A. R.: Prognostic stratification series,

•

will continue to

be an

essential in-

Assuma trial have manner, ran-

ing that

all

domization

other plans for in a satisfactory

within

(preferably

marcated prognostic strata)

science

randomization continues to be

re-

The

The References S.:

Controlled studies in clinical cancer research,

Med. 287:75-78, 1972.

in

biostatistics.

XIX.

the twelve different

18.

R.:

Clinical

biostatistics.

XXII.

randomization in sampling, testing,

role

A.

R.:

Clinical biostatistics. XXIII.

of randomization in sampling,

test-

and credulous idolatry (Part 2), Clin. Pharmacol. Ther. 14:898-915, 1973. Fisher, R. A.: The arrangement of field exing,

J.

Clinical

and credulous idolatry (Part 1), Pharmacol. Ther. 14:601-615, 1973.

17. Feinstein,

be neglected.

A.

role of

allocation,

N. Engl.

R.:

A. R.: Clinical biostatistics. XX. The epidemiologic trohoc, the ablative risk ratio, and 'retrospective' research, Clin. Pharmacol. Ther. 14:291-307, 1973.

Clin.

and Lee,

A.

14:112-122, 1973.

and if all of the more important and scientific components continue

B.,

humanised Lancet 2:

15. Feinstein,

clinical

J.

for

medication,

concepts of 'control,' Clin. Pharmacol. Ther.

progress,

Block,

The need

Ambiguity and abuse

16. Feinstein,

C,

442-

421-423, 1972.

as the principal ingredient in that

Chalmers, T.

R.:

evaluating

in

14. Feinstein,

well-de-

method for allocating the sequence of therapy. The idea of randomized allocation will retard therapeutic progress, howif

A.

13. Feinstein,

offers the best

current

garded

13:285-297,

457, 609-624, and 755-768, 1972.

gredient in therapeutic progress.

been made

Pharmacol. Ther.

Clin.

In the meantime, however, randomized

1.

J.

one of two

Stat.

adaptive procedure for sequential

statistical

trials

patients with distinctive prognostic charac-

to

Amer.

1114, 1971. (Abst.)

randomized

ever,

for selecting

1971.

6.

trials

The

J.:

A

mortality findings,

trials

the

Cornfield,

Program.

glimpsed, and sometimes distorted, by the investigative artefacts

treatments,

388-400, 1963.

can

provide a satisfactory view of the broad scope of human illness that is merely

A model

J.:

medical

Onlv the additional

definitive conclusions.

A., Thomas, C. I., Dillard, D. H., Merendino, K. A., and Bruce, R. A.: An evaluation of internal-mammary-artery ligation by a double-blind technic, N. Engl. J. Med. 260: 1115-1118, 1959.

2.

allocation,

periments, 1926.

J.

Ministry Agriculture 33:503-513,

Credulous idolatry and randomized allocation

19. Fisher,

8,

New

R. A.:

of experiments, ed.

chance occurrence of substantial initial differences between groups in studies based on

York, 1966, Hafner Publishing Co.

W.: The physical

20. Heisenberg,

quantum Chicago 21. Landis,

The design

theory, Chicago,

random

principles of the

1930, University of

27.

Press.

and Feinstein, A.

R.,

J.

R.:

An em-

comparison of random numbers acquired by computer-generation and from the Rand tables, Comp. Biomed. Res. 6:322-326, D.

V.

:

Is

randomization necessary?

fluencing

plots,

evaluation

of

drugs.

lar

J.

A.

M.

With

A. 167:2190-2199, 1958.

Nebenzahl,

E.,

sampling for a fixed-sample-size binomial lection problem, Biometrika 59:1-8, 1972. 26. Radhakrishna,

S.,

and

Sutherland

I.:

se-

S. Cosset): Comparison between and random arrangements of field

Biometrika 29:363-379, 1937.

Group Diabetes Program: A study

complications in patients with adult-onset

I. Design, methods and baseline reand II. Mortality results, Diabetes 19: 747-830 (suppl. 2), 1970. Youden, W. J.: Randomization and experi-

mentation, Technometrics 14:13-22, 1972.

31.Zelen,

M.:

Play

controlled clinical

The

infarction.

Med. 281:115-119,

sults;

30.

and Sobel, M.: Play-the-winner

J.'

diabetes.

special reference to the double-blind technique,

25.

myocardial

of the effects of hypoglycemic agents on vascu-

D.:

clinical

after

N. Engl.

(W.

29. University

Elementary medical statistics, ed. 2, Philadelphia, 1963, W. B. Saunders Co. Modell, W., and Houde, R. W.: Factors inMainland,

11:47-54, 1962.

1969. 28. Student

balanced

Biometrics 27:1114, 1971. (Abst.)

24.

prophylaxis

Final report,

1973.

23.

allocation, Appl. Stat.

Seaman, A. J., Griswold, H. E., Reaume, R. B., and Ritzmann, L.: Long-term anticoagulant

pirical

22. Lindley,

121

64:131-146, 1969.

the trial,

winner rule and the Amer. Stat. Assn. J.

CHAPTER

9

Consequences of 'compliance

Most discussions

cemed with

nl

compliance are con-

maintenance of

patient's

a

a

drug, diet, or other agent oi

work unless

therapy can

newsletter that contains an ongoing ac-

count

an assigned therapeutic regimen. Since no

bias'

In

ot

new developments

trying

pliance

with

in the field.

achieve or increase com-

to

a

therapeutic

regimen,

we

main studying compliance is

begin with the basic assumption that the

By finding out why patients fail to comply and how we can encourage compliance, we hope to develop better ways of enabling a presumably beneficial therapeutic regimen to accomplish its benefits. With these goals in mind, we may

and worth using. Once these virtues have been established, the regimen will warrant the efforts by both medical personnel and patients to ensure that it is mainthe complex spectrum of compliance has

investigate

effects

it

taken, one of the

is

clinical reasons for to increase

it.

the

various

clinico-socio-

therapy

desirable

is

—that

it

is

safe, effec-

tive,

tained as prescribed.

On

the other hand,

that can alter the basic data ana-

behavioral features that are determinants

lyzed to determine the virtues of a regimen.

compliance and the educational-com-

These ramifications of compliance are the

of

may

municational-packaging features that

enhance

A

substantial

develop on

references

only a few of the 18,

-

1

specifically

a detailed

has

literature

begun

to

compliance with therapeutic

The

regimens.

focus

discussion

of

in

this

essay.

suitable biostatistical attention

it.

cited

many

devoted

compendium 15

here are

studies

1

-

"

to this topic,

1L'-

and

of the literature

was recently assembled for a major "Workshop/ Symposium" conducted at the McMaster University Medical Center in HamilOntario, Canada. The organizers of

ton,

that meeting, Drs.

David

R. Brian Haynes, have also

L.

Sackett and

begun

to issue

the

diverse patterns

pliance,

it

and

is

effects

Unless

given to of

com-

can act as a source of confusion

and distortion

in

the original therapeutic

data. If various forms of "compliance bias"

and adjusted, regimen may be dismissed as worthless or an ineffectual treatment may be promulgated as good. In this essay, I should like to discuss six features of compliance that can affect the interpretations. biostatistical data and These features relate to issues in regimen are not properly recognized a

valuable

therapeutic

compliance, the evaluation of compliance, This chapter

i

iginally appeared as "Clinical biostatistics

XXX.

Biostati

Clin.

Pharmacol

122

problems in 'compliance Ther. 16:846, 1974.

teal

bias.'"

— In

the non-compliance "control", protocol compliance, the "compliance sample",

"compliance-confounded cohort".

and the

Consequences

Regimen compliance

A.

123

'compliance bias

oj

we would be unaware causing the difference. We would then conclude erroneously that Drug

compliance, however,

The first point to be considered is the way that the results of a therapeutic regi-

of

men can be

B is pharmacologically more effective than Drug A, although the actual cause of the

distorted by the compliance it Let us assume that an index of success has been established in a randomized, double-blind therapeutic trial, comparing Drug A vs. Drug B. Let us further assume that the two drugs are receives.

equally effective. Despite this equivalence

compliance bias can alter the results of the trial so that a major difference can falsely occur in the success rates in efficacy,

two drugs. The observed success 50% for Drug A and 66%

the

of

might be

rate for

Drug

A

B.

magnitude could Suppose that the con-

false difference of this

occur as follows:

role

its

in

difference arose as a matter of compliance,

rather than pharmacologic efficacy.

The mathematical simplified

calculations just cited can be by the following algebraic procedure.

Let pn be the proportion of patients who maintain good compliance for Treatment D. Let qn, which is 1-pn, be the proportion of patients who do not maintain good compliance. Let r, be the rate of an outcome event (such as "success" or a cardiovascular complication)

good compliance. Let

patients

in

who

maintain

be the rate of this outpatients whose compliance was not

come event in With these

good.

r:

conventions,

the

rate

the

of

outcome event for the cohort of patients receiving Treatment D is p n + q D r2 Thus, in the foregoing .

under treatment has

dition

when

rate

either

Drug A

drug

is

Drug A

that

B

success

when

The

success rate for

abandoned. Now suppose has an unappealing taste,

.30)

=

30%

success rate

50%

maintained by only it, whereas receives excellent compliance by

it

Drug B

90%

is

faithfully

of the patients assigned to

With these distinctions, if 200 patients were assigned to Drug A, 100 patients would maintain the drug faithfully and 70 of these 100 would have a successful outcome. Of the 100 patients who abandon the drug, 30 would be successful. The net result for Drug A would of the patients.

be 100 successes per 200 patients all

180

success of

the

maintain

—an over-

50%. With Drug B, 200 assigned patients would

rate

of

compliance

excellent

and

126

them would have a successful outcome. Of the 20 patients who abandon the drug, 6 (30%) would be successful. The total success rate for Drug B would thus be 132/200, or 66%. The difference between the two success rates would be large enough 16% ) to be clinically significant and the magnitude of the sample sizes cited here would also make the difference "statistically signifi2 = 10.5 If we had cant" at P < .005 not attempted to investigate and analyze

(70%)

of

(

)

(

.

=

(.50 x .30)

.70)

is

appearance, or schedule of administration so that

example, the success rate for Drug

faithfully

maintained and a either

or

70%

a

+

The

.35 + .15

Drug B

is

(

=

effect of

compliance bias

ation just described

A .50

is

(

=

.90 x .70) +

.63 + .03

.66

=

=

.50 x

50%. (

.10 x

66%.

was

to

in the situ-

cause a false

difference in the apparent efficacy of

two

drugs that actually had equal pharmaco-

An analogous set of problems might work in the opposite direction to produce a false difference in the adverselogic action.

reaction

rates

of

two drugs with equal

toxicity.

B.

Evaluation of compliance

Since distinctions in compliance can affect

the

appraisal of a therapeutic regi-

men's efficacy and safety, the evaluation of compliance is an important, although often neglected, aspect of clinical biostatistics.

neglect

Probably the main reason for this is that compliance is an entirely

subjective

and

human phenomenon. "soft". The degree

data are extremely

compliance

Its

of

determined by the patient; the act of compliance usually occurs in circumstances where it cannot be directly observed by the investigator; and its appraisal depends on what the patient decides to do and report about it. In an era devoted to the analysis of "hard" dat compliance is a variable that lacks scienti is

fi

The architecture

124

appeal, no matter

how

of cohort research

important the phe-

nomenon may be. The prejudice against

this type of information has been well summarized b)

Konrad Lorenz15

the recent Nobel laureate,

:

If the subject of investigation happens to be human, he or she is being literally dehumanized by being prevented from showing an) response which a guinea pi t^ or a pigeon might not show as well (in fact, tin- same experimental set-up is often applicable to animal and human subjects).

Worse,

kind

in that

menter himself is not permitted to be quite human, he is strictly prevented bom using most of the

cognitive mechanisms with which nature .

.The worst of

.

pliance, a well-constructed interview tech-

nique cannot be replaced by any of the available "objective'' methods. Furthermore, a

patient

who

deliberately wants to dis-

guise the truth can subvert the "objective"

procedures as well as the information provided

in the interview.

experimentation, the experi-

ol

as

our species.

can learn the qualitative and quantitative information that is not provided by the other two techniques. For getting the totality of data needed to evaluate com-

endowed

widespread condiscourages people really complicated

this

tempt tor description is that it from even trying to analyze systems

The

interview

technique,

patently subjective.

It

however,

is

depends completely and reliability and

on the patient's recall on the skill of the interviewer. An

also

who

interviewer

manner patient

is

punitive

is

or

whose

otherwise unacceptable to the

may

not

elicit

accurate data. For

Types of data. For investigators who

these

reasons,

arc willing to cope with complicated hu-

often

used as the prime source of com-

man

pliance data,

/.

systems, three different methods can

be used for getting data

to

describe and

techniques

evaluate compliance. Perhaps the most ob-

reliability.

measure the presence of the drug (or one of its metabolic products) in the urine. The main disadvantage of this method is that it pertains only to the one specimen for which the test was made. The result does not indicate the patient's compliance during all the other times for which no urine tests were performed. Furthermore, the single urine test may not be able to indicate whether the drug was correctly taken in the prescribed pattern even on the day of the

truly

jective

technique

to

is

patient

compliant

maining doses visit.

to

returned for

is

be counted

its

re-

at the next

This technique provides quantitative

data, but

it

will fail

if

to bring the container,

the patient forgets

and the

"pill

count"

cannot demonstrate whether the medication

was taken

in

the desired pattern or

disposed in various unprescribed manners.

The tient

best

way

has done

rectly.

Fro

i

of finding out is

the

to

what a pa-

ask the patient di-

reply,

an

investigator

as

information

consistent with

is

test.

2. Types of rating. Regardless of what method is used for assembling data about compliance, the results must be cited in a manner that allows the data to be ana-

lyzed.

cannot

compliance

Since

be

ex-

pressed in dimensional terms (which might

be used for such variables as height or serum cholesterol), the investigator must choose a scale that provides a rating. The is chosen can be a dichotomous

partition (such as

given a fixed number of doses

the

if

the results of the objective

A second quantitative technique is based on dosage unit counts. At each visit, the is

would be regarded

stated in the interview

scale that

in a container that

to verify the patient's

only

test.

patient

is

but one of the "objective"

added

is

A

interview technique

the

good and not good) or

a set of ordinal ranks, containing such cate-

and excellent. For many analytic purposes, a dichotomous gories as poor, fair, good,

partition will suffice, since the investigator

may want

to

engage only

in a

simple com-

parison in which the outcomes of the good

group or not-good group are contrasted against

all

others.

The choice

of criteria for these ratings

be affected by the type of regimen under study and the purpose for which it has been prescribed. For example, the criteria for good compliance may differ will obviously

Consequences

if

a daily antibiotic

is

being taken

to pre-

125

of 'compliance bias'

therapeutic decision.

the experiment suc-

If

vent rather than to eradicate an infection.

ceeds, the doctor has learned that success

To

does not always require the original plan

illustrate

distinction,

this

shall

I

list

here two sets of criteria used during investigations

'~'

of daily oral

;

antibiotics in

Group A streptococcal and rheumatic recurrences in a population of children and adolescents who had all had at least one previous episode the prevention of infections

of acute

rheumatic fever:

drug

Purpose of regimen

who

to

who

patients

received the treatment.

Nevertheless, in ordinary clinical prac-

who

a patient

appears to reject the

200,000 units

400,000 units 3

daily

times daily for

10 days

medical

G

Continuous pro-

Oral Penicillin

G

Eradication of

doctor's

may

attention

miss

elsewhere,

opportunity

the

doctor

the

the

learn

to

phylaxis against

streptococcal

results of the counter-experiment. This type

streptococcal

infection

of loss

Reliable history

Reliable history

"good" compliance of

patient

a

comply with an offered treatment becomes a type of "control" whose results can be compared with those of

tice,

infection

Criterion

Consequently,

action.

refuses

recommendations is often rejected doctor. Because the patient may by the then be actively or passively urged to seek

Oral Penicillin

Dose of

of

and no more than and no more than dose missed

5 days missed per

1

month and no 2

during the 10

days missed con-

days

may be a necessary event in circumstances where a busy practitioner wants to use his time "efficiently" and has no intention of ever tabulating his therapeutic

The

results.

however,

loss

the

if

become items

secutively

An Regardless of whether the reader agrees or disagrees with the details of these cri-

they demonstrate the fact that

highly

is

undesirable,

practitioner's

data

ever

of biostatistics.

illustration of the

problem has regu-

appeared in surveys reporting the outcome of therapy for cancer. In such larly

surgically

treated

differ-

surveys,

the

ent types of therapeutic regimens will re-

patients

have

quire different criteria for ratings of com-

against

pliance.

radiotherapy or chemotherapy. This comparison is unfair because the non-surgical

teria,

C.

The non-compliance "control"

results

those

of

been

generally of

compared

who

patients

received

reasons that compliance has been so ne-

were not an "operable" group; deemed "inoperable" and referred for other modes of treatment. For an unbiased comparison, the group of "op-

glected as an important variable in

erable"

Although the talking

to

patients

scientific prejudice against

patients

is

one

of

the

main

statisti-

they were usually

patients

be

should

"operable"

valuable data related to compliance. This

other treatment.

prejudice dismiss

is

the

patients

tendency of doctors to

who

reject

the

doctors'

recommendations. fails to

carry out the doctor's recom-

mendations is performing an important experiment that the doctor was unwilling to

undertake.

in

effect,

to

The

test

the experiment

patient

has

decided,

a counter-hypothesis. If

fails,

the failure helps sup-

port the propriety of the doctor's original

patients

received

with

who

a

surgery

group

received

of

some

This type of contrast would be delib-

arranged

erately

peutic

In ordinary clinical practice, a patient

who

who

contrasted

another type of prejudice has caused clinical investigators to lose other cal analyses,

trial,

in

a

randomized theratrials have

but very few such

been conducted for "operable" patients. Consequently, the only source of "operable" non-surgical patients in ordinary clinical

who were deemed and who refused the offered surgical treatment. These "non-compliant" patients would be a reasonable contro'

practice

is

the people

"operable"

group for the surgically treated patio

The

126

architecture of cohort research

but the non-compliant patients are usually

Compliance by

/.

investigator. The like-

rejected by their surgeons and seldom re-

lihood of violating a research protocol

ceive follow-up examinations for their out-

particularly high

comes to be noted, recorded, and analyzed. Non-compliant patients can also serve as an important "control" group in circumstances w here the results of plaeeho therapx

tors are

when

are not available. For example,

agents

antibacterial

eral

randomized value

sev-

were receiving compare their

clinical trials to

preventing streptococcal infections.

in

because of poor communication

arise

ly

among

the collaborators or inadequate at-

by

tention

Iniiuent

for diabetes

patients did not

examination of results in groups treated with placebo. In the absence ol a placeb itreated group, however, major problems

dards

when two

found

Was

to

of the "active" drugs

yield essentiall)

drug

were

similar results'.

more

fulfill

betes mellitus. Aside from the ethical prob-

lems

produced by

violation,

this type of protocol can create a major statistical

it

Streptococcal attack rate in compliant

pa-

different.

tients

i

who maintained "good to

m

patients who tailed to comply. Another opportunity to make analyticuse of compliance distinctions occurred during an investigation of the role that tonsil si/e might play in predisposing rheu-

matic children and adolescents

Among

to strepto-

the patients

good continuity

maintained

who

antibiotic

of

prophylaxis, the attack rate of streptococcal infections

Among

was unaffected by

patients

prophylaxis,

who

tonsil

size.

did not maintain good

the streptococcal

attack rate

increased with increasing size of the tonsils. D. Protocol

compliance

In

issues

all

the

idea of compliance

discussed so referred

Another type of

prophylaxis")

be substantially lower than

coccal infections.'

stan-

the "ineligible"

in

-

placebo? The issue was resolved

was found

admitted

minimum

glucose intolerance that had been established as diagnostic criteria for dia-

problem if the results and "eligible' patients

effective than

of therapy

''

ol

the

really

1

of the

(

One

criteria

For example,

the

when

either

the

in

trial.

mellitus. \Y

ethical considerations militated against the

anise

the

to

investigators.

occurs

UGDP cooperative study

the

in

individual

violation

admission

lor

is

when multiple investigacollaborating. The violations usual-

far,

the

only to the

inadvertent,

ment

violation,

which

is

often

the investigator's develop-

of the ability to discern the identities

drugs being studied in a doubleThis "unmasking" is particularly

the

ol

is

substantially

are

blind likely

trial.

to

occur

the active drug can be

if

recognized from a physiologic side effect such as the bradycardia that often occurs

with

beta-blocking

quence of

this

the delusion that

is

The

agents.

conse-

type of protocol violation

symptoms and other

subjective data have been determined with the presumptive "objectivity" of an effectively

maintained double-blind technique.

doctors (or patients)

If

become

successful-

ly able to differentiate the active

drug from

the placebo, the clinical

converted

trial

is

into a pseudo-double-blind exercise, having all

the

logistic

disadvantages of double-

maintenance of an assigned therapeutic regimen. Another important aspect of compliance refers to the maintenance of a research protocol. During

blind research and none of the scientific

or other

drugs that might affect the results of the

patient's acceptance or

the course of a therapeutic

many

trial

planned procedures must be carried out by both investigator and patient. The compliance or non-compliance given to these protocol procedures can affect the results of the research. investigation,

advantages.

A

third protocol

trials is

problem

in therapeutic

the need to exclude supplementary

main drug under

investigation.

The

viola-

tion of this specification of a protocol

often

overlooked

if

the

violations

is

have each

with equal frequency in group of patients receiving the compared

occurred

Consequences

therapeutic regimens. Since the qualitative of the "ineligible" medica-

characteristics

may

tions

not be equivalent for each group

may be

respon-

become

errone-

of patients, the differences for distinctions that

sible

A

was proby Chalmers'. lie described the way in which a sophisticated group of patients employees of the National Institutes of Health), who were participating recent example of this problem

vided

(

ously attributed to the principal therapeutic

in

agents.

contents

Of the many other

potential

issues

investigator non-compliance, the only

be cited here

to

in

one

the problem of pre-

is

serving the "letter" of a research protocol

while

ignoring

its

For example,

"spirit".

127

of 'compliance bias'

a double-blind clinical

the

of

trial,

tasted the

to

distinguish

capsules

vitamin C from placebo. According to Chalmers, the rates of the outcome event

were substantially

in the trial

patients

who

different for

did or did not correctly identi-

fy their medication.

suppose we are conducting a therapeutic trial to determine whether the maintenance of normoglycemia will prevent vascular

Another important form of patient noncompliance is improper attendance for re-

complications in adults with diabetes melli-

initiated.

we prescribe a fixed dosage of an hypoglycemic drug and determine whether the patient complies with the prescribed regimen, we have adhered to the If

tus.

oral

On

specifications of the protocol.

hand,

if

we

the other

peated

examinations If

after

treatment

is

a patient scheduled to have

a particular test done at 4 weeks and at 8 weeks after treatment appears only at

6 weeks, where do the results of the 6-week test get counted? Suppose the patient does not appear often enough to have

all

the

check whether the sugar is actually being

periodic tests that are needed to rule out

we

streptococcal infections or anicteric hepa-

fail

to

patient's blood maintained in a normal range or

if

adjust the dose of the drug so that produces normoglycemia, we have not complied with the basic idea of the re-

episodic

events,

How

such

as

asymptomatic

are the incomplete data to be

fail to

titis.

it

analyzed? There are no simple

search.

answers to these questions. Each decision requires subtle judgments according to the

There are many other ways

in

which

an investigator's non-compliance with protocol can distort the research data. Nevertheless, the published reports of a research project efforts

made

no indication of check whether protocol

contain

often

to

particular circumstances that are involved.

The

ultimate act of non-compliance, of

course,

is

the

patient's

issue

to

be cited

monitor the compliance of clinical investi-

the

gators

who perform

trials

of

new may

2.

academic Compliance by patient. While com-

several

different

institutions.

ways.

One

later,

problems of

the

of this discussion,

and

be reserved for a later installment in this series. The problems are difficult, complex, and not always well managed by actuarial

("life-table")

analyses

that

are usually proposed as a solution.

not be

plying with the prescribed medication, a patient may violate the prescribed protocol in

drop

will

demanded when a federal agency sponsors a multi-center therapeutic trial involving investigators at

to

analyzing data for drop-out patients are

beyond the scope

but an analogous monitoring

decision

out of a study altogether. Except for one

compliance has occurred. An interesting aspect of the peculiar "double standard" used on the current research scene is that pharmaceutical companies are expected to drugs,

statistical

violation

consists of breaking the double-blind code.

E.

The "compliance sample"

A

different type of biostatistical

arises

when

a therapeutic

trial is

problem conducted

with a "compliance sample" of patients. Such a sample arises in the following way: Before admission to the trial, the patients who are otherwise eligible are screened to

determine their ability and willingness

The architecture

128

of cohort research

comply with both the protocol and the

to

therapeutic regimens

whom

Patients

under investigation.

the investigators regard as

non-compliant arc then excluded from admission, so that the trial is conducted with the group of seemingly cooperative patients

who constitute the "compliance A clue to the existence of such

sample". a

sample

criteria

noted from an account of the used for excluding patients from

a

These

can

be

trial.

depend

criteria customarily

on various features

prognosis,

oi diagnosis,

co-morbidity, or co-medication. teria also include a

the cri-

If

statement about "will-

ingness to cooperate", the investigators have

used

a

compliance sample.

To choose

patients

this

in

a

therapeutic

trial,

all.

the

way seems in

conduct-

investigators

do not wish to expend major amounts of time and vigorous research efforts on patients

who

proposed proposed

are not likely to maintain the

medication

or

appear

value

treating

of

many

received

justified

which

with

excellence

with

patients

asymptomatic hypertension, the

trial

has

for

the

praises

was designed

it

and conducted. Nevertheless, the group under study contained a highly restricted compliance sample of hypertensive patients. The selection procedure was described as follows":

an

Since

number

appreciable

their occurrence as

psychopaths,

to

antagonistic

incompetent persons

mentally

personalities,

minimize as possible. "Skid row"

much

vagrants,

alcoholics,

dropouts

of

would jeopardize the study, we wished

are not properly cared for at home, and

who to

perfectly reasonable. Alter

ing

the

of

for

clinic

the

trial.

period outs

all

who those

one reason or another could not return regularly are therefore excluded from the pre-randomization trial

In addition,

serves

that

are

to

eliminate

missed

other potential drop-

during

the

evalua-

initial

tion.

The VA

investigators have not published

data on the

number

whose

of otherwise eligible

demon-

the

patients

By

energy and increase the efficiency of the

non-compliance kept them from being admitted to the trial. It has been estimated 10 that between one half to two thirds of the patients with eligible blood

research activities.

pressures were excluded from entry.

examination

tor

procedures.

screening out the non-compliant patients, the investigators would

eliminate wasted

In attaining this efficiency, however, the investigators risk

is

take a substantial

that the compliant patients

risk.

may

strated

anticipated

The

exclusion of this large proportion of pa-

The

tients

would not

not

in the

compliant patients

who have the under treatment. The risk is the excluded non-compliant patients

(or

)

affect the

results

who were

found treated

but would impair the ability draw general conclusions about the

properly represent the people

in the trial,

condition

to

tiny

treatment of other patients with hyperten-

if

trial

who are may have

their vascular systems benefitted

by thera-

constituted only a small proportion of the

sion.

group of otherwise eligible patients. If the excluded group occupied a large fraction of the eligible cases, however, the results of the trial may be seriously compromised, particularly if compliance and

willing to

total

therapeutic responsiveness are inter-related.

The

results of the trial may be pertinent compliant but not for other patients with the same clinical condition.

for

An example occurred istration

ment

of

this

type

of

problem

in the recent Veterans AdminCooperative Study 20 of the treat-

of hypertension. Because the results

provided

"hard"

(randomized)

evidence

Docile hypertensive patients

comply

in

such a

peutic agents that lower blood pressure;

but these agents may not work as well on the many non-docile hypertensives who are non-compliant. As public campaigns are mounted to deliver appropriate treatment to all patients with hypertension, the results (if noted and evaluated) may be somewhat disappointing. If the rate of vascular com-

plications

is

not reduced as

may

much

as

was

from the unresponsiveness of the non-compliant pa-

expected, the disparity

arise

Consequences

whose therapeutic

tients

refractoriness

had

not previously been discovered.

Let us assume that tense group of people a

The "compliance confounded cohort"

F.

The

problem to be cited here is particularly subtle and complex. It can arise if the ability to comply with a therapeutic regimen is also related to the event that is to be noted as the main outcome of treatment. If compliance ability and outcome event are closely related, the results will be distorted by a confounding last

30%

the

will

who can comply with treatbe destined to have an outcome-

event rate that differs substantially from the corresponding rate in the people

who

do not maintain compliance. Consequently, an ineffectual regimen may falsely appear to be distinctly beneficial (or detrimental) to the people who maintain it.

To

suppose that people who have a high degree of the particular kind of inner drive or stress that might be called psychic tension are illustrate

this

point,

under

interval

study.

identify a

have

also

non-tense

In

people, the corresponding rate

is

5%. Let

us further assume that the population under

study

consists

of

70%

non-tense

people

and 30% tense people. If nothing were done to this population, we would expect the overall rate of cardiovascular events to

be

=

(.70) (.05) + (.30) (.30) .035 + .09 .125 12.5%

=

hort of people

ment

we can who will

rate of cardiovascular events during

Regardless of treatment, the co-

variable.

129

of 'com)>liance bias'

=

Now

suppose that the entire population randomized clinical trial in which the action of a special new diet is being enters a

tested.

new

Half the population

and the other

diet

maintain

is

assigned this

half continues to

usual dietary pattern.

its

The compliance problem might now For the people

cur as follows.

who

not receiving a special diet, there difficult

is

oc-

are

no

regimen acting as a provocation

The only drop-out

to

drop

develop cardiovascular disease than people who are not psychically

is

the "nuisance" of participating in the

Let us now suppose that an unappealing and difficult-to-maintain new

out rates in the no-diet patients would be

more

likely to

tense.

diet has

been proposed

prevents

cardiovascular

an agent that disease. Let us

as

assume that the diet is actually assume that the tense people have great difficulty in complying with this new diet, whereas nontense people are much more able to comply. Under these conditions, when the diet is further

ineffectual. Finally, let us

prescribed for a large population, the rate cardiovascular disease will be lower people who maintain the diet than in people who do not. The false conclusion of

in

may

then be that the diet effectively pre-

in

the

trial

this

may

type of problem

multiple

(MRFIT)

risk

now

factor

arise 1

intervention

launched 11

being

throughout the United States,

I

some contrived numerical data

to illustrate

the possibilities.

clinical trial itself.

shall

cite

incentive

Consequently, the drop-

the usual attrition to be expected in any trial,

and the

rates

would be

similar in

the tense and non-tense patients. Let us

assume that these drop-out rates are 5% in each group. The remaining population in the no-diet cohort will thus be composed of 95% of the starting members of each psychic group, and will be 95% of its original size. [This figure can be verified as (.95) (.70) + (.95) (.30) = .665+ .285 = .950 = 95%.] For the patients in the cohort assigned to receive the special diet, the difficulties

of maintaining the diet will create a strong

stimulus toward dropping out.

vents cardiovascular disease.

Since

out.

Among

non-

tense patients, let us assume that the drop-

out rate in

is

10%, twice

as high as the rate

similar patients not receiving the diet.

Among

tense

patients,

the

problems

of

maintaining the special diet are formidable, so

that

80%

of

these patients drop out

of cohort research

The architecture

130

total

the

special

he

c.u,

.70)

|

because

I

drop-out

high

people

for

as-

Now

consider what

us

let

as

outcome

the

vascular events hort

nl

people

pate

in

the

would be people

+

309?

non-tense

or

event

rate

(.30)

(.285)

.11S75.

When

completed the Us:-.

the

cohort

maintain

would

.95)

would be (.03325)

adjusted

tor

who

ac-

special

59?

people and 30'

The

total

group

.0315

who

i

rate

event

•

63%

the 6'

will therefore

(.06)

When

iii

the

diet,

r

for

he

i

rate

event rate

of

ol

non-tense

tense people.

the special

.05

I

diet

(.63) + (.30)

.0495

.018

would

12.5%. For continued to

who

the

in

this

.125

=

people

ot

the

he

trial,

tense

of

the population of no-diet people tually

partici-

to

2S.5%

the

in

total

(.0S55)

For the co-

study.

the 66.59?

in

665) +

05

cardio-

group, the event rate

no-die!

The

people.

be ob-

for

who continued

59?

.ind

will

rates

this

in

4.95%.

adjusted for the special-diet people

actually completed the

trial,

this rate

.072 = 7.2%. would be (.0495) / (.69) If we knew nothing about the relationship of personality, compliance, and cardio~-~-

vascular rates

in

tense

we would observe

vs.

non-tense peo-

only the outcome Without a stratification for psychic state, we would not be aware of the differential drop-out distinctions that had produced the differences in outcome, and we might draw conclusions based only on the gross outcomes. Thus, in a randomized clinical trial comparing a special diet vs. no diet, we would have noted that the people who maintained the special diet had a cardiovascular event rate of 7.2%; and thi the people who maintained no special o >t had a corresponding rate of 12.5%. Th special diet would appear to

ple,

rates.

'

7.2

-

";}

42.4%. This mag-

=

nitude ot reduction would obviouslv seem

and furthermore, with

clinically significant;

numbers

the large

the

trial,

ot

patients entered into

would

difference

ilie

also

he

.significant".

statistical!)

signed to the special diet. served

12.5

by

30)

•

=

would be anticipated

rate

.90)

69%.] This seem unusually

will not

relatively

a

figure

have reduced the cardiovascular event rate

12 5

I

.69

.06

.63

low,

as

compliance

[This

cohort.

original

verified

(.20) rate ol

thus

will

diet

the

,ii

who maintain he reduced to

group of people

The

The obvious conclusion would seem to he that the special diet had reduced the rate

of

than

MY

he

cardiovascular

— and

<

by

disease

the conclusion

yet

more would

wrong. The apparent benefits

totally

of the diet, ineffectual,

which we know was actually would have arisen only from

the fact that it received compliance mainly from people destined to have a low rate of cardiovascular events.

At

this point in the discussion, a student

of clinical that

We

complete.

results of the

have not yet looked

is

in-

at the

drop-out patients. Under the

noted

conditions a

would immediately note

trials

the analysis presented thus far

substantial

earlier,

we

should find

difference in cardiovascular

two groups of patients who These rates would be 12.5% the no-diet group and 24.4% in the

rates

in

dropped in

the

out.

special -diet group.

The explanatory

calcu-

lations are as follows. In the no-diet group,

non-tense

the

; -

(.70) (.05)

would contain

(.30)

original population. in this

would contain and the tense drop-outs

drop-outs .035

(.05)

The

=

.015 of the

cardiovascular rate

group of drop-outs would be [(.035) (.30)] / .050 = [.00175 +

(.05) + (.015)

.00450]

.050

/

=

.00625 .05

=

.125

=

12.5%. In the special diet group, the non-

would contain (.70) (.10) and the tense drop-outs would con-

tense drop-outs

=

.07

tain

(

.30

)

(

.80

)

=

.24 of the original

popu-

The cardiovascular rate in this group drop-outs would be [(.07) (.05) + (.24)

lation.

of

"With calculations

that

are

too

extensive

to

be

re-

peated here, it can be shown that this difference in the two groups of compliant patients will have a P value below .05 ( by x 2 test ) if as enrolled in the trial.

few

as

636

patients are initially

Consequences

==

(.30)] / .31 .31

[.0035 + .072J/.31 == .0755/ The finding that one

.244 == 24.4%.

=

=

drop-out group had a cardiovascular rate twice as high as the other drop-out group

immediately

should

our

alert

suspicions

that something extremely peculiar has hap-

pened. if we look at the results who were randomized, rethose who dropped out. we

Furthermore, of

all

patients

gardless

would are

of

find

the

that

would be

rate

cardiovascular rates

same. In the no-diet group, the

the

+

[(.11875)

=

The

of 'compliance bias'

possibility of this type of error

a major hazard in the

—

particular psychologic test or other psychi-

examining instrument can

atric

would

have

1.00

be [(.0495) + (.0755)] / [(.69) + (.31)] = [.1250] / [1.00] = 12.5%. This simiin cardiovascular rates for the two randomized groups, regardless of drop-outs, would help confirm the existence of some

larity

phenomenon among

strange

the drop-out

To

get

peutic

this

all

additional

would require

trial

information,

the

that

thera-

may be

be conducted with an extra-

who

psychic

unwilling or unable to maintain

other interventions that are prescribed as

MRFIT

"active therapy" in the

rather

than

an

thus possesses

major

for a

MRFIT

all

the ingredients needed

scientific error in the interpre-

tation of results.

The most cogent way error

of avoiding this

for the patients' psychic condition

is

data,

occurrence

in

drop-out patients

otherwise been "lost to follow-

up"; but

the outcome event

if

is

a non-fatal

cardiovascular event (such as angina pec-

myocardial

toris,

intermittent

infarction,

but

investigation

the condition of serum lipids.

its

Since

unpalatable

equally the

diet,

death, the investigators can usually learn

about

trial.

the "control" group will receive no diet,

have dropped out. This intensity of followup surveillance almost never occurs in a therapeutic trial. If the outcome event is

who have

who

constitution

the particular forms of special dieting or

ordinary passion for getting complete, detailed follow-up data for all patients

best

that the people

is

particular

this

standard

cases.

however,

we

are especially sus-

ceptible to cardiovascular disease. Another

(.00625)]

In the special diet group, the rate

/

who

discern the people

reasonable belief

[.12500]

is

MRFIT

of the

An abundance of evidence has now been assembled to suggest that a distinctive relationship exists between certain personality types (or psychic states) and subsequent cardiovascular disease. The main issue is no longer whether such a relationship exists, but how to identify it by which

12.5%.

+ (.05)]

work

study.

=

[(.95)

131

to

be examined with

procedures that

test

are as thorough as those used for examining

different

could

needed bias

With such

the investigators could identify the

is

degrees

use to

the

of

psychic

results

and

analyses

compliance not be appealing, however, because

demonstrate

that

absent. This approach

scientifically

"risk",

the

for

may

claudication, or stroke), the occurrence or

the questionnaire and other written instru-

non-occurrence

ments used for examining psychic status have not received intensive attention from

to

document

living

of

this

event

in a standardized

patients

is

difficult

manner

who have been

lost

for to

follow-up. Because of these difficulties in the

follow-up

investigators

of

dropped

may be

patients,

the

strongly tempted to

main analyses to the patients who, complying with the research protocol, continued under observation. If this tempconfine their

tation

is

accepted,

the

investigators

will

reach the erroneous conclusion described earlier.

epidemiologists.

The

ideologic

belief

of

most contemporary epidemiologists has been that "risk factors" arise from nurture but not from nature from such environmental features as food, water, tobacco smoking, and exercise; but not from such

—

constitutional

features

as

heredity

and

Because of this ideologic belief, both heredity and psyche have been generally ignored in epidemiologic re

psychic status.

architecture of cohort research

132

The

search,

and suitable

by using age, race, sex or other

instruments

for bias"

have not been developed or applied for

variables

obtaining the necessary data

instead

scientific

large co-

in

absence of suitable psychic exami-

nations and correlated data

way

other for

is

even

drop

formed

men compliance, are

different

the

in

torted

l>v a

Equalit) not

will

have basic

received

a

major

may be

dis-

however, because the unwilling to con-

MRFIT

clinics

for

To cope with investigators may need clinics, home visits, or

examinations.

night

.special

may

surveillance

may be

problem, the

drop-out

the

regimens,

data

diagnostic

necessar)

other

patients

compliance-confounded cohort.

of

arrange

to

regi-

procedures

patients

that

will

allow

receive suitable

to

low-up examinations

fol-

•

•

six features of

to

potential

indicate for

to

scientific

does

not

its

bias,

7

intricacies

and

contemplate

If

the

and

they

rule

may

out

selection, detection,

i.se

due to inequities in and chronology, com-

merely an investigator hopes or wishes will

appear

jf

on

focus

restricted

"hard"

addition to the cited

In

however, therapeutic investigations that depend only on "hard data" create an important humanistic hazscientific defects,

ard.

The

may be

idea

analyses

biostatistical

established

that

unable to

are

dis-

between an act of patient care

tinguish

and an exercise

disappear

not

Boyd,

and

medicine.

in veterinary

J.

exist.

It

Covington, T.

R.,

Coussons,

also will not dis-

the data analyst tries to "adjust

R.

of

2.

3.

W.

R., Stanaszek,

Drug

T.:

compliance;

noncompliance patterns, Am. 31:362-367; 485-491, 1974.

I.

Analysis

of

II. J.

F.,

defaulting.

Hosp. Pharm.

Blackwell, B.: The drug defaulter, Clin. Pharmacol. Ther. 13:841-848, 1972. Chalmers, T. C: Quoted in Internal Medicine News, p. 4, Oct. 11, 1973. A. R.,

4. Feinstein,

Wood, H.

F.,

Epstein,

A.,

J.

Taranta, A., Simpson, R., and Tursky, E.:

study

controlled

phylaxis

of

methods

three

streptococcal

against

of the

II.

A

pro-

of

infection

population of rheumatic children.

in

a

Results

three years of the study, including

first

evaluating

for

the

oral prophylaxis, N. Engl.

J.

maintenance

of

Med. 260:697-702,

1959. 5.

A.

Feinstein,

Kloth,

phylaxis

R.,

Spagnuolo,

M.,

Tursky, E., and Levitt,

H.,

of

recurrent

Jonas,

M.:

S.,

Pro-

rheumatic fever. Thervs. monthly

apeutic-continuous oral penicillin injections,

does not

"hard"

obtained by direct conversation

are

methods

search. Like the biases

it

a

the malefaction.

the

re-

that

is

with patients, biostatisticians have abetted

investigator

chosen hypothesis and invalidate his

beet,

that

bias,

his

bias

petuating

its

vitiate

tee

clinical science that

Determinants

bias can

detection

research.

counter-hypotheses,

ph

have created the

data while ignoring important "soft" data

1.

1-

of

"soft"

J

because

information

this

scientifically

neglected

but often irrelevant or erroneous. By per-

creating major biostatistical

Compliance

selection

is

compliance should

be added 7 and chronology bias as another prime source of the confounding variables that produce fundamental errors in biostatistical analysis. Confounding variables in biostatistics are like counter-hypotheses in any other form

delusions.

abandoned

or

pa-

the

to

References •

suffice

talking

who have

for detecting cardio-

vascular events.

These

medicine:

clinical

tient. The- investigators

hazard of a

tinue to return to the

this

of

to restore attention to a traditional activity

of

non-COmplUmi

therapeutic

drop-OUt patients the

per-

To acquire the data needed for analyzing compliance and ruling out the existence of compliance bias, investigators will have

it

achievable,

In

equally

is

regardless

their

investigators thai

on the "incon-

ing.

the cardiovascular rates

If

for

several

signal

events

everyone,

for

surveillance,

out, so thai the detection

cardiovascular

oi

patients to continue

medical

intensive

the)

it

MRFIT

the

all

an-

analyses,

of trying to avoid the cited error

receive

to

conveniently available,

concentrating

venient" variables that create the confound-

hort studies. In the

are

that

of

6. Feinstein,

J.

A.

A.

M.

R.,

A. 206:565-568, 1968.

and

Levitt,

M.:

The

role

of tonsils in predisposing to streptococcal in-

Consequences

fections

N. Engl.

and recurrences of rheumatic Med. 282:285-291, 1970. J.

7. Feinstein,

A.

R.:

Clinical

X.

statistics,

Pharmacol. Ther. 12:704-721, 1971.

Clin. 8.

Feinstein,

A.

R.:

Clinical

biostatistics.

XI.

Sources of 'chronology bias' in cohort statistics, Clin. Pharmacol. Ther. 12:864-879, 1971. 9.

Organization of a long-term mul-

Freis, E. D.: ticlinic

therapeutic

in

trial

hypertension,

in

Gross, F., editor, with the assistance of Naegeli,

and Kirkwood, A. H.: Antihypertensive Principles and practice. An international symposium, New York, 1966, SpringerS.

R.,

therapy.

Verlag, pp. 345-354. D.: Personal communication.

and Barsky, A.

Diagnosis

11.

Gillum,

12.

and management of patient noncompliance, A. M. A. 228:1563-1567, 1974. J. Gordis, L., Markowitz, M., and Lilienfeld, A.

M.:

F.,

Why

patients

don't

J.:

medical

follow

A

study of children on long-term antistreptococcal prophylaxis, J. Pediatr. 75:957advice:

968, 1969. 13.

Mazzullo, P.

tion

M.,

J.

Lasagna,

and Griner,

L.,

Variations in interpretation of prescrip-

F.:

The

instructions.

prescribing habits,

need

A.

J.

for

improved

M. A. 227:929-931,

1974.

H. P., Caron, H. S., and Hsi, B. P.: Measuring intake of a prescribed medication. A bottle count and a tracer technique compared, Clin. Pharmacol. Ther. 11:228-237,

17. Roth,

1970.

and Cluff, L. E.: A review of medication errors and compliance in ambulant patients, Clin. Pharmacol. Ther. 13:463-

18. Stewart, R. B.,

Group Diabetes Program. A study

19. University

the

of

of

effects

hypoglycemic

on

agents

vascular complications in patients with adultonset diabetes.

Part

I:

Design, methods, and

baseline characteristics. Part II: sults,

Diabetes

20. Veterans

Mortality re-

19(Suppl. 2): 747-830, 1970.

Administration

Study

Cooperative

Group on Antihypertensive Agents. Effects treatment on morbidity in hypertension.

of

notated

Results in patients with diastolic blood pres-

patients

with

sure

R.

B.,

therapeutic

regimens,

Depart-

(

C. T.: Quoted in Medical M. A. 227:1243-1244, 1974.

Kaelber, J.

A.

Lorenz,

K.

Z.:

The fashionable

J.

News,

fallacy

21. Wilson,

90

through

114

mm

T.:

J.

Compliance with

Hg,

instructions in

the evaluation of therapeutic efficacy.

but

variable,

1973. of

averaging

II.

A. M. A. 213:1143-1152, 1970.

mon

graphed pamphlet.

15.

16.

and Sackett, D. L.: An anbibliography on the compliance of

Haynes,

ment of Clinical Epidemiology and Biostatistics, McMaster University Health Sciences Centre, MimeoHamilton, Ontario, Canada, 1974. 14.

Naturwissen-

description,

468, 1972.

10. Freis, E.

R.

with

133

schaften 60:1-9, 1973.

biostatistics.

Sources of 'transition bias' in cohort

dispensing

fever,

of 'compliance bias'

frequently

Clin.

Pediatr.

A com-

major unrecognized (Phila.) 12:333-340,

SECTION

TWO

OTHER ARCHITECTURAL PROBLEMS

The

difficulties

in the several

noted in the preceding section can be magnified or embellished

ways that are discussed

from a misplaced confidence

arises

The

statistical consultation.

sleeve,

few chapters. The

statistician usually has

may sometimes be empty

but the sleeve

The second problem

in the next

first

many

problem

powers of

in the prophylactic or remedial

valuable tricks

or the trick

may be

up

his

a delusion.

another misplaced confidence, caused by the disbetween mathematical ideas based on random sampling and the medical

parity

arises as

reality of "samples" that are

never selected randomly. Beyond the potential bias

an investigator can add further distortion

of a "rancid" sample,

to the data

by

using the unequal examination procedures that create "tilted targets."

A

prominent source of confusion

"control"

is

is

the ambiguity with which the concept of

used and abused in the design of research. In most

in statistical courses

on "experimental design," the control

maneuver, but neither

scientific

nor

group

trol also

A

of

people

who

receive

To confound

maneuver

or the con-

the confusion, the idea of con-

has been applied to at least ten additional ideas in medical research.

"case-control" study

control

it.

and

the comparative

statistical instruction contains specific atten-

tion to important issues in choosing either the comparative trol

scientific plans is

is

diverted from

effect rather

its

is

the most frequent situation in which the idea of

customary

scientific

connotation and

is

applied to an

than a cause. In case-control research, a group of diseased people

is

compared against a group of controls who do not have the disease. The comparison can be used in a "cross-sectional" study to examine the diagnostic utility of a particular marker or test in discriminating between diseased and nondiseased people. Alternatively, the case-control study can be "retrospective," aimed at (

)

examining an etiologic suspicion. In both types of case-control arrangement, the standard forward architecture of scientific research

is

drastically altered

and the

choice of a suitable control group becomes the crucial feature that determines the value of the results. Since mathematical principles again offer no help in mak-

ing this choice, the decision requires careful scientific strategies.

Although diverse mathematical

tactics

have been developing for manipulating

the quantitative results of both types of case-control study, rigorous standards

have not yet been established

for the scientific principles of a satisfactory re-

search architecture. In diagnostic case-control studies, the key issue

is

the degree

135

136

Other architectural problems

of discrimination that

is

sought within

the-

diverse spectrum of diseased

diseased people. In etiologic case-control studies, the of suitable procedures to avoid the effect reasoning

is

conducted

mam

kev issue

is

and nonthe development

biases that are inevitable

in a logically

backward temporal

when

direction.

cause-

CHAPTER

10

malpractice— and the

Statistical

responsibility of a consultant

A

pathologist has often been called a

"doctor's

In

doctor."

situations

which

in

any final scientific method," Bernard urged clinical investigators "reject to lish

necropsy, biopsy, or cytology can provide

statistics

confirmation for diagnostic reasoning, the

mental

pathologist

the

is

have

clinicians

consultant

whom

to

turned

traditionally

for

verification or occasional refutation of the

reached during the

decisions

diagnostic activities.

become the

now He is

doctor."

whom

investigators regu-

advice about decisions

larly turn for

The

during research.

who

preceding

statistician has

"researcher's

the consultant to

sultant

A

statistician

made

the con-

is

When

Pierre Louis, more than a century was developing and advocating his "numerical method" for the appraisal of ago,

therapy, 19 his clinical espousal of statistics

was opposed by most cians of the day. 11

and

to help interpret

cause

I

there

is

the scientific value of statistics.

A

when Claude Bernard was

.

nature."

Saying that

never yield

century

biostatistics

11:898, 1970.

.

.

statistics .

.

estab-

chapter originally appeared as VI." In Clin. Pharmacol. Thcr.

this

—

.

can

were not con-

it

much

...

should only praise

I

be

for counting too

to

much

any mind

precision,

relative

be-

it,

but noise made about such poor reproach (the statisticians) it

.

for

it

.

.

useful;

and

for de-

into the facts.

...

is

.

.

.

only a

changes under the

observation of the same man, according to the year, the season, and the reigning medical constitution.

much and

scientific truth (or)

With the same name, "Clinical

".

if

This mathematical exactitude

establish-

about phenomena as they construct them in their minds, but not as they exist in

.

so

I

believe

really

clining to put

"mathematicians reason

denounced simplify too

.

...

science

results.

ing experimental discipline in medical re-

.

the

in

sidered as the very keystone of the arch of all

.

of the leading clini32

-

the application of statistics to medicine

If

have not always held this crucial role in the world of medical research. Until the past few decades, clinical investigators were strongly distrustful and sometimes actively antagonistic about

he

'

were not rated too high,

Statisticians

.

21

The sentiment was remarks of the renowned Armand Trousseau: 34 summarized

the results.

(who)

pathological

often asked to check the

is

analysis of the data,

search,

experi-

for

and

science." 1

design of a research project, to plan the

ago,

foundation

a

as

therapeutic

From beyond tune of taunts

book,

the clinical world, the for-

statistics

as

How

Koestler's

has borne such recent

Darrell

Huff's 20

to lie with statistics,

remark that bathing

a

bikini

is

interesting;

From within

suit:

provocative

and Arthur

"Statistics

are like

what they reveal

what they conceal

is

vital."

the statistical fraternity, an

"anthology" of diverse misuses and abu

137

Other architectural problems

138

been pro\ ided by Wallis and Roberts.

of statistics has

in the text

:

Despite these caveats, statisticians have not only to endure in the world

managed

research, but

of clinical

more recently

to

The editors of good medical journow insist oil appropriate statistical

prevail.

nals

reviews

which were formerly applied mainly to the choice and application of statistical tests after the investigator had planned the research,

enough

often

solicited

and sometimes

early

govern

to

xvclcomed

have

Statisticians

new

the

recommended

one leading journal" the word "significance" has been removed from general circulation and reserved

of exposure to books

for use onlv in a statistical context. At the

tical

extremes

of

the statistician has begun to feel relatively

editor of

a

for publication,

and

now

are

to afleet

the basic design of the project.

manuscripts are accepted

before

before the project begins. His ideas,

tically

at

mathematical obeisance,

the

has de-

psychologic journal

challenges and have

pansion of their concepts

After a generation

roles.

of

and courses on

"experimental

confident about his

this ex-

planning

in

skill

statis-

design."

re-

cided to accept only manuscripts that have

search, and. for several decades, statisticians

by P

have urged that they be consulted "before you begin the project, not afterward." In

the "super-significance" demonstrated

values of less than 0.01.° In concepts about etiology of disease, statistical "proofs" for

the

causes

of

chronic

diseases

are

now

accorded the respect that was once reserved for Koch's experimental postulates about acute infectious disease. Statistical validation has been emphasized generally

by the

FDA

guidelines

new

sanctioning

claims

about

Advance approval and con-

drugs.

comitant participation by statisticians have

become

a

prerequisite

demand

large-scale clinical trials will be

suitable agencies.

And

before

funded by

courses in statistics

have become standard parts of the curriculum

leading

doctoral

to

degrees

in

either medicine or biology.

To achieve may have had ma\-

still

this

status,

to fight

the

statistician

an uphill battle and

many scars of the now clearly arrixed.

bear

but he has

conflict,

Further-

more, his consultative authority has been expanded in recent years to include prex'ention,

the

and not

just diagnosis

statistical

increasing

constantly

who come

consultant has willingly ac-

the researcher's doctor.

As

every

knoxvs,

of

medical

practicing

course,

a

cepting

A

responsibility.

physician

xvho

on the problems brought to him by a patient also becomes responsible for what he does in their management. He must be fullv attentive to the subtleties takes

as

the

as

xvell

grossly

oxert

aspects

of

the problems; he must not perform procedures for xvhich he is untrained or un-

he must guard against breaches and he must be ready, x\ hen necessary, to defend his actions if they are questioned by a jury qualified;

of accepted ethical standards; r

of his peers or in a court of laxv.

In accepting authority as a consultant

for the "ailments" of clinical research proj-

in

clinical

research,

how

His assistance, xvhich xvas formerly sought remedially after a project xvas completed, is now often solicited prophylac-

statistician

accepted

the

°Th' egemony of statistical doctrines has not been confined biomedical publications. In the literature of social sck the traditional deference to a statistician's < imprimatur recently been subjected to the "radical" rimary attention be given to the concepts proposal tha. the research, rather than to the data and and methods

doctor

consultant

patient's

cannot accept authority xvithout also ac-

and treatment,

ects.

floxv

"patients,"

as

cepted the authority, prestige, and other rewards that go along xvith his role as

in a recently issued series of

for

the

receiving

of investigators

What

has

sort of "licensure" or "boards"

used to

test

kind

"pathologist"

serves

xvell

of to

his

detect

qualifications,

his

or

the

responsibility?

are

and xvhat

"review panel" When he

failings?

performs experiments by applying untested unproved models to a research project, does he obtain informed consent theories or

>

the statistical

a.

sis. 38

from the investigator xvho

is

his "patient"?

malpractice

Statistical

By what kind

Pythagorean or other pledge the quality and

of

he

does

oath

— and

How often does he How are his instances

journals

commit malpractice?

review

guarded against,

appropriate

and the people who plan

papers,

semi-

statistical

dom

insights

considered during a

edu-

statistician's

have appeared about the errors committed during consultative activities, 29, and the retiring papers

Sporadic

3(;

may

president of a statistical society

casionally use his farewell address to

about certain

colleagues

his

or practical blunders. 33,

30

oc-

warn

intellectual

In such an ad-

two years ago, Frank Yates 39 com-

dress

much

plained that "the standard of

work

to-day statistical

These matters sponsibility

.

.

regrettably low.

are primarily the re-

.

the

of

is

day-

university

statistical

departments and are a direct consequence of the present-day obsession with advanced theory, largely divorced from practice."

of

Yates also quoted an earlier concern

Ronald Fisher:

Sir

"We

are quite in

danger of sending highlv trained and highly

.

.

.

young men out

never

the

solicit

clinicians

attendance

who might

expert

of

contribute occasional

and touches of

reality to the pro-

ceedings.

The need for vigilant critical self-apby biometricians has been accen-

praisal

tuated in recent years because of the increased opportunity both for statistical

commit malpractice and

consultants to

the

for

malpractice to have catastrophic efIn

fects.

the

when

days

worked mainly post hoc

a

statistician

perform statistical analysis for a completed research proj-

ect, his

choice of the

to

wrong

analytic pro-

cedures might create an intellectual nui-

would not

sance, but

affect either the basic

design or the primary data of the project. After the statistical errors were discovered,

they could always be rectified later with a better set of analyses. If the statistician

improper advice before a project

gives

the

begins, however, he can distort both the

with a dense fog in the place

design and the data, so that the project may not be salvageable later. After the

intelligent

world

into

where their brains ought to be." These random censures by leaders of the profession have not brought a tradition of introspective review and con-

misdeeds in planning are and corrected, the entire project would have to be repeated. The

stant self-criticism to statistical consultants.

repetition

of statistics, there

clinicopathologic consultant's identified, sion.

not too difficult for relatively

the ones for which statisticians are asked

regularly

are

sought,

and revealed for public discusand publications of societies,

is

no counterpart of a in which a

In the meetings

may

recognized

small projects, but such projects are seldom

conference,

errors

biostatistical

tents

is

and

statistician's

literature

In the educational processes

the parochial con-

who work

prevent statisticians

advance consultation. The type is most difficult to repeat is a massive project in which many observations are performed during many to provide

of research that

years. This type of project, as exemplified

by a

large-scale cooperative clinical

with clinical topics from receiving regular

is

exposure to comments or suggestions from connoisseurs of the topics. Although stat-

statistical authorities

isticians

clinical

mitted I

invite clinicians either to

write

to

biostatistical

nars on topics in clinical research almost

cation.

.

do not or

world

in the

compensated for? These issues have rarely been discussed in the literature of statistics and are selor

;

phenomenon does not occur The editors of

of biometry.

ethies of his practice?

of malpractice detected,

139

the responsibility of a consultant

write

are constantly asked to speak at

meetings, to

clinical

instructive

readers

of

those

to

referee papers

journals,

papers

for

journals,

sub-

and even the the

to

clinical

converse

the

type

increasingly

of

research

delegated

to

has

that

design

the

and that

trial,

been

offers

of

them

the greatest opportunity for transgressions that cannot be remedied easily,

The complex clinical

trial

logistics

of a

if

at

all.

large-scale

create major difficulties

for

every aspect of the review, appraisal, and

Oilier architect mill

140

problems

verification that research must receive to be accepted in the scientific community. For purposes of review, the primary data

luctant to recognize or admit that

of a modest-sized project are readily avail-

"above the battle" and become actively embroiled as a partisan in the disputes. The patients whose future therapy should have been enlightened by the results are

for

mammoth among

fused

clinical

puterized

dif-

purposes

For

summaries

numerical

small

relatively

data

may be

coded pages a dense hulk of com-

conversions.

the

analysis,

raw

the

trial

a vast array of

or transformed into

shown

hut

inspection,

able of a

ol

ol

a

can generally be simple tabulations; hut

project

entirely

in

even the summaries may be too abundant for complete citain a large clinical trial,

tion.

In the material selected for publica-

tion,

crucial

among

dispersed

obscured

omitted,

a plethora of tables, or

conversion into percentages.

l>v

and other

regressions,

ments."

may be

information

And

statistical

the clinical

"adjust-

can seldom

trial

expended large sums of money flawed

emerged

1

challenges of statistical consultation have

begun

rcccntlv

among

years, these

plated

and

:

able its

to

acquire

support.

Even

and who are

new

effort

the

necessary

if

new

funds

for

investigators

and

problems have been contem" in major papers, abstracts,

letters to the editors of

Biometrics,

as

such journals

A

Statistician.

particularly exten-

and discussion of the

presentation

appeared

Journal of

in the

My

the Rot/al Statistical Society.™*

object

the remainder of this discussion

augment

The

and

Technometrics,

issues recently

in

comment

to receive public

biomctricians. Within the past few

sive

the

about whatever treatment has

condemned. Although misdesigned clinical trials have not vet been given specific attention, some ol the problems in meeting the general

assembly of another group of investigators are willing to devote many years labor in

position

its

as extolled or

American

of

leave

instead to the uncertainties of medical

left

dissension

be repeated. The repetition would require

who

may

product,

has

it

for a badly

this established

to

is

foundation of con-

funds can be obtained, however, the newdata may still not settle an existing argument, because the previous workers may

structive criticism.

claim that the populations under investi-

years,

gation were different.

to observe defects in activities that occur

For

all

these reasons, the basic strategies

and

of design

analysis in a clinical trial

must be particularly circumspect and above suspicion. When disputes arise about the conclusions,

planning

the

statistical

basic

deficiencies

strategies

devastating consequences.

The

can

in

have

investigators

have worked prodigiously for results that cannot be either reliably analyzed or

will

confidently effort

utilized.

In

addition

to

the

already invested in the project, large

amounts of energy may be consumed

Having worked both

In

the main failings of I

would be

colleagues of

the

committed emotionally to their years of devoted but possibly misdirected labor, become even more fervently committed to defending their results against the ineviThe sponsoring agency, re-

table criticism.

clinical

recent

biostatistical

publications,

I

my

than

I

fellow clinicians. 10 fair to

neglected

my

to

"

12

statistical

note

some

on

their

imperfections

side of the street.

To

illustrate the

trast the

way

problems,

trations,

as

shall con-

and and perform

The

consultants.

seem appropriate because

a tradition

I

that statisticians

cians are prepared for

analogies

statisticians,

less if

apparent

the controversies that follow as the par-

and

the

several

have described what seem to be some of

activities

investigators

on

bilaterally street.

many

have had an unusual opportunity

I

in

ticipating

as a clinical investi-

gator and biostatistical consultant for

for these illus-

and heritage that

2,500 years old. Statisticians experiential anecdotes,

their

clinical

clinical consultants

ways admire the way that

clini-

is

have

more than

may

not

al-

clinicians recite

make unquantified

judgments, and deliver personal care, but

Statistical malpractice

—and

the responsibility of a consultant

we cannot deny

lent

deliberately

comes

that clinicians have been prepared for their role as consultants, and that they have had enormous experience and frequent success. As statisticians, we might be able to profit by studying their methods.

Formulating a problem

1.

In consulting a medical doctor, a pais not expected to express his prob-

consultant,

statistical

connoisseur

a

scientific

of

141

be-

course,

of

necessary

the

and constantly recog-

concepts

nizes their intellectual priority in the de-

But he must depend on and ability to make these perceptions after he begins his consultative activities. The details and importance sign of research. his

willingness

of the scientific principles are not usually

tient

transmitted as part of his undergraduate

lem in a clear, articulate, well-organized manner. The patient believes that the doctor will know what questions to ask and

or postgraduate instruction.

know how

will

manner

a

in

to organize the information

suitable

and planning

the problems

characterizing

for

their

ment. Consequently, one of the

manage-

first

that a clinical consultant learns

is

things

an

in-

and arrang-

tellectual structure for getting

ing the information used to formulate pa-

The

problems.

tients'

reflected

structure,

The preparation tant thus

for

work

as a consul-

contains antipodal contrasts in

the education of clinicians and statisticians.

A

taught to identify and formuproblems in a carefully structured manner; but he is then left to develop diverse tactics of "judgment" for clinician

is

late patients'

A

managing the outlined problems. tician

is

statis-

taught a carefully organized set

managing

of mathematical structures for

and physical examination, contains such components as the chief complaint, present illness, and review of systems, and the formulation is

an outlined problem; but he is left to develop diverse judgmental methods for

expressed in such terms as diagnosis, patho-

press

in the contents of the history

genesis, prognosis,

A

and therapy.

The

clinician

the

the

find

however, has not been specifically trained in an intelstatistical

consultant,

The

scientific

statistical

courses in

"experimental design" are inadequate both for

the

of

search.

and for the "dework done in clinical re-

"experiments"

the

sign"

Many

clinical

projects

are

con-

methodologic explorations, not as experiments, and even ducted for

as

projects

surveys that

or

are

truly

experiments,

right

fortunately,

if

he could rely on

provide

to

often unable to provide a well-organized account of the "chief complaint" or other intellectual maladies of his problems in

The clinician may have learned be a medical consultant for pabut not how to be a "patient" for

research. to

statistical consultants.

the clinician has also

plan the basic architecture of clinical research, and a specialized knowledge of statistics does not become cogent until the

major parts of the architectural

design have been completed. 15

An

excel-

his

Un-

however, in appearing as a

do not contain details fundamental scientific concepts

to

outline.

is

tients,

the

the

"patient" for the statistician, the clinician

and explanatory experiments. 14 In "design," of

of

details

architecture in clinical research

"patients"

critical

needed

answers but un-

would not be a major occupational hazard

how

statistical principles

may

statistician

outlining

in

difficulty

for a statistician

statistical principles

do not distinguish the differences between interventional

the

able to select the questions.

necessary data and formulating the logical

problems of a research project. As noted

but unable to

questions

right

answers;

emerge with the

lectual discipline suitable for acquiring the

earlier in this series,

and formulating the problem. may emerge able to ex-

identifying

struction in the investigation.

Like the statistician, had no rigorous in-

methodology of

The

scientific

clinician's "basic science"

courses in medical school taught

him an

array of facts and laboratory procedures that are not readily applicable in clinical

investigation 17

;

his

courses in

statistics,

i

Other architectural problems

142

were not concerned with

any,

scientific

architecture in research and usually dwelt

mainly on theories of probability and on techniques

training

anecdotage and

unquantified precision

performing statistical tests; exposed him to the

l

clinical

his

he often

that

he

how

about

any

receive

and nowhere

formal

express

to

objective in his research and late

sequence

a

oi

instructions

clear,

a

how

succinct

formuprocedures

scientific

for attaining the objects

im-

escape

to

seeks

getting consultative help;

1>y

did

logical

to

has

about

ideas

know how

not

from

about

comes

"doctor"

statistical

a

nomenon he

not

objective of

He may

become- reticent about expressing

anything significant

got

he says "I'm

nomenon

X

purpose.

assessed

be-

may become

He-

refer only

to

is

when

studying phe-

in

what aswhat

X." without indicating

of

"If I've

or

for

imprecise" and

a nebulous "judgment" and

to

"experience"

statements of criteria, evidence of valida-

to seek

help

who knows

tion,

or

when he

tests

reproducibility

of

phenomena under

A

asked to provide

is

who was

consultant

clinical

with

the

for

investigation.

such

con-

poor history-giver would know what to do about the situafronted

The

a

knowledge

clinician's

of

the

necessary medical outline would help him

taking a history, any good clinician

cessivel)

here,"

pre-

and quantification, hut know how to express ideas

to deal

interested

arrives with

know

a pile ol data and wants to

tion.

In

when he

a specific objective

project.

who

Taking a history

knows how

current

the-

but

about research. 2.

studying, without ever stat-

is

ing the

precision

who may

interminably about

talk

them

research

to express

cisely or quantitative!)

may

I It-

work he has done for the past 10 years and may discuss a variety of conjectures about the mechanism of the phe-

pect

e.

The biostatistical consultation may thus become a peculiar paradox. A "patient"

who may

search.

the

with a patient

garrulous, reticently

who

is e.\-

uncommuniThe clinical

determine what information traditional

his

obligation

to

to

get,

get

the

and in-

formation would evoke the necessary effort

consultant will usually interrupt garrulositv,

do so. The statistician, however, may have neither the knowledge nor the obli-

probe reticence, and delineate imprecision. Moreover, if the patient is unable to pro-

no

cative, or flagrantly imprecise.

vide

a

satisfactory

knows how

to

history,

improve

it

by

the

clinician

enlisting the

aid of family, friends, or interpreters. clinician

becomes adept

interrogational skills

The

developing these not only because he at

has learned a basic intellectual architecture

what information is necessary, knows that he is obligated to get the information. The standards and traditions of his consultational craft demand that he do so. As a statistical "patient." however, the clinician may, for the reasons cited pre-

to

gation

also

because he

for

scientific

his

task.

outline

as

Having guide

a

might ease the history-taking and having no professional traditions that crethat

ate

a

responsibility

extracting

for

the

crucial information, the statistical consul-

tant

may meet

the challenge with several

varieties of evasion.

One

to indicate

but

appropriate

specific

type

of

evasion

is

used

who comes with a assembled data but who has

investigator

of

for

an

collection difficulty

describing what the research was about

and what the data are supposed

The by

statistician

saying,

"Let

gets

me

rid

to

show.

of the "patient"

have your data." In

give a poor history. Aside from

isolated comfort, the statistician can then

knowing that he ought to have controls and that randomization and clouble-hlind are good things, the clinician "patient" may be garrulous, reticent, or imprecise in try-

process the data with a series of analytic

xiously,

ing to

describe the "ailment" of his re-

statistical

tests.

A

somewhat

different

used for an investigator who comes with an imprecise proposal for a clinical research project. After a few quesevasion

is

i

malpractice

Statistical

—and

the responsibility of a consultant

143

about confounding variables and hard the consultant says, "Let me have your protocol," which can then be contemplated and manipulated in the privacy of the statistician's theories about

sources of the fever or the anxiety. But

"experimental design."

dosages, "pure" populations, randomization,

tions

antibiotics

endpoints,

anxiety,

In both types

evasion,

the

for

fever

or

tranquilizers

for

without probing deeply into the

may

statistical

consultants

catliedra

pronouncement,

also,

an ex

in

prescribe

fixed

statisti-

double-blind procedures, dimensional mea-

cian deprives himself of an adequate his-

surements, and hard endpoints, without de-

and

tory

of

formulation

scientific

the

of

problems. Working in a Procrustean man-

may

he

ner,

cal tactics

to

take his knowledge of

and adapt the research problems

those

fit

tactics,

instead

or adapting his tactics to

choosing

of

the problems.

fit

The management of therapy

3.

With advances

may sometimes tory

tests

think.

has

statisti-

The

had

order myriads of labora-

rather than

take a history or

availability of digital

similar effects

on

computers

statistical

con-

statistical

when

In the "ancient" days

sultants.

the

in technology, clinicians

computations

had

to

all

be

done with a desk calculator, a statistician had the incentive to be selective in choosing analytic tests, if for no other reason than to spare himself the labor of the calculations.

data were available for 14 vari-

If

deemed would analyze

only two of which were

ables,

important, the statistician those two variables.

Today, with almost no effort, a statistician can get a computer to massage all 14 variables in diverse ways, to perform "factor analyses" on the lot, and to prepare a matrix of correlations for each variable with every other variable. Receiving the enormous pile of print-out, the investigator may have to spend days or weeks searching for what he wanted. He may not find it at all, but the magnitude of the computations will usually con-

him

vince

that the

consultant has

done

different

statistical

counterpart

of

when

the

defective clinical practice occurs

statistician "treats" the protocol of a clini-

cal

trial

careful

without

having

"diagnosis" of

what

established

a

wrong

or

is

to be done. Clinicians may sometimes, in "shotgun" manner, dispense

what needs

armamen-

tarium" will distort the objective of the research or invalidate the results.

The

practitioner of this type of statistical

shotgun therapy convince of his

may be

as

difficult

malefactions

clinician

who

quilizers

indiscriminately.

uses

antibiotics

As

as

and

to

the

is

tran-

justification

for the treatment, the clinician

can usually

point to the patient's subsequent improve-

ment, and the clinician

censed

if

his

judgment

is

may become

in-

later questioned

having caused possibly needless expense and risk to a patient who might have either improved without the chosen drugs or improved more rapidly with others. As justification for shotgun statistical treatment, the statistician can usually offer certain obvious improvements of a faulty protocol, and he may become incensed if his efforts in bringing wellfor

established statistical principles to the project are decried as being so statistical that the results are

consummately not meaning-

ful in clinical science.

A

shotgun statistical design of a clinican create many problems that have been discussed in detail elsewhere. 1115 cal

trial

The

basic

of the trial

difficulty

become

clinical reality, tivities

is

the

that

results

useless for scientific

because crucial

clinical ac-

were either omitted or distorted

in order to

approbation.

a "thorough" job.

A

termining whether these standard agents of the statistician's "therapeutic

fit

the

demands

A drug

that

of statistical

ordinarily

quires flexible dosage has not

properly

if

re-

been tested

given in fixed dosage.

If

pa-

with comorbid diseases are excluded from a trial in order to test a "pure" population with the "main" disease, the results may be free of confounding variables but also free of any realistic signifi tients

Other architectural problems

144

cance for the "impurities" constantly encountered in patients. The results of a trial cannot be related to clinical practice if prognostically heterogeneous patients are combined, randomized, and analyzed as a single group without previous or subsequent division into homogeneous prognos-

The

the risk of producing meaningless science.

This type of consultative advice

mount

who

patient

a

to

physical

exercise

therapy.

A good

allow

not

and

to giving only pills

is

tanta-

injections

needs

vigorous

intensive

psycho-

really

or

consultant will

clinical

pharmaceutical

substitutes

to

wasted are used

need to work out his own problems, and a good statistical con-

and everyone's time is wasted when double-blind tactics are needed but omitted in circumstances in which their application might have been

sultant should not allow clinicians to es-

tic-

Strata.

investigator's time

is

when double-blind procedures unnecessarily,

requiring

difficult,

second

a

ticipating observers.

The

set

of

par-

objective of the

research becomes distorted

when

the quest

for dimensional

measurements creates the in which the index variables are transferred from the qualitative facts that the clinician really needed to know into unimportant data that he can measure. The conduct of the trial becomes an "ordeal " when the desired clinical targets that are abundant in "soft "substitution

game."'"

1

uncommon may require

data" are displaced into

data" lessly

"hard

endpoints that needprolonged observation of needlessly

huge numbers of patients. Aside from the practical difficulties created by shotgun designs, the continued acceptance of such procedures is detrimental to the progress of clinical science.

One

main reasons

of the

for the neglected

prognostic heterogeneity, displacement of

and other defective

targets,

cited

using data.

is

tactics

just

that the statistician wants to avoid

imprecise

He

concepts

and unreliable

eliminates patients with comorbid

diseases because the clinician has not ade-

replace a

cape

patient's

the

inertia.

and prognostic

strata

make

that clinicians

If subjective

so.

are

"harden" the important "soft" data while

analyzed because clinicians have not de-

and

the

"soft

specifications

data"

variables

of

the strata;

are

displaced

because the data are neither observed nor interpreted in a reproducible manner. Rather than accept information that has little

tician tific^.!

or no

may lv

scientific

validity,

prefer to do

and

the statis-

what seem

statistically

valid

scien-

—even

at

clinician

has not vet developed the intellectual

"muscles" that will enable him to "walk"

he

if

develop

will never

scientifically

avoids

constantly

by being pushed

effort

in

this

skill

necessary

the a

statistical

"wheelchair."

By absolving

the clinician of the need

own important data and judgmental activities, statistical consultants perpetuate the atmosphere that evoked such scientific and clinical distress about statistics more than a century ago. Said improve

to

Trousseau, 33

Armand method it

his

is

the

"This

scourge

(statistical)

the

of

intellect:

transforms the physician into a calcu-

machine, making him the passive which he has massed

lating

lineated

A

preserving them for analysis.

who

slave of the figures

not

are

do

"soft" data

they should not

important,

critically

and

be replaced by what is objectively measured, "hard." and irrelevant. The statistician should demand, instead, that the clinician develop the observational procedures, indexes, and criteria 10 that will

quately classified the diseases or their constrata

own

their

suitable efforts to

variables

sequences;

prognostic

of

comorbid diseases have not been proper-

If

the statistician should insist

classified,

ly

defects

scientific

intellectual

up.

.

.

.

You wish

the pupil to see only

crude facts and to stifle his intellect: and when, by means of this dismal labor, his mind has been to some extent mutilated, you will ask him to show mental vigour

.

Claude scientific

our

.

.

and

Bernard, 2 ideas

power,

thought."

prolific

"Against

we must

because

these

Said anti-

protest with

they

help

to

all

hold

malpractice—-and the responsibility

Statistical

medicine back in the lowly state in which it has been held so long."

Because so many human

either

all deaths are checked necropsy or with other types

at

and done

self-limited

what

is

every

clinically,

are

no matter

subside

will

medical

doctor will usually have a high "success" in

rate

therapeutic

his

natural rate

is

activities.

This

augmented by the dramatic

achieved with antibiotics,

cures

surgery,

and other modern therapeutic agents, and even

when

cure,

his

therapy

the doctor's

to

fails

and compas-

personal concern

can provide comfort. The freqency of successes make most patients devoted to their doctors, and the emotional overtones associated with human sickness may make many patients devoutly worshipsion

these

ful of the consultant

provide

to

This

main

relief or

whose work seemed

remedy.

type of deification

A

is

A

clinician

activities

who

participates

teaching

a

of

will receive frequent interrogation

with house

staff

and "disclosure"

and

in-

and students. The various

demands

recent

hospital

during his "rounds"

tellectual provocation

suitable

for

to patients

instruction

have improved

the ethical tactics of clinical investigators,

and the spreading epidemic of

made

for malpractice has

more

all

legal suits

practitioners

and explaThese activities alone are obviously not adequate to thwart the development of unrestrained arrogance in a clinician's exercise of the power delegated to him, but they can help. And although a clinician may be able to "bury his mistakes," he is seldom able to hide them from himself or his colleagues. careful of their decisions

nations.

A

one of the

hazards of a clinician's

intellectual

the

in

illnesses

145

for ascertaining that

of evaluation.

The arrogance of power

4.

of a consultant

statistical consultant,

no such

restraints

by

contrast, has

in his professional ac-

Although a clinician's basic personality and character are his main prophylaxis

There are no types of licensure procedures after he receives his academic degree; there are no routine reviews by outside critics of his professional performance; there are no vounger minds regularly available or assigned to probe and question his consultative judgments in each of his tasks; there is no code of ethics to keep him from "experimenting" without adequate explanation to his "patients"; there is no threat of legal action to make him worry about engaging in malpractice. The statistician need not even bury his mistakes. If adequately glorified, they can be pub-

against this behavioral malady, certain pro-

lished.

occupation.

thoughtful clinician, recog-

nizing the deifying propensity of his patients,

may

often

use

it

therapeutically,

but he must constantly beware that he does not begin, consciously or unconsciousto believe in all the power, glory, and omniscience that have been attributed to

ly,

him. to

Not

resist

all

this

clinicians, belief,

however, are able

which may lead

various forms of arrogance in the clinician

thinks

about

ceives critical appraisal,

to

way

a

re-

his

activities,

and

interacts with

other people.

can help protect him and the public. He must pass standard examinations to be licensed as a practitioner and additional "boards" to be certifessional safeguards

fied

as

a specialist.

An

internist

receives

frequent diagnostic "exposure" at clinicopathologic

conferences,

and

a

surgeon's

removal of tissue is constantly reviewed by special committees. The procedures for accreditation of hospitals provide a means of reviewing

medical record-keeping and

tivities.

or

accreditation

Moreover,

come

although

increasingly

patients

wary

have be-

of their

clinical

and are less likely to engage in abject, mute worship, the statistician's clientele now seems more credulous, taciturn, and reverent than ever before. In 20 identified the 1932, Major Greenwood consultants

early stages of this trend:

Even writers

the

as recently as

would

demand

still

20 years ago, medical

challenge

peremptorily

that their data should

be treated

146

problems

Oilier architectural

at

istically

all

.

seeks for

some

the

application

of

"significant"

(but) the writer

.

.

now

which

result--.

.

believing

to

trillers

patentees of more

This

that

< i

improvement

the

qualifii atJons

and

inference,

but

area

the

in

sideration

reflection

too

all

of

his

The

clinician has

many

.

in

his

to

reasons for this re-

consultant.

statistical

Some ol the reasons are the external demands imposed by the contemporary lashions ol editors

the

ol

the

senting

clinician's

and granting agencies, hut reasons

are

insi entities

absence

repre-

internal,

caused

by

the

training in scientific

of

maintain

to

or

may

process

large

be delighted delegate the entire planning of these of data,

he-

also

statisti-

own ignorance about

methodology and' data processing, impressed by the statistician's credentials in "experimental design," and obsessed by the hope that the statistical procedures will oiler panaceas for all ailments in research, the clinician "patient" is

often deeply grateful merely to be ad-

mitted to the office of a statistical "doctor."

Anything given as help thereafter is humbly and reverently, particu-

received larly

if

leads

it

the

to

acceptance of a

methodology and by misdirected training

paper

in

posal for funding. Since the clinician

statistics.

The

likelihood

a statistical consultant in

homage

gullible

of

is

to

particularly high

populational research such as clinical

trials.

A

laboratory

clinical

who

inyestigator

research

usually

has

ideas about experimental design

does

needed

that to

epidemiolo-

may

trials,

to

sel-

done

seldom critical of it, be critical, his only lament max be about the statistician's oche

statistically,

and,

is

when asked

to

r

may be due about

to

statistics,

comments

that

the

clinician's

insecurity

may also fear any might make him seem an but he

"ungrateful patient."

Receiving

this

awed

idolatry from clini-

cians xvho are themselves regularly idolized,

the

ticularly

consultant's

susceptible

xeloping the

consultant to

the

same forms

of

risk

is

of

par-

de-

intellectual

may occur in any authority pronouncements are regularly acxvhose corded an unquestioned omniscience. arrogance that

controlled

remedy the de-

fects of earlier therapeutic studies,

knowing how

been

casional delays in gixing help. This reluc-

have been obtained in laboratory activities, in which precise objective data can be obtained with equipment constructed by an engineer, requiring no demands on the clinician's ability in clinical observation and

Knowing

what has

understands

tance to appraise the quality of the help

gy. His previous research experience

clinical trials are

dom

definite

and usually asks the statistician afterwards only for technical assistance in performing the analytic tests. The clinician who does populational research, howexer, may have had little or no exposure to

interpretation.

for publication or of a research pro-

and analy-

sis,

scientific principles of clinical

for

statistician's

scientific

.

delegate authority and

to

sponsibility

.

the-

has had almost no training

clinician

how amounts

cal-computer group. Thus, aware of his

the

today

to

chores to the statistician or to the

con-

experimental

responsibility

tiiese ate. is to the statistician.

willingness

careful

into

frequently

abdicates

investigator

many

the

to

holds no special

the statistician

defective

readily accede,

science,

of

in

:

rtainly

clinician's

may

removal or denigration of "soft" clinical data and judgment. Furthermore, since

powerful magic. to

sake

the

are

statisticians

the

to

methodology, he

by Remington and

current state described

Schotk

of "experimental design." Because laboratory investigation brought no

were

statisticians

now advanced

has

trend

knowledge

.

no! thinking at

to

less

f census data about the number of people live in 1920 and in 1960. Our two numeratoi contain the number of people go;

life.

For example, accord-

of people with

while

dom

alive.

Since thrombophlebitis

a fatal disease,

mildlv

and

escape

diagnostic

victim

does

it

transiently

is

sel-

can often occur in

detection

episodes

that

because the

not seek medical attention. if we were planning to use

Consequently,

.

The rancid sample,

any one of these three diseases as the numerators in a scientific survey, we should ascertain that the compared denominator populations were exposed to the same frequency and intensity of such procedures as electrocardiograms, chest x-rays, routine

physical examinations, and other appropriate diagnostic tests.

Within the past 25 years, political pollhave twice been badly stung by neglect of the bias caused by "tilted targets." Although reasonably random samples were used for the denominator populations in polls taken before the United States national election of 1948 and the British national election of 1970, the careful sampling procedures were sabotaged with errors caused by two different types of tilting in takers

the target data.

In the recent British election, the polled

seemed more inclined to vote Labor than Conservative, but the pollsters did not ascertain whether the two groups of voters

had the same

potential voters

intention of

carrying their beliefs to the target of expression at the voting booths. tion

Day came,

When

the Conservatives

cause their supporters were

more

and

the tilted target,

Elec-

won

be-

the medical poll-hearer

}77

Perhaps the single greatest flaw in contemporary epidemiologic concepts and methods is the neglect of possible problems caused by a tilted target. Although the idea of random sampling has been developed and promulgated as a method for removing bias from the denominators, no substantial methodologic attention or investigation has been given to the possible distortions caused by bias in the numerators. Both of the cited problems in the tilted targets of political polls the unequal registration of target and the temporal change of opinion have many counterparts in epidemiologic surveys. These two scientific hazards have been discussed in greater detail elsewhere 12 and will be only outlined here. The problems of unequal registration occur in groups of people followed concurrently whenever an initial feature of one of the groups or of the "causal agent" produces a disparate intensity of medical examination in the two groups. For example, an unequal frequency and scope of the examinations used to detect disease in compared groups may be responsible for

—

—

some

of

the

differences

the rates

in

of

diligent

coronary artery disease found in executives

than those of Labor in reaching the target

versus laborers, in the rates of lung cancer

ballot box.

As

a result of the spectacular

error in the predictions for this election,

we

can expect future political pollsters to inquire not merely about the opinions of the electorate,

but also about the likelihood

in

smokers versus nonsmokers, and

rates of thrombophlebitis for

use "the likely

pill"

versus other forms of contra-

Executives

ception.

in the

women who

are

generally

more

than laborers to receive routine elec-

that the potential voters will actually vote.

trocardiograms and other periodic "check-

In the American election of 1948, the target

ups." Because of the development of chron-

'was tilted in a different manner. Because

one of the candidates (Thomas E.

seemed rival,

to hold a

Dewey)

commanding lead over

his

the pollsters did not continue testing

beyond

late September and early October. By the time Election Day arrived in early

ic

cough, smokers

may be more

likelv to

receive chest x-rays than noncoughing non-

smokers.

needed

The increased medical

attention

renewing prescriptions or checking other hormonal problems may create for

more routine examinations

for

women

women

tak-

November, however, many voters apparently changed their minds, and cast ballots

ing the "pill" than for

contrary to opinions previously expressed

targets in disease detection

in the polls. 20

in these groups for the reasons cited, the

S

In a stunning "upset," Harry

Truman won

the election. Consequently,

political pollsters

States)

now

(at least in the United

continue their

close to Election

Day

activities

as possible.

as

using other

forms of contraception. Although

all

may be

of the tilted

tilts has never been inand the possible magnitude of the bias in these and other epidemiologi surveys is currently unknown.

existence of the vestigated,

Other architectural problems

178

The

temporal changes

effects of

populations,

The

ing the past half century, for identifying

different

problem

The

problem

occurs

targets.

in opin-

in

tilted

whenever

mortality rates for a particular disease are

compared

nonconcurrent

in

observed in different years or mortality rates

for

different

eras.

diseases

are

based on the data reported in death certificates, but a death certificate does not list diseases;

it

clinicians.

The

the diagnoses

lists

"vital statistics"

names

an alteration solely in the nomenclature, but many other diagnostic changes are due to the new technologic agents that have been developed, with increasing frequency dur-

ion create a

made by

derived each

illustrate

fashion

of

"disease."

diagnostic

Such procedures

as roentgenog-

raphy, biopsy, endoscopy, exploratory surgery, electrocardiography, electroencepha-

lography, chemical measurements,

immuno-

logic reactions, the various assessments of

year from death certificates do not indicate

microbial agents, and the entire phantas-

annual changes

magoria of contemporary laboratory tests are almost all a product of the past 50

in

rates

of disease;

they

indicate a changing rate of diagnosis.

Since

a

clinician

examines not a

"dis-

and increasfew decades.

increasingly available

years,

ease" but the clinical condition of a pa-

ingly used during the past

most contemporary names of "disease" depend not on observed evidence, but on

These new diagnostic adjuncts can be expected to have both an immediate and a delayed effect on the occurrence rates of

tient,

interpretations, or technologic expansions of the actual bedside evidence.

inferences,

Consequently, the "diseases" reported on death certificates will vary both with the diagnostic fashions

popular among

clini-

cians during any particular era in medicine,

and with the diagnostic tools and tactics available for assigning names of "disease" to the

observed

clinical conditions. If the

and techniques of diagnosis can be shown to remain the same as time procriteria

gresses, then the

annual

statistical tabula-

tions may actually reflect an altered natural occurrence of the disease. But if the con-

cepts and technology of diagnosis should change with time, the changing rates of many "diseases" noted on death certificates may be an artifact of target tilting, due to temporal changes of diagnostic opinion, rather than to new conditions of nature. A simple example of this distinction is

the disappearance of the disease dropsy,

which was held responsible for so many deaths a century ago, but which seldom seems to kill anyone today, according to modern mortality data. The clinical condition of dropsy

is still

of course, but

it

is

present in abundance,

now

called

by other

names, and its fatalities are usually attributed to some form of cardiac, renal, or hepatic disease. The modern changes from dropsy to congestive heart failure or other

the "diseases" that they identify. The immediate effect is on the "diseases" diagnosed at the medical centers where these tests are usually first developed and adopted; the delayed effect takes place over

many

years as the

new

adjuncts are slowly

disseminated into use by practicing physicians in the communities cal centers.

tion

The

new

of these

beyond the mediand dissemina-

availability

diagnostic procedures

from one year to the next, and certainly from one decade to the next, creating major changes in the clinical opin-

will often differ

names of "diseases." The temporal changes in diagnostic tac-

ions that are offered as

tics

are

particularly

certificate

important in death

data, since the clinician's final

statement on the certificate

is

seldom sub-

jected to revision or confirmation

by

ne-

performed for only about 20 per cent of deaths in the United States today 22 it does not always yield a clear diagnostic answer when performed; and its results do not always appear on the death certificate, which is usually filled out before the necropsy is done. Consequently, the "vital statistics" used in so much of epidemiologic reasoning depend mainly on cropsv.

Necropsy

is

;

the diagnoses

made by practicing

clinicians.

Although temporal alterations in dards can greatly affect the way that

stanclini-

The rancid sample,

when

cians "vote"

lite

identifying "disease,"

no

major surveys have been done to study the changing incidence and prevalence of the intellectual fashions, paraclinical tests,

and the medical poU-bearei

lilted target

and

diagnostic criteria that so greatly influence

members

stones as

179

of the hospitalized dia-

betic population?

Another

based on neglect failure to check for a tilted target, occurred during a major statistical study of a necropsy population. 24 Patients who died with and without cancer at necropsy were carefully matched for classic error,

of Berkson's fallacy

and on

which "diseases" will be designated, when, and where. Even when epidemiologists acknowledge that the availability of new diagnostic tests may change the occurrence

sex, race,

of certain "diseases," the temporal dissemi-

tive tuberculosis

nation of the tests from academia to com-

cent of the cancer patients but in 16.3 per

munity

is

cent of the noncancer group, the investiga-

not considered.

The Surgeon

General's

report 27

tors provides

an-

other useful example of the limited awareness of of these hazards. According to that report,

some

between 1947 any new identification tests because "there were no significant advances in diagnostic methods" during that period. 28 The presumptive reason for this statement is that exfoliative cytology techniques were first the rising incidence of lung cancer

and 1960 could not be due

to

described in 1945. What is ignored in the statement, however, is not the mere existence of a test,

but

its

dissemination.

Exfoliative

cytologic

sputum were not in use at many hospitals until at least a decade after Papanicolaou introduced the test. As the sputum "pap" smear became disseminated to practicing doctors, it would help identify many cases of lung cancer that had previously escaped detection. The dissemination of a new test, rather than its mere studies of

could thus lead to a rising annual incidence of a disease. This possibility, and many existence,

other aspects of changing diagnostic rates for disease, are regularly overlooked in the type of rea-

soning used in the Surgeon General's report.

rather than general populations. In a celebrated account of the error that is now

known

as

showed

that differential bias in the rates of

hospital

admission

fallacy",

could

cause

concluded that tuberculosis was antag-

onistic

Berkson 4 spurious

seems to have been forgotten by the many epidemiologic investigators today who engage in similar acts of retrospective "matching." As Mainland 21 describes the denoue-

ment 23 Then

of the situation, a serious flaw

did

many

of

the

other

strate

equality

members

of

target

detection:

did

of the general (or the "control")

population have the same prevalence of opportunities

and

tests for

detecting gall-

—the

pos-

diseases,

and therefore

trating thinkers in his field.

The

fallacies

political

created by tilted targets

become

pollsters,

painfully

evident

to

and

to

to pathologists,

Although epidemiologists may privately acknowledge the hazard, it is not given major attention in epidemiologic clinicians.

textbooks, in the literature of epidemiologic research,

and

in

such analytic reviews as

those contained in the Surgeon General's

A

century ago, convinced of the

found more diabetic patients than in a "control" hospital population, we can not conclude that gallstones and diabetes are associated. We would first have to demongallstones are

hospitalized

of

victims quicker than

statistician responsible for the research method and the conclusions was one of the most pene-

therapeutic value

often in

its

gave a person less time to develop florid tuberculosis such as was found in members of the noncancerous group. The concept of a cancer-tuberculosis antagonism was quickly dropped; but we ought not to forget the story, because the bio-

report.

if

was thought

cancer killed

sibility that

associations in the concurrence of diseases.

Thus,

Extensive efforts were

cancer.

to

then instituted to use tuberculin in treating neoplastic disease. This old "blooper"

have thus

The hazards of tilted targets are not unknown to people who work with clinical

"Berkson's

and date of death. Because acwas recorded in 6.6 per

of blood-letting,

many

argued that therapeutic trials or other proofs of the procedure were not necessary. Epidemiologists today seem equally convinced that no tests or precautions need be taken to ensure against clinicians

the distortions introduced

by a

get. In clinical medicine, the

tilted tar-

value (or

lac

of value) of the old therapeutic doctrn

1

Other architectural problems

180

was demonstrated as soon as satisfactory scientific studies were done. In epidemiology, some of the current doctrines may also become altered when chronic diseases are with

investigated

scientific

satisfactory

corded on death certificates, but the efforts have always been biased toward the detection of false positive diagnoses only. 10

when lung cancer appears on a certificate, investigators may seek

Thus,

death

substantiating evidence to ensure against

methods that arrange for the effects of bias be assessed in the numerators as well as in the denominators of the compared sta-

a false positive diagnosis, but the investi-

tistical ratios.

negative diagnoses. As noted earlier, about

to

do not explore the problem of

gators

false

20% 3.

An

of data

Reliability

additional defect of the polling tech-

niques used in the

Dewey-Truman

dential election of 194S

was the

presi-

pollsters'

assumption that the "undecided" vote would be split for each candidate in the same proportion Since

as

Dewey

the

"undecided"

ballots.-

among

held the lead

"de-

of lung cancers found at necropsy have not been diagnosed during life. Nevertheless, epidemiologists have not performed

careful tests to note the

may be cancer

distorted in

way

that their data

by the occurrence of lung

people for

whom

the diagnosis

was not recorded on the death certificate. These problems in reliability of data are major

scientific

hazards for the

cided'' voters in the poll, his share of the

demiologists and statisticians

'undeeided" group would maintain a margin of victory.

Ktie

presumably

On

Election

Day, however, Truman was the choice of most of the "undecided" voters, who had apparently been reluctant earlier to express what seemed to be an unpopular belief. As a result of this problem, the interviewers

now

many

many

epi-

whose ana-

work depends on the mortality data assembled by the Bureau of Vital Statistics. An epidemiologist who performs direct populational surveys may not always be as scientifically careful as Deming 6 recommends, but at least he does active research.

He must

design

a questionnaire, find a

additional

population, get the questions to the people,

questions to assess the reliability of the

receive their responses, and evaluate the

in political polls

use

responses.

replies they receive.

The last of

of reliability in data

issue

is

the

the major scientific defects that beset

contemporarv epidemiologic statistics. Although most scientific investigators will ordinarily go to great lengths to check the reliability of the data with which they work, a corresponding concern has not been evident

in

epidemiologic surveys.

When

By

contrast,

data need not

tality

an analyst of mor-

make any

plans for

and need not even leave his desk. All he has to do is wait for the Bureau of Vital Statistics to receive and tabulate death certificates, whose design is standardized and whose preparation is a

collecting information

traditional obligation of clinicians. Passively

receiving the

Bureau's tabulations about

becomes

the data are obtained from mailed ques-

rates of death, the epidemiologist

seldom determine whether the questions were understood or correctlv answered by the recipients. A second set of identical questionnaires is rarelv mailed to determine whether the repeat responses are consistent

a poll-bearer, rather than a poll-taker.

with those received in the first set. When the data are obtained from death certificates, such target variables as the diagnoses of disease are not thoroughly checked

logic statisticians often regard

tionnaires, the investigators

for accuracy. is

made

to

From time verify

to time,

certain

an

effort

diagnoses re-

One obvious

source of unreliable data in

these poll-bearing activities skills

is

of the diverse clinicians

the unequal

who

fill

out

the certificates. Instead of carefully check-

ing this source of unreliability, epidemio-

unimportant. The belief

assumption

that

the

is

it as being based on the

vicissitudes

clinical data-recorders wall create errors, tions.

with cancelling

of

the

random

effects in all direc-

This assumption might be acceptable

The rancid

.sample, the tilted target,

the data were free from any systematic

if

sources of general bias. At least three main sources

systematic bias, however, de-

of

stroy the blithe assumption that the only

problem to be considered (or ignored) the diverse diagnostic

One such earlier in this

source

skills

been discussed

has

essay and elsewhere 10

changing standards of

is

of clinicians.

:

the

technology,

criteria,

and concepts of disease from one era to the next. Another source of bias is the bizarre

oscillations

in

disease

categories

imposed by changes in the hierarchial ratings and coding rubrics assigned to the diagnosed diseases each decade by statisticians. 7 10, 16 29 The third major source of systematic bias is the excessive rigidity and general inadequacy of the death certificate as a "questionnaire." William Farr, who is generally regarded as having founded -

-

modern

"vital statistics" a

century ago, was

vigorously opposed to inflexible classification procedures that did not allow for ade-

quate description

Farr, 9

and the medical poll-hearer

181

morgue, no exabout the much more frequent situations in which a specific single cause of death cannot be identified. Sometimes no cause is evident; sometimes there are too many candidates; sometimes death was due to a conditon such as cardiac arrhythmia or diabetic acidosis requiring detection with special functional or chemical tests that cannot be applied in the routine morphologic examinations at necropsy; sometimes a leading suspect is present, but the actual mechanism of death is uncertain. The selection of the cause of death is an extraordinary difficult task, particularly for patients with brilliantly "solved" in the

citing

are

articles

published

—

A

chronic diseases.

good

clinician or pathol-

can often identify most of the diseases that were present, but he can seldom say exactly what was the precise cause of death. He often does not know; nor does ogist

anyone

A

else.

clinician (or a pathologist)

may

thus

imperfect knowledge has an obvious tend-

which disease caused death in a patient who had widespread cancer of the colon, advanced cerebral

ency

arteriosclerosis, multiple

"The

doubt.

of

Said

refusal to recognize terms that express

to

encourage reckless conjecture." seems to have been unheeded

Fair's advice

by

his successors.

In

its

produce a specious

specificity. It

no request for evidence or cited

diagnoses;

makes

criteria for the

no room for demands a statement

allows

it

honest doubt; and

it

not only of the disease regarded as the

cause of death, but also a tributing

tions,

and

luctant to

current format, the death certificate

appears to have been deliberately designed to

be unable

diseases,

list

of the con-

arranged in their

se-

to decide

diffuse

make

a choice, the clinician

be an undecided

thus familiar with the problems of the "undecided voter." In many instances, a definite disease has

not been identified,

and, in

many

other

instances, a single cause of death cannot

be selected from the array of diagnoses. Even when necropsy is performed, the issue of "cause of death" is seldom easily resolved. Although newspapers and magazines often describe isolated cases that

were

is

obligated to

lists

as a case

of either heart disease, stroke, cancer, or

something Difficult

pathologist

else.

as

the problem

who

may be

for a

has the open body before

death

is

He

patient enters the statistical

him, the situation

out a death certificate

voter.

choose a cause of death, so he picks one, and, depending on his arbitrary choice, the

quential order of lethality. Every clinician filled

is

not permitted by the death certificate to

who

has ever

myocardial infarcbronchopneumonia. Re-

is

much worse when

must be

the

most of them are, in circumstances distant from the diagnostic paraphernalia of a major certificate

filled out, as

medical center. The clinician is often not exactly certain of any diagnosis, let alone an exact cause of death. Yet he is not permitted to state, when necessary, that the patient died of uncertain but apparently natural causes, nor can the clinician

make

any other allowances for doubt in listin; the sequence of lethal diseases. The

182

Other architectural problems

form of the death all

must be

in

filled

disease

particular

demanded, with

as

cited

and

does

not

precisely

may

may even

state

make

muss.

a

Generally an intelligent, educated, thoughtful guess; a guess that seems as reasonable

and

with the diag-

as consistent as possible

nostic concepts of the era; a guess that

often

be

or

right

wrong

often

may

— but

a

Anyone who has ever experienced the realities of clinical medicine knows

guess.

about these guesses, and about the way they may be biased by the fashions of the doctor's training

and

environment. 1,

his

10

The

results of this guesswork continue, however, to be assiduously collected, in-

dustriously tabulated,

and the product

of the

technologic

to

to inconsistent guesses

and conclusions by

that undecided voter:

the clinician.

the

may company

refuse to accept the body.

So the clinician often

how much

to nature,

of the

delay payment, and the funeral di-

rector

due

changes, to preventive measures, to thera-

cause of death, the surviving family think him stupid, the insurance

is

peutic advances, to fashions of opinion, or

demands

in all the exact

fill

impossible to decide

a

the clinician

If

it is

change

cause of

of succession for

the "contributory" diseases. fails to

the

as

and with an order

death,

form

with

certificate is there,

structured spaces, and each space

its

and avidly analyzed,

Not

all

modern epidemiologists perform manner described here,

"research" in the

and many new surveys have been carried out with methods that improve or remove the cited defects.

Some

of the

new

"active"

have even openly acknowledged the scientific disparity between preaching and practice in the older, "passive" epidemiologic research. According to Wright and Acheson, 11 investigators

the

epidemiologist

maintain his admit that, as (it tin as not, the circumstances underlying the judgments made by him and his colleagues do not quite come up to the standards which he himself may indicate to be desirable in his teaching. If

field

is

he must be prepared

integrity,

to

to

Regardless of the efforts being

new

made

in

investigations, the vast bulk of exist-

called "vital statistics."

ing epidemiologic statistics about incidence,

documenta-

tion of the

prevalence and causes of chronic disease has been assembled with the apparent

certificate provides a civilized passport to

For many major

the disposal of a dead bodv, but not a

ence, epidemiologic statisticians have pre-

A

is

death certificate today

is

a

name, age, and sex of a person, and the time and place of an event; the

identification

scientific

of

Until

disease.

major improvements are made

in the collec-

philosophy of "better bad data than none." activities in

medical

sci-

work with quantitative data rather than with unquantified recollections alone,

ferred to

the data, the information about individual

and have hoped that "good judgment" would be used in recognizing and evaluat-

diseases cannot be used in serious scientific

ing the defects of the data.

tion,

organization,

and standardization of

Nevertheless,

activities.

assessment

of

the

without

multiple

careful

human and

technologic fashions that can distort the

and "causes" of death, the have been confidently used as the basis for gauging the changing frequency of diseases, for major statistical studies of various "causal associations," and cited diagnoses certificates

for

new

etiologic

epidemiologic explanations

surveys

for

seeking

different

geo-

The hope that good judgment will remedy poor science has been a wishful dream throughout all of medical history, and the annals of medical history are filled with the ruins of the dream. In the evaluation of clinical therapy, clinicians for cen-

used a fallacious philosophy of post hoc ergo propter hoc to justify all the

turies

blood-letting, blistering, purging,

and puk-

ing that were once regarded as the basic

graphic rates of chronic disease. As the

staples

mortality rates for various diseases rise and

etiology of disease, clinicians used the

fall in different

countries at different years,

of treatment.

In concepts of the

same

post hoc or other defective philosophies to

The rancid sample,

create beliefs in the angry gods, unfriendly demons, deranged humors, contagious mi-

183

asmas, visceral inflammations, dystonic blood vessels, defective teeth, and toxic

cannot be gauged from body counts alone. But in epidemologic activities, clinicians have accepted the results of body counts assembled without scientific verification of

organs that have incorrectly

either the procedures or the data.

to

I

and the medical poll-bearer

the tilted target,

modern times

—been

— from ancient

To

held responsible

obtain satisfactory medical data from

for causing disease.

a valid

random sampling

During the past few decades, the fallacies of post hoc reasoning in therapy have come under scientific reappraisal. Clinicians have recognized the need for appropriate comparison and quantification, and have begun to apply scientific methods in both surveys and trials of therapy. The methods still require many clinical and statistical improvements, but the defects of the old approaches have been amply recognized, and the need for better scientific methods is

ulation

is

cepted, thoughtful medical scientists will be uneasy about the wisdom of allowing the heroic diagnostic and therapeutic advances of the past few decades to be accompanied by less than heroic epidemiologic

well accepted.

research.

The

fallacies of post

hoc etiologic rea-

The

acquisition of a truly random samalmost impossible for a large general population. With appropriate ingenuity and

provide experimental evidence for causes

suitably selected strata

methodology has been needed for the chronic diseases that are prime targets of contemporary research. This methodologic vac-

population. 6

uum

has been

filled

by

statistical

proce-

dures.

What cific

is not such spewhether coronary disease is

at stake here

is

issues as

caused by physical inactivity, lung cancer '.by cigarette

smoking, or thrombophlebitis

clinicians,

is

effort,

however, satisfactory samples can be from random choices within

attained

shufflings of the terms

erroneous therapeutic and etiologic reasoning, are to

be led

uncritically into a

new

used in coding. The

must be changed to correspond to clinical realities, and connoisseurs of modern clinical and pathologic diagnosis should be enlisted to help explain those realities to the coders and other statistical personnel. 10 " Perhaps the most easily re'

defects

is

statistical

certificate itself

mindful of their long history of

issue

clusters of the

of death certif-

manipulations of the data and by decennial

moved

The main

and

The problems

cannot be solved merely by

icates

whether

•by the "pill."

and

proached with convenient but defective techniques of sampling and data collecting. Although the results have been widely ac-

ple

of infectious disease, a different

human popdifficult

almost heroic epidemiologic task. Because the task is so formidable, it has been ap-

have received neither the same recognition nor commensurate efforts at improvement. Although Koch's postulates could be used several generations ago to soning

of a

an extraordinarily

data.

of the existing difficulties are the

due to target tilting and unverified These problems can be solved as soon

epidemiologic investigators recall that

as

based on elaborate numerical analyses of

—rather than the primary principle native conjecture—

bad

of scientific research.

philosophy in which diverse characterizations

of

disease and causal "proofs" are

scientific data.

In diagnostic

now begun

clinicians

activities,

have

to use extraordinary precision

for identifying diseases. In therapeutic activities, tific

clinicians

procedures

have begun to use scienfor

evidence, and have

assembling

begun

suitable

to recognize that

in treatment, as in guerrilla

warfare, success

valid

evidence

statistical

theory, magisterial computation, or imagiis

The

obliteration

of science during the

numbers was deplored a half century ago bv Major Greenwood, 14 one of the founders of modern epidemiology:

collection of

Because the

results

of their labour are useful,

the compilers and analysers of these statistics are

no more entitled

to

rank as scientific investigators

Other architectural problems

184

who manu-

than are the equally useful artisans facture our laboratory apparatus.

epidemiologic disease, 4.

Modem tific

who

investigators

method

as a

regard scien-

fundamental necessity in

epidemiology will also note the recommendation of Bradford Hill 5

Berkson,

J.:

fourfold

table

5.

:

Deming, W.

New

One must

back.

.

.

.

go seek more facts, paying less attention to technique of handling the data and far more to the

development and perfection of methods

7.

E.:

Some

York, 1966,

Dover

theory

of

sampling,

Publications.

(

Paper-

)

Dorn, H.

of obtain-

F.:

Mortality, in Lilienfeld, A. M.,

kins Press, pp. 23-54.

W.:

8. Farr,

For such

main challenges

investigators, the

of scientific epidemiology will be found not

arm

in the speculator's

be to develop suitable strategies for choosing a proper sampling of population, persuading and examining the selected people, validating the tests and other procedures for gathering data, maintaining liaison with the population during the follow-up period, and performing satisfactory examinations of the subsequent state. Additional incentive can come from the 13 classical exhortation of Francis Galton will

:

on the

elevation

fa-

15:155-183, 1852.

W.: Quoted in Greenwood, M.: The medical dictator, and other biographical studies,

London, 1936, Williams and Norgate, Ltd. A.

10. Fcinstein,

The

or the computer aficionados ter-

The challenges

of

9. Farr,

chair, the question-

room, the tabulator's annual volumes, the statistician's desk cal-

Influence

tality of cholera, Stat. Soc. J.

naire-collector's mail

minal.

Bio-

data,

1946.

and Cifford, A. J., editors: Chronic diseases and public health, Baltimore, 1966, Johns Hop-

ing them.

culator,

hospital

to

analysis

Bradford Hill, A.: Observation and experiment, New Eng. J. Med. 248:995, 1953.

not accept as final what some third

party can give or chooses to give.

heart

Limitations of the application of

metrics Bull. 2:47-53,

6.

One need

study of arteriosclerotic Chron. Dis. 16:249, 1963.

J.

R.:

Clinical

identification

tern.

rates

epidemiology.

of disease,

II.

Ann. In-

Med. 69:1037-1061, 1968.

11. Feinstein,

A.

Clinical

R.:

biostatistics.

II.

versus science in the design of ex-

Statistics

Pharmacol. Ther. 11:282-

periments, Clin.

292, 1970. 12. Feinstein, A. R.:

Clinical biostatistics. IV.

architecture of clinical

Pharmacol. Ther. 11:595-610,

Clin. 13. Galton,

F.

:

Quoted

in

The

(continued),

research

1970.

Inside cover of each

14.

issue of Ann. Hum. Genet. Greenwood, M.: Is statistical method of any value in medical research? Lancet 2:153-158,

15.

Heasman, M.

1924. A.: Accuracy of death certificaRoy. Soc. Med. 55:733, 1962. R. A., and Klebba, A. J.: A preliminary

tion, Proc. It is

rior to

the triumph of scientific .

.

.

the value of beliefs sufficiently

feel .

.

.

men

to rise

supe-

16. Israel,

by which may be ascertained, and to

superstitions, to desire tests

whatever

report on the effect of eighth revision ICDA on cause of death statistics, Amer. J. Public Health 59:1651-1660, 1969.

masters of themselves to discard

may be found

untrue.

G., Patton, R. E., and Heslin, A. S.: Accuracy of cause-of-death statements on death certificates, Public Health Rep. 70:39, 1955. Kendall, M. G., and Buckland, W. R.: A dic-

17. James,

When

Galton made that remark many years he was urging the use of statistical methods in epidemiologic science. His remark still pertains today; but the need is for scientific methods in epidemiologic staago,

18.

1960,'llafner Publishing Co. 19. Lasagna, L., and von Felsinger,

volunteer tistics.

Anderson,

20. Likert,

D.

deaths due to

O.:

Geographic

emphysema and

Canada, Canad. Med.'Ass. 2.

3.

subject

in

research,

J.

York,

M.: The

Science

120:

359-361, 1954.

References 1.

New

tionary of statistical terms, ed. 2,

Barrett-Connor,

E.:

The

J.

variations

in

bronchitis in

98:231-241, 1968.

etiology of pellagra

and its significance for modern medicine, Amer. Med. 42:859-867, 1967. (Edit.) J. Beadenkopf, W. C, Abrams, M., Daoud, A., and Marks, R. V.: An assessment of certain medical aspects of death certificate data for

21.

R.:

Public

opinion

polls,

American 179:7-11, 1948. Mainland, D.: Elementary medical ed.

2,

Philadelphia,

1963,

W.

B.

Scientific

statistics,

Saunders

Company. 22. National Center for Health Statistics, Division

of Vital

Statistics,

United States Department and Welfare: Vital Sta-

of Health, Education,

tistics of the United States, vol. I: Mortality, Washington, D. C., 1963, United States Government Printing Office.

The rancid sample, the

tilted target,

Center For Health Statistics: Cycle Examination Survey; sample and response. Vital and Health Statistics. PHS Pub. No. 1000— Series 11. No. 1. Public Service Washington, Health April, 1964, United States Government Printing Office.

23. National I

of the Health

R.: Cancer and tuberculosis, Amer. J. Hyg. 9:97-159, 1929. Pearl, R.: A note on the association of diseases, Science 70:191-192 (August 23), 1929.

24. Pearl,

25.

26. Rogers,

L.:

The

pollsters,

New

York,

1949,

Alfred A. Knopf. 27.

28.

General's

Education and Welfare, Public Health Service Publication No. 1103. p. 140. 29. Van Buren, G. H.: Some things

prove by mortality tics

—Special

30. Wallis,

W.

(

Paperback

31. Wright,

E.

and Roberts, H.

New

of

Statis-

Com-

York, 1965,

V.: The nature The Free Press.

edition.

C, and Acheson,

Haven survey

Advisory

R.

of joint diseases. XI.

arthrosis of the hands,

378-392, 1970. 32. Yerushalmy, J.

observed troversy

1966,

W.

:

On

associations,

Advisory Committee On and Health. Smoking and Health 1964. United States Department of Health, General's

you can't

Vital

Department

Reports,

A.,

of statistics,

Relman, A.

Surgeon

in

statistics,

merce, Bureau of the Census, 12:191, 1940.

Publication No. 1103.

Smoking

185

M.: New Observer

variability in the assessment of x-rays for osteo-

Committee On Smoking and Health. Smoking and Health 1964. United States Department of Health, Education and Welfare, Public Health Service Surgeon

and the medical poll-hearer

in

S.,

Amer.

J.

inferring in

causality

Ingelfinger,

and Finland, M.,

Internal

B. Saunders

Epidem. 91:

Medicine,

Company,

editors:

from F.

J.,

Con-

Philadelphia, pp. 659-668.

CHAPTER

13

Ambiguity and abuse

twelve

in the

different concepts of 'control'

Ronald Fisher began his classic The Design of Experiments, by

Sir

hook/

pointing out that a critic accept

who

refuses to

of an experiment of attack." In one ap-

conclusions

the

can take "two

lines

proach, the critic believes that the results

have received a faulty interpretation. In the second approach, the experiment itself is

regarded as "ill designed." After deciding that the first approach

—belonged "the seemed approach —criticism

criticism of interpretation

domain

of

statistics."

regret that the second

of design

— was

who were

to

Fisher

employed by people

often

not "professed statisticians" and

whose main

qualification

longed experience or session of a

was only "pro-

at least the

scientific

evidence"

in

long pos-

reputation."

complained that "technical

dom

to

when

Fisher

details are sel-

"heavyweight

a

authority" attempts to discredit the design of a research project

by making

assertions

such as "his controls are totally inadequate."

may have been

Fisher

quite

an expert who makes disparaging remarks without citing the justiof "technical

details,"

but where

can either an expert or a novice find an account of the details that define an adequate control? Such details are obviously crucial for scientific plans

Under

the

"Clinical

same name, biostatistics

14:112, 1973.

186

and

interpreta-

appeared as Pharmacol. Ther.

this chapter originally

— XIX."

In

Clin.

but the details are not presented

in

The

the writings of "professed statisticians."

concept and choice of a control receive almost no discussion in Fisher's work, and

word

the

control does not appear in either

the index or the table of contents for any of his three epochal books 79 on statistics

and research.

The challenge has also not been accepted by Fisher's successors. Despite his warning that "the statistician cannot excuse himself from the duty of getting his head clear on the principles of scientific inference," and despite the many subsequent statistical publications devoted to the "design of experiments," a clearheaded account of the

word

of control

principles

appeared

has not yet

The

the statistical literature.

in

control constantly occurs in diverse discussions, but the concept of

statistical

control

is still

concept

defined ambiguously, and the

often applied imprecisely or er-

is

roneously.

According

justified

in disparaging

fication

tions,

tionary, biter

10

Kendall and Buckland's

which usually serves

of statistical

employed

terms,

in at least

senses.

One

follows:

"If

cal as

to

two

dic-

an

as

ar-

control can be different statisti-

of these ideas

is

a process produces

stated a set

under what are essentially the same conditions and the internal variations are found to be random, then the process is said to be statistically under control. of

data

The

separate

observations

are,

equivalent to random drawings

in

fact,

from

a

Ambiguity and abuse

population distributed according to some fixed probability law." This statistical idea the regulation of a process, and

to

refers

applied most often as part of the reasoning ( to be discussed later ) that laborais

workers

tory

use

for

an

usually

entity

called Quality Control.

The second

new method,

the

for

". .

con-

.

of

testing

process, or factor against

an accepted standard.

That part of the which involves the standard of com-

test

parison

known

is

statistical

as

the

This

control."

idea refers to comparison, and

is

concept that most research workers would use in describing the role of a conthe

trol

when

different investigative

maneuvers

are contrasted.

In

addition

these

to

the terms control chart

two

definitions,

and control

limits,

both of which are applied during proce-

The

dures used for testing quality control.

third additional idea, called control of sub-

inquiries to

of

.

multiple

this idea,

stratification."

when we

variables,

substrata

if

we

.

According

to

we would

(

or "initial

control

the

express the results accord-

The

ways rather than

different

cite

some

science

idea In

three.

the

shall describe the

I

I

shall

of the points that weighty au-

might make about

thorities

of

its

abuse.

A

few of the points are statistical, but most of them rest on straightforward concepts in biology and clinical science. Perhaps the most direct way to demonstrate

these ideas

reference to the

in

is

"architectural"

model that was proposed

earlier

series 3,

in

search

this

project,

and

describe a re-

to

develop

to

trols."

In that model, the

entity

is

certain

choice of "con-

initial state of an exposed to a maneuver and undergoes a response observed in the subsequent state. With this model in mind, we can proceed to the different ways in which the

idea of control appears in the architecture of clinical research.

A. The idea of regulation

Four of the twelve different uses of control

occur

governing, or choosing a particular

entity.

The concept

at least three distinctly different ideas for

control: the quality

with which a process

performed; the maneuver that is compared in an experiment; and the charis

of

the objects

With these three

compared different

in

a

ideas

ambiguity in the use of control, it is not surprising that the term is applied with so much confusion. Unfortunately, however, Kendall and Buckland have

of comparison

is

not

included in these applications of the term. 1. Control of the maneuver. This type

ment from

women. The statistical dictionary thus provides

in reference to the idea of regu-

lating,

of control

survey.

which the

in

clinical

analysis.

rest of this discussion,

For example, if the baseline variables were age and sex, our controlled substrata might be expressed as young men, young women, old men, and old ables.

to create

in

actually applied in at least twelve

is

ing to groups demarcated with both vari-

acteristics

used

biostatistical

control

ways

other is

.

divide survey data that

have been cited for two baseline state")

control

"...

a term used in sampling denote the employment of factors which are being used in a scheme is

word and

scientific principles in the

Kendall and Buckland describe three other usages of control. Two of these occur in

strata,

many

omitted

diverse uses of the term control, and

definition of control

experimentation

cerns a

187

the twelve different concepts of 'control'

in

is

what

distinguishes an experi-

a survey. In an experiment, the

investigator

"controls"

— governs, de—the maneuver i.e.,

cides, assigns, or chooses

to

which each investigated entity is exposed. In a survey, the maneuver is chosen by nature or by man, but not by the investigator. Thus, a therapeutic trial

is

an experi-

ment, because the clinical investigator signs the treatment to each patient.

A

as-

sur-

vey of therapy is not an experiment, because the treatments were chosen ad hoc by other doctors or by patients, or by both. Almost all of the contemporary epidemiologic investigations of alleged causes of disease such as diet, urban living, smoking, and oral contraceptive pills have

—

—

Other architectural problems

188

been performed not as experiments, but surveys in which the investigated ma-

scientifically

neuver was usually self-selected.

bias.

as

the

It

ment

the assign-

controls

investigator

of the maneuver,

can make the

lie

randomization, thereby

assignment bj

al-

lowing subsequent application of statistical inference based on random allocation. If the investigator does not control the ma-

neuver,

bias

possible

the

decisions.

The

with

occurs

allocation

its

can

that

all

of

human

enter

differences found thereafter

may be caused by

the

rather

bias,

initial

than by the effects of the compared ma-

The

neuvers.

found

differences

and one of the

inference,

out the bias and for

it.

the

make

is

not

is

seek

to

suitable adjustments

This potential bias

maneuver

main

scientist's

analyzing survey data

in

in

allocation of

removed by

selecting

Quality control. The regulation of a

2.

process evokes three different uses of the

word

control. In the procedures of 'quality

control"

land

— the

or

1

performed,

shape, or

cal surveys this

the

statistics

con-

epidemiologic and

clini-

fallacious

today are due to confusion about

who

type of control. The statistician

about research from courses in experimental design may not be familiar with the many potential sources of major

the

serum

of

cholesterol.

well the process uses

investigator

is

being

certain

size,

some other dimensional property

of the finished product.

the individual

If

products have similar values for

this

mea-

surement, the results are regarded as con-

and the process

sistent,

is

used

itself.

principal

if

is

regarded

as

An

analogous procethe "product" is a measure-

having high quality.

of the

many

to

and Buckproduce a

For a tangible product, such as beer, the

young and also selectively urban regions, we shall find a higher death rate in urban rather than rural locations, regardless of whether or not the urban-rural sample of population was chosen randomly. of

used

is

of the

first

assessment depends on measuring the

ment

Many

the

to

Kendall

well-defined statistical tactics.

dure

tained in so

process

measurement

a

"random sample"' of population, since aspect of randomness refers to choice of the population, rather than assignment of the maneuvers. Thus, if tense people to

by

To determine how

this

are likely to die

refers

cited

particular entity, such as a barrel of beer

a

move

— which

definitions

these

in

circumstances cannot be properly assessed with tests based on probabilistic statistical jobs

misleading if the basic information has been distorted by unrecognized

Such measurements are the

event

occurs

that

in

modern

laboratory medicine, where quality control

numerous chemical, microbiologic, and other technologic tests has become a paramount scientific challenge. The approaches used in meeting this challenge are heavily dependent on statistical principles, and a thoughtful account of the procedures has recently been published by Roy Barnett.- From a repeated measurement of the same specimen of material,

laboratory

the

personnel

mean and standard

can de-

learned

termine the

bias that can occur in a non-experimental

and can prepare control charts that show the upper and lower control limits for the measurement. The personnel can also de-

deviation,

survey, in which the investigator did not

velop procedures for determining "external

control the assignment of the maneuvers.

quality control"

Instead of urging the investigator to search

is

for bias ingly,

and

to

correct the data accord-

the statistician

He may

tests

of

are

dis-

This

is

statistical

the

same specimen

the type of activity for which principles

are

applied magnifi-

cently, since the process of measurement

then massage the data with

was the stimulus for the original ideas that led to development of theories about the disperson of "errors" in measurement, and that produced the shape of the frequency

if

significance,

relative

logistic function analyses,

that

accept the

when

at different laboratories.

they were experimental

torted results as data.

may

measured

statistically

risk

rates,

and other

tactics

inappropriate

and

Ambiguity and abuse

now known

distribution

as

a

in

the twelve

"normal" or

cliff erent

often occupies an unusual position in the

Gaussian curve. Although developed for the variations found in the process of mea-

trol

suring objects, these statistical

categories

were

principles

applied to describe the vari-

later

ability of the objects themselves.

The new now bebecome a

189

concepts of 'control'

architecture of a research project.

and

as

good or bad,

in remission or

able

The

con-

expressed as a variable having such

is

is

rise

or

fall,

active, but the vari-

not a baseline characteristic noted

and

application of the principles has

in the initial state,

eome

principal target event noted in the subse-

traditional,

but has also

source of major problems"' in clinical epi-

We

can safely assume that multiple measurements of a single specimen of serum glucose will have a "normal distribution" and we can safely ac-

demiologic research.

cept the

mean

of those values as the "cor-

measurement. We cannot safely assume, however, that single measurements of multiple specimens of serum glucose from a group of people will be "normally distributed," nor can we safely assume that a "normal range" for the people who provided those specimens will be determined by Gaussian statistics. These problems, which are beyond the scope of this discussion, create major diffirect"

culties

in

clinical

interpret a single

activities.

In trying to

measurement

for a pa-

the clinician must contemplate the

tient,

it

is

often not the

quent state. Thus, in a study of hypoglycemic agents for patients with diabetes mellitus, the main target events might be vascular complications, and the concomitant regulation of blood sugar would be an ancillary variable in the subsequent state. In a previous discussion,'" I suggested the term synchronous variable for the citation of this type of regulation. It refers to an entity that is noted after onset of the main

maneuver (such

as

therapy), and before

during the occurrence

or

rence)

of

the events

(or nonoccur-

that are the

main

target variables.

In other situations, the regulation of the disease process

may be

the principal tar-

get of the maneuver. Thus, the production (

is

rather than maintenance

often

)

of a remission

the main goal of treatment for

variability of the laboratory's quality con-

acute leukemia, and the reduction of "dis-

encountered in human populations. To cope with the latter forms of variability, the clinician must consider some of the many other

ease

trol,

as well as the variability

types of control in statistical

that are not expressible

formulations.

3. Control of the disease process. This type of control refers to the adequacy with

which a program of therapy produces cerdesired effects in the "activity" of a

tain

disease. In hypoglycemic treatment of diabetes mellitus, for example, the purpose is to keep the blood sugar or urine

patient's

sugar regulated within certain boundaries.

When

such regulation occurs, the sugar is said to be in "good

the patient)

(or

control."

ferring in

a

the

A to

similar phrase is used in remaintenance of a "remission"

patient with acute leukemia, or to lowering of serum cholesterol in a

patient with hyperlipidemia.

The

description of this type of control

activity"

in patients

may be

the

prime target

with rheumatic fever or rheu-

matoid arthritis. 4. Control of the environment. This topic refers to the investigator's ability to govern the environment in which the research maneuvers are administered. The experimenter in the laboratory can "control" the cages in which the animals live, can ascertain that they are given and ingest the assigned medication, and can make them appear for all of the planned examinations and procedures. The investigator who deals with human populations has no such options, and must contend with the possible bias caused by migration or loss of patients, and bv noncompliance with either the prescribed maneuver or with the schedule of follow-up examinations.

The the

difficulties

inception

of

between maneuver and th<

that can occur

the

observation of the subsequent state have

Other architectural problem*

190

been discussed elsewhere* as problems in "intrusion." Among the most striking of these problems are those that arise effective oral

.in

deemed take

to

unusual

medication

when

spuriously

is

ineffective because patients failed

when compliance

it;

an

requires

mav

psychic resolution that

dis-

tort the characteristics and outcome of the sell-selected group ot people who are able to comply; and when the detection of the

target event is

compared groups

in

of people

distorted by inequalities in the frequency

and

the diagnostic procedures

intensity of

used to detect the target

The performance

of

can also often be ex-

target detec tahiiitx

synchronous variables. Although control of these environmental fea-

pressed as the

tures ol research

any

lation."

here or

cited

become

later,

ing results. cal or

the synchronous variables the previous section can

in

member

a

described

another aspect of "regu-

is

of

may depend

in sugar, cholesterol, or

white

blood count; or the frequency and intensity diagnostic

neuver that

maneuver

ACTH; vs.

The

tests.

Nevertheless,

as

generally ignored in most sta-

The

many

gradations

stratified

type of "control"

neglected in

may be

complete-

statistical reports, or

the synchronous results

may be

analyzed in

an unsatisfactory retrograde manner.

choice of appropriate control masubtle act of

As described previous-

proper choice requires suitable

a

lv,'

tention

of

at-

complex array of "technical

a

to

potency dosage or other procedures with which

the maneuvers are administered; the rela-

with which comparative maneuvers

tivity

are chosen to demonstrate efficacy or efficiency;

accompaniment pro-

the internal

vided by the ancillary ingredients or excipients of a pharmaceutical agent, or the gery; the external

precede, parallel, or follow the principal

maneuver. In view of the plexity,

it

scientific

com-

not surprising that investiga-

is

sometimes choose control maneuvers are "totally inadequate" and that

statisticians

may

not recognize the defects.

A classic example of an unsatisfactory control maneuver occurred in a highly publicized recent experiment 1 designed by a pathologist collaborating with a

statistician.

smoke

was

beagle dogs,

whom

The idea of comparison next five types of control

all

depend

on the concept of comparison. The confusions in usage arise because the comparisons are based on different elements in the research architecture. Some of the comparisons refer to the maneuver; others, to the initial state of the objects under investigation; others, to the subsequent state; and yet others to the transitions between initial and subsequent state.

Warm

air

containing cigarette

pumped into the tracheostomy of who were then examined for pulEvery

lesions. I

promptly

The

sophisticated,

a

is

monary B.

the

vs.

the active drug;

vs.

"models" of experimental design for

clinical research. in this

injection

saline

scientific reasoning.

tors

ly

— the

the placebo

neuvers

that

are

the ma-

to

dose vs. the low; non-smoking smoking; urban vs. rural life.

dis-

tistical

refers

the high

cussed previously,'' these synchronous variables

It

compared with the principal

is

accompaniment provided by postoperative recovery rooms and solicitous medical personnel; and the concurrency with which comparative maneuvers

crucial analyses in clini-

according to compliance with the maneu-

of

think about control.

anesthesia and other concomitants of sur-

epidemiologic research

changes

the

is

compar-

on appropriate division of the population ver;

maneuver. This

control

of the "control strata."

that are used for

Main

The

details." These- details include: the

compliance and

ot

/.

sense in which most scientific investigators

have

selected

Warm

the

layman

intelligent

described

this

experiment

appropriate

to

has

ma-

control

(devoid of cigarette smoke) should have been pumped into the tracheostomy

neuver. of

air

the control dogs.

experiment,

the

Nevertheless,

in

the actual

comparative maneuver consisted

of insertion of a tracheostomy tube alone,

out anything

The tive

made is

with-

else.

decision that a particular compara-

maneuver

is

with any

apparently

inadequate cannot be reasoning and

statistical

sometimes not evident

to

Ambiguiti/ and abuse

experienced

in

die twelve different concepts of 'control

By considering

investigators.

such plans

are

191

obligated

scientifically

to

the stated principles 4 of potency, relativity,

describe

accompaniment, and concurrency, a reviewwhat may be wrong, but the application and interpreta-

Without such descriptions, the readers of cannot discern whether a gross imbalance in the size of compared groups is due to perceptive planning, ca-

er can often quickly discern

of those principles

tion

on

of judgment,

a 'mixture

common 2.

usually rest

will

wisdom, and

The "conentities who re-

group" consists of the ceive the comparative (or control) trol

These

who who

entities

might be the

marats

get the saline injection, the patients

receive placebo, or the dogs

warm

(or failed to get)

air

who

blown

get

into a

An

important

search

is

pricious plodding, or an unforeseen disaster

that the groups receiving the con-

maneuvers and the investigative maneuvers be qualitatively similar before the maneuvers are instigated. This qualitative aspect of similarity depends on the control

trol

3.

Kendall and Buckland. paring treatment

we

would like the compared groups to have the same number of members. This equalsize principle is intuitively appealing and has

most

the

numerical ceteris

likely,

virtue

paribus,

the results.

"statistical significance" in

On

certain occasions, however, particu-

when

larly

compared of the

multiple maneuvers are being in

groups

same experiment, some

the

may be made

proportionately

meaning cited by If we were com-

Y in pawe would

treatment

vs.

"control" for sex

if

we

divided the patients

men and women, and of

the two

analyzed the

treatments

separately

each sex group. Readers of previous papers in this series should immediately recognize that this type of "control" is achieved by stratifying the patients according to characteristics

in

that are present in the baseline initial state,

before the maneuver

is

imposed. The pur-

pose of prognostically predictive or other forms of stratification tive similarity in the

is

to achieve qualita-

compared groups. This

type of stratification

is

used

to

we might

proportions of important strata in the peo-

group

For example,

acts as the "con-

four treatment groups,

compared maneuvers.

any of the others. If a single medication is being given at two dosage levels, we might make each dosage group half the size of the other groups so that the results for the two dosages can be combined into a single full-size group for that medication. The planning of deliberate inequalities in the size of compared groups is one

ple assigned to the

of the statistical subtleties of experimental

sources of bias that enter a survey.

er than

design. Regardless of

are

avoid or

the placebo group substantially larg-

a single placebo

trol" for

make

"

reduce the bias that may occur when maneuvers are not assigned randomly or when the randomization does not produce equal

larger or smaller than others. if

being produce

of

to

X

1

with diabetes mellitus,

tients

results

a different issue, however. Ordinarily,

is

occurred.

control refers to the last

into

groups

of

similarity

un-

Control of the strata. This type of

The

quantitative

will pass

An interesting example of an unexplained gross imbalance in sample sizes occurred in the aforementioned investigation 10 of smoking dogs. The group that "smoked" through a tracheostomy contained 86 dogs. The "control" group that had a tracheostomy alone contained 8 dogs. The investigators provided no statement of how and why

of strata, as described in the next section.

also

hope

noticed.

principle of re-

scientific

justifications.

a published report

this bizarre disparity

tracheostomy.

and

reasons

that the investigators

sense.

Size of the control group.

neuver.

their

chosen,

the

why

the inequalities

investigators

who make

Attention to this type of "control" has

been

strikingly absent

statistical

"experimental design." a

from most general

textbooks and works devoted to

statistician

who

One

thinks

reason

is

that

only about ex-

periments, but not surveys, will obviously

have reason

no is

accepted

incentive

to

contemplate

A

the

second

the basic fallacy of the currentl' statistical

"model" of experiments.

Other architectural problems

192

This model, which depends on the effects of only a single maneuver, is inadequate

and epidemiologic for the common situations in which a maneuver of human intervention or pathogenesis is imposed on clinical

an underlying maneuver of nature (or of man). The purpose of the stratification is to identify what was done by the underlying maneuver to separate the people accord-

—

ing

their

to

different

the

treated

imposed maneuver.

may

If

the

than

longer

know about

to

referral

iatric

the

equal

Were

the children

likely

to tolerate

introduced

bias

their

in

For

whole."

this

reader would surely want

a

patterns.

tients

a

as

who were

others"

the

country

the

"in

type of conclusion,

Were

of

severity

clinical

who were

by

inter-

the compared paillness?

and more and the chemo-

"healthier"

both the travel

referred to the specialist phywhereas the sicker or moribund children were kept at home to be treated "in the country as a whole"?

therap)

generally

sicians,

A

degrees of haseline

susceptibility for the target event that later follow

considerably

ill

according to clinical severity of

stratification

would therefore be expected

illness

in

this

survey. In other forms of cancer, such

type

stratifi-

cations are regularly attempted with the "staging

parental longevity, severity of clini-

stress,

cal condition, or other features that create this

are

susceptibility

identified

appropriately

not

and separated, the investigator

courts an intellectual disaster.

He may

at-

imposed maneuver, results that are really due to the unrecognized underlying maneuver. This error is the data analyst's equivalent of the post hoc ergo propter hoc folly into which clinical reasoning has so often been seduced throughout the centuries. Because the current statistical "model" tribute,

to

the-

of "experimental design"

one maneuver,

rather

is

concerned with

than

ma-

dual

a

have had no major intellectual incentive to consider the need

neuver,

statisticians

two types of "control": a control that compares the overt imposed maneuvers, and a control that creates strata to "equalize" the previous effects produced by the antecedent underlying maneuvers. The literature of modern "science" thus becomes filled with large amounts of clinical and

for

epidemiologic data that are misleading or worthless, because to

no attempt was made

perform the control

stratifications

that

systems" that are used in an effort to distinguish severity of illness or other major prognostic differ-

among

ences

children

with

acute

and

type.

The

cell

cording

in

a

recent

survey 11

Medical Journal.

A

ment

of

problem appeared

published

stellar

in

known

to in

leukaemia

in

the

British

array of investigators

by physicians specializing

reported that "the children treated

this

have been

the treatchildhood have survived

to

com-

.

to

age

were not staged acillness

or ac-

any of the prognostic features that

might militate

for or against referral to a special-

ist.

Unfettered by the absence of randomization and by neglect of a major source of selection bias, the investigators nevertheless concluded that "there is good reason to believe that the improvements in survival" were due to "the availability of special facilities and expertize." The investigators also recommended that "it would seem desirable that children with acute leukaemia should be referred, where this is feasible, .

.

.

a centre specializing in the treatment of the

to

The chairman

disease."

previous

a

in

of the "working party,"

publication, 12

made

the

remark

that "practising clinicians have not always taken

kindly to the statistical approach to medical problems in which patients are considered as units in a

more

or less

of

suitable

sence

main reasons clinical

bias,

example of

.

leukemia according patients

cording to the clinical severity of

left

excellent

.

parisons." This "care" consisted of stratifying the

selection bias, susceptibility bias, detection

An

same age and the

The leukemia "working

nate bias so far as possible from the

nostic

compliance bias, chronology bias, and forms of bias that permeate contemporary statistics.

type.

cell

party" stated that they "had taken care to elimi-

could help remove the diverse forms of

other

patients with the

same neoplastic

stratifications

the

conclusions that emerge

heterogeneity

of

is

scientifically

patients

one

when is

of

the

unacceptable the prog-

ignored

and

uncontrolled.

4. all

for

homogeneous whole." The ab-

The epidemiologic

"case-control." In

of the comparisons discussed so far, the

scientific

ward

reasoning proceeded

in

a

for-

from cause toward effect. The comparisons dealt with the initial condition, the maneuvers, and synchronous changes that accompanied the maneuvers but the populations were all being fol-

—

direction,

Ambiguity and abuse

lowed from subsequent

A

toward

their initial state

in

their

among chonic

disease

epidemiologists calls for a total reversal of the direction of scientific logic. Instead of

pursuing a cohort group in the customary sequence of cause —* effect, the epidemio-

may

logic investigator tion

whom

in

By

curred.

begin with a popula-

then called

methods of

then

epidemiologist

follows

backward toward the putative Thus, to determine whether con-

The

5.

lead to thrombophlebitis,

would assemble a population of pill-takers and non-pill-takers, follow them forward, and determine the rate with which the people in each group develop thrombophlebitis.

Many

epidemiol-

however, would start with people who already have thrombophlebitis, and would then determine the proportion who had taken the pill. ogists,

In

the epidemiologist

situation,

this

is

confronted with the problem of getting a

group of people for comparison. The people cannot be chosen in the customary scientific manner, on the basis of receiving

control value.

We now

turn to a

form a transition variable. For exwe can measure the baseline level of serum cholesterol and call it the "control value." We then treat the patient with a lipid-lowering agent, and, after a

period

we measure

time,

of

The change between

again.

becomes the

of cholesterol then

that

is

The change

analyzed.

an increment

as

cholesterol

the two values transition

expressed

is

decrement)

(or

if

subtract the control value from the

we new

value; or as a proportion or percentage

we

divide the

not used for comparison in a scientific

sense. It serves merely as a numerical in-

gredient in the subtraction or division by

which we calculated the change that

because

the value of the transition variable.

ducted

the a

in

investigation

backward

is

direction.

The

chosen according to the

pa-

if

value by the control

In this situation, the control value

value. is

new

or not receiving the investigated

maneuver, being con-

is

The

value of a before-and-after pair of values

ample,

scientists

result

group.

different type of "comparison": the before

that

pills

The

"case-control"

greater detail in a later installment of this

cause.

most

a

series.

the "cases"

traceptive

were conveniently

that

tortuous reasoning used for this unique form of "control," and the many biases it ignores or creates, will be discussed in

the effect has already oc-

historical or other

the

inquiry,

and other data

sex,

available to the investigator.

state.

quaint tradition

193

the twelve different concepts of 'control'

Many

investigators

seem

to believe,

is

how-

"effect,"

ever, that this type of control value has a

but the control groups from the "initial state," not from the "subsequent state." Nevertheless, with a majestic dis-

stronger role in comparison, and can sub-

play of antipodal scientific logic, epidemiol-

professor

tients are

not

the

used in

ogists

"cause,"

scientific research are selected

will

choose a contrast group from

"subsequent-state" people in fect has

whom

the ef-

not occurred. For example, in a

study of the allegedly evil effects of oral contraceptive for

this

pills,

the eligible contenders

contrast group

would be people

who do not have thrombophlebitis. To add a semblance of scientific veneer to this retroverted scientific

demiologists

usually forms of "matching."

procedure, epi-

engage

Members

in

various

of the con-

group are usually matched to the case group according to features of age. trast

stitute

for

the

provided by a remember one in-

"control"

comparative maneuver.

I

when an august

stance several years ago

wanted

to

test

the

effect

a

of

certain surgical procedure on a particular

chemical in blood.

He planned

to

measure

the level of the chemical before surgery

and

after

surgery,

and he intended

hold

the

surgery

responsible

changes.

When members

for

to

any

of the institution's

research committee insisted that he examine

he claimed he did not need one because the pre-surgical values were alone satisfactory. The patients would a "control group,"

act as their

The

"own

failure of

controls."

prominent

clinical

micians to understand what

is

acade

meant

1

Other architectural problems

194

a controlled comparison might be excused

"control" maneuver, not merely a "control"

have had by the Fact that many It research in methods. so little training

value.

clinicians

more

is

a suitable excuse.

difficult to find

same confusion statisticians, and when this confusion becomes the- instructions offered however, when appears among

textbooks devoted to the de-

statistical

in

exactly the

and analysis

sign

research data.

ol

The quotation below From a prominent, textbook

of

analogous example biostatisticS.

not

book by name here because tion

the

of

literature

either

cite

my

text-

Had

done the research more thoroughly, might have found similar examples

I

many

other leading statistical books.

exact

text

is

the "example"

t

in

of

how

The

to

be tested

in

The

that book

for

its

believe

to

average,

the

raise

that

the

the

stimulus

mean

systolic

would,

Mood

three types of control refer to

last

regulation

table,

on

pres-

patient,

that

t

Stabilization.

1.

many experimental

is

imposed. The main purpose of to allow the observed

stabilize"

this interval

of data

values

into the result that will

be

called the baseline or "control" value. For

example, just

the

the

cited,

been

a

blood pressure experiment

in the

single

baseline

may have

value

reading taken just before

stimulus was

imposed. Alternatively,

may have been observed

during

a control period of several days or weeks.

From

the

many blood

pressure values ob-

then chose (or calculated) what was used

for

level of "statistical significance,"

state

that

"we do not have

evidence to conclude that the

stimulus

increases

implication to

is

scientific

blood

pressure."

The

were only high be "significant," we would have

satisfactory

to

In

a time interval of observation before the research maneuver is used

tained during that interval, the investigators

values

sufficient

This

same phrase:

perform a paired t-test. and find is only 1.09. Because this value is

authors

enough

and

comparison,

each

betorc-and-after

below the the

contained in the cited

the authors calculate the difference

the

in

nor

situations,

the patient

the data

a paired t-test.

control period.

sure?

From

been supexample

effect

on Mood pressure. Twelve men have their Mood pressure measured before and after the stimulus. The results are shown in Table 8-3. Is there reason

perform

are produced by exactly the

to is

to

neither

is

certain stimulus

in

statistics, this bla-

Tbe control period

C.

I

as follows: \

of

tant error in research design has

an

investiga-

incomplete.

is

observed "stimulus." Nevertheless,

renowned textbook

known

found

have

I

the .1

another textbook of

in

shall

1

directly

internationally

statistics.

experimental setting, a flaw in the equipment, or many other causes other than

plied to students as a fundamental

taken

is

The change observed by the statismay have been due to anxiety, the

ticians

is

that

if

t

evidence for that conclusion.

precisely the type of defective

reasoning that clinicians are urged

avoid by getting consultative help from

The paired

might allow

as the baseline value.

The

calculation

may

have produced a mean, median, mode, or some other derivative of the data observed during the entire control period, during a period of apparent equilibrium, or during the few readings that preceded the imposition of the stimulus.

This sound scientific procedure

marred

is

only by the frequent failure of investigators to report the details of

line"

value was

how

the "base-

chosen from the values

cant" change had or had not occurred in

observed during the control period. 2. Qualification. A second usage of control period is in reference to a time interval

the two pairs of readings, but no scientist

during which the investigator

would conclude that the imposed stimulus was responsible for the change. To reach the latter conclusion, we would require a

son's eligibility or qualification for admis-

statisticians.

t-test

the investigator to conclude that a "signifi-

tests a per-

During the patient may be called upon sion to a research study.

this

to

period

demon-

Ambiguity and abuse

in

the twelve different concepts of 'control'

195

strate

such features as compliance with proposed protocol, ability to remain free of diabetic ketosis on diet alone, etc.

knowledge

the

twelfth type, the statistical principles used

This type of "qualification

sagacity

sonable procedure jects.

Its

many

in'

main hazard

substantial

bias

in

The people who

is

research pro-

that

it

can create

ultimate

the

a rea-

trial" is

results.

the qualifications for

fit

compliance or other standards may not be a representative sample of the general

more profound than

are obviously confused about the topic of control, clinical

and epidemiologic

educate

to

our

maneuvers were

patient

may

re-

ceive placebo or no treatment during this interval.

This

type

control

of

somewhat analogous

to

the

period

first

is

type in

purpose is to restore stabilization. The main hazard of the washout period is that it is sometimes too short, or omitted that

its

when

necessary.

ticularly

The

when

A

frequent problem

subjective

—par-

symptoms

are

—

about

colleagues

successive therapeutic agents administered

The

investi-

gators are obligated to clarify the issues

is

in a cross-over trial.

familiarity

with a mean, a standard deviation, and a Gaussian distribution. Because statisticians

arcane aspects of

dition that

in the

no mathematical

in quality control require

who have the conunder investigation. 3. Washout. The last usage of control period is in reference to the "washout" interval that is often necessary between the

population of people

Even

of statistical theory.

scientific research.

who was

statistician

"raised" to think

only about experiments will not appreciate the bias that can occur in surveys where the populations were self-referred and the

tician

who

The

self-selected.

statis-

regards experimental design as

an exercise in analysis of variance and Latin squares cannot appreciate the issues of potency, relativity, accompaniment, and concurrency that are needed to choose a control

maneuver. The

statistician

whose

research model makes provision only for a single

maneuver cannot appreciate the need

for controlling strata

in

the

common

an important variable is the decision about whether the patient should receive placebo

search

or no treatment during the washout.

a control value for a control group

The main purpose to point

of this essay has

been

out the confusion that exists about

the term control,

some of the "technical details" whose absence was so distressing to R. A. Fisher. The lack of specific information about what constitutes and

to specify

an "adequate control," however, does not

seem

to

have inhibited

statistical disserta-

devoted to the "design of experiments." Nor has the absence restrained "heavyweight authorities" in the domain of

tions

from delivering oracular concluabout the interpretation of epidemio-

and

these

that

situations

maneuver. The

have

difficulty

to the

people

involve

who

statistician

supplying

who

re-

double

a

mistakes

helpful

may

advice

consult him for guidance

designing research.

in

Ronald Fisher 8 said that "the statistician cannot evade the responsibility for understanding the processes he applies or recommends." In exchange for the many useful tactics that statisticians have brought Sir

to science, the least that scientific investigators

can do

is

to help statisticians under-

stand the processes and meet the responsibility.

statistics

sions

References

experiments in which the crucial controls

Auerbach, O., and Garfinkel, ing on dogs.

were

tality,

1.

logic surveys, clinical trials, or laboratory

either

omitted,

malconceived,

or

different types of control

cited here, eleven are best discerned, noted, chosen, and evaluated with principles that are inherently scientific and that require no

L.:

E.

C, Kirman,

D.,

Effects of cigarette smok-

I. Design of experiment, morand findings in lung parenchyma, Arch.

Environ. Health 21:740-753, 1970.

otherwise scientifically unsatisfactory.

Of the twelve

Hammond,

2.

Bamett,

R.

N.:

Clinical

Boston, 1971, Little, 3.

laboratory

Feinstein, A. R.: Clinical biostatistics. tistics vs.

Clin.

statistics,

Brown & Co. II.

Sta

science in the design of experimc

Pharmacol. Ther. 11:282-292,

197

196

4.

Other architectural problems

Feinstein,

A.

H.:

Clinical

biostatistics.

ed.

III-V.

The architecture of clinical research, Clin. Pharmacol. Ther. 11:432-441, 595-610, and

().

755-771, 1970. 5.

Feinstein, A. R.: Clinical biostatistics, XII.

exorcising

On

the ghost of Gauss and the curse

of Kelvin, Clin.

Feinstein,

A.

K.:

Pharmacol. Ther. 12:1003Clinical

biostatistics.

XVII.

York,

1966,

Hafner

Fisher, R. A.:

Publishing

methods for research Edinburgh, 1970, Oliver &

Statistical

workers, ed.

14,

Boyd, Ltd. M.

G„ and

Buckland,

dictionary of statistical terms, ed.

11.

Synchronous partition and bivariate evaluation

3,

W.

R.:

A

Edinburgh,

1971, Oliver & Boyd. Ltd. Report to the Medical Research Council from the

Committee on Leukaemia and the Working

Pharmacol,

Party on Leukaemia in Childhood. Duration of

Theb. l.liPart 1): 755-768, 1972. R. A.: Statistical nut hods and scientific inference, ed. 2, Edinburgh, 1959, Oliver & Boyd, Ltd.

survival of children with acute leukaemia, Br.

in

predictive stratification. Ci.iv

Med.

7. Fisher,

8.

New

10. Kendall,

1016, 1971. 6.

8.

Co.

Fisher,

R.

A.

The design

of

experiments,

12.

J.

4:7-9, 1971.

Witts, L. ical

trials,

J.,

Medical surveys and clinLondon, 1964, Oxford Uni-

editor:

ed.

2,

versity Press, p. 5.

CHAPTER

14

The epidemiologic and

risk ratio,

About two years ago,

public

rebuke

ternity.

Dr. John

the

to P.

epidemiologic fra-

Fox,

who

a note "to call attention to

surprising

co-author

is

error

in

and correct

a report

presumably was produced by highly

persons

of

textbook of epidemiology, 14 wrote

new

of a

readers

the

there appeared 11 a remarkable

series,

this

'retrospective' research

in a journal that

may seldom be seen by

distinguished

.

.

.

.

.

as

(that)

basic

of

conceptual

and the terms that express these the fundamental vocabulary the domain. Investigators trained in

become

may not fullv unwhen applying them domain, and may misuse either

other forms of research

derstand those ideas in

another

words or the associated For example, in reporting the

the corresponding

concepts.

results of therapy, clinicians regularly con-

fuse

the

epidemiologic nuances that

dis-

from prevalence, and mortality rates from fatality rates. Converse-

tinguish

incidence

disease."

The misuse tive

"Clinical

biostatistics

14:291, 1973.

—

chapter originally appeared as XX." In Clin. Pharmacol. Ther. this

of prospective and retrospecby epidemiologists themselves is un-

—

rectly

in

those words, in contrast to the

meanings that have been created to distinguish incidence from prevalence, and mortality from fatality. The confusion about prospective and retrospective seems arbitrary

from anv arbitrarily created disbut from the unfortunate custom of using the terms to describe two different and sometimes conflicting concepts of research. One concept refers to the directo arise not

tinctions,

tional

pursuit of a population;

refers

to

method used

the

for

the other collecting

research data.

The is

Under the same name,

a diagnosis listed on a death from the true occurrence of a

fundamental to any scientific reasoning about populations a subject on which epidemiologists are regarded as experts. Furthermore, the ideas of "looking forward" and "looking backward" are expressed di-

ideas,

ideas

distinguish certificate

expected, however, because these terms are

Like even' domain of science, Epidemicertain

causes of "disease"

regularly confuse the clinical nuances that

several

.

Dr.

contains

tality rates for different

epidemi-

retrospective.

ology

epidemiologists analyzing general mor-

lv,

a

Fox pointed out that the authors of the cited report had made incorrect use of the terms prospective and ologists."

trohoc, the ablative

basic unit of epidemiologic research

a person

When

who

groups

vestigation,

has a particular condition. of

the

persons

are

under

in

prospective-retrospe

197

Other architectural problems

198

terminology has been applied for two difresearch activities: (1) the chron-

ferent

direction

ologic

followed

in relation

way

the

in

which each person is to that condition, and

the investigator gets the data

describe each person's observed and

that

other conditions. this

series,

In

an earlier paper

in

1

discussed the conflicting am-

1

biguitv and scientific confusion caused In the

two

sets

that the best

was

ol

wa)

to

I

suggested

remove the

difficulties

discard the words prospective and

to

retrospective. sets

concepts, and

We

could then use two new

terms to describe the two different

ol

concepts

The two sets of terms

A. /.

The

collection of data.

An

investigator

two basic way, the person under

can assemble research data

in

ways.

In

study

was originally observed

the

first

b\

people

who were

not

vestigation,

and who reported the observa-

tions

in

performing a specific

in-

routine records. Afterward, to get

the research data, the investigator extracts the information available in those records. In the

second way, the investigator makes beforehand for the techniques

special plans

with which each person

is

and the data recorded. The

to

be examined

way

1965.

Therefore, to do the re-

earlier

in

search,

we would have

formation entered

From

records.

the

we would

registry,

to rely

on the inmedical

in the hospital's

diagnostic

hospital's

find the

names

of all

premature babies born in 1965. From each baby's medical record and from supplemental ad hoc communications, we would the necessary data about birth weight and subsequent growth. Regardless of whether the research was collect

done prolectively or retrolectivelv, we would attempt to assemble the same group of patients and to follow that group in the same chronologic direction from birth onward. The main tactical difference would be in the methods used for collecting the research data. 2. The direction of populational pursuit. The other set of terms would refer to the

which a group of direction can be either forward or backward. For example, suppose we want to know whether oral temporal

direction

people

followed.

is

in

The

pills predispose to thromboembolism. In forward research, we would assemble a group of women who use the

contraceptive

pill

and another group who use other forms contraception.

of

We

would then follow

of col-

the two groups forward and note the rate

lecting research data can be called retro-

which they develop thrombophlebitis. backward research, we would assemble a group of women who have developed thrombophlebitis and another group who do not have thrombophlebitis. We would

lective,

The

first

and the second way,

protective.

procedures for primary data will depend on whether the research was planned before or after the persons under surveillance ability to use special

acquiring

the

reached the locale

was

at

which

their condition

at

In

then note the proportionate frequency with

which the two groups had previously used

be observed. For example, suppose to study the growth achieved in the first year of life for premature babies born in our hospital during 1965. If we

oral contraceptives.

decide in 1964 to begin

distinguishing

to

we want

this

research the

Both types of research could be done prolectively or retrolectively, according to

the

methods used

following year,

we can make advance plans performing suitable examinations to collect the data from birth onward for all appropriate children. If, however, we decide in 1973 to do this same research project, we cannot make any advance plans,

projects

for

tional

because the conditions we want to study were noted at our hospital eight years

lowed

would be the pursuit,

rather

of

the

first

two

cited

direction of popula-

than

which the research data were the

The

for assembling data.

feature

the

way

in

collected. In

project, the population

is

followed

forward, from "cause" toward "effect." In the second project, the population

cause.

backward,

from

"effect"

is

fol-

toward

The epidemiologic

The word cohort has been describe a group

and

trohoc, the ablative risk ratio,

of people

may

established to

term "retrospective"

who

tion of the research, but

are pur-

describe the direcit

does not pro-

name for the different groups under The term cases or case-group is

sued in the forward (or "prospective") di-

vide a

rection that characterizes scientific research.

study.

Thus, to assess the growth rate of prema-

sometimes used for the people

ture babies in the projects described earlier,

our

would have involved a and the second, a retro-

project

first

prolective cohort,

active

ment

cohort.

In

studying the develop-

of thrombophlebitis in pill-takers or

we would perform

non-pill-takers,

research, regardless of

cohort

whether the data are

199

'retrospective' research

is

whom

the

and the term

"effect" has alreadv occurred,

case-control

in

who

applied for the people

are in the "non-effect" group. These terms are not particularlv specific, since the

word

case can imply almost anything, rather than

the occurrence of an "effect." stitute

A good

sub-

the cases or "effect" group

for

is

obtained prolectively or retrolectively. In

probands, a term that has been employed

studying the antecedent use of contracep-

by Taube

people with or without throm-

tive pills in

we would

bophlebitis, however,

not be ex-

amining a cohort.

how

Regardless of

a cohort population

the data are collected, usually divided into

is

two or more groups. The main division is done to compare the principal maneuvers investigation. under If the principal

maneuver cohort

the

an alleged cause of disease,

is

mav be

divided into an "ex-

posed" and "non-exposed" group, and the

exposed group to

may be

further divided in-

degrees of exposure. Thus,

smoking

is

the

studv, the cohort

if

cigarette

putative pathogen

may be

under

divided into non-

smokers, light smokers, moderate smokers,

and heavy smokers. If the investigated maneuver is a therapeutic intervention in the course of a disease, the cohort

is

di-

vided into a treated and a "control" group, or into groups

who have

received different

We

might then refer to an exposed cohort and a non-exposed cohort, or to a treated cohort and a control cohort. A second form of division for forms of treatment.

a cohort

mav be ancillary

is

the separation into strata that

response, or detectabilitv of tar-

Regardless of how the first or subsequent divisions are created, the key feature of cohort research is that the groups are followed forward, in a scientific direc-

from "cause" toward "effect." There is currentlv no word to describe

tion,

the population that

the

is

followed backward

opposite of cohort

research.

The

-

1L

only difficulty with probands (or

alternative, propositi)

is

that the

'

;

its

word has

been almost wholly preempted by cists. Almost 40 years ago, Sir Fisher 11 was using proband during cal analyses of genetic phenomena, word has often been applied in a

geneti-

Ronald statisti-

and the genetic

sense since that time. Despite the genetic connotation, however, probands refers to a

group of people with an identified disease, and the term could be quite satisfactory for our purposes here. It might still be chosen by epidemiologists who prefer it to the

alternatives that follow.

A new word

that might be created to group of diseased people is morbery, formed from the Latin morb( disease) and -ery (collection). This word has no genetic associations, but it has the disadvantage of implying the existence of a disease and this type of research may not always begin with an "effect" that is a distinctive disease. (We might be studying the "causes" of unemplovment by backward

describe

a

—

pursuit of antecedent characteristics in a

group of unemployed people.)

different in prognosis, compliance,

get event.

in

The

in several excellent analyses.

Regardless of

who do of

how we

label the people

or do not have the "effect," none

the terms just cited conveys the idea

backward direction, analogous to the forward direction connoted by cohort. In

of a

quest of a word for a group of people

lowed reetion,

a

in I

chronologically

have

searched

backward

fol<

unsuccessfully

through several dictionaries, a few varif^' of Roget's Thesaurus,

and somr

Other architectural problems

200

Latin lexicons.

The some

best

I

can do at the

philologic connoisseur moment, until comes up with a better idea, is to reverse the word cohort, and to propose trohoc as the name lor a group of people who are followed backward from "effect" or "noneffect toward "cause." The "effect" group would form a case, proband, or morherij trohoc. and the "non-effect" group would (

be the control or contrast trohoc. Thus, in the timing of data collection for a

research project, the words "prospective"

and "retrospective" would be replaced by protective and retrotective. In the directional pursuit ot a population, the words "prospective" and "retrospective'' would be cohort and trohoc.

replaced b\

Vnd yet, as Fox pointed out, the experts can be wrong. In the instance about which Fox complained, the term retrospective was erroneously applied by epidemi1

'

ologists to a "prospective" studv of a cohort

data were obtained retrofrom medical records. This type

of people whose lectivelv

1

be elimi-

of error in scientific thinking can

nated as a younger gene-ration begins to

outgrow or avoid the confusion transmitted In

A much

elders.

its

or

potential

however,

error

research,

scientific

trohoc

the

is

greater source of real

in

investigations

which the term retrospective might be

to

cor-

rectly applied.

These

potential

errors

seldom

when

emphasis

sufficient

receive

students

are

taught about the trohocs investigated in

The problems of trohoc research

B.

standing what

the

may be wrong with

the sci-

From

direction of trohoc research.

entific

main conceptual

errors that

have pre-

mentioned

things as relative risk ratios tistical

sibility

merely

distinguishing a cause from an

in

without the added burden of

effect,

tinguishing the

more substantive source ever,

is

trohoc

of difficulty,

research because

it

the

clinical

a cohort forward

clinical

tion,

experi-

entific

follows

from an imposed "cause"

an observed "effect." Accustomed to this standard direction of scientific thinking, uncertain,

or logically

encounters rection. If

confused

when he

complete reversal of that tJ

by a barrage statistical

uncomfortable,

F

di-

accompanied mathematical formulas and

retroversion

tabi ations,

promptly withdi melee, returning

w

the

is

clinician

may

from the epidemiologic

o the security of

more

familiar forms of sc "nee while hoping that

the epidemiologic experts will

they are doing and will do

it

know what right.

field.

At a time when all other basic aspects of medical education have come into ques-

A

investigator

may become

standard practice of leaders in the

do

to

the clinician

distorted or whollv erroneous does not re-

retroverted trohoc research has been the

how-

ments confined to the laboratory or will perform surveys and trials of clinical therapy. With either of these two forms of research,

pos-

might be immensely

that the data

ceive intensive discussion, perhaps because

seldom occurs

usually either

investigator will

The

manipulations of the data.

A

that clinicians are unfamiliar with

as part of clinical investigation.

on such and other sta-

dis-

reasoning.

of

direction

passing, but the instructor or

the textbook usually dwells mainly

we know

the problems doctors have had

in

dif-

backward direction may be

of a

ficulties

vailed at different times in medical historv, ot

The

"analytic epidemiologic" studies.

Clinicians often have difficulty in under-

however,

a

validity

in

reconsideration

epidemiologic

would not be out of lenge

is

A

place.

to arrive at a

way

of

sci-

trohocs

prime chal-

of illustrating

what might be wrong with the way that the backward procedure works, and with the data shall

it

provides.

For

purpose,

this

draw an analogy from the game

I

of

baseball, with apologies to readers in coun-

where the game is unfamiliar. Suppose we suspected that right-handed

tries

batters werc> ters,

this

among

professional baseball players

worse hitters than left-handed batand we wanted to get data to test suspicion.

approach

that

than a century statistics,

In

has in

the prospective cohort

been used

for

we would determine

the

of times that each type of batter bat,

more

the science of baseball

and the number

of times

number came to

that

were

The epidemiologic

We

followed by a hit or an out."

would

then calculate each batter's "batting aver-

age" as a rate of

make our

To

decision about left-handed versus

right-handed the

per times at bat.

hits

overall

we would compare

batters,

averages

batting

the

in

two

groups.

To construct we would put

a table

showing the

results,

the batters in the rows, as

independent

the

and

trohoc, the ablative risk ratio,

used

variable

for

the

terfield hits

'retrospective' researi

and

outs, rather than

we cannot

the batting events,

201

it

from

all

use a'+b' and

to represent times at bat. We can contemplate only the proportions of hits and outs that were associated with the two types of batters. Thus, we could calculate the proportion of R.-handed hits as

c'-id'

and compare the result with which is the proportion of R.-handed outs. If the first proportion was a'/(a'+c')

b'/(b'+d'),

we

"sampling," and the hits and outs in the

substantially

columns, as the subsequently observed phe-

would conclude

nomena. The table would contain the

worse than L. -handed batters. Anyone who understands the game of baseball should immediately recognize what is wrong with these trohoc sta-

fol-

lowing numbers:

Number

Number

of hits

of outs

R.-handed batters

a

L. -handed batters

c

b d

vs.

c/

(c+d). In

occurred

during

would bat or even what

game.

the

Instead,

we

might consider a particular location in the park, such as center field.

came

batted ball

a

have not compared true batting we have looked onlv the proportion of hits and outs in balls

batted to the center

into

center

Whenever field,

we

would determine whether it was a hit or an out. We would then ask the center fielder to inquire and let us know whether the batter had been right- or left-handed. We would then construct the following table:

going on;

batters

batters

Number

of center field

hits

ber

we have no

times

of

what was num-

idea about the

bat that culminated

at

as

drive

in

field

left

entirely If

we

pop-ups,

outs,

infield

become

batted balls that

or

foul-outs,

either hits or outs

or right field.

Our view was

limited to events in center

field.

are sure that the events occurring

R.-handed and L. -handed batters are

in

equally represented by what takes place in

center

field,

may be

our restricted observational but the only way we

valid,

can decide about such equal representation and validity would be to get results with a cohort approach, which we have not used.

Number

of center field

outs

To

translate this baseball

analogy into

b'

trohoc tactics

The rows

of this table contain the center

and outs that were the basis for our "sampling"; the columns contain the

field hits

subsequently observed "handedness" of the batters;

More im-

strikeouts, walks, bunts, infield singles, line-

focus R.-handed L. -handed

to

field region.

portantly, our station in center field gave

us a highly restricted view of

we

approach,

trohoc

the

not consider the times at

ball

We

tistics.

"batting average" percentages to be

compared would then be a/(a+b)

that R.-handed batters are

averages (or rates); at

The

lower than the second,

and the

correspond

interior letters are

with

their

chosen

presumptive

counterparts in the preceding table.

Because

we

derived our data from cen-

ease,

for

we need merely

center for center

ease

studying cause of

D

as

D

dis-

substitute a medical

field;

a person with dis-

a "hit" and a person without

an "out"; exposure to the putaR.-handed batter and non-exposure as a L. -handed batter.

disease

as

tive cause of disease as a

With these using

translations, an epidemiologist

medical-center

data,

assembling

diseased trohoc and a "case-control" gro> a

•In accordance with statistical custom in these matters, walk or a sacrifice hit would not count as a time at bat.

and noting exposure or non-expo< alleged "cause" in both group

the

202

Other architectural problems

tonus the same- type

analysis just cited

dI

ever

statistician

a

If

tried

to

analyze

baseball batting in this manner, the sports of the nation's newspapers would be with howls ot derision. Suppose the

•<

rilled

stm^

statistician,

cides

mains

ol

some

leave

to

also

or right-handedness

who threw to each batter. age and height are not known to

Since

have particularly strong ting ability of

player, the choice of age

ball

matching

variable's, rather

important features the

tray

on the bat-

and height

than the more served to be-

just cited,

unfamiliarity

statistician's

but important nuance's

subtleof

effects

an active professional base-

his

oi

left-

of the pitcher

as

His "im-

field.

and

non-pitchers);

re-

keeping center field data, but he will tr\ to

provement" consists the source

center

in

but he

analysis,

unable

unwilling or

observation posl

as

public laughter, de-

1>\

improve the

to

from

separated

the corresponding

for baseball.

in the

with

game

baseball.

the problems b\ "match-

Consequently, the statistician's proposed matching would be regarded as a futile

henever a hit or out appears in center he any attention to the (a d) terms that were meaningless the trohoc. These terms do not appear the odds ratio, which depends only on

be

of non-exposed people- will Let us further assume that the rate

1-e.

of disease in the exposed people

is

p,,

and

the rate in non-exposed people

is

p..

We

then have the following numbers of

longer paj

will

people

in

diseased

the individual values noted for

d

m

the trohoc

used

population. risk"

"relative

for

a. b, c,

and

The odds

ratio

trohoc

thus

a

in

—

eased

p non-diseased

non-exposed and disand non-exposed and l-p 2 )( 1-e). Suppose now,

however,

l-e);

i

==

(

that

the

individual risk rates, which were also not

higher than the rate. d 2

determined.

disease

since

it

quite pleasing statistically,

number

provides a single

be subjected

to diverse

mathematical

about

theories

variance, and "significance."

show is

The

result

it

fails to

The

the individual risk rates.

is

result

quite unpleasing scientifically, however,

because

To of

estimation.

pleasing medically because

less

that can

manipulations with

\alidit\

its

is

so

open

general population to the trohoc group.

we had

make

to

For

this

are both exposed and diseased, detection, di,

detected

in

general

the

of

assumption

to

who

are non-

of people with de-

exposed and and non-exposed and

therefore be as follows:

dis-

eased

dis-

pid,e;

=

For non-diseased peop _d.. 1-e numbers will remain: exposed and non-diseased = (l-pje; and non-exposed and non-diseased = (l-p 2 )(l-e).

eased

=

:

(

)

.

ple, the

If

a trohoc population

relative risk

(or odds)

is

assembled, the

ratio

be

will

Pid,e

(1-pQ d-e)

p= and spec-

tions

the

can now

specificity

data.

similar circumstances, however, such calcula-

at

We

3). re-

result.

The investigator laudably made no to calculate a

+

3

the palpation method gave a falsel)

both directions

in

so that the results were arranged in a 4

even larger pattern of for disagreement

cells

to

original table had

—

x 4

oi

the opportunities

would be even greater when

I

the data

were doubly dichotomized.

In addition to this difficulty, a separate prob-

ACTUAL CONDITION

SULTOF

Rl

\m

Major

PALPATION

fever

\fitlor

lem

fever

15

6

22

1106

cells

it

contains

—

table is

of a single

test.

Many

is

assumption

that

results

found

in several

exam-

different variables, not just in one. For

A

third

arrangement depends on a previous

decision about clinical tactics. that

we

will

Let us decide

always use a thermometer

the patient's temperature

if

to take

palpation indicates

the intermediate condition of

minor

fever. Fur-

thermore, for purposes of using palpation as a

"screening test",

let

us assume that

we

are

not really interested in circumstances where the

ple, in acute

tions of

symptoms, electrocardiographic

and laboratory

merated collection of entries from certain "ma-

sensitivity

palpation in "screening" for major fever. With

diagnoses.

is

these assumptions, five cells are original nine-fold table, th

'owing four-fold

and

removed from it

reduces to

table:

ACTUAL CONDITION RL IT OF PAL .HON Major

Major fever 15

No fever 3

No fever

3

993

test

pro-

obviously inadequate for determining the

and specificity of these complex

We

would need

to use

an expression

that contains multivariate constituents.

An

ex-

ample of such a variable would be fulfillment of composite criteria for diagnosis of acute myocardial infarction. The categories of this variable could be expressed in terms such as

yes or no (or uncertain). This method of citing the result of a multivariate diagnostic procedure

fever

A

cedure based on input from just one variable is

know

data,

acute rheumatic fever,

jor" and "minor" manifestations.

the reliability of

to

tests. In

the Jones diagnostic criteria call for an enu-

we

want

myocardial infarction, the clinical

diagnosis would depend on certain combina-

thermometer shows only minor fever. What really

the

the univariate result

medical diagnoses depend

on an aggregate of the

fever

any "two-

—no matter how many

the

entity being evaluated

fever

Not major

that arises in the construction of

way" contingency

major

would allow us

to

use a 2-way table for comparing the enumerated

On

data of whatever firm the

the sensitivity, specificity, and discrimination of diagnostic tests

method was employed

patients'

correct diagnoses.

to con-

clinically "silent" form.

On

covery

the

221

Examples of such

tests in lanthanic patients are the

dis-

uses of

other hand, because the constituent multivariate

a serum calcium

elements are

roidism, a fasting blood sugar for diabetes mel-

lost in a single

expression such

we would have no

as yes or no,

way

direct

of determining the causes of erroneous results

when they

occur.

To

track

down

the sources

of false positive and false negative diagnostic

we would have

errors,

go back and

to

start

is employed in situations where we have strong suspicions that the dis-

confirmation test

ease

with

biopsy tissue C. Relationship of index

Both of the just least

and purpose

been mentioned could be overcome (or

mathematical indexes for expressing the rela-

Youden's

J,

or the

"index of validity", or any other indexes

depend on doubly dichotomous data fold table, that

we could

that

in a four-

use indexes of association

allow the variables to have polytomous

(more than two) categories. Such indexes would include Kendall's tau,

G, Cicchetti's ous the (If

Goodman and

statistic

4 ,

Krushkal's

and some of the

vari-

"kappa" statistics described by Fleiss 9 or "lambda" statistics described by Hartwig 12 worst came to worst, or perhaps to best, we .

could simply enumerate the results according to the

proportions that were too high, correct,

To

and too low).

examination

of

a confirmation test for lung

confirmation for diabetes mellitus.

An

at

reduced) with a more sophisticated set of

tionships. Instead of using

microscopic

is

cancer; and a glucose tolerance test provides

have

difficulties that

statistical

The purpose of the test is to The performance of bron-

present.

is

verify this suspicion.

choscopy

with each of the multivariate constituents.

for hyperparathy-

or a rectal examination for rectal cancer.

litus,

A

measurement

consider the correlation be-

tween multivariate constituents of data and the patient's

confirmed condition, we could use

some of

the diverse correlation coefficients that

exclusion test

employed

usually

is

"rule out" the presence of a disease is

suspected. Such a test

it

usually too expen-

is

employed merely

sive or inconvenient to be

to

when

for

discovery purposes during routine "screening".

For example, a

examination might

stool guaiac

be used for the screening discovery of colonic cancer, but a

more elaborate roentgenographic

or colonoscopic examination would be needed

"rule out"

to

suspected.

disease

the

if

presence

its

is

Certain exclusion tests are cheap

enough and convenient enough screening purposes. Thus, skin test for tuberculosis

to be used for

when an

appropriate

negative, the pres-

is

ence of active disease can usually be excluded, although a positive

will neither discover

test

nor confirm active tuberculosis.

Some

good

tests are

three purposes.

Some

can be used for

for only one of these

are

good for two. Some For example, the

three.

all

can be derived from multiple linear regression

performance of sigmoidoscopy, together with

or discriminant function analysis.

biopsy and histologic examination in

managing

multicategory or multivariate data,

however,

These

statistical

will not solve a

improvements

more fundamental problem

describing the effectiveness of a

seems

to

costatistical

effectiveness is

test.

be almost wholly overlooked in

in

is

the purpose for

and

clini-

litus,

nostic tests are

employed

test.

we

too

inconvenient for

cancer, but cannot be used to exclude the dis-

for at least three dif-

and

we use a discovery test Examining who seem healthy, with no clinical com-

disease,

generally

Diag-

exclusion. During various types of "screening" .

plaints to suggest the

is

purposes of screening discovery. The histologic

examination of tissue from a bronchoscopic

ferent purposes: discovery, confirmation,

procedures,

A

can be used to confirm

exclude the presence of diabetes melbut

biopsy

The three types of diagnostic

people

to

test

test

which the

used. 1.

and exclude cancer of the rectum.

firm,

glucose tolerance

What

strategies for calculating a test's

when appro-

can generally be used to discover, con-

priate,

presence of a particular

often search for that disease in a

is

an excellent

ease or to discover

Since diagnostic different

purposes,

it

way

to

confirm lung

during routine screening.

tests are

the

employed

statistical

for these

indexes of

efficiency should be arranged accordingly. 2.

Requirements of detection and cc test, we want n

firmation. In a discovery

ably high sensitivity. If the disease

is pre:

Other architectural problems

222

should be found, even

the risk of getting a

at

high rate of false positive results.

[We

arc

willing to take this risk because a discovery test,

when

positive,

confirmation

want the

test].

followed by a

usually

is

an exclusion

In

we

test,

even higher than

sensitivity to be

Unless the sensitivit]

would keep us from being confident

that a

has excluded the disease.

tive test

The discovery and exclusion

are thus

both intended to have a high sensitivity for

when

detecting the disease

is

it

present.

the particularly high sensitivity that in

an exclusion

test,

we must

is

To

get

sought

be willing to pay

the appropriate clinical price. Thus, to lest urine tor sugar

is

a

good, cheap, convenient waj of

"screening" for the discover) of diabetes melbut the urine test will regularly give

litus,

false negative results.

sugar

a

is

To measure

more expensive and

discover procedure, but cally effective because

it

out

rule

diabetes

fasting blood

less

convenient

has a lower false-

negative rate than the urine to

some

more diagnosti-

is

it

test.

mellitus

If

with certainty,

We

would have to use the much more expensive and cumbersome mechanism of the glucose tolerance

test,

which,

identifying lung can-

positive

false

results,

miss lung cancers

but

regularly

it

that are located at inac-

For these reasons, many diagnostic used

regularly is

used

result is

A

tandem.

in

tests are

high sensitivity

to find the disease;

and a positive

followed by a high specificity

test that

confirm the diagnosis by "excluding"

will

arrangements, the best

of the paired tests.

of the

statistical appraisal

depend on

results will

its

Because of these tandem

possible falsehood.

a suitable

arrangement

such an arrangement,

In

the result of the pair might be called negative if

the detection test

negative; and the paired

is

would be called positive only

result

both the

if

detection test and the confirmation test are posi-

The

tive.

positive and negative results of this

kind of paired arrangement would have both

high specificity and high sensitivity.

we want

however, we cannot rely on either of these procedures.

gives

test

tests

is

The bronchoscopic biopsy almost never

cessible sites.

nega-

way of

but non-sensitive cer.

or

test.

the risk of a false negative result

1

a quite specific

will

close to

is

Conversely, a

positive bronchoscopies biopsy

in

a discovery 1,

turn out to have lung cancer.

in this instance,

D.

Choice of the tested populations

There are important

clinical reasons for try-

some of

the problems that have

ing to solve just

been discussed. Perhaps the most important

reason

that this

is

form of correlation between

would be both an exclusion and a confirmation

the result of a test and the patient's actual con-

test.

dition

By

contrast, in a confirmation test,

we want

extremely high specificity, with few or no false positive results. If the test

ease

shows

that the dis-

we want to be sure that it would have no real objection

present,

is

We

present.

is

to

is

the best

way

of making clinical sense

out of the statistical chaos that

now

demarcating the "range of normal" 8 mality"

is

exists in .

If

"nor-

determined purely on a univariate

basis, according to arbitrary statistical aries for a distribution of data, the

bound-

demarcation

zone of customary values for

occasional false negative results, since the con-

will indicate the

firmation test will probably be ordered after an

the test, but not their clinical connotations in

exclusion test was used to find any cases that

health or disease. If the demarcated zone

might otherwise be missed as false negatives.

have these

3. Combinations of tests. A single test can seldom be excellent for the goals of both detec-

tions

tion and confirmation. With rare exception, the same procedure cannot be sensitive enough to

This type of correlation can be achieved and

find

cases of the disease while simulta-

all

neously being specific enough to avoid false positive identifications. For example, the chest

X-i ly

of

fii

is

a quite sensitive but non-specific

ling lung cancer.

Almost

all

way

patients with

lung cancer have abnormal roentgenograms, but not

all

people with positive roentgenograms

clinical connotations, the

must be established

is

to

demarca-

in direct correlation

with an actual condition of health or disease.

evaluated only through the type of bivariate

arrangements we have been discussing.

The discussion so however,

only

clinico-statistical

with

far has

strategies

improving the defects.

been concerned,

defects

the

of existing

and with ways of

Unfortunately,

these

mathematical improvements will not solve the really

fundamental

biostatistical

problems of

On

the sensitivity, specificity,

many

diagnostic tests. Like so cated

other sophisti-

procedures, the complex in-

statistical

dexes of association produce elegant but super-

The indexes can provide

algebra.

ficial

useful

methods of quantitative expression for what has been observed

—but

dependent on what

And

data.

lem

the calculations are totally

submitted as the observed

is

the fundamental biostatistical prob-

lies in the

choice of the populations that

1.

The

poses,

who

it

role of clinical suspicion. If

in

we

are

diagnostic pur-

test for different

must be evaluated

groups of people

suitably represent the different diagnostic

These people cannot be chosen

challenges.

+

+

a"

and d values

The By

2.

the test's performance, at least

will

be divided according to the exis-

We

who

would thus

constitute the

whom

the test

would be used, during "screening", as a detection test. The second group of people would have medical conditions that aroused our sus-

made

picion of the disease and that or exclude

eightfold arrange-

why

to see

a par-

might have not one set of values

for sensitivity and specificity, but several different sets.

Suppose a positive

result in the test

depends on the disease having produced a of

of pathologic derangement.

derangement

occurs,

cer-

When

this

diseased

the

persons almost always develop symptoms that arouse suspicions of the disease. In such sus-

ordinarily healthy population for

it

derange-

pathologic

of

pected patients, the test will therefore have

choose one group of people

confirm

role

inspecting this

the tested population

tence of clinical suspicions.

to

screened and suspected

in the

and the evaluation of

affect the choice of a test

must

suspicions

clinical

screened

the

the suspected

analogous calculations would

ment of data, we can begin ticular test

level

preceding

c'

sep-

populations.

demonstrated to have the disease the

Two

population.

+

a")/(a'

be done for specificity, using the respective b

tain level

Since

+ c') for + c") in

a'/(a'

population and a"/(a"

merely according to whether or not they were in question.

+

(a'

we would determine two

c")],

values:

arate

223

of diagnostic tests

[which would be

sensitivity

ment.

are the sources of the data.

going to use a

and discrimination

us want

high sensitivity. ease

is

The customary fourfold diagnostic table would thus be converted into the following "eightfold" table:

the other hand,

the dis-

if

derangement, the

requisite level of pathologic

patient

may be asymptomatic and

part of a

screened population. In such a population, the

may have low

diagnostic test

Once we begin derangement 7 the causes of

sensitivity.

contemplate a pathologic

to

rather than the particular entity

,

that is called a

it.

On

present without having reached the pre-

'

'disease"

many

,

we can

also recognize

false positive or dispropor-

tionately positive results that can destroy the

value of a diagnostic

test.

For example, suppose

the positive result of a particular diagnostic test

ACTUAL CONDITION

RESULTS OF TEST

Negative

Positive

really

ploy

Screened population:

depends on a derangement

nutritional status, but this test for the

suppose

in the patient's

we want

em-

to

diagnosis of cancer. For the

we choose

Positive

a'

b'

evaluated population,

Negative

c'

d'

group from hospitalized patients with cancer,

Positive

a"

b"

Negative

c"

d"

Suspected population:

and the non-diseased group from healthy technicians, secretaries, and other staff personnel.

Since patients whose cancer If these

reality,

we would want

the results of their test are usually positive.

P, the prevalence of

low

in the

screened

the test results are correlated with the actual

late at least

two

and specificity ulation ulation.

severe enough to

require hospitalization are often malnourished,

population and high in the suspected population.

patients'

is

populations are going to approximate

the actual disease, to be

When

the diseased

condition,

we would

calcu-

sets of values for sensitivity

—one

and another

set for the set for the

screened popsuspected pop-

Thus, instead of a single value for

Since the staff personnel are well nourished, their test results are negative.

We

emerge from

the evaluation process with the belief that

have found an excellent new diagnostic

we

test for

cancer: the sensitivity and specificity values ar quite high.

After the test begins to be applied,

be chagrined to discover that

it

realh

we

•

224

problems

OtJicr architectural

test fails to

cancer that has been missed by a liver biopsy;

neoplasms of asymptomatic well-

and a positive chest X-ray can detect tubercu-

and low

sensitivity

the

detect

specificity.

The

nourished patients with cancer; and false positive

gives

it

diagnoses of cancer for malnour-

ished patients with stroke, chronic cardiopul-

monary disease, or

we

cause

Be-

certain enteropathies.

failed to include such patients in the

we

original test population,

of the

inefficienc)

test

did not discover the after

until

it

Surrogate

pathognomonic

is.

ical

that

example, the palpation

used

procedures

paraclinical

for

o\'

a suitable sized

woman would

be pathoalso be cither

that

demonstrate, or otherwise identify

delineate,

a particular disease. For

example, the histologic

findings in an appropriate tissue specimen will

pathognomonic of cancer or

hepatitis;

a

specified set of values in a glucose tolerance test will

be pathognomonic of diabetes mellitus.

to identify.

Examples of cancer, serum

(SGOT)

for

may make the pap smear falsely negative. The electrocardiogram may fail to show a myocells

cardial

infarction

acute attack and

taken too early after the

if

may

give false positive results

Many

because of some other myocardopathy. chemical

tests give falsely

high results in

re-

appropriate clinical conditions. 4.

The process of discrimination. For

all

these reasons, a proper evaluation of the surro-

chest X-ray for tuberculosis, electrocardio-

gate procedures that are called diagnostic tests

for myocardial infarction, or urine sugar

for diabetes mellitus.

pathognomonic

sensitivity

and

test is

seldom evaluated

We may

when

for

worry about

a pathologist inter-

prets a tissue specimen; or about the standards

of glucose

ingestion,

specimen timing, and

chemical measurement when a laboratory per-

forms a glucose tolerance concerned that the is

test;

test itself

may

but

we

are not

be misleading.

main

the surrogate tests that create the

problems of sensitivity and

the disease.

We

that

A

specificity.

rogate test does not identify the disease;

something else

we hope

it

will

sur-

iden-

denote

often use surrogate tests be-

they are simpler, cheaper, and more con-

:

venu

rhan the corresponding pathognomonic

urrogate test

1

Fc

alkaline

may

also be

more

sensi-

example, a measurement of serum phosphatase

would require them challenges

specificity.

observer variability

tive.

smear

cancer; and inaccessibility of the desquamated

for hepa-

gram

tifies

Thus, inflam-

the results can be falsely lowered under other

surrogate tests are pap smears for

It

clinical conditions.

create a false positive pap

be used to represent or approximate

glutamic oxalic transaminase

A

may

sponse to alternative diseases and drugs; and

test,

the disease we want

titis,

will consist of alternative pathologic de-

rangements or mation

into

test

commission. These mech-

errors of omission or

anisms

therefore contemplate the

might "trigger" a

an entity

In a surrogate that will

we examine

it

template sources of false positive and false neg-

we must

in a

a surrogate test,

is

The

tests.

gnomonic of pregnancy. This term can

test.

Because the procedure

depends on a pathologic entity that is different from the one we are trying to diagnose. To con-

mechanisms

spontaneous movement within

cau-

often produce false results and the problem of

ative results,

suprapubic mass

To com-

surrogate tests

evaluating sensitivity and specificity.

usuall) applied to a clin-

is

ticular condition. Fur

be

pensate for these advantages,

manifestation that uniquch indicates a par-

term pathognomonic

tubercle bacilli in the

microscopic examination of sputum.

became

clinically popular. 3.

shown

has not

losis that

may

detect

metastatic

in

able to discriminate logic

to receive several different

discrimination.

derangements

among that

The

test

must be

a variety of patho-

might simulate either

the target disease or an entity in the clinical

and paraclinical spectrum of various groups of patients

that disease.

who

The

enter the evalu-

ated population must be selected according to their suitability for providing these challenges. If patients

are chosen merely because they

do not have

do or

the target disease, the discrimina-

tion of the test will not be adequately evaluated.

The choice of

patients to provide appropriate

challenges will depend on both the medical

spectrum of the disease and morbidity.

its

diagnostic co-

The medical spectrum 5 of

the dis-

ease refers to the array of clinical and paraclinical

laboratory

abnormalities

that

it

can

produce. The diagnostic co-morbidity of the disease consists of other diseases that might

On

be mistaken for

the sensitivity, specificity, and discrimination of diagnostic

it.

Diagnostically co-morbid

or by having similar paraclinical dysfunctions.

diseases are usually entities occurring in the

The

same topographic location of the body or producing somewhat similar morphologic or other

purposes

paraclinical abnormalities. pri-

sensitivity of a test used for discovery

"screening"

in

performance

its

mary lung cancer would indicate patients with hemoptysis, with major weight loss, and with

bers of groups

abnormal chest roentgenograms. The spectrum

sults

co-morbidity

diagnostic

of

would

include

patients

lung

for

with

cancer

non-neoplastic

will

in

would seem

will

mem-

specificity of the test re-

be a

minimum demarcation

of

needed

appropriate

in

The complexity of these arrangements may seem distressing, but they are ultimately less

therefore want to challenge

distressing than the continued proliferation of

the test with patients

Our

The

in identifying

its

The

To evaluate the disnew test for lung

lesions.

crimination of a proposed

who

represent different

medical and co- morbid spectrum.

parts of the

2.

would be

groupings

we would

.

avoidance of false positive

its

to

circumstances.

cancer,

1

the necessary comparisons, but additional sub-

chronic bronchitis) and with metastatically neo-

pulmonary

Group

groups 3 and 4. These four groups

pulmonary diseases (such as tuberculosis and plastic

and

1

depend on

in

for exclusion purposes

sensitivity

depend on

depend on

will

capacity to identify patients test's

For example, the medical spectrum of

225

tests

investigated population might thus in-

whose inadequacies escape

diagnostic tests tial

evaluations because the

ini-

evaluations

initial

did not contain suitable challenges.

The over-

clude the following groups of people: asymp-

simplification of the existing tactics for getting

tomatic patients with lung cancer; patients with

"control"

lung cancer and only primary symptoms, such

indexes has led to the spawning of

hemoptysis;

as

patients

whose lung cancer

are

that

and calculating

groups

unsatisfactory

grossly

for

symptoms include such systemic effects as major weight loss; patients whose lung cancer

purposes.

manifestations include such metastatic effects

frontation with clinical complexity.

To

statistical

many

tests

clinical

deal with clinical reality requires a con-

The new

hepatomegaly or bone pain; asymptomatic

arrangements proposed here are both feasible

pulmonary disease;

and analyzable after the appropriate data have

hemoptytic patients with other pulmonary dis-

been assembled. The performance of such com-

as

patients with other causes of

ease; patients with

major weight

loss

due

to

plex analyses

not at

is

all

a novel idea.

many

has

It

other diseases; and patients with hepatomegaly

been, in fact, performed for

or bone pain due to other diseases.

a generally unquantified procedure called clin-

In a

more general statement of

the populations

used

to

principles,

evaluate the discrimina-

tion of a diagnostic test for Disease

X

should

from the following

consist of representatives

groups of people: 1

X who

Patients with Disease

are

asymp-

Patients with Disease

X who

are

symp-

cover the medical spectrum of the

With increasing advances

.

in

technology, clinicians will increasingly have to evaluate the costs, risks, and diagnostic dis-

new

crimination of

science, clinical

tomatic with a diverse collection of manifestations that

judgment 5

evaluations

tomatic. 2.

ical

years during

are

the

to

diagnostic tests. If these

provide

subtleties

judgment

sensible

and

must

be

clinical

complexities

of

acknowledged,

adapted, and incorporated into the plans for

choosing the patients

who

are tested

and for

quantitatively expressing the results.

disease. 3.

Patients

without

Disease

X who

have

other diseases that have produced overt manifestations similar to those in the medical spec-

trum noted

in

4. Patients

Group

1.

X who have other

1972.

mimic Disease X's pathologic 2.

derangement by occurring

On comparisons of sensitivity, and predictive value of a number of diagnostic procedures. Biometrics 28:793-800 Bennett, B. M.: specificity

2.

without Disease

diseases that can

References

in a similar location

Bergeson, P.

S.,

dependable

palpation as a screeniiv

is

and Steinfeld, H.

J.:

I

Other architectural problems

226

for fever? Clinical Pediatr. (Phila.) 13:350-35

1

1

1.

3.

Berkson.

J.:

"Cost-utility"

the efficiency of a test.

as

a

Amer.

J.

measure of Stat. Assn.

Cicchetti. D. V.:

between

rank

A new

ordered

Feinstein, A. edition).

Am.

13.

R.: Clinical

judgment (reprinted

14.

Huntington. N. Y..

Feinstein. A. R.: tion

7.

of

Proc.

1974. Robert E.

co-morbidity

I.

chronic

in

The domains and

clinical macrobiology. 46:212-232. 1973.

9.

The evalu-

classifica-

disease.

Yale

J.

Feinstein. A. R.: Clinical biostatistics.

Methods

J.,

test

Inf.

and Telisman, Z.: Vadesignated by a single

Med. 12:244-248, 1973.

Sunderman,

1964.

W., and Van Soestbergen, A. compu-

F.

A.: Laboratory suggestions: Probability

tations for clinical interpretations of screening tests.

Freeman, L. C: Elementary applied statistics: For students in behavioral science. New York, 1965. John Wiley & Sons, Inc.

J.

of a diagnostic

16.

Med.

The derangements of the range of normal, Clin. Pharmacol. Ther. 15:528-540. 1974. Fleiss. J. L.: Statistical methods for rates and proportions. New York, 1973, John Wiley &

the

Nissen-Meyer, S.: Evaluation of screening tests in medical diagnosis. Biometrics 20:730-755,

J

XXVII.

of

Bchav. Sci. 18:307-310.

15.

disorders of Biol.

significance

Statistical

F.:

coefficients.

Muic, V., Petres. lidity

17.

Am.

J.

Vecchio, T.

Clin. Pathol. 55:105-111,

J.:

1971.

Predictive value of a single diag-

nostic test in unselected populations, N. Engl.

J.

Med. 274:1171-1173, 1966. 18.

Yerushalmy.

J.:

Statistical

problems

in assess-

ing methods of medical diagnosis, with special

Sons, Inc. 10.

N.:

Mantel. N.: Evaluation of a class of diagnostic tests. Biometrics 7:240-246. 1951.

function,

The pre-therapeutic

Chronic Dis. 23:455-469, 1970. Feinstein, A. R.: An analysis of diagnostic reasoning.

8.

W., and Mantel.

1973.

measure of agreement variables.

Krieger Publishing Co. 6.

Hartwig.

Lambda

Psychol. Assoc. 7:17-18. 1972. 5.

S.

1950. 12.

47:246-255, 1947. 4.

Greenhouse,

ation of diagnostic tests. Biometrics 6:399-412,

1974.

reference to X-ray techniques, Pub. Health Rep.

62:1432-1449, 1947. 19.

Youden, W. J.: Index for tests. Cancer 3:32-35, 1950.

rating

diagnostic

SECTION THREE

PROBLEMS In discussing the

MEASUREMENT

IN

methods used

for assembling

with the assumption that the basic information the statistical manipulation. For the

main challenges are

many

in the data,

architecture requires decisions about

and analyzing

data,

we

begin

(or will be) there, awaiting

is

activities in

medical research, however,

not in the analysis. This aspect of research

what data

to get

and what methods

to use

measurement

that converts an observed entity into an item

of data. Statistical strategies are

seldom pertinent for these challenges, because

for the process of

the mathematical models are concerned with numerical methods of analysis, not

with

scientific

The

methods of mensuration.

four papers in this section deal with a few of the

lems in clinical measurement that require

For one

clinical rather

need an

set of solutions, clinical investigators

many important

than

intellectual liberation

the entrenched but erroneous doctrine that data can be "hard" only in

dimensional numbers. The persistence of

this

prob-

statistical solutions.

if

from

expressed

fallacious doctrine has

been

abetted by an unbalanced emphasis on parametric forms of statistical analysis

and by the

scientific aberration that

occurred

many

years ago

when Gaussian became

ideas about the variance of different measurements for a single entity

applied to the distribution of single measurements for different entities.

A

sep-

arate set of solutions will involve remedial action for the inappropriate use of the

word "normal"

to describe the

shape of a

statistical curve,

tempts to establish medical normality solely according to Statistical proposals for the

and

for

misguided

at-

statistical locations.

assessment of safety and efficacy have been scien-

what is meant by by "efficacy-" The many published reports of adverse drug reactions have not been accompanied by a reproducible operational identification of adverse drug reactions; and efficacy is constantly appraised with techniques that do not encompass the wide spectrum of phenomena requiring consideration. To tifically

unsatisfactory because of the failure to delineate

"safety" or

accomplish a large-scale surveillance of the effects of therapeutic agents, careful attention will be

needed

clinical analyses for the

and the

for

many complex

differentiation of effects.

simplified

when

issues in clinical specifications

and

populations to be observed, the methods of observation,

These

issues are usually disregarded or over-

the investigators hope that the problems will be solved mainly

with mathematical inspirations or computerized extravaganzas.

227

CHAPTER

On

16

exorcizing the ghost of Gauss

and the curse of Kelvin

At any era in the history of science, the advance of science has been retarded by certain fundamental concepts that were enlightening when introduced,

further

but that

came

too long, be-

later, after persisting

barriers to future progress.

The Structure of Thomas S. Kuhn 27

In his perceptive book, Scientific

Revolutions,

the

way

stultified

that

attempt to force nature into the conceptual boxes supplied by professional education." 29 When nature refuses to remain in those boxes, "anomalies, or violations of expectaattract

tions,

emergence of

"universally recognized scientific achieve-

by repeated

ments that for a time provide model problems and solutions to a community of prac-

conform." 28

titioners

(of

science)."

examples

As

of

change in such paradigms, Kuhn cites the transitions from Ptolemaic to Copernican to Keplerian theories of astronomy; and from Aristotelian

to

Newtonian

concepts of dynamics.

to

Einsteinian

An example

history of biomedical science

in the

would be the

The

initial

12:1003, 1971.

this

chapter originally appeared as In Clin. Pharmacol. Ther.

— XII."

.

.

(to)

.

.

the

.

make an anomaly

failure to

response to a

crisis

devise "numerous articulations

is

resis-

paradigm and ad hoc

eliminate any apparent conflict." 32 Eventually,

however, after "normal science

peatedly goes astray

.

.

.

re-

the profession can

no longer evade anomalies that subvert the

The

biostatistics

.

may be induced

modifications of their theory in order to

from pre-Galenic times to Galen to Harvey to modern beliefs. Kuhn notes that contemporary paradigms have important roles as stimuli to the growth and development of the "normal science" of an era, but he also points out

Under the same name,

community

crises that

tance, as the defenders of the old

existing

"Clinical

the increasing attention of

(the) scientific

successive basic alterations in ideas of cardiac physiology

to an outdated

paradigm produces "a strenous and devoted

used the term paradigms for these fundamental concepts, which he defines as

has

research becomes

creative

when adherence

tradition

crisis

32 of scientific practice."

becomes intolerable and a

"sci-

producing a new of commitments, a

entific revolution" occurs,

paradigm, "a

new

new

set

basis for the practice of science." 30

My

object in this essay

is

to call atten-

two outdated paradigmatic concepts that have been stifling the intellectual growth of clinical biostatistics. One of the? concepts an extension of ideas often tributed to Carl Gauss is the belief tion to

—

:

—

t

229

Problems

230

in

measurement

the observed data of clinical medicine can

be expressed with "normal"

usually

dis-

tributions for "continuous variables" hav-

ing a "variance" that can be calculated

from the observations. The second concept

—an

extension of beliefs stated

—

by Lord

on cither side of the mean were next most frequent; large errors were uncommon, although present in about equal frequency on both sides of the mean; and extremely large- errors were rare. When the frequency of the individual measurements was graphierrors

the idea that scientific data must

cally plotted against their values, the result

be expressed objectively in the form of dimensional measurements. Both of these concepts provided major enlightenment when they first became accepted as paradigms; both have now led to major intellectual crises that remain unsolved by various ad hoc modifications of the basic paradigms; and both are now being used to substitute for enlightened thought or to

was a symmetrical "bell-shaped" or "cocked-

Kelvin

thwart

is

it.

with its apex at the value mean. The name "normal" was given to the pat-

hat"

curve,

for the

shown in this curve, become the familiar "normal distribution" on which so much statistical reasoning has depended during the past century. The important intellectual advances that have come from this reasoning

tern of frequencies

and

has

it

known

are too well

The ghost

1.

Gauss

of

here.

As technologic devices of measurement began to proliferate during the 19th century, scientists became confronted by a

phenomenon variability.''

that

we now

call

same object did not

yield identical results.

The disagreements immediately provoked

how to choose a correct among the diverse measure-

the question of

value from

The obvious answer to this questo designate the mean of the measurements as correct. With this assumption,

ments. tion

was

is

less

to

need recapitulation

apparent, however,

enormous obstacle that imposes to progress

this

is

reasoning

the

now

in clinical biostatistical

research.

"observer

Repeated measurements of the

What

A.

The ambiguity

of 'normal.'

The

deci-

sion to call this type of curvilinear sym-

metry a "normal distribution" was a reasonable act of mathematical nomenclature. The word "normal" has often been applied in various aspects of mathematics and the natural sciences. In clinical medicine, however, the word normal had already been established, long before

its statistical

usage,

the amounts by which values differed from

in reference to a quite different connota-

mean would be regarded as errors. The magnitude of individual deviation from the mean could be calculated for each of the n measurements; the individual deviations could be squared (to remove negative signs); the squared deviations could be added together; and the sum could be divided by n to provide a "mean" of the squared deviations that was called the

tion:

the

error variance.

The square

root of this error

variance was designated as the standard deviation.

One of

C

of the great statistical contributions

Friedrich Gauss was to notice that

1

these

rors in

distribu

metry.

T.i

1

in

measurement were usually a specific pattern of sym-

most commonly repeated single

value was usually the

mean

itself;

small

the distinction between health and

disease.

The

definition of this distinction

is

an issue too fundamental and complex for further discussion here.

Suffice

it

to say

word normal has two entirely difmeanings in its two types of usage.

that the ferent

Statisticians refer to the

shape of a curve;

clinicians refer to a state of well-being.

The

extensive medical problems created

by the confusion between these two ideas about normal will be reserved for a later paper in this series. For the moment, however,

we can

note that clarity of thought

would best be served had another name. campaniform, which has been

in clinical biostatistics if

the bell-shaped curve

The

adjective

proposed 40 for the curve's configuration, might be a satisfactory "generic" designa-

On

exorcizing the ghost of Gauss and the curse of Kelvin

but a commemorative eponym seems more appealing. The phrase Gaussian curve is readily understood and has already been used by many writers, although it perpetuates the same type of historical "injustice" created by the many medical eponyms in which the commemorated pertion,

son

is

man who

the

popularized a

first

phenomenon, rather than the man who described

it.

The

earliest

bell-shaped curve

first

account of the

was by Abraham De-

B.

The choice

of a 'normal range.'

choice of a "normal range."

is

what

is

the appropriate population

ment. Since only a single object had been measured, the variance clearly did not refer to the object itself.

During the 20th century, however, the been altered,

original logic of variance has

so that variance

may now

refer to the ob-

jects themselves, as well as to the

measured

We

same

has at least three components: (1) should depend on the charof a state of health or

ferred to the values per act of measure-

the

The problem

on the

value of a numerical measurement;

-» re-

An-

the idea of normality acteristics

JECT -* ACT OF MEASUREMENT VALUE, the mean and the variance

can therefore talk about the variance of the data regardless of whether we get 30 repeated measurements of the

Moivre, not by Gauss. 45 other major problem in normality

231

(2)

whose

values.

object, or a single

measurement

for

30 different objects. In the first kind of variance, however, we refer to the process of measurement, whereas in the second, we usually ignore the mensurational activity

and

refer to a characteristic of the

object.

From

measured

variance per measured value,

individual

the term has been altered into variance per

merical measurements

measured

medical characteristics or nu(or both) will be used to determine normality; and (3) by what method should boundaries be demarcated to create a range? These questions have been increasingly debated in recent years, particularly as a "range of normal" has been sought for the

abundance of new paraclinical data produced by the increasing medical use of

when we sodium

look at the single values of serum

for a group (or "sample") of peo-

we get the variance One obvious problem

ple,

of the people.

contribu-

some statistical textbooks. The phrase may be appropriate in refer-

My own

title

here from the

title

of an excellent

recent discussion ("Health, Normality, the Ghost of Gauss" )

by Elveback,

and

Guillier,

and Keating. 10 C.

variance of the laboratory procedure; but

debate will be deferred for a paper, but I have borrowed part of

chemical technology.

my

we

For example, when

of ambiguity in double usage of the same term is caused by the persistence of the phrase

tions to this later

object.

repeatedly measure the value of sodium in a single specimen of serum, we get the

The

logical alteration of 'variance.'

development of Gaussian the term variance referred to in-

this

error variance in

ence to a process of measurement, but the term error is improper and confusing when it prefixes a variance that refers to a group of people. The reader may be misled into believing either that the measured values

In the original

are themselves incorrect (a point that

statistics,

seldom

consistencies

in

observation.

When

the

at

issue),

wrong about. the people who deviated from

same object had received a series of nonidentical measurements, the deviations from the "correct" mean would be the source of error variance. The term error was thus attached to variance in the 19th

the mean. Aside from this semantic

century in order to connote disparities in

popidational

variance.

When

chemical

performed

in

the act of mensuration,

and the term error

variance referred to the size of the discrepancies. Thus, in the sequence of

OB-

is

was

or that something

diffi-

however, the logical alteration of variance has had profound effects on the use and abuse of modern statistics. culty,

(1)

The

distinctions of quality control

tests

modern

clinic

laboratories are checked for "qualitv trol "

and

manv

the

a traditional Gaussian calcula^

Problems

232

applied

in

values

to

measurement

obtained

in

repeated

measurements of the same specimen. The variance

of

these

multiple

helps

values

and precision

indicate the reproducibility of the test.

an array of results is assembled for specimens from a group of patients or

from the same

successive specimens

however, the calculated variance contains an admixture of variances arising patient,

not only from the difference

mens but

but comprised a mixture of newborn fants, adult

in-

pygmies, and professional bas-

ketball players.

When

for

contained people, mice, elephants, and gerbils, or if the group was confined to people

from the \k

also

when

in the speci-

s

issitudes of the

The complex problems determining homogeneity for a later paper in this point to be noted now is and-egg" type of problem

and be reserved

of defining will

series.

The onlv

that a "chicken-

has been created

— —

by the frequent use of variance newer, logically altered meaning

in

its

as

a

amount of variation attributable to the test. The necessary adaptations, which are beyond the scope of this

measure of homogeneity. Do we first decide that a group is homogeneous and do we then determine the variance, or do we first measure the variance and then decide that the group is homogeneous? The classical statistical approach to this question is to determine homogeneity from

an

the calculated variance. If the co-efficient

The

which is the ratio of the standard deviation divided by the mean, is below 10%, the group is often regarded as reasonably homogeneous. Statistical strategies and tactics are permeated with the idea that homogeneity is determined from post hoc calculation rather than from a priori classification. In most textbooks and other statistical literature, the term "homogeneity" rarely appears alone, and is generally used in the context of homogeneity

Nevertheless,

test

two

the data for

such specimens are compared, most clinicians usually decide that one value is higher or lower than considering

the

other without

the

essay, are particularly well discussed in

intriguing

new book bv Roy

main point

Barnett.'-'

be noted here

to

is

variations in people, as discerned

that the

from a

cannot be properlv interpreted with-

test,

out considering the variations in the test itself.

(2)

when

Homogeneity

of sample. In the davs

depended on the measurements of a single object, statisticians did not have to worry about the homogeneity of the objects being measured. The sampled object was "homogeneous" unto itself. When measurements are performed for a group of objects, however, the issue of homogeneity becomes fundamental variance

always

of variation,

of variance. Diverse "corrections"

inferences and tests in circumstances where the variance is not "homogeneous. tistical

The

to the scientific validity of the group.

In order to apply any statistical calcula-

depend on a group

we

and other

adaptations have been developed for sta-

classical scientific

question, however,

is

to

approach to

this

regard homogeneity

homogeneous, they can readily be combined for the calculation of a collective mean and variance. If the

taxonomy, not in statistics. make an a priori judgment about homogeneity before we measure, not afterward. We decide, as an act of taxonomic classification and diagnostic identification, whether our observed objects

objects are too heterogeneous, however, the

are mice or elephants,

tions that

must

first

of objects,

decide that the objects can be

considered collectively as a group. If the objects are sufficiently

collective values of

produce

statistical gibberish.

we might have no the

mean and

mean and

variance

may

For example,

objection to calculating

variance for the height of a

group of people, but we might greatly demur from such calculations if the group

an issue

as

Scientists

in

usually

newborn babies

professional basketball players. rely

We

or

do not

on measurements and calculations

for

these issues in identification. For example, if

had a mean weight of pounds with a standard deviation of pounds, a classical statistician might

a group of objects

15.1 0.2

On

exorcizing the ghost of Gauss and the curse of Kelvin

conclude that the objects were quite homo-

geneous because their co-efficient of variation is only 1.3%.* This conclusion might be entirely justified with respect to the weight of the objects, but a classical entist

sci-

would demand more description of

the objects before drawing any conclusions

about their homogeneity.

He might

discover that the group consisted of

then litter-

mates chosen from kennels of large small dogs,

and huge

cats,

tion,

may

taxonomy

the available categories of

not be satisfactory for decisions about

homogeneity. Thus, the classical taxonomy of biology

adequate for distinguishing

is

different four-legged animals as mice, dogs, cats,

cows, horses, or elephants, and for dis-

tinguishing different species animals.

The

available

among

these

taxonomy of human

chronology and occupations

taxonomy remain neglected, and if Gaussian variance continues to be regarded not only as a description of a group, but also as a

primary index of homogeneity. D. The idea of continuous variables. Perhaps the most intellectually pernicious current residue of Gaussian statistics is the abstract mathematical concept of continu-

One

ous variables.

of the basic tenets of

the mathematical expression for the Gaussian curve

birds.

In certain forms of scientific classifica-

233

based on

and of the

statistical

this expression

is

reasoning

that the variables

A

variable is regarded as can take on additional values that lie within any defined interval of values, no matter how small the interval becomes. For example, serum sodium can be regarded as a continuous variable. If we could measure it as finely as 136.75 or 136.76 meq./L., it can still assume the

are continuous.

continuous

it

if

adequate for

values of 136.752 or 136.753 within that

newborn babies from professional basketball players. But classical taxonomy has not been equally satisfactory

our measuring device were preenough to identify the latter two values, serum sodium could still be 136.7528 or 136.7529, and so on.

is

distinguishing

for distinguishing different species of entities

as bacteria or

such

A

form of has now been de-

worms.

numerical taxonomy6 38 veloped to make such distinctions by cal'

culations based

on the measured values of and chemical properties

interval. If

cise

All the mathematical concepts of analytic geometry and the calculus depend on continuous variables, as do most of the concepts of mathematical statistics. As soon origins

different physical

as

of these entities.

and enters the world of reality, however, a new symbol is necessary to indicate that the real world does not permit such niceties in measurement. This symbol is the X sign

In the classification of clinical ena,

many important problems

phenom-

of hetero-

geneity have been overlooked during the statistical

attention to variance as an ex-

clusive index of homogeneity.

Among

such

problems are the inadequacies of the current taxonomy of disease, which cannot be

that

is

the improvements will require careful con-

point as

have

been omitted from current classifications. 11 These improvements will not occur, however, if the intellectual problems of medical '1.3%

0.2 x

15.1

100.

emblem

of statistical

measurements used in statistics were minute enough to be truly continuous, we would not need the X sign.

To

sideration of important variables that

theoretical

its

the traditional If

activities.

remedied merely by calculations or by reof categories of coding for the diseases. 11 The prognostic inadequacy of diagnostic nomenclature also cannot be improved by statistical tactics alone, and

shufflings

leaves

statistics

the

calculate the

tinuous variable,

sum x,

of values for a con-

that

quencies at each value, x

had a "density

had

different fre-

we would

say that

function," f(x),

which

represented the frequency of x at each curve.

it

"moves" continuously along

The sum

of the values of x

then be expressed as integral sign

(

/

)

for

its

would

fxf(x)dx, using the

sums

of continuous

variables.

In both biologic and statistical rea^ however, we cannot determine f(x

Problems

234

in

measurement

the idealized continuous values of

we

stead

and we count the

x at a scries of

i

corresponding

frequencies.

points,

those metric values as x

and the sum of

fi,

(

,

the

Furthermore, when

if,\'i-

In-

x.

note the actual metric values of

We

express

the frequencies as values

for

we draw

a

graph

show the pattern of these frequencies, we do not draw a continuous curve. Instead, we draw a "frequency polygon" that connects the observed points with straight lines.

we do

not

draw

a line at

all,

and we construct a bar-graph called

a

"histogram."

A

statistical

metric values of a continuous variable

would be merely pedantic were

it not for another crucial feature that differentiates

clinical biology

have been

from the statistical models proposed for it. Many im-

paradigm that depends on

continuous metric variables will frequently

produce biologically peculiar the paradigm

is

results

when

applied to variables that

are either metrically discrete or non-metric.

Some

of the peculiarities have occurred so

often that the classical

may be abandoned

mode

of calculation

in favor of

an alterna-

expression. Thus, rather than saying

tive

This distinction between the ideal and real

definitely absent.

as

x

to

Alternatively,

rheumatic fever. An existential scale can be converted to ordinal rankings with such gradations as definitely present, probably present, uncertain, probably absent, and

that the average

mean

American family has

we would

children,

2.3

avoid calculating a

that creates the strange 0.3 child,

and we would state, instead, that American families have a median of 2 children. When a variable

is

expressed nominally, neither a

portant clinical variables arc discrete rather

mean nor

a

median can be used

They

culating a

summary

that

than

continuous.

specific

categorical

numerical

or

are

terms

verbal.

expressed

may be

that

Some

in

of

these

of the values,

for cal-

and the

would be stated as a mode or proporwe might note that chest pain the most common (i.e., the mode) pre-

result tion.

Thus,

discrete variables are also metric, in that

is

they can be expressed on a ranked scale with equal intervals between adjacent

senting complaint among patients in a coronary care unit, or that chest pain appeared in 76% of such patients.

ranks. Illustrations of discrete metric vari-

number

ables are

of children or

number

In

of

many

other instances, however, a suit-

previous myocardial infarctions.

able alternative expression has not yet been

Many other important clinical variables, however, are not even metric. They are expressed as discrete categories in "scales"

developed, and ad hoc modifications are still being proposed for the anomalies pro-

of values that are either ordinal, nominal, or existential.

values

as

An

ordinal scale contains such

none, mild, moderate, and ex-

duced by an unsatisfactory statistical paradigm. For example, to predict survival from the

many

variables that describe a patient's

initial state,

modern

the main paradigm offered by

treme for the variable, severity of chest pain, or such values as 0, 1+, 2+, 3+, and 4+

tiple linear regression.

for the variable, briskness of patellar re-

ing, this

Although an ordinal scale is demarcated into ranked values, the interval between any two adjacent ranks is not measurably equal. A nominal scale contains ranked values such as male and female

have been described elsewhere. 25 I shall here cite only two of the statistical dif-

flex.

fc

die

me patio

variable, it,

sex;

and other

or doctor,

lawyer,

for the variable, occu-

\n existential scale contains such

unranL values as yes and no, or present and abst for the variable, presence of chest pain, or for the variable, presence of

statistics is a

technique called mul-

For

technique has

clinical reason-

many

defects that

Since the regression technique depends on numerical data, it cannot accommodate existential or nominal variables, which are not expressed in numerical values. Accordingly, the ad hoc modification ficulties.

is

to assign arbitrary

for

these

variables,

numbers so

that

as "values"

absent

may

become and present, 1; or male may become 2 and female, 3. An additional prob-

On

lem of the multiple regression procedure is that its array of numerical co-efficients,

when applied

to the data for a particular

may sometimes

patient,

yield a negative

value for the predicted survival. modification for this anomaly

The ad hoc to apply a

is

makes the

"logistic" correction that

235

exorcizing the ghost of Gauss and the curse of Kelvin

regres-

sion procedure always yield a positive value.

As long as the tactics are otherwise suca gerrymandered statistical procedure need not create any major biologic difficulties. The main scientific hazard of the continuous-variable paradigm occurs when a devotion to such variables becomes cessful,

The curse

2.

Kelvin

of

About 90 years ago, a time when the work of Charles Darwin was drastically altering some fundamental paradigms of

human

biology,

when

the

concepts

of

Rudolf Virchow had led to the new paradigms of medical histopathology, and when Francis Galton was performing the magnificent analyses that would culminate in the development of biometry as a distinctive

new

intellectual discipline, a recurrent

exhortation was being offered to physicists.

The exact words of Lord Kelvin's theme, as Kuhn 26 has noted, appear in diverse

"When

the basis for rejecting or disdaining data

forms, but the basic sentiment

that are discrete, rather than continuous.

you cannot express it in numbers, your knowledge is meagre and unsatisfactory." This theme, which was later repeated by many other eminent researchers, including Galton, has become one of the paradigms

Such attitudes can

from either

arise

in-

tellectual inertia, intellectual prejudice, or

both. In the customary educational back-

ground of a

the contemplated

statistician,

are usually continuous, so that

variables

he becomes most comfortable intellectually when dealing with continuous data. To avoid the "discomfort" of discrete data, the statistician

may then

sultees

eliminate discrete variables in

to

advise his clinical con-

favor of continuous metric data.

With these

substitutions of variables, the research

may

become altered from its true easily measured continuous

objectives.

The

variables

may

and numerianswer the wrong

yield statistically comfortable cally precise results that

questions.

The

intellectual

that

inertia

leads

to

of

modern

"natural

biologic

would

persistently dis-

card

or fail to analyze important data merely for the convenience of statistical calculations.

The

attitude that encourages

a displacement of discrete data

by continu-

ous variables has required intellectual rein-

forcement not just from

statistical ideology,

but from the paradigms of science

The prejudice against

discrete

been incorporated into

scientific

for almost a century,

and

the doctrine of William

knighted as Baron Kelvin.

is

itself.

data has thinking

epitomized in

Thompson,

later

exhortation

Kelvin's

entirely appropriate.

struments to study similar phenomena in

have been sustained for so long a time, however, without an additional source of scientist

science,"

During that era, a burgeoning technology had begun to produce many new instruments with which to measure physical and chemical phenomena. Scientists were avidly working to create those instruments and to apply them in measurements. Kelvin's exhortation was also promptly accepted and implemented by biologists who could apply the new in-

seemed

the

No

science.

For physicists, chemists, astronomers, and other workers in the contemporary

these inappropriate substitutions could not

support.

is:

fluids,

excreta,

organisms.

tissues,

The

and

cells

exhortation

of

was

also happily received by biometricians who were seeking numerical measurements with which to develop the principles and data

of their infant discipline.

An

important feature of Kelvin's doctrine

demanded numbers, but did not among three different ways in which numbers could be obtained. The two was that

it

discriminate

numbers are mensuraand enumeration. In mensuration, the number is observed as a dimension on an principal sources of tion

established continuous scale of values.

enumeration, the number

is

obtained

counting a group of identified

enti'

T

Problems

236

in

measurement

number

mensuration, the

enumeration

and

in

way

of getting

is

"continuous,"

"discrete."

A

third

to divide one numbers by another. Thus, a mean is usually created by dividing a mensuration by an enumeration; and a proportion, bv dividing one enumeration by another enumeration. For example, we can use numbers to say

numbers

is

of the principal types of

that a particular man weights 70.21 kg., has 4 children, and belongs to a club of which

20%

the

of

doctors.

Each

members

are

board-eligible

of these citations contains a

QUmerica] statement, but the first number is a dimensional mensuration, the second is a counted enumeration, and the third a

proportionate

ratio

or

percentage.

is

All

numbers provide quantificawhatever setting they are used, but the numbers arise from distinctively different basic forms of description. In one form of citation, the basic observation is itthree of these

tion in

number on

self a

a dimensional scale. In

the other form of citation, the basic observation

is

a

verbal

description

whose

"units" are nouns, adjectives, verbs, or ad-

We

verbs.

an entity tor,

use such words to stipulate that

is

a child or a board-eligible doc-

and we then quantify that

stipulation

and standard deviations, but the desideratum chosen for "science" was to achieve a dimension at the basic level of observation.

The concept

of

quantification

was

limited to the concept of metrification.

With

this

constricted

interpretation

of

Kelvin's "numbers," the rush

toward dimensional measurement began and has continued. During the stampede, many clinical biologists and biometricians apparentlv forgot that Darwin's and Virchow's major scientific advances had not required the use of numbers, and that Galton's biometrv was based mainly on describing phenomena and counting them, not on dimensional mensuration. Furthermore, during Kelvin's lifetime, bacteriology and radiography" were introduced as new disciplines of observation, producing data that were scientifically fundamental, precise and verbal. Although these triumphs of verbal description might have been expected to reduce the alacrity with which mensurational pursuits became a prime goal of biomedical research, the allure of measured dimensions was too great. Technology was available to provide the measurements, and Gaussian-tvpe "continuous" statistics were available to manipulate the numbers. The cur-

—

by counting a group of verbally described entities, or by expressing the counted group in proportionate percentages of some

tific

other counted group.

almost never questioned, despite the major

An

individual object can therefore be

described in dimensions or in words.

group of objects can be quantified by culating a

mean

A

cal-

for the dimensions, or

by

determining a count or ratio for what is represented by the descriptive words. Statistical

data could thus be achieved from

rent interpretation of Kelvin's doctrine has

now become

so well established as a scien-

paradigm that

its

fundamental basis

is

problems and anomalies it has caused. For clinicians and for social scientists, the consequences of Kelvin's exhortation have been more of a curse than a comfort. The variables that are most important to practicing clinicians and sociologists are usually expressed in nominal, existential, or

(at

either a route of dimensional mensurations

best ) ordinal form and can not readily be

and calculated means, or a route of verbal descriptions, counted enumerations, and

ments. There

fractional proportions.

natural scale of dimensions 22,

In the general interpretation given to the Kelvin doctrine, however, the latter route was rejected. A populational quantification by counts and ratios of verbal data

woul just

T

have provided numbers that were as "numerical" as dimensional means

converted to dimensional metric measureis no technologic device or 40

with which

a physician can metrically measure such entities as chest pain,

abdominal cramps,

back pain, dyspepsia, dyspnea,

dysuria,

or

"The arrival of x-rays was greeted with surprise and shock by many members of the scientific community. Lord Kelvin at first regarded them as an elaborate hoax. 3142

On

other discomforts encountered in the daily

work of

technologic

device

natural

or

is

no

scale

of

There

medicine.

clinical

dimensions that can be used by a psychiato metrically

trist

measure

love, fear, an-

or by a sociolmeasure the diverse at-

sured what it purported to measure? the previous verbal descriptions were

adequate so that intelligence and anxiety themselves ambiguously specified, with what could the investigator correlate his scale to validate it?

ogist to metrically

current

and exchanges that occur

interactions.

To

numbers

approval by

for

achieve

scientific

creativity?

col-

with variables that could be cited in metric effect of these

replacements has

been wryly summarized by Frank Knight in the remark, "If you cannot measure, measure anyhow." 24

The new metric

The

to use laboratory technology,

first

when

ticular scale for "anxiety" or for "personality

inventory"

better or

more accurate than

The

of validating

difficulty

scales

phenomena has been

for

a major

scientific handicap for workers in psychology and sociology. Some of the contrived scales have been highly effective (or at least widely accepted), but most of them have never received a primary validation,

measome other

pulmonary function might replace a verbal account of the severity of dyspnea.

many new

The second method was

dated)

test of

to contrive a numerical values that would provide a seemingly metric expression for a "scale" of

non-physical quality, such as intelligence,

The

is

another scale?

pos-

cited in verbal descriptions. Thus, a

had no counterpart

assessed

is

framework of multiple-choice answers to a question, do we neglect the importance of an ability in logical synthesis that might be demonstrated only with an "essay" type of response? How do we decide that one par-

non-physical

dimensional appraisal of entities that might correspond to those previously

that

intelligence

was

sible, for

vital capacity or

When

exclusively in the analytic

variables could be ob-

tained in two different ways.

surement of

place too

mensurational

would have to abandon his precision in the use of words, and would have to replace the discrete data of verbal description

The

For example, do

"intelligence"

of

tests

high a premium on vocabulary and arithmetic, at the expense of imagination and

in social

leagues, a clinical or sociologic investigator

units.

If

in-

were

xiety, hostility, or depression,

titudes

237

exorcizing llw ghost of Gauss and the curse of Kelvin

in technologic tests.

are "validated" only in

scales

correlation with an accepted (but unvali-

old scale, and many others have been non-productive exercises of the "quantophrenia" described by Sorokin. 39 (The passion for contrived but unvalidated measurements was tellingly satirized by R. E. Dickinson, 9

who

psychology and sociology. Since laboratory technology offered almost no dimensional variables that corresponded to the non-physical phenomena

"The unit proposed is the milli-helen, the quantity of beauty required to launch exactly one

of behavioral or sociologic research, psy-

ship.")

A.

'metrics' of

and

chologists

sociologists

new numerical

"scales"

began for

Even

to create

"measuring"

suggested a scale for

assessing feminine beauty:

if

all

metrification

the problems of contrived

had been

such entities as intelligence, anxiety, and family inter-relations. By assigning arbitrary

scientific progress in

grades to the answers received in question-

still

naires

or

interviews,

and by combining

and clinical psychology would be handicapped by almost insurmount-

able difficulties in selection of the population to

could express the results in ranked numbers. 3 5 "• 41 43

ceding

-

-

>

The main difficulty with these scales has been their validation. How could the investigator determine that a scale actually mea-

however,

sociology

these grades in diverse manners, the investigators

solved,

the quantification of

be measured. As indicated in prepapers 15

17 '

'

18

of

this

series,

the

an investigated population may be destroyed if it consists of a non-randc validity of

1

or chronologically diffuse cross-sectioif

it

is

examined

in a

backward

r

238

Problems

measurement

in

The major therapeutic advances few decades, however, have been technologic, not intellectual. The ad-

direction instead of being pursued forward

therapy.

a cohort. Although clinicians can constantly observe cohorts of patients who are

of the past

as

offered

therapy and followed thereafter,

do not ordinarily "treat" their subjects, and psychologists (or psychiatrists) have not studied therapy with the same fervor that has been devoted to the sociologists

vances have occurred in the surgical tactics and pharmaceutical agents of treatment, not in the scientific design and analysis of therapy. Clinicians have finally learned that

therapy must be "controlled" and quantibut have not yet developed satisfactory

etiologv of psychic ailments. Consequently,

fied,

many

methods for choosing suitable "controls" and for quantifying the appropriate variables. Clinicians have begun to accept and utilize the statistical concept that planned therapeutic experiments require a random-

sociologic populations

have consisted be chosen

of "cross-sections" that could not

randomly, and psychologic populations have often consisted of psychically ill people who were followed in a backward direction

toward

nainics."

tempted

etiology

When

perform

to

pathogenesis, the

and

"psvchodv-

psychologists

forward

members

ized allocation of treatment, but equal at-

have

at-

tention has not been given to the scientific

studies

of

concept that the treated patients and their

of the cohort

have usually consisted of "healthy" volunteers, "chunk groups," or other "rancid"

clinical responses tified.

must be adequately iden-

The consequences

of the unsatisfac-

tory methods of investigation are evident in

samples. 15 These epidemiologic difficulties

the therapeutic controversies that exist to-

have outweighed any major advances obtained from the creation of metric scales by

day at every level of treatment from such minor ailments as the sprained back and

psychologists

and

The

sociologists.

full

common

cold to major ailments such as dia-

cannot be

betes mellitus, myocardial infarction, frac-

exploited until satisfactory methods have

ture of the hip, and peptic ulcer, to catas-

been developed for getting suitable cohorts and other appropriate populations for in-

trophic ailments such as cancer.

vestigation.

ized

scientific potential of the scales

B.

The

'metrics' of clinical medicine. In

ordinary medical activities, a clinical vestigator

is

spared

many

in-

of the problems

that his psycho-social colleagues encounter

phenomena and The symptoms and

in non-physical

in cohort

signs observed as clinical data can generally be reavailability.

lated to "physical" entities,

constantly treat patients

and

horts for studies of the course

of

disease.

With

clinicians

who become

satisfactory

co-

and therapy techniques

and investigation, the treated cohorts could be analyzed in a manner that would remove or adjust the bias crued during their non-random collecof

t

classification

7,18

n the opportunity to study the clinand therapeutic cohorts of "organic sease, physicians could have been expectea to make major intellectual improvements in the scientific assessment of

ical

dance of

statistical

trials

The abun-

data and even random-

has produced

many numerical

"confidence intervals," but few therapeutic

numbers about which thoughtful physicians could feel clinically confident.

Many the

been due

of the problems have

clinician's

abdication of his

own

to re-

by delegating the design of statisticians whose abstract "models" have been inadequate for the sponsibilities

research

to

realities of clinical biology.

14

But an equally

important source of problems has been the clinician's failure to emulate the metric creativity of his psycho-social colleagues.

discipline of clinimetrics has not

been

A

es-

tablished to correspond to psychometrics

and sociometry.

Clinicians

have become

metrically oriented, but not while observing patients.

The metric data have come from

paraclinical tests of patients' blood, urine,

and other substances observed

in the labo-

ratory. If clinicians

had respected

their clinical

On

239

exorcizing the ghost of Gauss and the curse nf Kelvin

observations, a collection of suitable "scales"

A

would have been created,

laboratory test of exercise does not indicate

tested,

and

vali-

dated for identifying the existence or

as-

phenomena as pain, digestive distress, dyspnea, and all the other "physical" symptoms encountered in medical practice. The scales, by now, could

sessing the severity of such

have been standardized in direct clinical activities, and would be available today for therapeutic surveys and

trials.

challenges of their

own

the

of

metric

data

from paraclinical technology. An array of dimensional numbers far beyond the fondest expectations of any Kelvinistic dream has been provided by the technologic ability to measure the constituents of to

fluids;

the

assess

physiologic

magnitudes of structural volume, mechanical

pressure,

conduction,

electrical

count, and survival time does not indicate whether a patient with cancer is alive and vibrant, or miserable and vegetating. The

does not indicate a patient's cardiac func-

available

diverse

dyspnea or angina pectoris in the circumstances of daily life. The assessment of roentgenographs tumor size, white blood

his

scientific

observational data.

plethora

an isolated

in

assessment of electrocardiographic changes, angiographic anatomy, or digitalis intake

Rather than confronting and solving the problems of clinical data, clinicians have substituted

performance

how-

Instead,

have evaded the

ever, clinicians

patient's

and

tion.

In these examples and in numerous other

needed

instances of the data

for evaluating

modern therapy, the necessary verbal

vari-

have been replaced by numerical data from paraclinical variables that provide the right measureables of clinical observation

ment

for

wrong

the

The

entity.

conse-

quences of this "substitution game" have been inadequate clinical science, because neither the pre-therapeutic nor post-therapeutic state of the patient

and dehumanized

is

sufficiently

muscular contraction; and to count the up-

specified;

take of radioactive substances. All of these

because the patient himself is deliberately ignored as an important source of informa-

phenomena

are generally observed,

ever, not in the patient,

how-

but in substances

derived from the patient and examined by

methods

other

than

clinical

skills.

To

achieve the phantasmagoria of metric data in

modern medicine,

their

have shifted medium of observation from the pa-

tient in

clinicians

the bed to the substance in the lab.

This deliberate dehumanization of the

observed variables

was inspired by the

quest for "better science" and

duced

results

it

has pro-

that are often scientifically

absence of suitable attention to patients has produced major defects in both therapeutic science and therapeutic care. Some of the most excellent.

Nevertheless,

the

care,

clinical

tion for analysis.

The

basic reason for avoiding the verbal

descriptions of clinical data has been that

they are subjective, ducible.

Much

"soft,"

and non-repro-

of this difficulty could be

promptly removed, however, if clinicians gave serious attention to improving the methods of observation and classification for the data. By shunning the intellectual seduction of inadequate metric substitution,

by

criteria

data,

12

by

suitable

establishing for

interpretation

taxonoric

of the

"soft"

identifying the observational de-

therapy can be

tails necessary for the interpretations, and by analyzing and reducing the inconsistencies with which these details are ob-

discerned only in the patient and cannot

served, clinicians could preserve the im-

be appropriately measured with paraclinical data. Neither the initial state nor the subsequent state of a treated patient are properly identified if the analyzed variables do not include such clinical features as

portant "soft" data while "hardening" their

crucial aspects of clinical

iatrotropy,

co-morbidity, and the cluster,

sequence, and chronometry of symptoms.

11

scientific

quality.

The

various

nominal,

existential, or ordinal categories of classifi-

cation for these data

would then produce

the necessary scales of clinical medicine.

Many

such scales have been create

recent years.

Some

of

them are usee

1

240

Problems

in

measurement

identification;

study, 44 the conventional clinical scale for

and others deal with grading the severity

neuropathy was replaced by a metric measurement performed with a biothesiometer. In exchange for the statistical

diagnostic

(or "screening")

of an entity either for prognostic or thera-

peutic purposes. Among such scales are: the various ratings of mental and physical impairment issued by committees of the

American Medical Association 7 the "Apgar score" of the condition of newborn babies'; and the "Kate index" of independence in 23 Other recent activities of daily living. scales have dealt with the severity of such 34 asthma/ the neonatal entities as tetanus, respiratory distress syndrome, 20 osteoarth37 After ritis of the knee,-' and alcoholism. ;

many

previous "statistical" investigations of

clinical

course and treatment, scales

"criteria")

such sis

have finally been proposed for fundamentals as the diagno-

scientific

of lupus erythematosus,8 the severity of

and the quality of survival

renal disease, in

(or

patients with breast cancer. 38

directly validated, either in construction or

because no satisfactory scienstrategies have been developed for the validations. For example, no penetrating analyses have been given to

in application,

and

satisfaction of these

the scientific chagrin that later came when the biothesiometric baseline data could not

be obtained for more than half the cohort, and when the measured results in the toes were found to be non-reproducible. 16 All of the necessary work in "clinimetrics" will require intensive attention and thought

best created

is

by "additive weights" or by

"Boolean clusters." In addition, almost

all

clinicians

tific

importance of the subject and

ticians

another.

in

established

severity

The scope

greatly restricted.

for

from one

of the scales

Many

grading state is

to

still

important clinical

phenomena, including the assessment of what is really accomplished in the care of patients, 19 have not yet received appropriate investigation and scales for evaluatic A further significant problem is that man^ ajor therapeutic trials have been (or ar now being) performed with the necessar scales either omitted or abjured. For example, in the celebrated UGDP

who

lack fa-

process of analyzing and validating

the results can be greatly aided, however,

who

by

statisticians

to

welcome the

are enlightened

enough

intellectual challenges of

categorical data, perceptive

enough

to en-

courage the clinical investigators to grapple with the problems, and wise enough to avoid misleading the investigators into the specious allure of metric substitutions.

In using verbal data to yield discrete

can be enumerated and

that

categories

quantified with proportionate ratios, clini-

meaning of need for numerical expressions. But a sublime paradox cians

can

fulfill

the

larger

Kelvin's doctrine about the

modern technology can enable the defor numbers to be satisfied even at

mand

been

other personnel

or

nuances of both the observational procedures and the data.

of

not

who

miliarity with the clinical

a single point in time. Satisfactory scales

have

respect the basic scien-

have the trained observational skills to study it. The work cannot be done by statis-

the scales refer to the patient's condition at

transitions

who

from

statistical

the issue of whether a composite scale

measured dimensions,

however, the investigators had to accept

The

These and other new scales are all based on categorical arrangements of information that is primarily verbal. Although the proposed scales represent important scientific progress, the progress is scant and overdue. Almost none of the scales has been

tific

diabetic

the fundamental level of description. At

about the same time that Kelvin's exhortawas achieving wide popularity, Her-

tion

man

Hollerith

introduced a punch card

system with which data could be coded for mechanical processing. First used for the United States census of 1890, the coding

now

evolved into the familiar

(IBM)

cards that are used today

system has Hollerith

managing data with a digital computer. The tactics of expressing discrete data with numerical "co-ordinates" and coding digits have been described elsewhere, 13 and will for

On

exorcizing the ghost of Gauss and the curse of Kelvin

The main

not be further discussed here.

now

point to be noted

that such verbal

is

data as "substernal chest pain, provoked by

by rest" can be faithfully and precisely represented when a coding number like 137406 is entered appropriately into the columns of a Hollerith card. The sublime paradox of computer automation, therefore, is that its system for exertion, relieved

coding data offers clinicians the opportunity to re-humanize clinical science instead of using the inanimate technology to poten-

dehumanization. By

tiate further

restoring

and standardization to clinical data that have hitherto been neglected because they could not be expressed crucial

attention

the Hollerith coding system

numerically,

can

permit the patient rather than the

laboratory to gain supremacy as the center

As a the number 137406

of attention in clinical science.

group of coding

digits,

has no more dimensional connotation than

number

a telephone

or a zip code, but

it

provides a numerical expression for a precise verbal description, digits

it

that

usually

Gaussian curve, and the term normal can then be liberated and returned to its primordial medical meaning. This transfer, however, will not remove the poltergeist of neo-Gaussian variance, an ambiguity that

in the

confounded the fundamental

has

and an observational process measurement. The scientific distinctions that separate homogeneity and homogeneity of variance will be discussed later in this classification

of

series.

The computer's capacity also

another

neo-Gaussian

stract inferences

ance.

An

entirely

tical decisions

for

suffice

biostatistical

the

available,

digmatic

allegiance

made

has been

Apgar, V.: Proposal for

new method

infant.

R.

N.:

Clinical

Boston, 1971, Little, 3.

phon

1970,

The Gry-

Chai, H., Purcell, K., Brady, K., and Falliers,

Therapeutic and investigational evaluaAllergy 41:23-36, J.

tion of asthmatic children,

J.:

1968.

the necessary changes. 5.

Church, C. N., and Ratoosh, P., editors: Measurement: Definitions and theories, New York,

6.

Cole, A.

•

The curse of Kelvin can be exorcised, therefore, by recognizing the paramount importance of clinical observations, by augmenting the precision with which the observations are made and verbally described, by developing operational

1959, John Wiley J.,

of

St.

Andrew's,

7.

September,

1968,

London,

Committee on Rating of Mental and Physical Impairment (American Medical Association).

The committee

ing with different

home

Numerical taxonomy. Pro-

1969, Academic Press, Inc.

and by using computer coding techniques to express those categories in numerical digits. The ghost of Gauss can be elimspiritual

Sons, Inc.

ceedings of a colloquium held in the University

criteria to con-

new

&

editor:

vert the descriptions into precise categories,

to a

J.,

Press.

C.

by transfer

statistics,

Buros, O. K., editor: Personality tests and reviews, Highland Park, N.

4.

laboratory

J.

Brown & Company.

tunity,

inated

of evalua-

Anesthes. Analg. 32:

260-267, 1953. (Further details reported in A. M. A. 168:1985, 1958.)

only real

prevent clinical

that

newborn

tion of

from recognizing the opporgrasping its challenge, and creating •

by

References

investigators

•

possible

for discussion at a later date.

and constricted para-

inertia

based on observed varinew approach to statis-

metric

problems that remain are the entrenched intellectual

ap-

zations of data in a Monte Carlo procedure. The new statistical paradigms provided by Monte Carlo concepts will also be reserved

2. Barnett,

chronic

this

remove the

the computer's ability to perform randomi-

remedy malady

now

to

eidolon:

praisal of "significance" according to ab-

1.

for

for calculation

opportunity

the

offers

measurements), and it can serve as a prophylactic agent for avoiding the transmogrified data of "metric madness." Since the is

differ-

ences between an intellectual process of

has six meaningful

(in contrast to the three "significant

figures"

241

body systems.

A

list

appears

most recent publication in: J. A. M. 213:1314-1324, 1970. Diagnostic and Therapeutic Criteria Co tee of the American Rheumatism Ass in the

8.

has issued 13 publications deal-

A

Problems

242

measurement

in

modern physical

Section of the Arthritis Foundation. Preliminary lupus

Quantification, Indianapolis, 1961,

erythematosus, Bull. Rheum. Dis. 21:643-648,

rill

27.

1971. 9.

Dickinson, R.

A

E.:

letter

Feb. 23, 1958. Quoted

The J.

10.

The Observer,

in

Atkins, H.

in:

J.

three pillars of clinical research, Br.

R.,

J.

M.

A.

L.,

and Keating.

and the ghost of

A. 211:69-75, 1970.

31. Ibid. p. 59.

consultant,

— and

34. Phillips,

poll-bearer, Ci.in.

of

Program (UGDP) study, MACOL. Ther. 12:167-191, 1971.

betes

A.

Clinical

R.:

Clin

Clinical

R.:

20.

X.

39. Sorokin,

1971.

Urgently needed: A way to meareally help patients, Resi-

syndrome, Lancet 1:808-810, 1969. 21. Gresham, G. E.: A method for the evaluation and classification of symptomatic and functional distress

22.

in

osteoarthritis of the

knees, Arthritis

Rheum. 13:320, 1970. Hamilton, M.: Measurement for what? Roy. Soc. Med. 63:1315-1319, 1970.

Proc.

Jaffe, M. W.: Studies of illness The index of ADL: A standardized

measure of biological and psychosocial function, J.

A.

M. A. 185:914-919, 1963. Quoted in footnote, p. 34,

24. Knight, F.: 25.

k

.

alg<

26.

and Feinstein, A. R.: Computer-aided -;is: II. Development of a prognostic Arch. Intern. Med. 127:448-459,

1971. 26.

Kuhn,

The

P.

H. A.: Principles 1963,

in modern Henry Regnery.

foibles

1956,

41.

J.,

42.

so-

measurement, and

Handbook of experimental psychology. New York, 1951, John Wiley & Sons, Inc., pp. 1-49. Stouffer, S. A., Guttman, L., Suchman, E. A., Lazarsfeld, P. F., Star, S. A., and Clausen, J. A.: Measurement and prediction, Princeton, N. S. S., editor:

1950, Princeton University Press.

S. P.: The Life of Sir William Thomson Baron Kelvin of Largs, London, 1910, The Macmillan Company. (Cited on

Thompson,

p.

59 of

27.)

ref.

43. Torgerson,

W.

New

S.:

Theory and methods

York, 1958, John Wiley

&

of

Sons,

Inc.

Group Diabetes Program. A study

of the effects of hypoglycemic agents on vascular complications in patients with adult-onset diabetes. Part I: Design, methods, characteristics. Part II:

45. Walker,

function of measurement in

H.:

statistical

and baseline

Mortality results, Dia-

liams

&

Studies

in

the

history

of

the

method, Baltimore, 1929, The Wil-

Wilkins Company. D. B.: The impact of rigid

defi-

on

Biol.

46. Zilversmit,

nitions :

Quantifying

betes 19:(SuppI. 2) 747-830, 1970. ref.

N.,

pi

Fads and

P.:

44. University

and

in the aged.

and Sneath,

psychophysics. in Stevens,

scaling,

23. Katz, S., Ford, A. B., Moskowitz, R. W., Jackson, B. A.,

L. E.:

Mod. Med. 37:188-189,

40. Stevens, S. S.: Mathematics,

XI.

biostatistics.

prognostic score for use in the respiratory-

status

Cancer 26:650-655, 1970.

Hollister,

ciology, Chicago,

dent Staff Physician 17:139-145, 1971. Gomez, P. C. W., Noakes, M., and Barrie, H.:

A

Quality

of numerical taxonomy, San Francisco,

how much we

sure

J.,

F.:

who have had

patients

1969.

Sources of 'chronology bias' in cohort statistics. Clin. Pharmacol. Thek. 12:864-879, 1971. 19. Gilson, J. S.:

and

38. Sokal, R. R.,

statistics,

Pharmacol. Thek. 12:704-721, A.

tetanus,

W. H. Freeman & Co.

biostatistics.

Sources of 'transition bias' in cohort 18. Feinstein,

and Robbins, G.

among

radical mastectomy,

An

Croup DiaCl.iv. PHAR-

of

classification

alcoholic impairment,

analytic appraisal of the University

17. Feinstein,

A

A.:

survival

37. Shelton,

134-150, 1971. 16. Feinstein, A. R.: Clinical biostatistics. VIII.

Mc-

1959,

307, 1971.

The

I'iiahmacol. Thkr. 12:

York,

tion). Criteria for the evaluation of the severity

rancid sample, the tilted target, and the medical

L.

36. Schottenfeld, D.,

15. Feinstein, A. R.: Clinical biostatistics. VII.

New

of established renal disease, Circulation 44:306-

11:

898-914, 1970.

Tests and measurements: Assess-

prediction,

Lancet 1:1216-1217, 1967. 35. Report of the Council on the Kidney in Cardiovascular Disease ( American Heart Associa-

the responsibility of

Ther.

I.:

Graw-Hill Book Co., Inc.

I.

I'iiahmacol.

Ci.in.

Ibid. p. 78.

33. Nunnally,

Clinical biostatistics. VI. Sta-

"malpractice"

tistical

a

Taxonorics.

Med. 126:1053-1067, 1970.

tern.

14. Feinstein, A. R.:

p. viii.

ix.

ment and

Formulation ol criteria, Arch. Intern. Med. 126:679-693, 1970. Feinstein, A. R.: Taxonorics. II. Formats ami coding s\ stems fur data processing, Arch. InR..-

structure of scientific revolu-

Chicago, 1970, University of Chi-

30. Ibid. p. 6.

The Williams & Wilkins Company.

12. Feinstein, A.

The

S.:

cago Press, 28. Ibid. p.

.52.

11. Feinstein, A. R.: Clinical judgment, Baltimore.

1967,

Kuhn, T.

29. Ibid. p. 5.

Guillier, C.

normality,

Health,

R.:

Gauss,

Bobbs-Mer-

Co., pp. 31-63.

tions, ed. 2,

B.:

Mod.

2:1547-1553, 1958.

Elveback, L. F.

I

science, in Woolf, H., editor:

criteria for the classification of systemic

scientific

thinking,

Med. 7:227-247, 1964.

Perspect.

CHAPTER

17

The derangements

of the

'range of normal'

According to certain contemporary stan-

Chamberabnormal. At more than

dards, the basketball player Wilt lain

grossly

is

seven

height

isolated and the correlated. In the isolated meanings, the idea of normal is univariate,

emerging from the direct partition of an

at least five stan-

array of numbers that are the values of

dard deviations away from the mean. Yet no one, except possibly a basketball op-

a single variable, such as height, age, or

ponent,

ings, the idea of

feet, his

is

has proposed that Chamberlain's

serum

mean-

cholesterol. In the correlated

normal

is

bivariate.

The

normality be restored by amputating his

values for the single variable (height, age,

legs.

or

This outrageous vignette helps illustrate the confusion that can arise from the cur-

serum cholesterol are regarded as normal or abnormal according to the way they relate to some other variable, such )

rent chaos of medical publications dealing

as state of health, genetic fitness, or prog-

with

nostic expectations.

"range of normal".

the

Despite

all

the

proposed new units of measurement

In the

and

all

spawned by com-

normality

puters,

been

the calculations

the concept of a "range" has not

suitably

population,

defined

in

either

data or

and the medical meaning of

"normal" has been

lost

in

the shuffle of

meaning

In

a

series

Murphy

:s

of 'normal'

masterly papers,

of

E.

A.

demonstrated the ambiguity and inconsistency with which the concept of "normal" is applied in modern '

has

Murphy

medicine.

meanings

has noted at least seven

for the word normal. For most practical purposes, these mean-

different

ings

delineated according to what

is

be conventional,

to

customary,

average, or habitual in the array of nu-

merical

With

values.

this

might use such expressions

we

approach,

normally

as "he

begins work at 9 am" or "families these days normally have two children". When the isolated approach receives rigorous

statistics.

A. The

found

is

approach, the zone of

isolated

:

can be divided into two groups: the

quantification, the demarcation of normal

becomes an array

the

act

of

of pure

numbers

After

statistics. is

assembled,

a

used to choose the boundaries of the groups that will be included or excluded as normal.

statistical

In of

the

normal

principle

is

approach,

correlated is

the

idea

medically referred to some

innocuous, harmless, or ideal situation of health. Under the same name, "Clinical biostatistics

15:528, 1974.

—

chapter originally appeared as XXV11." In Clin. Pharmacol. Ther. this

a

If

the

particular

current state of

called abnormal;

if

ill

it

number

health,

it

is

implies the

impli usi 1

243

Problems

244

some

of

number ma)

future ailment, the

be called a

For example, the normalit) in blood

factor.

risk

standards

current

measurement

in

ol

pressure were established

jtoss

or microscopic examination of anatomic structures, not by dimensional measurements of chemical substances. The

correlated

clinician gets prime- help in these diagnoses

univariate

manner. The decision that a diastolic blood pressure ol mon than 90 01

To conclude action"

that an "adverse drug re-

occurred

has

EVENT.

requires

a

careful

dissection of the intricate mixture of constituents contained in

each of these three

entities. 1.

The

after

event.

The events

that are noted

use of a drug can be classified as

desirable,

negligible

(or

innocuous),

or

— that

undesirable. Certain undesirable events can

were not discovered during the pre-marketing investigations? In this essay, I should

pain associated with an injection. Other

rhythmias or methotrexate in psoriasis

like to

culties

note some of the fundamental

and

to

discuss

some

of

the

diffi-

bio-

This chapter originally appeared as "Clinical biostatistics

XXVIII. The

—

problems of pharmaceutical surveillance." In Clin. Pharmacol. Ther. 16:110, 1974. hiostatistical

be intended or anticipated, such

as

the

undesirable events, which are not anticipated,

are

the

contenders

for

receiving

the designation of "adverse drug reaction,"

Before

any other planning can begin, a consistent mechanism must

standardized,

271

272

Problems

be established

measurement

in

For

designating the desir-

observed events and for ability deciding which ones are the adverse-reaction candidates. The current absence ol such a mechanism, as noted later, is a the

of

any clinical or aimed at designing an

obstacle

Fundamental

biostatistical efforts

to

effective process of surveillan<

over-

simplified aspect ol the surveillance process

the idea that the events occurring alter

is

the

use

a

i

drug can

regularlx.

phenomena drug

that

OCCUT while or alter the

taken can produce effects whose

is

— max-

— co-morbidity

to the associated diseases that

addition

in

with

patient

myocardial

Demography tures

refers

age.

as

to

underlying

These-

distinguish

who

son

the

another drug or the supei imposition ol a separate

the

takes

ol

be

included

exposure

arrhythmia

procedures or non-

pharmaceutical therapy that may be lowed by untoward consequences.

To evaluate

the concomitant

fol-

phenomena

and therapeutic reasoning. As noted later, no standardized, consistent mechanisms exist for performing these judgments.

The person. The human use

3.

also does

not take place

situations.

Even

if

in

of

drugs

standardized

people have the same

diagnosed disease, they max

differ in their

Thus,

two patients

basic

clinical

states.

may both have acute myocardial infarction, but one may have chest pain alone whereas the other may have chest pain, respiratory distress,

and

Even

txvo people have identical diag-

if

heart

a major cardiac arrhythmia.

such personal

fea-

occupation,

and help per-

the-

They

drug.

max' also

an

receiving

congestive

of

cause

possible

a

worsening of

a

is

an

of

when

in a patient

treatment

the

lor

sources

Thus,

reaction.

must

given, that

is

possible

as

failure,

the

of

the

initial

congestive heart failure rather than a action

max cause an adverse reaction, intrijudgments are needed in diagnostic

that

cate

has

also

characteristics

arrhythmia develops digitalis

that

to diagnostic

a

alone

be prognostic harbingers of future events,

was not present when the "index" drug was begun, and the patient's ailment

who

baseline state of

undesirable

drugs,

disease;

family status.

nomena

use

re-

are present

infarction

sex.

race-,

occurring after the drug

the

demog-

people

principal

from a patient

different

is

the

to

cause must be differentiated from the effects ascribed to the drug. Among those pheare

and

the

differentiate

ceiving the same drug. Co-morbidity refers

be asso

Man) concomitant

ciated with that drug.

variables

raphy

hypertension, diabetes mellitus, and gout.

e.

The drug. Another frequently

2.

tant

to

the

receiving

Similarly,

digitalis.

dementia develops

cardiac

disease,

possible cause of the dementia arterial

rather

insufficiency

when

an elderly patient

in

for

digitalis

re-

a

is

cerebral

than

digitalis

toxicity. B.

The

statistical rate

After these different aspects of the data

we

have been disentangled,

can contem-

plate a statistical rate as /

ADVERSE

\

/

(^REACTIONS ) I

The numerator fully

/USE OF\

DRUG

\

of this rate

determined since

it

)'

must be care-

and causal

identification, association,

the

represents

eval-

The

de-

noses, such as acute myocardial infarction,

uation of the events just cited.

and

nominator of the rate must also be care-

identical clinical states, such as chest

and respiratory

pain

distress,

drugs

the

may be

given for different clinical

cations.

One

for

the

be

max

patient

chest

given

pain;

the

receive

same

drug

Drug patient

other

the

indi-

for

the

1

and

addition to diseases, nical indications,

determined since

clinical

txx'o

states,

other impor-

it

the

represents

risk taken in the exposed patients. Another reason for precise identification of the denominator is to permit valid

the

extrapolation

iratory distress.

re

fully

association of the numerator events xvith

of

the

results.

Unless

xve

knoxv the particular tvpes of patients

whom

adverse reactions occurred,

we

in

can-

The

draw about what

clinically

not

do

to

in

useful

conclusions

future

prescription

When we

consider the process of sur-

however,

veillance,

talking

not

we' are

about either the numerator or denominator of this rate. Surveillance

(the slash

in the virgule

virgule

what happens mark) that sepais

numerator and denominator. The

rates the

Despite some

be noted

will

sent an

of the drug.

represents

the

time

interval,

be-

tween use of a drug and occurrence of subsequent events, during which we can impose the process of observing, detecting, evaluating, and recording those events.

273

pharmaceutical surveillance

difficulties of

major shortcomings these systems

later,

enormous achievement

that

repre-

in the tech-

nology of pharmaceutical surveillance. One valuable demonstration has been the con-

be made by nurses,'

that can

tributions

pharmacists, 17 and other non-physician per-

sonnel in collecting data. Another noteworthy feature has been the documented importance of making planned, continuous observations 28 rather than relying on ad

A

hoc, sporadic reports.

third substantial

contribution has been the creation of suitable data formats for coding and storing

and

Main- discussions of surveillance are de-

much

voted mainly to these "virgule activities," with emphasis on the various investigators,

the development of computerized or other

administrative mechanisms, and computer systems used for the acquisition and stor-

when

age of data. These activities in the process of surveillance constitute its technology

obtained served

to

the arrangements of medical setting, per-

toxicity

and appraisals

sonnel,

and data-gathering

tactics that al-

low the desired information

The more basic

be acquired.

to

scientific strategies of sur-

however, depend on decisions about the numerators and denominators: what kind of data to gather, what kind of patients to observe, and what principles veillance,

C.

tical

the

research of the past decade has been

development -". as. as. 31, 35,

cal

tive

during

systems,

ment

of

be

will

future

work

that

tactics

background

for

that

any

in surveillance activities.

institution-based surveillance system,

tients

the

who

are referred or self-referred to

institution

and can study only the

For the surveillance systems that depend on a population of hospital in-patients, the

-

and

pharmaceuti-

effects of

duration of observation usually as long as the patient

different

drugs

employed and

a

as

analog

an

of

for noting

may

mechanism Phase

III

demographic

influence the action

in

is

These drawbacks would not is

-'

ultimate

are authorized or ordered at the institution.

medical

frequency of associated adverse reactions. In some instances, a surveillance system

factors'

the

of

technologic

invaluable

ability

of drugs.

drugs and drug-

have led to the develop-

useful

what

clinical trials-";

of possible efficacy

Regardless

they

tem's

performing

possible

role of these institution-based surveillance

and the

for

of

These systems, which

ii

information that was hitherto unavail-

has also been

warnings

for a variety of different

interactions.-

have

surveillance

the

provide

able about the frequency of institutional of

inter-

kinds of pharmaceutical preparations that

preparations, have obtained quantita-

prescription

used

be and

'"

institutions.'

are intended to provide a continuous scrutiny of the use

can

that

tabulated

of "surveillance systems"

at several different U-i8,

pharmaceu-

in

are

however, is institution-based. The investigators can observe only the kinds of pa-

The technology of surveillance

A major accomplishment

data

the

-•"'

preted. In addition, of course, the results

An

to use for evaluation.

methods''

analytic

information,

necessary

the

of

to

lasts

only

the hospital. affect

a sys-

draw conclusions about

happening

in

the

institutional

would create serious setting, but limitations in answering more general questions about the surveillance of a wide variety of people using a wide variety of drugs for a wide variety of durations. The fundamental problem of contempothey

rary surveillance, however,

is

not the re-

Problems

274

scope

stricted

measurement

in

observed

the

of

and durations

drugs,

The

systems.

surveillance

problem

patients,

institution-based

in

fundamental

the absence of a standardized,

is

mak-

consistent, reproducible procedure for

ing is

decision

the

that

particular

a

event

an "adverse drug reaction.'

primary

The

causal

(or

mentioned and

sembled the

ol

an act

as

judgment

No

as

regis-

later,

each

individual clinical

ol

clearh delineated and overt-

method

specified

t

"adverse reactions" was

tabulated

diagnosed

l\

mentioned

he

will

ju

reports

the

ol

all

the various surveillance

at

that

tries

in

systems

operational

ol

pro-

making the decisions. The investigators ma) reguIarly supph definitions and classifications cedures

of

hern developed

has

for

"adverse reactions." hut a defined

the

classification

quite

is

from

different

an

operational identification.

We of

can define blood sugar as the amount

monosaccharide.

a

present

100

in

ml

of

.11, .(),..

whole

that

blood,

is

but

definition does not provide an opera-

this

tional identification of

operational a

(.'.

of

set

blood sugar. For an

identification,

sequential

we would need

instructions,

such as

those that might be found in a laboratory manual describing a chemical method of measuring blood sugar. The directions would tell us to add a certain volume of blood to certain volumes of designated reagents under certain physical conditions; to place the resultant mixture in a spectro-

photometer; using

and

certain

to

convert the

pre-determined

reading

numerical

The result would be the value of blood sugar, operationally identified.

factors.

The medical and pharmaceutical literacurrently contains many definitions

ture

drug reactions according to mechanism, existence, and severity. Their presumptive mechanism is that

classify

adverse

usually classified as pharmacologic, allergic. or

idiosyncratic.

The pharmacologic

re-

actions are further subclassified as primary

excess effects,

due

to

overdosage or over-

injection.

an ad-

to

often classified as definite

is

probable,

"causative"),

possible,

or

moderate, or mild.

severe nonfatal,

these definitions and classi-

of

provides an operational

fications

surveillance

the

ol

all

of

drug

of a

on non-

effects

sites

at

relation

Hut none

reactions

or

doubtful (or "coincidental"). The severity of an adverse reaction can be cited as fatal,

In

targets

verse reaction

The identification of adverse

D.

and secondary

effectiveness;

cation or diagnostic algorithm'

identifi-

for decid-

ing whether a particular undesirable event

was

clue

other

accused

the

to

drug,

an

to

evolving course

to the state,

drug,

interaction

some

to

drugs,

of

the main clinical

of

the evolving course of a baseline

to

co-morbid

state,

to

a

co-morbid

state,

or

to

superimposed new

some intervening

diagnostic or other therapeutic procedure.

A

report

actions to

from a Registry of Tissue ReDrugs is the only reference I

have been able formal

to find in

which deliberate

been given diagnostic reasoning used in this

Among to

decisions

criteria for

cation

are

main

in

to

the

activity.

might be included flow

the

chart

or

making an operational identifithe known characteristics of clinical state and co-morbid

known

the

state,

data that

the

justify

the

has

attention

patterns of response to

the index and associated other drugs, the

laboratory evidence of the drug's concentration

the patient,

in

the time relation-

between the adverse event and the intake of drugs, and the results of such

ships

additional tests as cessation

lenge")

of the

drug and

(or "de-chal-

its

resumption

(or "re-challenge"). In

the

absence

operational

stated

of

identification

actions, investigative

for

work

will continue to lack a

of

adverse

re-

surveillance

fundamental ingre-

dient of scientific evidence.

The "adverse

the basic element under

reaction," as vestigation,

in

methods

is

in-

not identified with precise,

We have no idea of the amount of variability among

distinctive, reproducible criteria.

the

ment

physicians

whose

nondescript

judg-

used to decide whether an observed event is or is not an adverse drug reaction. There have been no procedures is

The

tor standardization;

no

tests oi consistency;

and

no

After

we work our way through

assessments

difficulties of

reproducibility.

of

the

all

the

of

statistics,

'

we

find

the

that

decision-making mechanism for identifying reactions

adverse

— for

making diagnoses

such as whether an episode of vomiting

due

is

underlying heart disease,

to digitalis,

psychic tension, or spoiled food

on

the

vagaries

of

—depends

judgment

clinical

of

an array of unstandardized physicians.

The

main step towards developing pharmaceusurveillance will not be in creating first

technologies

additional

The

step

first

methods

for

is

surveillance.

of

to arrive at reproducible

identifying an adverse drug

Epidemiologic strategies of

surveillance

Like other epidemiologic challenges the

cohort.

discernment of cause-and-effect

in

rela-

is

people

of

assembled and followed to note the occurrence of the adverse reactions. This is the type of procedure used in the hospital surveillance systems mentioned earlier.

Retrolective coliort. In this approach,

b.

the numerators and denominators consist

same type

of the

hort

—the

of data used for a co-

incidence of

re-

possible epidemiologic strategies con-

complex, almost bewildering, array

of different

numerators and denominators,

and methods of getting The reader is hereby warned that

"control" groups,

section of the text, as befits the intri-

population of exposed or nonexposed people but the data for the compared groups are assembled retrolectivch ," after the adverse reactions have occurred.

Having noted some events that are believed be adverse drug reactions, the investigator begins the research by collecting exist-

to

of

the

subject,

is

neither

simple

from patients' medical clinical course of a group

ing data

drug.

who have

received the suspected

By inspecting

the data for the post-

drug course

of those people, the

comparison, the investigator assembles the

recorded data of a similar group of people,

Cohort procedures. The cohort approach is the most scientifically logical way of answering a question about cause and effect; and the incidence rates that emerge from a cohort study are direct and simple to understand. The denominator consists of the number of people exposed to the "cause," which is the drug under investigation. The numerator consists of the number of those people who developed the '"effect," which is the cited adverse re-

notes the rates of the

1.

action^).

The

denominator

is

ratio of this

an incidence

reactions

drug. This rate

is

numerator and rate, denoting

per people

incidence of similar adverse a

taking

the

then compared with the

group of people treated

in

reactions

investi-

gator determines rate at which the suspected adverse reactions occurred. For the

nor easy to follow.

adverse

adverse reactions

in a defined

of people

this

some other way are

or being treated in

The

cacies

prolective"

adverse reactions

—usually records —for the

data.

a

given and before any have occurred. Groups taking the investigated drug

before treatment

quires a comparative assessment of rates.

pharmaceutical surveillance

tionships,

tain a

In

—

reaction. E.

Protective

a.

study, the investigators begin the research

a valid biostatistical science of tical

275

manner. Several different procedures can be used for acquiring data about incidence rates in a cohort.

majesty of the computer print-out and the glory

pharmaceutical surveillance

this

some other manner, and same reactions in controlled group. The data for the treated

in

control group can be "concurrent," derived

from people treated in some other way during the same calendar time interval in which the suspected drug was taken; or "historical," derived from an earlier group of patients with the same medical condition.

A

retrolective cohort with

was

a "historical

procedure employed when the hazards of thalidomide were first pointed out in the English lancontrol"

guage

group

literature

the

by W. G. McBride.

'-

With-

out any elaborate research grants or epi-

in

demiologic

some other

scribed his

flamboyance,

work

in a letter,

McBride

de-

written to the

Problems

in

measurement

The Lancet,

for of

that

model

a

is

of

Congenital

1.5%

proximately I

abnormalities of

present

are In

babies.

who were

in

.

.

.

.

agent notes

during

this

group of patients receiving the suspected drug is not actuall) assembed and counted as the denominator of a cohort. a

number

Instead, the

drug

is

estimated

people taking the

of

from

its

sales

supplied by the manufacturer. ator

for

denominator

this

suspected

of

been

is

figures.

The numernumber

the

have

adverse reactions that

reported

geographic

the

in

region

which the estimated denominator perThe data about the adverse reactions

to

the

suspect that the

assembling data about

a suspected pharmaceutical

introduced, secular

the

trend

investigator

for

or other usage of the agent. tant

circumstance,

was

may

to an alteration in treatment

when

the time

women .

due

is

for the disease. After

multiple

t

(was) almost 20%.

Estimated cohort. In

c.

.

ap-

months

t

babies delivered

given the drug thalidomide

pregnancy

in

recent

have observed that the incidence

severe abnormalities

trend, an investigator

change

thinking and reporting:

curves

sales

The concomi-

secular trends in both mortality and

drug usage are then associated effect conjecture.

The

such

mortality

secular

a

for a cause-

drawn from was

inference

association

the basis for beliefs-' that aerosol broncho-

had led to an increase in deaths from asthma in England and Wales. /;. 'Event' trohoe. This approach con-

dilators

sists

of

trohoe

1

-

modification

a

the

of

classical

The

(or "case-control") study.

in-

begin by assembling a group

vestigators

of people (or "cases")

who

are

known

to

2. Other procedures. For all of the three methods just described, the frequency of adverse reactions was calculated as an

have had the particular event that is suspected of being an adverse reaction to the drug. In this group of "cases," the investigators then determine the number of people who had previously taken the accused drug. For comparison, the investigators determine the previous usage of that drug in a "control" group of people selected because they have not had the event under scrutiny. The prevalence rates of the drug's usage in the "cases" and

incidence rate, with appropriately exposed

the "controls" are then compared directly

tains.

can

be

collected

in

a

registry

|

as

de-

by the manufacturer of the drug. The rate in a comparative group is determined from "historical controls" or from analogous data assembled for some other drug that has scribed

later

or

)

directly

the same type of clinical usage.

(or non-exposed) people in the denomina-

and the corresponding adverse reac-

tors

tions

in

the

numerators.

methods that follow, the data

is

altered.

In

The number

receiving the drug

is

all

of

the

logic of cohort of

people-

not determined and

cohort incidence rates are not calculated.

or

compressed

into

a

"risk

ratio."

The

"cases" are usually selected from in-patients a hospital, and the "controls" can be chosen from other patients in the hospital or from suitable sources outside the hosat

pital.

The population studied

proach

is

in

this

called an event trohoe

ap-

(rather

Instead, the decisions rest on several other

than a disease trohoe) because the cases

types of epidemiologic rates.

and controls are chosen according

a.

Secular mortality associations. In this

procedure,

the

basic

research

data

are

by the Bureau of Vital Statistics some appropriate counterpart), when

issued (or

publishes the annual rate of death attributed to a specific disease in the general

it

population.

The

pattern noted in a calen-

dar sequence of these rates

is

annual

mortality

called a secular trend. Observing

an apparently inappropriate change

in the

to the

presence or absence of the event that

is

suspected of being an adverse drug reaction.

An

event trohoe was surveyed for the

type of research used to suggest that oral

might lead to thromboand that the use of certain estrogens in pregnancy might lead to vagiAlthough nal carcinoma in the offspring. contraceptive

phlebitis"-

f

"

pills

,J

1

modem-day

hospital

'

1

surveillance

systems

The

were

established

studies of

drug

use

also

performing

for

cohort

effects, the investigators

the

data

collected

can

trohoc

for

research. Because of the positive relation-

ship

found

in

trohoc study performed

a

with data of a pharmaceutical surveillance system,

27

pharmaceutical surveillance

difficulties of

an interesting current controversy

277

A mor-

mortalitv rates mentioned earlier.

rate contains three ingredients:

tality

number

per specified time interval.

number

fied

has arisen about the association between

A

secular "re-

two ingredients:

action" rate contains only

the

the

of deaths per specified population

of reactions reported per speci-

time interval.

A diverse group of have been reported

Registry trohoc.

2.

and myocardial infarction. The relationship was denied in a cohort studv performed by the Framingham epi-

in

the registry

for

demiologic team. 7

in

any drug, D. Suppose we are interested a particular type of reaction, X. Using

coffee-drinking

c.

Registry collections. All of the other

epidemiologic procedures to be described here

are

based on the analysis of data

assembled national/-

at the institutional," 9

(v

or international

municipal,

!T

registries that

clinical

entities

trohoc

the

will

adverse reactions

the

as

principle,

we can choose

"case" group to consist of

reported for D. lence

of

We

X among

all

then note the prevathose

reactions.

X among

prevalence of

voluntarily submitted

bv physicians or by other qualified reporters. Because of the

or drugs.

way

the registry will contain a certain

numerators without denominators. The data be used to count the frequency of the adverse reactions associated with spe-

can

cific

drugs and to provide warnings based

on changes data alone total

in

frequency, 41 but the registry

provide no indication of the

usage of the drugs or the number

of users

who were

An appealing to

is

reaction-free.

tactic

with registry data

convert the frequency counts of drug

For

comparison, our "control" consists of the

have been established to solicit, receive, and store the adverse-reaction case reports

the data are assembled, they contain

a

the reactions

of reactions reported for

3.

Accrual

some

other drug

At any point

rate.

number

the total

time,

in

number

of reported adverse reactions for the drug.

During an ensuing

interval of time,

reports are acquired.

when

ment,

The

new

size of this incre-

divided by the previous num-

ber of reports, produces an accrual rate

new

of

reactions for that drug during the

compared against the analogously calculated accrual rate for some other drug or for cited time interval. This rate can be

all

drugs.

reactions into crude incidence rates for an

estimated cohort of drug users.

The num-

ber of drug reactions

is the numerator; denominator is estimated as noted earlier; and the comparison is with his-

the

torical

or

other

suitable

control

groups.

Other analyses of the registry data are based exclusively on information stored at the registries.

From

the

many

possible

which have received thorough descriptions by Finney, 14 15 only three will be mentioned here.

strategies,

1.

Secular trends.

The

investigator notes

the frequency count of the

number

of re-

actions reported during successive calendar intervals. Suspicions are aroused by an unexplained increase in the secular trend of frequencies of the reports. (These secular

"reaction" rates differ from the secular

F.

Cause

and

The

signals, causes, rate signals,

rates different

epidemiologic

strategies

that have just been described can be

ployed in at least In one circumstance

four

em-

ways.

different

—the cause signal—we

have no idea that a particular drug and a particular

undesirable

The purpose

event

of the signal

is

are

related.

to raise sus-

picions of this possibility.

Once our

sus-

picions have been alerted,

we may

then

want to get additional evidence to decide whether the relationship between the drug and the event is actually causal, and whether the event should be regarded as an adverse reaction In

a

different

to that drug.

circumstance,

with the belief that the event

we begin is

an ad-

Problems in measurement

278

we want

verse reaction, and

frequency

of

want

also

signal

—

have

to

We

occurrence.

its

know

to

the

Secular mortality associations can suggest

might

that an undesirable event is happening but can not per se signal that a particular

mechanism — the

a

to alert us to

changes

frequency

in

that might suggest the occurrence of

or unforeseen

mechanism

to\ieit\

for

produces a

that

drug.

the-

rate

cause-

new The

signal

may not necessarily be the mechanism for proving cause: and the mechanism that produces

a

indicate

the

may

signal

rate

true

rates

adverse

Cause

7.

signals.

clinically

an undesirable event

If

such

unique,

growth of feathers on

we may have no it

an

is

may

sudden

the

as

a person's forearm.

deciding that

difficult v

suspect

Our main quest

the

associated

then

will

we

and

reaction

extraordinary

immediatel)

drug.

is

involved. Con-

method

of gener-

ating cause signals for a particular drug

procedures

cohort

with

is

group and 2.

in a control

Causes.

compare

that

the drug-treated

events occurring in

the

group.

After a

signal

suspicion

of

may be

has been raised, the next step

to

decide whether the drug can be held causal-

reactions

is

class of drugs)

sequentlv, the principal

always

not

the

of

drug (or

be

to

responsible for the associated event.

ly

the indictment the

available

was based on cohort

may be

evidence

If

data,

suitable

not only for suspecting an association but also for suggesting that the association

is

Quite often, however, the cohort may be both retrolective and un-

causal.

data

decide the frequency rate of the reaction

qualified, based on clinical hunches and

and

recollections that are quite satisfactory for

acceptability in exchange for the

its

benefits

event

of

is

drug.

the

the

If

clinically rare,

undesirable

phocomelia

such as

arousing

providing

The

proof.

inadequate

but

suspicions,

for

may

investigators

or retrolental fibroplasia, a

sudden increase

then look for more cogent evidence else-

may

suffice to trigger

where.

usual frequency

in its

our suspicions that something unsatisfactory If

happening.

is

the undesirable event, however,

is

a

frequent clinical occurrence such as cataracts, jaundice, peptic ulcer, relatively

or thrombophlebitis

ready

at

—doctors

suspect

to

first

Since conclusions about a causal rela-

drawn before

tionship have been

may

that

the

not be

event

try report try

is

submitted, the data of a regis-

cannot be used for assessing causal

The epidemiologic

relationships.

pend on the use

of cohorts, event trohocs,

est 5

condition, rather than to a pharmaceutical

uncertain elements that

an

adverse

reaction.

Lasagna

29

has described a series

—including the bleeding associated with aspirin— which adverse drug

of situations

in

effects

remained unrecognized for a long

Neither the trohoc nor the registry pro-

cedures signals.

or secular mortality associations.

can be used to generate such Trohoc studies begin only after

a suspicion

exists;

they are intended to

explore the intensity of the possible causal relationship,

not

to

signal

a

suspicion.

The

lar associations are the scientifically

these

of

activities,

containing

studies

can

are

also

many

when

mis-

Event-trohoc

weak 12 but

scientifically

strengthened

be

secu-

weak-

may produce

leading or distorted results.

special

pre-

cautions (which are seldom employed) are

taken

time.

pursuit

must therefore de-

of causal investigations

drug reaction. The event may be initially ascribed to the main condition under treatment or to a co-morbid is

a regis-

to

eliminate

or

adjust

for

major

sources of potential bias. Prolective cohort studies, although the most powerful of the scientific

(or

approaches,

are

extremely expensive)

often to

difficult

conduct

in

suitable populations.

A

well designed project using a retro-

Studies of registry data begin with data

lective

were submitted by doctors whose su. oicions have already been aroused and

although often overlooked research strategy for these goals. For example, the many trohoc investigations performed to study

that

who have

already

made

causal decisions.

cohort

is

a

potentially

effective

The

between oral contraceptive pills and thrombophlebitis could also have been done with probably no more effort or as studies on retrolective cohorts, apcost propriately chosen from practicing physiassociation

— —

cians' rosters of patients receiving different

forms of contraception. involved

in

The

potential bias

279

The prowould obviously provide signals about a change in rate; but retrolective cohort studies would usually not begin until a change in rate has been signalled by other sources of of

adverse

reaction

per

drug.

jective surveillance of a cohort

data.

non-random assignment

the

pharmaceutical surveillance

difficulties of

Rates. Regardless of whether or not

4.

of therapy for the cohorts could probably

a rate signal has been received, the only

be identified and adjusted more readily than the analogous and additional forms 'of bias that are present in trohoc research. Nevertheless, the epidemiologic data

way

routinely collected

by practicing physicians academia) are usual-

(inside or outside of

potential sources of valuable

ly rejected as

research information.

A

retrolective cohort

about the same kind of

can yield data

population, followed in the

same

clinical

and chronologic direction, as a proleetive cohort. 11 Because the data are not collected with the defined scientific standards that are possible

the

ever,

protective research,

in

use

of

retrolective

how-

cohorts

information

ized

available

from

lution

of

this

issue

the

The

is

to

study a cohort

of people receiving the drug.

procedures,

registry

or

trohoc

of reactions

None

secular

Even when

of the

associations,

an

provide

can

investigations

incidence rate.

the frequency

acquired from registry data

is

or

when

in

trohoc research, the conversion of the

results

a "risk ratio" has been calculated

to

an approximation of incidence

has required an estimation of the size of the exposed cohort, using data from sales curves

of

the

drugs

or

suitable

other

sources.

is

G. The goals of surveillance

usually spurned in favor of the standard-

verted logic of trohoc research. 1 -

of determining the incidence rate of

adverse drug reactions

depends on an im-

wading through the diverse

After

in-

reso-

terms,

concepts, and tactics of the preceding two sections,

the reader

probably ready to

is

portant point in the philosophic orientation

return to the main issue. Here, too, how-

do we want an imprecise answer to the right question or a precise answer to the wrong question?

ever,

Rate signals. If the "adverse reactions" have been thoroughly and reliably diagnosed, the techniques used with registry data examining secular trends, drug trohocs, and accrual rates can provide useful signals of changes in the incidence

surveillance?

of science:

3.

—

—

rate of reactions.

The

true rates, of course,

we

What

scientific

purpose of pharmaceutical it intended to provide an "early warning system" or a set of clinical guidelines for use of the drug? I. Early warning system. Almost everyone would agree that a prime goal of surveillance is to provide early warning about is

the

Is

a dangerous drug. In designing a system of

cannot be determined from registry data,

surveillance, however,

but could

know what kind

be approximated by the pre-

another

encounter

strategy that has been imprecisely specified.

of

we would have

We

danger.

to

would

viously described technique of cohort esti-

obviously want to be warned about clinical

(Another valuable source of rate

catastrophes, such as death, phocomelia, or

mation. signals

is

the adverse reaction case reports

assembled

by the manufacturers of the drug. These reports can be used in a manner analogous to that of registry data. Event trohocs are used to explore cause, rather than rates of incidence; and secular

mortality

associations

provide

data

about rates of death per disease, not rates

retrolental fibroplasia.

know,

however,

events?

What

as

transient

jaundice?

Do we

about

also

to

about. clinical nuisances, such skin

rashes

or

And what about

abnormalities

want

catastrophic

less

—such

as

reversible paraclinical

elevations

and

depressions of white blood count, serum calcium, or blood urea nitrogen

—that have

Problems

280

measurement

in

manifestations in clinical signs or

lit

symptoms? To construct a system of surveillance, we would need to know whether it is supposed to detect the

these

of

first,

the

dangers.

I

first

he

two, or

all

techniques

three

ob-

ol

"risk

benefit

What emerges would presumably he a balanced evaluation of the drug, providing effective guidance tor its future clinical

anecdotall)

Nevertheless, these distinctions are seldom

ol

specified

Several

in

years

population

to

discussions ago,

for

surveillance.

ol

example,

when

a

federal agency asked for bids on a contract

provide "surveillance," the instructions

to

to

the

potential bidders contained no statements about which goals, it any, were to be covered in the proposal. Because the general methodologic problems are so enormous, a successful approach to surveillance might begin l>\

usage. This that

an earl)

It an effeccan he constructed for providing

warning about catastrophes, the

system might later be amplified to include clinical nuisances and paraclinical abnor2.

A

quite different

hut as an evaluation system. The objective

(and

often

many

years

alter

pragmatic experience with such drugs as a

and

insulin.

surveillance

formal

of

"distillation"

H.

sv

The purpose for new

tern

experience alone.

clinical

Evaluation of epidemiologic

strategies

Having

contemplated

available

the

epidemiologic methods and having noted the desired goals,

we can now

begin to

evaluate the respective merits of different

methods

One

Clinical guidance.

informally

developed

)

(or old) drugs would he to attain the guidance information more promptly, effectively, and quantitatively than might he done if we waited for the appropriate

malities.

approach to the surveillance issue is to view the process not as a warning system

the kind of clinical guidance

is

been

has

aspirin, digitalis, of

detecting clinical catastrophes. tive s\ Nteiu

frequently

so

is

seldom documented or

so

quantified.

be observed, .uid the data to be acquired will obviously depend on what we want to accomplish. the

servation,

that

ratio"

hut

discussed

for different goals.

point about the methods should be

noted immediately: in order for any method to work, it must be made to work. Whatever be the procedure used for surveillance, the procedure itself must receive

merely what is had but what is good about a drug. The clinical trials that were done before the drug was approved for marketing will

surveillance (or monitoring) to ensure that

have provided preliminary evidence about its usage and consequences. This informa-

members of the cohort are entered into it and that the necessary data are being collected. The supervisors of a registry of

is

to find out not

also

tion

will

pertain,

however,

only

cohorts of patients and dosages that

were studied

in the trials.

the

to <

f

drug

When

the

drug enters the general market, it will he used in a greatly expanded spectrum of people, clinical states, and durations. The purpose of the surveillance would be to cover the new spectrum of anticipated or ^anticipated benefits and the new specie

m

tics

tati\

With this informaAvailable, we could compare the qualicharacteristics and quantitative freof adverse effects.

quenc.

of the events that occur in the

and benefits. This type of comparison would allow us to assess the spectrun

of risks

its

planned

The

activities are

supervisors

system

must

of

check

a

being carried out.

cohort surveillance

that

all

appropriate

adverse reactions must encourage the submission of reports by any physician who observes a reportable reaction. Furthermore, with either a cohort or a registry system, none of the available "signals" will

be detected unless the collected data are frequently analyzed in search of signals.

As we consider the two main goals pharmaceutical surveillance, however, tain procedural

methods may seem

of

cer-

prefer-

able to others. 1. Early warning system. In an early warning system for clinical catastrophes, we would want to be sure that all possible

The

difficulties of

pharmaceutical surveillance

281

With improvements

drug (and all ol the possible it have the opportunity to enter the system. This desideratum cannot

thoroughness of reporting and with greater care in analyzing the cited adverse reac-

he achieved with the restricted focus of an

tions, the case report

of the

list's

reactions

to

)

institution-based procedure.

surveillance system

people

of

is

treated

An

institutional

confined to the cohort

at

the

institution.

In

may

at-

technologic operation, the system

magnificence by acquiring numerator and denominator data, coding tain

statistical

standardized puterized

and

forms,

but

analyses;

com-

applying the

institutional

system can not be relied upon to observe the

totality

of drug usage and

that can provide early

reactions

warnings about

clini-

reaction occurs.

a

and

satisfactory

approach

the

in

procedure can provide relatively

to the goal of

inexpensive

an early warning

system.

The fundamental problem

in relying

case report data in a registry

on

the thor-

is

oughness of reporting. This problem can be solved or minimized if the act of reporting

is

made

toll-free

with

easy

especially

procedures as a

telephone

such

call

to

the collecting agency, by total reassurance to reporting physicians that the information

be kept suitably confidential, and by encouraging physicians to trust the motives

cal catastrophes.

will

pregnant ambulatory women taking thalidomide would not have been part of a hospitalized cohort and

physicians (for whatever reason) develop

thalidomide might not have been part of

apathy, hostility, or distrust toward federal

For

the

example,

pharmacopoeia used

patient

clinic.

a

in

Consequently,

out-

large

the

phoco-

of the collecting agency. Thus,

agencies,

melia that followed thalidomide might not

its

to

by institution-based prolective procedures. (The thalidomide disaster was actually detected by analysis of SG retrolective and estimated cohorts. '"• ) If we want to learn about any clinical :iL'-

catastrophe

that

using a drug, lance to the

who

may

occur

we cannot

in

limit

a

person

our surveil-

enumerated group of people

are followed in a cohort.

event can occur in anyone

A catastrophic-

who

receives the

and the event must have a good chance of being reported. For this reason, the best way of routinely getting warnings about catastrophes that can emerge from the entire group of drug users is to collect adverse reaction reports in a registry maintained either by the manufacturer or by some other governmental or medical drug,

organization.

The

registry

technique

be-

comes desirable here for the very reason that makes it epidemiologically unappealing:

there are no defined populations

constrain the denominators.

to

By placing no

on the denominator, i.e., by not an observed cohort, the registry allows a "total sampling," because anyone receiving the drug has an opportunity to be included when an adverse restrictions

demarcating

likelihood

of

practicing

an

agency's

receiving case reports will be reduced, and

have been

detected

the

if

registry

data

may be

too

problem

in

incomplete

be valuable.

An

additional

registry data

is

relying

on

the vigilance with which

the data are scrutinized and analyzed. For

an early warning signal to be noted, someone must be constantly looking out for it. Certain aspects of this

watchman

role

may

be better fulfilled by a "commercial" registry, maintained by the manufacturer of the drug, than by a "non-commercial" registry maintained by a national or international health agency. For example, the first public warning about thalidomide in the U.K. was issued not by a "non-commercial" registry, but by the drug's manufacturer, who announced its withdrawal from the market

two weeks before any adverse reactions had been reported in the English language

Had the "commercial" registries been even more vigilant on the European continent and in the U.K., the thalidomide catastrophe might have been noted sooner and its magnitude reduced. Since a drug's manufacturer can be ( and has been) held financially liable for compensating the victims, the manufacturer would be expected to exert a concerned literature.'

2

self-interest in

maintaining registries that

Problems

detect

iromptly

before

jgs

dangerous

overtly

become

consequences

the

well

as

11)

measurement

in

as

catastrophic.

clinically

numbers

with

patients

of

each

type

of

treated with each cor-

condition,

clinical

responding pharmaceutical agent.

mercial registries can also be expected to provide signals, of the type noted earlier,

2. Clinical guidance. An adequate early warning system could be attained with the methods that have just been described, but a clinical guidance system creates a

reactions that

quite different challenge. As noted earlier,

While supplementing the manufacturers vigilance for clinical disasters, the non-com-

about

dramatic

less

clinical

manufacturei might not avidly scrutinize. Another way of getting warnings about major clinical nuisances is through an a

institutional

surveillance

system.

The

ex-

data assembled and analyzed in system could provide cause signals other evidence for detecting non-

tensive

such

a

and

dramatic reactions that might otherwise be overlooked. current

pose

The main

using

in

surveillance systems for this the- clinical

is

difficulty

pur-

imprecision with which

the drug recipients and reactions have been identified. In

many

analyses, the reactions

we would want

to learn all

—

usage of the drug

cal

its

tion of the total ol

to investigate a prolective cohort.

could

because they do not include population or data. A

tactics

tality

the

appropriate

would be

retiolective cohort technique

also

unsatisfactory because the patients' original

medical records

may

not contain

the

all

information needed to document the diverse facets of risk

and

cohort study

is

and conditions.

scientifically

in surveillance studies sug-

We

not use registries, trohocs, or secular mor-

further subdivision according to the clinical

For example,

about the clinigood effects as

bad ones. For thorough evaluaspectrum and frequencies both risks and benefits, we would have

well as

and drugs have been associated without distinctions of the patients

its

benefit. Since a prolective

the only

method

of getting

about

data

precise

clinical

usage of a drug, the main issue in research architecture is the choice of the cohort.

gesting adverse cardiac effects of amitrip" no analyses were done to indi-

What people

cate whether the amitriptvline-treated pa-

are they to be chosen?

and the "controls" received the drug same clinical indication; and whether the compared groups were similar

question

in the

prognostic severity of their baseline

arranged by following the fate of a ran-

cardiac condition. In order for causal indict-

domly selected fraction of batches of the drug prepared by the manufacturer. Alternatively, a random sample of pharmacists

:

tyline,'

The most

tients

the

for

ments

be convincing, the patients receiving the compared treatments must be shown to have comparable clinical conditions at baseline. With improvements in to

clinical

and comparison, the

specification

institutional surveillance systems

can make

are to be

straightforward answer to this to

is

random sample of The procedure can be

take a

users of the drug.

can be enlisted as a collaborating research The cohort of patients can then be chosen as a random sample of the people

team.

to

whom

the pharmacists dispense the drug.

invaluable contributions to the screening

This

process that provides "early warning" of

would not be easy

clinical

nuisances

and laboratory abnor-

An

alternative to the case report system

the

institutional

cohort

is

the use of

representative cohorts as described in the

Such a procedure is probably expensive to be used only for delivering

\t section. tc

ea.

warnings. Furthermore, the size of

the

illowed

enorn.

type of "drug-cohort"

require

the

investigation

to perform.

creation

of

new

It

would

types

of

research teams, working outside of institu-

malities.

or

examined and how

s

to

cohorts

would have

to

be

allow surveillance of ample

tional

settings

and

enlisting

suitable

co-

operation from manufacturers, pharmacists, physicians,

and

patients.

Despite

obvious problems, such research

be too

difficult to

do

if

the

may

not

the participants are

properly approached. Pharmacists, practicing physicians, and their patients

come

the

opportunity

to

may

wel-

contribute

to

The

research surveys that are neither experimental nor esoteric, and that can provide

knowledge

better therapeutic

for everyone.

The main disadvantage of studying such a drug cohort is that we would lack a control group. No analogous information would be available for the outcome of people with similar clinical conditions that were treated in some other way. To complete the picture, therefore, we would need to perform a

we

separate investigation. For this purpose,

would want

to take a

random sample

whom we

practicing physicians from

pharmaceutical surveillance

difficulties of

of

could

283

from accounts of the events that actually in that world, as witnessed bv representative members of the people who

occur

live in

it.

Regardless surveillance

whether pharmaceutical

of is

intended to provide early

warnings or clinical guidance, we shall have to give more attention to the patient who gets the drug than to the technology of the surveillance or the manipulation of

the

still

The technology

statistics.

and

remains as undeveloped

the

What

already well developed.

statistics are

scientific ter-

get suitable data about the selected clinical

ritory

conditions and their post-therapeutic out-

verse reaction and of the clinical complex-

comes. Here, too, the performance of the

ity that

many

research

would

ficulties,

but the

minor

medical societies enthusiastically

if

entail

if

practicing physi-

cians accept the opportunity to

new form arises

of

clinical

work

investigation

1.

Borda,

Gilman,

that

of J.

but from the real world of medical

and

B.,

drug

M.

A.

usage

Collaborative

gram:

Interaction

and warfarin,

With these two representative samples of drug users and patient conditions, we

1972.

cover

the

spectrum

necessary

system to provide effective guidance about the use of drugs,

surveillance clinical

We

the job will not be easy. to

study what

convenient,

is

shall

valid rather than

and we

shall

have

what

4.

5.

new

in

clinical

epidemiology would allow the people who are most involved in the use of drugs manufacturers, pharmacists, practicing physicians,

and patients

ing investigators

—

who

to

become

lems that the drugs create. that

collaborat-

help solve the prob-

The questions

originate

in the extensive world of medical care would be answered not from

data assembled or

by armchair epidemiologists

institution-bound

academicians,

but

Engl.

N.

Surveillance

J.

Drug

chloral

Med.

Pro-

hydrate

286:53-55,

Surveillance

Pro-

Collaborative

Decreased

Drug

clinical

Surveillance

efficacy

Pro-

of propoxy-

7.

8.

L.

E, Thornton, G. F, and

Seidl,

L.

on the epidemiology of adverse drug reactions. I. Methods of surveillance, A. M. A. 188:976-983, 1964. J. Dawber, T. B., Kannel, W. B., and Gordon, T: Coffee and coronary heart disease, Am. G.:

have to develop

challenge

hospitals,

versity Press. 6. Cluff,

J.

This

Studies

phene in cigarette smokers, Clin. Pharmacol. Ther. 14:259-263, 1973. Bradford Hill, A.: Principles of medical statistics, ed. 9, New York, 1971, Oxford Uni-

restricted applications of institutional tech-

nology.

Drug

between

Collaborative

Boston

gram:

is

an appropriate coterie of investigative personnel rather than to rely merely on

C:

Boston

five

B„

Dinan,

D.,

T.

gram: Adverse reactions to the tricyclic-antidepressant drugs, Lancet 1:529-531, 1972.

of

and clinical phenomena. The methods of implementing the basic design are beyond the scope of this discussion. The main point is that if we want a pharmaceutical

in

Boston

Boston

Stone,

Chalmers,

A. 202:506-510, 1967.

care.

3.

H.,

Jick,

I.,

in a

2.

could

it.

References

not from isolated academic or federal

cloisters,

surrounds

dif-

logistic

may become

difficulties

endorse the plans and

the basic identification of an ad-

is

Studies

Cardiol. 33:133, 1974.

DeNosaquo,

The

'(

Abst.

on adverse reAmerican Medical Association, Med. 4:15-21, 1965.

N.:

registry

actions of the

Methods 9. Editorial:

Inf.

Beporting of

drugs, Can. 10. Feinstein,

How Clin.

do

adverse

Med. Assoc. A.

B.:

J.

Clinical

we measure

reactions

to

92:476-477, 1965. biostatistics.

"safety

and

IX.

efficacy"?

Pharmacol. Ther. 12:544-558, 1971.

11. Feinstein,

A.

B.:

Clinical

biostatistics.

XI.

Sources of 'chronology bias' in cohort statistics, Clin. Pharmacol. Ther. 12:864-879, 1971. 12. Feinstein,

A.

B.:

The epidemiologic

Clinical

trohoc,

biostatistics.

XX.

the

risk

ablative

Problems

284

measurement

in

Clin. Phar-

and 'retrospective' research Ihih 14:291-307, 1973

i

An analysis ol diagnostic The construction ol clinical algorithms, Vale J. Biol. Med. 47:5-32, 197 Finney, D. [.: The design and logic ! a A.

einstein,

reasoning.

R.:

monitor

drug

ol

use.

Chroni(

|.

1S:7T

Dis.

'is

Kanarek,

l>

signalling ol adverse

Systemati<

|

Methods

reactions to drugs,

29 Lasagna,

1974 16

G.

Nan Brunl

I

in

patients.

The P.,

inn-..

tern,

<

and

I

Harris,

1

l>.ivis.

S.:

Ex

in

out-

I.

A. M. A and Watson.

I.

:

J.

Adverse drug

Pharmacoi

Ther,

k

I

Vdams, L

.

I

Short-term intense surveillance reactions,

Clin.

|.

\. Ulfelder, L Adenocarcinoma

Herbst,

maternal

ol

13:61-67,

and

II..

.

C.:

ciation

the

ol

stilbestrol

Poskanzer, Asso-

therap)

with

monitoring

hospital

Med.

Br.

drill's,

J.

adverse

ol

to

1:531-536, 1969.

Inman, W. II. W., and Adelstein, A. M.: Him- and tall ol asthma mortal it) in England and Wales in relation to use ol pressurized

W.

Inman.

in J.

Med

Br.

Institute of Pathology.

24. Jick, H., Slone, D., Borda,

Efficacy

S.:

tion

and

toxicity

age and

to

N.

sex,

I.

of

and Shapiro,

T.,

heparin

Engl.

rela-

in

Med. 279:

J.

284-286, 1968. 25. Jick,

G.

H..

V.,

Shapiro.

S.,

and Slone,

drug surveillance, 1455-1460, 1970

hensive 26. Jick,

H.,

the

effects

clinical

Bo

Lewis, G.

oral

Miettinen, O.

Heinonen, O.

P.,

S.,

analgesic

drugs,

Collaborative

1971.

Neff, R. K., Shapiro,

and Slone,

myocardial infarction. i

P.,

for assessing

Phakmacol. Theh.. 12:456-463,

fick, H.,

a

of

Lewis,

Compre-

M. A. 213:

A.

S.,

S.,

D.:

A new method

V.:

>.,

|.

Slone, D., Shapiro,

and Siskind, Clin.

ab-

B.

R.:

Ad-

Experience ol Marv during 1962, |. A. M. A.

Thalidomide and congenital Preventable drug reactions

—

Med. 284:1361-

|.

W.

Cornwell,

[.,

B.,

Dingwall-Fordyce, I., Turnbull, Weir, R. I).: Cardiotosieit v ot

B.

|.

I.,

and Huedv

Adverse drug

[.:

,

during hospitalization, Canad. Med.

97:1450-1457, 1967.

B. A., and Kosenow, W.: Thalidoand congenital abnormalities, Lancet I:

Pteiller,

mide'

1962.

15-16,

37 Reidenberg, M. M.: Registry of adverse drug reactions, J. A M. A. 203:31-34, 1968. 38. Royall, B. W.: Monitoring adverse reactions Chronicle^ 27:469-475, 1973. to drugs.

WHO

39. Sartwell,

P.

and

embolism

Masi,

E., B.,

demiologic

A.

Arthes,

T.,

and Smith, H.

E.:

contraceptives:

oral

case-control

Am.

study,

F.

G,

Thrombo-

An J.

epi-

Epi-

demiol. 90:365-380, 1969. 40. Slone,

D.,

Jick,

Feinleib,

Bellotti,

Borda,

H.,

C, and Gilman,

Chalmers, T.

I.,

Muench,

M.,

IL,

Lipworth,

B.:

Drug

L.,

surveil-

lance utilizing nurse monitors, Lancet 2:901903, 41

1966.

Smidt, N. A., and to

inpatient

survev,

C:

Adverse

A comprehensive

hospital

McQueen,

drugs:

reactions

N.

Z.

Med.

E.

J.

76:397-401,

1972.

M. P., and Doll, R.: Investigation ot between use ot oral contraceptives and thromboembolic disease, Br. Med. [. 2:199-

42. Vessey,

relation

Miettinen, O.

Siskind,

P.,

congenital

1961.

G.:

k.,

and

J.,

Assoc.

36

Diagnostic problems and methods

Armed Forces

and

Per-

1.

1:45, 1962.

C. Crooks,

I).

reactions

C,

in drug-induced diseases. Parts I, II, and III, Washington, D. C, 1966, 1967, and 1968, respectively, American Registry of Pathology,

196

7: 157-170.

L.:

Greene, C.

age.

1969.

1971.

35. Ogilvie,

deaths

women ot child-bearing 2:193-199, 1968.

23. Irey, \. S.:

i

M.

W,

II.

of

adverse drug

ol

amitriptyline, Lancet 2:561-564, 1972.

and Vessey, M. P.: Infrom pulmonary, coronary, and cerebral thrombosis, and embolism vestigation

1,

W.

O'Mallev,

1969

aerosols, Lancet 2:279-2S5,

22.

reactions

H.,

de-

reactions.

Melmon, k 1368,

L973

vagina.

R.

Factors

drugs cause.

causes and cures, \. Engl.

J.:

tumor appearance in young women. \ Engl. Med. 284:878-881, 1971. J. .md Wade, 20. Hurwitz \ I..: Intensive

21.

33.

adverse ding

ol

Pharmacol.

II.

N.

abnormalities. Lancet 2:1358, 1961.

11:802-807,

and Fallon,

.

J.

diseases

hospital

McBride,

pharmacist-based monitoring sys

\

E.:

Med. 280:20-26,

Thalidomide

190:1071-107

34, Moir,

is Gra)

1).

Med.

drug

Fletcher

217:567-572, 1971.

[.

1970.

19.

The

Sweet.

"..

A.

McDonald, M. C, and Mackav, verse

I..

Drug Moni-

Permanente

Kaisei

i\

i

\1

monitoring drug reactions

perience

Cardnei

Collen

D.,

toring System, 17.

1..:

normalities, Lancet 31.

Friedman, E.,

Engl.

W.:

30 Lenz,

Med. 13:1-10,

Inf.

Eaton,

Y

reactions,

spect. Biol.

Finne)

and

W

V.

Sidel,

J.,

P.,

Center,

1973.

termining physician reporting

1963. 15.

Medical

Cniveisitv

Med. 289:63-67,

J.

28 Koch-Weser,

III.

1

14.

Boston

gram, Engl.

Drug

A

35-37, 1968.

44 Wintrobe, M. M.: The therapeutic millennium and its price: Adverse reactions to drug, in Talalav, P.: Drugs in our society. Based on a conference sponsored by

Coffee

University,

from the

kins Press.

D.:

report

205, 1968.

43 Weston, J. K.: The present status of adverse drug reaction reporting, J. A. M. A. 203:

Surveillance

Pro-

Baltimore,

The fohns Hopkins The Johns Hop-

1964,

SECTION FOUR

MATHEMATICAL MYSTIQUES AND STATISTICAL STRATEGIES

Mathematical theories often produce fears or fascinations that divert medical readers and investigators from the vious three sections. Readers

who

many

scientific

problems discussed

in the pre-

get flustered by parametric estimators, pooled

variances, a levels, standard errors, confidence intervals, regression coefficients, fS

and

errors,

logistic transformations

may be

too confused to look behind the

facades and scrutinize the quality of the scientific architecture and

statistical

data. Alternatively, after

becoming infatuated by the idea that

statistical

are panaceas for the intellectual ailments of research, investigators basic scientific challenges in structure, bias, or data

problems

will

methods

may

and may assume

neglect

that

any

be solved by multiple regressions, discriminant functions, analysis

of covariance, or partitionings of chi-square.

These inappropriate attitudes about the power of often encouraged by the

way

for investigators as a set of

even

the presentation

may

what

is

being done or why. For

stoves,

and kitchen maneuvers, while remaining

oblivious to the fundamental culinary objectives in texture, flavor,

statisticians.

taste.

first set

scientifically

of ideas

is

hazardous for both investigators

concerned with the major

scientific dis-

difference. In

most

of "statistical significance," the investigator evaluates the likelihood that

chance

The

The

between estimating a parameter and contrasting a

tinctions tests

and

four essays in this section are concerned with basic statistical ideas that

have been confusing, distracting, or

and

statisticians,

receive so inherently mathematical a focus that the cook

becomes a connoisseur of ovens,

The

techniques are

They may be presented

"cookbook" instructions to be followed obediently,

the "cook" has no idea of

if

statistical

the techniques are taught.

is

responsible for the difference found

when two groups

are contrasted.

and other conventional procedures used for this purpose were developed for a different purpose: to estimate a parameter for a single group. The conversion from a onegroup estimation to a two-group contrast was desirable because it offered major t-test,

chi-square,

are derived from mathematical theories that

advantages

in

computation, but

it

voluted,

and hard

modern

electronics, investigators

permutation

to understand.

has

made

the statistical process indirect, con-

With the ease

may

of computation that

tests that are direct, straightforward,

The second

is

offered

by

prefer to use an alternative method: the

and easy

to understand.

essay deals with problems in the mathematical duplexity of cer-

285

Mathematical mystiques ami

286

tain directions

lationship

statistical strategies

—the "one-way" dependence or "two-way" interdependence of a

— and

the choice of a unilateral or bilateral zone of probability for

The

"rejection" of the celebrated null hypothesis. issue that

is

seldom considered

difference

is

"statistically

nosis,

re-

and the P

in

third essay

is

devoted to an

The decision that a making a positive diag-

conventional instruction.

significant"

is

somewhat

like

value or probability zone established as an a level denotes the

possibility that the diagnosis

is

positive. If the difference

falsel)

is

regarded as

making which rarely receives adequate

not statistically significant, however, the investigator takes the chance of This latter problem,

a false negative diagnosis. attention

and

when

investigators studv statistics, involves the contemplation of

a different type of

P value. The « and

fi

/?

levels

levels of probability are the bases

for a statistically popular indoor sport: the calculation of

sample

size in a research

project.

The

fourth essay contains a catalog of defects in the

way

investigators

ploy statistical tactics to summarize or display the data reported in scientific ature.

The standard

stead of a standard

confidence interval to

error

regularly used improperly in

is

deviation

show

is

not identified.

P values

sented nor summarized.

or

A

± sign

a standard something, but the something

portrait that

is

needed

to illustrate the relation-

summarized

with correlation coefficients or regression coefficients.

and perhaps worst of

calculations for

the spread of data or instead of a

often omitted and, instead, the results are

inefficiently or misleading!)

Finally,

mean and

The graphical is

show

liter-

either in-

the estimated zone of location for the mean.

frequently appears between a

ship of two variables

to

two ways:

em-

F

all,

may be the statistical and the actual data may be neither pre-

the only reported results

values;

CHAPTER

20

Permutation

tests

and

'statistical

significance'

The previous paper"

in this series

was

the "luck of the draw" produced a sub-

concerned with the advantages and pitfalls of using a randomization process in

stantially distorted sample.

from a parent population. By depending on chance alone, the randomized selection removes any element of

tion.

human

within each stratum.

sampling

choice in the sample, and allows

the results

based

ences

These is

be interpreted with inferon statistical probability.

statistical

useful

larly

to

inferences

when

being sampled to

are

particu-

parent population estimate the value of an the

unknown parameter, such as the mean of a selected variable. From the results found in the random sample, the investigator can demarcate a zone of values, called a "confidence interval," and can have a speci-

An

antidote for this hazard

is

stratifica-

cogent subgroups (or strata) can

If

be defined beforehand, the members of the sample are selected randomly from stratification is

is

not

drawn with a

tion, the

If

tvpe of pre-

this

sample

the

feasible,

non-specific randomiza-

cogent strata are demarcated

af-

terward, and the data are analyzed within

The results found in the individual can then, if desired, receive a "standardized adjustment" to create a single value that estimates the desired param-

strata.

strata

eter

appropriately

for

the

entire

parent

population.

that the true value of the

As noted previously,"" 11 13 the selection of cogent strata and the analysis of po-

within

tential

fied

"confidence level" for the probability

parameter lies demarcated zone. The main of random sampling is that it does the

-

bias

are crucial scientific require-

lar

ments that have seldom been adequatelv fulfilled in modern clinical and epidemiologic research. The requirement is particu-

search.

larly necessary in these

pitfall

not indicate the reliability of the particu-

sample that was selected in the reThe randomization process provides no safeguards against the 5% of chance occasions in which a 95% confidence interval will be erroneous because

forms of research because the groups of people under investigation have almost never been assembled by random sampling. Almost all of the

existing surveys of the causes, occurrence,

—

This chapter originally appeared as "Clinical biostatistics XXIII. The role of randomization in sampling, testing, allocation,

and

credulous

idolatry

Pharmacol. Ther. 14:898, 1973.

(Part

2)."

In

Clin.

and treatment existing

of disease,

experimental

and

trials

of

all

of the

therapy,

have been based on groups that were not

287

288

Mathematical mystiques and

statistical strategies

chosen randomly. The groups have conpeople conveniently available to the investigators at such locations as hos-

sisted of

physicians' regis-

plants,

industrial

pitals,

search.

the

noted

two

A. The

main

uses

of

statistical

inference

The

discover)

that

surprising shock to investigators

a

work

other

in

domains.

who

With inanimate

chemists achieve random sam-

materials,

ples routine!)

and easily

as

an aliquot of

homogeneous mass. With general human populations, social and political scientists give careful attention to methods of sampling and getting random selections. With a

medical populations, however, the investiare almost never random.

gative samples

Why

medical

are

researchers

so

delin-

quent"

The answer the

two

to this

question

based on

is

different purposes of statistical in-

A

ference.

socio-political

scientist

often

wants to estimate a populational parameter, whereas a medical investigator usually wants to contrast a difference in two groups.

a

goal

A random

for estimating

sample

is

mandatory

a parameter,

but has not been regarded as equally imperative for

dom

most

of

epidemiologic

re-

investigator usually wants to

maneuvers under study the results exposed vs. non-exposed

effects of different

groups or

major

and

the

—

"case-control"

in

vs.

or in "treated" vs. "untreated."

medical researchers seldom use random samples often comes as

The

in

peoplc\

not

is

clinical

compare the in

schools, etc.

tries,

parameter forms of

diseased,

He

is

sel-

interested in populational parameters

or their confidence intervals.

He wants

know whether the magnitude

to

of the dif-

ference observed

in

ular set of data

more than can readily be

is

the results of a partic-

expected from numerical chance alone. The' contrast of the numerical difference

with different maneuvers is a fundamental purpose in many research activities, and is the primary objective of most clinical and epidemiologic investigations. This goal does not appear to be well understood, however, by some of the leading statisticians of our era. For example, so excellent a statistician as G. W. Snedecor has stated that "the purpose of an experiment is to produce a sample of observations which will furnish estimates of the parameters of the population together with measures of the uncertainty of these estimates." This misconception of medical research is relatively widespread among associated

statisticians,

and may be responsible many current problems

for

contrasting a difference.

some

Estimation of parameters. In polling political beliefs, our main concern is to

consultants in biomedical investigation.

get an estimate of the quantitative parti-

For analytic contrast, an investigator seldom cares about estimating the parameter of some unexamined theoretical parent population, and seldom wants to measure the uncertainties expressed with such calculations as "standard error of the mean." He has performed a single act of research on a single collection of people, who were divided into two groups. He has noted a

1.

tion

of public opinion.

There

is

no con-

any imposed experimental or conmaneuvers. We simply want to find out what the people think politically, and trast of

trol

we want sample large.

to

know how

We

accurately

the

views of the public at

reflects the

therefore

want

to estimate the

values and confidence intervals of populational parameters.

A random

sample being

crucial for these activities, a political scientist

pays

close

attention

to

getting

a

suitable collection of people for the survey. 2.

i

contra^

umerical evaluation of an analytic The estimation of a populational

31, 31,

of their

as ''•

,;

'

.'.:>

difference in

the numerical data for the

contrasted groups, and he wants to

know

whether this numerical difference might arise by chance alone. Suppose an investigator has reported success rales of 75% in a treated group and 25% in a control group. Before draw-

Permutation

tests

and

'statistical significance'

289

any further conclusions, we immediately want to know whether enough people

sible

were included

what Mainland'-' has called the "eye test." We could draw a reasonable conclusion by just looking at the data. The decision would be more difficult if the 75% and 25% came from such numbers as 9/12 vs. 5/20, or 3/4 vs. 10/40. For these and for the manv other contrasts whose numerical distinctions cannot be immediately judged with an "eye test," we want some cerebral mechanisms to supplement the ocular per-

in;

w; and so on. The number of different permutations for a group of four different objects is 24 (= 4 x 3 2 1). Thus, each disz, y, x,

appear in 24 ways among the 1680 arrangements. We therefore divide 1680 by 24 and get 70 as the number of distinctive quartets that could be formed by dividing eight objects into two groups of four each. tinctive quartet will

a.

trate

Data expressed procedure,

this

same data form the

that t

in

means. To

illus-

us

consider the

we examined

earlier to per-

test.

The

let

results

in

Group A mean

are 20, 15, 12, and 6 units, with a of 13.25. 2,

and If

1

The

units,

these

divided

results in

int

with a

Group B are mean of 3.25.

ight people were

two

pairs

of

7,

3,

arbitrarily

quartets,

how

Half of these 70 arrangements are similar but

opposite.

Thus,

the

quartet

ic,x,ij,z

"The exclamation point in the formula refers to the Thus 5! = 5x4x3x2x1. By con-

"factorial" of a number.

vention, 0!

—

1.

\

'

Permutation

Table

II.

Means and

differences in

means

and

tests

297

'statistical significance

for 35 arrangements of eight

observations divided into two groups Difference

Mean, Group A

,

Group A

Group B

Mean, Group B

in

means,

A-B

12.7

6,

3, 2, 1

13.5

3.0

10.5

20, 15, 12,6

7,

3,2,

1

13.25

3.25

10.0

1

20, 15,

3

7,

6,2,

12.5

4.0

8.5

20, 15, 12, 2

7,

6, 3, 1

12.25

4.25

8.0

20, 15, 12,

1

7,

6,3,2

12.0

4.5

7.5

12,

3, 2, 1

12.0

4.5

7.5

20, 15,

7,6 7,3

12,

6,2,

11.25

5.25

6.0

20, 15,

7,2

12,

6, 3, 1

11.0

5.5

5.5

20, 15,

7, 1

12,

6, 3,

2

10.75

5.75

5.0

20, 15,

12,

7,2,

1

11.0

5.5

5.5

20, 15,

6,3 6,2

12,

7,3,

1

10.75

5.75

5.0

20, 15,

6, 1

12,

7, 3,

2

10.5

6.0

4.5

1

10.0

20, 15,

20, 15,

1

2.

1

20, 15,

3,

2

12,

7,6,

6.5

3.5

20, 15,

3, 1

12,

9.75

6.75

3.0

9.5

20, 15,

2,

1

12,

7,6,2 7,6,3

7.0

2.5

20, 12,

15,

3, 2, 1

11.25

5.25

6.0

15,

6,2,

10.5

6.0

4.5

20, 12,

7,6 7,3 ",2

15,

6, 3, 1

10.25

6.25

4.0

20, 12,

7, 1

15,

6,3,2

10.0

6.5

3.5

20, 12,

6,3

15,

7, 2, 1

10.25

6.25

4.0

20, 12,

6,2

15,

7, 3, 1

10.0

6.5

3.5

20, 12,

6,

1

15,

7, 3,

2

9.75

6.75

3.0

20, 12,

3,2

15,

9.25

7.25

2.0

20, 12,

3, 1

15,

9.0

7.5

1.5

20, 12,

2,

15,

7,6,1 7,6,2 7,6,3

8.75

7.75

1.0

20,

7,

15, 12,2, 1

9.0

7.5

1.5

20,

7,

6,3 6,2

15, 12, 3, 1

8.75

7.75

1.0

20,

7,

6, 1

15, 12, 3, 2

8.5

8.0

0.5

20,

7,

8.0

8.5

-0.5

7,

3,2 3,1

15, 12, 6,

20,

15, 12, 6, 2

7.75

8.75

-1.0

20,

7,

15, 12, 6, 3

7.5

9.0

-1.5

20,

6,

2,1 3,2

15, 12,7, 1

7.75

8.75

-1.0

20,

6,

3, 1

15,

7.5

9.0

-1.5

20,

6,

2, 1

15, 12, 7, 3

7.25

9.25

-2.0

20,

3,

2, 1

15,

20, 12,

1

1

1

12,7,2 12,7,6

might be in Group A and the quartet s,t,u,v might be in Group B; or vice versa. Conse-

10.0

6.5

By

-3.5

inspecting the results of Table

we can

II,

get the answers to the questions

An

"exceptional" difference

quently, there are really 35 distinctively dif-

asked

ferent pairs of quartets to

means of 10 units or more in favor of Group A occurs twice in the 35 arrangements shown in Table I. If we completed the rest of the table to form 70 arrangements, we would find that a difference of means of 10 units or more in favor of B would also occur twice. Thus, at a twosided level of interpretation, for any dif-

be considered.

Those arrangements 12,7,6,3,2

and

1

gether with the

for the numbers 20,15, shown in Table II, tomeans and difference in

are

means

for each of the 35 arrangements. Exchanging the contents of Groups A and B, and reversing signs for the differences in

means, we could obtain the remaining 35 arrangements that complete the 70 possibilities for

those eight numbers.

earlier.

in

ference of 10 units or more, the probability

is

4 70

=

0.05714.

At a one-sided

Mathematical mystiques and

of interpretation,

level

statistical strategies

the difference

for

of 10 units in favor of A. the probability is

0.02S57.

2 70

These sided

P was

interpretation.

of

level

At a two-

test.

t

we

from what

results are different

found earlier with the

be-

tween 0.025 and 0.02 with the t test, but was 0.057 with the exact random perit mutation test. Tims, if we were adhering to a strict a level of below 0.05 for "sta

a one-sided

level

ever, both results

was below

0.012">

0.029 with

tin

From Table {

II.

According

to

in

the

2 15 II,

calculated

means

in

observed

12

ot

of

should

distribution

V and L7.85 none howevci mkI

confidence ')">',

Foi

noted

empiricallj

the

differences

ol

35

tin

each

in

pairs

ol

The

table

of the

many

arrangement.

arrangements

be considered. [The number of possible arrangements is

different

(8!x7!)

15!

to

6135.]

Fortunately, an easier

wav

ing the desired information

by

R. A. Fisher.

1

the table has the con-

It

shown

struction

of determin-

was developed

with

earlier

the

letters

F.N, Fisher showed that the pro-

a.b

portion

of

the

permutations

would

that

\'

is

B! SI F!

N! alb! eld!

between Table exceeded

For the particular table

we

observed, this

value would be

in

samples

8! 7! 9! 6! 15! 7! 112! 5!

li.nl

that

They include both

the pos-

pf and the possibility

test

draw

the sampling distribution for

statistic at

each degree of freedom,

the curves of each array of relative fre-

quencies, and to calculate the areas that

der the curve beyond each value of the tistic

that

Pi-

To determine

con-

100 degrees of freedom

1

ference in the two proportions, p, and p 2 Regardless of whether we are interested in p t

P2

which shows the curve of the

x

which the chi-square

plied.

For the one-sided

the particular chi-square curve that

is

pertinent for the 2

calculation of

relative frequencies of

P

degree of free-

1

example of the sampling

Fig. 2 gives an

culate the value of

chi-square,

at

0.025.

tribution of chi-square for

X

or chi-square.

2

contemplate only the right hand

this distribution.

of those individual imaginary samples, t

x

dom.

"sampling distribution" of the

"test statistic"

number of "degrees of freedom"

6

5

Every-

theoretical parent population in the sky. find

0.05

the

is

thing else takes us into the world of mathe-

we

=

Values of X 2

statisti-

were actually observed. This

the data that

thing

such as

we

First

4

3

2

the results that

of significance, however, everything

becomes more complicated.

last

area

test,

gives us the actual probability values

random rearrangements of

for

permutation

a

such as the Fisher exact probability

test,

lie

un-

test sta-

would be a formidable chore. Fortunately, be spared all this work because our sta-

we can tistical

colleagues have already done

They have worked out

it

for us.

the complete details of

each theoretical sampling distribution for each each "degree of freedom". The have been organized so that we need not

test statistic at

details

even look

at the distribution

curve to erect a

Mathematical mystiques and

statistical strategies

two-sided results. For the situation just de-

perpendicular line and measure an "external"

o\

area of probability. This area has already been

scribed,

measured

for us.

the P values that

becomes

It

are listed in the familiar tripartite tables that

show

confluence

a

three items: the value

o\'

(or whatever tesi statistic

t

number

associated

being used); the

is

degrees

o\'

o\'

freedom: and

o\

same callv

to

("one-sided")

lateral

sided"). t^

we

decide whether the P value

/x 2

.

know

test

the

than or less than esis

If

('"two-

bilateral

the procedure a

On

trickier.

a chi-square test, rath-

is

situation

the

t-tcsi.

somewhat

is

seeing the asymmetrical "one-tail"

curve of the chi-square distribution, users of the

[.

|

the alternative hypoth-

that the idea of tvvo-sidedness is built into the

i

hypothesis

is

/x,

is

we want

since

/x,

such as

unidirectional,

is

uni-

is

I

"statisti-

either greater

that

fig.

seek

way we formu-

significant'*.

than

er

and

.05,

mav become confused and may conclude erroneously that the associated P value refers to a unilateral probability. The fact is, however,

bilateral,

is

possibility

the

or

alternative

the

If

have

still

<

mav or may not be

oi data

set

P

"statistically signifi-

hypothesis and use the t-tables, the

late the null

When

however, we

<

.025

get

would now be

cant". Thus, according to the

the associated value for P. In using these tables,

we would

the result

<

/x,

/x_>.

to

the

test

calculation oi chi-square

In a t-test for

itself.

|

have only

probability should correspondingly

point,

this

illustrate

found a mean value oi

mean of

14.4 in

Group

we want

hypothesis,

=

10 vs. x B

=

12, the

what we would get

one direction

To

xA

I

1

suppose we have

.3

Group A and

in

a

B. According to the null

know

to

finding a difference of 3.1

the likelihood o\

by chance.

units

t

=

xA

if

value 12

opposite to

is

=

and x B

10.

=

.10 and p B = .12. however, the chipA square value is identical to what we would get If

;

|

.12 and p B = .10. Because the probabilities associated with the asymmetrical chi-

=

for p A

square curves are always two-sided,

Suppose each group contains 51 members, with

find a one-sided result

the standard error estimated as

the "other side" of the curve.

,

j

we cannot

|

by looking for a value on |

The associated value of

population.

=

3.1/1.7

.7 units for the

1

1.82.

our

If

test

would be

t

two-sided,

is

we

We

half the associated value of probability.

example,

in the

values at one degree of freedom, P

found on both sides of the symmetrical

X

tribution t

—

the probability of getting a value of

greater than

.82 and the probability of getting

1

For

customary table of chi-square

are really asking for the probabilities that are t-dis-

simply take

is 0.

10

when

0.05 when x 2 is 3.841. Consequently, a x value of 3.12 would not be "significant" in a two-sided test because 0.05 2

2.706 and P

is

j

!

is

2

j

j

a value of

two

less than

t

probabilities

is

ventional bilateral

—

The sum of those what emerges in the con-

test.

.82.

1

When we

of two-sided probabilities for

we

of freedom,

and P

=

our observed

<

.

1

,

t

P =

see that

.05 for

t

=

0.

look

1

in a table

100 degrees

at

t

for

t

=

1

of 1.82, is

we

find that .05

<

P

not "statistically sig-

On

the other hand, if we expected Group B have a larger value than Group A. so that

our null hypothesis was quent to

formation,

show

=£

/u.

B

We

,

the subse-

would want

only the probability of encountering

a value of

that

ix A

would be one-sided.

test

know

<

t

greater than 1.82.

we can

either

go

To

find this in-

to a set of tables

the one-sided probabilities or

we can

take half the probability values listed in a table

0.

1

If the

.

test

were one-sided, how-

same value of x2 would become "significant" at P < .05 because the associated probability values would be halved. These

illustrations of both the

and the "test value have

statistic"

al

are

permutation

methods of getting a P

shown why an

ally prefer the

They

nificant"'.

to

P

ever, this

.661

1.982. Consequently, for

so that the result

<

unilateral

investigator will usutests

more consistent with

of probability.

the unidirection-

design of a scientific hypothesis and they

also offer a "better"

vestment

in

ways more cant"

if

P value for the same The results are

in-

research data. likely

to

be "statistically

al-

signifi-

the probabilities are evaluated in a one

sided manner.

Many

statisticians,

however, are reluctant

accept the idea of allowing a directional entific decision to affect the operation

to

sci-

of a du-

The

One

plex statistical procedure.

of the most vig-

orous statements of the statistical ideology was

provided by Langley There

as follows:

places.

.

.

.

this par-

inference, even in high

ticular aspect of statistical

The whole aim of

statistical tests is to

eliminate guesswork and to put inductive logic

mathematical footing, and

this

means

on a

that these tests

must remain completely objective. This impartiality, freedom from human foible, is only possible if

this

we

stick to

lem)

is

two-sided probabilities.

.

.

(The prob-

.

solved simply by using two-sided probabili-

ties as a routine for

significance tests.

all

sided probabilities

among

is

statisticians

relied

A medical reader who on these two books might never discover

such decisions have

The remaining textbooks could be divided The describe-but-don'tprescribe books contain an account of the two kinds of decisions, but do not seem to provide any instructions on how to make the decision. The even-handed-approach books, as exempliinto three categories.

by Schor22 and by Huntsberger and Leaver-

fied

15

describe the two different kinds of hy-

,

potheses

and

particularly likely to arise

which

who have

Mainland 19

consulted

ex-

be made whenever

to

he uses a table of probability values.

ton

This adamant opposition to the use of one-

315

probabilities

rection for probability.

that

muddled thinking about

a lot of

is

17

and

direction of relationships, hypotheses,

each ,

circumstances

the

indicate

should

hypothesis

be

in

applied.

who must have met a trauma of who wanted to alter their hypoth-

tensively for research in education, psychology,

investigators

and the social sciences. In these domains, the

eses after the data were analyzed, prefers two-

investigator

of

is

seldom able

to study the effects

difficulty specifying a

dependent relationship or

ourselves a better chance of obtaining a 'posi-

sometimes

try

to

generate their

hypotheses after the data have been

scientific

analyzed rather than beforehand, and the investigators tistical

may

often want to juggle the sta-

hypothesis into a unilateral direction in

order to get the

P values down below

the

magic

marker level of 0.05. These research strategies have been thoroughly debated

in

tive'

statisti-

seem lamentably unaware. Anthologies

result".

The other books containing a

two-sides-are-almost-al ways-better- than-one

argument range from Campbell's view 4 "one-sided

... we ought always will

to

consider their possible

My own

view

is

accord with the even-

in

handed school. Since the decision should

ways be based on istician

In the

to

domain of biologic and

clinical sci-

maneu-

allows the investigator to formulate a unidirec-

directional;

hypothesis. Accordingly, one

hypothesis the

and the unilateral or ability

expectation,

books

that 14

I

went through

statis-

I

found, to

my

tion (or at least

distinction

none

between a

col-

surprise,

—by —contained no men-

two of the most famous books and by Bradford Hill 3

my

To

devoted to medical

that are

or biologic statistics.

er

whether the

should be one-sided or two-sided.

this

any

that

I

Fish-

could find) of the

unilateral or bilateral di-

is

whether

unidirectional

hypothesis

or

his bi-

should

then be appropriately one-sided or two-sided;

reasonably flexible about

lection of 18

stat-

with

decision

statistical

might expect contemporary biometricians to be

check

no reason for a

ligated to state, before the analysis, scientific

tical test

the

preconceptions. The investigator should be ob-

vers (with either experiments or surveys) often

scientific

approach

mathematically authoritative pontifications or

ence, the ability to study interventional

tional

al-

scientific rather than statisti-

cal considerations, there is

and 23.

"it

that

should almost always be two-sided".

number of "sides" in references 21

1

be safe to assume that significance tests

of the controversy over hypothesis testing and for probability are presented

that

are justified only rarely but

tests

relevance" to Armitage's assertion

an extensive

which many

series of publications of

cians

should beware of the

temptation to lower our standards by giving

and

a directional hypothesis. Consequently, certain

investigators

"we

sided tests because

may have

maneuvers

interventional

bilateral aspects of prob-

should follow accordingly.

From time

to

time, a statistical consultant will encounter an

rection.

who

cannot state any form of

di-

This difficulty usually arises not

in

investigator

deciding whether to go one in

the

way

or two

ways

the statistical null hypothesis, but because scientific

hypothesis

itself

is

either mal-

formed or amorphous. The research tionless because

it

is

is

direc-

aimless. In this situation,

Mathematical mystiques and

do everyone

the statistical consultant can

apphing

vice not by conservatively

two-sided

jective"

strategies

a ser-

.05, .01

"ob-

the

"significance",

of

tests

statistical

nificant"

,

but by advising the investigator to re-think the goals and aims of the research

others,

sequence

the

The

We

zone

size of the rejection

can now

grandiloquence

is

It

where the

the place

investi-

gator can happily set aside the null hypothesis

and proclaim those magical words and

editors,

granting

The

"statistical significance".

— sacred

to

rejection zone

is

which

called a.

we

usually taken to be

is

either .05. .01. or .001.

the P value

If

the

in

below the chosen level of a.

is

reject the null hypothesis, state that the re-

significant",

"statistically

are

sults

clude that the observed difference

This

is real.

investigator usually wants to attain in doing a test

The

of "significance".

level of

a

estab-

lishes the proportionate risk that the conclusion will

be falsely positive

—

that a correct null hy-

pothesis has been erroneously rejected. If the

we would want

null hypothesis is true, it

to accept

and draw the "negative" conclusion

results are

in this

Consequently,

decision will be

a =

if

.05,

cisions for 100 instances in

pothesis

wrong

that the

"not significant". The likelihood of

being correct

is true,

5 times.

we

will

and

if

—

1

we make

which the

a. de-

null hy-

1

-

a. which

is

thus equivalent to the specificity of a diagnostic test, is

also the source of statistical concepts of

"confidence". is

1

we

-

a, or

If

a =

95%.

.05, our confidence level

that

we

will

be right when

accept the null hypothesis and conclude that

a difference smaller, cificity

is

"not significant". As

a

gets

our "confidence" rises in the spe-

of the negative decision.

The smaller

the

the exuberance with

investigators, if

is

which "significance" is falls below a. For

P

is

respectively

symbolic

**, or ***.

sober scientist, encountering

wonder how

all this

verbal

may

these sacred boundaries were or-

why

dained and

test

the sanctification should arise

of a statistical hypothesis, in conof a scientific hypothesis

trast to the analysis

and

associated results. In a later essay in this

its

series.

give detailed discussion to the

shall

I

perversion of the word "significance" as one

many

of the

intellectual pollutants that an in-

appropriate use of statistical theory has brought to

modern medical

magic

below

level of

science. For the

moment,

a =

.05.

number in demarcating "significance" has become so widely accepted and The

role of this

worshipped

that

one might expect

ord of the time and place occurred. exist.

I

No

when

to find a rec-

the apotheosis

such record, however, seems to

have looked through a series of books

devoted to the history of probability,

statistics,

and science, but the historians do not seem have taken note of an event so prominent

development of a

to

in the

hegemony over

statistical

scientific decisions.

According tical

to

Donald Mainland 18 the ,

statis-

use of the term "significance", although

usually ascribed to R. A. Fisher,

was probably

introduced in 1896 by Karl Pearson. Pearson's

boundary

for

dividing

"non-significance"

from

"significance"

was an

entity

called the

probable error, an idea that has since become archaic. Fisher

moved to

magnitude of a, the greater

proclaimed when the P value

some

A

a

to *,

or symbolic jubilation about a P value,

be right 95 times and

The value of

restrained

however, we can focus on the choice of the

"positive" conclusion that an

the

is

and con-

is

Guide-Michelin expression of

from the

agencies

demarcated by the choice oi a fixed probability

research data

have some-

I

eureka. In computer print-outs, the machine's

a statistically blessed region called the

level,

decisive,

significant, high-

an integral

is

hypothesis testing

At the end of either a one-sided or two-sided

reviewers,

ma) be

times heard the sequence or super, wow. and

part of statistical

/election zone.

significant,

is

it

turn our attention to a magni-

tude, rather than a direction, that

test is

significant". For

and determinant. 2

ly significant,

significant"

statistically

statistically

and conclusive 2 ": or F.

upward verbal declension

the

"highly

to

"very highly

to

itself.

or .001

,

of significance ascends from "statistically sig-

the

was probably

boundary

the person

to .05. This choice

who

seems

have arisen from a mathematical phenomethat could be easily converted into a mne-

non

monically

convenient

statistical

Gaussian distribution of data, are in a zone

95%

tool.

In

a

of the values

spanned by the mean plus or minus

The

direction of relationships, hypotheses,

was so

1.96 standard deviations. Because 1.96 close to 2, the

number

membered and ± 2s) became

the expression fi

2 could be easily re-

±

a quick, simple shorthand for

"common"

denoting the zone in which the

would be encountered.

values of a distribution

5#

The

The next step

An

uncommon

or "rare".

reasoning was easy to

in the

5%

event that occurred in the outside

zone could be regarded as different enough

from the other nificant",

95%

and

of events to be called "sig-

be deemed unlikely to have

to

happened by chance. The fact was, of course, such

that

events

could

occur by

regularly

chance, with an occurrence rate of about

20 chance occasions.

every

in

1

Nevertheless,

a

boundary was needed for making individual decisions and .05

seemed

as

good a boundary

as any other, particularly since

it

had the pleas-

Some

of

language

Fisher's

elsewhere

is

quoted by Mainland 19 as follows: P

is

between

and

.1

there

.9,

certainly no

is

reason to suspect the hypothesis tested. If .02 to

it

account for the whole

often be astray

if

we draw

and consider that

.

a real discrepancy.

to be

below

.

.

...

We

the facts

shall not

a conventional line at .05

(lower) values It is

.

.

indicate

.

convenient to take

(.05) point as a limit in judging is

is

it

strongly indicated that the hypothesis fails

is

rejected; clinical trials are maintained or abruptly

this

whether a deviation

tific

reputations are

basis of the

made

—

on the

all

phrase and number:

magisterial

statistical significance at

or lost

al-

scien-

P

=s .05.

G. The size of the substantive difference

The

universally In

accepted by the

what

review the

ultimately

probably not be surprised that an act of scientific

judgment was converted

critical

into an arbi-

many judgmental

numerical shrine. So

trary

procedures have been obliterated during the twentieth century worship of technology,

many

crucial

human

so

have been ne-

attributes

glected or deliberately omitted during the collection of

"hard"

data, so

many

inadequately

designed and inadequately analyzed

statistical

standards of epidemiologic research,

much

effort has

tivity

even

is

been made

and so

to arrive at objec-

—

the sacrifice of sensibility

at

the conversion of scientific hypotheses

that

to

P

values will probably be regarded as merely a

minor note

ony of It

in the

dominant

intellectual

cacoph-

the era.

is

also not surprising that scientists con-

fronted with complex decision-making would

line.

establishment.

who

historians

course of twentieth century clinical science will

have searched for a reliable and effective guide-

considered significant or not.

Fisher did not succeed in getting these concepts

pharmaceutical agents are

terminated;

lowed on the market or withdrawn; and

enumerations have become established as the

ant relationship with 2cr.

If

317

probabilities

of values that lay outside this zone

would be regarded as

take.

2o~ (or x

and

statistical

generally accepted

Anyone confronted with

difficult decisions

would seek whatever good advice might be available.

What

the

historians

regard as astonishing, however,

probably

will is

that the

"en-

today as the "bible" of theoretical statistics,

lightened" scientists of the twentieth century

Kendall and Stuart 16

region

would have been so obsequious in accepting a guideline that was neither reliable nor effective. The P values that arise from statistical "tests

Furthermore, the

of significance" are unreliable guidelines to

.01

ary

make no mention of .05, number as the bound-

or any other specific

,

between what they

call the critical

and the acceptance region.

idea of "statistical significance"

pear anywhere in the three dall

and Stuart

does not ap-

volumes of the Ken-

text. In a footnote, the

authors

scientific decisions

because P values are

totally

dependent on the size of the groups under vestigation.

No

matter

how

trivial

no matter how

explain that they "shall not use" the term be-

the scientific hypothesis and

cause

petty or inconsequential the difference that

it

"can be misleading".

For the world of psychologic and social science research, however, and for many biometric priests

and

their clinical acolytes, Fisher's

words and boundaries have writ.

now become

sacred

Research publications are accepted or

in-

or foolish

is

being analyzed, the results will be "statistically significant"

if

the size of the sample

is

large

enough.

For example,

in a test

of the difference in

means between two groups each of

size

n,

Mathematical mystiques and

neans

by

+

Vn,

2

(Vn)(x, -

is

t

x2

if x,,

s,,

,

and

merely with changes gously, in a

2

and

(and the P value

smaller

of the difference

test

tages p, and p 2

size n. the value of chi-square

2n( Pl - p 2 )

larger)

n.

Analopercen-

in

between two groups each of

,

2

or

of

in the size

*

[(p,

+

is

p.,)(2

-

p,

-

stantive

An

4-

x2 )

remain the same,

s2

will rise or fall

t

correspondingly

get

s,

Because of the multiplication

s 2 ).

the value of will

of

value

the

(Vsj 2

and x 2 and variances

x,

statistical strategies

provement

tor

.

value

will

change

correspondingly,

according to the size of

For

this reason,

is

p2

entirely

an investigator in

x,

who

—

observes

x 2 or in p]

entirely justified in performing a t-test

or a chi-square test to determine whether n

enough

large

to indicate that the result

likely to arise

by chance. Beyond

is

is

un-

that prob-

abilistic

assessment, however, the investigator

engages

in a

soning

weird distortion of scientific rea-

+

itself,

or

if

he allows

come magnified

^

(P2

in

trivial

the

value of P

value

(the

p,

the

in

treated

-

would then be

(p,

+

purpose, the opposite

For

10).

this

p2 )

3=

or

.10,

<

(p 2

.10).

With

this

statement of a scientifically im-

increment, the simple algebra of the

portant

hypotheses has been invaded by ano-

statistical

crement

It

A, the magnitude of the

is

in-

that is substantively significant. This

increment seldom receives any major attention in statistical

textbooks or discussions, possibly

A

because the size of with any act of

cannot be determined conjecture or com-

statistical

To pay

putational prestidigitation.

A, the

macy

he makes decisions that depend on

if

degrees of "significance"

.

would want

ther symbol.

n.

an impressive difference

-

and the P

fall

whatso-

(or null) hypothesis would be that p,

,

difference, chi-square will rise or

interest

hypothesis that p, = p 2 To decide that the ob-

group) to exceed p 2 by a substantial increment such as 10%. The scientific hypothesis in the research

and p 2 Regardless of the individual values found for p, and p 2 or their p,

im-

< p2 served difference was important, the investiga-

p.)]

Chi-square thus equals 2n times a particular

28%

a

group, and 267c

might have no

in testing the null

or even that p,

Pi

"function" of

who observed

rate in the treated

in the controls,

ever

importance.

or clinical

significance

investigator

attention to

must acknowledge the

statistician

pri-

of scientific judgment.

Despite the neglect by statisticians, the size of

A

is

usually what

makes an

investigator de-

differences to be-

cide to use tests of statistical significance. If the

into "statistical significance"

observed difference, d, exceeds the size of a

merely because the size of n was large enough

chosen increment, A, an astute investigator will

make anything become "significant". The use of P values in this manner has been an especially unfortunate aspect of modern "bio-

want assurance

to

statistics",

and the trap has been particularly

may

sometimes appear

in

epidemiologic data.

In addition to being unreliable,

P values

are

not an effective guideline to scientific decisions.

The magnitude of P has nothing

to

do

is

Once we recognize

that traditional scientific

inference depends on A, statistical

whereas traditional

inference depends on a,

in

the differences

between

new

is

became infatuated with statistical tests, a scientist was obligated to establish more than a hypothesis and more than a direction for the hypothesis. The scientist

statistical

also had to decide about an increment. This

pressed in diagnostic terms,

increment

— difference —was what might the

in the results of the

by

Pi

~~

P2

-

x 2 or

be called sub-

are ready

contemplate yet another aspect of direction

a scientist: the size of the difference. In the days

contrasted groups, as expressed by x x

we

to

tistical

investigators

"statistically

not bother with any probabilistic calcula-

with the quantity that should ordinarily concern

before

is

too small, the investigator

tions.

treacherous for analysts of the huge numbers that

that the result

significant". If d

reasoning. This

usually labelled with the

terms,

a significance

scientific

Greek

1-/3 refers to

the

test. In scientific

to the directional aspect of

when conclusions

are

and

sta-

directional activity letter /3. In

"power"

terms,

/3

of

refers

being right or wrong

drawn about a.

Ex-

/3 is the rate of false

negative decisions about accepting the null hypothesis, and 1-/3 itive decision.

is

the "sensitivity" of a pos

The

direction of relationships, hypotheses,

These conclusions are regularly discussed

319

probabilities

primer of concepts, phrases, and procedures the

(seldom with any real attention to A) in statistical

and

textbooks, but the associated reasoning

statistical

analysis

of multiple

in

variables,

Clin. Pharmacol. Ther. 14:462-477, 1973. A. R.: Clinical biostatistics. XXIII. The role of randomization in sampling, testing, allocation, and credulous idolatry (Part 2), Clin. Pharmacol. Ther. 14:898-915, 1973. Feinstein, A. R.: Clinical biostatistics. XXV. A survey of the statistical procedures in general medical journals, Clin. Pharmacol. Ther. 15:97-107, 1974. Fisher, R. A.: Design of experiments, ed. 8, New York, 1966, Hafner Publishing Co., p. 2. Fisher, R. A.: Statistical methods for research workers, ed. 14, Edinburgh, 1970, Oliver &

11. Feinstein,

frequently in another one of

now appears most

the twentieth century's favorite applications of statistical

numerology: "the estimation of sam-

ple size" for a clinical tics,

and abuse of

trial.

this

The

strategy, tac-

12.

procedure will be de-

ferred for separate discussion in a future in-

stallment of this series.

13.

14.

References 1.

Statistical methods in medical York, 1971, John Wiley & Sons, Inc., pp. 159 and 104. Atkins, H.: Conduct of a controlled clinical trial, Br. Med. J. 2:377-379, 1966. Bradford Hill, A.: Principles of medical statistics, ed. 9, New York, 1971, Oxford Uni-

Armitage,

2.

3.

Boyd. Ltd.,

P.:

New

research.

15.

tistical

6.

Feinstein,

A.

R.:

Clinical

Kendall,

M. G., and

Stuart, A.:

The advanced

theory of statistics (in three volumes), Longon,

17.

Langley,

R.:

&

Practical

Co. statistics

(paperback),

London, 1968, Pan Books Ltd., pp. 143 and 18.

X.

biostatistics.

Sources of 'transition bias' in cohort

A.

R.:

Clinical

biostatistics.

Sources of 'chronology bias' in cohort

Mainland,

D.:

The significance of "nonsigPharmacol. Ther. 4:580-

nificance", Clin.

586, 1963.

statistics,

Clin. Pharmacol. Ther. 12:704-721, 1971.

19.

Mainland,

D.

Elementary medical statistics, W. B. Saunders Co.,

ed. 2, Philadelphia, 1964,

XI.

pp. 222 and 330.

statistics,

Clin. Pharmacol. Ther. 12:864-879, 1971.

20.

Miller, D. A.: Significant and highly significant.

A. R.: Clinical biostatistics. XIX. Ambiguity and abuse in the twelve different

21.

Nature 210:1190, 1966. Morrison, D. E., and Henkel, R. E., editors:

Feinstein,

of 'control', Clin. Pharmacol. Ther. 14:112-122, 1973. Feinstein, A. R.: Clinical biostatistics. XX. The

concepts

epidemiologic trohoc, the ablative risk ratio, and 'retrospective' research, Clin. Phar-

macol. Ther. 14:291-307, 1973. 10.

sciences,

Bacon, Inc.. pp. 150-

146.

Cornish, E. A.: Preface to Reference 14.

9.

&

151. 16.

p. 61. 5.

8.

inference in the biomedical

1963, Charles Griffin

Campbell, R. C: Statistics for biologists, Cambridge, 1967, Cambridge University Press,

Feinstein,

177.

Boston, 1970, Allyn

versity Press. 4.

p.

Huntsberger, D. V., and Leaverton, P. E.: Sta-

Feinstein, A. R.: Clinical biostatistics.

XXI.

A

The

—

a reader, significance test controversy Chicago, 1970, Aldine Publishing Co. 22. Schor, S. Fundamentals of biostatistics, New York, 1968, G. P. Putnam's Sons, Inc., p. 157. 23. Steger, J. A., editor: Readings in statistics for the behavioral scientist, New York, 1971, Holt, Rinehart and Winston, Inc.

CHAPTER

22

Sample

size

and the other

side

of statistical significance

Statistical significance' in

hiolngu- research

when

commonly

is

found an impressive difference animals or people.

It'

tested

the investigator has in

two groups of

groups are relatively

the

quency counts

that are

converted to proportions,

percentages, or rates; and the usual statistical

procedure would be a chi-square

test.

(To avoid

making unproved assumptions about

the distri-

small, the investigator (or a critical reviewer)

bution of a hypothetical parent population,

becomes worried about

can replace the

a statistical

though the observed difference percentiles

large

is

enough

to

be

means or

be biologicall)

for the numerical differences significant"?

'statistically

For example, units

in the

to

do the groups contain

(or clinically) significant,

enough members

problem Al-

if

Group A has

difference

may

be biologicall) impressive be-

mean

cause the second as the

first.

On

almost twice as large

is

the other hand,

if

the

two groups

each contain only a feu members, or are widely dispersed around the

may

our biologic impression numerically.

show

that the

The

the data

mean

values,

not be sustained

may

assessments

observed difference could quite

easily have arisen

The

statistical

if

statistical

by chance alone.

procedures used to

nu-

Mann- Whitney

underlying

The

calculations used for the procedures depend on

which the

the kind of basic data in

expressed.

would be

results

For dimensional data, the

cited as

means and

procedure would be a

t

were

results

the usual statistical

test.

For nominal or

existential data, the results are expressed as fre-

sum

test

or

test.

tests is

chosen accord-

statistical strategy is identical. It fol-

lows the same principle theorems

assume

in

that

was used

to

prove

elementary school geometry.

that a particular conjecture is true.

We We

then determine the consequences of that conjecture.

If the

consequences produce an obvious

absurdity or impossibility,

we conclude that the and we reject

original conjecture cannot be true, as false.

When

this

reasoning

argument proceeds

in sev-

U

Although each of these

between two groups have been discussed

of this series.

statistical pro-

ing to the type of data under examination, the

strategy that

4, 6

by a Fisher exact

ranked ordinal values, the usual

merical 'significance' of an observed difference

eral previous installments

test

cedure would be the Wilcoxon rank

it

test the

we

by a Pitman permutation

probability test.) If the data are expressed in

mean of 9.8

a

test

and the chi-square

test

the

and Group B has a mean of 17.3 units, the

t

is

is

used for the

statistical

called "hypothesis testing", the as follows.

We

have ob-

served a difference, called 8 (delta), between

A and B. To test its 'statistical signifiwe assume, as a conjecture, that Groups A and B are actually not different. This con-

I

i

Groups cance',

jecture

is

called the null hypothesis.

assumption,

we

then determine

how

With

|

this

often a dif-

ference as large as 8, or even larger, would

by chance from data for two groups having same number of members as A and B. The

arise

—

This chapter originally appeared as "Clinical biostatistics other side of 'statistical significance': alpha, beta, delta, and the calculation of sample size." In Clin. Pharmacol. Ther. 18:491, 1975.

XXXF/. The

the

result of this determination

emerges from the

is

the

statistical test

P

value that

procedure.

320

Il

Sample

At

from what was used

theorems

was impossible,

was zero.

such a circumstance, the original

In

the

i.e.,

conjecture could not be maintained.

wrong because

it

if

as

The P

tor specifies the deliberate

be as small as .000001, or

never becomes zero. There

it

always a possibility, however infinitesimal,

observed difference arose by chance

we cannot

use a statistical

test to

prove

to

always a chance of

P value

50. or whatever the

in

conjecture (i.e.,

inal

is,

null

the

1

in 20, or

that the orig-

statistical

conclusions, therefore,

must establish a concept for the

inferential

level of "significance".

from the than a,

used to demarcate value that emerges

equal to or smaller

we shall reject the null doing so, we demarcate a as the

decide that

hypothesis. In risk

(alpha)

is

statistical test is

we

of being

wrong

in this

conclusion

—but

it

is

we must take in order to have a statistical mechanism for drawing conclusions. In geomet-

a risk

rical

,,

a was always

inference,

inference,

a

i.e.,

20, although

1

editors)

in

may

zero. In statistical

customarily chosen to be .05,

is

some

investigators (or

select other boundaries such as

.

1

or .01.

A

previous paper 6 of this series contained a

way in which .05 became designated as the customary level of a. The designation came, not as a pronouncement discussion of the arbitrary

ternational Fisher.

observed difference

that the

arisen simply

from the deliberations of an

that a

after a statistical test

ing that an

in-

committee, but from a habit of R. A.

Noting

a

level

groups has

that the conclusion

The a

level

is

thus analogous to the risk of

getting a false positive result in a diagnostic test

3

Suppose we make

.

is

a diagnosis of lung can-

correct

—

a true positive. If the patient does

not have lung cancer, the diagnosis false positive conclusion. In the

ation of hypothesis testing,

and concluding

The a

real.

wrong

we want

—

situ-

make

to

a

is

observed difference

that the

is

level indicates the statistical risk

that this decision

may be wrong and

actually no difference

The value l-a can

that there

between the groups.

therefore be likened to the

specificity of a diagnostic test,

which

is

the like-

lihood that the test will have a negative result

when

the disease

is

absent.

The value of l-a when we

denotes the likelihood of being correct

do not

and thereby

reject the null hypothesis that

observed difference

the

not

is

'statistically significant'.

The kind of reasoning used

in

forming a

hypothesis' and in establishing levels of

l-a

is

based on the idea

looking for, groups,

is

was performed, and know-

diagnosis

is

.05,

we

i.e., a real

absent. is

draw the

is

customary

positive decision, rejecting the null hypothesis

diagnosis

to

correct

is

in the

wrong.

conclusion had to be drawn

was necessary

by chance, and

that

is real.

magnitude P,

that the null hypothesis

conclude

of the Deity or

two groups

will exist a probability of

however,

wants

that he

smear of a patient's sputum. If the patient does in fact have lung cancer, the diagnostic decision

P

It

the rejection zone. If the

a

called an

is

chance

when he decides

allow of being wrong

we

reasoning of grade school

geometry. This concept

In

selected

cer after finding a positive result in the Pap

was not necessary

that

is

is

hypothesis)

right.

To draw

level

boundary for the rejection zone, an investiga-

There

with total certainty that the original conjecture is

a

the observed difference in his

is

test.

'diagnostic specificity'.

results

alone. Accordingly, unlike the situation in ge-

1

and

level

however, the

even smaller, but

wrong. There

a

1.

way

regularly applied in a

is

the statistical appraisal of 'signifi-

ever, be so conclusive.

may

statistical test

is

of

rest

and diagnostic

using .05 or whatever other

value that emerges from the calculations in the

ometry,

be .05. The

cance' resemble a clinical diagnostic

be

to

makes

that

could not possibly be right.

statistical inference,

can seldom,

had

It

This reasoning

P value

situation that

that the

to

analogies

brought us to a

the geometrical logic regularly

is

a

world followed.

A. Statistical reasoning

deciding whether or

in

assumed conjecture, because

not to reject the

With

the statistical

grade school geometry. In geometry,

in

were no problems

there

prove

to

321

side of statistical significance

conclusion, Fisher chose

reasoning, the statistical

this point in the

strategy departs

and the other

size

that the 'disease'

'null

a and we are

difference between the

The chance of a

false positive

a; and the chance of a true negative l-a.

take a

5%

Consequently,

if

we

set

chance of being wrong

a if

at

we

Mathematical mystiques and

draw

reject the null hypothesis (i.e.,

ent population, these

a positive

conclusion) and a 959i chance of being

we concede the null hypothesis draw a positive conclusion).

statistical strategies

right

(i.e.,

fail

it'

to

This analogy to a false positive result and to the specificity of a diagnostic test can

q and

make

does not appear

in

phrase b) which a level

b.

nomenclature

cal

distribution of a

I

logic that customarily goes into hypothesis test-

the

To an

investigator, the test

logicall)

reason—

to

is

I

a positive event, analogous result in a diagnostic test

tance or rejection of the null hypothesis

intellectual

Thus,

virtues.

Type

nitions, a

I

for statistical

defi-

error consists of rejecting the

when it is actually true. The calculation of PA To apply these principles requires the calculation of a P value

null

hypothesis

2.

ence.

observed data and the observed differ-

8.

This particular P value, which

conventional one usually cited ture, will

in

medical

is

the

litera-

be designated here as P A to distinguish

from other P values

The procedures used

that will be

discussed

for calculating

later.

P A are pre-

sented in detail in textbooks of statistics and will

if

ble

ratio'*

or "z-score". For any single

value randomly chosen from a Gaussian

distri-

bution whose constituent values are x,, x 2 etc., a critical ratio

-

can be calculated as z

p)lcr. In this formula,

parent distribution;

and

Xj is the single

p. is

cr is its

the

mean

,

=

x3

.

(Xj

of the

standard deviation;

value with which

we

are con-

common

values, x and y, happen

If

1.

we can

p_>.

calculate

variance as follows:

we assume

that the population values

= p 2 the common mean for both samples is The p = (np, + np 2 )/2n = (p, + p 2 )/2. common variance of the difference in propor- p)/n. tions is 2p( for pi

.

I

2.

If

we do

values for p,

assume

not

=

p2

calculated as [p,(

I

.

that the

the

common

p,)

+

-

p2

(

The corresponding z val ues cases would be (p, - p 2 )/V'2p(

+

and

instance,

first

-

-

(p,

-

1

population variance

p2

for thes e I

-

is

)]/n.

p)/n

p 2 )/V[p,(

-

1

two

in the

p,)

p»)]/n in the second. In general, the

p formula for calculating the z value of a differ2(

ence

1

in

means

is

means

"standard error" of the difference c.

Regardless of whether a particular

critical

mean or for a difference in means, the z values have some distinctive, important properties. If we drew a large series of samples, consisting of single means or ratio

of

z is

calculated for a single

differences in means, and

were themselves

large,

and

if if

the sample sizes

a z score

were

cal-

culated for each sampling, the array of z values

cerned. If the data under consideration consist of

would approximate

means of samples, each of which has n members randomly drawn from the same par-

tion,

a series of

a

assume

not

w equals the sum

o\'

difference in

The simplest and most generally applicastatistical strategy rests on the idea of a

"'critical

mean

the

be summarized as follows: a.

we do

If

variance

be proportions, p, and

the

.

for the

it

particular,

sel-

is

negative or positive

associated with any

we assume, by

It

of the variance of x and the variance of y. In

to

however, the accep-

This

have a

population variances for x and y,

o\'

posm\e

y.

will

and y are equal, we can create

\

common

In general statistical usage,

dom

ances oi

then the

null hypothesis) as to gelling a

two ways.

o\'

equality

doubly negative phenomenon (rejection of the

—

x

own mean,

its

in the

seek the

hypothesis, that the population vari-

null

statisti-

is

=

w

variable,

"pooled variance" for w.

thai a bio-

also

new

he investigator thus regards a

impressive difference

cally impressive

done

usuall)

demonstrate

we

of two samples,

standard deviation that can be cal-

one

in

=

x, is z

analyze a difference

to

y,

which has

"common"

waj

culated

tor a positive

and

\

variable,

for the statisti-

investigators jn^\ statisticians use the inverted

ing.

we want

usually called Type

the difference in the

is

If

means.

is

any of these means,

n).

the general statistical

main reason

error. Probabl) the

- p)/{

too large

is

= 76%. We

23)

.15.] After

5

HA

P

1.08 and

[Doing the statistical procedure, we would choose p =

can be determined for any

327

side of statistical significance

"significant'.

Thus, for an observed value

a standard error.

of p 2

and the other

can represent an assigned

it

or for

size

the

error

= V(n 2 p iqi +

probability of falsely rejecting the alternative

nip 2 q 2 )/(n!n 2 ). For most practical purposes,

hypothesis.

we can assume

2. a.

of calculation for P B Equal sample sizes. Suppose an inves-

Illustration

tigator,

comparing the

tion with

that the rate at

rates of patient satisfac-

medical care

Hospital

A

at

and

84%

(19/23)

The investigator concludes is

not

two hospitals

of satisfaction was

statistically

70% at

finds

(16/23)

Hospital B.

that the difference

'significant'

because chi-

that VNpq = Vn 2 p7q7 + formula for finding beta error The n,p 2 q 2 - 8) Vnma/Npq. = (A would then be z B 3. The 'power curve' of a statistical test. .

If

we want to operate a for making

level of

a statistical test at a fixed

decisions,

determine the values of

/3

we can

that will

readily

be associ-

ated for any choice of a. In this case, the chosen level of

a

will

determine an assigned value of

Mathematical mystiques and

which

.

= Z\~ Zf w e

zB

that

z(

previously developed

we

since

be the location of

will

.

statistical strategies

in Fig.

formula

the

can substitute the

signed value of z a for zc and get z^

3.

=

+

za

assigned values of z

zB

counterparts

.

.

.

quently,

This

o{~

relationship

what we have noted and

of sensitivity test.

z,,.

/„ must decrease; and vice versa.

reciprocal

analogous

is

to

previous discussion'

in a

for a diagnostic

specificity

This reciprocal aspect of the equation allows various statistical

to

tests

be illustrated with

"power'" curves, which show I

- P

ti

that

values of

the

occur with different choices

will

of a. For example, consider the situation where values ol

the true

= = a

and

279; 18

=

and v

sample

compared

the

= 45',

p.,

2p(

-

1

.

rates arc

p,

A = p 2 - p, = 0.46. If we take

so that p)

/-

=

+

z„

and solve for

and z# for

and z B

z,

their respective

We

then would have

we

square both sides

Zp. If

we

n.

.

=

get n

and

=

p,

.50 and

significance for a

one-sided

level

(3

We

sample be? and

1.96

calculations,

the

=

2

/A 2

=

p2

.

.70

we know

z„

.20,

From

1.645.

previous

that v is either

1

0.48 or

0.47. Substituting directly

it

we

equation,

l.645) 2 /(.20) 2

(1.96+

z^)

A =

have assigned

Zp

cited

that

we want to attain statistical 2-sided a level of .05 and a of .05. What si/e should our

0.46. Let us call into

+

v(z,,

For example, suppose we expect

n

get

=

and n

,

=

(0.47)

152.7.

We

would thus need 153 patients in each group, for a total sample size of 306 patients. 2. Stricter procedure. In a mathematically stricter set

we make

of calculations,

in a

com-

the fact that the value of z„

= AVn/v =

(.18)

the null

80 for each group

size of

=A

Za

=

the other decreases

one increases,

If

a

and q. the values constant. Conse-

higher values are assigned to

if

the values

p.

to take the pre-

is

viously developed formula and to substitute the

as-

Since z± has a fixed value that depends on the

magnitudes of A n,. n 2 N. of /.., and zb must sum to

proach for these calculations

is

provision for

determined using j

we V80/.46 =

null

have

will

parison,

2.37.

hypothesis

zA

we decide

If

a two-sided

at

to

a

reject

level

the .05,

oi'

hypothesis, whereas the value of Zp

depends on

the alternative hypothesis.

we would have z B = 2.37 - 1.96 = 0.41. The associated P B value would be .341 and the 'power' o( the test would be 65.9%. If the null

hypothesis were to be rejected

sided a level of

.

1 ,

zB

=

2.37

-

two-

at a

1.645

-

V2p( At

same

this

test

the higher 'power' of

V[p 2 (l

of a 'doubly significant'

we

solve the

first

and substitute the

,

we worked on

In the foregoing discussion,

-

p)/n

A - p2 + )

c p,(l

-

'

p,)]/n

of these two equations for

we

results into the second,

c,

get

A - (V2p(l - p)/n)(z„) V[p 2 d - p 2 + p,d - Pi)]/n

.

)

the assumption that the research

We

was complete.

had our data; we had determined or

signed the level of P A or P„; and

know what P B might

be.

A

as-

we wanted modern

This equation becomes

to

different application

of these concepts occurs for the

^F {z/s [Vp

2 (l

-p

2)

+

cal-

Pl (l

-

Pl

)]

+ zjV2p(l -p)][= A.

culation of a 'doubly significant' sample size,

which means

enough

that

we want

a

sample large

to be significant at the levels

Squaring both sides and solving for

n,

we

get

of both

a and /3. In ,he previous calculations, we began w ith known data for everything except z B and we solved for z B Now we begin by knowing

n

=

(^{za[2p(l

- p)> +

z^[p 2 (l

+

,

-

p2 )

p,(l

2

-P>)] J

.

J

.

(or

assuming)

except n, and 1.

all

we

the

necessary

i

76.79L If

The calculation sample size

C.

I

Thus, the

level of

point c,

0.73.

and the associated P B value would be .233, giving the

a

point c in Fig. 3 will define an

information

solve the equation for n.

Simplified procedure. The simplest ap-

This formula, which looks,

is

statistical

is

less

formidable than

the one that regularly appears in

discussions 8, n

13 '

it

many

of sample size. In

]

Sample

and the other

size

numerical example just cited, the actual

the

values for our levels of

sample

nificant' 1

(,20)

+

1.%[2(.60)(.40)]*

2

l.645[(.70)(.30)

+

large as the

(.50)(.50)]4

To

2

a and

of z values will be

ratio

values are

329

side of statistical significance

be

will

(

the right

f3,

hand

and the 'doubly

1,

+

1

l)

2

=

sig-

4 times as

first.

suppose we want

illustrate this point,

achieve a one-sided significance level of .05

+

.96[.69]

!

1.645[.68]

10%

are

47

2

=

153.

.04

result

identical

is

what we obtained

to

with the previous set of simplified calculations.

Use of

3.

statistical tables.

tations

can be avoided

priate

sets

if

use of appro-

A

+

tables,

tained in

showing sample

textbook by Fleiss 7 For example,

5%,

a

if

.

(two-sided)

.05, Fleiss'

if

and

if f3

if

A

is

size for

a

in the

= 5%,

p,

to

2669

if

2

in

are higher than those calculated

preceding illustrations here because Fleiss'

Kramer and Greenhouse 9

,

the n values in our

calculations can be converted to the n' values

by

cited

Fleiss.

The formula

+

now

V

+

1

(8/nA)

To

We

formula for a 'singly

size.

signifi-

sample

cant'

z « 2v _ - -p.

„

n

136 patients

al-

a and

calculate sample size for both

levels of significance,

=

"

(z„

+ z«) 2 v = A2

/3

we would have

(1.645

2

+

l.645) (.25) 2

(.10)

=

We

=

would therefore need 2 x 271

tients,

an amount that

is

270.6.

542 pa-

about four times larger

than before.

Two

other important features to be noted

A

about these formulas are the crucial roles of

and 1

,

v. Since

A

is

A2 A =

the value of if

the value of

and a 'doubly significant' sample

In the classical old

=

always a value between is

and

always smaller than A. (For

.3,

A2 =

.09.) Furthermore,

the smaller the value of A, the smaller will be

note the difference between calculating

a 'singly'

67.65

2

(.10)

together.

example,

is

Differences in sample-size methods.

4.

"

and we would need 2 x 68

analogous to that of the Yates' 'correction'

chi-square tests. Using a formula derived by

can

A

(1.645) (,25)

=

2

pi

computations include a "correction for continuity',

0.26.]

we would have

level significance, z v

(one-sided)

= 50%. If A is 20%-, under the same conditions of a and /3, the size of n will range from 99 if p, = 5% to 172 if p! = 50%. The values of n in Fleiss' tables

=

For the 'one-sided' calculations of sample

sizes

Table A-3 shows that the size of n

range from 796

will

.05,

is

0.25. [The alternative es-

/3,

cellent

is

=

(.I0)(.90)

timation for v would be (2)(. 15)(.85)

particularly

A, and pi, is conTable A-3 (pages 176- 194) of the ex-

for different values of a,

in the

From

2

good collection of

20%

group and

these data, p 2 = .20, p, .10, and v is estimated as (.20)

A =

10,

(.80)

These compu-

we make

of prepared tables.

.

the expected rates of success

in the control

treated group.

= The

where

clinical trial

.04

= -^1.35 +

to

in a

A2

and the larger

will

2 responding value of (1/A ) which

be the coris

used as a

factor in determining n. Thus, the smaller the

we want

difference for which significance'

,

the larger

is

the

show

to

'statistical

sample size

that

is

we want to prove the null hypothesis exactly, and to show that pj and p 2 are absolutely identical, we would need a sample required. In fact,

if

of infinite size because

A =

0.

Since v appears in the numerator of the factors

For the 'doubly significant' sample,

that are multiplied to calculate n, the size of n

The change

to

(z a

=-

z„]

,

which

is

Ar

1

by a

+ Ml 2

n

2

z«) v

will decrease as v decreases.

.

The value of

v,

— p), will be at a = 50% and will take mini-

being dependent on 2p(l

'double significance' thus in-

creases the sample size 2

+

ratio of [(z (( 4- zp)/ If

we choose

maximum when

mum will

0%

or

p is close to 0% or to 100%, v be small and n will be correspondingly

100%. Thus, equal

p

values near the polar extremes of if

Mathematical mystiques and

statistical strategies

the other hand, when p is very close extreme, a large value for A may be a polar

On

small. to

The

difficult or unfeasible to obtain.

extremely

cology

.

occurs for

of the bioequivalence of

tests

two pharmaceutical preparations.

In such cir-

A

cumstances, we assign a value of

as the maxi-

advantage of a very high or very low value for p ma) thus be completely obliterated by the asso-

mum permissible difference between the groups.

ciated disadvantages of a very small value for A.

shall

If

the observed difference

equivalent.

The importance

D.

Because

fi

error

scientific research

showing

at

of

that

investigators

two

chosen for

usuall) directed

is

most

entities are different,

depend on

The values oi P B

v

are

generally omitted, cither because the investiga-

unaware o(

is

tor is

attention to the possibility lent

to setting

value of

The absence

'power' of the

o\' fi

error

= =

value of z^

the

one-sided

the

/,;.

because he

their existence or

not concerned about them.

test is

P|,

- P„

I

or

is

o\'

equiva-

As noted

A and

mining sample

routine studies

a and

cited earlier, for

84

2 1 )'

(

can emerge from

problem

both equal to .05 and

fi

from

according to the values of p,. Since sample sizes of this magnitude will usually be to 143,

For

and the

unfeasible, the values of

50%.

In other

be

made

to

t).

a and

quite liberal. Thus,

committing the

15, the size

if

may have

f3

a

(two-sided)

(one-sided)

if (3

is

to is

increased

of n will vary as follows:

rejecting the alternative hypothesis

Most investigators accept

this risk

with equa-

nimity, since their main concern in the custo-

mary

significance' testing

situation o\

with a false positive conclusion. There

error

A

5%

5%

Pi

57c

509c

20% 5%

20% 50%

n

254

72S

38

57

with

is

These sample

sizes, although smaller than be-

two major scientific circumstances, however, in which the role of fi error becomes

fore, are

still

patients

whose data have been customarily

particularly important.

amined for studies of

are at least

The trial

of these circumstances

first

which we want

in

satisfactory

when

to

a clinical

is

be sure of having a

a substantial A we fail to find "statistical siga level, we might like to be

chance of detecting

exists. If

it

nificance' at the

reasonably confident about accepting the null

This

hypothesis.

strategy

responsible

is

sample-size calculations that culminate phrases as

909c chance of finding a

"'a

in

20%

for

such dif-

ference at the .05 level". In this phrase, the associated

=

in

example

the size of n could range

.

increased to 0.2 and

a

that

0.

false negative error of incorrect-

in deter-

f3

bioavailability. In an

o\'

words, the investigator takes a 50-50 chance of

ly

a and

size.

The high values of n

A =

is

p, will be as

these calculations will be a major

for

we

the value that

earlier,

magnitude of

the

.5;

this

smaller than A,

important as the choices of

statistical tests that pro-

vide values onlj for P

is

conclude that the groups are essentially

statistical

.05 and (one-sided)

values (3

—

are .

A =

a

.20,

1.

substantially larger than the 6 to 10

availability research

over" manner,

in

ex-

bioavailability. If the bio'

conducted

is

in a

"cross-

one group of patients rather

than two groups, the paired arrangement of data will

permit a further reduction

Nevertheless,

come demanded

size.

standards be-

for studies of bioequivalence,

the problems of obtaining

people for the

sample

in

statistical

if strict

tests

may

ample numbers of

be so formidable that

the studies will be impossible to conduct. Just as the old calculations of

made no

provision for

tions of

sample size of

f3

PA

a

for

error alone

new calculaalone may have

error, the

(3

error

j

The second (and perhaps more important) role of to

error

{3

show

that

different.

is

in the situation

two groups

An example

clinical trial

12 -

15

of such a situation

whose conclusion was

is

a

that the

quality of primary care provided by nurse prac-

what

is

by physicians. Another example, which creasingly

common

be done without consideration of

a

error.

where we want

are similar rather than

titioners is essentially equal to

to

situation in clinical

offered is

an

in-

pharma-

E.

A

Caveats and abuses knowledge of

f3

reasoning and the 'other

side' of 'significance'

can lead to prompt detec-

tion of a classical abuse in the cal tests are often reported in

The

f3

way

that statisti-

medical

literature.

reasoning has also been applied to create

new problems

in the calculation

of sample size

Sample

and the other

size

Conclusions when the null hypothesis

1.

is

side of statistical significance

difference

is

In a routine statistical test of 'sig-

we have found

nificance*,

what conclusion do we draw

treatment

P A value

higher than

is

question

this

is

that

a? The

the

if

answer to

correct

such a high P value makes

us concede, i.e., fail to reject, the null hypothesis.

With

this

cant.

that the

statistically not signifi-

is

The wrong answer

we accept

we conclude

concession,

observed difference

distinctions

by chance

A

56%. Thus,

if

the true success values for

and B were, respectively. 29% and if

we exchanged one success and comprising groups

failure in the patients

and B, we would get success

and conclude

that

for

A

rates of

and 5/9 (567c) for B. This difference

One

between concede and accept

be clinically illustrated by recalling the purpose

gator

as a diagnostic test.

We

order the test in search of a positive diagnosis of

lung cancer. If the test

we cannot conclude

We

ruled out.

have failed to

that

negative, however,

is

lung cancer has been

would merely concede that we show its presence. To accept the

negative diagnosis that lung cancer i.e., to rule

sults

out,

it

we would want

from additional

is

absent,

check

to

re-

such as the chest

tests,

X-ray.

Consequently,

in a

significance', a high

simple

P value

verdict of not proved.

of

test is

'statistical

like the Scottish

When P A

exceeds a, we

it.

or

fail to reject

must therefore be is

that the

it.

it

to accept the null

To

we would have

— a decision

that

would require additional evidence for the possibility

of

/3

the erroneous conclusions that

>

a.

The

but

we would

can occur when

investigator wanted to claim that

the satisfaction

was similar

at the

two

hospitals,

not accept his claim because

a 'power' of only 71%. In fact,

it

had

our original

if

sample size was quadrupled, and tion of successes

if

the propor-

remained the same, the

result-

numbers would be 64/92 vs. 76/92 and the difference would be statistically significant at ing

P

<

.05 even though the observed 8

was only

two groups

Even when the two contrasted

we

still

is

and

'insignificant'

that the

should be regarded as similar. reasoning were correct, that

>

a,

identical,

groups type of

If this

we could always

two treatments were

'prove'

merely by

using a small sample size for the study. Thus, if

we

put 3 patients in each group, a result as

extreme as 0/3 (0%) successes for treatment A vs. 3/3 ( 100%) for treatment B could still not achieve

'statistical significance'.

P value

is

.

.)

1

From

tical significance",

(The two-sided

this failure to attain 'statis-

it

would be absurd

clude that the observed difference

medical

An

to con-

insignifi-

is

literature.

analogous problem occurs when

tests are

done

results

seem

cannot conclude that their

statistical

whether the act of

to determine

randomization provided an equitable distribution of the patient

groups before treatment began

When good

grounds exist for

suspecting baseline inequalities, a high

P A value

cannot alone be accepted as confirmation of their absence.

The analysis

rable

incomplete un-

is

(A memoexample of such omissions occurred in

less attention is also

analyses'diabetes.

l7

given to P B

of the celebrated

When

statistically

ences were not found baseline

distinctions,

in

the

.

UGDP

study of

significant differ-

certain analyses of

data

analysts

con-

cluded that the baseline differences were insignificant,

although no levels of /3-error were

cited.)

The

13%.

quite similar,

investi-

PA

concludes that the observed difference between

in a clinical trial.

error.

The previous example of satisfaction with care at two hospitals provided an illustration of PA

when an

value, i.e..

Nevertheless, such errors regularly appear in

observed difference

hypothesis

P

cant and that the two treatments are equivalent.

insignificant,

is

gets a high

We

not significant, rather than insignificant.

conclude that

who

Our conclusion

neither reject nor accept the null hypothesis.

concede

is

of the main abuses of tests of statistical

significance occurs, therefore,

sputum Pap smear

A

2/7 (29%)

impressive although not statistically significant.

and between not significant and insignificant can

of the

for

.

that

the difference is insignificant.

The

(43%)

)

is

to the question

the null hypothesis

A

treatments

one

a success rate of 3/7

and 4/9 (44% for treatment B This seems unimpressive, but it could readily

result

arise

Forexample. suppose

insignificant.

conceded.

331

nary

point to be borne in

test

of

mind

is

'statistical significance'

that

an ordi-

can be used

only to reject the null hypothesis, not to accept

Mathematical mystiques and

either

test

shows

difference.

ificant' 1

nificant

difference.

we would need

sion,

To draw know

\alue for the possibility oi

Problems

2.

the latter conclu-

the other kind ot

(3

P

sample

size.

With the increasing performance oi controlled clinical trials, man) alternative strategies have been proposed for determining sample si/c So

main

different proposals

seems

si/e

have been made,

to

have become

a

sport of statistical theoreticians strategies include

techniques

'play-the-w inner'

art in

The

Like main other with

wisdom"

mathematical

a pre-

tactics

of

and investigators from basic chal-

lenges that are the reall) fundamental issues in scientific research. In order to calculate a

we

ple size,

sume the

have been taken care

Neyman-Pearsonian.

strategies

sam-

often ignore these issues and as-

that the)

have yielded

a

tered by

its

Bayesian.

number

negotiations

its

choosing sample size

no

exist

procedures for either multivariate

in a

manner

or preparing a clinicall) effective composite of

important multiple variables into a single univariate index.

hard data

The Units on

h.

thing

in the

sample

Since every-

.

si/c calculations

depends on

is

an item oi 'hard data", such as death. Since

changes

in

death rates are usually smaller than

changes

the

can occur

that

in

important 'soft

on hard

data' variables, the result of the focus is

small value of A,

to create a relatively

which may lead

to excessively large values for

A more

the calculated sample size.

important

consequence of the hard-data focus

is

that

an

important soft-data variable, such as vascular

complications or quality of

may become

life,

ignored in the early stages of biostatistical planning for the

trial

and may remain ignored (or

attention to 'soft data', the most important clini-

other

sample

may

number

magnitude ('how the devil can about

currently

there

biostatisticaJ

or

in the

the precision of the

As

Nevertheless,

poorly managed) thereafter. Because of this in-

('you will need exactly 984 patients') or flusI

number and reduction become the

possibly get so many?').

ment.

satisfactor)

of. After

size calculations, the clinical investigator

become awed by

are involved in a patient's responses to treat-

data

determining sample si/e ma) often distract both statisticians

calls

cians usually want to be sure that this endpoint

statistical activities,

the

of good clinical investigation, which

tenet

on

contrary to every

is

Schneiderman" summary oi the

sented here has been based on the currently

occupation

outcome

the endpoint noted in a single variable, statisti-

sonic of the statistical ideolo-

accepted "conventional

"end-

will be the

various

For practical purposes, the material pre-

gies

whose outcome

the research. This concentration

in

onl) one kind oi

alternative

and

conjectures,

has provided a well-written

oi the

indoor

favorite

schemata based on sequential

Bayesian

analysis,

state

in

contriving new ways to gauge sample

fact, that

single variable

point"

tor an appraisal of the multitude of variables that

error.

calculating

in

a 'sig-

cannot show an 'insig-

Ii

to

show

or docs not

statistical strategies

this

cal

and human aspects of therapy

—

the associ-

ated risks, benefits, costs, joys, and sorrows of

—

treatment

become

often

the research

grossly neglected in

16 .

Bec. The current uninformed choice of p cause good data are seldom available for 'histor.

x

ical controls', the

choice of p, (as an estimate

focus of attention, the clinician and statistician

of the outcome rate for the control group) be-

may

comes an

forget that the basic scientific problems

remain unresolved.

Among them

are the fol-

guesswork

that often turns out

If the error leads

overestimate of sample size, the

lowing: a.

act of

to be erroneous.

The univariate choice of an endpoint. To and A. we must choose a

determine p,. p 2

.

huge

to a

trial

becomes

excessively expensive. d.

The future uninformed choice of p

{

.

The

estimate of a single value of p, has no real clinical precision. *For m\ education in these concepts and tor cither helpful comments oi, this text. am indebted to several clinical and statistical colleague Donald Archibald. Robert Deupree. Michael Gent.

stead,

is

What

is

usually needed, in-

a series of p, values

—one

for each of

I

Walter Spi knowledges., the contents

;r.

and Carolyn Wells. Their aid is gratefulh acwhile thej are also absolved of responsibility for

the cogent clinical strata to therapy. If the data

therapeutic

trials are

2

of patients subjected

of large-scale randomized

not analyzed with a cogent

Sample

however, the

clinical stratification,

current

results of a

cannot provide a good estimate of

trial

Today's expen-

p, values for use in future trials.

unproductive therapeutic

sive,

and the other

size

trial

may

thus be

know what we're doing and even if we can't it, repeat it, or make good clinical sense out of it, we can still calculate the required specify

populational numbers and determine the prob-

followed by tomorrow's.

abilistic

The arbitrary choices of a and f3. Despite the elaborate reasoning that has been discussed

directions.

e

.

a and

for choosing

fi,

selected in the abstract intellectual

What

scribed here. statistician

A. to

the

is

(

is

patients can actually be obtained for

and

tigation

can be funded. The values of

(2) that their recruitment

fit

this

able mathematical rationale

1.

2.

a and

/3 3.

a suit-

in

is

differ-

4.

A

importance, the scope of

by the univariate

strained earlier,

but

5.

noted

restrictions

ments about the proper size of

A

The University Group Diabetes

A

mortality

findings,

further statistical analysis of the J.

M.

A.

217:1676-

A.

Feinstein, A. R.: Clinical biostatistics.

The

macol. Ther. 13:285-297, 1972. A. R.. and Ramshavv, W. A.:

Feinstein,

have received

role of randomization in sampling, testing,

and credulous idolatry (Part 2). Clin. Pharmacol. Ther. 14:898-915. 1973. Feinstein, A. R.: Clinical biostatistics. XXXI. On the sensitivity, specificity, and discrimination of diagnostic tests, Clin. Pharmacol.

6.

Feinstein, A. R.: Clinical biostatistics.

XXXII.

Biologic dependency, 'hypothesis testing', unilateral probabilities,

workshops, or other conclaves of experts as-

direction vs. statistical duplexity, Clin.

sembled

macol. Ther. 17:499-513. 1975. Fleiss, J. L.: Statistical methods

of clinical impor-

tance. In the absence of established standards,

on being badgered by choose a A so that sample

7.

reasonable value. This value

formula, using z„, zp, etc.

emerges

that

is

is

8.

A

unfeasible,

accordingly, and so

do a and

adjusted n

some brave new world of

comes

the future,

when

peutic trials be truly clinical investigations as

cal investigators

em

statistical

/3-error,

mod-

methods for determining a-error,

and sample

size.

Even

if

we

don't

Chron. Dis.

J.

S.

W.: Determi-

search Council, 1959, pp. 356-371. Neyman, J., and Pearson, E. S.: On the use and

of

20A:l75and 11.

Pasternack,

of certain statistical

test

criteria

inference.

for

the

Biometrika

263. 1928. B.

S.:

Sample

sizes

for clinical

designed for patient accrual by cohorts, Chron. Dis. 25:673-681, 1972.

trials

can take comfort in knowing

about the panacea-like marvels offered by

and Ederer.

.

Kramer, M., and Greenhouse.

purposes

be developed for these clinical clini-

J.,

sizes for medical trials with special

interpretation

well as elaborate exercises in mathematics, bet-

may

&

Cole, J. O., and Gerard, R. W.. editors: Psychopharmacology: Problems in evaluation. National Academy of Sciences, National Re-

10.

and scientific problems. In the meantime,

for rates and

in

clinicians begin to insist that large-scale thera-

ter solutions

Phar-

nation of sample size and selection of cases,

out right. In

Sample

21:13-24, 1968. 9.

/3, until

in scientific

1973. John Wiley

reference to long-term therapy,

sample size

gets

York.

Halperin, M., Rogot, E.. Gurian. F.:

tossed into the

If the

and other issues

Sons, Inc.

can be calculated, picks what seems like a

size

New

proportions.

the clinical investigator,

the statistician to

A

Feinstein, A. R.: Clinical biostatistics. XXIII.

almost no concentrated attention via symposia,

to adjudicate matters

pur-

Ther. 17:104-116, 1975.

also gets chosen arbitrarily. Judg-

it

J.:

Program.

allocation,

not only con-

is

Cornfield,

The

often the most crucial issue

planning and evaluating the research. Despite

this

on such dazzling

procedure for rapid mental calculation of the fourfold chi-square test. J. Chron. Dis. 25:551-

ence that indicates 'clinical significance', the

A

to rely

553, 1972.

The arbitrary choice of A. As the

magnitude of

been able

poses of prognostic stratification, Clin. Phar-

for presentation to the granting agency. /.

logical

1687. 1971.

then developed

is

both

in

References

and inves-

number and

going

Not since the days of alchemy have

transmutations.

then chosen

the trial

are then adjusted to

scientists

that the selected

1)

uncertainties

de-

that the

and investigator decide on the size of

two requirements

number of

manner

often happens

The magnitude of the sample fit

seldom

their values are

333

side of statistical significance

J.

12.

W. O., Gent, M., and The Burlington randomized nurse practitioner: Health outcomes

Sackett, D. L., Spitzer,

Roberts, trial

R.

of the

S.:

Mathematical mystiques and

patii

its,

Ann.

Intern.

Med

statistical strategies

randomized

80:137-142.

Engl.

1974.

hlesselman, studv.

14.

I.

J.

Sample

Planning

a

longitudinal

Dis. 26:535-560, 1973. Schneiderman, M. A.: The proper clinieal trial:

'Grandma's

Nev. Drugs 4:3-1 15.

J.:

si/e determination.

1.

strudel'"

J.

W. ().. Feinstein. A. R.. and Saekett. What is a health eare trial? J. A. M. A.

of a

method.

J.

17.

University

Group Diabetes Program. A study

the effects of

complications

.

:

of the nurse practitioner. N.

233:161-163. 1975. size

1964

W.

M

trial

Med. 290:251-256. 1974.

Spit/er.

D. L.:

Chron.

O.. Saekett. D. L.. Sibley. J. C. Kergm. D. J.. HackRoberts. R. S.. Gent. ett. B C, and Olynich, A The Burlington Spitzer.

16.

J.

betes

and

1.

II.

of

hypoglycemic agents on vascular in

patients with adult-onset dia-

Design, methods and baseline results; Mortality

21:747-830. 1970.

results.

Diabetes 19 (Suppl.

CHAPTER

23

Problems

in the

summary and

display

of statistical data

After completing a research project, an investigator encounters several different challenges

management of

in the

lenges

mation; another third

One

is

in

of these chal-

organizational

activities

of

consist

suitable

arrangements of the data for each

and summarizing the data

in a

vari-

way

that

allows appropriate discernment of relations and contrasts. in

The formation of conclusions occurs

two different steps. From knowledge of the

scientific

search,

background and architecture of the

the

of the observed relations

uses

and contrasts, and

about their substantive

first

make decisions importance. From the

judgment

scientific

re-

notes the magnitude

investigator

to

statistically

pleted.

To meet

must provide

Thus,

an

investigator

study, a difference of 309c in

percentage improvement with

find,

in

one

comparing the

two treatments

in

results is a challenge

this challenge, the investigator

scientific colleagues with a clear

what was concluded. Unless

this last

challenge

managed, the previous activities will have an unsatisfactory outcome. The research will not be reported in a manner that makes it is

suitably

comprehensible, appraisable, and usable by the scientific

The

community.

tools of statistics play diverse roles in

The basic

these activities.

architecture

7

of the

research and the choice of important variables scientific rather than statistical deci-

but

statistical

the

all

other

procedures

involve

methods. The techniques of descrip-

tive statistics

might

"sig-

account of what was done, what was found, and

sions,

"statistical significance".

statistically

occurs after the others have been com-

depend on

decisions about

was

The communication of

examined groups, the investigator then uses

make

clinically important al-

"not significant", where-

nificant" but clinically trivial.

observed magnitudes and from the size of the

mathematical inference to

was

as the cited correlation

that

results.

choosing the variables to be analyzed; preparing

able;

though

drawing conclusions; a

communicating the

is in

The

data.

organizing and analyzing the infor-

in

is

apeutic difference

provide expressions for the num-

bers that are used to summarize data, to

and

relationships,

indicate

to

show

contrasts.

To

summarize data for individual variables, de-

10 people; or, in another study, a correlation

scriptive statistics offers such numerical expres-

two

sions as means, medians, proportions, standard

coefficient of

variables in

.05 for the relationship of

5,000 people. Using both

scientific

and mathematical methods of decision-making, the investigator

might conclude that the ther-

This chapter originally appeared as "Clinical biostatistics

XXXVII. minuses,

Demeaned inefficient

ruptions of scientific Ther. 20:617, 1976.

—

confidence games, nonplussed coefficients, and other statistical dis-

deviations, ranges, and percentiles. the relationships tive statistical

among

To show

variables, the descrip-

methods include two-way tables

and graphs, correlation coefficients, and regression equations.

To

indicate contrasts, descrip-

errors,

communication." In Clin. Pharmacol.

tive statistics supplies

such expressions as

in-

crements, decrements, ratios, and proportionate

335

Mathematical mystiques and

statistical strategies

The techniques of inferential starts-

nces.

provide

methods

mathematical

the

for

probabilistic conclusions about the re-

rig

The

sults cited in the descriptive expressions.

mathematics of inferential

produces

statistics

your work.

in

confidence intervals, and the various correlation

tor

values

regression

or

and tor the differences found

coefficients

in

win

scriptive

The descriptive results nt

foi

ol

idea,

None of

scriptive processes are completed,

or scientific

descriptive data arc

the in-

tial

role

ol

unless the

bi-

edi-

have become so infatuated with inferenor analytic statistics that the fundamental

of descriptive

has

statistics

glected or per\erted.

become neoccurs when

The neglect make important

investigators and editors

deci-

on the basis of P values or other

sions soleh

inferential calculations, ignoring the

magnitude

of the contrasts and relationships from which the statistical calculations are derived.

version

occurs

when

descriptive

The

per-

expressions

and relationships are elimi-

for the contrasts

concept

flic

inference and pertinent only

tical

nated and replaced by the inferential statistical

The idea

a.

want

oferror.

to explain the

mean, the idea

The demeaned error

problem

you decide

ving

test.

critical

idea o

Choose an but

common

statements you foolish.

Now

exactly what

is

make try

to

to take the

intelligent,

who knows

who

rithmetic average or

has enough

tin

you

invite

"layperson'*

statistics,

seem

explain

meant by

the

rea-

no-

understands the

mean, and who

sense to question that

you

that

to

begin

is

the

It

that

arose in reference to a practi-

today would be called "ob-

server variability". Suppose the

same

entity or

substance has been measured several times and

suppose

at

your institution

conducts

may

measurements do

different

the

agree. For example,

all

if

is

not

the chemistry laboratory

particularly fastidious and

tests in quadruplicate, the lab

its

get such values as 249. 250, 247,

mg/dl as measurements of cholesterol concen(ration in the

any

inconsistent or to

that

person

standard error of

an and why you, as an investigator, use

it

I

and 258 |

same specimen of serum. Which

one of these measurements should be issued by j

the lab as the formal, correct value? to this question involves

some

in the statistics

we average all the valmean? If so, the "corShould we take the average

of mensuration. Should

The "standard error" has become a popular method of reporting results, although most of the investigators using this term do not know its definition, source, or connotations. If you doubt I

certain

This idea happens to have a

realistic origin.

cal

If

fundamental philosophic issues

he foregoing remark.

when

phrase standard error of the

word with which

first

of error.

The answer

statements. 1.

an abstract

is

b\ the imaginary world of statis-

scientific reality

modern

of

spawned

and none

available lor evaluation. literature

data,

scientific

operations of that imaginary world are met in

acceptability

the

in

and

fantas) are described in such

o\'

until the de-

omedical science. man\ investigators and tors

the

maneuvers have anj substantive

Nevertheless,

realistic

in abstract fantasies

standard error has nothing to do with stan-

\

procedures are ob-

research

scientific

meaning

the flights

communicating

statistical

maneuvers can be applied

the inferential

engaging

dards, with errors, or with the communication

vioush basic necessities

ferential

why

friend will be asking you

peculiar winds as standard error.

contrasts of means, proportions, or other de-

summaries

the

If

you should now be feeling quite uncomfortable.

Your

P

satisfactory

satisfactory to that other person.

is

scientists arc

yield

is

battled and seeking a clearer account.

explanation

such probabilistic statements as standard errors. tests that

the explanation

If

you, the other person should be looking

to

ues together and take the rect" result

is

251.

of the three values that remain after discarding the value of 258 because lier, far

result

away from

is

248.7.

it

seems

to

be an out-

the others? If so, the correct

Should we take an average

based only on the two closest values? correct

result

is-

249.5.

which and how many values

yond

sion,

to include

is

the scope of this discussion. For the

ment, the point to be noted

how

If so, the

The decision about

is that

be-

mo-

regardless of

the candidate values are chosen for inclu-

each of the "correct" results emerged

from calculating

the

mean

of the candidates.

Problems

The idea of using the mean by

tified

its

problem

— and

is

tradition

— since

is

summary and

is

jus-

between the mean and the actual

and non-pejorative.

name

property

a

of

adopted the idea ideal.

The

measurements

individual that the

to

mean was

statistical glorification

being

Quetelet

people.

also the

of mediocrity

people was perpetuated when Galton, Pear-

in

biometry accepted both the conceptual transfer

for their failings.

If

can be suitably castigated

They can be

Although modern recognition of

is itself

and improper. To

call a de-

viation an error implies that there is

something

untrue, incorrect,

wrong with the measurement,

that the equip-

ment or the observer (or both) may not have

this

folly

should have been the main reason for evicting the

word error from

its

former role

describ-

in

ing deviations, the second reason

probably

is

more cogent. The word error was needed

we

(as

shall see later) for a different job, describing

fact,

however,

a different type of deviation, occurring in a dif-

we had some unequivocal method

for deter-

ferent type of

been functioning accurately. In if

of variance and the associated nomenclature of error

called errors.

This use of the word error, of course,

modern world

— the abstract

im-

mining the accurate value of the measurement,

agery of

we might have found that it was any one of the "'deviant" results. The deviation may thus have had nothing wrong with it and its designation as an error made it a victim of a scientifically

h. The idea of 'standard'. Once we have decided to use the mean as either the correct

bizarre

"morality"

less, this

in

nomenclature. Neverthe-

demeaning use of error has persisted vocabulary, being embellished

•in statistical

in

such additional maledictions as error variance

error for the

[or residual

deviations around a fitted iations of

v

sum of the squared mean or for the de-

observed measurements from a

fitted

For

many

other

statistical

circumstances,

•lowever, the word error has been displaced

rom

Yom

statistical inference.

result or the focal point of a series of n

this unsatisfactory

usage; and deviations

mean are actually called deviations. At two reasons can account for the displacenent. The first reason is that the word error is the

east

ibviously

foolish

lations from the

when

it

mean, not

is

applied to de-

in different

mea-

we want

surements,

vious approach

sum of

the deviations

of the

about

way this

a

the

serum cholesterol of each

group of people, calculated the

nean cholesterol for the group, and then found '.ach

person's deviation from the mean,

vould be unacceptably silly

if

we

we

referred to

To

a futile task,

take the

however,

=

^Xj/n, where x

then £(Xj

—

expanded

to

nx

=

wonder mean is de-

they were defined. [If you

n measurements. The x),

sum

is s

any one of the

of the deviations

is

which becomes algebraically

^x — £x, (

which

is

nx

—

0.]

To get around this problem, we could take the sum of the absolute magnitudes of the deviations, regardless of whether they are positive

This sum, when divided by n,

or negative.

value

we determined

is

statement, recall that the

neasurements of different specimens. After t

The most ob-

average dispersion

since they will always add up to zero, by virtue

would give

nember of

to find the

or average deviation of the values.

surements of a single specimen, but in single all,

is

mea-

to get an idea of the disper-

sion of values around that mean.

fined as x

line.

l

lay the foundations of populational statistics. In

mean, however, has been endowed with the priety, the deviations

'

a basic principle of

son, and other founders of the British school of

straightforward

sublime virtues of truth, correctness, and pro-

:

was

the

is

:

it

a property of individual

mean, we can note

a

these differences deviations, the

call

today, but

silly

transferring deviations and variance from being

values that are observed in the measurements. If

we

seem

way of solving the more than a cen-

tury of scientific usage.

the differences

those deviations as errors. This approach ma\

the reasoning used by Quetelet a century ago to

there

sanctified by

Once we have chosen

337

display of statistical data

seems

reasonableness

be no better routine

to

as the right result

The

an old statistical tradition.

in the

that

the average absolute deviation is

clear

and reasonable.

—

Unfortu-

nately, because absolute values are a nuisance to calculate ically

and work with, they are mathemat-

unappealing.

The next option field.

To avoid

is

the

the one that has swept the

negative

signs,

we can

Mathematical mystiques and

all

e

deviations. Preparing for getting

tlie

we now add

rage,

2

squared de-

those

viations together, as classical!) x)

it

and is 1

symbol, S xx

a special

to call this shall

sion by dividing the standard deviation by the

to be verj

sum of the squares, (Mj own preference

mean

get the coefficient of variation.

to

thermore, that

953

conventional nomenclature

man)

years,

municating

i

we now

divide the

viations bj n. w Inch

the

is

we

(or observations),

sum

get

the squared de-

i^\

number of deviations the mean squared de-

o\ the

standard

1.96)

to the

data will be contained

o\ the

spanned on either side tually

adhere

summary

a

(ac-

Thus,

for

deviations.

most popular way

the

~

I'Xi

for a set

com-

oi'

univariate

in

the form

mean ± standard deviation, usually svmbolized as x ± s. [There are good reasons as

oi

it

1

"

—

—

for rejecting

with medians and per-

x)'

centile ranges; but this essay

n

o\'

data has been to cite the results

this fashion, replacing

J

/one

in a

mean by two

discussed earlier in this scries

viation.

Fur-

Gaussian circumstances, we know

in

expression the group variance, hut

in this essaj If

.

can

sum

the

is

manipulations and so

often gets a nickname,

of the

We

dispersion oi data around the mean.

also describe the dispersion in a single expres-

of the squared deviations, happens useful for other statistical

summary

deviation serves as a splendid

symbolized by

This expression, which

.

statistical strategies

concerned with

is

n

the sins of the standard error, not the standard

which

also called the variance

is

variance,

square root of

tlie

rate!) called a

root

we

B) taking the

get

w

hat is accu-

mean squared deviation. To

shorten the phrase, the term might be called the

The phrase

average square deviation. used, however,

actuall) I

have been unable

question of whv like

is

an answer for the

word

scientific

standard was pressed into

mathematical

is

standard deviation.

to find

distinctive

a

that

peculiar

this

(According

F.

N.

David, the phrase standard deviation was

first

service.

Man)

to

deviation.]

The entrance and emergence of indirect The events just noted were routine

e.

inference.

of

procedures

evolved tigators

were

descriptive

the

the

in

days

when

concerned

with

what they had found. These

statistics

that

inves-

scientific

demonstrating

statistical activities

required no knowledge of mathematics (beyond the

ability

do arithmetic) and no mental

to

Mights to any probabilistic aeries. tigators

When

inves-

performed comparisons, however,

a

other two-word

role

became

alternatives were available, including adjusted

The

investigator's comparison might involve a

deviation and adapted deviation.

direct contrast of

used by Karl Pearson.)

mean squared

suitable

word dispersion,

the root

or an indirect contrast of a greater-than-zero

deviation could have been called

value for a correlation coefficient or regression

or

the

mean

The

dispersion.

word variance might even have been reserved for this purpose, so that what is now called the variance

or

the

squared standard deviation

could have been called the squared variance.

With complete disregard for the important scientific roles of the

words standards and stan-

however, the idea of standard was

dardized,

seized and joined to deviation, where

mained

made

it

has

re-

in its status as a statistically fused, sci-

malformed

entifically

neologism.

What

the malformation so acceptable

phrase

standard

course

is

excellent

data have

deviation

so

that the idea (not the

way i

two means or two proportions;

B)

re-definition of the

the dispersion

available for statistical inference.

has

and the

assumed

coefficient against an

null value of 0.

After deciding, from scientific judgment, that the contrasted

results

were substantively im-

pressive, the investigator

would then want

to

determine whether the observed groups were large

enough

for the results to be

more than

a

chance occurrence.

The need brought

to

make

statistical

search.

The two

do the

probabilistic

previously

this probabilistic decision

inference into scientific

different mathematical

analysis

in this series.

ways

re-

to

were described 9

One way was

simple,

popular,

of

straightforward, and easily comprehensible.

It

name)

an

relied

on permutations of the observed data

to

of communicating results.

is

If the

Gaussian distribution, the standard

provide a specific distribution of alternative possibilities for arranging the results,

showing

Problems

exact P values for each alternative.

in the

The

other

was complex, convoluted, and hard

wa\

understand.

on a gerrymandered

relied

It

to

trans-

they

and

Republican,

vote

will

thereby conclude that the Connecticut vote

54%

favor of Republicans will be

in

next

in the

domains of hypothetical

thereby conclude that molecular biologists are

numerous assumptions

think in the abstract

tions,

ances, and approximate P values.

The

first

way

election,

more

before computers,

first

required

way,

in the

and

difficult

we perform

epidemiologists,

For parametric estimation, indirect inference ury.

is

an estimation procedure,

In

way, once one learned the rules of the game,

group available with which

was easily carried out with

tations

held) calculator.

Not because of any logical or but merely

desirability,

scientific

because of

way has

calculational convenience, the second thus far been triumphant. In the forseeable future,

digital

com-

have become cheap and easy to use and when hand-held, battery-powered computer terminals have become ubiquitous, the scientif-

methods of

desirable, direct

icall)

may become

statistical in-

the conventional proce-

dure for performing probabilistic contrasts. For

immediately foreseeable future, however,

the

must deal with the indirect forms

investigators

of probabilistic reasoning, with statistical consultants

cause of its

who its

prefer that

comfort,

its

form of reasoning familiarity,

(be-

and perhaps

mystery), and with the adverse consequences

that the

indirect

forms of

one sample. There

statistical

inference

vestigator must engage in educated

statistical in-

ference had to be altered from their original

and

population; and must accept the other theoretical

components of

observed values for

from

the process that leads

in a

sample

to estimated values

frame (or population).

its

The distinction between a parametric estimation from a single random sample and a probabilistic contrast for the results of two groups does not appear

many

sequent confusion

many

to

be clearly understood by

and

statisticians

may

scientists.

of the major problems

research.

An

parametric

parametric

or

in

communication

brought to scientific

investigator performing a prob-

two groups has no

estimation, of

principles

only because he doesn't do,

The con-

well be responsible for

that statistical analysis has

in

methods of

in-

guesswork

other characteristics of a hypothetical parent

The confusion between estimations and contrasts. To be applied for evaluating comd.

The

in direct

sample available, however, the estimating

abilistic contrast of

purposes.

form the permu-

to

inference for contrasts. With only one

have had on scientific communication.

parisons, the indirect

no second

and other arrangements used

statistical

inves-

the is

(called assumptions) about the distribution

when

puters

ference

statistical

a necessity, not a calculational lux-

tigator has only

desk (or hand-

prob-

a

abilistic contrast.

sometimes formidable calculations; the second

a

Republican than

statistically significantly

clinical

wanted; the second was

scientists

what statisticians offered. The era

pooled vari-

acts of sampling,

infinite

was what

that

mathematically perfect distribu-

populations,

because

he

is

and

uses

statistical

interest

indirect

inference

know anything

else to

computationally

con-

original goal of the indirect in-

strained, or because he has received misleading

was parametric estimation, The inferential tactics

advice. Conversely, an investigator performing

not probabilistic contrast.

a parametric estimation

were

intended for political poll-takers,

cannot apply direct principles of inference and

ferential

strategy

initially

market research analysts, and other people

who

attempt to estimate the "parameters" of a popu-

'

say

339

could never be verified; and a willingness to

estimation from single samples;

an acceptance of

|>

them

display of statistical data

we perform a parametric estimation. If we find that the Republican preference is 44% among 150 clinical epidemiologists and 56% among 150 molecular biologists and if we

formation of the parametric theories developed for inferential

i

summary and

lation

by

found

in a

tion.

drawn from the values random sample of that populawe take a random sample of 150

inferences single

Thus,

if

potential voters in Connecticut, find that 81 of

from

a

random sample

uses the indirect methods because they are the best (and the only) tactics at his disposal.

The confusion among

scientists

is

wide-

spread and can readily be seen from the

quency with which

indirect

fre-

parametric tech-

niques are applied to contrast the results of

Mathematical mystiques and

statistical strategies

roups that were not selected as random samples and that therefore permit

timations.

no parametric

The confusion among

es-

statisticians

also widespread and can readih

not samples, the

entered

because

research

the

venience and availability

to

it

this

term can be rejected either because

was provided

investigator. statisti-

the late G.

bv

Snedecor, one of America's leading

\V.

con-

their

of

the

Perhaps the most dramatic example oi cians' contusion

statisti-

cians and the principal instructor for a genera-

contemporary

tion ot

minds

the

ot

statistical consultants.

scientific

ments are intended

investigators,

to allow

sults that will provide valid

answers

questions. According to Snedecor 18

sample of observations which

is

to ,

to

In

re-

research

however, produce

a

will furnish esti-

mates of the parameters of the population

to-

gether with measures of the uncertainty of these

e.

Estimating a mean and

its

ror. In the rare circumstances in cal or

standard erwhich a clini-

epidemiologic investigator has obtained a

random sample

for estimating

parameter such as cal

a

mean,

a

kind

might be confused with the other

it

standard deviation, calculated for indi-

o\'

means of

vidual single samples, not for the

has been hanging around, awaiting a call to ac-

To avoid

tive duty.

away,

letting this old soldier fade

standard

the

deviation

of

sample,

a standard deviation.

standard error of

mean (singular). The philologic restoration

of error and the

the

from

transition

plural

the

mathematical you, as a

faith.

is

It

singular of

the

to

mean was accompanied by some

basic acts of

have only one sample and

vou assume

that

With

this faith,

such a process took place and

The conclusions, which represent tenets

of

results.

its

the funda-

inference

statistical

for

parametric estimation, can be stated as follows:

As

1.

the theoretical sampling process con-

tinues over and over, the

mean of

mean. Lacking the about

samples

reality of repeated

to tell us this true value,

we must

take a guess

There are convincing mathematical

it.

show

proofs to

that the best guess, i.e.. the best

estimate, for the populational

mean of our 2.

means

the

approach the true value of the populational

mean

be the

will

single available sample.

Although the mean of

that single

sample

provides the best estimate of the populational

mean, the variance of the single sample, calculated as S xx /n,

populational

each sample,

proof that

necessary, back into the original

you sam-

that a repetitive

pling process never occurred.

mental

enables

this faith that

realistic scientist, to forget that

peated this process over and over, restoring if

means

the

(plural) will be christened the

will

we can calculate a From this information, how likely are we to be right or wrong in estimating the true mean of the parent population? To answer this question, we begin a long chain of abstract reasoning. Suppose we drew another sample and calculated its mean and standard deviation. Now suppose we obtained yet another sample and found its mean and standard deviation. Now suppose we reIn the available

a

series oi samples. Besides, our old friend error

populational

the indirect statisti-

reasoning would go as follows.

mean and

because

you can then draw conclusions about

estimates"

too

is

it

clear to be used in statistical nomenclature or

experi-

comparisons oi

"the purpose of an experiment

means, but

call

group having

ol the

the standard deviation of the

is

use the word samples for groups that were

members

tion of the

be seen from

the frequency with which statisticians improperly

The value for the standard deviameans would need a name. We could

those means.

frame (or population) before the next sample

can

was drawn.

variance

is

not the best estimator of the

is

With

variance.

demonstrated

be

that

mathematical

a

too complex to be

shown

here,

it

populational

the

|

As

the process of repeated sampling con-

tinued,

we would

obtain a series of means, one

for each sample. Let us set

of

now concentrate on

that

sample means and think of them as

though

were the individual values

lection ot

of the set

i

01

.

In fact, ins

let

is

best estimated as S xx /(n

aspect of statistical inference so

many

statistical

s

= V^(Xi -

[This

the reason that

textbooks and programmed

2

x) /(n

the

more

mean

the

denominator.

and the standard deviation of

1).

calculators determine the standard deviation as

in a col-

us calculate the

is

-

planation

intuitively

—

that

n

-

"logical" value of n

The commonly

-

1

I

1), rather than using!

is

the

cited

"degrees

in

ex-

of

<

Problems

freedom"

data

the

in

—

statistogenic confusion.

The

doesn't explain. n

-

is

It is

real

in the

another source of

now invoke

butions. According to this principle,

reason

the

the

that

is

popu-

offers a better estimate of the value in the

With another mathematical proof

3.

be spared here,

will

it

mean

the old principle of Gaussian distri-

mean

that

means found

more

that

you

Re-phrased

this

ner, this statement says there

in a

value for the standard deviation in

that the true populational

sample can be used to estimate the

x

calculated

s

error

is

means

as-

those hypothetical samples. With

in all

as

VS xx /(n -

1),

standard

the

of the

where

,

The appearance of the square root of

highh accurate estimations. In order to halve the standard error, we must quadruple the sample size. For example, let us consider the standard error of the a

random sample of 150

54

Consider the following array of survival rates for an ordinal age partition of patients with a particular disease: below age 31, 30%; age 31-54, 65%; age 46-60, 67%; above age 60, 29%. In this nonrates.

monotonic sequence of survival rates, the curve is trapezoidal (rising and falling), with survival rates worst in the two extreme age groups, and best in the middle groups. Since we had no advance beliefs about the prognostic distinctions of advancing age (particularly below age 60), the absence of monotonicity is not surprising or contradictory to

what had been

expected.

The property

of

can

be deter-

mined by simple inspection of the array of target rates

in

the

the

tively,

arranged

ordinally

values

for

strata.

Alterna-

adjacent target rates can

be subtracted from one another. In a monotonic stratification,

increments will be

the

all

positive

and a reversal in sign will denote a non-monotonic partition. Thus, for the three stratifications cited in the first paragraph of this or all negative,

section, the increments of rate were, respectively:

-15% and -25% for the monotonically decreasing partition; 4%, 5%, and 7% for the monotonically increasing partition; and -22%, +5%, and -39% for the

one that was non-monotonic.

test

for

monotonicity of gradient

particularly important ate,

when

is

a metric vari-

such as ape, height, or blood pressure, strata. into dichotomous

partitioned

is

With

a dichotomous

split,

any substantial

difference in rates will create a gradient,

Had we performed such a polychotomous partition we might have found three.

the following results: short, 10/40 (25%); below medium, 8/120 (7%); above medium, 7/20 (35%); and tall, 5/20 (25%). Our idea about a falling gradient would have been erroneous. Therefore, to avoid

misleading conclusions about the existence of gradients, a metric variate should always

be checked for before

A

it is

example

striking

omous

the report 14

The

Table 8

(pages 802-803)

stratifications

cholesterol,

body weight,

i

de».

had

t

then conclude that the target rate

nes with an increase in height. If .

cifically

we

checked for monotonicity

(

factors"

UGDP

in

report

age, blood pres-

glucose,

relative

and serum creatinine). All of these variates were split dichotomously, according to "cutting points" that were "arbitrarily selected." The target rates were listed for the two strata of each variate, but no data were presented to show whether the rates rose or fell monotonically

manner expected

in the

of the bio-

logic gradient for a "risk factor." In the absence

a test for monotonicity in these variates, the reader (and possibly the investigators) can have of

with a

would be discerned adequate polychotomous par-

effect that

scientifically

tition.

b.

Total gradient.

The

total gradient in

the target rate for a partition

is

the differ-

ence between the highest and lowest rates found in the individual strata. Assuming

distinctiveness,

For example, suppose the target rate in is 30/200 (15%). Suppose we now divide the population dichotomously according to height and find the rates of 10/40 (25%) for the shorter group and 20/160 (13%) for the taller group. We

dichot-

contained

visual acuity,

gradient has been specificially checked in

population

"risk

the

of

blood

fasting

is

study of diabetes

for

included seven metric variates sure,

that

a

UGDP

the

of

mellitus.

been

polychotomous rather than dichotomous

unsatisfactory

of

partitions for metric variates

conclusion about monotonicity unless the

partition.

polvchotomous partition

its

expressed in dichotomous form.

and the investigator may draw a spurious

a

more than

ordinal strata, and preferably

no idea of the true

A

have

divided the population into at least three

in

monotonicity

we would

of the gradient, however,

all

other numerical requirements have

fulfilled (for

modicum size, statistical when appropriate,

and,

monotonicity), we would prefer a partition with a large total gradient to one with a smaller gradient. If

a partition contained a dichotomous

split

performed in search of single "risk two strata would be regarded

factors," the

—

with neither stratum demarcating a significant risk factor unas essentially trivial

less

—

the gradient between them was

ciently high.

The

high" value for

suffi-

choice of a "sufficiently

this

gradient

is

arbitrary,

417

Evaluation of a prognostic stratification

10%

but a value of at least

seems reason-

tional ones:

>

able.

arterial

report 14

provides an obvious

also

example of the inappropriate application of the term "risk factors" to dichotomous partitions of strata that produced only minor or even trivial gradients in their target rates. Because the UGDP's Table 8 did not list the numbers of patients involved in the numerators or totals of the cited rates, the actual "risk" of the "risk factors"

death

not apparent in that table. Using the

is

of calculation described elsewhere, 8 I

method

have deter-

mined 8 the appropriate numbers and target

rate

percentages for the "selected baseline characteristics"

shown here If

we

in

UGDP. The

by the

reported

Table

as

two of the

"cardiovascular

strata labelled

factors"

risk

—

>

ECG

pectoris, lesterol.

The

data analysts created a union angina digitalis,

hypertension,

and elevated chocluster were also trivial,

abnormality,

results of this

producing a cardiovascular death gradient of only

9.0%. Omitted from this cluster were two strata that are shown in Table I to be more substantial cardiovascular

(gradient

factors:

risk

arterial

calcification

14.2%) and serum creatinine

>

1.5

mg./lOO ml. (gradient 10.3%). Another omission from the UGDP's combination of "cardiovascular risk factors" was Age > 55, which also had a higher cardiovascular risk gradient (9.5%) than the selected cluster.

With one exception,

of the strata that have producing significant or trivial gradients for rates of cardiovascular death produced correspondingly significant or trivial gradients when the target event was total deaths.

just

been described

carefully selected cluster of eight "base-

risk

included

factors"

four

for

cardiovascular that

strata

were

death thus

significant

factors (digitalis, angina pectoris, significant

risk

ECG

abnormality, and arterial calcification), one stra-

tum

that

two

strata

300 mg./

with cardiovascular death gra-

UGDP

strata:

line

The

stratum with the higher body weight.

UGDP's

rather than detrimental in this population.

7.1% and 7.4%. In forming a clusdesignated as one or more cardiovascular risk five

in the

hyper-

dients of only

of

vascular deaths, as well as total deaths, was lower

definite

and serum cholesterol

factors, the

factor,"

I indi-

essentially trivial in this population,

ter

of

I,

"risk

UGDP

as indicative

100 ml.

being associated

9.5% and 14.2%. The body weight however, was bizarre. Its gradient was only 3.9% and besides, the rate of cardioTable

by the

are

tension present

—were

rea-

Table

10%

of a distinct risk factor, the data of

cate that

were

"risk factors"

sonable additions to the cluster since they were associated with respective gradients, as shown in

was borderline significant (age > 55), that were trivial (hypertension and elevated cholesterol), and one stratum (relative body weight > 1.25) that was actually beneficial

results

I.

regard a gradient of

body weight The first and

years, relative calcification.

augmented

third of these

UGDP

The

> 55

age

and

1.25,

Isometry of clusters. As noted in our

c.

previous discussion, 10 the strata combined

have esWithout this isometry, a cluster would be

in a multivariate cluster should sentially similar target rates.

attention to

an indiscriminate conglomerate of heterogeneous groups, rather than a scientifically meaningful aggregation. An excellent example of scientific attention to isometry in clusters is provided in the staging system for breast cancer developed by Cutler and Myers. 5 6 These -

authors stratified patients according to a

number

of diverse risk factors, and formed "stages" by clustering the groups that had similar survival rates.

large

then

all

An

as

excellent

example of the neglect of

scientific principle is

risk

lar

The

formed

cluster"

original

this

provided in the "cardiovascu-

UGDP

in

the

cluster

UGDP

of five

reports. 14

strata

con-

20/200, which

tained an admixture of factors with cardiovascular

had a gradient of 4.9% for cardiovascular deaths,

death rates that can be noted in Table I to range from 33.3% (ECG abnormality) to 12.2% (hypertension). In the augmented cluster of eight strata, the corresponding death rates range from 33.3% to

The exception was

visual acuity

but 11.4% for total deaths.

The

the relationship of visual acuity

<

latter gradient for

and

total deaths

was higher than the corresponding gradient for hypertension, elevated cholesterol, and the UGDP's cardiovascular cluster.

In a subsequent report, 4 the UGDP group extended its cardiovascidar risk cluster to a union of eight rather

than five "risk factors." This union

included the previous five strata and three addi-

body weight > 1.25). The inmanner in which the "risk factors" were originally selected was magnified when the cardiovascular risk cluster was later stratified, in a subsequent UGDP report, 4 for patients who had

5.7%

(relative

discriminate

0,

1,

2,

factors."

°The values for arterial calcification have been listed according to the correction later reported by the UGDP group. 1 '

3,

4,

With

5,

or

this

6 of the cited eight "risk

type of stratification

—based

neither on biologic concordance nor on target rate

isometry

—

a patient with the single negative

factor of elevated

risk-

body weight would be placed

i

The

analysis of multiple variables

L'GDP

Stratification of "risk factors" in the

le I.

report' Total deaths

No. of Variate

Partition

<

Age

Pis.

X nnibcr and

X umber and rate

Gradient

rate

%

Gradient

55 55

449 374

22 (4.9 67 (17.9%)

13.0%

14 (3.1%) 47 (12.6%)

9.5%

Sex

Male Female

229 594

38 (16.6%) 51 (8.6%)

8.0%

25 (iO.9%) 36 (6.1%)

4.8%

Race

White Nonwhite

435 388

59 (13.6%)

5.9%

41

(9.4%) (5.2%)

4.2%

30

Absent

552 254

48 (8.7%) 38 (15.07c)

28 (5.1%) 31 (12.2%)

7.1%

762 46

69

785 47

74

Absent

777

74

Present

33

13 (39.4%;

697 108

71 (10.2%)

> 300 None One or more

411 361

27

(6.6%) 55 (i5.2%)

8.6%

13

(3.2%)

9.0%

<

272 547

21

(7.7%) 67 (J2.2%;

4.5%

13 (4.8%) 48 (8.8%) 44 (12.2%)

4.0%

>

110 110

< >

1.25

365 458

52 (14.0%) 37 (8.1%)

5.9%

1.25

20/200 20/200

725 41

76 (10.5%) 9 (21.9%)

<

1.5

781

>

1.5

18

>

1).

Ii\

finite

pertension

I'n sent

History of digitalis use

No Yes

No

History of angina pectoris Significant

Y,

ECG

abnormality

Cluster of

CV

risk

factors

Fasting blood glucose

Relative

body weight

>

Visual acuity

< Serum

s

.4). In addico-morbidity

nostic

disease

in the bivariate stratification of co-

tion,

morbid

G.I. disease vs.

symptom

stage, the

gradient for co-morbid G.I. disease

total

becomes

(P

statistically indistinct

>

.35)

when

tested within the category indolent,

which

is

From

Column these

1 of that table.

results,

we would

suspect

that the most biologically effective of the

three bivariate stratifications

is

the one con-

taining prognostic co-morbidity vs. symp-

tom

stage. This suspicion

is

supported by

Additional tactics

Table

II.

Chi-square results" for stratifications shoicn Co-morbid

'

Total

G.I. disease

symptom

vs.

stage

2

in

in

prognostic

Table

I

Prognostic co-morbidity

symptom

vs.

441

st ratification

Prognostic co-morbidity vs.

co-morbid G.

stage

I.

disease

X Value

d.f.

P

Value

d.f.

P

Value

d.f.

P

27.53

8

Clinical biostatistics

Recommend Stories

Idea Transcript

Helpful Links

Smile Life

Get in touch