WSC Workshop Day 4 - ITS Designs

Interrupted Time Series Designs
Overview
Role of ITS in the history of WSC
Two classes of ITS for WSCs
Two examples of WSC comparing ITS to RE

Issues in ITS vs RE WSCs
Methodological and logistical
Analytical
ITS
A series of observations on a dependent variable
over time
N = 100 observations is the desirable standard
N < 100 observations is still helpful, even with very few
observationsand by far the most common!
Interrupted by the introduction of an intervention.

The time series should show an effect at the time
of the interruption.
Two Classes of ITS for WSC

Large scale ITS on aggregates
Single-case (SCD) and N-of-1 designs in social
science and medicine
These two classes turn out to have very
different advantages and disadvantages in the
context of WSCs.
Consider examples of the two classes:
Large Scale ITS on Aggregates:

The effects of an alcohol warning label on prenatal drinking
0.8
0.4
0.2
0
-0.2
Impact
Begins
-0.4
-0.6
Se
p86
Ja
n8
M 7
ay
-8
Se 7
p87
Ja
n8
M 8
ay
-8
Se 8
p88
Ja
n8
M 9
ay
-8
Se 9
p89
Ja
n9
M 0
ay
-9
Se 0
p90
Ja
n9
M 1
ay
-9
Se 1
p91
prenatal drinking
0.6
Label
Law
Date
Month of First Prenatal Visit
Large Scale ITS on Aggregates

Advantages:
Very high policy interest.
Sometimes very long time series which makes analysis easier.
Disadvantages
Typically very simple with only a few design elements (perhaps a
control group, little chance to introduce and remove treatment,
rarely even implemented with multiple baseline designs).
Usual problems with uncontrolled and unknown attrition and
treatment implementation
We have yet to find a really strong example in education
Formidable logistical problems in designing WSCs that are

well enough controlled to meet the criteria we outlined on
Day 1 for good WSCs.
We are not aware of any WSCs comparing RE to this kind of ITS
Single-case (SCD) and N-of-1 designs in

social science and medicine
Each time series is done on a single person, though a study usually
includes multiple SCDs
Advantages:
Very well controlled with many opportunities to introduce design
elements (treatment withdrawal, multiple baseline and more), low
attrition, excellent treatment implementation.
Plentiful in certain parts of education and psychology
Disadvantages:
Of less general policy interest except in some circles (e.g., special
education) but
IES now allows them for both treatment development studies and for impact
studies under some conditions.
Increasing interest in medicine (e.g. CENT reporting standards).
Typically short time series that makes analysis more difficult

Much work currently being done on this.
Should be applicable to short time series in schools or classes

Has proven somewhat more amenable to WSC
Two Examples of WSC of RE vs ITS

Roifman et al (1987)
WSC Method: A longitudinal randomized crossover
design
Medical Example
One study that can be analyzed simultaneously as
Randomized experiment
6 single-case designs
Pivotal Response Training

WSC Method: Meta-analytic comparison of RE vs ITS
Educational example on treatment of autism
Multiple studies with multiple outcomes
No claim that these two examples are optimal

But they do illustrate some possibilities, and the
design, analytical and logistical issues that arise.
Roifman et al (1987)
High-dose versus low-dose intravenous
immunoglobulin in hypogammaglobulinaemia and
chronic lung disease
12 patients in a longitudinal randomized cross-over
design. After one baseline (no IgG) observation:
Group A: 6 receive high dose for 6 sessions, then low dose
for 6 sessions.
Group B: 6 receive low dose for 6 sessions, then high dose
for 6 sessions
Outcome is serum IgG levels

Here is a graph of results
Even though this example uses

individual people for each time
series, one can imagine this kind
of study being implemented using
schools or classrooms.
How many time points are needed

is an interesting question.
Analysis Strategy
To compare RE to SCD results, we analyze the
data two ways
As an N-of-1 Trial: Analyze Group B only as if it
were six single-case designs
As a RE: Analyze Time 6 data as a randomized
experiment comparing Group A and Group B.
Analyst blinding
I analyzed the RE
David Rindskopf analyzed SCD
Analytic Methods
The RCT is easy
usual regression (or ANOVA) to get group mean
difference and se.
We did run ANCOVA covarying pretest but results were
essentially the same.
Or a usual d-statistic (or bias-corrected g)
The SCD analysis needs to produce a result that is

in a comparable metric
Used a multilevel model in WinBUGS to adjust for
nonlinearity and get a group mean difference at time
6 (or 12, but with potential carryover effects)
d-statistic (or g) for SCD that is in the same metric as
the usual between-groups d (Hedges, Pustejovsky &
Shadish, in press).
But current incarnation assumes linearity
Analysis: RCT
If we analyze as a randomized experiment
with the endpoint at the last observation
before the crossover (time 6):
Group A (M = 794.93, SD = 90.48)
Group B (M = 283.89, SD = 71.10)
MD = 511.05 (SE = 46.98) (t = 10.88, df = 10, p <
.001)
d = 6.28, g = 5.80, V(g) = 1.98 (se = 1.41)
Analysis 2: SCD
If we analyze only Group B (6 cases) using a destimator1:
g = 4.59, V(g) = 1.43 (se = 1.196)
Close to RE estimate g = 5.80, V(g) = 1.98 (se = 1.41)
We also have a WinBUGS analysis2 taking trend

into account:
MD = 495, SE = 54 , t = 495/54 = 9.2
Very close to the best estimate from the randomized
experiment of MD = 511.05, SE = 46.98
1
Hedges, Pustejovsky and Shadish, in press, Research Synthesis Methods

2 Script and data input available on request
Comparing Results RE vs SCD

Means and d in same direction
Means and d of similar magnitude
It is not clear that the standard errors from previous
slides are really comparable, but treating them as if
they were:
Test overlap using 84% confidence intervals simulates ztest1
For g, they are 3.82 < 5.80 < 7.76 for SCD
2.91 < 4.59 < 6.27 for RE

For the group mean difference 419.13 < 495 < 570.87 for SCD
445.04 < 511 < 577.06 for RE
That is, no significant difference between the SCD and RE.
Another option would be to bootstrap the standard

errors.
1
Julious, 2004, Pharmaceutical Statistics
Comments on This WSC Method

Using randomized crossover designs with
longitudinal observations is a promising
method.
Statistical problems:
how to compare results from RE and SCD when
they clearly are not independent.
Did not deal with autocorrelation
Should be possible to do in several ways
but correcting would likely make SEs larger so make REITS differences less significant
Need to explore further the effects of trend and

nonlinearities
Example: PRT
Pivotal Response Training (PRT) for Childhood
Autism
This WSC method does a meta-analytic
comparison of results from SCDs to results from
an RE.
Meta-analytic WSCs have a long history but also
have significant flaws in that many unknown
variables may be confounded with the designs.
But those flaws may often be no more than in the
usual 3-arm nonrandomized WSC
Big difference is the latter usually has raw data but
meta-analysis does not. In the case of SCDs, however,
we do have the raw data (digitized).
The PRT Data Set

Pivotal Response Training (PRT) for Childhood
Autism
18 studies containing 91 cases.
We used only the 14 studies with at least 3 cases
(66 cases total).
If there were only one outcome measure per
study, this would result in 14 effect sizes.
But each study measures multiple outcomes on
cases, so the total number of effect sizes is 54
Histogram of Effect Sizes
Statistics
G
N
Valid
Missing
Mean
Median
Std. Deviation
Minimum
Maximum
54
0
1.311520
1.041726
1.2359393
-0.4032
5.4356
Data Aggregated (by simple averaging)

to Study Level
sid
19
20
9
5
8
18
17
15
7
16
10
3
11
4
g
.25643452
.54197801
.63926057
.75588367
.99578116
1.3178189
1.3252908
1.6105048
1.6148902
1.6345153
2.5494302
2.5985178
2.5989373
3.5641752
vg
.06523894
.06563454
.02315301
.03473584
.17598106
.36416389
.3465313
.11272789
.1154149
.23100681
.26735438
.67319145
.78074928
.15104969
w
15.328268
15.235879
43.190927
28.788705
5.6824295
2.7460164
2.8857422
8.8709195
8.6643927
4.3288767
3.7403539
1.4854615
1.2808209
6.6203379
Initial Meta-Analysis (aggregated to

the study level)
------- Distribution Description --------------------------------N
Min ES
Max ES
Wghtd SD
14.000
.256
3.564
.767
------- Fixed & Random Effects Model ----------------------------Mean ES
-95%CI
+95%CI
SE
Z
P
Fixed
1.0100
.8493
1.1706
.0820
12.3222
.0000
Random
1.4540
.9887
1.9193
.2374
6.1244
.0000
------- Random Effects Variance Component -----------------------v
=
.592448
------- Homogeneity Analysis ------------------------------------Q

df
p
87.4789
13.0000
.0000
I2 = 85.14%
RE on PRT: Nefdt et al. (2010)

From one RE (Nefdt et al., 2010), we selected the
outcomes most similar to those used in the SCDs
G = .875, v(G) = .146 (se = .382).
Recall that the meta-analysis of PRT showed

G = 1.454, V(G) = .056 (se = .2374)
Are they the same?

Same direction, somewhat different magnitudes
Again using 84% confidence interval overlap test:
.338 < .875 < 1.412 for RE
1.120 < 1.454 < 1.788 for SCDs
Again, the confidence intervals overlap substantially, so
no significant difference between RE and SCD
Comments on PRT Meta-Analytic

Example
This is just a very rough first stab
Need to deal better with linearity issues in
SCD analyses
We are currently expanding our d-statistic to cases
with nonlinearity
In the meantime, can detrend prior to analysis
Need also to code possible covariates

confounded with the design for metaregression.
Some Other Possible Examples

Labor Economics (Dan Black)
Many randomized experiments in labor economics
use outcomes that are in archives with 20-30 data
points
Army Recruiting Experiment (Coady Wing)

Incentive provided for enlisting into select
specialties, but only in randomly selected
recruiting districts.
Could tap VA/SSA/etc records for outcomes
Issues in WSCs comparing ITS to RE

Design Issues and Options
Analytic Issues
Logistical Issue
Design Issue: How to Make Random Assignment

to RE vs ITS More Feasible in Four-Arm Study
How to motivate people to take multiple
measures so that attrition is not a problem in the
time series?
Large amounts of money for each observation?
E.g., 300 people times 12 observations times $100 per
session = $360,000
If 100 observations, total cost is $3,000,000, but how to
space observations in a policy relevant way?
Do pilot study to determine optimal payment per
observation.
Shorten overall time by using very frequent

measurements per day (e.g., mood literature).
Could then reduce costs by paying per day or the like
But how policy relevant?
Design Issues: Nonrandomized Studies

Comparing ITS to RE
E.g., Michalopoulos et al (2004) had time
series data with about 30 observations over
time for randomized treatment group,
randomized control group, nonrandomized
comparison group.
In this study, longitudinal randomized
experiments were done within cities.
The nonrandomized comparison group
was from another city.
More on Michalopolous
They did not analyze the ITS as a time series.
Instead just substituted the comparison group
for the control group and analyzed like a
randomized experiment. This is not how an ITS
would be analyzed.
In addition, this has the usual confounds
between method and third variables.
But this perhaps can and should be done more
Does someone want to reanalyze Michalopolous?
Compare results from usual randomized estimate at
one time point to
Analyzing the randomized control as ITS
Design Issues: Randomized Crossover

Designs
Key Issue: How many longitudinal randomized
crossover designs exist with enough data points?
Medline search for randomized crossover design found
27986 hits. Surely some of these are longitudinal
Medline Search for longitudinal randomized crossover
design found 126 hits. But a quick look suggested few
were what we need. So it will be a tedious search to find
truly longitudinal randomized crossover designs
I have a list of
This is another good study for someone to do.
Might need meta-analytic summary, so statistical issues

will emerge about what effect size to use and how to
deal with trend.
Design Issues: Meta-Analytic

Approaches
Feasible in areas where both ITS and REs are
commonly used
A key issue would be to find areas with both
(medicine and N-of-1 trials? SCDs in ed and psy)
My lab is currently working on this
In many respects just aggregations of three and

four arm studies with all the flaws and strengths
therein.
But meta-regression could help clarify some
confounds of methods with third variables.
Analytic Issues
Because so many real examples are short ITS,
usual ARIMA modeling etc is not practical.
Key analytic issues:
What is the metric for the comparison of ITS and
RE?
Dealing with trend
Dealing with autocorrelation
Analytic Issues:
When the Metric is the Same
To compare RE to ITS, the same effect estimate has to be
measured in the same metric
Not a problem
In longitudinal randomized crossover designs
In three arm studies in which participants are randomized to all
methods and treated identically and simultaneously (e.g.,
Shadish et al., 2008, 2011)
In three-arm studies like Michalopolous that are all part of one
large study
Or over multiple studies if they just happened to use the same
outcome variable.
In all cases, ensure the effect estimate is the same (ATE,

ToT, etc.).
But meta-analyzing all this can require finding a common
metric if it causes metrics to differ across studies.
Analytic Issues: Metric Is Different

Special Case: When all outcomes are
dichotomous but may measure different
constructs
Can use HLM (etc) with code for outcome type
(Haddock Rindskopf Shadish 1998)
Otherwise, need common effect size estimate

like d, r, odds ratio, rate difference, etc.:
Analytic Issue: Metric is Different

d-statistic for ABk SCDs (Hedges, Pustejovsky and Shadish in
press RSM)
SPSS macro in progress; R-script available but needs individual
adaptation to each study
Assumes no trend, normally distributed outcome
Takes autocorrelation and between-within case variability into
account; requires minimum 3 cases.
d-statistic for multiple baseline designs SCDs nearing

completion (HPS also)
Grant proposal pending to extend work to
Various kind of trend
Various kinds of outcome (e.g., counts distributed as Poisson or
binomial)
There are other effect sizes, but none are well-justified

statistically, or comparable to between groups effect sizes.
Analytic Issues:
How to Deal with Trend in the ITS
The issue is to prevent the presence of linear
or nonlinear trend from causing a spurious
difference between RE and ITS estimate.
6
5
Baseline
Series1
Treatment
2
1
0
1
10
Trend:
Outcome Already in Common Metric
If you do not need to convert to a common metric (e.g., outcome in
RE and ITS is identical). Two options:
Model trend using ordinary regression (Huitema 2011) or multilevel
models (Van den Nortgate & Onghena; Kyse, Rindskopf & Shadish).
But may produce two estimates:
Main effect of treatment
Interaction of trend with treatment
Working out whether the regression/HLM estimate is truly identical to ATE

from RE is not transparent
For example, how to demonstrate that the effect estimate from WinBUGS for the SCDs in
the Roifman example is really an ATE?
Remove trend by detrending the data using

First order differencing (but is second order differencing needed, and loss of
data points)
Or regression with trend (but what polynomial order) and the subsequent
analysis on residuals
Then could use HPS d
But is not clear it is GOOD to remove trend x treatment interaction, which may
be a real effect of treatment.
Trend:
Outcomes in Different Metrics
E.g., in the PRT meta-analysis where each
study had the same construct but different
measures:
Detrend the data using methods previously
described
Then compute HPS d
Or wait till HPS d with trend adjustment is ready in
a few years.
Analytic Issues: Diagnosing Trend and

Sensitivity Analyses
A different approach is to use nonparametric or
semi-parametric methods to see whether the
presence of an effect is sensitive to the presence
of trend or trend x treatment interactions.
These methods allow the data to tell you about
trend and interactions, where as parametric
methods require you to know the trend
beforehand.
We have been exploring Generalized Additive
Models (GAMs), a semi-parametric method.
Introduction to GAM
Like a parametric regression, but replacing some or all
of the parametric predictors with smoothed
nonparametric predictors. E.g.,
Parametric:
Yt = 0+ 1Xt + 2zt + 3[Xt (n1 + 1)]zt + t.
GAM with smoothed trend and interaction:

Yt = 0+ s1(Xt) + 2zt + s3([Xt (n1 + 1)]zt) + t.
For the techies: Smooth is cubic regression spline with
iteratively reweighted least squares with best fitting
model chosen by generalized cross validation.
Modeled in R using the mgcv package (Wood, 2010)
GAM Questions to Ask

Is there trend or trend x treatment
interaction?
Is either nonlinear?
Is the treatment effect robust to trend?
Consider the following SCD:
Parametric and GAM Results

Parametric GLM with binomial errors finds a
treatment effect but no trend or interaction.
GAM best fitting model smooths the
interaction:
Degree of
nonlinearity
may be high;
Edf is
monotonically
related to
polynomial
order.
Fit indices
Borderline linear
parametric trend
Tmt effect is signif.
F suggests smoothed int

is not significant, but test
is underpowered.
GAM Conclusions
About our three questions:
A trend by treatment might be present
If so, it may be highly nonlinear
But the treatment effect is robust to trend compared
to the usual GLM.
About GAM based on our experience so far:

Works well with > 20 data points, less sure < 20
Good as sensitivity analysis about trend, too early to
say if good as primary analysis of ITS effects.
Open power questions
Model comparison tests seem well powered (too well?)
F test for smooth is said to be underpowered.
Does GAM overfit the data?
Conclusions about Trend

Probably the most difficult problem for WSCs
of RE and ITS
Lots of methods available and in development
but no best practice yet
So multiple sensitivity analyses warranted
Decision about trend depends in part on

decision/context regarding metric
If all outcomes are the same, more flexibility.
If outcomes vary, detrend, use d, use GAM for
sensitivity analyses?
Analytic Issues: Autocorrelation

Observations (or their errors) on the same
case over time are autocorrelated
Both effect estimates and standard errors can be
biased if not modeled correctly.
Typically computed with a standard YuleWalker estimator on the residuals from four
n 1
parameter regression:
yt yt 1
rj t 1n
yt 2
t 1
Correcting for Bias in Autocorrelation

Biased downwards (a) with small time series
and (b) the more regression parameters used
to estimate residuals:
(P + 3)/t
With the usual four parameter model, a
correction is:
rt 4
t 3
Variability in Autocorrelations
Raw autocorrelations can be quite variable:
Autocorrelation and Sampling Error

Much of the observed variability may be due
to sampling error:
vj
(1
2
j
) (t j 3)
E.g., consider Bayesian (or empirical Bayes)

estimates of autocorrelations from two SD
studies:
Trace Plot for Schutte Data

C
F
D
N
J
L
A
F
D
N
J
L
A
K
H
B
E
K
H
B
E
I
0.20
C
C
N
A
D
K
JFI
L
B
E
H
M
G
C
FI
N
D
JL
A
K
E
H
B
M
G
C
N
D
JFI
L
A
K
H
E
B
M
G
I
F
N
D
J
L
A
K
H
E
B
M
G
F
N
D
J
L
A
K
H
B
E
M
G
0.10
0.15
C
A
B
D
E
G
H
K
M
N
JFI
L
F
D
N
J
L
A
K
H
B
E
0.05
G
M
0.2
M
-0.8
-0.2
0.0
Conditional Mean
-0.4
0.25
-0.6
0.30
0.0
Posterior Probability of Tau
0.4
0.35
Estimates Conditional on Tau
0.005
0.013
0.030
0.065
0.140
Tau
0.303
0.668
1.558
4.157
A= (Intercept) B= 1 C= 2 D= 3 E= 4 F= 5 G= 6 H= 7 I= 8 J= 9 K= 10 L= 11 M= 12 N= 13
Most plausible values (those

with large bars) of are
small, corresponding to
variances of .09 or less,
though they seem to
indicate that is unlikely to
be zero.
At the most likely values of ,
the ACs are greatly shrunken
around a conditional mean
of just less than 0.20, with a
range that probably does not
extend much beyond -0.05
to -0.30, not enough to bias
G or V(G) much.
Trace Plot for Dyer Data

0.40
0.4
Here the results are different:

Estim ates Conditional on Tau
0.3
0.35
0.2
0.0
0.1
Conditional Mean
0.20
A
B
C
E
B
C
E
B
C
E
B
C
E
B
C
E
B
C
E
0.399
0.676
1.189
2.288
-0.3
0.05
B
C
E
-0.1
A
B
C
E
0.10
0.15
D
D
A
B
C
E
-0.2
0.25
0.0
Posterior Probability of Tau
0.30
0.025
0.048
0.085
0.143
0.239
Tau
A= (Intercept) B= 1 C= 2 D= 3 E= 4
1. Cases B, C, E shrink to a common

small negative autocorrelation.
2. Case D is an outlier with a larger
positive autocorrelation.
Inspection of the graph for Case D
suggests a ceiling effect that was
not present in Cases B, C, E, which
could cause higher
autocorrelation.
3. Again, however, is unlikely to be
zero.
4. And again, the range of the AC is
likely to fall between -0.20 and
+0.30, not enough to bias G or
V(G) much.
Implications of Bayesian Results

Doing Bayesian analyses of ITS/SCDs may be a
very useful approach
We need more research to understand whether
assuming a common underlying (Bayesian or EB)
autocorrelation is justified for, say, cases within
studies.
Schutte data says yes
Dyer data says no (but moderator analyses?)
If we can make that assumption, much of the

variability goes away and the remaining
autocorrelation may be small enough to ignore
Dealing with Autocorrelations

Lots of methods, but dont ignore entirely
In GLM/HLM/GAM:
Incorrect specification of trend in such models can
lead to spurious autocorrelations, so modeling
trend is important to results
Tentative work (GAMs) suggests properly
modeling trend may reduce ACs to levels that are
unimportant for bias.
For d, our SPSS macro estimates AC and

adjusts d appropriately.
Logistical Issue
How to get the data for the time series.
Sometimes it is available in an archive etc.
Sometimes it has to be digitized from graphs
The latter can be done with high validity and reliability:

Shadish, W.R., Brasil, I.C.C., Illingworth, D.A., White, K.,
Galindo, R., Nagler, E.D. & Rindskopf, D.M. (2009). Using
UnGraph to Extract Data from Image Files: Verification of
Reliability and Validity. Behavior Research Methods, 41,
177-183.
There is freeware also.
But digitizing can be time consuming and tedious for

large numbers of studies
E.g., a very good graduate student digitizing 800 SCDs from
100 studies took 8 months (including coding; Shadish &
Sullivan 2011).
Questions to Ask
The one area where we still need studies on
the main effect question of can ITS = RE?
Design variations to also examine:
Does it help to add a nonequivalent control?
Does it help to add a nonequivalent DV?
What about variations in ITS design?
Ordinary ITS with one intervention at one time
Multiple baseline designs with staggered
implementation of intervention over time
Over cases
Over measures within one case
Conclusion
WSCs of ITS and RE badly needed if ITS is
going to regain the credibility it once had
(assuming it does, in fact, give a good answer).
Challenging to design and analyze, but much
progress has been made already, especially in
SCDs and N-of-1 Trials.
Questions?
Some Comments on Meta-Analysis

Meta-analysis is probably the oldest empirical
approach to studying RE-QE differences
Smith, Glass, Miller 1981 psychotherapy
Lipsey and Wilson 1991 meta-meta-analysis
Both compared results from REs to results from
NREs with no adjustments of any kind.
Both found that dRE = dNRE, but perhaps s2RE < s2NRE
But
How Much Credibility Should We Give

To Such Studies?
The RE-NRE question was always secondary to
substantive interests (does psychotherapy work;
do behavioral and educational interventions
work?)
So no careful attention to definitions of RE-NRE
Just used original researchers word for it
No attention to covariates confounded with RE-NRE.
Glaser (my student) carefully recoded a large

random sample of Lipsey and Wilson using clear
definitions etc, and found LWs finding did not
replicate.
2nd Generation Meta-Analyses

E.g., Shadish-Ragsdale 1996; Heinsman-Shadish 1996.
Careful selection of two hundred well-defined REs and
NREs
from 5 areas (psychotherapy, presurgical patient
education, SAT coaching, ability grouping of students,
prevention of juvenile drug abuse),
coded on a host of covariates potentially confounded with
assignment method,
meta-regression to adjust RE-NRE difference for those
covariates.
Other studies of the same sort (Kownacki-Shadish 1999

Alcoholics Anonymous; Shadish et al. 2000
Psychotherapy)
Some Illustrative General Findings

Confounds with assignment methods are
rampant, and different across RE v NRE
And are often quite different across
substantive areas
But adjusting for those confounds greatly
reduces or eliminates RE-NRE effects.
Tentative hypothesis: If studies were
conducted identically in all respects except for
assignment method, they yield similar results.
Some Illustrative Specific Findings

Studies using self-selection into conditions
yielded far more bias than studies using otherselection.
Local controls produce more accurate NRE
estimates than non-local controls.
Crucial to control for activity level in control
group.
Across a set of studies, pretest d (on the
outcome) is often a very strong predictor of
posttest d.
3rd Generation Meta-Analyses?

Most meta-analyses do not have access to
individual person data within studies.
So they cannot use typical statistical adjustments at
that levelPSA, ANCOVA, SEM etc
But access may be more possible today

Digitizing individual outcome data from graphs as in
SCDs, often with some narrative and/or quantitative
description of each cases personal characteristics
Growing individual patient meta-analyses, that is from
REs and NREs where the individual data is available
both within and across multiple sites.
May allow some use of adjustments to NREs

within and across studies that has not been
possible before, but too early to tell.
What are the Disadvantages of MetaAnalyses?

For example, at the study level, meta-analytic data are
correlational data, so our ability to know and code for
confounds with assignment method is inherently
limited.
Also true of all three-arm designs so far?
Even worse because within a single three-arm study assignment is
totally confounded with study level covariates (but not individual
level covariates).
At least within meta-analysis one has variation in those study-level
confounds over studies to model and adjust. This may only inform
about study-level covariates (and perhaps aggregate level person
covariatese.g., average age), but those covariates are
nonetheless important in understanding RE-NRE differences in
results (and this may change with 3rd generation meta-analysis)
Other criticisms? Discussion of the role of MA in WSC?

WSC Workshop Day 4 - ITS Designs

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

WSC Workshop Day 4 - ITS Designs

Încărcat de

Drepturi de autor:

Formate disponibile

Interrupted Time Series Designs

Two examples of WSC comparing ITS to RE

Interrupted by the introduction of an intervention.

Two Classes of ITS for WSC

Large Scale ITS on Aggregates:

Month of First Prenatal Visit

Large Scale ITS on Aggregates

Formidable logistical problems in designing WSCs that are

Single-case (SCD) and N-of-1 designs in

Typically short time series that makes analysis more difficult

Should be applicable to short time series in schools or classes

Two Examples of WSC of RE vs ITS

Pivotal Response Training

No claim that these two examples are optimal

Outcome is serum IgG levels

Even though this example uses

How many time points are needed

Or a usual d-statistic (or bias-corrected g)

The SCD analysis needs to produce a result that is

d = 6.28, g = 5.80, V(g) = 1.98 (se = 1.41)

We also have a WinBUGS analysis2 taking trend

Hedges, Pustejovsky and Shadish, in press, Research Synthesis Methods

Comparing Results RE vs SCD

2.91 < 4.59 < 6.27 for RE

445.04 < 511 < 577.06 for RE

That is, no significant difference between the SCD and RE.

Another option would be to bootstrap the standard

Julious, 2004, Pharmaceutical Statistics

Comments on This WSC Method

Need to explore further the effects of trend and

The PRT Data Set

Histogram of Effect Sizes

Data Aggregated (by simple averaging)

Initial Meta-Analysis (aggregated to

------- Homogeneity Analysis ------------------------------------Q

RE on PRT: Nefdt et al. (2010)

Recall that the meta-analysis of PRT showed

Are they the same?

Comments on PRT Meta-Analytic

Need also to code possible covariates

Some Other Possible Examples

Army Recruiting Experiment (Coady Wing)

Issues in WSCs comparing ITS to RE

Design Issue: How to Make Random Assignment

Shorten overall time by using very frequent

Design Issues: Nonrandomized Studies

Design Issues: Randomized Crossover

Might need meta-analytic summary, so statistical issues

Design Issues: Meta-Analytic

In many respects just aggregations of three and

In all cases, ensure the effect estimate is the same (ATE,

Analytic Issues: Metric Is Different

Otherwise, need common effect size estimate

Analytic Issue: Metric is Different

d-statistic for multiple baseline designs SCDs nearing

There are other effect sizes, but none are well-justified

Working out whether the regression/HLM estimate is truly identical to ATE

Remove trend by detrending the data using

Analytic Issues: Diagnosing Trend and

GAM with smoothed trend and interaction:

GAM Questions to Ask

Parametric and GAM Results

F suggests smoothed int