I have a lot to thank John Hattie for. When I began seriously investigating education research in 2011, Hattie’s Visible Learning was one of the first books I read. One reference in Visible Learning had a profound influence on me — the 2006 review article by Kirschner, Sweller and Clark that put me on a path that ultimately led to my PhD research.
Unfortunately, I no longer accept the validity of Hattie’s methods, an approach broadly shared in the U.K. by the Education Endowment Foundation’s toolkit. In this post, I will attempt to explain why.
Let me present two graphs:
Graph 1
Graph 2
These graphs could represent a number of different scenarios — which is part of the issue that I will return to. So, let’s come up with a story for them. Imagine we have two groups of students — one group is taught using Ashman’s patent teaching package and the other is given a business-as-usual approach — the ‘control’. Let the orange line show the subsequent test scores of the students given the package and the blue line the scores of the control students. The average for the package is higher than for the control and so we may conclude it is more effective.
But notice the difference between Graph 1 and Graph 2. In Graph 2, the scores are more spread out. We can calculate something known as the ‘standard deviation’ — a measure of how spread-out the data is. It is sort-of* like the average distance of each data point from the mean. To keep things simple, for each graph, I have made the standard deviations the same for the package and control. In Graph 1, the standard deviation is 10 and in Graph 2, the standard deviation is 20.
We can use this data to find a standardised ‘effect size’. This is simply the difference between the two means divided by the standard deviation.
For Graph 1:
Effect size = (110 - 90) ÷ 10 = 2
For Graph 2:
Effect size = (110 - 90) ÷ 20 = 1
The effect size has no units, it is simply a measure of the difference between the means relative to the standard deviation. Provided we have a difference in means and a standard deviation — ‘pooled’ if if varies between conditions — we can calculate an effect size. These are the only requirements and this is the problem. The fact that we can calculate this value from a wide variety of different studies leads to the misconception that the effect size is some kind of standardised measure, representing the same thing in each case. It is not.
Returning to our thought experiment, imagine if the students selected for the intervention package were advanced students whereas the control students were a mix of ability levels. We could still calculate an effect size for ‘the intervention’ even though the difference is most likely due to differing student ability in the two conditions. Best practice would be to randomly allocate students to each condition, but we have not done that. Instead, we have created a really bad ‘quasi-experiment’. The effect size calculation does not care.
Perhaps even worse, we could have no control at all. Instead, the blue line could represent test scores before the intervention and the orange line could represent test scores afterwards. We could still calculate an ‘effect size’, but it would not be remotely comparable. In this new case, we are measuring before-versus-after whereas in an experiment, we were comparing two different afters. Assuming teaching has some effect, before-versus-after is likely to generate a positive effect size in most cases. However, when comparing two experimental results, we should expect a fair proportion of zeros.
With the appropriate mathematical formula, We can even generate an effect size from a correlation between two variables. For instance, if we collected data showing that consumption of ice-cream correlates with hat-wearing, we could compute an effect size for the ‘effect’ of hat-wearing on ice-cream eating. Equally, we could compute an effect size for the ‘effect’ of ice-cream eating on hat wearing. The fact that there is no cause-and-effect relationship between the two and that both are likely caused by changes in weather does not prevent me from doing this.
In slightly different ways, both Hattie and the Education Endowment Foundation toolkit take effect sizes drawn from a wide-range of different kinds of studies and group them into broad categories such as ‘feedback’ and ‘metacognition and self-regulation’. They then average the effect sizes to generate one effect size to rule them all. The Education Endowment Foundation then adds the spurious step of translating this into, ‘Additional months progress’.
How are we to interpret that?
For a start, we cannot be sure that any of these effect sizes represent a meaningful cause-and-effect relationship without reading the underlying studies. Even when they do, is averaging them valid? If you look at the Education Endowment Foundation’s toolkit, an interesting pattern emerges. The Education Endowment Foundation also conducts its own well-designed** randomised controlled trials which it feeds back into its toolkit. However, these tend to have effect sizes at the lower end of the range of the effect sizes we see in the toolkit studies. One way to interpret this is that the better the design of a study, the lower the effect size — this is the essence of the late Robert Slavin’s powerful critique of Hattie’s work. One assumption it is tempting to make about the categories that have a high average effect size is that they are full of bad studies.
Relatively low effect sizes may also be related to the Education Endowment Foundation’s tendency to use standardised assessments as an outcome measure.
To explain why this is an issue, consider my PhD research. I basically delivered a short intervention to teach upper primary children how to calculate energy efficiency. I tested the effect of two different approaches by giving the students an assessment I created full of energy efficiency questions. I tested for statistical significance and computed an effect size.
Imagine if the Education Endowment Foundation came along and insisted I replace my bespoke assessment with a standardised science assessment such as PAT Science. These tests may not even have a single energy efficiency question on them. I would expect no effect to show up because the assessment was not sufficiently sensitive to picking it up. Yet, in both cases, the real-world effect I had on student learning would be the same.
So in addition to the issues already noted, selection of assessment task also affects the effect size. Bespoke tasks lead to larger effect sizes than standardised ones. Either may be appropriate in different circumstances, but assuming the effect sizes describe the same thing would be misconceived.
It is also worth remembering that we can manipulate an effect size by changing the difference between the means or by changing the standard deviation. We assume the former is the main driver when it may be the latter. In early education, standardised assessments may measure a small range of skills — such as letter-sound relationships — and children’s performance on these skills could be quite tightly bunched. As Dylan Wiliam points out, ‘older students tend to be more spread out than younger students,’ and so we cannot easily compare effect sizes across students with different ages.
So, what can be done?
In some cases, where the study designs are very similar, the age of students is controlled and the outcomes measures are aligned and well-defined, averaging effect sizes may be valid. Studies into aspects of early reading instruction may fit these criteria.
In most other cases, the best we can do to summarise existing evidence is a narrative review. Such a review would simply describe the research done in the relevant area, outlining findings and the strengths and weaknesses of the evidence, with no average effect sizes.
If we want to answer the question of whether Intervention A is more effective than Intervention B then we have no real alternative to testing the best versions of each against each other in the same experiment, maybe with a third control condition.
Despite the many thousands of education faculties and bureaucracies around the world, pretty much nobody in mainstream education research is doing this.
*Technically, we take each data point and find its distance from the mean, then we square that. We then find the mean of these values and then square-root that.
**Maybe that’s too strong — many are ‘underpowered’
This is a very clear critique and I think you could have added even a few more things like temporal effects and the file drawer problem. I was duped into this stuff to some degree a few years ago but luckily for me I focused more on the feedback professional learning and the VL platform was a useful framework for building self evaluation in the school. Effects sizes however seemed a pat way to explain evidence based approaches to almost everything and why you might be clever enough to be pushing a strategy in school. But then I started to read more around effect sizes and understood its pitfalls. Like you Greg, this started me on a road of discovery, albeit not a PhD, in relation to the poor evidence for a lot of what was being pushed by academics and commercial companies supported by large scale glitzy companies - I even spoke at a few. The truth is out there and as a school leader, I am now firmly on the path of targeting work promoting explicit teaching, cognitive load theory ( including talking to staff about primary and secondary biological knowledge and fluid and crystallized intelligence) and an in school project we call the Big 5 which is focused on Dylan Wiliam's five key embedded formative assessment strategies. However, this is frustrating work and there is extreme deafness in the system.
Mr.-Dr. Ashman, Mr. Jay, & Mr. F., please allow me to wade into the coversation here. (Sorry if I used mistaken courtesy nouns of address there!)
Thank you! Thank you all for letting me barge in and for your comments.
This is a potentially fruitful conversation, so I'm glad to be able to participate.
Mr.-Dr. Ashman, I think your post includes important points. Those points include (but are not limited to) these:
(a) The influence of standard deviations on ESs (especially when the SDs for compared measures differ; as you rightly note, under that condition, one should use a pooled SD) must be considered; if one changes the denominator of a fraction (e.g., an ES), that makes a difference.
(b) Selection of samples (randomly chosen from what population?) and assignment to conditions (randomly assigned?) can influence outcomes of an study, whether it is a quasi-experiment or even an "randomized control trial"...just how are students randomized into groups (i.e., conditions)? Sequential coin flips or rolls of the dice? Stratefied on some basis (e.g., pretest)? Etc.
(c) How different types of instructional methods might be tested can make differences. Pre- vs. post-test studies might (probably will!) produce different outcomes than RCTs. Correlational (i.e., descriptive studies) are likely to produce different (and more widely variable) ESs than other types of studies.
Well, here's the thing: These are important concerns. They do not, in my view, however, provide sufficient reason for rejecting analyses of ESs. I see them as arguments for good practice in meta-analysis. A meta-analyst could code for these differences: She could (a) have something simple like a 0-1 code for each study about whether it used control SD or pooled SD (and, obviously, the code could be expanded to cover other variations); (b) include codes for sampling and sample assignment methods employed in each study; and (c) categorize (i.e., code) types of research designs for the studies in the corpus of the review.
The big point here is that scientifically savvy meta-analysts _code_ for these sorts of differences in a corpus of studies and then they _analyze_ the ESs according to those codes. That is, they go beyond simply reporting overall effect sizes; they report moderators—not just age, gender, ethnicity, and such, important as they may be—that affect those main affects, but also methodologic moderators.
So, Jay, I'll agree with Prof. Hattie about the importance to look beyond simple ESs. Code one's studies thoroughly and carefully. Get to know the studies like a primary teacher gets to know her first graders. But, do it systematically, so that one can test questions about their ESs (e.g., do correlational studies yield different ESs than RCTs?).
So, should I accept or dismiss any concerns about Hattie's research? Definately maybe. But I'd prefer to defer my decision until advocates of either position demonstrate the actual evidence that objectively shows why I should go left or right, up or down, forward or backward.
Now, if I may, let me add this tag: I hope we as as educators concerned about evidence and research, press to have our researchers use methods of open science so that we can examine results from meta-analyses (and other research) systematically. Publish not just your report of your meta-analysis, but your actual freaking coding system and your data.
Thanks for letting me interrup!
JohnL
John Wills Lloyd, Ph.D.
Professor Emeritus, UVA School of Ed & HD
Co-editor, Exceptional Children
Editor, https://www.SpecialEducationToday.com/