Monday, 18 January 2016

An investigation of the evidence John Hattie presents in Visible Learning


At the 2005 ACER conference (p14) Hattie said, "We must contest the evidence – as that is the basis of a common understanding of progression." Then in Visible Learning [VL] he quotes Karl Popper: "Those amongst us unwilling to expose their ideas to the hazard of refutation do not take part in the scientific game" (p4).

Concerned about the lack of quality educational research The Australian Government Productivity Commission made a number of recommendations in their 2017 report:

"A range of processes can be used to ensure the findings from completed research are robust. These include independent validation of the findings, peer review of research, publication of all outputs to enable scrutiny and debate (irrespective of findings), and the provision of project data for secondary analysis" (p19).

Ironically, the Maths curriculum for all Victorian schools (the state in which Hattie lives) details the following criteria for ALL students to achieve by the end of Year 10:

"Evaluate statistical reports in the media and other places by linking claims to displays, statistics and representative data." Mathematics Statistics and Probability Levels 7-10A.

We expect our students to evaluate claims but we rarely do this ourselves, particularly with regard to educational statistics and in particular to Hattie's claims.

Tom Bennett, the founder of researchEd,  wrote an influential paper 'The School Research Lead', where he states, 'There exists a good deal of poor, misleading or simply deceptive research in the ecosystem of school debate' (p9).

'Where research contradicts the prevailing experiential wisdom of the practitioner, that needs to be accounted for, to the detriment of neither but for the ultimate benefit of the student or educator' (p10).

In his excellent article "School Leadership and the cult of the guru: the neo-Taylorism of Hattie", Professor Scott Eaton says,

'The uncritical acceptance of his work as the definitive word on what works in schooling, particularly by large professional associations such as ACEL, is highly problematic' (p11).

The aim of this blog is to contest the evidence that Hattie presents in his 2009 book Visible Learning [VL] by using independent peer reviews and by analysing the studies that Hattie used.

Hattie's Aim:


Hattie uses the old-fashioned REDUCTIONIST approach by attempting to break down the complexity of teaching into simple discrete categories or influences.

Hattie states: "The model I will present ... may well be speculative, but it aims to provide high levels of explanation for the many influences on student achievement as well as offer a platform to compare these influences in a meaningful way... I must emphasise that these ideas are clearly speculative" (p4).

Hattie uses two statistics: Effect Size (d) and Common Language Effect Size (CLE) to interpret, compare and rank educational influences.

The effect size is supposed to measure the change in student achievement. However, each study measured achievement differently. Most studies did not use a robust experimental design (as Hattie claims, p8) but rather simple correlational analysis. Also, many studies did not measure achievement at all, but rather something else e.g., IQ, hyperactivity, behavior, and engagement. Hattie then ranks these effect sizes from largest (self report grades) to smallest.

This is the classic problem of comparing apples to oranges and has led many scholars to question the validity and reliability of Hattie's effect sizes and rankings, e.g., Higgins and Simpson (2011):

“We argue the process by which this number has been derived has rendered it practically meaningless“ (p199). 

Blatchford, et al (2016) state that Hattie's comparing of effect sizes, "is not really a fair test" (p96).

Dr Neil Hooley, in his review of Hattie - talks about the complexity of classrooms and the difficulty of controlling variables, "Under these circumstances, the measure of effect size is highly dubious" (p44).

Professor Dylan William, agrees, "... the effect sizes proposed by Hattie are, at least in the context of schooling, just plain wrong. Anyone who thinks they can generate an effect size on student learning in secondary schools above 0.5 is talking nonsense." The US National effect size benchmarks support Professor William's contention.

"To believe Hattie is to have a blind spot in one’s critical thinking when assessing scientific rigour. To promote his work is to unfortunately fall into the promotion of pseudoscience. Finally, to persist in defending Hattie after becoming aware of the serious critique of his methodology constitutes wilful blindness.”  Professor Pierre-Jérôme Bergeron

The Common Language Effect Size (CLE) is a probability statistic which usually interprets the effect size. However, three peer reviews showed Hattie calculated all CLE's incorrectly (he calculated probabilities that were negative and greater than 1!). As a result, he now claims the CLE statistic is not important and he focuses on an interpretation that an effect size d = 0.4 is the hinge point, claiming this is equivalent to a year’s progress. Although, there are significant problems with this interpretation.

Hattie's Interpretation of the Meta-analyses:

Meta-analysis, as a methodology has been widely criticised, for not representing the original studies faithfully,

'the methodology used, neglects the original theory that drives the primary studies it seeks to review'  (p4) Myburgh (2016).

But, Hattie's takes this interpretation problem to another level as his methodology is meta - meta-analysis (Snook, et al, 2009). Hattie claims to faithfully interpret a wide range of meta-analyses which use different experimental designs, on different groups of people (university students, doctors, nurses, tradesmen, and sometimes high school students!), with vastly different measures of student achievement or often no measure of achievement at all!

As Emeritus Professor Peter Blatchford points out about Hattie's VL,

"it is odd that so much weight is attached to studies that don't directly address the topic on which the conclusions are made" (p13).

A typical example of this is Hattie's influence 'reducing disruptive behavior' where he uses 3 meta-analyses (see the menu on right for details) to get an average low effect size of 0.34. Hattie has often used the polemic 'THE DISASTERS and GOING BACKWARD' to describe influences with low effect sizes. Yet most teachers would say that reducing disruptive behavior is one of their major aims.

A detailed look at the 3 meta-analyses is revealing - one compares the achievement of students with ADHD with a normative group, the other compares the behavior of students with 'emotional/behavioral' disturbance (EBD) with a normative group. The third does focus on reducing disruptive behavior but measures behavior NOT achievement! In addition, one meta-analysis gets a large negative result, i.e., student achievement decreased!

HOW CAN REDUCING DISRUPTIVE BEHAVIOUR DECREASE STUDENT ACHIEVEMENT?

A thorough look at this study clearly shows Hattie's misinterpretation. Also, there is significant doubt as to why this study was included in this category, given it is about emotionally disturbed kids. Also, the effect size was calculated by using (emotional/behavioral disturbance achievement) - (Normative achievement) giving a NEGATIVE result. Yet in other behavioral studies, the effect size is calculated the opposite way giving a positive result!

Hattie should have adjusted for this inconsistency. If this were done the average effect size would rocket up in Hattie's rankings to #6. If he removed this study then its ranking would rise further to #3. If he removed the study on ADHD kids the ranking would be #2! This would then be consistent with teacher experience.

A key tenet of the scientific method is reliability, this simple analysis demonstrates how unreliable Hattie's rankings are.

The menu on the right displays our detailed analyses so far, starting with the background to the research-Effect SizeStudent AchievementCLE statistic,  a Year's Progress and Validity & Reliability. Then the two highest ranked influences- Self-Report Grades and Piagetian Programs then other analyses are added as time permits.

With Evidence Like This Who Needs Your Opinion:


In spite of these significant errors, Hattie often proclaims, “statements without evidence are just opinions”. This belittles teacher experience and opinion and raises his so-called evidence and rankings above them.

Yet in Hattie’s 2017 publication, Learning strategies: a synthesis and conceptual model - he appears to have done a complete retraction and argues against ranking!

“There is much debate about the optimal strategies of learning, and indeed we identified >400 terms used to describe these strategies. Our initial aim was to rank the various strategies in terms of their effectiveness but this soon was abandoned. There was too much variability in the effectiveness of most strategies depending on when they were used during the learning process …” (p9).

The Problem of Breaking Down the Complexity of Teaching into Simple Categories, Influences or 'Buckets':

'The partitioning of teaching into smallest measurable units, a piecemeal articulation of how to improve student learning, is not too removed from the work of Taylor over 100 years ago. Despite its voluminous and fast expanding literatures, educational administration remains rooted to the same problems of last century' (p10). Eacott (2017).

By Professor Robert Sapolsky in his course 'Introduction to Human Behavioral Biology' (see 47:20 - 48:30).


Peer Reviews:


Professor John O'Neill has reviewed these influences: micro-teaching, professional development, providing formative evaluation, comprehensive interventions for learning disabled students, feedback, spaced vs. massed practice, problem-solving teaching, metacognition strategies, teaching strategies, co-operative vs. individualistic learning, study skills and mastery learning.

Dr Kristen Dicerbo has analysed self-report grades.

Dr Mandy Lupton has analysed Problem-Based and Inquiry-Based Learning.

Professors Higgins and Simpson have published Hattie's calculation errors.

Professor Arne Kare Topphol also published Hattie's calculation errors (in Norwegian) summary here.

Emeritus Professor Ivan Snook, et al, give a general critique of VL focusing on lack of quality studies, the problems oh Hattie's ranks and generalisations. They use class size and homework as examples.

Professor Ewald Terhardt published a general critique of Hattie's methodology and issues of Hattie's conflict of interest.

Hattie's retort to Snook and Terhardt, which is basically a defence of meta-analysis as a methodology.

Snook, Clark, Harker, Anne-Marie O’Neill and John O’Neill, retort of Hattie's retort.

Dr Myburgh analysed Hattie's retort to Snook, et al and Terhardt above. Myburgh focuses on the critique of meta-analysis as a methodology and not the specific critiques of Hattie's misrepresentations.

Professor Bergeron published Hattie's calculation errors plus other issues about correlation studies and misinterpretation.

Professor Adrian Simpson, "The main issue is that whatever the quality of evidence provided by the aggregation of effect sizes in the manner seen in Marzano (1998), Hattie (2009) and the EEF toolkit (Higgins et al. 2013), it is not evidence of more and less effective educational interventions, it does not indicate where there is more bang for our educational buck. It is evidence of research areas which are more or less susceptible to research design manipulation: areas where it is easier to make what may be educationally unimportant differences stand out through methodological choices. That is, standardised effect size is a research tool for individual studies, not a policy tool for directing whole educational areas" (p14).

Professor Scott Eaton's critique of the 'cult of Hattie'; how and why it came to be and its dangers.

Kelvin Smythe has written extensively on Hattie summarised.

Whilst not directly about Hattie's evidence for feedback, David Didau gives an excellent overview of the evidence for feedback here. Also, Gary Davies has an excellent blog - Is Education Research Scientific?

If you want to contribute please let me know. Many of the controversial influences only have 1-3 meta-analyses to read. I can provide you copies of most of the research used.

"Garbage in, Gospel out” Dr Gary Smith (2014)

What has often been missed is that Hattie prefaced his book with significant doubt "I must emphasise these are clearly speculative” (p4)Yet, his rankings have taken on “gospel” status due to: the major promotion by politicians, administrators and principals (it's in their interest, e.g. class size), very little contesting by teachers (they don't have the time, or who is going to challenge the principal?) and limited access to scholarly critiques - see Gary Davies excellent blog on this.

'Hattie’s work has provided school leaders with data that appeal to their administrative pursuits' (p3). Eacott (2017).

"Materialists and madmen never have doubts" G. K. Chesterton

Interestingly, his reservation has changed to an authority and certainty that is at odds with the caution that ALL of the authors of his studies recommend, e.g., class size and ability group. Caution due to lack of quality studies, inability to control variables, major differences in how achievement is measured and the many confounding variables. Also, there is significant critique by scholars who identify the many errors that Hattie makes; from major calculation errors and excessive inference to misrepresenting studies, e.g., Higgins and Simpson (2011).

The Rise of the Policy Entrepreneur:


Science begins with skepticism, however, in the hierarchical leadership structures of Educational Institutions skeptical teachers are not valued, although ironically, the sceptical skills of questioning and analysis are valued in students.  This paves the way for the many 'snake oil' remedies and the rise of policy entrepreneurs who 'shape and benefit from school reform discourses'.

Professor John O'Neill in analysing Hattie's influence on New Zealand Education Policy describes the process well:

"public policy discourse becomes problematic when the terms used are ambiguous, unclear or vague" (p1). The "discourse seeks to portray the public sector as ‘ineffective, unresponsive, sloppy, risk-averse and innovation-resistant’ yet at the same time it promotes celebration of public sector 'heroes' of reform and new kinds of public sector 'excellence'. Relatedly, Mintrom (2000) has written persuasively in the American context, of the way in which ‘policy entrepreneurs’ position themselves politically to champion, shape and benefit from school reform discourses" (p2).

Hattie's recent public presentation in the TV documentary 'Revolution School' confirms Professor O'Neill analysis. Dan Haesler reports Hattie's remedy cost the school around $60,000.

Professor Ewald Terhardt (2011), "A part of the criticism on Hattie condemns his close links to the New Zealand Government and is suspicious of his own economic interests in the spread of his assessment and training programme (asTTle)" (p434).

Ambiguous, Unclear or Vague?


There are many examples of ambiguity in the detailed analysis of each influence on the right menu. Although, the first striking one is in Hattie's preface to VL:

"It is not a book about classroom life, and does not speak to the nuances and details of what happens within classrooms."

However, many influences such as class size, teacher subject knowledge, teacher training, ability grouping, student control, mentoring, teacher immediacy, problem-based learning, exercise, welfare, and homework are considered to be about classroom life but Hattie has given them a low ranking. 

In Hattie's 2012 update of VL he does an about face and says,
"I could have written a book about school leaders, about society influences, about policies – and all are worthwhile – but my most immediate attention is more related to teachers and students: the daily life of teachers in preparing, starting, conducting, and evaluating lessons, and the daily life of students involved in learning" (preface).

Also, in his presentations, he describes many of these low ranked influences as DISASTERS! 

This seems to DEFY widespread teacher experience.


Is Hattie’s Evidence Stronger than Other Researchers or Widespread Teacher Experience?


A summary of the major issues scholars have found with Hattie's work (details on the page links on the right):

  • Hattie misrepresents studies e.g. peer evaluation in 'self-report' and studies on emotionally disturbed students are included in 'reducing disruptive behavior'.
  • Hattie often reports the opposite conclusion to that of the actual authors of the studies he reports on, e.g. 'class-size', 'teacher training', 'diet' and 'reducing disruptive behavior'.
  • Hattie jumbled together and averaged the effect sizes of different measurements of student achievement, teacher tests, IQ, standardised tests and physical tests like rallying a tennis ball against the wall.
  • Hattie jumbled together and averaged effect sizes for studies that do not use achievement but something else, e.g. hyperactivity in the Diet study, i.e., he uses these as proxies for achievement, which he advised us NOT to do in his 2005 ACER presentation.
  • The studies are mostly about non-school or abnormal populations, e.g., doctors, nurses, university students, tradesmen, pre-school children, and 'emotionally/behaviorally' disturbed students.
  • The US Education Dept benchmark effect sizes per year level, indicate another layer of complexity in interpreting effect sizes - studies need to control for age of students as well as the time over which the study runs. Hattie does not do this.
  • Related to the US benchmarks is Hattie's use of d = 0.40 as the hinge point of judgments about what is a 'good' or 'bad' influence. The U.S. benchmarks show this is misleading.
  • Most of the studies Hattie uses are not high quality randomised controlled studies but the much, much poorer quality correlation studies.
  • Most scholars are cautious/doubtful in attributing causation to separate influences in the precise surgical way in which Hattie infers. This is because of the unknown effect of outside influences or confounds.
  • Hattie makes a number of major calculation errors, e.g., negative probabilities.

Misrepresentation:


The reducing disruptive behavior studies are are an example of Hattie's misrepresentation. Another key example is class size. Hattie interprets the meta-analysis differently to the actual authors of the study. I was surprised to find this a common issue in VL.

Class Size:


For example, Glass and Smith (1979), 1 of the 3 studies that Hattie uses for class size, summarise their data in a graph and table:


The trend and the difference between good and poor quality research are clearly displayed. The authors conclude,

"A clear and strong relationship between class size and achievement has emerged... There is little doubt, that other things being equal, more is learned in smaller classes" (p15).


Hattie uses an average (which is another issue discussed) from the above table of d = 0.09 (although it seems the average is closer to d = 0.25). Hattie concludes class size has minimal impact on student learning. In fact, he goes further than this, in his 2005 ACER presentation (using this research) he calls class size a DISASTER! Other times he interprets d < 0.40 as "going backward"!


The Quality of the Research:

'Extraordinary claims require extraordinary evidence.' Carl Sagan

Generally, Hattie dismisses the need for quality and makes the astonishing caveat, that there is, "... no reason to throw out studies automatically because of lower quality” (p11).

Also his constant proclamation 'it is the interpretations that are critical, rather than data itself' (VL 2012 summary p3) is worrying, as it is opposite to the Scientific Method paradigm as Emeritus Professor Ivan Snook, et al explain: 

"Hattie says that he is not concerned with the quality of the research ..., of course, quality is everything. Any meta-analysis that does not exclude poor or inadequate studies is misleading, and potentially damaging if it leads to ill-advised policy developments. He also needs to be sure that restricting his data base to meta-analyses did not lead to the omission of significant studies of the variables he is interested in" (p2).

Professor John O'Neill writes a significant letter to the NZ Education Minister regarding the poor quality of Hattie's research, in particular, the overuse of studies about University, graduate or pre-school students and the danger of making classroom policy decision without consulting other forms of evidence, e.g., case and naturalistic studies. "The method of the synthesis and, consequently, the rank ordering are highly problematic" (p7).

The U.S. Department of Education has set up the National Center for Education Research whose focus is to investigate the quality of educational research. Their results are published in the What Works Clearing House. They also publish a Teacher Practice Guide which differs markedly from Hattie's results - see Other Researchers.

Importantly they focus on the QUALITY of the research and reserve their highest ratings for research that use randomised division of students into a control and an experimental group. Where students are non-randomly divided into a control and experimental group for what they term a quasi-experiment, a moderate rating is used. However, the two groups must have some sort of equivalence measure before the intervention. A low rating is used for other research design methods - e.g., correlation studies.

Given most of the research that Hattie uses is correlation based, he has skillfully managed to sidestep the quality debate within school circles (but not within the academic community - see References).

Self-Report Grades - the Highest Ranked Influence??


Hattie concludes the ‘best’ influence is self-reported grades with d=1.44. Which Hattie interprets as advancing student achievement by 3+ years!

This is an AMAZING claim if true: that merely predicting your grade, somehow magically improves your achievement to that extent. I hope my beloved “under-achieving” Australian football team – The St Kilda Saints are listening – “boys you can make the finals next year just by predicting you will - you don't need to do all that hard training!"

Whilst it may be simpler and easier to see teaching as a set of discreet influences, the evidence shows that these influences interact in ways in which no-one, as yet, can quantify. It is the combining of influences in a complex way that defines the 'art' of teaching. 

A Teacher's Lament:


Gabbie Stroud resigned from her teaching position and wrote:
"Teaching – good teaching - is both a science and an art. Yet in Australia today [it]… is considered something purely technical and methodical that can be rationalised and weighed.

But quality teaching isn't borne of tiered 'professional standards'. It cannot be reduced to a formula or discrete parts. It cannot be compartmentalised into boxes and 'checked off'. Good teaching comes from professionals who are valued. It comes from teachers who know their students, who build relationships, who meet learners at their point of need and who recognise that there's nothing standard about the journey of learning. We cannot forget the art of teaching – without it, schools become factories, students become products and teachers: nothing more than machinery."


John Oliver gives a funny overview of the problems with Scientific Studies:


Another overview the the issues with published studies-