VisibleLearning: Ability Grouping

There are a range of issues with the Ability Grouping (or Setting/Streaming) research:

1. The definition of ability groups differs from researcher to researcher. So, in combining disparate studies, as done in the Meta-meta analysis technique used by Hattie and the EEF, care needs to be taken in deciding which categories, from different researchers, are the same. The discussion below shows there are many issues with Hattie's judgement on this. Andrew Old, also, does an excellent analysis of the EFF here, showing mistakes, misrepresentations and carelessness.

2. The outcomes measured differs from achievement to social/ emotional and attitudinal outcomes. Hattie often just combines these together into 1 effect size without discriminating between measures.

3. There are significant confounding variables, e.g., novice teachers are often assigned the lower ability groups.

4. The comparison group (the control group), e.g., accelerated/gifted students can be compared to the year level they are accelerated to or the year they came from, this yields hugely different effect sizes, e.g., Kulik & Kulik (1992) report,

"These two types of studies produced distinctly different results. Each of the 11 studies with same-age control groups showed greater achievement in the accelerated class; the average effect size in these studies was 0.87. Studies with older comparison groups were as likely to produce positive as negative differences between groups. The average effect size in the 12 studies with older comparison groups was -0.02." (p. 76)

Hattie, firstly, mistakenly reports +0.02 (see reference, VL (2008), Appendix A, below) and also does not mention nor discuss the higher result of 0.87 !

5. The Meta-meta- analysis technique that Hattie & the EEF uses, counts the same studies twice, 3-times and even more-times over, leading to significant bias in average effect sizes (Wecker et al. (2017), Shannahan (2017)).

Note: Hattie has finally admitted this is a major problem and has re-examined his feedback research, getting totally different results, Wisniewski, Zierer & Hattie (2020). 'The Power of Feedback Revisited',

"...a source of distortion when using a synthesis approach results from overlapping samples of studies. By integrating a number of meta-analyses dealing with effects of feedback interventions without checking every single primary study, there is a high probability that the samples of primary studies integrated in these meta-analyses are not independent of each other...Therefore, these would have to be considered as duplets–primary studies that are included in the result of the synthesis more than once–and consequently cause a distortion." (p. 2)

Hopefully, Hattie will re-examine ALL of his other influences including Ability grouping!

6. Selective reporting. Hattie includes some results from studies but not others, e.g., Kulik & Kulik (1992) below.

7. Most of the research is old (pre - 2000).

Definitions

Kulik & Kulik (1992, p. 73-74) define these variations-

Multilevel classes:

Students in the same grade are divided into groups, often - high, middle, and low groups - on the basis of ability and the groups are instructed in separate class rooms either for a full day or for a single subject. Note, some researchers call these groups attainment groups.

Cross-grade grouping:

Children from several grades are formed into groups on the basis of their level of achievement in a subject, and the groups are then taught the subject in separate classrooms without regard to the children’s regular class placement.

Within-class grouping:

A teacher forms ability groups within a single classroom and provides each group with instruction appropriate to its level of aptitude.

Enriched classes for the gifted and talented:

Students who are high in aptitude receive richer, more varied educational experiences than would be available to them in the regular curriculum for their age level.

Accelerated classes for the gifted and talented:

Students who are high in academic aptitude receive instruction that allows them to proceed more rapidly through their schooling or to finish schooling at an earlier age than other students.

They conclude,

"Our conclusion is that effects of grouping are a function of program type." (p. 74)

Yet, Slavin (1990) defines Ability Grouping more generally as,

"...any school or classroom organisation plan that is intended to reduce the heterogeneity of each class for a given subject is reduced." (p. 471)

Hattie originally used the category "Ability Grouping" but in recent years has changed that to "Tracking/Streaming". The EEF uses "Setting/Streaming" and this is subtly different from Hattie's definition, but markedly different from the Kulik & Kulik (1992) definitions.

Hattie's Summary from Visible Learning (2008):

Hattie used the definitions below, differing a little from Kulik & Kulik (1992) by removing the categories of Cross-grade grouping and Multilevel classes and adding "general" Ability grouping & Enrichment.

"General" Ability grouping - effect size d = 0.12 (cross-grade grouping d = 0.30 but Hattie does not include this in this category, I’m not sure why).

Hattie's references from VL (2008, Appendix A):

Note: In Hattie's latest summary (January, 2021) he has changed the "General" Ability grouping to "Tracking/Streaming".

Within-class grouping - d = 0.16 (Hattie did not include the Kulik & Kulik (1992) study detailed below which gave d = 0.25). Here's an example of a school in my state which has significantly improved students math's scores by within-class ability grouping.

Ability grouping for gifted students - d = 0.30 (Hattie did not include the result from Kulik & Kulik (1992), where d = 0.87; if this was used d = 0.47)

Acceleration - d = 0.88 (Hattie’s 5th ranked influence, but overlaps with gifted students).

Enrichment - d = 0.39 (significant overlap with Acceleration & Gifted).

Dylan Wiliam gives a quick overview of the problems of Ability Grouping research-

Goldacre (2008) comments about meta-analysis in education:

"I think you’ll find it’s a bit more complicated than that".

De Heaume (2018) completes an excellent systematic review of Homogeneous grouping here,

"As a result of the adoption of data-driven practices, HOW2Learn and Hattie’s effect sizes, schools in my local region are choosing to disband within-school ability grouping. Hattie’s meta-analysis (NSWDEC, 2015) claims that ability grouping has an effect size of 0.12, approximately a quarter of the expected growth over a 12 month period. According to this publication, ability grouping does not allow for impressive student achievement...

I was interested to see the consistent influence of Hattie’s research in schools and the disbanding of ability grouping as a result. I am interested to know the impact that removing ability grouping in the forms of whole school streamed Mathematics" (p. 4).

De Haeume concludes,

"It could be predicted that, the removal of clustered Mathematics groups will have a detrimental effect on gifted mathematicians academically. This will also be reflected in an increase in the prevalence of boredom with gifted learners" (p. 20).

Hattie's Use of the Ability Grouping Research:

There is much overlap from one category to another, for example, there is virtually no difference between the categories 'ability grouping for gifted' and 'acceleration' and also, 'ability grouping' and 'cross-grade grouping'.

The researchers who dominate these categories in Hattie’s book are Kulik & Kulik. We will look at their 1992 study as Hattie includes it in many of the different influences (except 'within-class grouping').

Hattie is selective with which studies he includes in each category. For example, the Acceleration group is constructed by grouping all the gifted kids together. Kulik & Kulik have some very high effect sizes of around 0.90 which could be placed in the category 'ability grouping for gifted' which would raise the effect size significantly, but Hattie does not do this.

Technical Note: most of the meta-analyses include many of the same studies and this can lead to bias. This issue was looked at in detail in Effect Size - Problem 6 (multiple uses of the same data in several studies).

Study 1 (used in Ability grouping and ability grouping for Gifted):

Kulik and Kulik (1992) - Meta-analytic Findings on Grouping Programs.

They summarise:

"Meta-analytic reviews have focused on five distinct instructional programs that separate students by ability: multilevel classes, cross-grade programs, within-class grouping, enriched classes for the gifted and talented, and accelerated classes.

The reviews show that effects are a function of program type.

Multilevel classes, which entail only minor adjustment of course content for ability groups, usually have little or no effect on student achievement. Programs that entail more substantial adjustment of curriculum to ability, such as cross-grade and within class programs, produce clear positive effects. Programs of enrichment and acceleration, which usually involve the greatest amount of curricular adjustment, have the largest effects on student learning.

These results do not support recent claims that no one benefits from grouping or that students in the lower groups are harmed academically and emotionally by grouping" (p. 73).

Multi-level classes-
A total of 36 of the 51 studies examined results separately by ability level. Effects varied slightly with aptitude. The average effect size was 0.10 for higher aptitude, -0.02 for middle aptitude, and -0.01 for lower aptitude students.

The average effect size in all multi-level class programs was 0.03.

Cross-grade grouping-
The average effect size in the 14 studies was 0.30,

Cross-grade grouping is like multi-level grouping in that students of different ability levels are taught in separate classrooms. But in cross-grade plans, there are typically more levels.

Comment- Hattie does not include this category in the ability grouping and he does not explain why he did this.

Within Class grouping-
The average overall effect of grouping in the studies was d = 0.25. The average effect size was 0.30 for the higher ability students; 0.18 for the middle ability students; and 0.16 for the low-ability students. Once again an average would hide the detail.

Comment, Hattie does not include this result, but rather an earlier Kulik study (1985) where d = 0.15.

Gifted Talented-
The average effect in the studies was 0.41.

Hattie does Not report this result.

Accelerated classes for gifted talented-
Two types of studies produced distinctly different results. Each of the studies with same-age control groups showed greater achievement average effect size in these studies was 0.87.

However, if you use the (usually 1 year older) students as the control group, the average effect size in the studies was -0.02. Hattie mistakenly reports d = 0.02 in the category 'ability grouping for gifted students'.

As stated above, Hattie does not mention nor discuss the higher result of 0.87.

However, he reports this result in his Accelerated category, but cites Kulik (2004). But, his detailed references do not list Kulik (2004), so it is difficult to check this paper.

I suspect the 11 studies and d = 0.87 are exactly what Kulik & Kulik (1992) report.

This is another clear example that Hattie is selective as to which results he reports, and also into which category he reports them.

Kulik and Kulik (1992) conclude:

"talented youngsters who were accelerated into higher grades performed as well as the talented older students already in those grades. Second, in the subjects in which they were accelerated, talented accelerates showed almost a year's advancement over talented same-age nonaccelerates" (p. 89).

Study 2 (Acceleration):

Kulik and Kulik (1984) - Synthesis of Research on Effects of Accelerated Instruction.

Kulik and Kulik highlight the importance of what you are comparing in your study. They state that acceleration studies are divided into 2 groups.

"The first type of study compares the performance of accelerated students with the performance of same age non-accelerates... The second type of study compares accelerates with same grade non-accelerates of equal intelligence. Comparison groups in this type of study are equivalent in grade and IQ" (p. 86).

With same age control groups (13 studies) d = 0.88 (Hattie uses this result, although he incorrectly states there were 26 studies).

With the year older control group (13 studies) d = 0.05.

Note: Kulik (1992) get similar results, but Hattie uses the lower d value.

Kulik and Kulik (1984) note an issue with confounding variables,

"The variation seemed great enough to lead us to suspect that factors other than acceleration were playing a role in determining study outcomes" (p. 87).

On the issue of how achievement is measured they say,

"With poorly calibrated measuring instruments, investigators can hardly expect to achieve exact agreement in their results" (p. 89).

Regarding quality control, they state that no results from correlation studies were included in this meta-analysis but rather proper experimental studies (p. 89).

Study 3 (Within Class grouping):

Lou et al. (1996) - Within-Class Grouping: A Meta-Analysis

They conducted 2 analyses,

Analysis 1 included studies which compared Within-Class grouping with no grouping, d = 0.17. Analysis 2 included studies which directly compared homogeneous Within-Class grouping with heterogeneous Within-Class grouping, d = 0.12 (p. 429)

Hattie reports d = 0.17 in the Within-Class Grouping category BUT, d = 0.12 in the 'general' Ability Grouping category.

This is an example of problem 1 mentioned above. It seems to me Hattie has just read the abstract, where Lou et al. (1996) summarise,

"The first set included 145 effect sizes and explored the effects of grouping versus no grouping on several outcomes. Overall, the average achievement effect size was +0.17, favoring small-group learning. The second set included 20 effect sizes which directly compared the achievement effects of homogeneous versus heterogeneous ability grouping. Overall, the results favored homogeneous grouping; the average effect size was +0.12."

Hattie has read that analysis 2 was "homogeneous versus heterogeneous ability grouping", without reading the detail that followed, that this was also about Within-Class grouping.

Lou et al. (1996) conclude:

"We caution the reader that this meta-analysis, like others, does not allow one to make strong causal inferences, particularly with regard to explanatory features.

Not only were we unable to extract information from every study about the existence of particular factors, which reduces the sensitivity of the analyses, but the study features were often intercorrelated while the heterogeneity within categories of study features were not resolved in many cases, which makes unambiguous interpretation impossible and untempered conclusions unwise."

Interestingly they analyse ability grouping with class size; in classes of more than 35 students d = 0.35, whereas in classes less than 26 students d = 0.22.

Lou et al. (1996), Conclusions and Recommendations:

"The practice of within-class grouping is supported by the results of this review. Within-class grouping appears to be a useful means to facilitate student learning, particularly in large classes and especially in math and science courses" (p. 446).

They also warn of the problem of comparing effect sizes using different tests - as discuss in detail on the page listed on the right - Effect Size,

"...we found that the effect of small-group learning was much higher when achievement was measured with teacher-made tests than when researcher-made tests were used. Achievement measured with researcher-made tests was, in turn, higher than that measured with standardized tests. Similarly, we also showed that the effect sizes were higher when the outcome measures were geared to instruction than when they were not geared to instruction. Therefore, one explanation for the difference between locally developed tests and standardized tests is that teacher-made tests may have a closer match with local instructional objectives than researcher-made tests or standardized tests. Similarly, researcher made tests may have a closer match to the local instructional objectives than standardized tests. Thus, locally made tests may reflect the large influence of within-class grouping on proximal instructional objectives, while standardized tests may reflect the small influence of within-class grouping on distal instructional objectives." (p. 460)

Study 4 (Ability Grouping):

Neber et al (2001) - Cooperative Learning with Gifted and High-achieving Students: A review and meta-analyses of 12 studies.

Hattie reports d = 0.33 in the 'general' Ability Grouping category. However, it is not clear how Hattie derived d = 0.33 (see table below). Also, the study is on Gifted students and focuses on Co-operative versus Individual learning. So Hattie's categorising is once again very questionable, at the very least, ignoring the Cooperative vs Individual learning, Hattie should have included this in the 'Gifted' category.

Neber et al. (2001) summarise in a table (p. 209),

Neber et al. (2001) summarise,

"few methodologically sound studies can be found at present." (p. 199).

and

"...high achievers' performance improve if they learn in homogeneous groups with other high-achieving students" (p. 210).

They report many limitations of the studies, e.g.,

"...another limiting characteristic is the preference for rather short interventions. Only three of the eight studies implemented cooperative formats of instruction which lasted longer than 2 hours." (p. 206)

Study 5 (Ability Grouping):

Slavin (1990) - Achievement effects of Ability Grouping in Secondary Schools.

Hattie reports one of the lowest effect sizes d = -0.02.

Slavin used 29 studies (mostly from the 1960's): 6 randomised, 14 correlation and 9 matched experiments. Effect sizes differed markedly from d = 0.28 down to d = -0.48. He then used the median (not the mean) of d = -0.02. Slavin states,

"There are few consistent patterns in the study findings" (p. 484).

Slavin defends the use of the median,

"In pooling findings across studies, medians rather than means were used, principally to avoid giving too much weight to outliers. However, any measure of central tendency ... should be interpreted in light of the quality and consistency of the studies from which it was derived, not a finding in its own right" (p. 477).

Note that 9 studies were statistically insignificant and Slavin assigns d = 0.00 to these studies (p. 484). Other meta-analyses differ by using statistically insignificant d values. Very few studies completely dismiss these. The different strategies PROFOUNDLY affect the mean or median d values obtained.

Similar to many other researchers, Slavin cautions the reader that many of the studies use approximation techniques to derive an effect size and these should be interpreted with even more caution than usual (p. 477).

Slavin also talks about the problem of varying sample size. "All other things being equal" studies with more students provide better evidence (p. 484). Note: Hattie has been criticised heavily for ignoring this issue.

Slavin also notes another major confounding variables:

-the problem of dropouts becomes serious in senior high school as those who are most likely to drop out are in the low tracks. This could reduce the differences between high and low track students (p. 488).

-studies in higher tracks are also likely to be higher in such attributes as motivation, internal locus of control, self-esteem, and effort, factors that are not likely to be controlled in correlation studies (p. 489).

-high and low track students usually differ in pretests or IQ by 1-2 standard deviations, an enormous systematic difference for which no statistical procedure can adequately control (p. 489).

Slavin concludes,

"The present review cannot provide definitive answers " (p. 491).

He recommends a move to more proper experimental design using randomised control experiments rather than correlational studies (p. 490).

Other Commentary:

Professor Maureen Hallinan (1990) The Effects of Ability Grouping in Secondary Schools: A Response to Slavin's Best-Evidence Synthesis.

Slavin is used by Hattie for a variety of influences including ability grouping. Hallinan is very critical of Slavin's research, Her critique also is instructive for all research on ability grouping and in many ways also points out issues with much of the research that Hattie uses.

The Problem with Averaging:

"The fact that the studies Slavin examines show no direct effect of ability grouping on student achievement is not surprising. The studies compare mean achievement scores of classes that are ability grouped to those that are not. Since means are averages, they reveal nothing about the distribution of scores in the two kinds of classes. Ability grouping may increase the spread of test scores while leaving the mean unchanged.

This would occur if the practice had a differential impact on students with different abilities. Since teachers generally gear instruction to the ability level of the students being taught, students in a high ability group are likely to receive more and faster instruction and those in low ability groups less and slower instruction than pupils in an ungrouped class where instruction is geared to the average of the class. If greater gains of high achievers balance lesser gains of slow students in a grouped class, there should be no overall impact on the mean achievement of the class, compared to a heterogeneous class, even though the variance of the test scores in the two classes may differ markedly.

Studies comparing only mean would show no direct effect of grouping on achievement" (p. 501).

Example, using Kelley & Camilli (2007) on Teacher Training comparing teachers.

The Complexity of Classroom Instruction:

"None of the studies referred to by Slavin takes into account either the content and pace of instruction or the pedagogical practice. The research systematically ignores instructional and curricular differences across classes. This is a fatal flaw in studies aimed to evaluate grouping effects" (p. 502).

The Issue of What is Used to Measure Achievement:

"Slavin's studies rely almost solely on standardised test scores to measure achievement. This outcome measure has well-known limitations. Standardised tests are not adequate measures of what students are taught in school. They can be viewed more accurately as tests of general ability or intelligence rather than of mastery of the curriculum. Failure to observe differences in standardised test scores across students is a poor measure of grouping effects... In general, Slavin's conclusions are based on a limited and flawed measure of student learning" (p. 502).

The Bias of Experimental Studies versus Case Studies:

"Finally, Slavin's selection of studies is skewed heavily in favour of experimental research. There are only a few surveys in his sample, and there is a complete absence of case studies... In so doing, he disregards important field work, such as that by Oakes (1985), Rosenbaum (1976), and others. Their work shows distinct differences in instructional techniques, teacher interactions, reward systems, student motivation, effort and self-esteem, student behaviour, disciplinary measures, administrative load, role modelling, and peer influences by level of ability group. It is difficult to believe that these dramatic findings are not related to differential learning patterns across ability groups. It may be that the design of some of the experimental studies Slavin examines hides the richness of the learning process—a complexity that is better detected by more in-depth studies" (p. 503).

Demirsky (1991) review of Kulik and Slavin's work seems to pick up most of the issues that have already been discussed:

"If educators are to make informed decisions based on the findings ... they must study the original research and be sure the questions they're asking are the same ones posed by the researchers" (p. 60).

There are major issues depending on the tests you use for achievement, for example, the scores of gifted students usually approach the ceiling on standardised tests, making it difficult to show significant academic improvement on their part (p. 61).

Another major criticism by teachers is that the standardised tests do not evaluate what they are teaching (p. 61).

"The most destructive aspect of the controversy over ability grouping is the misrepresentation of the findings." (p. 62).

There has been a great deal of misrepresentation and misinterpretation of the research. Educators need to be critical consumers (p. 64).

Some Other Thoughts on Mixed Ability vs Setting

Emaths - Some Thoughts on Mixed Ability vs Setting

VisibleLearning

Ability Grouping

1 comment:

Blog Archive

About Me