Blog

Four reasons instructional coaching is currently the best-evidenced form of CPD

At the ResearchEd 2018 National Conference, Steve Farndon, Emily Henderson and I gave a talk about instructional coaching. In my part of the presentation, I argued that instructional coaching is currently the best-evidenced form of professional development we have. Steve and Emily spoke about their experience of coaching teachers and embedding coaching in schools. This blog is an expanded version of my part of the presentation…

What is instructional coaching?

Instructional coaching involves an expert teacher working with a novice in an individualised, classroom-based, observation-feedback-practice cycle. Crucially, instructional coaching involves revisiting the same specific skills several times, with focused, bite-sized bits of feedback specifying not just what but how the novice needs to improve during each cycle.

In many ways, instructional coaching is the opposite of regular inset CPD, which tends to involve a broad, one-size-fits-all training session delivered to a diverse group of teachers, involving little practise and no follow-up.

Instructional coaching is also very different to what we might call business coaching, in which the coach asks a series of open questions to draw out the answers that people already, in some sense, know deep down. Instructional coaches are more directive, very intentionally laying a trail of breadcrumbs to move the novice from where they are currently, to where the expert wants them to be.

Some instructional coaching models include a rubric outlining the set of specific skills that a participant will be coached on. Others are even more prescriptive, specifying a range of specific techniques for the teacher to master. There are also a range of protocols or frameworks available to structure the coaching interaction, with Bambrick-Santoyo’s Six Step Model being among the most popular.

Examples of established instructional coaching programmes for teachers include the SIPIC programme, the TRI model, Content Focused Coaching and My Teaching Partner. In the UK, Ark Teacher Training and the Institute for Teaching are two prominent users of instructional coaching.

What is the evidence for instructional coaching?

In 2007, a careful review of the literature found only nine rigorously evaluated CPD interventions in existence. This is a remarkable finding, which shows how little we knew about effective CPD just a decade ago.

Fortunately, there has been an explosion of good research on CPD since then and my reading of the literature is that instructional coaching is now the best-evidenced form of CPD we have. In the rest of the blog, I will set out four ways in which I think the evidence base for instructional coaching is superior.

Before I do, here are some brief caveats and clarifications:

  • By “best evidenced”, I mean the quality and quantity of underpinning research
  • I am talking about the form of CPD not the content (more on this later)
  • This is a relative claim, about it being better evidenced than alternative forms (such as mentoring, peer learning communities, business-type coaching, lesson study, analysis-of-practice, etc). Remember, ten years ago, we knew very little about effective CPD at all!
  • I am talking about the current evidence-base, which (we hope) will continue to develop and change in coming years.

Strength 1: Evidence from replicated randomised controlled trials

In 2011, a team of researchers published the results from a randomised controlled trial of the My Teaching Partner (MTP) intervention, showing that it improved results on Virginia state secondary school tests by an effect size of 0.22. Interestingly, pupils whose test scores improved the most were taught by the teachers who made the most progress in their coaching sessions.

Randomised controlled trials (RCT) are uniquely good at isolating the impact of interventions, because the process of randomisation makes the treatment group (those participating in MTP) and control group (those not) identical in expectation. If the two groups are identical, then any difference in outcomes must be the result of the one remaining difference – participating in the MTP programme. Unfortunately, the randomisation process does not guarantee the two groups are identical. There is a small chance that, even if MTP has zero effect on attainment, a well-run RCT will occasionally conclude that it has a positive impact (so-called random confounding).

This is where replication comes in. In 2015 the same team of researchers published the results from a second, larger RCT of the MTP programme, which found similar positive effects on attainment. The chances of two good trials mistakenly concluding that an intervention improved attainment, when in fact it had no effect, are far smaller than for a single trial. The replication therefore adds additional weight to the evidence base.

There are however, other CPD interventions with evidence from replicated RCTs, meaning this is not a unique strength of the evidence on coaching.

Strength 2: Evidence from meta-analysis

In 2018, a team of researchers from Brown and Harvard published a meta-analysis of all available studies on instructional coaching. They found 31 causal studies (mostly RCTs) looking at the effects of instructional coaching on attainment, with an average effect size of 0.18. The average effect size was lower in studies with larger samples, and in interventions that targeted general pedagogical approaches, however these were still positive and statistically significant.

A second, smaller meta-analysis looking at CPD interventions in literacy teaching also concluded that coaching interventions were the most effective in terms of increasing pupil attainment.

The evidence from the replicated MTP trials described above shows that good instructional coaching interventions can be effective. The evidence from meta-analysis reviewed here broadens this out to show that evaluated coaching programmes work on average.

How does this compare to other forms of CPD? There are very few meta-analysis relating to other forms of professional development, and those we do have employ weak inclusion criteria, making it hard to interpret their results.

Strength 3: Evidence from A-B testing

Instructional coaching is a form of CPD. In practice, it must be combined with some form of content in order to be delivered to teachers. This begs the question of whether the positive evaluation results cited above are due to the coaching, or to the content which is combined with the coaching. Perhaps the coaching component of these interventions is like the mint flavouring in toothpaste: very noticeable, but not in fact an active ingredient in bringing about reduced tooth decay.

In February 2018, a team of researchers from South Africa published the results from a different type of randomised controlled trial. Instead of comparing treatment and control groups, they compared a control group to A) a group of teachers trained on new techniques for teaching reading at a traditional “away day” and B) a group of teachers trained on the exact same content using coaching. This type of A-B testing provides an opportunity to isolate the active ingredients of an intervention.

The results showed that pupils taught by teachers given the traditional “away day” type training showed no statistically significant increase in reading attainment. By contrast, pupils taught by teachers who received the same content via coaching improved their reading attainment by an effect size of 0.18. The coaching was therefore a necessary component of the training being effective. A separate A-B test in Argentina in 2017 also found coaching to be more effective than traditional training on the same content.

Besides these two coaching studies, there are very few other A-B tests on CPD interventions. Indeed, a 2017 review of the A-B testing literature found only one evaluation which found different results for the two treatment comparisons – a joint analysis-of-practice of video cases programme. While very promising, this analysis-of-practice intervention does not yet have evidence from replicated trials or meta-analysis.

Strength 4: Evidence from systematic research programmes

A difficulty in establishing the superiority of one form of CPD is that you need to systematically test the other available forms. The Investing in Innovation (I3) Fund in the US does just this by funding trials on a wide range of interventions, as long as they have some evidence of promise. Since 2009, they have spent £1.4Bn testing 67 different interventions.

The chart below shows the results from 31 RCTs investigating the impact of interventions on English attainment (left chart) and a further 23 on maths attainment (right chart). Bars above zero indicate a positive effect, and vice versa. Green bars indicate a statistically significant effect and orange bars indicate an effect which, statistically speaking, cannot be confidently distinguished from zero. [i]

I3

Two things stand out from this graph. First, most interventions do not work. Just seven out of thirty-one English and three out of twenty-three maths interventions had a positive and statistically significant effect on pupil attainment. This analysis provides a useful approximation of what we can expect across a broad range of CPD interventions.[ii]

In order to compare instructional coaching with the evidence from I3 evaluations, I constructed an identical chart including all the effect sizes I could find from school-age instructional coaching evaluations. The chart (below) includes all the relevant studies from the Kraft et al meta-analysis for which effect sizes could be straightforwardly extracted [iii], plus three additional studies [iv]. Of the sixteen studies included, eleven showed positive, statistically significant impacts on attainment. This compares very favourably to I3 evidence across different forms of CPD.

Coaching I3

Conclusion

Instructional coaching is supported by evidence from replicated randomised controlled trials, meta-analysis, A-B testing and evidence from systematic research programmes. I have looked hard at the literature and I cannot find another form of CPD for which the evidence is this strong.

To be clear, there are still weaknesses in the evidence base for instructional coaching. Scaled-up programmes tend to be less effective than smaller programmes and the evidence is much thinner for maths and science than for English. Nevertheless, the evidence remains stronger than for alternative forms of CPD.

How should school leaders and CPD designers respond to this? Where possible, schools should strongly consider using instructional coaching for professional development. Indeed, it would be hard to justify the use of alternative approaches in the face of the existing evidence.

Of course, this will not be easy. My co-presenters Steve Fardon and Emily Henderson, both experienced coaches, were keen to stress that establishing coaching in a school comes with challenges.

Unfortunately, in England, lesson observation has become synonymous with remedial measures for struggling teachers. Coaches need to demonstrate that observation for the purposes of instructional coaching is a useful part of CPD, not a judgement. I have heard of one school tackling this perception by beginning coaching with senior and middle leaders. Only once this had come to be seen as normal did they invite classroom teachers to take part.

Another major challenge is time. Emily Henderson stressed that if coaching sessions are missed it can be very hard to get the cycle back on track. Henderson would ensure that the coaching cycle was the first thing to go in the school diary at the beginning of the academic year and she was careful to ensure it never got trumped by other priorities. Some coaching schools have simply redistributed inset time to coaching, in order to make this easier.

Establishing coaching in your school will require skilled leadership. For the time being however, coaching is the best-evidenced form of professional development we have. All schools that aspire to be evidence-based should be giving it a go.

[i] I wouldn’t pay too much attention to the relative size of the bars here, since attainment was measured in different ways in different studies.
[ii] Strictly speaking, only 85% of these were classed as CPD interventions. The other 15% involve other approaches to increasing teacher effectiveness, such as altering hiring practices. It should be noted that the chart on the left also includes some coaching interventions!
[iii] It should be noted that I did not calculate my own effect sizes or contact original authors where effect sizes were not reported in the text. To the extent that reporting of effect sizes are related to study findings, this will skew the picture.
[iv]
Albornoz, F., Anauati, M. V., Furman, M., Luzuriaga, M., Podesta, M. E., & Tayor, I. (2017) Training to teach science: Experimental evidence from Argentina. CREDIT Research Paper.
Bruns, B., Costa, L., Cunha, N. (2018) Through the looking glass: Can classroom observation and coaching improve teacher performance in Brazil? Economics of Education review. 64, 214-250.
Cilliers, J., Fleisch, B., Prinsloo, C., Reddy, V., Taylor, S. (2018) How to improve teaching practice? Experimental comparison of centralized training and in-classroom coaching. Working Paper.

 


 

Teacher shortages: are a handful of schools a big part of the problem?

Sam Sims and Rebecca Allen

 We recently met a Newly Qualified Teacher (NQT), let’s call her Ellen, who had been delighted to get their first teaching job in a North London primary school deemed outstanding by Ofsted. She arrived on the first day of term looking forward to the challenge of teaching, but by lunchtime it dawned on her that the school had lost 100% of its classroom teaching staff since the previous academic year. At the time, she wondered what could have happened to make all these teachers leave.

She soon found out however, as she spent the next year being pressured into an unsustainable workload and subjected to highly bureaucratic and, at times, callous management. At the end of the year, all the classroom teaching staff left the school. Many of them, including Ellen, left the state education sector altogether.

We wanted to know whether this was an isolated anecdote or a more widespread
problem. So in our paper for the February issue of the National Institute Economic Review we use the School Workforce Census to quantify the number of schools that recruit an unusually high proportion of NQTs and see an unusually high proportion of such teachers leave the profession within a year.

Identifying such schools is, however, challenging. For example, a small primary school might only recruit one teacher every few years. If this happens to be an NQT, and that teacher decides to leave through no fault of the school, then this school would show up as recruiting 100% NQTs and losing 100% of them. This is clearly not the same kind of phenomenon as a school losing 100% of their classroom teachers two years in a row. We needed a way to distinguish the two.

In order to avoid treating small schools overly harshly, we used a technique from the medical statistics literature called funnel plots, which were introduced by David Spiegelhalter, currently President of the Royal Statistical Society. The two funnel plots below show the proportion of teachers recruited by each school in the North West of England across the last five years that were Newly Qualified (left-hand side) and the proportion of these that left within a year (right-hand side). The horizontal line shows the regional average and the curved lines show the “control limits” beyond which we argue that schools are displaying unusually high turnover – note that these limits are wider for smaller schools. For full details of our method and the rationale for funnel plots, see our paper here.

Our analysis identified 122 schools in England that both use and lose an unusually high proportion of NQTs from the profession. These schools have an NQT wastage rate three times the national average and between them lost 577 NQTs from the profession between 2010 and 2014. We show that if these schools had attrition rates equivalent to the average school then 376 additional teachers would have progressed beyond their NQT year. For context, this is equivalent to 22 percent of the nationwide shortfall of teachers in 2015.

So it seems that Ellen was indeed unlucky: very-high-turnover schools are rare. Unfortunately, they are still common enough to be making a material contribution to the system-wide teacher shortage. What can be done about this? Funnel plots are a simple, reliable and low-cost way of identifying these schools. This would allow them to be provided with additional support to improve their retention of teachers. Alternatively, we might consider providing this information to trainee teachers to allow them to make a more informed choice about where to accept their first teaching job. If Ellen had enjoyed access to that information, she might still be in the classroom now.

 

This piece originally appeared on the IoE and NIESR blogs.


Could We Get the Best Teachers into the Most Deprived Schools?

Sam Sims

In a recent IOE London blog post, Professor Becky Francis highlighted wide and persistent gaps in GCSE attainment and university entry rates between rich and poor pupils. This follows the recent Social Mobility Commission report, which argued that policy makers have spent too much time on structural reforms to the schooling system and not placed a high enough priority on getting the best teachers into struggling schools, echoing Francis’ own research. Francis concludes that, in order to improve social mobility, we need to do much more to “support and incentivise the quality of teaching in socially disadvantaged neighbourhoods.”

In recent work, Rebecca Allen and I found that there are indeed reasons to be concerned about disadvantaged pupils’ access to good teachers. Experience (or lack thereof) is a good indicator of teacher quality. We found that pupils in the most deprived fifth of schools are around twice as likely to get an unqualified teacher, and a quarter more likely to get a teacher with less than five years of experience, when compared to pupils in the least deprived fifth of schools. Moreover, we found that, even within schools, disadvantaged pupils are more likely to be assigned to an inexperienced teacher.

Research from the US shows that pupils who get access to good teachers have higher attainment at school, are more likely to attend university and have higher adult earnings. This suggests that redistributing teachers would indeed improve the life chances of disadvantaged pupils. But is it actually feasible to redistribute good teachers using incentives? For such a proposal to work we need to be able to identify good teachers, attract them to work in disadvantages schools, and keep them there. What does existing evidence tell us about each of these three points?

A number of carefully evaluated initiatives from the US suggest that it is indeed possible to attract teachers to disadvantaged areas. The Talent Transfer Initiative offered teachers $20,000 spread over two years to move to the most disadvantaged schools in their district. Around 5% of the eligible teachers took up the offer. A similar policy in Washington State offered teachers a $5,000 per annum bonus and successfully increased the number of applications to work in disadvantaged schools.

There is also plenty of evidence to suggest that teachers working in disadvantaged schools can be incentivised to stay working there. In 2001, the North Carolina Bonus Programme offered shortage-subject (e.g. maths and science) teachers $1,800 per year to remain in disadvantaged schools and an evaluation showed that this reduced the probability of teachers leaving their school by around 17%. Careful evaluations of similar programmes in Florida and Georgia found similar effects.

But can we identify good teachers in the first place? The Talent Transfer Initiative in the US relied on statistical models to try and isolate individual teachers’ contribution to pupil progress, or “value added”. Teachers who ranked in the top fifth on this measure were eligible for the bonus. However the value-added approach relies on annual standardised testing to measure what pupils know when they enter a teacher’s classroom and then again when they leave at the end of the year. It also requires data linking individual pupils with their teachers. In England, we currently have neither. Lesson observations and teachers’ academic credentials have been shown to be unreliable and/or weak indicators of teacher quality, making them poor alternatives.

Unlike in the Talent Transfer Initiative, eligibility for the Washington State policy was determined by whether a teacher held an advanced teaching qualification awarded by the National Board for Professional Teaching Standards. Research shows both that National Board certification is predictive of teachers’ having a higher value-added score and that randomly assigning students to teachers who scored highly on the qualification improves their attainment. However, certification itself is a fairly weak indicator of quality, which might explain why the Washington State policy had no impact on pupil attainment, whereas the Talent Transfer Initiative did have a small positive impact.

Using incentives to eliminate or indeed reverse the current inequalities in access to good teachers would therefore require us to develop reliable indicators of teacher quality. National Board Certification, along with other carefully validated teacher assessment tools such as the Classroom Assessment Scoring System, demonstrate that this can be done. They can also be used to help improve the quality of teachers already working in disadvantaged schools. Nevertheless, attaching powerful monetary incentives to acquiring such qualifications does create the potential for abuse. Guarding against misuse and maintaining trust in such a system would therefore be another necessary condition of using teacher incentives to improve the life chances of disadvantaged pupils.

This piece was originally written for the Institute of Education blog.

Posted on 

WHAT’S THE POINT OF THEORY IN APPLIED SOCIAL SCIENCE?

Sam Sims

Applied quantitative researchers use statistical methods to answer questions with direct practical applications. Generally this involves trying to isolate causal relationships so that policy makers or practitioners can be provided with reliable advice about how to achieve their goals. Labour economists, for example, study which training schemes increase employment. Education researchers evaluate the effect of different teacher training programmes on teacher retention. And criminologists try to identify the how different police patrol patterns affect the crime rate. Applied research tends to attract pragmatic, empirically-minded people.

It is perhaps not surprising then that applied researchers generally have little time for theory. In my experience, they tend to see theory as impenetrable, untestable and unnecessary. In the first half of this blog I explain each of these objections; in the second half I argue that each of them is mistaken. My aim is to persuade applied quantitative researchers that they will make more progress with their research, and have more impact, if they made more use of theory in their work.

The first charge levelled against theory is that it is impenetrable. It is easy to underestimate how recently we have developed the statistical techniques, hardware, software and datasets necessary for doing applied quantitative research. Prior to this, policy researchers generally filled this empirical vacuum with theory and, as Noah Smith has pointed out, where theory is the only game in town, the competition for publication spots tends to become a battle for who can generate the most sophisticated ideas. This leads to a proliferation of theories in social science that are variously too complex to guide research design, so expansive as to make data collection prohibitively costly, or so nuanced as to make falsification impossible. Many applied researchers conclude that engaging with this sort of theory simply isn’t worth the hassle.

Even where falsification is possible however, many empirically-minded researchers see testing theory as a fool’s errand. Many assume that the best case scenario is that a plausibly causal relationship inconsistent with a theory is identified, sometimes loosely referred to as falsification. But ex-ante the researcher faces a substantial risk of failing to falsify the theory, which is even less valuable. The risks involved in this sort of research therefore dilute the incentives for testing theory.

Finally, even if an applied researcher found a suitable theory and were in principle willing to take the risk of trying to falsify it, many would still argue that testing theoretical relationships is unnecessary. Why not just conduct evaluations of existing policies or programmes that provide results about causal relationships which are of direct interest to policy makers? Ultimately, testing theories always seems one step removed from the most pressing problems facing applied researchers: does it work?

Though each of these arguments contain a grain of truth, they are all wrong in important ways.

The tide is now turning on overly complicated theory in the social sciences. Behavioural economics has persuaded many more economists to study simple heuristics rather than mathematically cumbersome models of sub-rational decision making. In political economy, to take another example, Dani Rodrick and Sharun Mukand have recently developed a theory that can explain something as complex as why liberal democracy does or does not emerge using only six basic concepts. But perhaps the best illustration of the trend towards simpler theory comes from sociology. The American Sociological Association recently held a meeting to debate whether the discipline needs to make its theory more manageable. One of the papers presented at that meeting demonstrates how opinion is changing in the field with its blunt title: Fuck Nuance. It is a brilliant argument for keeping theory simple enough to be useful. Paul Van Lange from Oxford University has also developed a set of useful principles (Truth, Abstraction, Progress and Applicability, or TAPAS) that can help researchers identify and develop good theories.

Recent developments have also helped to make theory more testable. In a brilliant paper, the political scientist Kevin Clarke recently showed that the only way to really provide confirmation for a theory is to test it against other competing theories. Since then, two other political scientists, Kosuke Imai and Dustin Tingley, have shown how finite mixture models can be used to do just this. They have also developed a programme for the statistical software R, making it straightforward for other researchers to test which theories apply best and, crucially, under which conditions. This approach also avoids the worst of the incentive problems associated with attempts at falsification.

Theory is also necessary in that it provides knowledge essential to answering the does it work question which statistics alone cannot. Nancy Cartwright discusses a now well-known case which highlights this. The Tamil Integrated Nutrition Policy (TINP) involved provision of healthcare and feeding advice for mothers of newborns and was shown to be effective in reducing child malnutrition. This is useful statistical evidence. However, when a near identical project was implemented in Bangladesh it was shown not to have an effect. But then further research found that educating the mothers of Bangledeshi children was ineffective because important aspects of food preparation there are not generally conducted by the mother. In Angus Deaton’s terms, while the intervention was replicated, the mechanism underlying success was not. Understanding the theory behind a programme can therefore help clarify its value in a way that statistics alone cannot.

In summary, theory is becoming steadily less impenetrable, increasingly easy to test and is necessary for applied researchers to confidently infer policy advice from specific evaluations. Instead of focusing solely on evaluating existing programmes, applied researchers would likely have more impact if they collectively adapted a more cyclical approach in which they: evaluated existing programmes to identify which were most effective; tested known theories or mechanism which may underpin effective programmes; helped design new programmes based on the most successful theories; and then conducted further evaluations of the new interventions. This approach would contribute to a virtuous cycle of policy-relevant discoveries which could allow quantitative researchers to deliver on Pawson and Tilley’s ideal of finding out “what works for whom in what circumstance… and how.”

Posted on 

DEPARTMENTAL HEADS IN THE SAND: WHY YOUR DEPARTMENT IS PERFORMING WORSE THAN YOU THINK

Sam Sims

How’s your driving? Below average, pretty normal, or better than most?

Research suggests that the majority of people reading this article have just answered ‘better than most’. They are surely wrong. Only half of drivers can be better than the median. The other half are, by definition, below it.

This is an example of what psychologists call ‘Illusory Superiority’, a phenomenon which is by no means limited to our skills behind the wheel. Studies have shown that we tend to overestimate our ability relative to others in a range of areas from leaderships, to parenting and even social skills.

What about schools? Surely teachers, with access to Ofsted judgements, regular testing and sophisticated outcome measures, have an accurate picture of their performance?

We surveyed a representative sample of English, maths and science Heads of Department to find out. We asked them how they thought their departments performed in 2013 GCSEs, relative to departments in other schools serving similar intakes. As figure 1 shows, fully 43% thought themselves to be superior, while only 17% assessed themselves as being worse than others.

We then calculated performance measures that control for school inta ke and split the departments into two groups based on their scores. The results for low performing departments are particularly striking (see figure 2). Almost a quarter incorrectly believe themselves to be high performing, with another 43% believing they perform similarly to others.

So according to our data, many Department Heads do indeed suffer from illusory superiority. If you’re a middle leader reading this, you probably still think your one of the select few with an accurate assessment. But that’s the thing about illusory superiority, we all think we’re special.

Given the prevalence of performance data these days, how can it be that so many Department Heads don’t have an accurate picture of their own performance? Recent research suggests that the dominant reason is our desire to avoid undesirable judgements about ourselves. This urge is particularly strong when we are being assessed on important tasks, such as teaching. Some middle leaders are simply choosing to ignore the data. Departmental Heads buried in the sand.

This cannot be good for schools, or pupils. Fortunately the research on driving offers pointers on how we can keep illusions of superiority in check. Motorists were found to offer more accurate assessments of their performance when they expected their judgement to be reviewed by others afterwards, particularly if they saw the reviewer as high status.

So what should schools do? Middle leaders could help keep themselves honest by collectively reviewing each other’s exam performance. An annual inter-school meeting to analyse results would be a start. The Families of Schools database, which groups schools with similar intakes, can help make these comparisons more transparent. But what really matters is that performance is reviewed by respected, knowledgeable colleagues.

We are all prone to thinking we’re better than we really are. Even you. That makes collaboration the only reliable antidote for complacency.

This piece was originally written for the Education Datalab website.

Posted on 

 


 

THE DFE’S NEW RANKING OF ACADEMY CHAINS AND LOCAL AUTHORITIES IS BADLY FLAWED

Sam Sims

In March the Department for Education (DfE) released a working paper called Measuring the Performance of Schools within Academy Chains and Local AuthoritiesIt contains the first official performance ranking of the organisations responsible for groups of schools in England: local authorities and academy chains. The performance of individual schools has been monitored and published in the UK since at least 1992, but until now the performance of groups of schools has not been systematically reported on. The new ranking therefore represents the expansion of data driven accountability to an additional tier of the school system.

The findings have been analysed and commented on widely. Anti-academy campaigners used them to claim that academy chains do not deliver better results than local authorities. Newsnight’s Chris Cooke pointed out that there are high and low performing examples of both local authorities and academy chains and argued that instead of debating their relative merits, we should focus on working out how to emulate the most successful examples, whether that be Ark academies or Hackney local authority. Robert Hill, education adviser in the Blair government, used the findings to do just that. Hill argues that high ranking chains tend to work in localised clusters, expand slowly and focus on pedagogy and oversight. Although they do not say it explicitly, it is safe to assume that the DfE also intends school commissioners to use the rankings when choosing academy chains to take over ‘failing’ schools.

This may all sound like useful, evidence-base policy analysis. But I want to argue that this sort of performance ranking is so flawed that it is effectively meaningless and therefore not very useful, either for drawing policy lessons or making commissioning decisions.

The new performance measure developed by the DfE ranks authorities and chains based on the ‘value added’ they achieve for pupils across their schools. This is measured as the difference between their predicted GCSE grades (based on Key Stage 2 attainment) and actual GCSE grades. This means that secondary schools do not take the credit (or blame) for the performance of their feeder primary schools, which is sensible. It also places a lower weight on the results of schools who have just joined an academy chain, to reflect the chains limited influence on that particular school.

What it does not take account of however, are non-school factors which influence pupil progress during secondary school. As the DfE analysts put it, their measure assumes that schools have the same “propensity for improvement” (p16). The problem is that we know this is not true. Pupil’s household income, for example, is known to have a strong relationship with attainment. Leaving it out will therefore create significant inaccuracies in the ranking.

The DfE hint at incorporating such contextual factors in future versions of the ranking (p17). This may sound sensible, but we have been here before. When dissatisfaction grew with the value added measure (first introduced in 2002) to rank individual schools, the government developed a Contextual Value Added (CVA) measure, which tried to control for such non-school factors. But research by Lorraine Dearden and colleagues demonstrated that leaving out the level of education of a pupil’s mother (data which is not generally collected) caused “significant systematic biases in school CVA measures for the large majority of schools.” Stephen Gorrard then pointed out that (non-random) missing data meant there were large errors in the estimates and, by extension, any ranking based on them. Adding contextual information to the DfE’s new measure would therefore only repeat the mistakes made with CVA, which has since been abolished.

These flaws in the new measure are severe enough to make them effectively meaningless, since it is unclear whether a high score represents the influence of the local authority/chain, the influence of other factors which are not taken into account, or just statistical noise.

There is also a more general sense in which this new measure ignores lessons from recent education research. A great deal of work in the last five years has tried to identify the policies and approaches behind London schools relative success. But Simon Burgess has now shown using census data that all of London’s superior performance can be accounted for by differences in ethnicity and migration patterns. If he is right, then the hunt for ‘what worked’ in London has largely been a wild goose chase. Indeed the suspicious concentration of London-based local authorities and academy chains at the top of DfE’s new ranking suggests that migration patterns might also be what is driving the results of their analysis. Studying successful exemplars, whether cities or academy chains, is difficult and potentially misleading.

A better approach to finding out what works is to study policies. Because we can measure the attainment of the same pupils before and after a policy is implemented, it is possible to rule out the influence of a range of other factors, even when we cannot measure them. Returning to the London example, a highly-aspirational recent immigrant before the policy is implemented is still a highly-aspirational recent immigrant after it is implemented. Their migration status therefore cannot be what is driving any observed changes in outcomes after the policy is implemented. Another benefit of studying policies is that they are easier to replicate. If a specific professional development programme for teachers is found to be effective, for example, it is fairly straightforward to deliver that programme in other schools. Other things equal, knowing that Hackney is an effective local authority just isn’t as useful.

The flaws in the DfE’s new accountability measure for local authorities and academy chains are severe enough to make them effectively meaningless. The ranking on which they are based is therefore not very useful, either for drawing policy lessons or making commissioning decisions. In general, evaluating policies will provide more reliable and useful insights than trying to identify and analyse examples of effective providers. Let’s not repeat the mistakes of past accountability reforms.

This piece originally appeared on the LSE Politics and Policy Blog.

Posted on