Ofsted and Workload: A concise review of the evidence

I recently tweeted:

“I think this is my favourite @TeacherTapp finding so far… Ofsted spends its time trying to make all schools like Good/Outstanding schools, but in the process makes RI/Inadequate schools unlike Good/Outstanding schools in important respects.”

About this graph:


Several people objected that this is a just a cross-sectional correlation and I therefore cannot justify this conclusion based on this graph. I agree. It might be a spurious correlation or indeed reverse causation. At the very least, you need longitudinal data to support that kind of interpretation. But I am not basing my interpretation on just this graph. I am basing it on this graph plus the academic literature. So, throwing off the character limits imposed by twitter, here is the rest of the evidence on which my interpretation is based:

First, this is not the only finding of a cross-sectional correlation between Ofsted grade and teacher workload. The nationally representative TALIS 2013 data shows that teachers working in Good or Outstanding school spent 82-84% of their time teaching, while those in RI schools spent 77% and those in Inadequate schools spent 75% of time teaching (Micklewright et al., 2014). The 2016 teacher workload survey also came to a similar finding (Higton et al., 2017).

In addition, a range of small-scale longitudinal studies have shown that workload increases with inspections, within schools. Male (1999) reports a one-year prospective longitudinal survey of teachers in a single inspected school. The results show that mean hours worked and the proportion of teachers working atypical hours increased in the run-up to the inspection. The proportion reporting atypical working hours did not fall back to pre-inspection levels in the immediate aftermath of the inspection. Case, Case & Catling (2000) report on a detailed, three-year, prospective qualitative study in three inspected schools. They find an increase in workload around the time of the inspection, particularly paperwork and documentation. Jeffrey (2002) reports on a 1.5 year, prospective longitudinal qualitative study in six schools, finding an increase in workload around the time of inspections. Chapman (2002) discusses ten retrospective case studies of inspected schools, finding that teachers “at all levels” reported that inspection contributed to workload. Perryman (2009) reports a detailed, two-year, prospective, longitudinal study of a single inspected school, finding a marked increase in workload in the build up to the inspection. Courtney (2016) reports a retrospective survey of 36 headteachers in recently inspected schools and reaches similar conclusions, adding that the introduction of a new inspection framework has prompted additional work as heads update and revise documentation. Taken together, this research provides good evidence that inspection precedes workload, rather than workload preceding inspection.

This evidence does not, however, rule out spurious correlations. Perhaps a third variable, such as leadership, is causing both increase workload and low Ofsted judgements? Again, I would argue that this is implausible based on the existing evidence. In the government’s 2016 workload consultation 53% of a (non-random) sample of respondents cited ‘accountability/perceived pressures of Ofsted’ as a driver of their workload burden (Gibson et al., 2016). Similarly, in TALIS 2013, 85% of respondents reported that ‘accountability (e.g. Ofsted, performance tables)’ adds significantly to the pressure of their jobs. This self-report survey evidence is corroborated by qualitative research, which illuminates the mechanisms through which Ofsted increases workload. Case et al (2002), Jeffrey (2002), Plowright (2007), Perryman (2009), Courtney (2016) and Perryman et al (2018) all come to the same finding: Ofsted increases workload because teachers have to dedicate time to generating the type of paper-based evidence that inspectors can consume during a visit, such as schemes of work and data. As one teacher put it, “There was an obsession with recording everything in writing so you could prove that you’d done it… Something that might normally have been just a practical activity had to have some element of it written down” (Case et al., p. 616). Notice that the additional work here is separate and additional to the activity of educating pupils. Rather, it aims to make activity visible to people not present at the time.

Interestingly, the advent of short-notice inspections appears to have changed the way in which inspection creates workload, because schools maintain a constant state of readiness for inspection (Perryman et al., 2018; Allen & Sims, 2018). As one teacher interviewed in our book explained “There is so much work I could pinpoint over the past seven years that was just done for Ofsted and has since gone in the bin because it hasn’t been needed. It was always just-in-case” (Allen & Sims, 2018, p. 93). A review of the literature on inspection found that this sort of over-documentation is a common side effect of school inspection systems across countries (de Wolf & Janssens, 2007).

None of the studies cited here would individually support my interpretation of the graph. Taken together however, they provide weight of evidence for the interpretation that Ofsted prompts RI and Inadequate schools to spend more time engaged in educationally useless activity in order to produce auditable material for inspectors. At present, it is hard to determine the amount of additional workload involved. Returning to the Teacher Tapp graph, 77% of respondents in RI/Inadequate schools and 50% of those in Outstanding schools report that “much” activity is driven by how things look to observers, including Ofsted. It is also noteworthy that, of the 7.4 extra hours that teachers in England spend at work each week compared to their international counterparts, the bulk is attributable to administration, management, marking and planning (Sellen, 2016); though of course, we do not know how much of this is attributable to the inspection system. Precisely quantifying this would require longitudinal teacher survey data including workload measures, preferably collected using complete weekly diary methods. Fortunately, the DfE are currently considering the business case for such a study. This will provide a more complete evaluation of the costs of the current school inspection model, to be weighed against the benefits.

Allen R., & Sims, S. (2018). The Teacher Gap. London: Routledge.

Case, P., Case, S., & Catling, S. (2000). Please show you’re working: a critical assessment of the impact of OFSTED inspection on primary teachers. British Journal of Sociology of Education21(4), 605-621.

Chapman, C. (2002). Ofsted and School Improvement: teachers’ perceptions of the inspection process in schools facing challenging circumstances. School Leadership & Management22(3), 257-272.

Courtney, S. J. (2016). Post-panopticism and school inspection in England. British Journal of Sociology of Education37(4), 623-642.

Gibson, S., Oliver, L. and Dennison, M. (2015). Workload challenge: Analysis of teacher consultation responses. Department for Education Research Report DfE-RR445. London: Department for Education.

Higton, J., Leonardi, S., Richards, N., Choudhoury, A., Sofroniou, N., & Owen, D. (2017). Teacher workload survey 2016. DfE research report rr633. London: Department for Education.

Jeffrey, B. (2002). Performativity and primary teacher relations. Journal of Education Policy17(5), 531-546.

Male, D. B. (1999). Special school inspection and its effects on teachers’ stress and health, workload and job‐related feelings: a case study. European Journal of Special Needs Education14(3), 254-268.

Micklewright, J., Jerrim, J., Vignoles, A., Jenkins, A., Allen, R., Ilie, S., Bellabre, F., Barrera, F., & Hein, C. (2014). Teachers in England’s secondary schools: Evidence from TALIS 2013. London: Department for Education.

Perryman, J. (2009). Inspection and the fabrication of professional and performative processes. Journal of Education Policy24(5), 611-631.

Perryman, J., Maguire, M., Braun, A., & Ball, S. (2018). Surveillance, governmentality and moving the goalposts: the influence of Ofsted on the work of schools in a post-panoptic era. British Journal of Educational Studies66(2), 145-163.

Plowright, D. (2007). Self-evaluation and Ofsted inspection: developing an integrative model of school improvement. Educational Management Administration & Leadership35(3), 373-393.

Sellen, P., (2016). Teacher workload and professional development in England’s secondary schools: Insights from TALIS.  London: Education Policy Institute.

de Wolf, I. F., & Janssens, F. J. (2007). Effects and side effects of inspections and accountability in education: an overview of empirical studies. Oxford Review of Education33(3), 379-396.

Four reasons instructional coaching is currently the best-evidenced form of CPD

At the ResearchEd 2018 National Conference, Steve Farndon, Emily Henderson and I gave a talk about instructional coaching. In my part of the presentation, I argued that instructional coaching is currently the best-evidenced form of professional development we have. Steve and Emily spoke about their experience of coaching teachers and embedding coaching in schools. This blog is an expanded version of my part of the presentation…

What is instructional coaching?

Instructional coaching involves an expert teacher working with a novice in an individualised, classroom-based, observation-feedback-practice cycle. Crucially, instructional coaching involves revisiting the same specific skills several times, with focused, bite-sized bits of feedback specifying not just what but how the novice needs to improve during each cycle.

In many ways, instructional coaching is the opposite of regular inset CPD, which tends to involve a broad, one-size-fits-all training session delivered to a diverse group of teachers, involving little practise and no follow-up.

Instructional coaching is also very different to what we might call business coaching, in which the coach asks a series of open questions to draw out the answers that people already, in some sense, know deep down. Instructional coaches are more directive, very intentionally laying a trail of breadcrumbs to move the novice from where they are currently, to where the expert wants them to be.

Some instructional coaching models include a rubric outlining the set of specific skills that a participant will be coached on. Others are even more prescriptive, specifying a range of specific techniques for the teacher to master. There are also a range of protocols or frameworks available to structure the coaching interaction, with Bambrick-Santoyo’s Six Step Model being among the most popular.

Examples of established instructional coaching programmes for teachers include the SIPIC programme, the TRI model, Content Focused Coaching and My Teaching Partner. In the UK, Ark Teacher Training and the Institute for Teaching are two prominent users of instructional coaching.

What is the evidence for instructional coaching?

In 2007, a careful review of the literature found only nine rigorously evaluated CPD interventions in existence. This is a remarkable finding, which shows how little we knew about effective CPD just a decade ago.

Fortunately, there has been an explosion of good research on CPD since then and my reading of the literature is that instructional coaching is now the best-evidenced form of CPD we have. In the rest of the blog, I will set out four ways in which I think the evidence base for instructional coaching is superior.

Before I do, here are some brief caveats and clarifications:

  • By “best evidenced”, I mean the quality and quantity of underpinning research
  • I am talking about the form of CPD not the content (more on this later)
  • This is a relative claim, about it being better evidenced than alternative forms (such as mentoring, peer learning communities, business-type coaching, lesson study, analysis-of-practice, etc). Remember, ten years ago, we knew very little about effective CPD at all!
  • I am talking about the current evidence-base, which (we hope) will continue to develop and change in coming years.

Strength 1: Evidence from replicated randomised controlled trials

In 2011, a team of researchers published the results from a randomised controlled trial of the My Teaching Partner (MTP) intervention, showing that it improved results on Virginia state secondary school tests by an effect size of 0.22. Interestingly, pupils whose test scores improved the most were taught by the teachers who made the most progress in their coaching sessions.

Randomised controlled trials (RCT) are uniquely good at isolating the impact of interventions, because the process of randomisation makes the treatment group (those participating in MTP) and control group (those not) identical in expectation. If the two groups are identical, then any difference in outcomes must be the result of the one remaining difference – participating in the MTP programme. Unfortunately, the randomisation process does not guarantee the two groups are identical. There is a small chance that, even if MTP has zero effect on attainment, a well-run RCT will occasionally conclude that it has a positive impact (so-called random confounding).

This is where replication comes in. In 2015 the same team of researchers published the results from a second, larger RCT of the MTP programme, which found similar positive effects on attainment. The chances of two good trials mistakenly concluding that an intervention improved attainment, when in fact it had no effect, are far smaller than for a single trial. The replication therefore adds additional weight to the evidence base.

There are however, other CPD interventions with evidence from replicated RCTs, meaning this is not a unique strength of the evidence on coaching.

Strength 2: Evidence from meta-analysis

In 2018, a team of researchers from Brown and Harvard published a meta-analysis of all available studies on instructional coaching. They found 31 causal studies (mostly RCTs) looking at the effects of instructional coaching on attainment, with an average effect size of 0.18. The average effect size was lower in studies with larger samples, and in interventions that targeted general pedagogical approaches, however these were still positive and statistically significant.

A second, smaller meta-analysis looking at CPD interventions in literacy teaching also concluded that coaching interventions were the most effective in terms of increasing pupil attainment.

The evidence from the replicated MTP trials described above shows that good instructional coaching interventions can be effective. The evidence from meta-analysis reviewed here broadens this out to show that evaluated coaching programmes work on average.

How does this compare to other forms of CPD? There are very few meta-analysis relating to other forms of professional development, and those we do have employ weak inclusion criteria, making it hard to interpret their results.

Strength 3: Evidence from A-B testing

Instructional coaching is a form of CPD. In practice, it must be combined with some form of content in order to be delivered to teachers. This begs the question of whether the positive evaluation results cited above are due to the coaching, or to the content which is combined with the coaching. Perhaps the coaching component of these interventions is like the mint flavouring in toothpaste: very noticeable, but not in fact an active ingredient in bringing about reduced tooth decay.

In February 2018, a team of researchers from South Africa published the results from a different type of randomised controlled trial. Instead of comparing treatment and control groups, they compared a control group to A) a group of teachers trained on new techniques for teaching reading at a traditional “away day” and B) a group of teachers trained on the exact same content using coaching. This type of A-B testing provides an opportunity to isolate the active ingredients of an intervention.

The results showed that pupils taught by teachers given the traditional “away day” type training showed no statistically significant increase in reading attainment. By contrast, pupils taught by teachers who received the same content via coaching improved their reading attainment by an effect size of 0.18. The coaching was therefore a necessary component of the training being effective. A separate A-B test in Argentina in 2017 also found coaching to be more effective than traditional training on the same content.

Besides these two coaching studies, there are very few other A-B tests on CPD interventions. Indeed, a 2017 review of the A-B testing literature found only one evaluation which found different results for the two treatment comparisons – a joint analysis-of-practice of video cases programme. While very promising, this analysis-of-practice intervention does not yet have evidence from replicated trials or meta-analysis.

Strength 4: Evidence from systematic research programmes

A difficulty in establishing the superiority of one form of CPD is that you need to systematically test the other available forms. The Investing in Innovation (I3) Fund in the US does just this by funding trials on a wide range of interventions, as long as they have some evidence of promise. Since 2009, they have spent £1.4Bn testing 67 different interventions.

The chart below shows the results from 31 RCTs investigating the impact of interventions on English attainment (left chart) and a further 23 on maths attainment (right chart). Bars above zero indicate a positive effect, and vice versa. Green bars indicate a statistically significant effect and orange bars indicate an effect which, statistically speaking, cannot be confidently distinguished from zero. [i]


Two things stand out from this graph. First, most interventions do not work. Just seven out of thirty-one English and three out of twenty-three maths interventions had a positive and statistically significant effect on pupil attainment. This analysis provides a useful approximation of what we can expect across a broad range of CPD interventions.[ii]

In order to compare instructional coaching with the evidence from I3 evaluations, I constructed an identical chart including all the effect sizes I could find from school-age instructional coaching evaluations. The chart (below) includes all the relevant studies from the Kraft et al meta-analysis for which effect sizes could be straightforwardly extracted [iii], plus three additional studies [iv]. Of the sixteen studies included, eleven showed positive, statistically significant impacts on attainment. This compares very favourably to I3 evidence across different forms of CPD.

Coaching I3


Instructional coaching is supported by evidence from replicated randomised controlled trials, meta-analysis, A-B testing and evidence from systematic research programmes. I have looked hard at the literature and I cannot find another form of CPD for which the evidence is this strong.

To be clear, there are still weaknesses in the evidence base for instructional coaching. Scaled-up programmes tend to be less effective than smaller programmes and the evidence is much thinner for maths and science than for English. Nevertheless, the evidence remains stronger than for alternative forms of CPD.

How should school leaders and CPD designers respond to this? Where possible, schools should strongly consider using instructional coaching for professional development. Indeed, it would be hard to justify the use of alternative approaches in the face of the existing evidence.

Of course, this will not be easy. My co-presenters Steve Fardon and Emily Henderson, both experienced coaches, were keen to stress that establishing coaching in a school comes with challenges.

Unfortunately, in England, lesson observation has become synonymous with remedial measures for struggling teachers. Coaches need to demonstrate that observation for the purposes of instructional coaching is a useful part of CPD, not a judgement. I have heard of one school tackling this perception by beginning coaching with senior and middle leaders. Only once this had come to be seen as normal did they invite classroom teachers to take part.

Another major challenge is time. Emily Henderson stressed that if coaching sessions are missed it can be very hard to get the cycle back on track. Henderson would ensure that the coaching cycle was the first thing to go in the school diary at the beginning of the academic year and she was careful to ensure it never got trumped by other priorities. Some coaching schools have simply redistributed inset time to coaching, in order to make this easier.

Establishing coaching in your school will require skilled leadership. For the time being however, coaching is the best-evidenced form of professional development we have. All schools that aspire to be evidence-based should be giving it a go.

[i] I wouldn’t pay too much attention to the relative size of the bars here, since attainment was measured in different ways in different studies.

[ii] Strictly speaking, only 85% of these were classed as CPD interventions. The other 15% involve other approaches to increasing teacher effectiveness, such as altering hiring practices. It should be noted that the chart on the left also includes some coaching interventions!

[iii] It should be noted that I did not calculate my own effect sizes or contact original authors where effect sizes were not reported in the text. To the extent that reporting of effect sizes are related to study findings, this will skew the picture.

[iv]  Albornoz, F., Anauati, M. V., Furman, M., Luzuriaga, M., Podesta, M. E., & Tayor, I. (2017) Training to teach science: Experimental evidence from Argentina. CREDIT Research Paper.

Bruns, B., Costa, L., Cunha, N. (2018) Through the looking glass: Can classroom observation and coaching improve teacher performance in Brazil? Economics of Education review. 64, 214-250.

Cilliers, J., Fleisch, B., Prinsloo, C., Reddy, V., Taylor, S. (2018) How to improve teaching practice? Experimental comparison of centralized training and in-classroom coaching. Working Paper.

Teacher shortages: are a handful of schools a big part of the problem?

Sam Sims and Rebecca Allen

 We recently met a Newly Qualified Teacher (NQT), let’s call her Ellen, who had been delighted to get their first teaching job in a North London primary school deemed outstanding by Ofsted. She arrived on the first day of term looking forward to the challenge of teaching, but by lunchtime it dawned on her that the school had lost 100% of its classroom teaching staff since the previous academic year. At the time, she wondered what could have happened to make all these teachers leave.

She soon found out however, as she spent the next year being pressured into an unsustainable workload and subjected to highly bureaucratic and, at times, callous management. At the end of the year, all the classroom teaching staff left the school. Many of them, including Ellen, left the state education sector altogether.

We wanted to know whether this was an isolated anecdote or a more widespread
problem. So in our paper for the February issue of the National Institute Economic Review we use the School Workforce Census to quantify the number of schools that recruit an unusually high proportion of NQTs and see an unusually high proportion of such teachers leave the profession within a year.

Identifying such schools is, however, challenging. For example, a small primary school might only recruit one teacher every few years. If this happens to be an NQT, and that teacher decides to leave through no fault of the school, then this school would show up as recruiting 100% NQTs and losing 100% of them. This is clearly not the same kind of phenomenon as a school losing 100% of their classroom teachers two years in a row. We needed a way to distinguish the two.

In order to avoid treating small schools overly harshly, we used a technique from the medical statistics literature called funnel plots, which were introduced by David Spiegelhalter, currently President of the Royal Statistical Society. The two funnel plots below show the proportion of teachers recruited by each school in the North West of England across the last five years that were Newly Qualified (left-hand side) and the proportion of these that left within a year (right-hand side). The horizontal line shows the regional average and the curved lines show the “control limits” beyond which we argue that schools are displaying unusually high turnover – note that these limits are wider for smaller schools. For full details of our method and the rationale for funnel plots, see our paper here.

Our analysis identified 122 schools in England that both use and lose an unusually high proportion of NQTs from the profession. These schools have an NQT wastage rate three times the national average and between them lost 577 NQTs from the profession between 2010 and 2014. We show that if these schools had attrition rates equivalent to the average school then 376 additional teachers would have progressed beyond their NQT year. For context, this is equivalent to 22 percent of the nationwide shortfall of teachers in 2015.

So it seems that Ellen was indeed unlucky: very-high-turnover schools are rare. Unfortunately, they are still common enough to be making a material contribution to the system-wide teacher shortage. What can be done about this? Funnel plots are a simple, reliable and low-cost way of identifying these schools. This would allow them to be provided with additional support to improve their retention of teachers. Alternatively, we might consider providing this information to trainee teachers to allow them to make a more informed choice about where to accept their first teaching job. If Ellen had enjoyed access to that information, she might still be in the classroom now.


This piece originally appeared on the IoE and NIESR blogs.


Applied quantitative researchers use statistical methods to answer questions with direct practical applications. Generally this involves trying to isolate causal relationships so that policy makers or practitioners can be provided with reliable advice about how to achieve their goals. Labour economists, for example, study which training schemes increase employment. Education researchers evaluate the effect of different teacher training programmes on teacher retention. And criminologists try to identify the how different police patrol patterns affect the crime rate. Applied research tends to attract pragmatic, empirically-minded people.

It is perhaps not surprising then that applied researchers generally have little time for theory. In my experience, they tend to see theory as impenetrable, untestable and unnecessary. In the first half of this blog I explain each of these objections; in the second half I argue that each of them is mistaken. My aim is to persuade applied quantitative researchers that they will make more progress with their research, and have more impact, if they made more use of theory in their work.

The first charge levelled against theory is that it is impenetrable. It is easy to underestimate how recently we have developed the statistical techniques, hardware, software and datasets necessary for doing applied quantitative research. Prior to this, policy researchers generally filled this empirical vacuum with theory and, as Noah Smith has pointed out, where theory is the only game in town, the competition for publication spots tends to become a battle for who can generate the most sophisticated ideas. This leads to a proliferation of theories in social science that are variously too complex to guide research design, so expansive as to make data collection prohibitively costly, or so nuanced as to make falsification impossible. Many applied researchers conclude that engaging with this sort of theory simply isn’t worth the hassle.

Even where falsification is possible however, many empirically-minded researchers see testing theory as a fool’s errand. Many assume that the best case scenario is that a plausibly causal relationship inconsistent with a theory is identified, sometimes loosely referred to as falsification. But ex-ante the researcher faces a substantial risk of failing to falsify the theory, which is even less valuable. The risks involved in this sort of research therefore dilute the incentives for testing theory.

Finally, even if an applied researcher found a suitable theory and were in principle willing to take the risk of trying to falsify it, many would still argue that testing theoretical relationships is unnecessary. Why not just conduct evaluations of existing policies or programmes that provide results about causal relationships which are of direct interest to policy makers? Ultimately, testing theories always seems one step removed from the most pressing problems facing applied researchers: does it work?

Though each of these arguments contain a grain of truth, they are all wrong in important ways.

The tide is now turning on overly complicated theory in the social sciences. Behavioural economics has persuaded many more economists to study simple heuristics rather than mathematically cumbersome models of sub-rational decision making. In political economy, to take another example, Dani Rodrick and Sharun Mukand have recently developed a theory that can explain something as complex as why liberal democracy does or does not emerge using only six basic concepts. But perhaps the best illustration of the trend towards simpler theory comes from sociology. The American Sociological Association recently held a meeting to debate whether the discipline needs to make its theory more manageable. One of the papers presented at that meeting demonstrates how opinion is changing in the field with its blunt title: Fuck Nuance. It is a brilliant argument for keeping theory simple enough to be useful. Paul Van Lange from Oxford University has also developed a set of useful principles (Truth, Abstraction, Progress and Applicability, or TAPAS) that can help researchers identify and develop good theories.

Recent developments have also helped to make theory more testable. In a brilliant paper, the political scientist Kevin Clarke recently showed that the only way to really provide confirmation for a theory is to test it against other competing theories. Since then, two other political scientists, Kosuke Imai and Dustin Tingley, have shown how finite mixture models can be used to do just this. They have also developed a programme for the statistical software R, making it straightforward for other researchers to test which theories apply best and, crucially, under which conditions. This approach also avoids the worst of the incentive problems associated with attempts at falsification.

Theory is also necessary in that it provides knowledge essential to answering the does it work question which statistics alone cannot. Nancy Cartwright discusses a now well-known case which highlights this. The Tamil Integrated Nutrition Policy (TINP) involved provision of healthcare and feeding advice for mothers of newborns and was shown to be effective in reducing child malnutrition. This is useful statistical evidence. However, when a near identical project was implemented in Bangladesh it was shown not to have an effect. But then further research found that educating the mothers of Bangledeshi children was ineffective because important aspects of food preparation there are not generally conducted by the mother. In Angus Deaton’s terms, while the intervention was replicated, the mechanism underlying success was not. Understanding the theory behind a programme can therefore help clarify its value in a way that statistics alone cannot.

In summary, theory is becoming steadily less impenetrable, increasingly easy to test and is necessary for applied researchers to confidently infer policy advice from specific evaluations. Instead of focusing solely on evaluating existing programmes, applied researchers would likely have more impact if they collectively adapted a more cyclical approach in which they: evaluated existing programmes to identify which were most effective; tested known theories or mechanism which may underpin effective programmes; helped design new programmes based on the most successful theories; and then conducted further evaluations of the new interventions. This approach would contribute to a virtuous cycle of policy-relevant discoveries which could allow quantitative researchers to deliver on Pawson and Tilley’s ideal of finding out “what works for whom in what circumstance… and how.”


In March the Department for Education (DfE) released a working paper called Measuring the Performance of Schools within Academy Chains and Local AuthoritiesIt contains the first official performance ranking of the organisations responsible for groups of schools in England: local authorities and academy chains. The performance of individual schools has been monitored and published in the UK since at least 1992, but until now the performance of groups of schools has not been systematically reported on. The new ranking therefore represents the expansion of data driven accountability to an additional tier of the school system.

The findings have been analysed and commented on widely. Anti-academy campaigners used them to claim that academy chains do not deliver better results than local authorities. Newsnight’s Chris Cooke pointed out that there are high and low performing examples of both local authorities and academy chains and argued that instead of debating their relative merits, we should focus on working out how to emulate the most successful examples, whether that be Ark academies or Hackney local authority. Robert Hill, education adviser in the Blair government, used the findings to do just that. Hill argues that high ranking chains tend to work in localised clusters, expand slowly and focus on pedagogy and oversight. Although they do not say it explicitly, it is safe to assume that the DfE also intends school commissioners to use the rankings when choosing academy chains to take over ‘failing’ schools.

This may all sound like useful, evidence-base policy analysis. But I want to argue that this sort of performance ranking is so flawed that it is effectively meaningless and therefore not very useful, either for drawing policy lessons or making commissioning decisions.

The new performance measure developed by the DfE ranks authorities and chains based on the ‘value added’ they achieve for pupils across their schools. This is measured as the difference between their predicted GCSE grades (based on Key Stage 2 attainment) and actual GCSE grades. This means that secondary schools do not take the credit (or blame) for the performance of their feeder primary schools, which is sensible. It also places a lower weight on the results of schools who have just joined an academy chain, to reflect the chains limited influence on that particular school.

What it does not take account of however, are non-school factors which influence pupil progress during secondary school. As the DfE analysts put it, their measure assumes that schools have the same “propensity for improvement” (p16). The problem is that we know this is not true. Pupil’s household income, for example, is known to have a strong relationship with attainment. Leaving it out will therefore create significant inaccuracies in the ranking.

The DfE hint at incorporating such contextual factors in future versions of the ranking (p17). This may sound sensible, but we have been here before. When dissatisfaction grew with the value added measure (first introduced in 2002) to rank individual schools, the government developed a Contextual Value Added (CVA) measure, which tried to control for such non-school factors. But research by Lorraine Dearden and colleagues demonstrated that leaving out the level of education of a pupil’s mother (data which is not generally collected) caused “significant systematic biases in school CVA measures for the large majority of schools.” Stephen Gorrard then pointed out that (non-random) missing data meant there were large errors in the estimates and, by extension, any ranking based on them. Adding contextual information to the DfE’s new measure would therefore only repeat the mistakes made with CVA, which has since been abolished.

These flaws in the new measure are severe enough to make them effectively meaningless, since it is unclear whether a high score represents the influence of the local authority/chain, the influence of other factors which are not taken into account, or just statistical noise.

There is also a more general sense in which this new measure ignores lessons from recent education research. A great deal of work in the last five years has tried to identify the policies and approaches behind London schools relative success. But Simon Burgess has now shown using census data that all of London’s superior performance can be accounted for by differences in ethnicity and migration patterns. If he is right, then the hunt for ‘what worked’ in London has largely been a wild goose chase. Indeed the suspicious concentration of London-based local authorities and academy chains at the top of DfE’s new ranking suggests that migration patterns might also be what is driving the results of their analysis. Studying successful exemplars, whether cities or academy chains, is difficult and potentially misleading.

A better approach to finding out what works is to study policies. Because we can measure the attainment of the same pupils before and after a policy is implemented, it is possible to rule out the influence of a range of other factors, even when we cannot measure them. Returning to the London example, a highly-aspirational recent immigrant before the policy is implemented is still a highly-aspirational recent immigrant after it is implemented. Their migration status therefore cannot be what is driving any observed changes in outcomes after the policy is implemented. Another benefit of studying policies is that they are easier to replicate. If a specific professional development programme for teachers is found to be effective, for example, it is fairly straightforward to deliver that programme in other schools. Other things equal, knowing that Hackney is an effective local authority just isn’t as useful.

The flaws in the DfE’s new accountability measure for local authorities and academy chains are severe enough to make them effectively meaningless. The ranking on which they are based is therefore not very useful, either for drawing policy lessons or making commissioning decisions. In general, evaluating policies will provide more reliable and useful insights than trying to identify and analyse examples of effective providers. Let’s not repeat the mistakes of past accountability reforms.

This piece originally appeared on the LSE Politics and Policy Blog.