Applied quantitative researchers use statistical methods to answer questions with direct practical applications. Generally this involves trying to isolate causal relationships so that policy makers or practitioners can be provided with reliable advice about how to achieve their goals. Labour economists, for example, study which training schemes increase employment. Education researchers evaluate the effect of different teacher training programmes on teacher retention. And criminologists try to identify the how different police patrol patterns affect the crime rate. Applied research tends to attract pragmatic, empirically-minded people.
It is perhaps not surprising then that applied researchers generally have little time for theory. In my experience, they tend to see theory as impenetrable, untestable and unnecessary. In the first half of this blog I explain each of these objections; in the second half I argue that each of them is mistaken. My aim is to persuade applied quantitative researchers that they will make more progress with their research, and have more impact, if they made more use of theory in their work.
The first charge levelled against theory is that it is impenetrable. It is easy to underestimate how recently we have developed the statistical techniques, hardware, software and datasets necessary for doing applied quantitative research. Prior to this, policy researchers generally filled this empirical vacuum with theory and, as Noah Smith has pointed out, where theory is the only game in town, the competition for publication spots tends to become a battle for who can generate the most sophisticated ideas. This leads to a proliferation of theories in social science that are variously too complex to guide research design, so expansive as to make data collection prohibitively costly, or so nuanced as to make falsification impossible. Many applied researchers conclude that engaging with this sort of theory simply isn’t worth the hassle.
Even where falsification is possible however, many empirically-minded researchers see testing theory as a fool’s errand. Many assume that the best case scenario is that a plausibly causal relationship inconsistent with a theory is identified, sometimes loosely referred to as falsification. But ex-ante the researcher faces a substantial risk of failing to falsify the theory, which is even less valuable. The risks involved in this sort of research therefore dilute the incentives for testing theory.
Finally, even if an applied researcher found a suitable theory and were in principle willing to take the risk of trying to falsify it, many would still argue that testing theoretical relationships is unnecessary. Why not just conduct evaluations of existing policies or programmes that provide results about causal relationships which are of direct interest to policy makers? Ultimately, testing theories always seems one step removed from the most pressing problems facing applied researchers: does it work?
Though each of these arguments contain a grain of truth, they are all wrong in important ways.
The tide is now turning on overly complicated theory in the social sciences. Behavioural economics has persuaded many more economists to study simple heuristics rather than mathematically cumbersome models of sub-rational decision making. In political economy, to take another example, Dani Rodrick and Sharun Mukand have recently developed a theory that can explain something as complex as why liberal democracy does or does not emerge using only six basic concepts. But perhaps the best illustration of the trend towards simpler theory comes from sociology. The American Sociological Association recently held a meeting to debate whether the discipline needs to make its theory more manageable. One of the papers presented at that meeting demonstrates how opinion is changing in the field with its blunt title: Fuck Nuance. It is a brilliant argument for keeping theory simple enough to be useful. Paul Van Lange from Oxford University has also developed a set of useful principles (Truth, Abstraction, Progress and Applicability, or TAPAS) that can help researchers identify and develop good theories.
Recent developments have also helped to make theory more testable. In a brilliant paper, the political scientist Kevin Clarke recently showed that the only way to really provide confirmation for a theory is to test it against other competing theories. Since then, two other political scientists, Kosuke Imai and Dustin Tingley, have shown how finite mixture models can be used to do just this. They have also developed a programme for the statistical software R, making it straightforward for other researchers to test which theories apply best and, crucially, under which conditions. This approach also avoids the worst of the incentive problems associated with attempts at falsification.
Theory is also necessary in that it provides knowledge essential to answering the does it work question which statistics alone cannot. Nancy Cartwright discusses a now well-known case which highlights this. The Tamil Integrated Nutrition Policy (TINP) involved provision of healthcare and feeding advice for mothers of newborns and was shown to be effective in reducing child malnutrition. This is useful statistical evidence. However, when a near identical project was implemented in Bangladesh it was shown not to have an effect. But then further research found that educating the mothers of Bangledeshi children was ineffective because important aspects of food preparation there are not generally conducted by the mother. In Angus Deaton’s terms, while the intervention was replicated, the mechanism underlying success was not. Understanding the theory behind a programme can therefore help clarify its value in a way that statistics alone cannot.
In summary, theory is becoming steadily less impenetrable, increasingly easy to test and is necessary for applied researchers to confidently infer policy advice from specific evaluations. Instead of focusing solely on evaluating existing programmes, applied researchers would likely have more impact if they collectively adapted a more cyclical approach in which they: evaluated existing programmes to identify which were most effective; tested known theories or mechanism which may underpin effective programmes; helped design new programmes based on the most successful theories; and then conducted further evaluations of the new interventions. This approach would contribute to a virtuous cycle of policy-relevant discoveries which could allow quantitative researchers to deliver on Pawson and Tilley’s ideal of finding out “what works for whom in what circumstance… and how.”
DEPARTMENTAL HEADS IN THE SAND: WHY YOUR DEPARTMENT IS PERFORMING WORSE THAN YOU THINK
How’s your driving? Below average, pretty normal, or better than most?
Research suggests that the majority of people reading this article have just answered ‘better than most’. They are surely wrong. Only half of drivers can be better than the median. The other half are, by definition, below it.
This is an example of what psychologists call ‘Illusory Superiority’, a phenomenon which is by no means limited to our skills behind the wheel. Studies have shown that we tend to overestimate our ability relative to others in a range of areas from leaderships, to parenting and even social skills.
What about schools? Surely teachers, with access to Ofsted judgements, regular testing and sophisticated outcome measures, have an accurate picture of their performance?
We surveyed a representative sample of English, maths and science Heads of Department to find out. We asked them how they thought their departments performed in 2013 GCSEs, relative to departments in other schools serving similar intakes. As figure 1 shows, fully 43% thought themselves to be superior, while only 17% assessed themselves as being worse than others.
We then calculated performance measures that control for school inta ke and split the departments into two groups based on their scores. The results for low performing departments are particularly striking (see figure 2). Almost a quarter incorrectly believe themselves to be high performing, with another 43% believing they perform similarly to others.
So according to our data, many Department Heads do indeed suffer from illusory superiority. If you’re a middle leader reading this, you probably still think your one of the select few with an accurate assessment. But that’s the thing about illusory superiority, we all think we’re special.
Given the prevalence of performance data these days, how can it be that so many Department Heads don’t have an accurate picture of their own performance? Recent research suggests that the dominant reason is our desire to avoid undesirable judgements about ourselves. This urge is particularly strong when we are being assessed on important tasks, such as teaching. Some middle leaders are simply choosing to ignore the data. Departmental Heads buried in the sand.
This cannot be good for schools, or pupils. Fortunately the research on driving offers pointers on how we can keep illusions of superiority in check. Motorists were found to offer more accurate assessments of their performance when they expected their judgement to be reviewed by others afterwards, particularly if they saw the reviewer as high status.
So what should schools do? Middle leaders could help keep themselves honest by collectively reviewing each other’s exam performance. An annual inter-school meeting to analyse results would be a start. The Families of Schools database, which groups schools with similar intakes, can help make these comparisons more transparent. But what really matters is that performance is reviewed by respected, knowledgeable colleagues.
We are all prone to thinking we’re better than we really are. Even you. That makes collaboration the only reliable antidote for complacency.
This piece was originally written for the Education Datalab website.
In March the Department for Education (DfE) released a working paper called Measuring the Performance of Schools within Academy Chains and Local Authorities. It contains the first official performance ranking of the organisations responsible for groups of schools in England: local authorities and academy chains. The performance of individual schools has been monitored and published in the UK since at least 1992, but until now the performance of groups of schools has not been systematically reported on. The new ranking therefore represents the expansion of data driven accountability to an additional tier of the school system.
The findings have been analysed and commented on widely. Anti-academy campaigners used them to claim that academy chains do not deliver better results than local authorities. Newsnight’s Chris Cooke pointed out that there are high and low performing examples of both local authorities and academy chains and argued that instead of debating their relative merits, we should focus on working out how to emulate the most successful examples, whether that be Ark academies or Hackney local authority. Robert Hill, education adviser in the Blair government, used the findings to do just that. Hill argues that high ranking chains tend to work in localised clusters, expand slowly and focus on pedagogy and oversight. Although they do not say it explicitly, it is safe to assume that the DfE also intends school commissioners to use the rankings when choosing academy chains to take over ‘failing’ schools.
This may all sound like useful, evidence-base policy analysis. But I want to argue that this sort of performance ranking is so flawed that it is effectively meaningless and therefore not very useful, either for drawing policy lessons or making commissioning decisions.
The new performance measure developed by the DfE ranks authorities and chains based on the ‘value added’ they achieve for pupils across their schools. This is measured as the difference between their predicted GCSE grades (based on Key Stage 2 attainment) and actual GCSE grades. This means that secondary schools do not take the credit (or blame) for the performance of their feeder primary schools, which is sensible. It also places a lower weight on the results of schools who have just joined an academy chain, to reflect the chains limited influence on that particular school.
What it does not take account of however, are non-school factors which influence pupil progress during secondary school. As the DfE analysts put it, their measure assumes that schools have the same “propensity for improvement” (p16). The problem is that we know this is not true. Pupil’s household income, for example, is known to have a strong relationship with attainment. Leaving it out will therefore create significant inaccuracies in the ranking.
The DfE hint at incorporating such contextual factors in future versions of the ranking (p17). This may sound sensible, but we have been here before. When dissatisfaction grew with the value added measure (first introduced in 2002) to rank individual schools, the government developed a Contextual Value Added (CVA) measure, which tried to control for such non-school factors. But research by Lorraine Dearden and colleagues demonstrated that leaving out the level of education of a pupil’s mother (data which is not generally collected) caused “significant systematic biases in school CVA measures for the large majority of schools.” Stephen Gorrard then pointed out that (non-random) missing data meant there were large errors in the estimates and, by extension, any ranking based on them. Adding contextual information to the DfE’s new measure would therefore only repeat the mistakes made with CVA, which has since been abolished.
These flaws in the new measure are severe enough to make them effectively meaningless, since it is unclear whether a high score represents the influence of the local authority/chain, the influence of other factors which are not taken into account, or just statistical noise.
There is also a more general sense in which this new measure ignores lessons from recent education research. A great deal of work in the last five years has tried to identify the policies and approaches behind London schools relative success. But Simon Burgess has now shown using census data that all of London’s superior performance can be accounted for by differences in ethnicity and migration patterns. If he is right, then the hunt for ‘what worked’ in London has largely been a wild goose chase. Indeed the suspicious concentration of London-based local authorities and academy chains at the top of DfE’s new ranking suggests that migration patterns might also be what is driving the results of their analysis. Studying successful exemplars, whether cities or academy chains, is difficult and potentially misleading.
A better approach to finding out what works is to study policies. Because we can measure the attainment of the same pupils before and after a policy is implemented, it is possible to rule out the influence of a range of other factors, even when we cannot measure them. Returning to the London example, a highly-aspirational recent immigrant before the policy is implemented is still a highly-aspirational recent immigrant after it is implemented. Their migration status therefore cannot be what is driving any observed changes in outcomes after the policy is implemented. Another benefit of studying policies is that they are easier to replicate. If a specific professional development programme for teachers is found to be effective, for example, it is fairly straightforward to deliver that programme in other schools. Other things equal, knowing that Hackney is an effective local authority just isn’t as useful.
The flaws in the DfE’s new accountability measure for local authorities and academy chains are severe enough to make them effectively meaningless. The ranking on which they are based is therefore not very useful, either for drawing policy lessons or making commissioning decisions. In general, evaluating policies will provide more reliable and useful insights than trying to identify and analyse examples of effective providers. Let’s not repeat the mistakes of past accountability reforms.
This piece originally appeared on the LSE Politics and Policy Blog.