Observing teachers

We've known for some time that observations of teachers are largely invalid. New research suggests they may be racist, sexist and classist too.

Back in the early 2010s, a fight was brewing in England between a teaching orthodoxy imposed by Ofsted, the school inspection service, and a band of freethinking teacher bloggers. At issue was Ofsted’s practice of observing lessons and then grading them. This practice appeared to enforce the use of ideologically driven teaching methods such as group work.

At around the same time over in the United States, a project funded by the Bill and Melinda Gates Foundation, was seeking to find valid ways of evaluating teaching. The Measures of Effective Teaching (MET) Project examined a teacher’s prior student achievement gains (the amount by which student scores increased when taught by a given teacher in the past), student survey scores and scores on classroom observation rubrics as predictors of teacher quality measured by current student gains.

Unsurprisingly, past student gains were the best predictor of current student gains, despite being imperfect and noisy. The MET Project researchers trialed a number of classroom observation rubrics. They found that all of them had some predictive power in terms of student gains, but these were far from perfect. As Professor Robert Coe pointed out in an influential blog-post in 2014:

“One way to understand these values is to estimate the percentage of judgements that would agree if two raters watch the same lesson. Using Ofsted’s categories, if a lesson is judged ‘Outstanding’ by one observer, the probability that a second observer would give a different judgement is between 51% and 78%.”

Coe goes on to note the unusual circumstances under which the MET project observed lessons. Observers received training and each teacher was observed multiple times by multiple observers. According to MET Project researchers, this was because, “…the same teacher was often rated differently depending on who did the observation and which lesson was being observed.”

Moreover, in the MET Project, observers watched videos of lessons, so the teachers were unaware they were being observed. Teachers were also unaware of the rubric on which they were being assessed. Taken together, this would make it hard for teachers to put on a performance for the sake of the observer.

Coe noted that Ofsted lesson observations, which we could extend to pretty much all lesson observations conducted in schools, are not performed with this level of attention to reliability. In fact, teachers are well aware of the criteria on which they will be judged. If they are going to received feedback following the observation, then it is impossible not to make them aware. And i real-world settings, judgements often result from a single observation or a series of observations from a single observer.

This killed the Ofsted lesson observation model, even if its death throes took a few years to peter out.

In subsequent years, more evidence has emerged from the MET Project about the effect of bias in lesson observations. For instance, one analysis found that male teachers and, ‘teachers in classrooms with high concentrations of Black, Hispanic, male, and low-performing students receive significantly lower observation ratings’ despite the data suggesting such ratings are unlikely to represent an actual difference in teacher quality.

New research suggests such bias is not restricted to the particular circumstances of the MET project and may be a more general phenomenon. This time, it was preservice teachers who were being observed. Again, males received systematically lower scores, along with those teaching in low-income and rural schools and, perhaps most significantly of all, preservice teachers of colour.

Subjective assessments always create a space for bias - which is one reason we should shun the siren call to replace school exams with subjective portfolio assessments. The particular pattern found in lesson observations seems to suggest a white, female, relatively affluent norm in the minds of observers, with deviations from this norm being marked down. The most likely explanations would be that this norm represents the experience of the observers or is somehow coded in to the observation rubrics. I would guess it’s probably a little of both.

For a head of mathematics like me, this presents something of a predicament. I like going to watch my teachers. I don’t feel like I really know them as a professional until I have. I certainly feel like lesson observations have a value, even if this is not supported by research.

My hunch is that the value of lesson observations in non-linear. It is easy to tell if something is badly wrong - if the wheels have fallen off the wagon. If I enter a classroom in my current context and students are talking over the teacher or not completing the set tasks then this rings alarm bells. We also have a large number of agreements, both as a school and as a department. For instance, all mathematics teachers work to a commonly prepared lesson plan so it is easy to spot the teacher who is not doing this. However, past these basics, perhaps the value runs out. I am aware that I have adopted certain ways of teaching maths concepts from colleagues which I would not have valued particularly highly if I had observed them in class. I have adopted them because assessment data suggests they are effective. Lesson observations would not have got me there.

So I guess the moral of the story is to approach lesson observations with caution. Using the kind of lesson observations that typically take place in schools in order to grade the quality of teaching is not an evidence-based practice and could potentially even be racist, sexist and classist. At best, lesson observations may be effective at disseminating the basics of craft knowledge - I don’t know how else you would disseminate such knowledge to preservice teachers. So, lesson observations may be a useful tool, provided we do not rush to judge.