Ann Arbor Schools Musings: Guest Post: Teachers, Statistics, and Teacher Evaluation

Have I mentioned that I love guest posts?

Priti Shah, an AAPS parent and a UM psychology professor read a version of this during public commentary at a school board meeting, and she followed her comments up as a formal letter. I liked it so much that I asked if I could post it here. The reason I asked is that I think we need to understand what good evaluation would mean, and why the system being imposed on teachers by the school district is not a good system. And by the way, if you have never spoken at public comment (or haven't recently), I encourage it!

Dear Ann Arbor School Board Members:

This letter follows up on my comments during the public comment period of the Ann Arbor School Board meeting in January 2016. I spoke about the new teacher evaluation system.

As a reminder, I’m the parent of two children in the Ann Arbor Public Schools (11th and 6th grade). I am also a Professor of Psychology at the University of Michigan, and my research areas are in cognition and cognitive neuroscience and educational psychology. I base my comments on my feelings as a parent as well as based on the research evidence regarding teacher evaluations.

Priti Shah

The reason I wanted to speak was because I am very concerned about the climate of respect and collaboration teachers and administration that has been eroding in the Ann Arbor Public Schools and the impact on our children.

I start with three assumptions:

(1) we all want the very best teachers possible,

(2) we all want them to have the resources they need to provide the best possible educational experiences for each of our children, and

(3) we want to be able to do all that without wasting our hard-earned resources.

I strongly believe in setting high expectations and rewarding high quality work. And as an educational scientist, I believe very much in high quality, research-supported teacher evaluation. High quality evaluation should be valid (that is, someone who is rated as a “good” teacher should actually be a good teacher and someone who is rated as a “bad” teacher should actually be a bad teacher) and reliable (that is, evaluation shouldn’t change too much depending on who is in one classroom or which day the assessment occurs). Validity is a very hard nut to crack, because it depends fundamentally on one’s definition of what a good teacher is.

The new teacher evaluation system relies on two components: (1) student growth on a menu of standardized tests and (2) the Charlotte Danielson teacher evaluation system. I would like to outline my concerns with respect to both of these approaches in terms of validity and reliability.

Student Growth

While I understand that incorporating student growth into teachers’ evaluations is mandated by state law, I want to highlight that the use of student growth—and how a teacher contributes to that growth--is problematic from a statistical perspective. The American Statistical Association, in their policy statement on the issue, point to numerous concerns with respect to using student growth data for teacher evaluation purposes. Most studies find that teachers account for about 1% to 14% of the variability in test scores, and that the majority of opportunities for quality improvement are found in the system-level conditions. Student growth measures are not highly reliable, in other words.

Most studies find that teachers account for about 1% to 14% of the variability in

test scores, and that the majority of opportunities for quality improvement are found

in the system-level conditions. Student growth measures are not highly

reliable, in other words.

A good teacher may look like a bad teacher depending on the composition of students in his or her class. A group of Ann Arbor students in AP English may not show huge growth on a standardized English test because those students are already performing at ceiling on the test; their teacher might be rated as ineffective because there was no growth. A teacher whose students may need safety and security (and warm coats and breakfast) may do an outstanding job and yet the circumstances that they are dealing with might lead to minimal growth on a standardized test.

Another problem with using test scores to evaluate teachers is that relevant test scores are not available for many subjects taught by teachers-- my children have taken outstanding courses in subjects for which there are no standardized tests used: engineering design; communications, media and public policy; orchestra; art. Some of these teachers will only interact with students once a week for an hour. Evaluating these teachers on the performance of their students in subjects that they do not teach, and students that they rarely see, is absurd.

Furthermore, there is good support for the idea that teachers change their practices in light of these high stakes evaluations, often removing activities that promote critical thinking and creativity to spend more time on tested materials.

Most importantly, growth rates for different years for the same teachers vary widely, suggesting that these measures are not very reliable indicators of teacher quality and highly influenced by exactly which random kids they are teaching. And unfortunately, students will spend increasing amounts of time, and the district increasing amounts of money on high stakes tests that assess learning to the detriment of resources spent on other activities.

The Ann Arbor Public Schools would like to focus on growth for the bottom 1/3 of students in hopes that this will be an incentive to reducing the achievement gap. Unfortunately, having 1/3 of the data to work with will mean a massive reduction in the possible reliability of the data because of smaller sample size. And the bottom 1/3 is a dramatically different benchmark standard across teachers (i.e., you cannot compare growth across teachers if one is using the bottom 33% of the students in AP English and another the bottom 33% of students in guitar).

The Charlotte Danielson Framework

The second proposed component of the new teacher evaluation system is the Charlotte Danielson Framework. On the surface, this is a reasonable measure that involves administrators evaluating teachers on a systematic set of 76 items that are likely to be positively associated with teacher quality.

Again, a good measure of teaching quality an assessment requires two key features: it needs to be reliable – in that the same teacher would be rated the same across time by different people—and valid—that is, that a good score on the means someone really is a good teacher. Unfortunately, the reliability or validity of this framework is just not clear, based on the extant evidence. Sure, you’ll hear some relatively high numbers from the people who sell the Danielson system but they are based on expert coders watching the same lessons on video. Consider rating a baseball player for 15 minutes during a game. If he makes a home run that day, your two independent raters will rate him similarly. If he strikes out, the two independent raters will rate him low. It’ll look like your rating system is highly reliable. That’s how reliability of these observational methods is tested. This is just one of many problems associated with such classroom observation methods.

I point the board to a 2012 article in Education Researcher by Harvard School of Education Professor and University of Michigan PhD Heather Hill for a more technical discussion of these and related concerns. And at the same time I appeal to your common sense: Look at the rubrics and ask yourself—have you ever had a terrible teacher who could check off all the boxes and look like an “effective” teacher because they could use the right lingo and implement the criteria superficially? Have you ever had a stellar educator who inspired and motivated you to succeed but didn’t see eye to eye with the administrators’ views on how classroom seating could be organized? Might there be a teacher who can shine during such a formal evaluation process but shows active disdain for some students throughout the school year?

I appreciate the extreme difficulty but necessity of evaluating teacher effectiveness, but I can confidently state that just by moving from rating teacher on one subset of the criteria annually to rating them on all four will not necessarily positively impact the reliability or validity of the measure. Indeed, it is likely to reduce the quality of the ratings, the validity of the measures, while simultaneously increasing burden on teachers and administrators. Just because there are more items does not mean an assessment is better. Neither do I think that the vast majority of highly effective experienced teachers are going to change and become less effective. At my own job, our evaluations become less frequent with greater seniority; this makes sense to me.

Recommendation

Given that teachers must be evaluated, and that none of the proposed methods are particularly reliable or valid, I would probably use a combination of metrics as proposed by the school board. However, I would (1) try to minimize burden on the teachers and administrators (as in, not that many hours of time), (2) involve teachers in decision making at all phases (to get input on what they think should be included and what is reasonable and won’t distract them from their real work), (3) include not just administrator evaluations but peer evaluations (that is, ratings of other teachers, who often know more about what goes on in classrooms), and (4) consider also input of parents and students.

A proud mama moment: my son wrote an article advocating the inclusion of student ratings of teachers for the Skyline Skybox (http://readtheskybox.com/201601/why-students-are-the-best-tools-when-it-comes-to-teacher-evaluations/); while I think student evaluations can be problematic in some situations, he makes an excellent point. Student evaluations, based on specific questions regarding teaching effectiveness (not just “was this a good class” but whether the teacher seemed to care, whether students respect the teacher, and so forth) can actually be better predictors of student growth than observational methods. And I can tell you that parents in our community are pretty well informed regarding which teachers seem engaged, caring, and effective. Parent and student surveys are cheap.

Conclusion

We need to start with some basic assumptions in revamping the teacher evaluation system in Ann Arbor.

My first assumption is that most of our teachers are smart, hard working, and caring professionals. I have observed far, far, more excellence in the Ann Arbor Schools classrooms on my many visits and interactions with teachers than I have experienced ineffective teaching.

Second, the Ann Arbor school system needs to maintain its leadership position regarding school administration and governance as well as quality schools. The reason we have such outstanding teachers is that they want to work in our district. We want to attract the very best teachers, not drive them away with unnecessary busywork. Let’s interpret our state’s laws in a manner best suited to our teachers and students instead of jumping through hoops that may well be unnecessary.

Finally, let’s all agree that we want to expend our time and money on what helps our children learn, and that we do not want more and more of our money go to for profit testing companies, consultants to train administrators and run workshops teachers on evaluation rubrics, software so that administrators can rapidly rate teachers on numerous criteria quickly in the classroom at the press of a button.

Thanks for your time, and I’m happy to have a longer conversation with anyone who would like to talk to me.

Sincerely,

Priti Shah

A few references:

http://www.amstat.org/policy/pdfs/asa_vam_statement.pdf

https://edpolicy.stanford.edu/sites/default/files/publications/getting-teacher-evaluation-right-challenge-policy-makers.pdf

Hill, H. C., Charalambous, C. Y., & Kraft, M. A. (2012). When rater reliability is not enough teacher observation systems and a case for the generalizability study. Educational Researcher, 41(2), 56-64.

Pages

Wednesday, March 16, 2016

Guest Post: Teachers, Statistics, and Teacher Evaluation

Student Growth

The Charlotte Danielson Framework

Recommendation

Conclusion

2 comments:

AddThis

Stat Counter

Loaded Web

BlogCatalog