Wednesday, March 16, 2016

Guest Post: Teachers, Statistics, and Teacher Evaluation

Have I mentioned that I love guest posts? 

Priti Shah, an AAPS parent and a UM psychology professor read a version of this during public commentary at a school board meeting, and she followed her comments up as a formal letter. I liked it so much that I asked if I could post it here. The reason I asked is that I think we need to understand what good evaluation would mean, and why the system being imposed on teachers by the school district is not a good system. And by the way, if you have never spoken at public comment (or haven't recently), I encourage it!

Dear Ann Arbor School Board Members:

This letter follows up on my comments during the public comment period of the Ann Arbor School Board meeting in January 2016. I spoke about the new teacher evaluation system.

As a reminder, I’m the parent of two children in the Ann Arbor Public Schools (11th and 6th grade). I am also a Professor of Psychology at the University of Michigan, and my research areas are in cognition and cognitive neuroscience and educational psychology. I base my comments on my feelings as a parent as well as based on the research evidence regarding teacher evaluations.
Priti Shah

The reason I wanted to speak was because I am very concerned about the climate of respect and collaboration teachers and administration that has been eroding in the Ann Arbor Public Schools and the impact on our children.

I start with three assumptions: 
(1) we all want the very best teachers possible,  
(2) we all want them to have the resources they need to provide the best possible educational experiences for each of our children, and 
(3) we want to be able to do all that without wasting our hard-earned resources. 

I strongly believe in setting high expectations and rewarding high quality work.  And as an educational scientist, I believe very much in high quality, research-supported teacher evaluation.  High quality evaluation should be valid (that is, someone who is rated as a “good” teacher should actually be a good teacher and someone who is rated as a “bad” teacher should actually be a bad teacher) and reliable (that is, evaluation shouldn’t change too much depending on who is in one classroom or which day the assessment occurs). Validity is a very hard nut to crack, because it depends fundamentally on one’s definition of what a good teacher is.

The new teacher evaluation system relies on two components: (1) student growth on a menu of standardized tests and (2) the Charlotte Danielson teacher evaluation system.  I would like to outline my concerns with respect to both of these approaches in terms of validity and reliability.

Student Growth

While I understand that incorporating student growth into teachers’ evaluations is mandated by state law, I want to highlight that the use of student growth—and how a teacher contributes to that growth--is problematic from a statistical perspective.  The American Statistical Association, in their policy statement on the issue, point to numerous concerns with respect to using student growth data for teacher evaluation purposes.  Most studies find that teachers account for about 1% to 14% of the variability in test scores, and that the majority of opportunities for quality improvement are found in the system-level conditions. Student growth measures are not highly reliable, in other words. 

Most studies find that teachers account for about 1% to 14% of the variability in 

test scores, and that the majority of opportunities for quality improvement are found 

in the system-level conditions. Student growth measures are not highly 

reliable, in other words.  

A good teacher may look like a bad teacher depending on the composition of students in his or her class.  A group of Ann Arbor students in AP English may not show huge growth on a standardized English test because those students are already performing at ceiling on the test; their teacher might be rated as ineffective because there was no growth.  A teacher whose students may need safety and security (and warm coats and breakfast) may do an outstanding job and yet the circumstances that they are dealing with might lead to minimal growth on a standardized test. 

Another problem with using test scores to evaluate teachers is that relevant test scores are not available for many subjects taught by teachers-- my children have taken outstanding courses in subjects for which there are no standardized tests used: engineering design; communications, media and public policy; orchestra; art.  Some of these teachers will only interact with students once a week for an hour.  Evaluating these teachers on the performance of their students in subjects that they do not teach, and students that they rarely see, is absurd.

Furthermore, there is good support for the idea that teachers change their practices in light of these high stakes evaluations, often removing activities that promote critical thinking and creativity to spend more time on tested materials.

Most importantly, growth rates for different years for the same teachers vary widely, suggesting that these measures are not very reliable indicators of teacher quality and highly influenced by exactly which random kids they are teaching. And unfortunately, students will spend increasing amounts of time, and the district increasing amounts of money on high stakes tests that assess learning to the detriment of resources spent on other activities.

The Ann Arbor Public Schools would like to focus on growth for the bottom 1/3 of students in hopes that this will be an incentive to reducing the achievement gap.  Unfortunately, having 1/3 of the data to work with will mean a massive reduction in the possible reliability of the data because of smaller sample size.  And the bottom 1/3 is a dramatically different benchmark standard across teachers (i.e., you cannot compare growth across teachers if one is using the bottom 33% of the students in AP English and another the bottom 33% of students in guitar).

The Charlotte Danielson Framework

The second proposed component of the new teacher evaluation system is the Charlotte Danielson Framework. On the surface, this is a reasonable measure that involves administrators evaluating teachers on a systematic set of 76 items that are likely to be positively associated with teacher quality. 

Again, a good measure of teaching quality an assessment requires two key features: it needs to be reliable – in that the same teacher would be rated the same across time by different people—and valid—that is, that a good score on the means someone really is a good teacher.  Unfortunately, the reliability or validity of this framework is just not clear, based on the extant evidence.  Sure, you’ll hear some relatively high numbers from the people who sell the Danielson system but they are based on expert coders watching the same lessons on video.  Consider rating a baseball player for 15 minutes during a game.  If he makes a home run that day, your two independent raters will rate him similarly. If he strikes out, the two independent raters will rate him low. It’ll look like your rating system is highly reliable. That’s how reliability of these observational methods is tested. This is just one of many problems associated with such classroom observation methods.  

I point the board to a 2012 article in Education Researcher by Harvard School of Education Professor and University of Michigan PhD Heather Hill for a more technical discussion of these and related concerns. And at the same time I appeal to your common sense: Look at the rubrics and ask yourself—have you ever had a terrible teacher who could check off all the boxes and look like an “effective” teacher because they could use the right lingo and implement the criteria superficially?  Have you ever had a stellar educator who inspired and motivated you to succeed but didn’t see eye to eye with the administrators’ views on how classroom seating could be organized? Might there be a teacher who can shine during such a formal evaluation process but shows active disdain for some students throughout the school year?

I appreciate the extreme difficulty but necessity of evaluating teacher effectiveness, but I can confidently state that just by moving from rating teacher on one subset of the criteria annually to rating them on all four will not necessarily positively impact the reliability or validity of the measure. Indeed, it is likely to reduce the quality of the ratings, the validity of the measures, while simultaneously increasing burden on teachers and administrators. Just because there are more items does not mean an assessment is better.   Neither do I think that the vast majority of highly effective experienced teachers are going to change and become less effective. At my own job, our evaluations become less frequent with greater seniority; this makes sense to me.


Given that teachers must be evaluated, and that none of the proposed methods are particularly reliable or valid, I would probably use a combination of metrics as proposed by the school board. However, I would (1) try to minimize burden on the teachers and administrators (as in, not that many hours of time), (2) involve teachers in decision making at all phases (to get input on what they think should be included and what is reasonable and won’t distract them from their real work), (3) include not just administrator evaluations but peer evaluations (that is, ratings of other teachers, who often know more about what goes on in classrooms), and (4) consider also input of parents and students.   

A proud mama moment: my son wrote an article advocating the inclusion of student ratings of teachers for the Skyline Skybox (; while I think student evaluations can be problematic in some situations, he makes an excellent point.   Student evaluations, based on specific questions regarding teaching effectiveness (not just “was this a good class” but whether the teacher seemed to care, whether students respect the teacher, and so forth) can actually be better predictors of student growth than observational methods.  And I can tell you that parents in our community are pretty well informed regarding which teachers seem engaged, caring, and effective. Parent and student surveys are cheap.


We need to start with some basic assumptions in revamping the teacher evaluation system in Ann Arbor.

My first assumption is that most of our teachers are smart, hard working, and caring professionals. I have observed far, far, more excellence in the Ann Arbor Schools classrooms on my many visits and interactions with teachers than I have experienced ineffective teaching.

Second, the Ann Arbor school system needs to maintain its leadership position regarding school administration and governance as well as quality schools.  The reason we have such outstanding teachers is that they want to work in our district.  We want to attract the very best teachers, not drive them away with unnecessary busywork.  Let’s interpret our state’s laws in a manner best suited to our teachers and students instead of jumping through hoops that may well be unnecessary.

Finally, let’s all agree that we want to expend our time and money on what helps our children learn, and that we do not want more and more of our money go to for profit testing companies, consultants to train administrators and run workshops teachers on evaluation rubrics, software so that administrators can rapidly rate teachers on numerous criteria quickly in the classroom at the press of a button.

Thanks for your time, and I’m happy to have a longer conversation with anyone who would like to talk to me.


Priti Shah

A few references:

Hill, H. C., Charalambous, C. Y., & Kraft, M. A. (2012). When rater reliability is not enough teacher observation systems and a case for the generalizability study. Educational Researcher, 41(2), 56-64.

Tuesday, March 8, 2016

Election Day: My High School's Polling Place

Here in Ann Arbor, we have the relatively simple and yet apparently not too likely to be tampered with paper ballots. They are easy to fill out, keep the lines moving, and there is a paper record if there ever needs to be a recount. Filling in the little circles today did kind of remind me of standardized testing, but that's not my complaint.

I went to a high school where the middle school was attached, and there was a polling place in a quiet corner of the high school. Of course (or maybe not of course--does it happen today?) visiting the polling place was definitely part of the grade 7-12 social studies curriculum.

Plus, if I would go with my parents, when I was little, to vote--there was a certain magic of going behind the private curtain (was it velvet? I think it might have been), pulling the private levers, and when you walked out of the booth, nobody knew who you were voting for.

And I know, rationally, that the system we have in Ann Arbor is much more secure than those old voting machines.

And I know, rationally, that these little booths are much easier for the clerks to move around from precinct to precinct.

And I know, rationally, that it is much easier to expand the number of booths in high volume elections.


When I was a kid, those voting booths were magical, and even today, I miss them.

By Pauljoffe at en.wikipedia, CC BY-SA 3.0,

Consider subscribing to Ann Arbor Schools Musings by Email!

Tuesday, March 1, 2016

Ask the Ann Arbor School Board to Vote No on Tuition-Based Program

I was surprised--and not happily--to see this headline from the Ann Arbor News:

Tuition-based Program Would Bring Chinese Students to Ann Arbor High Schools.

The key things to know, from the article:

A new plan* proposed to the Ann Arbor Public Schools Board of Education Wednesday evening would place up to 200 students from China in the city's high schools each year...The district is considering a partnership with BCC International Education Group, a Chinese-American company that has already created similar programs bringing Chinese high school students to Dexter and Saline.
*This idea (or something quite similar) was actually discussed, and rejected, a few years ago in the Pat Green era.

Here's the letter I just sent to the school board. You can send emails to the school board at:

Dear Board of Ed-- 

I'm writing to ask you to oppose the proposed contract with a firm to bring in up to 200 Chinese tuition-paying students. 
We already have at least two exchange programs in the schools--Youth for Understanding and AFS--both programs devoted to bringing students from around the world, including--among many other countries--China. Did you know that YFU started in Ann Arbor as an exchange program with Germany post-World War II? Or that AFS's origins lie in the aftermath of World War I with a similar goal of inter-cultural understanding?  
Ann Arbor also hosts the USA's U-18 hockey program.  
I'm not sure exactly how many students come through these three programs but I think it's something in the range of 100 students.  
All of these programs rely on host family volunteers, and it's not so easy to find them. I know intimately what is involved, because we hosted a student from Sweden last year and a student from Uruguay the year before, both with YFU. Both were great experiences but it does involve a fair bit of work, and (I know I'm repeating myself) it's not easy to find host families.  
I asked several families this year if they could take a student, and none of them felt they were in a position to do it. And, in fact, just today I got an email asking for a host family for a student who needs to leave his current family--and that happens too, sometimes, in the middle of the year. 
I do understand the desire to bring in money for the district, but I don't think this is a good way. And I would say this even if I hadn't heard, today, that the Oxford School District has had a very negative experience with this company.  
With the current exchange programs the student in my house brought in the same per-pupil funding as every other student in the district, thus adding to the district's census.  
I'm asking you to vote no on this. 
Ruth Kraut

Consider subscribing to Ann Arbor Schools Musings by Email!