Taking a Scholarly Approach to Evaluating Teaching:
A Review of and Recommendations for Assessing Teaching Excellence Across the Schools of the Great Lakes Colleges Association

Sarah L. Bunnell
Ohio Wesleyan University

Copyright 2016 Sarah L. Bunnell

Note:  This Essay for Action is part of a study by Sarah L. Bunnell, which included the gathering of data from member colleges of the Great Lakes Colleges Association.  The full version of this study includes both the text and the tables that accompany the essay.  To access the full essay with its attached tables, click here.   

Across the G.L.C.A., faculty and administrators commit a large amount of time and energy to the evaluation of teaching for the purposes of retention, tenure, promotion, and merit decisions. As small liberal arts colleges with teaching-centric missions, this work is critical to ensuring that students are receiving high-quality instruction from reflective, scholarly teachers. There appears to be a disconnect, however, between the methods used to evaluate teaching and the standards set forth for the evaluation of the scholarship of teaching and learning (SoTL), and this disconnect is in no way limited to our local practices. Scholarly teaching, as defined by Ernest Boyer (1990) and refined by Glassick, Huber, and Maeroff (1997), requires that instructors apply the same systematic approach to their teaching that they do to their disciplinary research, including the specification of objectives, the development of an awareness of previous work in the field, the collection of data using agreed upon methods, and the public sharing of one’s results and conclusions. The fact that most measures used to rate teaching effectiveness do not share much in common with standards for scholarship in teaching and learning has been described as the “SoTL Paradox” (Walker, Boepher, & Cohen, 2008). In this action essay, I’ll review current practices used to evaluate teaching across the G.L.C.A., as well as summarize current literature on the strengths and challenges of these approaches. Then, I’ll present the 6-pronged framework for scholarly teaching developed by Boyer and discuss the ways in which teaching effectiveness could be assessed using this scholarly teaching framework.

Current research reveals a number of issues with the tools that are commonly used to evaluate teaching in higher education. Student evaluations of teaching, which are the most frequently used measure of teaching effectiveness across U.S. institutions (e.g., Clayson, 2009), tend to be good measures of instructor organization, the clarity of faculty expectations and content delivery, perceived instructor availability and respectfulness, and overall student “satisfaction.” These same instruments, however, are poor measures of an instructor’s reflective practice, intentionality in course design, knowledge of teaching best practices, or willingness to explore and develop improved methods of instruction. They are also very poor measures of the amount of learning achieved; people are poor judges of their own learning, in large part because we are not aware of what we don’t know. A meta-analysis of the relationship between student grades and students’ evaluations of their learning revealed that the correlation between these two factors has decreased the past several decades and is now effectively zero (Clayson, 2009), meaning that students’ perceptions of how much they have learned are not indicative of how much actual learning has occurred, at least in terms of the grade earned in a course. Further, it is well documented that students’ evaluations of instructors are heavily influenced by variables related to privilege and diversity, such that female instructors and those from minority racial/cultural groups tend to be rated by students as less effective than instructors who are male or from a majority racial group (e.g., Hamermesh & Parker, 2005; MacNell, Driscoll, & Hunt, 2015; Rubin, 1998). The literature also finds that students expect female instructors to be more compassionate, nurturing, and available than male instructors; female instructors who are viewed as warm and “maternal” by students tend to be rated higher on course evaluations than are females who are viewed as less nurturing, although the same pattern does not hold for male faculty (Sprague & Massoni, 2005). In terms of instructor difficulty, students tend to provide higher ratings to instructors of “easy” classes over those whose course work they found challenging (e.g., Johnson, 2003), unless the student evaluators are themselves very high-achieving; these students appear to value being challenged (Stark & Freishtat, 2014).

As we know from the large body of research on introspection (see Wilson, 2003 for a review), individuals are much better at reporting on what happened than they are on how it was achieved, or why. In the context of teaching evaluations, questions that require students to report what the instructor did or did not do in the classroom, as well as their own personal perceptions of the classroom dynamic, are much more likely to be accurate reports than are responses to questions about how learning was achieved or whether an instructor, or the course itself, was effective in facilitating learning outcomes. In the case of the latter type of questions, implicit biases may play a more powerful role in shaping the responses given. As such, course evaluation survey tools are limited measures, depending upon the questions asked and the purposes for which they are used; while they may capture student perceptions as well as some of the outward behaviors and decisions of the instructor, they are much less well suited measures of an instructor’s efforts to facilitate learning outcomes and the resulting learning that has occurred. These findings likely come as no surprise to most faculty members, as many of us readily report deep dissatisfaction with course and/or teaching evaluation instruments, particularly those that rely upon a student survey instrument.

That being said, teaching evaluation surveys remain the norm at both the local and national level. First and foremost, we are institutions of higher learning that deeply value the student experience and student learning. As such, it makes great sense to collect information from our students about their experiences in the classroom. In addition to the valuing of student perspectives, however, there are several other driving forces that likely serve to maintain the use of course evaluation surveys as our primary mechanism for evaluating teaching effectiveness. First, institutional history and practices are powerful entities; faculty and administrators may be hesitant to deviate from how teaching has been evaluated in the past. Changing an evaluation instrument or process raises important questions about the fairness and equality of assessment over time, such that faculty evaluated under the “old” system may not be held to the same standards as those evaluated under the “new” system; in response, individuals would likely feel as though they must respond on-the-fly to a new set of criteria and resulting value system. Given faculty members’ limited time and resources, we are all forced to make decisions about how best to commit our time and energies; if the way in which retention, promotion, and merit decisions are made is altered, the paradigm under which faculty have been operating would shift as well, and that can be an uncomfortable and/or unwelcome process. The timing of such a change is also challenging. What is the best way to shift to a new system, with tenure-track faculty at different points in their review periods? Given the heavy weight that teaching evaluation responses are given on most campuses, these practical questions have significant potential implications for individuals’ career trajectories.

Finally, the revision of institutional policies and practices is time-intensive, and our faculty and faculty governing bodies are already over-stretched to complete the important work of personnel reviews. Developing, piloting, and implementing new metrics for capturing and evaluating the teaching work of faculty may quite reasonably be beyond the resources currently available or frankly, beyond the current interests of the institution. With ever greater pressure and import placed on research scholarship at liberal arts institutions, administrators may feel as though placing a greater emphasis on evaluating teaching, and asking faculty to commit more time and resources towards engaging in a scholarly approach to teaching, is counter to this goal.

Twelve of our 13 G.L.C.A. institutions employ some form of student evaluation of teaching survey tool (see Table 1); Wabash requires their faculty to collect feedback from students in their classes, but this feedback is not collected using a shared assessment measure. Most employ a 5- or 7-point Likert scale as well as open-ended response items (but see Earlham, who uses only open-ended questions). Across institutions that use student course evaluations, there is variability in the amount of flexibility that instructors have regarding choice of instrument as well as the nature of the questions asked. For instance, Oberlin requires that faculty ask questions across six factors (Course Organization and Clarity; Instructor Enthusiasm; Teacher-Student Interaction, Rapport, and Approachability; Workload and Difficulty of the Course; Exams, Papers, Grading Fairness, and Feedback; and Self-Rated Learning), but the instructor may choose which questions from each set best fit their teaching context and goals. Similarly, the College of Wooster has a set of six questions which all instructors must ask their students, but instructors are free to select any additional course evaluation tool they would like to administer in their classes, either from a set of college-provided options ( or from an outside source. And, Albion uses the IDEA Student Ratings of Instruction online survey tool (, which allows instructors to rank questions in the instrument based on their teaching goals and practices, and the resulting student responses are weighted according to these rankings.

There are several themes that emerge as common across our evaluation instruments (see Tables 2a-2f for an overview of common course evaluation themes and questions across schools which use a standardized instrument, and the Appendix for copies of each institution’s student evaluation of teaching form). Almost all of us ask students to rate the overall effectiveness of the instructor, while half also ask for a rating of the overall effectiveness of the course. In terms of course design, students are most frequently asked to rate the organization of the course and/or the effectiveness of how class time was utilized, followed by whether the instructor presented information clearly and whether the assignments in the course effectively facilitated learning. Most instruments also include at least one question about how hard students felt they had to work in order to be successful in the course. Finally, almost all of the assessment tools used across the G.L.C.A. ask students to report on the amount that they learned in the course, although these questions vary in terms of whether they ask students to globally rate their level or learning or whether students are asked to rate their learning across a range of learning outcomes or skills.

In terms of the learning environment itself, the questions vary more substantially across institutions. Questions include the amount of respect for students the instructor displayed, how much the instructor encouraged student questions, whether the instructor created a positive learning environment, and if the instructor encouraged students to consider multiple viewpoints and perspectives. Finally, in terms of instructor behaviors, the majority of instruments ask students if they found the instructor’s feedback in response to their work to be helpful. About half of our schools ask if the instructor was available during office hours or outside of class time and if assignments were returned in a reasonable amount of time. Less commonly asked questions include whether the instructor demonstrated enthusiasm for the course material, if the instructor evaluated student work fairly, if the instructor came to class prepared, and if the instructor was knowledgeable about the subject matter.

In addition to teaching evaluation surveys, 8 of the 13 G.L.C.A. schools currently require peer observations of teaching as part of their personnel review process, and peer observations of teaching for either formative or summative means is common place at institutions where it is not required (See Table 1). The typical structure of peer observations includes one or two class visits with a summary of the observations provided either to the faculty member themselves, in the case of formative peer observations, or submitted as part of a personnel review file, in the case of summative, evaluative peer observations. Recommendations for best practices in peer observation include: 1) pre- and post-visit discussions between the observer and the observed, with the conversation focused on the goals of the to-be-observed class and how the current class session aligns with the larger course learning goals; 2) appropriate preparation on behalf of the observer, such that they are familiar with the assignments for the class; and 3) a discussion of the evidence that will be used to evaluate whether student learning goals have been met (see Deborah Dezure’s chapter in Seldin (1999) for a thorough discussion of peer observation of teaching best practices).

The adoption of peer observation into our faculty evaluation systems stems from the belief that our own colleagues can more readily recognize the presence of absence of good teaching than can our students. Indeed, colleagues can speak to the teaching approaches used, the level of perceived engagement of the class, the responsiveness of the instructor to the current class climate and direction of the class discussion, and the success of the faculty member in redirecting student questions and clarifying students’ understandings. That being said, there are some important limitations to using peer observations as a summative tool for evaluating teaching effectiveness. 1) The presence of the observer in the classroom, even if the individual positions themselves as unobtrusively as possible, can quickly change the dynamics of the class, especially in a small class setting or a discussion-heavy course. Although learning is best facilitated by the creation of a safe learning environment, the observer may disrupt this dynamic, resulting in an observation that does not accurately represent the nature of the course or class climate. Further, given the time-intensive nature of conducting classroom observations, the number of observations that can reasonably be expected to be conducted is limited, resulting in a small number of observations being used as evidence of one’s teaching efficacy. Because of this, faculty may see these observations as being quite high-stakes, and decisions surrounding the “who” of the observation, as well as the “when,” can be fraught. On the one hand, if the faculty member is allowed to select who will conduct the observations, as well as when the observations are conducted, they would be wise to choose individuals with whom they are friendly and with whom they share common teaching-related values and practices. They would also be wise to select particularly engaging or successful class sessions for observation, regardless of how representative this class session is of the larger course offering. The resulting observational reports, in this context, then serve more as a letter of recommendation of one’s teaching than an objective observational report, although they are commonly treated as the latter. Conversely, if the peer observers are chosen by some outside body and/or the timing of the observations occur without consultation with the instructor, you run the risk of a singular “off day” in class, or a singular observation by someone who does not see value in your teaching practices (e.g., a Socratic teacher who does not value lecture, or a lecture-heavy instructor who deems active learning approaches as lacking in rigor or coverage) being heavily weighted evidence in the evaluation of one’s teaching and one’s related personnel status.

Finally, and perhaps most importantly, both peer observations and student evaluations of teaching commonly decouple the processes of teaching and learning. The emphasis is placed on the practices of the instructor without positioning these practices in the context of student learning outcomes. As the effectiveness of one’s teaching hinges upon the learning that is derived, this disconnect is problematic. In an attempt to recouple the teaching and learning processes, and more directly evaluate the teaching practices, and resulting learning, that occur, many individuals now argue for a shifting of the criteria used to evaluate teaching, such that our standards for teaching excellence could and should more closely align with those used to evaluate the scholarship of teaching and learning (e.g., Seldin, 1999; Stark & Freistat, 2014; Wieman, 2015). So, what would our teaching evaluations look like if they were to more closely mirror the standards of scholarly teaching?

1) Clarity of Goals. Scholarly teachers provide clear statements of their goals for student learning, as well as how their teaching approaches and course design decisions have been driven by these stated goals (see Wiggins & McTighe “Backwards Design” for an overview). Evaluation of an instructor in terms of their clarity of goals would include an examination of how well he or she has aligned their learning outcome goals with their course design decisions, behaviors in the classroom, and policies. An additional way in which this alignment may be assessed would be through a forensic syllabus examination, in which the faculty member would submit their syllabi for review by knowledgeable colleagues outside one’s own institution, in order to provide review committees with information about how well the to-be-taught material reflects current knowledge, issues, and even teaching best practices in the instructor’s discipline.

2) Awareness and incorporation of the work of others into one’s own practices. There is a long history of research into educational practices conducted by those trained in schools of education and pedagogy, and there is a growing body of evidence on disciplinary and interdisciplinary research in the field of the scholarship of teaching and learning (e.g., International Journal for the Scholarship of Teaching and Learning (, Teaching and Learning Inquiry ( While many scholarly teachers read and contribute to these literatures, it can be challenging to stay current with this work while also maintaining one’s knowledge of current research in their disciplinary domains (assuming their research areas are not educational or pedagogical in nature). The G.L.C.A. is developing a database of article summaries to assist faculty with accessing relevant research; however, there are additional ways in which individuals may stay aware of the work of others beyond the reading of published literature. Attendance of teaching and learning conferences or sessions within disciplinary conferences that are focused on pedagogy, as well as involvement with on-campus or across-campus presentations on teaching and learning, increases one’s awareness of teaching methods and practices. The thoughtful consideration and incorporation of insights gleaned from these events into one’s own teaching is common and required practice for scholarly teachers; thus, evaluating an instructor upon this dimension would include an examination of their level of involvement with the field and/or relevant literatures focused on teaching and learning inquiries, as well as whether their practices and policies in the classroom have been informed by this literature.

3) Use of “Shared Methods.” A community of scholars shares an understanding of the accepted methods in the discipline. In the case of teaching scholarship, there are several ways of knowing which may be employed towards the systematic improvement of teaching and learning. Certainly, the intentional addition or subtraction of particular approaches, assignments, or activities into a course design and the resulting changes in student learning over time could be assessed. Alternatively, one could compare student learning across sections of a course, evaluating the resulting student learning between the “treatment” and “control” conditions. An obvious critique of these methods focuses on the fact that our classrooms are rarely controlled laboratory settings: students are not randomly assigned to class sections, instructors often change multiple components of the course in one iteration, and much of what happens in a course is influenced by the personalities and levels of preparation/engagement of the students rather than the instructor. Regardless, there is value in documenting changes that have occurred; repeated improvements in student learning over time are likely attributable to something in the class context, to which the instructor is a critical contributor, even if we can’t pinpoint the precise variable or variables driving this outcome. As such, evaluating an instructor on this dimension would include examining his/her intentionality in course design, such that he/she is able to systematically ask questions about the efficacy of choices made in and around the course on shaping student learning outcomes.

4) Collection of Evidence. Scholarly teachers provide evidence of the impact of their practices and decisions on student learning outcomes; this evidence should be aligned with the goals specified by the instructor. As discussed previously, student evaluations of teaching provide evidence of a particular type, such as the behaviors of the instructor and student perceptions of their learning and growth as a result of the course. Conversely, peer observations provide evidence of teaching practices and classroom dynamics as observed by a knowledgeable peer, although these observations may be impacted by the nature of their structure and/or implementation. Both of these streams of evidence tell us something valuable about the teaching and learning context. If, however, the instructor’s goals for their teaching involve some aspect of student learning, then evidence of this learning should be directly obtained from student work itself. While we may intuitively sense that our teaching is meeting the goals we have set for student learning, a scholarly approach to one’s teaching requires the provision of relevant and compelling evidence that such gains have indeed been achieved.

The evidence that would be provided by a scholarly instructor, then, would include examples of student learning products (e.g., term papers, exam responses, student journals, online discussions that occurred, portfolios of writing/art/music products) as well as summary data on how representative these examples are of the general class performance (for examples of how others have presented student work as evidence using electronic portfolios, see the University of Kansas Center for Teaching Excellence: It is most helpful to our colleagues and others who are reviewing evidence of our students’ learning if the instructor provides: a) Examples of student work at various levels of performance (e.g., two assignments each which are indicative of exceptional, intermediate, and novice levels of understanding), b) Brief summaries by the instructor which describe what aspects of each student work product led them to evaluate the work at a high, moderate, or low level, and c) An overview of the entire class distribution of performance on the assignment so that those reviewing this work are aware of the average level and range of performance in the course. Within a sense of the overall course performance, it is difficult for both the instructor and the outside reviewers to determine how representative the student work products are of overall student outcomes in the course.

5) Engages in Reflective Practice. The provision of evidence without reflection and iterative change is not scholarly. When we present the results of a disciplinary inquiry, it is expected that we also provide a discussion of what we have concluded from the data, in addition to the implications of these findings for our future research and practice. When applied to pedagogy, the scholarly instructor draws conclusions about the efficacy of their teaching, given the evidence they have collected, and reflects upon how they will respond to these observations in future instantiations of the course or in other courses. For instance, an instructor may discover that the vast majority of students in their introductory course continue to demonstrate novice-level understanding of a key conceptual issue on the final exam, despite providing students with additional in-class and out-of-class instruction on that topic. The instructor, therefore, may reasonably conclude that he or she should more rigorously scaffold this skill in future versions of this same course; he or she will then examine whether this approach is associated with improvements on the exam in future offerings of the course. In this way, a scholarly teacher generates next steps and next questions about their teaching, while specifying the observations that would indicate that their teaching practices have been effective.

6) Makes their work public. While we tend to be quite public with our disciplinary research, through presentations and publications, we also tend to be quite private about our teaching practices and impacts. Indeed, the collection of meaningful, direct evidence of student learning (Standard 4) and the sharing of one’s findings about one’s teaching with others are the two domains of a scholarly approach to teaching in which faculty least commonly engage (Bunnell & Bernstein, 2012). However, a scholarly approach to teaching requires that one share the results of their work with others, in order to receive feedback, inform others, and contribute to the larger knowledge base on teaching and learning. There are many ways in which individuals can “go public” with their teaching in a scholarly manner. Locally, individuals may present the results of their teaching inquiries to their departments or in the context of university-level conversations around teaching. More broadly, individuals may share their teaching-related work at G.L.C.A.-sponsored events, via online forums or teaching e-portfolios, or at regional, national, or international conferences on teaching and learning. Individuals may also write about their work for a wider audience and contribute to the literature on the scholarship of teaching and learning via peer-review publication. And of course, scholarly teachers share their findings in documents submitted for personnel decisions, which are shared with colleagues and others for review. Therefore, evaluating faculty upon this dimension would involve assessing the ways in which teachers have shared their findings and reflections with others, both locally and more broadly.

The evaluation of teaching and teachers is a complicated, time-intensive, and important process, particularly for institutions that place great value on hiring and retaining excellent teaching faculty. However, there are limitations to the approaches we currently employ in our evaluation approaches, particularly in terms of evaluating the iterative, reflective processes of scholarly teaching. Student evaluations of teaching and peer reviews of teaching provide important but limited evidence of what happens in a classroom context. We should be cautious about the ways in which these forms of evidence are interpreted and the emphasis placed on the responses generated. The standards for scholarly teaching described above, and the related criteria upon which faculty would be asked to represent their teaching, would certainly require more work from faculty members whose teaching is being reviewed, and particular institutions may not feel that all six components of scholarly teaching are useful areas of evaluation for their teaching faculty. If, however, we truly value scholarly teaching on our campuses, we may be well served by adopting an evaluatory framework that reflects and rewards this, by capturing the intellectual work of teaching, directly measuring student learning, and maintaining an emphasis on the critical link between teaching and learning.


Boyer, E.L. (1990). Scholarship Reconsidered: Priorities of the Professoriate. Carnegie Foundation for the Advancement of Teaching.

Bunnell, S.L. & Bernstein, D.J. (2012). Overcoming Some Threshold Concepts in Scholarly Teaching, The Journal of Faculty Development, 26(3), 14-18.

Clayson, D. (2009). Student evaluations of teaching: Are they related to what students learn? A meta-analysis and review of the literature. Journal of Marketing Education, 31(1), 16-29.

Glassick, C.E., Huber, M.T., & Maeroff, G.I. (1997). Scholarship Assessed: Evaluation of the Professoriate. San Francisco, California: Jossey-Bass.

Hamermesh, D.S. & Parker, A.M. (2005). Beauty in the classroom: Instructors’ pulchritude and putative pedagogical productivity. Economics of Education Review, 24(4), 369-376.

MacNell, L., Driscoll, A., & Hunt, A.N. (2015). What’s in a name? Exposing gender bias in student ratings of teaching. Innovations in Higher Education, 40(4), 291-303.

Miller, J. & Chamberlin, M. (2000). Women are teachers, men are professors: A study of student perceptions. Teaching Sociology, 28, 283-298.

Rubin, D.L. (1998). Help! My professor (or doctor or boss) doesn’t talk Englisgh. In J.N. Martin, T.K. Nakayama, & L.A. Flores (Eds.), Readings in Cultural Contexts (pp. 149-160). Mountain View, CA: Mayfield Publishing Company.

Seldin, P. (1999). Changing practices in evaluating teaching: A practical guide to improved faculty performance and promotion/tenure decisions. San Francisco, California: Jossey-Bass.

Sprague, J. & Massoni, K. (2005). Student Evaluations and Gendered Expectations: What we can’t count can hurt us. Sex Roles, 53(11/12), 779-793.

Stark, P.B. & Freishtat, R. (2014). An Evaluation of Course Evaluations, ScienceOpen. 1-7. doi: 10.14293/S2199-1006.1.SOR-EDU.AOFRQA.v1

Walker, J.D., Boepher, P., & Cohen, B. (2008). The Scholarship of Teaching and Learning Paradox: Results without Rewards. College Teaching, 56(3), 183-189.

Wieman, C. (2015, January-February). A Better Way to Evaluate Undergraduate Teaching. Change: The Magazine of Higher Learning.

Wiggins, G. & McTighe, J. (2005). Understanding by Design. Alexandria, Virginia: Association for Supervision and Curriculum Development.

Wilson, T. (2003). Knowing when to ask: Introspection and the adaptive unconscious. Journal of Consciousness Studies, 10(9-10), 131-140.

Skip to toolbar