Student Self-Evaluation in Grade 5-6 Mathematics Effects on Problem Solving Achievement
John A. Ross*
Ontario Institute for Studies in Education
at the University of Toronto
Paper presented at the annual meeting of the American Educational Research Association
Seattle, April 2001
*Corresponding Author: John, A. Ross, OISE/UT Trent Valley Centre
Box 719, 633 Monaghan Road South, Peterborough, Ontario, Canada K9J 7A1
email@example.com Tel: (705) 742-8827 Fax: (705) 742-5104
Two studies examined the effects of self-evaluation training on mathematics achievement. Grade 5-6 students self-evaluated for 8 weeks in Study 1 (N=176 treatment, 174 control) and 12 weeks in Study 2 (N=259 treatment, 257 control). Results were mixed. Self-evaluation training had no effect on student achievement in Study 1. In Study 2, when students experienced self-evaluation for a longer period and teachers were given increased in-service and curricular support, treatment students outperformed controls (ES=.40). The findings contribute to the consequential validity argument for alternate assessment. These results also indicate that subject moderates the effects of self-evaluation on achievement. More support is required in mathematics than in language because it is more difficult to recover deep structures in mathematics than in writing, teacher conceptions of the discipline and teaching of mathematics are misaligned with curricular expectations, and students are less articulate about mathematical reasoning then they are in talking about writing.
Figure 1 describes a model linking self-evaluation to student achievement. Although the constructs in the model are conceptually distinct, they overlap empirically (Wigfield, Eccles, & Rodriguez, 1998). The model posits that achievement is the outcome of students’ personal goals and effort. Student goals can be categorized at general and specific levels. General goals identify student motives for engaging in a classroom activity. Goal theorists distinguish two types of general goal orientations, variously labeled mastery versus ability focused (Ames, 1992), learning versus performance (Dweck & Leggett, 1988), and task versus ego (Nicholls, 1984). Students who set mastery goals approach school tasks by focusing on what they might learn from their participation. They define success as mastering a skill or developing understanding and seek tasks that are more likely to generate these outcomes. Students who set ability goals focus on opportunities to demonstrate their abilities. Success is measured by higher grades or greater status in comparison to peers. Students who set ability goals seek familiar task situations. The highest academic performance is obtained when the former orientation predominates (Meece, Blumenfeld, & Hoyle, 1988). Although most research focuses on these two orientations, a third goal orientation, social affiliation, has been examined. In most studies, students who approach a school task by focusing on the opportunities it provides for social interactions with peers have lower achievement than students with high learning goal orientations. However, achievement and the need for affiliation can be positively associated, particularly in classrooms using cooperative learning techniques (Urdan & Maehr, 1995). At the specific goal level, achievement is likely to be higher when students focus on the lesson objectives embedded within the task.
The second term in the model, student effort, influences how well students achieve their goals, since persistence increases accomplishment. Effort is also influenced by students’ goals. For example, students are more likely to persist if they adopt goals that have unambiguous outcomes, that are achievable in the near future, and that are moderately difficult to achieve (Schunk, 1981).
Self-evaluation embodies three processes that self-regulating students use to observe and interpret their behavior (Schunk, 1996; Zimmerman, 1989). First, students produce self-observations, deliberately focusing on specific aspects of their performance relevant to their subjective standards of success. Second, students make self-judgments in which they determine how well their general and specific goals were met. Third, are self-reactions, interpretations of the degree of goal achievement that express how satisfied students are with the result of their actions.
Self-evaluation contributes to self-efficacy beliefs, i.e., student perceptions of their ability to perform the actions required by similar tasks likely to be encountered in the future. Students who perceive themselves to have been successful on the current task are more likely to believe that they will continue to be efficacious in the future (Bandura, 1997).
Students with greater confidence in their ability to accomplish the target task are more likely to visualize success than failure. They set higher standards of performance for themselves. Student expectations about future performance also influence effort. Confident students persist. They are not depressed by failure but respond to setbacks with renewed effort. For example, students with high self-efficacy interpret a gap between aspiration and outcome as a stimulus while low self-efficacy students perceive such a gap as debilitating evidence that they are incapable of completing the task (Bandura, 1997).
Self-evaluation plays a key role in fostering an upward cycle of learning when the child’s self-evaluations are positive. Positive self-evaluations encourage students to set higher goals and commit more personal resources to learning tasks (Bandura, 1997; Schunk, 1995). Negative self-evaluations lead students to embrace goal orientations that conflict with learning, select personal goals that are unrealistic, adopt learning strategies which are ineffective, exert low effort and make excuses for performance (Stipek, Recchia, & McClintic, 1992).
Training in self-evaluation strategies could enhance achievement in several ways. Training could heighten self-observation. Since teachers try to assign tasks that students can complete successfully, student judgments of their performance would likely be positive and the upward cycle in which self-evaluation stimulates achievement could accelerate. Self-evaluation training could also modify the specific goals that students set, bringing them more closely in line with teacher expectations. Training might also spur students to greater efforts if they became more conscious of the specific gaps between their performance and their goals.
Fontana and Fernandes (1994) implemented a program to increase primary student control of mathematics learning. In the early phase of the 20 week program, students selected from a range of tasks identified by the teacher, negotiated learning contracts and determined whether they had fulfilled their commitments using assessment materials provided by the teacher. By the end of the program students were setting their own learning objectives, developing appropriate tasks, selecting suitable mathematical apparatus and developing their own self-assessment procedures. The program had a significant impact on student achievement for more able students but the effects were negligible for the less able. In this study, self-evaluation was embedded in a broader instructional treatment. The distinctive contribution of self-evaluation to the effects could not be disentangled from other program components.
Schunk (1996) implemented self-evaluation by asking grade 4 students to judge how certain they were they could solve computational problems. Self-evaluation had no effect on mathematics achievement in a learning goal condition (i.e., when students were told that the purpose of the activity was to learn how to solve fraction problems). Students given a performance goal (i.e., the directions made no reference to learning how to solve fraction problems) had higher achievement if they self-evaluated on six occasions (once after each lesson). In this study students were given no information about their performance; there was no training in self-evaluation; the duration of the treatment was short; the self-evaluation task was brief and infrequent.
Ross (1995) found that self-evaluation training increased cooperative student interactions associated with achievement. Grade 7 mathematics students working in cooperative groups were given edited transcripts of their group interactions and were trained how to interpret them. They used an instrument 1-2 times per week for 12 weeks to record the frequency of positive interactions. Self-assessment increased the frequency of productive help giving, help seeking, and attitudes about asking for help. However, in this study no measure of mathematics achievement was administered. Other studies have also reported that self-evaluation has a positive effect on factors associated with achievement without measuring achievement directly. For example, self-evaluation increases persistence (Henry, 1994; Hughes, Sullivan, & Mosley, 1985; Schunk, 1996).
Our approach to teaching students how to evaluate their work began with a study of the student assessment practices of cooperative learning teachers identified by their peers and supervisors as exemplary (Ross, Rolheiser, & Hogaboam-Gray, 1998-a). We organized their strategies as a four-stage process: (i) involve students in defining evaluation criteria, (ii) teach students how to apply the criteria, (iii) give students feedback on their self-evaluations, and (iv) help students use evaluation data to develop action plans. Strategies for each stage and classroom usable tools (Rolheiser, 1996) were elaborated by a school-university partnership. Use of these strategies had a positive effect on student attitudes to evaluation (Ross, Rolheiser, & Hogaboam-Gray, 1998-b; 1999) and on language achievement (Ross, Rolheiser, & Hogaboam-Gray, 1999-a). Our goal in study 1 was to extend these findings to mathematics. We hypothesized that students who were trained how to evaluate their work would have higher mathematics achievement. The model in Figure 1 provided the grounds for the hypothesis but in Study 1 we were not testing specific elements of the model. Our intention, deflected by the outcomes of Study 1, was to examine in subsequent studies the validity of the model once overall effects of the training had been detected.
We focused on grade 5-6 data management and probability problems because of their prominence in the provincial curriculum (Ontario Ministry of Education and Training, 1997) and in mathematics education (ASA-NCTM Joint Committee, 1985). Previous studies found that even sophisticated adults have difficulty solving real-life problems involving probability due to naïve conceptions that interfere with rational strategies (Bramald, 1994; Kahneman, Slovic, & Tversky, 1982; Konold, Pollatsek, Well, Lohmeier, & Lipson, 1993). Students are subject to similar misconceptions, which tend to increase with school experience (Garfield & Ahlgren, 1988; Green, 1983). Students also suffer from deficits in prerequisite knowledge such as proportional reasoning (Wavering, 1984) and general problem solving skills (Brenner et al., 1997; Hansen, McCann, & Myers, 1985).
Fourteen grade 5-6 classes (mean age 11 years, 9 months), in a large school district in Ontario (Canada), were randomly assigned to two conditions. The treatment group (N=176 students) was older (median age 11 years) and had a higher proportion of females (51%) than controls (N=174, median age 10 years, 46% female). No systematic data on students’ social class or ethnicity were collected. Teachers in both conditions reported that students came from a range of economic circumstances with few visible minorities. A few (<2%) mentally handicapped students were excluded from both samples.
Outcome Measures. Students completed an achievement test on three occasions: pre, post (after 8 weeks), and retention (4 weeks after the posttest). The post and retention items were probability problems measuring the Data Management and Probability expectations of the 1997 Ontario Curriculum for grade 5-6. The pretest was a general measure of problem solving (selected from Kulm, 1994) because students had little instruction on probability prior to the study. A marker with an Ph.D in mathematics education coded the achievement items. The marker was blind to the experimental conditions of the students and to study goals.
In the pretest students designed rectangular dog pens using 24 meters of fencing. They were asked to make drawings of possible pens, select one that they would build, and explain why. The coding scheme for interpreting student responses had three levels: does not meet expectations (level 1), minimally satisfactory (level 2), and satisfactory answer demonstrating understanding of mathematics concepts (level 3). The levels were divided into high and low responses, creating a scale with six values: 1-, 1+, 2-, 2+, 3-, 3+. Each response was holistically placed in one of these levels using three criteria: reasoning or strategy for solving the problem, accuracy of concepts and computations, and communication of argument. Appendix 1 (Dog Pen) contains the rubric.
The posttest of achievement consisted of two items. Students were given a list of 17 movies, their starting times (9 values), and their categorization according to audience suitability (3 values). The first task (data management) was to create a bar chart showing how many movies were offered at each time of day. The second task (probability) was to find the probability of seeing a movie with a particular rating at a particular time. Student responses were coded into the same 6-point scale used in the pretest. Appendix 1 (Data Management, Probability) contains the rubric.
The retention test consisted of 16 probability items (e.g., what is the probability of getting heads when a coin is tossed?) that were scored correct or incorrect. The number correct was converted to the six point scale used in the pre test: 0-1 correct=level 1-, 2-3 correct=level 1+, 4-8 correct=level 2-, 9-12 correct=level 2+, 13-14 correct=3- and 15-16 correct (including the two most difficult items)=level 3+.
Intra-rater reliability based on a random sample of 60 responses for each test was acceptable: for the pretest Cohen’s k=.62 for exact agreement and .92 for within one point on the 6-point scale; for the two post-test items Cohen’s k=.57 and .93 for exact agreement and .92 and .97 for within one point on the scale; for the retention test: Cohen’s k=.96 for exact agreement and 1.0 for within one point.
Tests of Sample Equivalence. Four instruments (derived from the model in Figure 1) were administered on the pretest to determine sample equivalence. Self-evaluation was measured with 6 items (alpha=.93). After completing the pretest achievement task students rated their overall performance with a 1-10 scale (anchored by 1=not well and 10= very well). They used the same scale to rate five dimensions of their problem solving (“How well you…understood the problem, made a plan, solved the problem, checked the solution, and explained the solution.”). Student self-efficacy consisted of 6 items identical to the self-evaluation measure except that each asked about expectations about future performance “how sure are you that you could…” rather than focusing on past performance (alpha=.89). Attitudes to self-evaluation consisted of 10 Likert items adapted from Paris, Turner, and Lawton (1990) and Wiggins (1993). Ross et al. (1998-b) presented evidence of the validity and reliability of the instrument but in this study the internal consistency was poor (alpha= .53). The goal orientations survey consisted of 16 items from Meece et al., (1988) distinguishing three orientations toward learning: mastery (alpha=.83), ego (alpha=.63), and affiliation (alpha=.65). 
In the first six weeks treatment students were given direct instruction on how to evaluate their work. There were six-30 minute lessons in which the teacher demonstrated a particular self-evaluation technique or engaged students in a discussion of their self-evaluations. For example, in one activity students cooperatively developed a rubric. The activity began with students individually solving a mathematics problem. In whole class setting, students suggested criteria for judging the quality of their performance. The teacher recorded the suggestions on the board and asked groups of four to reach consensus and then vote as a class on which criteria were most important. After determining the top four criteria, the teacher had each small group describe high, medium, and low performance on one criterion. Outside of class, the teacher reworked student suggestions to construct a rubric that used student ideas and language, while reflecting expectations of the curriculum. Students then used the co-developed rubric to evaluate their work. Students worked through other activities based on the four-stage model, including 11 short practice sessions in which students completed a 3-5 minute self-evaluation using a form provided by the teacher. For example, students might be asked to assess how well they performed the social skill of the day (such as “praising good ideas and disagreeing agreeably”) using a 4-point “poorly” to “very well” scale. This form also had a place for the teacher to evaluate the student’s performance on the same scale. In a practice activity focused on mathematics, students used the same 4-point scale to rate how well they performed four recurrent problem solving skills (understanding the problem, devising a plan, carrying out the plan, and checking the solution). The form provided space for self-assessments on four separate occasions to provide students with progress over time. The activities implemented by teachers were based on suggestions in a teacher handbook (Rolheiser, 1996) and ideas developed in working sessions developed by teachers during in-service sessions (described below). Teachers also received a handbook of performance tasks (Ontario Association for Mathematics Education, 1996).
In weeks 7-8 teachers implemented a data management and probability unit. In small groups students conducted a number of experiments in which they made predictions about the probability of various results produced by random generation devices such as number cubes. Students drew samples and represented their results in a variety of ways (pictographs, bar and line graphs, tally sheets). The activities were designed to demonstrate equal likelihood of outcomes and test probability estimates by counting. Other features of the unit were the use of manipulatives, real-life examples, and student dialogue about reasons for solutions. Teachers in the treatment condition continued to assign self-evaluation tasks during the probability unit.
Teachers in the treatment condition attended 3 three-hour, after-school in-service sessions distributed over the eight weeks. The three sessions modeled classroom activities (e.g., a tangram task was used to model the development of a rubric), provided structured opportunities for teachers to share successful self-evaluation activities and identify problems, and enabled teachers to collaboratively plan self-evaluation activities for their own classrooms. During these sessions the three authors recorded teacher plans, successes and problems. In addition artifacts (primarily lesson plans) were collected. Treatment teachers also attended four brief team meetings in their schools to review progress and solve problems that arose during their enactment of the treatment.
During the eight weeks of the project control group teachers continued teaching mathematics as they usually did without overt self-evaluation training. In the last two weeks control group classes worked through the probability unit without providing self-evaluation activities. Control group teachers received the probability unit in a three-hour after school workshop. They (unlike treatment teachers) were also given a half-day of release time in their schools to plan how to use the probability unit. Control teachers also received a handbook of performance tasks (Ontario Association for Mathematics Education, 1996).
Table 1 summarizes the student achievement means and standard deviations for each study condition on three occasions and summarizes the results of the tests for pretest equivalence. Treatment students scored significantly higher than controls when rating themselves on the pretest achievement task [t(294.55)=2.09, p=.037] and treatment students tended to be older [t(287.72)=11.76, p=.000] These between-sample differences were significantly correlated, albeit weakly (r=.18 to .27), with achievement on the post- and retention-tests.
We used the General Linear Modeling (GLM) procedures in SPSS to determine the effect of the treatment on the outcome measure, mathematics achievement. The within-subject factor was time (i.e., changes in achievement scores from pre- to post- to retention-tests). The between-subject factors were study condition (treatment or control), age, and pretest self-evaluation of mathematics performance (the latter three were continuous variables). We included age and pretest self-evaluation score in the equation because we found differences between the treatment and control groups on these variables at the beginning of the study. We had no explicit hypotheses as to how age and pretest self-evaluation might moderate the relationship between self-evaluation training and student achievement.
The two-way interaction of time X treatment tested the hypothesis of Study 1. There was no significant relationship between self-evaluation training and mathematics achievement [F(2,590)=.102, p=.903]. Even though the multivariate relationship was not significant we also examined the univariate findings: there were no significant differences between the treatment and control groups for either the linear [F(1,295)=.164, p=.686] or the quadratic [F(1,295)=.023, p=.880] trend. Students who were in the treatment group performed no better on the post- and retention tests than students who were in the control group, even when potential moderators that correlated with the post- and retention-test scores were included in the analysis. The effect size of the treatment was negligible: .08 on the post-test and .02 on the retention test.
The main finding of the study is that training in self-evaluation had no impact on student achievement. But we could not rule out several confounding factors that may have contributed to the non-impact of the treatment:
First, despite random assignment of classes, the groups were not equivalent. Treatment students on entry to the study rated their mathematics achievement higher than controls and treatment students tended to be older. Although we made statistical adjustments, we could not adjust for the impact these sample differences might have had on teacher pacing decisions and student observations of peer performance.
Second, the treatment may have been too short. Students’ understanding of their role in evaluation develops over their entire school career. An 8-week intervention may be insufficient to overturn the belief that evaluation is something done to students rather than a process in which they have a personal responsibility. In our previous research we found that student cognitions about evaluation changed when they were taught self-evaluation techniques but many of the misconceptions they had about the process, particularly its contribution to improved performance, continued unabated (Ross et al., 1998-b). Teachers reported that students were reluctant to self-evaluate in mathematics because they lacked key terms for describing their work, they were uncomfortable due to math anxiety, and some students had difficulty seeing gradations in performance, believing answers were correct or not.
Third, although the pre-, post-, and retention-tests each sampled the objectives of the grade 5-6 mathematics curriculum, they measured different objectives (problem solving, data management and probability, and probability respectively) in different ways (performance tasks for the pre-and post-tests; short answer items for the retention-test). Test nonequivalence made it more difficult to interpret treatment-time interactions. In addition students learned self-evaluation in a variety of topics but the achievement measure was based on the two-week data management and probability unit.
Study 2 was a replication of Study 1 with a few modifications to correct threats to the validity of Study 1. Once again we hypothesized that students who were trained how to evaluate their work would have higher mathematics achievement. Study 2 was a quasi-experimental pre-post design involving 12 teachers from the district in Study 1 matched with 12 teachers from an adjacent district. Our goal in shifting to a matched sample design was to increase the likelihood of equivalent groups, a problem in Study 1. The student measures to determine sample equivalence were modified by adding math anxiety. Our rationale was that math anxiety is a strong negative predictor of mathematics achievement (the meta-analysis of Schwarzer, Seip, & Schwarzer, 1989 found a correlation of r=-.30). However, we retained the self-efficacy measure used in Study 1 and did not revise our model in Figure 1 because other research (reviewed by Pajares & Urdan, 1996) suggests that math self-efficacy is a better predictor of achievement than anxiety. The duration of the treatment was extended to 12 weeks to address the concern that the Study 1 results might be attributable to insufficient treatment. The achievement measure was a performance task measuring problem solving skills addressed throughout the 12 weeks to address Study 1’s use of nonequivalent tests. Both treatment and control group teachers received the same release time to remedy a minor difference in the experimental conditions of Study 1. However, the amount of in-service time and the curricular support for teachers was increased.
Twelve grade 5-6 teachers, seven of whom participated in the treatment group in Study 1, volunteered for the study. They were matched (on grade, gender and experience) with 12 teachers from an adjacent district who were not using systematic self-evaluation procedures.
Students completed a performance task at the beginning and end of the project. Both were adapted from Kulm (1994). On the pretest students were given $10 to purchase items for a puppy. There were prices from three retail outlets for each of the three items. The task was to “Choose a possible selection of collar, dish and toy that you could buy. What is the cost? How much change would you receive? How many different ways could you buy the three items and still spend $10 or less. Show each combination.” On the posttest the task was “You have been hired by Hanson to design their stage for their upcoming concert in Toronto. The stage should be rectangular and have an area of 200 square meters. It will have a security rope on both sides and the front. Make some drawings of different rectangles that have 200 square meters of area. How many meters of security rope are needed for each one? Which shape would you use? Why?”
The tests were independently coded by two experienced teachers using a modified version of the rubric in Appendix 1. A fourth level was added showing extensions (in reasoning, accuracy in concepts and computations, and communication) beyond expectations for the grade. Markers first assigned a level to each response and then determined whether it was high or low within the assigned level, creating an 8-point scale (1-, 1+, etc.). Inter-rater reliability was acceptable: Cohen’s k=.73 for exact agreement and .90 for within one point (i.e., half level) on the scale.
The tests of sample equivalence consisted of the same measures as in Study 1 (self-evaluation, self-efficacy, and goal orientations) and two new measures. Math confidence (i.e., less threatening expressions of math anxiety) and math anxiety (i.e., more threatening expressions of anxiety) consisted of 10 Likert items, originally developed by Betz, that have been extensively used in previous research. Pajares and Urdan (1996) presented evidence of their reliability and validity. The reliabilities of the sample equivalence instruments, shown in Table 2, were adequate, except for the three-item ego and affiliation goal orientation measures.
The treatment condition was very similar to that of Study 1 with a 12 rather than 8 week duration. Students were engaged in co-developing evaluation criteria, applying the criteria in self-evaluations, receiving feedback from teachers and peers on the accuracy of their self-evaluations, and setting goals for the future. Training for teachers increased from 9 to 15 after school hours with an additional 3 hours of in-school release time. Teachers received the same materials as in Study 1 with an additional document that provided more examples about how to teach each of the four stages of self-evaluation in mathematics. Control teachers received no in-service on self-evaluation but they did receive 3 hours of in-school release time to develop activities for teaching problem solving.
Table 2 displays the means and standard deviations for Study 2. There was a significant difference between the samples on affiliative orientations to learning (the treatment group was higher) and on age (the treatment group was younger). Neither affiliative orientation (r=-.071, n=473, p=.123) nor age (r=.041, n=474, p=.480) was associated with posttest achievement but pretest achievement was (r=.366, n=476, p=.001). An analysis of covariance was conducted in which the outcome measure was mathematics achievement, the covariate was pretest achievement, and the independent variable was study condition (treatment or control). Pretest scores predicted posttest achievement [F(1,475)=82.20, p=.000]. The treatment had a significant effect [F(1,475)=14.58, p=.000]. Treatment students outperformed controls. The effect was small (ES=.40).
Study 2’s results conflicted with those of Study 1. Training in self-evaluation had an impact on student achievement only when students experienced self-evaluation for a longer period and additional in-service and curricular support were provided to teachers. The treatment in Study 1 was of the same type and duration as a study that found that the writing performance of grade 4-6 students improved when they were trained how to evaluate their narratives. Trained students increased integration of story elements around a central theme and more consistently adopted a narrative voice than students in the control group (Ross et al., 1999-a). Hillocks (1986) reviewed seven studies (all but one unpublished) that found that student writing improved when students were given scales for judging writing samples and used them to appraise the work of their peers and themselves. Arter, Spandel, Culham, and Pollard (1994) reported significant treatment effects when grade 5 students were taught how to apply a trait analysis scheme to their writing. These results suggest that subject has a moderating effect on the impact of self-evaluation on achievement.
The reason why self-evaluation requires more implementation effort to have an effect in mathematics than in language might be attributable to the nature of mathematics and teachers’ representation of it. The deep structure of mathematics may be more difficult for teachers to recover than the structures underlying writing. Most teachers understand mathematics as a set of arbitrary rules and algorithms to be memorized (Prawat & Jennings, 1997), in contrast with the reform view of mathematics as a dynamic set of intellectual tools for solving meaningful problems (e.g., NCTM, 2000). Teacher images of mathematics affect how they teach it. For example, Fennema, Franke, and Carpenter (1993) observed an exemplary teacher who was able to engender a high degree of student metacognition because her knowledge of mathematics was extensive, accurate, and hierarchically organized. In contrast, teachers who are uncertain about their grasp of the subject might find it difficult to identify performance criteria that run through a variety of topics within a course. Stein, Baxter, and Leinhardt (1990) found that a generalist teacher who lacked subject content knowledge was limited in her ability to make connections within grade 5 mathematics. Teachers might also be reluctant to share responsibility for assessment if they are uneasy about their ability to defend their own assessment decisions. Grade 5-6 teachers are more likely to have taken undergraduate courses in language than in mathematics and assign greater priority to the former than the latter when making in-service selections (Spillane, 2000). Experienced teachers teaching topics that are less familiar to them are more likely to use low risk strategies that reduce opportunities for student involvement in classroom decision making than when teaching familiar topics (Carlsen, 1993; Lee, 1995).
Students’ ability to self-evaluate might also be negatively affected by the nature of the discipline. For example, students might find it difficult to identify and apply assessment criteria and to generate precise goals to remedy their deficiencies if they represent mathematics learning as the mastery of discrete facts and procedures. The lack of positive effects of self-evaluation in mathematics may also be attributable to the difficulty students have in talking about their math performance, in contrast with the relative ease in describing language attainments. Resnick (1988) found that reciprocal teaching did not translate easily from language to mathematics classrooms, in large part because of difficulty in creating generic math prompts and student roles comparable to those used in the language version. The challenge faced in teaching self-evaluation in math may be embedded in the larger issue of the constructivist classroom in which teacher-student talk is a central vehicle for learning. Researchers (e.g., Williams & Baxter, 1996) have frequently observed how difficult it is for teachers, even those committed to constructivist approaches and trained in their use, to sustain productive dialogue on mathematics concepts.
The studies reported in this article contribute to knowledge in two ways. First, research on the effects of mathematics education reform (reviewed in Ross, McDougall, & Hogaboam-Gray, 2001) indicates that implementation of reform contributes to problem solving achievement, change in assessment practice is only one dimension of reform implementation, that changing mathematics instruction is very difficult, and that teacher change has been obtained in multi-dimensional interventions. The nonsignificant result of Study 1 suggests that an intervention based only on changing assessment strategies may be insufficient to move teaching and learning of mathematics. In Study 2 greater attention was given to specific teacher concerns about using self-evaluation in mathematics and a booklet of subject-specific examples was provided. The positive result of Study 2 suggests that student assessment can be usefully combined with other dimensions of reform, particularly in-service that focuses on teachers’ cognitions about the nature of mathematics.
Second, Study 2 provides evidence of the positive effects of one approach to alternate assessment on mathematics achievement, providing additional support for the consequential validity argument for alternate assessment. It has been argued that shifting to assessment practices that are tightly integrated into daily instruction will have beneficial consequences for teachers and students (e.g., Wiggins, 1993). It is predicted that alternate assessments will focus teacher attention on the objectives to be measured and provide teachers with more useful information than is afforded by traditional tests. This focusing of teacher attention will contribute to improved student performance. Very little empirical data is available to test these predictions, especially the claims about student effects. One of the few studies to explore the effects of alternate assessment (Shepard, Flexer, Hiebert, Marion, Mayfield, & Weston, 1996) found a small but significant effect (ES=.13) for performance assessment on mathematics learning but not for reading. Our studies found positive effects for self-evaluation: in writing (Ross et al.,1999-a) and in mathematics (but only with increased support beyond that provided for the writing experiment). The findings suggest that the effects of student assessment vary with subjects. Proponents of alternate assessment should seek to identify the conditions under which particular assessment practices are more effective than traditional assessment, rather than assume that one approach will be universally superior.
American Statistical Association--National Council of Teachers of Mathematics Joint Committee. (1985). Teaching statistics within the K-12 curriculum. Washington: American Statistical Association.
Ames, C. (1992). Classrooms: Goals, structures, and student motivation. Journal of Educational Psychology, 84, 261-271.
Anderson, J. R., Reder, L. M., & Simon, H. A. (1996). Situated learning and education. Educational Researcher, 25(4), 5-11.
Arter, J., Spandel, V., Culham, R., & Pollard, J. (1994, April). The impact of training students to be self-assessors of writing. Paper presented at the annual American Educational Research Association, New Orleans.
Bandura, A. (1997). Self-efficacy: The exercise of control. New York: W. H. Freeman.
Bramald, R. (1994). Teaching probability. Teaching Statistics, 16(3), 85-89.
Brenner, M. E., Mayer, R. E., Moseley, B., Brar, T., Duran, R., Reed, B. S., & Webb, D. (1997). Learning by understanding: The role of multiple representations in learning algebra. American Educational Research Journal, 34(4), 663-689.
Carlsen, W. (1993). Teacher knowledge and discourse control: Quantitative evidence from novice biology teachers' classrooms. Journal of Research in Science Teaching, 30(5), 471-481.
Dweck, C., & Leggett, E. (1988). A social-cognitive approach to motivation and personality. Psychological Review, 95, 256-273.
Fennema, E., Franke, M., & Carpenter, T. (1993). Using children's mathematical knowledge in instruction. American Educational Research Journal, 30(3), 555-583.
Fontana, D., & Fernandes, M. (1994). Improvements in math performance as a consequence of self-assessment in Portuguese primary school pupils. British Journal of Educational Psychology, 64, 407-417.
Garfield, J., & Ahlgren, A. (1988). Difficulties in learning basic concepts in probability and statistics: Implications for research. Journal for Research in Mathematics Education, 19(1), 44-63.
Geary, D. C. (1995). Reflections of Evolution and Culture in Children's Cognition: Implications for Mathematical Development and Instruction. American Psychologist, 50(1), 24-37.
Green, D. R. (1983). A survey of probability concepts in 3000 pupils aged 11-16 years. In D. R. Grey, P. Holmes, V. Barnett, & C. Constable (Eds.), Proceedings of the first international conference on teaching statistics (pp. 766-783). Sheffield, UK: Teaching Statistics Trust.
Hansen, R., McCann, J., & Myers, J. (1985). Rote versus conceptual emphases in teaching elementary probability. Journal for Research in Mathematics Education, 16(5), 364-374.
Henry, D. (1994). Whole language students with low self-direction: A self-assessment tool. Virginia: University of Virginia. ED 372359.
Hillocks, G. (1986). Research on written composition: New directions for teaching. Urbana, IL: ERIC Clearinghouse on Reading and Communication Skills.
Hughes, B., Sullivan, H., & Mosley, M. (1985). External evaluation, task difficulty, and continuing motivation. Journal of Educational Research, 78(4), 210-215.
Kahneman, D., Slovic, P., & Tversky, A. (1982). Judgment under uncertainty: Heuristics and biases. London: Cambridge University Press.
Konold, C., Pollatsek, A., Well, A., Lohmeier, J., & Lipson, A. (1993). Inconsistencies about students' reasoning about probability. Journal for Research in Mathematics Education, 24(5), 392-414.
Kulm, G. (1994). Mathematics assessment: What works in the classroom. San Francisco: Jossey-Bass.
Lee, O. (1995). Subject matter knowledge, classroom management, and interactional practices in middle school science classrooms. Journal of Research in Science Teaching, 32(4), 423-440.
Meece, J., Blumenfeld, P., & Hoyle, R. (1988). Students' goal orientations and cognitive engagement in classroom activities. Journal of Educational Psychology, 80(4), 514-523.
National Council of Teachers of Mathematics. (1995). Assessment standards for school mathematics. Reston, VA: Author.
Nicholls, J. G. (1984). Achievement motivation: Conceptions of ability, subjective experience, task choice, and performance. Psychological Review, 91, 328-346.
Ontario Association for Mathematics Education. (1996). Linking assessment and instruction in mathematics. Rosseau, ON: author.
Ontario Ministry of Education and Training. (1997). Grades 1-8 Mathematics. Toronto: author.
Pajares, F., & Urdan, T. (1996). Exploratory factor analysis of the Mathematics Anxiety Scale. Measurement and Evaluation in Counselling and Development, 29(1), 35-47.
Paris, S., Turner, J., & Lawton, T. (1990, April). Students' views of standardized achievement tests. Paper presented at the annual meeting of the American Educational Research Association, Boston.
Prawat, R., & Jennings, N. (1997). Students as context in mathematics reform: The story of two upper-elementary teachers. Elementary School Journal, 97(3), 251-270.
Resnick, L. (1988). Teaching mathematics as an ill-structured discipline. R. Charles & E. Silver (Eds.), The teaching and assessing of mathematical problem solving (pp. 32-60). Hillsdale, NJ: Erlbaum.
Rolheiser, C. (Ed.) (1996). Self-evaluation: Helping students get better at it. Toronto: Visutronx.
Ross, J. A. (1995). Effects of feedback on student behavior in cooperative learning groups in a grade 7 math class. Elementary School Journal, 96(2), 125-143.
Ross, J. A., McDougall, D., & Hogaboam-Gray, A. (2001). Research on reform in mathematics education, 1993-2000. Submitted to Alberta Journal of Educational Research.
Ross, J.A., Rolheiser, C., & Hogaboam-Gray. (1999-a). Effects of self-evaluation training on narrative writing. Assessing Writing, 6(1), 107-132.
Ross, J. A., Rolheiser, C., & Hogaboam-Gray, A. (1998-a). Student evaluation in cooperative learning: Teacher cognitions. Teachers and Teaching, 4(2), 299-316.
Ross, J. A., Rolheiser, C., & Hogaboam-Gray, A. (1999-b). Teaching students how to self-evaluate when learning cooperatively: The effects of collaborative action research on teacher practice. Elementary School Journal, 99(3). 255-274.
Ross, J. A., Rolheiser, C., & Hogaboam-Gray, A. (1998-b). Skills training versus action research in-service: Impact on student attitudes to self-evaluation. Teaching and Teacher Education, 14(5), 463-477.
Schoen, H. L., Fey, J. T., Hirsch, C. R., & Coxford, A. F. (1999). Issues and options in the math wars. Phi Delta Kappan, 80(6), 444-453.
Schunk, D. (1981). Modelling and attributional effects on children's achievement: A self-efficacy analysis. Journal of Educational Psychology, 73(1), 93-105.
Schunk, D. (1995, April). Goal and self-evaluative influences during children's mathematical skill acquisition. Paper presented at the annual conference of the American Educational Research Association, San Francisco.
Schunk, D. (1996). Goal and self-evaluative influences during children's cognitive skill learning. American Educational Research Journal, 33(2), 359-382.
Schwarzer, R., Seip, B., & Schwarzer, C. (1989). Mathematics performance and anxiety: A meta-analysis. In R. Schwarzer, H. M. van Der Ploeg, & C. D. Spielberger (Eds.), Advances in Test Anxiety Research (pp. Vol. 6, pp.105-119). Berwyn, PA: Swets North America.
Shepard, L. A., Flexer, R. J., Hiebert, E. H., Marion, S. F., Mayfield, V., & Weston, T. J. (1996). Effects of introducing classroom performance assessments on student learning. Educational Measurement: Issues and Practice, 15(3), 7-18.
Spillane, J. P. (2000). A fifth-grade teacher's reconstruction of mathematics and literacy teaching: Exploring interactions among identity, learning, and subject matter. Elementary School Journal, 100(4), 307-330.
Stein, M., Baxter, J., & Leinhardt, G. (1990). Subject-matter knowledge and elementary instruction: A case from functions and graphing. American Educational Research Journal, 27(4), 639-663.
Stiggins, R. (1994). Student-centered classroom assessment. New York: Merrill.
Stipek, D., Recchia, S., & McClintic, S. (1992). Self-evaluation in young children. Monographs of the Society for Research in Child Development, 57, 1-84.
Urdan, T., & Maehr, M. (1995). Beyond a two-goal theory of motivation and achievement: A case for social goals. Review of Educational Research, 65(3), 213-243.
Villasenor, A., & Kepner, H. S. (1993). Arithmetic from a problem-solving perspective: An urban implementation. Journal for Research in Mathematics Education, 24(1), 62-69.
Wavering, M. (1984). Interrelationships among Piaget's formal operations schemata: Proportions, probability, and correlation. Journal of Psychology, 118(1), 57-64.
Wigfield, A., Eccles, J. S., & Rodriguez, D. (1998). The development of children's motivation in school contexts. In P. D. Pearson, & A. Iran-Nejad (Eds.), Review of Research in Education, Vol. 23 (pp. 73-118). Washington, DC: American Educational Research Association.
Wiggins, G. (1993). Assessing student performance: Explore the purpose and limits of testing. San Francisco, CA: Jossey-Bass.
Williams, S., & Baxter, J. (1996). Dilemmas of discourse-oriented teaching in one middle school mathematics classroom. Elementary School Journal, 97(1), 21-38.
Zimmerman, B. J. (1989). A social cognitive view of self-regulated academic learning. Journal of Educational Psychology, 81 (3), 329-339.
 This research was funded by the Social Sciences and Humanities Research Council of Canada and the Ontario Ministry of Education. The views expressed in the report do not necessarily reflect the views of the Council or the Ministry. We thank the teachers and students of Durham Catholic District School Board for their assistance.
 Models are simplifications. Achievement is influenced by myriad other factors, e.g., prior knowledge/ability, opportunity to learn, home background--to name but three salient variables not identified in Figure 1. We excluded these variables in order to focus attention on the influence of variables most directly related to the influence of self-evaluation on student achievement.
 We also administered a battery of teacher instruments finding no significant pretest differences between treatment and control teachers on teacher efficacy, self-reported use of student assessment procedures, gender, teaching experience, and beliefs about teaching. These data are not reported because the sample size is too small for meaningful comparisons.
 There were no significant multivariate or univariate three-way interactions of treatment X time X age or treatment X time X pretest self-evaluation score.
 It is not our intention to enter into the debate concerning the evidence for (e.g., Fennema et al., 1993; Schoen, Fey, Hirsch, & Coxford, 1999; Villasenor, & Kepner, 1993) and against (Anderson, Reder, & Simon, 1996; Geary, 1995) constructivist approaches to mathematics teaching. Our finding that self-evaluation training does not contribute to mathematics achievement could be treated as evidence against constructivist approaches (because self-evaluation involves such constructivist principles as students reflecting on their learning and teachers sharing control with students). However, it is possible to have students self-evaluate in a traditional mathematics program dominated by the search for correct answers to decontextualized tasks.