The University of Illinois at Chicago goes on a fishing trip

Title: Subsequent-Grades Assessment of Traditional and Reform Calculus
Author(s): Judith Lee Baxter, Dibyen Majumdar, Stephen D. Smith
Publication type: article
Online: link

Until 1994-1995 the University of Illinois at Chicago taught first year calculus in the `traditional' manner. In the year 1994-1995 they e.g. used Stewart's book. From the year 1995-1996 on they have switched to `reform' calculus for this first year calculus sequence, using the book by Hughes-Hallett

In this study the authors examine the results of this change. The grades of the 1994-1995 calculus class are compared to those of the 1995-1996 calculus class. Even though the ACT and mathematics placement scores of both classes are very similar, the reform class got significantly higher grades in calculus. The authors also study grades obtained in other classes by these students (and consider this to be a better measure of success than the calculus grades). The authors do not make any firm explicit conclusions, but the overall impression this reader got while reading the article is that they think that their study shows that the reform approach works better. This is confirmed by the following. The authors state

We just wanted to assess the progress of the changeover, to decide whether to continue with the new method.

Since the University of Illinois at Chicago indeed continued with the `new method' we must assume that they considered that the assessment showed that the progress was satisfactory. As I will point out it is at least debatable whether this is indeed the case.

In the article it is said:

We measured many other science and engineering courses available to us.

It is not mentioned how many, but it seems that this might be as high as a few dozen. The authors mention that for six courses the difference between the two approaches is statistically significant at the 5% level (4 in favor of reform and 2 in favor of traditional). Now you should remember what statistically significant means. Statistically significant at the 5% level means that such a difference or a bigger difference happens by chance only in 5% of the cases. If you consider a few dozen cases (which this study seems to do), then you will find a couple of cases that are statistically significant at the 5% level by chance alone by definition of what `statistically significant' means! Statisticians have a word for `research' like this: it's called a fishing trip. There are statistical methods to deal with this issue, the simplest one being the Bonferroni method which amounts to dividing the 5% by the number of cases that you consider.

One of the subsequent courses is studies in-depth: Physics I. The authors only study the students who take this course immediately after the first calculus course (more on this later). The authors claim that their analysis shows that the reform students significantly outperformed the traditional students. This is however very questionable. The authors earlier state about the calculus sequence

Another obviously important variable is the individual instructor. In our 2-year pool, by necessity only about 11 faculty and a similar number of TA’s worked with each sub-pool, so it would be extremely difficult to completely eliminate this factor. As earlier noted under "Background", the department intended a neutral selection of high-quality instructors, sequence-wide. Ideally also the more substantial size of our student pool will provide some compensation for the influence of individual teaching style.

They completely forget about this when they study the results in Physics I. I do not have the data for Spring 1995 and Spring 1996 (the semesters of the study), but in Spring 2007 there were two sections of Physics I, with the same instructor! So it is very plausible that the difference in scores between the traditional and the reform students in Physics I is due to the teaching/assessment differences in this course between Spring 1995 and Spring 1996 and has nothing to do with reform versus traditional in the calculus sequence. As all students know: some teachers are good, some are bad; some teachers give on average high grades, some on average low grades. The teacher factor cannot simply be ignored in the grade-analysis of Physics I.

There is another curious thing about the Physics I analysis. As mentioned the main analysis is only done on the students who take Physics I directly after Calculus I. There is a difference of .564 in mean and .366 in LSMean (table 2 and after some calculation for the mean also by table 4). This exact same number of .366 is also given in table 6 but here supposedly it refers to all students who took Physics I up to Spring 1997. At least that's what I concluded from the sentence `we consider courses through spring semester of 1997'. Are the numbers in table 6 for the other courses also only for those students that took that course at the earliest possible moment? That's never said. Or is this a copy and paste error?

Summarizing, this paper does not show that either reform or traditionally taught calculus is better. The main faults are

  • The authors test multiple hypotheses without adjusting the significance level for each individual hypothesis. This is bound to lead to false positives.
  • The grade of each student is treated as if it is independent of the grades of the other students. This is not true because of the instructor effect.
  • Something fishy might be going on with the data that the authors decide to include and the data they decide to exclude.