Mathematics Education Research

The University of Illinois at Chicago goes on a fishing trip

2007-06-25T18:01:00.000-07:00

Title: Subsequent-Grades Assessment of Traditional and Reform Calculus
Author(s): Judith Lee Baxter, Dibyen Majumdar, Stephen D. Smith
Publication type: article
Online: link

Until 1994-1995 the University of Illinois at Chicago taught first year calculus in the `traditional' manner. In the year 1994-1995 they e.g. used Stewart's book. From the year 1995-1996 on they have switched to `reform' calculus for this first year calculus sequence, using the book by Hughes-Hallett et.al..

In this study the authors examine the results of this change. The grades of the 1994-1995 calculus class are compared to those of the 1995-1996 calculus class. Even though the ACT and mathematics placement scores of both classes are very similar, the reform class got significantly higher grades in calculus. The authors also study grades obtained in other classes by these students (and consider this to be a better measure of success than the calculus grades). The authors do not make any firm explicit conclusions, but the overall impression this reader got while reading the article is that they think that their study shows that the reform approach works better. This is confirmed by the following. The authors state

We just wanted to assess the progress of the changeover, to decide whether to continue with the new method.

Since the University of Illinois at Chicago indeed continued with the `new method' we must assume that they considered that the assessment showed that the progress was satisfactory. As I will point out it is at least debatable whether this is indeed the case.

In the article it is said:

We measured many other science and engineering courses available to us.

It is not mentioned how many, but it seems that this might be as high as a few dozen. The authors mention that for six courses the difference between the two approaches is statistically significant at the 5% level (4 in favor of reform and 2 in favor of traditional). Now you should remember what statistically significant means. Statistically significant at the 5% level means that such a difference or a bigger difference happens by chance only in 5% of the cases. If you consider a few dozen cases (which this study seems to do), then you will find a couple of cases that are statistically significant at the 5% level by chance alone by definition of what `statistically significant' means! Statisticians have a word for `research' like this: it's called a fishing trip. There are statistical methods to deal with this issue, the simplest one being the Bonferroni method which amounts to dividing the 5% by the number of cases that you consider.

One of the subsequent courses is studies in-depth: Physics I. The authors only study the students who take this course immediately after the first calculus course (more on this later). The authors claim that their analysis shows that the reform students significantly outperformed the traditional students. This is however very questionable. The authors earlier state about the calculus sequence

Another obviously important variable is the individual instructor. In our 2-year pool, by necessity only about 11 faculty and a similar number of TA’s worked with each sub-pool, so it would be extremely difficult to completely eliminate this factor. As earlier noted under "Background", the department intended a neutral selection of high-quality instructors, sequence-wide. Ideally also the more substantial size of our student pool will provide some compensation for the influence of individual teaching style.

They completely forget about this when they study the results in Physics I. I do not have the data for Spring 1995 and Spring 1996 (the semesters of the study), but in Spring 2007 there were two sections of Physics I, with the same instructor! So it is very plausible that the difference in scores between the traditional and the reform students in Physics I is due to the teaching/assessment differences in this course between Spring 1995 and Spring 1996 and has nothing to do with reform versus traditional in the calculus sequence. As all students know: some teachers are good, some are bad; some teachers give on average high grades, some on average low grades. The teacher factor cannot simply be ignored in the grade-analysis of Physics I.

There is another curious thing about the Physics I analysis. As mentioned the main analysis is only done on the students who take Physics I directly after Calculus I. There is a difference of .564 in mean and .366 in LSMean (table 2 and after some calculation for the mean also by table 4). This exact same number of .366 is also given in table 6 but here supposedly it refers to all students who took Physics I up to Spring 1997. At least that's what I concluded from the sentence `we consider courses through spring semester of 1997'. Are the numbers in table 6 for the other courses also only for those students that took that course at the earliest possible moment? That's never said. Or is this a copy and paste error?

Summarizing, this paper does not show that either reform or traditionally taught calculus is better. The main faults are

The authors test multiple hypotheses without adjusting the significance level for each individual hypothesis. This is bound to lead to false positives.
The grade of each student is treated as if it is independent of the grades of the other students. This is not true because of the instructor effect.
Something fishy might be going on with the data that the authors decide to include and the data they decide to exclude.

Calculus reform: the case of Roger Williams University

2007-06-24T16:03:00.000-07:00

Title: Does calculus reform work?
Author(s): Joel Silverberg
Publication type: article
Online: link

The author describes the results of an experiment at Roger Williams University where some sections of the first course in the calculus sequence were taught using a 'reform' approach and the other sections were taught using what the author calls a 'traditional' approach.

The author mentions the textbook used in the reform section (Dubinsky-Schwingendorf) and describes the teaching in these sections (which it seems was all done by himself), but does not mention the textbook(s) used in the so-called traditional sections or what kind of teaching actually went on in these sections. He does mention that there was a weekly computer lab associated with the so-called traditional sections. This alone would qualify them as reform and not as traditional in many eyes. So we might actually be look at reform versus reform here.

The test used to assess the experiment was a common final. The following is what the author has to say about this.

The instructors of the traditional sections prepared a final examination to be taken by all calculus sections. The exam was designed to cover the skills emphasized in the traditional sections rather that the types of problems emphasized in the reform sections. In an attempt to minimize any variance in the ways individual instructors graded their examinations, each faculty member involved in teaching calculus graded certain questions for all students in all sections.

The experiment lasted three semesters, each semester starting with a new population of students. The following information on the grades on the final exams is given.

Final Exam Grades/ Traditional Sections
Percentile Range:	0-24	25-49	50-74	75-100
semester 1	C	C	B	A
semester 2	F	C	C+	A-
semester 3	D	C	B	B

Final Exam Grades /Reform Sections
Percentile Range:	0-24	25-49	50-74	75-100
semester 1	F	D	C	B-
semester 2	F	C-	C	B
semester 3	C	B	B	B+

We are not given the number of students or the precise number of sections involved.

The above indicates that the performance of the students in the reform sections was worse in the first two semesters of the experiment and better in the third semester of the experiment. The author credits changes that he made in implementation for this. There is of course another possible explanation: it could be that the students who took the third semester version of the reform section were better. It seems that the students were not randomly assigned to sections, so this is something to take into serious consideration. From the article it can be deduced that Roger Williams University, like most universities, knows the SAT scores of its students and has a mathematics placement examination. This data can be used to correct for initial differences between the students in the reform and the traditional sections. This is however not done. Due to this basically no conclusions can be drawn from this study.

As we have seen, this study has several significant flaws. Summarized:

It isn't clear whether the sections that are labeled traditional are really traditional.
The reform sections seem to all have been taught by the same professor. So it could be that it is just this professor's teaching ability versus that of his colleagues that is measured.
The number of students involved is not indicated.
No effort is made to adjust for initial student differences. Since the assignment of students wasn't random, this is detrimental to the study.

So the question 'does calculus reform work?' can unfortunately not be credibly answered by this study.

The case of Amber Hill and Phoenix Park

2007-06-22T15:13:00.000-07:00

Title: Experiencing school mathematics; traditional and reform approaches to teaching and their impact on student learning
Author(s): Jo Boaler
Publication type: book
Online: no, but see the website of the author for related articles. See especially this article ('open and closed mathematics'). Also see this very short article by Boaler in Education Week.

Jo Boaler spent three years following students of two UK schools that use radically different approaches to mathematics education. The school that she labels 'Phoenix Park' uses discovery learning in small groups (a 'reform' approach) and the school that she labels 'Amber Hill' uses a traditional approach with lectures by the teacher and students working individually on exercises from the textbook. She later replicated this study in California with schools that she labeled 'Greendale', 'Hilltop' and 'Railside'.

The important thing here of course is which approach gives rise to the best results. A nationally normed test (NFER) was administered to the students at the beginning of the study and another nationally normed test (GCSE) was administered at the end. The GCSE is a mandatory national test and is important for university entrance. The distribution of standardized NFER scores (mean 100, standard deviation 15 nationally) for the two schools is (percentages are used for easy comparison between the two schools; at Amber Hill 160 students were tested, at Phoenix Park 109)

	73 to 82	82 to 91	91 to 100	100 to 109	109 to 118	118+
AH	25	25	25	16	8	2
PP	17	35	25	17	6	2

The performance seems to be about the same at both schools. We can do a similar thing for the GCSE scores. The British grading system is a bit odd, It will suffice to know the following about it. The grade A* is the highest, G is the lowest and U,X,Y are different kinds of failing grades. A crucial thing for university entrance is to have a grade in the A*-C range. Again percentages are given, at Amber Hill 182 students were tested and at Phoenix Park 108.

	A*	A	B	C	D	E	F	G	U,X,Y
AH	0	0.5	2.2	10.9	13.7	22.0	20.3	14.3	15.9
PP	1.0	1.9	1.0	8.3	12.0	25.9	25.0	18.5	7.4

Also these scores are very similar. A notable difference is that rather a lot of students at Amber Hill fail, whereas more students at Phoenix Park get the very low grades E,F,G. Boaler sees this as a positive thing about Phoenix Park. A possible explanation (which Boaler does not give) has to do with the fact that the GCSE is actually not one exam, but three exams. There is the higher exam (grades A*-C), the intermediate exam (grades B-E) and the basic exam (grades D-G). So it is for example not possible to obtain a 'D' on the higher exam: the only possibilities are A*,A,B,C or fail. Boaler unfortunately does not indicate which percentage of the students at the two schools took which exam. But it is perfectly conceivable that at Amber Hill many students aimed higher than they could achieve and failed. Note that it is essential for further education to receive at least a C, so that participating in the basic exam is virtually useless. The figures show that nonetheless at Phoenix Park at least 43.5 percent of the students (the Fs and Gs) participated in this exam and by doing this gave up their chance at higher education without even trying.

Boaler picked Phoenix Park first and then picked Amber Hill as a comparison school. She chose Amber Hill based on the fact that it used a content-based mathematical approach and that it had a student body that was almost identical to that of Phoenix Park. These conditions are probably not unique to Amber Hill though, so there is some room for a biased selection here (from her writings it is quite obvious that Boaler prefers the Phoenix Park approach to mathematics education). What we can do is compare Phoenix Park with the nation instead of with the school that Boaler picked. Since the GCSE is a national test we can find the national mathematics scores from 1995 (the year the students in Boalers' study took the GCSE) on the web. The mathematics GCSE results for 1995 are as follows (with the Phoenix Park scores beneath it for easy comparison):

	A*	A	B	C	D	E	F	G	U,X,Y
UK	1.9	6.5	13.4	23.1	17.1	16.1	12.8	6.7	2.4
PP	1.0	1.9	1.0	8.3	12.0	25.9	25.0	18.5	7.4

Of course this is an unfair comparison since the NFER test showed that the students at Phoenix Park were below the national average when they entered Phoenix Park. We can compensate for this by using the NFER scores. I must do this in a rather crude way since I do not have the scores for the individual students, so a proper statistical analysis is out of the question. So what I've done is the following (the results are in the accompanying picture). I basically plot the points in the above table with GCSE scores against each other: (Phoenix Park, UK). To make the picture easier to read I do not plot (for example) the percentage of students with a C against each other, but the percentage of students with a C or lower. This gives the purple crosses. I connect these purple crosses by straight line segments. The fact that the purple line is below the blue diagonal represents that Phoenix Park did worse than the national average on the GCSE. We can do a similar thing for the NFER scores. Since the national scores are not known here I plotted this against a normal distribution (the standardized NFER scores for the country are close to a normal distribution according to NFER). This gives the black circles. Connecting the black circles gives the black line. This black line is also below the blue line, indicating that Phoenix Park scored below the national average on the NFER. The interesting thing is that the purple (GCSE) line is below the black (NFER) line (except at the very top scores). This indicates that, compared to the nation, the students at Phoenix Park did worse on the GCSE than they did on the NFER. So Phoenix Park seems not to have done its students a lot of good. The same is of course true for Amber Hill, which performed very similarly to Phoenix Park. I also took a look on the internet at typical average scores of schools on the GCSE. It seems that Phoenix Park and Amber Hill are just about the schools with the worst GCSE scores in the UK. I cannot help but think that Amber Hill was specifically chosen for this fact.

The following enlightening footnote appears in the article 'open and closed mathematics' by Boaler:

When Phoenix Park first adopted a process-based approach, they were involved in a small-scale pilot of a new GCSE examination that assessed process as well as content. In 1994 the School Curriculum and Assessment Authority (SCAA) withdrew this examination, and the school was forced to enter students for a traditional, content-based examination. The proportion of students attaining grades A-C and A-G dropped from 32% and 97%, respectively, in 1993 to 12% and 84% in 1994. The school has now reintroduced textbook work in an attempt to raise examination performance.

Boaler doesn't say anything about the GCSE scores of Amber Hill at the moment that she decided to include this school in her study, but there is not reason to believe that it was markedly different from the above mentioned scores for Amber Hill. If that is the case, then Boaler seems to have been stacking the deck in favor of Phoenix Park and its discovery learning approach to mathematics teaching. But this didn't quite pan out the way that she probably wanted because the SCAA withdrew the process-based examination.

In the Education Week article Boaler mentions something rather funny if you know the facts. She says

On the national examination, three times as many students from the heterogeneous groups in the project school as those in the tracked groups in the textbook school attained the highest possible grade.

This proportion is actually even more impressive, it is infinity! The highest possible grade is A* and one student at Phoenix Park got that grade versus no student at Amber Hill. What Boaler seems to refer to here is a grade of at least A. Then she is right that the proportion is 1 to 3. These are also the absolute numbers however: 3 students at Phoenix Park versus 1 at Amber Hill. Talk about statistics with small numbers... She also writes in the edweek article:

One of the results of these differences was that students at the second school--what I will call the project school, as opposed to the textbook school--attained significantly higher grades on the national exam.

But as we've seen, this is not exactly true. The percentage of students at Phoenix Park who get a A*-C grade is actually slightly lower than at Amber Hill. And this is the grade range that counts, getting a 'pass' on the GCSE of lower than a C is basically worthless. Boaler also doesn't mention that the grades for the GCSE at both schools are lower than one would expect given the NFER scores. She seems determined to interpret everything in favor of Phoenix Park.

Calculus reform in Minnesota

2007-06-21T15:15:00.000-07:00

Title: Redesigning the calculus sequence at a research university:
issues, implementation, and objectives
Author(s): Harvey B. Keynes and Andrea M. Olson
Publication type: journal article
Online: link (full-text for subscribers only)

Authors' abstract. The paper discusses the progress and challenges of a new reformed calculus sequence for science, engineering, and mathematics students developed by the Institute of Technology Centre for Educational Programs and School of Mathematics, University of Minnesota. The main objective of the Initiative is to enable undergraduates to better learn calculus and the critical thinking skills necessary to apply it in a variety of science and engineering problems. Changes in content and pedagogy are emphasized, including instructional teamwork and student-centred learning, involving students working cooperatively in small groups and exploring mathematical ideas using appropriate technologies. Achievement and retention of Initiative students are compared with a control group from the standard calculus sequence. Student attitudes about the usefulness of the Initiative's curriculum, pedagogy, and its influence on learning are discussed. Future implications including new uses of distributed learning are also addressed.

The University of Minnesota performed an, in principle interesting, experiment on calculus teaching. The usual way calculus is taught at a research university is as follows: the students have three hours of lectures a week by a professor and one hour of discussion led by a teaching assistant (usually a graduate student). One can of course investigate whether this is the optimal mix or not. In the experiment the University of Minnesota traded one hour of lectures a week for two hours of discussion a week (and renamed the discussion 'workshop'). This is from the student point of view. From the university point of view the situation is somewhat different since there are multiple workshop sessions for one lecture session. The typical situation is depicted in the following tables.

traditional	#students	#hours	total#
lectures	100	3	3
workshops	25	1	4

experimental	#students	#hours	total#
lectures	100	2	2
workshops	25	3	12

So from the university point of view 1 hour of lectures is replaced by 8 hours of workshops. Even though teaching assistants are cheaper than professors, this will mean an increase in cost. The authors of the study indeed write

Even after implementing all reasonable economies, there is an incremental cost difference of 20-25% over the standard calculus sequence

This is probably an underestimate since in the actual experiment the professor was also supposed to help during some of the workshops and the teaching assistants were supposed to be present during the lectures. The indicated cost increase does seem to be somewhat realistic if this team-teaching is abolished and only the trading of lectures for workshops is considered. Apart from the increased cost there is another problem that the authors mention: staffing. The experimental condition requires three times as many teaching assistants. One of the ways in which this was addressed in the experiment was by employing high-school teachers and undergraduates as teaching assistants. This of course raises all kinds of issues.

Lets look at the results: does trading an hour of lectures for two hours of workshops actually lead to better results? The authors claim that it does, but I'm not convinced. Comparisons were made between the experimental classes and the traditional classes. Students were not randomly assigned to one of the two conditions. It is justifiable to not do this, but then one should be very careful in making comparisons. The authors look at the average calculus grade point average: this is 3.27 for the experimental condition and 2.85 for the traditional condition. They also looked at how many students took a second year of calculus: this was 77% in the experimental condition and 56% in the traditional condition. The authors also gave partially the same questions on the final exam for both conditions. In the experimental condition 76% of the responses to these seven common questions was correct whereas in the traditional condition only 60% was correct. So this is all clearly in favor of the (more expensive) experimental condition. Now comes something strange. The authors compare the grade point average of the students in all their upper division courses. In the experimental condition 43% had a GPA of 3.5 or higher and only 23% had a GPA less than 3.0. For the traditional condition these percentages are 15% and 58%, respectively. From this the authors conclude that the experimental condition provides students with strong mathematical skills necessary for success in future courses. The more obvious interpretation of this difference is that the students in the experimental condition were just smarter. Remember that students were not randomly assigned! And it becomes even stranger. Like all mathematics departments at research universities the University of Minnesota has a mathematics placement test that is administered to all incoming students. This can serve as a pre-test to determine whether the two groups are comparable and to statistically adjust scores if they aren't. This is however not done. The only reason that I can think of that this is not done is that the placement scores are similar to the upper division GPA scores (which shows that this difference is indeed due to the fact that the students in the experimental condition are smarter) and that if one adjusts for this, then the experimental condition turns out not to outperform the traditional condition. At this point it is good to say that the authors of the article are involved in the experiment and therefore have much to loose if the experiment is deemed a failure.

We can do some ballpark statistics on the above information on GPAs. We assume that students with a GPA of more than 3.5 on average have a GPA of 3.75, those with a GPA in between 3.0 and 3.5 have on average a GPA of 3.25 and those with a GPA of less than 3.0 have on average a GPA of 2.5. Then the average GPA of students in the experimental condition is 3.29 and that of the students in the traditional condition is 2.89. Both of these figures are extremely close to the average calculus GPA for these conditions. So the calculus grades seem to fit in perfectly with the grades in all other courses.

Based on the information on the website of the University of Minnesota something can be said about the aftermath of this experiment. Both the (now no longer) experimental and the traditional condition still exist at the University of Minnesota. The traditional sequence seems to attract twice as many students.

Collaborative discovery learning

2007-06-20T09:56:00.000-07:00

Title: Teacher interventions aimed at mathematical level raising during collaborative learning
Author(s):Rijkje Dekker and Marianne Elshout-Mohr
Publication type: journal article
Online: link (full-text for subscribers only)

Authors' abstract. This article addresses the issue of helping students who work collaboratively on mathematical problems with the aim of raising the level of their mathematical understanding and competence. We investigated two kinds of teacher interventions aimed at helping students. The rationale of these interventions was based on a process model for interaction and mathematical level raising. One kind of interventions focused on the interaction between the students, the other – on the mathematical content of the tasks. The effects of the two kinds of interventions were investigated using a pre-test – post- test comparison of students’ learning outcomes and analyzing the transcripts of students’ verbal utterances and worksheets. Our analyses point to interventions focused on students’ interactions as more effective in terms of students’ learning outcomes. Theoretical and practical implications of the research are discussed.

I've discussed a similar study (by Pijls) before. That study was partially a follow-up of the study currently under review. Also here students are supposed to do discovery learning in small groups (triples in this case). No computers are used in this study however. Also here there are two experimental conditions: a process-help condition and a product-help condition. In the process-help condition the teacher is not supposed to talk about mathematics in any way. In the product-help condition he was allowed to give hints. This is what the authors have to say about the product-help condition.

Being used to collaborative learning as instructional arrangement, the teacher habitually limited himself to hints, avoided direct instructions or lengthy explanations, and gave help only when this was manifestly needed.

So it is important to note that this is in no way a comparison between discovery learning and instruction; it's a comparison between two kinds of discovery learning in small groups. I got curious about this study because the study by Pijls that I reviewed before claimed that the study currently under review showed that process-help was better than product-help (a result that Pijls was not able to replicate). The authors of the current study indeed claim that they show this, but they don't. To see why they don't we look at the statistics. The authors state:

A pre- and a post-test were constructed to measure the results of students’ learning. The tests consisted of different items, but were parallel in relevant aspects.[...] Maximum total scores were 25 for both pre- and post-test.

In between the pre- and the post-test the students followed 2 sessions of 65 minutes. The authors go on to say:

The hypothesis about the post-test scores was that these would be
higher in the process-help condition than in the product-help condition.
This hypothesis was confirmed (p < .05).

Using the diagrams that the authors provide it is possible to deduce the pre- and post-test scores of all students in both conditions. So we can exactly redo the statistics. Did you carefully read the first sentence of the preceding quotation? That's were the answer lies, the authors did a one-sided statistical test. If they did the usual two-sided statistical test, then the hypothesis would have been rejected (p=0.08). It seems they just applied the test that gave them a statistically significant result.

More interesting information can be deduced from these diagrams. Using the usual statistical test, it cannot be shown that the students in the product-help group 'on average' improved (the t-test for the gain from pre-test to post-test has p=0.15). Actually 6 of the 15 students in the product-help group did not improve their score. In the process-help group this is the case for 4 of the 20 students. Now of course 2 sessions of 65 minutes are not a lot of time to learn something new. It keeps amazing me how education researchers seem to think that you can learn significant material in such a short amount of time....

Collaborative discovery learning with the computer

2007-06-18T13:28:00.000-07:00

Title: Collaborative mathematical investigations with the computer; learning materials and teacher help.
Author(s): M.H.J. Pijls
Publication type: PhD thesis
Online: link

The PhD thesis that I have in front of me is on collaborative discovery learning using the computer.

Of course it is interesting to know whether education with computers gives better results than education without them. But that's not what this research is about: in all conditions the computer is used. In the first experiment described in this thesis it is investigated at which moment the computer can best be used: only in the beginning of the lesson cycle, during the whole lesson cycle or only at the end of the lesson cycle. The answer: for the test scores is doesn't matter. That one could also not use a computer at all doesn't seem to be a thought that crossed the researchers mind. In the second described experiment the computer was used during the whole lesson cycle.

Of course it's also interesting to investigate to what extent collaborative learning is effective. But this is also not done in this thesis. The researcher wonders whether it would have been better to divide the class into triples instead of into pairs, but that one could also divide the class into singles doesn't seem to have crossed her mind.

In the second experiment the effectiveness of 'process-help' versus 'product-help' is investigated. At first glance you might think that this means discovery versus instruction, but the following quote clearly indicates that it's not:

The learning materials contained no 'theory blocks' (i.e. sections in which a mathematical concept was shown and explained) and no 'correction sheets' with the correct answers to assess their work afterwards (the students were used to correct their work with the help of correction sheets after finishing the tasks). No classroom discussions took place in either project.

So both conditions are discovery learning. The only difference is that in the product-help condition the teacher was allowed to give the students (as pairs) mathematical hints and the process-help teacher was not allowed to talk about mathematics at all. Of course the product-help teacher encountered the problem with small-group learning: there is simply not enough time to give all the small groups the attention that they need. The teacher and the students therefore wanted to have 'whole class moments', but this was not allowed. On the post-test there was no difference between the process-help group and the product-help group. The researcher gave this the following spin (original in Dutch):

More explanation not always better.

Now let's take a closer look at the results. Maybe you know the game on the picture above. The students played games like this on the computer during the experiments. On one of the tests (it isn't mentioned whether it was the pre-test or the post-test) there was a question about this particular game (but with only 5 boxes at the bottom instead of the 10 that are shown in the picture above). The question was 'what is the probability that the ball will fall into the middle box?' There is a simple trick that helps to answer this (Pascal's triangle) and this was exactly the kind of question that was practiced during the 10 lesson experiment. There were 13 questions like this on the post-test for in total 46 points. The maximum number of points for this particular question was 4. For the (absolutely wrong) answer '1/5' a student got 1 point. For the correct but inaccurate answer 'more than 1/5' the student got 2 points. The average score of the students on the post-test was 14.29 points, that's an average of slightly more than 1 point per question. Since one apparently gets 1 point for a completely wrong answer to a question this is a very saddening score. In 10 lessons the students seem to have learned almost nothing. The researcher remarks the following about the improvement from pre-test to post-test:

The open-ended questions in this post-test were very much comparable to the pretest. The number of questions and the division of points were the same and the questions often had very similar contexts. This time, however, we expected the students to be able to make the majority of the tasks. [...] The difference between pre- and post-test in both conditions showed that on average all students' learning results improved (t-test for difference between pre- and post-test: t = 6.367, df = 51, p <.001).

This is 'on average' indeed true. From the information provided we can deduce that the average increase in score was 4.98 points with a standard deviation of 5.64 points. We can do some 'ballpark statistics' with this limited information provided. Assuming that the increase in scores is normally distributed we see that about 16% of the students has a negative increase. So our 'ballpark statistics' tells us that 9 of the 52 students knew less about this topic after the ten lessons on this topic then they knew before! However, this of course doesn't lead the researcher to question her computer-assisted collaborative discovery learning method...