Educational Performance over Time: Changes in Mathematical Attainment between 1976 and 2009 in England

Over the past fifty years


Introduction
Educational performance over time is a perennial concern of politicians, policy-makers and other commentators across the world.Some have claimed that performance in the US, the UK and elsewhere in the developed world has stagnated or even fallen since the mid-1970s (e.g., Goldin & Katz, 2008;Hanushek et al., 2010) and have pointed to the comparative success of the Pacific Rim and others in international surveys (e.g., Oates, 2011).These claims are hotly contested (e.g., Kilpatrick, 2011;Openshaw & Walshaw, 2010).So, for example, increases in school leaving or graduation qualifications attained are cited to support both the case that schooling is becoming ever more successful and also the opposing case that schooling is becoming less successful because the 'standards' of such qualifications are 'falling' .Yet, the evidence on which these claims about changes in educational performance are based is weak, because only limited data comparing educational performance extends back before the mid-1990s.Comparisons based on the international surveys, TIMSS and PISA, are valid for only a relatively short period back to the 'anchor' years of 1995 and 2003 respectively (PISA mathematics scores are comparable to 2003, although reading and science scores are comparable to 2000 and 2006, respectively).In the absence of reliable data, Goldin and Katz (2008), for example, base their analysis largely on years of schooling, whilst others analyse IQ test data over time (Nuthall, 1985) or compare attainment on different tests (Afrassa & Keeves, 1999;Rashid & Brooks, 2010).The Long-Term Trend National Assessment of Educational Progress (NAEP-LTT) in the US, which dates back to the 1970s, is a notable exception (Kloosterman, 2010), Downloaded from Brill.com 09/03/2024 02:59:38PM via Open Access.This is an open access article distributed under the terms of the CC BY 4.0 license.https://creativecommons.org/licenses/by/4.0/although even these data are problematic, because the test focuses on a rather narrow range of topics (consisting in mathematics of largely computationally based items).
This debate is important, because it bears on the success or otherwise of educational reform historically, and thus can inform future policy.We contend that England provides a critical case that is of interest internationally.Since the 1970s, England has, in common with many other countries around the world, implemented a range of educational reforms aimed at raising educational (including mathematical) attainment for all students (Cockcroft, 1982;Hodgen et al., 2022).Indeed, many systems have looked to England as an inspiration or as a model for educational change.One measure of whether this extensive reform and investment has been successful is the extent to which it has improved educational attainment.The focus of this paper is on changes over time in attainment in mathematics in the context of major investment in reform in this subject.
In this paper, we report on a replication of several surveys of student attainment first carried out in England in the 1970s.Specifically, we report the findings of a study of lower secondary students' mathematical attainment in England in which the results of a national sample of students tested in 2008 and 2009 are compared to the results of a similar national sample of students tested in 1976 and 1977 using the same research-based tests of conceptual understanding in mathematics (Hart et al., 1981(Hart et al., , 1985)).
As such, this study is both a constructive replication, investigating the phenomena at a different time, and a conceptual replication, utilising different statistical methods (Melhuish, 2018).As such, this replication enables us to examine whether English students' mathematical understandings have changed over the period and, thus, to consider whether the various reform of mathematics education in England have been successful in raising attainment.The replication also enables us to investigate the original findings on students' misunderstandings and misconceptions, although these findings are reported elsewhere (e.g., Faerch & Hodgen, 2023;Hodgen et al., 2012).In conducting this replication we re-validated the tests, using modern statistical methods that utilise the computing power now available.Specifically, we used Rasch modelling, now widely used in assessment (Bond & Fox, 2007;Cascella et al., 2023;Gilberti & Maffia, 2022), but which was not well-developed at the time of the original study.We also have the benefit of a research team consisting of both researchers who were not involved in the original study (the first two authors: Hodgen and Coe), as well as two who were involved in the original study (the third and fourth authors, Brown and Küchemann), thus enabling us to combine the strengths of internal and external replications (Aguilar, 2020).Downloaded from Brill.com 09/03/2024 02:59:38PM via Open Access.This is an open access article distributed under the terms of the CC BY 4.0 license.https://creativecommons.org/licenses/by/4.0/raise attainment together with the level of resources committed and "pressure and support" mechanisms (Fullan, 2000, p. 15), one might expect that the combined effect of these reforms would be to produce at least a modest increase in educational performance in mathematics.

Mathematical Performance over Time
During this 30-year period of educational reform, results in national examinations of mathematics in England have shown steady and substantial rises.
For example the proportion of 11-year-olds achieving the targeted level in national tests rose from 54% in 1996 to 79% in 2009.Similarly, the proportion of 16-year-olds achieving grade C or above at GCE O-level/GCSE rose from about 23% in the mid-1980s to 45% in 1992 to 55% in 2007.2Some commentators (e.g., Barber & Mourshed, 2007) have claimed this represents considerable success for educational reform in England.
It is notoriously difficult to compare performance over time (e.g., Goldstein & Heath, 2000).One of the particular problems with using national tests and GCSE examinations to compare performance is that all papers are released, so new tests have to be used each year.This makes it difficult to maintain standards over time, and there is some evidence that the standard represented by the award of the same grade or level in these examinations for successive years has decreased (Coe & Tymms, 2008;Jones et al., 2016;Tymms, 2004).For example, Coe (2008) shows that between 1996 and 2007 performance in mathematics GCSE for students with equivalent ability scores rose by 0.9 of a grade, which was a larger increase than for any other mainstream subject.Prior to this period, use of a grade criteria system by the Graded Assessment in Mathematics (GAIM) project suggested that there was an average rise of approximately one grade in 1988 (the first year of the new GCSE) in comparison with the previous year's results on the earlier system, so that students who would previously have been awarded a grade D would now receive a grade C (Brown, 1989(Brown, , 1996)).Taken together, these results suggest a shift of up to 2 grades in the standard required at the GCSE grade C boundary over 30 years.
The international surveys, the International Association for the Evaluation of Educational Achievement's TIMSS (Trends in International Mathematics  (Mullis et al., 2012),3 whereas in PISA, England's performance at age 15 showed a small decline between 2003and 2012(Wheater et al., 2013)).4However, independent analyses of the surveys suggest that, after correcting for sample bias and changes to the date of test administration, the performance on PISA has probably been broadly stable over time (Brown et al., 2007;Jerrim, 2013).Before 1995, some notion of comparison can only be gained in relation to rank order against other countries.However, there is no clear evidence of significant improvement from the earlier tests run by the IEA in 1964 (FIMS) and 1982 (SIMS), in each of which England scored close to the international mean.Indeed, the OECD's 2012 survey of adult skills indicates that mathematical skills are lower for younger cohorts, suggesting a gradual decline over time (OECD, 2013).
In a review of the research evidence on educational attainment over time, Rashid and Brooks (2010) find limited comparative evidence of mathematical attainment over time, and the evidence that they report is over a relatively short timescale.For example, they report that the Assessment of Performance Unit (APU) National Monitoring Survey conducted between 1978 and 1987 indicated a small overall improvement in mathematics attainment between 1978 and 1982, but no overall change thereafter, although there were decreases for the topics of algebra and number.
Internationally, aside from TIMSS and PISA, one of the few longstanding rigorous national systems for monitoring mathematical performance is the NAEP-LTT.Although based in the US, this provides comparable data on the performance of students aged 9, 13 and 17 over the period 1978-2004, and shows statistically significant gains of 22, 17 and 7 points in the mean score, respectively (Kloosterman, 2010).We note, however, that the NAEP-LTT tested a narrow range of procedural skills in contrast to the Main NAEP or the CSMS tests considered in this paper.5 In summary, the evidence on educational performance over time is limited, both in England and internationally, particularly prior to 1995.Since then, the evidence in England suggests a slight rise in performance at secondary, although these rises may be at least partly due to factors other than increases in genuine mathematical attainment or competence.

4
The Design and Content of the Tests Increasing Competence and Confidence in Algebra and Multiplicative Structures (ICCAMS) was a 4½ year project in England and included a survey of 11-14-year-olds' understandings of algebra and multiplicative reasoning, and their attitudes to mathematics.This survey involved both cross-sectional and longitudinal samples.The ICCAMS study used three tests (Algebra, Decimals and Ratio) which were first developed and administered to nationally representative samples of English students in the 1970s as part of the Social-Science-Research-Council-funded Concepts in Secondary Mathematics and Science (CSMS) study.

4.1
The Design of the CSMS Tests The development of the tests was in several stages (Hart & Johnson, 1983).First, an initial set of items was developed on the basis of a review of the research literature and an analysis of the then curriculum.Since there was not then a national curriculum in England, this analysis was conducted using the most commonly used textbooks and contemporary national attainment surveys; e.g., Tests of Attainment of Mathematics in Schools (TAMS) (Sumner, 1975).Second, items were trialed in 20-30 interviews with students across ages 11-14 and the mainstream ability range, and finally pilot versions of the tests were administered in one or two schools.During this whole process, some items were dropped, some added and others amended.The majority of items on the tests are short open-response tasks intended to capture students' approaches without disadvantaging students who have reading difficulties.

4.2
The Operationalisation of "Understanding" in the CSMS Tests The focus of the CSMS tests is on what is often described as "conceptual" as opposed to "procedural" understanding (Hiebert, 1986); CSMS aimed to use "problems which were recognisably connected to the mathematics curriculum but which would require the child to use methods which were not obviously 'rules'" (Hart & Johnson, 1983, p. 2)."Excessive computation" was generally avoided in favour of non-routine "word problems where the numbers were simple" (p.22).The CSMS tests deliberately did not attempt to cover all aspects of the secondary mathematics curriculum, but rather intended to "reveal the different strategies that children used (rather than just to determine whether an item was answered correctly, or was easy or difficult)" (Küchemann, 1981, p. 30).

4.2.1
Algebra The Algebra test focuses on generalised arithmetic rather than algebraic structure (Küchemann, 1981).The test extended Collis's (1975) analysis of the different ways in which pronumerals can be interpreted, and items were devised to bring out the following six categories (Küchemann, 1981): Letter evaluated, Letter not used, Letter as object, Letter as specific unknown, Letter as generalised number, and Letter as variable.As an example, consider item 5c: "If e + f = 8, e + f + g = …."Here the letter g has to be treated as at least a specific unknown which is operated upon: the item was designed to test whether students would readily 'accept the lack of closure' (Collis, 1975) of the expression 8 + g, rather than give the closed response 8g, or a numerical response such as 9, 12 or 15.

4.2.2
Decimals This test6 focuses on decimals as an aspect of rational number, including place value, and its full title is "Place value and decimals".Items were designed to assess meaning and structure "in the sense of understanding how [the place value system] works and how to apply it in appropriate situations both mathematical and drawn from real-life" (Brown, 1981, p. 214).Computational aspects were included, but the focus was on meaning rather than on methods of calculation.Meaning was defined as "how children conceptualise number operations, visually, verbally and symbolically, and how applications of them are recognised" (Brown, 1981, p. 75).These foci are illustrated with the item: 19d, "The cost of 6.22 litres of petrol was £4.86.What would the price of one litre be?", requires students to identify the correct calculation rather than to calculate the answer.Students need to reason that the context requires division (4.86 ÷ 6.22), moreover, a division where the divisor is larger than the dividend, and that this makes sense.The Decimals test was developed at a time when research on rational number was at a relatively early stage of development and the foci of meaning and structure reflected the state of research at that time.Research in the United States had concurrently identified seven inter-related sub-constructs to rational number (Kieren, 1976), a framework that was subsequently refined into a smaller number, four or five, inter-related sub-constructs: quotient, ratio, operator and measure (and/or part-whole relations) (Behr et al., 1992;Kieren, 1988;see also, Lamon, 2007).Although developed independently, test items were matched at the time to cover all the sub-constructs in Kieren's (1976) early version of this later framework (Brown, 1981, p. 66).

4.2.3
Ratio The Ratio Test focuses on the application of ratio-based thinking in increasingly complex situations, with a particular focus on whether students used multiplicative rather than additive approaches (Hart, 1980).The test, titled "Test R", deliberately does not mention ratio, so that students are not cued to use taught techniques, and instead have to work out the relationships inherent in the situation.Complexity was operationalised in terms of the multipliers involved, the steps required and the context (Hart & Johnson, 1983, p. 23).The test development was influenced by Piaget's ideas and two items, involving eels and similar rectangles respectively, were drawn from questions used in his research (Piaget et al., 1960).Another item was a version of the well known 'Mr Short and Mr Tall' task developed by Karplus and colleagues (Karplus & Petersen, 1970) that requires a conversion between two non-standard units of measurement.The correct response demands multiplicative reasoning (or at least rated addition, Carraher, 1986)7 rather than a direct additive approach.In order to avoid "excessive computation", emphasis was placed on relatively simple ratios like 3:2 and 5:2.These were judged to provide substantive evidence of proportional thinking whereas, importantly, simple integer ratios did not (Hart & Johnson, 1983, p. 19;Karplus et al., 1972).A few items involved more numerically 'complex' ratios such as 5:3 and the technical report suggests that the team considered such a ratio to be sufficiently complex to indicate Implementation and Replication Studies in Mathematics Education 4 (2024) 83-124 near-complete understanding of ratio as a multiplicative relationship (Hart & Johnson, 1983, p. 31).

4.2.4
Fractions Some additional items focusing on fractions were included at the end of the ICCAMS tests (14 in the Ratio test and one in the Algebra test).This enabled a broader assessment of multiplicative reasoning, and made the Ratio test a similar length to the other tests.These items were drawn from the two CSMS Fractions tests, which took a similar approach to that described above for place value and decimals, with a focus on the 'measurement' and the 'multiplicative' areas of quotient and operator.

4.2.5
The CSMS Hierarchies of Levels In the original CSMS data analysis in the 1970s, items were selected empirically from each test to form a series of hierarchical levels of difficulty.For each level, groups of items were identified based on the strength of phi correlations between the items within the context of a Guttman scaling model.(The method is described in detail by Hart and Johnson, 1983.)These hierarchies used 30 out of the 51 items in the Algebra test, 39 out of the 72 items in the Decimals test, and 20 out of the 27 items in the Ratio test.There were different numbers of levels associated with each test, since the levels were derived empirically; Algebra and Ratio both had 4 levels and Decimals had 6.Students were judged to have been successful at a specific level if they had successfully answered two-thirds or more of the items at that level.Students who had not achieved two-thirds of Level 1 items were said to be 'at Level 0' .It was possible to broadly describe the type of mathematical understanding required for the items in each level in each topic, although these were not always neat descriptions, since the items and levels were assigned mostly on empirical rather than theoretical grounds.These hierarchies provide a basis for comparing the understandings of students across the attainment range in 1976-1977 and 2008-2009.The additional 15 items on fractions were drawn from items that appeared in the hierarchy of the CSMS Fractions test, which also had 4 levels.However, in the case of Fractions, only single item comparisons with the 1976 data are possible, since the full set of Fractions items was not used.

4.3
The Relationship to the English School Mathematics Curriculum of the 1970s The original CSMS study was conducted prior to the introduction of a statutory National Curriculum in England.However, as described above, considerable efforts were made to ensure that the test items related to the curriculum at the time and that the items would engage students in the 1970s.The team's Downloaded from Brill.com 09/03/2024 02:59:38PM via Open Access.This is an open access article distributed under the terms of the CC BY 4.0 license.https://creativecommons.org/licenses/by/4.0/subsequent analysis (Hart & Johnson, 1983, pp. 30-33) shows that, within the overall focus of each test on particular aspects of understanding, the test content could be shown to be covered in the curriculum, if not always explicitly, and that a relatively large proportion of the curriculum was covered by the whole set of tests.

4.4
The Relevance of the CSMS Tests to the 2008-2009 Mathematics Curriculum in England In the 30 years between the CSMS study and this replication, there were considerable changes to the school mathematics curriculum in England.The first National Curriculum introduced in 1988 was influenced by the results of the CSMS study, particularly in relation to progression within and between topics and the recognition of the frequent use of informal methods.As part of its introduction, several aspects of topics, and most notably formal symbolic algebra, were introduced in Grade 6, which was earlier than before.(The introduction and revision of the National Curriculum is described in Brown, 1996.)Around a decade later, alongside the second revision of the National Curriculum, two National Strategies were introduced (the primary National Numeracy Strategy in 1999, and the secondary Key Stage 3 Strategy in 2001), which, amongst other things, introduced a very much more detailed year-by-year teaching framework as a programme for teaching mathematics (DfEE, 2001).A further amended national curriculum was introduced for secondary schools in 2007, but with little change in relation to content.Our analysis of the two National Strategies documents indicates that the CSMS tests are still relevant to the curriculum.Indeed, compared to 1976-1977, the CSMS tests appear in some ways to be slightly closer to the curriculum in place during 2008-2009, since the framework references some specific items or contexts utilised in the CSMS tests.A more significant change is that in 2008-2009 greater emphasis is placed on measurement and computation with decimals, rather than with fractions, due to the increasing prevalence of calculators, computers and the metric system.

4.5
Survey Methods The ICCAMS and the CSMS projects each administered the tests to nationally representative samples of English lower secondary students, across 2008 and 2009, and in 1976 and 1977, respectively.8At both times, the Algebra and Ratio tests were administered to students in Grades 7 and 8, and the Decimals test to students in Grades 6, 7 and 8.The focus will therefore be on those age groups, and, more specifically, on Grade 8.9 The sampling for each project is described below.It is clear that in order to achieve a reliable comparison over time, the two samples would have to be selected in similar ways.

4.6
The 1976 and 1977 CSMS Samples Fifty-four schools from across England were involved in the whole testing programme.The schools were selected from among those where a teacher had volunteered the participation of their school, responding to requests from the research team, either during professional development events or following an article in the professional press.The sample was not formally stratified, although there was a deliberate balance between rural and urban schools, schools with different ranges of social class and ability in their intakes, and state-funded and independent sector (private) schools.For each test in each grade, a sub-sample of at least 6 schools was selected.Care was taken to ensure that the score distribution across the sample on a (then) widely used test, the Calvert Non-Verbal IQ Test (Calvert, 1958), matched the national norms.The Calvert test was administered only to students in Grade 7, the age group for which it was designed; at the time, there was no equivalent test available for other age groups, and it was checked in each school that there was no obvious reason why the IQ score distribution of other grades would differ significantly from that of the Grade 7 students.
In 1976, the Algebra and Ratio tests were administered to entire year groups at Grades 7 and 8 within a school.In 1977, the arrangements were similar, except that the Decimals and Fractions tests were administered to a randomly selected proportion of the students in each class, since most schools were organised into classes for mathematics by attainment level.Again, the achieved samples were checked for their matches to national IQ norms.No full record exists of the details of the matching, except in the case of the Decimals test in 1977, where the means for each of four age groups were all in the 99.0-101.5 range, with the standard deviations all in the 14.0-15.0range (Calvert norms were 100 and 15, respectively); the shapes of the four distributions were found not to be significantly different (p > 0.2) from the standard normal curve, using the Kolmogorov-Smirnov goodness-of-fit test (Brown, 1981, pp. 237-230).Education 4 (2024) 83-124 It is believed that these values were typical of those for all the tests, especially since the Decimals test had the smallest sample sizes.Because of the different administrations, the sample sizes of students for the 1977 tests were smaller than for the 1976 tests, although the number of schools was similar in each case.The samples are shown in Table 1.

4.7
The 2008-2009 ICCAMS Sample The aim was to draw the sample in a similar way, but since it was no longer possible to use the Calvert test as a control for the representativeness of the sample, it was decided instead to employ the MidYIS (Middle Years Information System), a value-added reporting system, which is widely used by schools across England (Tymms & Coe, 2003).The control test used to measure representativeness is a measure of developed ability, and consists of verbal (receptive vocabulary), numerical (everyday mathematics) and spatial (3D visualisation) problems.
The MidYIS system held a database of schools, so that it was possible to draw a random sample of schools.The intention had been to test a sample (20 schools) in the summer of 2008; however, owing to delays in the approval of funding for the project and higher than expected refusal from schools under time pressure, only 10 schools actually completed the tests (and one school did not test any Grade 8 students).A further round of testing was therefore conducted in the summer of 2009 to make up the sample.

4.7.1
The 2008 Sample In order to obtain the right proportion of students from each sector, the sample was made up of two independent (private) schools and 18 governmentfunded schools.
The group of government-funded schools in the MidYIS database turned out to be a very close match to the group of all government-funded schools in England,10 so a simple random sample of 18 was drawn.The characteristics of this sample were well matched to the population.For each sampled school, a reserve school was selected with matching characteristics in order to maintain the balance of the sample characteristics in the event of any school non-response.The use of a stratified sample was considered, using a range of variables to define strata, but the trial samples did not appear to produce a better fit to the population than a simple random sample.
The group of independent schools in the MidYIS database had slightly higher scores than the average for other independent schools (there were 264 independent schools with three years of MidYIS paper test data in the database).The sample was therefore limited to schools whose average MidYIS score over the three years was within one standard deviation of the mean for all schools, in order to ensure that the overall sample of 20 schools would be close to being nationally representative.Two schools were chosen at random from this subgroup.Given the much smaller numbers, reserve schools were simply chosen at random from the remaining eligible schools.

4.7.2
Selecting the 2009 Sample to Balance the 2008 Sample A weighted random sample of schools was selected.Such a small sample has a lot of variability in sample mean.Therefore sampling was repeated until the mean MidYIS score was within 0.5 points of the desired value (the population standard deviation of MidYIS is 15 points).Reserve replacement schools were identified in the same way as for the 2008 sample.

4.7.3
The Achieved 2008-2009 Sample Although 20 schools agreed to complete the surveys, only 19 actually managed to do so.Altogether, we approached 86 schools, which represents a school-level response rate of 22%, a value that is lower than might have been hoped for, but is within a typical range for studies of this kind (Coe & Hodgen, 2012, 2017c;10 The database was restricted to schools with three years of MidYIS test data in order to ensure that MidYIS data were available for all three Grades tested.There were 301 such schools out of a total of 1164 in the database, which is around a third of secondary schools in England.Education Endowment Foundation, 2013).A range of student-and school-level characteristics were compared for the achieved sample of 19 responders, the 67 non-responders and the wider population.11Most differences were small and within chance variation for a sample of this size, though, overall, the achieved sample contained pupils from schools with slightly higher than average levels of Free School Meal12 eligibility (18% vs 13% nationally), lower than average attainment (44% 5A*−C vs 50%), but above average value-added, both in mathematics and overall (Coe & Hodgen, 2012) (standardised effect sizes for the difference of 0.25 and 0.13, respectively).Nevertheless, there may be some bias due to the response rate of 22% in the 2008-2009 sample and, although we are able to show that there was no significant non-response bias in the school average MidYIS score for the students in the study, we did observe some differences between responders and non-responders in the attainment and progress of a previous cohort of students in those schools.A small difference in attainment (0.15 SD) suggests that the achieved sample might underestimate attainment in the national population, while a slightly larger difference in value-added (0.25 SD) points in the opposite direction.
In addition, we were able to match individual students' MidYIS scores with their ICCAMS scores.Correlations between MidYIS score and the total score on each ICCAMS test in each grade varied between 0.679 and 0.746, and all were statistically significant (p < 0.001).The strength of this relationship, combined with the availability of national norms for MidYIS scores, allowed us to increase the precision of sample-based estimates of population parameters by weighting the achieved responses to make their MidYIS scores fit the national distribution.
As the use of MidYIS scores was central to the approach used to obtain accurate estimates, we also investigated the sensitivity of the results to unmatched or missing MidYIS scores (Coe & Hodgen, 2017b).Overall, 9.5% of ICCAMS scores could not be matched and lower scores were more likely to be missing.Despite this, estimates using observed scores were within 0.02 of a standard 11 The variables available for the population of all schools in England were: whether the school is single sex or mixed; whether the school is selective or not; whether the school is independent or maintained; total number of pupils in the school; school percentage FSM; school percentage achieving 5 + A* − C at GCSE (the equivalent of a Level 2 qualification, see Footnote 3 above); overall school value-added; mathematics value-added for the school.Full details of the comparisons can be found in Coe & Hodgen (2012).12 The proportion of students eligible for Free School Meals (FSM) is commonly used as a measure of deprivation in England.The data available to compare responding and non-responding schools relate to different cohorts of students from those who were involved in the study.deviation of those derived from multiple imputation; using weighted MidYIS scores seemed to give small but appropriate corrections (Coe & Hodgen, 2017b).

4.8
Test Administration The administration of the tests was the same for ICCAMS and CSMS and took place at the end of the school year in June and July.Each test was designed to be taken in one mathematics lesson and was administered by the students' regular mathematics teacher.In 2008-2009, to reduce test fatigue, each student completed just two of the tests on separate occasions.In 1976-1977, a sub-sample of students took two tests.Detailed instructions were provided, with only minor updating of language for the ICCAMS administration.In addition, whilst the tests were administered under examination conditions, teachers were encouraged to "ensure that all the students understand what the questions are asking of them … [but not to] give any information about how to tackle the questions" and to read the questions to students if required.In the 1970s, Algebra and Ratio, both shorter tests, were sometimes taken together in one lesson.Hence, in 2008-2009, students may have had a longer time to complete these tests than some students in the 1970s.
For the 2008-2009 sample, the performance of all three tests was analysed using both classical and item response theory (Rasch) models; full details are in Coe & Hodgen (2014).All tests performed well on dimensionality tests, and had high levels of internal consistency (e.g., Cronbach's alpha values: Algebra 0.95; Decimals 0.96; Ratio: 0.94).Almost all items provided an excellent fit to the Rasch model, with occasional misfit well below the level that would degrade measurement.

Are the 2008-2009 and the 1976-1977 Samples Equivalent and
Comparable? While, as far as possible, the 2008-2009 sample was constructed in the same way as that in 1976-1977, so as to enable valid comparisons of results, there were inevitably some differences.First, in order to improve precision, in 2008-2009, a larger sample of schools contributed to the results for each test in each grade, although the total number of schools involved in 1976-1977 was greater.In retrospect, a larger sample of schools for each test in each grade in 1976-1977 would have been preferable in order to reduce the extent to which the schools involved did not reflect schools nationally, but there was nothing that we could do to about that.Second, in 2008-2009 the schools were selected at random from the MidYIS database of schools in England, using the National Pupil Database (NPD) to establish the representativeness of the sample.In practice, however, the low response rate means that even a systematic sampling process does not guarantee that the 2008-2009 sample achieved is representative.In the 1970s, no equivalent database was available, so schools of different types and from different regions were asked to participate on a rather opportunistic basis.Third, the Calvert Non-Verbal Reasoning test originally used to establish the national representativeness of the sample for each test is no longer available and therefore, as already noted, an alternative, the MidYIS test, was substituted.It seems very unlikely that this change had anything but a very small effect.
Overall, therefore, it seems that in relation to the national distribution of IQ, the samples could be judged to be equivalent and comparable, and we know that there was a high correlation between these ability measures and scores on the mathematics tests.However, in relation to the effectiveness -or other characteristics -of the schools involved in the two samples, it is not possible to be completely confident whether either sample was nationally typical or whether they were strictly comparable.We know that schools in the 2008-2009 sample were slightly more effective than the national average in terms of their value-added progress in mathematics from grade 5 to 10, but also that their overall attainment in grade 10 was below the national average for England.In 1976-1977, many of the schools became part of the sample through a staff member volunteering at a professional development event, or through some other personal connection, so it is possible that these teachers were more confident, enthusiastic and perhaps more effective than typical.These are all limitations that we are unable to overcome and that should be borne in mind in interpreting any claims about national performance.
Nevertheless, we believe that some claims about national performance can still be made on the basis of these samples, for the following reasons.First, for all its limitations, both samples were the result of a systematic process to select a representative group and check its representativeness.Second, both samples used matching to a highly correlated, nationally standardised measure to limit the size of any variation from national norms.Third, no other longitudinal surveys exist, especially not on this scale, involving thousands of pupils at two time points.Ideally, our knowledge of changes in performance would be based on the interpretation of multiple and independent studies, each using different methodologies to give a balance of different strengths and weaknesses.Our study is far from the final word in such a process, but we hope it provides a start.

4.10
The Estimation of Item Facilities and Confidence Intervals Bootstrapping was adopted as an approach to estimating the sampling error on 2008-2009 item facilities, after weighting to make the distribution of MidYIS scores in the sample for each test and grade nationally representative.Because we employed a two-stage sampling process (selecting schools, then pupils), estimation of standard errors must take account of possible clustering (the tendency for pupils in the same school to be more similar than pupils chosen independently).Although this can be done with standard statistical adjustments to the data from a single sample (e.g., using multilevel modelling or the Huber-White correction), the bootstrap approach is preferable.Our procedure for estimating item facilities was more complex: drawing a sample of schools, testing a sample of pupils in those schools, then applying weights to those test scores to achieve the same distribution in our sample of MidYIS scores as was known to be nationally representative.Part of the reason for using this weighting approach was to reduce the standard errors of our facility estimates: estimates from different samples chosen and weighted this way should be expected to vary less than they would if no weighting were applied.In the absence of an analytical way to calculate standard errors, the bootstrap approach allowed us to estimate the variation in facility parameters from repeated samples by simulating a process of repeated sampling and calculating that variation.For the Algebra test, 3000 bootstrap samples were generated for each grade in order to check the agreement across three different bootstrap methods: Standard 'Bootstrap-t' confidence interval; Simple percentile method; and BCa (bias corrected accelerated) percentile method (Efron & Tibshirani, 1993).In addition, the standard Bootstrap-t method was applied to item facilities without weighting based on MidYIS scores.Full details of the approach and results can be found in Coe & Hodgen (2017a).
The three methods that used MidYIS weightings were found to agree extremely well, with all inter-method correlations in excess of 0.99, and over 90% of pairwise comparisons within 10% of each other.Comparison with confidence intervals estimated without using weighting showed that weighting typically reduced the width of confidence intervals by 20%-30%, though for some items the reduction was much greater (Coe & Hodgen, 2017b).For the other tests, the simplest method (standard 'Bootstrap-t') was therefore used to estimate 95% confidence intervals.
The CSMS results were published in a lengthy technical report (Hart & Johnson, 1983), several doctoral theses (Brown, 1981;Hart, 1980;Küchemann, 1981) and a book (Hart et al., 1981).Although we had access to the detailed results from these sources, we did not have access to the full CSMS dataset.This meant that we could not reanalyse the data and, since standard errors and confidence intervals were not calculated for the original survey, we had to estimate these through a simulation process as described below.Hence, bootstrapping was also used to estimate confidence intervals for item facilities from the 1976-1977 round of testing, in the absence of any direct estimates from the original study.A bootstrap sample of six schools was taken to represent a typical sample, and the same method as that used in the 2008-2009 sample (Coe & Hodgen, 2017a) was applied.An estimate of the 95% confidence interval around the change in facility was calculated from the standard errors of measurement at each point.

Results: Changes to Students' Understanding over Time
In order to examine and compare overall performance between 1976-1977 and 2008-2009, we examine how item facilities as a whole have changed, together with the mean item facility in each topic.We begin by focusing on the comparison of the results over time for the oldest group of students tested in 2008-2009, those in Grade 8. We examine how the performance on the items in each topic has changed and then how performance has changed across the attainment range in the cohort by comparing the proportion of students at the different levels in the hierarchy of understanding.We then briefly consider how mathematical understanding has changed for younger students in order to consider the changes across lower secondary.

Changes to Item Facilities at Grade 8
In this section, we discuss the changes to the item facilities across all four topics -Algebra, Decimals, Ratio and Fractions -in order to examine overall how students' understanding has changed over time.In Table 2, we summarise these changes by using the mean facilities for each topic together with an overview of the numbers and percentages of items where the facilities have increased, decreased or not changed significantly.It can be clearly seen that the overall mean facilities have declined in all topics, with the decline smallest for Decimals and greatest for Fractions.The decline is statistically significant for all topics except Decimals, and the effect sizes range from d = 0.18 for Decimals to d = −0.45 for fractions.Over time, mean facilities on roughly half of the items have decreased significantly, and roughly half have not changed significantly.Only 5 (or 3%) of the total 163 items have mean facilities that have increased significantly, and all of these are from the Decimals test.
In   The picture for Algebra shows a similar decline across the range of facilities, except that there are significant declines for a cluster of items with facilities greater than 75% in 1976.In 2008-2009 several items involving apparently simple arithmetic appear to be surprisingly difficult when presented in the context of the algebra test.For example, two items (see Figure 5) involving numerical calculations of area, a 3 by 4 rectangle (with a grid) and a 6 by 10 rectangle (without a grid), show declines from 91.4% to 78.4%, and 88.6% to 75.0%, respectively.Other items presented in geometric contexts also showed considerable declines in facilities.For example, two items (see Figure 6) involved enumerating diagonals in a polygon, given example of a five-sided polygon.The item facilities for 57 and k sided polygons had reduced from 74.6% to 53.8%, and 52,0% to 41.7%, respectively.This may be due to less emphasis being placed on geometry in general than in the 1970s, but is nevertheless salutary considering the extensive use of geometric contexts in the teaching of algebra at low secondary.3 together with the change in facility over time.It can be seen that the facilities for three of the items show considerable declines of between 10% and 15%.However, the final item (8d), which asks students to work out the cost of a £20 coat when reduced by 5%, shows a non-significant percentage point increase in the facility of 7% from 26.5% in 1976 to 33.5% in 2008-2009.This may be due to greater emphasis being placed on mental and other 'informal' methods for calculating percentages.13 One additional and potentially important change is that the proportion of blank or non-responses has increased over time.The frequencies of blank responses have risen from means of 12.8%, 6.7% and 7.6% in the 1970s to means of 21. 0%, 17.9% and 17.9% in 2008-2009 for the Algebra, Decimals and Ratio items, respectively.This is a curious result, which we address in the discussion below.The blank responses for the Fractions items have also increased (from a mean of 14.6% in the 1970s to 31.6% in 2008-2009).This increase is perhaps less surprising, given that much less emphasis was placed on fractions in 2008-2009 than was the case in the 1970s.

5.2
Changes across the Attainment Range at Grade 8 We now turn to examine how students' understanding has changed across the attainment range.We examine how the proportions of students at each level in the CSMS topic hierarchies have changed.Again, we focus on the oldest students tested, those at Grade 8. Here, we report on Algebra, Decimals and Ratio, but not Fractions, because only a small subset of Fractions items was used in the 2008-2009 administration.
In Table 4 and Figure 7, we show the change in the proportion of Grade 8 students achieving each Level or above in the CSMS hierarchy for each test.As noted earlier, the levels were well-ordered in both administrations of the tests and there were very few students who achieved a higher level but not a lower level.In Algebra and Ratio, the proportion of the lowest achievers, i.e., those at "Level 0", has increased dramatically over time.In Algebra, the proportions have significantly declined for those achieving each level or above, except Level 2 or above.The proportion of students achieving at least Level 3 is of particular interest, since this is when students begin to understand the key algebraic concept of variable.For Ratio, the proportions achieving at least Level 1 and at least Level 2 have declined significantly.Level 2 is important, since this is when students begin to understand contexts involving non-integer ratios as being multiplicative.The proportion achieving at least Level 2 has declined to only around a third of the cohort.The results for Decimals indicate a slightly more positive picture, in that there has been some improvement for Downloaded from Brill.com 09/03/2024 02:59:38PM via Open Access.This is an open access article distributed under the terms of the CC BY 4.0 license.https://creativecommons.org/licenses/by/4.0/ the middle range of attainment, with an increase in the proportion achieving Level 4 or above, although this is offset by what appears to be a corresponding decline at the highest levels.As with Algebra and Ratio, the proportion of the lowest achievers "at Level 0" has increased, although the change is smaller and is not statistically significant.

5.3
Comparing Changes at Grade 6 and Grade 7 to Those at Grade 8 In this section, we compare progression in Algebra, Decimals, Ratio and Fractions.Thus, in Table 5, we compare the mean facilities for 1976 or 1977 and 2008-2009 at Grade 8 with those at Grade 7 (aged 12-13) for all four areas and at Grade 6 (aged 11-12) for Decimals and Fractions.Changes across the 30 years at Grades 6 and 7 were rather different across the four areas.In Ratio (Grade 7 only) and Fractions, the changes were similar to Grade 8; in Algebra (Grade 7 only), there was only a small decline.In decimals, there were small rises, greater for Grade 6 than Grade 7. In Table 6, we compare progression from Grade 6 to Grade 8 across the attainment range in 2008-2009 for Algebra, Decimals and Ratio.It can be seen that the gap in attainment from the 10th to the 90th percentile increases for the older students with greater increases for Algebra and Ratio.This is in large part due to much smaller gains for the lowest attainers.
An important change to the curriculum is that symbolic algebra is now generally introduced in Grade 6, which, as we noted previously, is earlier than in the 1970s (Brown, 1996).One might have expected that this earlier introduction would have boosted performance at Grade 7, so the fact that the 2008-2009 mean facility is below that for 1976 is surprising.Moreover, since the Grade 8 students in 2008-2009 had a longer exposure to symbolic algebra than their counterparts in 1976, three years rather than two, one might have expected a further boost in performance at Grade 8. Hence, it is very striking indeed that the change in mean facility, or progression, from Grade 7 to Grade 8 has more than halved to 4.0%.Moreover, and that over time the gap between the highest and lowest attainers actually widens from Grade 6 to Grade 9.This suggests that the earlier introduction of symbolic algebra may have had little or no lasting effect beyond a possible 'initiation' effect, particularly for the lowest attaining group of students.
For Decimals, the increase at Grade 6 may reflect the fact that many aspects of decimal number, particularly measurement aspects such as place value and the use of number lines, are now taught in primary (DfEE, 1999).Decimals are taught much more extensively and very much earlier than in the 1970s and there is much greater emphasis on decimals outside school.It is therefore somewhat surprising that this has not resulted in better performance on the Grade 8 tests.
For Ratio, the decrease at Grade 7 is of a similar order to the decrease at Grade 8, suggesting a consistent decline in the understanding of ratio, but it is striking that the 2008-2009 mean facility for Grade 8 is below that for Grade 7 in the 1970s.
For Fractions, the mean facilities have declined substantially for all three age groups, with a slightly greater decrease for Grade 6.The current Grade 8 facility is well below that for Grade 6 in the 1970s.This is perhaps to be expected, because as we have already noted there is now much less teaching of fractions than in the 1970s.The change in mean facility, or progression, from Grade 6 to Grade 8 has increased from 3.8% in 1976-1977 to 6.0% in 2008-2009.It can also be seen from Table 7 that the proportion of the lowest attaining group of students, those "at Level 0" in the CSMS hierarchy, has increased significantly for all ages of students in Algebra and Ratio.For Decimals, the proportion of low attaining students has fallen slightly at Grade 6 (but not significantly) and remained stable at Grade 7. On Replication Before discussing the results of the replication further, we consider the challenges that we faced in replicating a study carried out in the 1970s and how we used methods now available to address these replication issues.The CSMS study was at the time one of the largest and most rigorous studies that had been carried out worldwide.Yet, viewed with a modern eye, the study has some limitations.The original analysis was limited by the methods and computing power then available.In addition, the expectations regarding statistical practice and reporting in the 1970s were lower.Hence, the original study reported point estimates but not standard errors or confidence intervals.Such measures of precision are critical to judging the significance of changes over time.These kinds of issues are likely to affect all, or most, of the significant studies carried out prior to the mid-1990s before academic journals in education began to establish clear guidelines for statistical reporting (Hill & Shih, 2009).We addressed this by making use of statistical techniques and computing power now available.First, we used a more robust modern method, Rasch modelling, to re-validate the tests.Second, we used statistical simulation to estimate standard errors, thus enabling the comparison.Ideally, our claims would be validated against other assessments and samples, but no such evidence is available.In the absence of any other data, as is likely to be the case in any other similar replication, we believe modern statistical methods, such as simulation, have an important role to play in replicating studies and comparing the results over time.
Some might argue that the study carried out here is simply a comparison of two national-scale studies and, whilst the comparison itself is of value, the study does not provide a specific contribution to the replication literature.Certainly, the examination of national-level reform has received little attention in the literature on replication (although this has been a significant concern of the implementation literature, see, e.g., Helenius, 2021).Following Hüffmeier et al., (2016), Melhuish (2018) provides a typology of replication types: exact, close, constructive and conceptual.Of most relevance to the research presented here are the constructive and conceptual variants, which Melhuish argues "may contain divergences from the original study to better test, refine, or expand a theory or theoretical propositions" (p.11).Constructive replications do this by introducing a new element, whilst conceptual replications contain changes to the methodology.The study reported here involves both a different element to the original study, a sample from the population of English students at a different time point, and changes to the methodology, specifically the use of modern statistical methods.This combination of changes has enabled us to refine and expand the original findings about how students understand these key areas of mathematics and how these understandings develop over the lower secondary phase.This aspect of the replication is certainly very important, but is reported elsewhere (e.g., Hodgen et al., 2012).In contrast, the focus of this paper is on examining whether the levels of understanding, which were identified in the 1970s, still hold 30 years later and, thus, assessing whether reform in England has had a positive effect on student attainment.There are, of course, other tests and surveys that enable a comparative analysis of these long-term trends over time, such as national tests (such as GCSE s in England) and international surveys (such as PISA and TIMSS), although our replication demonstrates how comparisons can be made when other tests and surveys are either not available or are limited in scope.Importantly, this study focuses specifically on conceptual understanding, a key strand of mathematics that is often underemphasised in official tests and surveys.Thus, as a replication, this study provides an independent research-led assessment that is not influenced by national or international political concerns and makes an additional contribution by demonstrating how modern statistical methods can overcome the challenges of comparisons with studies where the original data is either not available or available only in a limited aggregated form.

6.2
Comparing Performance over Time A major conclusion of our replication study is that, at Grade 8, there has been an overall decline in students' attainment since the mid-1970s in each of the areas tested.There is a more mixed picture for Decimals, where students' understanding appears to have slightly increased over time for the middle attaining students, although this is in the context of an overall decline.This general decline is a surprising result, since, as we have already noted, England has seen a concerted attempt to improve educational performance in mathematics over the past 30 years.
Ultimately, if our aim is to measure the change in attainment, the benefits of further refining our estimates of population parameters from the 2008-2009 sample are constrained by the much larger uncertainty around the estimates from 1976-1977.
The decline is equivalent to effect sizes of Cohen's d = −0.32,−0.18, −0.45 and −0.29 for Algebra, Decimals, Ratio and Fractions, respectively.Effect sizes of this order are often classed as low to moderate in the educational literature when judging the impact of educational interventions.However, these are arguably large systemic effects and are of a similar order to major changes in systems' performances on TIMSS and PISA.They are also large in relation to the Downloaded from Brill.com 09/03/2024 02:59:38PM via Open Access.This is an open access article distributed under the terms of the CC BY 4.0 license.https://creativecommons.org/licenses/by/4.0/typical growth we observed in students' scores on the tests between Grades 6 to 8. The decline in performance in Fractions is equivalent to the progress typically made in two years of schooling, while for the other three tests it is of the order of over a year.In other words, students at the end of Grade 8 in 1976-1977 were well over a year ahead of their counterparts in 2008-2009.This overall decline is in marked contrast to the increase in examination results, which have risen dramatically over the period.There are several possible reasons for this anomaly.One possible hypothesis is that the nature and value of qualifications has changed.There is a great deal of recent research indicating that grade standards in English mathematics examinations may have 'slipped' over time (e.g., Coe & Tymms, 2008;Jones et al., 2016).It is important to note that obtaining qualifications, particularly in mathematics, has become much more crucial for all students since the 1970s.Hence, examination results may have improved because a greater proportion of students have been given the opportunity to sit the examinations, because these students have greater motivation to do well and because schools are influenced by accountability measures.
It could also be that tests which focus on a deeper level of reasoning, such as the CSMS tests, show a decline, whereas those, such as the national GCSE examination, involving more routine items and/or more coached performances show an improvement.Indeed, as we have discussed above, the CSMS tests were deliberately designed to test conceptual understanding rather than the ability to perform routine, procedural tasks.In addition, our own comparative analysis of mathematical textbooks in England (Hodgen et al., 2010) suggests that much less emphasis was placed on conceptual understanding in 2008-2009 than in the 1970s.
Another possible explanation is that, unlike GCSE examinations, the ICCAMS and original CSMS tests were administered without preparation or revision, whilst secondary education in England has become more highly focused on examination performance in recent years (Office for Standards in Education, 2012).It may be that an effect of the rise in prevalence of high-stakes testing between 1976-1977 and 2008-2009 is that low-stakes tests (such as ours) seem less worthy of effort by comparison.This might also explain why the number of unanswered questions is higher in 2008-2009 than 1976-1977.Unfortunately, the existing literature is not conclusive.Penk et al., (2014) show that some studies do find an association between test performance and motivation, whereas others do not.Some experimental studies do find relatively large effects for motivation, although these may be distorted by the effect of academic ability (Wise & DeMars, 2005).In addition, these large effects are recorded in designs that emphasise extreme differences in the stakes of tests or monetary incentives.Whilst we cannot rule out the possibility of a test motivation effect in the decline, the current evidence suggests that any such effect would be small at most, given that the test was low stakes at both administrations.
The decline may also be related to changes in the population of students in England's schools, particularly to changes in the proportion of ethnic minority students, students with English as an Additional Language (EAL) or students with Special Educational Needs (SEN).Unfortunately, data on students' ethnicity or EAL was not systematically collected in the 1970s (Khan, 1983), but it is generally accepted that the proportions of these students significantly increased over the period.However, the evidence that is available suggests that these factors have not contributed to the decline (and might, if anything, have reduced the decline over time).Strand (2015) finds that, over the period 1991 to 2006, the gap in educational attainment between ethnic minority and White British students has narrowed.In addition, Strand et al., (2015) examine the relationship between EAL and achievement between 1997 and 2013 and, whilst they identify an attainment gap in the early years, they find that, in mathematics, this is almost eliminated by age 11, and that, by age 16, EAL students outperform First Language English students.
One serious issue concerns the proportion of the lowest attaining students, those who fail to achieve Level 1 and are thus "at Level 0".In Algebra and Ratio, the proportion of these students has more than doubled to around 15% of the population.These students have difficulty with the very simplest items on the tests and thus do not fully grasp some of the core ideas in the primary curriculum.This may be partially reflected in the TIMSS results, which show no change between 1995 and 2007 in the proportion of Grade 8 students who do not achieve the low international benchmark, despite a significant rise in England's average attainment (Sturman et al., 2008).It is difficult to explain this; one possibility is the closing of many Special Schools and greater inclusivity of students with SEN within the mainstream sector.The Warnock Report into special educational needs records that 1.8% of the school population (ages 5-15) were in special schools or classes in 1977 (Warnock, 1978, p. 37).In 2007, 1.05% of students were in special schools, with a further 0.2% in pupil referral units (Department for Children Schools and Families, 2008).Hence, it is unlikely that this factor could account for the full size of the difference.Another possible explanation lies in the finding that the National Numeracy Strategy had the effect of depressing attainment at the lower end, perhaps because of the failure to address children's particular needs in attempting to provide equal access to the curriculum (Barnes et al., 2003;Brown et al., 2008).The decline is also in contrast to England's performance in international surveys, although, as previously noted, these surveys enable reliable comparisons only over a much shorter period: back to 1995 for TIMSS and back to 2003 for PISA.One possible explanation for England's rises in TIMSS is that the English mathematics curriculum has become closer to the curriculum tested, particularly at primary (Brown, 2011;Burstein, 1992).It is also important to note that the CSMS tests do not test the whole curriculum and, indeed, do not test the entirety of algebra and multiplicative reasoning.Nevertheless, the topics tested are critical to further progression in mathematics.
Of course, it is possible that, whilst mathematical performance in England declined over the period as a whole, this decline may not have been monotonic; our findings would be consistent with some improvements over shorter periods.Indeed, the evidence at primary level does suggest a modest rise since 1995 (Tymms, 2004), although the OECD's (2013) survey of adult skills suggests a gradual decline over the period.Nevertheless, the issue of when the decline took place, and indeed whether the decline is associated with any particular reform initiatives, remains open.
A related issue is the increase in the proportion of blank responses to questions.In 1976In -1977, missing responses were treated as incorrect, and so, for purposes of comparability, we have treated the 2008-2009 missing responses as incorrect.Brown et al.'s (2014) analysis of NAEP data indicates that this is a reasonable approach, because alternative methods (such as ignoring missing responses or using imputation) produce similar estimates for large samples.Analysis of the missing responses does suggest that the rise in missing responses is more than would be expected to arise purely from increased difficulty, although the effect is relatively small (Coe & Hodgen, 2017c).This does not appear to be the result of students having less time for the tests.One possible explanation is that an increased focus on examination technique has led to some students leaving a blank response to items that they are unsure about.
The fall in the proportion of those at the highest level of attainment is also of concern.Although this fall is statistically significant only for the Algebra test, the actual proportion of the current cohort at this level of attainment is worryingly low.
On the Decimals test, the effect of a rise in attainment focused in that section of the attainment range between the 15th and 60th percentile again has possible explanations.This could reflect greater focus on coaching students predicted to be around the Level 4 borderline in Year 6 and then the C/D borderline at GCSE, since these are key performance indicators for schools in England.While this does not explain why this feature is not present in any of the other curriculum areas, this could be attributed to the fact that these borderline students are more likely to have been coached in basic number than in Downloaded from Brill.com 09/03/2024 02:59:38PM via Open Access.This is an open access article distributed under the terms of the CC BY 4.0 license.https://creativecommons.org/licenses/by/4.0/ the more formal and abstract topic of algebra.However, these differences may also occur because of cultural changes in student knowledge.There was probably more use and knowledge of fractions in 1970s society in England.The 1970s saw the advent of decimalised money and metrication, and also the rise of calculator and computer use, which employ decimal notation.These societal changes probably had the effect of enhancing knowledge about decimals in relation to knowledge about fractions.This would explain both improvements in the middle range for Decimals and the presence in that test of a greater proportion of items which are unchanged or improved compared to other areas.

Conclusion
In this study, we conducted a 'scaling out' replication of the CSMS study originally carried out in the 1970s in order to compare performance over time in key areas of lower secondary mathematics.One key finding of the study is to demonstrate how modern statistical methods can be used to carry out such a comparison, when the original data and statistical findings are no longer fully available.It certainly seems counterintuitive that given the long list of major Government initiatives between 1985 and 2009 aimed at increasing attainment in mathematics there has not been any obvious positive effect on the understanding and application of two of the key areas in lower secondary.Of course, one might speculate that this list in itself explains why performance does not appear to have risen; the effectiveness of teachers and schools may be negatively affected by initiative overload (Perryman et al., 2011).Indeed, there is some evidence that higher performing countries are less prone to frequent external initiatives (Askew et al., 2010).The evidence presented here does suggest that government investment on initiatives is not sufficient on its own to increase mathematical attainment across the system and that it is also important to focus on the quality of reform initiatives.Hence, one implication of this study is that such initiatives should include research focused on building the evidence base on the efficacy of educational interventions (Coe, 2009).
A further implication of this study is that it is important to take steps to monitor standards of attainment over time.Frequently, the debate about educational performance in England and elsewhere has focused on examination standards (Anthony & Walshaw, 2007;Truss, 2013;Walport et al., 2010).However, we believe this focus to be somewhat misplaced, since the nature and purpose of the examination system changes significantly over time, and public, high-stakes examinations are not well-suited to monitoring standards over time.In the US, the NAEP-LTT program goes at least some way towards Downloaded from Brill.com 09/03/2024 02:59:38PM via Open Access.This is an open access article distributed under the terms of the CC BY 4.0 license.https://creativecommons.org/licenses/by/4.0/meeting this need, but in England there is currently no equivalent.If we want to know about system-wide changes in performance over time, we need an assessment program designed for this purpose.
A related implication is that there is a need for mathematics education to place greater focus on conceptual understanding (see also, Kilpatrick et al., 2001).Procedural understanding of mathematics is important, but conceptual undertstanding is critical to using mathematics in new and unfamiliar contexts.
Overall, our results are sobering.In England, over a 30-year period, despite huge investment in well-intentioned reform and widespread perception of improvement, student outcomes appear to have declined, at least in the key areas of mathematics that are the focus of this study.The most plausible interpretation of our results is that overall attainment in mathematics for 14 year olds in England has declined.This should be a salutary warning to anyone who thinks that systemic educational improvement can be decreed, imposed, bought or assumed: evidently it needs something much smarter than that.
Downloaded from Brill.com 09/03/2024 02:59:38PM via Open Access.This is an open access article distributed under the terms of the CC BY 4.0 license.https://creativecommons.org/licenses/by/4.0/Implementation and Replication Studies in Mathematics Education 4 (2024) 83-124 Figures 1-4, the facilities of 2008-2009 are plotted against those of 1976-1977 and any significant changes are indicated.These scatterplots show that for Decimals, Ratio and Fractions, items that have declined significantly are spread across the range of item facilities.Downloaded from Brill.com 09/03/2024 02:59:38PM via Open Access.This is an open access article distributed under the terms of the CC BY 4.0 license.https://creativecommons.org/licenses/by/4.0/

Figure 1 Figure 2 Figure 3 Figure 4
Figure 1 Scatterplot of 51 matched items facilities for Algebra test at Grade 8

Figure 5
Figure 5 Items 7a and 7b involving the numerical calculation of area from the Algebra test

Figure 7
Figure 7 Proportional bar charts showing achievement of CSMS hierarchy levels in Algebra, Decimals and Ratio across the cohort at Grade 8. Key: L0, Level 0, etc.
Implementation and Replication Studies in Mathematics Education 4 (2024) 83-124 and Science Survey) and the OECD's PISA (Programme for International Student Assessment), do have comparable data going back to 1995 for TIMSS and 2003 for PISA mathematics.In TIMSS, England's performance at Grade 8 rose from a mean of 498 in 1995 to 513 in 2007

Table 1
The samples in 1976The samples in  -1977The samples in   and in 2008The samples in  -2009 Downloaded from Brill.com 09/03/2024 02:59:38PM via Open Access.This is an open access article distributed under the terms of the CC BY 4.0 license.https://creativecommons.org/licenses/by/4.0/

Table 2
Summary of change to mean facility of items on each test at Grade 8 Effect size calculated using score change based on mean facility as a proportion of standard deviation in 2008-2009.Downloaded from Brill.com 09/03/2024 02:59:38PM via Open Access.This is an open access article distributed under the terms of the CC BY 4.0 license.https://creativecommons.org/licenses/by/4.0/

Table 3
Facilities, standard errors and change over time for the four items involving percentages on Ratio (Test R) at Grade 8 (age 13-14) with abbreviated descriptions Downloaded from Brill.com 09/03/2024 02:59:38PM via Open Access.This is an open access article distributed under the terms of the CC BY 4.0 license.https://creativecommons.org/licenses/by/4.0/

Table 4
Change over time of proportions of students achieving CSMS hierarchy levels in Algebra, Decimals and Ratio at Grade 8 Downloaded from Brill.com 09/03/2024 02:59:38PM via Open Access.This is an open access article distributed under the terms of the CC BY 4.0 license.https://creativecommons.org/licenses/by/4.0/Implementation and Replication Studies in Mathematics Education 4 (2024) 83-124

Table 5
Change over time for Algebra, Decimals, Ratio and Fractions at all grades For Fractions, there are more common items at Grade 8, and the mean facilities at each administration for these 15 common items are shown in italics.Downloaded from Brill.com 09/03/2024 02:59:38PM via Open Access.This is an open access article distributed under the terms of the CC BY 4.0 license.https://creativecommons.org/licenses/by/4.0/Implementation and Replication Studies in Mathematics Education 4 (2024) 83-124

Table 6
Effect sizes (Cohen's d) of gain in attainment from the end of Grade 6 to the end of Grade 8 across the attainment range for the Algebra, Decimals and Ratio tests in 2008-2009 Downloaded from Brill.com 09/03/2024 02:59:38PM via Open Access.This is an open access article distributed under the terms of the CC BY 4.0 license.https://creativecommons.org/licenses/by/4.0/

Table 7
Proportion of students "at Level 0" in the CSMS hierarchy in Algebra, Decimals and Ratio at all grades Downloaded from Brill.com 09/03/2024 02:59:38PM via Open Access.This is an open access article distributed under the terms of the CC BY 4.0 license.https://creativecommons.org/licenses/by/4.0/Implementation and Replication Studies in Mathematics Education 4 (2024) 83-124 Downloaded from Brill.com 09/03/2024 02:59:38PM via Open Access.This is an open access article distributed under the terms of the CC BY 4.0 license. https://creativecommons.org/licenses/by/4.0/ Downloaded from Brill.com 09/03/2024 02:59:38PM via Open Access.This is an open access article distributed under the terms of the CC BY 4.0 license. https://creativecommons.org/licenses/by/4.0/