A Review and Critical Analysis of the Woodcock Reading Mastery Tests – Revised (NU)

Info: 3723 words (15 pages) Dissertation
Published: 9th Dec 2019

Share this: Facebook Twitter Reddit LinkedIn WhatsApp

The ability to read and communicate in an effective manner is a critical aspect of every person’s daily life. Individuals that struggle with reading will have increased difficulty across all academic subjects and can also face significant difficulties in their personal lives outside of the school system. Difficulties with reading can occur for many reasons, including visual or auditory difficulties, difficulties with letter recognition, phonics, or with the comprehension words or paragraphs. These can be due a lack of understanding in a specific area or a physical and/or learning disability. Standardized individual batteries such as the Woodcock Reading Mastery Tests – Revised – Normative Update (WRMT-R/NU) allows for the performance of an individual in different domains of reading to be determined and compared to individuals from similar age/grade groups. If this information can be accurately obtained, then it can be used to assist in the development of a learning program that can use the individual’s strengths to assist them in areas of difficulty. The purpose of this paper is to summarize and critically analysis the Woodcock Reading Master Tests – Revised (NU) in terms of its development, technical information and uses for decision making regarding individuals.

Test Development

The WRMT-R a revised edition of the original 1973 Woodcock Mastery Tests that includes some new and expanded tests that was published in 1987 (Woodcock, 1998). The WRMT-R/NU was published in 1998 and uses the exact same stimulus information as the WRMT-R the only difference that it includes a different set of normative tables for individuals up to grade 12 and under age 23. Over these age and grade cutoffs the original normative tables are still used for subjects completing the test.

According to the WRMT-R/NU manual it contains 2 parallel forms that allow for later retesting or for more comprehensive testing by combining the two tests (Woodcock, 1998). However, of note is that Form H is not the complete battery as it does not include the readiness cluster of tests including those looking at visual-auditory learning, letter identification, or the supplementary checklists that test recognition of capital and lowercase letters. Both the G and H form of the battery including measures of the basic skills cluster and the reading comprehension cluster. The basic skills cluster includes two tests, word identification and word attack, that both test the ability of the subject to pronounce either increasingly difficult words or nonsense words using the correct phonics for different letter combinations. The reading comprehension cluster includes tests of word comprehension, as tested by the subjects having to provide synonyms and antonyms of printed words and by completing analogies, and test of passage comprehension, as tested by the individual having to fill in a missing word from the sentence or passage. A subject’s ability to complete the various tests and the errors they make can be used to determine areas of reading strengths and weaknesses. Furthermore, for a more comprehensive assessment the manual also provides details about subtests from other batteries that can be combined with the WRMT-T/NU battery if additional information is necessary.

The items used in the different tests of the WRMT-R were initially selected using classical selection methods with the assistance of experts in the field of education (Woodcock, 1998). Once the initial set of items were determined the Rasch model was also used to ensure that all items meant strict statistical criteria before it was included in the battery and thus ensuring meant the current requirement of the Standards for Educational and Psychological Tests that had been set out by the American Psychological Association in 1974.

Technical Information

Standardization

All procedures for the assessment have been laid out in detail to ensure standardization of the results (Woodcock, 1998). From types of acceptable physical settings, necessary materials and the seating arrangement to ensure valid testing. Different recommendations of how to develop and maintain a positive rapport with the testing subject are also given including reminders to smile, calling the subject by name, and randomly making positive comments about their performance without indicating positive or negative results. The actual administration of the test is done using a standing easel-kit that has the visual stimuli for subjects printed on one side and instructions for the administrator and correct answers printed on the other. Each test section is indicated by a tabbed page that also gives a brief description of the test purpose and any important keys points such as starting point and how to determine the basal (lowest) and ceiling (highest) point of the subject. The actual manual itself offers explanation for the testing instructions and procedures in more detail then the stimulus book. As this battery includes two tests that look at the pronunciation of words the manual contains a written pronunciation key, the stimulus book contains a more detailed written pronunciation key for complex words and the kit includes a cassette that contains an auditory pronunciation guide to ensure the same standards are used for all words. There is also a note regarding individual with a handicap and ESL students and possible allowable accommodations and their effects on the standardization of the results of the assessment.

The manual includes information to allow for self-training of individuals administering the WRMT-R/NU (Woodcock, 1998). This includes recommendations for multiple practices administrations of the battery to both young and matures subjects until all procedures are comfortable and natural and a checklist of different training activities to assist with this. Furthermore, the authors recommend that once the individual has reached this level of competence they have the final practice session observed by an individual experience in the administration of the reading battery to ensure competence in administration and includes an observation checklist of all standardized procedures to ensure compliance.

Normative Procedures of the WRMT-R

The normative data of the WRMT-R was collected using a stratified sampling design that controlled for specific variables to ensure the sampling distribution of the different age groups matched that of the U.S. population distribution as determined by multiple indexes to reduce selection bias (Woodcock, 1998). The variables used included geographical regions, community size, gender, race, socioeconomic status and origin (Hispanic or Nonhispanic). The college samples also considered the institution type (public/private) and course type/length. For the adult sample the considered years of education, occupational status, and occupational type. The overall sample size was 6089 from 60 communities throughout the U.S. with the grade K to 12 consisting of 4201 subjects, college university 1023 subjects and the adult (20-80+ years but not in college) of 865. For samples that are underrepresented in the U.S. an effort was made to ensure they were oversampled (i.e., Asian-Pacific). For the school aged population schools were selected that were considered representative of the community and students were randomly selected from each group. Furthermore, the subjects where selected the authors also applied subject weighting to ensure that any small differences between the sample and U.S. distributions were removed.

The norms for all samples were collected as continuous year norms (Woodcock, 1998). This allows the normative tables created to be a more accurate representation of scores at various time points through out the year, by decreasing the affected of error variance, instead of having the normative data being collected at only one or two time points throughout the year. This means that the percentile rank and standard scores are calculated from the median values of a tenth of a year for K-16.9, any month for 5-0 to 18-11 and within 1 year for 19+. The data collected for the WRMT-R was also collected in conjunction with the G-F-W Sound Symbol Test (GFW) and the Woodcock-Johnson Psycho-Educational Battery (WJ) for 600 subjects. This allowed the authors to create a common scale for the three tests so that scores between them can be directly compared.

Normative Update – 1998

The WRMT-R underwent renorming in 1995-1996 at the same time as the KeyMath-Revised, the Peabody Individual Achievement Test-Revised (PIAT-R) and the Comprehensive and Brief forms of the Kaufman Test of Educational Achievement (K-TEA) (Woodcock, 1998). During the project the different subtests of the different batteries where compared and categorized into domains where the items were determined to be measuring the same thing. Allowing subtests from different batteries to be normed together allowing the difficulty level of different stimuli to be compared and scaled and allowed for comparison of score in the subtests from the same domain. For the WRMT-R some of the tests were normed on their own as their content did not overlap with that of the other batteries. An important note is that none of the stimuli or responses were altered in the batteries during this update of the norms, the only thing affected were the norm tables themselves.

The subjects used to establish the new norms included 3184 individuals from kindergarten through grade 12 and 245 young adults aged 18 to 22 (Woodcock, 1998). The subjects were selected using a stratified multistage sampling procedure to ensure that the distribution was representative of the 1994 U.S. Census. At the beginning of the project coordinators across the country distributed consent form through the school system that included questions regarding variables of interest (date of birth, gender, race, grade, education level of the parents, and level of English). The returned consent forms were then run through a computerized stratified random procedure that randomly selected individuals that would result in a representative sample. Individuals were assigned to complete one of the five batteries and a few subtests from other batteries, with each group being a representative sample. Individuals assigned to the WRMT-R were also randomly assigned to one of the two alternative forms. This same procedure was completed with the remaining unselected consent forms and a new set of consent forms in the spring. With the specific number of subjects to complete each test of the WRMT-R varying greatly from 721 to 2662 individuals (Woodcock, 1998).

The new norms were developed in a similar manner as noted above for the original WRMT-R (Woodcock, 1998). One difference was that the W value used in the tables was altered using a constant so that the median score of individuals at the beginning of grade 5 would be 500. Also, of note was that the original norms were kept for individual higher then grade 12 and over age 23. To allow this the project attempted to ensure that the norms that overlap were as close to each other as possible, however for some values there is a noticeable change. In a comparison of the original norms and the updated norms there are few changes in the level of performance for individual in the average to above average range of their grade or age group. However, for individual whose level of performance is below average there was noticeable decline in performance throughout most of the WRMT-R (NU) tests, especially for elementary and middle-school grades.

Test Reliability

The internal consistency, or the degree in which items measure the same construct, was calculated using the split-half procedure in which the even and odd raw scores were compared in the calculation (Woodcock, 1998). The authors provided values for all tests and clusters of scores 7 difference age groups. They reported a reliability coefficient .68 to .99 and a standard error measurement (SEM) of 2.0 to 6.7 (W scale units) for the G or H version of the test (Woodcock, 1998). Increased precision can be achieved by combining the G and H version into one test with a reported reliability coefficient .81 to .99 and a standard error measurement (SEM) of 1.5 to 5.1 (W scale units) (Woodcock, 1998).

Validity

The content validity is the degree that the items a test are measuring the specific construct or trait they are meant to measure (Thorndike & Thorndike-Christ, 2010; Woodcock, 1998). The manual indicated that review of the items included the WRMT-R tests in terms of sequencing and the scope of questions asked will show they cover all important factors of the trait they are attempting to measure (Woodcock, 1998). In stated that because the items were developed with the assistance of other experts, including teacher and curriculum specialists, they would be comprehensive in terms of difficulty level and content. However, the only direct evidence the authors provided was a series of charts that illustrate a criterion-referenced scale. In this scale they show the distribution of a sample of items in term of difficulty level for different grades, which 96%, 75% and 50% of individuals able to correctly answer the item in question. In addition, because the items themselves are open answered questions they closely parallel situation that will occur in everyday reading challenges.

Concurrent validity of the effectiveness WRMT-R in measuring a trait was provide by providing data on how similar its measured result corresponded to similar tests. The first table indicated how of the Woodcock-Johnson Reading Tests subtests correlated to the different tests of the WRMT-R with ranges from .25 to .91 depending on the specific grade and subtests being compared (Woodcock, 1998). Additional tables provide the following correlations values for the WRMT-R Total Reading Score: Iowa Tests of Basic Skills (.78 to .83), Iowa Tests of Educational Development (.79), PIAT Reading (.87 to .87), WJ Reading Achievement (.87 to .92), and the WRAT reading tests (.86 to .92) at different grade levels (Grade 3, 5 and 12) (Woodcock, 1998).

The WRMT-R manual also provided tables to indicate the degree of content overlap between the different tests at different grade levels (Woodcock, 1998). The tables indicated that some tests measured significantly different traits with correlations as little as .11 to tests with significant overlap in content with correlations of .98 (Woodcock, 1998)

Critical Analysis

Strengths

The WRAT-R had many positive traits. With administration and scoring instructions that are concise yet well explained. The short amount of time to administer and grade the battery, with each test taking approximately 5-10 minutes for administration and the entire battery taking 40-45 minutes (Woodcock, 1998). The time required for scoring and interpretation of the assessment will depend on the level of interpretive information necessary. With lower levels that give information about individual’s performance on different skills to the highest-level providing information about their performance compared to a reference group (Woodcock, 1998). Furthermore, because the WRAT-R is considered a Level B assessment that any individual trained the ethics, administration, and scoring of assessments can access it for use. This means that greater number of professionals can access and administer the assessment. This combined with the reduced time commitment for administration and grading at lower levels of analysis this makes the WRAT-R an asset for less significant educational decisions such as minor instructional decisions and may even be used in conjunction with other information highlight individual who may benefit from a more extensive assessment. And a variety of normative options available for use including the more common standard scores and percentile ranks.

Limitations

However, the WRMT-R(NU) also has some limitations. One major factor to note is that the norms used for this version of the assessment are quite old with the normative update values being published in 1998 and the WRMT-R norms used for older individuals were published in 1973. This is highlight by the fact that there was an updated version (WRMT-III) published in 2011. In addition, the outdated audio cassette for the pronunciation and the hand calculated score sheets are unattractive compared to updated assessments available.

An important area of concern is that there is little detail or evidence provided in terms of the reliability and validity of the assessment. The little information that is provided by the manual is from the WRMT-R version with no further data regarding the updated normative values instituted in this version of the assessment or any comparisons between them. Woodcock indicated that all the technical information from the WRMT-R remained valid for the WRMT-R(NU) (1998). For the reliability of the WRMT-R the only evidence provided was that of internal consistency values according to the slit-half procedure (Woodcock, 1998). This is unusual since there is a parallel form of the WRMT-R but no comparisons between them. There is also no indication about the reliability or stability of a score over time for an individual. In addition, there is no true evidence provided regarding the validity of the assessment. Woodcock indicates that a through examination of the items and the fact that “outside experts” were used to develop is enough evidence (Woodcock, 1998). Another area of concern is if the norms may contain racial bias as both sets of samples consisted primarily of white Americans with them making up 82.3% of the WRMT-R and 65-83% if the WRMT-R (NU) of the total sample population.

A paper that analyzed differences between the normative values of the WRMT-R and the WRMT-R/NU found that when compared to subtests of the Wide Range Achievement Test (WRAT) and Kaufmann Brief Intelligence Test (K-BIT) to their correlation so these tests were very similar (within .01 to .02) indicating little change in how their results (Pae et al., 2005). However, this study there was consistent differences in the standard score for each subtest for the WRMT-R and WRMT-R/NU and if placed on scatterplots it indicates there is a greater difference between scores of individuals who had a below average performance then for individuals with an average or above average performance (Pae et al., 2005). This finding was also noted in the manual with the WRMT-R/NU score being higher (Woodcock, 1998). The problem is that increase in standard scores resulted in individual who fell within the range of reading disabilities designation determined by a combination of scores from the K-BIT and WRMT-R decreased by 19% if the standard scores from the WRMT-R/NU norms were used instead, possibly resulting in these individual losing access to special education services if the WRMT-R/NU norms are used (Pae et al., 2005).

Uses of the Test

According to the WRMT-R manual the comprehensive nature of the test and normative values for a large age range allow its use in many applications, including clinical assessment and the diagnosis of students with reading difficulties and assisting in the development of appropriate an Individual Education Plan (IEP) (Woodcock, 1998). This is supported to a certain extent by an independent study that found the Full-Scale Reading score of the WRMT-R appeared to be a valid measure of the general reading construct with it accounting or 85.23% of the variances in scores in special education students (Ronald & Thomas, 2006). However, the same authors found that the separate cluster scores only explaining 3.65% of the variance and the other 81.58% and the two factors were closed correlated (.83) which indicating they measure similar constructs (Ronald & Thomas, 2006).

Conclusion

However, the fact that the WRMT-R/NU only tests a small degree of factors related to construct of reading ability let alone more general learning disabilities this test should not be used alone for the diagnosis of an individual. Instead, the WRMT-R/NU should be used as part of a larger combination of assessments if it is to be used for high-stakes decisions that may significantly affect an individual’s prospects. A better use of the assessment may be for early testing by teachers or aids to determine specific areas of weakness in their learning ability. If this is done the information about performance and where the individual made errors can be used to tailor a learning plan in the class room. The fact the test is inexpensive in terms of cost of time also means this assessment means that if results fail to assist in classroom performance it could be used as a one piece of evidence that the child make qualify/require a more in-depth psycho-educational assessment. Therefore, while the Woodcock Reading Mastery Tests off a quick and efficient way to determine basic reading strengths and weakness is should only be used to for low stakes decisions such as classroom learning and as evidence combined with other more detailed assessment for decisions that have a greater long term effect of the individual.

References

Pae, H. K., Wise, J. C., Cirino, P. T., Sevcik, R. A., Lovett, M. W., Wolf, M., & Morris, R. D. (2005). The Woodcock reading mastery test: Impact of normative changes. Assessment, 12(3), 347–357. https://doi.org/10.1177/1073191105277006

Ronald, C., & Thomas, O. (2006). Exploratory and confirmatory factor analyses of the woodcock reading mastery tests-revised with special education students. Psychology in the Schools, 38(6), 561–567. https://doi.org/https://doi.org/10.1002/pits.1043

Thorndike, R. M., & Thorndike-Christ, T. (2010). Measurement and evaluation in psychology and education (8th ed.). Boston, MA: Pearson Education.

Woodcock, R. W. (1998). Woodcock Reading Master Tests – Revised NU: Examiner’s Manual. Minneapolis, MN: Pearson Assessments.