CRITERIA OF A GOOD EXAMINATION


       A good examination must pass the following criteria:

 

Validity

       Validity refers to the degree to which a test measures what it is intended to measure. It is the usefulness of the test for a given measure. A valid test is always reliable. To test the validity of a test it is to be pretested in order to determine If it really measures what it Intends to measure or what it purports to measure.

 

 

Reliability

       Reliability pertains to the degree to which a test measure what is suppose to measure. The test of reliability is the consistency of the results when it is administered to different groups of individuals with similar characteristics in different places at different times. Also, the results are almost similar when the test is given to the same group of individuals at different days and the coefficient of correlation is not less than 0.85.

 

Objectivity

       Objectivity is the degree to which personal bias is eliminated in the scoring of the answers. When we refer to the quality of measurement, essentially, we mean the amount of information contained in a score generated by the measurement. Measures of student instructional outcomes are rarely as precise as those of physical characteristics such as height and weight.


Student outcomes are more difficult to define, and the units of measurement are usually not physically units. The measures we take in students vary in quality, which prompts the need for different scales of measurement. Terms that describe the levels of measurement in these scales are nominal, ordinal, interval and ratio.


       Measurements may differ in the amount of information the numbers contain. These differences are distinguished by the terms nominal, ordinal, interval and ratio scales of measurement.


       The term nominal, ordinal, interval and ratio actually form a hierarchy. Nominal scales of measurement are the least sophisticated and contain the least information. Ordianl, interval, and ratio scales increase respectively in sophistication.


The arrangement is a hierarchy in the higher levels, along with additional data. For example, numbers from an interval scale of measurement contain all of the information that nominal and ordinal scales would provide, plus some supplementary input.


However, a ratio scale of the same attribute would contain even more information than the interval scale. This idea will become more clear as each scale of measurement is described.


Nominal Measurement

       Nominal scales are the least sophisticated; they merely classify objects or events by assigning numbers to them. These numbers are arbitrary and imply no quantification, but the categories must be mutually exclusive and exhaustive. For example, one could nominally designate baseball positions by assigning the pitcher the numeral 1; the catcher, 2; the first baseman, 3; the second baseman, 4; and so on. These assignments are arbitrary; no arithmetic of these numbers is meaningful. For example, 1 plus 2 does not equal 3, because a pitcher plus a catcher does not equal a first baseman.


Ordinal Measurement

       Ordinal scales classify but they also assign rank order. An example of ordinal measurement is ranking individuals in a class according to their test scores. Students scores could be ordered from the first, second, third and so forth to the lowest score.


Such a scale gives more information than nominal measurement, but it still has limitations. The units of ordinal measurement are most likely unequal. The number of points separating the first and second students probably does not equal the number separating the fifth and sixth students. These unequal units of measurements are analogous to a rules in which some inches are longer than others. Addition and subtraction of such units yield meaningless numbers.

 

Interval Measurement

 

       In order to be able to add and subtract scores, we use interval scales, sometimes called equal interval or equal unit measurement. This measurement scale contains the nominal and ordinal properties and is also characterized by equal units between score points.

Example include thermometers and calendar years. For instance, the difference in temperature between 10º and 20º is the same as that between 47º and 57º. Likewise the difference in length of time between 1946 and 1948 equals that between 1973 and 1975. These measures are defined in terms of physical properties suc that the intervals are equal.

For example, a year is the time it takes for the earth to robit the sun. The advantage of equal units of measurement in straightforward: Sums and differences now make sense, both numerically and logically. Note, however, the zero point in interval measurement is really an arbitrary decision; for example, 0º does not mean that there is no temperature.

Ratio Measurement

       The most sophisticated type of measurement includes all the preceding properties, but in a ratio scale, the zero point is not arbitrary; a score of zero includes the absence of what is being measured. For example, if a person’s wealth equaled zero, he or she would have no wealth at all.

This is unlike a social studies test, where missing every item (i.e., receiving a score zero) may not indicate the complete absence of social studies knowledge.

Ratio measurement is rarely abieved in educational assessment, either in cognitive or affective areas. The desirability of ratio measurement scales is that they allow ratio comparisons, such as Ann 1-1/2 times as tall as her little sister, Mary. We can seldom say that one’s intelligence or achievement is 1-1/2 as great as that of another person.

An IQ of 120 may be 1-1/2 times as great numerically as an IQ of 80, but a person with an IQ of 120 is not 1-1/2 times as intelligent as a person with an IQ of 80.

       Note that carefully designed tests over a specified domain of possible items can approach ratio measurement. For example, consider an objective concerning multiplication facts for pairs of numbers less than 10. In all, there are 45 such combinations.

However the teacher might randomly select 5 or 10 test problems to give a particular student. Then, the proportion of items that the students get correct could be used to estimate how many of the 45 possible items the student has mastered. If the student answers 4 or 5 items correctly, it is legitimate to estimate that the student would get 36 t o 45 items correct if all 45 items were administered.


This is possible because the set of possible items was specifically defined in the objective, and the test were a random, representative sample from that set. Most educational measurements are better than strictly nominal or ordinal measures, but few can meet the rigorous requirements of interval measurement.

Educational testing usually falls, somewhere between ordinal and interval scales in sophistication. Fortunately, empirical studies have shown arithmetic operations on these scales are appropriate, and the scores do provide adequate information for most decisions about students and instruction. Also, as we will see later, certain procedures can be applied to scores with reasonable confidence.


Norm-Referenced and Criterion Referenced Measurement

 

       When we contrast norm-referenced measurement (or testing) with criterion- referenced measurement, we are basically referring to two different ways of interpreting of information. However, Popham (1988 page 135) points out the certain characteristics tend to go with each type of measurement, and it unlikely that results of norm- referenced tests are interpreted in criterion-referenced ways and vice versa.


       Norm-Referenced interpretation historically has been used in education; norm- referenced tests continue to comprise a substantial portion of the measurement in today’s schools. The terminology of criterion-referenced measurement has existed for close to three decades, having been formally introduced with Glaser’s (1963) classic article.

Over the years, there has been occasional confusion with the terminology and how criterion-referenced measurement applies in the classroom. Do not infer that just because a test is published, it will necessarily be norm-referenced, or if teacher- constructed, criterion-referenced. Again, we emphasize that the type of measurement or testing depends on how the scores are interpreted. Both types can be used by the teacher.

 

Norm-Referenced Interpretation

 

Norm-referenced interpretation stems from the desire to differentiate among individuals or to discriminate among the individuals of some defined group on whatever is being measured. In norm-referenced measurement, an individual’s score is interpreted by comparing it to the scores of a defined group, often called normative group.

 

       Norm-referenced interpretation is a relative interpretation based on an individual’s position with respect to some group, often called the normative group. Norms consist of the score, usually in some form of descriptive statistics, of the normative group.

 

       In norm-referenced interpretation, the individual’s position in the normative groups is of concern; thus, this kind of positioning does not specify the performance in absolute terms. The norm being used is the basis of comparison and the individual score is designated by its position in the normative group.

 

Achievement Test as an Example. Most standardized achievements tests, especially those covering several skills and academic areas, are primarily designed for norm- referenced interpretation. However, the form of results and the interpretations of these tests are somewhat complex and require concepts not yet introduced in this text.

Scores on teacher-constructed test are often given norm-referenced interpretations. Grading on the curve, for example, is norm referenced interpretation of test scores on some type of performance measure. Specified percentages of scores are assigned the different grades, an individuals score is positioned in the distribution of scores.


       Suppose an algebra teacher has total of 150 students in five classes, and the classes have a common final examination. The teacher decides that the distribution of letter grades assigned to the final examination performance will be 10 percent As, 20 percent Bs, 40 percent Cs, 20 percent Ds, and 10 percent Fs.(Note that the final examination grade is not necessarily the course grade.)


Since the grading is based on all 150 scores, do not assume that 3 students in each class will receive as, on the final examination.

James receives a score on the final exam such that 21 students have higher scores and 128 students have lower scores. What will Jame’s letter grade be on the exam? The top 15 scores will receive As, the next 30 scores (20 percent of 150) will receive Bs. Counting from the top score down, Jame’s score is positioned 22nd, so he will receive a B on the final examination.

 

Note that in this interpretation example, we did not specify James’s actual numerical score on the exam. That would have been necessary in order to determine that his score positioned 22nd in the group of 150 scores. But in terms of the interpretation of the score, it was based strictly on its position in the total group of scores.

 

Criterion-Referenced Interpretation

       The concepts of criterion-referenced testing have developed with a dual meaning for criterion-referenced. On one hand, it means referencing an individual’s performance to some criterion that is defined performance level. The individual’s score is interpreted in absolute rather than relative terms. The criterion, in this situation, means some level of specified performance that has been determined independently of how others might perform.


       A second meaning for criterion-referenced involves the idea of a defined behavioral domain—that is, a defined body of learner behaviors. The learner’s performance on a test is referenced to a specifically defined group of behaviors. The learner’s performance on a test is referenced to a specifically defined group of behaviors. The criterion in this situation is the desired behaviors.


       Criterion-referenced interpretation is an absolute rather than relative interpretation, referenced to a defined body of learner behaviors, or, as is commonly done, to some specified level of performance.


Criterion-referenced tests require the specification of learner behaviors prior to constructing the test. The behaviors should be readily identifiable from instructional objectives. Criterion-referenced tests tend to focus on specific learner behaviors, and usually only a limited number are covered on any one test.

 

       Suppose before the test is administered an 80-percent-correct criterion is established as the minimum performance required for mastery of each objective. A student who does not attain the criterion has not mastered the skill sufficiently to move ahead in the instructional sequence. To a large extent, the criterion is based on teacher


judgement. No magical, universal criterion for mastery exists, although some curriculum materials that contain criterion-referenced tests do suggest criteria for mastery.

Also, unless objectives are appropriate and the criterion for achievement relevant, there is little meaning in the attainment of criterion, regardless of what it is.


Distinctions between Nor-Referenced and Criterion-Referenced Tests

 

       Although interpretations, not characteristics, provide the distinction between norm-referenced and criterion-referenced tests, the two types do tend to differ in some ways. Norm-referenced tests are usually more general and comprehensive and cover a large domain of content and learning tasks. They are used for survey testing, although this is not their exclusive use.

       Criterion-referenced tests focus on a specific group of learner behaviors. To show the contrast, consider an example. Arithmetic skills represent a general and broad category of student outcomes and would likely be measured by a norm-referenced test. On the other hand, behaviors such as solving addition problems with two five-digit numbers or determining the multiplication products of three-and four digit numbers are such more specific and may be measured by criterion-referenced tests.


       A criterion-referenced tests tend to focus more on sub skill than on broad skills.

Thus, criterion-referenced tests tend to be shorter. If mastery learning is involved, criterion-referenced measurement would be used.


       Norm-referenced tests scores are transformed to position within the normative group Criterion-referenced test scores are usually given in the percentage of correct answers or another indicator of mastery or the lack thereof. Criterion-referenced tests tend to lend themselves more to individualizing instruction than do norm-referenced tests. In individualizing instruction, a student’s performance is interpreted more appropriately by comparison to the desired behaviors for that particular student, rather than by comparison with the performance of a group.


       Norm-referenced test items tend to be of average difficulty. Criterion-referenced tests have item difficulty matched to the learning tasks. This distinction in item difficulty is necessary because norm-referenced tests emphasize the discrimination among individuals and criterion referenced tests emphasize the description of performance.

Easy items, for example, do little for discriminating among individuals, but they may be necessary for describing performance.


       Finally, when measuring attitudes, interests, and aptitudes, it is practically impossible to interpret the results without comparing them to a reference group.

The reference groups in such cases are usually typical students or students with high interests in certain areas. Teachers have no basis for anticipating these kinds of scores; therefore, in order to ascribe meaning to such a score, a referent group must be used.

For instance, a score of 80 on an interest inventory has no meaning itself. On the other hand, if a score of 80 is the typical response by a group interested in mechanical areas, the score takes on meaning.


BACK