3.1 VALIDITY
A test is said to be valid if it measures accurately
what it is intended to measure. This seems simple enough. When closely
examined, however, the concept of validity reveals a number of aspects, each of
which deserves our attention.
3.1.1 Content validity
A test is said to have content validity if
its content constitutes a representative sample of the language skills,
structures, etc. with which it is meant to be concerned. It is obvious that
grammar test, for instance, must be made up of items testing knowledge of
control of grammar. But this in itself does not ensure content
validity. The test would have content validity if it included a
proper sample of the relevant structures. Just what are the relevant structures
will depend, of course, upon the purpose of the test. We wouldn’t
expect an achievement test for intermediate learners to contain just
the same set of structures as one for advanced learners. In order to
judge whether or not a test has content validity, we need
a specification of the skills or structures etc. that is meant to cover. Such a
specification should be made at a very early stage in
test construction. It is not to be expected that everything in the
specification will always appear in the test; there may simply be too many
things for all of them to appear in a single test.
What is the importance of content validity? First, the
greater a test’s content validity, the more likely it is to be an accurate
measure of what it is supposed to measure. Secondly, such
a test is likely to have harmful backwash effect. Areas which are likely to
become areas ignored in teaching and learning. Too often the content of
tests is the best safeguard against this is to write full test specifications
and to ensure that the test content is a fair reflection of these.
The effectiveness of a content validity strategy can
be enhanced by making sure that the experts are truly experts in the
appropriate field and that they have adequate and appropriate tools in the form
of rating scales so that their judgments can be sound and focused. However,
testers should never rest on their laurels. Once they have established that a
test has adequate content validity, they must immediately explore other kinds
of validity of the test in terms related to the specific performances of the
types of students for whom the test was designed in the first place.
3.1.2 Criterion-related validity/ Empirical
validity
There are essentially two kinds of criterion-related
validity: concurrent validity and predictive validity. Concurrent validity is
established when the test and the criterion are administered at about the same
time. To exemplify this kind of validation in achievement testing, let us
consider a situation where course objectives call for an oral component as part
of the final achievement test. The objectives may list a large number of
‘function’ which students are expected to perform orally, to test of all which
might take 45 minutes for each student. This could well be impractical.
The second kind of criterion-related validity is
predictive validity. This concerns the degree to which a test can predict
candidates’ future performance. An example would be how well a proficiency test
could predict a student’s ability to cope with a graduate course at a British
University. The criterion measure here might be an assessment of the student’s
English as perceived by his or her supervisor at the university, or it
could be the outcome of the course (pass/fail etc.)
3.1.3 Construct validity
A test, part of test, or a testing technique is said
to have construct validity if it can be demonstrated that it measures just the
ability which it is supposed to measure. The word ‘construct’ refers to underlying
ability (or trait) which is hypothesized in a theory of language ability. One
might hypothesize, for example, that the ability to read involves a number of
sub-abilities, such as the ability to guess the meaning of unknown words from
the context in which they are met. It would be a mater of empirical research to
establish whether or not such a distinct ability existed and could be measured.
If we attempted to measure that ability in a particular test, then that part of
the test would have construct validity only if we were able to
demonstrate that we were indeed measuring just that ability.
Construct validity is the most important form of
validity because it asks the fundamental validity question: What this test
really measuring? We have seen that all variables derive from constructs and
that constructs are nonobservable traits, such as intelligence, anxiety, and
honesty, “invented” to explain behavior. Constructs underlie the variables that
researchers measure. You cannot see a construct, you can only observe its
effect. “Why does the person act this way and that person a different way?
Because one is intelligent and one is not – or one is dishonest and the other
is not.” We cannot prove that constructs exist, just as we cannot perform brain
surgery on a person to “see” his or her intelligence, anxiety, or honesty.
3.1.4 Face validity
A test is said to have face validity if it looks as if
it measures what it is supposed to measure, for example, a test which pretended
to measure pronunciation ability but which did not require
the candidate to speak (and there have been more) might be thought to lack face
validity. This would be true even if the test’s construct and criterion-related
validity could be demonstrated. Face validity is hardly a scientific concept,
yet it is very important. A test which does not face validity may not be
accepted by candidates, teachers, education authorities or employers. It may
simply not be used; and if it is used, the candidates’ reaction to it may mean
that they do not perform on it in a way that truly reflects their ability.
3.1.5 The use of validity
What use is the reader to make of the notion of
validity? First, every effort should be made in constructing tests to ensure
content validity. Where possible, the tests should be validated empirically
against some criterion. Particularly where it is intended to use indirect
testing, reference should be made to the research literature to
confirm that measurement of the relevant underlying constructs has
been demonstrated using the testing techniques that are to be used.
3.2 RELIABILITY
Reliability is a necessary characteristic of any good
test: for it to be valid at all, a test must first be reliable as a measuring
instrument. It test is administered to the same candidates on
different occasion (with no language practice work taking place
between these occasion), then, to the extent that it produces differing
results. It is not reliable. Reliability measured in this way is commonly referred
to as test/re-test reliability to distinguish it from mark/re-mark
reliability. In short, in order to be reliable, a test must be consistent in
its measurements.
Factors affecting the reliability of a test are:
- the extent of the sample of material selected for
testing; whereas validity is concerned chiefly with the content of the
sample, reliability is concerned with the size. The larger the sample (i.e
the more tasks the testees have to perform), the greater the
probability that the test as a whole is reliable-hence the favoring of
objectives tests, which allow for a wide field to be covered.
- the administration of the test : is the same test administered to
different groups under different conditions or at different times?
Clearly, this is an important factor in deciding reliability, especially
in tests of oral production and listening comprehension.
One method of measuring the reliability of a test is
to re-administer the same test after a lapse of time. It is assumed that all
candidates have been treated in the same way in the interval – that they have
either all been taught or that none of them have.
Another means of estimating the reliability of a test
is by administering parallel forms of the test to the same group. This assumes
that two similar versions of a particular test can be constructed; such tests
must be identical in the nature of their sampling, difficulty, length, rubrics,
etc.
3.2.1 How to make tests more reliable
As we have seen, there are two components of test
reliability: the performance of candidates from occasion to occasion, and the
reliability of the scoring.
Take enough sample of behavior. Other
things equal, the more items that you have on a test, the more reliable that
test will be. This seems intuitive right. While it is important to make a test
long enough to achieve satisfactory reliability, it should not be made so long
that the candidates become so bored or tired that the behavior that they
exhibit becomes unrepresentative of their ability. At the same time
, it may often be necessary to resist pressure to make a test shorter than is
appropriate. The usual argument for shortening a test is that it is not
practical.
Do not allow candidates too much
freedom. In some kinds of language test there is a tendency to
offer candidates a choice of questions and then to allow them a great deal of
freedom in the way that they answer the ones that they have chosen. Such a
procedure is likely to have a depressing effect on the reliability
of the test. The more freedom that is given, the greater is likely to be the
difference between the performance.
Write unambiguous items. It
is essential that candidates should not be presented with items whose meaning
is not clear or to which there is an acceptable answer which the
test writer has not anticipated.
Provide clear and explicit instructions. This
applies both to written and oral instructions. It is possible for candidates to
misinterpret what they are asked to do, then on some occasions some of them
certainly will. Test writers should not rely on the students’ powers of
telepathy to elicit the desired behavior.
Ensure that tests are well laid out and perfectly
legible. Too often, institutional tests are badly typed (or
handwritten), have too much text in too small a space, and are poorly
reproduced. As a result, students are faced with additional tasks which
are not ones meant to measure their language ability.
Their variable performance on the unwanted tasks will lower the
reliability of a test.
Candidates should be familiar with format and testing
techniques. If any aspect of a test is
unfamiliar to candidates, they are likely to perform less well they would do
otherwise (on subsequently taking a parallel version, for example). For this
reason, every effort must be made to ensure that all candidates have the
opportunity to learn just what will be required of them.
Provide uniform and non-distracting conditions of
administration. The greater the differences
between one administration of a test and another, the greater the differences
one can expect between a candidate’s performance on two occasions. Great care
should be taken to ensure uniformity.
Use items that permit scoring which is as objective as
possible. This may appear to be a
recommendation to use multiple choice items, which permit completely
objective scoring. An alternative to multiple choice item which has a unique,
possibly one word, correct response which the candidates produce themselves.
This too should ensure objective scoring, but in fact problems with such
matters as spelling which makes a candidate’s meaning unclear often make
demands on the scorer’s judgment. The longer the required response, the greater
the difficulties of this kind.
Make comparisons between candidates as direct as
possible. This reinforces the suggestion already made that
candidates should not be given a choice of items and that they should be
limited in the way that they are allowed to respond. Scoring the compositions
all on one topic will be more reliable than if the candidates are allowed to
choose from six topics, as has been the case in some well-known tests. The
scoring should be all the more reliable if the compositions are guided. In this section,
do not allow candidates too much freedom.
Provide a detailed scoring key. This
should specify acceptable answer and assign points for partially correct
responses. For high scorer reliability the key should e as detailed possible in
its assignment of points.
Train scorers. This
is especially important where scoring is most subjective. The scoring of
comparisons, for example, should not be assigned to anyone who has
not learned to score accurately compositions form past administrations. After
each administration, patterns of scoring should be analyzed. Individuals whose
scoring deviates markedly and inconsistently from the norm should not be used
again.
Identify candidates by number; not name. Scorers
inevitably have expectations of candidates that they know. Except in purely
objective testing, this will affect the way that they score. Studies have shown
that even where the candidates are unknown to the scorers, the name on a script
(or a photograph) will make a significant difference to the scores
given. For example, a scorer may be influenced by the gender or nationality of
a name into making predictions which can affect the score given. The
identification of candidates only by number will reduce such effects.
Employ multiple, independent
scoring. As a general rule, and certainly where testing is
subjective, all scripts should be scored by at least two independent
scorers. Neither scorer should know how the other has scored a test paper.
Scores should be recorded on separate score sheets and passed to a third,
senior, colleague, who compares the two sets of scores and investigates
discrepancies.
3.3 ADMINISTRATION
A test must be practicable; in other words, it must be
fairly straight forward to administer. It is only too easy to become so
absorbed in the actual construction of the test items that the most obvious
practical considerations concerning the test are overlooked. The length of time
available for the administration of the test is frequently misjudged even by
experienced test writers. Especially when the complete test consists of a
number of sub-tests. In such cases sufficient time may not be
allowed for the administration of the test (i.e. a try out of the test to a
small but representative group of testees)
Another practical consideration concerns the answer
sheets and the stationary used. Many tests require the testees to enter their
answers on the actual question paper (e.g. circling the letter of the correct
option), thereby unfortunately reducing the speed of the scoring and
presenting the question paper from being used a second time. In some tests the
candidates are presented with a separate answer sheet, but too often
insufficient thought has been given to possible errors arising from the
(mental) transfer of the answer sheet itself.
A final point concerns the presentation of the test
paper itself. Where possible, it should be printed or typewritten and appear
neat, tidy and aesthetically pleasing. Nothing is worse and more disconcerting
to the testee than an untidy test paper, full of misspellings, omissions and
corrections.
0 komentar:
Posting Komentar