Language testing and assessment: some fundamental considerations.

C.Alexander MA (Applied Linguistics + TESOL), PGCert (TESOL), CTEFLA, LGSM, Doctor of Education Student (Bristol)






In this paper I argue that for the externally graded ‘matura’ examination to be ethical in the context of The Central Examination Commission-Centralna Komisja Egzaminacyjna, validity, reliability, and influences on test performance have to be considered. With regard to the English-language ‘matura’ in Poland for example, institutional accountability, stake-holders and the issue of test consequences will become increasingly important when Poland becomes a full member of The European Union.

Construct Validity and theories of language use.

The simplified Chapelle (1999) questions ‘What does our test measure?’ or ‘ Does this test measure what it is supposed to measure?’ could be a good starting point. Messick (1989, 1994) notes that validity is not a characteristic of a test, but a feature of inferences made on the basis of test scores and the uses to which a test is put. Cronbach & Meehl (1955) hold it is not a test that is validated but ‘a principle for making inferences’. In order to establish what is being tested, testers need to consider what is known about language knowledge and ability, and the ability to use language i.e. not design a test arbitrarily. Test misuse refers to using a test for a purpose for which it was not intended and for which its validity is unknown’. Alderson et al. (2001, 62-64) maintain that no one person can possibly produce a good test or even a good item. The item writer knows what the item is intended to test, and will find it difficult to see that it might in fact be testing either something quite different or something in addition to what is being intended. Alderson et al. state (2001, 63) that all tests should be edited with the help of an expert committee and it is relevant that the committee do not just read the test with its items but actually take each item as if they were students. It is also held (Alderson 2001, 64) that one person should be made responsible for ensuring that recommendations of the committee are not only recorded but are also acted upon and implemented in a revised test.

Training of ‘writing’ examiners

It is clear that examiners need to be trained especially in subjective-marking-related tests e.g. writing and speaking. A rating scale is required e.g. a holistic scale (general impressions), an analytic scale (more detailed). Examiners should understand the principles behind the particular rating scales they must work with, and be able to interpret their descriptors consistently (Alderson 2001, 106-112).

With regard to grading writing scripts Alderson et al. (2001, ibid) recommend the following steps

  1. A Chief Examiner should select scripts which are examples of ‘adequate’ and ‘inadequate’ performance i.e. ‘consensus’ and ‘problem’ scripts.
  2. These scripts should be graded using the rating scale (a special examination committee could do this) i.e. a standardising committee. Each member should be given copies of the scripts and should mark them. Marks should be compared and a ‘consensus’ mark should be reached; if necessary, the rating scale should be refined.
  3. The CE should divide the scripts into ‘consensus’ and ‘problem’. The first batch could be used during the initial training of examiners and the second during a second meeting.

Once the rating scale is thoroughly understood examiners should proceed to the marking stage. Alderson et al. (2001, ibid) hold examiners should go through this training process at regular intervals. Alister Cumming (1997) discusses in detail the testing of writing in a second language.

Training examiners of speaking

Stevenson (1981) notes that it is problematic to establish methods of testing which accurately reflect the targeted abilities and in which ability/trait and method are clearly separated. In the context of a speaking test, there seem to be three influences: (i) the influence of the type of language elicited i.e. how the testee language variations affect the generalisations that are made about language use; (ii) the nature of the learner and the suitability of the test for him or her i.e. this may limit the ability to generalise beyond the test itself ; (iii) the influence of the test method itself i.e. generalisations reflect how well candidates cope with the test format itself rather than actual proficiency.

Alderson et al. (2001, 114-117) take the view that the training of examiners of speaking is similar to that of examiners of writing, with three principal differences: (1) in most institutions examiners give their grades during the test instead of after, the training needs to take place before the test is administered; (2) ‘the institution needs to use recordings of student performances instead of written scripts, both when the committee is setting standards and during the standardisation meeting’. Audio recordings are usually used but video tape recordings are becoming increasingly common. Alderson et al. (2001, ibid) maintain that ‘the chief examiner should choose sample performances and edit them onto a single cassette for the standardisation committee’. The institution could invite volunteer students to be tested during the standardisation meeting and examiners could be given an opportunity to try out their skills; (3) there should be an ‘interlocutor’ and an ‘assessor’. Both interlocutor and assessor need training i.e. the type of question elicited and the way it is elicited needs to be considered (was the candidate given the best chance to display the abilities being tested? Alderson et al. 2001, ibid). Affective factors and the constructs behind the test also need to be considered.

A lot of research has been undertaken concerning the way professional NNS’s and NS’s judge oral performances (Brown 1995, Ellis 1995, 63-67); in light of this research, I maintain that whether raters are NNS’s or NS’s is relevant (though a definition of what a ‘native speaker’ is, is problematic). Research suggests that there are significant differences (in harshness) in the way a NNS and NS assess different productive skills. The background of the NS may be pertinent i.e. Brown (1995, 7-8) found that NS’s with an industrial background were harsher than those with a teaching background. It may be that the frequency and type of negotiation differs according to whether the interlocutor is familiar or unfamiliar to the test takers. There may also be more subtle differences between NS’s e.g. Lazaraton (1996, 166) found that native speaker examiners provide candidates with 8 types of support and that the type of support is not consistent and so could impact on a candidates’ language use and on the rating;

In oral assessments, close attention also needs to be paid not only to the interlocutor but also to possible task variables. Structure (competency-related tasks could be functional or vocational), cognitive load (i.e. difficulty) and familiarity of content are seen as features internal to the task; availability of planning time and whether an interlocutor is a NS/NNS should be treated as external conditions. Spence-Brown (2001) states that the authenticity of a task is now firmly established as a central concern in test design and test validation.

Barry O’Sullivan et al. (2002) (Head of Applied Linguistics at Reading University and Visiting Lecturer at Bristol University), validates all UCLES speaking-examinations, presents a checklist which enables language samples elicited by the task to be scanned for language functions in real time. Barry O’Sullivan (2002, 291) also holds that ‘a test-candidate’s degree of acquaintanceship with his or her interlocutor as well as the sex of that interlocutor, relative to that of the candidate, represent a set of variables whose effect on performance is both predictable and significant within that context.’.

The testing of vocabulary is a more active field of study than the testing of grammar; vocabulary test purpose is the starting point for test design (see Read and Chapelle 2001 i.e. A framework for second language vocabulary assessment). Traditionally vocabulary was tested through individual interviews where learners provided explanations of words; this methodology however is time-consuming and restricts learners to small sample sizes. An alternative approach is written tests with multiple choice options. Every speaking test should be graded according to fluency, accuracy and complexity; though each of these aspects must be clearly defined for the assessors and affective factors must taken into consideration. Please note Read (1998) ‘Validating a Test to Measure Depth of Vocabulary Knowledge’.

Monitoring examiner reliability

The reliability of an examiner or examiners to consistently grade a test using the same assessment standards is of utmost importance. There are essentially two issues of concern: (i) would a student receive a different grade (written or oral) if he or she took the test with a different person using the same marking scale (i.e. inter-rater reliability)? ; (ii) would an examinee’s or examinees’ grade (written or oral) be the same if the test was taken at a different time/date with the same assessor (i.e. intra-rater reliability)? All professional testing institutions (including universities) should attempt to measure the reliability of test assessors. Intra-rater reliability can be measured using a correlation coefficient or through analysis of variance. Inter-rater reliability can be improved in the following ways (noted in Alderson et al. (2001, 128-136):

  1. central marking. There are three main ways of marking scripts centrally: (a) sampling by the chief examiner—the team leader can monitor the marking process as it actually happens and should be available if there are any doubts; (b) using ‘reliability scripts’. Here each examiner will independently mark the same set of ‘reliability scripts’ chosen by the chief examiner and/or a standardising committee. ‘Eyeballing’ examiners’ grades is an initial technique that can be used to help identify grading discrepancies between examiners; the examiners’ marks could also be correlated with a standardising committee’s marks. Alderson et al. (2001, ibid) state that a correlation of 0.8 or above is ‘the best’ outcome. It is important to identify lenient and strict examiners; (c) routine double-marking—every piece of work that a student produces is doubled marked by two different examiners and a mean grade is obtained.; (d) if examiners mark at home they will probably grade scripts at different times of the day under varying conditions. It is important that a team leader (and not the examiner) randomly samples scripts that have not been marked centrally to establish the quality or consistency of the marking process.
  2. With regard to speaking tests, Alderson et al. (2001, ibid) suggest using an independent assessor. When a candidate completes the test the two assessors (NB not interlocutor) can compare their marks. There are good arguments for using reliability tapes to train examiners.

Intra-rater reliability

It is also essential that the internal consistency of markers is checked. This can be achieved though the routine re-marking of scripts. Giving teachers or examiners too many writing groups or examination scripts to mark will in my opinion adversely affects intra-rater reliability.

Setting pass marks and examination construction

It is important to decide whether a norm-referenced or criterion referenced system will be used; both grading systems have advantages and disadvantages. A normal procedure in language test construction and evaluation is to elicit feedback from administrators, candidates and examiners using questionnaires that seek to obtain information concerning elements in the test. It may be the case that students feel that:

Examination construction is a dynamic and not static process.


In this paper I have provided an overview of issues relevant to grading the English-language ‘matura’. I hold that it is important to continuously analyse/research the concepts discussed in this paper in order to develop/improve tests i.e. rater reliability, construct validity, training of examiners for speaking and writing tests, feedback, how to test vocabulary, and the use of assessors in a speaking test.


Alderson, J. C. Clapham, & D. Wall, (2001) ‘Language Test Construction and Evaluation’

Cambridge: CUP

Brown, A. (1995) The effect of rater variables in the development of an occupation-specific

language performance test. Language Testing 12/1: 1-15

Chapelle, C. (1999) Validity in language assessment, Annual Review of Applied Linguistics

(Vol 19, pp. 254-272). New York: Cambridge University


Cronbach, J. & A. Davies, (1955) Construct validity in psychological tests. Psychological

Bulletin, 52: 281-302

Cumming, A. (1997) ‘The Testing of L2 writing’, Language Testing, 7: 51-63

Ellis, R. (1995) The study of Second Language Acquisition. Oxford: OUP

Lazaraton, A. (1996) Interlocutor support in oral proficiency interviews: the case of

CASE. Language Testing 13/2: 151-172

Messick, S. (1989) Validity. In R. Linn (Ed), Educational measurement. Third edition (13-

103) New York: Macmillan

Messick, S. (1994). The interplay of evidence and consequences in the validation of

performance assessments. Educational Researcher 23 (2), 13-23

Read,J and C. Chapelle, (2001) A framework for second language vocabulary assessment,

Language Testing 18 (1): 1-32

Read, J. (1998) Validating a Test to Measure Depth of Vocabulary Knowledge, In Kunnan,

A,J (ed) Validation in Language Assessment’ Chapter three. Mahwah, NJ: Lawrence Erlbaum Associates.

Spence-Brown, R. (2001) ‘The eye of the beholder: authenticity in an embedded

assessment task’ Language Testing, 18: 463-481

Stevenson, K. (1985) Authenticity, validity and a tea party. Language Testing , 2(1): 41-47

O’Sullivan, B., C. Weir, and N. Saville, (2002) ‘Using observation checklists to

validate speaking-test types’ Language Testing 19: 33-56

O’Sullivan, B. (2002) ‘Learner acquaintanceship and oral proficiency test pair-task

performance’ Language Testing 19 (3): 277-295