Documente Academic
Documente Profesional
Documente Cultură
net/publication/261614129
CITATIONS READS
3 239
2 authors, including:
Nick Zacharov
Delta
86 PUBLICATIONS 772 CITATIONS
SEE PROFILE
All content following this page was uploaded by Nick Zacharov on 02 December 2015.
The papers at this Convention have been selected on the basis of a submitted abstract and extended precis that have
been peer reviewed by at least two qualified anonymous reviewers. This convention paper has been reproduced from
the author’s advance manuscript, without editing, corrections, or consideration by the Review Board. The AES takes
no responsibility for the contents. Additional papers may be obtained by sending request and remittance to Audio
Engineering Society, 60 East 42nd Street, New York, New York 10165-2520, USA; also see www.aes.org. All rights
reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the
Journal of the Audio Engineering Society.
ABSTRACT
Assessor panels are used to perform perceptual evaluation tasks in the form of listening and viewing tests. In
order to ensure the quality of collected data it is vital that the selected assessors have the desired qualities in
terms of discrimination aptitude as well as consistent rating ability. This work extends existing procedures in
this field to provide a statistically robust and efficient manner for assessing and evaluating the performance
of assessors for listening and viewing tasks.
Evaluation of potential
The three stage process of screen is defined in the
following section comprising of
Table 1: Summary of assessor categories employed in sensory analysis, as defined in ISO standard 8586-2
[14], applied to the food industry and recommended for adoption in the field of audio.
The purpose of the questionnaire was mainly to gain 4. Age between 18 - 50 years
knowledge about age, experience and interest in lis-
tening and viewing tests, availability and native lan- 5. Availability for tests during daily hours (Mon -
guage. Fri 9-18)
43 persons from the original pool complied with all The first qualification stage contained the following
five selection criteria and were selected for the first tasks/tests
qualification test session.
2. Stereopsis better than 250 seconds of arc. The session started with a follow up group discussion
Preferably better than 50 seconds of arc. on the chocolate tasting task performed in Qs1. The
intention was to introduce the potential assessors to
3. No colour deficiency the work process of word elicitation and language de-
velopment methods. The task was to subgroup a list
Hearing of 20 descriptive words on chocolate collected from
Qs1 into three groups. The three subgroups should
1. The person should have an audiogram showing be well defined by the persons and there should be
equal or less than 15 dB HL for all frequencies. consensus about the meaning of each of the descrip-
However for one frequency per ear 20 dB HL tors. One subgroup of words was picked out and
were allowed. the panel would define one scale including two an-
chor point labels to rate chocolate. This work had
2. Loudness test should be 100 % correct. two purposes: to introduce the potential assessor to
panel work and to get an impression of group inter-
It was not a criterion, that the persons should com- action and social skills.
ply with both the vision and hearing criteria, but it
The four perceptual tests were performed at differ-
was highly desirable.
ent computer work stations in the listening room.
The visual criteria were specified by the optometrist To minimize disturbance between test persons, sep-
designing the tests. The audiogram criterion is gen- aration walls were put up. The persons rotated be-
erally accepted as normal hearing threshold devia- tween the work stations. Hearing protectors were
tions. provided for persons performing the visual tests to
eliminate the audible sound transmission from the
Twenty-two persons qualified from Qualification
open headphones used by persons performing the
stage I and were invited to Qualification stage II
listening tests.
(Qs2). The selected persons gave the impression of
being suited for panel work and group discussions 2.4.1. Perceptual screening tests
and showed a high level of motivation and enthusi- The four perceptual tests are explained in this sec-
asm. 17 persons complied with the criteria for nor- tion. Each test had its own dedicated computer to
mal vision, and 19 persons complied with the criteria eliminate any variations of tests material across test
for normal hearing. Thirteen persons complied with persons. The tests were performed sitting at a ta-
the criteria for both normal vision and hearing. ble with a screen distance of approximately 50 cm
from eye. The ambient light measured perpendicular
Due to drop outs from two persons prior to Qs2, the
to the computer screens was 110 Lux. The acous-
final selected group consisted of 20 persons.
tic output levels from the Sennheiser HD 650 head-
2.4. Qualification stage II phones used for the two listening tests were adjusted
The purpose of Qualification stage II was to test the to a calibrated most comfortable level. The listening
persons ability to perform perceptual tests within room complied with NR 10 background noise level.
the sound and visual domains. The persons were
For the purpose of these tests the triangle test was
invited in groups of 3 - 5 to perform the tests. The
chosen to overcome problems associated with the
perceptual tests were performed individually.
pair comparison method, as employed in [26], where
The second qualification stage contained the follow- a specific attribute should be assessed by the as-
ing tasks/tests sessor. Considering that assessors in this screening
experiment are not familiar with complex attributes Group Difficulty Sample A Sample B
such as speech quality or spatial quality, the trian- H1 Intro 10 kbps PCM
gle test offers a more understandable task, i.e. to H2 * 17 kbps PCM
identify which of the three samples is different. Ad- H3 ** 10 kbps 17 kbps
ditionally, the statistical analysis of the triangle test H4 *** 12 kbps 17 kbps
is robust and absolute, reported in terms of percent-
age correct. Table 3: Stimulus description of the AMR (adap-
tive multirate) narrowband codec stimuli employed
Each test was designed to start with two easily dis- in the speech codec screening test.
criminative samples presented in six balanced tri-
ads to work as an introduction. According to the
ISO 4120 [16] triangle test standard method there Group Difficulty Sample A Sample B
should be six triads for each pair of stimuli. The tri- H1 Intro 24 kHz 64 kbps PCM
ads have the following balanced stimuli order: ABB, H2 * 32 kHz 80 kbps PCM
AAB, ABA, BAA, BBA, BAB. The presentation se- H3 ** 32 kHz 96 kbps PCM
quence of stimuli pairs was the same for all persons H4 *** 32 kHz 112 kbps PCM
and the reason for having the same presentation or-
der with increasing task difficulty was to ensure the Table 4: Stimulus description of the MP3 codec
same learning effect for each person. The presen- stimuli employed in the audio codec screening test.
tation order of the triads within each stimulus pair
was randomized double blinded by the test software.
In the following description of test material the task xxx 50 = 50 %, using a JPEG standard 4:2:2
difficulty is indicated by number of stars (*). format. See table 5 for details of the stimulus
parameters.
Speech codec test Stimuli were constructed using
Nokia Multimedia Converter and QuickTime Picture brightness test Images with the size of
Pro. AMR (adaptive multirate) narrowband 800 x 600 pixels were constructed in Corel
codec was used. The file bit rates in kbps Photo Paint X3, as illustrated in Figure 13 and
listed by the Nokia software is noted as the sam- 14. In the center of the images a square of 101
ple name. The original speech sample was ex- x 101 pixels of different brightness was placed
tracted from the original Danish speech intelli- (See examples in Figures 13 - 14. Brightness is
gibility test CD known as Dantale II [8]. See in this test expressed by RGB constants (Red,
table 3 for details of the stimulus parameters. Green, Blue) changes. For example: 100 105
has a background grey color of R = 100, G =
Audio codec test Stimuli were constructed us- 100, B = 100 and the centred smaller square
ing Adobe Audition 1.5 with Fraunhofer MP3 is R = 105, G = 105, B = 105 which is per-
codec. Constant bit rate was used. The origi- ceived as brighter than the surrounding outer
nal music sample was extracted from the com- grey color. See table 6 for details of the stimu-
mercially available AES CD ’Perceptual Audio lus parameters.
Coders: What to Listen For’ [1], Track 92 by
Brian Gilmour. In all 4 groups, one sample was
always the original PCM wave file. The sam- 2.5. Software
pling frequency and compression rate is noted The triangle test software was built in Labview 7.1.
as the sample name. See table 4 for details of Sound stimuli were all in PCM wave format. Picture
the stimulus parameters. stimuli were in JPG format. The cross fade time be-
tween sound switching was 50 ms. The cross fade
Picture compression test Picture compressions processing when switching between stimuli within
of original ITU-R BT.802-1 [18] test material. each triad was done linearly and calculated by the
The test images were JPG compressed using Ir- software in real time. Direct switching was used
fanView version 4.20. The quality is noted as when changing between pictures in the visual tests.
• Project ID
2.6. Stage II selection criteria per eye in the 1.0 visual acuity test, but correction
The selection criteria for the assessor panel were glasses are considered to compensate.
purely based on performance in the perceptual tests.
The selected assessors audiograms showed equal or
Based on feedback from the persons during Qs2 and
less than 15 dB HL for all frequencies. Person 3, 4,
by inspecting the results it was clear, that the speech
5 and 18 exceeded this level, but by age correcting
and audio codec tests were more challenging than
the results according to ISO 7029 [11] only person
the visual tests. This led to the following selection
number 4 fails on his left ear with a hearing threshold
criteria:
of 40 dB HL but it might have been due to poor
fitting of the headphones.
• Picture compression and brightness tests should
All selected assessors except person 6 and 8 had a
be performed with at least 90 % correct re-
score of minimum 90 % correct for the perceptual
sponses in each test to qualify for visual panel
visual tests. All but person 3 and 18 had a score of
work.
minimum 75 % correct for the sound tests in Qs2.
• Audio and speech codec tests should be per- Persons 1, 2, 4, 5, 7, 9, 10, 13, 14, 16 and 20 fulfilled
formed with at least 75 % correct responses in the criteria for both visual and sound tests based on
each test for audio panel work. Qs2 criteria (See Figure 8).
A summary of the average selected assessor perfor-
The final selection was thus based on the average mance in each test is presented in Figures 9 and 10.
score for each subject in each of the screening tests.
4. CONCLUSION
The performance of the 20 individual assessors eval- This assessor selection procedure defined here, al-
uated in Qs2 is illustrated in Figures 4 - 7. lows for a rapid screening of assessor suitability
For the sound screening tests a wider spread of rat- for listening and viewing quality evaluation tests.
ing can be seen in the audio coding task compared The procedure has improved upon previous meth-
to the speech coding tasks. However, in the audio ods though use of the triangle test method, which
coding test, a few assessors were able to obtain 100 can provide robust statistical analysis and allows for
% correct scores (assessors 2, 4, 7, 16), compared to measures of the assessors repeatability as well as al-
the speech coding task, where the highest scores for lowing for comparison to other assessors. Addition-
the most difficult stimulus set is 83 % (assessors 1, ally, the triangle test eliminated the problems asso-
5, 13, 16, 20, 21, 22). ciated with use of line scales and focuses assessors
of the simple task of finding the different sample of
Of the picture quality tasks the picture brightness three. This latter aspect is beneficial for naive asses-
was the easier of the two, with only the most difficult sors who have potentially never performed a listen-
sample pairs causing most of the misidentification ing or viewing test before. The screening can be per-
and 15 assessors obtaining greater than 80 % correct. formed with 6 assessors in approximately four hours,
which means that this type of screening can be read-
3. FINAL SELECTION CRITERIA ily applied when new assessors are to be found.
To qualify for the selected assessor panel the criteria
Once assessors have passed through this process,
defined from qualifications session I and II should be
then can be categories as selected assessors according
fulfilled.
to ISO 8586-2 [14] and are ready for further training
This means that all selected persons should have and assessment for them to be categorized as expert
Danish as mother tongue, and should be between assessors for any given application domain (speech
18 and 50 years old and motivated and available coding, noise suppression, etc.).
for tests during daytime hours. Further the selected
persons should have normal visual acuity, stereopsis 5. FURTHER WORK
better than 250 seconds of arc and no color blind- From this study it can be seen that the performance
ness. Person 6, 9, 13 and 14 had more than 2 errors of assessors is strongly determined by the stimuli
Fig. 9: Average assessor performance for all (15) selected assessor in audio screening tests as a function of
the sound group (H1 - H4), as define in Table 3 and 4.
Fig. 10: Average assessor performance for all (15) selected assessor in visual screening tests as a function
of the visual group (V1 - V7), as define in Table 3 and 4.
selected and their relation with respect to the just expertise of the experimenter in the visual quality
noticeable difference. In this study it was apparent domain. This finding leads to the view that a more
that the visual stimuli where less critical than the systematic approach to selection of stimuli would be
audio stimuli, potentially due to the lack of high level desirable.
(a) Foreground colour RGB 100:100:100 (b) Foreground colour RGB 105:105:105
Fig. 13: Brightness test: Background colour RGB 100:100:100. Yellow arrow is introduced to indicate the
region of difference for both images.
Additionally, it is observed that whilst expertise is towards a more unique definition of stimuli for as-
defined in both ITU-R BS.1116-1 [19] and more ex- sessor screening, leading to a more absolute measure
tensively in ISO 8586-1 [13], an absolute measure of assessor competence and expertise.
of assessor expertise is lacking within the industry.
The performance of assessors is not only associated
6. ACKNOWLEDGEMENTS
with their sensory acuity, experience and training,
The activities and results reported in this paper have
but also by the stimuli with respect to the thresh-
been co-funded by the Danish Agency for Science,
old of perception. In order to sharpen the definition
Technology and Innovation. The assessors involved
of assessor expertise, it is suggested that we work
in this work thanked for their time and effort for
(a) Foreground colour RGB 200:200:200 (b) Foreground colour RGB 195:195:195
Fig. 14: Brightness test: Background colour RGB 200:200:200. Yellow arrow is introduced to indicate the
region of difference for both images.
[14] ISO. 8586-2. Sensory analysis – General guid- Telecommunications Standardization Sector,
ance for the selection, training and monitoring 2000.
of assessors – Part 2: Experts. International
Organization for Standards, 1994. [24] Koivuniemi, K., and Zacharov, N. Unrav-
eling the perception of spatial sound reproduc-
[15] ISO. 13299. Sensory analysis - Methodology - tion: Language development, verbal protocol
General guidance for establishing a sensory pro- analysis and listener training. In Proceedings
file. International Organization for Standards, of the Audio Engineering Society 111th Inter-
2003. national Convention (2001), Audio Engineering
Society.
[16] ISO. 4120. Sensory analysis - Methodology -
Triangle test. International Organization for [25] Lorho, G. Individual vocabulary profiling of
Standards, 2004. spatial enhancement systems for stereo head-
phone reproduction. In Proceedings of the 119th
[17] ISO. 6658. Sensory analysis - Methodology - Convention of the Audio Engineering Society
General guidance. International Organization (New York, USA, October 2005).
for Standards, 2005.
[26] Mattila, V.-V., and Zacharov, N. Gen-
[18] ITU-R. Test Pictures and Sequences for Sub- eralized listener selection (GLS) procedure. In
jective Assessments of Digital Codecs Convey- Proceedings of the 110th Convention of the Au-
ing Signals Produced According to Rec. ITU- dio Engineering Society (Amsterdam, Holland,
R BT.601. International Telecommunications 2001).
Union Radiocommunication Assembly, 1994.
[27] Kanehara Trading Inc. Ishihara’s tests for
[19] ITU-R. Recommendation BS.1116-1, Methods colour deficiency - 38 plates edition. Tokyo,
for the subjective assessment of small impair- Japan, 2005.
ments in audio systems including multichan-
nel sound systems. International Telecommuni- [28] Stereo Optical Co., Inc. RANDOT° R