Sunteți pe pagina 1din 18

Sonification of Gestures

Exploring polarity of mapping & tools

Names: Angelina von Gegerfelt (​anvg@kth.se​) & Andreas Almqvist (​aalmqvis@kth.se​)


Course: DT2300 Sound in Interaction
Assignment: Project Report
Examiner: Roberto Bresin
Kungliga Tekniska Högskolan
Date: March 2017

   
Abstract

This paper argues for what polarity a sonified gesture generally should have to achieve a higher
degree of perceived naturalness, regarding the velocity and scaling of the gesture. Exploration
of current tools and the realization of a gestural sonification is documented and evaluated. The
ofxGVF is a viable tool for gesture recognition. A granular synthesis with a sound source with
rich frequency spectra works well to explore single physical dimension variations’ impact on
perceived acoustic parameter altering. Our results suggests that what polarity to use in a
mapping of the physical dimension to the acoustic dimension seems to correspond to naturally
occurring events and physical phenomena humans have perceived and experienced since our
origination. Smaller objects and animals more often produce sounds of higher pitch, and larger
objects and animals more often produce sounds of lower pitch, which might be a reason to why
that polarity of scale was perceived as more favourable. To have increased velocity to increase
loudness or brightness of a sound might be traced to the physical phenomena of a large velocity
impact. The perceived acoustic output of a system should presumably therefore correspond to
the (metaphorical) physical energy input into the system.
Table of Contents
Introduction 1

Background 2
Previous Research 2
Gesture Variation Follower 3
The focus of this project 4

Method 5
Gesture Variation Follower 5
Motion tracking 5
Open Sound Control (OSC) 5
Sound engine 6
Mapping and polarity 7
Data flow 8
Test procedure 8
Evaluation 9

Results 10

Discussion 11

Conclusion 13

References 14
Introduction

By designing a tool that is properly adapted to a user’s nature and perception of the world, the
user’s perceived effectiveness of the tool is likely to increase. Yesterday’s mainstream
technology was largely based around visual stimulation, mouse and keyboard input and simple
warning sounds. With the development of touchscreens, voice controlled systems and motion
controls, a broader spectrum of the human nature is accommodated. To make technology more
usable for all human function variations and to take further advantage of the human potential,
technology could rely more heavily on our ability of accurate movements and sensitive auditory
system.
As discussed by Walker, there are three important questions to initially consider when designing
a sonic interaction (Walker, 2000): Which sonic parameter is best fitted to represent the data at
hand? What would be the most natural polarity of that mapping? What is the relational scaling
between the relative change of data and the corresponding sonic feedback that conveys the
change in the most accurate way to a listener? This paper focus on the second question
regarding the polarity of a mapping.

1
Background

Previous Research
Bevilacqua et al. presents a Hidden Markov Model based system for gesture analysis in real
time which can be used to build systems controlling sonic and/or visual media using gestures.
They focused on two aspects of gesture analysis, the first being where in the gesture the user
were (​time progression)​ and the second was the ​likelihood​ that the performed gesture was
mimicking one stored in a database, i.e. the similarities between them. They described some
cases in which their paradigm was used, for example in the context of music pedagogy. Using
their system, users were able to control the playing speed of an orchestra recording (​Bevilacqua
et al., 2009)​.
Sonification is a relatively new research area, as could be suggested by the struggle to finalize a
definition for the word itself. A definition that seems to have stuck is the following (Dubus and
Bresin 2013): “​Sonification is defined as the use of nonspeech audio to convey information.
More specifically, sonification is the transformation of data relations into perceived relations in
an acoustic signal for the purposes of facilitating communication or interpretation. ​“ (Kramer et
al., 2010)
Sonification may unlock a potential of a higher simultaneous information processing than a
standard visual display (Scaletti and Craig, 1991). Due to the nature of sound and the human
perception of it, sonification might be especially applicable for duties that are time-related, to
facilitate kinesthesia or to help visually impaired (Dubus and Bresin 2013).
Kramer et al. describe several successful applications of sonifications, testifying its potential.
Among other things, they talk about sonifications which have helped in surgery situations with
quicker identification of emergencies and for the help of understanding oscillating phenomena
observed in the nature (Kramer et al., 2010).
Löwgren talks of fluency and pliability of HCI. Fluency concerns how well an interaction deals
with and facilitate multimodality and multiple input channels. Pliability concerns how well a user
perceive an interaction in terms of malleability and shapeability. By incorporating sonification
and gestural input in a well designed interaction, both of these concepts may be increasingly
accommodated (Löwgren, 2009). In sonification of gestures, the gestures is not the only thing
that influence the sound. Of course, the auditory feedback plays a vital part. It has been shown

2
in a study that when a sound’s causal action is identified, gestures made primarily mimic this
action. When the identification cannot be made, the sound’s acoustic contour is traced
(Caramiaux et al., 2014)​. This suggests that certain sounds have gestures that are more
naturally coupled than others. Exploiting this could make an interaction more natural.
The polarity of a mapping is vital to how effective sonic interaction is, as it decides how an
interaction actually is perceived and used. Walker and Kramer showed that an intuitive polarity
of a mapping for a designer is not necessarily the best one, as perceived by users. This
emphasize the importance of empirical groundwork for sonic interaction design (Walker and
Kramer, 2005).

Gesture Variation Follower


Baptiste Caramiaux created a application called the Gesture Variation Follower [ofxGVF] that is
used for gesture recognition (Caramiaux et al., 2015). There are three versions available: one
for PureData, one for OpenFrameworks and the last one for MaxMSP. The application has three
modes: “clear” (default mode), “learning” and “following”. “Clear” means the program ignores
user input and it should remove (clear) all saved gestures. In the “learning” mode it will record
gestures for templates and in the “following” mode it will compare the movement to the recorded
gestures. The program will always try to match a recorded gesture, no matter how little the
movement is alike it. If there is only one recorded gesture, the program will say that the user is
always making that gesture. Each frame of movement compared to the saved gestures is
summarized in a data structure called GVFOutcomes. It holds five variables: likeliest gesture (an
integer matching the recorded gesture), alignment (where in the gesture), dynamics (the speed
relative to the recorded), scalings (the size relative the recorded gesture) and rotations (gesture
rotation angle).
This project used the OpenFrameworks version, since it is more stable then the PureData
version and MaxMSP costs money to use. OpenFrameworks is an open source programming
tool for “creative coding” in C++ (​OpenFrameworks 2017).​ Caramiaux has written an example
project for OpenFrameworks called example-2dShapes (​OfxGVF-example 2016​). The project
uses two addons; ofxUI, which is deprecated in the current OpenFrameworks and has to be
manually added into the project, and the other is ofxXmlSettings because it is an dependency
for the UI. The example uses a 2D space and you interact with it using your mouse. When the
user press lowercase or uppercase “L” the program goes into “learning” mode, and then it waits

3
for the user to press the mouse key, drag the mouse and then release it. That completes a
gesture which is then save as a template, and it then immediately goes into the “following”
mode. You can go back into the “learning” mode more times if you want. When the program is in
the “following” mode, it will also look for a mouse key-press, movement and then release
(​OfxGVF 2016)​.

The focus of this project


This study deals with an aspect of the groundwork of designing successful gestural sonification.
There is a focus on the exploration of the polarity of mappings from the physical dimension and
current easily available tools used to realize a gestural sonic interaction.

4
Method

Gesture Variation Follower


In this project, the example project from Caramiaux was modified to ignore the mouse and
instead utilize motion tracking. There is a new command, lowercase or uppercase “K”, which
changes the mode to “following”. If the program is in “learning” mode and finishes a gesture, it
does not automatically change into the following mode as before, but remains in the “learning”
mode and allows the user to program several gestures in a row. The input comes in the form of
a message on an OSC channel containing seven integers, although it only uses the first three:
X, Y and Z-coordinates. When the Z value is less than zero the program accepts gestures and
when it is not in the interval it ignores the input. The X and Y values are used for recording the
gesture. When in the “following” mode and the user makes a movement, the program outputs
seven integers: the number representing the likeliest gesture, then for that gesture it outputs the
alignment, dynamics (two values), scalings (two values) and the rotation. Normally, those other
values are a list containing values for each of the programmed gestures, but here they are
filtered out. The output is sent over a different OSC channel.
The code can be found on GitHub: ​https://github.com/Glassig/ofxGVF_motion

Motion tracking
The gesture code is not dependent on what kind of program is used for the motion tracking, as
long as the X, Y and Z coordinates are sent over an OSC channel. For this project, OptiTrack
Arena (OptiTrack 2017) was used to track a rigid body and send its values on an OSC channel
using PureData. The three point rigid body was handheld during gesture performance.

Open Sound Control (OSC)


OSC is a protocol created for real-time communication between various devices such as
computers and multimedia devices that are optimized for modern networking technology (OSC
2011). Here it is used to send floats and integers between the programs doing the calculations.
PureData is used for the transfer from OptiTrack Arena (motion tracking), and an

5
OpenFrameworks addon for OSC is used to send to the Pure Data patch doing the sound
calculations.

Sound engine
Pure Data is an open source visual programming language that enables users to create
software graphically in order to generate sound, video, 2D/3D graphics and MIDI. It works on
many devices, from personal computers and Raspberry Pis to smartphones (PureData 2016).
A sound engine was built in Pure Data based on granular synthesis by modifying an already
existing patch (​Granulator 2016​). The code can be found on GitHub:
https://github.com/aalmqvis/Ljud-i-interaktion/blob/master/GRAIN_experiment.pd
Granular sound synthesis can be described as dividing a sound (​granulation​) into many small
parts (​grains​) in the temporal domain, which together make up the full sound (​Roads, 1988​).
The position of where the user “is” in a gesture is outputted from OpenFrameworks as a value
(alignment), and this value is used to find the location in the sound file by the use of equation
(1).
Alignment [0, 1] × sampleLength(soundf ile) = index (1)
The grain that is being played is (2):
soundf ile (index : index + 3000) (2)
By dislodging velocity from the sound file’s pitch, the pitch will not interfere as polarities of other
mappings are being analyzed. A granular sound synthesis allowed a one-to-one mapping to be
evaluated.
Due to the stream of indices being input, there is a constant triggering of grains which ideally
should construct a continuous sound. However, in our implementation there was disruption
between the playback of every grain causing what was perceived as “chopped-up” sounds. This
was apparent as a perceptually disturbing discontinuity in the sound. Several attempts to
remediate this was done without noticeable success. The PureData reverb [freeverb~] was used
to slightly suppress the issue.
No matter what the velocity of a gesture that is being performed is, it will not alter the original
pitch of the sound file. This is the reason to why this synthesis technique was used in this
project.
The used sound file was a six second guitar clip which had a rich frequency spectra. This sound
file accommodated the need of a sound with distinct pitch (due to a specific mapping

6
investigated), as a sound such as white noise didn’t give a usable sensation of pitch change
when the playback speed was altered in the sound engine. A shorter sound file could have
been used, as long as it matched the length of the original gesture.

Mapping and polarity


As the user performs a gesture, the sound file is being read and played. Velocity is a common
physical dimension used in sonification (Dubus and Bresin, 2013). The recognition of a gesture
with equal velocity as the “learned” outputs the value 1. The scale factor is another comparative
physical dimension which is output from the ofxGVF. The recognition of a gesture of equal size
to the “learned” outputs the value 1. Both the velocity and scaling values typically ranged
between ​(0.5, 1.5)​ during gesture recognition while affecting them to the extremes. One mapping
at a time was used. The mappings and polarities investigated were as follows.

Mapping 1  Velocity to Cutoff Frequency 


of Low Pass Filter 

1st polarity:  v elocity * 2500 = cutof f F req  

2nd polarity:   (1/velocity) * 2500 = cutof f F req  

   

Mapping 2  Velocity to Loudness 

1st polarity:  v elocity * loudness = loudnessF actor  

2nd polarity:   (1/velocity) * loudness = loudnessF actor  

   

Mapping 3  Scale to Pitch  


(playback speed of grain) 

1st polarity:  scale = playbackspeedF actor  

2nd polarity:   (2 − scale) = playbackspeedF actor  

The rescaling of the third mapping’s second polarity corresponded to a perceptual polarity
inversion.

7
Data flow

The flow of data for the gestural sonification was as follows: a rigid body was defined,
discovered and tracked on computer 1 with the OptiTrack Arena software. The coordinates of
the rigid body were locally converted to the OSC protocol with the help of Natnet, streaming the
coordinate data to a PureData patch. This patch received the data stream and sent it onwards
to an external computer, number 2. This computer received it in OpenFrameworks, where the
gesture handling took place. From OpenFrameworks, the output was sent onwards to computer
3, where the OSC data stream was received by a PureData patch where the sound synthesis
took place. From this computer, the output sound was sent into a routing matrix of the Fireface
audio interface, where it was finally outputted into the 8.0 surround system. Despite all of this no
latency was found, most likely due to the fact that all computers were on the same network and
the only data sent between the machines was integers or floats.

Test procedure
A circular gesture was recorded by one of the authors by activating the ofxGVF and performing
the gesture in the defined zone in space. The ofxGVF was put into “following” mode and the
author redid the gesture in different sizes (during the mapping scaling to pitch) and in different
velocities (during the mapping of velocity to cutoff frequency and during the mapping velocity to
loudness). Half of the time spent for each mapping was dedicated to either one of the polarities.
During the “following” mode, while the user was performing a gesture the ofxGVF would
constantly output values which triggered the granular synthesis, outputting sound into the
speakers.

8
Evaluation
No external participants were used to test the polarities of the mappings due to time shortage.
The authors themselves did a verbalized assessment of the polarities after performing the
gestures with all presented mappings and polarities.

9
Results

Regarding the polarity of mapping velocity to cutoff frequency of low pass filter, it was found that
the polarity of which higher velocity gave a higher cutoff frequency was perceived as more
natural than that lower velocity gave a higher cutoff frequency. The polarity of mapping velocity
to loudness of which higher velocity gave higher loudness was perceived to be preferable. As
for the scaling, it was found that a larger scaling value giving a lower pitch was perceived as a
more preferable polarity than when a larger scaling value gave a higher pitch.
If a gesture included movement at the same point in space multiple times, this could lead to
unpredictable outputs of the ofxGVF. The best gesture was a circle, since a fluid, smooth motion
gave more predictable output from the OptiTrack than a fast, jerky motion.
Using a granular synthesis for the investigation was considered to be a good match to fulfill the
needs of the project scope. However, the used sound engine did have a big issue regarding the
procedure used to play grains. Every new grain that is triggered (and played back) cancel the
previous one’s playback if the former is still sounding at the time of the new grain’s triggering,
which is to say all the time except for the very last grain.

10
Discussion

In this project a lot of focus was put on getting the technology to work and getting an appropriate
and functional sound engine in play. Early on a lot of time was put on trying to get the PureData
version of the ofxGVF to work, but it was very unstable and would crash a lot. Then a lot of time
was spent trying to get the OpenFrameworks version to work, since a lot of the dependencies
are outdated and the original piece of code was created with a 2D space and mouse in mind.
One of the biggest weaknesses of this project is that no study with external participants could be
conducted due to a lack of time. If the gesture tracking had been completed earlier, the sound
synthesis could have been perfected earlier and there would have been more time for external
participant testing.
In hindsight, we should have recognized early that the PureData version wasn’t going to work
and looked for alternatives. Motion tracking and gesture analysis is very complex and if our
project would have been less complicated (ie, no gesture analysis, instead focusing on motion
tracking and polarity mapping) we might had accomplished more in our given time. The time
spent on looking for other solutions than the, in the end, used granular synthesis patch could
have been more carefully invested. Changing the sound synthesis technique meant changing
the basic condition of how the interaction would take place, which stirred up unnecessary
confusion.
The ofxGVF gave unpredictable outputs when a gesture included movement at the same point
in space multiple times as it tried to recognize the current gesture position to the beginning of it.
Favourable results was obtained by using a wide diversity of gestures. For instance, having two
gestures like a triangle and a circle with the same direction of rotation was hard for the ofxGVF
to distinguish accurately.
We gained a lot of knowledge in how motion tracking works, it can be very precise but
recognizing a gesture reliably can be quite hard. There are applications that can do it to some
extent, but it is not always reliable, as experienced with ofxGVF.
To use a granular synthesis initially accommodated an initial need of playback of a fixed sound
file that wouldn’t change its pitch depending on the velocity at which a gesture is performed, i.e.
the speed of which the sound file is being played back. However, as the project proceeded and
the direction of the project changed, the root of the need dissolved, but the built sound engine
was still utilized in an advantageous manner.

11
The use of a different, more neutral sound source such as Gaussian white noise would have
been sufficient if the acoustic parameter pitch was not used as a part of a investigated mapping,
as previously mentioned. However, as the used sound has strong components over the
frequency spectra, and it did give similar experiences of the polarities across the mappings, the
nature of the sound is judged to not have a negative influence over the project. The disrupted
playback of grains did have an inconvenient influence over the perceived sound as more focus
from the users’ side was necessary to acknowledge the acoustic shifts. It is judged as not
distorting the overall perception of the polarities.
Generally, there seems to be two indications regarding what a preferable polarity of a mapping
is, linking the polarities to natural occurrences and how humans have perceived and
experienced the world since our origination. The polarity of higher velocity giving higher
loudness and higher cutoff frequency is on par with the natural phenomena of that a higher
velocity of an object gives a more forceful impact, which produce a louder sound with higher
frequency components. The polarity of a smaller gestural scaling giving a high pitched sound
and a larger scaling giving a low pitched sound mirrors natural events, such as impact of
objects of different sizes and sounds from animal of different sizes.
The importance of communication and sense of responsibility to make the best job possible was
shared by the two authors. The contribution of the two authors to the project was equal. In the
project plan, there was no explicit statement that one author was to take care of the gesture
recognition and the other author was to take care of the sound synthesis. Rather, this evolved
naturally along the working process due to discovery of the tools and previous knowledge and
abilities in C++ respective sound synthesis.

12
Conclusion

This paper deals with an aspect of the groundwork of designing successful gestural sonification.
There is a focus on the exploration of the polarity of mappings and the tools used to realize an
interaction. Gestural tracking was done in OptiTrack Arena and gestural recognition was done in
OpenFrameworks by a modification of the so-called Gesture Variation Follower. Sonification of
gestures was done in PureData with a granular synthesis technique. Investigation of mapping
polarities was done by the use one-to-one mapping, exploring the perception of one physical
input parameter’s impact on a single acoustic parameter, in a total of three mappings (and thus
six polarities).
The use of the ofxGVF in OpenFrameworks for the application at hand was fully functional. It
had some difficult bugs, like “clear” did not work when more than one gesture was programmed.
It also forced the gesture recognition to match at least one gesture, even when in reality the
gesture performed did not match any at all. When the user was trying to match the gesture it
was very accurate, both in sensing which gesture it was but also in the relative velocity and
scale to the original gesture.
Using a granular synthesis technique for separating gesture velocity from pitch-altering playback
and thus for evaluation of one acoustic parameter at a time works well, as long as grain
playback and temporal evolution is undistorted.
More research has to be done before establishing a general rule of what polarity a mapping
from the physical dimension to the acoustic dimension should be. Our results suggests that the
polarity to use in a mapping of the physical dimension to the acoustic dimension seems to
correspond to naturally occurring events and physical phenomena humans have perceived and
experienced since our origination. Smaller objects and animals more often produce sounds of
higher pitch, and larger objects and animals more often produce sounds of lower pitch, which
might be a reason to why that polarity of scale was perceived as more favourable. To have
increased velocity increase loudness or brightness of a sound might be traced to the physical
phenomena of a large velocity impact. The perceived acoustic output of a system should
presumably therefore correspond to the (metaphorical) physical energy input into the system.

13
References

Bevilacqua, F., Zamborlin, B., Sypniewski, A., Schnell, N., Guédy, F. and Rasamimanana, N.,
2009, February. Continuous realtime gesture following and recognition. In ​International gesture
workshop​ (pp. 73-84). Springer Berlin Heidelberg.

Caramiaux, B., Bevilacqua, F., Bianco, T., Schnell, N., Houix, O., & Susini, P. (2014). The role
of sound source perception in gestural sound description. ​ACM Transactions on Applied
Perception (TAP)​, ​11​(1), 1.

Caramiaux, B., Montecchio, N., Tanaka, A., & Bevilacqua, F. (2015). Adaptive gesture
recognition with variation estimation for interactive systems. ​ACM Transactions on Interactive
Intelligent Systems (TiiS)​, ​4​(4), 18.

Dubus, G. and Bresin, R., 2013. A systematic review of mapping strategies for the sonification
of physical quantities. ​PloS one​, ​8​(12), p.e82491.

Granulator 2016, Accessed 10 March 2017,


<​http://sites.google.com/site/maximoskp/grain_offLine_course_final.pd.zip?attredirects=0&d=1&
pageId=104610731018232454088​>

Kramer, G., Walker, B., Bonebright, T., Cook, P., Flowers, J. H., Miner, N., & Neuhoff, J. (2010).
Sonification report: Status of the field and research agenda.

Löwgren, J. (2009). Toward an articulation of interaction esthetics. ​New Review of Hypermedia


and Multimedia​, ​15​(2), 129-146.

OfxGVF 2016, accessed 10 March 2017, <​https://github.com/bcaramiaux/ofxGVF​>

OfxGVF-example 2016, accessed 10 March 2017,


<​https://github.com/bcaramiaux/ofxGVF/tree/master/openFrameworks/example-2dShapes​>

OpenFrameworks 2017, accessed 10 March 2017, <​http://openframeworks.cc/​>

OptiTrack 2017, accessed 10 March 2017,


<​http://www.optitrack.com/support/software/arena.html​>

14
OSC 2011, accessed 10 March 2017, <​http://opensoundcontrol.org/introduction-osc​>

PureData 11 August 2016, accessed 10 March 2017, <​https://puredata.info/​>

Roads, Curtis. "Introduction to granular synthesis." ​Computer Music Journal​ 12.2 (1988): 11-13.

Scaletti, C., & Craig, A. B. (1991, June). Using sound to extract meaning from complex data. In
Electronic Imaging'91, San Jose, CA​ (pp. 207-219). International Society for Optics and
Photonics.

Walker, B. N. (2000). ​Magnitude estimation of conceptual data dimensions for use in


sonification​ (Doctoral dissertation, Rice University).

Walker, B. N., & Kramer, G. (2005). Sonification design and metaphors: Comments on Walker
and Kramer, ICAD 1996. ​ACM Transactions on Applied Perception (TAP)​, ​2​(4), 413-417.

15

S-ar putea să vă placă și