Quantifying Approaches To Discourse For Social Scientists PDF

Postdisciplinary Studies in Discourse
Series Editor
Johannes Angermuller
Centre for Applied Linguistics
University of Warwick
Coventry, UK
Postdisciplinary Studies in Discourse engages in the exchange between
discourse theory and analysis while putting emphasis on the intellectual
challenges in discourse research. Moving beyond disciplinary divisions in
today’s social sciences, the contributions deal with critical issues at the
intersections between language and society. Edited by Johannes
Angermuller together with members of DiscourseNet, the series wel-
comes high-quality manuscripts in discourse research from all disciplin-
ary and geographical backgrounds. DiscourseNet is an international and
interdisciplinary network of researchers which is open to discourse ana-
lysts and theorists from all backgrounds. Editorial board: Cristina
Arancibia, Aurora Fragonara, Péter Furkó, Tian Hailong, Jens Maesse,
Eduardo Chávez Herrera, Michael Kranert, Jan Krasni, María Laura
Pardo, Yannik Porsché, Kaushalya Perera, Luciana Radut-Gaghi, Marco
Antonio Ruiz, Jan Zienkowski
More information about this series at

http://www.palgrave.com/gp/series/14534
Ronny Scholz
Editor
Quantifying
Approaches to
Discourse for Social
Scientists
Editor
Ronny Scholz
Centre for Applied Linguistics
University of Warwick
Coventry, UK
Postdisciplinary Studies in Discourse

ISBN 978-3-319-97369-2 ISBN 978-3-319-97370-8 (eBook)
https://doi.org/10.1007/978-3-319-97370-8
Library of Congress Control Number: 2018958470
© The Editor(s) (if applicable) and The Author(s) 2019

This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and trans-
mission or information storage and retrieval, electronic adaptation, computer software, or by similar or
dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, express or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
Cover design by Tjasa Krivec
This Palgrave Macmillan imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Acknowledgements
The idea for this volume was born during the First International
DiscourseNet Congress in Bremen in late summer 2015. Together with
Marcus Müller, Tony McEnery, and André Salem, we had organized the
panel ‘Quantifying methods in discourse studies. Possibilities and limits
for the analysis of discursive practices’. With scholars of international
renown coming from linguistics, statistics, computer sciences, sociology
and political sciences in countries such as Canada, France, Germany,
Switzerland and the United Kingdom as well as guests from many other
countries, the panel was a real success. Papers presented seminal work
using a great variety of quantifying methods mostly in combination with
qualitative methods.
This edited volume is driven by the interdisciplinary and international
attitude of the inspiring discussions that we led in the panel. Completing
such a fascinating project would not have been possible without a net-
work of international supporters, to name only a few of them: ERC
DISCONEX Group and the Professional and Academic Discourses
Group, both hosted in Applied Linguistics at the University of Warwick;
the Centre d’Étude des Discours, Images, Textes Écrits, Communication
(CEDITEC) at the University Paris Est-Créteil; DiscourseLab at the TU
Darmstadt; and last but not least, DiscourseNet, which unites discourse
researchers across national and disciplinary borders all over the world.
v
vi Acknowledgements
This volume would not have seen the light without the support of
many colleagues and friends. I am grateful to the anonymous reviewers
who provided me with detailed and encouraging feedback. I also thank
the series editor Johannes Angermuller and the editorial assistant Beth
Farrow from Palgrave for supporting this publication project throughout
with great enthusiasm. Finally, I am thankful to my wife Joy Malala and
to our new-born son Gabriel Amani for tolerating the extra hours that I
had to put into editing this volume.
Praise for Quantifying Approaches to
Discourse for Social Scientists
“In today’s complex world of communication there is an urgent need to stand

back, to analyse, and to ask what is going on. This multi-national collection of
papers by communication scientists does just that. The papers in this book not
only provide technical tools of both a quantitative and a qualitative kind, they
make possible a perspective that gives objectivity to our understanding of the
disturbing world of words in which we flounder.”
—Paul Chilton, Emeritus Professor, Department of Linguistics and English
Language, Lancaster University, UK
“This is a very welcome addition to the literature on quantitative methods

for the analysis of language and meaning-making processes. Taking into
account textual and social contexts of language use including extra-textual
contexts the volume convincingly demonstrates that quantifying approaches
to discourse should not – and cannot – be reduced to the mere counting of
words. This book will be of use to students and researchers interested in
particular in the challenges posed by big data and technology-driven lan-
guage practices.”
—Alexandra Homolar, Associate Professor of International Security, Department
of Politics and International Studies, The University of Warwick, UK
“Bringing together a wide range of quantitative approaches to discourse analysis

which stretch far beyond the relatively well established methods of corpus lin-
vii
viii Praise for Quantifying Approaches to Discourse for Social Scientists
guistics, this volume provides a great overview for both beginners and experts in
the field of quantitative discourse analysis.”
—Professor Annika Mattissek, Professor for Economic Geography and Sustainable
Development, Department of Geography, University of Freiburg, Germany
“In the fast-moving field of text processing, keeping up with methodological

innovation can be a real challenge. This volume provides social scientists with a
rich and varied palette of approaches and tools, combining sound theoretical
foundations with practical advice on methods.”
—Professor Gerlinde Mautner, Institute for English Business Communication,
Vienna University of Economics and Business, Austria
“With the discourse turn in the social sciences, the need for a state of the art
guide to practice and theory of meaning construction is evident. In this volume,
leading British and continental scholars present quantitative and qualitative
methods of exploring discourse and the wider context into which texts are
embedded, while discussing and bringing together the approaches of Critical
Discourse Analysis and the Foucauldian dispositif. Long overdue!”
—Wolfgang Teubert, Emeritus Professor, Department of English Language and
Linguistics, University of Birmingham, UK
Contents
Part I Introductory Remarks 1
1 Understanding Twenty-First-Century Societies Using

Quantifying Text-Processing Methods 3
Ronny Scholz
2 Beyond the Quantitative and Qualitative Cleavage:

Confluence of Research Operations in Discourse Analysis 23
Jules Duchastel and Danielle Laberge
Part II Analysing Institutional Contexts of Discourses 49
3 The Academic Dispositif: Towards a Context-Centred

Discourse Analysis 51
Julian Hamann, Jens Maesse, Ronny Scholz, and Johannes
Angermuller
ix
x Contents
4 On the Social Uses of Scientometrics: The Quantification

of Academic Evaluation and the Rise of Numerocracy
in Higher Education 89
Johannes Angermuller and Thed van Leeuwen
Part III Exploring Corpora: Heuristics, Topic Modelling and

Text Mining 121
5 Lexicometry: A Quantifying Heuristic for Social Scientists

in Discourse Studies123
Ronny Scholz
6 Words and Facts: Textual Analysis—Topic-Centred

Methods for Social Scientists155
Karl M. van Meter
7 Text Mining for Discourse Analysis: An Exemplary Study

of the Debate on Minimum Wages in Germany183
Gregor Wiedemann
Part IV New Developments in Corpus-Assisted Discourse

Studies 213
8 The Value of Revisiting and Extending Previous Studies:

The Case of Islam in the UK Press215
Paul Baker and Tony McEnery
9 The Linguistic Construction of World: An Example of

Visual Analysis and Methodological Challenges251
Noah Bubenhofer, Klaus Rothenhäusler, Katrin Affolter, and
Danica Pajovic
Contents xi
10 Multi-method Discourse Analysis of Twitter

Communication: A Comparison of Two Global Political
Issues285
Jörn Stegmeier, Wolf J. Schünemann, Marcus Müller, Maria
Becker, Stefan Steiger, and Sebastian Stier
Index315
Notes on Contributors
Katrin Affolter is a PhD student in a joined program between the Zurich

University of Applied Science (ZHAW) and the University of Zurich (UZH),
Switzerland. The research topic of her PhD thesis is natural language interfaces
for databases. From 2012 to 2016, she studied computer science at the UZH,
focusing on computational linguistics and databases. In her master’s thesis
‘Visualization of Narrative Structures’, she developed a web application to
explore narrative structures based on an interactive graph visualization.
Johannes Angermuller is Professor of Discourse and the director of the ERC
DISCONEX research group at the Centre for Applied Linguistics at Warwick,
UK, and EHESS in Paris, France. He is a discourse researcher in linguistics and
sociology. His recent publications deal with academic and political discourses
and include books such as Poststructuralist Discourse Analysis (2014) and Why
There Is No Poststructuralism in France (2015), which have come out in English,
French, German, Turkish, Portuguese, and Spanish versions.
Paul Baker is Professor of English Language at the Department of Linguistics
and English Language, Lancaster University, UK, where he is a member of the
Corpus Approaches to Social Sciences ESRC Research Centre. He specializes in
corpus linguistics, using and developing corpus methods to carry out discourse
analysis, as well as being involved in research in media language, variation and
change and social identities. He has written 16 books, including Using Corpora
to Analyse Discourse (2006), Sexed Texts: Language, Gender and Sexuality (2008),
and Discourse Analysis and Media Attitudes (2013). He is also commissioning
editor of the journal Corpora (EUP).
xiii
xiv Notes on Contributors
Maria Becker is a doctoral researcher at the Department of Computational

Linguistics at Heidelberg University. She is also a researcher at Discourse Lab, a
research environment for digital discourse analysis at TU Darmstadt. She stud-
ied German philology, linguistics, philosophy, communication science, and psy-
chology at the universities of Heidelberg and Mannheim. For her PhD thesis,
she works on the automated reconstruction of implicit knowledge in argumen-
tative texts. Her research interests further include deep learning, corpus linguis-
tics, discourse analysis, media linguistics, spoken language and medical
communication.
Noah Bubenhofer is head of the Digital Linguistics group at the ZHAW
University of Applied Sciences in Winterthur, Switzerland. His research topics
are linguistics, corpus linguistics, and language and digitality. From 2015 to
2018, he was a senior researcher at the Institute of Computational Linguistics at
the University of Zurich. For his PhD thesis ‘Muster an der sprachlichen
Oberfläche’ (patterns at the linguistic surface), he developed corpus linguistic
methods for discourse and cultural analyses.
Jules Duchastel is Emeritus Professor of the Department of Sociology at the
Université du Québec à Montréal. His main areas of research are transformation
of the welfare state in Canada and Québec starting from the 1940s to present
time and computer-assisted analysis of political discourse. He has held, from
2001 to 2008, a Canadian Research Chair on Globalization, Citizenship and
Democracy and has founded the Centre for Computer Assisted Text Analysis
(ATO) in 1983.
Julian Hamann is a postdoctoral researcher at the Leibniz Center for Science
and Society at the Leibniz University Hannover, Germany. His research draws
on the sociology of social sciences and humanities, the sociologies of knowledge
and culture, and higher education studies. His present work touches on topics
like evaluation and boundaries, subjectivity and performativity, academic
knowledge and academic careers, and power and social inequality. His current
research has recently appeared in Poetics, Higher Education, Zeitschrift für
Soziologie, and History of Humanities.
Danielle Laberge is Emeritus Professor of the Department of Management
and Technology at the Université du Québec à Montréal. Throughout her career,
she has taught methodology and epistemology and published extensively with
Jules Duchastel on these questions. Her recent research deals with project man-
agement and governance.
Notes on Contributors xv
Thed van Leeuwen is a senior researcher at the Centre for Science and
Technology Studies (CWTS) of Leiden University in the Netherlands. He is co-
leading the research theme on Open Science, and the project leader of the Open
Science Monitor. As a member of the SES research group, other research topics
Thed is involved in relate to the evaluation of research, in particular in the social
sciences and humanities, as well in the ways research quality is perceived. The
overarching science policy context under which research assessments are orga-
nized and the role of bibliometric indicators therein are of major concern for
this research agenda. Thed is co-editor of the OUP journal Research Evaluation,
as well as associate editor of the Frontiers journal Research Metrics & Analytics.
Jens Maesse is Assistant Professor in the Department of Sociology, University
of Giessen. His research focus is on discourse analysis, sociology of science and
education, economic sociology and political economy. His publications include
‘Austerity discourses in Europe: How economic experts create identity projects’,
Innovation: The European Journal of Social Science Research 31 (1): 8–24 (2018).
‘The elitism dispositif. Hierarchization, discourses of excellence and organisa-
tional change in European economics’, Higher Education 73: 909–927 (2017).
Tony McEnery is Distinguished Professor of English Language and Linguistics
at Lancaster University. He is currently a Group Director (Sector Strategy) at
Trinity College London, on secondment from Lancaster University. Tony was
previously Director of Research and Interim Chief Executive at the UK’s
Economic and Social Research Council (ESRC). He was also the Director of the
ESRC Centre for Corpus Approaches to Social Science at Lancaster. He has
published extensively on corpus linguistics.
Karl M. van Meter is a research sociologist at the Centre Maurice Halbwachs
(ENS Paris) and an expert in sociological methods and methodologies. He is an
American-French citizen with university degrees from the US, the UK and
France. Although his PhD was in pure mathematics, he founded and directed
for 34 years the bilingual Bulletin of Sociological Methodology/Bulletin de
Méthodologie sociologique, which is now with Sage Publications. In his research
he uses mainly quantitative text processing methods with which he traces major
historical shifts in French, German and American sociologies, and the represen-
tation of politics in society.
Marcus Müller is full professor in German Studies—Digital Linguistics at the
Department of Linguistics and Literature, Technische Universität Darmstadt.
He studied German philology, romance studies and European art history at the
xvi Notes on Contributors
universities of Heidelberg and Granada. Müller leads the Discourse Lab, a

research environment for digital discourse analysis (http://discourselab.de/). His
research interests include digital linguistics, discourse analysis, language and art,
and science communication.
Danica Pajovic obtained her master’s degree in computational linguistics at
the University of Zurich. She worked on the project ACQDIVIZ: Visualising
Development in Longitudinal First Language Acquisition Data in the compara-
tive linguistics department and was a collaborator in the project Visual
Linguistics, led by Noah Bubenhofer at the University of Zurich.
Klaus Rothenhäusler is a junior researcher in the Digital Linguistics group at
the ZHAW University of Applied Sciences in Winterthur, Switzerland. He
received his master’s degree in computational linguistics from the University of
Heidelberg and specialized in distributional semantic models during his time at
the IMS Institute for Natural Language Processing in Stuttgart. Over the past
years he has worked in numerous digital humanities projects.
Ronny Scholz coordinates the ERC-funded DISCONEX project on academic
discourses at the University of Warwick, UK. He holds a master’s degree in dis-
course studies from the University Paris XII and a PhD in sociology and linguis-
tics from Magdeburg and Paris Est. His work focuses on the question of
legitimization of power in political discourses especially in the post-democratic
era. He uses lexicometric tools as quantifying heuristic helping to explore new
perspectives in various corpora of political discourse.
Wolf J. Schünemann is junior professor of political science with a focus on
Internet and politics at Hildesheim University. His research and teaching cover
the fields of Internet governance, international relations and European integra-
tion. After having studied political science, philosophy, German literature, and
media at Kiel University and Sciences Po in Rennes, France, he worked as a
research fellow and lecturer at the University of Koblenz-Landau. He received
his doctoral degree with a comparative discourse study of referendum debates in
France, the Netherlands and Ireland.
Jörn Stegmeier is a postdoctoral researcher at the Department of Linguistics
and Literature, Technische Universität Darmstadt. Together with Marcus
Müller, he heads the Discourse Lab, a research environment for digital discourse
analysis (http://discourselab.de). His research interests include digital linguis-
tics, corpus linguistics, and discourse analysis.
Notes on Contributors xvii
Stefan Steiger is a research associate at the University of Hildesheim and doc-

toral student at the Institute of Political Science at Heidelberg University. He
studied political science, history and philosophy at Heidelberg University. His
research interests include cybersecurity, Internet governance, political commu-
nication and foreign policy analysis.
Sebastian Stier is a postdoctoral researcher in the Department Computational
Social Science at GESIS—Leibniz Institute for the Social Sciences in Cologne,
Germany. He is an interim professor at the Institute of Political Science and the
NRW School of Governance of the University of Duisburg-Essen. He received
his PhD in political science at the University of Heidelberg. His main research
interests include political communication, comparative politics, populism and
computational social science methods.
Gregor Wiedemann is a post-doctoral researcher in the Language Technology
group of the computer science department at Hamburg University, Germany.
He studied political science and computer science in Leipzig and Miami. Due to
his interdisciplinary background, he has worked in several projects in the fields
of digital humanities and computational social science. In his research, he
focuses on methods and workflows to analyse large text collections with tech-
nologies from natural language processing.
List of Figures
Fig. 2.1 Confluences in discourse analysis 25

Fig. 2.2 Transformation of the text. The figure is an adaption of a
schema presented in Meunier (1993) 36
Fig. 3.1 Correspondence analysis of keywords and research interests of
UK sociologists—rows only represented 75
Fig. 3.2 Correspondence analysis of keywords and research interests of
UK sociologists—rows and columns represented 76
Fig. 3.3 Correspondence analysis of keywords and research interests
represented on the websites of professors in sociology depart-
ments in the UK 77
Fig. 5.1 Correspondence analysis of the German press corpus on the
financial crisis 2008 in the partition ‘month’ (Representation of
column names only) 134
Fig. 5.2 Correspondence analysis of the sub-corpus German press inter-
views on the financial crisis 2008 134
Fig. 5.3 DHC in the German press corpus on the financial crisis 2008
(Analysed with Iramuteq)137
Fig. 5.4 The dominating semantic field in the German press corpus on
the financial crisis 2008 139
Fig. 5.5 Over- and under-represented groups of words referring to dis-
course participants and discourse objects (partition ‘month’) 141
Fig. 5.6 Summary of the macrostructure of the financial crisis press
corpus142
xix
xx List of Figures
Fig. 5.7 Map of text sections of interviewees displaying prototypical

sentences of Angela Merkel’s interviews 143
Fig. 6.1 Factorial correspondence analysis of geographical terms in the
official biographies of members of the last Central Committees
of the USSR 159
Fig. 6.2 French Sociological Association (AFS) congress 2004 strategic
diagram of all abstracts (femme)164
Fig. 6.3 French Sociological Association (AFS) congress 2004 ‘femme’
(woman) cluster with its keywords, including ‘travail’ (work) 165
Fig. 6.4 French Sociological Association (AFS) congress 2004 strategic
diagram of all abstracts (without femme)166
Fig. 6.5 Strategic diagram of the first four months of the 2006
Association for the Right to Information corpus 171
Fig. 6.6 2006 keywords’ attractive power over the three four-month
periods175
Fig. 6.7 Dominant 2007-1 terms over the four periods of 2007–2008
(Bush vs. UN) 176
Fig. 7.1 Document frequency of articles on minimum wages in two
German newspapers 193
Fig. 7.2 Area plot of topic distribution over time 199
Fig. 7.3 Relative frequencies of documents containing stances on mini-
mum wages 203
Fig. 8.1 Average number of articles about Islam per newspaper per
month, 1998–2014 220
Fig. 8.2 Proportion of mentions of different branches of Islam for each
newspaper237
Fig. 8.3 References to Sunni, Shia, Sufi, Salafi and Wahhabi over time.
Dark grey denotes the proportion of mentions of references to
branches of Islam (e.g. Sunni, Shia, Wahhabi); light grey bars
denote references to Islam 238
Fig. 8.4 Summary of all data, comparing proportions of change over
time241
Fig. 8.5 Claimed causes of radicalisation in the press in 1998–2009 244
Fig. 8.6 Claimed causes of radicalisation in the press in 2010–2014 245
Fig. 8.7 Claimed causes of radicalisation in the press in 2014 246
Fig. 9.1 Geocollocations control panel 262
Fig. 9.2 Geocollocations map view 264
Fig. 9.3 Dorling diagram view 265
List of Figures xxi
Fig. 9.4 Reduced Dorling view, comparison of selected countries:

collocate Beziehung (relationship) 267
Fig. 9.5 Map view, selection of collocates Migration, Flüchtlinge
(migration, refugees)—Spiegel/Zeit corpus 2010–2016 270
Fig. 9.6 Map view, selection of collocates Migration, Flüchtlinge
(migration, refugees)—Spiegel/Zeit corpus 1945–1960 271
Fig. 9.7 Close view on the collocates in the migration discourse—
Spiegel/Zeit corpus 2010–2016 272
Fig. 9.8 Javascript library ‘D3.js’, ‘visual index’ of examples on the
website277
Fig. 10.1 Tweets on #ClimateChange 294
Fig. 10.2 Tweets on #NetNeutrality 295
Fig. 10.3 Network analysis of the tweets on #ClimateChange 297
Fig. 10.4 Network analysis of the tweets on #NetNeutrality 298
List of Tables
Table 3.1 Three levels of analysis 56

Table 3.2 Four ideal typical dimensions of social context for the
analysis of discourses 59
Table 4.1 National evaluation schemes in some Western higher
education systems 104
Table 7.1 Topics terms and shares 196
Table 7.2 Text classification of stances on minimum wages 202
Table 8.1 The structure of the two newspaper corpora 219
Table 8.2 Collocates of Muslim women 229
Table 8.3 Patterns around veiling for Muslim women 232
Table 8.4 Patterns around veiling—change over time (summary) 233
Table 8.5 Arguments against veiling 234
Table 8.6 Collocates of Muslim men 235
Table 8.7 Levels of belief 239
Table 8.8 Extremism keywords 243
Table 10.1 Geolocated tweets and retweets of the ten most frequent
countries293
Table 10.2 The ten most represented countries in both samples 296
Table 10.3 Categories derived from context analysis 302
Table 10.4 Categorised keywords for tweets containing #ClimateChange 303
Table 10.5 Categorised keywords for tweets containing #NetNeutrality 305
xxiii
Part I
Introductory Remarks
1
Understanding Twenty-First-Century
Societies Using Quantifying Text-
Processing Methods
Ronny Scholz
1 Analysing Knowledge-Based
Post-industrial Societies: Challenges
and Chances
During the last 50 years, Western societies have experienced substantial
changes. The phenomena of Europeanisation and globalisation as well as
technical innovations such as the Internet and social media have revolu-
tionised the way we use language when interacting, socialising with each
other, or storing and recalling knowledge. In fact, the Internet has fos-
tered access to globally produced information. In the abundance of
sometimes contradicting information, the formation of knowledge in
I am thankful to Malcolm MacDonald, Joy Malala and Yannik Porsché for their helpful comments
on earlier versions of this text.
R. Scholz (*)
Centre for Applied Linguistics, University of Warwick, Coventry, UK
e-mail: r.scholz@warwick.ac.uk
© The Author(s) 2019 3

R. Scholz (ed.), Quantifying Approaches to Discourse for Social Scientists,
Postdisciplinary Studies in Discourse, https://doi.org/10.1007/978-3-319-97370-8_1
4 R. Scholz
discourses with a particular inherent logic becomes evident. Moreover,

recent debates in the public sphere about gender identities, national and
cultural identities, and post-truth evidence an increased self-reflexivity in
Western societies, emphasising the construction of knowledge as pivotal
for the way we make sense of ourselves and our surrounding world.
Drawing from Foucault, Althusser, Pêcheux, Laclau, and others, discourse
studies has developed as a transdisciplinary research field that responds to
a need for a better understanding of how contexts influence the making
of meaning and knowledge on various levels of social interaction.
This volume is a compilation of papers that present sophisticated quan-
tifying methods used to analyse textual and social contexts of language
use. The volume covers a large range of quantifying methods that could be
used by social scientists who investigate the construction of knowledge in
society. Before presenting the texts that have been compiled for this vol-
ume at the end of this introduction, the first part of the introduction will
outline how trends in society have contributed to discourse becoming an
object worthy of social sciences research. The second part will explain the
importance of context for analysing the construction of meaning in differ-
ent discourses. The third part of the introduction discusses the benefits of
using quantifying methods to analyse meaning-making process in the
digital age, and the last part presents the purpose of this volume.
2 ocietal Trends Influencing Knowledge

S
Production and Knowledge Exchange
This book showcases methods developed in natural language processing,
quantitative linguistics, corpus linguistics and statistics for the analysis of
discourses across the social sciences. Discourses have become a preemi-
nent object of investigation across the social sciences and humanities
because they have become palpable phenomena in everyday life. This is
mainly due to an increased amount of coherent information contradict-
ing previously established knowledge formations or narratives. The dis-
cursivity of knowledge is evidenced in a number of societal developments:
(1) an increase in expert-based legitimation of decisions and competing,
sometimes contradictory, expertise; (2) an increasingly rapid exchange of
information on a global scale, sometimes contradicting mainstream
Understanding Twenty-First-Century Societies Using… 5
media narratives; and (3) a democratisation of information production

through social media in which communities are formed beyond national
borders, and in which historical events can be recontextualised by contra-
dicting a narrative circulated, for instance, in the mass media.
First, in post-industrial societies, work, as a primary object for socio-
logical research, is often knowledge-based. Hence, the construction of
knowledge and the conditions of its emergence become crucial for better
understanding societies, and their dynamics and structuration. Moreover,
its continuous production leads to a constant renewal of knowledge and
thus, to a higher dynamic and fluidity of established bodies of knowledge.
For example, knowledge about societal structures, including national,
cultural, social and political boundaries seems to be constantly challenged
by cross-boundary influences and experiences. One way of dealing with
the emerging complexity of societies, is the instalment of knowledge
authorities referred to as experts. They help to interpret, order, evaluate
and interlink the abundance of produced information that becomes
knowledge. Expertise, however, is the result of a knowledge industry that
extends from classical universities to think tanks, which compete in the
public space for recognition. Nowadays, political decisions are often legit-
imised by an army of experts and interpreted by political agents in line
with their political argument (Maesse 2015). Some scholars have argued
that the demonstration of scientific information has become part of a
range of governance techniques contributing to the construction of
Europe as a political and scientific community (Rosental 2015).
Second, faster routes of communication have helped to raise awareness of
political problems in the most remote areas on the planet. Nevertheless, the
ever-larger news corporations and the deregulation of the media landscape
have led to increasing competition in news production, favouring the cover-
age of dramatic events and information, which guarantee a high news value
(Luhmann 2000), attracting high numbers of audience. Failing to provide
balanced background information about a political issue enabling the citi-
zens to engage fully with a political debate, these infrastructural changes
have led to a new media style, a superficial ‘infotainment’, which can eas-
ily be argued against with similar superficial but contradicting information.
Third, important technological changes have taken place that impact on
the way we socialise. Some scholars have stressed the importance of new
media for economic development and political innovation (Barry 2001).
6 R. Scholz
Indeed, new communication devices have enhanced communication on a

global scale leading, among others, to similar cultural trends triggered by a
globalised cultural industry. Moreover, social media have opened up a new
social arena enhancing socialisation beyond geographical and social bound-
aries. Nowadays, personal and collective identities are often influenced to a
substantial extent by experiences in the virtual space of social media. At the
same time, knowledge emerges and is reproduced much more dynamically
than it used to be. Large sets of information are available through the
Internet and can be accessed and reproduced individually at any time and
as often as desired. Smartphones and tablets have led to a popularisation of
information production. Thereby the Internet functions as a new social
location, in which socialisation can take place often without control or
influence of mainstream political stakeholders. This new social location has
not only fostered democratic debates, it has also enhanced a particularisa-
tion of discourse communities, in which the flourishing of counter-dis-
courses and identity-building that oppose official political positions is
facilitated. This has given rise to populist and extremist discourses all over
the world evidenced by the political success of movements like PEGIDA,
SYRIZA, and PODEMOS or politicians like Trump, Orban, Duterte, or
Le Pen. Social media, such as Facebook or Twitter, especially gives indi-
viduals easy access to communicating with the relatively anonymous
masses. The twofold constellation of mainstream and social media has pro-
vided the grounds for a broader reproduction of knowledge, in which cur-
rent and historical events can be recontextualised. As a result of the manifold
recontextualisations available in the public space, various interpretations,
narratives, and representations of historic and contemporary events seem
to circulate in society that not only researchers but also the wider public
refers to as ‘discourses’. In sum, we can identify a number of aspects that
render discourse a relevant research object for social scientists.
3 Discourse, Context, and Meaning

The overall objective of discourse studies is to analyse language use in
context in order to better understand the social construction of meaning
and its influence on the knowledge circulating in any given society or
part of it. Thus, social scientists doing discourse analysis will, for instance,
gain insights into: how knowledge about groups, communities, and social
identities is constructed in discourses (inclusion and exclusion; gender,
religion, race, class, nation); how this structuration is justified; how social
spaces and positions are constructed, negotiated, and orchestrated; which
narratives and ideologies drive different actors, groups, and communities
in society; and how decisions in a given society or a part of it are being
legitimised. Moreover, with its capacity to map prevailing argumenta-
tions and narratives, discourse studies can reveal how values are articu-
lated in a particular way in order to justify the stance of a specific group,
social strata, or class. In this sense, discourse analysts can show how soci-
ety is governed and hence can feed into the formulation of a social cri-
tique that contributes to social progress (Herzog 2016).
Foucault’s philosophy has helped to understand the formation of knowl-
edge in terms of discourses that organise knowledge. Most importantly he
has insisted on the fact that discourses are driven by power relations that
are rooted in institutional, social, societal, and historical contexts in which
language users have to operate (Foucault 1970, 1972, 1979). Foucault’s
theoretical categories have informed discourse analytical approaches not
only in France but across the globe—to name only a few, the Discourse
Linguistic Approach (Warnke and Spitzmüller 2008) and the Sociology of
Knowledge Approach (Keller 2013) in Germany or the Discourse
Historical Approach (Reisigl and Wodak 2016) and Critical Discourse
Analysis (Dijk 1997; Fairclough 1995; Wodak and Meyer 2001). What is
common to all approaches in discourse studies is their fundamental inter-
est in meaning construction through natural language use in context.
There are numerous definitions of discourse. I will touch upon two
which best fit the purposes of this volume. First, there is Busse and
Teubert’s definition which is common in German discourse linguistics.
They define discourse as a ‘virtual text corpus’ containing all sorts of texts
that have been produced on a particular topic. In order to analyse a dis-
course, a researcher has to compile a ‘concrete text corpus’ which is com-
piled from a representative selection of texts of the ‘virtual text corpus’
(Busse and Teubert 2014, 344). This definition might satisfy corpus lin-
guists, but if we want to analyse discourse practices from a perspective that
accommodates the broader spectrum of social sciences and humanities,
8 R. Scholz
we have to add elements that refer to the social structure, the context of

circulation, and the actions that allow an utterance to become effective.
Achard’s (1993, 10) definition seems to suit this demand somewhat bet-
ter. According to him, a discourse emerges from language practice in a
situation in which a speech act is considered as effective as a result of its
relation to a set of other linguistic and non-linguistic acts. Consequently,
the analysis of discourse has to put emphasis on institutional and social
context in which interlocutors relate to each other. At the same time, con-
text cannot be understood as a stable, stereotypical, neutral, and self-
contained entity, in which everything seems to happen (Blommaert 2005,
56). The conceptualisation and operationalisation of context is in fact a
necessary analytical step in order to understand how meaning emerges in
discourses. Discourse studies is a broad field of research integrating differ-
ent disciplines and their foci on specific aspects of society, materiality, and
context (Beetz and Schwab 2018). Hence, the understandings of contexts
are quite diverse and multilayered (Porsché 2018, 81–129). This volume
presents a collection of texts using quantitative methods which are applied
to a range of inside and outside perspectives on language use. Whereas the
former focuses on textual contexts (co-text) of discourse or the construc-
tion of context from within a particular situation of communication, the
latter emphasises its institutional, social, societal, and historical contexts
(Leimdorfer 2011, 74).
Foucault’s concept of the dispositif articulates the inside and outside
perspective on language use (Foucault 1972; Raffnsøe et al. 2016). This
concept captures the nexus of power and knowledge reflecting institu-
tional constraints in interpretive processes. Studies referring to this
concept often distinguish, on the one hand, between a meso- and
macro-level of society providing the institutional and societal struc-
tures in which discourses emerge, and, on the other hand, a micro-level
on which discourses are performed (Bührmann et al. 2007). Therefore,
the term ‘discourse’ often refers to institutionalised practices which fol-
low certain sets of rules fixed over time. These practices are considered
central for particular fields and areas of society and are evidenced in
institutionalised ways of speaking (Bührmann and Schneider 2007, 5).
The analysis focuses on the interplay between situational contexts
and practices with discourses or the constitution of contexts through
discourses (Keller 2013). In this volume, Hamann et al. (Chap. 3) use

quantitative corpus methods to study the dispositif of academic
discourse.
Studies highlighting the analysis of dispositifs usually distinguish
between discursive and non-discursive practices in discourse production
(Bührmann and Schneider 2007). From a perspective of French discourse
analysis or a post-structuralist discourse theory, the assumption of a pre-
discursive world, in which non-discursive elements have their place and
in which subjects can choose their position freely, would be considered as
problematic. Rather than asking how the knowledge is constructed,
which a subject then uses to make a sense, post-structuralist approaches
would ask how a discourse is constructed, in which subjects are placed
and can emerge in a particular position. In this sense, pre-discursive ele-
ments can become relevant to discourse, and subsequently to society in
general, only if they are articulated with a particular meaning in relation
to other discursive elements. This idea goes back to Vološinov’s philoso-
phy of language according to which the emergence of meaning is a result
of a dialectic relation between the concrete utterance and the language
system which, in opposition to Saussure, cannot be regarded as indepen-
dent from communication. It is only in concrete social relation of people
that a linguistic sign, together with the language system, can acquire
meaning (Vološinov 1973, 157–158).
Against this background, French discourse analysts have tried to go
beyond a structuralist perspective to better account for discursive prac-
tices. Socio-pragmatic approaches to discourse tried to integrate contex-
tual variables with the analysis of utterances. Aiming to capture social
and cognitive dynamics, which are triggered by forms and implications
on the textual level, they study the construction of meaning in its social
context (e.g. Bronckart 1996). There are a couple of strands highlighting
different aspects of context. Enunciative pragmatics has developed since
the 1980s. It focuses on the reflexivity of the speech activity allowing the
speakers to convert the system of language into discourse (Angermuller
et al. 2014, 135–139; Authier-Revuz 1984; Kerbrat-Orecchioni 1980;
Maingueneau 1997; Reboul and Moeschler 1998). Furthermore, a socio-
linguistic tradition has developed that looked at how institutional struc-
tures influence the use of language, analysed through the perspective of a
10 R. Scholz
sociology of language (Achard 1993; Bacot and Rémy-Giraud 2011;

Boutet et al. 1995). The operationalisation of an analysis of discourse
practices has been taken most seriously in the praxematic approach to
discourse cumulating in the analytical concept of the ‘praxem’ (Lafont
1978; Lafont et al. 1983; Siblot 1990, 1997). A similar idea has influ-
enced sociologists in Québec when they analysed society as a result of the
work on meaning production (Bourque and Duchastel 1984, 118). Here,
the internal dynamics in discourses are considered to have a particular
impact on this process (Bourque and Duchastel 1988, 51).
4 hallenges and Chances for Discourse

C
Research with Quantifying Methods
Even though the societal changes aforementioned have rendered dis-
courses important for social scientists, they also confront researchers with
various challenges: Texts are produced and distributed at an enormous
speed, in substantial numbers, by a large number of institutional and
non-institutional actors; political discourses often emerge in an interna-
tional institutional context and are reproduced in different languages (for
a sociolinguistic account of this issue: Duchêne and Heller 2012). For
example, the debates on the ‘Euro crisis’, the international trade agree-
ment TTIP, or leaked document collections trigger political discourses in
different countries. Moreover, computers influence more and more inter-
actions between humans. Complex algorithms respond to our actions in
virtual space and robots emulate individuals in social networks (Hegelich
and Janetzko 2016) so that the concept of the author and the dialogue
with such algorithmic agents (Antonijevic 2013, 95–97; Tufekci 2015)
may need a new set of analytical categories and methods.
The fact that large amounts of natural language data are stored in digi-
tised form and can be accessed through the Internet is, in fact, to the
advantage of computer-based methods as they are presented in this vol-
ume. There are now large databases for digitised political texts and press
texts. In the ‘digital humanities’, enormous efforts are undertaken to cre-
ate large digital archives of historical and literature texts from past
centuries (e.g. the CLARIN centres). The abundance of available textual
data creates methodological challenges that are similar to those discussed

around ‘big data’. Some argue that ‘big data’ will radically change the way
in which we make sense of the world to the extent that theoretical and
abstract models become obsolete. Others make a case for the opposite,
contending for theories and methods of interpretation as necessary to
maintain an analytical stance in the abundance of information (Boyd and
Crawford 2012; Gonzalez-Baillón 2013; Schroeder 2014). Discourse
researchers also have to face these points of debate. They have to discuss:
To what extent can the analysis be based on text-processing software solu-
tions? To what extent is there a need to understand the underlying algo-
rithms in order to interpret the results? How important are discourse
theories for the interpretation? How can results produced with quantita-
tive methods complement qualitative approaches to discourse?
The ‘digital age’ has indeed changed the way texts are produced, stored,
made available, and read. This does not remain without consequences for
discursive practices. Social scientists that try to capture and study the
production of knowledge under these circumstances need theories and
methods that are able of accounting for the emergence of meaning in the
different social, historic, and textual contexts, in which phenomena
become meaningful and therefore can become social. Thus the conceptu-
alisation of context is as diverse as the disciplines that have subscribed to
this transdisciplinary field. Consequently, discourse analysts have dis-
cussed on an epistemological (Maingueneau 2013; Paveau 2017) and a
methodological level (Norris and Jones 2005; Jones et al. 2015) how to
grasp social phenomena in the digital age. In this regard there has been a
special emphasis on the analysis of discourses as perpetuated via social
and new media (Barats 2013; KhosraviNik 2016).
Complementary to these approaches, this volume emphasises the use
of quantifying methods. Indeed, corpus linguistic methods have been
used to analyse discourses for a long time, especially in France, where the
lexicometric approach has been developed since the 1970s (see Chap. 5).
Also, in Critical Discourse Analysis, the benefits of corpus methods, such
as a better navigation in large text collections based on quantification,
have been advocated for some time (Baker 2006; Mautner 2005, 2012)
and are now being established in the field under the acronym CADS for
Corpus-Assisted Discourse Studies (see Part IV in this volume and
12 R. Scholz
Partington et al. 2013). Corpus methods applied in discourse studies can

help to gain an overview over dominant discourse participants, discursive
positions, topics, and arguments, and their interrelations relatively
quickly. They can trace discursive dynamics over time within these dis-
cursive structures. And they permit the analyst to ‘zoom in’ into parts of
the data which have proven to be of particular interest within the research
process. In this sense corpus tools allow organising the research process in
quantitative and qualitative stages—whereby the researcher can develop
hypotheses and theories by alternating repeatedly in a reflexive loop
between the analysis of the research material on different levels and the
interpretation of the different results (Demazière et al. 2006; Leimdorfer
and Salem 1995).
5 The Purpose of This Volume

Recent publications using corpus methods in discourse studies are numer-
ous. Most of these publications focus on a particular approach such as
corpus linguistics (Baker and McEnery 2015) or text mining (Ignatow
and Mihalcea 2017; Wiedemann 2016). Some of them are very good
introductions into technical and/or statistical aspects of the analysis
(Biemann and Mehler 2014; Jockers 2014). The present volume is com-
plementary to these works in that it takes a global perspective covering a
broad range of quantifying methodologies that are used to analyse dis-
courses. The volume aims to provide a useful introduction to researchers
in social sciences facing the challenges of new technology-driven language
practices and big data. It goes beyond the simple analysis of the textual
level by taking into account the extra-textual context that is essential in
order to understand how meaning is constructed in society. The aim is to
give an overview of the broad range of quantifying corpus methods that
can be deployed in order to address different levels of the co-text and the
context. What all empirical contributions in this volume have in com-
mon is the use of some sort of quantification that reorganises the natural
language data. The book is driven by a reflexive mindset regarding the
possibilities and limits of quantifying methods, which are understood as
complementary not as in opposition to qualitative methods.
The authors of this volume were asked to present their methods in an

accessible way to corpus beginners and experts alike. For the first time,
different national and disciplinary traditions of quantifying methods in
discourse studies that have been developing to a great extent in parallel
without taking much notice of the other approaches are presented in this
collective volume. Moreover, this book draws on experiences from schol-
ars in the Anglo-Saxon, German, and French-speaking academic world
that work with similar methods, yet meet rarely, due to language, disci-
plinary, or institutional boundaries. With texts coming from authors in
sociology (Chaps. 2 and 3), scientometrics (Chap. 4), quantitative lin-
guistics and lexicometry (Chaps. 5 and 6), computational linguistics and
text mining (Chap. 7) as well as corpus linguistics and political science
(Chaps. 8, 9 and 10), the volume has a strong transdisciplinary outlook.
The book is divided into Parts I–IV and comprises ten chapters. After
the introductory remarks in the two chapters of Part I, the two following
chapters of Part II look into how to integrate institutional contexts into
discourse analysis. The three chapters in Part III set out complex algo-
rithms developed in quantitative linguistics, lexicometry, and text min-
ing. The benefit of such methods is clearly their heuristic strength. They
take into account the complete vocabulary of all texts in the corpus at
once and enable the development of new ideas not only on topics touched
upon in a given corpus but also on the macro-structure of the discourse
represented in it. Part IV covers new developments in Computer-Assisted
Discourse Studies. All three chapters use classical corpus linguistic meth-
ods such as collocation analysis and keyword analysis and bring them to
new splendour by taking innovative methodological steps.
Chapter 1 has introduced the volume. By outlining societal develop-
ments that have rendered discourses a prominent research object across
the social sciences, it has advocated the benefits of quantifying text-
processing methods as a means of studying societies in the digital age.
Second, the text has addressed some of the challenges discourse analysts
have to face when studying digitised media communication. Third, it has
further highlighted the importance of quantitative methods for an analy-
sis of contextuality on different levels.
Chapter 2 takes the reader on a methodological journey which will
help him/her to stop thinking of research in terms of quantitative
14 R. Scholz
and qualitative paradigms. After a short overview of theories that have

influenced discourse studies, Jules Duchastel and Danielle Laberge
present a general model of how social reality is being analysed with text-
processing methods. By highlighting strengths and weaknesses of quali-
tative and quantitative methods the text advocates an integrative mixed
methods approach. The text shows that scientific interpretation is
bound to specific explanatory operations that pave the way for a par-
ticular understanding of the world. Any scientific process, no matter
whether qualitative or quantitative, is based on a common ground
mobilising research operations for the identification of units, their
description and their analysis. While the analytical paradigms differ in
their epistemological and methodological assumptions, they are facing
the same problem of reducing and restoring complexity.
Chapter 3 outlines the dispositif approach, which combines a linguistic
discourse analysis of texts with a sociological study of the social context
(i.e. the dispositif understood as an institutional arrangement of practices
and structures). Julian Hamann, Jens Maesse, Ronny Scholz, and
Johannes Angermuller use the discourse of academic researchers to exem-
plify this approach. By articulating correspondence analysis of self-
representations on researchers’ homepages with institutional data of
sociology professors in the United Kingdom, they outline a research
design that consists of three components: a linguistic analysis of texts, a
sociological analysis of institutional contexts, and a theoretical account of
how the two are related in the academic dispositif. The dispositif perspec-
tive on discourse aims to respond to a demand for systematic discourse
research on the social and institutional contexts of discursive practices.
Chapter 4 presents scientometrics as a type of corpus research which
measures the scientific output of academic researchers and analyses
underlying differences and inequalities among researchers based upon
their scientific outputs. Johannes Angermuller and Thed van Leeuwen
discuss the history of the field since Eugene Garfield launched the Science
Citation Index in 1963 and investigate its practices and indicators, such
as the Journal Impact Factor or the h-index. This contribution places the
development of the field in the context of the rise of ‘numerocracy’—a
regime of power knowledge which aims at governing large populations by
numbers. By applying and extending Michel Foucault’s governmentality
thesis, the authors point out the non-scientific conditions of scientific

practices in corpus research and make the case for a socially reflexive
approach.
In this contribution, Ronny Scholz draws a connection between
Bachelard’s concept of ‘epistemic rupture’ and quantitative methods
which allows the discovery of discursive phenomena prior to the interpre-
tation of meaning in texts. Lexicometry is a corpus-driven approach that
deploys, besides common corpus linguistic methods, complex algorithms
that exhaustively analyse the lexis of a given corpus. It does so by con-
trasting different corpus parts organised in partitions. Taking examples
from a corpus of 4000 press texts on the global financial crisis of 2008,
the contribution illustrates how a large text corpus can be reduced sys-
tematically to a readable size. It also demonstrates different ways of
exploring lexicosemantical macro-structures using correspondence analy-
sis, descending hierarchical classification, and other methods.
Chapter 6 explains how complex statistical methods such as factorial cor-
respondence analysis, both descending hierarchical classification (Alceste,
Topics) and ascending hierarchical classifications (Leximappe- Lexinet,
Calliope) can be used to study which concepts and topics dominate a par-
ticular discourse in society at a certain period in time. Karl M. van Meter
demonstrates how semantic and thematic shifts can be traced over time and
which future developments might be more or less probable. The examples
are taken from two projects: first, a synchronic and diachronic analysis of a
corpus of conference abstracts submitted to the annual conferences of the
American, French, and German national associations of sociology; second,
drawing from Pete Dale Scott’s concept of ‘World Parapolitics’, a diachronic
perspective on the representation of conflicts in the international press.
Chapter 7 develops a discourse analysis approach based on the many
opportunities provided by text mining. Gregor Wiedemann introduces
unsupervised and supervised machine learning techniques to analyse a cor-
pus covering twenty years of public discourse on statutory minimum wages
in Germany. His contribution demonstrates how topic modelling can be
used to reveal thematic clusters on the macro-level, and how text classifica-
tion is able to trace utterances of political stance on the micro-level of a dis-
course. In particular, the combination of data-driven clustering and
theory-driven classification allows for complex analysis workflows on very
16 R. Scholz
large text collections, thus making qualitative aspects of diachronic discourses

quantifiable.
In Chap. 8, Paul Baker and Tony McEnery introduce CADS, a means
of using the methods of corpus linguistics to facilitate discourse analysis
of large volumes of textual data. The chapter uses this framework not
only to demonstrate the value of CADS but also to explore the impor-
tance of repeating studies over time to test the degree to which discourse
is static, or changes, through time. By extending a study of the represen-
tation of Muslims and Islam in the UK press, the chapter shows the value
of exploring the dynamic nature of discourse as a way of cautioning
against the idea that discourse is necessarily stable across time.
In Chap. 9, Noah Bubenhofer, Klaus Rothenhäusler, Katrin Affolter,
and Danica Pajovic discuss common approaches using data visualisations
within the field of digital humanities. They argue that by assigning equal
importance to the development, as well as the usage of a visualisation
framework, researchers can question dogmatic ‘best-practice’ norms for
data visualisations which may prevent them from developing visualisa-
tions that can be used to find emergent phenomena within the data. They
then focus on the question of how visualisations reconstitute language by
using diagrammatic operations. Working with digital visualisations, the
technological background is of great importance for the interpretation
and the development of new tools. As an example, they present a
visualisation framework for ‘geocollocations’, which can be used as a tool
to detect words that typically collocate with toponyms in text corpora.
In Chap. 10, Jörn Stegmeier, Wolf J. Schünemann, Marcus Müller,
Maria Becker, Stefan Steiger, and Sebastian Stier present a multi-method
discourse analytical approach to analyse Twitter communication on two
political issues of global concern: environmental policy/climate change
and Internet governance/net neutrality. Their corpus is compiled from
Twitter messages containing #NetNeutrality or #ClimateChange, which
the authors gathered between January and March 2015. First, they map
and compare the geographical landscapes of the two policy fields by using
geolocation information from the Twitter API and the Data Science
Toolkit. Second, they carry out a comparative network analysis defining
Twitter users as nodes, and Retweets (RT) and mentions (@) as links.
Finally, the authors apply keyword analysis to identify discursive pat-
terns. Combining these three methods allows the authors to assess the
degree of transnationalisation in the two fields.
References
Achard, Pierre. 1993. La sociologie du langage. Paris: PUF.
Angermuller, Johannes, Dominique Mainguenau, and Ruth Wodak, eds. 2014.
The discourse studies reader. Main currents in theory and analysis. Amsterdam:
John Benjamins.
Antonijevic, Smiljana. 2013. The immersive hand: Non-verbal communication
in virtual environments. In The immersive internet. Reflections on the entan-
gling of the virtual with society, politics and the economy, ed. Dominic Power
and Robin Teigland, 92–105. Basingstoke: Palgrave Macmillan.
Authier-Revuz, Jacqueline. 1984. Hétérogénéité(s) énonciative(s). Langages 73:
98–111.
Bacot, Paul, and Silvianne Rémy-Giraud. 2011. Mots de l’espace et conflictualité
sociale. Paris: L’Harmattan.
Baker, Paul. 2006. Using corpora in discourse analysis. London: Continuum.
Baker, Paul, and Tony McEnery. 2015. Corpora and discourse studies. Integrating
discourse and corpora, Palgrave advances in language and linguistics. Basingstoke:
Palgrave Macmillan.
Barats, Christine, ed. 2013. Manuel d’analyse du web en Sciences Humaines et
Sociales. Paris: Armand Colin.
Barry, Andrew, ed. 2001. Political machines. Governing a technological society.
London: Athlone Press.
Beetz, Johannes, and Veit Schwab, eds. 2018. Material discourse—Materialist
analysis. Lanham, MD: Lexington Books.
Biemann, Chris, and Alexander Mehler. 2014. Text mining. From ontology learn-
ing to automated text processing applications – Festschrift in honor of Gerhard
Heyer. Translated by Gerhard Heyer. Theory and Applications of Natural
Language Processing. Cham: Springer.
Blommaert, Jan. 2005. Discourse, a critical introduction. Cambridge: Cambridge
University Press.
Bourque, Gilles, and Jules Duchastel. 1984. Analyser le discours politique
duplessiste: méthode et illustration. Cahiers de recherche sociologique 2 (1, Le
discours social et ses usages): 99–136.
18 R. Scholz
———. 1988. Restons traditionnels et progressifs. Pour une nouvelle analyse du

discours politique. Le cas du régime Duplessis au Québec. Montréal: Les Éditions
du Boréal.
Boutet, Josiane, Bernard Gardin, and Michèle Lacoste. 1995. Discours en situ-
ation de travail. Langages 29 (117): 12–31.
Boyd, Danah, and Kate Crawford. 2012. Critical questions for big data.
Information, Communication & Society 15 (5): 662–679.
Bronckart, Jean-Paul. 1996. Activité langagière, textes et discours; pour un interac-
tionnisme socio-discursif. Lausanne: Delachaux et Niestlé.
Bührmann, Andrea D., Rainer Diaz-Bone, Encarnación Gutiérrez-Rodríguez,
Werner Schneider, Gavin Kendall, and Francisco Tirado. 2007. Von Michel
Foucaults Diskurstheorie zur empirischen Diskursforschung. Forum Qualitative
Sozialforschung/Forum: Qualitative Social Research 8(2): Editorial FQS.
Bührmann, Andrea D., and Werner Schneider. 2007. Mehr als nur diskursive
Praxis? – Konzeptionelle Grundlagen und methodische Aspekte der
Dispositivanalyse. Forum Qualitative Sozialforschung/Forum: Qualitative
Social Research 8 (2): Art. 28.
Busse, Dietrich, and Wolfgang Teubert. 2014. Using corpora for historical
semantics. In The discourse studies reader. Main currents in theory and analysis,
edited by Johannes Angermuller, Dominique Mainguenau and Ruth Wodak,
340–349. Amsterdam: John Benjamins. Original edition, 1994.
Demazière, Didier, Claire Brossaud, Patrick Trabal, and Karl Van Meter, eds.
2006. Analyses textuelles en sociologie. Logiciels, méthodes, usages. Rennes:
Presses universitaires de Rennes.
van Dijk, Teun A., ed. 1997. Discourse as social interaction. Discourse studies.
London: Sage.
Duchêne, Alexandre, and Monica Heller. 2012. Language in late capitalism.
Pride and profit. New York: Routledge.
Fairclough, Norman. 1995. Critical discourse analysis. The critical study of lan-
guage. London: Longman.
Foucault, Michel. 1970. The order of things. An archaeology of the human sciences
(World of man). London: Tavistock. Original edition, 1966.
———. 1972. The archeology of knowledge. London: Tavistock. Original edi-
tion, 1969.
———. 1979. Discipline and punish. The birth of the prison. New York: Vintage
Books. Original edition, 1975.
Gonzalez-Baillón, Sandra. 2013. Social science in the era of big data. Policy &
Internet 5 (2): 147–160.
Hegelich, Simon, and Dietmar Janetzko. 2016. Are social bots on Twitter polit-
ical actors? Empirical evidence from a Ukrainian social botnet. Proceedings of
the Tenth International AAAI Conference on Web and Social Media. Accessed
July 1, 2018. https://www.aaai.org/ocs/index.php/ICWSM/ICWSM16/
paper/view/13015.
Herzog, Benno. 2016. Discourse analysis as social critique—Discursive and non-
discursive realities in critical social research. Basingstoke: Palgrave Macmillan.
Ignatow, Gabe, and Rada Mihalcea. 2017. Text mining. A guidebook for the social
sciences. Los Angeles: SAGE.
Jockers, Matthew Lee. 2014. Text analysis with R for students of literature
(Quantitative Methods in the Humanities and Social Sciences). Cham:
Springer.
Jones, Rodney H., Alice Chik, and Christoph A. Hafner, eds. 2015. Discourse
and digital practices. Doing discourse analysis in the digital age. London:
Routledge, Taylor & Francis Group.
Keller, Reiner. 2013. Doing discourse research. An introduction for social scientists.
London: Sage.
Kerbrat-Orecchioni, Catherine. 1980. L’Énonciation. De la subjectivité dans le
langage. Paris: Armand Colin.
KhosraviNik, Majid. 2016. Social Media Critical Discourse Studies (SM-CDS):
Towards a CDS understanding of discourse analysis on participatory web. In
Handbook of critical discourse analysis, ed. John Flowerdew and John
E. Richardson. London: Routledge.
Lafont, Robert. 1978. Le travail et la langue. Paris: Flammarion.
Lafont, Robert, Françoise Madray-Lesigne, and Paul Siblot. 1983. Pratiques
praxématiques: introduction à une analyse matérialiste du sens. Numéro spé-
cial de: Cahiers de linguistique sociale 6: 1–155.
Leimdorfer, François. 2011. Les sociologues et le langage. Paris: Editions de la
MSH.
Leimdorfer, François, and André Salem. 1995. Usages de la lexicométrie en
analyse de discours. Cahiers des Sciences humaines 31 (1): 131–143.
Luhmann, Niklas. 2000. The reality of the mass media. Cambridge: Polity Press.
Original edition, 1995.
Maesse, Jens. 2015. Economic experts. A discursive political economy of eco-
nomics. Journal of Multicultural Discourses 10 (3): 279–305.
Maingueneau, Dominique. 1997. Pragmatique pour le discours littéraire. Paris:
Dunod.
20 R. Scholz
———. 2013. Genres de discours et web: existe-t-il des genres web? In Manuel
d’analyse du web en Sciences Humaines et Sociales, ed. Christine Barats, 74–98.
Paris: Armand Colin.
Mautner, Gerlinde. 2005. Time to get wired: Using web-based corpora in criti-
cal discourse analysis. Discourse & Society 16 (6): 809–828.
———. 2012. Corpora and critical discourse analysis. In Contemporary corpus
linguistics, ed. Paul Baker, 32–46. London: Continuum.
Norris, Sigrid, and Rodney H. Jones. 2005. Discourse in action. Introducing
mediated discourse analysis. London: Routledge.
Partington, Alan, Alison Duguid, and Charlotte Taylor. 2013. Patterns and
meanings in discourse. Theory and practice in corpus-assisted discourse studies
(CADS), Studies in corpus linguistics. Vol. 55. Amsterdam: Benjamins.
Paveau, Marie-Anne. 2017. L’analyse du discours numérique. Dictionnaire des
formes et des pratiques, Collection Cultures numériques. Paris: Hermann.
Porsché, Yannik. 2018. Public representations of immigrants in museums—
Exhibition and exposure in France and Germany. Basingstoke: Palgrave
Macmillan.
Raffnsøe, Sverre, Marius Gudmand-Høyer, and Morten S. Thaning. 2016.
Foucault’s dispositive: The perspicacity of dispositive analytics in organiza-
tional research. Organization 23 (2): 272–298.
Reboul, Anne, and Jacques Moeschler, eds. 1998. Pragmatique du discours. De
l’interprétation de l’énoncé à l’interprétation du discours. Paris: Colin.
Reisigl, Martin, and Ruth Wodak. 2016. Discourse Historical Approach (DHA).
In Methods of critical discourse studies, ed. Ruth Wodak and Michael Meyer,
23–61. Los Angeles, CA: Sage.
Rosental, Claude. 2015. From numbers to demos: Assessing, managing and
advertising European research. Histoire de la Recherche Contemporaine 4 (2):
163–170.
Schroeder, Ralph. 2014. Big data and the brave new world of social media
research. Big Data & Society 1 (2): 1–11.
Siblot, Paul. 1990. Une linguistique qui n’a plus peur du réel. Cahiers de praxé-
matique 15: 57–76.
———. 1997. Nomination et production de sens: le praxème. Langages 127:
38–55.
Tufekci, Zeynep. 2015. Algorithmic harms beyond Facebook and Google:
Emergent challenges of computational agency. Colorado Technology Law
Journal 13 (2): 203–218.
Vološinov, Valentin N. 1973. Marxism and the philosophy of language. New York:
Seminar Press. Original edition, 1929.
Warnke, Ingo H., and Jürgen Spitzmüller, eds. 2008. Methoden der
Diskurslinguistik. Sprachwissenschaftliche Zugänge zur transtextuellen Ebene.
Berlin and New York: De Gruyter.
Wiedemann, Gregor. 2016. Text mining for qualitative data analysis in the social
sciences. A study on democratic discourse in Germany. Wiesbaden: Springer VS.
Wodak, Ruth, and Michael Meyer. 2001. Methods of critical discourse analysis.
Introducing qualitative methods. London: Sage.
2
Beyond the Quantitative
and Qualitative Cleavage: Confluence
of Research Operations in Discourse
Analysis
Jules Duchastel and Danielle Laberge
1 Introduction
The world of social and language sciences is characterised by many cleav-
ages: between understanding and explaining, between structural and phe-
nomenological analysis, between different fields and disciplines related to
the study of language, between different national and continental tradi-
tions, and between qualitative and quantitative approaches. These opposi-
tions often create new avenues of thought, but they become sterile when
giving up important aspects of the analysis. We will ask ourselves how
J. Duchastel (*)
Department of Sociology, UQAM – Université du Québec à Montréal,
Montréal, QC, Canada
e-mail: duchastel.jules@uqam.ca
D. Laberge
Department of Management and Technology, UQAM – Université du Québec
à Montréal, Montréal, QC, Canada
e-mail: laberge.danielle@uqam.ca

24 J. Duchastel and D. Laberge
different approaches in discourse analysis deal with these oppositions, and

eventually with their possible convergence. We will explore the capacity
of mixed methods to overcome the opposition between qualitative and
quantitative methods. We will see how interpretation and explanation are
constitutive parts of the research process.
First, we will show how discourse analysis stands at an intersection of
disciplines, traditions, and approaches. We will then discuss the opposi-
tion between qualitative and quantitative methods and the mixed meth-
ods approach as a proposed solution. This will lead us to reconsider the
distinction between explaining and understanding: we put forward the
existence, in all sciences, of a hermeneutic arc that does not separate
interpretation from explanation. Through the description of different
states of the text in the process of discourse analysis, we will describe the
necessary phases of reduction and restoration of complexity, whether the
approach is quantitative or qualitative. We will illustrate the compatibil-
ity of these methods, showing that the concepts of causality and measure-
ment can apply in either approach.
2 Oppositions and Convergences
in the Field of Discourse Analysis
Discourse analysis stands at the confluence of various disciplines, tradi-
tions, and approaches. It arose from a dual need to overcome, in the
humanities, the limited focus on content and, in the language sciences,
the restricted structural approach to language. Discourse analysis intro-
duced the need to consider language in its social context and apprehend
content as it is materialised in linguistic forms and functions. Discourse
analysis can be considered as a merger of two great traditions: the herme-
neutical tradition of humanities and social sciences, based on the mean-
ing of social practices and institutions, and the more functional and
structural tradition of language sciences that focuses on the description of
different aspects of language use. Within the context of this confluence, a
third axis emerged, that of statistical and computer sciences, leading to
the development of a tradition of computer-assisted discourse analysis. If
Beyond the Quantitative and Qualitative Cleavage: Confluence… 25
one can hardly speak of discourse analysis as a discipline, it is because of

this profusion of influences. They are produced by as many analytical
practices as there are many disciplines and intersections between them.
Figure 2.1 represents the set of oppositions and similarities of the vari-
ous traditions of discourse analysis as they emerged in the sixties. The
diagram shows, at its centre, discourse analysis as the crossing point of all
these traditions. Therefore, it is not to be regarded as a discipline but as a
field of research practices sharing a number of designs from several disci-
plines. This confluence is also marked by numerous exchanges between
national traditions. The diagram can be read as a set of oppositions, from
top to bottom, left to right, and along the diagonals. The first major
opposition from top to bottom distinguishes qualitative and quantitative
approaches. It is possible to consider approaches at the top of the figure
as belonging to ‘letters’, for example, quality, while the bottom part refers
to ‘numbers’, for example, quantity (Pires 1982). The second major
opposition can be read, from left to right, French versus Anglo-Saxon
Hjelmslev (1931/1963) Linguistic Cercle of

Copenhaguen
G. H Mead (1934) Symbolic
interactionnism
Austin (1962) Speech acts

Harris (1952) Discours Analysis
Berger & Luckman (1966) Social

Barthes (1957) Semiology construction of reality
Qualitative analysis
Dubois (1969), Benveniste (1966)
French school of
Enonciation analysis
Discourse analysis
Lacan (1966), Psychoanalysis

Garfinkel (1967) Ethnomethodology
Pêcheux (1969), Automatic discourse analysis Searle (1970) Philosophy of langage
Foucault (1969), Discourse analysis Sacks (1972) Conversation analysis
Althusser (1970), Ideology
Stone (1966) General Inquirer
analysis
Laswell (1952) Communication theory
Lexicometry Content analysis
Muller (1968) Lexical statistics
Berelson (1952) Content Analysis
Guiraud (1960) Linguistic statistics

Holsti (1969) Content analysis
Fig. 2.1 Confluences in discourse analysis

traditions,1 highlighting the relative preponderance of linguistic on the

left and of social sciences on the right.
Figure 2.1 illustrates a space where each term is opposed to the other,
horizontally, vertically, or diagonally. At the top of the diagram, within
the so-called qualitative perspective, the French school of discourse analy-
sis and the Anglo-Saxon tradition of qualitative analysis form the first
opposition. What distinguishes them most is that they belong to differ-
ent disciplinary traditions. French discourse analysis is anchored in the
distributional, functional, and pragmatic linguistics, aiming to study lan-
guage as it is used in the real world. It owes much to the structuralist
tradition: understanding symbolic phenomena in their systemic dimen-
sion. It has gradually given attention to speech as a theoretical and an
empirical object (Foucault 1969), and evolved into a form of text linguis-
tics (Adam 1999; Rastier 2001).
On the other hand, the qualitative analysis has evolved from the bosom
of symbolic interactionism and phenomenology, also under the influence
of the philosophy of language and pragmatism. These traditions have a
common interest in the intentional action manifested through speech
acts. While the French tradition focuses on the linguistic aspects of situ-
ated speech, the American tradition is mostly interested in language as a
vehicle for the social construction of reality. What particularly distin-
guishes the two traditions is the type of empirical speech that is favoured.
From the beginning, the French tradition was interested in institutional
discourse, that is, political or literary discourses. The American tradition
was rather more inclined towards speech in everyday life, that is, localised
interlocutions or conversation.
On the bottom axis of the diagram, which represents the quantitative
perspective, we can also contrast two different approaches. On one side,
we have the French tradition of lexical analysis (lexicometry), and on the
other, the American tradition of content analysis. Both approaches share
a common interest for the quantification and measurement of linguistic
phenomena, but they can be distinguished by their disciplinary origin.
1
It has to be noted that both traditions are not hermetically closed. For instance, the French school
of discourse analysis initially was inspired by Zellig Harris (1952) distributional approach to
language.
While in France there is an interest in statistics applied to literary and

political corpora, in America, it is the study of communication and pro-
paganda that gave birth to a tradition of content analysis. While in both
cases, there is a strong belief in the power of explanation with figures, the
mathematical and statistical models greatly differ. On the one hand,
complex statistical methods are applied to words in their ‘natural’ exis-
tence, that is to say, without coding, on the other hand, relatively simple
counts of coded units are produced. But in both cases, the access to
meaning is through the numbers.
Observing the figure along the vertical axis, it is possible to distinguish
on the left an opposition between the French tradition of discourse analy-
sis at the top and the lexical approach at the bottom. This opposition has
gradually evolved from a ‘dialogue of the deaf ’, during the sixties and
seventies, to a mutual recognition in recent years, as computer-assisted
discourse analysis systems began to impose their own legitimacy.
Everything happens as if the requirements of formalisation of computing
procedures made statistics less daunting in the eyes of those primarily
interested in the description of language functions. On the right side, in
the American tradition, the same opposition existed between qualitative
and quantitative methods. In both cases, the interest lies primarily in the
meaning of discourses, but the qualitative tradition emphasises the inter-
pretive reading based on the coding of units, while content analysis is
concerned, at least in its early stages, with the essentially quantitative
count of units of speech. This opposition has also diminished over the
years, and there aren’t hardly any purely orthodox researchers left. As
proof of this, one has only to look at mixed qualitative and quantitative
features in computer-assisted qualitative data analysis systems.
Finally, on the diagonal axes of the diagram, we oppose, two by two,
each tradition. It is clear that the opposition between lexical and qualita-
tive analyses follows the same logic as that between the two approaches in
quantitative and qualitative content analyses in the American tradition.
But this opposition is not really present in the literature. The opposition
that puts face-to-face discourse analysis and content analysis took shape
in the founding act of discourse analysis in France. We should remember
that the French tradition of discourse analysis comes from the critique of
the content analysis tradition (Haroche et al. 1971). It criticises the
ignorance of the linguistic substratum of discourse in this tradition,

although some authors, such as Osgood (1959), have justified its whole
importance.
Discourse analysis as a research practice has always had a syncretic
character, each tradition drawing on several disciplinary and method-
ological sources. It follows that the oppositions described here have pro-
gressively moved towards a confluence of diverse perspectives. This is true
of the reconciliation, in France, between the traditions of discourse anal-
ysis and of lexical analysis. A sign of this coming together is the growing
place of the statistical analysis of textual dimensions, often referred to as
‘textometry’ or ‘logometry’ (Mayaffre 2007). This is also true of the com-
bination of qualitative and quantitative methods in content analysis in
the Anglo-Saxon tradition. Similarly, French and American traditions of
discourse analysis have grown closer in recent decades. That which origi-
nally distinguished them—the nature of discourse analysed (in the first
case, political and literary discourses and in the other, the everyday life
discourses) and the disciplinary origin (for one, linguistic and for the
other, pragmatic)—gradually converged. It is interesting to note that the
authors of reference of the Anglo-Saxon tradition of critical discourse
analysis (Fairclough 2007) or the school of social representations (Hall
2009) are the same as those of the French school: Barthes (1957),
Althusser (1970), Foucault (1969), Derrida (1967), and Lévi-Strauss
(1949). It is equally interesting to note that the analysis of ordinary
knowledge and conversation has crossed the Atlantic in the other direc-
tion. It is out of the question to define a fictional unity of discourse analy-
sis domain, but it is certainly worth noting that the research practices in
discourse analysis combine, rather than oppose, more and more disci-
plines, approaches, and methodologies.
3 Mixed Methods
The confluence of theoretical and methodological approaches in the cur-
rent practices of discourse analysis involves the use of mixed methods.
The idea of mixed methods fits into the broader project to overcome the
opposition between qualitative and quantitative approaches, and to
somehow combine the two methodologies. While the quantitative meth-

ods are relatively easy to define, it is not the case for the qualitative ones.
For example, the contrast between the upper left and the upper right of
Fig. 2.1 indicates two different qualitative perspectives. Methods of dis-
course analysis aim to describe the forms and functions of language; in
fact they take into account the qualitative aspects of speech. The latter
refers more properly to the qualitative paradigm as such. But before going
further in the characterisation of quantitative and qualitative paradigms,
we must insist on the fundamental difference between the two approaches.
While the definition of the quantitative approach is quite simple, for
example, the use of mathematical and statistical tools in order to describe,
explain, and predict phenomena through operationalised concepts as
measurable variables, the qualitative approach refers to a large number of
research practices, such as those listed by Denzin and Lincoln (1994):
case study, ethnography, participant observation, phenomenology, eth-
nomethodology, grounded theory, biographical method, action research,
and clinical research.
More profoundly, quantitative and qualitative paradigms differ on
three levels: epistemological, analytical, and operational. The paradig-
matic configurations can vary in different ways according to the ontologi-
cal positions adopted by researchers, but they generally indicate common
positions regarding the task they are given. For the moment, we will not
develop further the ontological questions regarding the existence of real-
ity and truth that lies upstream of epistemological positions. These pos-
tures, positivist, post-positivist, critical, or constructivist give reality a
more or less autonomous status. The same can be said about the regime
of truth, the degree of relativity increasing, here, on the axis ranging from
positivism to constructivism.2 These postures necessarily influence the
various paradigmatic positions.
We will instead concentrate on the analytical and operational plans
characterising both qualitative and quantitative paradigms. These form a
series of oppositions that should be thoroughly discussed. But the goal here
2
See also Table 6.2 ‘Paradigm positions on Selected Practical Issues’ in Guba and Lincoln (1994)
and Table 1 ‘Trois positions ontologiques dans les sciences sociales contemporaines’ in Duchastel
and Laberge (1999b).
is to give an overview of the main debates between the two viewpoints. At

the epistemological level, three questions arise. The first question regards
the viewpoint of the observer: while the quantitative approach adopts a
positivist perspective, advocating a measure of distance between the
observer and the data and procedural objectivity, the qualitative approach
promotes empathy and subjectivity. The second question concerns the
capacity for generalization. Quantitative scientists aim at formulating gen-
eral and universal propositions, while the qualitative scientists insist on
uniqueness and context. The third question is about the value of truth.
Quantitative researchers put forward procedures’ validity and observers’
neutrality. The qualitative researchers prefer the ideas of transferability and
credibility to those of validity and axiological commitment to neutrality.
In analytical terms, quantitative methods proceed to the reduction of
complexity, while qualitative methods favour its full apprehension.
Quantitative scientists promote a deductive approach, at least in the con-
firmatory phase, while the qualitative researchers support induction or
abduction. Moreover, the quantitative analysts encourage width (thin
analysis) rather than depth (thick analysis) that characterizes the qualita-
tive approach. Finally, in terms of operations, quantitative research
works on variables while qualitative research is more interested in inten-
tional actions. Quantitative research favours measurement rather than
focus on qualitative processes. Consequently, quantitative researchers
seek confirmatory statistical tests when qualitative researchers employ
exploratory procedures. In summary, the purpose of quantitative m ethods
would be causal explanation and that of qualitative methods the under-
standing of meaning.
The use of mixed methods can be explained by the relative weakening
of the paradigmatic oppositions between quantitative and qualitative
methods, and the adoption of a more pragmatic attitude. Aware of the
variable nature of the data and of their actual availability, researchers have
come to use materials or analytical approaches that have previously
tended to be opposed. These changes are mostly based on pragmatic
arguments: ‘It works!’ A review of practices in the area of mixed methods
shows that there are essentially three arguments to justify these combina-
tions. The first argument can be described as functional. It consists of
simply juxtaposing the use of various types of methods according to the
needs of the research project and the nature of the data. The choice is up
to the researcher to establish the sequence of qualitative and quantitative
methods and their relative importance (QUAN > qual, QUAL > quan,
QUAN = QUAL) as part of the research process. The second argument is
more substantive. It justifies the hybridization of methods according to
the nature of data. For example, discourse analysis and content analysis
are applied to phenomena including aspects of both qualitative and
quantitative nature. The third argument is epistemological. The use of
mixed methods is legitimated by the idea of triangulation. Triangulation
is seen as a way to increase confidence in the research results. However,
we must recognize that the use of the term ‘triangulation’ is mostly meta-
phorical (Kelle 2001) and does not formally ensure a greater validity,
except in the form of convergence or confirmation of findings. In sum,
the use of mixed methods only proves that there should not be mutually
exclusive types of methods. It seems, however, insufficient to reduce the
issue of mixed methods to their sole effectiveness without trying to
understand the implications of epistemological, analytical, and opera-
tional oppositions characterizing both qualitative and quantitative para-
digms on these new forms of empirical approaches.
4 Explaining and Understanding
What can be drawn from the above? On the one hand, we have estab-
lished that the practice of discourse analysis is at the confluence of several
disciplines, themselves, relying on more or less quantitative or qualitative,
phenomenological or structural, linguistic or sociological approaches.
While each tradition has established itself on epistemological, theoretical,
and methodological oppositions with other traditions, we can neverthe-
less observe a certain convergence in the use of methods and the mitiga-
tion of previous fractures. On the other hand, the fundamental opposition
between qualitative and quantitative methods seems to dissolve in the
pragmatic choice of mixed methods. This pragmatism often avoids exam-
ination of ontological and epistemological foundations of this practice.
This is why we have to question the possible reconciliation of these two
so strongly opposed paradigms.
To elucidate this question, it is useful to return to the starting point of

the distinction between natural science and humanities as established by
Dilthey in the late nineteenth century. This distinction was built on the
contrast between explaining and understanding. According to this view,
the natural sciences were entirely dedicated to the identification of causal
relationships between phenomena, while the humanities sought to
uncover the meaning of historically situated experiences. It is this design
that better differentiates the paradigmatic opposition between quantita-
tive and qualitative methods. But instead, we will rather rely on the
assumption of Ricœur (1981, 161) that “it seems possible to situate
explanation and interpretation along a unique hermeneutical arc and to
integrate the opposed attitudes of explanation and understanding within
an overall conception of reading as the recovery of meaning.”
In fact, Ricœur defines a hermeneutical arc, from explanation to
understanding, that is to say that the interpretation unfolds in a set of
objective procedures for observation, description, and analysis resulting
in the understanding of the research object. Hermeneutics cannot be
reduced to the immediate interpretation of the observed reality, as might
be the case for everyday knowledge. In scientific knowledge, the interpre-
tation is necessarily supported by the mediation of operations that can be
named explanatory procedures.
This assumption allows us to reject two common conceptions of inter-
pretation. The first comes from within the qualitative paradigm where
interpretation is often seen as a hermeneutical comment. One textbook
defines qualitative analysis as “a deliberate and rigorous representation
and conscious transposition of the ‘self-other-world’ system, in order to
make a new exploration in the particular perspective of the humanities
and social sciences, which strive to bring out the sense rendering it under-
standable.” (Our translation, Paillé and Mucchielli 2008, 24)
The researchers set out to reveal the meaning of speech in context. In
fact, they are mostly interested in the referential function of discourse.
But should we not consider that the essence of discourse analysis is to
highlight the various linguistic and paralinguistic aspects of speech whose
disclosure is necessary for an overall understanding? Interpretation can-
not stand on its own and it requires the work of description and
explanation.
The interpretative process’s second conception is restricted to the inter-

pretation of results. In quantitative or qualitative frameworks, the work
of interpretation is often limited in establishing the meaning of the results
generated by research operations. It then maintains the illusion that these
operations are absolutely objective until meaning is assigned to the results
they produce. Such a point of view ignores the importance of interpretive
acts that mark each stage of the research process. The projection of a theo-
retical framework, the identification of analytical dimensions, and the
choice of values lent to research objects are all housed in the same inter-
pretive acts within objectification procedures.
What then is interpretation? In the broadest sense, there is a tendency
to confuse this concept with that of understanding or appropriating, for
ourselves, the meaning of an action, an intention, or a thought. The
researcher would then be asked to develop his empathic abilities, which
could give him access to the consciousness of the observed subject. It is
true that, at the end of every project, the researcher arrives at a global
interpretation of the observed phenomenon that is somehow detached
from observation, description, and analytical procedures. This holistic
interpretation can be seen as an appropriation for ourselves of the object,
the global comprehension of the phenomenon (Duchastel and Laberge
1999a). But in the context of a scientific process, interpretation must be
seen as the continuous confrontation of the researcher with discursive
materiality (Conein et al. 1981) or language materiality (Paveau 2012).
For several authors, we find this strong intuition that access to meaning
cannot dodge the discursive materiality. Pêcheux (1975), and later on
Paveau (2012) and Molino (1989), insisted that only the very materiality
of speech could render analysis possible. Similarly, Ricœur (1981, 149)
speaks of “the eclipse of the circumstantial world by the quasi-world of
texts” as a condition for reading and interpreting. In sum, hermeneutics
as the art of interpretation should be based on a set of procedures for the
description, exploration, and analysis of material units of discourse.
The intuition behind the project of discourse analysis was, from the
outset, to go beyond content analysis and take into account the linguistic
dimension of speech. Speech was not to be reduced to its purely linguistic
dimensions—lexical or semantic. The hypothesis was to find various
traces of discourse functions, such as those developed by Jakobson (1963),
in the material fabric of language. This is the case with enunciation analy-
sis that seeks the inscription of speaker and audience in the thread of
discourse. The same is true with the study of markers of argumentation.
According to Gee (2011), discourse analysis is about the study of speech
on three levels: the analysis of the information it conveys (saying), that of
action it raises (doing) and of identity it formulates (being). Each of these
dimensions is identifiable only through linguistic forms that make them
intelligible. The interpretation must rely on certain classes of observation
units and the description of their properties. This process is objectifying
as well as interpretative.
If this is true, a restrictive approach of interpretation cannot be sus-
tained. Interpretation cannot be limited to the final act of the research
process when making sense of results. Rather, interpretation should be
present at the very beginning of the research process. Interpretation is
part of every research procedure, and all procedures rely on interpreta-
tion. This means that explanatory procedures and interpretation go hand
in hand and do not oppose each other, as the quarrel of paradigms would
suggest. Rather than designing two general paradigms defined by their
purpose, explaining, or understanding, it is more productive to integrate
both actions within a single process. No science can do without a proper
pre-comprehension of the object. There is always a knowledge frame,
more or less theoretical, which predetermines the grasping of reality.
What is sought is to increase this preliminary understanding. Explanation
is most often thought of as establishing a relationship between two phe-
nomena. But, it also has a semantic sense. Kaplan (1964) has defined
interpretation as a semantical explanation, thus explaining the meaning
of a statement. In both cases, the goal is to better understand. The various
procedures for observation, description and analysis of objects are
designed to enhance understanding by distancing the object from the
subject and by linking the object with the cognitive frameworks at play.
However, we must consider the asymmetry of both processes of expla-
nation and interpretation. While explanatory procedures can be con-
trolled to a certain point, the act of interpretation, even if it is well framed,
remains difficult to define. The cognitive capacities of the researcher,
semantic, emotional, or cultural, will result in some uncertainty of inter-
pretation. However, it is easier to control the micro-level of the interpre-
tive process in various descriptive and analytical procedures than in the

overall understanding of a phenomenon. That is why we distinguish
‘local interpretations’ that can be thought of, if not perfectly controlled,
at all stages of the research process and ‘global interpretations’ that bring
meaning to the complexity of the object at the expense of an assured
mastery of the cognitive processes at work (Duchastel and Laberge
1999a).
5 The Problem of Complexity

One of the most fundamental criticisms addressed to the quantitative
paradigm is its reductive approach to the problem of complexity. On the
other hand, the comprehensive paradigm is based on the idea that the full
complexity of any phenomena must be preserved. It shows strong resis-
tance to any reduction that may dissipate meaning. Instead, an empathic
approach is advocated. But is it possible to grasp an object without reduc-
ing its complexity and describing it? Qualitative methods are not exempt
from this requirement as they shall, themselves, proceed to the identifica-
tion of units of varying size (words, textual segments, sentences, and para-
graphs) to which they affix referential or factual categories. Yet, proponents
of the qualitative paradigm insist on the whole rather than the parts.
The question may be ill defined. It is rather more appropriate to distin-
guish between systematic reduction of complexity and oversimplifica-
tion. Admittedly, the distinction between in-depth analysis (thick) and
wide analysis (thin) remains relevant and it is understandable that the
first type embraces more complexity. But in all cases, reducing the phe-
nomenon under study is unavoidable. It is not possible to grasp an object
in its totality, if not intuitively. Thus we need to temporarily neglect some
of its components to retain only a few. Ricœur (1986) explains that dis-
course analysis can be only done through the mediation of the text. This
methodical act of concealing the complexity of the social conditions of
discourse allows the proper identification of textual materiality and the
observation of its properties and relationships. Such mixed interpretative
and explanatory procedures will progressively lead to a more comprehen-
sive understanding at the very end of the research process.
We see the process of understanding as a spiral formed by overlapping

circles each having a point of origin based on a prior understanding of the
object and an endpoint defined as the enriched understanding of the
same object. Between these two points, there is a set of operations of
construction, description, and analysis involving both explanation and
interpretation procedures. These procedures are formed by the identifica-
tion of dimensions and units, the description of units based on conceptual
dimensions, and the exploration of their relationship. All these opera-
tions can be performed only on a well-defined materiality. This material-
ity is that of the text and the text is the main trace of the speech situation.
The text is thus some reduction of the situated discourse. It is not possible
to carry out the analysis without the use of a textual support, in contrast
to mundane understanding in everyday life.
The transformation of the text over the course of research will show
how a dual process of reduction and recovery of complexity operates.
Figure 2.2 shows the various stages in the transformation of the text
with each specific methodical operations of discourse analysis. The initial
form of the text is speech itself. It consists of the raw material on which
we will perform various research operations. The ‘speech’ text is the start-
ing point, a complex object produced within a socio-historical, cultural,
cognitive and linguistic context, and a specific communication situation.
The first transformation is to establish a ‘manuscript’ text. Initially, we
may have a spoken or written speech, already in the form of a text. In the
case of written speech, we then must select and authenticate a version of
Fig. 2.2 Transformation of the text. The figure is an adaption of a schema pre-
sented in Meunier (1993)
the speech that will become a text ‘outside of the world’, in the words of
Ricœur. In the case of oral discourse, we first proceed to its transcription.
Oral discourse includes a set of prosodic and contextual features that can
be recorded in a more or less developed format using established conven-
tions (e.g., Jefferson 2004). The ‘manuscript’ text is an object both differ-
ent and less complex than the original, in the sense that the conditions
and context of its production and enunciation are no more present oth-
erwise than within the text itself.
The next transformation will produce an ‘edited’ text. Whatever the
characterization of the manuscripts, transcripts of oral, in paper or com-
puterized format, standardization and normalization work must be done
in order to make the various elements of a corpus comparable. Information
about the conditions of production of speech and of enunciation (speaker,
support, place, time, etc.) must define each document of a corpus. We get
a new ‘edited’ text which will be subsequently the object of description,
exploration, and analysis. In summary, the ‘manuscript’ text is a deriva-
tion of the original discourse which version has been established by
authentication or transcription and the edited text is, in turn, the result
of standardization and indexation according to a system of rules and
descriptive categories. It is on the basis of this ‘edited’ text that the work
of description, exploration, and analysis can be further performed.
Which actions should then be performed on this textual material? We
can define two universal research operations whatever the approach. The
first task is to establish the observation units: What is to be observed? The
second task consists of the description of these units based on one or
more systems of categories: How is it to be observed? Observation units
can be represented as a set of nested elements, from the global corpus to
the sub-corpora, to the collection of texts that constitute each of them, to
the various parts of each text, and finally to the middle and micro-level
text units. Each nesting level of units may be described into a system of
categories. The corpus itself and its subsets are indexed with a metadata
system. Every text component (section, paragraph, verbal exchanges,
etc.) can be marked. Finally, speech units (textual segments, turns of
speech, sentences, words) are coded depending on the research target
(e.g., morpho-syntactic, semantic, pragmatic, enunciative, argumentative
coding). Thus, the descriptive system unfolds at three levels: The corpus
is described by meta-categories, the parts of text are described by struc-

tural variables, and the speech units are described by a variety of proper-
ties associated with the research design. Arguably, the ‘edited’ text is
actually transformed into a series of ‘representations’, in the sense that the
text is now enriched with descriptions, and in some way, a form of com-
plexity is partially restored. It represents, however, multiple images of the
original text, but in no way corresponds fully to the context of its
utterance.
All text descriptions can be sorted and compiled. They may or may not
be the subject of counts, crossovers, comparisons based on various seg-
ments established on the basis of metadata or structural variables. Each
data mining operations described will result in the production of many
new texts in the form of comments or numerical results. Each of these
subtexts will only be a distant image of the original text. It is the accumu-
lation of these images which will allow further exploration of the original
speech and lead to the interpretation of the data, producing a new trans-
formation of the text in the form of ‘interpretation’. The interpretation of
the results can be partial or global, depending on whether we choose to
interpret the empirical data produced by different sets of explorations or
we attempt to give an overall sense of the whole data. Global interpreta-
tion will then mobilize much more than methodological devices.
Theoretical and socio-historical knowledge are needed to restore the full
complexity of discourse in action. The final form of the text is a new text,
the ‘interpretation’ text taking the form of an article or monograph aim-
ing at the increased understanding of the phenomenon being studied.
This more or less metaphorical representation of a succession of states
of the text goes to show that speech can only be grasped in the form of its
textual materiality which must be later subjected to methodical opera-
tions. From this point of view, it does not seem appropriate to distinguish
between quantitative and qualitative methods. On the epistemological
level, it is not productive to oppose complexity and simplicity. We have
seen that understanding and explanation should form a hermeneutical
arc. Any methodological approach necessarily implies a reduction of the
object allowing some objectification of data. As we saw earlier, this pro-
cess involves both operations of explanation and interpretation. These
operations ultimately lead to the formulation of interpretative hypotheses
that allow for the appropriation of the object for ourselves, that is to say,
its understanding.
6 ausality and Measurement in Discourse

C
Analysis
We have tried so far to show how discourse analysis is, as its name sug-
gests, a practice that focuses on the discursive materiality and implements
systematic operations, both explanatory and interpretative. We have chal-
lenged the strict opposition between the qualitative and quantitative
paradigms while recognizing the existence of distinctive practices con-
cerned with quantitative or qualitative aspects of phenomena. The para-
digmatic opposition between qualitative and quantitative approaches
emphasizes two distinct criteria. As we have pointed out, the quantitative
approach would favour measurement and causal explanation, and the
qualitative approach would rather choose the global understanding of
phenomena. To be convinced of the compatibility of the two approaches,
it is useful to examine the presence of causal reasoning in the practice of
discourse analysis and the relevance of measuring as an operation favour-
ing at the same time reduction and restoration of complexity. We will
attempt to illustrate how causal explanation and measurement have their
place in the qualitative approach.
With regard to causation, we refer to Tacq’s proposal (2011) that causal
reasoning is present in both quantitative and qualitative researches. He gives
an overview of different theories of causality in the social sciences to stress
the idea of an experimental logic present in both approaches. He starts from
the premise that in science, the causal relationship is rarely apprehended
directly, but rather is considered in an indirect way, a sort of encirclement
process. Thus, science most often uses probabilistic or statistical approaches
to examine the necessary and sufficient conditions explaining a phenome-
non, without being able to establish a direct causal link between phenom-
ena. To support his conception of experimental logic, Tacq relies on the
INUS model (Insufficient but Necessary part of a set, which is Unnecessary
but Sufficient for the Result, Mackie 1974), which bases the nature of rea-
soning on all the conditions making possible the occurrence of an event.
According to the INUS model, an event may be the product of a nec-

essary condition but insufficient in general, while being sufficient
although not necessary under the circumstances. Tacq gives the following
example: Experts may say that fire is the result of a short circuit. The
cause cannot be declared necessary because other factors could cause fire.
It cannot be declared sufficient since other conditions may contribute to
the spread of fire. All we can say is that, combined with the short circuit,
there is a set of positive or negative conditions that are sufficient without
being necessary to trigger the fire. It is a counterfactual argument that
questions the possibility of the occurrence of an event in the absence of
an identified causal factor. The perspective is that of a causal field rather
than a logical causation. According to the author, this type of reasoning
is widely used in experimental research. But it is surely the kind of logic
that is applied in qualitative research.
To support his thesis, Tacq responds to the main arguments that aim
at distinguishing qualitative and quantitative approaches. The first argu-
ment pertains to the measurement scales, nominal, ordinal, interval, and
metric. The first two levels, nominal and ordinal, would characterize the
qualitative approach, allowing limited mathematical operations, thus
excluding the causal logic implied by quantitative models. While math-
ematical operations vary depending on the nature of the variables, it
does not follow that the causal logic is de facto excluded. The second
argument is based on the difference in sample size between the qualita-
tive and quantitative approaches. In extreme cases, qualitative studies
will apply to a single case, making causal analysis improbable. Tacq notes
that there are few objective criteria for determining the minimum sam-
ple size and even the analysis of a single case can make sense, provided it
is placed in relation with other single-case studies. The analysis of com-
plex necessary and sufficient conditions is still possible by the counter-
factual examination of these conditions. The third argument regards the
possibility of statistical tests. Obviously, the power of statistical tests var-
ies greatly depending on the sample size. However, there are a variety of
tests that have been produced to validate the results of small samples,
and comparison of data with data obtained in other studies is, in itself, a
kind of test, even if not statistical. The last argument pertains to the dif-
ference between thin and thick analyses. Again, there is no doubt that
in-depth analysis multiplies the dimensions of the object that can be

observed, while the analysis in width multiplies the number of individuals
observed for a limited number of dimensions. This should not, however,
change the argument, especially as there is no reason not to combine
qualitative and quantitative procedures at various stages of the research
process.
The author comes to the conclusion that if we use the counterfactual
and conditional approach of INUS’s model and the method of difference
at the base of the experimental approach as formulated by John Stuart
Mill, there is no principled difference between quantitative and qualita-
tive methods in terms of causal reasoning.
We will conclude by showing that the use of measurement is not
inconsistent with a qualitative approach. If one refers to the qualitative
paradigm, measurement is conceived as a distortion of the research object
and would constitute a misleading and unnecessary analysis, precisely
because it reduces complexity. However, measurement is one of the
research operations that allows at the same time a reduction of the dimen-
sions under study and possibly the production of another order of com-
plexity. We retain the definition proposed by Kaplan (1964, 177):
“Measurement, in the most general terms, can be regarded as the assign-
ment of numbers to objects (or events or situations) in accord with some
rule.” The properties of the object and their measurability do not exist
independently of a theory. The qualitative or quantitative representation
of an object depends on the choice of a system of symbolic representa-
tion. In the words of Kaplan, “quantities are of qualities and a measured
quality has just the magnitude expressed in its measure” (1964, 207). In
sum, measure can be applied at various levels of construction of the
object. First, it can be applied to any object with an independent material
existence, regardless of its nature, size, and complexity, such as individu-
als, world objects, texts, statements, events. Second, it can be applied to
segments or properties of these objects not directly accessible to observa-
tion, but arising from research work. Third, the measure may even extend
to intangible objects that exist through the work of the mind. This last
kind of objects might be a social production (success, wealth, popularity,
etc.) or the product of disciplinary knowledge (anomie, social relativism,
creativity, etc.).
To resume our earlier discussion, the measuring may indeed appear to

be a reduction of information. In the different phases leading to measure-
ment, only certain attributes are deemed relevant to the process. It implies
that we waiver the diversity of concrete manifestations, physical or imag-
ined, of one’s research object. This work of abstraction is present in both
qualitative and quantitative approaches. It is reflected in the operations of
description and categorization of the chosen units. Categorization con-
sists in a double reduction of the object by identifying a particular aspect
of the object and allocating an abstract value that can represent it. Giving
values to units and their properties follows previous work of reduction
and abstraction of the object’s dimensions. In return, measurement may
also help restore complexity. It can indeed be a powerful heuristic strategy
to rebuild complex representations of aspects or attributes postulated in
theory. For example, the construction of indices to represent a concept by
adding and weighting indicators leads to the emergence of a form of
complexity non-apparent at the starting point. In the same fashion, mul-
tidimensional statistical analysis produces information that was not there
from the start (see also Duchastel and Laberge 2011).
Discourse analysis is a good example for the use of measurement as
part of a mixed methods approach. The different operations of descrip-
tion and analysis of discourse data show that measurement can contrib-
ute both to the abstraction of specific dimensions of the object and to the
restoration of complexity. Analysis relies on the capacity to identify series
of discrete speech units (words, semantically meaningful phrases, broader
textual segments, etc.) and to determine a system of categorization
(semantic, sociological, argumentative, pragmatic, enunciative, etc.). The
researcher remains free to determine whether he will take into account
only the units, whatever the type, or if he is interested in their properties.
Counting these objects will only give a partial view of the whole. For
example, we could learn about the proportion of nouns belonging to a
semantic class, the dominant premises of an argument, the relative impor-
tance of certain enunciative markers in a political speech, the frequency
of speech turns in a conversation, etc. Thus, one can speak of a reductive
reading manifested both by a certain selection of aspects of the text and
by its representation in a measurement system. But it is also possible to
speak of a more complex representation of the text by the multiplication

of observations and accumulated elements measured. The accumulation
of observations and measurements can lead to the construction of indices
or increase the size of the analysis. Measurement is then one of the opera-
tions available in discourse analysis. It is not inherently incompatible
with the qualitative approach.
7 Conclusion
We have shown that discourse analysis is not a discipline but a
research practice that is at the confluence of a set of disciplinary and
national traditions. The rich heritage of disciplinary, theoretical, and
methodological knowledge explains the privileged position of discourse
analysis. The very purpose of discourse analysis predisposes it to stay at
the frontier of different methodological approaches, which might be
called mixed methods. We have shown that the paradigmatic opposi-
tions between qualitative and quantitative approaches, although strongly
advocated in the body of scientific literature, have become obsolete in
the pragmatic use of mixed methods. We went beyond this pragmatic
attitude to defend the thesis that there is indeed a common background
in all methodologies, whatever their paradigmatic affiliation. We have
shown that we cannot explain without interpreting at the same time, and
that the very identification of research units and operations of descrip-
tion and analysis combines, at all times, explanation and interpretation.
We further stated that scientific knowledge cannot proceed without
applying some reduction procedures, but that the combination of these
procedures can lead to a restoration of the complexity of the object. We
ended by showing that the logic of causality and measurement, seem-
ingly opposed to the qualitative paradigm, applies to both qualitative
and quantitative approaches.
Acknowledgements We are thankful to Beltz Juventa Publishing House for

allowing to reprint this text, which originally was published in the Zeitschrift für
Diskursforschung (2014/2).
References
Adam, Jean-Michel. 1999. Linguistique textuelle. Des genres de discours aux textes.
Paris: Nathan.
Althusser, Louis. 1970. Idéologie et appareils idéologiques d’État. La Pensée 151
(juin).
Barthes, Roland. 1957. Mythologies. Paris: Seuil.
Conein, Bernard, Jean Jacques Courtine, Françoise Gadet, Edward W. Marandin,
and Michel Pêcheux, eds. 1981. Matérialités discursives. Actes du colloque de
Nanterre (24–26 avril 1980). Lille: Presses universitaires de Lille.
Denzin, Norman K., and Yvonna S. Lincoln. 1994. Handbook of qualitative
research. London: Sage.
Derrida, Jacques. 1967. L’écriture et la différence. Paris: Seuil.
Duchastel, Jules, and Danielle Laberge. 1999a. Des interprétations locales aux
interprétations globales: Combler le hiatus. In Sociologie et normativité scien-
tifique, ed. Nicole Ramognino and Gilles Houle, 51–72. Toulouse: Presses
Universitaires Du Mirail.
———. 1999b. La recherche comme espace de médiation interdisciplinaire.
Sociologie et Sociétés XXXI (1): 63–76.
———. 2011. La mesure comme représentation de l’objet. Analyse et interpré-
tation. Sociologies (Avril). Accessed June 27, 2018. https://journals.openedi-
tion.org/sociologies/3435.
Fairclough, Norman. 2007. Discourse and social change. Cambridge: Polity.
Foucault, Michel. 1969. L’archéologie du savoir. Paris: Gallimard.
Gee, James Paul. 2011. An introduction to discourse analysis. Theory and method.
3rd ed. New York: Routledge.
Guba, Egon G., and Yvonna S. Lincoln. 1994. Competing paradigms in quali-
tative research. In Handbook of qualitative research, ed. Norman K. Denzin
and Yvonna S. Lincoln, 105–117. London: Sage.
Hall, Stuart. 2009. Representation, cultural representations and signifying practices.
London: Sage.
Haroche, Claudine, Paul Henry, and Michel Pêcheux. 1971. La Sémantique et
la coupure saussurienne: Langue, langage, discours. Langages 24: 93–106.
Harris, Zellig. 1952. Discourse analysis. Language 28 (1): 1–30.
Jakobson, Roman. 1963. Essais de linguistique générale. Paris: Minuit.
Jefferson, Gail. 2004. Glossary of transcript symbols. In Conversation analysis:
Studies from the first generation, ed. Gene H. Lerner, 13–31. Amsterdam: John
Benjamins Publications.
Kaplan, Abraham. 1964. The conduct of inquiry. Methodology for behavioral sci-
ence. New York: Chandler Publishing.
Kelle, Udo. 2001. Sociological explanations between micro and macro and the
integration of qualitative and quantitative methods. Forum Qualitative Social
Research 2(1). https://doi.org/10.17169/fqs-2.1.966. Accessed June 27,
2018.
Mackie, John L. 1974. The cement of the universe. A study of causation. Oxford:
Oxford University Press.
Mayaffre, Damon. 2007. Analyses logométriques et rhétoriques des discours. In
Introduction à la recherche en sic, ed. Stéphane Olivési, 153–180. Grenoble:
Presses Universitaires De Grenoble.
Meunier, Jean-Guy 1993. Le traitement et l‘analyse informatique des textes.
Revue de Liaison de la recherche en informatique cognitive des organisations
(ICO Québec) 6 (1–2): 19–41.
Molino, Jean. 1989. Interpréter. In L‘interprétation des textes, ed. Claude
Reichler, 9–52. Paris: Editions De Minuit.
Osgood, Charles E. 1959. The representational model and relevant research
methods. In Trends in content analysis, ed. Ithiel de Sola Pool, 33–88. Urbana:
University of Illinois Press.
Paillé, Pierre, and Alex Mucchielli. 2008. L’analyse qualitative en sciences humaines
et sociales. Paris: Armand Colin.
Paveau, Marie-Anne. 2012. L’alternative quantitatif/qualitatif à l’épreuve des
univers discursifs numériques. In Colloque international et interdisciplinaire
Complémentarité des approches qualitatives et quantitatives dans l’analyse des
discours?, Amiens, France.
Pêcheux, Michel. 1975. Les vérités de la Palice, linguistique, sémantique, philoso-
phie. Paris: Maspero.
Pires, Alvaro P. 1982. La méthode qualitative en Amérique du nord: un débat
manqué (1918–1960). Sociologie et société 14 (1): 16–29.
Rastier, François. 2001. Arts et sciences du Texte. Paris: PUF.
Ricœur, Paul. 1981. Hemeneutics and the human sciences. Essays on language,
action and interpretation. Cambridge: Cambridge University Press.
———. 1986. Du texte à l’action. Paris: Seuil.
Tacq, Jacques. 2011. Causality in qualitative and quantitative research. Quality
and Quantity 45 (2): 263–291.
Zienkowski, Jan. 2012. Overcoming the post-structuralist methodological defi-
cit. Metapragmatic markers and interpretative logic in a critique of the bolo-
gna process. Pragmatics 22 (3): 501–534.
References of Figure 2.1
Althusser, Louis. 1970. Idéologie et appareils idéologiques d‘État. La Pensée 151

(juin).
Austin, John L. 1962. How to do things with words. New York: Oxford University
Press.
Barthes, Roland. 1957. Mythologies. Paris: Seuil.
Benveniste, Emile. 1966. Problèmes de linguistique générale. 1. Paris: Gallimard.
Benzecri, Jean-Paul. 1973. L’analyse des données: l’analyse des correspondances.
Paris: Dunod.
Berelson, Bernard. 1952. Content analysis in communication research. New York:
Hafner Publications.
Berger, Peter, and Thomas Luckman. 1966. The social construction of reality. A
treatise in the sociology of knowledge. New York: Anchor Books.
Dubois, Jean. 1969. Énoncé et énonciation. Languages 4 (13): 100–110.
Foucault, Michel. 1969. L’archéologie du savoir. Paris: Gallimard.
Garfinkel, Harold. 1967. Studies in ethnomethodology. Englewood Cliffs, NJ:
Prentice Hall.
Guiraud, Pierre. 1960. Problèmes et méthodes de la statistique linguistique. Paris:
PUF.
Harris, Zellig. 1952. Discourse analysis. Language 28 (1): 1–30.
Hjelmslev, Louis. 1947. Structural analysis of language. Studia Linguistica 1
(1–3): 69–78.
Holsti, Ole R. 1969. Content analysis for the social sciences and humanities.
Reading, MA: Addison Wesley.
Lacan, Jacques. 1966. Écrits. 2 vols. Paris: Seuil.
Lasswell, Harold D., Daniel Lerner, and Ithiel de Sola Pool. 1952. The compara-
tive study of symbols. Stanford, CA: Stanford University Press.
Lévi-Strauss, Claude. 1949. Les structures élémentaires de la parenté. Paris: PUF.
Mead, George H. 1934. Mind, self, and society. Chicago: University of Chicago
Press.
Muller, Charles. 1968. Initiation à la statistique linguistique. Paris: Larousse.
Pêcheux, Michel. 1969. Analyse automatique du discours. Paris: Dunod.
Sacks, Harvey. 1972. An initial investigation of the usability of conversational
data for doing sociology. In Studies in social interaction, ed. David Sudnow,
31–74. New York: Free Press.
Schütz, Alfred. 1967. The phenomenology of the social world. Evanston, IL:
Northwestern University Press.
Searle, John. 1970. Speech acts. An essay in the philosophy of language. Cambridge:
Cambridge University Press.
Stone, Philip J., Dexter C. Dunphy, Marshall S. Smith, and Daniel M. Ogilvie.
1966. The general inquirer. A computer approach to content analysis. Cambridge,
MA: MIT Press.
Part II
Analysing Institutional Contexts of
Discourses
3
The Academic Dispositif: Towards
a Context-Centred Discourse Analysis
Julian Hamann, Jens Maesse, Ronny Scholz,
and Johannes Angermuller
1 Introduction
In discourse, meanings are realised and established among members of a
social community. Discourse is a meaning-making practice, which oper-
ates with gestures, images, and, most importantly, with language. From a
discourse analytical point of view, texts and contexts, utterances and their
The authors thank Johannes Beetz, Sixian Hah, and one anonymous reviewer for their comments
on previous versions of this contribution. They are also very grateful to Marie Peres-Leblanc for
improving the design of the visualisations.
J. Hamann (*)
Leibniz Center for Science and Society, Leibniz University Hannover,
Hannover, Germany
e-mail: julian.hamann@lcss.uni-hannover.de
J. Maesse
Department of Sociology, Justus-Liebig University Giessen, Gießen, Germany
e-mail: Jens.Maesse@sowi.uni-giessen.de
R. Scholz • J. Angermuller
e-mail: r.scholz@warwick.ac.uk; J.Angermuller@warwick.ac.uk

52 J. Hamann et al.
uses in certain social structures and institutional settings always need to

be studied in conjunction. Language does not make sense outside the
contexts in which it is used. And it is difficult to account for contexts
without language referring to them. Discourse analysts, therefore, deal
with the problem of how texts are articulated with contexts, i.e. the social
configurations, which are not simply represented but also constituted in
discursive practices.
Discourse analysts deal with the relationships between texts and social
contexts in many ways. While for some approaches social contexts are
closely tied to discursive practices, others conceive of social context as
existing independently of discursive practices. Pragmatic approaches to
discourse, which look at the contextualisation of utterances, can be cited
as an example of the former. To analyse discourse pragmatically means to
ask how utterances, through the cues, markers, and instructions they pro-
vide, evoke certain (aspects of ) contexts. The latter, by contrast, is testi-
fied by structural or power theoretical approaches to discourse. In the
latter view, for the social meaning of texts to be understood, one needs to
relate to the wider social and historical horizon in which they are used.
Discourse analysis, in that view, would produce a partial picture of social
meaning-making practices if it ignored aspects of social reality that are
not shown in texts.
In this contribution, we will make the case for a context-centred dis-
course analysis by citing findings from our research on academic dis-
courses in the ERC DISCONEX project.1 Dispositif is a Foucauldian
concept that tries to capture and link the heterogeneous textual and
non- textual context conditions in which discourses emerge.
Operationalisations of the concept pay attention to hierarchies and
power structures in institutions, architectural order, regulations and
laws, and scientific statements, and so on that together form a network
1
The concept for the information system from which we draw our examples was developed within
the research project ‘Discursive Construction of Academic Excellence’, funded by the European
Research Council and led by Johannes Angermuller. We are grateful to the whole ERC DISCONEX
team for allowing us to present a part of their research ideas to which all four authors have contrib-
uted in various stages. For more information see: http://www.disconex.discourseanalysis.net.
The Academic Dispositif: Towards a Context-Centred Discourse… 53
of dispositional elements that the construction of meaning in discourses

depends on (for an introduction: Raffnsøe et al. 2016). A dispositif anal-
ysis integrates an analysis of language use with an analysis of the afore-
mentioned context elements. In this contribution, we will give an
example of how to account for a dispositif with the help of quantitative
methods. We apply these methods to short texts from institutional web-
pages. This first analysis is part of a larger study in which we combine an
analysis of academic texts with a dispositif analysis of academic careers
and institutions. This larger study aims to show that academic publica-
tions need to be seen against the background of the researchers’ social
positions in the institutional dispositif, as well as the symbolic dynamics
within their scientific communities. We will apply these theoretical con-
siderations with two specific empirical questions in mind: How can we
account for the discursive construction of social order in the academic
discipline of sociology? How do social relationships of difference and
inequality that constitute the sociological field go together with the lin-
guistic organisation of the field, that is, a specific distribution of words
and expressions across the social space?
The first section will discuss why we should integrate context data
into our discourse analysis, and it will elaborate on the different types
of social contexts that sociologists relate to in their discursive practices.
In the second section, we will ask how sociological insights into the
social context can inform a discourse analysis of the discipline of soci-
ology. Specifically, we will discuss possible (statistical) categories that
help to operationalise contexts of academic discourse. Our approach is
illustrated in the third section, which outlines an empirical analysis of
the research interests that UK full professors in sociology mention on
their homepages. We demonstrate how these data can serve as an entry
point for a context-centred analysis of academic discourse. The fourth
section ties the argument back to the theoretical background of the
dispositif approach and reveals the methodological framework that
guides the analysis and offers pathways of interpretation. The contribu-
tion closes with a discussion of our proposed approach in the fifth
section.
54 J. Hamann et al.
2 Integrating Sociological Data

into Discourse Analysis
In this section we will outline a methodological argument that explains
why analyses of discourse need an explicit and systematic analysis of
social and historical contexts. The combination of linguistically informed
discourse analysis and sociologically informed context study seems to be
useful because texts, language, gestures, and pictures, on the one hand, as
well as social and institutional rules and structures, on the other, refer to
and constitute each other.
2.1 The Problem of Text and Context
The question of text and context has been the subject of a great deal of
controversy. In line with socially minded linguists, who have long insisted
on the systematic empirical observation of real linguistic and discursive
practices (Bhatia et al. 2008; Blommaert 2005; Sarangi and Coulthard
2000), we will make the case for a sociological take on social and histori-
cal contexts. We will cite and elaborate Foucault’s concept of dispositif in
order to seize the social context as an institutional arrangement of linguis-
tic practices and non-linguistic practices, rules, and structures in a larger
social community. While text and talk can be analysed with the classical
instruments of discourse analysis (from pragmatics to corpus analysis),
the dispositif is analysed with the help of sociological methods (such as
interviews, questionnaires, statistical analysis, ethnography).
With the concept of the dispositif, we make the case for sociological
perspectives on discursive practices as embedded in institutional power
arrangements (the notion of dispositif has been the object of debate in
France, where the term originated, and Germany: Angermüller 2010;
Angermuller and Philippe 2015; Bührmann and Schneider 2007, 2008;
Maesse and Hamann 2016; Maingueneau 1991; Spieß et al. 2012). The
dispositif approach encompasses power and social structures (Bourdieu
2010), the nexus of power and knowledge (Foucault 1972), as well as
institutionally organised processes of interpretation (Angermüller 2010).
It takes CDA perspectives further, in that it pleads for studying social
context empirically and systematically. Rather than leaving claims about

social inequality to intuition, it asks for methodological symmetry: con-
texts should be studied just as systematically as texts. Moreover, for the
dispositif analyst, context is not a social reality waiting as it were behind
texts. It is an arrangement of social practices that not only represent the
social reality, but by representing it constitute it.
Our plea for a combination of linguistic discourse analysis and socio-
logical dispositif analysis responds to a demand for addressing questions
of power and inequality in discourse research. While we agree with those
interactional and praxeological discourse analysts who criticise structural
approaches to discourses for subsuming the material under preconceived
theories, we do think that power and inequality are questions that dis-
course analysts often need to address. Social structures can have an
important effect on meaning-making practices in society without being
referred to explicitly, or even implicitly, in a given text or discursive prac-
tice. Indeed, how could one understand the implicit political messages of
an academic publication, an OECD report, a New York Times article, or
the performance of a stand-up comedian if one filtered out the institu-
tional setting and the broader power field in which it circulates? Yet,
claims about broader institutional, social, and historical contexts cannot
be produced in an ad hoc way; they necessitate systematic theoretical and
empirical scrutiny.
We enter the debate on discourse and context from both a sociological
as well as a linguistic point of view. Discourse analysis does not only
include the study of linguistic characteristics of texts, symbols, icons, ges-
tures, and other forms of expression. It also requires the systematic
empirical study of social contexts in which these linguistic forms orches-
trate interpretation and create social meaning. However, the differentia-
tion between text and context is one possibility among others that seeks
to come to terms with this problem in discourse analysis. We take the
opposition of text and context as an analytical starting point in order to
argue for a methodological differentiation between the analysis of lin-
guistic qualities (of, for example, texts) and the sociological study of con-
texts (e.g. social structures and institutions). While ‘discourse’ is
understood as the combination of text and context, we keep both dimen-
sions apart for analytical and heuristic reasons that become relevant in
56 J. Hamann et al.
practical research processes (Angermüller 2004, 2006; Hamann 2014;

Maesse 2010, 2015a).
2.2 Levels of Discourse Analysis
The discourse analytical process usually takes place at three different lev-
els of empirical investigation, as outlined in Table 3.1. The first level deals
with problems that are first and foremost linguistic in nature and located
on the text level of discourse. At this stage, qualitative and quantitative
methods are applied to analyse the formal rules that make linguistic forms
(spoken words, written texts, gestures, as well as pictures) readable in
respective social contexts. Thus, the analysis of argumentation, deixis,
categorisations, polyphony, co-occurrences, and so forth requires the
study of small utterances as well as large textual corpora. The particular
choice of method depends on the research question or on corpus charac-
teristics. Nonetheless, the linguistic level must not be confused with the
social contexts in which language appears in order to create meaning(s).
After the linguistic level, a sociological level emerges, which cannot be
studied with linguistic methods. At this point, the discourse analytical
Table 3.1 Three levels of analysis

Level Example Analytical goal
Language A book, a media corpus, an Studying linguistic forms’
utterance, a corpus of deixis, co-occurrence,
utterances attribution, etc. with
quantitative and
qualitative methods from
discourse analysis
Social context An academic discipline, Studying the social and
situations such as a institutional rules as well
workshop, the national as social conventions
Higher Education System
Theoretical ‘Class struggle within Give explanations of data
interpretation academia’, ‘identity and build theoretical
formations of researchers’, models of what is happing
‘functional differentiation of in society
academic systems’, ‘face in
academic interactions’, etc.
process moves from the first, the linguistic, level, to the second level of
investigation: social context(s) (Table 3.1). This switch from one level to
another is required in qualitative as well as in quantifying approaches to
textual materials. Large data sets and corpora as well as small utterances
neither speak nor interpret themselves.
As is illustrated in Table 3.1, the linguistic and sociological levels of
discourse analysis are complemented by a third level: theoretical interpre-
tation. Taken by themselves, neither linguistic nor contextual data are
interpretations. Furthermore, the combination of discourse analytical
data with data from sociological analysis is not an automatic and natural
procedure either. Interpretations do not emerge from the data; they are
made by those who interpret them. This is where the significance of the
theoretical level of analysis comes into play. Researchers can mobilise
theories and paradigms for the interpretation of data and they can build
new theories and explanations on the basis of data interpretations led by
theories. Whereas positivistic approaches follow a data-theory determin-
ism, we suggest giving theory its place in research processes as a tool for
data interpretation and as a result of those processes. Theory is simultane-
ously the creative starting point and the result of every research process.
It helps to make sense of data. While the theoretical level will be addressed
in the fourth section of this contribution, let us briefly return to the con-
textual level.
The three levels of analysis can be observed in various types of dis-
course analysis. For example, if pragmatic discourse analysts ask how
utterances are contextualised through markers of polyphony (Angermuller
2015), their analysis does not stop at the textual level. An example from
economic expert discourses (Maesse 2015b) can show how an analysis of
meaning-making on the micro level (of utterances) can be combined
with the study of institutional contexts on the macro level. The following
exemplary statement was uttered by the economist Joseph Stiglitz.
What is needed is a macroeconomic theory based on theories of imperfect

competition and imperfect information. (Stiglitz 1984, 355)
The author provides us with linguistic material that can be studied

with respect to the deixis of person, time, and place. The utterance thus
58 J. Hamann et al.
refers the reader to the context in which it is uttered by the locutor (i.e.
Stiglitz). To make sense of this utterance, the reader will need an under-
standing of the locutor and their context. It will, therefore, be important
to know that Stiglitz has been awarded a Nobel Prize, is an established
academic at several Ivy League universities, a popular commentator of
economic policies and chief economist of the World Bank.
Yet, knowledge about the social context is not only important for this
type of linguistic micro-analysis of utterances. As will become clear in the
following, more structural approaches, too, articulate linguistic and soci-
ological levels of analysis and integrate them into a theoretical
explanation.
2.3 Accounting for Social Context
There is a variety of qualitative and quantitative methods that can be used

to account for a multitude of social contexts of discourse (Table 3.2). In
sociology, quantitative methods are typically used to portray aspects of
social reality from a macro-sociological perspective that is interested in
large-scale distribution. For instance, there is a tradition of comparing
societies by analysing their social structures. Scholars draw, for example,
on large data sets comprising birth rate, education duration and attain-
ment, family size, income level, professional occupation, and income
differences.
The first impression from a discourse analytical perspective is quite
similar: data on social or institutional structures are not very helpful
when we are actually interested in discursive practices in the production
of knowledge, meaning, and subjectivity in different cotexts and con-
texts. We argue, however, that socio-structural data are indeed important
for a better understanding of the social contexts and institutional power
relations in which discourses emerge. Some relevant social contexts may
betray more stable structures. They spread out over vast areas of the social
world and their influence can only be detected on the macro level. Among
these ‘hard facts’ of social life one can discover, for example, stable insti-
tutional structures in which a discourse on educational reforms is embed-
ded (Maesse 2010) or the socio-structural backgrounds of a university
Table 3.2 Four ideal typical dimensions of social context for the analysis of
discourses
Type of Social relations Institutions and Epistemic Forms of social
context organisations resources practice
Example Academic Universities, Rankings, tacit Reading, writing
community professorships, knowledge books/papers/
networks, funding about certain articles,
teaching organisations, institutions, presenting
relations, publishers, scientific papers at big
organisational editorial theories and conferences/
hierarchies boards, methods, informal circles,
between commissions, ideological being involved
deans and political knowledge, in email
professors or parties, knowledge communication,
professors and business firms, about and so forth
PhD administrative political and
candidates, offices, and economic
relations bodies organisations
between
politicians,
media
journalists,
and academics
population that have an impact on how certain academic ideas are discur-
sively constructed (Hamann 2014). Furthermore, it can be worth look-
ing at the institutional context and practices of text production that
reveal information on the influences on choices of topics, arguments, and
positions in, for example, election manifestos (Scholz 2010).
Data that can be analysed with quantitative methods are one suitable
route to assess the structures in which a discourse is embedded. Such an
approach enables us to analyse social phenomena that are spread over a
relatively large geographical space, such as national higher education sys-
tems, and which concern relatively large groups, like disciplines or a pop-
ulation of professors. Furthermore, we are able to trace developments of
social contexts that encompass discourses over relatively long periods of
time. We can account for a large quantity of data in order to get an over-
view of the research object analysed before we actually start interpreting
language-related phenomena in the discourse.
60 J. Hamann et al.
Accounting for social contexts of discourses will take into consider-

ation four ideal typical and usually interrelated dimensions and, depend-
ing on the research question, their development over time: social relations
(actors and their relations), institutional frameworks (organisations, bod-
ies, entities and their rules), epistemic resources (ideas, concepts, tacit
and biographical knowledge), and forms of social practice (field rules and
their interrelations with other field rules) (Table 3.2).
3 Dispositif as a Heuristic Concept

for a Context-Centred Discourse Analysis
We suggest the concept of dispositif as a theoretical framework that can
guide our interpretation. It is crucial for our argument that academic
discourse—that is, not only the research interests stated that we will dis-
cuss below but also, for example, theories, schools, subdisciplinary spe-
cialisations—and the practices that constitute them—for example,
publications, talks, citations, mentoring, societal engagement—are not
taking place in a social vacuum. Rather, academic discursive practices are
embedded in social and institutional contexts. These contexts range from
very specific settings in specific departments, archives, or laboratories
(Knorr Cetina 1981; Grafton 2011) to very broad and durable structures
on a national or even global scale (Frank and Gabler 2006; Fourcade
2009). In addition, academic discursive practices are enforced and estab-
lished in power relations. Power relations are expressed, for example, on
the micro level, where uncertain funding may influence scholarly prac-
tices, and thus encourage, or discourage, specific research interests (Morris
and Rip 2006), on the meso-level, where departments that attract the
majority of research funding have the resources to shape the image and
development of a discipline (Hamann 2016a), or on the macro level,
where there is a social expectation that research has an impact on society
(Maesse 2015b). After discussing methodological aspects, we now turn to
the theoretical question: how can we combine the study of linguistic
material and of social and institutional contexts and how can we concep-
tualise the multifarious relations between discursive practices and their
contexts marked by power and inequality?
3.1 Three Aspects of Academia as a Dispositif
Foucault’s notion of dispositif provides a valuable starting point for this

endeavour. The term dispositif describes the ensemble that discourses
form with elements as heterogeneous as institutions, practices, rules, and
even architectural forms. In short, the focus of interest is powerful appa-
ratuses that relate elements of “the said as much as the unsaid” (Foucault
1977, 194). This makes a dispositif a vital heuristic concept for the ques-
tion of the relation between discourses and contexts.
Foucault’s dispositif concept is well suited to account for social dynam-
ics in the academic world. Academic power relationships are often organ-
ised through a formal status hierarchy. But many are indirect and emerge
from spontaneous interactions among academics. While language plays
an important role in producing and reproducing the academic social
order, any linguistic expressions that academics use are characterised by a
certain interpretive openness. Their meanings depend on the contexts in
which they are used. While meaning production is by nature excessive
and ‘recalcitrant’, it is contained by institutional arrangements such as
the academic dispositif. Meanings are homogenised, interpretations are
smoothed out, ideas are domesticated through the interrelated practices,
rules, and resources of the academic dispositif (Angermüller 2010,
90–96). Conceptualising academia as a dispositif thus emphasises three
aspects: power effects of closure and sedimentation, heterogeneous and
overlapping contexts, and the discursive circulation of signs between aca-
demia and other fields (Maesse and Hamann 2016).
3.1.1 Power, Closure, Sedimentation
The dispositif concept reminds us that power is more than mere interpre-
tative efforts describing power as an “open, more-or-less coordinated […]
cluster of relations” (Foucault 1977, 199). It emphasises effects of closure
and sedimentation that are also part of the academic world. There are
many examples where meaning-making is domesticated and controlled
in the academic world, think of the rhetoric of excellence and competi-
tion as well as discourses of inclusion and equality, the pressures for exter-
62 J. Hamann et al.
nal funding and on university admissions (cf. Zippel 2017; Münch 2014;
Friedman et al. 2015; Kennelly et al. 1999).
3.1.2 Fields as Heterogeneous Arenas
Conceptualising academia as a dispositif emphasises a second aspect: aca-

demic contexts are complex and heterogeneous arenas that overlap with
other arenas. Foucault (1977, 199) calls for “a grid of analysis which
makes possible an analytic of relations of power”. However, an analytical
grid for the power relations that encompass discourses and contexts was
not systematically developed by Foucault. Thus, we would like to draw
attention to a framework that already exists for the analysis of social
power in specific arenas: Bourdieu’s field theory of symbolic production
(Bourdieu 1988, 2004). Combining Foucauldian and Bourdieusian
approaches allows us to account for the various effects of closure, stratifi-
cation, and sedimentation that we highlighted in the previous paragraph.
It also provides a hypothesis for the relations between discourses and con-
texts by suggesting a structural similarity between the positions within a
field and the statements made from these positions (Fig. 3.3). The main
reason for combining Foucauldian and Bourdieusian approaches is, how-
ever, that it allows for a sophisticated approach to institutional and social
contexts of discourses. Rather than remaining diffuse and opaque, con-
texts are elevated to a main object of investigation. Understanding them
as fields enables us not only to grasp the disparity and asynchrony of dif-
ferent contexts, but also to relate various social logics, structures, and prac-
tices to one another (cf. Bourdieu 1983, 1996). Our dispositif approach
links both perspectives: the field perspective emphasises the dimension of
social and institutional power while the discourse perspective incorpo-
rates linguistic practices of interpretation and translation.
3.1.3 Discursive Circulation and Interpretation
Conceptualising academia in this way further underscores a third aspect.

It highlights that discourses play an important role because they give
social actors the opportunity to act in an open field, as well as enabling

discursive circulation throughout many fields between academia and
society. Academic discourses span academic and other societal contexts
because they consist of signs that can be interpreted differently in various
contexts. In linguistic terms, signs are material carriers of multiple mean-
ings. They open up a multitude of meanings that are subsequently closed
again by specific interpretations. This is why academic discourses can be
simultaneously embedded into interrelated academic and other societal
contexts. For example, a journal article in a science discipline might rep-
resent a discovery that is translated into a patent in the judicial context,
and into research cooperation with enterprises in the economic context.
In the social sciences and humanities, other contextual trajectories are
more likely. A publication could be interpreted as a theoretical position
that matches a certain climate in the political context (Lamont 1987;
Angermuller 2015), or as an insight that is picked up by media outlets
that interpret the sign in terms of expertise (Maesse 2015b). Signs that
are interpreted as symbolic academic capital in the academic context may
traverse into judicial, economic, media, or political fields, where different
actors charge them with different meanings. The research that emerges
from these positions would have to traverse from the academic into a
number of other societal contexts.
In order for discursive utterances in Foucault’s sense and capital in
Bourdieu’s sense to make sense and mean something, the material sign as
a carrier of meaning has to coincide with the meaning it generates in a
specific context. The material form of utterances is a prerequisite to gen-
erating meaning, just as the form needs a context in order to refer to
something meaningful. Our dispositif theory does not describe the circu-
lation of meaning (or capital) since meaning is the product of using lan-
guage. What circulates is not meaning, but text and talk that are charged
with meaning in various contexts (cf. Beetz 2016, 82–88). The semiotic
substantiation of our nexus of discourse—dispositif—field allows us to
treat discourses as distinct social entities that are neither above nor subor-
dinated to their contexts, but embedded in them.
Our heuristic toolkit will now allow us to understand, first, how ele-
ments of academic discourse—for example, a journal article that presents
64 J. Hamann et al.
a particular approach—are embedded in academic and other contexts,

for example, when they are picked up by researchers from other subfields,
or when they influence political debates. This requires, second, that we
understand how these discursive elements are produced in the context of
an academic field, equipped with a specific academic meaning and re-
interpreted when they travel into other fields.
4 ow to Use Sociological Data

H
in Discourse Analysis?
So far, we have outlined the importance of the linguistic, contextual, and
theoretical levels of discourse analyses. In the following two sections, we
will explain how socio-structural sociological data about institutional
contexts of discursive practices can be articulated with discourse analysis
in order to study the discursive construction of social order, here for
example in an academic discipline. We will draw on empirical examples
from an information system that has been built for this purpose in the
ERC DISCONEX project. The information system aims to provide soci-
ological information about institutional and epistemic contexts of aca-
demic discourses. With this goal in mind, we have produced quantifiable
data that can inform us about larger social structures in which discourses
emerge. Such an analysis of the macro-structures of the social context
helps to decide and justify which particular cases of discourse production
should be chosen for a more thorough analysis with qualitative methods.
Therefore, we do not understand quantifying methods in terms of a posi-
tivistic approach to discourse, but as a means to explore and discover. For
our discourse analytical perspective, quantifying methods are a useful
heuristic tool, which is followed by a more fine-grained analysis (Kleining
1994). In the first part of this section, we will give an example of a data
set that informs us about social contexts of a discourse. Second, we will
explain how we use such data for the purposes of discourse analysis.
Thereby, we will further develop the theoretical framework for a context-
centred discourse analysis.
4.1 apping Social Contexts with Statistics: Actors,

M
Institutions, Organisations
To map the contexts of discursive practices, our first step is to assess the
relevant actors within and outside academia. The social contexts of aca-
demic discourses consist of, for example, researchers, institutions like
universities, publishers, and funding agencies, as well as disciplines,
research groups, and networks. In a broader sense, these are all actors
that, in one way or another, participate as social entities in an academic
discourse, no matter whether they are individual or collective, human or
non-human. Hence, the first step in our analysis is to identify the dis-
course actors that are relevant for the discourse we want to study. This can
be done via a systematic investigation of the institutional structures of
our research object. In our case we catalogued all national higher educa-
tion and research institutions that are relevant to the academic discourse
in a particular research field together with the full professorships in each
institution. In addition, we also tried to account for higher education
policies effecting the classification of universities. In the UK case, classifi-
catory instances that are part of the higher education dispositif include,
for example, such groups as the Russell Group and European Research
Universities, and also governance instruments like the Research Excellence
Framework, with its highly influential focus on research excellence and
societal impact (Hamann 2016a). The importance of these classificatory
instances notwithstanding, our approach in the project is more focused
on the individuals in higher education institutions, the way they position
themselves, are positioned by others and the career trajectories they have
followed.
There are numerous other methods to identify the actors or partici-
pants of a particular discourse. For a cademic discourse, citation analysis
has become the preferred approach in order to map the structures of sci-
entific communities (e.g. Estabrooks et al. 2004; Xu and Boeing 2013),
and concept-mapping has been applied to identify how different actors in
a cross-disciplinary context conceptualise their research area (Falk-
Krzesinski et al. 2011). Below, we will illustrate how a mapping of the
positions of discourse participants can be produced with correspondence
66 J. Hamann et al.
analysis by contrasting the distribution of all the words used in their

texts.
Taking the example of academic discourse, the mapping of contexts
should not only enable us to identify relevant actors, it should also help
to locate researchers institutionally and epistemically based on their pro-
fessional activities. We distinguish, as suggested in earlier studies
(Angermuller 2013; Whitley 1984), between an institutional world
where researchers can occupy posts in the organisational hierarchy of aca-
demia and an academic world where researchers can occupy positions in
academic communities and become references over knowledge horizons.
In Bourdieu’s (1988) terms, this means that, in order to build up a mean-
ingful, intelligible position in academia, a researcher needs to draw on
two types of capital. S/he needs institutional academic capital in terms of
worldly power that is created, acknowledged, and accumulated, for
example, in committees, councils, and on boards, and s/he also needs
symbolic academic capital in terms of prestige, recognition, and acco-
lades that are awarded by peers.
In this chapter we will focus on a synchronic view of the academic
world. In addition, however, we are interested in diachronic develop-
ments of researchers’ locations. This view of trajectories is first and fore-
most concerned with researchers’ biographies—if possible going back to
their social origins, tracing their careers by systematically collecting infor-
mation on researchers’ educational pathways and the institutions they
have gone through and the positions they have held in the course of their
careers (cf. Hamann 2016b). For the quantitative analysis of such data,
Sequence Analysis is one route to follow (Blanchard et al. 2014).
4.2 Challenges and Potentials of Context Data
We propose to analyse data pertaining to institutional backgrounds in

which a discourse is produced, and to the biographical data of discourse
participants that can actually be held responsible for the texts produced
within a discursive formation. Both institutional and biographical con-
text data can be analysed from a diachronic and synchronic perspective,
which results in different challenges and potentials.
Biographical information on researchers’ administrative positions,

teaching, media activities, and non-academic as well as research-related
activities, for instance, allows us to delineate how these various activities
influence particular research and academic discourse in general.
Furthermore, it might be interesting to see which activities become
important at which points in a career (cf. Ćulum et al. 2015). By collect-
ing data on publications and collaborations, we can find out more about
the social networks within the community of a particular research field.
Also, such data help to study institutions through the prism of their
researchers: how is an institutional framework reflected in the writing of
its researchers? Which institution has the most diverse research interests,
with the greatest range of transdisciplinary research interests, with the
most cited researchers, with the most and largest grants, and so on? One
objective of exploring the institutional background can be to identify
hubs for particular research interests and theoretical approaches. With
methods like correspondence analysis, we are able to find out more about
the transdisciplinary contacts between research fields and disciplines.
Additionally, it is possible to observe trends in research fields from a dia-
chronic perspective.
However, these potentials notwithstanding, institutional structures,
research infrastructures, affiliations and career progress differ substan-
tially in different countries (Angermuller 2017; Paradeise et al. 2009).
Hence, developing a grid of categories accounting for the institutional
and social reality of the academic world in various countries is anything
but straightforward.
For our example, it is important to note that France has a more cen-
tralised academic system, the system in the UK is less centralised, though
many resources are still concentrated in Oxford-, Cambridge-, and
London-based institutions, while German higher education and research
institutions have more autonomy and are usually dependent on individ-
ual states [Länder].
Moreover, in order to account for the context of the academic dis-
course in different countries, the institutional working conditions under
which research is conducted can add an important analytical dimension.
For instance, in contrast to Germany or the UK, there is an institutional
distinction between research and teaching activities in France. This
68 J. Hamann et al.
French peculiarity does not exist in many other countries. Research is

organised by discipline in trans-institutional laboratories according to
particular fields of research. If they are not working for the National
Scientific Research Centre (CNRS), researchers have to teach at a univer-
sity. There is a national body (National Council of Universities [Conseil
National des Universités—CNU]) that classifies these researchers and allo-
cates their work to one of the currently 87 sections corresponding to
fields of research. After completing a PhD, a researcher pursuing an aca-
demic career would normally try to become qualified by the CNU, and
then apply for a permanent civil service position. Their counterpart in
Germany would have to be happy with a renewable contract. In case they
do not qualify for a professorship after 12 years, they risk their academic
career coming to an end (Law on Temporary Employment Contracts in
Sciences [Wissenschaftszeitvertragsgesetz]). Through the prism of the dis-
positif approach, we consider such institutional differences as part of the
power structures establishing the social context in which processes of text
production in academic discourse play out. An analysis of these institu-
tional social contexts will help to discover and interpret socially induced
features on the textual level of academic discourse.
4.3 Examples of Statistical Categories
A context-centred study of academic discourses can be based on publicly

available information about the positions, careers, and activities of
researchers and their institutions. For smaller research projects, such data
can be collected in spreadsheets. For more complex studies, however, it
might be worthwhile creating a system into which data can be fed manu-
ally (e.g. from online CVs) and automatically (e.g. from literature data-
bases). In the case of the DISCONEX project, we created an information
system that comprises socio-institutional and biographical data. All data
were taken at face value, as presented in their sources. Hence, categories
that were used in the data sources are reproduced in the database. In this
way we hope to account for the concepts and labels that the actors believe
to be important in the social and discursive arena in which they aim to
position themselves. Of course, in research practice, this ideal has some
pitfalls that must be tackled. For instance, universities, degrees, and even
countries change names. In order to ensure the continuity of the reference
labels in the information system, old and new names have to be linked to
each other. This is by no means an innocent step because we are interven-
ing in the game of names and their meaning that is at the heart of ques-
tions on discursive construction, which we are actually interested in.
The quantitative analysis of these data aims to reveal aspects of the
social structure and dynamics of research fields. Why would discourse
analysts be interested in such questions? The answers help to understand
the social and institutional context to which discourse participants must
respond implicitly or explicitly if they want to produce a meaningful
statement in a particular discourse. We assume that discourse practices
relate—in one way or another—to these context conditions. Thus, a dis-
course analysis that integrates context data can enrich the interpretation
of textual data with information about institutional and societal condi-
tions that are usually not obvious merely by studying text data.
What we propose here is a first step towards the systematic acquisition
and study of such data whose results would still have to be articulated
with the (quantitative and qualitative) text analytical methods that are
widely used in discourse analysis. In terms of a context-centred perspec-
tive, this will help to better understand why a certain type of research,
colluding with particular statements and narratives, occurs in particular
institutions and locations. In this sense, we could integrate into the anal-
ysis of academic discourses, for example, the impact of research and fund-
ing policies on the research landscape, and the topography of a field of
research.
In order to conduct integrative studies, we propose to collect ‘hard data’
(a) on institutions, (b) on individuals and their career trajectories in those
institutions, and (c) on the professional networks of these individuals.
A. The following information about departments as well as the research

interests of the senior research staff allows locating research hubs:
1. Institutions or groups of institutions (e.g. League of European
Research Universities, Russell Group) with a high number of research-
ers in a field;
70 J. Hamann et al.
2. Mappings of research interests on the institutional, disciplinary,

and national levels;
3. The publication types and journals in which researchers with par-
ticular research interests publish predominantly—for example,
one might find that German economic sociologists publish more
books than journal articles when compared to their British col-
leagues; or that corpus linguists from the University of Lancaster
publish their articles in a particular journal that differs from the
publication preferences of their colleagues at the University of
Birmingham.
Furthermore, the enquiry into biographies can help to understand
whether particular institutions and/or disciplines prefer individuals with
a particular social or career background. By systematically recording edu-
cation and career trajectories, we might, for instance, identify those insti-
tutions, disciplines, or research fields in which researchers have typically
a migrant, working-class, or elite background.
B. The following information on education and career steps collected

from the CVs of research staff allows identifying typical and atypical
career trajectories in different fields, disciplines, and countries:
1. Ages and biographical backgrounds of researchers in a field (in
future projects this might be completed with information on the
social backgrounds of researchers);
2. The average time spent in each career stage in a respective field and
country;
3. The average time needed to become a professor in a certain field,
discipline, and country;
4. The number of institutional positions held until professorship is
reached in a certain field, discipline, and country;
5. The typical and less typical activities of researchers in a particular
research field until achieving professorship;
6. The institutional origins of senior researchers in a field.
In terms of a discourse analytical perspective that focuses on institutional
contexts, this information helps to illustrate the similarities and differences
in career trajectories between different fields of research as well as coun-

tries, relations between career progress (seniority) and particular activities
(number and quality of publications, acquisition of research funding,
administrative activities/positions in the higher education institution) or
the conditions for a change of research interests in a career, among other
things.
C. Finally, information on collaboration in research projects and publi-

cations can be used to study the professional networks of individuals
in institutions. With the help of social network analysis, we could
investigate such questions as:
1. Co-authorship;
2. Who publishes in the same journal, or with the same publisher?
3. What are typical or less typical text types that are published in a
discipline or a particular field of research?
4. Citation analysis: social networks of citing researchers in a particu-
lar field.
–– Who are the most cited researchers in a field of research?
–– Do they come from a certain type of institution?
–– Relation between seniority and citation frequency: are researchers
in senior positions cited more than junior academics?
–– How important are early career contacts in citation patterns?
In terms of a context-centred discourse analytical perspective, this

information facilitates a social network analysis mapping the social
and institutional contexts of academic discourses. Once we have col-
lected a data set according to the research interests outlined between
A1 and C4, we can produce data queries that correspond to these
interests. A results report is basically a spreadsheet containing variables
and data, which will then have to be analysed and visualised with sta-
tistical software. In the following section, we will give an example that
focuses on the research interests of full professors in UK sociology
departments.
72 J. Hamann et al.
5 xample: Research Interests as an Entry

E
Point for a Context-Centred Analysis
of Academic Discourses
Based on the analytical framework developed in the previous sections, the
current section will outline what a quantifying exploration of social con-
texts could look like. In doing so, we will address the empirical questions
posed in the introduction: How can we account for the discursive con-
struction of social order in the academic discipline of sociology? How do
social relationships of difference and inequality that constitute the socio-
logical field go together with the linguistic organisation of the field, that
is, a specific distribution of words and expressions across the social space?
The aim of this section is to demonstrate a method that can help to get
an idea about the macrostructure of academic discourse in UK sociology
based on research interests. Ideally, this analysis of the institutional con-
text of academic discourse and its structure should help to explain hierar-
chies and power relations. The division between central and periphery
institutions that this section reveals on the level of research interests could
help to explain the manifestation and sedimentation of symbolic and/or
economic capital that these institutions accumulate in evaluation proce-
dures, such as the Research Excellence Framework or various other rank-
ings—an analytical step that we allude to in the last section of this text.
As an example of how to use quantitative methods as a heuristic tool
for a qualitative analysis of discourse data, we explore one dimension of
academic discourse, the research interests that individual full professors
display on their homepage profiles in sociology departments. In a more
thorough analysis, other data sets could be integrated depending on our
research question. For the sake of illustration, we only represent one of
the numerous methods that could be applied. In a second step, which we
won’t be able to present here, we could further investigate the similarities
in academic discourse of those institutions grouped by our method in
one cluster. With this exemplary study we aim to identify the research
hubs of various research fields in a given discipline and country—in this
case, UK sociology. With the help of correspondence analysis, we are able
to create maps that present those institutions on whose webpages we find
similar research interests and keywords close to each other, whereas dif-
ferences in terms of research interests are represented by greater distance
on the map.
5.1 The Corpus and Method
For this study, we compiled a trial corpus with texts of research interests
and keywords that full professors at 76 UK sociology departments pres-
ent on their institutional webpages. There are more than 90 sociology
departments in the UK, but not all of them have full professors on their
staff. We consider full professors to be preeminent stakeholders in aca-
demic discourse and therefore the analysis of their data is the starting
point of our study. The corpus was partitioned in such a way that we
could compare research interests on the institutional, disciplinary, and
national levels. With a size of 11,980 tokens, our corpus of UK sociolo-
gists is quite small. However, it is big enough to present the method and
its potential for future research. The corpus has not been lemmatised and
also includes all grammatical words. Our choice is based on the assump-
tion that different grammatical forms of content words and grammatical
words themselves have a particular influence on the meaning construc-
tion for what we want to account for in our analysis.
We analyse our data set with correspondence analysis. This is a statisti-
cal method to simplify complex multivariate data by grouping entities
under investigation according to corresponding features. In classical
empirical social research, the method has been used to group actors
according to similar occupations and dispositions (Bourdieu 2010), in
discourse analysis the approach has been used to group speakers accord-
ing to corresponding features in their language use. In this sense, it is a
powerful method to discover similar language use of different speakers in
particular time periods by taking into account the complete vocabulary
of a given corpus and comparing it according to the partitions intro-
duced. In this study, we took the partition ‘institution’ and we contrasted
the distribution of all word tokens used to express research interests on
the website of a given department of sociology in the UK with it. To the
extent that the method takes into account the entire vocabulary of a
74 J. Hamann et al.
corpus, it is a corpus-driven (Tognini-Bonelli 2001) heuristic approach

for the analysis of language use. It was first developed in France in the
1970s within the lexicometric approach to discourse (see Chap. 5).
Applying correspondence analysis to text corpora is a way of visualis-
ing similarities and differences in language use by projecting words into
an at least two-dimensional visualisation (Chap. 6 in this volume, and
Bécue-Bertaut 2014; Benzécri 1980; Husson and Josse 2014; Lebart and
Saporta 2014; Salem 1982). There is now a range of lexicometric tools
that offer correspondence analysis, for example, Lexico3, Hyperbase,
and Iramuteq. However, for this text we used the software TextObserver,
developed at the CEDITEC laboratory, University of Paris-East. This
tool provides maximum flexibility in terms of the integration and exclu-
sion of corpus parts into the analysis, which can be helpful to understand
the impact that certain corpus parts have on the visualisation.
Correspondence Analysis applied to text corpora is based on a table of
the distributions of word token frequencies (rows) in different corpus
parts (columns). In our case, one can find in the first column of this table
all the word tokens of the corpus. All other columns are named after the
location of the university from which the texts of the corpus originate.
The rows of these columns contain the frequencies of the word tokens
they refer to in the texts of a particular university. In simple terms, the
method can be described as an algorithm grouping together those words
with similar distributions across different universities (columns).
Figures 3.1 and 3.2 serve purely didactic purposes—they are not readable
as such. However, they demonstrate how the words of a given corpus are
taken into account and visualised according to their distribution profiles.
By distribution profiles, we mean a similar high or low distribution of a
group of the same word tokens across different institutions (columns).
The word tokens of such profiles are positioned close to one another.
In Fig. 3.1, the words in blue refer to keywords used by professors in
UK sociology departments. The blue dots refer to the positions of word
tokens calculated with an algorithm comparing the distribution of all
word tokens across the columns. In Fig. 3.2, the words in red refer to the
locations of the universities from which the texts were taken (column
names). Their positions are calculated based on the positions of the blue
dots. The names of columns with similar distribution profiles are placed
Fig. 3.1 Correspondence analysis of keywords and research interests of UK soci-

ologists—rows only represented
close to one another, whereas those with very different distribution pro-
files are more distant from one another. The axes are situated alongside
the highest concentrations of similar characteristics (here similar frequen-
cies of the same words in different institutions). Deciphering the mean-
ing of the axes is part of the interpretation process, which often needs
further analytical steps using different methods.
To demonstrate how the visual can be interpreted, we have chosen to
make it more readable by removing certain word tokens from the repre-
sentation. In Fig. 3.3, we have kept those content words that we under-
stand have particular importance for occupying a position in the research
field of sociology. This is simply for the sake of demonstrating the method.
A more systematic analysis would have to be more explicit about the
76 J. Hamann et al.
Fig. 3.2 Correspondence analysis of keywords and research interests of UK soci-

ologists—rows and columns represented
words that have been removed from the visualisation (but not from the
analysis). Moreover, the data set should be completed with more and
longer texts. However, regardless of these limitations, we think that the
potential of the approach will become obvious.
When looking at a location in the system of coordinates, we must
consider that both axes represent dominant information concerning a
certain variable. Finding the answer to the question as to which variable
that might be is part of the researcher’s interpretation. In the case of tex-
tual data, the concept of ‘variable’ would have to be understood as the
semantic realm triggered by the referential meanings of certain words.
However, the interpretation of a correspondence analysis based on tex-
tual data is never straightforward because it is based on lexical items, not
semantic units. The problem is that the same semantic can be expressed
Fig. 3.3 Correspondence analysis of keywords and research interests represented

on the websites of professors in sociology departments in the UK
with different words or variations of words. In this sense, the method

tends to analyse the style of speech that has a semantic dimension, even
though it cannot be reduced to a pure analysis of semantic worlds. Hence,
the distances represented on the axes could originate from solely lexical or
solely morphological differences, or a mixture of the two. In this sense,
we have to be prudent with the claims we make on the semantic level. In
order to verify our claims, we can use concordance analysis, or other
methods that allow direct access to textual data. Regardless of the out-
come of such verification, we can be sure that the method reveals domi-
nant tendencies in the style of speech within texts from the different
corpus parts introduced (here, texts from different institutions).
Thus, Fig. 3.3 will help to understand which institutions are closely
related to each other in terms of the research interests of their professors.
At this point we have to acknowledge that the visualisation seems to
78 J. Hamann et al.
imply that the authors of the texts—UK full professors in sociology—

represent the research interests pursued at the particular universities they
work for. Moreover, one has to consider that there is no direct correlation
between the distance of keywords and their frequencies in the text of an
institution. For example, in Fig. 3.3, poverty is relatively close to Bangor
(2nd quadrant). However, the word form poverty does not occur in texts
taken from the University of Bangor—but it does occur in the text taken
from the University of Bath, which is displayed closest to Bangor. These
two are located close to one another because of the similarities of their
distribution profiles throughout their texts. Yet, this does not mean that
the vocabulary in the texts of both institutions is identical, and so par-
ticular words can be missing in one or other part. Hence, some words,
such as poverty, do not occur in both text collections. Thus, we should
concentrate interpretation of the visual on the similarities of institutions
and less on particular cases of vocabulary contributing to the visualisa-
tion: the texts used at Bangor are most similar to those used at Bath and
Belfast. The keywords displayed in Fig. 3.3 can help to give us a general
picture. For the interpretation, we should try to identify general common
features in the words displayed in a particular realm of the visualisation.
In trying to understand Fig. 3.3, we should also direct our attention
towards the meaning of the axes. In order to interpret these, we can look
at the most extreme cases on both axes. The most extreme cases on the
x-axis are the University of Aberdeen and the University of the West of
England (UWE Bristol). On the left side of the x-axis we see notions such
as societies, European, Europe, digital, death. On the right side of the x-axis
there are notions such as governance, power, failure, education, employ-
ment, technologies, practices, and others.
Ideally, we would now be able to relate the visualisation to the social
world—the academic world that they were taken from. Once we inte-
grate the complete list of all word forms contributing to the position of a
data point (not represented for reasons of readability), we can interpret
the x-axis as follows: on the left side of the x-axis we find institutions at
which researchers present their work in terms of an external perspective
of their discipline: they refer predominantly to different topics of the
discipline, internationality, and interdisciplinarity on their websites (psy-
chology, geography, Europe, international, etc.). In contrast, on the right
side, researchers present their work from more of an internal perspective

of their discipline. They emphasise the way they do sociology (compara-
tive, ethnographic, empirical, theoretical), whereas the international aspect
remains largely unaddressed by focusing on national British questions
(Britain, nationalism). While on the left side we have more problem-
driven topics, such as death, crisis, poverty, welfare, on the right side we
find, with some exceptions (illness, failure), more applied topics that ask
for affirmative solutions (communication, education, governance, employ-
ment) and stress technological and/or scientific aspects of society, such as
corporate power, regulation and governance of health technologies, regenera-
tive medicine, science and technology studies.
Most universities are situated around the origin of the coordinates,
which in comparison with the rest of the corpus represents the corpus parts
with the least specific distribution of the vocabulary. We can also say that
researchers at these institutions use vocabulary that is used most fre-
quently—and thus may represent the mainstream of UK sociology depart-
ments. Such terms include culture(s), gender, identity, racism, but also
references to time, such as current and historical. The fact that these institu-
tions are situated in the centre could also mean that they use terms that can
be found on both sides of the x-axis and also the y-axis to a similar extent.
Meanwhile the y-axis seems to represent a continuum between research
interests emphasising a rather critical attitude at the bottom and research
interests that are presented in a rather concise and neutral style at the top.
5.2 Limits of and Obstacles to Interpretation
With regard to the limitations of our interpretation, we have to admit

that texts from the genre that we are looking at are by their nature very
short. Thus, the outliers in Fig. 3.3 may also result from the fact that
there are only very few researchers, and hence particular research interests
that differ greatly from the ‘mainstream’ situated around the origin. A
bigger corpus might help to sustain the hypothetical results that we gen-
erated at this stage. These limitations notwithstanding, the visual illus-
trates a method that could be used to integrate the data on institutions
into a discourse analysis.
80 J. Hamann et al.
The advantage of using this somewhat imperfect corpus is that the cor-
pus parts are of a size that we can manage to read. We gain a better under-
standing of the method by simply reading the closest and the most distant
texts (Bangor, Bath, Belfast, and Aberdeen versus Glasgow and Bristol
[UWE]). The disadvantage of the relatively small corpus size is that changes
in the visual might be quite substantial if we add or remove texts of
researchers from these institutions. At any rate, we do not claim that these
visuals represent a positivistic depiction of ‘the reality’. Rather, through the
prism of correspondence analysis, we get a vague idea about hidden rela-
tions that are not visible in academic texts, the aim being to find out about
relations that could be relevant either on other levels from subsequent
investigations with other variables, or on the discursive level itself.
Suppose that we somehow had ‘complete’ data, we could relate these
results to studies that map, for instance, the level of access of UK institu-
tions to research funding in general, or to funding for research on par-
ticular topics. This would allow us to cluster institutions with a similar
level of access to research funding, and subsequently analyse to what
extent these clusters match the maps that we produce based on research
interests. We could also include other data, for example, on the perma-
nence and duration of positions across institutions, disciplines, and
countries, in order to investigate the impact of such variables on aca-
demic discourse in the short and long terms.
Given adequate data, the analysis of social contexts becomes a power-
ful supplement to discourse analytical approaches. This section has dem-
onstrated an exemplary starting point for such an undertaking. The
remaining question is connected to the third level of analysis (Table 3.1),
the theoretical interpretation of linguistic and social context data. In the
following section, we will suggest a theoretical framework that integrates
the linguistic and sociological dimensions of discourse analysis.
6 he Heuristic Potential of the Dispositif

T
Approach
The linguistic and sociological analyses of data that we have sketched out
in the previous sections can help us find out about the institutional and
discursive structures of research fields. But neither the linguistic material
nor the collection of sociological data can account for the social organisa-
tion of academia.
In this section, we propose a dispositif theoretical approach in order to
go beyond the opposition of micro and macro social structure and discur-
sive practice. The dispositif analysis we propose would read statistical data
pragmatically and reflexively. We take them as a starting point for further
investigations into an empirical object that resists simplification. We
point to three aspects of academia as a dispositif: (a) we emphasise a
rather structuralist notion of power that yields effects of closure and sedi-
mentation in academia, (b) we emphasise that academic contexts are
complex and heterogeneous arenas that overlap with other arenas, and (c)
we emphasise that discourses play an important role because they give
social actors the opportunity to act in an open field, as well as to enable
discursive circulation through many fields between academia and society.
As highlighted by the following sections, all three aspects are addressed by
the dispositif concept.
Let us illustrate the heuristic potential of a dispositif theory that guides
the analysis of academic texts and contexts. Coming back to our empiri-
cal example of full professors in sociology in the UK (cf. Sect. 5), the
three aspects of our dispositif approach generate the following analytical
perspectives: First, we have argued for a rather structuralist notion of
power that emphasises the effects of closure and sedimentation (cf. Sect.
3.1.1). What (more) can we take from Fig. 3.3 if we follow this argu-
ment? The specific distribution of sociology departments in terms of the
research interests of their professors might tentatively be interpreted in
terms of a centre and a periphery. Departments at the centre of the field,
including many London-, Oxford-, and Cambridge-based institutions,
and Warwick, could represent a thematically coherent core of ‘top’ depart-
ments. Anchoring this assumption with additional data would enable us
to test whether these departments are also ‘competitive’ in terms of fund-
ing. Professors at departments on the periphery appear to be pursuing
alternative research strategies that do not represent the ‘core interests’ of
the field.
Second, the dispositif theoretical framework introduces fields as a
main object of investigation, thus allowing for a systematic account of
different contexts that overlap with each other (cf. Sect. 3.1.2). Following
82 J. Hamann et al.
this argument, the distribution in Fig. 3.3 can be understood as a field, as

a power-soaked arena in which professors and departments struggle for
scarce resources. A sophisticated notion of contexts as fields allows us to
investigate phenomena such as the specific resources that the actors in
Fig. 3.3 are actually competing for. It also allows us to consider—and
distinguish—other contexts that overlap with the field of UK sociology
and influence it with their specific logics. Interdisciplinary research inter-
ests could then be interpreted as influences from other academic fields,
while applied research topics that demonstrate an interest in, say, health
technologies, could be interpreted as intrusions from respective industry
sectors that overlap with the academic sphere.
The third aspect of our dispositif framework (cf. Sect. 3.1.3) highlights
discursive circulation through academic and social fields. We argue that
this circulation is possible because academic discourses consist of signs
that can be interpreted differently in various contexts. Coming back to
our empirical example of full professors in sociology in the UK, this
insight allows us to follow the products of specific research interests
throughout different contexts, and to study how they are interpreted dif-
ferently in their respective contexts. For example, a research interest in
‘poverty’, as displayed in Fig. 3.3, might not only result in a publication
in an academic journal, but also in policy advice. While the journal arti-
cle would be interpreted in an academic context, and thus in terms of its
contribution to the state of research, the policy advice that results from
the research would be interpreted in a political context, and thus in terms
of its practical feasibility and party political implications. Eventually, one
would be able to track the discursive circulation even further and study
not only how poverty research turns into policy advice, but how policy
advice in turn circulates in the judicial field where it is interpreted in
terms of new welfare policy legislation.
7 Conclusion
We have highlighted some shortcomings of text-centred conceptualisa-
tions of context and pointed out the necessity for a more systematic inte-
gration of social contexts and a theory-based interpretation of discourses.
Taking the example of academic discourse, we have identified sociologi-

cal aspects that, from our point of view, are of importance for a more
comprehensive understanding of how discourses shape the social and are
shaped by them.
Correspondence Analysis was presented as a possible starting point
that can help to articulate sociological data with discourse analysis. In the
example presented, we produced a map of the field of academic discourse
in terms of institutions and research interests. In a follow-up study, we
could translate keywords and research interests into one language in order
to create a cross-country mapping of research interests related to institu-
tions. This would help to better understand the interconnectedness of
research fields across national borders. We hope to be able to interpret
such visuals more thoroughly once we have a deeper knowledge of the
material and the institutional context we are studying. In order to better
understand how dispositifs influence academic discourse, we would also
have to take into account other variables producing other maps of the
academic world. We would have to look at funding infrastructures, social
networks, the social backgrounds of researchers and their career trajecto-
ries. Such explorations would help, for instance, to identify centre and
peripheral institutions in terms of research interests. They would help us
gain better insights into the social dynamics of academic discourse. In the
long term, we could also study how these mappings change over time:
Are there certain institutions that display consistently a stable conglom-
eration of research interests? How and depending on which conditions
do research interests change over time? Embedded in a social theory
framework, the operationalisation of the dispositif concept will help to
capture non-linguistic aspects of discourse that can complete the linguis-
tic data and enrich the analysis of discourse.
References
Angermüller, Johannes. 2004. Institutionelle Kontexte geisteswissenschaftlicher
Theorieproduktion: Frankreich und USA im Vergleich. In
Wissenschaftskulturen, Experimentalkulturen, Gelehrtenkulturen, ed. Markus
Arnold and Gert Dressel, 69–85. Wien: Turia & Kant.
84 J. Hamann et al.
———. 2006. L’analyse des textes à la croisée de diverses traditions

méthodologiques: les approches qualitatives et quasi qualitatives. In
Méthodologie qualitative. Postures de recherche et variables de terrain, ed. Pierre
Paillé, 225–236. Paris: Armand Colin.
———. 2010. Widerspenstiger Sinn. Skizze eines diskursanalytischen
Forschungsprogramms nach dem Strukturalismus. In Diskursanalyse meets
Gouvernementalitätsforschung. Perspektiven auf das Verhältnis von Subjekt,
Sprache, Macht und Wissen, ed. Johannes Angermüller and Silke van Dyk,
71–100. Frankfurt/Main: Campus.
Angermuller, Johannes. 2013. How to become an academic philosopher.
Academic discourse as a multileveled positioning practice. Sociología Histórica
2: 263–289.
———. 2015. The moment of theory. The rise and decline of structuralism in france
and beyond. London: Continuum.
———. 2017. Academic careers and the valuation of academics. A discursive
perspective on status categories and academic salaries in France as compared
to the U.S., Germany and Great Britain. Higher Education 73: 963–980.
Angermuller, Johannes, and Gilles Philippe, eds. 2015. Analyse du discours et
dispositifs d’énonciation. Autour des travaux de Dominique Maingueneau.
Limoges: Lambert Lucas.
Bécue-Bertaut, Mónica. 2014. Distributional equivalence and linguistics. In
Visualization and verbalisation of data, ed. Jörg Blasius and Michael Greenacre,
149–163. London and New York: CRC.
Beetz, Johannes. 2016. Materiality and subject in Marxism, (post-)structuralism,
and material semiotics. London: Palgrave Macmillan.
Benzécri, Jean-Paul. 1980. Pratique de l’analyse des données. Paris: Dunod.
Bhatia, Vijay, John Flowerdew, and Rodney H. Jones. 2008. Advances in dis-
course studies. New York: Routledge.
Blanchard, Philippe, Felix Bühlmann, and Jacques-Antoine Gauthier. 2014.
Advances in sequence analysis. Theory, method, applications (Life Course
Research and Social Policies, 2). Heidelberg: Springer.
Blommaert, Jan. 2005. Discourse: A critical introduction. Cambridge: Cambridge
University Press.
Bourdieu, Pierre. 1983. The field of cultural production, or: The economic
world reversed. Poetics 1983 (12): 311–356.
———. 1988. Homo Academicus. Cambridge: Polity Press.
———. 1996. The state nobility. Elite schools in the field of power. Cambridge:
Polity Press.
———. 2004. Science of science and reflexivity. Cambridge: Polity Press.

———. 2010. Distinction: A social critique of the judgement of taste. London:
Routledge & Kegan Paul. Original edition, 1984.
Bührmann, Andrea D., and Werner Schneider. 2007. Mehr als nur diskursive
Praxis? – Konzeptionelle Grundlagen und methodische Aspekte der
Dispositivanalyse. Forum Qualitative Sozialforschung/Forum: Qualitative
Social Research 8 (2): Art. 28.
———. 2008. Vom Diskurs zum Dispositiv. Eine Einführung in die
Dispositivanalyse, Sozialtheorie: intro. Bielefeld: Transcript.
Ćulum, Bojana, Marko Turk, and Jasmina Ledić. 2015. Academics and com-
munity engagement: Comparative perspective from three European coun-
tries. In Academic work and careers in Europe: Trends, challenges, perspectives,
ed. Tatiana Fumasoli, Gaële Goastellec, and Barbara M. Kehm, 133–150.
Dordrecht: Springer.
Estabrooks, C.A., C. Winther, and L. Derksen. 2004. Mapping the field: A bib-
liometric analysis of the research utilization literature in nursing. Nursing
Research 53 (5): 293–303.
Falk-Krzesinski, Holly J., Noshir Contractor, Stephen M. Fiore, Kara L. Hall,
Cathleen Kane, Joann Keyton, Julie Thompson Klein, Bonnie Spring, Daniel
Stokols, and William Trochim. 2011. Mapping a research agenda for the sci-
ence of team science. Research Evaluation 20 (2): 145–158.
Foucault, Michel. 1972. The archeology of knowledge and the discourse of lan-
guage. London: Tavistock. Original edition, 1969.
———. 1977. The confession of the flesh. In Power/knowledge: Selected inter-
views & other writings, 1972–1977, by Michael Foucault, ed. Colin Gordon,
194–228. New York: Pantheon.
Fourcade, Marion. 2009. Economists and societies: Discipline and profession in the
United States, Britain and France, 1890s–1990s. Princeton, NJ and Oxford:
Princeton University Press.
Frank, David J., and Jay Gabler. 2006. Reconstructing the university. Worldwide
shifts in academia in the 20th century. Stanford, CA: Stanford University
Press.
Friedman, Sam, Daniel Laurison, and Andrew Miles. 2015. Breaking the ‘class’
ceiling? Social mobility into Britain’s elite occupations. The Sociological Review
63 (2): 259–289.
Grafton, Anthony T. 2011. In Clio’s American atelier. In Social knowledge in the
making, ed. Charles Camic, Neil Gross, and Michèle Lamont, 89–117.
Chicago and London: University of Chicago Press.
86 J. Hamann et al.
Hamann, Julian. 2014. Die Bildung der Geisteswissenschaften. Zur Genese einer
sozialen Konstruktion zwischen Diskurs und Feld. Konstanz: UVK.
———. 2016a. The visible hand of research performance assessment. Higher
Education 72 (6): 761–779.
———. 2016b. ‘Let us salute one of our kind’. How academic obituaries con-
secrate research biographies. Poetics 56: 1–14.
Husson, François, and Julie Josse. 2014. Multiple correspondence analysis. In
Kennelly, Ivy, Joya Misra, and Marina Karides. 1999. The historical context of
gender, race, & class in the academic labor market. Race, Gender & Class 6
(3): 125–155.
Kleining, Gerhard. 1994. Qualitativ-heuristische Sozialforschung. Schriften zur
Theorie und Praxis. Hamburg-Harvestehude: Fechner.
Knorr Cetina, Karin. 1981. The manufacture of knowledge. An essay on the con-
structivist and contextual nature of science. Oxford: Pergamon.
Lamont, Michèle. 1987. How to become a dominant French philosopher: The
case of Jacques Derrida. The American Journal of Sociology 93 (3): 584–622.
Lebart, Ludovic, and Gilbert Saporta. 2014. Historical elements of correspon-
dence analysis and multiple correspondence analysis. In Visualization and
verbalisation of data, ed. Jörg Blasius and Michael Greenacre, 31–44. London
and New York: CRC.
Maesse, Jens. 2010. Die vielen Stimmen des Bologna-Prozesses. Bielefeld:
Transcript.
———. 2015a. Eliteökonomen. Wissenschaft im Wandel der Gesellschaft.
Wiesbaden: VS.
———. 2015b. Economic experts. A discursive political economy of econom-
ics. Journal of Multicultural Discourses 10 (3): 279–305.
Maesse, Jens, and Julian Hamann. 2016. Die Universität als Dispositiv. Die
gesellschaftstheoretische Einbettung von Bildung und Wissenschaft aus dis-
kurstheoretischer Perspektive. Zeitschrift für Diskursforschung 4 (1): 29–50.
Maingueneau, Dominique. 1991. L’Analyse du discours. Introduction aux lectures
de l’archive. Paris: Hachette.
Morris, Norma, and Arie Rip. 2006. Scientists’ coping strategies in an evolving
research system: The case of life scientists in the UK. Science and Public Policy
33 (4): 253–263.
Münch, Richard. 2014. Academic capitalism. Universities in the global struggle for
excellence. New York: Routledge.
Paradeise, Catherine, Emanuela Reale, Ivar Bleiklie, and Ewan Ferlie, eds. 2009.
University governance. Western European Perspectives. Dordrecht: Springer.
Raffnsøe, Sverre, Marius Gudmand-Høyer, and Morten S. Thaning. 2016.
Foucault’s dispositive: The perspicacity of dispositive analytics in organiza-
tional research. Organization 23 (2): 272–298.
Salem, André. 1982. Analyse factorielle et lexicométrie. Mots – Les langages du
politiques 4 (1): 147–168.
Sarangi, Srikant, and Malcolm Coulthard, eds. 2000. Discourse and social life.
Harlow: Longman.
Scholz, Ronny. 2010. Die diskursive Legitimation der Europäischen Union.
Eine lexikometrische Analyse zur Verwendung des sprachlichen Zeichens
Europa/Europe in deutschen, französischen und britischen Wahlprogrammen
zu den Europawahlen zwischen 1979 und 2004. Dr. phil., Magdeburg und
Paris-Est, Institut für Soziologie; École Doctorale “Cultures et Sociétés”.
Spieß, Constanze, Łukasz Kumięga, and Philipp Dreesen, eds. 2012.
Mediendiskursanalyse. Diskurse – Dispositive – Medien – Macht. Wiesbaden:
VS.
Stiglitz, Joseph E. 1984. Price rigidities and market structure. The American
Economic Review 74 (2): 350–355.
Tognini-Bonelli, Elena. 2001. Corpus linguistics at work. Amsterdam: Benjamins.
Whitley, Richard. 1984. The intellectual and social organization of the sciences.
Oxford: Clarendon Press.
Xu, Yaoyang, and Wiebke J. Boeing. 2013. Mapping biofuel field: A bibliomet-
ric evaluation of research output (review). Renewable and Sustainable Energy
Reviews 28 (Dec.): 82–91.
Zippel, Kathrin. 2017. Women in Global Science: Advancing Careers Through
International Collaboration. Stanford: Stanford University Press.
4
On the Social Uses of Scientometrics:
The Quantification of Academic
Evaluation and the Rise of Numerocracy
in Higher Education
Johannes Angermuller and Thed van Leeuwen
1 Introduction
Corpus approaches have a long tradition. They have recourse to computer-
aided tools which reveal patterns, structures, and changes of language use
that would go unnoticed if one had to go through large text collections
‘manually’. If such research is known for rigorous, replicable, and ‘ratio-
nal’ ways of producing scientific claims, one cannot understand its suc-
cess without accounting for the role of non-academic actors.
Scientometrics, also known as ‘bibliometrics’, is a type of corpus
research which measures the scientific output of academic researchers and
represents citation patterns in scientific communities. Scientometrics is a
J. Angermuller (*)
e-mail: J.Angermuller@warwick.ac.uk
T. van Leeuwen
Centre for Science and Technology Studies (CWTS), Leiden University,
Leiden, The Netherlands
e-mail: leeuwen@cwts.leidenuniv.nl

90 J. Angermuller and T. van Leeuwen
social sciences field studying human communication, in particular pro-

cesses through which knowledge is codified through texts and published
in various forms (books, chapters in books, journal articles, conference
papers, etc.). We choose to use the terms ‘scientometrics’ and ‘bibliomet-
rics’ in a more or less synonymous way although we know in the field
some scholars wish to distinguish between the terms (Hood and Wilson
2001). Whereas sciento or biblio relates to the elements that are studied in
the field, metrics relates to the fact that quantification plays an important
role in the field. Scientometrics typically aims to represent developments
in science and technology through indicators.
In this contribution, we will reflect on scientometrics as an example of
the social effects that the quantifying instruments of corpus research have
today. Scientometrics responds to specific social demands and occupies a
niche which ‘mainstream’ corpus analysis is not always aware of. While
corpus analysis is known for its rigorous methods, scientometrics reminds
us that the tools of corpus research have become important in some
applied fields with a specific empirical expertise. We consider it as part
and parcel of ‘numerocratic’ practices that constitute the world of aca-
demic research.
A field of research since the 1970s, scientometrics has been propelled
by non-scientific demands, such as evaluation and policy-making in
higher education. Governments and other institutional bodies often rely
on scientometric knowledge in institutional decision-making. Therefore,
scientometrics is a textbook example of a field of research which responds
to power relationships within the academic field as well as outside. An
example of such power relationships at work could be found in the uni-
versity rankings that have gained currency over the last decade. Such
rankings often mobilize scientometric knowledge in order to compare
individuals and groups, organizations, and systems within and across
countries. Within university systems, rankings have an impact on higher
education policies which try to ‘learn the lesson’ in order to achieve better
ratings. As rankings are presented as objective facts, especially in media
discourse, they increasingly organize perceptions of universities in the
global academic market place.
While academic governance and institutional decision-making increas-
ingly relies on scientometrics, we do not support the idea that
On the Social Uses of Scientometrics: The Quantification… 91
s cientometrics promotes the agenda of neoliberalism or managerialism as

such. Rather, our question is how to account for scientometrics as a social
practice and to reflect on its non-scientific conditions, purposes, and
effects. Scientometrics impacts on society and society impacts on sciento-
metrics. If scientometrics wants to strengthen its scientific ambitions, it
cannot but gain from reflecting on the ways its claims are made and
established socially.
The objective of this contribution is both theoretical and historical.
Inspired by Michel Foucault’s governmentality theory, we will discuss
scientometrics against the background of a ‘numerocratic’ dispositif
emerging since the eighteenth century. And by unpacking the social prac-
tices constituting the scientometric field, we will come up with a critical
account of its history.
This paper consists of four parts. First, we will start with a theoretical
discussion of the ‘numerocratic’ dispositif of power-knowledge in which
scientometrics needs to be placed. Second, we will give a historical
account of its emergence and ask how it relates to the rise of numerocratic
practices of governing academics. In the third part, we will have a closer
look into the indicators and rankings scientometrics produces and dis-
cuss their social effects. In the conclusion, we will develop critical per-
spectives on the uses and abuses of scientometrics in the numerocratic
dispositif.
2 cience, the Social Order, and the Rise

S
of the Numerocratic Governmentality
Social scientists study social life. Their objective is to understand what
drives and constrains people in doing, thinking, and saying certain things.
And some social scientists study the social life of social scientists them-
selves. Thus, in the area of Science and Technology Studies, a number of
research strands have emerged which deal with the practices and struc-
tures, the ideas and institutions that characterize scientific communities
and higher education institutions. Post-war sociologists of science looked
into the historical circumstances and social conditions in which scientific
knowledge emerged (Merton 1962). While they understood science as an
institution of liberal and democratic societies (even though the Soviet

Union, too, produced scientific achievements), a certain emphasis was
put on the cultural norms and shared values that the science system needs
to produce knowledge that is considered as legitimate knowledge (such as
a professional academic ethos, the social recognition of scientists and
their work, a certain autonomy of higher education institutions, etc.). If
their question was to account for the rise of science as a social system, the
1970s marked a turning point when more critical and constructivist epis-
temologies entered the scene and challenged the nature of scientific
knowledge more radically. Against this background, members of the
Edinburgh School made the case for ‘symmetric’ approaches (Bloor
1976). Critically interrogating the implied hierarchy (‘asymmetry’) of
older, institutionalist approaches, they called for explaining ‘true’ knowl-
edge socially just as one would explain ‘false’ knowledge (rather than ask-
ing how ‘true’ scientific knowledge was discovered). The plea for symmetry
has inspired a great deal of qualitative researchers and ethnographers
(e.g., in Laboratory Studies). These studies focus on the situated practices
and contingent processes in which knowledge claims are produced and
turn into ‘objective facts’ (Knorr Cetina 1981; Latour and Woolgar
1979).
It is against this background that we want to bring in Michel Foucault’s
perspective on science as power-knowledge (Foucault 1980). Defending
a consistently ‘symmetric’ understanding of scientific knowledge,
Foucault puts special emphasis on the intrinsic intertwinements of
knowledge and power. Power is needed to make knowledge ‘true’. And
knowledge is needed to exercise power in society. The question then is
how people are placed in a hierarchical social order through scientific
knowledge and how large populations are put under control.
Foucault deals with science from several angles, which can approxi-
mately be related to the problem of archaeology, genealogy, and govern-
mentality. First, some of his works deal with historically changing
configurations of proto-scientific and scientific ideas, most systematically
in The Order of Things (2002), where he deals with the emergence of eco-
nomic, biological, and linguistic ideas in the early modern and modern
period (sixteenth to nineteenth centuries). If the label of ‘archaeology’ has
sometimes been attached to this early work, a focus on ideas can also be
observed in later publications such as the Birth of Biopolitics (2008),

which traces the development of ordo- and neo-liberalism in economic
theory. Second, in some of his other books, the most well-known being
Discipline and Punish (1995), Foucault looks into practices of exercising
power and disciplining individuals in various historical moments. By
investigating institutional practices of the early modern nation-state, he
asks what is the knowledge that legitimates them. These works have
sometimes been situated in his ‘genealogical phase’ even though one
clearly sees traces of these interests in earlier works such as Birth of the
Clinic (1973). And third, in his posthumously published lectures on gov-
ernmentality, published under the title Security, Territory, Population
(2007), Foucault’s interests shift towards the terrain of political sociology
which traces the changes in governmental practices and political ideas
during the eighteenth century. Crucially inspiring Governmentality
Studies, this work investigates how direct modes of exercising power (i.e.,
order and command) were replaced by more abstract power technologies
operating from a distance through a system of rules, regulations, and
incentives. From a governmentality point of view, large populations are
placed into hierarchical social orders through governmental and non-
governmental practices. Rather than expressing a sovereign will, these
practices are dependent on dominant ideas of ‘good governance’ which
are circulated through discourse. Informed by the legal, administrative,
and social scientific knowledge circulating at the time and embedded in
institutional power structures, governmental and non-governmental
practices constitute the dispositif of governmentality which aims at coor-
dinating the members of large populations and giving them their place in
the social structure. For Foucault, liberalism and neoliberalism are types
of ‘governmentality’ operating by means of ‘freedom’ where subjects are
governed from a distance rather than by order and command.
Against the background of Foucault’s governmentality theory, one can
point out the social forces and constraints working on a societal level:
knowledge is not only constructed under conditions of social inequality
but also used to legitimate and sometimes even to construct and reinforce
inequalities. The social and the relationships that make up the social,
therefore, are not only represented but are constituted through discursive
practices.
Foucault mainly deals with the seventeenth and eighteenth centuries,

when sovereignism in Western Europe showed signs of crisis. This was a
time when the seeds for a ‘liberal’ governmentality were sown, which
conceived power as a technology of coordinating populations from a dis-
tance. In the transition from sovereignism to liberalism, governmental
practices as the legitimate exercise of power changed their form. They
were no longer directed from one subject to another. They discovered the
‘population’ as a new object of governmental practice. As a consequence,
there was an increasing demand for new governmental knowledge about
how to run the state and the economy. Specialists, experts, bureaucrats,
administrators, and policy-makers, who were no social scientists, yet were
needed to devise the rules, standards, and procedures that would allocate
goods through ‘free’ markets and coordinate decision-making processes
through the nation-state. They were the officials, regulators, and techno-
crats who made sure that economic and political activities could develop
within certain limits. For Foucault, therefore, both the modern nation-
state and the free market economy testify to a new regime of governing
the social. Informed by a certain type of social knowledge, these practices
aim to represent the social space and, through representation, also con-
tribute to constituting it. Embedded in a dispositif comprising both gov-
ernmental and non-governmental practices, they are not only based in
the government or a ministry but they also mobilize a host of private and
corporate agents which have the social as their object (e.g., mass media,
insurances, demographers).
While Foucault stops short of an analysis of the more contemporary
moment, some of his observations of proto-liberal governance in the
eighteenth century can be extrapolated to the way the liberal governmen-
tality worked in the nineteenth century and to the way neoliberalism
works today. It is common to the regimes of governmentality since the
eighteenth century that they applied a ‘numerocratic’ logic to the social.
These regimes can be qualified as ‘numerocratic’ since they aim at govern-
ing the social through numbers and other standardizing codes. The effects
of ‘governing by numbers’ are typically seen in social arenas where goods
and resources are distributed by means of ‘free’ price-setting mechanisms.
Thus, markets have come to be seen as arenas in which values are con-
structed through signalling and networking practices. Yet numbers are
also used in the administrative domain of the nation-state where decision-

making practices increasingly follow the ‘democratic method’: everybody
counts and is counted in votes and elections.
Numerocracy has given birth to an arrangement of practices which
aim at ‘governing’ large social populations by applying numbers and
other standardizing codes to the heterogeneity of the social. Numerocratic
governmentalities have subjected various social arenas to practices of cal-
culation, quantification, and standardization. In these practices, the
social is not only screened, measured, and overseen but it also becomes a
reality that can no longer be ignored. One can think of the effects that the
introduction of unemployment statistics has had on modern nation-
states: large social groups are now defined as having work or not and the
‘well-being’ of the country is expressed through statistics. Many economic
policies are based on such quantifying knowledge and welfare pro-
grammes have been created to reallocate resources among groups.
Applying numbers, indicators, grids, and scales to the social is not a neu-
tral operation but a practice that constitute order. Through these numeric
and standardizing codes, the various elements that constitute the hetero-
geneous terrain of the social can be made commensurable within orders
of value (a bus ticket is the equivalent of three oranges; writing a novel is
the equivalent of two Master’s degrees). What is more, hierarchies are
constructed between the elements of the social (writing a novel is worth
more than a Master’s degree which is worth more than a bus ticket which
is worth more than an orange). As a result of numbers representing social
relationships, the heterogeneous relationships of the social are turned
into the hierarchical structures of society. And this is why the usage of
numeric, standardizing and classifying codes to compare things and peo-
ple does not just render a given reality. They do not represent the social
without intervening in it. And when they are applied to the multifarious
ties and numerous relationships of the social, they can bring forth the
structures that characterize society.
Foucault emphasizes the discontinuity of the historical process. History
is seen as a flow of many lines, with old ones terminating and new ones
beginning. Yet number-based practices of governing the social have
characterized all governmentalities since the eighteenth century

(Angermuller 2013a; Angermuller and Maeße 2015). The social has since
been subject to numerocratic practices and increasingly so. Numerocracy

is a dispositif of power-knowledge which comprises numerocratic prac-
tices that are located inside the state apparatus and outside it. Numerocratic
practices typically mobilize a background of power relationships, that is,
institutions, bureaucracies, rules, and laws (monthly unemployment sta-
tistics need a bureau, department, or ministry which provides resources
and gives credence to them). Yet, as they represent power, these practices
can constitute power relationships (e.g., between those who are catego-
rized as ‘real’ job-seekers and may legitimately claim social benefits and
those who do not count as such). But while numerocratic practices pro-
duce and reproduce structures of social inequality, they also produce and
reproduce the knowledge that is needed to structure the social (cf.
Bourdieu 1992). Unemployment statistics build on administrative and
proto-scientific expertise (e.g., on who counts as ‘working poor’, what is
a regular work contract, what are relevant parameters measured by statis-
ticians…). Yet, they also create, update, or change the existing knowledge
about the state of the economy in a country (some people may no longer
be counted, new professions may be recognized, etc.). Such knowledge is
often both controversial (which is why the numbers from national and
international agencies often differ and numbers are often challenged
from various stakeholders) and complex (it may combine sophisticated
academic theories of socio-economic development with simple adminis-
trative reporting and accounting techniques). Numerocratic practices can
operate with numbers (e.g., one can measure the level of income of the
unemployed), but also with more general codes (e.g., taxonomies, grids,
and scales). Different categories of unemployed people may be projected
onto social populations, which may organize the perception of the ‘needi-
ness’ of individuals or groups (as a result, a ‘single mom’ may claim more
benefits than an ‘asylum-seeker’). Numerocratic power is at work when
members of a population are classified according to hierarchical scales,
which can be numeric (e.g., salary scales), ordinal (e.g., professional sta-
tus) or binary (e.g., nationals and immigrants), mixes of these or all at the
same time. Numerocratic practices are discursive in that they have
recourse to numeric signifying codes but not all discursive practices are
numerocratic. While discursive practices mobilize all kind of codes,
including natural languages, numerocratic practices are generally not
restricted to ‘natural language’ and also includes standardizing codes such

as numbers, benchmarks, indicators, lists, grids, scales, which represent
the social and give it a structure. Numerocracy thus designates a dispositif
of numerocratic, discursive, and non-discursive practices that classify
members of social populations and assign a value to them.
Proto-liberal, liberal, and neoliberal governmentalities crucially rely on
numerocratic arrangements to coordinate large populations. In proto-
liberal governmentality (eighteenth century), numerocratic mechanisms
(such as the ‘market’) are used to coordinate economic activities. In lib-
eral governmentalities (nineteenth century and early twentieth century),
the transition to the market in the economic sphere is complete and
numerocratic principles are increasingly extended to the realm of politi-
cal decision-making and administrative procedures (see the rise of the
nation-state and the advent of parliamentary democracy). In neoliberal
governmentality (since the late 1970s), the economic and political spheres
having been subsumed, numerocracy now discovers the sphere of culture,
media, and education. Thus, while we follow Michel Foucault’s explora-
tions of post-sovereignist regime of governance which started to take
shape in the eighteenth century, we point out the increasingly important
role of ‘big numbers’ in many domains which are not economic or politi-
cal (Desrosières 1998; Ogien 2010; Porter 1994). If both capitalist mar-
kets and democratic nation-states crucially rely on ‘big numbers’ for the
distribution of economic goods and for political decision-making, such
numbers become ever more important in the sphere of education and
higher education (Angermuller 2017) by the last third of the twentieth
century. The rise of scientometrics goes hand in hand with the expansion
of numerocracy as a social technique. From this point of view, therefore,
corpus research not only contributes to the production of knowledge but
is also deeply involved in practices of exercising power across large popu-
lations. It is part and parcel of what can be called a numerocratic regime
of power-knowledge which has subjected more and more areas of the
social to the logics of governing by numbers (Miller 2001).
In the neoliberal governmentality since the late 1970s, the relation
between knowledge and power has been redefined. If older
governmentalities (until nineteenth-century liberalism) had mobilized
knowledge as an ‘external’ resource as it were (i.e., produced by experts
who were exempt from the constraints of the system they helped put in
place), the production of knowledge is now fully integrated into the gov-
ernmentality. The specialized knowledge and the administrative expertise
of the agents of governmentality now become the object of numerocratic
innovation. The large and growing social arena of educationalists and
researchers is now subsumed to numerocracy. Neoliberalism, in other
words, heralds the governmentalization of education and higher
education.
It will now be possible to discuss scientometrics as a field which
emerges in a situation of changing historical circumstances. Scientometrics
testifies to the growing importance of numerocratic practices for the con-
stitution of social order in Western societies since the eighteenth century
in general and the advent of these practices in the higher education sector
since the post-war era more particularly. Yet while one can observe a
growing demand for scientometric knowledge, scientometrics has always
had to grapple with a tension between applied and more academic
research orientations. Is scientometrics subordinate to normative political
goals or does it engage in fundamental social research in order to reveal
how research and researchers work? Also, specialized researchers in scien-
tometrics cannot but acknowledge the explosive growth of scientometric
data produced by corporate actors such as Thomson Reuters, Google, or
Elsevier. Therefore, it remains an open question how the increasing
amount of scientometric data which is now circulating in the academic
world impacts on knowledge production and decision-making in aca-
demia. To what degree is scientometric knowledge linked with practices
of governing academics? To obtain responses to these questions, we will
have a closer look at the directions that scientometrics has taken as a field.
3 he Emergence of Scientometrics
T
as a Field
Scientometrics (or bibliometrics) as a social science is a relatively young
field. Its origins reach back the second part of the twentieth century
(Hood and Wilson 2001). It comprises quantifying methods from social
research and uses numbers to account for structures and changes of many
texts and changes over time. While contemporary scientometrics has

recourse to computers and often uses digital texts, corpus research goes
back to medieval times. It started with exegetical and philological prac-
tices of studying sacred and literary texts, for example, through word lists
and concordances. The development of scientometrics after the world
war had much to do with managing academic libraries on tight budgets
and therefore keeping statistics on what articles and books were requested
and how long they were kept. Those figures justified paying for other
subscriptions or buying new books. At the time, the term ‘bibliometrics’
was used for that practice.1 If scientometrics can be counted as an estab-
lished methodology in the social sciences today, it has never been a neu-
tral scientific method which generates ‘pure’ knowledge. The practice of
scientometrics has been highly responsive to technological achievements
(such as data-processing tools) and dependent on the legal framework of
knowledge production (notably copyright laws). And by no means is it a
purely academic practice. In today’s knowledge-based economy, the text-
processing practices of scientometrics have become crucial if one thinks
of corporations such as Google and Facebook or governments monitor-
ing digital communication.
Initially, scientometrics was embedded in a much broader scientific
discipline, the field of Science and Technology Studies (STS), where both
qualitative and quantitative methods were used against a broad variety of
academic backgrounds, ranging from sociology of science, philosophy of
science, and history of science. As a relatively new branch of this field, the
quantitative study of science started in the 1960s and involved a number
of disciplines, such as library and information science, with some of the
early pioneers coming from the natural sciences and engineering, such as
computing, mathematics, statistics, and even physics. As a relatively new
academic discipline, there was little space for housing scientometrics in
existing schools and faculties. Scientometrics has found a more hospita-
ble environment in applied research institutes at the intersection of aca-
demic and policy-oriented research. Many scientometricians have a
first-hand experience of the marketization and commodification of aca-
demic research, which they see in more or less critical terms. And to the
We thank an anonymous reviewer for that idea.

1
degree that the field is part and parcel of numerocratic practices, one can
understand that there is a tendency in scientometrics that posits the
investment into numerocratic governance as a norm to other researchers
(Burrows 2012; Radder 2010). In the late 1980s, scientometrics became
autonomous from the larger STS field and developed their own profes-
sional organizations (such as ISSI, and somewhat later, ENID) with sepa-
rate scientific conferences (the ISSI cycle, next to the S&T cycle) and
dedicated journals (such as Scientometrics, JASIST, Research Policy to
name a few).
The academic field of scientometrics has broken off from the more
qualitative and theoretical strands in STS. It has always put strong
emphasis on empirical, strongly data-driven empirical research and
focused on electronic data of various types. Just like statistics in the nine-
teenth century, scientometrics testifies to the numerocratization of the
social. To some degree, statistics is subservient to civil servants, techno-
crats and administrators who carry out censuses, create standards in cer-
tain areas and devise regulative frameworks of action. Scientometrics,
too, is an academic field which is tied to the rise of ‘numerocratic’ tech-
niques of exercising power, which aim to govern large populations
through numbers, standards, benchmarks, indices, and scales.
All of this did not happen in a sociopolitical vacuum. After the eco-
nomic crises in the 1970s and 1980s (Mandel 1978), ending a long phase
of prosperity, the political climate in Europe and the USA changed and
neoliberal approaches prevailed. With Reagan and Thatcher in charge in
the USA and the UK, economic policies implemented austerity pro-
grammes, which meant budget cuts in various sectors of society, including
higher education. The political ideology of neoliberalism was practically
supported by New Public Management (NPM). NPM proposes to orga-
nize the public sector according to management techniques from the cor-
porate sector (Dunleavy and Hood 1994). However, science has been
mainly evaluated through peer review and for a long time quantitative
information on higher education was limited. In the USA until the 1980s,
for example, the reports by the National Science Foundation were the
only source of quantitative data on the higher education system as a whole.
Yet, in a period of austerity, more justification was needed for spending
taxpayer’s money and policy-makers were less prone to distributing
resources on the basis of the internal criteria of the sector. Peer review–
based evaluation gives little control to policy-makers and governing agen-
cies over how money is spent. Therefore, one can observe a growing
demand for simpler mechanisms of evaluation and decision-making.
Examples of these type of simplistic indicators used for evaluation and
decision-making will be introduced in Sect. 4.
Scientometric measures first appeared in science policy documents in
the 1970s, when the US National Science Foundation integrated research
metrics in its annual national science monitor, which gives an account of
the US science system. Scientometric results were then used to describe
research activity on a macro level in the USA and other countries. It took
another 20 years before large-scale scientometric reports of national sci-
ence systems were produced in Europe. It was in the Netherlands in the
early 1990s that the first national science and technology monitor
appeared (NOWT, reports covering the period 1994–2014). The last
series of these reports were produced in 2014 (WTI 2012, 2014). In this
series of reports from the Netherlands, the national Dutch science sys-
tem was compared internationally with other EU countries. Indicators
have been devised with the aim to represent technological performance,
based on the number of patents or revenue streams. These reports also
contained information on the sector as a whole (e.g., the relationship
between the public and private sector, other public institutions, hospi-
tals, etc.) and on the institutional level (e.g., comparisons between uni-
versities). These analyses typically broke down the research landscape
into disciplinary fields and domains with various levels (countries, sec-
tors, institutions).
In France, the government also initiated a series of national science
monitor reports, produced by the Observatoire des Sciences et Technologies
(OST, reports covering the period 1992–current). In France, the reports
contained an internationally comparative part, as well as a national
regional part. These reports are produced by a national institution
financed by the government while in the Netherlands the reports were
produced by a virtual institution with various participating organizations
from inside and outside the academic sector. Various countries organize
now such indicator reports, combining metrics on the science system
with indicators on the economic system, innovation statistics, and so on.
In the USA, such monitoring work has often been done by the disciplin-
ary associations, whereas in Germany research institutes in higher educa-
tion studies have played an important role in producing numerocratic
knowledge.
The first country in Europe to start developing a national system of
research evaluation was the UK, which launched its first research assess-
ment exercise in 1986. From this first initiative in the mid-1980s, the UK
has seen periodic assessments of its research system (Moed 2008), with
varying criteria playing a role in the assessment. This has been accompa-
nied with a continuous discussion on the role of peer review in the assess-
ment. In particular, it was held against approaches based on metrics solely
(see, for example, the 2007 CWTS report and The Metric Tide report of
2015). The UK research assessment procedure evaluates the whole
national system at one single moment by cutting the scientific landscape
into units of assessment. An important element in the UK research assess-
ment procedure is that it links the outcomes of assessment to research
funding. Overall, one can conclude that the UK research assessment exer-
cises tend to be a heavy burden for the total national science system. By
organizing this as one national exercise, every university is obliged to
deliver information on all the research at one single moment, across a
large number of research fields. Many senior UK scholars get involved in
the assessment of peers. Other countries in Europe also initiated national
assessment systems. Finland, for example, was one of the countries initi-
ating such a system shortly after the UK. Outside Europe, Australia has
implemented a system which follows the UK model in many respects.
The Netherlands initiated their research assessment procedure in the
early 1990s, which is still in place, albeit with a changed design. In the
Netherlands, the periodical assessment of research was institutionalized
from the early 1990s onwards. Until 2003, assessment was organized
under the supervision of the VSNU, the association of universities in
the Netherlands (Vereniging van Samenwerkende Nederlandse
Universiteiten). Their so-called chambers, consisting of representatives
from the disciplines, decided how to design the research assessment pro-
cess, which includes the question whether research metrics were appro-
priate for the research evaluation in the respective fields. In a wide
number of fields, data from advanced scientometric analysis have been
applied to complement peer review in research assessment procedures
(e.g., biology, chemistry, physics, and psychology). After 2003, the ini-
tiative to organize research assessment was put in the hands of the uni-
versity boards, which meant that it was no longer a national matter. The
Royal Academy of Arts and Sciences in the Netherlands has carried out
studies that have influenced the recent revision of the standard evalua-
tion protocol. The focus is now no longer only on research output and
its impact, but it also considers, like in the UK, the societal relevance of
academic work (‘impact’). It remains to be seen to what extent the sys-
tem can still rely on peer review and what will be the role of scientomet-
ric measures in the future.
While some other countries are building up national frameworks of
assessing research quality, national evaluation schemes are still an excep-
tion (Table 4.1). One can see few tendencies in federal states like in the
USA, where evaluation tends to be done on an institutional level, or
Germany, where there are no plans for another national evaluation
scheme after a ‘pilot study’ of the Wissenschaftsrat evaluated the research
excellence of two disciplines (chemistry and sociology) in 2007. The
French AERES carries out a ‘light-touch’ evaluation compared with the
UK. AERES regularly visits French research groups (laboratoires) and
only examines publication records and not the publications themselves.
Such national evaluation schemes can be highly consequential. The
measurement of research performance can be done in a very explicit man-
ner, as is the case in the UK, where the outcomes of the most recent
research assessment exercise has direct consequences for money flows
between and within universities, or Italy, where a similar model as applied
in the UK, has been adopted over the last years. Here research perfor-
mance is linked up to a nationwide research assessment practice, with
intervals of five to eight years between the exercises. In other countries,
research assessment has crept in more silently as some research funding is
distributed on an annual basis, based upon last year’s output numbers, in
which productivity in internationally renowned sources (journals and
book publishing houses) provide higher rewards as compared to publish-
ing in locally oriented journals and/or book publishing houses (Flanders,
Norway, and Denmark). In France, negative assessments can lead to
research clusters being closed down, whereas the German evaluation of
2007 has shown no immediate effects and little lasting influence. The
Table 4.1 National evaluation schemes in some Western higher education systems
104
USA no
Netherlands VSNU Protocol, systematic,
Germany currently the SEP (Standard national
Wissenschaftsrat Evaluation Protocol) assessment
UK REF 2013 France AERES 2007 protocol implemented
Unit of Researchers, Research groups Subdepartmental Initially departments and NA
evaluation departments, and (laboratoires) units groups, now only
institutions institutes/departments
Evaluation of Peer review No peer review Peer review Peer review NA
publications Scientometrics is Scientometrics is Advanced scientometrics
through not used officially not used only when it fits the field
Allocation of Yes No direct funding No direct effects No direct linking of NA
funding? effects but Scientometrics is assessment of research
future of groups not used and research funding
can be officially
questioned
J. Angermuller and T. van Leeuwen
Effects on Significant impact Academics may be No effects known Only indirect and implicit NA
academics on academic counted as effects on internal
recruitment in ‘non-publishing’ university policies
many fields but their job
security is not at
stake
Production of Performance Ad hoc statistics Some statistics On the national level NA
statistics statistics of are produced on are produced monitoring of the whole
and institutions and institutional system, on the
indicators the whole sector level institutional level periodic
are produced assessment of research
within institutions
more ambitious schemes collect publications of the researchers and have

them read by assessment panels (e.g., UK and Germany).
In all such evaluations, one can observe heated debates over the use of
quantitative and computer-aided tools in the evaluation of the evaluation
files and publications. Yet while more and more aspects of research activi-
ties are quantified (such as research revenue, publications, theses) and
research activities are assessed by means of factors, indices, benchmarks
and national evaluation schemes have so far stopped short of using the
quantifying tools of scientometrics to rate individual research publica-
tions. In fact, while it is difficult to assess the precise impact of sciento-
metric methods on academic decision-making, one can observe persistent
conflicts between evaluators and types of evaluation (such as ‘manual’
peer review and ‘automatic’ quality assessment).
What are the effects of the growing expertise that scientometrics can
offer? While academic governance has become more professional, the gap
between managers and decision-makers on the one hand and the aca-
demic disciplines on the other may be growing. Such a gap may be con-
ducive to numerocratic practices in academic governance. Indeed, large
disciplines organized in large institutional units (e.g., business and medi-
cine) give more credence to quantitative indicators such as journal impact
factors. And especially in non-Western countries, where resources are less
abundant and managers sometimes have less training in the disciplines,
quantitative signals of academic quality (e.g., publications in ‘interna-
tionally recognized’ journals) often play a more important role than in
institutions where decision-makers are recognized as experts in their
disciplinary fields and have a more intimate understanding of what is
perceived to be good research.
Scientometrics centres around the codified scientific knowledge pro-
duction as can be measured through publications in various types of com-
munication, especially in journal articles (for a number of historical
reasons, see Wouters 1999). The Science Citation Index (SCI) forms the
main basis for the scientometric method applied in basic research and in
applications supporting research assessment. The SCI has been developed
by Garfield Associates, which was to become part of a larger company,
started by Eugene Garfield in 1958—the Institute for Scientific
Information (ISI), located in Philadelphia. Garfield was an entrepreneur
who obtained a PhD in structural linguistic in 1961 from Pennsylvania.

The SCI database started to track the communication and referencing in
natural science literature. The SCI was launched in 1963, followed by the
Social Sciences Citation Index (SSCI) in 1973 and the Arts and Humanities
Citation Index (AHCI) in 1978 (Klein 2004; Cawkella and Garfield
2001). Over time, an increasing number of journals have been covered by
the ISI indices, which now count references in well over 11,000 journals
from dozens of disciplines over the last decade. New developments now
implicate Web-based counting practices, such as the number of occur-
rences of an author’s name and the number of ‘hits’ or ‘reads’ of texts
posted on the Web. Both Web of Science and Scopus now make extensive
use of this form of alternative research metrics, also called altmetrics. As
always, one has to be careful with these numbers, as the meaning of these
new research metrics and their statistical robustness are not yet clear.
With ISI, Garfield became the “undisputed patriarch of citation index-
ing” (Cronin and Atkins 2000, 1). Garfield’s work has been crucially
influential in the creation of scientometrics as an academic field. Yet he
was also successful commercially. In 1992, ISI and its citation indices
were sold to the Thomson Reuters Corporation for $210 million. In
October 2016, Thomson Reuters completed the sale of its intellectual
property and science division and transferred citation indexing to a new
company called Clarivate Analytics.
Given the role of Garfield’s commercial activities in the scientometric
field, one wonders how academic and economic dimensions relate to
each other. Is it a field that has emerged because specialized academic
expertise in STS has been applied and commercialized outside the aca-
demic world? Or is citation indexing an activity of huge economic impact
which has given birth, accidentally as it were, to an academic field?
4 cientometric Indices and Rankings

S
as Social Practices
In the slipstream of Garfield’s commercial activities, impact indicators
have been developed and taken up in the academic sector. An indicator
that Garfield developed in the 1950s and 1960s is the Journal Impact
Factor (JIF), with first publications on the indicator in 1955 and 1963
(Garfield 1955; Garfield and Sher 1963). Journal citation statistics were
included in the Journal Citation Reports (JCR), the annual summarizing
volumes to the printed editions of the SCI and the SSCI. In the growing
higher education sector, in which more and more journals appeared on
the scene, the JIF became a tool used by librarians for managing their
collections. When the JCR started to appear on electronic media, first on
CD-ROM, and later through the Internet, JIF was more frequently used
for other purposes, such as assessments of researchers and units, which
was always sharply criticized by Garfield himself (Garfield 1972, 2006).
The JIF has been included in the JCR from 1975 onwards, initially only
for the SCI, later also for the SSCI. For the AHCI, no JIFs were pro-
duced. From the definition of the JIF, it becomes apparent that JIF is a
relatively simple measure, is easily available through the JCR, and relates
to scientific journals, which are the main channel for scientific commu-
nication in the natural sciences, biomedicine and parts of the social sci-
ences (e.g., in psychology, economics, business and management) and
humanities (e.g., in linguistics).
While the ISI indices cover a number of features in journal articles,
they focus mostly on the references cited by the authors. These references
are taken to express proximity and influence between citing and cited
people. On the receiving end, the question is to whom they relate. Here
references are considered as citations. Citation theory argues that value is
added when references become socially recognized citations, objectifying
as it were the researchers’ social capital (see Wouters 1999; Angermuller
2013b; Bourdieu 1992). The value that members of scientific communi-
ties add to works can be objectified through normalized measures. Thus,
by processing and counting references the scientometrician does not only
reflect a given distribution of academic value but she or he also adds value
to references. Other types of information used in the analysis relates to
the authors and co-authors, their institutions and cooperations, the jour-
nals in which their papers come out, their publishing houses, information
on the moment of publishing, information on the language of communi-
cation in the publication, meta-information on the contents, such as key-
words, but also words in titles and abstracts, as well as information on the
classification of the publications in various disciplinary areas.
The fact that these indices mostly focus on journal publications has
been widely criticized. It has been argued, for instance, that certain disci-
plinary areas have been put in a disadvantageous position (e.g., history,
where monographs are more important, or new disciplinary and transdis-
ciplinary fields which don’t have established journals yet). Moreover, it
needs to be recalled that the three indexes tend to over-represent research
in English as the main language of international scientific communica-
tion (van Leeuwen 2013; van Leeuwen et al. 2001). Areas in which
English-language journals are not standard outlets for research tend to
become peripheral in citation indexes. As a result, Western (i.e., US and
Western European) journals in the natural, life, and biomedical sciences
had long been given a certain prominence, which has been called into
question only after the rise of the BRIC countries (Brazil, Russia, India
and in particular of China). Other geographical regions of the world are
now better represented in the ISI, although the bias towards English has
remained intact as many of these countries follow models of the English-
speaking world.
The rise of these indices took place when some scientometric indica-
tors such as the JIF and the h-index started to be used in evaluation
practices throughout the science system. The JIF was originally designed
by Garfield for librarians to manage their journals in their overall library
collection and for individual researchers in the natural sciences to help
them decide on the best publication strategies (Garfield 1955, 1972,
2006; Garfield and Sher 1963). The JIF has been used in a variety of
contexts, for example, by managers who evaluate whole universities
(often with a more formal registration of research outputs of the scholarly
community) and by individual scholars to ‘enrich’ their publication lists
while applying for research grants and for individual promotion or job
applications (Jiménez-Contreras et al. 2002). Yet indicators such as the
JIF are not neutral as they can bring forth the realities they represent
through the numbers.
There are no indicators that have been universally accepted to repre-
sent quality of research or the performance of researchers. As they have
been the object of controversial debate, a number of serious flaws with
the JIF have been pointed out which disqualify the indicator for any use
in science management, let alone for evaluation purposes. Thus, it is cal-
culated in ways that in about 40% of all journals JIF values are overrated
(Moed and van Leeuwen 1995, 1996). Another issue is that JIF values do
not take into consideration the way the journal is set up. Journals, for
example, that contain many review articles, tend to get cited more fre-
quently as compared to normal research articles. Therefore, review jour-
nals always end up on top of the ranking lists. A third issue relates to the
fact that JIF values do not take into consideration the field in which the
journal is positioned. Reference cultures differ, as do the number of jour-
nals per field. This means that fields with a strong focus on journal pub-
lishing, and long reference lists, have much higher JIF values as compared
to fields where citations are not given so generously. A fourth reason
relates to the fact that citation distributions are, like income distribu-
tions, skewed by nature. This means that a JIF value of a journal only
reflects the value of few much-cited articles in the journal while most
have lower impacts. This creates a huge inflation in science, given the
practice mentioned above, in which scholars tend to enrich their publica-
tion lists with JIF values, which say nothing about the citation impact of
their own articles. Moreover, JIF values tend to stimulate one-indicator
thinking and to ignore other scholarly virtues, such as the quality of
teaching, the capability to ‘earn’ money for the unit, the overall readiness
to share and cooperate in the community.
The h-index was introduced in 2005 (Hirsch 2005). It is to assess an
individual researcher’s performance by looking at the way citations are
distributed across all publications of that person. If one takes the output
in a descending order (by number of received citations), the h-index rep-
resents the number of citations received equalling the rank order (i.e., if
somebody has published five articles, cited 20, 15, 8, 4, and 3 times, the
h-index is four). Due to the simplicity of the indicator, it has been widely
adopted and is sometimes even mentioned to justify hiring and firing
decisions as well as the evaluations of research proposals in research
councils.
The problems with the h-index are manifold. First of all, a number of
issues related to JIF also apply on the h-index, one of them being the
issue of the lack of normalization, which makes comparisons across fields
impossible (van Leeuwen 2008). A next issue is the conservative nature of
the indicator, it can only increase, which makes the h-index unfit for
predictions. The next set of issues relates to the way this indicator is cal-
culated. Depending on the database, calculations of the h-index can dif-
fer significantly. In many cases, authors and their oeuvres cannot be
determined easily. A final set of issues relates to a variety of more general
questions, such as the publication strategies chosen by researchers (put-
ting your name on every single paper from the team or be more selective),
the discrimination against younger staff and the invisibilization of schol-
arly virtues.
Indicators such as the JIF and the h-index are nowadays easily available
for everybody. They are readily used by people in research management
and science policy, government officials, librarians, and so on. Even
though precise effects are difficult to prove, these indicators often play a
role for decision-making in grant proposal evaluation, hiring of academic
personnel, annual reviews as well as promotion and tenure decisions. If
they are applied in a mechanistic way without reflecting on their limits,
such indicators can go against values which have defined the academic
ethos, for example, the innovation imperative, the service to the commu-
nity, the disinterested pursuit of ‘truth’.
While the JIF and the h-index testify to numerocratic practices within
academic research, university rankings are an example for the numeroc-
ratization of higher education in the broader social space. University
rankings were initially intended to help potential students to select a
proper university matching their educational background. Yet these rank-
ings have turned into a numerocratic exercise that now assesses many
aspects of university performance. Since 2004, with the launch of the
so-called Shanghai ranking, universities have been regularly ranked
worldwide, which has contributed to creating a global market of higher
education. As a result, these league tables can no longer be ignored by
managers and administrators, especially in Anglo-American institutions,
which highly depend on fees brought by international students (Espeland
and Sauder 2007). With the exception of the Leiden ranking, which is
based entirely upon research metrics, university rankings usually contain
information on educational results, student-staff ratio, reputation, and
research performance. All prominent university rankings, including the
ARWU Ranking (Academic Ranking of World-Class Universities, aka
Shanghai Ranking), the Times Higher Education university ranking and
the QS Ranking, include scientometric information in their ranking for-

mula. Many of the problems that one can observe with scientometric
analyses, such as coverage and language, find their way into university
rankings (see van Leeuwen 2013; van Leeuwen et al. 2001; van Raan
et al. 2011a, b). In the European context, the performance of French,
German, Italian and Spanish universities is seriously underrated because
their output contains local language output (see van Raan et al. 2011a,
b). Some researchers, particularly from Germany, have therefore started
to publish in English. And since university rankings tend to reward sheer
size policy-makers have cited the need to do well in international univer-
sity rankings to push through reforms such as the German
Exzellenzinitiative or the recent clustering of French institutions.
Generally speaking, we can point out various types of scientometric
studies, which testify to the different levels of numerocratic practices
objectifying the social through numbers. A macro-level scientometric
analysis provides scientometric information on the level of countries or
groups of countries (e.g., the EU, the OECD countries, etc.); a meso-
level analysis concentrates on institutions such as universities, publicly
funded research organizations, corporations and larger units within these
institutions (faculties, research groups, etc.). And, finally, one can see
micro-level analyses that deal with small research groups, programs, and
individual researchers. This is a level to which scientometric analysis is
extended more often nowadays. The requirements of data collection vary
according to the level of analysis. Raw data for macro-level analyses is
relatively well organized for some countries if one takes citation indices as
a point of departure. One can distinguish between the outputs of differ-
ent European countries relatively easily and most work here involves
questions such as the grouping the four countries England, Scotland,
Wales and Northern Ireland within the UK or the creation of labels such
as the EU (as it has been progressively enlarged) or the OECD. On the
meso-level, data handling requires significant knowledge of the science
system on the national level (e.g., how academic hospitals are related to
universities in the same cities), what name variations exist of a university
and its various locations, how are other publicly funded research organi-
zations grouped on the national level), and in numerous cases, data han-
dling needs to achieve unification and bring the many variants of an
institution under one umbrella name. On the micro level, the challenges
of scientometric analysis are the greatest. In the first place, micro-level
analysis often, if not always, needs to involve those who are the object of
the analysis so as to define research groups, projects, programmes, and
publications realized by these units. Some sort of certification or authori-
zation is required without which the outcomes of the study would lack
legitimacy. In the second place, on the level of the individual researcher,
the problem of homonyms and synonyms plays an important role. As
one single person can publish under various names in the international
serial literature (e.g., by using various initial combinations, sometimes
one of the first names is written in full, etc.), these occurrences have to be
brought back to one single variation. Likewise, one name variation can
hide various persons, due to the occurrence of very common names (in
the English language area names like Brown or Smith), in combination
with one single initial but many Chinese scholars mean formidable chal-
lenges to citation indices. Scientometric data handling requires informa-
tion on the full names of individuals, the field in which people have been
working and also about their career track. One can try to collect such
data. However, ideally, one should consult the authors as they are those
who know best. Verifying publications not only increases the validity of
the outcomes of the scientometric study but also adds to the transparency
of the process.
In order to critically reflect on how numerocracy works in and through
scientometrics, one needs to understand how the empirical data for indi-
cators and rankings are collected. With respect to this phase of data col-
lection, we can cite for one moment the work of Pierre Duhem, a French
natural scientist and philosopher of science. In his so-called law of cogni-
tive complementarity, he holds that the level of accuracy and the level of
certainty stand in a complex relationship with each other (Rescher 2006).
If we follow Duhem, the analysis of the macro level can teach us about
the performance of a country in a particular field but it cannot tell us
anything about any particular university active in that field, let alone
about any of the research programmes or individual scholars in that field.
Vice versa, while analyses on the micro level can instruct us about indi-
vidual scholars and the research projects they contribute, it cannot inform
us about the national research performance in that same field of research.
Even at the meso-level, where we would expect the level of certainty and
accuracy to be easier in balance, the world remains quite complicated: an
overview of the research in, say, a medical centre in the field of immunol-
ogy does not relate in a one-to-one relationship to the department of
immunology in that centre, as researchers from various departments
might publish in the field of immunology, such as haematologists, oncol-
ogists, and so on. This tension between the level of certainty and accuracy
exists at any moment and influences the range and reach of the conclu-
sions that can be drawn from scientometric data.
5 onclusion: For a Critical Reflection

C
of Scientometrics as a Numerocratic
Practice
Over the last few decades, higher education has gone through profound
changes. While the sector has grown hugely and in many areas continues
to grow, higher education is under conditions of austerity and both tax-
payers and private sponsors expect more and more justification as to how
money is spent on academic activities. Academic systems have been under
pressure to turn away from ‘personal’ modes of evaluation (such as
peer review) towards more managerial and neoliberal approaches.
Scientometrics responds to these demands by subjecting research output
of individuals, institutions, and academic systems to numeric analysis
and representation.
It was the objective of this paper to place the rise of scientometrics in
the context of ‘numerocracy’, a regime which uses numbers to place peo-
ple into social networks, groups, or structures. Against this background,
scientometrics signals the advent of numerocracy in the higher education
sector. In higher education, numerocratic governance aims at monitor-
ing and controlling large academic populations through numbers. Such
approaches typically define arenas where academics are allowed to engage
in ‘free’ competition over scarce resources and constitute hierarchies of
‘excellence’ among researchers. Scientometrics can provide the tools
that objectify social inequalities and justify hierarchical relationships
between researchers. By measuring research impact and citation pat-

terns, scientometricians numerocratic practices not only legitimate but
also create unequal distributions of resources and prestige. Thus, scien-
tometrics can to naturalizing and depoliticizing social struggles among
the participants of the academic ‘game’. Numerocracy theory provides
critical tools to reflect on the social effects of such practices, which the
actor does not always control. While it acknowledges the part of the
actors in producing and reproducing social inequalities (Espeland and
Stevens 2008), it places scientometric research in the larger context of a
knowledge-power dispositif (see Chap. 3 in this volume) which has
applied numerocratic techniques to a number of arenas and populations
since the eighteenth century.
Scientometrics has met with widespread suspicion if not outright criti-
cism from various corners of the academic community for participating
in power games. If scientometricians are perhaps the first ones to be aware
of the political uses and abuses of the expertise it can offer, we want to
insist on the dynamics of power in which scientometric research finds
itself entangled. As a field, scientometrics is part and parcel of a system
which creates, reinforces, and legitimates social inequalities between aca-
demics. How should scientometricians act in the light of political effects
of their work? How should they deal with the fact that their knowledge is
not always used in ways that is in accord with their goals and values?
One solution could be to distinguish between descriptive (or basic)
and normative (or applied) research in scientometrics. Descriptive scien-
tometrics would put the emphasis on revealing differences in behaviours
and practices of academics. It would draw on methods of social research
to study social order in the world of science empirically. The goal of nor-
mative scientometrics, by contrast, would be to facilitate institutional
decision-making. Through indicators such as the JIF or the h-index, it
could claim, for instance, to assess the ‘performance’ of researchers and to
measure the ‘quality’ of research output. Scientometricians would then
have the choice to take on one or the other role. They would do sciento-
metrics either as an empirical social research (and perhaps integrate it
back into STS) or they would produce applied knowledge for decision-
makers (following Garfield, whose work has crucially contributed to the
ISI Web of Science indicators).
Yet if we take seriously the idea of scientometrics as an aspect of a

numerocratic dispositif, the distinction between descriptive and norma-
tive research is difficult to maintain. For neither practice can stand by
itself. While descriptive research uses numbers to represent social reali-
ties, its representations cannot but partake in the constitution of such
realities. Just like any social research that aims to make a difference, sci-
entometrics cannot simply render a given state of the social world with-
out intervening in it. And how could there be normative scientometrics
which does not at the same time claim to be descriptive (and the other
way round)? Even more fundamentally, one can ask to what degree
applied research is in control of the political effects that it has in the social
world? Is applied research always affirmative of the values that motivate
its sponsor to give resources for it? Does not ‘descriptive’ research that
wants to make a positive difference in the world precisely need to follow
the ‘applied’ route and seek to work with agents and stakeholders outside
its scientific community?
There is probably no general response to these dilemmas which many
scientometricians are confronted with. And there is no way to find
responses without accounting for the specific circumstances in which
researchers struggle to find solutions to the problems they deal with in
their everyday lives. While it is difficult to assess the impact of academic
research on the social world, it is even more difficult to determine the
precise effects that scientometric knowledge has on actions and decision-
making. Over the last couple of years, ethnographic studies on the role
of research metrics on shop floor level shows the degree of penetration
of those metrics in various aspects of academic life, for example, in
decision-making on publication strategies, on promotion procedures,
the role of research metrics in research grant procedures, and so on
(Rushforth and de Rijcke 2015). It is all too easy to fall prey to the myth
of a scientometric conspiracy in higher education. Research has always
been and will always remain a creative social practice, which will never
be totally dominated by benchmarks, indicators and algorithms.
Scientometrics, therefore, should be critical of its role in the numeroc-
ratized dispositif of higher education and it cannot but gain from a critical
reflection on the fragile and preliminary nature of scientific knowledge
more generally.
Today, numerocratic technologies are used in many areas of social life

and the rise of scientometrics more generally testify to the increasing
demand for scientifically established numbers. Yet with the spread of the
digital medium, numbers have become common in various areas of social
life. Consumers use numbers when they choose a hotel or a mobile phone
provider. Citizens use numbers in the democratic debate to determine
what clickworthy news is. And numbers even have become crucial in the
match-making and dating business. By taking scientometrics as an exam-
ple of the emerging numerocratic dispositif, we invite to critically reflect
on the social effects that numbers have in many areas of social life today.
References
Angermuller, Johannes. 2013a. Discours académique et gouvernementalité
entrepreneuriale. Des textes aux chiffres. In Les discours sur l’économie, ed.
Malika Temmar, Johannes Angermuller, and Frédéric Lebaron, 71–84. Paris:
PUF.
———. 2013b. How to become an academic philosopher. Academic discourse
as a multileveled positioning practice. Sociología histórica 3: 263–289.
———. 2017. Academic careers and the valuation of academics. A discursive
perspective on status categories and academic salaries in France as compared
to the U.S., Germany and Great Britain. Higher Education 73 (6): 963–980.
Angermuller, Johannes, and Jens Maeße. 2015. Regieren durch Leistung. Zur
Verschulung des Sozialen in der Numerokratie. In Leistung, ed. Alfred Schäfer
and Christiane Thompson, 61–108. Paderborn: Schöningh.
Bloor, David. 1976. Knowledge and social imagery. London: Routledge & Kegan
Paul.
Bourdieu, Pierre. 1992. Homo academicus. Frankfurt/Main: Suhrkamp.
Burrows, Richard. 2012. Living with the h-index? Metric assemblages in the
contemporary academy. Sociological Review 60 (2): 355–372.
Cawkella, Tony, and Eugene Garfield. 2001. Institute for scientific information.
In A century of science publishing, ed. E.H. Fredriksson, 149–160. Amsterdam:
IOS Press.
Committee on the Independent Review of the Role of Metrics in Research
Assessment and Management. 2015. The metric tide. Report to the HEFCE,
July 2015. Accessed June 27, 2018. http://www.hefce.ac.uk/media/
HEFCE,2014/Content/Pubs/Independentresearch/2015/The,Metric,
Tide/2015_metric_tide.pdf.
Cronin, Blaise, and Helen Barsky Atkins. 2000. The scholar’s spoor. In The web
of knowledge: A festschrift in honor of Eugene Garfield, ed. Blaise Cronin and
Helen B. Atkins, 1–8. Medford, NJ: Information Today.
CWTS. 2007. Scoping study on the use of bibliometric analysis to measure the
quality of research in UK higher education institutions. Report to HEFCE
by the Centre for Science and Technology Studies (CWTS), Leiden
University, November 2007.
Desrosières, Alain. 1998. The politics of large numbers: A history of statistical rea-
soning. Cambridge, MA: Harvard University Press.
Dunleavy, Patrick, and Christopher Hood. 1994. From old public-administration
to new public management. Public Money & Management 14 (3): 9–16.
Espeland, Wendy Nelson, and Michael Sauder. 2007. Rankings and reactivity:
How public measures recreate social worlds. American Journal of Sociology
113 (1): 1–40.
Espeland, Wendy Nelson, and Michell L. Stevens. 2008. Commensuration as a
social process. Annual Review of Sociology 24: 313–343.
Foucault, Michel. 1973. The birth of the clinic: An archaeology of medical percep-
tion. London: Routledge. Original edition, 1963.
———. 1980. Power/knowledge: Selected interviews and other writings
1972–1977. Edited by Colin Gordon. New York: Pantheon.
———. 1995. Discipline and punish: The birth of the prison. New York: Vintage
Books.
———. 2002. The order of things. An archeology of the human sciences. London:
Routledge. Original edition, 1966.
———. 2007. Security, territory, population. lectures at the college de France.
Basingstoke: Palgrave Macmillan. Original edition, 1977/78.
———. 2008. The birth of biopolitics. Lectures at the Collège de France,
1978–1979. London: Palgrave Macmillan.
Garfield, Eugene. 1955. Citation indexes to science: A new dimension in docu-
mentation through association of ideas. Science 122 (3159): 108–111.
———. 1972. Citation analysis as a tool in journal evaluation. Science 178
(4060): 471–479.
———. 2006. The history and meaning of the journal impact factor. JAMA
295 (1): 90–93.
Garfield, Eugene, and Irving H. Sher. 1963. New factors in the evaluation of
scientific literature through citation indexing. American Documentation 14
(3): 195–201.
Hirsch, Jorge Eduardo. 2005. An index to quantify an individual’s scientific
research output. Proceedings of the National Academy of Sciences of the USA
102 (46): 16569–16572.
Hood, William W., and Concepción S. Wilson. 2001. The literature of biblio-
metrics, scientometrics and informetrics. Scientometrics 52 (2): 291–314.
Jiménez-Contreras, Evaristo, Emilio Delgado López-Cózar, Rafael Ruiz-Pérez,
and Victor M. Fernández. 2002. Impact-factor rewards affect Spanish
research. Nature 417: 898.
Klein, Daniel B. 2004. The social science citation index. A black box—With an
ideological bias? Econ Journal Watch 1 (1): 134–165.
Knorr Cetina, Karin. 1981. The manufacture of knowledge. An essay on the con-
structivist and contextual nature of science. Oxford and New York: Pergamon
Press.
Latour, Bruno, and Steve Woolgar. 1979. Laboratory life. Princeton, NJ:
Princeton University Press.
van Leeuwen, Thed N. 2008. Testing the validity of the Hirsch-index for research
assessment purposes. Research Evaluation 17 (2): 157–160.
———. 2013. Bibliometric research evaluations, web of science and the social
sciences and humanities: A problematic relationship? Bibliometrie – Praxis
und Forschung, 1–18. Accessed June 27, 2018. http://www.bibliometrie-pf.
de/article/view/173.
van Leeuwen, Thed N., Henk F. Moed, Robert J.W. Tijssen, Martijn S. Visser,
and Ton F.J. Van Raan. 2001. Language biases in the coverage of the science
Citation Index and its consequences for international comparisons of national
research performance. Scientometrics 51 (1): 335–346.
Mandel, Ernest. 1978. The second slump. London: Verso.
Merton, Robert K. 1962. Science and the social order. In The sociology of science,
ed. Bernard Barber and Walter Hirsch, 16–28. Westport, CT: Greenwood.
Miller, Peter. 2001. Governing by numbers. Why calculative perspectives mat-
ter. Social Research 68 (2): 379–396.
Moed, Henk F. 2008. UK research assessment exercises: Informed judgments on
research quality or quantity? Scientometrics 74 (1): 153–161.
Moed, Henk F., and Thed N. van Leeuwen. 1995. Improving the accuracy of
institute for scientific information’s journal impact factors. Journal of the
American Society for Information Science 46: 461–467.
———. 1996. Impact factors can mislead. Nature 381: 186.

NOWT. Wetenschaps- en Technologie Indicatoren, by the Netherlands
Observatory of Science & Technology (NOWT). Report to the Dutch
Ministry of Science and Education, 1994, 1996, 1998, 2000, 2003, 2005,
2008, 2010, 2012, 2014.
Ogien, Albert. 2010. La valeur sociale du chiffre. La quantification de l’action
publique entre performance et démocratie. Revue française de socio-économie
5: 19–40.
OST. 1992. Observatoire de Science et Technologie (OST). Edition 1992,
1994, 1996, 1998, 2000, 2002. Science & Technologie Indicateurs. Paris.
Porter, Theodore. 1994. Trust in numbers. The pursuit of objectivity in science and
public life. Princeton, NJ: Princeton University Press.
van Raan, Ton F.J., Thed N. van Leeuwen, and Martijn S. Visser. 2011a. Non-
English papers decrease rankings. Nature 469 (1): 34.
———. 2011b. Severe language effect in university rankings: Particularly
Germany and France are wronged in citation-based rankings. Scientometrics
88: 495–498.
Radder, Hans. 2010. The commodification of academic research. Science and the
modern university. Pittsburgh, PA: University of Pittsburgh Press.
Rescher, Nicolas. 2006. Epistemetrics. Cambridge: Cambridge University Press.
Rushforth, Alex, and Sarah de Rijcke. 2015. Accounting for impact? The jour-
nal impact factor and the making of biomedical research in the Netherlands.
Minerva 53: 117–139.
Wouters, Paul. 1999. The citation culture. PhD thesis, University of Amsterdam,
the Netherlands.
WTI. (2012) 2014. Wetenschaps-, Technologie en Innovatie Indicatoren
(WTI). Report to the Dutch Ministry of Science, Culture & Education,
2012, 2014.
Part III
Exploring Corpora: Heuristics, Topic
Modelling and Text Mining
5
Lexicometry: A Quantifying Heuristic
for Social Scientists in Discourse Studies
Ronny Scholz
1 Introduction
Most discourse analytical studies have a common interest in patterns of
how knowledge is (re-)produced, (re-)distributed, and controlled
through social practices of language use. Discourse analysts have devel-
oped numerous methods to demonstrate how meaning is constructed.
However, the reason why a particular text or textual sequence in a given
corpus was chosen to be analysed often remains arbitrary. Distinguishing
hermeneutic from heuristic methods, this contribution introduces a
systematic quantitative methodology guiding the analyst’s choices of
texts and textual sequences. The text emphasises the heuristic strength
I am thankful to Malcolm MacDonald for his helpful comments on earlier versions of this text.
Additionally, I want to thank André Salem for the numerous personal tutorial sessions and
discussions of the software Lexico3 with which most of the analyses in this text have been
conducted.
R. Scholz (*)
e-mail: r.scholz@warwick.ac.uk

124 R. Scholz
of lexicometric methods, providing the researcher with numerous per-

spectives on the text collections she or he is analysing. On the basis of
these perspectives, the analyst can develop hypotheses about discourse
phenomena driven by the corpus data, which can be verified and falsi-
fied in the hermeneutic research phase.
At least since the 1970s, lexicometry, not without polemic, has devel-
oped into one of the prominent methodologies in French discourse
analysis, mainly in order to analyse the language use in political dis-
courses (Bonnafous and Tournier 1995; Tournier 1975, 1993, for an
introduction; Mayaffre 2016; Mayaffre and Poudat 2013; Scholz and
Mattissek 2014). Lexicometric methods are especially suited for the
study of recurrent language use patterns assuming that a high frequency
of a particular use of language reflects the typical way knowledge is
structured in society or a part of it. These patterns are normally studied
with a number of quantitative methods in text collections (corpora) of
varying size. When analysing society through language use, we have to
operationalise the world in a particular way. Discourse analysts gain
their data through a chain of procedures through which they reduce the
complexity of the world (see also Chap. 2 in this volume). It is only after
having transcribed, edited, and enriched actual language use with some
metadata that quantitative and qualitative methods can be used to anal-
yse and interpret discourses.
Lexicometric methods allow describing discourses in terms of con-
crete and ‘differentiated’ language use which can be explained with ref-
erence to historical and social situations as well as to individuals and
social groups that are embedded into power relations. Computational
methods are one way to account rigorously for these relations, to the
extent that they can be translated into a machine-readable language. The
concept of ‘differentiated use’ [usage différentiel] refers to an understand-
ing of the sociolinguistic diversity (historical situation, social groups,
communication situation, and genre) in relation to the language system.
With contrastive measurements, one can distinguish the predominant
topics, text types, genres, and subgenres (Mayaffre 2005). Stylistic char-
acteristics as well as rhetoric and argumentative effects can be described
and juxtaposed. Furthermore, the distribution of particular patterns
of language use or similarities between texts can be measured (Mayaffre
Lexicometry: A Quantifying Heuristic for Social Scientists… 125
2016). Even if there is a substantial overlap with corpus-assisted dis-

course studies in the Anglo-Saxon world, both traditions have ignored
each other for a number of different reasons, which I will not elaborate
here. Therefore, one objective of this chapter is to present the French
quantitative approach to a novice readership. Notwithstanding this
introductory impetus, experts in the field will also find some useful ideas
for their research.
In the following section, I outline the historical context in which the
lexicometric approach developed. Thereafter, in the third section I explain
theoretical aspects that are important for the understanding of a quanti-
tative heuristic approach. In the fourth section, I remind the reader of the
criteria that should be taken into account when building a corpus to be
analysed with lexicometric methods. The fifth section is the main part.
Here, I will demonstrate how exhaustive methods that take into account
the whole vocabulary of a corpus simultaneously can be used to take a
variety of angles onto the data. I will illustrate how quantifying methods
can be used to explore the lexicosemantic macro-structures in a discourse.
Important discursive aspects can be discovered without a prior interpreta-
tion of texts. Furthermore, I will show how lexicometric methods can be
used to reduce a large corpus to a selection of prototypical texts, which
then can be interpreted with various methods.
2 ome History of the Lexicometric

S
Methodology
Influenced by textual philology, stylistics, statistics, and computer sci-
ences, the lexicometric approach has been developed in France since the
1970s as a computer-assisted methodology for the analysis of language
use in political discourses. Early precursors of the approach can be found
in quantitative linguistics related to names such as George Yule (1944),
George Zipf (1929, 1935) Gustav Herdan (1964, 1966), or Pierre
Guiraud (1954, 1960). Influential was Charles Muller’s contrastive statis-
tic model to describe lexico-grammatical differences between the comic
and the epic parts of the drama L’illusion comique by Corneille (Muller
1967). Ever since, based on Benzécri’s multivariate statistics (Benzécri
126 R. Scholz
1963, 1969, 1982), methods for exhaustive, contrastive analysis of cor-

pora have been developed (Lafon 1984; Lebart and Salem 1988; Lebart
et al. 1998).
The term ‘lexicometry’ (lexicométrie) had been used by Maurice
Tournier (1975) and his team at the Laboratoire de lexicologie et textes
politiques (Ecole Normale Supérieure de Saint-Cloud) since the 1970s in
order to mark the methodological independence of the approach focus-
ing on lexical statistics. Meanwhile, other terms such as textométrie and
stylométrie have been introduced highlighting different aspects of linguis-
tic investigation. Lately Mayaffre (2007) has advocated the ‘logometric
approach’ aiming for an automated analysis of all textual levels compris-
ing all corpus methods and exceeding the simple analysis of word forms
by including lemmata, grammatical structures, and a possible recourse on
the original text source.
Like Pêcheux’s discourse analysis, lexicometry aims for a thorough
analysis of political language use. However, even though both approaches
developed in a similar intellectual environment with a similar objective,
their proponents encountered each other with distance and suspicion in
the first instance. Pêcheux and his disciples (Pêcheux 1982; Pêcheux et al.
1979) whose work is based on Harris’s (1952) distributional approach
disputed that counting words cannot help to uncover the hidden inher-
ent ideological permeations of texts—being an important aspect of
French discourse analysis at the time (Scholz and Fiala 2017). In contrast,
lexicometric scholars criticised Pêcheux’s approach for its oversized theo-
retical superstructure and for an incoherent methodology when trying to
link non-linguistic and linguistic aspects of language use (Demonet et al.
1975, 38). They argued that by founding the analysis on statistical results,
in the first instance, the researcher can get rid of his or her ideological
presumptions that cannot be easily controlled when interpreting texts.
However, since then the mutual scepticism has faded over time due to the
development of sophisticated methods allowing quantification of the co-
text (Heiden and Lafon 1998; Martinez 2012; Salem 1987), retracing of
syntactic structures (Fiala et al. 1987) and direct access to the textual
material (Fiala 1994).
3 pistemological Rupture and Heuristic

E
Methodology
Lexicometry is a data-driven approach to (mostly political) language use
in which the analyst applies a range of statistical algorithms onto textual
data in order to unveil its lexicosemantic macro-structures from a syn-
chronic and diachronic perspective. Historical or sociological categories
such as time of text production or authorship can be introduced into the
text corpus as metadata prior to the analysis. They are used to partition
the corpus and allow typical and untypical language use to be highlighted
in the resulting parts of the corpus through the exhaustive contrasting of
word frequencies. Thus, ‘exhaustive’ means that all word tokens of a cor-
pus are taken into account at once. Applying these methods, the researcher
alienates himself or herself temporarily from the textual material in the
first instance for the benefit of an explorative phase. The decontextualized
representation of text statistical results can trigger new ideas about rela-
tions between different corpus parts referring to authors and time periods
that might stay hidden if one limits the analysis to methods of text inter-
pretation. The strength of these quantifying methods is their capability to
map a corpus and help the researcher discovering the most dominant
lexicosemantic features—even before he or she starts reading the texts.
Against this backdrop I want to suggest using quantifying methods as a
sort of machine-led reading technique that guides the analyst to statisti-
cally salient text sequences, which then can be analysed with qualitative
interpretative methods. In this sense, interpretations concerning the con-
struction of meaning in context are postponed to a later point in research
(Scholz 2016; Scholz and Angermuller 2013; Scholz and Ziem 2015).
Moreover, I argue that by postponing the interpretation in this way, we
can systematically induce epistemological ruptures. Based on such rup-
tures we can put into practice Foucault’s “project of pure description of
discursive events1” in their original neutrality, which puts aside the inher-
ent continuities of knowledge formations triggered when allocating an
utterance to a certain author, period, text type, and so on (Foucault 2003,
Italic in original.
1
128 R. Scholz
400). Foucault draws from Bachelard’s epistemology, according to which

different ‘worlds of reality’ are the result of different scientific descrip-
tions of the world. New descriptions of the world only have become pos-
sible because they have ruptured with the everyday-life perception and
thought of the world. Against Husserl’s phenomenology Bachelard states:
‘The world in which one thinks is not the world in which one lives’
(Bachelard 1962, 110). The precondition for scientific thought is an epis-
temological rupture with the habitual way of thinking (Diaz-Bone 2007).
If we understand lexicometric methods as a way of putting epistemo-
logical ruptures into practice, it is because they allow us to ignore the
interpretation of utterances in prefabricated categories. Based on word
frequencies these methods render apparent continuities and discontinui-
ties of discursive practices not based on the assumption that texts of dif-
ferent authors, periods, and text genres must be different—but by
verifying, if, on the statistical level, such differences can be confirmed or
not. Lexicometry can help to get a grip on a large amount of data repre-
senting discourses. They allow a critical stance by providing a quantifying
heuristic of these data, which then can be analysed thoroughly with vari-
ous methods used in lexicometric and other approaches of discourse
analysis. Normally a combination of lexicometric methods with qualita-
tive methods of discourse tackling text and particular utterances on the
micro level will give most robust and telling results.
Heuristic methodologies can be distinguished from hermeneutic
methodologies. Whereas heuristic methods trigger research discoveries
based on new perspectives onto the research material, hermeneutic meth-
ods help interpreting the meaning of this material. The combination of
both activities is crucial for analysis. However, in social research the heu-
ristic aspect sometimes is neglected in favour of an abundant interpreta-
tion within a particular theoretical framework. Lexicometry meets the
criteria that Kleining has developed for heuristic methods in qualitative
social research: Openness of the researcher for new unexpected results, a
provisory research object, a maximum of structural variation of perspec-
tives, an analysis focusing on common features (Kleining 1994, 22–46
and 178–198). Lexicometry is a methodology that allows in a given cor-
pus exploring textual data through a variety of operations helping to
produce hypotheses and narrowing down the research material based on
statistical salience. Moreover, the different methods inform the interpre-

tation of the textual sequences that have become relevant throughout the
first phase of investigation (Glady and Leimdorfer 2015).
4 ompiling a Text Collection

C
for a Lexicometric Analysis
When working with lexicometric methods, we normally work with tex-
tual data of all sorts which have to be compiled according to particular
criteria. First of all, we need to have a basic understanding of our research
object. There are numerous rather elaborate definitions of what a dis-
course is. However, for our purposes, we chose a very general definition
that has been advocated for by German discourse linguists. Within this
approach, discourse was defined from a perspective of research practice
as a ‘virtual text corpus’ containing all kinds of texts concerning a certain
discourse topic. To analyse a discourse, a researcher must compile a ‘con-
crete text corpus’ constituting the object of research which contains a
representative number of texts taken from the virtual corpus (Busse and
Teubert 2014, 344).
Representativeness of our data is an important criterion to produce
valid hypotheses of general value about the research object, a given dis-
course. Notwithstanding the importance of aiming for representativeness
of the research data I would say that this is an ideal typical requirement.
Because it is rather difficult to decide which texts are representative of a
discourse before we actually start the analysis, instead I would like to
highlight that it is important to keep the issue of ‘representedness’ in
mind throughout the whole research process: What, who, and how are
the texts under investigation representing? To what extent do the data
structure and the context, in which the data were produced, influence the
statistical results? Finally, we have to be aware of the fact that, in terms of
statistical algorithms, we are applying a rigid apparatus on a more or less
arbitrary collection of data. And in this sense, the choice of texts prede-
termines the results. If we are not reflecting on what our data are actually
representing, there is always a risk of producing artefacts. For instance,
130 R. Scholz
visuals can perfectly represent the result of a particular method we

applied, but at the same time, it is possible that they do not represent a
discourse in general and hence the social reality we were actually inter-
ested in. This becomes obvious when we add texts from a particular dis-
course participant and leave out texts from another participant. By
omitting or overemphasising particular aspects of a discourse, we are
actively influencing the outcome of our analysis. However, experiment-
ing with different datasets and methods in this way is also an important
way of validating our research results.
Furthermore, in order to avoid the production of artefacts, it is impor-
tant that our research corpus is homogenous in terms of text register (Lee
2001). Therefore, all parts of a corpus should contain a similar amount of
texts produced in the same or similar contexts. This is necessary because
we want to make sure that our rigorous methodical apparatus compares
features of language use on the same ontological level. For example, if our
corpus contains only public speeches for one discourse participant and
only interview data for another participant we would not be able to deter-
mine if the result is caused by the different text types or by a different
language use of a speaker. As in discourse analysis we are predominantly
interested in the latter aspect, we should compile our corpus from texts
coming from more or less homogeneous sources of text production.
Before we can start the examination of the corpus, we have to make
sure that the different metadata, such as author, date of text production,
and text type, are allocated to the data material in a way that computer
software is able to partition the corpus according to these variables. The
partitioning of the corpus is a way to account for the influence of the
context of text production onto a discourse. For instance, contrasting
corpus parts built from different discourse participants presupposes that
there might be different ways, in which different participants speak/write
concerning a certain topic and subsequently position themselves in the
discursive field; corpus parts created from different periods of text pro-
duction are based on the presumption that the vocabulary and therefore
the discourse on a certain topic evolves over a period of time, and so on.
Metadata are one way to integrate sociological research categories into
our analysis. In this sense, we can account for different actors and
institutions, their spheres of activity, and social spaces. When we, in a
second step, analyse the language use within these sociological categories,
we then can refer the use of particular words to the construction of mean-
ing within a particular social context (Duchastel and Armony 1995,
201). Such an analysis aims at an investigation of topics, utterances, and
positioning practices structuring the organisation of the symbolic field
and the exercise of power in contemporary societies (Duchastel and
Armony 1993, 159). It is the contrastive analysis of actual language use
with reference to sociological metadata with which we aim to investigate
the meaning construction entangled in social and societal relations such
as work, class, gender, race, and poverty.
5 Examples of the Empirical Analysis

In the following section of this text, I shall be illustrating some lexicomet-
ric methods. The corpus used for the following exemplary studies was
compiled in the context of a larger project on crisis discourses in the
German press since the 1970s, led by Martin Wengeler and Alexander
Ziem and funded by the German Research Foundation. The corpus for
the financial crisis 2008 was compiled from press articles taken from the
five largest national newspapers and magazines with the intent to cover
the German political spectre: the Bild (populist conservative daily tab-
loid), the Frankfurter Allgemeine Zeitung (FAZ—conservative), the
Süddeutsche Zeitung (SZ—progressive liberal), Die Zeit (weekly,
centre-left liberal), and the weekly magazine Der Spiegel (centre-left).
The research period was determined according to the intensity of the
media coverage of the crisis starting in September 2008 with the bank-
ruptcy of the Lehman bank and ending in April 2009 with the G20 sum-
mit on financial market issues. The search term used to identify relevant
articles was Finanzkrise (financial crisis). The abundance of texts found
with this search term prevented the use of other search terms that would
have led to an unmanageable number of texts. All identified texts were
read rapidly and double-checked for their relevance to the crisis issue.
The total number of press articles included in the corpus was 3,814 with
a total of 2,324,914 word tokens.
132 R. Scholz
5.1 xploring and Discovering with Exhaustive

E
Statistical Methods
As outlined in Section 3, exhaustive methods take into account all word

tokens of a given corpus at once and contrast them alongside a corpus
partition that the researcher introduces to the corpus-based on the origin
of these texts. The advantage of exhaustive methods when aiming for a
quantifying heuristic is that they allow us to reorganise the textual data
based on statistical algorithms and not on our interpretation. Such an
inductive, corpus-driven (Tognini-Bonelli 2001) approach to the data
develops hypotheses about the semantic structure of a particular dis-
course on the basis of the data and not based on presumptions about the
discourse. On the basis of these methods, I will give some examples, of
how to retrieve macro-structures from a corpus representing a particular
political discourse.
5.1.1 Correspondence Analysis
One powerful method to discover similarities in language use between dif-

ferent speakers and time periods is correspondence analysis. This method
takes into account the complete vocabulary of a given corpus and com-
pares it according to the partitions which have been introduced. Applying
correspondence analysis to text corpora is a way of visualising differences
and similarities in language use by projecting words into at least a two-
dimensional visualisation (Bécue-Bertaut 2014; Benzécri 1980; Husson
and Josse 2014; Lebart and Saporta 2014; Salem 1982).
There is a range of lexicometric software solutions that offer correspon-
dence analysis, for example Lexico3, Hyperbase, TextObserver and
Iramuteq. The method is based on a matrix table containing the fre-
quency distributions of each word (rows) in different corpus parts (names
of columns). The first column of this table contains all words (types) of
the corpus. All other columns are named after the part of the corpus they
refer to. The rows of one column contain the frequencies of all word
tokens in a particular part of the corpus (for example all token frequen-
cies of the words used by a particular author across all his texts contained
in the corpus). Put simply, the method can be described as an algorithm
grouping together those words with similar distributions across different

columns into a profile. By distribution profiles we mean a similar high or
low distribution of a group of same tokens across columns.
In the visual representation distribution profiles are represented as
clusters. Based on these distribution profiles the algorithm calculates the
distance between the lexis of each column. As a result, the name of each
column can be located in a two-dimensional diagram. Column names
with similar distribution profiles are placed close to one another, whereas
column names with very different distribution profiles are placed distant
from one another.
Correspondence Analysis produces more than two axes. The axes are
situated alongside the highest concentration of similar characteristics
across distribution profiles. The most dominant features of the distribu-
tion profiles are represented on the first two axes—less dominant features
on the remaining axes. The significance of the difference between the
characteristics of distribution profiles decreases with the growing number
of axes. We only concentrate here on the first two axes. Deciphering the
meaning of the axes is part of the process of interpretation which often
needs further analytical steps with different methods. To interpret a
visual, we need to get an idea what extreme positions in this visualisation
stand for. The word tokens whose distribution profiles provide the basis
for this representation are not represented in Fig. 5.2 as the abundance of
overlaps would interfere with the readability of most of the represented
words. Therefore, only column names are shown. In the current examples
they refer to particular months (Fig. 5.1) or names of interviewees
(Fig. 5.2).
Lexical Proximity over Time
In Fig. 5.1, we have excluded the texts of September 2008 from the

calculation as they were extremely distant, which had accumulating
effects on all other data points—when excluding the extreme values we
get a more detailed picture of the rest of the data. The extreme position
of the September 2008 gives us a hint that the vocabulary to describe
the financial crisis during the first month of the research period is
134 R. Scholz
2009/02
2009/03 2008/10
2009/04 2008/09
2009/01
2008/11
2008/12
Fig. 5.1 Correspondence analysis of the German press corpus on the financial
crisis 2008 in the partition ‘month’ (Representation of column names only)
Pofalla/CDU
Müntefering/SPD
Glos/CSU
Seehofer/CSU
Steinmeier/SPD
Kauder/CDU Enzensberger
Schäuble/CDU
Merkel/CDU Leyen/CDU
Köhler/B. -Präsi.
Lagarde/Frankreich Steinbrück/SPD
Juncker/EU Lundgren/ Schmidt H./SPD
Prof.
Schweden
Weischenberg
Merkel/Sarkozy
Barroso/EU
Fig. 5.2 Correspondence analysis of the sub-corpus German press interviews on

the financial crisis 2008
substantially different from the rest of the research period. To under-

stand this extreme position better, we would need to use other meth-
ods to find out which words in particular are overrepresented during
this month in comparison to the rest period—an analysis which is
described further below in this text. We also see that the last three
months of the research period are placed together within the same
quadrant but in inverse order—February 2009 is the most distant data
point whereas April 2009 is closest to the origin of coordinates. The
inverse order is a hint that something unusual must have happened in
the development during these three months—a phenomenon which,
once again, we would have to analyse in more detail with other meth-
ods. The fact that the last three months are situated in the same quad-
rant shows that the vocabulary did not develop as dynamic as during
the first months of the research period.
Lexical Proximity between Discourse Participants
Another way of using this method is to contrast the language use of dif-
ferent speakers in order to find out which speakers use the most similar
vocabulary. Figure 5.2 represents the similarities and differences in the
vocabulary of discourse participants that have been interviewed concern-
ing the financial crisis 2008 in the German press throughout the research
period.
For this analysis, we have created a sub-corpus of the above press cor-
pus on the financial crisis compiled from 28 interviews of 19 interviewees
and a report from the G20 summit in Washington (15–16 November
2008) written by the German Chancellor Merkel and the French
President Sarkozy. The sub-corpus was compiled with the intention of
getting to know better the positions of individuals that were chosen by
the journalists to be experts in this discourse. The visualisation shows that
the positions of these ‘experts’ cannot be ordered according to their polit-
ical party affiliation. With the rather intellectual figures Enzensberger,
Köhler, Schmidt, and Weischenberg on the right and numerous politi-
cians on the left, the x-axis seems to represent a continuum between
societal aspects and party politics. Accordingly, the y-axis seems to repre-
136 R. Scholz
sent a continuum between international/European (at the bottom) and

national politics (at the top). In this sense the correspondence analysis
can help to find out which social dimensions dominate the lexis in a
given corpus.
5.1.2 Descending Hierarchical Classification
Another exhaustive method is the descending hierarchical classification

(DHC) (Roux 1985), which was further developed by Reinert (1983) in
the ALCESTE2 software (also available in the open source software
Iramuteq). The method creates thematic classes from statistically sig-
nificant co-occurrences of word tokens by the following procedure: (A)
Based on punctuation signs and word counts, the corpus is partitioned
into contextual units. (B) The algorithm measures which words are more
likely to co-occur together in one contextual unit. These words are put
into the same class of words. (C) Starting with two classes, the algorithm
creates up to 10 subclasses by dividing the word classes of a superior
class. Each class contains word forms that have a high probability of co-
occurring in the same contextual unit. Therefore, words of the same
class are considered as belonging to the same ‘semantic world’. The more
classes are created, the more they become consistent. The algorithm
stops once a class is stabilised when no statistic significant evidence can
be found to create a new subclass. Assigning a word class to a certain
topic is a result of the researcher’s interpretation and therefore some-
times could be challenged—especially in cases when a class is consti-
tuted of words that cannot be assigned to only one but multiple topics.
Figure 5.3 represents six classes that have been formed with the DHC
method. The percentages refer to the total of contextual units being clas-
sified in a corpus—the two classes referring to the largest part of the
corpus being the classes 1 and 5. The most overrepresented words are
represented with an increased font size at the top of the list. As words co-
occur with a variety of other words, a word can occur in more than one
2
ALCESTE stands for ‘Analyse des Lexèmes Cooccurrents dans un Ensemble de Segments de
Texte’, which means analysis of co-occurring lexemes in a totality of text segments.
Fig. 5.3 DHC in the German press corpus on the financial crisis 2008
(Analysed with Iramuteq)
class (e.g. nicht).3 Whereas class 1 contains words that seem to refer to a
rather technocratic discourse describing the macroeconomic context of
the financial crisis, class 5 contains words which are above all names of
banks that were involved or affected by the financial crisis. Class 3 con-
tains words referring above all to effects in the real economy, and class 2
to the social market economy as part of a discussion on the crisis of the
predominant economic system criticised from a viewpoint of political
economy. Similarly, class 4 refers to Marx and his analysis of capitalism
which seems to refer in a particular way to various social contexts - Kinder
(children), Mann (man), Frau(en) (woman/women). Contrary to this
discussion of the political system on a more abstract level, class 6 contains
words referring to the politics on the national, supra-, and transnational
sphere. Without going into more detail, we can see the potential strength
3
For French and English corpora the software uses more sophisticated dictionaries which exclude
functional words for this type of analysis.
138 R. Scholz
of this method to reveal the different semantic worlds and parts of the
narrative elements prevailing in a corpus. In this sense it allows us to
make assumptions about how semantic units are put into context and
subsequently knowledge is constructed.
5.1.3 Semantic Clusters Based on Reciprocal Co-occurrences
Another method to analyse corpus data exhaustively is to cluster all recip-

rocal co-occurrences (Martinez 2011, 2012). Reciprocal co-occurrence is
when two terms within a determined space of the text (e.g. a sentence)
show the same level of ‘attraction’ for each other. Therefore, we measure
not only if the term A is overrepresented in sentences containing the term
B (simple co-occurrence) but also if the term B is overrepresented to the
same extent in all sentences containing the term A.4 For example, in a
given corpus we might find that all sentences stating the word house con-
tain the article the, and the content word insurance. However, if we look
into all sentences stating the, we would find that house is not used to the
same extent. This would be a case of a simple co-occurrence of house and
the. Contrary to this, if looking into all sentences stating the word insur-
ance we might find that house is used to a similar extent as insurance in
sentences containing house. As both words co-occur to the same extent in
the same delimited space we consider them as reciprocal co-occurrences.5
If different words co-occur to a similar extent, we infer that they are
part of the same semantic field. By applying this method to all words of
a given corpus we are able to identify groups of words belonging to the
same semantic field which can be represented as semantic networks. By
manipulating the parameters, we can increase the size of a network (num-
ber of words within one cluster) and decrease the total number of net-
works or decrease the size of the networks and increase their total number.
An advantage of this method is that depending on our research question
we can narrow down the textual data to a number of keywords that then
can be used to identify topic specific text sequences.
4
The term ‘co-occurrence’ refers to an instance of an above-chance frequency of occurrence of two
terms (probability distribution). Instances of a systematic co-occurrence taking into account word
order or syntactic relations would be referred to with the term ‘collocation’ in this terminology.
5
Mutual Information score addresses the same issue but with a different algorithm.
Fig. 5.4 The dominating semantic field in the German press corpus on the finan-
cial crisis 2008
Figure 5.4 represents the dominant semantic field in the German press
corpus on the financial crisis 2008. Reciprocal co-occurrences can be ana-
lysed with the software CooCs. We have chosen parameters that produce
a maximum of reciprocal co-occurrences in one network when at the same
time aiming for a readable representation. Figure 5.4 represents all tokens
for which the probability to co-occur in a paragraph containing the node
bank is very high. Forming a symmetric matrix of the lexis of a given cor-
pus, it was measured to what extent each type of the corpus is overrepre-
sented in a paragraph containing another type of this corpus—if for
instance the token Lehman was overrepresented in paragraphs containing
the token bank. To obtain a reciprocal co-occurrence bank is ought to be
overrepresented to a similar extent in paragraphs containing Lehman.
In Fig. 5.4 one can distinguish a number of different semantic fields
enabling us to map the discourse of the financial crisis in the German
press. In the centre of the visualisation, we find the token bank. One can
see that, for instance, words referring to German politics are intercon-
nected with the rest of the network via particular lexical elements: Merkel
is connected to tokens referring to the inter- and transnational political
140 R. Scholz
field such as Sarkozy, Barroso, EU, G[7], IWF, Weltbank (world bank)
and others. Steinbrück is linked to Hypo Real Estate and Finanzminister to
tokens referring to US American political actors such as Paulson, Bush,
Obama, Kongress (Congress). Based on this map we could go into more
depth by analysing in more detail the textual context in which these
dominant lexical elements are used in the corpus.
5.2 Typical Language Use
Once we have explored our data with exhaustive methods we usually start
deepening our analysis with additional methods. Keyword analysis
is helpful to find out more about the language use that is typical for a
particular speaker or a particular time period. This can be done with an
algorithm calculating which word tokens are overrepresented in a particu-
lar part of the corpus when compared to all the other parts (see ‘partition’
in Sect. 3. and 4.). Based on the total number of tokens in the corpus, the
total number of token in one part, and the frequency of each token in the
whole corpus the algorithm calculates the expected frequency of each
token in the part investigated. If the observed frequency in this part is
higher than the expected frequency then this token is overrepresented and
can be considered as belonging to the typical vocabulary of this part of the
corpus. In the software Lexico3, which we used to run most of the analy-
ses presented in this text, the algorithm is the same as the one which is
used for the calculation of the co-occurrences (see also Sect. 5.1.2).
5.2.1 Uncovering Discourse Dynamics
To find out about lexicosemantic dynamics in a discourse, we can calcu-

late the typical language use in subsequent time periods. Figure 5.5 illus-
trates the dynamics in the above-mentioned corpus on the financial crisis
2008. On the basis of the list of the vocabulary overrepresented in each
month, we have created groups of words referring to the same content
category. For instance, all words referring to a bank of issue have been
added to the group of words named accordingly. This category had
been created because such words had occurred recurrently in the list of
the overrepresented vocabulary for September 2008. In this way we have
55
45
35
25
15
-5
-15
Sep 08 Oct 08 Nov 08 Dec 08 Jan 09 Feb 09 Mar 09 Apr 09
-25
-35
-45
-55
financial crisis economic crisis countries banks banks of issue
ministry of finance German government global actors financial products government measures
Fig. 5.5 Over- and under-represented groups of words referring to discourse

participants and discourse objects (partition ‘month’)
created the categories: countries, banks, ministry of finance, German gov-

ernment, global actors, government measures, and financial products. finan-
cial crisis and economic crisis refer to the word forms as such and not to the
name of a group of words.
Figure 5.5 shows that in the first month of the research period words
referring to country and bank names as well as to banks of issue and min-
istries of finance are strongly overrepresented. In October 2008 we see
that the word financial crisis is predominantly used in the press. If we
interpret September and October together we can hypothesise that before
the term financial crisis can be used as a common reference in a discourse,
it first needs to be constructed with regard to an institutional (banks),
geographical (countries) and political (banks of issue, ministry of finance)
location. We also can observe that the polity level increases with the
duration of the crisis. Whereas in the beginning the ministries of finance
are represented in relation to the financial crisis, it is later the German
government and the global actors that are represented as being in charge of
142 R. Scholz
solving the crisis. Furthermore, throughout the duration of the crisis the
term financial crisis loses importance in favour of the term economic crisis.
Moreover interestingly, the government measures against the crisis—
overrepresented in January 2009—seem not to be discussed together
with the origin of the crisis, the financial products, which are overrepre-
sented in November, March, and April—always together with the global
actors—such as the International Monetary Fund (IMF).
Figure 5.6 summarises the interpretation of Fig. 5.5: The discourse
representing the financial crisis in the German press refers on the one
hand to the national political sphere and on the other hand to the inter-
national political sphere. Both spheres are structured differently in terms
of discourse participants and their represented actions. Whereas on the
international level we can find the international political actors deliberat-
ing about the origins and the actors responsible for the crisis, on the
national level we can observe political action against the effects but not
against the origins of the crisis. In this sense, crisis politics seems to be
divided into national political action against crisis effects, which are
Fig. 5.6 Summary of the macrostructure of the financial crisis press

corpus
treated separately from their origins on the international level where

political action is missing. Of course the value of such an analysis is first
and foremost heuristic and needs to be substantiated with other discourse
analytical methods. However, this example shows the potential that these
methods have for the exploration of large text corpora and the discovery
of discourse phenomena regarding their lexicosemantic macro-structures
that could not be uncovered systematically by simply reading and inter-
preting the texts of a corpus.
5.2.2 Reducing the Amount of Textual Data
In the last section of this chapter, we will illustrate briefly how we can use
quantitative methods of corpus analysis in order to reduce systematically
the amount of textual data that can subsequently be analysed with quali-
tative methods. Based on the above mentioned sub-corpus of press inter-
views, we have calculated the specific vocabulary for each interviewee.
With the help of the resulting lists of keywords we were able to identify
prototypical text sequences for each interviewee.
Fig. 5.7 Map of text sections of interviewees displaying prototypical sentences of

Angela Merkel’s interviews
144 R. Scholz
Figure 5.7 is a screenshot of the corpus map function in Lexico3. On

the left-hand side, we see the list of keywords from interviews of the
German Chancellor Angela Merkel when compared to the vocabulary of
all other interviewees. On the right-hand side, we see a map of the cor-
pus. Each box represents one sentence of the corpus. The map is divided
according to the partition interviewee (interloc). Most boxes are shaded in
blue. These boxes contain words that are overrepresented in Merkel’s
interviews (list in the left). The darker the shade, the more of these words
are contained in one sentence. We can see that two boxes in the Merkel
part are highlighted in dark blue. As these are the two sentences that
contain the highest number of keywords, we consider these sentences as
prototypical sentences for Merkel’s interviews.
In order to enable the reader to make more sense of the identified pro-
totypical sentences, I have added the enclosing sentences into the
sequence:
Merkel: There have been a number of years in which the fund has barely
had its classic role to play—supporting countries that have experienced
serious economic and financial difficulties. Therefore the savings program
was decided. However, if we now assign new tasks to the IMF to monitor
the stability of the financial markets, we must also equip it properly. […]
With our stimulus package aiming to stabilize the economy, we immedi-
ately provide effective support for investment and consumption. We are
building a bridge between businesses and citizens so that in 2009 the con-
sequences of the global crisis will be absorbed and the economy will rise
again in 2010.6 (Interview by Süddeutsche Zeitung, 14 November
2008)
The German original is kept in the footnote highlighting those words

in bold which are contained in the list of the overrepresented vocabulary.
6
Merkel: Es gab jetzt eine ganze Reihe von Jahren, in denen der Fonds seine klassische Rolle—die
Unterstützung von Ländern, die in ernste wirtschaftliche und finanzielle Schwierigkeiten geraten
sind—kaum noch ausüben musste. Deshalb wurde das Sparprogramm beschlossen. Wenn wir dem
IWF aber nun neue Aufgaben bei der Überwachung der Finanzmarktstabilität übertragen, müs-
sen wir ihn auch ordentlich ausstatten. Mit unserem Paket zur Stabilisierung der Konjunktur
geben wir dagegen sofort wirksame Hilfen für Investitionen und Konsum. […] Wir bauen damit
Unternehmen und Bürgern eine Brücke, damit 2009 die Folgen der weltweiten Krise aufgefan-
gen werden und es 2010 wieder aufwärts geht.
We could now start analysing this quotation with qualitative methods of

discourse research—just to name a few: the analysis of argumentation
topoi (Wengeler 2015; Žagar 2010), of cognitive frames (Scholz and
Ziem 2013, 2015; Ziem 2014) and of metaphors (Kuck and Scholz
2013) which come into play in Merkel’s rhetoric. With these rhetorical
features Merkel constructs a particular knowledge about the crisis and
her political position to it—all aimed at convincing the reader from the
rightfulness of the political steps that she and her political party proposes
against the crisis. We could also take a political science perspective and
pay more attention to the political level to which Merkel allocates the
political responsibility for the crisis. Contrary to Merkel whose crisis dis-
course has a strong international orientation (proposing, for instance, a
stronger role for the IMF), this international element seems to be missing
from the discourse of her political rival, the then Minister of Finance Peer
Steinbrück (SPD). His prototypical sentences refer to national
economics:
Steinbrück: The depth of the recession will not be known until afterwards.
[…]
Spiegel: If one sees with which efforts the recession is countered abroad, one
can get the impression that you are quite passive—or just stubborn.
Steinbrück: I am not stubborn, I obey economic reason.7 (Interview by
Spiegel, 1 December 2008)
Similar to Steinbrück the prototypical sentences of the conservative

politicians Horst Seehofer (CSU) and Ronald Pofalla (CDU) refer rather
to national politics. The importance of national politics has to be under-
stood in the context of the national parliament elections (Bundestagswahl)
which were held in September 2009. In the vocabulary of both p oliticians
the term Bundestagswahl is overrepresented next to tax reduction
(Steuersenkungen) in Seehofer’s interviews and strong references to other
parties of the Bundestag in Pofalla’s interviews.
Steinbrück: Wie tief die Rezession ausfällt, wird man erst hinterher genau wissen.
7
Spiegel: Wenn man sieht, wie man sich im Ausland gegen diese Rezession stemmt, dann muss
man den Eindruck bekommen, dass sie ziemlich passiv sind. Oder einfach nur stur.
Steinbrück: Ich bin nicht stur, ich gehorche der ökonomischen Vernunft.
146 R. Scholz
Contrary to the national framing of the crisis we find strong references

to the supranational level in the interviews with the then President of the
European Commission, José Manuel Barroso. However, he reemphasises
the political role of the nation-state and does not take the opportunity to
argue for the increased political influence of the European institutions:
SZ: Do we need a European economic government, as France’s President

Sarkozy calls for?
Barroso: After the meeting of the Heads of State and Government on 7th
November, we in Europe agree that we should better coordinate national
activities but not harmonize everything. If, for example, Poland decides an
economic program, this affects Germany and certainly also vice versa.8
(Interview by Süddeutsche Zeitung, 14 November 2008)
Even though these text sequences could be analysed in more detail,

we want to end this section by having a look again at Fig. 5.2, in which
we tried to find out about the dimensions dominating the discourse on
the financial crisis in the German press interviews. With the help of the
cited text sequences, our hypothesis about the meaning of the axes rep-
resented in Fig. 5.2 can be confirmed. Where the y-axis represents a
continuum between the international and the national political sphere
(Barroso versus Pofalla), the x-axis represents a continuum between
political and sociocultural aspects of the crisis—displaying the writer
and former Marxist Hans Magnus Enzensberger at the extreme right of
the x-axis in whose interview ethical questions concerning morality are
addressed:
Spiegel: Have the bankers failed morally?

Enzensberger: It goes a bit too far to hold especially bankers accountable for
morality. […]
8
SZ: Brauchen wir eine europäische Wirtschaftsregierung, wie sie Frankreichs Präsident Sarkozy
fordert?
Barroso: Nach dem Treffen der Staats—und Regierungschefs am 7. November sind wir uns in
Europa einig, dass wir nationale Aktivitäten besser koordinieren, aber nicht alles vereinheitlichen
müssen. Wenn etwa Polen ein Wirtschaftsprogramm beschließt, wirkt sich das auf Deutschland
aus und sicher auch umgekehrt.
Spiegel: The financial debacle caused a profound crisis of the so-called real
economy.
Enzensberger: It is incomprehensible to me why the whole world is so sur-
prised. This is a bit like in England. If it is snowing in winter, the English
are quite taken aback, because entire regions sink into the snow, as if winter
were not a periodically recurrent fact. Likewise every boom is followed by
a crash. This is of course very uncomfortable.9 (Interview by Spiegel,
3 November 2008)
In this section, we have given an example of how to reduce the amount

of textual data systematically according to the distribution of the over-
represented vocabulary of interviewees which we have used to identify
prototypical text sequences in their interviews. Other criteria might be
used to identify text sequences that are relevant for our research question.
For instance, elsewhere we have analysed textual sequences containing
the terms growth and prosperity in order to compare metanarratives used
in the petrol crisis 1973 and the financial crisis 2008 (Scholz 2016). In
another study we have used topic specific words identified with the
semantic cluster method (see Sect. 5.1.2) in order to locate topic specific
press articles within the financial crisis corpus dealing with questions of
social and regulatory policies (Kuck and Scholz 2013). The idea of this
section was to illustrate that a mixed-method approach to language use
can be a powerful tool, on the one hand, to guide the qualitative analysis
of discourses and, on the other hand, to explain quantitative research
results. In this sense a combined use of quantitative and qualitative meth-
ods can lead to a mutual fertilisation of research results.
9
Spiegel: Haben die Banker moralisch versagt?
Enzensberger: Es ist ein bisschen viel verlangt, dass ausgerechnet die Banker für die Moral
zuständig sein sollen. […]
Spiegel: Aus dem Finanzdebakel erwächst eine tiefgreifende Krise der sogenannten Realwirtschaft.
Enzensberger: Es ist mir unbegreiflich, weshalb die ganze Welt davon so überrascht ist. Das ist
ein bisschen wie in England. Wenn es dort im Winter schneit, dann sind die Engländer ganz
verblüfft, weil ganze Regionen im Schnee versinken, so, als wäre der Winter nicht ein periodisch
wiederkehrendes Faktum. Genauso folgt jedem Aufschwung ein Absturz. Das ist natürlich sehr
ungemütlich.
148 R. Scholz
6 Conclusion: Strengths and Limitations

In this chapter I have presented the lexicometric approach to discourse. I
have emphasised the heuristic strength of this approach in helping to find
out about the underlying lexicosemantic macro-structures in text corpora
which are not identifiable by simply reading texts. The various quantita-
tive methods allow the researcher to take a number of different perspec-
tives onto the research material guided by statistical algorithms. With the
help of these algorithms, the researcher is able to map a discourse accord-
ing to its lexicosemantic features, to develop hypotheses about the domi-
nant language use by different discourse participants and in different
time periods (discourse dynamics). Moreover, the researcher can use these
methods to discover new discourse elements that he or she might have
missed by limiting the research on qualitative methods. The particular
strength of the approach is that it allows a continuous movement back
and forward between quantitative and qualitative methods guiding the
interpretation process and enriching mutually research results gained
with different methods. The multitude of perspectives on the corpus
material arising from these different perspectives renders the lexicometric
approach a truly heuristic quantitative apparatus.
Of course an apparatus comes with all the shortcomings that rigid
methods can have in research. First, the results depend on the kind of
data that they are based on—changing the composition of a corpus, espe-
cially in small corpora—can change the results substantially. Therefore,
prudent questioning of what claims concerning a discourse can be made
based on the text corpus used is as important as experimenting with dif-
ferent corpus compositions in order to ensure the validity of the results.
The examples used in this text might only partly illustrate the potential
use of quantitative text processing methods for a classical sociological
investigation. However, these methods have been used previously to anal-
yse questionnaires or interviews (Leimdorfer and Salem 1995). For the
analysis of such data, especially the exhaustive methods (Sect. 5.1), can
help get an overview of its dominant content and structure. With regard
to a social constructionist approach in sociology, it should have become
clear how much the methodology presented here can help us understand
better the relationship between the social and textual context in language
use. This then can be used to investigate the construction of social catego-
ries such as race, gender, or class (Leimdorfer 2010). Even though the
lexicometric approach has not yet been used extensively in sociological
research, this chapter should help to integrate more quantitative research
on language use into the social sciences.
References
Bachelard, Gaston. 1962. La philosophie du non. Essai d’une philosophie du nouvel
esprit scientifique. Paris: PUF. Original edition, 1940.
Bécue-Bertaut, Mónica. 2014. Distributional equivalence and linguistics. In
Benzécri, Jean-Paul. 1963. Course de Linguistique Mathématique. Rennes:
Universitée de Rennes.
———. 1969. Statistical analysis as a tool to make patterns emerge from data.
In Methodologies of pattern recognition, ed. Satosi Watanabe, 35–74. New York:
Academic Press.
———. 1980. Pratique de l’analyse des données. Paris: Dunod.
———. 1982. Histoire et préhistoire de l’analyse des données. Paris: Dunod.
Bonnafous, Simone, and Maurice Tournier. 1995. Analyse du discours, lexico-
métrie, communication et politique. Mots – Les langages du politique 29
(117): 67–81.
Busse, Dietrich, and Wolfgang Teubert. 2014. Using corpora for historical
semantics. In The discourse studies reader. Main currents in theory and analysis,
ed. Johannes Angermuller, Dominique Mainguenau and Ruth Wodak,
340–349. Amsterdam: John Benjamins. Original edition, 1994.
Demonet, Michel, Annie Geffroy, Jean Gouazé, Pierre Lafon, Maurice
Mouillaud, and Maurice Tournier. 1975. Des tracts en mai 1968. Paris: Colin.
Diaz-Bone, Rainer. 2007. Die französische Epistemologie und ihre Revisionen.
Zur Rekonstruktion des methodologischen Standortes der Foucaultschen
Diskursanalyse. Forum Qualitative Sozialforschung/Forum: Qualitative Social
Research 8 (2): Art. 24.
Duchastel, Jules, and Victor Armony. 1993. Un protocole de description de
discours politiques. Actes des Secondes journées internationales d’analyse
statistique de données textuelles, Paris.
150 R. Scholz
———. 1995. La catégorisation socio-sémantique. Actes des Secondes journées

internationales d’analyse statistique de données textuelles, Rome.
Fiala, Pierre. 1994. L’interprétation en lexicométrie. Une approche quantitative
des données lexicales. Langue Française 103 (Sep.): 113–122.
Fiala, Pierre, Benoît Habert, Pierre Lafon, and Carmen Pineira. 1987. Des mots
aux syntagmes. Figement et variations dans la Résolution générale du congrès
de la CGT de 1978. Mots – Les Langages du Politiques 14 (1): 47–87.
Foucault, Michel. 2003. On the archaeology of the sciences: Response to the
epistemology circle. In The essential Foucault. Selections from the essential works
of Foucault 1954–1984, ed. Paul Rabinow and Nikolas Rose, 392–422.
New York: New Press. Original edition, 1968.
Glady, Marc, and François Leimdorfer. 2015. Usages de la lexicométrie et inter-
prétation sociologique. Bulletin de Me´thodologie Sociologique 127: 5–27.
Guiraud, Pierre. 1954. Les caractères statistiques du vocabulaire. Paris: PUF.
———. 1960. Problèmes et méthodes de la statistique linguistique. Paris: PUF.
Harris, Zellig S. 1952. Discourse analysis: A sample text. Language 28 (4):
474–494.
Heiden, Serge, and Pierre Lafon. 1998. Cooccurrences. La CFDT de 1973 à
1992. In Des mots en liberté, Mélanges Maurice Tournier, ed. Pierre Fiala and
Pierre Lafon, 65–83. Lyon: ÉNS Éditions.
Herdan, Gustav. 1964. Quantitative linguistics. London: Butterworths.
———. 1966. The advanced theory of language as choice and chance. Berlin:
Springer.
Husson, François, and Julie Josse. 2014. Multiple correspondence analysis. In
Kleining, Gerhard. 1994. Qualitativ-heuristische Sozialforschung. Schriften zur
Theorie und Praxis. Hamburg-Harvestehude: Fechner.
Kuck, Kristin, and Ronny Scholz. 2013. Quantitative und qualitative Methoden
der Diskursanalyse als Ansatz einer rekonstruktiven Weltpolitikforschung.
Zur Analyse eines internationalen Krisendiskurses in der deutschen Presse. In
Rekonstruktive Methoden der Weltpolitikforschung. Anwendungsbeispiele und
Entwicklungstendenzen, ed. Ulrich Franke and Ulrich Roos, 219–270. Baden-
Baden: Nomos.
Lafon, Pierre. 1984. Dépouillements et statistiques en lexicométrie. Paris:
Champion.
Lebart, Ludovic, and André Salem. 1988. Analyse statistique des données textu-
elles. Questions ouvertes et lexicométrie. Paris: Dunod.
Lebart, Ludovic, André Salem, and Lisette Berry. 1998. Exploring textual data.
Dordrecht: Kluwer.
Lebart, Ludovic, and Gilbert Saporta. 2014. Historical elements of correspon-
dence analysis and multiple correspondence analysis. In Visualization and
verbalisation of data, ed. Jörg Blasius and Michael Greenacre, 31–44. London
and New York: CRC.
Lee, David. 2001. Genres, registers, text types, domains, and styles: Clarifying
the concepts and navigating a path through the BNC jungle. Language
Learning & Technology 5 (3): 37–72.
Leimdorfer, François. 2010. Les sociologues et le langage. Paris: Maison des sci-
ences de l’homme.
Leimdorfer, François, and André Salem. 1995. Usages de la lexicométrie en
analyse de discours. Cahiers des Sciences Humaines 31 (1): 131–143.
Martinez, William. 2011. Vers une cartographie géo-lexicale. In Situ, 15.
Accessed July 1, 2018. http://journals.openedition.org/insitu/590.
———. 2012. Au-delà de la cooccurrence binaire… Poly-cooccurrences et
trames de cooccurrence. Corpus 11: 191–216.
Mayaffre, Damon. 2005. De la lexicométrie à la logométrie. L’Astrolabe. Accessed
July 1, 2018. https://hal.archives-ouvertes.fr/hal-00551921/document.
———. 2007. Analyses logométriques et rhétorique du discours. In Introduction
à la recherche en SIC, ed. Stéphane Olivesi, 153–180. Grenoble: Presses
Universitaires de Grenoble.
———. 2016. Quantitative linguistics and political history. In Quantitative lin-
guistics in France, ed. Jacqueline Léon and Sylvain Loiseau, 94–119.
Lüdenscheid: Ram Verlag.
Mayaffre, Damon, and Céline Poudat. 2013. Quantitative approaches to politi-
cal discourse. Corpus linguistics and text statistics. In Speaking of Europe.
Approaches to complexity in European political discourse, ed. Kjersti Fløttum,
65–83. Amsterdam: Benjamins.
Muller, Charles. 1967. Étude de statistique lexicale. Le vocabulaire du théâtre de
Pierre Corneille. Translated by Pierre Corneille. Paris: Larousse.
Pêcheux, Michel. 1982. Language, semantics and ideology (Language, Discourse,
Society Series). London: Macmillan.
Pêcheux, Michel, Claudine Haroche, Paul Henry, and Jean-Pierre Poitou. 1979.
Le rapport Mansholt: un cas d’ambiguïté idéologique. Technologies, Idéologies,
Pratiques 2: 1–83.
Reinert, Max. 1983. Une méthode de classification descendante hiérarchique.
Cahiers analyse des données VIII (2): 187–198.
152 R. Scholz
Roux, Maurice. 1985. Algorithmes de classification. Paris: Masson.

Salem, André. 1982. Analyse factorielle et lexicométrie. Mots – Les Langages du
Politiques 4 (1): 147–168.
———. 1987. Pratique des segments répétés. Essai de statistique textuelle. Paris:
Klincksieck.
Scholz, Ronny. 2016. Towards a post-material prosperity? An analysis of legiti-
mising narratives in German crisis discourses from 1973 and 2008. French
Journal for Media Research [online] 5 (Narratives of the Crisis/Récits de crise).
Accessed July 1, 2018. http://frenchjournalformediaresearch.com/index.
php?id=614.
Scholz, Ronny, and Johannes Angermuller. 2013. Au nom de Bologne ? Une
analyse comparative des discours politiques sur les réformes universitaires en
Allemagne et en France. Mots – Les Langages du Politiques 102: 22–36.
Scholz, Ronny, and Pierre Fiala. 2017. Politolinguistik in Frankreich. In
Handbuch Sprache und Politik, ed. Jörg Kilian, Thomas Niehr, and Martin
Wengeler, 1163–1199. Bremen: Hempen.
Scholz, Ronny, and Annika Mattissek. 2014. Zwischen Exzellenz und
Bildungsstreik. Lexikometrie als Methodik zur Ermittlung semantischer
Makrostrukturen des Hochschulreformdiskurses. In Diskursforschung. Ein
interdisziplinäres Handbuch. Band 2: Methoden und Analysepraxis. Perspektiven
auf Hochschulreformdiskurse, ed. Martin Nonhoff, Eva Herschinger, Johannes
Angermuller, Felicitas Macgilchrist, Martin Reisigl, Juliette Wedl, Daniel
Wrana, and Alexander Ziem, 86–112. Bielefeld: Transcript.
Scholz, Ronny, and Alexander Ziem. 2013. Lexikometrie meets FrameNet: das
Vokabular der ‘Arbeitsmarktkrise’ und der ‘Agenda 2010’ im Wandel. In
Sprachliche Konstruktionen von Krisen: Interdisziplinäre Perspektiven auf ein
fortwährend aktuelles Phänomen, ed. Martin Wengeler and Alexander Ziem,
155–185. Bremen: Hempen.
———. 2015. Das Vokabular im diskurshistorischen Vergleich: Skizze einer
korpuslinguistischen Untersuchungsheuristik. In Diskurs – interdisziplinär.
Zugänge, Gegenstände, Perspektiven, ed. Heidrun Kämper and Ingo Warnke,
281–313. Berlin and New York: De Gruyter.
Tognini-Bonelli, Elena. 2001. Corpus linguistics at work. Amsterdam: Benjamins.
Tournier, Maurice. 1975. Un vocabulaire ouvrier en 1848. Essai de lexicométrie.
Quatre volumes multicopiés. Saint-Cloud: École Normale Supérieure.
———. 1993. Lexicometria – Séminaire de lexicométrie. Lisbonne: Universidade
Aberta. Original edition, 1988.
Wengeler, Martin. 2015. Patterns of argumentation and the heterogeneity of
social knowledge. Journal of Language and Politics 14 (5): 689–711.
Yule, George Udny. 1944. The statistical study of literary vocabulary. Cambridge:
Žagar, Igor Ž. 2010. Topoi in critical discourse analysis. Lodz Papers in Pragmatics
6 (1): 3–27.
Ziem, Alexander. 2014. Frames of understanding in text and discourse. Theoretical
foundations and descriptive applications. Amsterdam: Benjamins.
Zipf, George K. 1929. Relative frequency as a determinant of phonetic change.
Harvard Studies in Classical Philology 40: 1–95.
———. 1935. The psycho-biology of language. An introduction to dynamic
philology. Boston: Mifflin.
6
Words and Facts: Textual Analysis—
Topic-Centred Methods for Social
Scientists
Karl M. van Meter
1 Introduction
In arguing for systematic textual analysis as a part of discourse analysis,
Norman Fairclough stated that ‘[t]he nature of texts and textual analysis
should surely be one significant cluster of issues of common concern’
within discourse analysis (Fairclough 1992, 196). As a contribution dis-
course analysis and a reinforcement of the close association between dis-
course analysis and textual analysis, I will here deal with texts from several
different origins and over a time span extending from the 1980s to now.
I will try to explain and show how complex statistical methods such as
factorial correspondence analysis (see the TriDeux software of Cibois
2016), both descending hierarchical classification analyses (see the Alceste
software (Image 2016) and the Topics software (Jenny 1997)) and ascend-
ing hierarchical classification analyses (such as Leximappe-Lexinet
(Callon et al. 1991) and Calliope (De Saint Léger 1997)), can be used to
K. M. van Meter (*)

Centre Maurice Halbwachs, École Normale Supérieure—Paris, Paris, France
e-mail: karl.vanmeter@ens.fr

156 K. M. van Meter
produce detailed representations of lexico-semantic structures in large

text collections or corpora. As you can see from the dates, these methods
and their applications to textual analysis came well before the new term
‘big data’ became popular among scientists and the general public.
Interestingly enough, the development of these methods and their ini-
tial applications in social science research took place primarily in Paris
and were centred around such figures as Jean-Paul Benzécri and Pierre
Bourdieu, and their work in the 1970s and 1980s. Indeed, most of the
authors cited here and I have worked together in textual analysis and have
been associated with Benzécri, Bourdieu, and Callon at the Université
Pairs VI (now Université Pierre et Marie Curie), the Maison des Sciences
de l’Homme, and the École des Mines de Paris.
Here we present three studies that we hope will help today’s social sci-
entists in understanding which concepts and topics dominate a particular
discourse in society and the relationship between these same diverse top-
ics both diachronically and synchronically. Furthermore, we will show
how semantic and thematic shifts in a given society can be traced over
time and which future developments might be more or less probable.
These methods—used individually or combined in what is now called
multi-method analysis (Van Meter 2003)—can produce unintended and
sometimes amazing results (Glady and Leimdorfer 2015).
I’ll look at three different cases:
1. In the first case, I will look at how statistical analysis of texts from
1989 produces a geographical map of the then Soviet Union and how
those texts and this map help to define the political and economic
structures of current-day Russia and its ‘near abroad’.
2. In the second case, I’ll take a synchronic perspective on American,
French, and German sociologies by looking at abstracts submitted to
the annual conferences of each national association of sociology, and
also a diachronic perspective in the specific case of French sociology.
The first analysis shows how the discursive field of sociology in each
country is structured in terms of topics dealt with, of relationships or
ties between these topics, and of the history and culture of each coun-
try. In a second analysis, I will provide a diachronic perspective by
Words and Facts: Textual Analysis—Topic-Centred Methods… 157
looking at the thematic dynamics over time in French academic

discourse on sociology from 2004 to 2009 (Van Meter 2009). Such an
analysis can trace the evolution of a topic from past to present. It even
allows possible projections into the future.
3. With the third case, I try to contribute to the discussion of what poet,
academic, and diplomat Pete Dale Scott calls ‘Parapolitics’, or empha-
sizing the real use of power. In his own words:
The investigation of parapolitics, which I defined (with the CIA in mind)

as a “system or practice of politics in which accountability is consciously
diminished.” …I still see value in this definition and mode of analysis. But
parapolitics as thus defined is itself too narrowly conscious and intentional
… it describes at best only an intervening layer of the irrationality under
our political culture’s rational surface. Thus I now refer to parapolitics as
only one manifestation of deep politics, all those political practices and
arrangements, deliberate or not, which are usually repressed rather than
acknowledged. (Scott 1993, 6–7)
‘Parapolitics’ or deep politics are revealed by looking at the largest pos-

sible portion of the international press, by analysing how topics emerge
in these texts in certain socio-historical and institutional contexts, and by
how them are then reproduced in texts over time and in other contexts.
In particular, I shall be looking at semantic structures and relationships
referring to international conflicts. For instance, in 2009 Obama prom-
ised to close Guantanamo, which is still open today, and our ‘parapoliti-
cal’ analysis of the international press shows that the closure of
Guantanamo was only an Obama Administration priority during 2009
and not afterward. And during the same period of time in the interna-
tional press, the dominating image of the then new US president was
overshadowed by the representation of the previous president, George
W. Bush, and his ‘legacy’—even today the term ‘Bush’ remains closely
associated with ‘torture’ and ‘Iraq’ in the international media. In a dia-
chronic perspective, such a study can explain how certain forces are put
into motion, how they persist over time and how they reach into the
future.
158 K. M. van Meter
2 extual Geography of the Last Soviet

T
Central Committee and Russia Today
Following the development of a science, of the adoption of an invention,
of the spread of an epidemic, of the policies of a political party, of the
evolution of an international conflict, all this can be done—and has been
done—through the study of the texts, the words, and their associations
that these processes have left behind. And often with such massive data,
‘following’ such developments is so solidly established that such analysis
can open a window on the future: in the international media, as men-
tioned above, the words ‘legacy’ and ‘George Bush’ have continued for
several years now to be closely associated with ‘torture’, ‘Guantanamo’
and ‘Iraq’, and will in all probability continue to be in the future. The
geographical words in 1989 in the official biographies of the last Soviet
Encyclopaedia drew a nice map of what is today Russia and its ‘near
abroad” neighbours. Keywords in American sociology of religion—
although usually not geographical—remain very different from those of
German and French sociologies of religion and that will continue long
into the future. The analysis of keywords and geographical names can
also show that important scientific research ‘moves around’ geographi-
cally as the names of the authors and their institutional affiliations change
(Maisonobe 2015).
In the case of the Soviet Union/Russia, together with Philippe Cibois,
Lise Mounier and Jacques Jenny (Van Meter et al. 1989), we used the
online computer data base, SOVT on the server GECAM in Paris, which
at that time provided detailed official biographies of leading figures of the
Soviet Union. Since 1982 and the death of Secretary-General Leonid
Brezhnev, there were three successive secretary-generals, the last being
Mikhail Gorbachev. We examined the biographies of all members of the
Central Committee of the Communist Party of the Soviet Union during
the period 1981–1987. In that data, there were 503 individuals, includ-
ing 18 women, and a total of 7844 dated sequences for any average of
15.6 dated sequences per individual. The data included names, demo-
graphic information, dates, and descriptive texts. We began our study
with the analysis of the corpus of all words used to describe the individuals
of our population, but quickly settled on using only 100 geographical
terms, and even then, we used only the 20 most frequent names as active
variables with the 80 other geographical names being used only as non-
active or passive elements that did not enter into the calculations. We used
the TriDeux factorial correspondence analysis program (Cibois 1983,
1985, 2016) inspired by Benzécri, which is a hierarchically descending
method that successively ‘cuts’ the set of data points along the most statis-
tically significant axes or dimensions, thus producing the following two-
dimensional diagram based on the two most pertinent factors (Fig. 6.1).
Fig. 6.1 Factorial correspondence analysis of geographical terms in the official

biographies of members of the last Central Committees of the USSR
160 K. M. van Meter
Here we see that Stavropol (STAV), Latvia (LETT), and Moscow

(MOSC) are very largely responsible for the positive side of horizontal
factor or axis 1. Ukraine (UKRA), Dnepropetrovsk (DNEP), Donetsk
(DONE), Kiev (KIEV), and Kharkov (KHAR) are largely responsible for
the negative side of axis 1. This is of course statistically the most impor-
tant factor or axis of the entire population. The second most important
factor or axis opposes, on the positive side, Tashkent (TACH), Azerbaijan
(AZER) and Gorky (GORK, now Nizhny Novgorod) with, on the nega-
tive side, Latvia (LETT).
In 1989, with this simple textual analysis of geographical terms, we
stated:
Summarily interpreted, this analysis reveals that there is a very tight and
coherent network centred around Dnepropetrovsk and the Ukraine [and
other geographical names that figure prominently in Brezhnevian allies’
biographies], which corresponds with the political power base of Leonid
Brezhnev. The most significant cleavage in the entire population is between
this Brezhnevian group and the rest of the population. This ‘rest of the
population’ is in turn multi-centred with Stavropol (Mikhail Gorbachev’s
political power base), Latvia (an economically dynamic Baltic republic)
and Moscow (the centre of the state apparatus), distributing the remainder
of the names in an arc around themselves. The second most important
cleavage (the second axis) seems to be related to economic development
with Latvia at one extreme and Uzbekistan and Azerbaijan [in Central
Asia] at the other. This is, however, a tentative hypothesis that must be
examined further.
Indeed, this hypothesis ‘examined further’ is confirmed with Latvia

contributing to NATO operations as of 1994 before becoming a member
in 29 March 2004, and Ukraine becoming an independent country and
rival to Russia, before losing its eastern portion to Russia recently. And
soon after the publication of the above cited article, we learned that the
last Central Committee of Uzbekistan died before a firing squad, exe-
cuted for corruption and incompetence for storing outdoors the entire
Uzbek cotton crop that had rotted away while being sold and resold on
the international market, thus confirming the ‘cleavage’ due to ‘economic
development’ mentioned above. Have the situation and the cleavages
between Russia and Ukraine (the first axis of the diagram) and between
the Baltic republic and the Central Asian republics (the second axis)
changed that much since our textual analysis produced the above
diagram?
3 Ascending and Descending
Methodologies
We referred to factorial correspondence analysis as a hierarchical descend-
ing method by which we mean that you start with all the data together in
the form an n-dimensional cloud of data points, where n corresponds to
the number of variables being used in the description of each data point.
In the case of the Soviet biographies, it was 20 geographical terms or
variables. The cloud is then cut successively—thus in a ‘descending’ man-
ner—according to specific statistical criteria. In the case of the Soviet
biographies, the successive ‘cuts’ where on the basis of statistical contribu-
tions to the first factor or axis (Ukraine/Rest of the USSR), then to the
second axis (Baltic republics/Asian republics or economically developed/
economically underdeveloped), and so on to the third, fourth and other
axes with less and less statistical significance. This distinction between
ascending and descending methods is even clearer when it concerns clas-
sifications (Van Meter 1990, 2003). One either builds classes or ‘clusters’
by putting together data points with similar characteristics or values for
variables, or by ‘cutting’ the entire cloud of data points to successively
form the most homogeneous classes possible. Alceste mentioned above,
and first developed by Max Reinert (1987), is such a descending method
specifically intended for the analysis of corpora of texts and has been used
extensively in the analysis of very different sorts of textual data and con-
tinues to be extensively used today (Reinert 2003; Image 2016).
In the other direction, hierarchically ascending classifications try to
put in the same class or ‘cluster’ elements that have the most similar char-
acteristics (Van Meter et al. 1987, 1992). When the data are units of
texts, one of the more useful manners of constructing classes is to put
together words or terms that appear the most often together in the units
162 K. M. van Meter
of text in the data or corpus. We usually refer to this method as classifica-

tion by co-occurrence of keywords. Of course, one can combine these
different types of analysis in a multi-method approach. An example of
the use of a hierarchically descending factorial method (using the TriDeux
software), a hierarchically descending classification method (using the
Alceste software), and the hierarchically ascending classification method
(using the Calliope software), all applied to the same data, can be found
in Van Meter (2008).
Classification by co-occurrence of keywords has proven to be a power-
ful textual analysis tools in several disciplines. One example is the
Leximappe method first developed at the Paris Ecole des Mines by Michel
Callon et al. (1991) as a DOS program with a graphic extension called
Lexinet. Mathilde de Saint Léger (1997) developed a new Windows-
based version, Calliope, based on the Leximappe-Lexinet DOS package.
We first used the latter system in 1992 in an analysis of the initial ten
years of scientific articles published on sociological AIDS research (Van
Meter and Turner 1992), describing the S-curve typically found in track-
ing the development of scientific publications in any new specialty. And
again in 1995, Leximappe-Lexinet was used to analyse the same data in
detail for thematic clusters and to reveal the role of national policies in
sociological AIDS research in the US, Great Britain, and France (Van
Meter et al. 1995). Then in 1997, we were able to show the role and tra-
jectory of individual authors, and the evolution of publishing policy of
individual scientific journals (Van Meter and Turner 1997), thus showing
that this textual analysis tool used several time on the same data base
could produce new and often unanticipated results each time.
4 F rench, German, and American

Sociologies as Seen Through Congress
Presentations
In a comparative study on research agendas in French, German, and
American sociologies, we analysed corpora comprising all abstracts of the
national sociology association congresses in each of the three countries.
For France, we analysed abstracts from the congresses of the French
Sociological Association or AFS (Association française de sociologie) follow-

ing its created in 2002. The association 2002 replaced the moribund
Société française de sociologie, a scholarly society created by Émile
Durkheim himself. At the first 2004 AFS congress, along with several
other researchers, we decided to analyse all the abstracts using different
methods of textual analysis. This produced a thematic issue of the BMS
(Bulletin of Methodological Sociology/Bulletin de Méthodologie Sociologique),
the number 85 issue in January 2005. In our contribution to that the-
matic issue (De Saint Léger and van Meter 2005), we used the new
Windows version of Leximappe-Lexinet, Calliope, for the first time. That
issue of the BMS also became a book in French, Analyses textuelles en
sociologie—Logiciels, méthodes, usages (Textual Analyses in Sociology—
Computer Programs, Methods, Practices; Demazière et al. 2006).
What did we find? Nothing simpler than showing you the Strategic
Diagram of AFS 2004.
Some explanation would be helpful in understanding the meaning of
the strategic diagram that is formed by a first horizontal axis of ‘centrality’
and a second vertical axis of ‘density’. Here, the classes or clusters are
constructed using co-occurrence of keywords. Thus, keywords that appear
often together in the different units of text—here, in abstracts of scien-
tific presentations made at a conference—will have a high index of simi-
larity and will be placed together in the same class. The classes thus
constructed are then used as the basis for the calculation of ‘in-ties’ (ties
between keywords within the same class) and ‘out-ties’ (ties between key-
words outside of their class and with keywords in other classes). The ‘in-
ties’ are used to calculate an axis of ‘density’ (the vertical or second axis)
and the ‘out-ties’ to calculate an axis of ‘centrality’ (the horizontal or first
axis). With density as a vertical axis, and centrality as a horizontal axis, a
two-dimensional strategic diagram is constructed in such a manner that
it is possible to place the different classes according their mean density
and mean centrality. The strategic diagram, by construction, therefore
has a dense and central upper right first quadrant, which we call the
‘mainstream’. The dense but not central upper left second quadrant is
called ‘ivory tower’. The not dense and not central lower left third quad-
rant is sometimes called ‘chaos’ or simply ‘unstructured’. The central but
not dense lower right fourth quadrant is called ‘bandwagon’ (Callon et al.
164 K. M. van Meter
1991). This means that ‘mainstream’ classes (in the first quadrant) will
have relatively numerous ‘in-ties’ (co-occurrences of the class’ keywords
together in the text units) and relatively numerous ‘out-ties’ (co-
occurrences in text units of the class’ keywords with other keywords from
other classes). ‘Bandwagon’ classes have keywords with numerous ‘out-
ties’ but relatively fewer ‘in-ties’. And, of course, ‘ivory tower’ classes have
numerous ‘in-ties’ but relatively few ‘out-ties’.
In Fig. 6.2, one can clearly see the dominant role of the term ‘femme’
(and thus sociology of women), and in Fig. 6.3, you can see the internal
structure of the class ‘femme’ and its constituent keywords (including
‘travail’) and the cluster’s ‘in-ties’. The statistical weight of the term
‘femme’ was so dominant that we decided to see what the data base would
look like without ‘femme’. So we removed that term and redid the analy-
sis, producing a new strategic diagram (Fig. 6.4). The astonishing similar-
ity between Figs. 6.2 and 6.4 leads to the conclusion that the terms
Fig. 6.2 French Sociological Association (AFS) congress 2004 strategic diagram of

all abstracts (femme)
‘femme’ and ‘travail’ are interchangeable in the structure of the 2004 AFS
corpus and, by implication, the sociology of women and the sociology of
work (a classic and historically dominant theme of French sociology)
have become contemporary equivalents (De Saint Léger and van Meter
2005). This result was then confirmed by the analysis of the 2006 AFS
congress and the 2009 congress (De Saint Léger and van Meter 2009).
In the case of the 2009 congress, there was a declared central theme:
‘violence in society’. Therefore, in the 2009 Strategic Diagram, the domi-
nant position of ‘femme’/‘travail’ had been taken over by the term
‘Violence’. But by employing the same technique of deleting the domi-
nant term and re-analysing the data, we produced a new 2009 Strategic
Diagram that was indeed very similar to the 2004 and 2006 results, and
Fig. 6.3 French Sociological Association (AFS) congress 2004 ‘femme’ (woman)
cluster with its keywords, including ‘travail’ (work)
166
K. M. van Meter
Fig. 6.4 French Sociological Association (AFS) congress 2004 strategic diagram of all abstracts (without femme)
thus confirming the most or less established structure of contemporary

French sociology.
A similar Calliope analysis was done of the 2004 congress of the
Deutsche Gesellschaft für Soziologie (DGS), or German Sociological
Association, held in Munich. The entire set of 401 conference abstracts
was available after the congress free of charge on the DGS Web site and
the results permitted a direct comparison of the structure of contempo-
rary French and Germany sociologies and stressed their similarity (Van
Meter and de Saint Léger 2009a). This was followed by the Calliope
analysis of 2003, 2004, 2005, and 2006 American Sociological Association
(ASA) conference abstracts of more than 8000 presentations (Van Meter
and de Saint Léger 2014). The comparison of the ASA results with the
AFS and DGS results was particularly surprising concerning the sociol-
ogy of religion and the relationship between family, education, and work,
which is quite distinct in each of the three cases and clearly related to the
history and culture of these three nations.
5 L ongitudinal Analysis of World Media

on Conflicts—Parapolitics 2006–2012
There has been a great deal of exchange over the past few years between
discourse analysis, sociology, and political science. This has even resulted
in joint sessions in international conferences and joint publications in
which the BMS has played a certain role (Marchand 2005, 2007a).
Indeed, Marchand’s work in political science resulted in the publication
of a book that cites his contribution to discourse analysis, sociological
methodology and to the BMS, in particular (Marchand 2007b). But it is
Michel Pinault’s work on the politics and history of French scientific
research (Pinault 2006) that cited not only sociological methodology
developments and the BMS but used some the same methods we have
presented here. In this particular area of politics, Pinault utilized Calliope
to analyse the texts and discourses of the major social and political actors,
and then applied social network analysis to examine the roles and rela-
tionships between those actors in the development of a coherent national
168 K. M. van Meter
policy for French scientific research, another surprising and well-done

multi-method analysis of a complex subject.
If political science studies politics and the use of political power as it
presents itself on the public scene, ‘parapolitics’ is the study of the use of
political power that is not presented on the public scene; in short, ‘the real
use of political power’, as we have described it above. The term seems to
have been coined by Prof. Scott at the University of California, Berkeley,
in the late 1970s or early 1980s. We worked with Scott and with one of his
students, Jonathan Marshall, who founded the journal Parapolitics USA,
in the 1980s. Other colleagues of Scott founded the French ‘Association
for the Right to Information’ (Association pour le Droit à l’Information—
ADI) and the journal, Parapolitics, during the same period. The ADI con-
tinues to publish books and reviews, although the journal Parapolitics has
evolved and changed its title to become—since 1995—the current Internet
journal, Intelligence, Secrecy & Power (ISP), which is currently published
twice a month (https://intelligenceadi.wordpress.com/ and http://groups.
yahoo.com/group/intelligence-secrecy-power-adi/). ADI and ISP infor-
mation has served previously in the social network analysis (or ‘traffic
analysis’) of the long-term cat-and- mouse/cops-and-robbers dynamic
between law enforcement and suspected offenders (Van Meter 2001).
Many different centres of research and information on political devel-
opments, particularly on the international level, produce annual reports,
yearbooks or ‘tops of the year.’ Some of the better-known examples
include the ‘United Nations’ Yearbook’, the Stockholm International
Peace Research Institute’s (SIPRI) ‘SIPRI Yearbook’, the CIA’s annual
‘World Factbook’, the Council of Europe’s ‘European Yearbook’. Other
examples include the International Institute for Strategic Studies’
‘Strategic Survey’, the Institut des Relations Internationales et Stratégiques
annual ‘L’Année stratégique’, the Swedish National Defence College’s
‘Strategic Yearbook’, the ‘Statesman’s Yearbook’, by Barry Turner, and the
‘World Political Almanac’, by Chris Cook and Whitney Walker. In most
cases, these works are compilations of a selected number of reports made
during a given year, but they are often accompanied by analyses or opin-
ions attributed to experts in the domain concerned, be it international
politics, security, economics, armaments, and so on. The events they tend
to emphasis are what we often call ‘yearbook’ events.
Methodologically speaking, these works usually suffer from a double

weakness: the limited sampling basis for the information or data used in
the analysis and the reliance upon an ‘expert’ or ‘experts’ to analyse and
interpret the data in a non-formalized manner, and often without contra-
dictory evaluation or critique. The consequences of this double weakness
in international politics can be seen with the Bush White House’s deci-
sion to invade Iraq because of the supposed existence of weapons of mass
destruction.
While it is extremely difficult—if not impossible—to develop either
an exhaustive sample or even a statistically representative sample of publi-
cally available reports on international political developments for a given
year, there is a tremendous amount of information available in this area
as the above-mentioned reports show. Indeed, the problem has more to
do with handling and managing the vast quantity of information, and
selecting the pertinent and representative information for further analy-
sis. Informal experiments have been run in this domain to show how a
report by a major wire service—Associated Press, Reuters, Agence France
Presse, etc.—is often ‘rewritten’ into multiple, even hundreds, of media
reports which furnish little or no further information than that contained
in the original wire service report. The difference often resides in the
political, cultural, and ideological choices of words to communicate and
to interpret the same initial information. This choice of words, this lan-
guage, does furnish further information, but at another level: that of the
analysis of the ‘rewriting’ and those that carry it out.
One of the ADI’s objectives is to document international politics and
parapolitics through the systematic wide sampling of the media. For
several years, the ADI has published in each issue daily entries of media
report headlines or titles, along with an indication of the source and
eventually additional text from the original report. For example, the
extract below is the first six entries of the 43 ADI entries for 31 August
2016:
• Death of senior leader al-Adnani caps bad month for ISIS/CNN. The
death of one of ISIS’ most prominent figures, Abu Mohammad al-
Adnani, is one more example of the pressure the group is under in
both Iraq and Syria.
170 K. M. van Meter
• Trump to meet in Mexico with the country’s president/WP. Donald Trump

will travel to Mexico City on Wednesday for a meeting with Mexican
President Enrique Peña Nieto, just hours before he delivers a high-
stakes speech in Arizona to clarify his views on immigration policy.
• Monsoon Rains, Terrorism Ad Lib Snag Kerry in South Asia/NYT. US
Secretary of State John Kerry’s visit to South Asia this week has been
filled with serious diplomatic meetings and agreements to boost ties
with India and Bangladesh but his motorcade’s struggle with monsoon
downpours and an …
• North Korea Has Executed a Deputy Premier, Seoul Reports/NYT. Kim
Jong-un, the North Korean leader, in a picture issued by his govern-
ment’s Korean Central News Agency.
• Vote on Whether to Remove President Nears in Brazil’s Senate/NYT.
Senators debated the fate of Brazilian President Dilma Rousseff into
the wee hours of Wednesday, then planned a short break before casting
votes that will decide whether to remove her permanently as leader of
Latin America’s most …
• At least nine dead as Typhoon Lionrock pummels Japan/CNN. Nine bod-
ies were found in a home for the elderly in the town of Iwaizumi in
Iwate Prefecture, which suffered flooding following Typhoon Lionrock,
police tell CNN.
These six reports come from a variety of sources: CNN, WP (Washington

Post) and NYT (New York Times). In 2008, we used the entire set of 2006
ADI entries as a data set to be analysed with Calliope, but the size of the text
corpus encouraged us to divide 2006 into three successive periods of four
months and produce a strategic diagram for each period. Figure 6.5 is the
strategic diagram for the first four months of 2006. Clearly, ‘Iran’ is the domi-
nant term, but it was replaced by ‘Iraq’ in the two following 2006 Strategic
Diagrams, as would be expected with the massive dominance of the Iraq war
in 2006 political and parapolitical developments. The presence of ‘Iran’ in the
‘mainstream’ first quadrant in the first (winter to spring 2006) four-month
section shows that the Bush White House intention to attack Iran and Iran’s
nuclear program played a major political and parapolitical role during the
first part of 2006 before declining significantly towards the end of the year,
but this topic did come back as a major theme in middle and late 2007.
Fig. 6.5 Strategic diagram of the first four months of the 2006 Association for
the Right to Information corpus
In the first two 2006 Strategic Diagrams, we previously noted the

important position of ‘eavesdropping’ (first diagram) and ‘NSA’ (second
diagram) in exactly the same position. The first diagram also includes
‘NSA spy’ and ‘surveillance’ clusters near the origin, while the second
diagram has a ‘spy’ cluster near the origin. These clusters are related to the
Bush White House program of warrantless eavesdropping on American
citizens by the NSA. However, the third diagram included only a ‘surveil-
lance’ cluster in the unstructured third quadrant. Nonetheless, the calcu-
lated attractive power of the term ‘spy’ (see van Meter and de Saint Léger
2014 for a description of keyword ‘attractive power’) was fairly constant
over the three periods of 2006. With the Iraq war as the overall frame-
work, ISP stated, in late February: ‘Although the Bush White House is
plagued with numerous scandals and crises, the three current major ones
are Iran and its nuclear program, the ‘domestic’ NSA spying scandal, and
172 K. M. van Meter
the series of reports on US torture/abuse and calls for the closure of

Guantanamo.’ This corresponds with the position of ‘Iran’ and ‘eaves-
dropping’ in the first diagram, and the continued permanence of the
latter throughout the year when ‘Iran’ and ‘torture’ declined
importance.
There were three notable 2006 ‘yearbook’ events: another invasion of
Lebanon by Israel (July), a North Korean nuclear weapon test (October),
and a US Democrat Congressional election victory (November). These
events are duly situated by the analysis as the respective clusters ‘Israel’
and ‘Lebanon’ (second diagram, 2006-2), ‘North Korea’ (second dia-
gram), and ‘Congress’ (third diagram, 2006-3), but only as passing major
events. ‘Israel’, ‘Lebanon’, and ‘Gaza’ figure only in the second diagram
and not in the ‘mainstream’ first quadrant and nowhere else during the
year. North Korea figures in the ‘mainstream’ first quadrant, but only in
the second diagram and nowhere else. ‘Congress’ figures only in the third
diagram and in the ‘ivory tower’ second quadrant. However, preparations
for those important elections were widely commented on during the pre-
ceding second period and account in large part for the presence of the
clusters ‘GOP’ and ‘Democrat’ in the ‘ivory tower’ second quadrant and
‘election’ in the ‘mainstream’ first quadrant. But now, in 2016, and with
this formal textual analysis of the 2006 media, the very major changes
these three ‘yearbook’ events supposedly announced at the time—‘the
Israeli invasion will change the face of the Middle East’, ‘a North Korea
atomic weapons will change world politics,’ ‘a Democrat Congress will
change everything in the US’—and the choice of these three events as the
three most notable 2006 ‘yearbook’ events seem rather shallow and with-
out fundamental importance.
It is only by following keywords or clusters backward and forward in
time that coherent major political developments can be identified, unless
of course it is a question of a clearly monumental political development.
The case of ‘Hamas’ in the 2006 third period is a good example of both
how a political development can be identified and followed with this
analysis, and how the international media can ‘skip’ or bias the coverage
of a major political development.
According to our analysis of 2006, there are, however, two major
developments that this data rather clearly designate: in early 2006, the
international press usually talks about ‘Hamas terrorists’ but, by the end
of the year, this has become ‘Hamas militants’; China starts the year as a
major player on the international scene but finishes 2006 as one of the
most insignificant international actors (Van Meter and de Saint Léger
2008, 2009b). The case of ‘Hamas’ is quite interesting in its implications
for text analysis of political developments. If a ‘Hamas’ cluster appears in
the third diagram (2006-3), it is because the international media concen-
trated on the killing of an elected Hamas government official by the
Israelis in November 2006. But the truly important political develop-
ment associated with Hamas occurred in January 2006 when Hamas won
the internationally recognized democratic elections in Gaza and the
Occupied Territories, thus causing a major change in perspective that
should have been widely commented in the press, but was not. Again, in
July, Israel detained a large number of elected Hamas government offi-
cials, against the wishes of the larger international community. But,
again, Hamas does not appear in the second diagram (2006-2), and the
international media gave only passing attention to this imprisonment of
elected Palestinian officials. It was only in November with the Israeli kill-
ing of one such official that the international media finally seemed to pay
attention to what was happening. Indeed, since the publication of our
study, several major news agencies have confirmed that at a very senior
level there was an editorial decision to no longer label all Hamas mem-
bers as ‘terrorists’ and instead use the term ‘militant’ or ‘activist’, or in the
case of this killing, ‘Hamas official’. This major change in 2006 is still
with us today and could only be identified by moving back and forth in
time over the results of these textual analyses.
There have been many other similar instances of such ‘uneven’ or
clearly biased coverage of particularly sensitive topics or events that can
be discerned by following the formation of certain clusters back in time
and also forward in time. Inversely, if a topic or cluster cannot be fol-
lowed over time, it is very likely a transient event or not a coherent topic.
The supposedly major ‘yearbook’ event that was the election of a 2006
Democrat majority in Congress hardly resulted in any memorable politi-
cal developments. Few people other than Middle Eastern specialists
remember the 2006 Israeli invasion of Lebanon or how many times Israel
has invaded its neighbours. And North Korea has its nuclear weapon, but
174 K. M. van Meter
has not used it until today and international politics has been thoroughly
preoccupied by other issues since then.
But let us look at another less than evident 2006 development that is
nonetheless fundamental and whose consequences are still with us today.
During the first four-month period of 2006, the keyword ‘China’ was the
keyword with the highest statistical attractive power in the construction
of the clusters and axes (see Van Meter and de Saint Léger 2009b for a
description of this index). By looking at the texts of the first period that
included ‘China’, one finds that they often involved international nego-
tiations trying to keep the Bush White House for invading ‘Iran’, which
was the dominant term of that first period. China, Russia, and Europe
were involved in those negotiations to try to keep Bush and Cheney from
starting Word War III. Four months later, Bush and Cheney no longer
wanted to invade Iran, but were hinting that an invasion of North Korea
would stop the development of an atomic arm in that country. Again
there were international negotiations, and this time China was playing
the leading role, which largely explains how the attractive power of
‘China’ increased to a maximum during the 2006-2 period. But on 9
October 2006, North Korea detonated a nuclear device and that was the
end of negotiations and the attractive power of the keyword ‘China’ fell
precipitously as China disappeared from the international media as a
major player on the world stage (see Fig. 6.6, ‘2006 keywords’ attractive
power over the three four-month periods). Little wonder that China soon
became far more aggressive on the world scene and is currently browbeat-
ing the entire world concerning its offensive in the South China Sea.
Since the publication of the 2006 results, we have also looked at the
last two years of the Bush White House—2007–2008 (Van Meter and de
Saint Léger 2011)—and, as could be expected, we found some rather
intriguing developments. That data was divided into four successive six-
month periods (2007-1, 2007-2, 2008-1, and 2008-2) and strategic dia-
grams were produced for each period. However, it was in the evolution of
keyword attractive power that the most surprising evolution appears (see
Fig. 6.7).
Most of the terms decline in the last period (2008-2), but two top
terms do increase: ‘UN’ (United Nations) and ‘kill’ (which does not des-
ignate any particular institution but does tend to characterize an interna-
Fig. 6.6 2006 keywords’ attractive power over the three four-month periods
Words and Facts: Textual Analysis—Topic-Centred Methods…
175
176 K. M. van Meter
tional increase in violence and instability). But among the other top
terms, there is one in serious decline: ‘Bush’. This can be interpreted as
indicating that during an international situation of increasing violence
and instability (‘kill’ going up), the leader of the world’s most powerful
nation was in decline or, as certain commentators stated, ‘had abandoned
the helm’ or was too discredited to lead the world. That responsibility was
being turned over to the institution ‘on the rise’; that is the United
Nations. In short, the Republican government of George W. Bush was
being replaced on the international scene by the Republicans’ worst
nightmare, an ‘international government’ headed by the United Nations.
Although Bush was replaced at the White House by a Democrat, Barack
Obama, our analysis of 2009–2012, which was recently published under
the title 2009–2012—Obama’s First Term, Bush’s ‘Legacy’, Arab Spring &
World Jihadism (Van Meter 2016), confirms that the UN and not the US
president continues to lead the world on the current violent international
scene.
Fig. 6.7 Dominant 2007-1 terms over the four periods of 2007–2008 (Bush
vs. UN)
6 onclusions—Press Cycles, Fashionable

C
Thinking, and Cumulating Information
From the above ‘parapolitical’ publications covering successively 2006,
2007–2008, 2009–2012, and soon 2013 and 2015–2016, it becomes
clear that ‘yearbook’ events that preoccupy the international press, and of
which we are all well aware, are often based on ‘expert advice’ and current
events that come in waves—often called the ‘press cycle of attention’—
and each successive wave replaces the preceding one on stage ‘front and
centre’ while we forget what happened during the previous period. We
can escape these successive waves by keeping the texts they generate and
performing a systematic and formal analysis of their content and struc-
ture. Often when we present these results during conferences, we ask the
auditorium, for example, “What do you remember about 2006? What
were the most important events of 2006?” The answers are usually a
number of ‘yearbook’ events, because that is what we have in our mem-
ory. Textual analysis provides tools allowing to look back at that material
and to challenge what you have retained from the successive waves of the
media flood.
In the domain of scientific publishing, where there are fewer waves of
fashionable thinking, these same methods can dig deep into the content
to reveal structures that do not appear of themselves, even to careful
researchers trying to maintain neutrality when studying, for example, the
sociology of education and how to bring up your kids and educate them
for the future in France, Germany, or the United States. Remember that
in the latter country almost half the population will grow up believing
that the Bible explains better the existence of man than Darwinian evolu-
tion, while less than five percent believe that in the two former countries,
and this is reflected in the textual structures of French, German, and
American sociologies.
As for Russia, its geopolitical structure today still looks very much like
the results of our textual analysis of 1989 in the official biographies with
Ukraine isolated by itself, the ‘western fringe’ of developed Baltic repub-
lics now part of the Western world, and Central Asian republics in a situ-
ation of flux and far from the two centres of power which are Moscow
and Saint Petersburg.
178 K. M. van Meter
Formal textual analysis is probably one of the very few methods avail-
able to us for the systematic study of scientific and cultural production in
all of these countries and throughout the world, permitting anything
approaching scientific neutrality, the possibility of comparative study and
accumulating further information in these domains.
References
Callon, Michel, Jean-Pierre Courtial, and William Turner. 1991. La méthode
Leximappe – Un outil pour l’analyse strategique du développement scienti-
fique et technique. In Gestion de la recherché – Nouveaux problèmes, nouveaux
outils, ed. Dominique Vinck, 207–277. Brussels: De Boeck.
Cibois, Philippe. 1983. Methodes post-factorielles pour le dépouillement
d’enquêtes. Bulletin of Sociological Methodology/Bulletin de Méthodologie
Sociologique 1: 41–78.
———. 1985. L’analyse des donees en sociologic. Paris: Presses Universitaires de
France.
———. 2016. Le logiciel Trideux. Accessed June 27, 2018. http://cibois.pages-
perso-orange.fr/Trideux.html.
De Saint Léger, Mathilde. 1997. Modélisation de la dynamique des flux
d’informations – Vers un suivi des connaissances. Thèse de doctorat, CNAM,
Paris.
De Saint Léger, Mathilde, and Karl M. van Meter. 2005. Cartographie du pre-
mier congrès de l’AFS avec la méthode des mots associés. Bulletin of
Sociological Methodology/Bulletin de Méthodologie Sociologique 85: 44–67.
———. 2009. French sociology as seen through the co-word analysis of AFS
congress abstracts: 2004, 2006 & 2009. Bulletin of Sociological Methodology/
Bulletin de Méthodologie Sociologique 102: 39–54.
Demazière, Didier, Claire Brossaud, Patrick Trabal, and Karl van Meter. 2006.
Analyses textuelles en sociologie – Logiciels, méthodes, usages. Rennes: Presses
Universitaires de Rennes.
Fairclough, Norman. 1992. Discourse and text. Linguistic and intertextual anal-
ysis within discourse analysis. Discourse & Society 3 (2): 193–217.
Glady, Marc, and François Leimdorfer. 2015. Usages de la lexicométrie et inter-
prétation sociologique. Bulletin of Sociological Methodology/Bulletin de
Méthodologie Sociologique 127: 5–25.
Image. 2016. Alceste – Un logiciel d’analyse et d’aide à la décision simple

d’utilisation. Accessed June 27, 2018. http://www.image-zafar.com/Logiciel.
html.
Jenny, Jacques. 1997. Méthodes et pratiques formalisées d’analyse de contenu et
de discours dans la recherche sociologique française contemporaine – État
des lieux et essai de classification. Bulletin of Sociological Methodology/Bulletin
de Méthodologie Sociologique 54: 64–122.
Maisonobe, Marion. 2015. Emergence d’une spécialité scientifique dans
l’espace – La Réparation de l’ADN. Bulletin of Sociological Methodology/
Marchand, Pascal. 2005. Le grand oral de Dominique de Villepin. Bulletin of
———. 2007a. Un vert, ça va – Dix verts, bonjour les débats! Bulletin of
———. 2007b. Le grand oral – Les discours de politique générale de la Ve
République. Brussels: De Boeck.
Pinault, Michel. 2006. La science au Parlement. Les débuts d’une politique des
recherches scientifiques en France. Paris: CNRS Editions.
Reinert, Max. 1987. Classification descendante hiérarchique et analyse lexicale
par context – Application au corpus des poésies d’A. Rimbaud. Bulletin of
———. 2003. Le rôle de la répétition dans la représentation du sens et son
approche statistique par la méthode ‘Alceste’. Semiotica 147 (1/4): 389–420.
Scott, Pete Dale. 1993. Deep politics and the death of JFK. Berkeley and Los
Angeles: University of California Press. Trade paper edition, 1996. Accessed
June 27, 2018. http://www.peterdalescott.net/B-IV.html.
Van Meter, Karl M. 1990. Methodological and design issues. Techniques for
assessing the representatives of snowball samples. In The collection and inter-
pretation of data from hidden populations, ed. Elizabeth Y. Lambert, 31–43.
Washington, DC: National Institute of Drug Abuse, Research Monograph
Series 98.
———. 2001. Terrorists/liberators. Researching and dealing with adversary
social networks. Connections 24 (3): 66–78.
———. 2003. Multimethod analysis and stability of interpretation. In
Interrelation between type of analysis and type of interpretation, ed. Karl M. van
Meter, 91–124. Bern: Peter Lang.
———. 2008. Analyses of a quarter of century of publishing at the BMS.
Bulletin of Sociological Methodology/Bulletin de Méthodologie Sociologique 100:
6–15.
180 K. M. van Meter
———. 2009. The AFS and the BMS. Analyzing contemporary French sociol-
ogy. Bulletin of Sociological Methodology/Bulletin de Méthodologie Sociologique
102: 5–13.
———. 2016. 2009–2012 – Obama’s first term, Bush’s ‘Legacy’, Arab Spring &
world jihadism. Paris: Harmattan.
Van Meter, Karl M., Philippe Cibois, Lise Mounier, and Jacques Jenny. 1989.
East meets West—Official biographies of members of the central committee
of the communist party of the soviet union between 1981 and 1987, ana-
lyzed with western social network analysis methods. Connections 12 (3):
32–38.
Van Meter, Karl M., Martin W. de Vries, and Charles D. Kaplan. 1987. States,
syndromes, and polythetic classes. The operationalization of cross-
classification analysis in behavioral science research. Bulletin of Sociological
Methodology/Bulletin de Méthodologie Sociologique 15: 22–38.
Van Meter, Karl M., Martin W. de Vries, Charles D. Kaplan, and
C.I.M. Dijkman-Caes. 1992. States, syndromes, and polythetic classes.
Developing a classification system for ESM data using the ascending and
cross-classification method. In The experience of psychopathology. Investigating
mental disorders in their natural settings, ed. Martin W. de Vries, 79–94.
Cambridge: Cambridge University Press.
Van Meter, Karl M., and Mathilde de Saint Léger. 2008. Co-word analysis
applied to political science. 2006 international political & ‘parapolitical’
headlines. Bulletin of Sociological Methodology/Bulletin de Méthodologie
Sociologique 97: 18–38.
———. 2009a. German & French contemporary sociology compared: Text
analysis of congress. Bulletin of Sociological Methodology/Bulletin de
———. 2009b. World politics and “parapolitics” 2006. Computer analysis of ADI
timelines. Paris: Harmattan.
———. 2011. 2007–2008—The end of Bush and the rise of the UN. Link
analysis of world media headlines. USAK Yearbook of International Politics
and Law 4: 1–21.
———. 2014. American, French and German sociologies compared through
link analysis of conference abstracts. Bulletin of Sociological Methodology/
Van Meter, Karl M., and Wiliam A. Turner. 1992. A cognitive map of sociologi-
cal AIDS research. Current Sociology 40 (3): 129–134.
Van Meter, Karl M., and William A. Turner. 1997. Representation and confron-
tation of three types of longitudinal network data from the same data base of
sociological AIDS research. Bulletin of Sociological Methodology/Bulletin de
Van Meter, Karl M., William A. Turner, and Jean-Bernard Bizard. 1995.
Cognitive mapping of AIDS research 1980–1990. Strategic diagrams, evolu-
tion of the discipline and data base navigation tools. Bulletin of Sociological
Methodology/Bulletin de Méthodologie Sociologique 46: 30–44.
7
Text Mining for Discourse Analysis:
An Exemplary Study of the Debate
on Minimum Wages in Germany
Gregor Wiedemann
1 Introduction
Two developments have widened opportunities for discourse analysts in
recent years and paved the way for incorporation of new computational
methods in the field. First, amounts of digital textual data worth investi-
gating are growing rapidly. Not only newspapers publish their content
online and take efforts retro-digitizing their archives, but also users inter-
actively react to content in comment sections, forums, and social net-
works. Since the revolution of the Web 2.0 made the Internet a
participatory many-to-many medium, vast amounts of natively digital
text emerge, shaping the general public discourse arena as much as they
form new partial public spheres following distinct discourse agendas.
Second, computational text analysis algorithms greatly improved in their
ability to capture complex semantic structures.
G. Wiedemann (*)
Department of Informatics, Language Technology Group, Hamburg
University, Hamburg, Germany
e-mail: gwiedemann@informatik.uni-hamburg.de

184 G. Wiedemann
Early approaches of computational content analysis (CCA; Stone et al.

1966) have been criticized from a qualitative research perspective as sim-
ple ‘word counts’ which treat character strings as representatives of fixed
meaning and largely ignore contexts of their discursive formation. As a
reaction, in 1969, Michel Pêcheux sketched the program of an ‘Automatic
Discourse Analysis’ (Pêcheux et al. 1995) which was based on the idea to
examine the formation of the meaning of words themselves through sta-
tistical observation of language use in context with other words. In the
following decades, many variants of statistical approaches of computer-
assisted text analysis originated in the Francophone branch of discourse
analysis. The methods such as key term extraction, frequency analysis,
and co-occurrence analysis can be summarized by the term ‘lexicometry’
(Dzudzek et al. 2009). During the 2000s, lexicometry and corpus linguis-
tics slowly became part of the method toolbox also in Anglo-Saxon and
German social sciences and humanities through integration with (criti-
cal) discourse analysis (e.g. Helsloot and Hak 2007; Glasze 2007; Baker
et al. 2008; Mautner 2009; Scholz and Mattissek 2014).
For some years now, there are a number of advanced algorithms for
text analysis from the field of natural language processing in computer
science. The algorithms, summarized by the term ‘text mining’, further
extend computational capabilities to capture semantic structures in large
text collections (Heyer et al. 2006). They provide new opportunities for
exploratory studies which can be directly applied in discourse analysis
contexts. Further, they allow for automation of certain content analysis
steps, which can contribute to research designs where coding of texts is
utilized as part of operationalization in a discourse study (Wedl et al.
2014). For discourse analysis, text mining provides promising answers to
the challenge of vastly growing amounts of data in our digital era
(Wiedemann and Lemke 2016).
In this study, I will introduce two classes of ‘machine learning’ algo-
rithms to demonstrate the capabilities of text mining and what it can
contribute to a discourse research design. As an exemplary case, I will
investigate the public discourse on the introduction of statutory m
inimum
wages in the Federal Republic of Germany. The discourse is examined by
analysing more than 7,600 articles on the topic published in two German
newspapers between 1995 and 2015. I will demonstrate how an ‘unsu-
Text Mining for Discourse Analysis: An Exemplary Study… 185
pervised’ machine learning (ML) method such as topic modelling con-

tributes to partition the large corpus thematically and temporally for
closer exploration of contained discourse structures. Moreover, I show
how ‘supervised’ learning, also known as text classification, can contrib-
ute to reveal categorical patterns in the text data and, hence, may support
more deductive analysis steps.
However, before further introduction of the exemplary study and its
results, I will explain the ideas behind ML in more detail in the upcom-
ing section. I also reflect methodologically on characteristics of statistical
ML models for text in contrast to lexicometric measures and how both
may contribute to reveal patterns of discourse in diachronic studies. But
it should be mentioned beforehand that for text mining the same applies
as for any other computer-assisted text analysis: the mere use of specific
software is not a method on its own. Software, first and foremost, sup-
ports human analysts to structure and order data within a methodologi-
cal setting (Kelle 1997). This is also true for the new algorithms, although,
as we will see, they are able to capture more complex structures in very
large amounts of texts.
2 From Lexicometrics to Machine Learning

Notorious and ongoing debates on the character of discourse research
repeatedly state that there is no single formal method to conduct empiri-
cal discourse analysis. Consequently, discourse analysis sometimes is
described rather as a specific style of research than a method (Wedl et al.
2014, 540). When it comes to concrete data analysis, discourse studies
usually borrow instruments and processing steps from a variety of other
methods and methodologies such as grounded theory, lexicometry, or
content analysis. For this reason, any discourse research requires its own
specific operationalization to produce empirically founded hypothesis
about the investigated discourse (cp. Angermuller 2014, 24).
Consequently, I do not want to pay much attention to methodological
debates about discourse theory or any specific notion of ‘discourse’ in this
chapter. I rather concentrate on methodical questions to reveal structural
patterns in large textual data sets. Doubtlessly, text mining comprises
186 G. Wiedemann
useful tools for both lexicometry and content analysis (Wiedemann and
Lemke 2016). Therefore, I expect that its capabilities to structure and
order data also can make a valuable contribution to discourse studies
conducted against a large variety of methodological and theoretical
backgrounds.
First, let us have a look at the way how advanced text mining algo-
rithms, in particular, ML, proceed to extract knowledge from textual
data. In a second step, we compare the characteristics of already estab-
lished computational approaches such as CCA and lexicometric analyses
(Lebart et al. 1998), on the one hand, and ML, on the other hand, to
reflect on characteristics of the new approaches and their potential for
discourse analysis.
2.1 Supervised and Unsupervised Machine Learning
Generally, two types of ML are distinguished: (a) unsupervised learning,

which is purely data-driven to obtain structure originating from the data
itself, and (b) supervised learning, which tries to infer relations between
given text collections and knowledge represented in text external vari-
ables, for example, code categories a text is labelled with.
In discourse analysis scenarios, unsupervised learning can be, for
instance, automatic clustering of document collections by thematic
coherence. For this purpose, topic models provide a set of ‘algorithms for
discovering the main themes that pervade a large and otherwise unstruc-
tured collection of documents’, which can be employed to ‘organize the
collection according to the discovered themes’ (Blei 2012, 77). The basic
model Latent Dirichlet Allocation (LDA) takes the single words of a doc-
ument as features for partitioning the collection into a given number of
K topics. In a complex computation, LDA infers two ‘latent’ variables as
probability distributions. The first is a topic-term distribution, which
encodes information on how probable a term occurs in each of the K top-
ics. The second is a topic-document distribution, which encodes the
share of all K topics in each individual document. Highly probable terms
in single topic represent semantic coherences from the collection which
can be interpreted as a thematic cluster. Since distribution of these the-
matic clusters across all documents is a result of the modelling process, we

can assess on their overall share in the collection, as well as in arbitrary
slices of the collection such as time periods of single years. Therefore,
topic models allow for a broad thematic partitioning to gain insights and
as an inductive starting point for further detailed analysis. Moreover, top-
ics can also be evaluated with respect to metadata such as time or author-
ship to analyse evolvement of themes or distinguish thematic preference
of authors.
Evans (2014), for instance, analyses around 15,000 newspaper docu-
ments to identify topics regarded as ‘unscientific’ in the US-American
public discourse. Elgesem et al. (2015) utilize topic modelling to study
the discourse on climate change in more than 1.3 million Internet blog
articles and compare topic prevalence in different net communities.
Supervised learning, in contrast, allows not only to cluster collections
for broad thematic coherences, but to create models encoding the asso-
ciation of certain textual features with deductively defined content cate-
gories. By this, documents, paragraphs, or sentences can automatically be
coded analogue to the coding process in manual content analysis (CA).
For this, human coders are trained to identify text parts fitting the defini-
tion of a certain category given by a code book. The coder closely reads
(samples of ) documents from the target collection and extracts text parts
fitting code book definitions. A classification model trained by supervised
learning in this regard represents the equivalent of a naive human coder
who applies his/her knowledge to code a given text. By incorporating this
knowledge in a statistical model instead of employing human coders,
supervised ML algorithms can decide whether or not any new given text
belongs into the category. This allows for automatic coding of potentially
unlimited amounts of texts. Once an entire, big collection is automati-
cally coded, it allows for interesting quantitative evaluations of category
distributions and dependency, or filtering into subcollections. Interesting
analysis, for example, would be measurement of frequency of categories
across time or comparison of certain data subsets such as single publica-
tions. Another type of analysis could focus on dependency or co-
occurrence of categories in single documents.
Lemke et al. (2015), for instance, utilize text classification to code
paragraphs in several million newspaper articles to learn about the
188 G. Wiedemann
evolvement of economized justifications of policy measures in German

public discourse. They are able to show that the use of economized jus-
tification such as arguments referring to cost and efficiency significantly
increased in many policy areas over the last six decades. Wiedemann
(2016) employs supervised learning on a corpus of around 600,000
newspaper documents to identify patterns of excluding speech acts tar-
geted against certain actors and ideas in the German political discourse
on democracy. He finds that the far-right is more likely to be excluded
from the democratic discourse by moral and legal arguments referring to
democratic ideals and repressive policy measures of ‘fortified democracy’
such as ban of parties. Leftist politics, in contrast, more often is discred-
ited by equating it to its far-right counterpart, or devalued by reference
to ‘extremism’ as generic concept. These examples prove the ability of ML
approaches to capture complex, qualitatively defined categories in large
text collections and make them accessible to quantitative analysis steps.
2.2 Quality Assurance and Active Learning
One necessary condition for supervised learning to produce accurate

results is that it needs to be provided with sufficient training data, that is,
enough manually annotated text examples representative for a certain
category. The quality of automatic coding processes can be evaluated ana-
logue to inter-rater reliability known from CA where two or more ana-
lysts code the same set of texts and compare their results. Instead of two
human coders, reliability in the automatic coding setting can be obtained
by comparing human and machine decisions. The share of agreement
between human and machine decisions reports the accuracy of the pro-
cess. Measures such as Krippendorff’s alpha or Cohen’s kappa regularly
used in linguistic and social science contexts to compare coding results
from two coders apply well. Also the F1-score can be used, a measure of
the harmonic mean between precision and recall of the coding, which is
more common in machine learning contexts (Dumm and Niekler 2016).
Unfortunately, it is not as easy to obtain a sufficient training set which
captures all relevant features of a category in a collection and, hence, may
lead to poor automatic classification accuracy. A training set of sufficient
quality can be guaranteed by a certain selection strategy called active

learning. For this, Wiedemann (2016, 125ff) proposes a workflow of
selecting documents across the entire time period of investigation and
from all prominent topics in a collection, to compile and manually code
an initial training set. This initial training set is then extended iteratively
in the ‘active learning’ process, where the machine classifier is utilized to
choose new training examples. For this, the classifier is trained on the
current training set and then applied to the yet unlabelled data. The
resulting suggestions of positive category texts are retrieved and revised
by the human analyst who accepts or rejects the classifiers decision. The
revised and potentially corrected positive and negative examples are sub-
sequently used to extend the training set and train a new model. Every
iteration of automatic classification followed by human revision makes
the training set more representative for the entire collection, and the clas-
sification model becomes more accurate. This process is repeated until
classifier accuracy is no longer increasing or has reached a satisfying level.1
A model based on the final training set is then applied to code the entire
collection.
2.3 Local and Global Contexts
Since discourse analysts strive for inference on transtextual knowledge

produced in countless acts of communication, while at the same time
actually close reading single documents, they necessarily oscillate between
two levels of analysis. These two levels can be distinguished along with
specific notions of context they incorporate for their analysis. On the one
hand, they look at single documents to identify supposedly relevant
structures for the discourse. Single observations within documents are
assessed against the background of their local context, that is, the other
knowledge structures referred to in the document. Then, a comprehen-
sive overview of extracted patterns on an intertextual level is approached
to identify recurrent patterns and to reconstruct underlying knowledge
structures or sort out negligible aspects. By reading through multiple
1
Accuracy in this scenario can be determined by k-fold cross-validation on the current training set
(Dumm and Niekler 2016).
190 G. Wiedemann
documents, the view of the analyst is sharpened for specific textual for-
mations and language regularities contributing to shape a discourse in its
very own specific way. Analysis on this level embeds empiric observations
from the data in their global context. If one is able to condense these for-
mations and language regularities into some sort of analytic categories,
she/he keeps on extracting such patterns from local contexts and relating
them to each other on global context level until a saturated description of
the discourse can be assumed. This alteration between inductive, data-
driven category development and deductive category subsumption is at
the core process of knowledge reconstruction in discourse analysis—or,
as Wodak and Meyer (2009, 9) phrase it: ‘Of course, all approaches
moreover proceed abductively’. This way of proceeding has some analogy
to the unsupervised and supervised nature of ML algorithms. They also
give researchers the opportunity to combine inductive and deductive
steps of analysis into creative workflows. On the one hand, unsupervised
ML allows for exploration of patterns buried in large text collections to
learn about contents without any prior knowledge. On the other hand,
supervised ML provides results for descriptive statistics and hypothesis
testing on the basis of deductively defined categories.
Also algorithmically, ML has some similarities to abduction to infer
optimal knowledge representations for given data sets.2 Unlike humans,
algorithms are capable of processing extremely large quantities of textual
data sets without getting tired or distracted. At the same time, usually
these data sets are the only source they can learn structure from in a sta-
tistical way.3 So far, in contrast to humans, they lack common ‘world
knowledge’ and experience from outside the investigated text collection
to relate observed patterns and draw inference on. In this respect, local
2
Optimization algorithms in machine learning, such as Expectation Maximization, usually start
with random or informed guesses for their initial model parameters. In an iterative process the
model parameters are adapted in small steps to better fit the given data. In the end, the model
parameters (nearly) optimally describe the data set in some structural way. Not coincidentally, this
process resembles the abductive process of knowledge reconstruction from text corpora in qualita-
tive data analysis.
3
Of course, there are already text mining approaches which incorporate text external resources such
as comparison corpora or structured knowledge bases. It is the task of further research and develop-
ment to evaluate on the contribution of such resources for specific research questions in qualitative
data analysis.
and global contexts for computer algorithms need to be defined in a dras-

tically limited manner. Local contexts of observations such as words,
statements, or topics are simply defined as all the other words, statements,
or topics also occurring in the same context window, for example, the
same document. Global contexts of an observation then can be defined
as (statistically significant) observations which typically co-occur across
documents. Saussurean structuralism allows us to describe specific rela-
tions based on these contexts. For instance, the term sun often co-occurs
in a local context with shine. When both terms appear together in one
sentence, they describe a syntagmatic relation. At the global context level,
we can observe that this relation is statistically significant, and that other
terms such as light also describe a comparable relation with shine. Because
sun and light share a similar global context, they form a paradigmatic
relation.
The new characteristic which distinguishes ML algorithms from
already established methods in lexicometry and corpus linguistics is
directly related to their ability to connect the local and global context
level. ML approaches are able to link meaning structures from local con-
texts, for example, words in documents, to meaning on global context
levels such as thematic coherence or dependency on external variables.
Such links are captured explicitly in statistical models. In contrast, lexico-
metric measures such as ‘keyness’ of terms or co-occurrence are restricted
to descriptions of global contexts. They only provide an aggregated, dis-
tant view on a text collection. Formation of (latent) meaning of words in
this respect is encoded solely in measures expressing their statistical sig-
nificance. Typically, such aggregated numbers are decoupled from their
constituents, the individual observations in the data. ML, in contrast,
relies on probabilistic models of latent meaning retaining the connection
between statistically inferred global context knowledge (e.g. thematic
clusters) and individual data instances (e.g. documents). Statistical mod-
els, once trained on a text collection, can even be applied to new data to
observe and measure inductively or deductively derived categories across
(sub-)collections. Hence, in ML, context is no longer ignored like in
simple word counts, or a mere result of an analysis like in lexicometric
analysis. Instead, ML takes an intermediate position between the local and
the global view, which allows for switching between both perspectives.
192 G. Wiedemann
With this, the combination of lexicometry and ML, which I summarize

under the term text mining, doubtlessly improves opportunities for com-
bined qualitative and quantitative insights into text collections. Applied
in the right manner, it allows for complex, insightful analysis workflows
which have an enormous potential to open up discourse research to big
data sets, which is of special interest for longitudinal studies covering
large time periods or broad, fractionated discourse communities (Abulof
2015).
3 wenty Years of Dispute: Minimum

T
Wages in Germany
To further display the potentials of ML for discourse analysis, in the
upcoming sections, I sketch an exemplary study on the public debate
about minimum wages in Germany using topic models and text classifi-
cation. Rather than providing a comprehensive policy study, the focus of
this chapter is to contribute to the understanding of what these technolo-
gies can contribute to discourse analysis.
3.1 Example Study
Jäger (2004) defines discourses as ‘flows of knowledge through time’ and,

therewith, highlights the temporal component of the concept. In this
respect, longitudinal large text corpora appear as natural source to study
discourses. Especially, public dispute on reform in certain policy issues
can cover long time periods and evolve rather slowly, shaped by persistent
pressure from interest groups against or in favour of a certain measure.
Also, there are spontaneous reforms as a reaction to abrupt changes of
social, political, or institutional conditions. Studying longitudinal dis-
course under such conditions can be a challenging endeavour due to the
massive amounts of data. The introduction of minimum wages in the
Federal Republic of Germany makes an excellent case to study as such a
longitudinal discourse. Not only is it interesting to reveal how an excep-
tionally long-lasting political dispute finally became law. It also may dem-
onstrate the striking contribution text mining techniques have to offer

for discourse analysis.
The introduction of statutory minimum wages was a hot topic for
more than two decades in Germany. Since January 2015, wages of at least
8.50 EUR per hour have to be paid throughout the country. The under-
lying law, the ‘Mindestlohngesetz’, passed the Bundestag in August 2014
after long discussions between political parties, business lobbyists, unions,
and the public. This act realized an integral part of the coalition agree-
ment between the two major parties, Christian Democrats (CDU) and
Social Democrats (SPD), who formed a grand coalition after the federal
election in 2013. Up to this point, the dispute lasted already for several
years on the highest political level. First peaks of the debate can be dated
back to the mid-nineties. A simple frequency analysis of the term
‘Mindestlohn’ (minimum wage) in German newspapers reveals that the
term started to occur in the late 1980s, but the debate did not gain
importance before 1995, with peaks in 1996, 2007, and 2013 (Fig. 7.1).
400
documents
200
0
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
year
publication FAZ FR
Fig. 7.1 Document frequency of articles on minimum wages in two German

newspapers
194 G. Wiedemann
For the exemplary study, I inspect and compare two major German
newspapers. Following the ‘distant reading’ paradigm (Moretti 2007),
first, I strive for revealing global contexts of the debate with respect to its
temporal evolvement. What are major topics and subtopics within the
discourse, when did they emerge or disappear, and how are they con-
nected to each other? Can we determine distinct time periods of the
debate from the data? The use of topic models in combination with fur-
ther text mining techniques will enable us to answer these questions. For
a substantial analysis, we also want to zoom in from this distant perspec-
tive and have a close look on individual units of language use shaping the
discourse. In a deductive step, I will look for statements and utterances
expressing political stance towards the issue. How is approval or rejection
of the introduction of minimum wages expressed and justified through-
out time? Then, with the help of text classification, we will be able to
trace this antagonism between proponents and opponents of statutory
minimum wages quantitatively.
3.2 Data Selection
To study the discourse on minimum wages (MW) in Germany, I com-

piled a corpus of articles from two newspapers, the Frankfurter Rundschau
(FR) and the Frankfurter Allgemeine Zeitung (FAZ). Both are widely
regarded as ‘quality newspapers’ with nationwide dissemination.
Moreover, they represent two sides of the political spectrum. The FAZ,
founded in 1949, is generally viewed as a voice for conservative milieus
in Germany, while the FR, founded in 1945 covers the liberal, left-wing
part of the political spectrum. Therefore, a comparison of both publica-
tions may give interesting insights into the political debate, which as a
measure for promotion of social justice has been propelled by left-leaning
actors mainly. Since both newspapers are published in the Hessian
metropolis Frankfurt am Main, possible local aspects of the debate may
also be covered in a comparable manner.
Articles were selected by their publication date between 1 January
1995 and 31 December 2015, covering the time period from the first
larger impacts of the issue in the discourse agenda to the first year after
the ‘Mindestlohngesetz’ came into force. This allows tracing the genesis
of the policy measure from the first noticeable public demands, over vari-
ous sidetracks of the discourse and the point when lawmakers in majority
supported the measure, up to reflections on actual effects of the enacted
law. From the entire newspaper archive, those articles were retrieved
which contained the term ‘Mindestlohn*’, where the asterisk symbol
indicates a placeholder to include inflected forms, plural forms, and com-
pounds.4 Additionally, articles had to be related to German politics
mainly, which could be achieved by restricting the retrieval in the archive
databases by provided metadata. This results in a corpus of 7,621 articles
(3,762 in the FAZ; 3,859 in the FR) comprising roughly 3.76 million
word tokens. Their distribution across time reveals the changing intensity
of the public debate. Absolute document frequencies indicate that both
publications, although from opposing sites of the political spectrum,
cover the topic in a surprisingly similar manner (Fig. 7.1).
4 Inductive Analysis: Corpus Exploration

with Topic Models
For the exemplary study, I computed an LDA topic model with K = 15
topics on the entire corpus of newspaper articles. Before the LDA model
inference, common stop words were removed, as well as words occurring
less than 20 times in the corpus. Remaining terms were reduced to their
word stem and transformed to lower case to unify similar word types in
the vocabulary of the collection. The analysis is realized in the R statistical
programming environment using the topicmodels package by Grün and
Hornik (2011), which provides algorithms for LDA and other topic
models. Table 7.1 displays the inferred topics by listing their ten most
probable terms and their share of the collection in decreasing order.
Additionally, a label is attached to each topic, manually determined by
interpretation of the semantic coherence of its top terms.
4
The search index treats German Umlaute as their ASCII equivalent, such that Mindestlohn* also
retrieves articles containing ‘Mindestlöhne’.
196 G. Wiedemann
Table 7.1 Topics terms and shares

Share
No. (%) Top 10 terms Label
1 10.5 polit sozial gesellschaft mensch staat Social market economy
wirtschaft okonom gerecht arbeit
marktwirtschaft
2 9.3 mindestlohn branch gesetz tarifvertrag Sector-specific MW
euro spd union entsendegesetz
muntefering gewerkschaft
3 8.2 spd merkel partei koalition muntefering Social democrats
beck sozialdemokrat gabriel steinbruck
kanzlerin
4 8.2 mann alt leut jahr tag mensch polit gut General terms
welt leb
5 7.9 spd union cdu merkel koalition csu nahl Grand coalition
koalitionsvertrag seehof gabriel
6 7.8 euro uhr mitarbeit kund frankfurt stadt MW implementation
arbeit preis stund unternehm
7 7.6 prozent arbeitsmarkt arbeitslos institut Job market
million euro
8 6.3 euro milliard kind arbeitslos Social welfare
arbeitslosengeld rent hartz alt sozialhilf
langzeitarbeitslos
9 5.6 ausland deutsch unternehm illegal arbeit Undeclared work
pol baustell schwarzarbeit gesetz bau
10 5.4 gewerkschaft dgb zeitarbeit metall Temporary work
leiharbeit verdi beschaftigt
gewerkschaftsbund somm arbeit
11 5.2 grun fdp cdu spd partei wahl koch hess MW in Hesse
hessisch link
12 5.1 bau prozent gewerkschaft arbeitgeb ost MW in construction
west branch allgemeinverbind sector
wiesehugel baugewerb
13 4.8 europa deutsch deutschland wirtschaft MW in Europe
regier frankreich griechenland franzos
land milliard
14 4.2 link partei linkspartei spd lafontain gysi Socialist party
parteitag berlin pds wahl
15 3.9 post zumwinkel pin deutsch wettbewerb MW in postal sector
brief euro konz tnt verdi
In general, evaluation of clustering results and, in particular, evalua-

tion of topic models is not a trivial task. There are two major problems to
deal with. First, well-fitted separation of data into clusters can be obtained
at different levels of granularity. Second, due to mechanisms of stochastic
inference which involve random sampling methods to obtain model
parameters from data observation, results are not entirely deterministic
(Lancichinetti et al. 2015).
To solve the first problem, analysts may desire a data-driven, automatic
solution to answer the question: How many topics does my collection
contain? Unfortunately, although there are numerous measures to deter-
mine cluster quality (Walesiak and Dudek 2015) which may be utilized
to suggest a numerically optimal number of clusters, such automatically
derived parameters may not fit well to the needs of an analyst. Imagine,
for example, the data-driven parameter selection suggests splitting the
collection into two topics. This purely would not satisfy the researchers
demand for gaining deeper insights into the constituent parts of the dis-
course, since topics would be very general and hardly interpretable as
meaningful facets. In another case, it may suggest a separation into a
hundred or more clusters as numeric optimal solution. We certainly
would not be able to analyse and interpret such a fine-grained, potentially
unstable partitioning. This means that although we can use cluster qual-
ity indices as orientation, we need to define reasonable lower and upper
bounds to the number of clusters according to our research design.
The second problem, variance of modelling results due to the non-
deterministic nature of the inference mechanism, also needs careful
inspection during the analysis process. Choosing a higher number K to
separate the collection may lead to an unreliable clustering which cannot
be reproduced between repeated runs of the model algorithm. This
means, topics which occur in one process of modelling and which may
appear as meaningful and interpretable semantic coherence to the analyst
cannot necessarily be retrieved again in a second run of the algorithm. To
produce valid interpretations from topic modelling, researchers should
rely on reliable, hence, reproducible topics only. This certainly needs
some experimentation with parameters and familiarity with the overall
procedure. Usually, in topic model analysis researchers compute a variety
198 G. Wiedemann
of models and compare them in a systematic procedure. To select an

appropriate model, Evans (2014) suggests three steps:
The first criterion for model selection is to utilize numeric measures to
determine the model quality. Three measures are customarily used
throughout the literature: (1) perplexity, a measure to determine how well
the model generalizes to unseen data (Wallach et al. 2009); (2) coherence,
measuring how often topic defining terms actually co-occur in docu-
ments (Mimno et al. 2011); and (3) reliability, the share of topics we are
able to retrieve in a stable manner between repeated runs (Lancichinetti
et al. 2015). Since perplexity and coherence highly depend on the alpha-
parameter of the LDA modelling process, a numeric prior which deter-
mines the shape of topic distributions across documents, it makes sense
to utilize them as an indicator for selection of an optimal alpha-parameter.
For the newspaper collection, I computed different models, each with
K = 15 topics but varying the alpha-parameters, and selected the one with
optimal coherence and perplexity. For this, reliability could be obtained,
showing that around 81% of the topics can be retrieved stably, which is
within acceptable range. In case of low reliability, it would be reasonable
to decrease the number of topics K to retrieve more stable topics.
As the second step, we investigate the top terms of all topics and evalu-
ate if we are able to assign a descriptive label to them. For the presented
model on the MW collection, such a label for each topic is given in
Table 7.1. While most of the labels represent very straightforward inter-
pretations, in some cases it is rather hard to come up with a unique label.
Topic 4, for instance, mainly consists of general terms not related to any
specific theme. To support intuitions from top word lists, we can sample
topic representative documents from the collection and read them care-
fully to obtain more information to base our interpretation on. Overall,
most topics can be interpreted and labelled unambiguously. We can iden-
tify interesting subtopics in the debate around MW, such as consequences
for the job market, relations to undeclared work and social welfare or
different forms of implementation such as MW for temporary work or
specific sectors.
In the third step, we assess whether topic distributions over time fit to
our prior expectations about the discourse, for example, if certain events
or developments we already know about are represented by the model.
Since our model provides topic probabilities for each document, we can
aggregate probabilities according the document metadata such as its pub-
lication year and normalize them in a range between 0 and 1 to interpret
them as proportions. Average topic proportions can be visualized, for
instance, as an area plot, which allows a visual evaluation of topic trends
over time (Fig. 7.2). To make the plot more readable, topics are sorted in
a specific manner. On the single curves of the topic proportions over
time, I computed a regression line and sorted the topics according the
slope of this line. This results in an ordering where largest increases of
topic shares are on the top of the list while topic shares most decreasing
over time are located at the bottom. Now we can easily identify specific
trends in the data. In the beginning of our investigated time period, the
discourse is largely dominated by the issue of MW in the construction
sector which were introduced with the ‘Entsendegesetz’ in 1996 to pre-
vent dumping wages, but led to heated discussions on increases of unde-
Topic
1.00
Grand coalition
Social democrats
MW in Hesse
0.75 Job market
Socialist party
Aggregated probability
MW and temporary work
Sector specific MW
0.50
MW in postal sector
General terms
MW implementation
Social welfare
0.25
MW in Europe
Social market economy
MW and undeclared work

0.00 MW in construction sector
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
Date
Fig. 7.2 Area plot of topic distribution over time

200 G. Wiedemann
clared work. Dispute around the millennial turn was focusing on the
question whether statutory wages can even be enforced by executive pow-
ers. Then, steadily more industrial and service sectors became subject to
the debate. Throughout the 2000s, sector-specific MWs with an excep-
tional highlight on the postal sector were preferred above a general MW
for all sectors in the entire country. During that time, the topics on social
market economy entangled with demands for social justice and concerns
for the job market formed a steady and solid background for the debate.
In 2013, we could identify a new shift of the debate when a general mini-
mum wage became a central policy objective in the coalition agreement
of CDU/CSU and SPD after the federal election. From this year onwards,
topics on implementation of MW and possible consequences on the job
market increase.
From the perspective of quality assurance of the analysis process, these
results can be viewed as a successful evaluation, suggesting we were able
to obtain a valid model for our research purpose. We selected model
parameters according to optimized conventional numeric evaluation
measures and were able to label and interpret single topics, as well as their
quantitative evolvement over time. But of course, carrying out the entire
analysis on distributions and trends of semantic clusters covers only a
very distant perspective of the discourse. The method rather provides an
overview of the discourse, a suggestion for its separation, and a starting
point for further analyses to gain insight through contrasting along data
facets such as theme and time. To develop a profound understanding of
the data and what is going on in distinct topics at certain points of time,
we still need to read single articles. Fortunately, the topic model provides
us information on which documents to select. For every topic at any dis-
tinct point in time, for example, the striking increase in the topic of sec-
tor-specific MW in 2004, we can sample from the most representative
documents for that subset of our data to prepare further manual analysis
steps. Thus, the close connection between the global and the local context
through the model allows for a straightforward realization of demands for
‘blended reading’, the close integration of distant and close reading steps
(Lemke and Stulpe 2016).
5 eductive Analysis: Classification

D
of Propositional Statements
The previous step demonstrated how unsupervised ML contributes to
information extraction from large text collections by relying on data-
internal structures. This allowed us to inductively explore the debate on
MW in temporal and thematic manner. It made knowledge emerging
from trans-textual coherence of language use explicitly accessible through
the latent variables of the topic model. At the same time, this only pro-
vided rather broad contextual information. To analyse and understand
discourses, we do not strive for the macro view only, but want to reveal
how single constituents of utterance contribute to its formation.
Depending on the research design, we may derive categories of language
use contributing to the production of a discourse directly from the data,
or already have concrete categories for the analysis derived from text
external sources such as theoretical assumptions or previous studies. In
both cases, we can apply a subsuming analysis step where we assign cat-
egories to samples of text as representatives of empiric evidence. For
closed, well-defined category systems, this step of categorization can be
supported by supervised text classification or active learning.
For the exemplary study on the minimum wage discourse, I employ a
very simple category scheme for demonstration purposes. The goal is to
measure approval and opposition of political stance in the newspaper
articles to reveal their trends. For this, I hand-coded sentences in articles.
I extracted 107 sentences representing approval to the idea of statutory
MW, and 102 expressing opposition out of a set of 30 documents. Text
classification is realized with a linear support vector machine (SVM)
operating on a feature set of unigrams and bigrams of word stems in the
sentences. SVM introduced in text classification by Joachims (1998)
already almost two decades ago, is still one of the most efficient discrimi-
native classification models for this task. I utilized the LIBLINEAR
implementation by Fan et al. (2008). In six iterations of active learning,
I extended these sets to 460 approval sentences and 524 sentences express-
ing opposition. Since both categories are constructed as independent, to
each set related sentences, called ‘positive’ set, a contrasting ‘negative’ set
202 G. Wiedemann
of sentences irrelevant to the category is collected. These ‘negative’ sen-

tences mainly result from corrections during the active learning process.
Therefore, they also provide very valuable feature information to the clas-
sifier to be able to distinguish category representative units correctly from
those not fitting the category definition. A tenfold cross-validation on the
final training set shows results around 0.7 for the F1-score (Table 7.2),
which can be seen as a satisfying agreement between machine and human
coding.
With the final TS obtained from active learning, we now are able to
classify each of the around 259,000 sentences from the entire collection as
approval, opposition, or irrelevant in terms of our category system. Finally,
the resulting sets of positive sentences fitting each category can be utilized
for quantitative evaluation. Figure 7.3 plots frequencies of documents
containing at least one expression of approval or opposition towards the
issue of MW in the two publications. Frequencies were normalized to
proportions with respect to the collection size in the single time slices.
Table 7.2 Text classification of stances on minimum wages

Approval Opposition
Example ‘An Mindestlöhnen wird ‘Wissenschaftler warnen seit
sentence deshalb kein Weg langem vor der Einführung von
vorbeigehen’. [Hence, Mindestlöhnen in Deutschland’.
there is no alternative to [Scientists have long been
minimum wages.] warning against the introduction
of minimum wages.]
Initial 107 102
training
set (TS)
Final TS 460 524
positive
Final TS 1,226 1,621
negative
Features 24,524 28,600
Cross- 0.720 0.667
validation
F1
Sentences 259,097 259,097
all
Sentences 5,937 2,026
positive
Trends, that is, changes of proportions across the entire time frame,
appear to be very similar between the FAZ and the FR. We can observe
early peaks of support for the idea of MWs in 1996 and around the years
1999/2000. In 1996, an MW was introduced in the construction sector.
Around the turn of the millennium, although characterized by a large
relative share of approval (Fig. 7.3), the debate remained on rather low
level in absolute terms (Fig. 7.1). In contrast, intensity of the debate in
absolute counts and the share of approval statements for the policy mea-
sure started to increase simultaneously from 2004 onwards. Intensity
peaks in 2013 while retaining high approval shares. In this year, MWs
became part of the grand coalition agreement as a soon to be enacted law.
For expressions of opposition towards MW, we can observe interesting
trends as well. Not surprisingly, the overall share of negative sentiments
towards the policy measure is higher in the more conservative newspaper
FAZ. But, more striking is the major peak in the year 2004, just at the
beginning of the heated intensity of the debate. In 2005, there has been
an early election of the Bundestag after the government led by chancellor
FAZ FR
0.6
0.4
proportion
0.2
0.0
1995 2000 2005 2010 2015 1995 2000 2005 2010 2015
year
category approval opposition
Fig. 7.3 Relative frequencies of documents containing stances on minimum wages

204 G. Wiedemann
Gerhard Schröder (SPD) dissolved the parliament. His plan for backing
up the support for aimed social reforms, the so-called Agenda 2010, by
renewing his government mandate failed. Schröder lost the election and
a new government under Angela Merkel (CDU) was formed together
with the Social Democrats as a junior partner. Under the impression of
heated dispute about the Agenda 2010, the 2005 election campaign was
highly influenced by topics related to social justice. Switching to the
mode of campaign rhetoric in policy statements may be an explanation
for the sharp drop of oppositional statements to MW.
One year earlier oppositional stances peaked in the public discourse
presenting MW as a very unfavourable policy measure. Due to their bad
reputation throughout conservatives as well as Social Democrats,
demands for MWs did not become a major topic in the campaign of
2005. The interesting finding now is that the relative distribution of
approval and opposition in the public discourse in that year already was
at similar levels compared with that in the years 2012/2013 when the
idea finally became a favoured issue of the big German parties. This may
be interpreted in a way that statutory MWs actually could have been a
successful campaign driver in 2005 already, if Social Democrats would
have been willing to adopt them as part of their program. In fact, left-
leaning Social Democrats demanded them as a compensatory measure
against the social hardship of the planned reforms. Instead, the SPD
opted for sticking to the main principle behind the Agenda 2010 to
include low-skilled workers into the job market by rather subsidizing low
wages than forcing companies to pay a general minimum. It took Social
Democrats until elections of 2009 to take stance for the idea, and another
four years until government leading Christian Democrats became com-
fortable enough with it. Over the years 2014/2015, we could observe a
drop in both approval and opposition expressions, which may be inter-
preted as a cool-down of the debate.
In addition to trace trends of expressions for approval or opposition
quantitatively, we also can evaluate on the used arguments more qualita-
tively. Since supervised learning provides us lists of positively classified
sentences for each category, we can quickly assess their contents and iden-
tify major types of arguments governing the discourse. For approval, for
instance, statements mainly refer to the need for some kind of social
justice. The claim, ‘people need to be able to afford living from their
earned income’, can often be found in variants in the data. In fact, low
wage policy in Germany led to situations where many employees were
dependent on wage subsidies financed by the welfare state. Companies
took advantage of it, by creating business models relying on public subsi-
dies of labour to increase competitiveness. In addition to the social justice
argument, there are more economic arguments presented, especially in
relation to the demand for sector-specific MW. They are welcomed not
only by workers, but also by entrepreneurs as a barrier against unfair con-
ditions of competition on opened European markets. Oppositional
stances to the introduction of MWs also point to the issue of competi-
tiveness. They are afraid that competitiveness of German industry and
services will be diminished and, hence, the economy will slow down.
Turning it more to a social justice argument, major layoffs of workforce
are predicted. Often the claim can be found that MW are unjust for low-
skilled workers because they are preventing them from their entry into
the job market. One prominent, supposedly very specific German argu-
ment in the debate is the reference to ‘Tarifautonomie’, the right of coali-
tions of employees and employers to negotiate their work relations
without interference by the state. Statutory MW, so opponents claim, are
a major threat to this constitutional right. For a long time, German work-
ers unions followed this argument, but steadily realized that power of
their organized coalition in times of heated international competitiveness
was no longer able to guarantee decent wages for their members.
This brief characterization of the main argument patterns identifiable in
public discourse could be the basis for a further refinement of the category
system used for supervised learning. The active learning workflow applied
to this extended system would allow for measuring of specific trends and
framings of the debate—for instance, if reference to the argument on
‘Tarifautonomie’ diminishes over time, or if oppositional statements are
more referring to threats of increased unemployment in a framing of social
justice than in a framing of general damage to the economy. Although we
would have been able to identify these argumentative patterns also with
purely manual methods, we would not be able to determine easily and
comprehensibly on their relevancy for the overall discourse. Certainly, we
would not be able to determine on trends of their relevancy over time.
206 G. Wiedemann
6 iscussion: From Implicit Knowledge

D
to Explicit Models
In this exemplary study, I demonstrated how current technologies of text
mining and ML can contribute to the analysis of the discourse on statu-
tory MW in Germany over the past twenty years. The investigation of
public discourse covered in two major newspapers from opposing sites of
the political spectrum revealed interesting coherence between the identi-
fied patterns such as frequency of relevant documents as well as trends of
stances towards the policy measure. We were able to infer that the overall
agenda is determining the entire discourse community in very compara-
ble manner to a large extent, resulting in similar macro-level trends for
both publications. Further, we were able to identify changes of thematic
focuses of the debate across time in a purely data-driven manner with the
help of statistical topic models. With text classification we measured
approval and opposition towards MWs from its first peak in the mid-
nineties up to the enactment as law in 2015. For this, we applied an SVM
classifier in an active learning workflow which combines human and
machine coding of text in an alternating process to produce TSs effi-
ciently and with sufficient quality.
How do these approaches compare to non-computational steps of ana-
lysing discourse? Obviously, we were evaluating more surface structures
rather than revealing micro-structures of utterance as the base foundation
of discursive statements. While hermeneutic-interpretive investigation of
text samples may uncover how speech acts about threats of layoff of
employed work or forecasts of a shrinking economy contribute to a nega-
tive framing of MW, we simply measured their frequency and embedding
in a broader thematic structure formed by regularities in certain vocabu-
lary use. In this respect, computer-mediated discourse analysis rather
complements qualitative, more hermeneutic steps of discourse analysis
by allowing for assessment of relevancy and identification of trends of
structures.
The major contribution of text mining technologies including ML is
that they are able to make implicit knowledge acquired by the individual
discourse analyst during her/his work explicit by encoding it into (statis-
tical) models of language use representative for the categories of interest.

This process of encoding implicit understanding of a discourse in explicit
models has three major advantages for the analysis. First, it allows the
assessment of very large data sets and, thus, the quantified evaluation of
categories. Second, the explicit connection between local and global con-
text in ML models also allows retrieving examples on the micro level, for
example, single documents, which can be regarded as paradigmatic for
macro-level structures. Typical d ocuments for a topic cluster or exem-
plary expressions for classified statements can easily be selected from such
analysis for further manual inspection. Third, it allows for a more repro-
ducible analysis which can be subject to scientific argument more easily
than purely manual analysis procedures. As well as other approaches of
computer-mediated discourse analysis, results based on ML allow for
compliance with demands for inter-subjective comprehensibility and
reproducibility of research. At the same time, although explicitly relying
on quantification and statistics, text mining approaches remain highly
compatible with key assumptions from many discourse theories. Since
they do not assume fixed meaning associated with character strings, but
strive to ‘learn’ these associations from observations in empirical data,
they fit the demand for reconstruction of knowledge structures from
combined observations on both the textual and intertextual levels.
Despite the algorithmic advancements, the debate on best practices for
the integration of the new technologies in qualitative data analysis is still
far from being settled (Wiedemann and Lemke 2016). Not much has
been said so far, on the integration of various text mining methods into
complex analysis workflows. While ML on large data sets clearly shed
lights on broader macro-level structures of discourse by revealing topics
inductively or category trends deductively, lexicometric methods addi-
tionally provide valuable information to better understand production of
discourse formation on more fine-grained entities such as key words and
collocations. They can be interpreted in qualitative analysis steps as indi-
vidual events in textual data on the meso and micro levels. However, they
only become a pattern on the aggregated macro level of discourse. This
interplay between the distinguished levels has been subject to broad theo-
retic, methodological reflection to situate such approaches beyond the
traditional qualitative-quantitative divide (Angermüller 2005).
208 G. Wiedemann
From the technical perspective too, there is a lot of research necessary

to systematically integrate text mining approaches for capturing aspects
on the different discourse levels. Beyond broad topics, it would be inter-
esting to get closer to the intermediate semantic structures such as propo-
sitional patterns as a basis for a discourse formation. If we acknowledge
the advantages of computer-assisted methods for discourse analysis, we
need intensified discussion on best practices and guidance for valid, effi-
cient, and reliable application of state-of-the-art technologies, while at
the same time paying respect to matured methodological debates in dis-
course research which deny the possibility for full standardization of the
method (Feustel et al. 2014). In this regard, even when computational
methods help us finding patterns and selecting good samples, it largely
remains the task of the creative human analyst to link findings in the data
on the micro- and macro levels, and to draw the right conclusions in
conjunction with her/his knowledge about the sociopolitical contexts.
This means that also with nowadays ML methods, we are far away from
a purely automatic discourse analysis. But the new technologies offer us
the chance not only to generate new findings from very large data sets,
but at the same time, to facilitate the access to empiric analysis by point-
ing into interesting, relevant, and inter-subjectively comprehensible
directions.
References
Abulof, Uriel. 2015. Normative concepts analysis: Unpacking the language of
legitimation. International Journal of Social Research Methodology 18 (1):
73–89.
Angermüller, Johannes. 2005. Qualitative methods of social research in France:
Reconstructing the actor, deconstructing the subject. Forum Qualitative
Sozialforschung/Forum: Qualitative Social Research 6 (3). Accessed July 1,
2018. http://nbn-resolving.de/urn:nbn:de:0114-fqs0503194.
———. 2014. Einleitung: Diskursforschung als Theorie und Analyse. Umrisse
eines interdisziplinären und internationalen Feldes. In Diskursforschung. Ein
interdisziplinäres Handbuch. Band 1: Theorien, Methodologien und
Kontroversen, ed. Johannes Angermuller, Martin Nonhoff, Eva Herschinger,
Felicitas Macgilchrist, Martin Reisigl, Juliette Wedl, Daniel Wrana, and

Alexander Ziem, 16–36. Bielefeld: Transcript.
Baker, Paul, Costas Gabrielatos, Majid KhosraviNik, Michael Krzyzanowski,
Tony McEnery, and Ruth Wodak. 2008. A useful methodological synergy?
Combining critical discourse analysis and corpus linguistics to examine
discourses of refugees and asylum seekers in the UK press. Discourse & Society
19 (3): 273–306.
Blei, David M. 2012. Probabilistic topic models: Surveying a suite of algorithms
that offer a solution to managing large document archives. Communications
of the ACM 55 (4): 77–84.
Dumm, Sebastian, and Andreas Niekler. 2016. Methoden, Qualitätssicherung
und Forschungsdesign. Diskurs- und Inhaltsanalyse zwischen
Sozialwissenschaften und automatischer Sprachverarbeitung. In Text Mining
in den Sozialwissenschaften: Grundlagen und Anwendungen zwischen qualitati-
ver und quantitativer Diskursanalyse, ed. Matthias Lemke and Gregor
Wiedemann, 89–116. Wiesbaden: Springer VS.
Dzudzek, Iris, Georg Glasze, Annika Mattissek, and Henning Schirmel. 2009.
Verfahren der lexikometrischen Analyse von Textkoprora. In Handbuch
Diskurs und Raum: Theorien und Methoden für die Humangeographie sowie die
sozial- und kulturwissenschaftliche Raumforschung, ed. Georg Glasze and
Annika Mattissek, 233–260. Bielefeld: Transcript.
Elgesem, Dag, Lubos Steskal, and Nicholas Diakopoulos. 2015. Structure and
content of the discourse on climate change in the blogosphere: The big pic-
ture. Environmental Communication 9 (2): 169–188. https://doi.org/10.108
0/17524032.2014.983536.
Evans, Michael S. 2014. A computational approach to qualitative analysis in
large textual datasets. PloS ONE 9 (2). https://doi.org/10.1371/journal.
pone.0087908. Accessed July 1, 2018.
Fan, Rong-en, Kai-wei Chang, Cho-jui Hsieh, Xiang-rui Wang, and Chih-jen
Lin. 2008. LIBLINEAR: A library for large linear classification. Journal of
Machine Learning Research 9: 1871–1874.
Feustel, Robert, Reiner Keller, Dominik Schrage, Juliette Wedl, Daniel Wrana,
and Silke van Dyk. 2014. Zur method(olog)ischen Systematisierung der sozi-
alwissenschaftlichen Diskursforschung. Herausforderung, Gratwanderung,
Kontroverse. In Diskursforschung. Ein interdisziplinäres Handbuch. Band 1:
Theorien, Methodologien und Kontroversen, ed. Johannes Angermuller, Martin
Nonhoff, Eva Herschinger, Felicitas Macgilchrist, Martin Reisigl, Juliette
Wedl, Daniel Wrana, and Alexander Ziem, 482–506. Bielefeld: Transcript.
210 G. Wiedemann
Glasze, Georg. 2007. Vorschläge zur Operationalisierung der Diskurstheorie

von Laclau und Mouffe in einer Triangulation von lexikometrischen und
interpretativen Methoden. Forum Qualitative Sozialforschung/Forum:
Qualitative Social Research 8 (2). Accessed July 1, 2018. http://nbn-resolving.
de/urn:nbn:de:0114-fqs0702143.
Grün, Bettina, and Kurt Hornik. 2011. Topicmodels: An R package for fitting
topic models. Journal of Statistical Software 40 (13). Accessed July 1, 2018.
http://www.jstatsoft.org/v40/i13/.
Helsloot, Niels, and Tony Hak. 2007. Pêcheux’s contribution to discourse analy-
sis. Forum Qualitative Sozialforschung/Forum: Qualitative Social Research 8 (2).
Accessed July 1, 2018. http://nbn-resolving.de/urn:nbn:de:0114-fqs070218.
Heyer, Gerhard, Uwe Quasthoff, and Thomas Wittig. 2006. Text mining:
Wissensrohstoff Text: Konzepte, Algorithmen, Ergebnisse. Bochum: W3L.
Jäger, Siegfried. 2004. Kritische Diskursanalyse: Eine Einführung. 4th ed.
Münster: Unrast.
Joachims, Thorsten. 1998. Text categorization with support vector machines:
Learning with many relevant features. In Proceedings: Machine Learning:
ECML-98, Heidelberg, Berlin, 137–142.
Kelle, Udo. 1997. Theory building in qualitative research and computer pro-
grams for the management of textual data. Sociological Research Online 2 (2).
Accessed February 28, 2017. http://www.socresonline.org.uk/2/2/1.html.
Lancichinetti, Andrea, M. Irmak Sirer, Jane X. Wang, Daniel Acuna, Konrad
Körding, and Luís A. Nunes Amaral. 2015. High-reproducibility and high-
accuracy method for automated topic classification. Physical Review X 5 (1).
Accessed July 1, 2018. https://journals.aps.org/prx/pdf/10.1103/
PhysRevX.5.011007.
Lebart, Ludovic, André Salem, and Lisette Berry. 1998. Exploring textual data.
Dordrecht: Kluwer.
Lemke, Matthias, Andreas Niekler, Gary S. Schaal, and Gregor Wiedemann.
2015. Content analysis between quality and quantity. Datenbank-Spektrum
15 (1): 7–14.
Lemke, Matthias, and Alexander Stulpe. 2016. Blended Reading: Theoretische
und praktische Dimensionen der Analyse von Text und sozialer Wirklichkeit
im Zeitalter der Digitalisierung. In Text Mining in den Sozialwissenschaften:
Grundlagen und Anwendungen zwischen qualitativer und quantitativer
Diskursanalyse, ed. Matthias Lemke and Gregor Wiedemann, 17–62.
Wiesbaden: Springer VS.
Mautner, Gerlinde. 2009. Checks and balances: How corpus linguistics can
contribute to CDA. In Methods of critical discourse analysis, ed. Ruth Wodak
and Michael Meyer, 122–143. London: SAGE.
Mimno, David, Hanna M. Wallach, Edmund Talley, Miriam Leenders, and
Andrew McCallum. 2011. Optimizing semantic coherence in topic models.
In Proceedings of the conference on Empirical Methods in Natural Language
Processing (EMNLP’11), 262–272. Stroudsburg: ACL.
Moretti, Franco. 2007. Graphs, maps, trees: Abstract models for literary history.
London and New York: Verso.
Pêcheux, Michel, Tony Hak, and Niels Helsloot. 1995. Automatic discourse anal-
ysis. Amsterdam and Atlanta: Rodopi.
Stone, Phillip J., Dexter C. Dunphy, Marshall S. Smith, and Daniel M. Ogilvie.
1966. The general inquirer: A computer approach to content analysis. Cambridge,
MA: MIT Press.
Walesiak, Marek, and Andrzej Dudek. 2015. clusterSim: Searching for optimal clus-
tering procedure for a data set. http://CRAN.R-project.org/package=clusterSim.
Wallach, Hanna M., Iain Murray, Ruslan Salakhutdinov, and David Mimno.
2009. Evaluation methods for topic models. In Proceedings of the 26th Annual
International Conference on Machine Learning (ICML’09), 1105–1112.
New York: ACM.
Wedl, Juliette, Eva Herschinger, and Ludwig Gasteiger. 2014. Diskursforschung
oder Inhaltsanalyse? Ähnlichkeiten, Differenzen und In-/Kompatibilitäten.
In Diskursforschung. Ein interdisziplinäres Handbuch. Band 1: Theorien,
Methodologien und Kontroversen, ed. Johannes Angermuller, Martin Nonhoff,
Eva Herschinger, Felicitas Macgilchrist, Martin Reisigl, Juliette Wedl, Daniel
Wiedemann, Gregor. 2016. Text mining for qualitative data analysis in the social
sciences: A study on democratic discourse in Germany. Kritische Studien zur
Demokratie. Wiesbaden: Springer VS.
212 G. Wiedemann
Wiedemann, Gregor, and Matthias Lemke. 2016. Text Mining für die Analyse
qualitativer Daten: Auf dem Weg zu einer Best Practice? In Text Mining in
den Sozialwissenschaften: Grundlagen und Anwendungen zwischen qualitativer
und quantitativer Diskursanalyse, ed. Matthias Lemke and Gregor Wiedemann,
397–420. Wiesbaden: Springer VS.
Wodak, Ruth, and Michael Meyer. 2009. Critical discourse analysis: History,
agenda, theory and methodology. In Methods of critical discourse analysis, ed.
Ruth Wodak and Michael Meyer, 1–33. London: Sage.
Part IV
New Developments in Corpus-
Assisted Discourse Studies
8
The Value of Revisiting and Extending
Previous Studies: The Case of Islam
in the UK Press
Paul Baker and Tony McEnery
1 Introduction1
Discourse analyses often tend to be time-bound. A discourse is observed,
its nature characterised and an analysis concludes. This, of itself, is not
problematic—analyses have beginnings and ends. Researchers invest the
time and effort into a research question as their research programme
demands and then move on to their next question. A slightly more prob-
lematic situation arises, however, when discourse is described and then
assumed to remain static. Such an analysis will background the fact that
discourse is dynamic. While we may concede that dynamism in discourse
may be topic sensitive and that such change may vary in terms of speed
and degree, it is nonetheless probably the rule rather than the exception
1
The work reported on in this chapter was supported by the ESRC Centre for Corpus Approaches
to Social Science, grant number ES/K002155/1.
P. Baker (*) • T. McEnery

Department of Linguistics and English Language, University of Lancaster,
Lancaster, UK
e-mail: j.p.baker@lancaster.ac.uk; a.mcenery@lancaster.ac.uk

216 P. Baker and T. McEnery
that discourse is Protean over time. However, it can be easy to refer to

past analyses and assume that they still have contemporary relevance. For
example, L’Hôte (2010) studies the language of globalisation in the
British Labour party in the period 1994–2005 using a 234,387-word
corpus of documents produced by the party. Yet the study has been cited
since as being about ‘the particular meaning of “globalisation” in Labour
manifestos’ (Evans and Schuller 2015). Similarly, Kambites (2014, 339)
says that L’Hôte’s paper ‘analyses the use of the term “globalization” by
successive UK governments and finds that “new Labour discourse is sig-
nificantly more concerned with the process of globalisation than
Conservative discourse”’. Both papers imply that the findings of L’Hôte
are not time-bound—but at the very least that possibility should be
accepted, otherwise contemporary discourse may be mischaracterised.
Nonetheless, the difficulty of extending some studies, especially where
they are large in scale, may be so daunting that a full investigation of the
ongoing dynamism of a discourse may not be undertaken.2 In this chap-
ter, we will return to one major study, Baker et al. (2013), to explore how,
in the period following their study, the discourse that they studied
changed, if at all. Baker et al. explored the representation of Muslims and
their religion, Islam, in the British national press. They did this by exam-
ining 143 million words of British newspaper articles from 1998 to 2009.
While their study is an exhaustive account of the period covered, numer-
ous subsequent global events relating to, or involving, Muslims, such as
the Arab Spring and the rise of ISIS, mean that the possibility clearly
exists that discourse around Islam has changed somewhat since 2009.
Additionally, the political context in the UK has changed; there was a
change in government in 2010, with a Conservative-led coalition replac-
ing Gordon Brown’s Labour Party. The popularity of the English Defence
League as well as increasing support for UKIP suggest that Britain’s social
barometer has become more right-leaning after 2009. With both the
political context and the world context of reporting about Muslims and
2
An example of such an extension is the work of Blinder and Allen (2016), who looked at the
representation of refugees and asylum seekers in a 43-million-word corpus of UK press material
from 2010–2012 in a complementary study to the investigation of the same subject by Baker et al.
(2008) using a 140-million-word corpus of newspaper articles covering 1996–2005.
The Value of Revisiting and Extending Previous Studies… 217
Islam in the UK press having changed, the opportunity presents itself to

consider an important methodological issue—how stable might the anal-
ysis of a discourse prove to be in the time following the completion of a
study?
2 The Approach—Corpus Linguistics

While we situate the analysis undertaken here in the broad field of dis-
course analysis, the method that we use to approach the discourse is
corpus-assisted discourse studies (CADS; Partington 2003; Partington
et al. 2004). CADS, as its name implies, relies on the methods of corpus
linguistics. Corpus linguistics (McEnery and Hardie 2012) is an approach
to the study of language based on the study of large volumes of attested
language use. These collections of language data, called corpora (singular
corpus), allow analyses which cycle between large-scale quantitative analy-
ses and more traditional close reading of texts. The data is analysed,
manipulated and navigated using programmes called concordance sys-
tems. For this study a system called CQPweb (Hardie 2012) was used.
The main features of CQPweb used in this study are keywords, collocates
and concordancing. Keywords are designed to bring into focus words
which are unusually frequent (or infrequent) in one corpus when it is
compared to another. Keywords have been shown (e.g. by Baker 2006) to
be of particular use in CADS as they often help, for example, to charac-
terise the overall content of the data set, the construction of groups within
the data set and rhetorical devices used in the corpus. These are clearly
useful things for any discourse analysis to achieve. Using keywords, we
can achieve these things on a large scale, relatively objectively and swiftly.
Collocation is another helpful procedure—it helps the analyst to look at
the explicit and implicit construction of the meaning of a word in context
(see, e.g. Taylor 2017). It does this by looking at words which have a
strong preference to co-occur with the word in question—so-called col-
locates. This is determined using a suitable statistic, typically an effect size
measure which shows the strength of attraction of the two words in
question (see Gablasova et al. 2017, for a discussion of various collocation
measures). Being able to use this tool to look at implicit and explicit
meaning as well as using it to contrast any differences between two cor-
pora (e.g. one of broadsheet news stories about a group and another of
tabloid news stories about the same group) has obvious applications
within discourse analysis, especially in the area of the construction of
identities and in and out groups. Given that there is growing evidence
that collocates have some root in psychological reality (Durrant and
Doherty 2010; Millar 2011), the value to the discourse analyst in using
collocation to explore discourse is further strengthened. The final tool is
the key tool which mediates between the relatively abstract large-scale
analyses provided by keyword and collocation analysis; concordancing
allows us to map back from the abstract quantitative analyses to the tex-
tual reality on which they are based. Concordancing allows us to navigate
back to the examples in context that produce a keyword or a collocate,
allowing us to rapidly scan those contexts to understand the finding in a
more nuanced way. Alternatively, we may start with concordancing and
work up to the more abstract level, exploring whether something we see
in one text is unique to it, relatively rare, average or in some ways unusu-
ally frequent, for example.
CADS uses the tools of corpus linguistics in order to subject corpora,
both large and small, to discourse analysis. The subsequent analyses have
the benefit of scale, can avoid an inadvertent cherry-picking bias that the
exploration of isolated, and potentially atypical, texts may promote and
has the advantage that some elements of the analysis are relatively subjec-
tive and reproducible. This chapter is an exploration of one discourse in
the UK press which will use CADS both in order to illuminate how that
discourse has changed, if at all, over time and, at the same time, to dem-
onstrate briefly what the CADS approach can achieve. So in order to
explore how stable the discourse around Muslims and Islam was in the
UK press, we extended the original study, analysing a corpus of articles
about Islam and Muslims from 2010 to 2014, which for convenience we
will call Corpus B, making comparisons back to the findings from the
original 1998 to 2010 study which was based on a corpus we will call
Corpus A.
2.1 Collecting the Articles
To be able to make a fair comparison, we collected Corpus B using the

same method of collecting newspaper articles that we used for Corpus
A. This involved gathering articles from the online news database Nexis
UK. We collected articles between January 2010 and December 2014,
using the same search term used to collect Corpus A. As before we col-
lected national newspapers and their Sunday editions.
However, even in the process of data collection we encountered our
first example of how the discourse around these topics may have
changed—since the collection of Corpus A, there had been changes to the
availability of some newspapers. The Business was a weekly newspaper
which went out of print in February 2008, so we were not able to collect
that newspaper. Additionally, The News of the World stopped publishing
in 2011 but was replaced with The Sunday Sun, so we have collected that
newspaper as the Sunday equivalent of The Sun. These changes, in them-
selves, open the possibility of an overall change in discourse.
Table 8.1 shows the number of words for both the older and the new
corpus, for each newspaper.
Table 8.1 The structure of the two newspaper corpora

Corpus A: 1998–2009 Corpus B: 2010–2014
Percentage of Percentage of
Newspaper Total words the corpus Total words the corpus
Business 577,234 0.3938 N/A N/A
Express 6,419,173 4.379286 3,510,462 4.394532
Guardian 24,344,632 16.60839 19,740,632 24.71208
Independent 25,591,916 17.45931 7,622,731 9.542428
Mail 17,216,224 11.74525 5,379,219 6.733914
Mirror 8,067,444 5.503768 3,117,405 3.902488
Observer 10,264,984 7.002973 3,516,404 4.40197
People 663,192 0.452443 336,226 0.420901
Star 2,669,466 1.821162 1,343,924 1.682376
Sun 5,018,404 3.423654 4,131,110 5.171483
Telegraph 16,125,825 11.00135 12,623,169 15.80217
Times 29,621,874 20.20862 18,561,226 23.23566
Total 146,580,368 100 79,882,508 100
It should be borne in mind that the broadsheets make up the majority

of the data in both corpora because they have longer articles and more
articles per issue than the tabloids. For the 2010–2014 corpus, just two
broadsheet newspapers, The Guardian and The Times make up almost half
of all the data. Adding in the other broadsheets (The Independent, The
Observer and The Telegraph), amounts to three-quarters of the data. As a
proportion of the corpus, The Guardian (a liberal newspaper) actually
contributed more in Corpus B than Corpus A (8% more), so it should be
borne in mind that the overall composition of corpus data is different
between the two time periods examined, and this may subsequently
impact on results found.
3 Number of Articles per Month

Looking at the volume of data produced and the trend over time, two
features of the Baker et al. study are shown to be stable—the volume of
data increases at points in time where violent events involving Muslims
occurs and, over time, the overall trend is for more articles to be written
350
300
250
200
150
100
50
0
1998-01
1998-08
1999-03
1999-10
2000-05
2000-12
2001-07
2002-02
2002-09
2003-04
2003-11
2004-06
2005-01
2005-08
2006-03
2006-10
2007-05
2007-12
2008-07
2009-02
2009-09
2010-04
2010-11
2011-06
2012-01
2012-08
2013-03
2013-10
2014-05
2014-12
Fig. 8.1 Average number of articles about Islam per newspaper per month,
1998–2014
mentioning Muslims and Islam. Figure 8.1 shows both of these points.

The figure combines results from Corpus A and Corpus B to show long-
term trends in the data.
This graph shows a trend line indicating that press interest in Islam
and Muslims has continued to increase, particularly since the events of
9/11. The rise of the Islamic State and its invasion of Iraq in 2014 appear
to have caused a third large spike in reporting (with 9/11 and 7/7 being
two other spikes). But since the end of 2009, there appears to have been
a notable upturn in reporting, following a general fall from 2005 to 2009.
Yet the overall volumes of data over time, while indicative of a poten-
tial change, does not describe or account for it. To look at this issue, we
decided to explore the differences between Corpus A and Corpus B using
the keywords procedure. We wanted to focus on keywords which indi-
cated a difference between the two periods and then to reflect on how
those keywords either meshed with, or indicated a change in, our previ-
ous findings. So we contrasted the language of the 1998–2009 articles
with those from 2010 to 2014 to identify words which occurred signifi-
cantly more frequently in one of these periods when compared against
the other. These keywords were then used as a way of exploring the con-
tinuation or change of findings in our original study of Corpus A.
4 Stability—What Has Remained the Same

Stability in the analysis across the two periods was the exception. Yet
there were some broadly stable results. Firstly, the relative frequency of
the word Islamic, shown by Baker et al. (2013, 44) to be associated with
extremism, retains this association. Secondly, the strong association of
Muslims and Islam with conflict in the UK press (Baker et al. 2013,
62–63) has also been sustained. Thirdly, the ‘horror discourse’ around the
veil identified in 1998–2009 is still present in 2010–2014. This was iden-
tified in the first study (Baker et al. 2013, 216–217) as a way of framing
Muslims in Islam in terms of science fiction or gothic horror—for exam-
ple by comparing them to movie of literary monsters: and in the original
and new study it is easy to find examples such as women wearing the veil
as looking like Daleks or Darth Vader. Finally, the original study reported
that the phrase devout Muslim was negatively loaded; this is still true in
the 2010–2015 articles where we found references to devout Muslims
described them as cheating (e.g. failing drug tests, having affairs etc.),
becoming radicalised, engaging in extremist activity or clashing with
‘Western’ values in some way.3 However, even in these cases, there are
some slight changes. For example, reporting of the Islamic State group’s
activities has served to intensify the association of the word Islamic with
extremism. Similarly, in the 1998–2009 corpus the word forms relating
to conflict (see Baker et al. 2013, 59) constituted 2.72% of the corpus.
For the 2010–2014 data these words constituted 2.75% of that corpus.
So while overall these findings have remained the same, minor details
relating to the findings have been subject to flux. However, these changes
are minimal by comparison to the major changes that have taken place
across the two time periods, hence the bulk of this paper will be devoted
to a discussion of these differences.
5 What Has Changed

The dynamic nature of discourse is shown clearly when we consider other
findings arising from Baker et al. (2013). In the following subsections, we
will consider a range of findings from the first study and show how, in the
second study, these features have changed. Throughout, the figures
quoted in the tables are normalised frequencies per million words as this
allows for an ease of comparison between the two corpora.
In the analysis that follows, after an initial focus on which countries are
mentioned and the prevalence of conflict lexis in the corpora, we will
focus on six key questions, relating to major findings by Baker et al., to
illustrate the dynamism of the discourse across the two periods with
regard to how, in Corpus B relative to Corpus A:
• the words Muslim, Muslims and Islam are used;

• Muslim women and men are represented;
3
For example, ‘A JUDGE yesterday ruled that a devout Muslim woman must remove her full face
veil if she gives evidence in court’ (The Sun, January 23, 2014).
• different branches of Islam are constructed;

• strength of belief is communicated and used;
• extremism is constructed;
• radicalisation is discussed.
These investigations are guided by words which are key when the two
corpora are contrasted.
5.1 Country and Conflict Keywords
In terms of location, there has been a shift away from stories about con-
flicts or attacks in Iraq, Palestine, and America, which are key when
Corpus A is compared to Corpus B. When Corpus B is compared to
corpus A, we find instead that in Corpus B Syria, Libya, Iran and Egypt
are key.
In terms of conflict, as noted, the strong relationship with both
Muslims and Islam is relatively stable across the two corpora. However,
when the lexis used to realise the presentation of conflict in the two cor-
pora is examined, a clear difference emerges. The top keywords (in
descending order) in Corpus A, when compared to Corpus B, are war,
terrorist, terrorists, attacks, bomb, bombs, terrorism, suicide, invasion,
destruction, raids, and hijackers. Key in Corpus B, when compared to
Corpus A are islamist, rebels, crisis, revolution, protesters, protest, sanction,
rebel, activists, uprising, islamists, jihadists, jihadist and jihadi. How can we
interpret these findings? World events tend to be a major driving force in
the contexts that Muslims and Islam are written about—such events
align well with news values. We therefore hypothesise that since 2009
references to terrorism have fallen sharply in articles about Islam particu-
larly because large-scale orchestrated attacks like 9/11 and 7/7 in
Anglophone countries in particular have been less marked. Many words
which directly refer to conflict have also seen sharp falls: war, bomb, raids,
destruction, attacks. Yet other words relating to conflict of a principally
civil kind have increased, such as crisis, revolution, protests, sanctions and
uprising. While stories about armed conflict have not gone away, reference
to political/civil conflict has risen dramatically. This makes us reflect
again upon the apparently stable finding linking Muslims and Islam with
conflict. While the picture in terms of the frequency of conflict words
appears relatively stable, the relative proportions of the different types of
conflict words are not stable. Concerns over Iran’s nuclear intentions, and
reporting of events around the Arab Spring have replaced the focus on
the Iraq war and 9/11. While mentions of al-Qaeda and the Taliban have
been reduced, they have been replaced by other groups like Islamic State,
Boko Haram and the Muslim Brotherhood. There are also more refer-
ences in 2010–2014 to rebels, activists, Islamists, protestors and jihadists. So
rather than being framed around fear of terrorist attacks, the discourse
between 2010 and 2014 is more linked to revolution, political protest
and Islam as a political force. The concept of jihad and those engaged in
it (while less frequent than some of the other terms) has also risen over
time. These changes in turn impact on the frequency of the selection of
different items of conflict lexis.
5.2 Muslim, Muslims and Islam
In our original study we were particularly interested in words which

appear next to Muslim, Muslims and Islam (as well as Islamic, which has
been discussed above) as, if they are frequent enough, they are likely to
trigger the concept even before it is mentioned. For example, if a person
encounters the phrase ‘Muslim fanatic’ enough times, we hypothesise
that they are likely to be primed to think of the word fanatic if they hear
the word Muslim by itself. So, as in the initial study, we looked for pat-
terns like Muslim [X], as well as [X] Muslims and [X] Islam.
5.2.1 Muslim
The top ten words immediately following Muslim in Corpus A are, in

descending order of frequency, community, world, council, women, leaders,
countries, cleric, country, men and communities. In Corpus B they are
brotherhood, community, women, world, men, woman, communities, con-
vert, countries and council.
The change, tokened by the contrast between the two lists, is quite
marked. We see a strong rise in the phrase Muslim Brotherhood, indicating
the salience of stories coming out of the Arab Spring and the uprising in
Egypt. In 2010–2014 Brotherhood follows over 1 in every 10 mentions of
Muslim. The Muslim council appears to be of less interest to journalists in
the later period, as do Muslim leaders and the phrase Muslim cleric. So
apart from the Muslim Brotherhood, it appears that there is now less
focus on people or groups who are seen as leading various Muslim
communities.
The term Muslim convert has become more common though, although
this term usually refers to stories about Muslim converts who are involved
in crime, usually terrorism or militancy, for example:
In the feverish atmosphere of Kenya’s war on terror, rumours abound as to

the whereabouts of Samantha Lewthwaite, the Muslim convert from
Aylesbury who is on the run after a foiled bomb plot. (Telegraph, July 16,
2012)
Yet there is an element of stability in the word—Muslim world, Muslim

countries and Muslim community continue to be phrases used frequently
to refer to Muslims. By continuing to use terms like Muslim world,
Muslim community and Muslim countries, the British press continue to
collectivise large numbers of Muslims as living in similar circumstances,
implying that they are separate from everyone else.
What of the plural form, Muslims? For Corpus A, the top ten right-
hand modifiers are British, young, Shia, moderate, Sunni, Bosnian, fellow,
other, radical and devout. For Corpus B, they are British, young, Sunni,
Shia, moderate, Bosnian, fellow, other, radical and devout.
This example is of interest as, although the two wordlists appear very
similar, rank orderings within the lists have changed, indicating a shift in
focus in the discourse. Similarly, the normalised frequencies for some
words have changed, even though their ranking has not. For example, the
ordering of mention of Sunni and Shia Muslims has been reversed. Also,
although references to British Muslims have decreased over time (from
Table 8.1 examples per million words in Corpus A to 4.9 in Corpus B),
this is still the most common way that Muslims are referred to. Further
changes in the representation of Muslims emerge when we consider what

British Muslims are reported as doing in Corpus B. They are described as
alienated, sometimes impressionable and prone to radicalisation:
The alienation of many young British Muslims goes deep—some despise

society enough to want to harm it. (The Times, February 2011)
Maybe after the 7/7 bombings and numerous failed plots to blow up
aeroplanes, nightclubs, an airport and a shopping centre it’s not so surpris-
ing. Nobody knows the true level of radicalisation among British Muslims.
(The Guardian, July 2010)
They are described as having travelled (often to places like Syria to join
ISIS):
According to our intelligence agencies, some 500 young British Muslims

have travelled to Syria to join the Islamic State or other terrorist organisa-
tions. (The Mail, August 2014)
They are expected to condemn jihad and terrorism (but a minority are
sometimes described as not doing so):
While most British Muslims wholeheartedly condemn the killers, we know

from bleak experience that a significant minority will tell opinion pollsters
they actually endorse what was done (The Mail, May 2013)
While the link through collocation of Muslims to alienation and radi-

calisation is reported in Corpus A by Baker et al. (2013, 42) travelled is
not. This is strongly indicative of a shift in the discourse to focus on the
issue of Muslims travelling to join jihadi groups and the wish for British
Muslims to discourage such travel by condemning it. In the case of con-
demn, in particular, the press has shifted its stance from a neutral one in
Corpus A, where it discussed the reaction or response of Muslims to terror-
ist acts, to a situation where it was seeking to shape that response, that is,
encouraging them to condemn such acts.
A focus on the word young reveals further differences. This collocate is
not reported in the analysis of Muslims in Corpus A by Baker et al. (2013,
42–44) yet in Corpus B it is notable that young and British both often
appear together at the same time as modifiers of Muslims. As a result,
many of the collocates of young Muslims are the same as those of British
Muslims. Those for young Muslims actually show a stronger concern about
radicalisation. Young Muslims are described as impressionable, disaffected,
rootless, angry and susceptible. They are at risk of being lured, recruited,
indoctrinated or brainwashed to commit crimes or jihad.
The Prison Service has long been concerned at the spread of radical Islam
inside Britain’s jails. Experts say a tiny number of fanatics, most serving
long sentences, have huge influence over disaffected young Muslims. (The
Sun, May 2013)
Cameron said a clear distinction must be made between the religion of
Islam and the political ideology of Islamist extremism, but the ‘non-violent
extremists’ who disparage democracy, oppose universal human rights and
promote separatism were also ‘part of the problem’, because they lure
young Muslims onto the path of radicalisation, which can lead them to
espouse violence. (The Times, February 2011)
So while Muslims are discussed as a collective group, the most salient

pattern is in the context of the radicalisation of young British Muslims.
The last word we will consider in this section relates to the belief sys-
tem that Muslims follow, Islam. In this case looking at the collocates
immediately preceding the word is revealing. In Corpus A, the top ten
such collocates are radical, militant, political, Shia, insulting, fundamen-
talist, anti, British, insult and moderate. For Corpus B it is radical, anti,
militant, political, Shia, Sunni, moderate, fundamentalist, insulting and
British. Note that these lists of collocates are similar, but not the same.
One collocate is unique to each list—insult for Corpus A and militant for
Corpus B. Also, these lists are ordered from strongest to weakest collo-
cates—hence it is clear to see that the rank ordering of the collocates has
changed too.
These minor changes, however, mask some real similarities. As with
Islamic, the word Islam continues to be associated with extremism, with
the words radical, militant and fundamentalist appearing in the top 10
left-hand collocates (although moderate also appears). Other words
suggest conflict: insulting, insult (which have both decreased over time)
and anti (which has increased over time):
Dorje Gurung, a Nepalese teacher in Qatar, who was imprisoned for 12

days in May 2013, after a 12-year-old student accused him of insulting
Islam, believes the men are likely to be held without any form of informa-
tion or support. (The Guardian, September 2014)
Note, however, that this extremism has been reinforced in Corpus B

by the new top ten collocate militant, which reinforces this discourse of
extremism in the corpus focussed upon the word Islam. By the same
token, however, the loss of insult from the top ten collocate list for Corpus
B perhaps tokens a weakening of the link of the word to insult Islam.
Nonetheless, the collocate insulting is still present in the Corpus B top
ten collocate list, meaning that the link between Islam and the process of
insulting persists, though the reality of insults to Islam sometimes appear
questionable according to the narrative voice of some articles.
Saudi liberal activist Raif Badawi was sentenced to 1,000 lashes, 10 years in
prison and a heavy fine for insulting Islam. In fact, his crime was to estab-
lish an online discussion forum where people were free to speak about
religion and criticise religious scholars. (The Independent, May 2014)
Overall, Islam is still often constructed in Corpus B as an entity that is

prone to feeling insulted, with any insults provoking strong reactions
from its adherents.
5.3 Representation of Gender
The terms Muslim women and Muslim men are frequent in the corpus. We
found that in the previous study, Muslim women tended to be discussed
in terms of the veil, due to a debate which took place in 2006 after com-
ments made about veiling by the then Home Secretary Jack Straw.
Muslim men were most often discussed in terms of their potential for
radicalisation. How have Muslim men and women been written about
since 2010?
Table 8.2 shows collocates (words which frequently occur near or next
to the word or phrase we are interested in) of Muslim women in the two
time periods—we only considered content words (nouns, verbs or
Table 8.2 Collocates of Muslim women

Category Collocates: 1998–2009 Collocates: 2010–2014
The veil and dress wear, veils, veil, wearing, wear, wearing, veil, face,
remove, worn, hijab, full, veils, burka, full, faces,
Straw, cover, Jack, niqab, worn, dress, cover,
faces, face, headscarves, hijab, veiled, covering,
veiled, Straw’s, dress, covering, niqab, public, remove,
head, headscarf, traditional, headscarves, burkas,
burkas, burka, wore, dressed, modesty, coverings
covered, veiling, burqa,
burkha
Identity words young, men, children, women, young, children, women,
(age, gender people, girls, old, generation, men, people, group,
and husbands, marry marry, girls, woman,
relationships)
Reporting and said, saying, feel, asked, call, said, says, see, feel,
feeling told, asking, comments, say, talking, think, say
ask, talk, revealed, believe,
suggested, called, calling,
urged, prefer, hope, know,
wants, claimed, speak,
question, warned, understand
Locations British, Britain, country, world, British, Britain, France,
London, English, western, French, English, London,
Bosnian, constituency, society, world, country, central,
Blackburn, France, Europe, western, Arab
street, Serb, town, French,
Arab, community
Freedom rights, allowed, forced, ban, ban, rights, right,
issue, choose, debate, power, allowed, law, banned,
support, help, free, allow, oppressed, choose,
row, required, banned, allowing, debate, free
encourage, campaign, choice,
freedom
Religion Islamic, faith, religious
Others role, take, way, rape,
protect, minority,
majority, come, made,
designed, swimming,
sport, living
a djectives) which occurred 10 times or more. Words that are unique to

2010–2014 are shown in bold.
We first examined some of the new collocates of Muslim women. For
example, modesty is used to refer to wearing the burka or hijab. Modesty
is generally represented as a positive aspect of being a Muslim woman,
although some authors cite this point in order to problematise it:
For many, the hijab represents modesty and freedom of choice, but we
cannot ignore that it is also one of the most contentious and divisive issues
of modern times—within the Muslim community as well as outside it.
(Guardian, February 16, 2010)
Other authors imply that adherence to modesty does not necessarily

mean that a Muslim woman cannot be stylish:
She is part of an expanding group of bloggers in London and America,

known as hijabistas, who are proving that it is possible to wear a hijab, a
symbol of modesty, and be stylish at the same time. (Sunday Times, January
19, 2014)
Swimming referred mostly to stories about swimming pools that ran

special sessions for Muslim women only. Such sessions were viewed as
problematic and contribute towards a wider discourse that was frequently
encountered in the 1998–2010 corpus of Muslims receiving ‘special
treatment described as unnecessary’:
The Walsall debacle comes six months after The Daily Express revealed how
Hull City Council was accused of running Muslim women-only swim-
ming sessions in secret—to the fury of regular baths users. (The Express,
July 6, 2010)
The collocate rape most often refers to atrocities that took place in
Bosnia in the 1990s:
In 1996 a UN tribunal indicted eight Bosnian Serb soldiers and policemen

for the rape of Muslim women. (Daily Mirror, June 12, 2014)
Sport is seen as an area which Muslim women should be encouraged to

get involved in more:
RIMLA AKHTAR, winner of the Community Award, believes more needs

to be done to engage Muslim women in sport at the grassroots level, writes
Andrew Longmore. (Sunday Times, December 8, 2013)
Of the 18 times that oppressed collocates with Muslim women, almost

all of them cite the idea that Muslim women are oppressed in order to
disagree with the notion:
‘The media portray Muslim women as oppressed and subjugated and

Islam is often presented as misogynist and patriarchal,’ she said, and her
book was intended as an antidote to that. (The Observer, March 16, 2014)
People use this idea that Muslim women are oppressed as an excuse for
pulling off their head coverings. (The Sunday Times, July 20, 2014)
The collocate designed most often refers to an all-in-one swimsuit for

Muslim women, called a burkini, with the following case giving a some-
what unflattering description:
TV chef Nigella Lawson has admitted she resembled ‘a hippo’ when she
wore a burkini on Bondi Beach. The 51-year-old caused a storm two years
ago by donning the all-in-one swimsuit designed for Muslim women dur-
ing a visit to Australia. (The Sun, February 25, 2013)
Finally, role collocates with Muslim women 12 times of which 8 are

used positively in the phrase role model.
I want to give something back to the community and be a positive role

model for young Muslim women—and for young women in general.
(Daily Star, March 25, 2011)
So since 2010 there has been a small but significant increase in positive
discourses around Muslim women, particularly in terms of questioning
their oppression or discussion of positive female role models. However,
the main picture is a continuation of older discourses which focus on
Muslim women as victims, receiving special treatment, victimisation or

problematising their dress.
In terms of stability in the discourse, the table above suggests that the
main focus around Muslim women—veiling—has not actually changed,
with a similar set of words relating to veiling and dress collocating mostly
with Muslim women. Other categories in the table tend to be linked to
veiling, particularly the one relating to freedom. So, has the debate
around veiling changed at all since 2010? In the 1998–2009 study we
concluded that the veil was discussed negatively and seen as an ‘issue’,
characterised by ambivalence and conflict. Some Muslim women were
described as oppressed by the veil, others as demanding to wear it. There
were a wide range of arguments given as to why Muslim women should
not veil, and they were discussed in metaphors relating to things that
glide on water, soft furnishings and (most negatively) frightening super-
natural monsters.
To examine the 2010–2014 articles, we retrieved cases of the following
pattern: insist*/demand*/force*/right/cho* to wear/wearing [the veil or
any related clothing item]
We only counted cases where the construction uncritically or unques-
tioningly referred to the veil as a choice, right, demand or imposition.
Table 8.3 shows what patterns we found for each newspaper.
We can see that the most frequent construction here is of Muslim
women being forced to wear the veil. The two negative constructions
Table 8.3 Patterns around veiling for Muslim women

Forced to Right to Choosing to Demanding to
wear it wear it wear it wear it
Times 10 3 4 5
Telegraph 11 1 9 6
Sun 3 0 2 3
Mail 4 2 3 4
Express 4 0 2 2
Star 0 0 0 3
Mirror 5 0 2 1
Guardian/ 10 2 14 6
Observer
Independent 1 2 2 1
Total 48 10 38 31
(force and demand/insist) are more frequent together than the more posi-
tive ones (right or choice). Uncritical descriptions of the veil as a right
were relatively infrequent.
We note the higher frequency of the veil being described as a choice by
The Guardian, although this newspaper also has fairly high representa-
tions of it being linked to compulsion as well. Table 8.4 compares pro-
portions of these constructions of wearing the veil to the earlier set of
data.
Over time, the veil is more likely to be described in negative terms,
either as Muslim women being forced into wearing it, or in terms of them
demanding or insisting on wearing it. Discussion of the veil as a right
appears to have sharply declined, although it is slightly more likely to be
described as a choice.
We also looked at arguments given for why Muslim women should not
veil. This was found by carrying out a search on terms describing the veil,
appearing in the same vicinity as the word because. Of the 135 cases of
these, 32 gave arguments as to why a Muslim woman should not wear the
veil. These are shown in Table 8.5.
The argument about the veil (particularly face-covering veils) making
communication with the veil-wearer difficult was the most frequently
cited. In particular, a court case where a veiled female juror was asked to
step down was mentioned, as well as there being references to school-
teachers who veil their faces.
I’m with Ken Clarke when he says that women should not be allowed to
wear the full-face veil in court because it is difficult to give evidence from
inside a kind of bag. (Daily Mail, November 5, 2013, Richard Littlejohn)
People are nervous about speaking to burka wearers. That’s because we
want direct communication, not just through eye contact but through
Table 8.4 Patterns around veiling—change over time (summary)

Forced to Right to Choosing to Demanding to
wear it (%) wear it (%) wear it (%) wear it (%)
1998–2009 28.5 34.8 25.5 10.6
2010–2014 37.7 7.8 29.9 24.4
Table 8.5 Arguments against veiling

Argument against veiling Frequency
It makes communication with the wearer difficult 13
It’s a symbol of oppression of women 5
It’s alien to the culture it’s been worn in 3
It reduces your field of vision 3
It compromises national security 3
It’s a symbol of extremist Islam which is seen as intolerant 2
It overexcites men 2
It’s false consciousness 1
It’s offensive 1
interesting and sometimes revealing facial expressions. We want to see the

lips move. (The Sun, April 10, 2010)
Again, the picture is of stability at one level, change on another. There

is still a focus on the veil in the second corpus. However, the main argu-
ment against Muslim women wearing the veil has changed from the
oppression of women (1998–2009) to a focus on difficulties surrounding
communication with the veil-wearer (2010–2014). The increase in argu-
ments relating to difficulties in communication could perhaps be seen as
‘strategic’ by those who oppose the veil, as it does not require its oppo-
nents to make claims about Islam’s ideology or attitude towards women.
Instead, the argument focusses on a more ‘practical’ concern, which may
be difficult to counter as being Islamophobic. This could suggest that
opponents of Islam are developing more careful and subtle arguments to
support their views.
Having explored the changing representation of Muslim women in the
UK press, let us now consider Muslim men. Table 8.6 shows collocates
(occurring 10 times or more) of Muslim men for the two time periods.
Again, those in bold are unique to the last period.
Apart from the reporting words, many of the categories used to discuss
Muslim men are the same in both periods, with discussion of them as
victims of violence and also as guilty of crimes. There is a new category
though: abuse. The words grooming, abuse and gangs collocate with
Muslim men in stories about the sexual abuse of (often) white girls.
Table 8.6 Collocates of Muslim men

Category Collocates: 1998–2009 Collocates: 2010–2014
Identity words (age, young, boys, women, children, boys, young, women,
gender and wives, whore, dating girls, marry, children
relationships)
Locations Srebrenica, Bosnian, British, Srebrenica, Bosnian,
Britain, Serb, Serbs, Iraq, British, Pakistani, Serb,
London, Asian, town Britain, Asian
Killing massacre, killed, massacred, massacre, killed,
slaughter, murder, murdered, killing,
murdered, slaughtered, slaughter, massacred,
died, war, killing, suicide, deaths, unarmed,
bombers murder
Reporting and said, say, believed, feel, says,
feeling told
Law and order arrested, accused, executed, accused, arrested,
innocent, alleged, trial, law charged, guilty,
convicted, found
Radicalisation/ impressionable, disaffected, radicalised
terrorism radicalised, radicalisation,
training, terror, radical, Abu
Abuse grooming, gangs,
abuse, force
Other white, allowed, disgraceful, white, forces
angry, born, beards, forces,
gathered, way, dignity, see,
get
And when Jack Straw condemned the grooming by British Muslim men of
Pakistani origin of vulnerable white girls, he was instantly flamed as a
bigot. (The Times, January 22, 2011)
This was far from a one-off case. Police operations going back to 1996
have revealed a disturbingly similar pattern of collective abuse involving
small groups of Muslim men committing a particular type of sexual crime.
(Daily Mail, January 10, 2011)
The authorities have been just as reprehensible in their reluctance to
tackle the sickening exploitation of white girls by predatory gangs of
Muslim men. (The Express, May 17, 2012)
Force does not occur in stories about sexual abuse but relates to cases
where Muslim men apparently force women to wear the veil.
We also note that the category of radicalisation/terrorism has fewer

words in it in 2010–2014. However, overall, the concept of radicalisation
has grown during this period (as shown earlier). What seems to be the
case is that it is not as gendered a concept as it previously was. When the
British press speak about radicalisation, they talk about young British
Muslims, but do not mention gender as much as they did in the past. So
radicalisation has become a more gender-neutral subject than it used to
be.
In this category only one word: radicalised occurs as a collocate with
Muslim men more than 10 times. However, this occurs in similar ways to
its use in 1998–2009, with fears about Muslim men being radicalised
(particularly young Muslim men).
The father of the Muthana brothers, Ahmed Muthana, suggested yesterday

that young Muslim men were being radicalised at ‘pop-up’ meetings in
Cardiff rather than at any mosque or via internet videos. (The Guardian,
June 24, 2014)
THOUSANDS of lone wolf extremists could launch similar attacks to
the Woolwich bloodbath, a senior police officer warned yesterday. And
Assistant Met Commissioner Cressida Dick said the threat cannot be eradi-
cated while young Muslim men are radicalised via the internet. (Daily
Mirror, December 20, 2014)
In Corpus B, stories about Muslim men in the British press have

focussed around them as either victims or perpetrators of crime, with
particular focus on the sexual abuse of white girls or the risk of them
being radicalised.
5.4 Branches of Islam
The earlier study found that about half the newspapers refer to Islam
generally rather than discussing different branches of Islam like Sunni,
Shia and Wahhabi. Is there any evidence that this behaviour has changed?
Figure 8.2 shows the proportions of times that a newspaper refers to these
Sunni and Shia Islam in both Corpus A and Corpus B. It shows the pro-
portions for each newspaper for the two time periods.
2.5
1.5
0.5
Fig. 8.2 Proportion of mentions of different branches of Islam for each

newspaper
The first bar shows 1998–2009, while the second shows 2010–2014. It
can be seen that The Independent has greatly increased the proportion of
times it refers to branches of Islam as opposed to writing more generally
about Islam. Six other newspapers have also gone in this direction (although
not hugely). However, The Guardian, Telegraph, Express and Star have gone
the other way and refer to the branches less than they used to.
Generally, a distinction can be made between the broadsheets and the
tabloids here, with all the broadsheets referring more often to branches of
Islam rather than Islam in general, while the reverse is true of the
tabloids.
So again, we have some stability. British tabloids continue to paint a
simplistic picture of Islam, not usually referring to or distinguishing
between different branches like Sunni and Shia, although The Mirror is
the tabloid that makes the most effort to do this. On the other hand, all
the broadsheets are more likely to refer to branches of Islam as opposed
to Islam itself, with The Independent being most likely to do this. Yet
within this overall picture of stability, variation by newspaper can be
notable, especially with regard to The Independent and The People. The
change underlying this apparent stability becomes all the more obvious if
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
Fig. 8.3 References to Sunni, Shia, Sufi, Salafi and Wahhabi over time. Dark grey
denotes the proportion of mentions of references to branches of Islam (e.g. Sunni,
Shia, Wahhabi); light grey bars denote references to Islam
we consider change over time not by period covered by the corpus, but
by year. Figure 8.3 shows how overall references to different branches of
Islam have changed since 1998.
Since the start of the new collection of data (2010), newspapers have
begun once again to increasingly make distinctions between different
branches of Islam, as opposed to simply referring to Islam itself. However,
such references often relating to fighting between Sunnis and Shias (often
in Iraq) and the Sunni uprising in Syria.
5.5 Strength of Belief
This section examines how Muslims and Islam are associated with differ-
ent levels of belief. Phrases like Muslim extremist and Muslim fanatic were
found to be extremely common in our earlier study, and one way of gaug-
ing whether representations of Muslims have changed is to examine
whether such terms have increased or decreased. We would argue that the
presence of such terms, particularly in large numbers, is of concern as
readers of newspapers would begin to unconsciously associate Muslims

and Islam with extremism. Such terms tend to be more frequent in the
broadsheets, like The Guardian and The Times, simply because these
newspapers contain more print than the tabloids. So simply counting the
number of times a phrase occurs in a newspaper, and then comparing
across newspapers, is not the best way of ascertaining the scope of the
problem. Instead, we have taken an approach which looks at the propor-
tion or percentage of cases a word like extremist occurs next to Muslim in
each newspaper. This takes into account the fact that some newspapers
may mention Muslims a lot and some not very much at all.
Our initial study looked at labels which could be used to refer to peo-
ple. So we were interested in terms like Muslim extremist or fanatical
Muslim. We classified three different types of belief, as shown in Table 8.7.
As the table includes words based on labels for people, it does not con-
sider words for related abstract concepts like extremism, radicalism, mili-
tancy and separatism. Nor does it consider words related to processes like
radicalisation. In fact, as shown earlier, the words extremism and radicali-
sation were found to be significantly more common in the 2010–2014
articles, compared against the 1998–2009 articles. The term extremism
occurs after Islamic 15% of the time, while extremist(s) occurs after Islamic
in a further 31% of cases.
So before looking in more detail at other terms, it is worth bearing in
mind that in articles about Islam the newspapers are more likely to write
about the concept of extremism than they used to. Hence even if they
have reduced the number of times they talk about Islam or Muslims as
extremist, there is evidence that there has been a shift in language towards
Table 8.7 Levels of belief

Level of
belief Words considered
Extreme Extremist, extremists, fanatical, fanatic, fanatics, firebrand,
firebrands, fundamentalist, fundamentalists, hardline, hardliner,
hardliners, militant, militants, radical, radicals, separatist,
separatists
Strong Devout, faithful, orthodox, pious
Moderate Liberal, liberals, moderate, moderates, progressive, progressives,
secular
a greater emphasis on the abstract idea of extremism. This may make the
articles superficially less personalised, although it does not remove the
general focus on extremism. As found with the 1998–2009 data set,
extremism is more likely to be associated with the word Islamic, than
Islam or Muslim(s). Proportionally, The Star uses the most extremist words
next to Islamic in 22% of cases (almost 1 in 4). Compare this to The
Guardian which does this 6% of the time (about 1 in 17 cases). The
Express is the newspaper most likely to associate Islam with an extremist
word (1 in 10 cases), while The Mirror does this least (1 in 42 times). For
Muslim and its plural, it is The Express again which has the highest use of
extremist associations (1 in 13 cases), and The Guardian which has the
least (1 in 83 cases). However, overall in the British press, Muslim(s)
occurs next to an extreme word 1 in 31 times, for Islam this is 1 in 21 and
for Islamic the proportion is 1 in 8.
The picture for the words Muslim and Muslims combined shows that
fewer uses of the word Muslims are linked to extremism overall, with the
proportion in 1998–2009 being 1 in 19, while it is 1 in 31 for 2010–2014.
The People shows the largest fall in this practice, although we should bear
in mind that this is based on a much smaller amount of data than for the
other newspapers (e.g. The People mentions Muslims less than 500 times
overall in the period 2010–2014, compared to The Guardian which has
over 20,000 mentions in the same period). However, all newspapers show
falls in this practice overall.
For the word Islamic, there are also falls in its association with extrem-
ism, with the average number of mentions of an extremist word next to
Islamic being 1 in 6 in 1998–2009 and 1 in 8 in 2010–2014. The Star
and Sun are most likely to link the two words, while it is the least com-
mon in The Guardian and its sister newspaper The Observer. The picture
for the word Islam is somewhat different, however. Here the average
number of mentions of an extreme word near Islam has actually increased
slightly, from 1 in 25 to 1 in 21. The practice has become noticeably
more common in The Express, although most newspapers have followed
suit. Only The Mirror and The Telegraph show a move away from this
practice.
What of the moderate words? It is The Express, Mail and People which
are more likely to refer to Muslims as being moderate, with this practice
being least common in The Mirror. On average it is Muslims who are more
likely to be called moderate (1 in 161 cases), as opposed to the concept of
Islam (1 in 271 cases). However, these figures are much smaller than those
for the extremist words. For the 2858 mentions of extreme Muslim(s) in
the press, there are only 558 moderate Muslim(s), or rather 5 extremists for
every moderate. However, in the 1998–2009 articles, there were 9 men-
tions of extremist Muslims for every moderate, so we can see evidence that
moderate Muslims are starting to get better representation proportionally,
although they are still outnumbered. As Fig. 8.4 suggests, this is not
because moderate Muslims are being referred to more, it is more due to a
dip in mentions of extremist ones. For Muslim and its plural, it is The
People, Express and Mail which have shown greater increases in mentions
of moderate Muslims. However, on average, the number of mentions of
moderate Muslims has gone up but only slightly (now 1 in 161 cases).
For cases of Islamic occurring next to a moderate word, this was never
common, and has actually fallen slightly. Figures are based on low fre-
quencies, however, and as we have seen earlier, the word Islamic shows a
20
18 1998-2009
2010-2014
16
14
12
10
8
6
4
2
0
Fig. 8.4 Summary of all data, comparing proportions of change over time
very strong preference for extremist associations. For Islam, there is a

slight increase in the proportion of times it occurs next to a moderate
word, although again, this is very low overall.
Finally, let us consider the words devout, faithful, orthodox and pious,
and how often they occur next to Muslim(s), Islam and Islamic. These
words are more likely to occur next to Muslim(s) (about 1 in 140 times)
with only a handful of cases of them occurring next to Islamic and
Islam.
These strong belief words were barely present in the 1998–2009 arti-
cles next to Islam and Islamic too, so we will only consider change over
time for the word Muslim(s). What we see are falls for all newspapers,
with an average of 1 in 92 references to devout Muslims and similar terms
in 1998–2009, dropping to 1 in 140 such references in 2010–2014.
Figure 8.4 gives a better sense of the overall picture. We can see, for
example, how extremist words are most likely to occur near Islamic, but
also how there has been a drop in this association over time, as well as a
similar drop in the way that extremist words occur next to Muslim(s). We
can also see (small) increases in associations of extremist words with
Islam, and moderate words with Muslim(s).
So while references to Muslims as extremists have fallen in the British
press since 2010, journalists are writing about the abstract concept of
extremism much more frequently than they used to. Also, the concept of
Islam is more likely to be referred to as extreme than it used to be.
Extremism is still a hot news topic, but in this context, there has been a
move towards focussing more on the religion rather than on its adher-
ents. The ratio of mentions of extreme Muslims to moderate Muslims in
the British press is becoming slightly more equal (from 9 to 1 in
1998–2009 to 5 to 1 in 2010–2014). References to devout Muslims
have fallen since 2009, although these are still highest in The Mirror.
Relatively, The Star and Express have the most references to Islamic
extremists, while The Guardian has the fewest. However, the Express
refers to moderate Muslims the most too. The Mirror and People have
dramatically reduced their references to Islamic extremists. Given the
salience of extreme words in Table 8.8, let us now explore those words in
some more detail.
Table 8.8 Extremism keywords

1998–2009 2010–2014
Fundamentalism 22.4 10.3 ↓
Fundamentalist 38.1 18.8 ↓
Fundamentalists 27.1 13.0 ↓
Extremism 42.8 69.6 ↑
Radicalised 8.3 22.3 ↑
Radicalisation 6.1 19.9 ↑
Grooming 2.7 11.2 ↑
5.6 Extremism Keywords
There has been a strong decline in the words fundamentalism, fundamen-

talist and fundamentalists. Such words were found to have been particu-
larly strong in the years 2001 and 2004, and they have not returned.
However, the abstract concept of extremism (as opposed to a people-
labelling word like extremist) is more frequent in Corpus B than in Corpus
A, as well as three related words, radicalised, radicalisation and grooming.
This indicates a major shift in language around extremism, which is more
concerned with the process of becoming extreme, rather than labelling
people as such.
5.7 Reasons for Radicalisation
Radicalisation has been mentioned in several sections of this chapter so

far, so a closer look at radicalisation and the claimed causes of radicalisa-
tion seems appropriate—especially as it points to a major difference
between Corpus A and Corpus B and hence a source of dynamism in the
discourse over time.
References to radicalisation have increased since 1998–2009, with
there being almost double the number of mentions of that word in the
shorter 2010–2014 data set. The main pattern we see over time is increas-
ing attribution of blame for radicalisation on extremist Islam—in
1998–2009 this occurred in 1 in 3 cases. By 2014 it is 2 in 3 cases. Yet
the attribution of blame to government policy has decreased over time,
from 1 in 3 cases in 1998–2009 to 1 in 12 in 2014. The invasions of Iraq
and Afghanistan, while still mentioned, are now seen as almost historical
factors attributable to the ‘Labour years’, rather than as being relevant to
the present situation.
Two of the less frequent explanations for radicalisation found in the
1998–2009 data, ‘grievance culture’ and ‘multi-culturism’, seem to have
largely disappeared from the discourse around radicalisation in
2010–2014 (Figs. 8.5 and 8.6).
In the pie charts that follow, the different causes of radicalisation pre-
sented by the press are shown. The first pie chart shows the relative fre-
quency of causes in the period 1998–2009, the second covers 2010–2014,
while the third shows 2014 on its own. Below is a brief key explaining
each cause listed in the tables.
• Extremist Islam—cases where extremists are described as targeting

non-extremists. This is often described as occurring in prisons, schools
or universities.
• Government Policy—this usually refers to foreign policy, for example,
the invasion of Iraq, but in a rarer set of cases can also refer to a view
Multiculturism 1998-2009 Others

4% Poverty Grievance Culture 1%
4% 3%
Wars
4%
Alienation of Extremist Islam

Muslims 34%
14%
Government Policy
36%
Fig. 8.5 Claimed causes of radicalisation in the press in 1998–2009

Others
Poverty
4% 2010-2014
6%
Wars
7%
Alienation of
Muslims
10%
Extremist Islam
57%
Government Policy
16%
Fig. 8.6 Claimed causes of radicalisation in the press in 2010–2014
that the government has failed to properly tackle extremism at home,

or that policies that remove civil liberties are to blame.
• Alienation of Muslims—this refers to Islamophobia, the view that
Muslims and Muslim communities are cut off from others in the UK
and that youth in particular are disaffected.
• Wars—references to conflict abroad, particularly the Israel-Palestine
conflict, causing anger in the UK but also, more recently, the conflict
in Syria engaging British Muslims.
• Multiculturalism—blame on a general culture of accepting, even pro-
moting difference in the UK.
• Poverty—economic reasons, especially cases of Muslims living in areas
of economic deprivation.
• Grievance Culture—the view that Muslims wrongly feel victimised.
This is similar to ‘alienation of Muslims’, but here the sense of victimi-
sation is described as misguided.
• Others—‘one-off’ reasons such as individual Muslims experiencing
torture, or more global trends such as liberalising modernisation or
capitalism.
2014 only
Others
9%
Wars
8%
Alienation of
Muslims
9%
Government Policy
8% Extremist Islam
66%
Fig. 8.7 Claimed causes of radicalisation in the press in 2014
These three pie charts alone are sufficient cause to cast doubt on the
use of any time-bound analysis to cast light on what happens either before
or after the period studied. The results from Corpus A are very different
in terms of the proportions with which causes of radicalisation are men-
tioned. Figure 8.7 shows that one year in Corpus B is a much closer
match to the results of Corpus B than the comparison to Corpus A pro-
duces; that is, there is some evidence for internal consistency within
Corpus B, yet evidence of real change between Corpus A and Corpus B.
6 Conclusion
There is little doubt that the availability of corpus data which has allowed
large-scale investigations of discourses in certain genres, especially the
press, has been one of the most notable methodological developments in
discourse analysis in the past couple of decades. Such analyses, however,
are of necessity time-bound—the analysts collect data between two dates.
No matter how exhaustive the collection of that data, the capacity of the
data and its associated analysis to cast light on the discourse that preceded
or followed that data is, of necessity, limited. In this chapter we have

argued for a new methodological approach—the repeat of major studies
of this sort and a comparison across time of relevant results to begin to
approach the question of how discourse may shift through large-scale
systematic studies.
To demonstrate the need for this, we repeated the study of Baker
et al. (2013), collecting data for the six years following their analysis. By
doing that we were able to show the need for such a study—discourse
may indeed shift over time. Some elements remain the same, some shift
by a small degree, some shift quite substantially. Each shift has the
capacity to expose a driver in discourse in society—for example, how
the causes of radicalisation have been shifted in the UK press also has
the effect of backgrounding blame for some and foregrounding blame
for others. To simply assume that on the basis of the examination of 13
years of newspaper discourse (Corpus A) that we can generalise beyond
those 13 years is naïve—yet we do see statements in the literature that
seem, at least implicitly, to suggest that this is possible. Of course it may
be the case that if we subdivided the periods studied further we may also
find that the apparently monolithic nature of the two periods studied
are in themselves subject to diachronic shift—indeed Baker et al. (2013),
Gabrielatos et al. (2012) explore shifts within the period covered by
Corpus A in this chapter. However, the key point in this chapter is to
focus on the usefulness of the methods of CADS for exploring discourse,
but also to raise the general point that most discourses are not time-
bound and are dynamic through time—both within a period studied
and beyond it.
The methodological innovation that is needed to counteract this situ-
ation is simple—discourse must be monitored. Studies should be restaged,
and the findings of time-bound analyses should not be assumed to be
generalisable beyond the period studied unless there is clear evidence
from a follow-on study that such a generalisation is warranted. However,
on the basis of the study undertaken, we would predict that static find-
ings between an original study and a follow-on study, especially in a
dynamic medium such as press reportage, are likely to be the exception
rather than the norm.
References
Baker, Paul. 2006. Using corpora in discourse analysis. London: Continuum.
Baker, Paul, Costas Gabrielatos, Majid KhosravNik, Michal Kryzanowski, Tony
McEnery, and Ruth Wodak. 2008. A useful methodological synergy?
Combining critical discourse analysis and corpus linguistics to examine dis-
courses of refugees and asylum seekers in the UK press. Discourse and Society
19 (3): 273–306.
Baker, Paul, Costas Gabrielatos, and Tony McEnery. 2013. Discourse analysis
and media attitudes: The representation of Islam in the British press. Cambridge:
Blinder, Scott, and Will Allen. 2016. Constructing immigrants: Portrayals of
migrant groups in British national newspapers, 2010–2012. International
Migration Review 50 (1): 3–40.
Durrant, Philip, and Alice Doherty. 2010. Are high-frequency collocations psy-
chologically real? Investigating the thesis of collocational priming. Corpus
Linguistics and Linguistic Theory 6 (2): 125–155.
Evans, Matthew, and Simone Schuller. 2015. Representing ‘terrorism’: The radi-
calization of the May 2013 Woolwich attack in British press reportage.
Journal of Language, Aggression and Conflict 3 (1): 128–150.
Gablasova, Dana, Vaclav Brezina, and Tony McEnery. 2017. Collocations in
corpus-based language learning research: Identifying, comparing and inter-
preting the evidence. Language Learning 67 (S1): 130–154.
Gabrielatos, Costas, Tony McEnery, Peter Diggle, Paul Baker, and ESRC
(Funder). 2012. The peaks and troughs of corpus-based contextual analysis.
International Journal of Corpus Linguistics 17 (2): 151–175.
Hardie, Andrew. 2012. CQPweb—Combining power, flexibility and usability
in a corpus analysis tool. International Journal of Corpus Linguistics 17 (3):
380–409.
Kambites, Carol J. 2014. ‘Sustainable development’: The ‘unsustainable’ devel-
opment of a concept in political discourse. Sustainable Development 22:
336–348.
L’Hôte, Emilie. 2010. New labour and globalization: Globalist discourse with a
twist? Discourse and Society 21 (4): 355–376.
McEnery, Tony, and Andrew Hardie. 2012. Corpus linguistics: Method, theory
and practice. Cambridge: Cambridge University Press.
Millar, Neil. 2011. The processing of malformed formulaic language. Applied
Linguistics 32 (2): 129–148.
Partington, Alan. 2003. The linguistics of political argument. London: Routledge.

Partington, Alan, John Morley, and Louann Haarman, eds. 2004. Corpora and
discourse. Bern: Peter Lang.
Taylor, Charlotte. 2017. Women are bitchy but men are sarcastic? Investigating
gender and sarcasm. Gender and Language 11 (3): 415–445.
9
The Linguistic Construction of World:
An Example of Visual Analysis
and Methodological Challenges
Noah Bubenhofer, Klaus Rothenhäusler,
Katrin Affolter, and Danica Pajovic
1 Visual Analytics in Discourse Analysis

Using quantifying methods often leads to complex and confusing data to
be interpreted. This is especially true for corpus linguistic methods if they
are not merely understood as a way to produce some frequency numbers,
This research was supported by the Swiss National Science Foundation. We thank the reviewers
for their comments on an earlier version of the manuscript.
N. Bubenhofer (*) • K. Rothenhäusler

Department of Applied Linguistics, Digital Linguistics Unit ZHAW, Zurich
University of Applied Sciences, Winterthur, Switzerland
e-mail: noah.bubenhofer@zhaw.ch; klaus.rothenhaeusler@zhaw.ch
K. Affolter
ZHAW School of Engineering, Winterthur, Switzerland
e-mail: katrin.affolter@zhaw.ch
D. Pajovic
Department of Computational Linguistics, University of Zurich,
Zurich, Switzerland
e-mail: danica.pajovic@uzh.ch
252 N. Bubenhofer et al.
but a way to reveal patterns of language use in large corpora. Following

approaches of linguistic discourse analysis or lexicometry, the output of
an analysis of corpus data often comes in the form of a list, sometimes
long lists (collocates of a lexeme, significant keywords, n-grams etc.) fol-
lowing a specific structure. Networks are an example of such structured
lists. Or, to be precise, some structured lists represent elements and asso-
ciations between these elements that can be turned into a new visual
representation commonly being interpreted as a network graph. Apart
from the well-known and excessively used network graphs representing
social networks, in corpus linguistics, collocation graphs are an example
for such visualizations (Brezina et al. 2015; Steyer 2013; Bubenhofer
2015).
The goal of using visualization techniques for complex data is twofold.
On the one hand, visualizations help to make the data interpretable. The
visualization is successful if it turns the data in a form which makes it
accessible for interpretation, more accessible than before. On the other
hand, and closely connected to the first goal, the visualization should be
an instrument to work with, an instrument that allows ‘diagrammatic
operations’ potentially leading to new insights.
This diagrammatic perspective stresses the exploratory nature of visu-
alizations, clearly being more than just a display of something that is
already known, but rearranging data and mapping specific aspects of it to
visual elements, making it accessible alternatively. Diagrams not only
come in the form of elaborated visualizations such as network graphs,
maps and the whole range of displays of frequency distributions (box
plots, bar, line charts, etc.), but start in very inconspicuous forms such as
lists, enumerations, tables and so on. This means, of course, that the
aforementioned structured list of associated linguistic elements (e.g. col-
locations) is already of diagrammatic nature. A specific type of list, the
key word in context list used in corpus linguistics, actually is an impor-
tant diagrammatic form working as an index or register. This very old
form of rearranging data is an important tool to allow a new perspective
on text. It decontextualizes the original arrangement and breaks up the
unity of a text. Nevertheless, in corpus linguistics, it is often the case that
lists are not sufficient to work with. But turning list data into a more
The Linguistic Construction of World: An Example of Visual… 253
complex visualization such as a network graph is just a transformation

from one diagram into another.
Using visualizations in the aforementioned sense as exploratory tools is
the core idea of ‘visual analytics’ (Thomas and Cook 2005; Keim et al.
2010; Zhang 2007; Chen 2008). These methods are built upon a long
tradition of handling large data sets statistically (Friendly 2005; Friendly
and Denis 2001) and are widely used for data mining, both in academic
and in non-academic fields (business, government).
In the field of the digital humanities, which may be considered as the
broader academic context of quantitative discourse analysis, visual ana-
lytics plays an important role. Many examples of visual analytics
approaches have been developed and applied. Nevertheless, an in-depth
discussion about the practice of using visual analytics in the humanities
in general and for discourse analysis in particular is missing: The role such
tools play in the research process is largely undiscussed. Often, papers in
visual analytics introducing a novel approach mainly tackle questions
about technology and visual principles and aim at providing a tool for the
so-called expert, the humanist. On the other side of the gulf, the human-
ist plays with these tools without understanding them entirely and tries
to manipulate their parameters until they spit out plausible results. We
exaggerate, but in this chapter, we want to address some aspects of apply-
ing visual analytics in discourse analysis worth being discussed. To exem-
plify this discussion, we present our approach of calculating geocollocations
to visualize how discourses construct world views.
2 Hermeneutics and Distant Reading

Scholars in the humanities and in some branches of the social sciences
focussing on text—we will use the term ‘humanists’—are good at read-
ing, understanding and interpreting texts, in short: doing hermeneutics,
or doing ‘close reading’, as Moretti puts it: ‘in all of its incarnations, from
the new criticism to deconstruction’ (Moretti 2000, 57). The hermeneu-
tic approach to texts is still an important method in the humanities, also
in the digital age where massive amounts of text are available in electronic
form. But approaches such as discourse analysis (Foucault 1966;
Spitzmüller and Warnke 2011) or Moretti’s distant reading try to find a

new way of doing hermeneutics. Sometimes, these approaches are misin-
terpreted as just a new version of traditional hermeneutics, which, as a
plus, allows to ‘read’ larger amounts of text efficiently by using fancy
query engines, statistics and visualizations. That is only half the truth—or
perhaps even completely misleading. What we should be concerned with
when doing some form of distant reading is finding emergent phenom-
ena: Phenomena which cannot be explained by the properties of the
underlying entities. This helps to ‘understand the system in its entirety’
and gain ‘theoretical knowledge’—for the price of losing the details
(Moretti 2000, 57). A price not too high to pay for the advantage of get-
ting emergent concepts that can be interpreted to get ‘the big picture’: of
the development of world literature, changes in cultural behaviour, zeit-
geist, discourses, just to name a few.
Methods of visual analysis aim at making this ‘big picture’ accessible.
A good example in corpus linguistics again is collocation graphs showing
patterns of language use as an emergent phenomenon: The basis of the
visualization is a statistical computation to find words in text co-occurring
more often than we would expect (Evert 2009). Representing the collo-
cational profile of a word as a table already helps to see the emergent
patterns of language use, but drawing a graph may make these patterns
even more obvious. However, the graph representation is not self-explan-
atory: Many parameters must be set (e.g. how to reduce the multidimen-
sionality to two or three dimensions?) and mapping principles must be
found (e.g. do the lengths of the edges have a meaning?). Using such
visualizations for discourse analysis has important implications that need
to be discussed—we will touch some of them in the form of the following
hypotheses.
2.1 ool Development and Using a Tool Are

T
the Same
Developing visual analytic tools means getting one’s hands dirty: Finding
a diagrammatic expression for the data, selecting a programming frame-
work, using or developing algorithms, and programming. Normally, in
the humanities, the programming skills are limited. But not only in the
humanities, also in more information technology-oriented disciplines,
the people building and using such tools are not the same. This separa-
tion between visualization tool developers and so-called experts using
them is at the same time comprehensible and fatal. The disciplines deal-
ing with visual analytics developed theoretical foundations and method-
ological frameworks to solve visualization problems on an advanced
level. In consequence, humanists may use an increasing number of com-
plex tools, but they are using them as if they were black boxes. If visual
analytics is not just a tool, but a framework to explore data and to find
emergent, meaningful phenomena, then building the visualization
framework is as well an integral part of the research process (Kath et al.
2015). From choosing the statistics and data aggregation modes, the
mappings of data, and graphical forms, to designing an interface, every
single step in building the visualization framework demands full atten-
tion and reflection of the humanist. How does that influence the inter-
pretation? And more important, how does that influence the research
process itself? Software is a ‘semiotic artefact’, something ‘gets “expressed”
in software’ (Goffey 2014, 36), for example, the cultural surrounding it
is developed in, and its enunciations influence the process of
interpretation.
2.2 No Best Practice
The development of visual analytic methods is driven by disciplines hav-

ing a background in information science and engineering: informatics,
data and text mining, computational linguistics, business intelligence
and so on. The thought styles (German ‘Denkstil’, see Fleck 1980) of
these disciplines demand specific scientific standards to be met, one of
great importance being evaluation. Methods must be proofed to be useful
by evaluating them by means of a so-called ground truth or gold stan-
dard. On that basis, descriptions of best practices are drafted to solve a
specific problem using methods that have been evaluated. For visual ana-
lytic tools the need for evaluation has also been regarded as crucial (Keim
et al. 2010, 131).
But if it is true that developing and using visualization frameworks

goes hand in hand, then, obviously, defining best practices for doing
visual analytics in the humanities is very hard (Keim et al. 2010, 54). As
the research questions and processes differ enormously, a best practice
must be defined for each use case. Indeed, doing so helps researchers to
reflect on all the decisions taken in building the visualization
framework.
Despite the difficulties in evaluating visual analytic tools, several
approaches have been proposed. Most interesting from a humanist per-
spective are more ethnographic-oriented methodologies such as design
studies: Sedlmair et al. (2012) are fully aware of the pitfalls of problem-
driven visualization research. They propose a nine-stage framework that
can be followed not only to design and develop a visualization framework
for a given problem, but also to reflect and evaluate the whole process.
But still, this methodology separates the visualization researcher from the
domain expert. In our eyes, it is crucial at least from a humanist perspec-
tive to combine these roles: For the humanist, the visualization frame-
work is not just a tool to be applied, but an important part of her research
process. There is no decision on the visualization side of the joint research
process which is not of interest for the humanist.
2.3 ogmas of Visualization Science Must

D
Be Questioned
From a humanist perspective some dogmas of visualization science are

questionable. Taking for example Shneiderman’s ‘Visual Information
Seeking Mantra’: ‘Overview first, zoom and filter, then details-on-
demand’ (Shneiderman 1996, 336) or the extended version of Keim et al.
(2006, 6): What are the details when visualizing language use? The obvi-
ous answer is: The text itself or a text snippet. In the example of the col-
location graph, one would expect to have the text snippets available upon
which the calculation of the collocations is based. But if we understand
collocations as an emergent phenomenon, the actual text snippets will
not provide any important information we do not already have. If the
researcher has the impression of not getting the full picture by inspecting
the collocations, the phenomenon they are interested in has not been
modelled sufficiently. If it is mandatory to go through the text snippets,
the researcher is interested in the single text snippets, not the big picture.
A visualization framework with an easy access to the text details traps the
researcher in the cage of single text hermeneutics: ‘what we really need is
a little pact with the devil: we know how to read texts, now let’s learn how
not to read them’ (Moretti 2000, 57).
Another issue is related to a more general topos in information science
described by Fuller as the ‘idealist tendencies in computing’ (Fuller 2003,
15) or by Geoffey as ‘an extreme understanding of technology as a utili-
tarian tool’ (Goffey 2014, 21). These topoi lead principles in computing
like efficiency and effectiveness, to dominate algorithmic approaches to
text understanding: Visual analytics therefore aims at building ‘effective
analysis tools’ (Keim et al. 2010, 2), ‘to turn information overload […]
into a useful asset’ (Keim et al. 2006, 1). As these goals can be justified
for business applications, they can be misinterpreted in the humanities as
a faster way to read texts. But instead, the capability of visual analytics in
the humanities lies in getting a completely different view on human
interaction—seeing emergent phenomena. A visual analytics framework
useful for humanists will not provide a compact overview of the data and
not merely a more efficient access, but should make the data nourishing
for further analyses that were not possible before.
3 Geocollocations
The visualizing experiments we will present now stand against the back-
ground we sketched so far. The research questions lie in the domain of
corpus linguistic discourse analysis.
3.1 Research Question
We are interested in the way mass media and other mass communication
players shape our perception of the world. News articles often deal with
countries, cities, regions, continents and so on and attribute values and
statements to them. We use an approach of discourse analysis using large

corpora and data-driven methods to reveal linguistic patterns that are
relevant for discourses (see Bubenhofer 2009; Bubenhofer et al. 2015;
Sinclair 2004; Teubert 2006; Felder et al. 2011; Scholz and Mattissek
2014, and also the contributions in this book). In a discourse analytical
perspective, we are interested in space being constructed socially (Glasze
and Mattissek 2014, 209). We therefore search for linguistic patterns
(frequently co-occurring words) contributing to the discursive construc-
tion of space.
Our approach to reveal such linguistic patterns is the calculation of
words co-occurring significantly often with toponyms. These so-called
‘collocates’ of toponyms should reveal the attributions attached to these
entities. Our term for collocations consisting of at least one toponym is
‘geocollocations’. An example of a geocollocation would be ‘Switzerland—
banks’ where ‘banks’ is the collocate of the toponym ‘Switzerland’.
The world shaped by discourses normally differs from geographic real-
ity regarding proximity (what areas seem to be near or far away?) and
positioning (country X seems nearer than Y, although geographically it is
not). Not all places and areas are present equally detailed and the world
may be divided into parts like ‘south’ or ‘middle east’. Of course, all these
conceptions are relative to the geographic position of the players in the
discourses.
Whereas it is already interesting for discourse studies to see which top-
ics and attributions are attached to places (and what the differences are
between different discourses), the more challenging goal is another one:
to obtain abstract categories how geography is shaped through language
use. We will later give some pointers to this question.
Of course, the worlds constructed by different discourses may differ
significantly from each other. Therefore the selection of a specific dis-
course as an object of the study and the underlying data is of crucial
importance. But the following case study does not aim at giving a com-
plete study on a specific discourse. The intention of our paper is to pres-
ent our analytical framework and to discuss some methodological issues
working with such a framework.
3.2 Data and Calculations
The data used for this case study consists of two data sets: (1) A corpus of
the magazine ‘Der Spiegel’ and the weekly journal ‘Die Zeit’ from
Germany from 1946 to 2010 (640,000 texts, 551 million tokens, Spiegel/
Zeit corpus) crawled from the complete digital archives available online1
and (2) records of the German parliament Bundestag of the legislative
period 2009 to 2013 (363,000 contributions, 22 million tokens) com-
piled by Blätte (2013). The data has been processed with the part-of-
speech tagger and lemmatizer ‘TreeTagger’ (Schmid 1994, 1999). In
addition, named entity recognition (NER) has been applied to the data
using the Stanford Named Entity Recognizer (Finkel et al. 2005) in a
version adapted to German (Faruqui and Padó 2010). The recognizer
tags not only toponyms but also names of persons, companies and orga-
nizations. In our case, only the toponyms were used.
In order to calculate the geocollocations, all toponyms above a mini-
mum frequency limit were selected and words (lexemes) co-occurring
significantly often with the toponym in the same sentence were calcu-
lated. The selection of an association measure is influenced not only by
statistical considerations, but primarily by the theoretical modelling of
the collocation concept. We used a log likelihood ratio significance test-
ing which is widely used in discourse linguistics to study language usage
patterns (Evert 2009).
The data set now contains toponyms and their collocating lexemes
with frequency information and the level of significance of the colloca-
tion. In order to place the toponyms on a map, they have to be geocoded.
Although there are several geocoding services available like Google Maps
API or Nominatim (OpenStreetMap), the task is challenging because of
reference ambiguities (‘Washington’[DC or the state], ‘Berlin’[capital of
Germany or city in New Hampshire?]), historicity of the toponyms
(‘Yugoslavia’ or ‘Ex-DDR’ do not exist anymore) or the use of unofficial
names (‘the States’ for USA, German ‘Tschechei’ instead of ‘Tschechien’,
‘Western Sahara’, which is not officially recognized as a state). Luckily
1
See http://www.zeit.de/2017/index and http://www.spiegel.de/spiegel/print/index-2017.html
(accessed 6 March 2017).
with geonames (http://www.geonames.org), a gazetteer is available that

can deal with the last two problems by providing a broad range of names
for places, including historical variants, and as a community-based proj-
ect it is easily extendible on top. However, it is just a list and hence on its
own it is of no help for the task of ambiguity resolution. To tackle the
latter, we used CLAVIN (Cartographic Location and Vicinity INdexer2),
an open source tool, which performs automatic geocoding based on the
document context. It employs different heuristics to disambiguate the
toponyms identified by the NER: place candidates are ranked according
to population size and in a second step a set of places is chosen that are
within the vicinity of one another.
The precomputed collocations are stored in an Elasticsearch3 index for
convenient access of toponyms along with their collocates and to allow
for fast visual display in the browser. Most of the filtering possibilities
discussed in the following section can also be computed in the backend
to further reduce the load of the code running on the client side and
enhance the user experience.
3.3 Visualization
The computed list of geocollocations contains several thousand entries

and obviously is not suitable to get an overview over the data. In addi-
tion, reading the list lets the analyst constantly draw references to the
geographical places which might be easy in the case of toponyms like
Paris, London, Berlin, Rome, France, UK, Germany, and Italy, but is
more difficult with an unordered list of less known toponyms like Madhya
Pradesh, Dschidda, Masar-e Scharif, Sibiu. Contextualizing these top-
onyms by placing them on a map helps to draw the references to the
actual places and to semantically enrich them for further interpretation.
Also geographical vicinity gets transparent.
Hence the most straightforward way to visualize the data is placing the
collocates of the toponyms on a map which is how we built our first pro-
See https://github.com/Berico-Technologies/CLAVIN/ (accessed 6 March 2017).

2
See https://www.elastic.co/ (accessed 6 March 2017).

3
totype. This may be seen as the most literate translation of the term geo-
co-location because words that occur near each other in text are located
together on a map and share the same coordinates.
This representation is easily understood as it stays very close to the
data. Nevertheless, it enables the user to interactively explore the infor-
mation presented on the map. To facilitate this kind of explorations, we
provide a number of visual hints and controls:
• In order to see the distribution of a collocate and identify similarities

in the inventory of collocates between different toponyms we high-
light the word under the mouse pointer everywhere else on the map.
• The collocates are super-tagged for nominal, adjectival and verbal
word classes (identified as sets of part-of-speech tags), which are
assigned colour codes via display styles.
• The number of collocations identified for even decently sized corpora
quickly overgrows the limits that can be displayed in an orderly fash-
ion on a map. Hence, we introduce three means to further reduce this
amount. The first puts a threshold on the significance level for the
computed collocations so that only higher levels of significance above
the standard minimum (p < 0.05) are treated as significant and become
visible on the map (Fig. 9.1, menu item 3). The second thresholds col-
locates based on a minimum frequency (Fig. 9.1, menu item 4a) and
the last one simply cuts off the list of collocates for a toponym if it is
larger than the given number (Fig. 9.1, menu item 4b)
• To thematically narrow down the presented data, the user can type in
a regular expression which deletes all non-matching words from the
display (Fig. 9.1, menu item 5).
• The statistical computation of collocations regularly generates artefacts
such as the appearance of general high-frequency words (e.g. function
words) in collocate lists. To overcome this limitation and weed out
such candidates a linguistic filter can be applied that cleans out words
which do not belong to a set of given word classes (Fig. 9.1, menu item
1). But of course, this is also a means to focus on certain kinds of
collocates in order to narrow down to specific aspects of the discourse
centring around some geographic entity.
views
map, dorling, separated
label type
smart (show only if enough space), always, hidden
part-of-speech
all, nouns, adjectives, verbs
level of significance
frequency of the collocation

minimum frequency; top x
search collocates with string
dataset
number of collocations displayed
Fig. 9.1 Geocollocations control panel
• As an experimental feature, the UI lets users aggregate collocations for

countries (Fig. 9.1, menu item 2). We store with each computed geo-
collocation not only the toponym’s coordinates but also the country to
which it belongs. Aggregation is thus straightforward.
A second approach we have chosen is to leave the strict geographical

representation of the data. What we were interested in was finding a visu-
alization which better reflects discourse reality than geographic reality
(Fig. 9.2). There are a number of disadvantages with regular geographic
maps (Glasze 2009; Smith 1992). First of all, we are well used to world
maps that employ customary projections. However, they distort actual
geographic proportions in various ways. For example, the Mercator pro-
jection distorts areas so that countries near the equator appear smaller
than those in the temperate zone. A projection that has been criticized as
Eurocentric for decades, yet it is used by all major online map providers
and consolidates a dubious mental representation of the world. Secondly,
the actual size of geopolitical entities (i.e. most of all countries) intro-
duces a considerable visual bias that puts more weight on larger ones.
Thirdly, we are generally so accustomed to the shapes of countries which
are considered important that their salience on a map is so obtrusive that
an unprejudiced reading easily becomes obfuscated.
To overcome these shortcomings, we built an alternative visualization
(Fig. 9.3) using a Dorling diagram (Dorling 1993). In a Dorling dia-
gram, countries (or other entities) are represented as circles whose radius
depends on the dimension of interest. We currently scale the circles
according to the number of collocations associated with the respective
country. This visualization allows to grasp with a glimpse which countries
are associated with a large number of collocates in the underlying corpus
and abstracts away from their geographical size. In order for the user not
to completely lose orientation when switching between the different
modes of presentation, the position of the country circles is only moved
so far as not to overlap one another.
Another important information is the distribution of the collocates.
Some collocates are tightly attached to specific places, others appear
worldwide at different places and are less place-specific.
To explore such distributions we use yet another kind of diagram in
the maps—integrated into the Dorling diagram: A Sankey diagram (first
introduced by Sankey 1896) depicts the distribution of some quantity in
the form of a flow where line thickness indicates the share of the quantity
(see white lines in Fig. 9.3). From within the Dorling diagram, a coun-
try’s collocates can be selected for display in a Sankey diagram where the
quantity flow corresponds to the co-occurrences of an individual collo-
cate with all the toponyms in a corpus or more precisely the respective
countries. So the flow fans out from a collocate to all the countries with
which it co-occurs. Thus, for specific collocates we see lines with only a
few crutches whereas for non-specific ones the stem dissolves into many
thin lines.
3.4 Exploring the Data
The most important aspect of the visualization is its interactive and

exploratory nature. In this section we will give some insights we gained so
264
N. Bubenhofer et al.
Fig. 9.2 Geocollocations map view

Fig. 9.3 Dorling diagram view

far in working with the framework to provide some ideas of what kind of
research questions can be dealt with. Often the framework serves as a
means of developing hypotheses which must be evaluated in a second
step involving other methods.
The map view on the data reveals the focuses and the gaps of discourse-
driven world views. A good indicator in this direction is the number of
collocates attached to regions and locations in conjunction with the
observation, if a collocate is specific for a location or widely spread. The
Zeit and Spiegel news corpus of the time period 2001–2010 shows the
following results: Collocates like Stadt (city), Land (country), Jahr (year)
and the like are very generic. On the other hand, collocates like
Menschenrecht (human rights) or Flüchtling (refugee) are not generic
because they are attached to specific locations which have similar roles in
different discourses. Collocates like chinesisch (Chinese) or Obama are
very location specific.
The combination of the Dorling and Sankey diagrams are useful to see
the distribution of collocates. Figure 9.3 shows the correlations of some
collocates to states and in particular the collocate Frühling (spring) which
is selected. Frühling, of course, as part of the expression Arabischer
Frühling (Arab Spring), is attached to countries such as Algeria, Tunisia,
Libya, Syria, Egypt. Other collocates like Euro or Krieg (war) are used
more generically and have connections to a lot of countries. By clicking
on a country, the ten most frequent or specific collocates can be selected.
Choosing the top ten collocates of the United States shows that news
coverage about Northern America is dominated by some collocates used
mainly in the context of this country, although they are potentially
generic. Some examples are Kultur (culture), Museum (museum),
Universität (university) or Unterstützung (support). Press coverage about
China is dominated by generic collocates used to introduce locations
probably unknown to the reader. Examples are chinesisch (Chinese),
Provinz (province) or Hafenstadt (port).
To compare some specific countries, the Dorling view can be reduced
to just these countries (see Fig. 9.4). The collocate Beziehung (relation-
ship) has connections to Russia, China and the US (and other countries),
but not to Germany and France. It is an indicator for relations between
Germany and other countries which are sometimes strong and enduring,
The Linguistic Construction of World: An Example of Visual…
Fig. 9.4 Reduced Dorling view, comparison of selected countries: collocate Beziehung (relationship)
267
but nevertheless sometimes difficult (Germany and the US), or subject of

permanent doubt and efforts (Russia, China). Another example is Kultur
(culture) which correlates with Germany, the US and France (and other
countries), but not Russia and China. It seems that discourses in German
news coverage conceptualize Russia and China not as nations of notewor-
thy cultural heritage.
Changes in time can also be observed: Comparing three corpora from
1946 to 1960, 1991 to 2000, and 2001 to 2010 in the Spiegel/Zeit cor-
pus shows the traces of colonial history and Cold War indicated by col-
locates such as (1) britisch, französisch (British, French) and (2) sowjetisch,
kommunistisch and amerikanisch (Soviet, communist, American). The
distribution of these collocates, except amerikanisch, decreased dramati-
cally until 2010 and is only used very locally. Taking sowjetisch as an
example, it probably collocates with countries still suffering from Soviet
occupation during the Cold War (Eastern Europe, Georgia, Ukraine,
Pakistan etc.).
Using the corpus of the German parliament Bundestag reveals striking
differences in the world views between the political parties. Focussing on
the region of Afghanistan, Pakistan and India, the collocates reflect the
war in Afghanistan and the mission of the German army. Whereas politi-
cians of the centre-right CDU (which is part of the government) men-
tion Erfolg, Lage, Sicherheit, Einsatz (success, situation, security, mission)
together with Afghanistan, the left-wing party Die Linke uses the terms
Krieg, töten and Kampfdrohne (war, kill, battle drone).
Sometimes a closer view on the distribution of specific terms is useful.
To illustrate, imagine we were interested in the discourse about migration
and refugees in German newspapers. In the dataset of the Spiegel/Zeit
corpus of the time period from 2010 to 2016 we selected collocates
matching the following regular expression:
.*([Ff ]l[uü]cht|[Mm]igrant|[Mm]igration).*
This expression covers words containing the strings ‘flucht’, ‘flücht’,

‘migrant’ and ‘migration’ (refugee, migrant, migration). Figure 9.5 shows
the locations covered by these collocates: Both countries of the origin and
the target of migration are highlighted, but collocates attached to Europe
and Germany are the most frequent. Comparing this distribution to the
period after WWII (1945 to 1960, see Fig. 9.6), reveals the changes in
the discourse since then: Press coverage on Africa is much less dominated
by the migration topic, whereas the Americas (north, central and south)
play a role in the discourse. Obviously having a look at the collocates
shows the differences: In the discourse after WWII, the Americas are
countries of emigration, not migration or refugees.
Figure 9.7 shows the subtle differences of the recent discourse of migra-
tion. There are several derivations of the stems ‘Flucht’, ‘Flücht’, ‘Migration’
and ‘Migrant’: Flüchtlingslager (refugee camp), Flüchtlingspolitik (refugee
policy), Zuflucht (refuge), Bootsflüchtling (boat people), Flüchtlingswelle
(wave of refugees), Flüchtlingszahl (number of refugees) and many more.
For Europe and Germany, the variety of collocates attached to these
places is much higher than for the places of origin of the refugees. These
collocates indicate the discourse about domestic policy dominated by
metaphors and topoi provoking worries and fears of migration. Of course,
the variety of collocates attached to some places also shows the impor-
tance of this place in the discourse. Countries such as Bulgaria, Tunisia,
Albania or Ukraine show only Flüchtling as collocate, meaning that these
places are not in the focus of the migration discourse in Germany.
These examples, far away from extensive studies, must suffice to give
an impression of the possibilities. We wanted to show the manifold
approaches possible with this framework: They range from very broad
approaches interested in the big picture to narrow ones focussing on
specificities in some regions. It is possible to explore the data and observe
abstract concepts such as colonialism tightly attached to specific regions
or to study determined collocates and see their derivations as shown in
the short study about migration.
4 What Does the Tool Do with the Data?

When working with large data sets, visualizations may help to gain an
overview of the data. That is, indeed, an important goal of the geocolloca-
tion tool we have presented. But more important, visualization tech-
niques always have an effect on the data. They rearrange and transform
270
Fig. 9.5 Map view, selection of collocates Migration, Flüchtlinge (migration, refugees)—Spiegel/Zeit corpus 2010–2016
The Linguistic Construction of World: An Example of Visual…
Fig. 9.6 Map view, selection of collocates Migration, Flüchtlinge (migration, refugees)—Spiegel/Zeit corpus 1945–1960
271
272
Fig. 9.7 Close view on the collocates in the migration discourse—Spiegel/Zeit corpus 2010–2016
the data to another view, they are affected by diagrammatic operations

(see Sect. 1). The next section discusses some core transformations being
in effect and shows, why these transformations are specifically useful for
a discourse analytical approach.
But as visualizations of digital data heavily rely on techniques of pro-
gramming and software engineering, the influence of these techniques on
the visualizations and the interpretation of them will be tackled in the
next section but one.
4.1 Reconceptualizing Language
In Sect. 1 we elaborated the process of transformation by choosing

another diagram for the data. The complex geocollocations diagram is an
ensemble of transformations worth a look at. In general, there seems to
exist a set of core transformation types in play with linguistic data. In the
geocollocations diagram, the following types play important roles (see
Bubenhofer 2018 for details):
1. De- and recontextualization: Counting linguistic entities (e.g. words)

in a corpus is a process of de-contextualization whereas the original
contexts of the entities are discarded. At the same time building this
list of entities and their frequencies is an enrichment of a new context.
Therefore the end of the diagrammatic operation is a recontextualiza-
tion. Counting linguistic entities are just one example of recontextu-
alization; all different forms of reformatting the linguistic entities are
a form of recontextualization.
2. De-sequentialization: As language use follows a sequential order, pro-
cesses of recontextualization often appear together with a process of
de-sequentialization when linguistic entities are removed from the
syntagmatic order of the text.
3. Dimensional enrichment: Diagrammatic forms may enrich data with
further dimensions. This is the case with different forms of graphs,
trees and of course maps. The linguistic entities are connected with
other linguistic or extra linguistic entities.
4. Re-materialization: Diagrammatic transformations are able to turn

linguistic data into a new materiality. A collocation profile is a statisti-
cal aggregation of the distributional behaviour of the lexeme and
therefore constitutes a new object which can be read as the semantic
profile of the lexeme. Similarly, linguistic entities associated with a
map (enrichment of dimensions) constitute a new object of language
in situ.
In the case of the geocollocations diagram, a collection of texts is the

starting point of the process of analysis and visualization, followed by
these further steps:
1. Detection and listing of toponyms: De-contextualization and de-

sequentialization of toponyms appearing in the texts.
2. Aggregated recontextualization of these toponyms using their sur-
rounding contexts (calculation of collocates). This leads at the same
time to a re-materialization of the toponyms in form of collocation
profiles, interpretable as semantic profiles.
3. Dimensional enrichment is the effect of geocoding the toponyms and
displaying them on a map. In parallel, the mapping process leads to a
re-materialization of the linguistic data as discourse in situ.
4. Various interaction modes lead to dimensional enrichment in the
sense of linking toponyms, locations and collocates with other
information:
(a) Highlighting of the same collocates attached to different top-
onyms shows collocates being used globally or locally.
(b) Using the Dorling visualization of the locations, the size of the
nodes expresses the number of collocates.
(c) The Sankey visualization shows similarities of locations through
their collocates.
All these transformations reconceptualize the object of study, language
use. These transformations are possible thanks to the digital processing of
the data and the diagrammatic forms of expression and operation. In the
context of discourse analysis, it is important, if this reconceptualized lan-
guage model is nourishing for discourse analysis. As we have already
described (Bubenhofer 2009), and also has been widely discussed in cor-
pus linguistic approaches to discourse analysis (Spitzmüller and Warnke
2011; Teubert 2005; Sinclair 2004; Mautner 2012; Bubenhofer et al.
2015; Felder et al. 2011; Lebart and Salem 1994; Glasze 2007; Scholz
2010 and many more), this seems to be the case: Discourse analysis seeks
to reveal systems of énoncés, statements, that break the borders of texts.
Aggregated recontextualizations are one form to break down texts in
smaller entities and to rearrange them to detect similarities of passages
across texts. The dimensional enrichment has the potential to unveil a
systematic organization of énoncés. Coercing linguistic data into a new
materiality gives the chance to see in a system of énoncés an emergent
phenomenon that potentially can be interpreted as discourse. At least for
corpus linguistic approaches to discourse, especially approaches being
data-driven, having language surface as a starting point is the nucleus of
corpus pragmatics (Bubenhofer and Scharloth 2014; Feilke 2000; Feilke
and Linke 2009). These pointers must suffice to give an impression of the
nourishing qualities of the above-described transformations.
4.2 Influence of Coding Cultures
Building visualizations such as the one for working with geocollocations

demands the use of technology. For humanists, technology normally is
not of big interest. In Sect. 2.1 we have already described, that this leads
to the separation of making and using tools, a separation that should be
overcome. But what in our case of the geocollocations tool are the tech-
nological premises that influence the application of the tool?
Programming is always culturally shaped—programming takes place
in specific cultural settings and choosing a programming framework or
language is not merely a technical decision, but one about the culture the
programmers feel attracted or belonging to (Coleman 2012; Manovich
2014; Ford 2015). People writing software often argue in favour or
against a specific language using aesthetic or ideological topoi, despite the
utilitarian tendencies normally dominating programming discourses (see
Sect. 2.3). The following statement about using the scripting language R
for visualizations illustrates this:
The visualizations you can create in R are much more sophisticated and
much more nuanced. And, philosophically, you can tell that the visualiza-
tion tools in R were created by people more interested in good thinking
about data than about beautiful presentation. (The result, ironically, is a
much more beautiful presentation, IMHO). (Milton 2010)
This is just one example of a whole range of similar statements that

link the practice of programming (with a specific language) with catego-
ries like ‘sophisticated’, ‘good thinking’, ‘beauty’ and so on. The impor-
tant question here is not, if these statements are true or not, but how the
cultural settings in which programming takes place influence the result-
ing software.
In our case, the tool mainly uses the Javascript library ‘D3’ (Bostock
et al. 2011), a library widely used for data-driven visualization applica-
tions. Javascript is a scripting language, invented right from the begin-
ning around 1995 to display interactive web content (see Wikipedia
‘JavaScript’). Its source code does not get compiled before execution, but
is executed in the client’s browser and therefore the code is completely
visible to the user. It can be copied, modified, and reused by others for
their own projects. In the case of D3 this led to a constantly growing col-
lection of code bits and examples demonstrating the use of this library for
data visualization purposes. The D3 website (www.d3js.org) contains a
‘visual index’ gallery listing hundreds of square thumbnail pictures with
names of visualization types (see Fig. 9.8). They do not follow a thematic
order, they are probably ordered chronologically, starting with well-
known names like ‘box plot’, ‘bullet chart’ and so on and coming to new
or unknown forms like ‘ulam spiral’, ‘aster plot’, ‘SOM hexagonal heat-
map’ and many more. The code to reproduce the visualizations is just one
click away and can easily be adapted to other data.
How does this programming culture mostly following the principles of
open source influence academic use of visualizations? Sources like the
aforementioned visual index (and a lot of other projects being presented
online, for example, in reddit.com groups such as ‘data is beautiful’, ‘map
porn’ or ‘infographics’) are very inspiring and a possibility to find possible
visualizations for one’s own data. But this reverses the process normally
proposed in visual analytics which starts with a clear problem definition
Fig. 9.8 Javascript library ‘D3.js’, ‘visual index’ of examples on the website
and analysis before finding visualization solutions (Keim et al. 2010,
119). The alternative ‘coding culture’ simplifies experimenting with data
and visualizations and promotes showing these experiments to a big audi-
ence uninformed about the data and exact goals of the visualization. The
visualization itself, as a result of transcriptive processes (Jäger 2007), gen-
erates Eigensinn (Jäger 2005, 140), the visualization is self-referential.
The visual forms we used for the geocollocations tool are of course
inspired by traditional forms (map, Sankey, Dorling) that have been
transformed to interactive browser-based versions. What if we had cho-
sen another programming language, for example, R, also used for visual-
ization tasks but used mainly for static visualizations and curating a
different culture of visualization examples? Do we justify our selection of
the programming language with cultural arguments? Probably not, even
if they influence every selection, but they are not accepted as valid argu-
ments in academics.
Most of the D3 examples are interactive in the sense that the user can
influence the way things are displayed by moving, pointing or clicking
the mouse or hovering over an element. The possibilities of Javascript
(and the sister technologies HTML5 and Cascading Style Sheets [CSS])
and the structure of the D3 library suggest or almost coerce the program-
mer to enable these modes of interactivity. Also built-in are functions for
animating elements and smoothing the transition between the states. In
our example the transition from the traditional map to the Dorling dia-
gram is a smooth transition from countries to nodes implying countries
and nodes are the same. Even if this makes sense, by using this technology
it becomes most probable that while programming the tool, such an
effect is being activated without much thinking about it.
We have to stop the discussion here (and have to refer to Bubenhofer
2016, 2018; Bubenhofer and Scharloth 2015), but hope having shown:
The methodological reflections and the sensitivity discourse analysts nor-
mally have for the discursive and cultural settings surrounding them
must be broadened if digitality and visualizations enter the arena. Kath
et al. (2015) propose a ‘new visual hermeneutics’ as a starting point, but
especially the influence of technological cultural settings have not been
discussed sufficiently yet.
5 Conclusions
In its current state, the geocollocation explorer is a framework which can
be used to explore linguistic data reflecting discourses related to geogra-
phy. But more than that, we used the development of the framework to
reflect upon the effect of diagrammatic operations on linguistic data and
the influence of ‘coding cultures’ on the technical implementation of
visualizations. Development and using of the framework went hand in
hand as an iterative process, where data exploration led to new ideas for
visualization modes, such as abstracting the representation of countries
from geography by turning them into a combination of a Dorling and
Sankey diagram. The process is ongoing and will also in the future refrain
from being teleologic: The further development remains unpredictable.
We consider it critical for humanists to be interested in the technical
and algorithmic details of visual analytics and for developers to provide
for the involvement of humanists. Regarding well established and theo-
retical sound methods of visual analytics and scientific visualizations,
we doubt some of the premises of the methods if they are applied in the
humanities. The methods of visual analytics often follow idealist ten-
dencies in computing. These tendencies incorporate goals in research
that do not necessary match the ones in the humanities. One example
are the differences between data mining and discourse analysis, two
approaches sharing similar methods and research tools, for example,
tools similar to the geocollocations framework. In data mining, the
goal is to efficiently find the right document or categories the docu-
ments with the right labels—following the ‘gold standard’ defined
before—in discourse analysis the task is much more complicated:
Researchers using Foucauldian discourse analysis, approaches in the
paradigms of constructivism or deconstructivism and the like would
mistrust the very idea of being capable to define a gold standard. Tools
and methods that allow new perspectives on or new readings of the data
are of much greater interest for them. We have shown the importance
of diagrammatic operations as one possibility to do that, for example,
in breaking up the unity of texts to find énoncés. The tool then is not
just a tool, but the essential part of the methodological approach itself.
Hence for scholars in the humanities, a critical stance on tools being

developed in disciplines following different logics of research is crucial.
The culture for the development of such tools is embedded in and has
an influence not only on the functionality of the tool, but also on the
hermeneutical process triggered by the tool. However, the chances
offered by visual analytic methods and other data mining approaches
prevail over the risks—if they are reflected critically under the premises
of the humanities.
References
Blätte, Andreas. 2013. PolMine-Plenardebattenkorpus (PolMine—German
parliamentary debates corpus). Accessed June 29, 2018. http://polmine.sowi.
uni-due.de/daten.html.
Bostock, Michael, Vadim Ogievetsky, and Jeffrey Heer. 2011. D3: Data-driven
documents. IEEE Transactions on Visualization & Computer Graphics (Proc.
InfoVis). Accessed June 29, 2018. http://vis.stanford.edu/papers/d3.
Brezina, Vaclav, Tony McEnery, and Stephen Wattam. 2015. Collocations in
context. A new perspective on collocation networks. International Journal of
Corpus Linguistics 20 (2): 139–173.
Bubenhofer, Noah. 2009. Sprachgebrauchsmuster. Korpuslinguistik als Methode
der Diskurs- und Kulturanalyse. Berlin and New York: De Gruyter.
———. 2015. Muster aus korpuslinguistischer Sicht. In Handbuch Satz –
Äußerung – Schema, ed. Christa Dürscheid and Jan Georg Schneider,
———. 2016. Drei Thesen Zu Visualisierungspraktiken in den Digital
Humanities. Rechtsgeschichte Legal History—Journal of the Max Planck
Institute for European Legal History 24: 351–355.
———. 2018. Visual linguistics: Plädoyer für ein neues Forschungsfeld. In
Visual linguistics, ed. Noah Bubenhofer and Marc Kupietz, 25–62. Heidelberg:
Heidelberg University Publishing.
Bubenhofer, Noah, and Joachim Scharloth. 2014. Korpuspragmatische
Methoden Für Kulturanalytische Fragestellungen. In Kommunikation Korpus
Kultur: Ansätze Und Konzepte Einer Kulturwissenschaftlichen Linguistik, ed.
Nora Benitt, Christopher Koch, Katharina Müller, Lisa Schüler, and Sven
Saage, 47–66. Trier: WVT.
———. 2015. Maschinelle Textanalyse im Zeichen von Big Data und data-
driven Turn – Überblick und Desiderate. Zeitschrift Für Germanistische
Linguistik 43 (1): 1–26.
Bubenhofer, Noah, Joachim Scharloth, and David Eugster. 2015. Rhizome digi-
tal: Datengeleitete Methoden Für Alte Und Neue Fragestellungen in Der
Diskursanalyse. Zeitschrift für Diskursforschung, Sonderheft Diskurs,
Interpretation, Hermeneutik 1: 144–172.
Chen, Chun-houh. 2008. In Handbook of data visualization, ed. Wolfgang
Härdle and Antony Unwin. Berlin: Springer.
Coleman, E. Gabriella. 2012. Coding freedom: The ethics and aesthetics of hack-
ing. Princeton, NJ and Oxford: Princeton University Press.
Dorling, Danny. 1993. Map design for census mapping. The Cartographic
Journal 30 (2): 167–183.
Evert, Stefan. 2009. Corpora and collocations. In Corpus linguistics. An interna-
tional handbook, ed. Anke Lüdeling and Merja Kytö, 1212–1248. Berlin and
New York: De Gruyter.
Faruqui, Manaal, and Sebastian Padó. 2010. Training and evaluating a German
named entity recognizer with semantic generalization. In Proceedings of
KONVENS 2010, 129–134.
Feilke, Helmuth. 2000. Die pragmatische Wende in der Textlinguistik. In Text-
und Gesprächslinguistik/Linguistics of text and conversation, ed. Klaus Brinker,
Feilke, Helmuth, and Angelika Linke, eds. 2009. Oberfläche Und Performanz.
Untersuchungen Zur Sprache Als Dynamische Gestalt. Berlin and New York:
De Gruyter.
Felder, Ekkehard, Marcus Müller, and Friedemann Vogel. 2011. Korpuspragmatik:
Thematische Korpora als Basis diskurslinguistischer Analysen. Berlin and
Finkel, Jenny Rose, Trond Grenager, and Christopher Manning. 2005.
Incorporating non-local information into information extraction systems by
Gibbs Sampling. In Proceedings of ACL, 363–370.
Fleck, Ludwik. 1980. Entstehung und Entwicklung einer wissenschaftlichen
Tatsache: Einführung in die Lehre vom Denkstil und Denkkollektiv. Frankfurt/
Main: Suhrkamp.
Ford, Paul. 2015. What is code? If you don’t know, you need to read this.
Businessweek, June. Accessed June 29, 2018. http://www.bloomberg.com/
whatiscode/.
Foucault, Michel. 1966. Die Ordnung der Dinge: Eine Archäologie der
Humanwissenschaften. Frankfurt/Main: Suhrkamp.
Friendly, Michael. 2005. Milestones in the history of data visualization: A case

study in statistical historiography. In Classification: The ubiquitous challenge,
ed. Claus Weihs and Wolfgang Gaul, 34–52. New York: Springer.
Friendly, Michael, and Daniel J. Denis. 2001. Milestones in the history of the-
matic cartography, statistical graphics, and data visualization. Accessed June
29, 2018. http://www.datavis.ca/milestones/.
Fuller, Matthew. 2003. Behind the blip: Essays on the culture of software. New York:
Autonomedia.
Glasze, Georg. 2007. Vorschläge zur Operationalisierung der Diskurstheorie
von Laclau und Mouffe in einer Triangulation von lexikometrischen und
interpretativen Methoden. Forum Qualitative Sozialforschung/Forum:
Qualitative Social Research 8 (2): Art. 14. Accessed June 29, 2018. http://
nbn-resolving.de/urn:nbn:de:0114-fqs0702143.
———. 2009. Kritische Kartographie. Geographische Zeitschrift 97 (4):
181–191.
Glasze, Georg, and Annika Mattissek. 2014. Diskursforschung in der
Humangeographie. In Diskursforschung. Ein interdisziplinäres Handbuch.
Band 1: Theorien, Methodologien und Kontroversen, ed. Johannes Angermuller,
Martin Nonhoff, Eva Herschinger, Felicitas Macgilchrist, Martin Reisigl,
Juliette Wedl, Daniel Wrana, and Alexander Ziem, 208–223. Bielefeld:
Transcript.
Goffey, Andrew. 2014. Technology, logistics and logic: Rethinking the problem
of fun in software. In Fun and software: Exploring pleasure, paradox, and pain
in computing, ed. Olga Goriunova, 21–40. New York: Bloomsbury Academic.
Jäger, Ludwig. 2005. Vom Eigensinn des Mediums Sprache. In Brisante
Semantik. Neuere Konzepte und Forschungsergebnisse einer kulturwissenschaftli-
chen Linguistik, ed. Dietrich Busse, Thomas Niehr, and Martin Wengeler,
45–64. Tübingen: Niemeyer.
———. 2007. Transkriptive Verhältnisse. Zur Logik intra- und intermedialer
Bezugnahmen in ästhetischen Diskursen. In Transkription und Fassung in der
Musik des 20. Jahrhunderts: Beiträge des Kolloquiums in der Akademie der
Wissenschaften und der Literatur, Mainz, vom 5. bis 6. März 2004, ed. Gabriele
Buschmeier, Ulrich Konrad, and Albrecht Riethmüller, 103–134. Stuttgart:
Steiner.
Kath, Roxana, Gary S. Schaal, and Sebastian Dumm. 2015. New visual herme-
neutics. Zeitschrift für Germanistische Linguistik 43 (1): 27–51.
Keim, Daniel A., Jörn Kohlhammer, Geoffrey Ellis, and Florian Mansmann.
2010. Mastering the information age: Solving problems with visual analytics.
Goslar: Eurographics Association.
Keim, Daniel A., Florian Mansmann, Jörn Schneidewind, and Hartmut Ziegler.
2006. Challenges in visual data analysis. In Proceedings of Tenth International
Conference on Information Visualization (IV’06), 9–16.
Lebart, Ludovic, and André Salem. 1994. Statistique textuelle. Paris: Dunod.
Manovich, Lev. 2014. Software is the message. Journal of Visual Culture 13 (1):
79–81.
Mautner, Gerlinde. 2012. Corpora and critical discourse analysis. In
Contemporary corpus linguistics, ed. Paul Baker, 32–46. London and New York:
Continuum.
Milton, Michael. 2010. When to use Excel, when to use R. Webpage. Accessed
March 27, 2017. http://www.michaelmilton.net/2010/01/26/when-to-use-
excel-when-to-use-r/.
Moretti, Franco. 2000. Conjectures on world literature. New Left Review 1:
54–68.
Sankey, Henry R. 1896. The thermal efficiency of steam-engines. (Including
appendixes). Minutes of the Proceedings of the Institution of Civil Engineers
125: 182–212.
Schmid, Helmut. 1994. Probabilistic part-of-speech tagging using decision
trees. In Proceedings of International Conference on New Methods in Language
Processing, Manchester, UK.
———. 1999. Improvements in part-of-speech tagging with an application to
German. In Natural language processing using very large corpora, ed. Susan
Armstrong, Kenneth Church, Pierre Isabelle, Sandra Manzi, Evelyne
Tzoukermann, and David Yarowsky, 13–25. Dordrecht: Springer Netherlands.
Scholz, Ronny. 2010. Die diskursive Legitimation der Europäischen Union: Eine
lexikometrische Analyse zur Verwendung des sprachlichen Zeichens Europa/
Europe in deutschen, französischen und britischen Wahlprogrammen zu den
Europawahlen zwischen 1979 und 2004. Magdeburg, Univ., Fak. für Geistes-,
Sozial- und Erziehungswiss, Magdeburg. Accessed June 29, 2018. http://
edoc2.bibliothek.uni-halle.de/hs/urn/urn:nbn:de:101:1-201108243629.

Sedlmair, Michael, Miriah Meyer, and Tamara Munzner. 2012. Design study
methodology: Reflections from the trenches and the stacks. IEEE Transactions
on Visualization and Computer Graphics 18 (12): 2431–2440.
Shneiderman, Ben. 1996. The eyes have it: A task by data type taxonomy for
information visualizations. In Proceedings of 1996 IEEE Symposium on Visual
Languages, 3–6 September 1996, 336–343.
Sinclair, John. 2004. Trust the text. Language, corpus and discourse. London:
Routledge.
Smith, Neil. 1992. History and philosophy of geography: Real wars, theory
wars. Progress in Human Geography 16 (2): 257–271.
Spitzmüller, Jürgen, and Ingo H. Warnke. 2011. Diskurslinguistik: Eine
Einführung in Theorien und Methoden der transtextuellen Sprachanalyse. Berlin
and New York: De Gruyter.
Steyer, Kathrin. 2013. Usuelle Wortverbindungen: Zentrale Muster des
Sprachgebrauchs aus korpusanalytischer Sicht. Tübingen: Narr Francke
Attempto.
Teubert, Wolfgang. 2005. My version of corpus linguistics. International Journal
of Corpus Linguistics 10 (1): 1–13.
———. 2006. Korpuslinguistik, Hermeneutik und die soziale Konstruktion
von Wirklichkeit. Linguistik Online 28 (3): 41–60.
Thomas, James J., and Kristin A. Cook, eds. 2005. Illuminating the path: The
research and development agenda for visual analytics. Los Alamitos, CA: IEEE
Computer Society.
Zhang, Jin. 2007. Visualization for information retrieval. Berlin and Heidelberg:
Springer.
10
Multi-method Discourse Analysis
of Twitter Communication:
A Comparison of Two Global Political
Issues
Jörn Stegmeier, Wolf J. Schünemann, Marcus Müller,
Maria Becker, Stefan Steiger, and Sebastian Stier
J. Stegmeier (*) • M. Müller

Department of Language and Literature Studies, Technische Universität
Darmstadt, Darmstadt, Germany
e-mail: stegmeier@linglit.tu-darmstadt.de; mueller@linglit.tu-darmstadt.de
W. J. Schünemann
Department for Social Sciences, Hildesheim University, Hildesheim, Germany
e-mail: wolf.schuenemann@uni-hildesheim.de
M. Becker
Department of Computational Linguistics, Heidelberg University,
Heidelberg, Germany
e-mail: mbecker@cl.uni-heidelberg.de
S. Steiger
Institute of Political Science, Heidelberg University, Heidelberg, Germany
e-mail: stefan.steiger@ipw.uni-heidelberg.de

286 J. Stegmeier et al.
1 Basic Assumptions and Research Interest

In this chapter, we apply three different methods of discourse analysis, in
the broadest sense of the term, to Twitter communication in order to get
a multi-perspective view on the formation of global discourse on two
much-discussed topics of political concern: net neutrality and climate
change. With our multi-method design, we can better understand how
political, social, thematic, and technical aspects intertwine in inter- or
transnational discourse formation. The methods we apply are geoloca-
tion, network analysis, and keyword analysis. Among these, geolocation
and network analysis are well-established methods of data science that
have been applied to the analysis of Twitter communication in social sci-
ence and language studies in recent years. For the classification of users
according to their national background, we rely on the results of the geo-
location analysis conducted beforehand. Finally, keyword analysis only
seldom appears in Twitter studies. There are very few studies that include
keyword analyses of tweets (e.g. McEnery et al. 2015; Baker and McEnery
2015). However, we can build on their pioneering work when we inter-
pret our findings within the framework of discourse studies.
We have chosen the three quantitative methods above all for two rea-
sons. First, they are complementary, as one method might balance out
the technical limitations and shortcomings of the others. Limitations are,
of course, numerous and diverse: geolocation analysis, for instance, can-
not incorporate the tweets of users who have disabled geotagging in their
settings. Network analysis requires a number of technical decisions: a
suitable centrality measure and a visualisation mode need to be chosen
before one can even start with the demanding task of interpretation. It is
important to note that geolocation and network analysis in combination
S. Stier
Institute of Political Science, NRW School of Governance, University of
Duisburg-Essen, Duisburg, Germany
Department of Computational Social Science, GESIS – Leibniz Institute for
the Social Sciences, Cologne, Germany
e-mail: sebastian.stier@gesis.org
Multi-method Discourse Analysis of Twitter Communication… 287
can only deliver insights into the actor-based structuration of the global
debates and, thus, the preconditions for the transnational circulation and
diffusion of discursive patterns. For the calculation of keywords, an ade-
quate reference corpus is needed, as it more or less determines the value
of the resulting keyword lists (cf. the discussion of the individual meth-
ods below). Of course, these kinds of decisions cannot be avoided in
empirical research. Each decision must be made and justified per se.
However, by mixing the methods, we get views on our data from differ-
ent perspectives. We might be in a better position to judge whether an
analysis based on a certain configuration is an artefact of research, needs
to be explained by a hidden variable, or gives us reliable results. Moreover,
it seems particularly important to us that we apply our methods not only
to one corpus but to two corpora which are then compared. If we keep
constant the configuration of the tools we use and the principles of apply-
ing them, we can be more certain that the variation we find has mostly to
do with our research object and not so much with our methods. Thus, a
combination of methods and the comparison of different corpora are
helpful for both: a good interpretation of our findings and a constant
check on our methodology.
Second, we take advantage of the fact that each method puts a specific
focus on our data. Geolocation highlights national differences. Network
analysis gives insights into the relations between users without taking
into account thematic aspects. Finally, keyword analysis brings us closer
to the subject matter of Twitter discourses. That is to say, our mix of
methods is designed as a funnel leading progressively from geographical
to social aspects and on to the conceptual level of discourse.
In the following, we use the term ‘discourse’ in the general sense as
‘language in use’ (e.g. Fasold 1990). This needs to be specified: discourse
is formed by language patterns to be analysed as traces of social interac-
tion (Müller 2012, 2015). Discourse patterns indicate the individual,
social, and collective knowledge of the speakers (Felder and Müller 2009).
Our underlying aim is thus to learn about the interrelationship of lan-
guage, knowledge, and society in Twitter communication. Aside from a
measure of relative influence in networks, which is of course heavily
dependent on an actor’s speaker position and institutionalised influence
both online and offline, we do not delve deeper into power effects and
power/knowledge relations. In line with the conception of this volume,

we concentrate on methodological issues in this chapter. We are particu-
larly interested in whether the adaptation and combination of the meth-
ods shortly described above help us to get a better and more comprehensive
understanding of our subject or whether they rather produce unrelated
results.
From their very beginnings, internet development in general and the
more recently established social media applications have triggered opti-
mistic expectations regarding a transnationalisation of public spheres
among political and academic observers of international relations and
processes of global communication. This optimism is attached to a trans-
nationalisation hypothesis that emphasises the internet’s status as a
medium of global dimensions. It thus expects internet development to
contribute to a transnationalisation of social communication and,
thereby, also of political publics due to its technical architecture, the cul-
tural globalisation that is promoted by it, or at least—as more cautious
observers would state—the development of a ‘public of publics’ (Bohman
2007). For this chapter, our aim is to understand whether social networks
like Twitter can be seen as catalysts of transnationalisation or not. To
answer this question, it should be applied to both the social and concep-
tual dimensions of Twitter communication. For both paths of theoretical
reflection, the social media platforms and applications so central to the
second phase of internet development (Web 2.0) are of significant impor-
tance, as they suggest a coming capacity for social integration beyond
national societies. Thus, they seem able to compensate in cyberspace for
the loss of communicative cohesion in dissolving national societies.
Among the diverse tools, platforms, and channels that would belong
to the class of social media, we deliberately selected Twitter. From a theo-
retical perspective, it can be argued that Twitter constitutes a special
channel of issue-oriented communication that is most able to serve as a
functional equivalent to traditional media (Kwak et al. 2010). Moreover,
it is used intensively by political actors across the (democratic) world for
the dissemination of policy ideas, the communication of news, and the
organisation of collective action (Jeffares 2014). There are signs that
social media has the potential to set the agenda of public debates (Neuman
et al. 2014). While Twitter incorporates ambitions towards a more acces-
sible and intensive culture of political communication and a diversifica-

tion of the media system, it is important to note that among the most
active Twitter users are societal and political elites like journalists and
actors from party politics (Freelon and Karpf 2015). These influential
actors transfer political issues and their interpretations from Twitter to
the mass media (Chadwick 2013; Freelon and Karpf 2015).
The study of Twitter communication comes with some methodologi-
cal caveats (Boyd and Crawford 2012; Ruths and Pfeffer 2014). First,
there is a ‘platform bias’ as Twitter usage differs significantly across the
world: while in 2014, 23% of the US online population used Twitter
(Duggan et al. 2015), only 9% of German netizens used Twitter from
time to time (van Eimeren and Frees 2014). Furthermore, there is a ‘lan-
guage bias’, as by our query terms we almost exclusively collected English-
language tweets. As we considered the variation in media usage,
population size, and language usage, the clear predominance of Anglo-
Saxon countries in ‘transnational’ Twitter communication had been
expected from the start. Of course, the Twitter population and its sub-
group of politically active users are not representative mirrors of offline
populations. However, these limitations do not affect our research that
much, as we put our focus on the comparison of transnationalisation in
both policy fields that should be biased in the same way by the general
communicative patterns of Twitter.1 Moreover, the structural biases add
up to the most likely design, which is obvious in our case selection any-
way (see below).
Our investigation is not only characterised by a mix of methods but
also by an interdisciplinary authorship coming from political science,
computational social science, and linguistics. Being interested in the
social and political aspects of verbal communication, we think inter- or
1
In general, the analysis of social behaviour on the internet suffers from uncertainty, which is inher-
ent to the medium (Boyd and Crawford 2012; Ruths and Pfeffer 2014). The streaming of tweets
using the API, for instance, is restricted to 1% of real time Twitter traffic. However, this threshold
has not been passed at any time during our study. Moreover, besides relevant messages, communi-
cation in social networks produces a lot of ‘noise’, e.g. spam and automated messages sent from bots
that distort political discourses. For this reason, the accounts @All4NeutralNet and @
RealNeutralNet set up by activists from Demand Progress were excluded from data collection, since
they sent the same citizen petitions to Republican politicians and President Obama in an infinite
loop.
even transdisciplinary collaboration is not only particularly important

but also the only way to understand how language use and its social and
political context interrelate. As we are confronted with rather new phe-
nomena of social communication, the development of innovative meth-
ods is not and should not be confined to the borders of academic
disciplines.
As methodology stands central in this chapter, the political science
background of our research questions and the corresponding interpreta-
tion of our findings are only presented to the extent that is necessary to
follow the methodological discussion. For the same reason, we only
shortly describe the two policy fields that our corpora are derived from in
Sect. 2, in which we present the data collection process. Sections 3, 4 and
5 focus on the application of the three methodologies that we used to
analyse our corpora. We end our text with a short conclusion.
2 Object of Investigation and Corpus

Even though the two fields of environmental policy and internet gover-
nance differ with regard to political conflict and judicial governance,
there is nevertheless a common point in how both broader fields have
shown a higher degree of transnational mobilisation in earlier conflicts.
While net neutrality is one of the core questions dominating the agenda
of the transnational community of internet activists that has a high level
of mobilisation potential, climate change is the paradigm case of a global
political challenge that reaches high degrees of mobilisation, also
through already institutionalised actors operating transnationally. It is
important to note that for our comparative design we deliberately
selected most likely cases of transnationalisation. This can be seen also as
precaution regarding the hashtag and language biases in our data collec-
tion. While there is much more international or transnational and mul-
tilingual communication o ccurring on social media like Twitter, it is
most likely to find structural traits of transnational discourse when
streaming predominantly English content via the globally used, yet
English, hashtags.
2.1 Net Neutrality
Net neutrality is a core issue of internet governance. It has become a

contentious issue of political debate in many countries across the devel-
oped world. The community of so called netizens fights against the new
business models of telecommunication companies, in which they see an
assault on the original idea of the internet. Net neutrality pleas for a
nationally or internationally guaranteed commitment to the principle
of an equal transfer of data packages and the prohibition of any zero-
rating services, which privilege certain content providers over their
competitors. The principle of net neutrality and its alleged endanger-
ment thus matters for the liberal (or libertarian) self-understanding of
netizens. The massive mobilisation of the netizen community against
the anti-counterfeiting agreement ACTA might serve as a comparable
case. After years of negotiations and at the end of an international deci-
sion process, the effectively articulated protests of the transnationally
networked user groups caused an important political turnaround
(Kneuer 2013, 7).
In contrast, the movement for net neutrality is clearly supported by
some governments, such as the US government, which has a clear leader-
ship role in internet governance issues and which, according to its self-
conception, takes on the role of the defender of a free internet.
Consequently, even in this field, US legislation and especially the highly
expected regulation of the oversight commission FCC (Federal
Communications Commission) in 2015 serve as a worldwide role model
for regulations on this issue.
However, it is important to note that this aspect of telecommunication
regulation between service providers and users could effectively be made
and transposed on the national level. In contrast to other problems of
internet governance and certainly to the challenges of climate change, the
net neutrality issue does not necessarily require cross-border or interna-
tional regulation. However, the fundamental regulation question is
currently on the table in many countries and regularly surfaces in day-to-
day politics.
2.2 Climate Change
Environmental crises and catastrophes either naturally caused or man-

made are traditionally seen as the clearest cases for the necessity of trans-
national regulation (Pries 2010).
Contemporary environmental problems are perhaps the clearest and stark-

est examples of the global shift in human organization and activity, creat-
ing some of the most fundamental pressures on the efficacy of the
nation-state and of state-centric politics. (Held 1997, 258)
Moreover, environmental movements have been established world-

wide with the help of globally reported media events. These activists were
somehow ‘at the forefront regarding the use of new information tech-
nologies and especially the internet as tools for the organisation and
mobilisation’ (Castells 2002, 141).
Climate change policy channels the mostly shared acceptance of the
devastating effects of environmental pollution into a global political regu-
lation effort. Against this backdrop, it is not surprising that in a compara-
tively early phase—especially after the 1992 UN Conference on
environment and development in Rio de Janeiro—institutionalised inter-
national fora and actors were established for this policy field. Thereby,
international climate policy is the paradigmatic example of a classical glo-
balised agenda. It serves as a good example for comparison, since political
stakeholders already promoted transnational activities and initiatives at an
early stage (Kielmansegg 2013, 257). While business interests were domi-
nant in the beginning, an observation over time indicates that interest
representation at global climate conferences has since become as diverse
and institutionalised as it is for national policy fields (Hanegraaff 2015).
2.3 Data
For the automated data collection of this study, we streamed tweets

(Twitter messages) containing the hashtags #NetNeutrality and
#ClimateChange between 14 January and 6 March 2015. Thereby, we
considered Twitter hashtags as topical query terms that within their com-
munication contexts serve ‘as a vehicle for otherwise unconnected partici-

pants to be able to join in a distributed conversation’ (Bruns and Burgess
2011, 49). The hashtags chosen for our study are the central ‘issue con-
tainers’ for the transnational debates on the two policy fields. However,
issue-related discussion elements that do not use these hashtags as well as
related hashtag ‘populations’ that have emerged during the debate, like
#OpenInternet, have not been queried under these selection criteria. On
the other hand, only such a careful approach allows for an equal consider-
ation of both policy fields. Applying the R package streamR (Barberá
2014), we collected the metadata and content of 884,729 tweets and
retweets. For further analysis, we parsed the data into data frames in
R. First, we conducted the geolocation analysis using the DataScienceToolkit
(DSTK). Since most tweets do not contain information on geolocation,
DSTK is needed to assign tweets to specific countries. DSTK provides an
extensive database with geolocation information that is used to identify
the origin of as many tweets as possible by checking information provided
in the metadata of all tweets. After running the DSTK, we were able to
ascribe 55% of the data to a country. For #ClimateChange, we collected
380,890 tweets, for #NetNeutrality 503,839 tweets. We were able to
determine the geolocation for 54% of #ClimateChange tweets and 56%
of NetNeutrality tweets, so there is not much difference in geolocation
coverage in relative terms. Being interested in processes of transnationali-
sation, we worked exclusively with the geolocated tweets, restricting our
analyses to the ten most frequent countries in our corpus (cf. below). The
number of geolocated tweets and Retweets are shown in Table 10.1.
We used the visualisation software Gephi for the network analysis.
Regarding step three, discourse analysis, we chose a corpus-analytical
methodology, as the sheer amount of collected data and the reduced con-
tent and language of Twitter messages plausibly suggest an automated
approach instead of a qualitative analysis.
Table 10.1 Geolocated tweets and retweets of the ten most frequent countries
Tweets Retweets Total
#ClimateChange 80,324 88,336 168,660
#NetNeutrality 117,786 125,438 243,224
Total 198,110 213,774 411,884
3 eolocation Analysis and the Global

G
Distribution of Tweets
Because we streamed English hashtags, the vast majority of the tweets we
received are also in English. To a limited extent there are also tweets writ-
ten in other languages. This is not problematic for a network analysis, but
it is indeed a problem for quantitative linguistic research. Therefore,
though we based the later network analysis on tweets of all languages, the
corpus linguistic studies were based only on English tweets (see below).
For this reason, we give two different results of the geolocation analysis:
Figures 10.1 and 10.2 show the global distribution of English tweets for
both hashtags, considering the ten most represented countries. Eighty-
four per cent of all tweets with #NetNeutrality were sent from the United
States, and other than some developed democracies, only very few coun-
tries produced more than 0.25% of tweets. Net neutrality thus seems to
be an issue predominantly for users from highly developed democratic
countries. The degree of transnationalisation (i.e. on this analytical level:
international participation) for the climate change discourse appears
#ClimateChange
Australia
7% Brazil
1%
Canada
13% France
3%
Germany
1%
India
2%
USA Irland
60% Italy 1%
1%
UK
11%
Fig. 10.1 Tweets on #ClimateChange

Canada France
#NetNeutrality 1% Germany
4%
Brazil 1% India
1% 2%
Italy
Australia
1%
1%
Mexico
1%
UK
4%
USA
84%
Fig. 10.2 Tweets on #NetNeutrality
higher. This points to the fact that there are also users from countries in
Africa (e.g. Kenya, Nigeria, South Africa), Asia, and South America that
are directly affected by climate change.
This first finding already indicates that both policy fields diverge
regarding Twitter communication patterns. Dynamic developments of
the policy debate in the USA during the research period are clearly
reflected by the activities in the #NetNeutrality sample. On February 4,
2015, in an interview published by the magazine Wired and on Twitter,
FCC Chairman Tom Wheeler announced his decision to advocate for the
principle of net neutrality in his regulation proposal, which caused a
strong increase in activity on Twitter.2 Mass media also reported compre-
hensively. For the New York Times, the FCC decision concluded the ‘lon-
gest, most sustained campaign of Internet activism in history’. Civil
rights organisations and telecommunication firms commented on the
decision from their respective positions. Finally, political actors such as
President Obama, Senator John McCain, and the Speaker of the House
2
#Tom Wheeler (@TomWheelerFCC): “I have outlined the new #OpenInternet proposal in an
op-ed just posted on @Wired here: http://wrd.cm/16nDJn5 #NetNeutrality”.
Table 10.2 The ten most represented countries in both samples

% Tweets #NetNeutrality % Tweets #ClimateChange
USA 75.56 USA 46.81
Great Britain 3.83 Great Britain 12.13
Canada 3.19 Canada 9.00
France 1.31 Australia 6.85
India 1.21 India 2.14
Australia 1.11 France 2.12
Italy 0.93 Brazil 1.75
Germany 0.86 Italy 1.26
Brazil 0.86 Ireland 1.02
Mexico 0.65 Germany 1.01
of Representatives John Boehner also used #NetNeutrality for their offi-

cial statements on Twitter.
In contrast, the climate change debate remained at a stable level of
activity during the research period, which is due to the fact that there
were no political events with higher mobilisation potential in that times-
pan. Accordingly, the total volume of tweets for #NetNeutrality is 32%
higher than for #ClimateChange. We expect that the relation as well as
the discursive constellations vary in more intensive phases of climate pol-
icy negotiations, for example, during UN Climate Conferences.3 Our
research results, as analyses of ad hoc communication processes in general,
are contingent on the specification of time ranges for data collection.
Table 10.2 shows the ten most represented countries in both samples
considering all tweets, not only the English ones. In the climate change
debate, 47% of tweets still originate in the USA. Moreover, the list shows
that in the other English-speaking countries such as Great Britain,
Canada, and Australia there is also an intense debate on climate change
via Twitter. The network analysis presented below will have to clarify
which actors concretely engage in the debates and whether national clus-
ters can be identified within the #ClimateChange network. However,
Table 10.2 shows that the dominant role of US users on Twitter also
overlays the debates in the Anglo-Saxon countries.
This is one reason for us to continue data collection until the end of the year so that we cover the
3
UN Climate Summit in Paris in December 2015.

conservsbelieve
krobertory
politicallaughs debnicolina
climatecou
ibdeditorials
gopfashionista anonspress
chuckwoolery
lrihendry
scienceadvances nature_mi
thedavidmcguire
chucknellis occupycorruptdc
washtimes genelingerfelt
georgewhitejr parvasaeua greenfraud
katrinapierson newsroompostcom
audubonsociety
natlparkservice
realalexjones leodicaprio
citi
cli
cbsthismorning blossomnnodim
nowiknowmyabcs robfit pmgeezer realtimers clickhole
nprpolitics
notjoshearnest
sarahmlauren
gma lauraecpaul
juleslalaland
iluvco2 startalkradio
newscientist
carbongate
alammaldives greenl4l
c40cities
mintpressnews scottwx_twn
shaughn_a heritage
carinabloro organicconsumer
greenforall
usgs
mormondems everyvoice
peter_bowden ucberkeley
statedept randcorporation
secretaryjewell jimmybear2
foe_us
vanobserver nprnews rt_america
billnye johnkerry
bob_owens dailycaller ecointeractive
noaaclimate
weatherchannel
climatetreaty who commercegov
interior billmaher
gizmodo
wired robertswan2041
agimcorp
danielgennaoui foxnews rahul358 fidelherrera jrcarmichael
yaleclimatecomm markruffalo
nowwithalex
huffingtonpost
climateopp
bloombergdotorg ndgain jeffreydsachs
pharrell
telegraph
edshow paulcarfoot
cnn
theeconomist
ap
markeymemo
peta sierraclub thescienceguy unuehs
luciagrenna
real_liam_payne
awea
climatenewsca drbobbullard
youtube
earthjustice
ginaepa britanniacomms
dodo
peddoc63 mprnews
megynkelly
kdungul
michael_shank
glen4ont
nasagiss
lindasuhler takepart alroker narendramodi mcampaign
slate
cleanairmoms newclimateecon wsj
sensanders usaid upeace
gmiller1952 epa chriscmooney pnudperu
senatorkirk kennytorrella
salon
sciam noaa worldresources m3metic
reportingclimat whitehouse unesco

politibunny
thehill yearsofliving mikebloomberg androidpm
un_climatetalks iansomerhalder
pcgtw mahamosa imabannedd ontarioclimate
centerforbiodiv
maydnusa nbcnews nrdc actadaptation yaleenviro
davos unfoundation
maryhooksilver
speakerboehner
ceresnews
nasa climateriskworldbank environmentont
rockefellerfdn uninindia
citizensrock truthout lcvoters chichilnisky cosmincorendea

insideclimate un_spokesperson
revyearwood
washingtonpost euclimateaction
allenwest
rednationrising
ecowatch
globalthermo gravitydynamic
connect4climate limacop20 wmonews un_news_centre
mashable resilientearth
senwhitehouse
laureldavilacpa momentum_unfccc
algore
floggermiester
triplepundit
nytimes
bostonglobe
dofvoteclimate wsf5_2015 eu_commission
allanmargolin climategroup
michaelemann
barackobama
cdp undpasiapac
mcconnellpress thebaxterbean
konalowell
undp reinvest2015
wef
carminezozzora gop thedailyclimate
carlsiegrist reuters undp_india
thinkprogress
rollingstone lorabruncke climatereality
350climatehotnews nilimajumder
marcvegan uncclearn
thedailyedge planetexperts
adb_hq
tjwiseman
tchenya nodesystems rtccclimatenews
freja_petersen ecowarrior1980 nappeema un_women
jiminhofe brighteyedjaymi
un
cfigueres iaeaorg
policyngeco
bettybeekeeper cifor
democracynow
uniteblue paulhbeckwith ruisaldanha
unrightswire
unep
goodbyekoch
onahunttoday ungeneva
theearthnetwork
richardhine usrealitycheck yourveganho
latimes silenced_not ineeshadvs
africagreenmedi teriin
desmogblog greenpeaceusa
earthhour
ericwolfson matthieunappee2 rt_com
time
sninkypoo bluedotregister
johnlundin unfccc unepasiapacific
ajit_ranade
lee_tennant climatecentral
ipcc_ch irena
sentedcruz
richardohornos
mikeloburgio moxyladies
guardian
jackiepburgoyne
wwf
scottwesterfeld
mercianrockyrex
jimharris
greencomedy resilientneighb
amazonwatch fhollande
climatewise2015 climateprogress grist greenvideos
greenpeaceaustp iepjeltok action2015
sustainia
huffpostpol
politicalant whatsyurimpact
richardmclellan helenclarkundp
greenpeace
cdhill9 aj
skepticscience
senfeinstein mrfcj
johnfugelsang
hugoc3318
physorg_com
huffpostgreen isaacbiovega
bulletinatomic billmckibben
stanford tveitdal earthvitalsigns
benandjerrys thecvf
climatechange_a
ddimick liveearth
lluisahicart coolmyplanet planamikebarry
mcspocky business
senatorsanders senschumer faonews designboom
smh anjakolibri evclimatechange unredd

environewstv
bennydiego natgeo profraywills
jackthelad1947guardianeco pik_climate theelders
tr_foundation
sharethis
billmoyershq whytovotegreen
sengillibrand senatorboxer abcnews faoknowledge
motherjones
conversationuk
boilfroggie
cechr_uod actonclimate rnfrstalliance cop21 wfp
cuestionmarque ajam dpcarrington
alex_verbeek iied
suzyji
ecoeye carbonbrief winnie_byanyima ajarabic
wessmith123 naomiaklein thoriumfuture

divestfund eco_wife oxfam faoforestry
oxfamaustralia
oxfamamerica
nyccamp guardiansustbiz wwfcanada
elonjames climatecouncil pontifex climatekic
andrealeon fractivist
mikehudema cgiarclimate foeeurope
exploreamazing_ desmoguk
conversationedu theatlantic civileats
mwbloem
albd1971
carlfletcher15 deutschebank
espuelasvox robinhoodtax
jerrybrowngov yahoonews davidsuzukifdn
climate_rev
roshart
ed_miliband
trocaire
greenamerica p_hannam
alexanderknight jrf_uk
al_perri bbcnews stephenfry
oceanwarrior
brenttoderian
uniteblueky
coffeewarblers csironews
wwf_uk judiwakhungu
shell davidharewood
mgliksmanmdphd otiose94 david_cameron
rupertmurdoch
politics_pr vice
arusbridger
wrobertsfood
hearyanow
tonyabbottmhr edjoyce nick_clegg
mmnngreenworld
wwwfoecouk onegreenplanet deborahmeaden radioleary
wendy_bacon
pittgriffin kimacheson deccgovuk
theccoalition columbia_biz
reclaimanglesea wildlifetrusts
edie greenpeaceuk
malmmckay
jamie_woodward_ circularecology food_tank
greghuntmp anothergreen cafod
stopcoalexports maggiejordanacn sustaincities

georgemonbiot
carolinelucas
craigjack36
renew_economy
citizenradio
michaelhallida4
catholicclimate christian_aid natures_voice
terrence_mccoy
realtimferguson
stansteam2
_bto
strobetalbott
elitedaily
johnwren1950 nhm_london
petricholas
hankgreen albomp
Fig. 10.3 Network analysis of the tweets on #ClimateChange
4 Network Analysis
Figures 10.3 and 10.4 display the results of our network analyses for the
#ClimateChange and #NetNeutrality debates. For illustrative reasons, we
restricted the graphs to the 500 most important actors in each network.4
Colours are based on frequent connections identified by Gephi’s algo-
rithm to detect communities in networks. The #NetNeutrality network
4
We used the PageRank algorithm to position actors and to scale the size of actor labels. Results
remain robust if we apply the Betweenness centrality algorithm as a comparison. The graphical
design of networks is based on the Fruchterman-Reingold layout.
docdead elgatogaming
bestvpns
educationweek
flattr veteransfp jahovaswitniss

sputniknewsus
dbargen
adafruit
littlebits thepushdaily
change
sarajchipps republican_mrs
engadget
politics_pr
anonymousky solodemocrats natshupe

ouranonguardian
buzzfeednews tomcostellonbc
latestanonnews cspanradio mdcolangelo
michaelbrownusa
mitchellbaker thehpalliance
yahoonews
elisewho cnnmoney
digitalagendaeu
occupythemob lwbluedragon jonstreet noahwehrman
caseyparksit
fingersflying
neilstandish usatodaytech bloviate_barbie ronpaul

x_net_
theanonmovement stevescalise twitchyteam
thespiritweaver mcspocky arizonaafp cameron_gray
jayduplass wessmith123 cyberdustapp
brianstelter anthonybialy
bluedupage jessconditt
gottalaff braveheart_usa
progressivepush
50th_president michellemalkin
dmashak g_humbertson
policy
ansip_eu atlantahumanist
selinasorrels
kimberly_canete jarjarbug
edri blueva_hound net_alliance bettinavla

anonautopsy naphisoc heatherpustalka aaronrobinow
carlquintanilla jennjacques
gravitydynamic rebeksy
upworthy appsame smolloydvm hardline_stance
doctorow jasonabbruzzese
unitebluewi
bfeld lesliemarshall
blob_fish
jgalt9 bosshoggusmc
guardianus nickgillespie galtsgirl occupycorruptdc
mashabletech
dlb703
140elect
internetassn
marietjeschaake hapkidobigdad
thedemocrats teapartyexpress
pmgeezer afpnevada amymek
sensanders
uniteblue afphq
mashable usatoday
aaronsw senate_gops
aclu
billmoyershq housegop
zeldman
cendemtech
firefox
sarah__reynolds joshuawoodz
reason ladysandersfarm
resisttyranny
cblacktx
guyverhofstadt
anonyops occupywallstnyc
bipartisanism
smith83k gop gop_thinker
mozillaadvocacy
laureldavilacpa barneyfranken patriotic_me calfreedommom
boingboing thebaxterbean cmdorsey
mozilla wisco defunddc

lastweektonightwebfoundation steph93065
fcc
speakerboehner
bbcusa173
anonymousglobo
cspan peddoc63 alyssalafage
chucknellis
joec1776
accessnowdailykos tomrisen albertdeascenti
thepatriot143
ronsantofan
bannerite dennygirltwo
mr_oneliner
senmarkey mikesnider poetvix
kuperj0nes
jjauthor
comcast
taliesan
jbspharmd
repfredupton betseyross rbpundit
iamjohnoliver normanpenny politibunny alysiastern
youranonnews b_fung
barackobama
couragecampaign latinagopvoter
bradthor
mcuban
boldprogressive
ronwyden
sharneal
christichat gt_rman
greenpeaceusa moveon mikeofcc housecommerce ibdinvestors
rwsurfergirl
timberners_lee dominicru
thomasb00001
ctia oxco martydrinksbeer
allenwestrepub
alfranken
holding_our_own dagnyred
jim_b60
theblaze
tomwheelerfcc
jdtabish
keithellison senjohnthune rednationrising
jensan1332
sumofus ppinternational senblumenthal
tanyainalameda ajitpaifcc tea_alliance
croatansound
beanfrompa
lrihendry
theopenmediamclyburnfcc thehill
countermoonbat
honestjames4 gerfingerpoken
fightfortheftr
spreadbutter
verizonwhitehouse
hitrecordjoe jrosenworcel
edhaskl
credomobile
openmedia_ca twitter netflix
johnlaprise
greensboro_nc
norsu2
snap_politics klsouth kerryepp
guardian markeymemo marshablackburn lybr3
freepress
rachellive tsb1974 chuckwoolery
craignewmark michaelkhoo jimfak wordsmithguy
brettglass opposenetttax
glennbeck
nhmc
colorofchange
michaelscurato
gigibsohnfcc darrellissa dbongino
demandprogress
ivote4usa twistedpolitix
propublica publicknowledge
verge
rosariodawson
freenetbot washtimes dailysignal
senatorleahy
culturejedi techfreedom foxnews tedcruz
jkfreespirit
thegeeksjt1 adambaldwin bmoc98
markruffalo repjohnlewis mediajusticemediaaction katyonthehill
onlineontheair ppr_intern
evangelinelilly mattfwood etsy

fredbcampbelljr
sentedcruz prisonplanet
rockprincess818
rashadrobinson
timkarr
popresistance
wsj
chrisj_lewis mcconnellpress
micahflee realalexjones
politifact berinszoka iiabroadband dailycaller jhaletweets
presenteorg meddemfundjgonzaleznhmc senjohnmccain
evan_greer haroldfeld
notaaroncraig gopoversight
yelp bennydiego cnn iowahawkblog
dailydot
xeni blackrepublicanheritage
repgregwalden
oti davidsussman
18millionrising
candacejeanne coppsm matthops82
washingtonpost
stevenrenderos vanschewick
elonjames nctacable shaughn_a
ammori digiphile
nationaction corybooker commoncause wired jeffreyatucker
nbcnightlynews allenwestarmy
dloesch
truthout josephatorres
engineorg
usabillofrights tedcruz45 abc
gnagesh
senwarren
future_of_music eff amy_schatz
youtube
hannahkauthor
ericschiffer
chaddickerson ttoboyle freedomworks
jessalanfields
zephyrteachout
regiteric
thefaithfulnet
repannaeshoo att cnet libsinamerica
realdeancain
buyvpnservice
rt_com booyahboyz mattwalshblog
penn
superwuster nytimes bloombergtv wsjd
nerdywonka
billdeblasio tumblr sprint
msnbc foxbusiness
lalaruefrench75 netadvisor
anniescranton
reddit cnbc anthonycumia
hitrecord newyorker maplight wsjpolitics google
scrawford westjournalism
lnonblonde foxnewspolitics
technocowboy
kickstarter
democracynow womensmediacntr newsweek bidenscreepyhug
oldcaesarcole
variety sharethis
feliciaday
cityjournal
stanfordcis vzpublicpolicy poltoons
rt_america cbsnews darth fox5sandiego
twc business forbes
dreamhost
manhattaninst
ga boogie2988 thaddeusrussell
techraptr internetarchive jerezim
gigaom
gitgirl
citizenradio nprnews techcrunch

venturebeat
femshaveballzbobmurphyecon
ssludgeworth
stanfordlaw
abcpolitics
abbymartin
cbseveningnews thenextweb
tedtalks occupy_www
mckennasmark
anonymous5thnov huffpostpol universilence
theskimm jamesrhine
westchesterocpy change4india
harikondabolu
pcmag thepeoplescube
maddie_marshall
wikileaks
blackberry
bassnectar benandjerrys
angryblacklady
josephscrimshaw mattmay iglvzx
npr
bbcbreaking theneedledrop
producercody
criticl_me
vpnbaron marceloclaure
ign rsprasad
jonrisinger thinkgeek cbcalerts
digiges
gamespot
joshjepson
allispeed
debnicolina
brownjenjen
darthputinkgb riotfest
adage chrisdemarais
Fig. 10.4 Network analysis of the tweets on #NetNeutrality
(Fig. 10.4) is clearly dominated by US American actors coming from

different social spheres: politics, business, and media. The network cen-
trality of the regulation commission FCC and its chairman Tom Wheeler
clearly reflect the domestic policy debate. We can observe a significant
political polarisation with proponents of net neutrality on the left side
and critics of FCC regulation on the right. The network separation
depicts the tendency of actors on both sides to predominantly name,
link, and share content with users in the same camp. Among the propo-
nents, we find a number of NGOs from the liberal-progressive spectrum,
for example, Fight for the Future and Demand Progress, civil rights
movements such as the Electronic Frontier Foundation, individual activ-

ists, and the hacker group Anonymous. The smaller camp of the critics is
composed of Republican politicians such as Ted Cruz, Ron Paul, John
Boehner, the accounts of the GOP, the Republican National Committee,
as well as other conservative groups and activists, for example, from the
Tea Party movement. In the centre of the graph, we mostly find media
actors that are regularly referred to by both camps. Actors from business
are mostly on the left side of the network.
In contrast, the #ClimateChange network (Fig. 10.3) is much more
homogeneous. Transnationally operating NGOs and activists dominate
the #ClimateChange debate while national political actors are less cen-
tral, with the notable exceptions of climate activist and former politician
Al Gore as well as US President Obama. In contrast to the #NetNeutrality
network, we find a diversity of academic actors and institutions with a
technical focus such as NASA, IAEA, or the Intergovernmental Panel on
Climate Change (IPCC). The high degree of transnational activity in the
#ClimateChange network is expressed by the presence and centrality of
many UN institutions and their respective communities. Their concen-
tration on the centre-right side of the network indicates the existence of
different subfields of the debate within the network. In the context of
international climate negotiations, there is constant interaction within a
group of actors constituted of organisations such as WWF or Greenpeace
that try to influence the relevant institutions with their advocacy.
However, we also identify two national debate clusters on the periphery:
one cluster with actors from the UK in the bottom right corner of the
graph and another on the upper left populated by US American actors.
These results might lead to the interpretation that there is a low level
of political polarisation in Twitter discourse on climate change. At first
glance, the results of our keyword analysis seem to confirm this finding.
However, a look at the contexts of the keywords reveals that climate scep-
tics actually influence the discourse noticeably (see below). Still, climate
sceptics do not show up as central nodes of the Twitter networks. This is
to say that climate scepticism on Twitter is performed occasionally by
infrequent participants of the debate while the users who tweet and are
addressed regularly in the context of climate change discourse support
the IPPC position on anthropogenic climate change.
5 Keyword Analysis
Geolocation analysis and network analysis operate on large quantities of
data, which makes them perfect for a ‘bird’s eye analysis’ of the corpus
and for gaining insight into the internal structure of the corpus.
Geolocation analysis and network analysis with user names as nodes give
valuable information on where the discourse actors are located and who
among them are the most visible and, therefore, the most important
ones. However, due to their rather coarse-grained approach, they are not
as suitable for providing hermeneutic insight into the data. The last part,
keyword analysis, aims at finding topic-specific words by comparing
word frequency lists.
Keyword analysis is a well-established approach in corpus linguistics in
which a keyword is considered to be a word ‘which can be shown to occur
in the text with a frequency greater than the expected frequency (using
some relevant measure), to an extent which is statistically significant’
(Wynne 2008, 730; cf. Demmen and Culpeper 2015 for a comprehen-
sive overview). The token frequency in the reference corpus is used to set
the expected frequencies, and the token frequencies in the main corpus
are the observed frequencies. We used Laurence Anthony’s concordancer
software AntConc (Anthony 2005) for keyword computation, as it is one
of the leading free software packages for this task. It uses the Log-
Likelihood measure to test the null hypothesis that there is no difference
between the observed and the expected frequencies.
For this analysis word means token, which is a syntactically used word
form like is, are and was as opposed to the base form be (= lemma). While
using lemmas for keyword analysis serves the purpose of finding topic-
specific vocabulary well, we chose word forms over lemmas because of the
formal challenges of Twitter messages. Automatically finding and anno-
tating the base form of the word forms used in the Twitter messages
(= lemmatising) did not prove sufficiently correct at the time of writing
even though this is a standard procedure in Natural Language Processing
for regular text (cf. Manning et al. 2014).
As mentioned above, the features of the reference corpus affect the
outcome of the whole procedure: ‘Features which are similar in the refer-
ence corpus and the [research] corpus itself will not surface in the com-
parison, […] only features where there is significant departure from the
reference corpus norm will become prominent for inspection’ (Scott
2009, 80). In other words, comparing the frequencies of the tokens
occurring in the research corpus and the tokens occurring in the reference
corpus yields content- and topic-related statistically significant tokens as
keywords if the corpora do not belong to the same domain and share
predominant formal features. The keywords are, in turn, a starting point
for a more detailed hermeneutic topic analysis by inspection of the con-
text in which the keywords occur.
In our case, the most prominent features that should not show up as
key are the linguistic patterns which are specific to Twitter communica-
tion as a social media platform. These include, among others, certain
acronyms like ‘lol’ (‘laughing out loud’) or ‘wtf ’ (‘what the fuck’), which
appear quite regularly in social media discourse but not (as much) in
media articles. Therefore, it seems prudent to use a Twitter corpus as the
reference corpus since items like these will not stand out in the compari-
son. The features that should register in the comparison are those which
are mainly content-related, including Twitter-specific constructions like
@-mentions and hashtags. To make sure that these items would be
counted as regular words in AntConc and following Baker and McEnery
2015, we changed the default word definition, which counts any string
of ASCII characters as words, to include the following characters ‘@’, ‘–’,
‘_’, ‘:’, and ‘#’.
For our analysis, we used a corpus of tweets belonging to the domain
‘art’ and ‘communication about art’ which seems sufficiently removed
from the domains covered in the research corpus to make topic specific
words statistically significant. It was built by streaming tweets containing
at least one of the following artist’s name in hashtag form: #botticelli,
#schiele, #gursky, #calder, #kandinsky. The corpus covers the period from
1 December 2015 to 31 January 2016 because there were exhibitions in
several countries featuring these artists.
For a more detailed profiling of the two policy fields net neutrality and
climate change, we split the research corpus into two subcorpora, one
containing all the tweets dealing with net neutrality and the other one
containing the tweets dealing with climate change.
For a better interpretation of the ‘aboutness’ marked by the keywords,

we categorised all content words that were among the first 100 keywords
for each subcorpus by manually grouping them together and labelling the
groups based on the topic they deal with. Function words like to, on or is
were not categorised. Also not explored further were content words like
tell or big, which are used with too many different meanings for clear
categorisation. Categorising keywords in this manner entails the close
reading of the context surrounding it (= concordance analysis) and using
one’s own linguistic expertise and world knowledge to bundle keywords
together. It is this step of the analysis which combines quantitative and
qualitative analyses: the keywords themselves are the result of purely
quantitative analysis while the categorisation process is purely qualitative.
Categorisation therefore provides a very rich platform from which trends
and meanings in a corpus can be inferred.
Again following Baker and McEnery’s (2015) approach to Twitter key-
word analysis, we present the keywords as a categorised list. The c ategories
we derived from context analysis of the keywords are similar but not
identical for the two subcorpora (Table 10.3).
The similarities between the categories reflect the similarities between
the two subcorpora. Especially the fact that they both deal with policy
fields is quite obvious from the categories, since they all somehow deal
with the evaluation of a state or process and how to deal with it.
Table 10.3 Categories derived from context analysis

Subcorpus ‘climate Subcorpus
Category change’ ‘net-neutrality’
1 Discourse actors x
2 Affected domain x
3 Agents causing the effect x
4 Effect on the domain x
5 The desired effect x
6 Scale of the effect x x
7 Evidence and factuality x
8 Measures to reduce the x
effect
9 Measures to cause the x
desired effect
Hashtags and @-mentions that are also keywords are treated as regular
words in the categorisation. It should be pointed out, however, that
hashtagged items are more than mere words in Twitter communication.
They can fulfil up to three functions. In their purest form, they categorise
a whole tweet by adding a meaningful tag to it without any syntactic con-
nection to the rest of the tweet. For example: ‘Help save the Internet!
#fcc’. However, any syntactically used word within a tweet can be turned
into a hashtag: ‘#Republicans bill to gut the #fcc and kill #netneutrality’.
And, as shown in the next and last example, the word used as a hashtag
can also refer to a discourse actor like in ‘#fcc’ (which stands for federal
communications commission) (Table 10.4).
Table 10.4 Categorised keywords for tweets containing #ClimateChange

Category Keywords
Discourse #auspol, @barackobama / #obama / obama, #cdnpoli
actors (‘canadian politics’), #cop (‘conference of the parties’),
14 total | deniers, news, people, scientist / scientists, senate, #sotu
23.72% (‘state of the union’), #tcot (‘top conservative on twitter’), we
Affected climate / #climate, earth, #environment, future, ice, planet,
domain weather
8 total |
13.56%
Agents carbon, emissions, fossil, #geoengineering
causing the
effect
3 total | 5.08%
Effect on the #climatechange, change (‘climate change’), #globalwarming,
domain impact/impacts, rise, risk, sea, threat, warming
12 total |
20.34%
Scale of effect global, planet, world, hottest, years (‘2014 probably hottest
5 total | 8.47% year in 10,000 years’)
Evidence and believe, data, debate, real, report, says, #science / science,
factuality study, record (‘hottest year on record’)
10 total |
16.95%
Measures to act, action, #actonclimate, fight, energy / #energy (‘renewable
reduce the energy’), #sustainability
effect
7 total |
11.86%
Of all the categories, Discourse actors and Effect on the domain are the
most frequent; together, they account for almost half of the keywords
that were categorised.
Of the fourteen keywords, only two are in the form of @-mentions,
which means that the remaining twelve do not register as nodes on the
network graph. The discourse actors that proved to be statistically signifi-
cant in comparison with our reference corpus show that there must be
substantial differences of opinion (‘deniers’) regarding the topic of cli-
mate change. A clear focus on politicians and policymakers is also evident
if the discourse actors doubling as hashtags are also taken into account.
This confirms the findings of the network analysis and refines it by bring-
ing actors to attention which do not (or not dominantly) show up as
nodes in the network graph like ‘deniers’ and ‘scientists’.
The affected domain presents itself as widespread, covering virtually
the whole globe, and the effect on it is painted as clearly negative. The
keywords dealing with the effect on the affected domain are the second
strongest category, which indicates the need of the users to talk about it.
The category Evidence and factuality is the third largest, which shows that
the evidence for the postulated effect is subject to debate. The measures
to reduce the effect that are part of the keyword list are rather vague and
aimed at rallying people or expressing one’s own willingness to do
something.
Interestingly, the category Scale of effect is completely devoid of hashtags
even though the words that prompted this category seem quite suited to
be used as hashtags. Their rank on the keyword list is so low, however,
that they were not part of the categorisation process. This means that the
users who chose not to use these words as hashtags did not find it prob-
able that they would fulfil the role of pooling tweets relevant to their own
(Table 10.5).
Just like in the subcorpus on climate change, most of the categorised
keywords belong to only one of two categories: Discourse actors and
Measures to cause the desired effect. While both are significant parts of the
debate—together, they comprise more than half the categorised key-
words—the Measures seem to be the most important topic of this debate.
The discourse actors who made it on the keyword list are mainly poli-
cymakers (@ajitpaifcc, @tomwheelerfcc, @fcc/fcc, chairman, congress,
Table 10.5 Categorised keywords for tweets containing #NetNeutrality

Category Keywords
Discourse @ajitpaifcc (Ajit Pai is a commissioner at the FCC), #fcc / @fcc / fcc
actors (‘federal communications commission’), #fcclive, @
16 total | tomwheelerfcc (Tom Wheeler is chairman at the FCC),
27.12% chairman, comcast (largest home internet service provider in
the USA), congress, government, isps (‘internet service
providers’), obama, people, republicans, #tcot (‘top
conservatives on twitter’), we
Affected broadband, cable, #internet / internet, net, access (‘internet
domain access’), service (‘internet service’)
7 total |
11.86%
Desired free (‘free and open internet’), freedom, #netneutrality,
effect on neutrality, #nonetneutrality, open, #openinternet, public
the (‘internet as public utility’), utility (‘internet as public utility’)
domain
9 total |
15.25%
Scale of strong (‘strong net neutrality’)
effect
1 total |
1.69%
Measures to bill, control, debate, decision, fight, passed, plan, protect,
cause the regulate/regulation, rules/ruling, stop, title (‘title II
desired regulations’), #titleii (‘title II’ of the US Communications Act of
effect 1934), vote / voted / votes, approves (‘FCC approves
19 total | #netneutrality rules’), proposal (‘NetNeutrality proposal’),
32.20% thank (is a measure by extension as in ‘Thank you #FCC’)
government, Obama, republicans, #tcot) and companies (‘comcast’,

‘isps’) that are connected to the affected domain. Just like in the subcor-
pus on climate change, most of the keywords referring to discourse actors
are not in the form of @-mentions and therefore not part of the network
graph but rather topics of debate. The affected domain (‘internet’, ‘broad-
band’) presents itself as surprisingly technical while the desired effect on
this domain reveals a rather idealistic and abstract-philosophical perspec-
tive (‘freedom’, ‘neutrality’, ‘open’, ‘free’). The measures to cause the
desired effect centre around governmental processes and a very orderly
conduct of things.
The geolocation and network analyses showed that the net neutrality
discourse proceeds mostly along national boundaries, which gives it a
certain homogeneity. Keywords like bill, decision, title (‘title II regula-
tions’), vote/voted/votes, approves (‘FCC approves #netneutrality rules’)
show how rooted the discourse is in national boundaries, especially, in
this case, the boundaries of the USA since most of them describe parts of
US law-making processes. At the same time, the net neutrality discourse
is also heterogeneous in that the discourse actors belong to different social
spheres (politics, business, and media). Most of the keywords in the cat-
egory ‘discourse actors’ refer to politicians. Only two refer to business and
the remaining two are quite vague (‘people’ and ‘we’) but still part of
political discourse patterns, which is especially true for ‘we’ which func-
tions as a group defining entity (s. also below). The media, surprisingly,
do not register within the first 100 keywords. This is especially surprising
in the light of the fact that Twitter users regularly refer to news outlets by
@-mention and hashtags. However, the names of media actors are so low
on the keyword list that they did not make it into the categorisation pro-
cess. While this could be interpreted as a very strong dominance of non-
media related actors, it needs to be taken into account that the keywords
are computed by using the text of the tweets only. The Twitter account
where they are tweeted from is not part of the keyword computation.
This means that anything mentioned in the category ‘discourse actor’ is
either an @-mention or a word referring to a person or institution that
was used in the text of a tweet. The keyword list does not give any indica-
tion of how strong the impact of media actors is as communicators. It
does, however, show that they are not important as discourse topics. This
is also consistent with the fact that media actors like ‘foxnews’, ‘usatoday’,
‘theopenmedia’, and ‘freepress’ are important nodes in the network analy-
sis (see above).
The keyword list of the net neutrality tweets shows a dominance of
liberal points of view in the top 100 keywords, which coincides with a
critical perspective on centralisation tendencies on the internet. Only a
few keywords indicate the antagonism of political perspectives shown
above by the network analysis: the hashtags #NetNeutrality vs. #nonet-
neutrality, #tcot (‘top conservative on twitter’), and the words ‘Obama’
vs. ‘republicans’ and eventually the keyword ‘we’, which is used exten-
sively when communicators try to reinforce group identities by putting

emphasis on the differences to other groups (cf. Wodak et al. 1998,
99–102).
While the net neutrality discourse almost stays within national bor-
ders, the climate change discourse seems to be a more transnational phe-
nomenon notwithstanding the fact that more than 60% of all tweets in
English with geotagging enabled come from within the USA. Other than
in the net neutrality discourse, the keywords in the climate change corpus
refer mostly to topics of transnational concern like ‘#energy’, ‘#environ-
ment’ and ‘#geoengineering’. There are also keywords indicating a more
national perspective like ‘Obama’ or ‘#sotu’, but in contrast to the net
neutrality discourse, they are not restricted to the USA, as the keywords
‘#auspol’ (stands for ‘Australian politicians’) and ‘cdnpoli’ (= Canadian
politics) show. Of the keywords indicating climate change as a topic,
none refers to any national boundaries. Instead they refer to generic cat-
egories like ‘future’ or ‘planet’, which shows that the topic in question is
much bigger than just national politics.
Like the net neutrality discourse, the climate change discourse is
noticeably polarised. However, in contrast to the net neutrality discourse,
where the proponents and critics build separate communities (see net-
work analysis), the polarisation in the climate change discourse is a polar-
isation of content. The so-called climate sceptics and the supporters of
the official IPCC position are not only organised in the same network
but also use the same key vocabulary. Keywords from the category ‘evi-
dence and factuality’, for example, are by no means only used by people
who regard climate change as real and as a real threat to humanity—cli-
mate sceptics also use ‘science’, ‘data’, ‘study’, and so on in their tweets.
On a higher level than the keywords themselves, the categories we
developed by means of co-text analysis of the keywords also illustrate
some of the differences and similarities between the two Twitter dis-
courses. Fully present in both discourses are only the following three cat-
egories: 1. Discourse actors, 2. Affected domain, 4. Effect on the domain.
Category 5. Scale of the effect might be present in both discourses, but it
plays only a minor role in the net neutrality discourse.
Categories 7. Measures to reduce the effect (in the climate change dis-
course corpus) and 8. Measures to cause the desired effect (in the net neu-
trality corpus) both refer to ways of dealing with whatever change the
affected domain is subject to, which makes them eligible as categories
present in both discourses. However, as their labels already show, the
underlying keywords refer to quite different things, which makes them
eligible as categories marking the differences between the two discourses.
In fact, a closer look reveals that while category 9. Measures to cause the
desired effect is composed mostly of words referring to various steps in the
regulatory process, category 8. Measures to reduce the effect mostly consists
of more generic words that constitute a call to action.
The categories 4. Agents causing the effect (in the climate change corpus)
and 7. Evidence and factuality illustrate the most striking difference
between the two discourses. As far as the discourse on net neutrality is
concerned, the what and how is open for and in need of debate. In the
climate change discourse, however, the very existence of the issue is open
for debate. Again, both statements come as no surprise for those who
already have knowledge of the discourses. But still, it also shows how
keyword analysis can help make sense of an overwhelmingly large num-
ber of texts.
6 Conclusion
In this study, we have put our focus on the transnationalisation of politi-
cal communication via the social network Twitter. On the current level of
analysis (i.e. metadata analyses of geolocations and networks, preliminary
results of content analysis), we observed transnational Twitter communi-
cation for both the issues examined. We used a sophisticated set of meth-
ods the benefits of which are discussed in the second part of this
conclusion. We end this contribution on an outlook into the future.
6.1 Summary of Results
We found that the methods we used led to supplementary and even com-
plementary results. First, the geolocation analysis and the network analy-
sis showed how topics like #ClimateChange or #NetNeutrality are
discussed on different levels of international involvement and intercon-

nection. While net neutrality seemed to be predominantly debated in
highly developed democratic countries, the degree of transnationalisation
for the climate change discourse appeared higher. The communication on
#NetNeutrality was clearly dominated by US American actors belonging
to different social groups: politics, business, and media (see Fig. 10.4).
Within the communication network, however, a clear separation into
two spheres could be observed, which we interpreted as the result of the
tendency of actors on both sides to predominantly name, link, and share
content with users in the same camp. The communication on
#ClimateChange turned out to be more homogeneous. Instead of a clear
separation into communities, most of the users seemed to interact with-
out perceptible preferences in terms of group membership. However,
there are two national debate clusters on the periphery: one with actors
from the UK and the other with US American actors.
Second, the keyword analysis gave a more fine-grained insight into the
communication. It showed clearly that discourse actors are not only
actors but also topics. Only a small number of the keywords in the cate-
gory Discourse actors in both subcorpora were in the form of @-mentions;
the remaining were regular words and did therefore not register as nodes
on the respective network graphs but were still significant words in the
subcorpora.
Apart from Discourse actors, the topics dealt with most frequently in
the subcorpus on climate change were the Effect on the domain and
Evidence and factuality. For the net neutrality subcorpus, the most impor-
tant topic was Measures to cause the desired effect. These differences indi-
cate that the debate on climate change is a debate on content while the
debate on net neutrality is more concerned with consolidation.
However, the results need to be differentiated by adding some impor-
tant caveats and comparative findings. First, in Twitter debates for both
policy fields US American and Anglo-Saxon users and groups are clearly
predominant. This finding is related to query decisions in favour of glob-
ally used hashtags and the English language’s status as the lingua franca of
online communication. Second, network analyses show some separated
national clusters, as keyword analysis revealed an orientation towards the
national regulatory level, at least for the net neutrality debate. Both find-
ings are quite in line with the assumptions of structural nationalism as

defined by Schünemann (2014, 512), which is said to hamper the devel-
opment of transnational patterns of political communication. National
societies constitute communication or discourse communities by sharing
a public discourse from which legitimacy for political decision-making is
normally derived. Thus, when a policy decision is under way on the
national level, we would expect a relative closure of discourse formation
along national borders (or the borders of the ruling territory/the respec-
tive jurisdiction).
Consequently, the transnational potential of social media needs to be
regarded from an issue-specific perspective. While the higher intensity of
the net neutrality debate as measured by the total volume of tweets sup-
ports the assumption of the high mobilisation potential of internet politi-
cal issues, the transnational participation beyond the US and the
Anglo-Saxon world is higher for communication on climate change. On
top of the dominance of American users in overall Twitter communica-
tion, our sample period during a domestic policy debate even amplifies
the centrality and agenda-setting function of US American actors in the
net neutrality debate. This effect is so strong that the domestic polarisa-
tion clearly leaves its mark in the network analysis of #NetNeutrality.
However, the deeper analysis of networks including PR distributions and
the first steps of corpus analysis compared by national subsets show that
it is indeed the net neutrality debate that features less national peculiari-
ties. In contrast, the results for the climate change debate are more mixed,
with English-speaking countries showing a higher degree of discursive
independence than non-English-speaking countries. However, these pre-
liminary findings need to be substantiated by further research.
6.2 Linking the Different Approaches
The internet has brought a plethora of new empirical sources for research,
such as social media, and an ever-growing number of applicable methods,
but this development also poses the risk of very data- or method-driven
studies, missing a theoretical foundation. We tried to avoid that by always
evaluating the benefits of applied methods for our research question. Our
approach therefore did not only use the advantages offered by every sin-
gle method, but combined them in a way that significantly improved our
understanding of the matter at hand. Geotagging is a necessary precondi-
tion for the further analysis of potential transnationalisation in Twitter
communication. Therefore, most of the analytical steps directly built on
and profited from geolocation information, since this information
enabled us to build specific subcorpora for network and linguistic analy-
sis. Employing geolocalisation therefore enabled us to engage with our
research question. Furthermore, the combination of different methods
enabled us to verify and enhance some of our findings. This holds
especially true for the combination of network analysis and keyword
analysis. Combing these two methods and going deeper into the results
of keyword analysis, we were able to identify the conceptual polarisation
of the climate change discourse on Twitter, which did not show up in
network analysis. Building upon both methods, we could further sub-
stantiate and differentiate our findings. Although some findings differed
slightly, as mentioned above, the results nevertheless enabled us to draw
a more nuanced picture. The combination of different methods may also
help compensate for the potential weaknesses of some methods and shed
light on what would otherwise be blind spots. While network analysis
focuses only on relations between actors, linguistic analysis supplements
this by actually looking for similarities and differences in content. This
also illustrates the benefits not only of combining methods but also of an
interdisciplinary approach.
6.3 Future Work
One point concerning further research is the question of whether all

hashtags can be regarded as functional keywords since they show which
subjects or discourses the respective users deem related to their tweets.
Another point concerns the possibilities of closer combination of the
approaches discussed above, especially the combination of the results of
the network analysis with those of the keyword analysis. For instance, a
question which arises in this context is how often specific keywords occur
with a specific actor (@-mention) in one tweet, like @barackobama and
freedom in contrast to @barackobama and regulate.
To further develop and improve this study, we need to expand our

research period in order to analyse both policy fields at times of compa-
rable political attention. A qualitative analysis of tweets would provide
further insight into the actual transnationalisation of different ideas. This
could be further substantiated by also including non-English hashtags
and additional sources other than Twitter.
References
Anthony, Laurence. 2005. AntConc: Design and development of a freeware
corpus analysis toolkit for the technical writing classroom. In IEEE:
Proceedings International Professional Communication Conference, 729–737.
Baker, Paul, and Tony McEnery. 2015. Who benefits when discourse gets
democratised?: Analysing a Twitter corpus around the British Benefits Street
debate. In Corpora and discourse studies. Integrating discourse and corpora, ed.
Paul Baker and Tony McEnery, 244–265. London: Palgrave Macmillan.
Barberá, Pablo. 2014. Package ‘streamR’. Accessed February 21, 2015. http://
cran.r-project.org/web/packages/streamR/index.html.
Bohman, James. 2007. Democracy across borders: From Dēmos to Dēmoi. Studies
in contemporary German social thought. Cambridge, MA: MIT Press.
Boyd, Danah, and Kate Crawford. 2012. Critical questions for Big Data.
Information, Communication & Society 15 (5): 662–679.
Bruns, Axel, and Jean Burgess. 2011. #Ausvotes: How twitter covered the 2010
Australian federal election. Communication, Politics & Culture 44 (2): 37–56.
Castells, Manuel. 2002. Die Macht der Identität: Teil 2 der Trilogie: Das
Informationszeitalter. Das Informationszeitalter: Wirtschaft, Gesellschaft,
Kultur. Vol. 2. Opladen: Leske + Budrich.
Chadwick, Andrew. 2013. The hybrid media system: Politics and power. Oxford:
Oxford University Press.
Demmen, Jane E., and Jonathan V. Culpeper. 2015. Keywords. In The Cambridge
handbook of English corpus linguistics, ed. Douglas Biber and Randi Reppen,
90–105. Cambridge: Cambridge University Press.
Duggan, Maeve, Nicole B. Ellison, Cliff Lampe, Amanda Lenhart, and Mary
Madden. 2015. Social media update 2014. Accessed February 14, 2015.
http://www.pewinternet.org/2015/01/09/social-media-update-2014.
Fasold, Ralph W. 1990. The sociolinguistics of language. Oxford: Blackwell.
Felder, Ekkehard, and Marcus Müller, eds. 2009. Wissen durch Sprache. Theorie,
Praxis und Erkenntnisinteresse des Forschungsnetzwerks “Sprache und Wissen”.
Berlin and New York: De Gruyter.
Freelon, Deen, and David Karpf. 2015. Of big birds and bayonets: Hybrid
Twitter interactivity in the 2012 Presidential debates. Information,
Communication & Society 184: 390–406.
Hanegraaff, Marcel. 2015. Transnational advocacy over time: Business and
NGO mobilization at UN climate summits. Global Environmental Politics 15
(1): 83–104.
Held, David. 1997. Democracy and globalization. Global Governance 3:
251–267.
Jeffares, Stephen. 2014. Interpreting hashtag politics: Policy ideas in an era of social
media. Basingstoke: Palgrave Macmillan.
Kielmansegg, Peter G. 2013. Die Grammatik der Freiheit: Acht Versuche über den
demokratischen Verfassungsstaat. Baden-Baden: Nomos.
Kneuer, Marianne. 2013. Bereicherung oder Stressfaktor?: Überlegungen zur
Wirkung des Internets auf die Demokratie. In Veröffentlichungen der
Deutschen Gesellschaft für Politikwissenschaft: Vol. 31. Das Internet: Bereicherung
oder Stressfaktor für die Demokratie? ed. Marianne Kneuer, 7–31. Baden-
Baden: Nomos.
Kwak, Haewoon, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is
Twitter, a social network or a news media? In Proceedings of the 19th interna-
tional conference on World Wide Web—WWW ’10, 591–600. Raleigh, NC:
ACM Press.
Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven
J. Bethard, and David McClosky. 2014. The Stanford CoreNLP natural lan-
guage processing toolkit. In Proceedings of 52nd Annual Meeting of the
Association for Computational Linguistics: System Demonstrations, 55–60.
Accessed March 15, 2017. http://www.aclweb.org/anthology/P/P14/P14-
5010.
McEnery, Tony, Mark McGlashan, and Robbie Love. 2015. Press and social
media reaction to ideologically inspired murder: The case of Lee Rigby.
Discourse and Communication 9 (2): 1–23.
Müller, Marcus. 2012. Vom Wort zur Gesellschaft: Kontexte in Korpora: Ein
Beitrag zur Methodologie der Korpuspragmatik. In Korpuspragmatik.
Thematische Korpora als Basis diskurslinguistischer Analysen, ed. Ekkehard
Felder, Marcus Müller, and Friedemann Vogel, 33–82. Berlin and New York:
De Gruyter.
———. 2015. Sprachliches Rollenverhalten: Korpuspragmatische Studien zu diver-

genten Kontextualisierungen in Mündlichkeit und Schriftlichkeit. Berlin and
Neuman, W. Russell, Lauren Guggenheim, S. Mo Jang, and Soo Y. Bae. 2014.
The dynamics of public attention: Agenda-setting theory meets big data.
Journal of Communication 64 (2): 193–214.
Pries, Ludger. 2010. Transnationalisierung: Theorie und Empirie grenzüberschreit-
ender Vergesellschaftung. Wiesbaden: Springer VS.
Ruths, Derek, and Jürgen Pfeffer. 2014. Social media for large studies of behav-
ior. Science 346 (6213): 1063–1064.
Schünemann, Wolf J. 2014. Subversive Souveräne: Vergleichende Diskursanalyse
der gescheiterten Referenden im europäischen Verfassungsprozess. Theorie und
Praxis der Diskursforschung. Wiesbaden: Springer VS.
Scott, Michael. 2009. In search of a bad reference corpus. In What’s in a word-
list? Investigating word frequency and keyword extraction, ed. Dean Archer,
79–92. Aldershot: Ashgate.
van Eimeren, Beate, and Beate Frees. 2014. 79 Prozent der Deutschen online –
Zuwachs bei mobiler Internetnutzung und Bewegtbild. Media Perspektiven
7–8: 378–396.
Wodak, Ruth, Rudulf de Cillia, and Martin Reisigl. 1998. Zur diskursiven
Konstruktion nationaler Identität. Frankfurt/Main: Suhrkamp. [Engl. version:
2009. The discursive construction of national identity. Edinburgh: University
Press.]
Wynne, Martin. 2008. Searching and concordancing. In Corpus linguistics: An
international handbook, ed. Anke Lüdeling and Merja Kytö, 706–737. Berlin
and New York: De Gruyter.
Index1
A network, 16, 71, 167, 168, 286,

Actor(s) 287, 293, 294, 296–300,
discourse, 65, 68, 300, 303, 309 304, 306–311
group, 73 textual, 155
institutionalised, 290, 292 visual, 251–280
media, 299, 306 Archive, 60, 195
political, 140, 142, 167, 288, 295, digital, 10, 183, 259
299 newspaper, 195
social, 60, 63, 68, 81
Analysis
content, 26–28, 31, 33, 184–187 B
correspondence, 14, 15, 65–67, Big data, 11, 12, 156, 192
72–77, 80, 83, 132, 134,
136, 155, 159
geolocation, 286, 294–297, 300, C
306, 308 CADS, see Corpus-Assisted
keyword, 13, 16, 155–178, 218, Discourse Studies
286, 299–309, 311 Capital, 63
Note: Page numbers followed by ‘n’ refer to notes.

1

Postdisciplinary Studies in Discourse, https://doi.org/10.1007/978-3-319-97370-8
316 Index
Capital (cont.) local, 189–192, 200

economic, 72 political, 63, 216, 286, 290
symbolic academic, 63, 66 social, 4, 7, 8, 14, 24, 52–59, 62,
Causality, 24, 39–43 64–66, 68, 72, 80, 82, 131,
Cluster, 61, 133, 147 137, 148, 290
actor network, 296, 299, 309 societal, 63
co-occurrence, 163 textual (co-text), 11, 52, 140,
correspondence, 80 148, 302
semantic, 138–140, 147, 200 Contextuality, 9, 13, 52, 57, 260
thematic, 15, 162, 171, 173, 186, Corpus, 37, 89, 123, 217
191 partition, 132, 185
Collocation, 13, 207, 217, 218, reference corpus, 287, 300, 301,
226, 252, 254, 256, 304
258–260 Corpus-Assisted Discourse Studies
geocollocations, 16, 253, (CADS), 11, 16, 24, 27,
257–269, 273–275, 278, 217, 247
279 Corpus-driven, 15, 74, 132
graph, 252, 254, 256 Corpus pragmatics, 275
profile, 254, 274 Critical Discourse Analysis (CDA),
Communication 7, 11, 28
acts of, 189 Culture, 6, 92, 97, 254, 255,
global, 6, 288 275–278
mass, 257 discourse, 245
multilingual, 290 political, 157, 169
political, 289, 308, 310 of political communication, 289
social, 288, 290
transnational, 290
Twitter communication, 16, D
285–312 Digital humanities, 10, 16, 253
verbal, 289 Discourse
Complexity, 15, 24, 30, 35–43, 62, analysis, 7, 9, 11, 13–16, 42,
73, 81, 96, 124, 168, 183, 51–83, 123, 124, 126, 128,
188, 252 130, 155, 167, 183–208,
Concordance/keyword in context 215, 217, 218, 246,
(KWIC), 218, 302 251–254, 257, 258, 274,
Context 275, 279, 286–312
global, 189–192, 194, 207 media, 90
historical, 7, 8, 11, 54, 55, 125 social, 301
Index 317
political, 10, 124, 125, 132, 188, K

289n1, 306 Knowledge, 3, 6–8, 11, 14, 28, 34,
Dispositif, 8, 9, 14, 51–83, 91, 94, 43, 54, 58, 83, 90, 92–99,
96, 97, 114–116 102, 105, 111, 114, 115,
Dynamic(s) 123, 124, 127, 138, 189,
cognitive, 9 190, 190n2, 190n3, 192,
discourse, 140–143, 148, 215, 308
247, 295 biographical, 60
social, 9, 61, 83 collective, 287
symbol, 53 construction, 4–6, 190, 207
disciplinary, 41, 43
emerging, 201
G formation, 3, 4, 7, 127
Globalisation, 3 frame, 34
Governmentality, 14, 91–98 horizon, 66
ideological, 59
implicit, 206–208
H methodological, 43
Heuristic production, 4–6, 11, 58, 97, 98,
qualitative, 64, 72, 128 105
quantitative, 72, 123–149 scientific, 32, 43, 91–93, 105, 115
socio-historical, 38
structure, 189, 207
I theoretical, 38, 43, 254
Interaction, 10, 299 transtextual, 189
academic, 56, 63 world, 190, 302
human, 257
social, 4, 287
spontaneous, 61 L
Interpretation, 11, 12, 14, 16, 24, Language
32–34, 36, 38, 43, 53–55, bias, 289, 290
57, 60–64, 69, 75, 76, functions, 27, 29
78–80, 82, 125, 127–129, materiality, 33
132, 136, 142, 148, 195, model, 206–208, 274
197, 198, 252, 253, 255, natural, 4, 7, 10, 12, 96, 184
260, 273, 286 natural language processing, 4,
local, 35 184, 300
theoretical, 57, 80 scripting, 275, 276, 278
318 Index
Language use, 73, 194, 201, 239, qualitative, 12, 30, 32, 35, 38, 41,
258, 290 56, 58, 64, 124, 128, 145,
patterns of language use, 124, 147, 148
252, 254, 259, 287 quantitative, 8, 11, 13, 14, 24,
visualising, 256 27–32, 38, 41, 53, 56, 58,
Legitimation, 4, 27, 112, 188, 310 59, 99, 123, 124, 143, 147,
Lexicometry, 13, 26, 184–192, 148, 286
252
Linguistics
corpus, 4, 7, 11–13, 15, 16, 70, P
184, 191, 217–220, 251, Participant, 12, 66, 69, 130,
252, 254, 257, 275, 294, 135–136, 141, 142, 148
300 Politics, 135, 136, 145, 188, 195,
quantitative, 4, 13, 125, 294 291, 292, 306, 307
socio, 9, 10, 124 deep, 157
text, 26 international, 142, 168, 169, 174
parapolitics, 157, 167–176
of scientific research, 167
M Position
Meaning, 6–12, 30, 32, 33, 35, 51, discursive, 12
52, 63, 64, 73, 78, 123, geographic, 258
127, 128, 131, 184, 191, institutional, 80
207 ontological, 29
access, 27, 33 political, 6, 145
construction, 6, 7, 9, 53, 73, 127, social, 53
131, 217 speaker, 287
explicit, 217, 218 symbolic, 65
implicit, 217, 218 Power, 7, 8, 14, 27, 52, 54, 55, 62,
latent, 191 66, 78, 79, 81, 90, 92–94,
multiple, 63 96, 97, 100, 114, 124, 157,
production, 10, 61, 63 229
referential, 76 academic, 61
social, 52, 55 attractive, 171, 174, 175
Methods effects, 61, 287
corpus, 9, 11, 12, 16, 126, 143, political, 160, 168
217, 251 relations, 7, 58, 60, 62, 72, 124,
mixed, 14, 24, 28–31, 42, 43, 288
147 structures, 52, 68, 93
Index 319
Practice, 8–10, 12, 14, 15, 25, 28, Scientometrics, 13, 14, 89–116
30, 31, 39, 52, 54, 62, 69, Situation, 36, 124
90–99, 105, 106, 113–116, Social
123, 128, 131, 255–256 change, 3, 5, 10, 98, 192
discursive, 7, 9, 11, 14, 52–55, dynamic, 9, 61, 69, 83
58, 60, 64, 65, 81, 93, 96, logics, 62
128 order, 53, 61, 64, 72, 91–98, 114
institutionalised, 8 reflexivity, 15
language, 8, 12 Social media
linguistic, 54, 62 Facebook, 6, 99
meaning-making, 51, 52, 55 Twitter, 6, 16, 286–312
political, 157 Society, 55, 56, 58, 60, 63, 81, 91,
of programming, 276 92, 95, 98, 100, 124, 131,
research, 25, 28, 29, 43, 68, 129, 229, 288, 310
253 Sociology, 10, 13–15, 53, 58, 71–73,
social, 24, 55, 60, 91, 106–113, 77, 78, 81, 82, 162–167,
115, 123 177
of text production, 59 of education, 177
of religion, 158, 167
of work, 165
R Strategy, 81, 108, 110, 115
Register, 130 selection, 189
Representation, 15, 16, 32, 38, 41, Structure, 8, 14, 54, 55, 60, 62, 80,
55, 113, 190, 216n2, 91, 93, 95, 96, 113, 126,
228–236, 241 129, 132, 148, 167, 184,
geographical, 262, 279 185, 189, 207
mental, 262 actor-based, 287
visual, 133, 252 data-driven, 186, 201
Representativeness, 7, 102, 129, 169, discourse, 12, 72, 185
184, 188, 189, 198, economic, 156
200–202, 207, 289, 296 geopolitical, 177
institutional, 9, 58, 65, 67
internal, 164, 300
S lexicosemantic, 156, 165, 167,
Science(s) 252
political, 13, 145, 167, 168, 289 macro-structure, 13, 15, 64, 72,
social, 4, 7, 12, 13, 26, 32, 39, 125, 127, 132, 142, 143, 148
63, 98, 99, 107, 149, 156, semantic, 132, 157, 183, 184,
184, 188, 253, 286, 289 208
320 Index
Structure (cont.) information, 52n1, 64, 68, 69

social, 8, 52, 54, 55, 58, 64, 69, media, 289
81, 93 self-other-world, 32
micro and macro, 81 symbolic, 41
societal, 5, 8
textual, 177
thematic, 177 T
Studies Text
communication, 13, 27, 90 digital
discourse, 4, 6, 7, 12–14, 123–149, natively digital, 183
184–186, 258, 286 retro-digitized, 183
media, 13 Text mining, 12, 13, 15, 183–208,
Synchron/diachron, 15, 16, 66, 67, 255
127, 156, 157, 185, 247 Topic
System major, 194, 204
academic, 67, 113 modelling, 15, 185–187, 192,
belief, 227 194–201, 206
category, 37, 42, 201, 202, 205 sub, 194
concordance, 217 Transnationalisation, 17, 288–290,
of énoncés, 275 293, 294, 308, 311

Quantifying Approaches To Discourse For Social Scientists PDF

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Quantifying Approaches To Discourse For Social Scientists PDF

Încărcat de

Drepturi de autor:

Formate disponibile

Postdisciplinary Studies in Discourse

More information about this series at

Postdisciplinary Studies in Discourse

Library of Congress Control Number: 2018958470

© The Editor(s) (if applicable) and The Author(s) 2019

Cover design by Tjasa Krivec

“In today’s complex world of communication there is an urgent need to stand

“This is a very welcome addition to the literature on quantitative methods

“Bringing together a wide range of quantitative approaches to discourse analysis

“In the fast-moving field of text processing, keeping up with methodological

Part I Introductory Remarks 1

1 Understanding Twenty-First-Century Societies Using

2 Beyond the Quantitative and Qualitative Cleavage:

Part II Analysing Institutional Contexts of Discourses 49

3 The Academic Dispositif: Towards a Context-Centred

4 On the Social Uses of Scientometrics: The Quantification

Part III Exploring Corpora: Heuristics, Topic Modelling and

5 Lexicometry: A Quantifying Heuristic for Social Scientists

6 Words and Facts: Textual Analysis—Topic-Centred

7 Text Mining for Discourse Analysis: An Exemplary Study

Part IV New Developments in Corpus-Assisted Discourse

8 The Value of Revisiting and Extending Previous Studies:

9 The Linguistic Construction of World: An Example of

10 Multi-method Discourse Analysis of Twitter

Katrin Affolter is a PhD student in a joined program between the Zurich

Maria Becker is a doctoral researcher at the Department of Computational

universities of Heidelberg and Granada. Müller leads the Discourse Lab, a

Stefan Steiger is a research associate at the University of Hildesheim and doc-

Fig. 2.1 Confluences in discourse analysis 25

Fig. 5.7 Map of text sections of interviewees displaying prototypical

Fig. 9.4 Reduced Dorling view, comparison of selected countries:

Table 3.1 Three levels of analysis 56

© The Author(s) 2019 3

discourses with a particular inherent logic becomes evident. Moreover,

2  ocietal Trends Influencing Knowledge

media narratives; and (3) a democratisation of information production

Indeed, new communication devices have enhanced communication on a

3 Discourse, Context, and Meaning

we have to add elements that refer to the social structure, the context of

discourses (Keller 2013). In this volume, Hamann et al. (Chap. 3) use

sociology of language (Achard 1993; Bacot and Rémy-Giraud 2011;

4  hallenges and Chances for Discourse

data creates methodological challenges that are similar to those discussed

Partington et al. 2013). Corpus methods applied in discourse studies can

5 The Purpose of This Volume

The authors of this volume were asked to present their methods in an

and qualitative paradigms. After a short overview of theories that have

thesis, the authors point out the non-scientific conditions of scientific

large text collections, thus making qualitative aspects of diachronic discourses

———. 1988. Restons traditionnels et progressifs. Pour une nouvelle analyse du

© The Author(s) 2019 23

different approaches in discourse analysis deal with these oppositions, and

one can hardly speak of discourse analysis as a discipline, it is because of

Hjelmslev (1931/1963) Linguistic Cercle of

Austin (1962) Speech acts

Berger & Luckman (1966) Social

Lacan (1966), Psychoanalysis

Pêcheux (1969), Automatic discourse analysis Searle (1970) Philosophy of langage

Foucault (1969), Discourse analysis Sacks (1972) Conversation analysis

Althusser (1970), Ideology

Stone (1966) General Inquirer

Guiraud (1960) Linguistic statistics

Fig. 2.1 Confluences in discourse analysis

Part I Introductory Remarks 1

1 Understanding Twenty-First-Century Societies Using

2 Beyond the Quantitative and Qualitative Cleavage:

Part II Analysing Institutional Contexts of Discourses 49

3 The Academic Dispositif: Towards a Context-Centred

4 On the Social Uses of Scientometrics: The Quantification

Part III Exploring Corpora: Heuristics, Topic Modelling and

5 Lexicometry: A Quantifying Heuristic for Social Scientists

6 Words and Facts: Textual Analysis—Topic-Centred

7 Text Mining for Discourse Analysis: An Exemplary Study

Part IV New Developments in Corpus-Assisted Discourse

8 The Value of Revisiting and Extending Previous Studies:

9 The Linguistic Construction of World: An Example of

10 Multi-method Discourse Analysis of Twitter

Fig. 2.1 Confluences in discourse analysis 25

Table 3.1 Three levels of analysis 56

2 ocietal Trends Influencing Knowledge

3 Discourse, Context, and Meaning

4 hallenges and Chances for Discourse

5 The Purpose of This Volume

5 The Problem of Complexity

6 ausality and Measurement in Discourse

2 Integrating Sociological Data

2.1 The Problem of Text and Context

2.2 Levels of Discourse Analysis

2.3 Accounting for Social Context

3 Dispositif as a Heuristic Concept

3.1 Three Aspects of Academia as a Dispositif

3.1.1 Power, Closure, Sedimentation

3.1.2 Fields as Heterogeneous Arenas

3.1.3 Discursive Circulation and Interpretation

4 ow to Use Sociological Data

4.1 apping Social Contexts with Statistics: Actors,

4.2 Challenges and Potentials of Context Data

4.3 Examples of Statistical Categories

5 xample: Research Interests as an Entry

5.1 The Corpus and Method

5.2 Limits of and Obstacles to Interpretation

6 he Heuristic Potential of the Dispositif