Documente Academic
Documente Profesional
Documente Cultură
Series Editor
Johannes Angermuller
Centre for Applied Linguistics
University of Warwick
Coventry, UK
Postdisciplinary Studies in Discourse engages in the exchange between
discourse theory and analysis while putting emphasis on the intellectual
challenges in discourse research. Moving beyond disciplinary divisions in
today’s social sciences, the contributions deal with critical issues at the
intersections between language and society. Edited by Johannes
Angermuller together with members of DiscourseNet, the series wel-
comes high-quality manuscripts in discourse research from all disciplin-
ary and geographical backgrounds. DiscourseNet is an international and
interdisciplinary network of researchers which is open to discourse ana-
lysts and theorists from all backgrounds. Editorial board: Cristina
Arancibia, Aurora Fragonara, Péter Furkó, Tian Hailong, Jens Maesse,
Eduardo Chávez Herrera, Michael Kranert, Jan Krasni, María Laura
Pardo, Yannik Porsché, Kaushalya Perera, Luciana Radut-Gaghi, Marco
Antonio Ruiz, Jan Zienkowski
Quantifying
Approaches to
Discourse for Social
Scientists
Editor
Ronny Scholz
Centre for Applied Linguistics
University of Warwick
Coventry, UK
This Palgrave Macmillan imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Acknowledgements
The idea for this volume was born during the First International
DiscourseNet Congress in Bremen in late summer 2015. Together with
Marcus Müller, Tony McEnery, and André Salem, we had organized the
panel ‘Quantifying methods in discourse studies. Possibilities and limits
for the analysis of discursive practices’. With scholars of international
renown coming from linguistics, statistics, computer sciences, sociology
and political sciences in countries such as Canada, France, Germany,
Switzerland and the United Kingdom as well as guests from many other
countries, the panel was a real success. Papers presented seminal work
using a great variety of quantifying methods mostly in combination with
qualitative methods.
This edited volume is driven by the interdisciplinary and international
attitude of the inspiring discussions that we led in the panel. Completing
such a fascinating project would not have been possible without a net-
work of international supporters, to name only a few of them: ERC
DISCONEX Group and the Professional and Academic Discourses
Group, both hosted in Applied Linguistics at the University of Warwick;
the Centre d’Étude des Discours, Images, Textes Écrits, Communication
(CEDITEC) at the University Paris Est-Créteil; DiscourseLab at the TU
Darmstadt; and last but not least, DiscourseNet, which unites discourse
researchers across national and disciplinary borders all over the world.
v
vi Acknowledgements
This volume would not have seen the light without the support of
many colleagues and friends. I am grateful to the anonymous reviewers
who provided me with detailed and encouraging feedback. I also thank
the series editor Johannes Angermuller and the editorial assistant Beth
Farrow from Palgrave for supporting this publication project throughout
with great enthusiasm. Finally, I am thankful to my wife Joy Malala and
to our new-born son Gabriel Amani for tolerating the extra hours that I
had to put into editing this volume.
Praise for Quantifying Approaches to
Discourse for Social Scientists
vii
viii Praise for Quantifying Approaches to Discourse for Social Scientists
guistics, this volume provides a great overview for both beginners and experts in
the field of quantitative discourse analysis.”
—Professor Annika Mattissek, Professor for Economic Geography and Sustainable
Development, Department of Geography, University of Freiburg, Germany
“With the discourse turn in the social sciences, the need for a state of the art
guide to practice and theory of meaning construction is evident. In this volume,
leading British and continental scholars present quantitative and qualitative
methods of exploring discourse and the wider context into which texts are
embedded, while discussing and bringing together the approaches of Critical
Discourse Analysis and the Foucauldian dispositif. Long overdue!”
—Wolfgang Teubert, Emeritus Professor, Department of English Language and
Linguistics, University of Birmingham, UK
Contents
ix
x Contents
Index315
Notes on Contributors
Thed van Leeuwen is a senior researcher at the Centre for Science and
Technology Studies (CWTS) of Leiden University in the Netherlands. He is co-
leading the research theme on Open Science, and the project leader of the Open
Science Monitor. As a member of the SES research group, other research topics
Thed is involved in relate to the evaluation of research, in particular in the social
sciences and humanities, as well in the ways research quality is perceived. The
overarching science policy context under which research assessments are orga-
nized and the role of bibliometric indicators therein are of major concern for
this research agenda. Thed is co-editor of the OUP journal Research Evaluation,
as well as associate editor of the Frontiers journal Research Metrics & Analytics.
Jens Maesse is Assistant Professor in the Department of Sociology, University
of Giessen. His research focus is on discourse analysis, sociology of science and
education, economic sociology and political economy. His publications include
‘Austerity discourses in Europe: How economic experts create identity projects’,
Innovation: The European Journal of Social Science Research 31 (1): 8–24 (2018).
‘The elitism dispositif. Hierarchization, discourses of excellence and organisa-
tional change in European economics’, Higher Education 73: 909–927 (2017).
Tony McEnery is Distinguished Professor of English Language and Linguistics
at Lancaster University. He is currently a Group Director (Sector Strategy) at
Trinity College London, on secondment from Lancaster University. Tony was
previously Director of Research and Interim Chief Executive at the UK’s
Economic and Social Research Council (ESRC). He was also the Director of the
ESRC Centre for Corpus Approaches to Social Science at Lancaster. He has
published extensively on corpus linguistics.
Karl M. van Meter is a research sociologist at the Centre Maurice Halbwachs
(ENS Paris) and an expert in sociological methods and methodologies. He is an
American-French citizen with university degrees from the US, the UK and
France. Although his PhD was in pure mathematics, he founded and directed
for 34 years the bilingual Bulletin of Sociological Methodology/Bulletin de
Méthodologie sociologique, which is now with Sage Publications. In his research
he uses mainly quantitative text processing methods with which he traces major
historical shifts in French, German and American sociologies, and the represen-
tation of politics in society.
Marcus Müller is full professor in German Studies—Digital Linguistics at the
Department of Linguistics and Literature, Technische Universität Darmstadt.
He studied German philology, romance studies and European art history at the
xvi Notes on Contributors
xix
xx List of Figures
xxiii
Part I
Introductory Remarks
1
Understanding Twenty-First-Century
Societies Using Quantifying Text-
Processing Methods
Ronny Scholz
1 Analysing Knowledge-Based
Post-industrial Societies: Challenges
and Chances
During the last 50 years, Western societies have experienced substantial
changes. The phenomena of Europeanisation and globalisation as well as
technical innovations such as the Internet and social media have revolu-
tionised the way we use language when interacting, socialising with each
other, or storing and recalling knowledge. In fact, the Internet has fos-
tered access to globally produced information. In the abundance of
sometimes contradicting information, the formation of knowledge in
I am thankful to Malcolm MacDonald, Joy Malala and Yannik Porsché for their helpful comments
on earlier versions of this text.
R. Scholz (*)
Centre for Applied Linguistics, University of Warwick, Coventry, UK
e-mail: r.scholz@warwick.ac.uk
part of it. Thus, social scientists doing discourse analysis will, for instance,
gain insights into: how knowledge about groups, communities, and social
identities is constructed in discourses (inclusion and exclusion; gender,
religion, race, class, nation); how this structuration is justified; how social
spaces and positions are constructed, negotiated, and orchestrated; which
narratives and ideologies drive different actors, groups, and communities
in society; and how decisions in a given society or a part of it are being
legitimised. Moreover, with its capacity to map prevailing argumenta-
tions and narratives, discourse studies can reveal how values are articu-
lated in a particular way in order to justify the stance of a specific group,
social strata, or class. In this sense, discourse analysts can show how soci-
ety is governed and hence can feed into the formulation of a social cri-
tique that contributes to social progress (Herzog 2016).
Foucault’s philosophy has helped to understand the formation of knowl-
edge in terms of discourses that organise knowledge. Most importantly he
has insisted on the fact that discourses are driven by power relations that
are rooted in institutional, social, societal, and historical contexts in which
language users have to operate (Foucault 1970, 1972, 1979). Foucault’s
theoretical categories have informed discourse analytical approaches not
only in France but across the globe—to name only a few, the Discourse
Linguistic Approach (Warnke and Spitzmüller 2008) and the Sociology of
Knowledge Approach (Keller 2013) in Germany or the Discourse
Historical Approach (Reisigl and Wodak 2016) and Critical Discourse
Analysis (Dijk 1997; Fairclough 1995; Wodak and Meyer 2001). What is
common to all approaches in discourse studies is their fundamental inter-
est in meaning construction through natural language use in context.
There are numerous definitions of discourse. I will touch upon two
which best fit the purposes of this volume. First, there is Busse and
Teubert’s definition which is common in German discourse linguistics.
They define discourse as a ‘virtual text corpus’ containing all sorts of texts
that have been produced on a particular topic. In order to analyse a dis-
course, a researcher has to compile a ‘concrete text corpus’ which is com-
piled from a representative selection of texts of the ‘virtual text corpus’
(Busse and Teubert 2014, 344). This definition might satisfy corpus lin-
guists, but if we want to analyse discourse practices from a perspective that
accommodates the broader spectrum of social sciences and humanities,
8 R. Scholz
terns. Combining these three methods allows the authors to assess the
degree of transnationalisation in the two fields.
References
Achard, Pierre. 1993. La sociologie du langage. Paris: PUF.
Angermuller, Johannes, Dominique Mainguenau, and Ruth Wodak, eds. 2014.
The discourse studies reader. Main currents in theory and analysis. Amsterdam:
John Benjamins.
Antonijevic, Smiljana. 2013. The immersive hand: Non-verbal communication
in virtual environments. In The immersive internet. Reflections on the entan-
gling of the virtual with society, politics and the economy, ed. Dominic Power
and Robin Teigland, 92–105. Basingstoke: Palgrave Macmillan.
Authier-Revuz, Jacqueline. 1984. Hétérogénéité(s) énonciative(s). Langages 73:
98–111.
Bacot, Paul, and Silvianne Rémy-Giraud. 2011. Mots de l’espace et conflictualité
sociale. Paris: L’Harmattan.
Baker, Paul. 2006. Using corpora in discourse analysis. London: Continuum.
Baker, Paul, and Tony McEnery. 2015. Corpora and discourse studies. Integrating
discourse and corpora, Palgrave advances in language and linguistics. Basingstoke:
Palgrave Macmillan.
Barats, Christine, ed. 2013. Manuel d’analyse du web en Sciences Humaines et
Sociales. Paris: Armand Colin.
Barry, Andrew, ed. 2001. Political machines. Governing a technological society.
London: Athlone Press.
Beetz, Johannes, and Veit Schwab, eds. 2018. Material discourse—Materialist
analysis. Lanham, MD: Lexington Books.
Biemann, Chris, and Alexander Mehler. 2014. Text mining. From ontology learn-
ing to automated text processing applications – Festschrift in honor of Gerhard
Heyer. Translated by Gerhard Heyer. Theory and Applications of Natural
Language Processing. Cham: Springer.
Blommaert, Jan. 2005. Discourse, a critical introduction. Cambridge: Cambridge
University Press.
Bourque, Gilles, and Jules Duchastel. 1984. Analyser le discours politique
duplessiste: méthode et illustration. Cahiers de recherche sociologique 2 (1, Le
discours social et ses usages): 99–136.
18 R. Scholz
Hegelich, Simon, and Dietmar Janetzko. 2016. Are social bots on Twitter polit-
ical actors? Empirical evidence from a Ukrainian social botnet. Proceedings of
the Tenth International AAAI Conference on Web and Social Media. Accessed
July 1, 2018. https://www.aaai.org/ocs/index.php/ICWSM/ICWSM16/
paper/view/13015.
Herzog, Benno. 2016. Discourse analysis as social critique—Discursive and non-
discursive realities in critical social research. Basingstoke: Palgrave Macmillan.
Ignatow, Gabe, and Rada Mihalcea. 2017. Text mining. A guidebook for the social
sciences. Los Angeles: SAGE.
Jockers, Matthew Lee. 2014. Text analysis with R for students of literature
(Quantitative Methods in the Humanities and Social Sciences). Cham:
Springer.
Jones, Rodney H., Alice Chik, and Christoph A. Hafner, eds. 2015. Discourse
and digital practices. Doing discourse analysis in the digital age. London:
Routledge, Taylor & Francis Group.
Keller, Reiner. 2013. Doing discourse research. An introduction for social scientists.
London: Sage.
Kerbrat-Orecchioni, Catherine. 1980. L’Énonciation. De la subjectivité dans le
langage. Paris: Armand Colin.
KhosraviNik, Majid. 2016. Social Media Critical Discourse Studies (SM-CDS):
Towards a CDS understanding of discourse analysis on participatory web. In
Handbook of critical discourse analysis, ed. John Flowerdew and John
E. Richardson. London: Routledge.
Lafont, Robert. 1978. Le travail et la langue. Paris: Flammarion.
Lafont, Robert, Françoise Madray-Lesigne, and Paul Siblot. 1983. Pratiques
praxématiques: introduction à une analyse matérialiste du sens. Numéro spé-
cial de: Cahiers de linguistique sociale 6: 1–155.
Leimdorfer, François. 2011. Les sociologues et le langage. Paris: Editions de la
MSH.
Leimdorfer, François, and André Salem. 1995. Usages de la lexicométrie en
analyse de discours. Cahiers des Sciences humaines 31 (1): 131–143.
Luhmann, Niklas. 2000. The reality of the mass media. Cambridge: Polity Press.
Original edition, 1995.
Maesse, Jens. 2015. Economic experts. A discursive political economy of eco-
nomics. Journal of Multicultural Discourses 10 (3): 279–305.
Maingueneau, Dominique. 1997. Pragmatique pour le discours littéraire. Paris:
Dunod.
20 R. Scholz
———. 2013. Genres de discours et web: existe-t-il des genres web? In Manuel
d’analyse du web en Sciences Humaines et Sociales, ed. Christine Barats, 74–98.
Paris: Armand Colin.
Mautner, Gerlinde. 2005. Time to get wired: Using web-based corpora in criti-
cal discourse analysis. Discourse & Society 16 (6): 809–828.
———. 2012. Corpora and critical discourse analysis. In Contemporary corpus
linguistics, ed. Paul Baker, 32–46. London: Continuum.
Norris, Sigrid, and Rodney H. Jones. 2005. Discourse in action. Introducing
mediated discourse analysis. London: Routledge.
Partington, Alan, Alison Duguid, and Charlotte Taylor. 2013. Patterns and
meanings in discourse. Theory and practice in corpus-assisted discourse studies
(CADS), Studies in corpus linguistics. Vol. 55. Amsterdam: Benjamins.
Paveau, Marie-Anne. 2017. L’analyse du discours numérique. Dictionnaire des
formes et des pratiques, Collection Cultures numériques. Paris: Hermann.
Porsché, Yannik. 2018. Public representations of immigrants in museums—
Exhibition and exposure in France and Germany. Basingstoke: Palgrave
Macmillan.
Raffnsøe, Sverre, Marius Gudmand-Høyer, and Morten S. Thaning. 2016.
Foucault’s dispositive: The perspicacity of dispositive analytics in organiza-
tional research. Organization 23 (2): 272–298.
Reboul, Anne, and Jacques Moeschler, eds. 1998. Pragmatique du discours. De
l’interprétation de l’énoncé à l’interprétation du discours. Paris: Colin.
Reisigl, Martin, and Ruth Wodak. 2016. Discourse Historical Approach (DHA).
In Methods of critical discourse studies, ed. Ruth Wodak and Michael Meyer,
23–61. Los Angeles, CA: Sage.
Rosental, Claude. 2015. From numbers to demos: Assessing, managing and
advertising European research. Histoire de la Recherche Contemporaine 4 (2):
163–170.
Schroeder, Ralph. 2014. Big data and the brave new world of social media
research. Big Data & Society 1 (2): 1–11.
Siblot, Paul. 1990. Une linguistique qui n’a plus peur du réel. Cahiers de praxé-
matique 15: 57–76.
———. 1997. Nomination et production de sens: le praxème. Langages 127:
38–55.
Tufekci, Zeynep. 2015. Algorithmic harms beyond Facebook and Google:
Emergent challenges of computational agency. Colorado Technology Law
Journal 13 (2): 203–218.
Vološinov, Valentin N. 1973. Marxism and the philosophy of language. New York:
Seminar Press. Original edition, 1929.
Understanding Twenty-First-Century Societies Using… 21
Warnke, Ingo H., and Jürgen Spitzmüller, eds. 2008. Methoden der
Diskurslinguistik. Sprachwissenschaftliche Zugänge zur transtextuellen Ebene.
Berlin and New York: De Gruyter.
Wiedemann, Gregor. 2016. Text mining for qualitative data analysis in the social
sciences. A study on democratic discourse in Germany. Wiesbaden: Springer VS.
Wodak, Ruth, and Michael Meyer. 2001. Methods of critical discourse analysis.
Introducing qualitative methods. London: Sage.
2
Beyond the Quantitative
and Qualitative Cleavage: Confluence
of Research Operations in Discourse
Analysis
Jules Duchastel and Danielle Laberge
1 Introduction
The world of social and language sciences is characterised by many cleav-
ages: between understanding and explaining, between structural and phe-
nomenological analysis, between different fields and disciplines related to
the study of language, between different national and continental tradi-
tions, and between qualitative and quantitative approaches. These opposi-
tions often create new avenues of thought, but they become sterile when
giving up important aspects of the analysis. We will ask ourselves how
J. Duchastel (*)
Department of Sociology, UQAM – Université du Québec à Montréal,
Montréal, QC, Canada
e-mail: duchastel.jules@uqam.ca
D. Laberge
Department of Management and Technology, UQAM – Université du Québec
à Montréal, Montréal, QC, Canada
e-mail: laberge.danielle@uqam.ca
2 Oppositions and Convergences
in the Field of Discourse Analysis
Discourse analysis stands at the confluence of various disciplines, tradi-
tions, and approaches. It arose from a dual need to overcome, in the
humanities, the limited focus on content and, in the language sciences,
the restricted structural approach to language. Discourse analysis intro-
duced the need to consider language in its social context and apprehend
content as it is materialised in linguistic forms and functions. Discourse
analysis can be considered as a merger of two great traditions: the herme-
neutical tradition of humanities and social sciences, based on the mean-
ing of social practices and institutions, and the more functional and
structural tradition of language sciences that focuses on the description of
different aspects of language use. Within the context of this confluence, a
third axis emerged, that of statistical and computer sciences, leading to
the development of a tradition of computer-assisted discourse analysis. If
Beyond the Quantitative and Qualitative Cleavage: Confluence… 25
Qualitative analysis
Dubois (1969), Benveniste (1966)
French school of
Enonciation analysis
Discourse analysis
analysis
Laswell (1952) Communication theory
Lexicometry Content analysis
Muller (1968) Lexical statistics
Berelson (1952) Content Analysis
1
It has to be noted that both traditions are not hermetically closed. For instance, the French school
of discourse analysis initially was inspired by Zellig Harris (1952) distributional approach to
language.
Beyond the Quantitative and Qualitative Cleavage: Confluence… 27
3 Mixed Methods
The confluence of theoretical and methodological approaches in the cur-
rent practices of discourse analysis involves the use of mixed methods.
The idea of mixed methods fits into the broader project to overcome the
opposition between qualitative and quantitative approaches, and to
Beyond the Quantitative and Qualitative Cleavage: Confluence… 29
2
See also Table 6.2 ‘Paradigm positions on Selected Practical Issues’ in Guba and Lincoln (1994)
and Table 1 ‘Trois positions ontologiques dans les sciences sociales contemporaines’ in Duchastel
and Laberge (1999b).
30 J. Duchastel and D. Laberge
needs of the research project and the nature of the data. The choice is up
to the researcher to establish the sequence of qualitative and quantitative
methods and their relative importance (QUAN > qual, QUAL > quan,
QUAN = QUAL) as part of the research process. The second argument is
more substantive. It justifies the hybridization of methods according to
the nature of data. For example, discourse analysis and content analysis
are applied to phenomena including aspects of both qualitative and
quantitative nature. The third argument is epistemological. The use of
mixed methods is legitimated by the idea of triangulation. Triangulation
is seen as a way to increase confidence in the research results. However,
we must recognize that the use of the term ‘triangulation’ is mostly meta-
phorical (Kelle 2001) and does not formally ensure a greater validity,
except in the form of convergence or confirmation of findings. In sum,
the use of mixed methods only proves that there should not be mutually
exclusive types of methods. It seems, however, insufficient to reduce the
issue of mixed methods to their sole effectiveness without trying to
understand the implications of epistemological, analytical, and opera-
tional oppositions characterizing both qualitative and quantitative para-
digms on these new forms of empirical approaches.
4 Explaining and Understanding
What can be drawn from the above? On the one hand, we have estab-
lished that the practice of discourse analysis is at the confluence of several
disciplines, themselves, relying on more or less quantitative or qualitative,
phenomenological or structural, linguistic or sociological approaches.
While each tradition has established itself on epistemological, theoretical,
and methodological oppositions with other traditions, we can neverthe-
less observe a certain convergence in the use of methods and the mitiga-
tion of previous fractures. On the other hand, the fundamental opposition
between qualitative and quantitative methods seems to dissolve in the
pragmatic choice of mixed methods. This pragmatism often avoids exam-
ination of ontological and epistemological foundations of this practice.
This is why we have to question the possible reconciliation of these two
so strongly opposed paradigms.
32 J. Duchastel and D. Laberge
in the material fabric of language. This is the case with enunciation analy-
sis that seeks the inscription of speaker and audience in the thread of
discourse. The same is true with the study of markers of argumentation.
According to Gee (2011), discourse analysis is about the study of speech
on three levels: the analysis of the information it conveys (saying), that of
action it raises (doing) and of identity it formulates (being). Each of these
dimensions is identifiable only through linguistic forms that make them
intelligible. The interpretation must rely on certain classes of observation
units and the description of their properties. This process is objectifying
as well as interpretative.
If this is true, a restrictive approach of interpretation cannot be sus-
tained. Interpretation cannot be limited to the final act of the research
process when making sense of results. Rather, interpretation should be
present at the very beginning of the research process. Interpretation is
part of every research procedure, and all procedures rely on interpreta-
tion. This means that explanatory procedures and interpretation go hand
in hand and do not oppose each other, as the quarrel of paradigms would
suggest. Rather than designing two general paradigms defined by their
purpose, explaining, or understanding, it is more productive to integrate
both actions within a single process. No science can do without a proper
pre-comprehension of the object. There is always a knowledge frame,
more or less theoretical, which predetermines the grasping of reality.
What is sought is to increase this preliminary understanding. Explanation
is most often thought of as establishing a relationship between two phe-
nomena. But, it also has a semantic sense. Kaplan (1964) has defined
interpretation as a semantical explanation, thus explaining the meaning
of a statement. In both cases, the goal is to better understand. The various
procedures for observation, description and analysis of objects are
designed to enhance understanding by distancing the object from the
subject and by linking the object with the cognitive frameworks at play.
However, we must consider the asymmetry of both processes of expla-
nation and interpretation. While explanatory procedures can be con-
trolled to a certain point, the act of interpretation, even if it is well framed,
remains difficult to define. The cognitive capacities of the researcher,
semantic, emotional, or cultural, will result in some uncertainty of inter-
pretation. However, it is easier to control the micro-level of the interpre-
Beyond the Quantitative and Qualitative Cleavage: Confluence… 35
Fig. 2.2 Transformation of the text. The figure is an adaption of a schema pre-
sented in Meunier (1993)
Beyond the Quantitative and Qualitative Cleavage: Confluence… 37
the speech that will become a text ‘outside of the world’, in the words of
Ricœur. In the case of oral discourse, we first proceed to its transcription.
Oral discourse includes a set of prosodic and contextual features that can
be recorded in a more or less developed format using established conven-
tions (e.g., Jefferson 2004). The ‘manuscript’ text is an object both differ-
ent and less complex than the original, in the sense that the conditions
and context of its production and enunciation are no more present oth-
erwise than within the text itself.
The next transformation will produce an ‘edited’ text. Whatever the
characterization of the manuscripts, transcripts of oral, in paper or com-
puterized format, standardization and normalization work must be done
in order to make the various elements of a corpus comparable. Information
about the conditions of production of speech and of enunciation (speaker,
support, place, time, etc.) must define each document of a corpus. We get
a new ‘edited’ text which will be subsequently the object of description,
exploration, and analysis. In summary, the ‘manuscript’ text is a deriva-
tion of the original discourse which version has been established by
authentication or transcription and the edited text is, in turn, the result
of standardization and indexation according to a system of rules and
descriptive categories. It is on the basis of this ‘edited’ text that the work
of description, exploration, and analysis can be further performed.
Which actions should then be performed on this textual material? We
can define two universal research operations whatever the approach. The
first task is to establish the observation units: What is to be observed? The
second task consists of the description of these units based on one or
more systems of categories: How is it to be observed? Observation units
can be represented as a set of nested elements, from the global corpus to
the sub-corpora, to the collection of texts that constitute each of them, to
the various parts of each text, and finally to the middle and micro-level
text units. Each nesting level of units may be described into a system of
categories. The corpus itself and its subsets are indexed with a metadata
system. Every text component (section, paragraph, verbal exchanges,
etc.) can be marked. Finally, speech units (textual segments, turns of
speech, sentences, words) are coded depending on the research target
(e.g., morpho-syntactic, semantic, pragmatic, enunciative, argumentative
coding). Thus, the descriptive system unfolds at three levels: The corpus
38 J. Duchastel and D. Laberge
that allow for the appropriation of the object for ourselves, that is to say,
its understanding.
7 Conclusion
We have shown that discourse analysis is not a discipline but a
research practice that is at the confluence of a set of disciplinary and
national traditions. The rich heritage of disciplinary, theoretical, and
methodological knowledge explains the privileged position of discourse
analysis. The very purpose of discourse analysis predisposes it to stay at
the frontier of different methodological approaches, which might be
called mixed methods. We have shown that the paradigmatic opposi-
tions between qualitative and quantitative approaches, although strongly
advocated in the body of scientific literature, have become obsolete in
the pragmatic use of mixed methods. We went beyond this pragmatic
attitude to defend the thesis that there is indeed a common background
in all methodologies, whatever their paradigmatic affiliation. We have
shown that we cannot explain without interpreting at the same time, and
that the very identification of research units and operations of descrip-
tion and analysis combines, at all times, explanation and interpretation.
We further stated that scientific knowledge cannot proceed without
applying some reduction procedures, but that the combination of these
procedures can lead to a restoration of the complexity of the object. We
ended by showing that the logic of causality and measurement, seem-
ingly opposed to the qualitative paradigm, applies to both qualitative
and quantitative approaches.
References
Adam, Jean-Michel. 1999. Linguistique textuelle. Des genres de discours aux textes.
Paris: Nathan.
Althusser, Louis. 1970. Idéologie et appareils idéologiques d’État. La Pensée 151
(juin).
Barthes, Roland. 1957. Mythologies. Paris: Seuil.
Conein, Bernard, Jean Jacques Courtine, Françoise Gadet, Edward W. Marandin,
and Michel Pêcheux, eds. 1981. Matérialités discursives. Actes du colloque de
Nanterre (24–26 avril 1980). Lille: Presses universitaires de Lille.
Denzin, Norman K., and Yvonna S. Lincoln. 1994. Handbook of qualitative
research. London: Sage.
Derrida, Jacques. 1967. L’écriture et la différence. Paris: Seuil.
Duchastel, Jules, and Danielle Laberge. 1999a. Des interprétations locales aux
interprétations globales: Combler le hiatus. In Sociologie et normativité scien-
tifique, ed. Nicole Ramognino and Gilles Houle, 51–72. Toulouse: Presses
Universitaires Du Mirail.
———. 1999b. La recherche comme espace de médiation interdisciplinaire.
Sociologie et Sociétés XXXI (1): 63–76.
———. 2011. La mesure comme représentation de l’objet. Analyse et interpré-
tation. Sociologies (Avril). Accessed June 27, 2018. https://journals.openedi-
tion.org/sociologies/3435.
Fairclough, Norman. 2007. Discourse and social change. Cambridge: Polity.
Foucault, Michel. 1969. L’archéologie du savoir. Paris: Gallimard.
Gee, James Paul. 2011. An introduction to discourse analysis. Theory and method.
3rd ed. New York: Routledge.
Guba, Egon G., and Yvonna S. Lincoln. 1994. Competing paradigms in quali-
tative research. In Handbook of qualitative research, ed. Norman K. Denzin
and Yvonna S. Lincoln, 105–117. London: Sage.
Hall, Stuart. 2009. Representation, cultural representations and signifying practices.
London: Sage.
Haroche, Claudine, Paul Henry, and Michel Pêcheux. 1971. La Sémantique et
la coupure saussurienne: Langue, langage, discours. Langages 24: 93–106.
Harris, Zellig. 1952. Discourse analysis. Language 28 (1): 1–30.
Jakobson, Roman. 1963. Essais de linguistique générale. Paris: Minuit.
Jefferson, Gail. 2004. Glossary of transcript symbols. In Conversation analysis:
Studies from the first generation, ed. Gene H. Lerner, 13–31. Amsterdam: John
Benjamins Publications.
Beyond the Quantitative and Qualitative Cleavage: Confluence… 45
Kaplan, Abraham. 1964. The conduct of inquiry. Methodology for behavioral sci-
ence. New York: Chandler Publishing.
Kelle, Udo. 2001. Sociological explanations between micro and macro and the
integration of qualitative and quantitative methods. Forum Qualitative Social
Research 2(1). https://doi.org/10.17169/fqs-2.1.966. Accessed June 27,
2018.
Mackie, John L. 1974. The cement of the universe. A study of causation. Oxford:
Oxford University Press.
Mayaffre, Damon. 2007. Analyses logométriques et rhétoriques des discours. In
Introduction à la recherche en sic, ed. Stéphane Olivési, 153–180. Grenoble:
Presses Universitaires De Grenoble.
Meunier, Jean-Guy 1993. Le traitement et l‘analyse informatique des textes.
Revue de Liaison de la recherche en informatique cognitive des organisations
(ICO Québec) 6 (1–2): 19–41.
Molino, Jean. 1989. Interpréter. In L‘interprétation des textes, ed. Claude
Reichler, 9–52. Paris: Editions De Minuit.
Osgood, Charles E. 1959. The representational model and relevant research
methods. In Trends in content analysis, ed. Ithiel de Sola Pool, 33–88. Urbana:
University of Illinois Press.
Paillé, Pierre, and Alex Mucchielli. 2008. L’analyse qualitative en sciences humaines
et sociales. Paris: Armand Colin.
Paveau, Marie-Anne. 2012. L’alternative quantitatif/qualitatif à l’épreuve des
univers discursifs numériques. In Colloque international et interdisciplinaire
Complémentarité des approches qualitatives et quantitatives dans l’analyse des
discours?, Amiens, France.
Pêcheux, Michel. 1975. Les vérités de la Palice, linguistique, sémantique, philoso-
phie. Paris: Maspero.
Pires, Alvaro P. 1982. La méthode qualitative en Amérique du nord: un débat
manqué (1918–1960). Sociologie et société 14 (1): 16–29.
Rastier, François. 2001. Arts et sciences du Texte. Paris: PUF.
Ricœur, Paul. 1981. Hemeneutics and the human sciences. Essays on language,
action and interpretation. Cambridge: Cambridge University Press.
———. 1986. Du texte à l’action. Paris: Seuil.
Tacq, Jacques. 2011. Causality in qualitative and quantitative research. Quality
and Quantity 45 (2): 263–291.
Zienkowski, Jan. 2012. Overcoming the post-structuralist methodological defi-
cit. Metapragmatic markers and interpretative logic in a critique of the bolo-
gna process. Pragmatics 22 (3): 501–534.
46 J. Duchastel and D. Laberge
References of Figure 2.1
Searle, John. 1970. Speech acts. An essay in the philosophy of language. Cambridge:
Cambridge University Press.
Stone, Philip J., Dexter C. Dunphy, Marshall S. Smith, and Daniel M. Ogilvie.
1966. The general inquirer. A computer approach to content analysis. Cambridge,
MA: MIT Press.
Part II
Analysing Institutional Contexts of
Discourses
3
The Academic Dispositif: Towards
a Context-Centred Discourse Analysis
Julian Hamann, Jens Maesse, Ronny Scholz,
and Johannes Angermuller
1 Introduction
In discourse, meanings are realised and established among members of a
social community. Discourse is a meaning-making practice, which oper-
ates with gestures, images, and, most importantly, with language. From a
discourse analytical point of view, texts and contexts, utterances and their
The authors thank Johannes Beetz, Sixian Hah, and one anonymous reviewer for their comments
on previous versions of this contribution. They are also very grateful to Marie Peres-Leblanc for
improving the design of the visualisations.
J. Hamann (*)
Leibniz Center for Science and Society, Leibniz University Hannover,
Hannover, Germany
e-mail: julian.hamann@lcss.uni-hannover.de
J. Maesse
Department of Sociology, Justus-Liebig University Giessen, Gießen, Germany
e-mail: Jens.Maesse@sowi.uni-giessen.de
R. Scholz • J. Angermuller
Centre for Applied Linguistics, University of Warwick, Coventry, UK
e-mail: r.scholz@warwick.ac.uk; J.Angermuller@warwick.ac.uk
1
The concept for the information system from which we draw our examples was developed within
the research project ‘Discursive Construction of Academic Excellence’, funded by the European
Research Council and led by Johannes Angermuller. We are grateful to the whole ERC DISCONEX
team for allowing us to present a part of their research ideas to which all four authors have contrib-
uted in various stages. For more information see: http://www.disconex.discourseanalysis.net.
The Academic Dispositif: Towards a Context-Centred Discourse… 53
The question of text and context has been the subject of a great deal of
controversy. In line with socially minded linguists, who have long insisted
on the systematic empirical observation of real linguistic and discursive
practices (Bhatia et al. 2008; Blommaert 2005; Sarangi and Coulthard
2000), we will make the case for a sociological take on social and histori-
cal contexts. We will cite and elaborate Foucault’s concept of dispositif in
order to seize the social context as an institutional arrangement of linguis-
tic practices and non-linguistic practices, rules, and structures in a larger
social community. While text and talk can be analysed with the classical
instruments of discourse analysis (from pragmatics to corpus analysis),
the dispositif is analysed with the help of sociological methods (such as
interviews, questionnaires, statistical analysis, ethnography).
With the concept of the dispositif, we make the case for sociological
perspectives on discursive practices as embedded in institutional power
arrangements (the notion of dispositif has been the object of debate in
France, where the term originated, and Germany: Angermüller 2010;
Angermuller and Philippe 2015; Bührmann and Schneider 2007, 2008;
Maesse and Hamann 2016; Maingueneau 1991; Spieß et al. 2012). The
dispositif approach encompasses power and social structures (Bourdieu
2010), the nexus of power and knowledge (Foucault 1972), as well as
institutionally organised processes of interpretation (Angermüller 2010).
It takes CDA perspectives further, in that it pleads for studying social
The Academic Dispositif: Towards a Context-Centred Discourse… 55
The discourse analytical process usually takes place at three different lev-
els of empirical investigation, as outlined in Table 3.1. The first level deals
with problems that are first and foremost linguistic in nature and located
on the text level of discourse. At this stage, qualitative and quantitative
methods are applied to analyse the formal rules that make linguistic forms
(spoken words, written texts, gestures, as well as pictures) readable in
respective social contexts. Thus, the analysis of argumentation, deixis,
categorisations, polyphony, co-occurrences, and so forth requires the
study of small utterances as well as large textual corpora. The particular
choice of method depends on the research question or on corpus charac-
teristics. Nonetheless, the linguistic level must not be confused with the
social contexts in which language appears in order to create meaning(s).
After the linguistic level, a sociological level emerges, which cannot be
studied with linguistic methods. At this point, the discourse analytical
process moves from the first, the linguistic, level, to the second level of
investigation: social context(s) (Table 3.1). This switch from one level to
another is required in qualitative as well as in quantifying approaches to
textual materials. Large data sets and corpora as well as small utterances
neither speak nor interpret themselves.
As is illustrated in Table 3.1, the linguistic and sociological levels of
discourse analysis are complemented by a third level: theoretical interpre-
tation. Taken by themselves, neither linguistic nor contextual data are
interpretations. Furthermore, the combination of discourse analytical
data with data from sociological analysis is not an automatic and natural
procedure either. Interpretations do not emerge from the data; they are
made by those who interpret them. This is where the significance of the
theoretical level of analysis comes into play. Researchers can mobilise
theories and paradigms for the interpretation of data and they can build
new theories and explanations on the basis of data interpretations led by
theories. Whereas positivistic approaches follow a data-theory determin-
ism, we suggest giving theory its place in research processes as a tool for
data interpretation and as a result of those processes. Theory is simultane-
ously the creative starting point and the result of every research process.
It helps to make sense of data. While the theoretical level will be addressed
in the fourth section of this contribution, let us briefly return to the con-
textual level.
The three levels of analysis can be observed in various types of dis-
course analysis. For example, if pragmatic discourse analysts ask how
utterances are contextualised through markers of polyphony (Angermuller
2015), their analysis does not stop at the textual level. An example from
economic expert discourses (Maesse 2015b) can show how an analysis of
meaning-making on the micro level (of utterances) can be combined
with the study of institutional contexts on the macro level. The following
exemplary statement was uttered by the economist Joseph Stiglitz.
refers the reader to the context in which it is uttered by the locutor (i.e.
Stiglitz). To make sense of this utterance, the reader will need an under-
standing of the locutor and their context. It will, therefore, be important
to know that Stiglitz has been awarded a Nobel Prize, is an established
academic at several Ivy League universities, a popular commentator of
economic policies and chief economist of the World Bank.
Yet, knowledge about the social context is not only important for this
type of linguistic micro-analysis of utterances. As will become clear in the
following, more structural approaches, too, articulate linguistic and soci-
ological levels of analysis and integrate them into a theoretical
explanation.
Table 3.2 Four ideal typical dimensions of social context for the analysis of
discourses
Type of Social relations Institutions and Epistemic Forms of social
context organisations resources practice
Example Academic Universities, Rankings, tacit Reading, writing
community professorships, knowledge books/papers/
networks, funding about certain articles,
teaching organisations, institutions, presenting
relations, publishers, scientific papers at big
organisational editorial theories and conferences/
hierarchies boards, methods, informal circles,
between commissions, ideological being involved
deans and political knowledge, in email
professors or parties, knowledge communication,
professors and business firms, about and so forth
PhD administrative political and
candidates, offices, and economic
relations bodies organisations
between
politicians,
media
journalists,
and academics
population that have an impact on how certain academic ideas are discur-
sively constructed (Hamann 2014). Furthermore, it can be worth look-
ing at the institutional context and practices of text production that
reveal information on the influences on choices of topics, arguments, and
positions in, for example, election manifestos (Scholz 2010).
Data that can be analysed with quantitative methods are one suitable
route to assess the structures in which a discourse is embedded. Such an
approach enables us to analyse social phenomena that are spread over a
relatively large geographical space, such as national higher education sys-
tems, and which concern relatively large groups, like disciplines or a pop-
ulation of professors. Furthermore, we are able to trace developments of
social contexts that encompass discourses over relatively long periods of
time. We can account for a large quantity of data in order to get an over-
view of the research object analysed before we actually start interpreting
language-related phenomena in the discourse.
60 J. Hamann et al.
The dispositif concept reminds us that power is more than mere interpre-
tative efforts describing power as an “open, more-or-less coordinated […]
cluster of relations” (Foucault 1977, 199). It emphasises effects of closure
and sedimentation that are also part of the academic world. There are
many examples where meaning-making is domesticated and controlled
in the academic world, think of the rhetoric of excellence and competi-
tion as well as discourses of inclusion and equality, the pressures for exter-
62 J. Hamann et al.
nal funding and on university admissions (cf. Zippel 2017; Münch 2014;
Friedman et al. 2015; Kennelly et al. 1999).
To map the contexts of discursive practices, our first step is to assess the
relevant actors within and outside academia. The social contexts of aca-
demic discourses consist of, for example, researchers, institutions like
universities, publishers, and funding agencies, as well as disciplines,
research groups, and networks. In a broader sense, these are all actors
that, in one way or another, participate as social entities in an academic
discourse, no matter whether they are individual or collective, human or
non-human. Hence, the first step in our analysis is to identify the dis-
course actors that are relevant for the discourse we want to study. This can
be done via a systematic investigation of the institutional structures of
our research object. In our case we catalogued all national higher educa-
tion and research institutions that are relevant to the academic discourse
in a particular research field together with the full professorships in each
institution. In addition, we also tried to account for higher education
policies effecting the classification of universities. In the UK case, classifi-
catory instances that are part of the higher education dispositif include,
for example, such groups as the Russell Group and European Research
Universities, and also governance instruments like the Research Excellence
Framework, with its highly influential focus on research excellence and
societal impact (Hamann 2016a). The importance of these classificatory
instances notwithstanding, our approach in the project is more focused
on the individuals in higher education institutions, the way they position
themselves, are positioned by others and the career trajectories they have
followed.
There are numerous other methods to identify the actors or partici-
pants of a particular discourse. For a cademic discourse, citation analysis
has become the preferred approach in order to map the structures of sci-
entific communities (e.g. Estabrooks et al. 2004; Xu and Boeing 2013),
and concept-mapping has been applied to identify how different actors in
a cross-disciplinary context conceptualise their research area (Falk-
Krzesinski et al. 2011). Below, we will illustrate how a mapping of the
positions of discourse participants can be produced with correspondence
66 J. Hamann et al.
pitfalls that must be tackled. For instance, universities, degrees, and even
countries change names. In order to ensure the continuity of the reference
labels in the information system, old and new names have to be linked to
each other. This is by no means an innocent step because we are interven-
ing in the game of names and their meaning that is at the heart of ques-
tions on discursive construction, which we are actually interested in.
The quantitative analysis of these data aims to reveal aspects of the
social structure and dynamics of research fields. Why would discourse
analysts be interested in such questions? The answers help to understand
the social and institutional context to which discourse participants must
respond implicitly or explicitly if they want to produce a meaningful
statement in a particular discourse. We assume that discourse practices
relate—in one way or another—to these context conditions. Thus, a dis-
course analysis that integrates context data can enrich the interpretation
of textual data with information about institutional and societal condi-
tions that are usually not obvious merely by studying text data.
What we propose here is a first step towards the systematic acquisition
and study of such data whose results would still have to be articulated
with the (quantitative and qualitative) text analytical methods that are
widely used in discourse analysis. In terms of a context-centred perspec-
tive, this will help to better understand why a certain type of research,
colluding with particular statements and narratives, occurs in particular
institutions and locations. In this sense, we could integrate into the anal-
ysis of academic discourses, for example, the impact of research and fund-
ing policies on the research landscape, and the topography of a field of
research.
In order to conduct integrative studies, we propose to collect ‘hard data’
(a) on institutions, (b) on individuals and their career trajectories in those
institutions, and (c) on the professional networks of these individuals.
similar research interests and keywords close to each other, whereas dif-
ferences in terms of research interests are represented by greater distance
on the map.
For this study, we compiled a trial corpus with texts of research interests
and keywords that full professors at 76 UK sociology departments pres-
ent on their institutional webpages. There are more than 90 sociology
departments in the UK, but not all of them have full professors on their
staff. We consider full professors to be preeminent stakeholders in aca-
demic discourse and therefore the analysis of their data is the starting
point of our study. The corpus was partitioned in such a way that we
could compare research interests on the institutional, disciplinary, and
national levels. With a size of 11,980 tokens, our corpus of UK sociolo-
gists is quite small. However, it is big enough to present the method and
its potential for future research. The corpus has not been lemmatised and
also includes all grammatical words. Our choice is based on the assump-
tion that different grammatical forms of content words and grammatical
words themselves have a particular influence on the meaning construc-
tion for what we want to account for in our analysis.
We analyse our data set with correspondence analysis. This is a statisti-
cal method to simplify complex multivariate data by grouping entities
under investigation according to corresponding features. In classical
empirical social research, the method has been used to group actors
according to similar occupations and dispositions (Bourdieu 2010), in
discourse analysis the approach has been used to group speakers accord-
ing to corresponding features in their language use. In this sense, it is a
powerful method to discover similar language use of different speakers in
particular time periods by taking into account the complete vocabulary
of a given corpus and comparing it according to the partitions intro-
duced. In this study, we took the partition ‘institution’ and we contrasted
the distribution of all word tokens used to express research interests on
the website of a given department of sociology in the UK with it. To the
extent that the method takes into account the entire vocabulary of a
74 J. Hamann et al.
close to one another, whereas those with very different distribution pro-
files are more distant from one another. The axes are situated alongside
the highest concentrations of similar characteristics (here similar frequen-
cies of the same words in different institutions). Deciphering the mean-
ing of the axes is part of the interpretation process, which often needs
further analytical steps using different methods.
To demonstrate how the visual can be interpreted, we have chosen to
make it more readable by removing certain word tokens from the repre-
sentation. In Fig. 3.3, we have kept those content words that we under-
stand have particular importance for occupying a position in the research
field of sociology. This is simply for the sake of demonstrating the method.
A more systematic analysis would have to be more explicit about the
76 J. Hamann et al.
words that have been removed from the visualisation (but not from the
analysis). Moreover, the data set should be completed with more and
longer texts. However, regardless of these limitations, we think that the
potential of the approach will become obvious.
When looking at a location in the system of coordinates, we must
consider that both axes represent dominant information concerning a
certain variable. Finding the answer to the question as to which variable
that might be is part of the researcher’s interpretation. In the case of tex-
tual data, the concept of ‘variable’ would have to be understood as the
semantic realm triggered by the referential meanings of certain words.
However, the interpretation of a correspondence analysis based on tex-
tual data is never straightforward because it is based on lexical items, not
semantic units. The problem is that the same semantic can be expressed
The Academic Dispositif: Towards a Context-Centred Discourse… 77
The advantage of using this somewhat imperfect corpus is that the cor-
pus parts are of a size that we can manage to read. We gain a better under-
standing of the method by simply reading the closest and the most distant
texts (Bangor, Bath, Belfast, and Aberdeen versus Glasgow and Bristol
[UWE]). The disadvantage of the relatively small corpus size is that changes
in the visual might be quite substantial if we add or remove texts of
researchers from these institutions. At any rate, we do not claim that these
visuals represent a positivistic depiction of ‘the reality’. Rather, through the
prism of correspondence analysis, we get a vague idea about hidden rela-
tions that are not visible in academic texts, the aim being to find out about
relations that could be relevant either on other levels from subsequent
investigations with other variables, or on the discursive level itself.
Suppose that we somehow had ‘complete’ data, we could relate these
results to studies that map, for instance, the level of access of UK institu-
tions to research funding in general, or to funding for research on par-
ticular topics. This would allow us to cluster institutions with a similar
level of access to research funding, and subsequently analyse to what
extent these clusters match the maps that we produce based on research
interests. We could also include other data, for example, on the perma-
nence and duration of positions across institutions, disciplines, and
countries, in order to investigate the impact of such variables on aca-
demic discourse in the short and long terms.
Given adequate data, the analysis of social contexts becomes a power-
ful supplement to discourse analytical approaches. This section has dem-
onstrated an exemplary starting point for such an undertaking. The
remaining question is connected to the third level of analysis (Table 3.1),
the theoretical interpretation of linguistic and social context data. In the
following section, we will suggest a theoretical framework that integrates
the linguistic and sociological dimensions of discourse analysis.
nor the collection of sociological data can account for the social organisa-
tion of academia.
In this section, we propose a dispositif theoretical approach in order to
go beyond the opposition of micro and macro social structure and discur-
sive practice. The dispositif analysis we propose would read statistical data
pragmatically and reflexively. We take them as a starting point for further
investigations into an empirical object that resists simplification. We
point to three aspects of academia as a dispositif: (a) we emphasise a
rather structuralist notion of power that yields effects of closure and sedi-
mentation in academia, (b) we emphasise that academic contexts are
complex and heterogeneous arenas that overlap with other arenas, and (c)
we emphasise that discourses play an important role because they give
social actors the opportunity to act in an open field, as well as to enable
discursive circulation through many fields between academia and society.
As highlighted by the following sections, all three aspects are addressed by
the dispositif concept.
Let us illustrate the heuristic potential of a dispositif theory that guides
the analysis of academic texts and contexts. Coming back to our empiri-
cal example of full professors in sociology in the UK (cf. Sect. 5), the
three aspects of our dispositif approach generate the following analytical
perspectives: First, we have argued for a rather structuralist notion of
power that emphasises the effects of closure and sedimentation (cf. Sect.
3.1.1). What (more) can we take from Fig. 3.3 if we follow this argu-
ment? The specific distribution of sociology departments in terms of the
research interests of their professors might tentatively be interpreted in
terms of a centre and a periphery. Departments at the centre of the field,
including many London-, Oxford-, and Cambridge-based institutions,
and Warwick, could represent a thematically coherent core of ‘top’ depart-
ments. Anchoring this assumption with additional data would enable us
to test whether these departments are also ‘competitive’ in terms of fund-
ing. Professors at departments on the periphery appear to be pursuing
alternative research strategies that do not represent the ‘core interests’ of
the field.
Second, the dispositif theoretical framework introduces fields as a
main object of investigation, thus allowing for a systematic account of
different contexts that overlap with each other (cf. Sect. 3.1.2). Following
82 J. Hamann et al.
7 Conclusion
We have highlighted some shortcomings of text-centred conceptualisa-
tions of context and pointed out the necessity for a more systematic inte-
gration of social contexts and a theory-based interpretation of discourses.
The Academic Dispositif: Towards a Context-Centred Discourse… 83
References
Angermüller, Johannes. 2004. Institutionelle Kontexte geisteswissenschaftlicher
Theorieproduktion: Frankreich und USA im Vergleich. In
Wissenschaftskulturen, Experimentalkulturen, Gelehrtenkulturen, ed. Markus
Arnold and Gert Dressel, 69–85. Wien: Turia & Kant.
84 J. Hamann et al.
Hamann, Julian. 2014. Die Bildung der Geisteswissenschaften. Zur Genese einer
sozialen Konstruktion zwischen Diskurs und Feld. Konstanz: UVK.
———. 2016a. The visible hand of research performance assessment. Higher
Education 72 (6): 761–779.
———. 2016b. ‘Let us salute one of our kind’. How academic obituaries con-
secrate research biographies. Poetics 56: 1–14.
Husson, François, and Julie Josse. 2014. Multiple correspondence analysis. In
Visualization and verbalisation of data, ed. Jörg Blasius and Michael Greenacre,
165–183. London and New York: CRC.
Kennelly, Ivy, Joya Misra, and Marina Karides. 1999. The historical context of
gender, race, & class in the academic labor market. Race, Gender & Class 6
(3): 125–155.
Kleining, Gerhard. 1994. Qualitativ-heuristische Sozialforschung. Schriften zur
Theorie und Praxis. Hamburg-Harvestehude: Fechner.
Knorr Cetina, Karin. 1981. The manufacture of knowledge. An essay on the con-
structivist and contextual nature of science. Oxford: Pergamon.
Lamont, Michèle. 1987. How to become a dominant French philosopher: The
case of Jacques Derrida. The American Journal of Sociology 93 (3): 584–622.
Lebart, Ludovic, and Gilbert Saporta. 2014. Historical elements of correspon-
dence analysis and multiple correspondence analysis. In Visualization and
verbalisation of data, ed. Jörg Blasius and Michael Greenacre, 31–44. London
and New York: CRC.
Maesse, Jens. 2010. Die vielen Stimmen des Bologna-Prozesses. Bielefeld:
Transcript.
———. 2015a. Eliteökonomen. Wissenschaft im Wandel der Gesellschaft.
Wiesbaden: VS.
———. 2015b. Economic experts. A discursive political economy of econom-
ics. Journal of Multicultural Discourses 10 (3): 279–305.
Maesse, Jens, and Julian Hamann. 2016. Die Universität als Dispositiv. Die
gesellschaftstheoretische Einbettung von Bildung und Wissenschaft aus dis-
kurstheoretischer Perspektive. Zeitschrift für Diskursforschung 4 (1): 29–50.
Maingueneau, Dominique. 1991. L’Analyse du discours. Introduction aux lectures
de l’archive. Paris: Hachette.
Morris, Norma, and Arie Rip. 2006. Scientists’ coping strategies in an evolving
research system: The case of life scientists in the UK. Science and Public Policy
33 (4): 253–263.
Münch, Richard. 2014. Academic capitalism. Universities in the global struggle for
excellence. New York: Routledge.
The Academic Dispositif: Towards a Context-Centred Discourse… 87
Paradeise, Catherine, Emanuela Reale, Ivar Bleiklie, and Ewan Ferlie, eds. 2009.
University governance. Western European Perspectives. Dordrecht: Springer.
Raffnsøe, Sverre, Marius Gudmand-Høyer, and Morten S. Thaning. 2016.
Foucault’s dispositive: The perspicacity of dispositive analytics in organiza-
tional research. Organization 23 (2): 272–298.
Salem, André. 1982. Analyse factorielle et lexicométrie. Mots – Les langages du
politiques 4 (1): 147–168.
Sarangi, Srikant, and Malcolm Coulthard, eds. 2000. Discourse and social life.
Harlow: Longman.
Scholz, Ronny. 2010. Die diskursive Legitimation der Europäischen Union.
Eine lexikometrische Analyse zur Verwendung des sprachlichen Zeichens
Europa/Europe in deutschen, französischen und britischen Wahlprogrammen
zu den Europawahlen zwischen 1979 und 2004. Dr. phil., Magdeburg und
Paris-Est, Institut für Soziologie; École Doctorale “Cultures et Sociétés”.
Spieß, Constanze, Łukasz Kumięga, and Philipp Dreesen, eds. 2012.
Mediendiskursanalyse. Diskurse – Dispositive – Medien – Macht. Wiesbaden:
VS.
Stiglitz, Joseph E. 1984. Price rigidities and market structure. The American
Economic Review 74 (2): 350–355.
Tognini-Bonelli, Elena. 2001. Corpus linguistics at work. Amsterdam: Benjamins.
Whitley, Richard. 1984. The intellectual and social organization of the sciences.
Oxford: Clarendon Press.
Xu, Yaoyang, and Wiebke J. Boeing. 2013. Mapping biofuel field: A bibliomet-
ric evaluation of research output (review). Renewable and Sustainable Energy
Reviews 28 (Dec.): 82–91.
Zippel, Kathrin. 2017. Women in Global Science: Advancing Careers Through
International Collaboration. Stanford: Stanford University Press.
4
On the Social Uses of Scientometrics:
The Quantification of Academic
Evaluation and the Rise of Numerocracy
in Higher Education
Johannes Angermuller and Thed van Leeuwen
1 Introduction
Corpus approaches have a long tradition. They have recourse to computer-
aided tools which reveal patterns, structures, and changes of language use
that would go unnoticed if one had to go through large text collections
‘manually’. If such research is known for rigorous, replicable, and ‘ratio-
nal’ ways of producing scientific claims, one cannot understand its suc-
cess without accounting for the role of non-academic actors.
Scientometrics, also known as ‘bibliometrics’, is a type of corpus
research which measures the scientific output of academic researchers and
represents citation patterns in scientific communities. Scientometrics is a
J. Angermuller (*)
Centre for Applied Linguistics, University of Warwick, Coventry, UK
e-mail: J.Angermuller@warwick.ac.uk
T. van Leeuwen
Centre for Science and Technology Studies (CWTS), Leiden University,
Leiden, The Netherlands
e-mail: leeuwen@cwts.leidenuniv.nl
who were exempt from the constraints of the system they helped put in
place), the production of knowledge is now fully integrated into the gov-
ernmentality. The specialized knowledge and the administrative expertise
of the agents of governmentality now become the object of numerocratic
innovation. The large and growing social arena of educationalists and
researchers is now subsumed to numerocracy. Neoliberalism, in other
words, heralds the governmentalization of education and higher
education.
It will now be possible to discuss scientometrics as a field which
emerges in a situation of changing historical circumstances. Scientometrics
testifies to the growing importance of numerocratic practices for the con-
stitution of social order in Western societies since the eighteenth century
in general and the advent of these practices in the higher education sector
since the post-war era more particularly. Yet while one can observe a
growing demand for scientometric knowledge, scientometrics has always
had to grapple with a tension between applied and more academic
research orientations. Is scientometrics subordinate to normative political
goals or does it engage in fundamental social research in order to reveal
how research and researchers work? Also, specialized researchers in scien-
tometrics cannot but acknowledge the explosive growth of scientometric
data produced by corporate actors such as Thomson Reuters, Google, or
Elsevier. Therefore, it remains an open question how the increasing
amount of scientometric data which is now circulating in the academic
world impacts on knowledge production and decision-making in aca-
demia. To what degree is scientometric knowledge linked with practices
of governing academics? To obtain responses to these questions, we will
have a closer look at the directions that scientometrics has taken as a field.
3 he Emergence of Scientometrics
T
as a Field
Scientometrics (or bibliometrics) as a social science is a relatively young
field. Its origins reach back the second part of the twentieth century
(Hood and Wilson 2001). It comprises quantifying methods from social
research and uses numbers to account for structures and changes of many
On the Social Uses of Scientometrics: The Quantification… 99
degree that the field is part and parcel of numerocratic practices, one can
understand that there is a tendency in scientometrics that posits the
investment into numerocratic governance as a norm to other researchers
(Burrows 2012; Radder 2010). In the late 1980s, scientometrics became
autonomous from the larger STS field and developed their own profes-
sional organizations (such as ISSI, and somewhat later, ENID) with sepa-
rate scientific conferences (the ISSI cycle, next to the S&T cycle) and
dedicated journals (such as Scientometrics, JASIST, Research Policy to
name a few).
The academic field of scientometrics has broken off from the more
qualitative and theoretical strands in STS. It has always put strong
emphasis on empirical, strongly data-driven empirical research and
focused on electronic data of various types. Just like statistics in the nine-
teenth century, scientometrics testifies to the numerocratization of the
social. To some degree, statistics is subservient to civil servants, techno-
crats and administrators who carry out censuses, create standards in cer-
tain areas and devise regulative frameworks of action. Scientometrics,
too, is an academic field which is tied to the rise of ‘numerocratic’ tech-
niques of exercising power, which aim to govern large populations
through numbers, standards, benchmarks, indices, and scales.
All of this did not happen in a sociopolitical vacuum. After the eco-
nomic crises in the 1970s and 1980s (Mandel 1978), ending a long phase
of prosperity, the political climate in Europe and the USA changed and
neoliberal approaches prevailed. With Reagan and Thatcher in charge in
the USA and the UK, economic policies implemented austerity pro-
grammes, which meant budget cuts in various sectors of society, including
higher education. The political ideology of neoliberalism was practically
supported by New Public Management (NPM). NPM proposes to orga-
nize the public sector according to management techniques from the cor-
porate sector (Dunleavy and Hood 1994). However, science has been
mainly evaluated through peer review and for a long time quantitative
information on higher education was limited. In the USA until the 1980s,
for example, the reports by the National Science Foundation were the
only source of quantitative data on the higher education system as a whole.
Yet, in a period of austerity, more justification was needed for spending
taxpayer’s money and policy-makers were less prone to distributing
On the Social Uses of Scientometrics: The Quantification… 101
resources on the basis of the internal criteria of the sector. Peer review–
based evaluation gives little control to policy-makers and governing agen-
cies over how money is spent. Therefore, one can observe a growing
demand for simpler mechanisms of evaluation and decision-making.
Examples of these type of simplistic indicators used for evaluation and
decision-making will be introduced in Sect. 4.
Scientometric measures first appeared in science policy documents in
the 1970s, when the US National Science Foundation integrated research
metrics in its annual national science monitor, which gives an account of
the US science system. Scientometric results were then used to describe
research activity on a macro level in the USA and other countries. It took
another 20 years before large-scale scientometric reports of national sci-
ence systems were produced in Europe. It was in the Netherlands in the
early 1990s that the first national science and technology monitor
appeared (NOWT, reports covering the period 1994–2014). The last
series of these reports were produced in 2014 (WTI 2012, 2014). In this
series of reports from the Netherlands, the national Dutch science sys-
tem was compared internationally with other EU countries. Indicators
have been devised with the aim to represent technological performance,
based on the number of patents or revenue streams. These reports also
contained information on the sector as a whole (e.g., the relationship
between the public and private sector, other public institutions, hospi-
tals, etc.) and on the institutional level (e.g., comparisons between uni-
versities). These analyses typically broke down the research landscape
into disciplinary fields and domains with various levels (countries, sec-
tors, institutions).
In France, the government also initiated a series of national science
monitor reports, produced by the Observatoire des Sciences et Technologies
(OST, reports covering the period 1992–current). In France, the reports
contained an internationally comparative part, as well as a national
regional part. These reports are produced by a national institution
financed by the government while in the Netherlands the reports were
produced by a virtual institution with various participating organizations
from inside and outside the academic sector. Various countries organize
now such indicator reports, combining metrics on the science system
with indicators on the economic system, innovation statistics, and so on.
102 J. Angermuller and T. van Leeuwen
In the USA, such monitoring work has often been done by the disciplin-
ary associations, whereas in Germany research institutes in higher educa-
tion studies have played an important role in producing numerocratic
knowledge.
The first country in Europe to start developing a national system of
research evaluation was the UK, which launched its first research assess-
ment exercise in 1986. From this first initiative in the mid-1980s, the UK
has seen periodic assessments of its research system (Moed 2008), with
varying criteria playing a role in the assessment. This has been accompa-
nied with a continuous discussion on the role of peer review in the assess-
ment. In particular, it was held against approaches based on metrics solely
(see, for example, the 2007 CWTS report and The Metric Tide report of
2015). The UK research assessment procedure evaluates the whole
national system at one single moment by cutting the scientific landscape
into units of assessment. An important element in the UK research assess-
ment procedure is that it links the outcomes of assessment to research
funding. Overall, one can conclude that the UK research assessment exer-
cises tend to be a heavy burden for the total national science system. By
organizing this as one national exercise, every university is obliged to
deliver information on all the research at one single moment, across a
large number of research fields. Many senior UK scholars get involved in
the assessment of peers. Other countries in Europe also initiated national
assessment systems. Finland, for example, was one of the countries initi-
ating such a system shortly after the UK. Outside Europe, Australia has
implemented a system which follows the UK model in many respects.
The Netherlands initiated their research assessment procedure in the
early 1990s, which is still in place, albeit with a changed design. In the
Netherlands, the periodical assessment of research was institutionalized
from the early 1990s onwards. Until 2003, assessment was organized
under the supervision of the VSNU, the association of universities in
the Netherlands (Vereniging van Samenwerkende Nederlandse
Universiteiten). Their so-called chambers, consisting of representatives
from the disciplines, decided how to design the research assessment pro-
cess, which includes the question whether research metrics were appro-
priate for the research evaluation in the respective fields. In a wide
number of fields, data from advanced scientometric analysis have been
applied to complement peer review in research assessment procedures
On the Social Uses of Scientometrics: The Quantification… 103
(e.g., biology, chemistry, physics, and psychology). After 2003, the ini-
tiative to organize research assessment was put in the hands of the uni-
versity boards, which meant that it was no longer a national matter. The
Royal Academy of Arts and Sciences in the Netherlands has carried out
studies that have influenced the recent revision of the standard evalua-
tion protocol. The focus is now no longer only on research output and
its impact, but it also considers, like in the UK, the societal relevance of
academic work (‘impact’). It remains to be seen to what extent the sys-
tem can still rely on peer review and what will be the role of scientomet-
ric measures in the future.
While some other countries are building up national frameworks of
assessing research quality, national evaluation schemes are still an excep-
tion (Table 4.1). One can see few tendencies in federal states like in the
USA, where evaluation tends to be done on an institutional level, or
Germany, where there are no plans for another national evaluation
scheme after a ‘pilot study’ of the Wissenschaftsrat evaluated the research
excellence of two disciplines (chemistry and sociology) in 2007. The
French AERES carries out a ‘light-touch’ evaluation compared with the
UK. AERES regularly visits French research groups (laboratoires) and
only examines publication records and not the publications themselves.
Such national evaluation schemes can be highly consequential. The
measurement of research performance can be done in a very explicit man-
ner, as is the case in the UK, where the outcomes of the most recent
research assessment exercise has direct consequences for money flows
between and within universities, or Italy, where a similar model as applied
in the UK, has been adopted over the last years. Here research perfor-
mance is linked up to a nationwide research assessment practice, with
intervals of five to eight years between the exercises. In other countries,
research assessment has crept in more silently as some research funding is
distributed on an annual basis, based upon last year’s output numbers, in
which productivity in internationally renowned sources (journals and
book publishing houses) provide higher rewards as compared to publish-
ing in locally oriented journals and/or book publishing houses (Flanders,
Norway, and Denmark). In France, negative assessments can lead to
research clusters being closed down, whereas the German evaluation of
2007 has shown no immediate effects and little lasting influence. The
Table 4.1 National evaluation schemes in some Western higher education systems
104
USA no
Netherlands VSNU Protocol, systematic,
Germany currently the SEP (Standard national
Wissenschaftsrat Evaluation Protocol) assessment
UK REF 2013 France AERES 2007 protocol implemented
Unit of Researchers, Research groups Subdepartmental Initially departments and NA
evaluation departments, and (laboratoires) units groups, now only
institutions institutes/departments
Evaluation of Peer review No peer review Peer review Peer review NA
publications Scientometrics is Scientometrics is Advanced scientometrics
through not used officially not used only when it fits the field
Allocation of Yes No direct funding No direct effects No direct linking of NA
funding? effects but Scientometrics is assessment of research
future of groups not used and research funding
can be officially
questioned
J. Angermuller and T. van Leeuwen
Effects on Significant impact Academics may be No effects known Only indirect and implicit NA
academics on academic counted as effects on internal
recruitment in ‘non-publishing’ university policies
many fields but their job
security is not at
stake
Production of Performance Ad hoc statistics Some statistics On the national level NA
statistics statistics of are produced on are produced monitoring of the whole
and institutions and institutional system, on the
indicators the whole sector level institutional level periodic
are produced assessment of research
within institutions
On the Social Uses of Scientometrics: The Quantification… 105
Factor (JIF), with first publications on the indicator in 1955 and 1963
(Garfield 1955; Garfield and Sher 1963). Journal citation statistics were
included in the Journal Citation Reports (JCR), the annual summarizing
volumes to the printed editions of the SCI and the SSCI. In the growing
higher education sector, in which more and more journals appeared on
the scene, the JIF became a tool used by librarians for managing their
collections. When the JCR started to appear on electronic media, first on
CD-ROM, and later through the Internet, JIF was more frequently used
for other purposes, such as assessments of researchers and units, which
was always sharply criticized by Garfield himself (Garfield 1972, 2006).
The JIF has been included in the JCR from 1975 onwards, initially only
for the SCI, later also for the SSCI. For the AHCI, no JIFs were pro-
duced. From the definition of the JIF, it becomes apparent that JIF is a
relatively simple measure, is easily available through the JCR, and relates
to scientific journals, which are the main channel for scientific commu-
nication in the natural sciences, biomedicine and parts of the social sci-
ences (e.g., in psychology, economics, business and management) and
humanities (e.g., in linguistics).
While the ISI indices cover a number of features in journal articles,
they focus mostly on the references cited by the authors. These references
are taken to express proximity and influence between citing and cited
people. On the receiving end, the question is to whom they relate. Here
references are considered as citations. Citation theory argues that value is
added when references become socially recognized citations, objectifying
as it were the researchers’ social capital (see Wouters 1999; Angermuller
2013b; Bourdieu 1992). The value that members of scientific communi-
ties add to works can be objectified through normalized measures. Thus,
by processing and counting references the scientometrician does not only
reflect a given distribution of academic value but she or he also adds value
to references. Other types of information used in the analysis relates to
the authors and co-authors, their institutions and cooperations, the jour-
nals in which their papers come out, their publishing houses, information
on the moment of publishing, information on the language of communi-
cation in the publication, meta-information on the contents, such as key-
words, but also words in titles and abstracts, as well as information on the
classification of the publications in various disciplinary areas.
108 J. Angermuller and T. van Leeuwen
The fact that these indices mostly focus on journal publications has
been widely criticized. It has been argued, for instance, that certain disci-
plinary areas have been put in a disadvantageous position (e.g., history,
where monographs are more important, or new disciplinary and transdis-
ciplinary fields which don’t have established journals yet). Moreover, it
needs to be recalled that the three indexes tend to over-represent research
in English as the main language of international scientific communica-
tion (van Leeuwen 2013; van Leeuwen et al. 2001). Areas in which
English-language journals are not standard outlets for research tend to
become peripheral in citation indexes. As a result, Western (i.e., US and
Western European) journals in the natural, life, and biomedical sciences
had long been given a certain prominence, which has been called into
question only after the rise of the BRIC countries (Brazil, Russia, India
and in particular of China). Other geographical regions of the world are
now better represented in the ISI, although the bias towards English has
remained intact as many of these countries follow models of the English-
speaking world.
The rise of these indices took place when some scientometric indica-
tors such as the JIF and the h-index started to be used in evaluation
practices throughout the science system. The JIF was originally designed
by Garfield for librarians to manage their journals in their overall library
collection and for individual researchers in the natural sciences to help
them decide on the best publication strategies (Garfield 1955, 1972,
2006; Garfield and Sher 1963). The JIF has been used in a variety of
contexts, for example, by managers who evaluate whole universities
(often with a more formal registration of research outputs of the scholarly
community) and by individual scholars to ‘enrich’ their publication lists
while applying for research grants and for individual promotion or job
applications (Jiménez-Contreras et al. 2002). Yet indicators such as the
JIF are not neutral as they can bring forth the realities they represent
through the numbers.
There are no indicators that have been universally accepted to repre-
sent quality of research or the performance of researchers. As they have
been the object of controversial debate, a number of serious flaws with
the JIF have been pointed out which disqualify the indicator for any use
in science management, let alone for evaluation purposes. Thus, it is cal-
On the Social Uses of Scientometrics: The Quantification… 109
culated in ways that in about 40% of all journals JIF values are overrated
(Moed and van Leeuwen 1995, 1996). Another issue is that JIF values do
not take into consideration the way the journal is set up. Journals, for
example, that contain many review articles, tend to get cited more fre-
quently as compared to normal research articles. Therefore, review jour-
nals always end up on top of the ranking lists. A third issue relates to the
fact that JIF values do not take into consideration the field in which the
journal is positioned. Reference cultures differ, as do the number of jour-
nals per field. This means that fields with a strong focus on journal pub-
lishing, and long reference lists, have much higher JIF values as compared
to fields where citations are not given so generously. A fourth reason
relates to the fact that citation distributions are, like income distribu-
tions, skewed by nature. This means that a JIF value of a journal only
reflects the value of few much-cited articles in the journal while most
have lower impacts. This creates a huge inflation in science, given the
practice mentioned above, in which scholars tend to enrich their publica-
tion lists with JIF values, which say nothing about the citation impact of
their own articles. Moreover, JIF values tend to stimulate one-indicator
thinking and to ignore other scholarly virtues, such as the quality of
teaching, the capability to ‘earn’ money for the unit, the overall readiness
to share and cooperate in the community.
The h-index was introduced in 2005 (Hirsch 2005). It is to assess an
individual researcher’s performance by looking at the way citations are
distributed across all publications of that person. If one takes the output
in a descending order (by number of received citations), the h-index rep-
resents the number of citations received equalling the rank order (i.e., if
somebody has published five articles, cited 20, 15, 8, 4, and 3 times, the
h-index is four). Due to the simplicity of the indicator, it has been widely
adopted and is sometimes even mentioned to justify hiring and firing
decisions as well as the evaluations of research proposals in research
councils.
The problems with the h-index are manifold. First of all, a number of
issues related to JIF also apply on the h-index, one of them being the
issue of the lack of normalization, which makes comparisons across fields
impossible (van Leeuwen 2008). A next issue is the conservative nature of
the indicator, it can only increase, which makes the h-index unfit for
110 J. Angermuller and T. van Leeuwen
predictions. The next set of issues relates to the way this indicator is cal-
culated. Depending on the database, calculations of the h-index can dif-
fer significantly. In many cases, authors and their oeuvres cannot be
determined easily. A final set of issues relates to a variety of more general
questions, such as the publication strategies chosen by researchers (put-
ting your name on every single paper from the team or be more selective),
the discrimination against younger staff and the invisibilization of schol-
arly virtues.
Indicators such as the JIF and the h-index are nowadays easily available
for everybody. They are readily used by people in research management
and science policy, government officials, librarians, and so on. Even
though precise effects are difficult to prove, these indicators often play a
role for decision-making in grant proposal evaluation, hiring of academic
personnel, annual reviews as well as promotion and tenure decisions. If
they are applied in a mechanistic way without reflecting on their limits,
such indicators can go against values which have defined the academic
ethos, for example, the innovation imperative, the service to the commu-
nity, the disinterested pursuit of ‘truth’.
While the JIF and the h-index testify to numerocratic practices within
academic research, university rankings are an example for the numeroc-
ratization of higher education in the broader social space. University
rankings were initially intended to help potential students to select a
proper university matching their educational background. Yet these rank-
ings have turned into a numerocratic exercise that now assesses many
aspects of university performance. Since 2004, with the launch of the
so-called Shanghai ranking, universities have been regularly ranked
worldwide, which has contributed to creating a global market of higher
education. As a result, these league tables can no longer be ignored by
managers and administrators, especially in Anglo-American institutions,
which highly depend on fees brought by international students (Espeland
and Sauder 2007). With the exception of the Leiden ranking, which is
based entirely upon research metrics, university rankings usually contain
information on educational results, student-staff ratio, reputation, and
research performance. All prominent university rankings, including the
ARWU Ranking (Academic Ranking of World-Class Universities, aka
Shanghai Ranking), the Times Higher Education university ranking and
On the Social Uses of Scientometrics: The Quantification… 111
institution under one umbrella name. On the micro level, the challenges
of scientometric analysis are the greatest. In the first place, micro-level
analysis often, if not always, needs to involve those who are the object of
the analysis so as to define research groups, projects, programmes, and
publications realized by these units. Some sort of certification or authori-
zation is required without which the outcomes of the study would lack
legitimacy. In the second place, on the level of the individual researcher,
the problem of homonyms and synonyms plays an important role. As
one single person can publish under various names in the international
serial literature (e.g., by using various initial combinations, sometimes
one of the first names is written in full, etc.), these occurrences have to be
brought back to one single variation. Likewise, one name variation can
hide various persons, due to the occurrence of very common names (in
the English language area names like Brown or Smith), in combination
with one single initial but many Chinese scholars mean formidable chal-
lenges to citation indices. Scientometric data handling requires informa-
tion on the full names of individuals, the field in which people have been
working and also about their career track. One can try to collect such
data. However, ideally, one should consult the authors as they are those
who know best. Verifying publications not only increases the validity of
the outcomes of the scientometric study but also adds to the transparency
of the process.
In order to critically reflect on how numerocracy works in and through
scientometrics, one needs to understand how the empirical data for indi-
cators and rankings are collected. With respect to this phase of data col-
lection, we can cite for one moment the work of Pierre Duhem, a French
natural scientist and philosopher of science. In his so-called law of cogni-
tive complementarity, he holds that the level of accuracy and the level of
certainty stand in a complex relationship with each other (Rescher 2006).
If we follow Duhem, the analysis of the macro level can teach us about
the performance of a country in a particular field but it cannot tell us
anything about any particular university active in that field, let alone
about any of the research programmes or individual scholars in that field.
Vice versa, while analyses on the micro level can instruct us about indi-
vidual scholars and the research projects they contribute, it cannot inform
us about the national research performance in that same field of research.
On the Social Uses of Scientometrics: The Quantification… 113
Even at the meso-level, where we would expect the level of certainty and
accuracy to be easier in balance, the world remains quite complicated: an
overview of the research in, say, a medical centre in the field of immunol-
ogy does not relate in a one-to-one relationship to the department of
immunology in that centre, as researchers from various departments
might publish in the field of immunology, such as haematologists, oncol-
ogists, and so on. This tension between the level of certainty and accuracy
exists at any moment and influences the range and reach of the conclu-
sions that can be drawn from scientometric data.
References
Angermuller, Johannes. 2013a. Discours académique et gouvernementalité
entrepreneuriale. Des textes aux chiffres. In Les discours sur l’économie, ed.
Malika Temmar, Johannes Angermuller, and Frédéric Lebaron, 71–84. Paris:
PUF.
———. 2013b. How to become an academic philosopher. Academic discourse
as a multileveled positioning practice. Sociología histórica 3: 263–289.
———. 2017. Academic careers and the valuation of academics. A discursive
perspective on status categories and academic salaries in France as compared
to the U.S., Germany and Great Britain. Higher Education 73 (6): 963–980.
Angermuller, Johannes, and Jens Maeße. 2015. Regieren durch Leistung. Zur
Verschulung des Sozialen in der Numerokratie. In Leistung, ed. Alfred Schäfer
and Christiane Thompson, 61–108. Paderborn: Schöningh.
Bloor, David. 1976. Knowledge and social imagery. London: Routledge & Kegan
Paul.
Bourdieu, Pierre. 1992. Homo academicus. Frankfurt/Main: Suhrkamp.
Burrows, Richard. 2012. Living with the h-index? Metric assemblages in the
contemporary academy. Sociological Review 60 (2): 355–372.
Cawkella, Tony, and Eugene Garfield. 2001. Institute for scientific information.
In A century of science publishing, ed. E.H. Fredriksson, 149–160. Amsterdam:
IOS Press.
Committee on the Independent Review of the Role of Metrics in Research
Assessment and Management. 2015. The metric tide. Report to the HEFCE,
July 2015. Accessed June 27, 2018. http://www.hefce.ac.uk/media/
On the Social Uses of Scientometrics: The Quantification… 117
HEFCE,2014/Content/Pubs/Independentresearch/2015/The,Metric,
Tide/2015_metric_tide.pdf.
Cronin, Blaise, and Helen Barsky Atkins. 2000. The scholar’s spoor. In The web
of knowledge: A festschrift in honor of Eugene Garfield, ed. Blaise Cronin and
Helen B. Atkins, 1–8. Medford, NJ: Information Today.
CWTS. 2007. Scoping study on the use of bibliometric analysis to measure the
quality of research in UK higher education institutions. Report to HEFCE
by the Centre for Science and Technology Studies (CWTS), Leiden
University, November 2007.
Desrosières, Alain. 1998. The politics of large numbers: A history of statistical rea-
soning. Cambridge, MA: Harvard University Press.
Dunleavy, Patrick, and Christopher Hood. 1994. From old public-administration
to new public management. Public Money & Management 14 (3): 9–16.
Espeland, Wendy Nelson, and Michael Sauder. 2007. Rankings and reactivity:
How public measures recreate social worlds. American Journal of Sociology
113 (1): 1–40.
Espeland, Wendy Nelson, and Michell L. Stevens. 2008. Commensuration as a
social process. Annual Review of Sociology 24: 313–343.
Foucault, Michel. 1973. The birth of the clinic: An archaeology of medical percep-
tion. London: Routledge. Original edition, 1963.
———. 1980. Power/knowledge: Selected interviews and other writings
1972–1977. Edited by Colin Gordon. New York: Pantheon.
———. 1995. Discipline and punish: The birth of the prison. New York: Vintage
Books.
———. 2002. The order of things. An archeology of the human sciences. London:
Routledge. Original edition, 1966.
———. 2007. Security, territory, population. lectures at the college de France.
Basingstoke: Palgrave Macmillan. Original edition, 1977/78.
———. 2008. The birth of biopolitics. Lectures at the Collège de France,
1978–1979. London: Palgrave Macmillan.
Garfield, Eugene. 1955. Citation indexes to science: A new dimension in docu-
mentation through association of ideas. Science 122 (3159): 108–111.
———. 1972. Citation analysis as a tool in journal evaluation. Science 178
(4060): 471–479.
———. 2006. The history and meaning of the journal impact factor. JAMA
295 (1): 90–93.
118 J. Angermuller and T. van Leeuwen
Garfield, Eugene, and Irving H. Sher. 1963. New factors in the evaluation of
scientific literature through citation indexing. American Documentation 14
(3): 195–201.
Hirsch, Jorge Eduardo. 2005. An index to quantify an individual’s scientific
research output. Proceedings of the National Academy of Sciences of the USA
102 (46): 16569–16572.
Hood, William W., and Concepción S. Wilson. 2001. The literature of biblio-
metrics, scientometrics and informetrics. Scientometrics 52 (2): 291–314.
Jiménez-Contreras, Evaristo, Emilio Delgado López-Cózar, Rafael Ruiz-Pérez,
and Victor M. Fernández. 2002. Impact-factor rewards affect Spanish
research. Nature 417: 898.
Klein, Daniel B. 2004. The social science citation index. A black box—With an
ideological bias? Econ Journal Watch 1 (1): 134–165.
Knorr Cetina, Karin. 1981. The manufacture of knowledge. An essay on the con-
structivist and contextual nature of science. Oxford and New York: Pergamon
Press.
Latour, Bruno, and Steve Woolgar. 1979. Laboratory life. Princeton, NJ:
Princeton University Press.
van Leeuwen, Thed N. 2008. Testing the validity of the Hirsch-index for research
assessment purposes. Research Evaluation 17 (2): 157–160.
———. 2013. Bibliometric research evaluations, web of science and the social
sciences and humanities: A problematic relationship? Bibliometrie – Praxis
und Forschung, 1–18. Accessed June 27, 2018. http://www.bibliometrie-pf.
de/article/view/173.
van Leeuwen, Thed N., Henk F. Moed, Robert J.W. Tijssen, Martijn S. Visser,
and Ton F.J. Van Raan. 2001. Language biases in the coverage of the science
Citation Index and its consequences for international comparisons of national
research performance. Scientometrics 51 (1): 335–346.
Mandel, Ernest. 1978. The second slump. London: Verso.
Merton, Robert K. 1962. Science and the social order. In The sociology of science,
ed. Bernard Barber and Walter Hirsch, 16–28. Westport, CT: Greenwood.
Miller, Peter. 2001. Governing by numbers. Why calculative perspectives mat-
ter. Social Research 68 (2): 379–396.
Moed, Henk F. 2008. UK research assessment exercises: Informed judgments on
research quality or quantity? Scientometrics 74 (1): 153–161.
Moed, Henk F., and Thed N. van Leeuwen. 1995. Improving the accuracy of
institute for scientific information’s journal impact factors. Journal of the
American Society for Information Science 46: 461–467.
On the Social Uses of Scientometrics: The Quantification… 119
1 Introduction
Most discourse analytical studies have a common interest in patterns of
how knowledge is (re-)produced, (re-)distributed, and controlled
through social practices of language use. Discourse analysts have devel-
oped numerous methods to demonstrate how meaning is constructed.
However, the reason why a particular text or textual sequence in a given
corpus was chosen to be analysed often remains arbitrary. Distinguishing
hermeneutic from heuristic methods, this contribution introduces a
systematic quantitative methodology guiding the analyst’s choices of
texts and textual sequences. The text emphasises the heuristic strength
I am thankful to Malcolm MacDonald for his helpful comments on earlier versions of this text.
Additionally, I want to thank André Salem for the numerous personal tutorial sessions and
discussions of the software Lexico3 with which most of the analyses in this text have been
conducted.
R. Scholz (*)
Centre for Applied Linguistics, University of Warwick, Coventry, UK
e-mail: r.scholz@warwick.ac.uk
Italic in original.
1
128 R. Scholz
second step, analyse the language use within these sociological categories,
we then can refer the use of particular words to the construction of mean-
ing within a particular social context (Duchastel and Armony 1995,
201). Such an analysis aims at an investigation of topics, utterances, and
positioning practices structuring the organisation of the symbolic field
and the exercise of power in contemporary societies (Duchastel and
Armony 1993, 159). It is the contrastive analysis of actual language use
with reference to sociological metadata with which we aim to investigate
the meaning construction entangled in social and societal relations such
as work, class, gender, race, and poverty.
2009/02
2009/03 2008/10
2009/04 2008/09
2009/01
2008/11
2008/12
Fig. 5.1 Correspondence analysis of the German press corpus on the financial
crisis 2008 in the partition ‘month’ (Representation of column names only)
Pofalla/CDU
Müntefering/SPD
Glos/CSU
Seehofer/CSU
Steinmeier/SPD
Kauder/CDU Enzensberger
Schäuble/CDU
Merkel/CDU Leyen/CDU
Köhler/B. -Präsi.
Lagarde/Frankreich Steinbrück/SPD
Juncker/EU Lundgren/ Schmidt H./SPD
Prof.
Schweden
Weischenberg
Merkel/Sarkozy
Barroso/EU
Another way of using this method is to contrast the language use of dif-
ferent speakers in order to find out which speakers use the most similar
vocabulary. Figure 5.2 represents the similarities and differences in the
vocabulary of discourse participants that have been interviewed concern-
ing the financial crisis 2008 in the German press throughout the research
period.
For this analysis, we have created a sub-corpus of the above press cor-
pus on the financial crisis compiled from 28 interviews of 19 interviewees
and a report from the G20 summit in Washington (15–16 November
2008) written by the German Chancellor Merkel and the French
President Sarkozy. The sub-corpus was compiled with the intention of
getting to know better the positions of individuals that were chosen by
the journalists to be experts in this discourse. The visualisation shows that
the positions of these ‘experts’ cannot be ordered according to their polit-
ical party affiliation. With the rather intellectual figures Enzensberger,
Köhler, Schmidt, and Weischenberg on the right and numerous politi-
cians on the left, the x-axis seems to represent a continuum between
societal aspects and party politics. Accordingly, the y-axis seems to repre-
136 R. Scholz
2
ALCESTE stands for ‘Analyse des Lexèmes Cooccurrents dans un Ensemble de Segments de
Texte’, which means analysis of co-occurring lexemes in a totality of text segments.
Lexicometry: A Quantifying Heuristic for Social Scientists… 137
Fig. 5.3 DHC in the German press corpus on the financial crisis 2008
(Analysed with Iramuteq)
class (e.g. nicht).3 Whereas class 1 contains words that seem to refer to a
rather technocratic discourse describing the macroeconomic context of
the financial crisis, class 5 contains words which are above all names of
banks that were involved or affected by the financial crisis. Class 3 con-
tains words referring above all to effects in the real economy, and class 2
to the social market economy as part of a discussion on the crisis of the
predominant economic system criticised from a viewpoint of political
economy. Similarly, class 4 refers to Marx and his analysis of capitalism
which seems to refer in a particular way to various social contexts - Kinder
(children), Mann (man), Frau(en) (woman/women). Contrary to this
discussion of the political system on a more abstract level, class 6 contains
words referring to the politics on the national, supra-, and transnational
sphere. Without going into more detail, we can see the potential strength
3
For French and English corpora the software uses more sophisticated dictionaries which exclude
functional words for this type of analysis.
138 R. Scholz
of this method to reveal the different semantic worlds and parts of the
narrative elements prevailing in a corpus. In this sense it allows us to
make assumptions about how semantic units are put into context and
subsequently knowledge is constructed.
4
The term ‘co-occurrence’ refers to an instance of an above-chance frequency of occurrence of two
terms (probability distribution). Instances of a systematic co-occurrence taking into account word
order or syntactic relations would be referred to with the term ‘collocation’ in this terminology.
5
Mutual Information score addresses the same issue but with a different algorithm.
Lexicometry: A Quantifying Heuristic for Social Scientists… 139
Fig. 5.4 The dominating semantic field in the German press corpus on the finan-
cial crisis 2008
Figure 5.4 represents the dominant semantic field in the German press
corpus on the financial crisis 2008. Reciprocal co-occurrences can be ana-
lysed with the software CooCs. We have chosen parameters that produce
a maximum of reciprocal co-occurrences in one network when at the same
time aiming for a readable representation. Figure 5.4 represents all tokens
for which the probability to co-occur in a paragraph containing the node
bank is very high. Forming a symmetric matrix of the lexis of a given cor-
pus, it was measured to what extent each type of the corpus is overrepre-
sented in a paragraph containing another type of this corpus—if for
instance the token Lehman was overrepresented in paragraphs containing
the token bank. To obtain a reciprocal co-occurrence bank is ought to be
overrepresented to a similar extent in paragraphs containing Lehman.
In Fig. 5.4 one can distinguish a number of different semantic fields
enabling us to map the discourse of the financial crisis in the German
press. In the centre of the visualisation, we find the token bank. One can
see that, for instance, words referring to German politics are intercon-
nected with the rest of the network via particular lexical elements: Merkel
is connected to tokens referring to the inter- and transnational political
140 R. Scholz
field such as Sarkozy, Barroso, EU, G[7], IWF, Weltbank (world bank)
and others. Steinbrück is linked to Hypo Real Estate and Finanzminister to
tokens referring to US American political actors such as Paulson, Bush,
Obama, Kongress (Congress). Based on this map we could go into more
depth by analysing in more detail the textual context in which these
dominant lexical elements are used in the corpus.
Once we have explored our data with exhaustive methods we usually start
deepening our analysis with additional methods. Keyword analysis
is helpful to find out more about the language use that is typical for a
particular speaker or a particular time period. This can be done with an
algorithm calculating which word tokens are overrepresented in a particu-
lar part of the corpus when compared to all the other parts (see ‘partition’
in Sect. 3. and 4.). Based on the total number of tokens in the corpus, the
total number of token in one part, and the frequency of each token in the
whole corpus the algorithm calculates the expected frequency of each
token in the part investigated. If the observed frequency in this part is
higher than the expected frequency then this token is overrepresented and
can be considered as belonging to the typical vocabulary of this part of the
corpus. In the software Lexico3, which we used to run most of the analy-
ses presented in this text, the algorithm is the same as the one which is
used for the calculation of the co-occurrences (see also Sect. 5.1.2).
55
45
35
25
15
-5
-15
Sep 08 Oct 08 Nov 08 Dec 08 Jan 09 Feb 09 Mar 09 Apr 09
-25
-35
-45
-55
financial crisis economic crisis countries banks banks of issue
ministry of finance German government global actors financial products government measures
solving the crisis. Furthermore, throughout the duration of the crisis the
term financial crisis loses importance in favour of the term economic crisis.
Moreover interestingly, the government measures against the crisis—
overrepresented in January 2009—seem not to be discussed together
with the origin of the crisis, the financial products, which are overrepre-
sented in November, March, and April—always together with the global
actors—such as the International Monetary Fund (IMF).
Figure 5.6 summarises the interpretation of Fig. 5.5: The discourse
representing the financial crisis in the German press refers on the one
hand to the national political sphere and on the other hand to the inter-
national political sphere. Both spheres are structured differently in terms
of discourse participants and their represented actions. Whereas on the
international level we can find the international political actors deliberat-
ing about the origins and the actors responsible for the crisis, on the
national level we can observe political action against the effects but not
against the origins of the crisis. In this sense, crisis politics seems to be
divided into national political action against crisis effects, which are
In the last section of this chapter, we will illustrate briefly how we can use
quantitative methods of corpus analysis in order to reduce systematically
the amount of textual data that can subsequently be analysed with quali-
tative methods. Based on the above mentioned sub-corpus of press inter-
views, we have calculated the specific vocabulary for each interviewee.
With the help of the resulting lists of keywords we were able to identify
prototypical text sequences for each interviewee.
Merkel: There have been a number of years in which the fund has barely
had its classic role to play—supporting countries that have experienced
serious economic and financial difficulties. Therefore the savings program
was decided. However, if we now assign new tasks to the IMF to monitor
the stability of the financial markets, we must also equip it properly. […]
With our stimulus package aiming to stabilize the economy, we immedi-
ately provide effective support for investment and consumption. We are
building a bridge between businesses and citizens so that in 2009 the con-
sequences of the global crisis will be absorbed and the economy will rise
again in 2010.6 (Interview by Süddeutsche Zeitung, 14 November
2008)
Steinbrück: The depth of the recession will not be known until afterwards.
[…]
Spiegel: If one sees with which efforts the recession is countered abroad, one
can get the impression that you are quite passive—or just stubborn.
Steinbrück: I am not stubborn, I obey economic reason.7 (Interview by
Spiegel, 1 December 2008)
Steinbrück: Wie tief die Rezession ausfällt, wird man erst hinterher genau wissen.
7
Spiegel: Wenn man sieht, wie man sich im Ausland gegen diese Rezession stemmt, dann muss
man den Eindruck bekommen, dass sie ziemlich passiv sind. Oder einfach nur stur.
Steinbrück: Ich bin nicht stur, ich gehorche der ökonomischen Vernunft.
146 R. Scholz
8
SZ: Brauchen wir eine europäische Wirtschaftsregierung, wie sie Frankreichs Präsident Sarkozy
fordert?
Barroso: Nach dem Treffen der Staats—und Regierungschefs am 7. November sind wir uns in
Europa einig, dass wir nationale Aktivitäten besser koordinieren, aber nicht alles vereinheitlichen
müssen. Wenn etwa Polen ein Wirtschaftsprogramm beschließt, wirkt sich das auf Deutschland
aus und sicher auch umgekehrt.
Lexicometry: A Quantifying Heuristic for Social Scientists… 147
Spiegel: The financial debacle caused a profound crisis of the so-called real
economy.
Enzensberger: It is incomprehensible to me why the whole world is so sur-
prised. This is a bit like in England. If it is snowing in winter, the English
are quite taken aback, because entire regions sink into the snow, as if winter
were not a periodically recurrent fact. Likewise every boom is followed by
a crash. This is of course very uncomfortable.9 (Interview by Spiegel,
3 November 2008)
9
Spiegel: Haben die Banker moralisch versagt?
Enzensberger: Es ist ein bisschen viel verlangt, dass ausgerechnet die Banker für die Moral
zuständig sein sollen. […]
Spiegel: Aus dem Finanzdebakel erwächst eine tiefgreifende Krise der sogenannten Realwirtschaft.
Enzensberger: Es ist mir unbegreiflich, weshalb die ganze Welt davon so überrascht ist. Das ist
ein bisschen wie in England. Wenn es dort im Winter schneit, dann sind die Engländer ganz
verblüfft, weil ganze Regionen im Schnee versinken, so, als wäre der Winter nicht ein periodisch
wiederkehrendes Faktum. Genauso folgt jedem Aufschwung ein Absturz. Das ist natürlich sehr
ungemütlich.
148 R. Scholz
use. This then can be used to investigate the construction of social catego-
ries such as race, gender, or class (Leimdorfer 2010). Even though the
lexicometric approach has not yet been used extensively in sociological
research, this chapter should help to integrate more quantitative research
on language use into the social sciences.
References
Bachelard, Gaston. 1962. La philosophie du non. Essai d’une philosophie du nouvel
esprit scientifique. Paris: PUF. Original edition, 1940.
Bécue-Bertaut, Mónica. 2014. Distributional equivalence and linguistics. In
Visualization and verbalisation of data, ed. Jörg Blasius and Michael Greenacre,
149–163. London and New York: CRC.
Benzécri, Jean-Paul. 1963. Course de Linguistique Mathématique. Rennes:
Universitée de Rennes.
———. 1969. Statistical analysis as a tool to make patterns emerge from data.
In Methodologies of pattern recognition, ed. Satosi Watanabe, 35–74. New York:
Academic Press.
———. 1980. Pratique de l’analyse des données. Paris: Dunod.
———. 1982. Histoire et préhistoire de l’analyse des données. Paris: Dunod.
Bonnafous, Simone, and Maurice Tournier. 1995. Analyse du discours, lexico-
métrie, communication et politique. Mots – Les langages du politique 29
(117): 67–81.
Busse, Dietrich, and Wolfgang Teubert. 2014. Using corpora for historical
semantics. In The discourse studies reader. Main currents in theory and analysis,
ed. Johannes Angermuller, Dominique Mainguenau and Ruth Wodak,
340–349. Amsterdam: John Benjamins. Original edition, 1994.
Demonet, Michel, Annie Geffroy, Jean Gouazé, Pierre Lafon, Maurice
Mouillaud, and Maurice Tournier. 1975. Des tracts en mai 1968. Paris: Colin.
Diaz-Bone, Rainer. 2007. Die französische Epistemologie und ihre Revisionen.
Zur Rekonstruktion des methodologischen Standortes der Foucaultschen
Diskursanalyse. Forum Qualitative Sozialforschung/Forum: Qualitative Social
Research 8 (2): Art. 24.
Duchastel, Jules, and Victor Armony. 1993. Un protocole de description de
discours politiques. Actes des Secondes journées internationales d’analyse
statistique de données textuelles, Paris.
150 R. Scholz
Lebart, Ludovic, André Salem, and Lisette Berry. 1998. Exploring textual data.
Dordrecht: Kluwer.
Lebart, Ludovic, and Gilbert Saporta. 2014. Historical elements of correspon-
dence analysis and multiple correspondence analysis. In Visualization and
verbalisation of data, ed. Jörg Blasius and Michael Greenacre, 31–44. London
and New York: CRC.
Lee, David. 2001. Genres, registers, text types, domains, and styles: Clarifying
the concepts and navigating a path through the BNC jungle. Language
Learning & Technology 5 (3): 37–72.
Leimdorfer, François. 2010. Les sociologues et le langage. Paris: Maison des sci-
ences de l’homme.
Leimdorfer, François, and André Salem. 1995. Usages de la lexicométrie en
analyse de discours. Cahiers des Sciences Humaines 31 (1): 131–143.
Martinez, William. 2011. Vers une cartographie géo-lexicale. In Situ, 15.
Accessed July 1, 2018. http://journals.openedition.org/insitu/590.
———. 2012. Au-delà de la cooccurrence binaire… Poly-cooccurrences et
trames de cooccurrence. Corpus 11: 191–216.
Mayaffre, Damon. 2005. De la lexicométrie à la logométrie. L’Astrolabe. Accessed
July 1, 2018. https://hal.archives-ouvertes.fr/hal-00551921/document.
———. 2007. Analyses logométriques et rhétorique du discours. In Introduction
à la recherche en SIC, ed. Stéphane Olivesi, 153–180. Grenoble: Presses
Universitaires de Grenoble.
———. 2016. Quantitative linguistics and political history. In Quantitative lin-
guistics in France, ed. Jacqueline Léon and Sylvain Loiseau, 94–119.
Lüdenscheid: Ram Verlag.
Mayaffre, Damon, and Céline Poudat. 2013. Quantitative approaches to politi-
cal discourse. Corpus linguistics and text statistics. In Speaking of Europe.
Approaches to complexity in European political discourse, ed. Kjersti Fløttum,
65–83. Amsterdam: Benjamins.
Muller, Charles. 1967. Étude de statistique lexicale. Le vocabulaire du théâtre de
Pierre Corneille. Translated by Pierre Corneille. Paris: Larousse.
Pêcheux, Michel. 1982. Language, semantics and ideology (Language, Discourse,
Society Series). London: Macmillan.
Pêcheux, Michel, Claudine Haroche, Paul Henry, and Jean-Pierre Poitou. 1979.
Le rapport Mansholt: un cas d’ambiguïté idéologique. Technologies, Idéologies,
Pratiques 2: 1–83.
Reinert, Max. 1983. Une méthode de classification descendante hiérarchique.
Cahiers analyse des données VIII (2): 187–198.
152 R. Scholz
Yule, George Udny. 1944. The statistical study of literary vocabulary. Cambridge:
Cambridge University Press.
Žagar, Igor Ž. 2010. Topoi in critical discourse analysis. Lodz Papers in Pragmatics
6 (1): 3–27.
Ziem, Alexander. 2014. Frames of understanding in text and discourse. Theoretical
foundations and descriptive applications. Amsterdam: Benjamins.
Zipf, George K. 1929. Relative frequency as a determinant of phonetic change.
Harvard Studies in Classical Philology 40: 1–95.
———. 1935. The psycho-biology of language. An introduction to dynamic
philology. Boston: Mifflin.
6
Words and Facts: Textual Analysis—
Topic-Centred Methods for Social
Scientists
Karl M. van Meter
1 Introduction
In arguing for systematic textual analysis as a part of discourse analysis,
Norman Fairclough stated that ‘[t]he nature of texts and textual analysis
should surely be one significant cluster of issues of common concern’
within discourse analysis (Fairclough 1992, 196). As a contribution dis-
course analysis and a reinforcement of the close association between dis-
course analysis and textual analysis, I will here deal with texts from several
different origins and over a time span extending from the 1980s to now.
I will try to explain and show how complex statistical methods such as
factorial correspondence analysis (see the TriDeux software of Cibois
2016), both descending hierarchical classification analyses (see the Alceste
software (Image 2016) and the Topics software (Jenny 1997)) and ascend-
ing hierarchical classification analyses (such as Leximappe-Lexinet
(Callon et al. 1991) and Calliope (De Saint Léger 1997)), can be used to
1. In the first case, I will look at how statistical analysis of texts from
1989 produces a geographical map of the then Soviet Union and how
those texts and this map help to define the political and economic
structures of current-day Russia and its ‘near abroad’.
2. In the second case, I’ll take a synchronic perspective on American,
French, and German sociologies by looking at abstracts submitted to
the annual conferences of each national association of sociology, and
also a diachronic perspective in the specific case of French sociology.
The first analysis shows how the discursive field of sociology in each
country is structured in terms of topics dealt with, of relationships or
ties between these topics, and of the history and culture of each coun-
try. In a second analysis, I will provide a diachronic perspective by
Words and Facts: Textual Analysis—Topic-Centred Methods… 157
terms, and even then, we used only the 20 most frequent names as active
variables with the 80 other geographical names being used only as non-
active or passive elements that did not enter into the calculations. We used
the TriDeux factorial correspondence analysis program (Cibois 1983,
1985, 2016) inspired by Benzécri, which is a hierarchically descending
method that successively ‘cuts’ the set of data points along the most statis-
tically significant axes or dimensions, thus producing the following two-
dimensional diagram based on the two most pertinent factors (Fig. 6.1).
Summarily interpreted, this analysis reveals that there is a very tight and
coherent network centred around Dnepropetrovsk and the Ukraine [and
other geographical names that figure prominently in Brezhnevian allies’
biographies], which corresponds with the political power base of Leonid
Brezhnev. The most significant cleavage in the entire population is between
this Brezhnevian group and the rest of the population. This ‘rest of the
population’ is in turn multi-centred with Stavropol (Mikhail Gorbachev’s
political power base), Latvia (an economically dynamic Baltic republic)
and Moscow (the centre of the state apparatus), distributing the remainder
of the names in an arc around themselves. The second most important
cleavage (the second axis) seems to be related to economic development
with Latvia at one extreme and Uzbekistan and Azerbaijan [in Central
Asia] at the other. This is, however, a tentative hypothesis that must be
examined further.
between Russia and Ukraine (the first axis of the diagram) and between
the Baltic republic and the Central Asian republics (the second axis)
changed that much since our textual analysis produced the above
diagram?
3 Ascending and Descending
Methodologies
We referred to factorial correspondence analysis as a hierarchical descend-
ing method by which we mean that you start with all the data together in
the form an n-dimensional cloud of data points, where n corresponds to
the number of variables being used in the description of each data point.
In the case of the Soviet biographies, it was 20 geographical terms or
variables. The cloud is then cut successively—thus in a ‘descending’ man-
ner—according to specific statistical criteria. In the case of the Soviet
biographies, the successive ‘cuts’ where on the basis of statistical contribu-
tions to the first factor or axis (Ukraine/Rest of the USSR), then to the
second axis (Baltic republics/Asian republics or economically developed/
economically underdeveloped), and so on to the third, fourth and other
axes with less and less statistical significance. This distinction between
ascending and descending methods is even clearer when it concerns clas-
sifications (Van Meter 1990, 2003). One either builds classes or ‘clusters’
by putting together data points with similar characteristics or values for
variables, or by ‘cutting’ the entire cloud of data points to successively
form the most homogeneous classes possible. Alceste mentioned above,
and first developed by Max Reinert (1987), is such a descending method
specifically intended for the analysis of corpora of texts and has been used
extensively in the analysis of very different sorts of textual data and con-
tinues to be extensively used today (Reinert 2003; Image 2016).
In the other direction, hierarchically ascending classifications try to
put in the same class or ‘cluster’ elements that have the most similar char-
acteristics (Van Meter et al. 1987, 1992). When the data are units of
texts, one of the more useful manners of constructing classes is to put
together words or terms that appear the most often together in the units
162 K. M. van Meter
1991). This means that ‘mainstream’ classes (in the first quadrant) will
have relatively numerous ‘in-ties’ (co-occurrences of the class’ keywords
together in the text units) and relatively numerous ‘out-ties’ (co-
occurrences in text units of the class’ keywords with other keywords from
other classes). ‘Bandwagon’ classes have keywords with numerous ‘out-
ties’ but relatively fewer ‘in-ties’. And, of course, ‘ivory tower’ classes have
numerous ‘in-ties’ but relatively few ‘out-ties’.
In Fig. 6.2, one can clearly see the dominant role of the term ‘femme’
(and thus sociology of women), and in Fig. 6.3, you can see the internal
structure of the class ‘femme’ and its constituent keywords (including
‘travail’) and the cluster’s ‘in-ties’. The statistical weight of the term
‘femme’ was so dominant that we decided to see what the data base would
look like without ‘femme’. So we removed that term and redid the analy-
sis, producing a new strategic diagram (Fig. 6.4). The astonishing similar-
ity between Figs. 6.2 and 6.4 leads to the conclusion that the terms
‘femme’ and ‘travail’ are interchangeable in the structure of the 2004 AFS
corpus and, by implication, the sociology of women and the sociology of
work (a classic and historically dominant theme of French sociology)
have become contemporary equivalents (De Saint Léger and van Meter
2005). This result was then confirmed by the analysis of the 2006 AFS
congress and the 2009 congress (De Saint Léger and van Meter 2009).
In the case of the 2009 congress, there was a declared central theme:
‘violence in society’. Therefore, in the 2009 Strategic Diagram, the domi-
nant position of ‘femme’/‘travail’ had been taken over by the term
‘Violence’. But by employing the same technique of deleting the domi-
nant term and re-analysing the data, we produced a new 2009 Strategic
Diagram that was indeed very similar to the 2004 and 2006 results, and
Fig. 6.3 French Sociological Association (AFS) congress 2004 ‘femme’ (woman)
cluster with its keywords, including ‘travail’ (work)
166
K. M. van Meter
Fig. 6.4 French Sociological Association (AFS) congress 2004 strategic diagram of all abstracts (without femme)
Words and Facts: Textual Analysis—Topic-Centred Methods… 167
• Death of senior leader al-Adnani caps bad month for ISIS/CNN. The
death of one of ISIS’ most prominent figures, Abu Mohammad al-
Adnani, is one more example of the pressure the group is under in
both Iraq and Syria.
170 K. M. van Meter
Fig. 6.5 Strategic diagram of the first four months of the 2006 Association for
the Right to Information corpus
international press usually talks about ‘Hamas terrorists’ but, by the end
of the year, this has become ‘Hamas militants’; China starts the year as a
major player on the international scene but finishes 2006 as one of the
most insignificant international actors (Van Meter and de Saint Léger
2008, 2009b). The case of ‘Hamas’ is quite interesting in its implications
for text analysis of political developments. If a ‘Hamas’ cluster appears in
the third diagram (2006-3), it is because the international media concen-
trated on the killing of an elected Hamas government official by the
Israelis in November 2006. But the truly important political develop-
ment associated with Hamas occurred in January 2006 when Hamas won
the internationally recognized democratic elections in Gaza and the
Occupied Territories, thus causing a major change in perspective that
should have been widely commented in the press, but was not. Again, in
July, Israel detained a large number of elected Hamas government offi-
cials, against the wishes of the larger international community. But,
again, Hamas does not appear in the second diagram (2006-2), and the
international media gave only passing attention to this imprisonment of
elected Palestinian officials. It was only in November with the Israeli kill-
ing of one such official that the international media finally seemed to pay
attention to what was happening. Indeed, since the publication of our
study, several major news agencies have confirmed that at a very senior
level there was an editorial decision to no longer label all Hamas mem-
bers as ‘terrorists’ and instead use the term ‘militant’ or ‘activist’, or in the
case of this killing, ‘Hamas official’. This major change in 2006 is still
with us today and could only be identified by moving back and forth in
time over the results of these textual analyses.
There have been many other similar instances of such ‘uneven’ or
clearly biased coverage of particularly sensitive topics or events that can
be discerned by following the formation of certain clusters back in time
and also forward in time. Inversely, if a topic or cluster cannot be fol-
lowed over time, it is very likely a transient event or not a coherent topic.
The supposedly major ‘yearbook’ event that was the election of a 2006
Democrat majority in Congress hardly resulted in any memorable politi-
cal developments. Few people other than Middle Eastern specialists
remember the 2006 Israeli invasion of Lebanon or how many times Israel
has invaded its neighbours. And North Korea has its nuclear weapon, but
174 K. M. van Meter
has not used it until today and international politics has been thoroughly
preoccupied by other issues since then.
But let us look at another less than evident 2006 development that is
nonetheless fundamental and whose consequences are still with us today.
During the first four-month period of 2006, the keyword ‘China’ was the
keyword with the highest statistical attractive power in the construction
of the clusters and axes (see Van Meter and de Saint Léger 2009b for a
description of this index). By looking at the texts of the first period that
included ‘China’, one finds that they often involved international nego-
tiations trying to keep the Bush White House for invading ‘Iran’, which
was the dominant term of that first period. China, Russia, and Europe
were involved in those negotiations to try to keep Bush and Cheney from
starting Word War III. Four months later, Bush and Cheney no longer
wanted to invade Iran, but were hinting that an invasion of North Korea
would stop the development of an atomic arm in that country. Again
there were international negotiations, and this time China was playing
the leading role, which largely explains how the attractive power of
‘China’ increased to a maximum during the 2006-2 period. But on 9
October 2006, North Korea detonated a nuclear device and that was the
end of negotiations and the attractive power of the keyword ‘China’ fell
precipitously as China disappeared from the international media as a
major player on the world stage (see Fig. 6.6, ‘2006 keywords’ attractive
power over the three four-month periods). Little wonder that China soon
became far more aggressive on the world scene and is currently browbeat-
ing the entire world concerning its offensive in the South China Sea.
Since the publication of the 2006 results, we have also looked at the
last two years of the Bush White House—2007–2008 (Van Meter and de
Saint Léger 2011)—and, as could be expected, we found some rather
intriguing developments. That data was divided into four successive six-
month periods (2007-1, 2007-2, 2008-1, and 2008-2) and strategic dia-
grams were produced for each period. However, it was in the evolution of
keyword attractive power that the most surprising evolution appears (see
Fig. 6.7).
Most of the terms decline in the last period (2008-2), but two top
terms do increase: ‘UN’ (United Nations) and ‘kill’ (which does not des-
ignate any particular institution but does tend to characterize an interna-
Fig. 6.6 2006 keywords’ attractive power over the three four-month periods
Words and Facts: Textual Analysis—Topic-Centred Methods…
175
176 K. M. van Meter
tional increase in violence and instability). But among the other top
terms, there is one in serious decline: ‘Bush’. This can be interpreted as
indicating that during an international situation of increasing violence
and instability (‘kill’ going up), the leader of the world’s most powerful
nation was in decline or, as certain commentators stated, ‘had abandoned
the helm’ or was too discredited to lead the world. That responsibility was
being turned over to the institution ‘on the rise’; that is the United
Nations. In short, the Republican government of George W. Bush was
being replaced on the international scene by the Republicans’ worst
nightmare, an ‘international government’ headed by the United Nations.
Although Bush was replaced at the White House by a Democrat, Barack
Obama, our analysis of 2009–2012, which was recently published under
the title 2009–2012—Obama’s First Term, Bush’s ‘Legacy’, Arab Spring &
World Jihadism (Van Meter 2016), confirms that the UN and not the US
president continues to lead the world on the current violent international
scene.
Fig. 6.7 Dominant 2007-1 terms over the four periods of 2007–2008 (Bush
vs. UN)
Words and Facts: Textual Analysis—Topic-Centred Methods… 177
Formal textual analysis is probably one of the very few methods avail-
able to us for the systematic study of scientific and cultural production in
all of these countries and throughout the world, permitting anything
approaching scientific neutrality, the possibility of comparative study and
accumulating further information in these domains.
References
Callon, Michel, Jean-Pierre Courtial, and William Turner. 1991. La méthode
Leximappe – Un outil pour l’analyse strategique du développement scienti-
fique et technique. In Gestion de la recherché – Nouveaux problèmes, nouveaux
outils, ed. Dominique Vinck, 207–277. Brussels: De Boeck.
Cibois, Philippe. 1983. Methodes post-factorielles pour le dépouillement
d’enquêtes. Bulletin of Sociological Methodology/Bulletin de Méthodologie
Sociologique 1: 41–78.
———. 1985. L’analyse des donees en sociologic. Paris: Presses Universitaires de
France.
———. 2016. Le logiciel Trideux. Accessed June 27, 2018. http://cibois.pages-
perso-orange.fr/Trideux.html.
De Saint Léger, Mathilde. 1997. Modélisation de la dynamique des flux
d’informations – Vers un suivi des connaissances. Thèse de doctorat, CNAM,
Paris.
De Saint Léger, Mathilde, and Karl M. van Meter. 2005. Cartographie du pre-
mier congrès de l’AFS avec la méthode des mots associés. Bulletin of
Sociological Methodology/Bulletin de Méthodologie Sociologique 85: 44–67.
———. 2009. French sociology as seen through the co-word analysis of AFS
congress abstracts: 2004, 2006 & 2009. Bulletin of Sociological Methodology/
Bulletin de Méthodologie Sociologique 102: 39–54.
Demazière, Didier, Claire Brossaud, Patrick Trabal, and Karl van Meter. 2006.
Analyses textuelles en sociologie – Logiciels, méthodes, usages. Rennes: Presses
Universitaires de Rennes.
Fairclough, Norman. 1992. Discourse and text. Linguistic and intertextual anal-
ysis within discourse analysis. Discourse & Society 3 (2): 193–217.
Glady, Marc, and François Leimdorfer. 2015. Usages de la lexicométrie et inter-
prétation sociologique. Bulletin of Sociological Methodology/Bulletin de
Méthodologie Sociologique 127: 5–25.
Words and Facts: Textual Analysis—Topic-Centred Methods… 179
———. 2009. The AFS and the BMS. Analyzing contemporary French sociol-
ogy. Bulletin of Sociological Methodology/Bulletin de Méthodologie Sociologique
102: 5–13.
———. 2016. 2009–2012 – Obama’s first term, Bush’s ‘Legacy’, Arab Spring &
world jihadism. Paris: Harmattan.
Van Meter, Karl M., Philippe Cibois, Lise Mounier, and Jacques Jenny. 1989.
East meets West—Official biographies of members of the central committee
of the communist party of the soviet union between 1981 and 1987, ana-
lyzed with western social network analysis methods. Connections 12 (3):
32–38.
Van Meter, Karl M., Martin W. de Vries, and Charles D. Kaplan. 1987. States,
syndromes, and polythetic classes. The operationalization of cross-
classification analysis in behavioral science research. Bulletin of Sociological
Methodology/Bulletin de Méthodologie Sociologique 15: 22–38.
Van Meter, Karl M., Martin W. de Vries, Charles D. Kaplan, and
C.I.M. Dijkman-Caes. 1992. States, syndromes, and polythetic classes.
Developing a classification system for ESM data using the ascending and
cross-classification method. In The experience of psychopathology. Investigating
mental disorders in their natural settings, ed. Martin W. de Vries, 79–94.
Cambridge: Cambridge University Press.
Van Meter, Karl M., and Mathilde de Saint Léger. 2008. Co-word analysis
applied to political science. 2006 international political & ‘parapolitical’
headlines. Bulletin of Sociological Methodology/Bulletin de Méthodologie
Sociologique 97: 18–38.
———. 2009a. German & French contemporary sociology compared: Text
analysis of congress. Bulletin of Sociological Methodology/Bulletin de
Méthodologie Sociologique 104: 5–31.
———. 2009b. World politics and “parapolitics” 2006. Computer analysis of ADI
timelines. Paris: Harmattan.
———. 2011. 2007–2008—The end of Bush and the rise of the UN. Link
analysis of world media headlines. USAK Yearbook of International Politics
and Law 4: 1–21.
———. 2014. American, French and German sociologies compared through
link analysis of conference abstracts. Bulletin of Sociological Methodology/
Bulletin de Méthodologie Sociologique 122: 26–45.
Van Meter, Karl M., and Wiliam A. Turner. 1992. A cognitive map of sociologi-
cal AIDS research. Current Sociology 40 (3): 129–134.
Words and Facts: Textual Analysis—Topic-Centred Methods… 181
Van Meter, Karl M., and William A. Turner. 1997. Representation and confron-
tation of three types of longitudinal network data from the same data base of
sociological AIDS research. Bulletin of Sociological Methodology/Bulletin de
Méthodologie Sociologique 56: 32–49.
Van Meter, Karl M., William A. Turner, and Jean-Bernard Bizard. 1995.
Cognitive mapping of AIDS research 1980–1990. Strategic diagrams, evolu-
tion of the discipline and data base navigation tools. Bulletin of Sociological
Methodology/Bulletin de Méthodologie Sociologique 46: 30–44.
7
Text Mining for Discourse Analysis:
An Exemplary Study of the Debate
on Minimum Wages in Germany
Gregor Wiedemann
1 Introduction
Two developments have widened opportunities for discourse analysts in
recent years and paved the way for incorporation of new computational
methods in the field. First, amounts of digital textual data worth investi-
gating are growing rapidly. Not only newspapers publish their content
online and take efforts retro-digitizing their archives, but also users inter-
actively react to content in comment sections, forums, and social net-
works. Since the revolution of the Web 2.0 made the Internet a
participatory many-to-many medium, vast amounts of natively digital
text emerge, shaping the general public discourse arena as much as they
form new partial public spheres following distinct discourse agendas.
Second, computational text analysis algorithms greatly improved in their
ability to capture complex semantic structures.
G. Wiedemann (*)
Department of Informatics, Language Technology Group, Hamburg
University, Hamburg, Germany
e-mail: gwiedemann@informatik.uni-hamburg.de
useful tools for both lexicometry and content analysis (Wiedemann and
Lemke 2016). Therefore, I expect that its capabilities to structure and
order data also can make a valuable contribution to discourse studies
conducted against a large variety of methodological and theoretical
backgrounds.
First, let us have a look at the way how advanced text mining algo-
rithms, in particular, ML, proceed to extract knowledge from textual
data. In a second step, we compare the characteristics of already estab-
lished computational approaches such as CCA and lexicometric analyses
(Lebart et al. 1998), on the one hand, and ML, on the other hand, to
reflect on characteristics of the new approaches and their potential for
discourse analysis.
1
Accuracy in this scenario can be determined by k-fold cross-validation on the current training set
(Dumm and Niekler 2016).
190 G. Wiedemann
documents, the view of the analyst is sharpened for specific textual for-
mations and language regularities contributing to shape a discourse in its
very own specific way. Analysis on this level embeds empiric observations
from the data in their global context. If one is able to condense these for-
mations and language regularities into some sort of analytic categories,
she/he keeps on extracting such patterns from local contexts and relating
them to each other on global context level until a saturated description of
the discourse can be assumed. This alteration between inductive, data-
driven category development and deductive category subsumption is at
the core process of knowledge reconstruction in discourse analysis—or,
as Wodak and Meyer (2009, 9) phrase it: ‘Of course, all approaches
moreover proceed abductively’. This way of proceeding has some analogy
to the unsupervised and supervised nature of ML algorithms. They also
give researchers the opportunity to combine inductive and deductive
steps of analysis into creative workflows. On the one hand, unsupervised
ML allows for exploration of patterns buried in large text collections to
learn about contents without any prior knowledge. On the other hand,
supervised ML provides results for descriptive statistics and hypothesis
testing on the basis of deductively defined categories.
Also algorithmically, ML has some similarities to abduction to infer
optimal knowledge representations for given data sets.2 Unlike humans,
algorithms are capable of processing extremely large quantities of textual
data sets without getting tired or distracted. At the same time, usually
these data sets are the only source they can learn structure from in a sta-
tistical way.3 So far, in contrast to humans, they lack common ‘world
knowledge’ and experience from outside the investigated text collection
to relate observed patterns and draw inference on. In this respect, local
2
Optimization algorithms in machine learning, such as Expectation Maximization, usually start
with random or informed guesses for their initial model parameters. In an iterative process the
model parameters are adapted in small steps to better fit the given data. In the end, the model
parameters (nearly) optimally describe the data set in some structural way. Not coincidentally, this
process resembles the abductive process of knowledge reconstruction from text corpora in qualita-
tive data analysis.
3
Of course, there are already text mining approaches which incorporate text external resources such
as comparison corpora or structured knowledge bases. It is the task of further research and develop-
ment to evaluate on the contribution of such resources for specific research questions in qualitative
data analysis.
Text Mining for Discourse Analysis: An Exemplary Study… 191
400
documents
200
0
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
year
publication FAZ FR
For the exemplary study, I inspect and compare two major German
newspapers. Following the ‘distant reading’ paradigm (Moretti 2007),
first, I strive for revealing global contexts of the debate with respect to its
temporal evolvement. What are major topics and subtopics within the
discourse, when did they emerge or disappear, and how are they con-
nected to each other? Can we determine distinct time periods of the
debate from the data? The use of topic models in combination with fur-
ther text mining techniques will enable us to answer these questions. For
a substantial analysis, we also want to zoom in from this distant perspec-
tive and have a close look on individual units of language use shaping the
discourse. In a deductive step, I will look for statements and utterances
expressing political stance towards the issue. How is approval or rejection
of the introduction of minimum wages expressed and justified through-
out time? Then, with the help of text classification, we will be able to
trace this antagonism between proponents and opponents of statutory
minimum wages quantitatively.
the ‘Mindestlohngesetz’ came into force. This allows tracing the genesis
of the policy measure from the first noticeable public demands, over vari-
ous sidetracks of the discourse and the point when lawmakers in majority
supported the measure, up to reflections on actual effects of the enacted
law. From the entire newspaper archive, those articles were retrieved
which contained the term ‘Mindestlohn*’, where the asterisk symbol
indicates a placeholder to include inflected forms, plural forms, and com-
pounds.4 Additionally, articles had to be related to German politics
mainly, which could be achieved by restricting the retrieval in the archive
databases by provided metadata. This results in a corpus of 7,621 articles
(3,762 in the FAZ; 3,859 in the FR) comprising roughly 3.76 million
word tokens. Their distribution across time reveals the changing intensity
of the public debate. Absolute document frequencies indicate that both
publications, although from opposing sites of the political spectrum,
cover the topic in a surprisingly similar manner (Fig. 7.1).
4
The search index treats German Umlaute as their ASCII equivalent, such that Mindestlohn* also
retrieves articles containing ‘Mindestlöhne’.
196 G. Wiedemann
Since our model provides topic probabilities for each document, we can
aggregate probabilities according the document metadata such as its pub-
lication year and normalize them in a range between 0 and 1 to interpret
them as proportions. Average topic proportions can be visualized, for
instance, as an area plot, which allows a visual evaluation of topic trends
over time (Fig. 7.2). To make the plot more readable, topics are sorted in
a specific manner. On the single curves of the topic proportions over
time, I computed a regression line and sorted the topics according the
slope of this line. This results in an ordering where largest increases of
topic shares are on the top of the list while topic shares most decreasing
over time are located at the bottom. Now we can easily identify specific
trends in the data. In the beginning of our investigated time period, the
discourse is largely dominated by the issue of MW in the construction
sector which were introduced with the ‘Entsendegesetz’ in 1996 to pre-
vent dumping wages, but led to heated discussions on increases of unde-
Topic
1.00
Grand coalition
Social democrats
MW in Hesse
Socialist party
Aggregated probability
Sector specific MW
0.50
MW in postal sector
General terms
MW implementation
Social welfare
0.25
MW in Europe
Date
clared work. Dispute around the millennial turn was focusing on the
question whether statutory wages can even be enforced by executive pow-
ers. Then, steadily more industrial and service sectors became subject to
the debate. Throughout the 2000s, sector-specific MWs with an excep-
tional highlight on the postal sector were preferred above a general MW
for all sectors in the entire country. During that time, the topics on social
market economy entangled with demands for social justice and concerns
for the job market formed a steady and solid background for the debate.
In 2013, we could identify a new shift of the debate when a general mini-
mum wage became a central policy objective in the coalition agreement
of CDU/CSU and SPD after the federal election. From this year onwards,
topics on implementation of MW and possible consequences on the job
market increase.
From the perspective of quality assurance of the analysis process, these
results can be viewed as a successful evaluation, suggesting we were able
to obtain a valid model for our research purpose. We selected model
parameters according to optimized conventional numeric evaluation
measures and were able to label and interpret single topics, as well as their
quantitative evolvement over time. But of course, carrying out the entire
analysis on distributions and trends of semantic clusters covers only a
very distant perspective of the discourse. The method rather provides an
overview of the discourse, a suggestion for its separation, and a starting
point for further analyses to gain insight through contrasting along data
facets such as theme and time. To develop a profound understanding of
the data and what is going on in distinct topics at certain points of time,
we still need to read single articles. Fortunately, the topic model provides
us information on which documents to select. For every topic at any dis-
tinct point in time, for example, the striking increase in the topic of sec-
tor-specific MW in 2004, we can sample from the most representative
documents for that subset of our data to prepare further manual analysis
steps. Thus, the close connection between the global and the local context
through the model allows for a straightforward realization of demands for
‘blended reading’, the close integration of distant and close reading steps
(Lemke and Stulpe 2016).
Text Mining for Discourse Analysis: An Exemplary Study… 201
Trends, that is, changes of proportions across the entire time frame,
appear to be very similar between the FAZ and the FR. We can observe
early peaks of support for the idea of MWs in 1996 and around the years
1999/2000. In 1996, an MW was introduced in the construction sector.
Around the turn of the millennium, although characterized by a large
relative share of approval (Fig. 7.3), the debate remained on rather low
level in absolute terms (Fig. 7.1). In contrast, intensity of the debate in
absolute counts and the share of approval statements for the policy mea-
sure started to increase simultaneously from 2004 onwards. Intensity
peaks in 2013 while retaining high approval shares. In this year, MWs
became part of the grand coalition agreement as a soon to be enacted law.
For expressions of opposition towards MW, we can observe interesting
trends as well. Not surprisingly, the overall share of negative sentiments
towards the policy measure is higher in the more conservative newspaper
FAZ. But, more striking is the major peak in the year 2004, just at the
beginning of the heated intensity of the debate. In 2005, there has been
an early election of the Bundestag after the government led by chancellor
FAZ FR
0.6
0.4
proportion
0.2
0.0
1995 2000 2005 2010 2015 1995 2000 2005 2010 2015
year
category approval opposition
Gerhard Schröder (SPD) dissolved the parliament. His plan for backing
up the support for aimed social reforms, the so-called Agenda 2010, by
renewing his government mandate failed. Schröder lost the election and
a new government under Angela Merkel (CDU) was formed together
with the Social Democrats as a junior partner. Under the impression of
heated dispute about the Agenda 2010, the 2005 election campaign was
highly influenced by topics related to social justice. Switching to the
mode of campaign rhetoric in policy statements may be an explanation
for the sharp drop of oppositional statements to MW.
One year earlier oppositional stances peaked in the public discourse
presenting MW as a very unfavourable policy measure. Due to their bad
reputation throughout conservatives as well as Social Democrats,
demands for MWs did not become a major topic in the campaign of
2005. The interesting finding now is that the relative distribution of
approval and opposition in the public discourse in that year already was
at similar levels compared with that in the years 2012/2013 when the
idea finally became a favoured issue of the big German parties. This may
be interpreted in a way that statutory MWs actually could have been a
successful campaign driver in 2005 already, if Social Democrats would
have been willing to adopt them as part of their program. In fact, left-
leaning Social Democrats demanded them as a compensatory measure
against the social hardship of the planned reforms. Instead, the SPD
opted for sticking to the main principle behind the Agenda 2010 to
include low-skilled workers into the job market by rather subsidizing low
wages than forcing companies to pay a general minimum. It took Social
Democrats until elections of 2009 to take stance for the idea, and another
four years until government leading Christian Democrats became com-
fortable enough with it. Over the years 2014/2015, we could observe a
drop in both approval and opposition expressions, which may be inter-
preted as a cool-down of the debate.
In addition to trace trends of expressions for approval or opposition
quantitatively, we also can evaluate on the used arguments more qualita-
tively. Since supervised learning provides us lists of positively classified
sentences for each category, we can quickly assess their contents and iden-
tify major types of arguments governing the discourse. For approval, for
instance, statements mainly refer to the need for some kind of social
Text Mining for Discourse Analysis: An Exemplary Study… 205
justice. The claim, ‘people need to be able to afford living from their
earned income’, can often be found in variants in the data. In fact, low
wage policy in Germany led to situations where many employees were
dependent on wage subsidies financed by the welfare state. Companies
took advantage of it, by creating business models relying on public subsi-
dies of labour to increase competitiveness. In addition to the social justice
argument, there are more economic arguments presented, especially in
relation to the demand for sector-specific MW. They are welcomed not
only by workers, but also by entrepreneurs as a barrier against unfair con-
ditions of competition on opened European markets. Oppositional
stances to the introduction of MWs also point to the issue of competi-
tiveness. They are afraid that competitiveness of German industry and
services will be diminished and, hence, the economy will slow down.
Turning it more to a social justice argument, major layoffs of workforce
are predicted. Often the claim can be found that MW are unjust for low-
skilled workers because they are preventing them from their entry into
the job market. One prominent, supposedly very specific German argu-
ment in the debate is the reference to ‘Tarifautonomie’, the right of coali-
tions of employees and employers to negotiate their work relations
without interference by the state. Statutory MW, so opponents claim, are
a major threat to this constitutional right. For a long time, German work-
ers unions followed this argument, but steadily realized that power of
their organized coalition in times of heated international competitiveness
was no longer able to guarantee decent wages for their members.
This brief characterization of the main argument patterns identifiable in
public discourse could be the basis for a further refinement of the category
system used for supervised learning. The active learning workflow applied
to this extended system would allow for measuring of specific trends and
framings of the debate—for instance, if reference to the argument on
‘Tarifautonomie’ diminishes over time, or if oppositional statements are
more referring to threats of increased unemployment in a framing of social
justice than in a framing of general damage to the economy. Although we
would have been able to identify these argumentative patterns also with
purely manual methods, we would not be able to determine easily and
comprehensibly on their relevancy for the overall discourse. Certainly, we
would not be able to determine on trends of their relevancy over time.
206 G. Wiedemann
References
Abulof, Uriel. 2015. Normative concepts analysis: Unpacking the language of
legitimation. International Journal of Social Research Methodology 18 (1):
73–89.
Angermüller, Johannes. 2005. Qualitative methods of social research in France:
Reconstructing the actor, deconstructing the subject. Forum Qualitative
Sozialforschung/Forum: Qualitative Social Research 6 (3). Accessed July 1,
2018. http://nbn-resolving.de/urn:nbn:de:0114-fqs0503194.
———. 2014. Einleitung: Diskursforschung als Theorie und Analyse. Umrisse
eines interdisziplinären und internationalen Feldes. In Diskursforschung. Ein
interdisziplinäres Handbuch. Band 1: Theorien, Methodologien und
Kontroversen, ed. Johannes Angermuller, Martin Nonhoff, Eva Herschinger,
Text Mining for Discourse Analysis: An Exemplary Study… 209
Mautner, Gerlinde. 2009. Checks and balances: How corpus linguistics can
contribute to CDA. In Methods of critical discourse analysis, ed. Ruth Wodak
and Michael Meyer, 122–143. London: SAGE.
Mimno, David, Hanna M. Wallach, Edmund Talley, Miriam Leenders, and
Andrew McCallum. 2011. Optimizing semantic coherence in topic models.
In Proceedings of the conference on Empirical Methods in Natural Language
Processing (EMNLP’11), 262–272. Stroudsburg: ACL.
Moretti, Franco. 2007. Graphs, maps, trees: Abstract models for literary history.
London and New York: Verso.
Pêcheux, Michel, Tony Hak, and Niels Helsloot. 1995. Automatic discourse anal-
ysis. Amsterdam and Atlanta: Rodopi.
Scholz, Ronny, and Annika Mattissek. 2014. Zwischen Exzellenz und
Bildungsstreik. Lexikometrie als Methodik zur Ermittlung semantischer
Makrostrukturen des Hochschulreformdiskurses. In Diskursforschung. Ein
interdisziplinäres Handbuch. Band 2: Methoden und Analysepraxis. Perspektiven
auf Hochschulreformdiskurse, ed. Martin Nonhoff, Eva Herschinger, Johannes
Angermuller, Felicitas Macgilchrist, Martin Reisigl, Juliette Wedl, Daniel
Wrana, and Alexander Ziem, 86–112. Bielefeld: Transcript.
Stone, Phillip J., Dexter C. Dunphy, Marshall S. Smith, and Daniel M. Ogilvie.
1966. The general inquirer: A computer approach to content analysis. Cambridge,
MA: MIT Press.
Walesiak, Marek, and Andrzej Dudek. 2015. clusterSim: Searching for optimal clus-
tering procedure for a data set. http://CRAN.R-project.org/package=clusterSim.
Wallach, Hanna M., Iain Murray, Ruslan Salakhutdinov, and David Mimno.
2009. Evaluation methods for topic models. In Proceedings of the 26th Annual
International Conference on Machine Learning (ICML’09), 1105–1112.
New York: ACM.
Wedl, Juliette, Eva Herschinger, and Ludwig Gasteiger. 2014. Diskursforschung
oder Inhaltsanalyse? Ähnlichkeiten, Differenzen und In-/Kompatibilitäten.
In Diskursforschung. Ein interdisziplinäres Handbuch. Band 1: Theorien,
Methodologien und Kontroversen, ed. Johannes Angermuller, Martin Nonhoff,
Eva Herschinger, Felicitas Macgilchrist, Martin Reisigl, Juliette Wedl, Daniel
Wrana, and Alexander Ziem, 537–563. Bielefeld: Transcript.
Wiedemann, Gregor. 2016. Text mining for qualitative data analysis in the social
sciences: A study on democratic discourse in Germany. Kritische Studien zur
Demokratie. Wiesbaden: Springer VS.
212 G. Wiedemann
Wiedemann, Gregor, and Matthias Lemke. 2016. Text Mining für die Analyse
qualitativer Daten: Auf dem Weg zu einer Best Practice? In Text Mining in
den Sozialwissenschaften: Grundlagen und Anwendungen zwischen qualitativer
und quantitativer Diskursanalyse, ed. Matthias Lemke and Gregor Wiedemann,
397–420. Wiesbaden: Springer VS.
Wodak, Ruth, and Michael Meyer. 2009. Critical discourse analysis: History,
agenda, theory and methodology. In Methods of critical discourse analysis, ed.
Ruth Wodak and Michael Meyer, 1–33. London: Sage.
Part IV
New Developments in Corpus-
Assisted Discourse Studies
8
The Value of Revisiting and Extending
Previous Studies: The Case of Islam
in the UK Press
Paul Baker and Tony McEnery
1 Introduction1
Discourse analyses often tend to be time-bound. A discourse is observed,
its nature characterised and an analysis concludes. This, of itself, is not
problematic—analyses have beginnings and ends. Researchers invest the
time and effort into a research question as their research programme
demands and then move on to their next question. A slightly more prob-
lematic situation arises, however, when discourse is described and then
assumed to remain static. Such an analysis will background the fact that
discourse is dynamic. While we may concede that dynamism in discourse
may be topic sensitive and that such change may vary in terms of speed
and degree, it is nonetheless probably the rule rather than the exception
1
The work reported on in this chapter was supported by the ESRC Centre for Corpus Approaches
to Social Science, grant number ES/K002155/1.
2
An example of such an extension is the work of Blinder and Allen (2016), who looked at the
representation of refugees and asylum seekers in a 43-million-word corpus of UK press material
from 2010–2012 in a complementary study to the investigation of the same subject by Baker et al.
(2008) using a 140-million-word corpus of newspaper articles covering 1996–2005.
The Value of Revisiting and Extending Previous Studies… 217
measures). Being able to use this tool to look at implicit and explicit
meaning as well as using it to contrast any differences between two cor-
pora (e.g. one of broadsheet news stories about a group and another of
tabloid news stories about the same group) has obvious applications
within discourse analysis, especially in the area of the construction of
identities and in and out groups. Given that there is growing evidence
that collocates have some root in psychological reality (Durrant and
Doherty 2010; Millar 2011), the value to the discourse analyst in using
collocation to explore discourse is further strengthened. The final tool is
the key tool which mediates between the relatively abstract large-scale
analyses provided by keyword and collocation analysis; concordancing
allows us to map back from the abstract quantitative analyses to the tex-
tual reality on which they are based. Concordancing allows us to navigate
back to the examples in context that produce a keyword or a collocate,
allowing us to rapidly scan those contexts to understand the finding in a
more nuanced way. Alternatively, we may start with concordancing and
work up to the more abstract level, exploring whether something we see
in one text is unique to it, relatively rare, average or in some ways unusu-
ally frequent, for example.
CADS uses the tools of corpus linguistics in order to subject corpora,
both large and small, to discourse analysis. The subsequent analyses have
the benefit of scale, can avoid an inadvertent cherry-picking bias that the
exploration of isolated, and potentially atypical, texts may promote and
has the advantage that some elements of the analysis are relatively subjec-
tive and reproducible. This chapter is an exploration of one discourse in
the UK press which will use CADS both in order to illuminate how that
discourse has changed, if at all, over time and, at the same time, to dem-
onstrate briefly what the CADS approach can achieve. So in order to
explore how stable the discourse around Muslims and Islam was in the
UK press, we extended the original study, analysing a corpus of articles
about Islam and Muslims from 2010 to 2014, which for convenience we
will call Corpus B, making comparisons back to the findings from the
original 1998 to 2010 study which was based on a corpus we will call
Corpus A.
The Value of Revisiting and Extending Previous Studies… 219
350
300
250
200
150
100
50
0
1998-01
1998-08
1999-03
1999-10
2000-05
2000-12
2001-07
2002-02
2002-09
2003-04
2003-11
2004-06
2005-01
2005-08
2006-03
2006-10
2007-05
2007-12
2008-07
2009-02
2009-09
2010-04
2010-11
2011-06
2012-01
2012-08
2013-03
2013-10
2014-05
2014-12
Fig. 8.1 Average number of articles about Islam per newspaper per month,
1998–2014
The Value of Revisiting and Extending Previous Studies… 221
that the phrase devout Muslim was negatively loaded; this is still true in
the 2010–2015 articles where we found references to devout Muslims
described them as cheating (e.g. failing drug tests, having affairs etc.),
becoming radicalised, engaging in extremist activity or clashing with
‘Western’ values in some way.3 However, even in these cases, there are
some slight changes. For example, reporting of the Islamic State group’s
activities has served to intensify the association of the word Islamic with
extremism. Similarly, in the 1998–2009 corpus the word forms relating
to conflict (see Baker et al. 2013, 59) constituted 2.72% of the corpus.
For the 2010–2014 data these words constituted 2.75% of that corpus.
So while overall these findings have remained the same, minor details
relating to the findings have been subject to flux. However, these changes
are minimal by comparison to the major changes that have taken place
across the two time periods, hence the bulk of this paper will be devoted
to a discussion of these differences.
3
For example, ‘A JUDGE yesterday ruled that a devout Muslim woman must remove her full face
veil if she gives evidence in court’ (The Sun, January 23, 2014).
The Value of Revisiting and Extending Previous Studies… 223
These investigations are guided by words which are key when the two
corpora are contrasted.
In terms of location, there has been a shift away from stories about con-
flicts or attacks in Iraq, Palestine, and America, which are key when
Corpus A is compared to Corpus B. When Corpus B is compared to
corpus A, we find instead that in Corpus B Syria, Libya, Iran and Egypt
are key.
In terms of conflict, as noted, the strong relationship with both
Muslims and Islam is relatively stable across the two corpora. However,
when the lexis used to realise the presentation of conflict in the two cor-
pora is examined, a clear difference emerges. The top keywords (in
descending order) in Corpus A, when compared to Corpus B, are war,
terrorist, terrorists, attacks, bomb, bombs, terrorism, suicide, invasion,
destruction, raids, and hijackers. Key in Corpus B, when compared to
Corpus A are islamist, rebels, crisis, revolution, protesters, protest, sanction,
rebel, activists, uprising, islamists, jihadists, jihadist and jihadi. How can we
interpret these findings? World events tend to be a major driving force in
the contexts that Muslims and Islam are written about—such events
align well with news values. We therefore hypothesise that since 2009
references to terrorism have fallen sharply in articles about Islam particu-
larly because large-scale orchestrated attacks like 9/11 and 7/7 in
Anglophone countries in particular have been less marked. Many words
which directly refer to conflict have also seen sharp falls: war, bomb, raids,
destruction, attacks. Yet other words relating to conflict of a principally
civil kind have increased, such as crisis, revolution, protests, sanctions and
uprising. While stories about armed conflict have not gone away, reference
to political/civil conflict has risen dramatically. This makes us reflect
224 P. Baker and T. McEnery
again upon the apparently stable finding linking Muslims and Islam with
conflict. While the picture in terms of the frequency of conflict words
appears relatively stable, the relative proportions of the different types of
conflict words are not stable. Concerns over Iran’s nuclear intentions, and
reporting of events around the Arab Spring have replaced the focus on
the Iraq war and 9/11. While mentions of al-Qaeda and the Taliban have
been reduced, they have been replaced by other groups like Islamic State,
Boko Haram and the Muslim Brotherhood. There are also more refer-
ences in 2010–2014 to rebels, activists, Islamists, protestors and jihadists. So
rather than being framed around fear of terrorist attacks, the discourse
between 2010 and 2014 is more linked to revolution, political protest
and Islam as a political force. The concept of jihad and those engaged in
it (while less frequent than some of the other terms) has also risen over
time. These changes in turn impact on the frequency of the selection of
different items of conflict lexis.
5.2.1 Muslim
The change, tokened by the contrast between the two lists, is quite
marked. We see a strong rise in the phrase Muslim Brotherhood, indicating
the salience of stories coming out of the Arab Spring and the uprising in
Egypt. In 2010–2014 Brotherhood follows over 1 in every 10 mentions of
Muslim. The Muslim council appears to be of less interest to journalists in
the later period, as do Muslim leaders and the phrase Muslim cleric. So
apart from the Muslim Brotherhood, it appears that there is now less
focus on people or groups who are seen as leading various Muslim
communities.
The term Muslim convert has become more common though, although
this term usually refers to stories about Muslim converts who are involved
in crime, usually terrorism or militancy, for example:
They are described as having travelled (often to places like Syria to join
ISIS):
They are expected to condemn jihad and terrorism (but a minority are
sometimes described as not doing so):
42–44) yet in Corpus B it is notable that young and British both often
appear together at the same time as modifiers of Muslims. As a result,
many of the collocates of young Muslims are the same as those of British
Muslims. Those for young Muslims actually show a stronger concern about
radicalisation. Young Muslims are described as impressionable, disaffected,
rootless, angry and susceptible. They are at risk of being lured, recruited,
indoctrinated or brainwashed to commit crimes or jihad.
The Prison Service has long been concerned at the spread of radical Islam
inside Britain’s jails. Experts say a tiny number of fanatics, most serving
long sentences, have huge influence over disaffected young Muslims. (The
Sun, May 2013)
Cameron said a clear distinction must be made between the religion of
Islam and the political ideology of Islamist extremism, but the ‘non-violent
extremists’ who disparage democracy, oppose universal human rights and
promote separatism were also ‘part of the problem’, because they lure
young Muslims onto the path of radicalisation, which can lead them to
espouse violence. (The Times, February 2011)
suggest conflict: insulting, insult (which have both decreased over time)
and anti (which has increased over time):
Saudi liberal activist Raif Badawi was sentenced to 1,000 lashes, 10 years in
prison and a heavy fine for insulting Islam. In fact, his crime was to estab-
lish an online discussion forum where people were free to speak about
religion and criticise religious scholars. (The Independent, May 2014)
The terms Muslim women and Muslim men are frequent in the corpus. We
found that in the previous study, Muslim women tended to be discussed
in terms of the veil, due to a debate which took place in 2006 after com-
ments made about veiling by the then Home Secretary Jack Straw.
Muslim men were most often discussed in terms of their potential for
radicalisation. How have Muslim men and women been written about
since 2010?
The Value of Revisiting and Extending Previous Studies… 229
Table 8.2 shows collocates (words which frequently occur near or next
to the word or phrase we are interested in) of Muslim women in the two
time periods—we only considered content words (nouns, verbs or
For many, the hijab represents modesty and freedom of choice, but we
cannot ignore that it is also one of the most contentious and divisive issues
of modern times—within the Muslim community as well as outside it.
(Guardian, February 16, 2010)
The Walsall debacle comes six months after The Daily Express revealed how
Hull City Council was accused of running Muslim women-only swim-
ming sessions in secret—to the fury of regular baths users. (The Express,
July 6, 2010)
The collocate rape most often refers to atrocities that took place in
Bosnia in the 1990s:
TV chef Nigella Lawson has admitted she resembled ‘a hippo’ when she
wore a burkini on Bondi Beach. The 51-year-old caused a storm two years
ago by donning the all-in-one swimsuit designed for Muslim women dur-
ing a visit to Australia. (The Sun, February 25, 2013)
So since 2010 there has been a small but significant increase in positive
discourses around Muslim women, particularly in terms of questioning
their oppression or discussion of positive female role models. However,
the main picture is a continuation of older discourses which focus on
232 P. Baker and T. McEnery
(force and demand/insist) are more frequent together than the more posi-
tive ones (right or choice). Uncritical descriptions of the veil as a right
were relatively infrequent.
We note the higher frequency of the veil being described as a choice by
The Guardian, although this newspaper also has fairly high representa-
tions of it being linked to compulsion as well. Table 8.4 compares pro-
portions of these constructions of wearing the veil to the earlier set of
data.
Over time, the veil is more likely to be described in negative terms,
either as Muslim women being forced into wearing it, or in terms of them
demanding or insisting on wearing it. Discussion of the veil as a right
appears to have sharply declined, although it is slightly more likely to be
described as a choice.
We also looked at arguments given for why Muslim women should not
veil. This was found by carrying out a search on terms describing the veil,
appearing in the same vicinity as the word because. Of the 135 cases of
these, 32 gave arguments as to why a Muslim woman should not wear the
veil. These are shown in Table 8.5.
The argument about the veil (particularly face-covering veils) making
communication with the veil-wearer difficult was the most frequently
cited. In particular, a court case where a veiled female juror was asked to
step down was mentioned, as well as there being references to school-
teachers who veil their faces.
I’m with Ken Clarke when he says that women should not be allowed to
wear the full-face veil in court because it is difficult to give evidence from
inside a kind of bag. (Daily Mail, November 5, 2013, Richard Littlejohn)
People are nervous about speaking to burka wearers. That’s because we
want direct communication, not just through eye contact but through
And when Jack Straw condemned the grooming by British Muslim men of
Pakistani origin of vulnerable white girls, he was instantly flamed as a
bigot. (The Times, January 22, 2011)
This was far from a one-off case. Police operations going back to 1996
have revealed a disturbingly similar pattern of collective abuse involving
small groups of Muslim men committing a particular type of sexual crime.
(Daily Mail, January 10, 2011)
The authorities have been just as reprehensible in their reluctance to
tackle the sickening exploitation of white girls by predatory gangs of
Muslim men. (The Express, May 17, 2012)
Force does not occur in stories about sexual abuse but relates to cases
where Muslim men apparently force women to wear the veil.
236 P. Baker and T. McEnery
The earlier study found that about half the newspapers refer to Islam
generally rather than discussing different branches of Islam like Sunni,
Shia and Wahhabi. Is there any evidence that this behaviour has changed?
Figure 8.2 shows the proportions of times that a newspaper refers to these
Sunni and Shia Islam in both Corpus A and Corpus B. It shows the pro-
portions for each newspaper for the two time periods.
The Value of Revisiting and Extending Previous Studies… 237
2.5
1.5
0.5
The first bar shows 1998–2009, while the second shows 2010–2014. It
can be seen that The Independent has greatly increased the proportion of
times it refers to branches of Islam as opposed to writing more generally
about Islam. Six other newspapers have also gone in this direction (although
not hugely). However, The Guardian, Telegraph, Express and Star have gone
the other way and refer to the branches less than they used to.
Generally, a distinction can be made between the broadsheets and the
tabloids here, with all the broadsheets referring more often to branches of
Islam rather than Islam in general, while the reverse is true of the
tabloids.
So again, we have some stability. British tabloids continue to paint a
simplistic picture of Islam, not usually referring to or distinguishing
between different branches like Sunni and Shia, although The Mirror is
the tabloid that makes the most effort to do this. On the other hand, all
the broadsheets are more likely to refer to branches of Islam as opposed
to Islam itself, with The Independent being most likely to do this. Yet
within this overall picture of stability, variation by newspaper can be
notable, especially with regard to The Independent and The People. The
change underlying this apparent stability becomes all the more obvious if
238 P. Baker and T. McEnery
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
Fig. 8.3 References to Sunni, Shia, Sufi, Salafi and Wahhabi over time. Dark grey
denotes the proportion of mentions of references to branches of Islam (e.g. Sunni,
Shia, Wahhabi); light grey bars denote references to Islam
we consider change over time not by period covered by the corpus, but
by year. Figure 8.3 shows how overall references to different branches of
Islam have changed since 1998.
Since the start of the new collection of data (2010), newspapers have
begun once again to increasingly make distinctions between different
branches of Islam, as opposed to simply referring to Islam itself. However,
such references often relating to fighting between Sunnis and Shias (often
in Iraq) and the Sunni uprising in Syria.
This section examines how Muslims and Islam are associated with differ-
ent levels of belief. Phrases like Muslim extremist and Muslim fanatic were
found to be extremely common in our earlier study, and one way of gaug-
ing whether representations of Muslims have changed is to examine
whether such terms have increased or decreased. We would argue that the
presence of such terms, particularly in large numbers, is of concern as
The Value of Revisiting and Extending Previous Studies… 239
a greater emphasis on the abstract idea of extremism. This may make the
articles superficially less personalised, although it does not remove the
general focus on extremism. As found with the 1998–2009 data set,
extremism is more likely to be associated with the word Islamic, than
Islam or Muslim(s). Proportionally, The Star uses the most extremist words
next to Islamic in 22% of cases (almost 1 in 4). Compare this to The
Guardian which does this 6% of the time (about 1 in 17 cases). The
Express is the newspaper most likely to associate Islam with an extremist
word (1 in 10 cases), while The Mirror does this least (1 in 42 times). For
Muslim and its plural, it is The Express again which has the highest use of
extremist associations (1 in 13 cases), and The Guardian which has the
least (1 in 83 cases). However, overall in the British press, Muslim(s)
occurs next to an extreme word 1 in 31 times, for Islam this is 1 in 21 and
for Islamic the proportion is 1 in 8.
The picture for the words Muslim and Muslims combined shows that
fewer uses of the word Muslims are linked to extremism overall, with the
proportion in 1998–2009 being 1 in 19, while it is 1 in 31 for 2010–2014.
The People shows the largest fall in this practice, although we should bear
in mind that this is based on a much smaller amount of data than for the
other newspapers (e.g. The People mentions Muslims less than 500 times
overall in the period 2010–2014, compared to The Guardian which has
over 20,000 mentions in the same period). However, all newspapers show
falls in this practice overall.
For the word Islamic, there are also falls in its association with extrem-
ism, with the average number of mentions of an extremist word next to
Islamic being 1 in 6 in 1998–2009 and 1 in 8 in 2010–2014. The Star
and Sun are most likely to link the two words, while it is the least com-
mon in The Guardian and its sister newspaper The Observer. The picture
for the word Islam is somewhat different, however. Here the average
number of mentions of an extreme word near Islam has actually increased
slightly, from 1 in 25 to 1 in 21. The practice has become noticeably
more common in The Express, although most newspapers have followed
suit. Only The Mirror and The Telegraph show a move away from this
practice.
What of the moderate words? It is The Express, Mail and People which
are more likely to refer to Muslims as being moderate, with this practice
The Value of Revisiting and Extending Previous Studies… 241
being least common in The Mirror. On average it is Muslims who are more
likely to be called moderate (1 in 161 cases), as opposed to the concept of
Islam (1 in 271 cases). However, these figures are much smaller than those
for the extremist words. For the 2858 mentions of extreme Muslim(s) in
the press, there are only 558 moderate Muslim(s), or rather 5 extremists for
every moderate. However, in the 1998–2009 articles, there were 9 men-
tions of extremist Muslims for every moderate, so we can see evidence that
moderate Muslims are starting to get better representation proportionally,
although they are still outnumbered. As Fig. 8.4 suggests, this is not
because moderate Muslims are being referred to more, it is more due to a
dip in mentions of extremist ones. For Muslim and its plural, it is The
People, Express and Mail which have shown greater increases in mentions
of moderate Muslims. However, on average, the number of mentions of
moderate Muslims has gone up but only slightly (now 1 in 161 cases).
For cases of Islamic occurring next to a moderate word, this was never
common, and has actually fallen slightly. Figures are based on low fre-
quencies, however, and as we have seen earlier, the word Islamic shows a
20
18 1998-2009
2010-2014
16
14
12
10
8
6
4
2
0
Fig. 8.4 Summary of all data, comparing proportions of change over time
242 P. Baker and T. McEnery
and Afghanistan, while still mentioned, are now seen as almost historical
factors attributable to the ‘Labour years’, rather than as being relevant to
the present situation.
Two of the less frequent explanations for radicalisation found in the
1998–2009 data, ‘grievance culture’ and ‘multi-culturism’, seem to have
largely disappeared from the discourse around radicalisation in
2010–2014 (Figs. 8.5 and 8.6).
In the pie charts that follow, the different causes of radicalisation pre-
sented by the press are shown. The first pie chart shows the relative fre-
quency of causes in the period 1998–2009, the second covers 2010–2014,
while the third shows 2014 on its own. Below is a brief key explaining
each cause listed in the tables.
Wars
4%
Government Policy
36%
Others
Poverty
4% 2010-2014
6%
Wars
7%
Alienation of
Muslims
10%
Extremist Islam
57%
Government Policy
16%
2014 only
Others
9%
Wars
8%
Alienation of
Muslims
9%
Government Policy
8% Extremist Islam
66%
These three pie charts alone are sufficient cause to cast doubt on the
use of any time-bound analysis to cast light on what happens either before
or after the period studied. The results from Corpus A are very different
in terms of the proportions with which causes of radicalisation are men-
tioned. Figure 8.7 shows that one year in Corpus B is a much closer
match to the results of Corpus B than the comparison to Corpus A pro-
duces; that is, there is some evidence for internal consistency within
Corpus B, yet evidence of real change between Corpus A and Corpus B.
6 Conclusion
There is little doubt that the availability of corpus data which has allowed
large-scale investigations of discourses in certain genres, especially the
press, has been one of the most notable methodological developments in
discourse analysis in the past couple of decades. Such analyses, however,
are of necessity time-bound—the analysts collect data between two dates.
No matter how exhaustive the collection of that data, the capacity of the
data and its associated analysis to cast light on the discourse that preceded
The Value of Revisiting and Extending Previous Studies… 247
References
Baker, Paul. 2006. Using corpora in discourse analysis. London: Continuum.
Baker, Paul, Costas Gabrielatos, Majid KhosravNik, Michal Kryzanowski, Tony
McEnery, and Ruth Wodak. 2008. A useful methodological synergy?
Combining critical discourse analysis and corpus linguistics to examine dis-
courses of refugees and asylum seekers in the UK press. Discourse and Society
19 (3): 273–306.
Baker, Paul, Costas Gabrielatos, and Tony McEnery. 2013. Discourse analysis
and media attitudes: The representation of Islam in the British press. Cambridge:
Cambridge University Press.
Blinder, Scott, and Will Allen. 2016. Constructing immigrants: Portrayals of
migrant groups in British national newspapers, 2010–2012. International
Migration Review 50 (1): 3–40.
Durrant, Philip, and Alice Doherty. 2010. Are high-frequency collocations psy-
chologically real? Investigating the thesis of collocational priming. Corpus
Linguistics and Linguistic Theory 6 (2): 125–155.
Evans, Matthew, and Simone Schuller. 2015. Representing ‘terrorism’: The radi-
calization of the May 2013 Woolwich attack in British press reportage.
Journal of Language, Aggression and Conflict 3 (1): 128–150.
Gablasova, Dana, Vaclav Brezina, and Tony McEnery. 2017. Collocations in
corpus-based language learning research: Identifying, comparing and inter-
preting the evidence. Language Learning 67 (S1): 130–154.
Gabrielatos, Costas, Tony McEnery, Peter Diggle, Paul Baker, and ESRC
(Funder). 2012. The peaks and troughs of corpus-based contextual analysis.
International Journal of Corpus Linguistics 17 (2): 151–175.
Hardie, Andrew. 2012. CQPweb—Combining power, flexibility and usability
in a corpus analysis tool. International Journal of Corpus Linguistics 17 (3):
380–409.
Kambites, Carol J. 2014. ‘Sustainable development’: The ‘unsustainable’ devel-
opment of a concept in political discourse. Sustainable Development 22:
336–348.
L’Hôte, Emilie. 2010. New labour and globalization: Globalist discourse with a
twist? Discourse and Society 21 (4): 355–376.
McEnery, Tony, and Andrew Hardie. 2012. Corpus linguistics: Method, theory
and practice. Cambridge: Cambridge University Press.
Millar, Neil. 2011. The processing of malformed formulaic language. Applied
Linguistics 32 (2): 129–148.
The Value of Revisiting and Extending Previous Studies… 249
Developing visual analytic tools means getting one’s hands dirty: Finding
a diagrammatic expression for the data, selecting a programming frame-
work, using or developing algorithms, and programming. Normally, in
The Linguistic Construction of World: An Example of Visual… 255
the humanities, the programming skills are limited. But not only in the
humanities, also in more information technology-oriented disciplines,
the people building and using such tools are not the same. This separa-
tion between visualization tool developers and so-called experts using
them is at the same time comprehensible and fatal. The disciplines deal-
ing with visual analytics developed theoretical foundations and method-
ological frameworks to solve visualization problems on an advanced
level. In consequence, humanists may use an increasing number of com-
plex tools, but they are using them as if they were black boxes. If visual
analytics is not just a tool, but a framework to explore data and to find
emergent, meaningful phenomena, then building the visualization
framework is as well an integral part of the research process (Kath et al.
2015). From choosing the statistics and data aggregation modes, the
mappings of data, and graphical forms, to designing an interface, every
single step in building the visualization framework demands full atten-
tion and reflection of the humanist. How does that influence the inter-
pretation? And more important, how does that influence the research
process itself? Software is a ‘semiotic artefact’, something ‘gets “expressed”
in software’ (Goffey 2014, 36), for example, the cultural surrounding it
is developed in, and its enunciations influence the process of
interpretation.
the collocations, the phenomenon they are interested in has not been
modelled sufficiently. If it is mandatory to go through the text snippets,
the researcher is interested in the single text snippets, not the big picture.
A visualization framework with an easy access to the text details traps the
researcher in the cage of single text hermeneutics: ‘what we really need is
a little pact with the devil: we know how to read texts, now let’s learn how
not to read them’ (Moretti 2000, 57).
Another issue is related to a more general topos in information science
described by Fuller as the ‘idealist tendencies in computing’ (Fuller 2003,
15) or by Geoffey as ‘an extreme understanding of technology as a utili-
tarian tool’ (Goffey 2014, 21). These topoi lead principles in computing
like efficiency and effectiveness, to dominate algorithmic approaches to
text understanding: Visual analytics therefore aims at building ‘effective
analysis tools’ (Keim et al. 2010, 2), ‘to turn information overload […]
into a useful asset’ (Keim et al. 2006, 1). As these goals can be justified
for business applications, they can be misinterpreted in the humanities as
a faster way to read texts. But instead, the capability of visual analytics in
the humanities lies in getting a completely different view on human
interaction—seeing emergent phenomena. A visual analytics framework
useful for humanists will not provide a compact overview of the data and
not merely a more efficient access, but should make the data nourishing
for further analyses that were not possible before.
3 Geocollocations
The visualizing experiments we will present now stand against the back-
ground we sketched so far. The research questions lie in the domain of
corpus linguistic discourse analysis.
We are interested in the way mass media and other mass communication
players shape our perception of the world. News articles often deal with
countries, cities, regions, continents and so on and attribute values and
258 N. Bubenhofer et al.
The data used for this case study consists of two data sets: (1) A corpus of
the magazine ‘Der Spiegel’ and the weekly journal ‘Die Zeit’ from
Germany from 1946 to 2010 (640,000 texts, 551 million tokens, Spiegel/
Zeit corpus) crawled from the complete digital archives available online1
and (2) records of the German parliament Bundestag of the legislative
period 2009 to 2013 (363,000 contributions, 22 million tokens) com-
piled by Blätte (2013). The data has been processed with the part-of-
speech tagger and lemmatizer ‘TreeTagger’ (Schmid 1994, 1999). In
addition, named entity recognition (NER) has been applied to the data
using the Stanford Named Entity Recognizer (Finkel et al. 2005) in a
version adapted to German (Faruqui and Padó 2010). The recognizer
tags not only toponyms but also names of persons, companies and orga-
nizations. In our case, only the toponyms were used.
In order to calculate the geocollocations, all toponyms above a mini-
mum frequency limit were selected and words (lexemes) co-occurring
significantly often with the toponym in the same sentence were calcu-
lated. The selection of an association measure is influenced not only by
statistical considerations, but primarily by the theoretical modelling of
the collocation concept. We used a log likelihood ratio significance test-
ing which is widely used in discourse linguistics to study language usage
patterns (Evert 2009).
The data set now contains toponyms and their collocating lexemes
with frequency information and the level of significance of the colloca-
tion. In order to place the toponyms on a map, they have to be geocoded.
Although there are several geocoding services available like Google Maps
API or Nominatim (OpenStreetMap), the task is challenging because of
reference ambiguities (‘Washington’[DC or the state], ‘Berlin’[capital of
Germany or city in New Hampshire?]), historicity of the toponyms
(‘Yugoslavia’ or ‘Ex-DDR’ do not exist anymore) or the use of unofficial
names (‘the States’ for USA, German ‘Tschechei’ instead of ‘Tschechien’,
‘Western Sahara’, which is not officially recognized as a state). Luckily
1
See http://www.zeit.de/2017/index and http://www.spiegel.de/spiegel/print/index-2017.html
(accessed 6 March 2017).
260 N. Bubenhofer et al.
3.3 Visualization
totype. This may be seen as the most literate translation of the term geo-
co-location because words that occur near each other in text are located
together on a map and share the same coordinates.
This representation is easily understood as it stays very close to the
data. Nevertheless, it enables the user to interactively explore the infor-
mation presented on the map. To facilitate this kind of explorations, we
provide a number of visual hints and controls:
views
map, dorling, separated
label type
smart (show only if enough space), always, hidden
part-of-speech
all, nouns, adjectives, verbs
level of significance
dataset
duces a considerable visual bias that puts more weight on larger ones.
Thirdly, we are generally so accustomed to the shapes of countries which
are considered important that their salience on a map is so obtrusive that
an unprejudiced reading easily becomes obfuscated.
To overcome these shortcomings, we built an alternative visualization
(Fig. 9.3) using a Dorling diagram (Dorling 1993). In a Dorling dia-
gram, countries (or other entities) are represented as circles whose radius
depends on the dimension of interest. We currently scale the circles
according to the number of collocations associated with the respective
country. This visualization allows to grasp with a glimpse which countries
are associated with a large number of collocates in the underlying corpus
and abstracts away from their geographical size. In order for the user not
to completely lose orientation when switching between the different
modes of presentation, the position of the country circles is only moved
so far as not to overlap one another.
Another important information is the distribution of the collocates.
Some collocates are tightly attached to specific places, others appear
worldwide at different places and are less place-specific.
To explore such distributions we use yet another kind of diagram in
the maps—integrated into the Dorling diagram: A Sankey diagram (first
introduced by Sankey 1896) depicts the distribution of some quantity in
the form of a flow where line thickness indicates the share of the quantity
(see white lines in Fig. 9.3). From within the Dorling diagram, a coun-
try’s collocates can be selected for display in a Sankey diagram where the
quantity flow corresponds to the co-occurrences of an individual collo-
cate with all the toponyms in a corpus or more precisely the respective
countries. So the flow fans out from a collocate to all the countries with
which it co-occurs. Thus, for specific collocates we see lines with only a
few crutches whereas for non-specific ones the stem dissolves into many
thin lines.
far in working with the framework to provide some ideas of what kind of
research questions can be dealt with. Often the framework serves as a
means of developing hypotheses which must be evaluated in a second
step involving other methods.
The map view on the data reveals the focuses and the gaps of discourse-
driven world views. A good indicator in this direction is the number of
collocates attached to regions and locations in conjunction with the
observation, if a collocate is specific for a location or widely spread. The
Zeit and Spiegel news corpus of the time period 2001–2010 shows the
following results: Collocates like Stadt (city), Land (country), Jahr (year)
and the like are very generic. On the other hand, collocates like
Menschenrecht (human rights) or Flüchtling (refugee) are not generic
because they are attached to specific locations which have similar roles in
different discourses. Collocates like chinesisch (Chinese) or Obama are
very location specific.
The combination of the Dorling and Sankey diagrams are useful to see
the distribution of collocates. Figure 9.3 shows the correlations of some
collocates to states and in particular the collocate Frühling (spring) which
is selected. Frühling, of course, as part of the expression Arabischer
Frühling (Arab Spring), is attached to countries such as Algeria, Tunisia,
Libya, Syria, Egypt. Other collocates like Euro or Krieg (war) are used
more generically and have connections to a lot of countries. By clicking
on a country, the ten most frequent or specific collocates can be selected.
Choosing the top ten collocates of the United States shows that news
coverage about Northern America is dominated by some collocates used
mainly in the context of this country, although they are potentially
generic. Some examples are Kultur (culture), Museum (museum),
Universität (university) or Unterstützung (support). Press coverage about
China is dominated by generic collocates used to introduce locations
probably unknown to the reader. Examples are chinesisch (Chinese),
Provinz (province) or Hafenstadt (port).
To compare some specific countries, the Dorling view can be reduced
to just these countries (see Fig. 9.4). The collocate Beziehung (relation-
ship) has connections to Russia, China and the US (and other countries),
but not to Germany and France. It is an indicator for relations between
Germany and other countries which are sometimes strong and enduring,
The Linguistic Construction of World: An Example of Visual…
Fig. 9.4 Reduced Dorling view, comparison of selected countries: collocate Beziehung (relationship)
267
268 N. Bubenhofer et al.
.*([Ff ]l[uü]cht|[Mm]igrant|[Mm]igration).*
and Germany are the most frequent. Comparing this distribution to the
period after WWII (1945 to 1960, see Fig. 9.6), reveals the changes in
the discourse since then: Press coverage on Africa is much less dominated
by the migration topic, whereas the Americas (north, central and south)
play a role in the discourse. Obviously having a look at the collocates
shows the differences: In the discourse after WWII, the Americas are
countries of emigration, not migration or refugees.
Figure 9.7 shows the subtle differences of the recent discourse of migra-
tion. There are several derivations of the stems ‘Flucht’, ‘Flücht’, ‘Migration’
and ‘Migrant’: Flüchtlingslager (refugee camp), Flüchtlingspolitik (refugee
policy), Zuflucht (refuge), Bootsflüchtling (boat people), Flüchtlingswelle
(wave of refugees), Flüchtlingszahl (number of refugees) and many more.
For Europe and Germany, the variety of collocates attached to these
places is much higher than for the places of origin of the refugees. These
collocates indicate the discourse about domestic policy dominated by
metaphors and topoi provoking worries and fears of migration. Of course,
the variety of collocates attached to some places also shows the impor-
tance of this place in the discourse. Countries such as Bulgaria, Tunisia,
Albania or Ukraine show only Flüchtling as collocate, meaning that these
places are not in the focus of the migration discourse in Germany.
These examples, far away from extensive studies, must suffice to give
an impression of the possibilities. We wanted to show the manifold
approaches possible with this framework: They range from very broad
approaches interested in the big picture to narrow ones focussing on
specificities in some regions. It is possible to explore the data and observe
abstract concepts such as colonialism tightly attached to specific regions
or to study determined collocates and see their derivations as shown in
the short study about migration.
Fig. 9.5 Map view, selection of collocates Migration, Flüchtlinge (migration, refugees)—Spiegel/Zeit corpus 2010–2016
The Linguistic Construction of World: An Example of Visual…
Fig. 9.6 Map view, selection of collocates Migration, Flüchtlinge (migration, refugees)—Spiegel/Zeit corpus 1945–1960
271
272
N. Bubenhofer et al.
Fig. 9.7 Close view on the collocates in the migration discourse—Spiegel/Zeit corpus 2010–2016
The Linguistic Construction of World: An Example of Visual… 273
described (Bubenhofer 2009), and also has been widely discussed in cor-
pus linguistic approaches to discourse analysis (Spitzmüller and Warnke
2011; Teubert 2005; Sinclair 2004; Mautner 2012; Bubenhofer et al.
2015; Felder et al. 2011; Lebart and Salem 1994; Glasze 2007; Scholz
2010 and many more), this seems to be the case: Discourse analysis seeks
to reveal systems of énoncés, statements, that break the borders of texts.
Aggregated recontextualizations are one form to break down texts in
smaller entities and to rearrange them to detect similarities of passages
across texts. The dimensional enrichment has the potential to unveil a
systematic organization of énoncés. Coercing linguistic data into a new
materiality gives the chance to see in a system of énoncés an emergent
phenomenon that potentially can be interpreted as discourse. At least for
corpus linguistic approaches to discourse, especially approaches being
data-driven, having language surface as a starting point is the nucleus of
corpus pragmatics (Bubenhofer and Scharloth 2014; Feilke 2000; Feilke
and Linke 2009). These pointers must suffice to give an impression of the
nourishing qualities of the above-described transformations.
The visualizations you can create in R are much more sophisticated and
much more nuanced. And, philosophically, you can tell that the visualiza-
tion tools in R were created by people more interested in good thinking
about data than about beautiful presentation. (The result, ironically, is a
much more beautiful presentation, IMHO). (Milton 2010)
Fig. 9.8 Javascript library ‘D3.js’, ‘visual index’ of examples on the website
278 N. Bubenhofer et al.
and analysis before finding visualization solutions (Keim et al. 2010,
119). The alternative ‘coding culture’ simplifies experimenting with data
and visualizations and promotes showing these experiments to a big audi-
ence uninformed about the data and exact goals of the visualization. The
visualization itself, as a result of transcriptive processes (Jäger 2007), gen-
erates Eigensinn (Jäger 2005, 140), the visualization is self-referential.
The visual forms we used for the geocollocations tool are of course
inspired by traditional forms (map, Sankey, Dorling) that have been
transformed to interactive browser-based versions. What if we had cho-
sen another programming language, for example, R, also used for visual-
ization tasks but used mainly for static visualizations and curating a
different culture of visualization examples? Do we justify our selection of
the programming language with cultural arguments? Probably not, even
if they influence every selection, but they are not accepted as valid argu-
ments in academics.
Most of the D3 examples are interactive in the sense that the user can
influence the way things are displayed by moving, pointing or clicking
the mouse or hovering over an element. The possibilities of Javascript
(and the sister technologies HTML5 and Cascading Style Sheets [CSS])
and the structure of the D3 library suggest or almost coerce the program-
mer to enable these modes of interactivity. Also built-in are functions for
animating elements and smoothing the transition between the states. In
our example the transition from the traditional map to the Dorling dia-
gram is a smooth transition from countries to nodes implying countries
and nodes are the same. Even if this makes sense, by using this technology
it becomes most probable that while programming the tool, such an
effect is being activated without much thinking about it.
We have to stop the discussion here (and have to refer to Bubenhofer
2016, 2018; Bubenhofer and Scharloth 2015), but hope having shown:
The methodological reflections and the sensitivity discourse analysts nor-
mally have for the discursive and cultural settings surrounding them
must be broadened if digitality and visualizations enter the arena. Kath
et al. (2015) propose a ‘new visual hermeneutics’ as a starting point, but
especially the influence of technological cultural settings have not been
discussed sufficiently yet.
The Linguistic Construction of World: An Example of Visual… 279
5 Conclusions
In its current state, the geocollocation explorer is a framework which can
be used to explore linguistic data reflecting discourses related to geogra-
phy. But more than that, we used the development of the framework to
reflect upon the effect of diagrammatic operations on linguistic data and
the influence of ‘coding cultures’ on the technical implementation of
visualizations. Development and using of the framework went hand in
hand as an iterative process, where data exploration led to new ideas for
visualization modes, such as abstracting the representation of countries
from geography by turning them into a combination of a Dorling and
Sankey diagram. The process is ongoing and will also in the future refrain
from being teleologic: The further development remains unpredictable.
We consider it critical for humanists to be interested in the technical
and algorithmic details of visual analytics and for developers to provide
for the involvement of humanists. Regarding well established and theo-
retical sound methods of visual analytics and scientific visualizations,
we doubt some of the premises of the methods if they are applied in the
humanities. The methods of visual analytics often follow idealist ten-
dencies in computing. These tendencies incorporate goals in research
that do not necessary match the ones in the humanities. One example
are the differences between data mining and discourse analysis, two
approaches sharing similar methods and research tools, for example,
tools similar to the geocollocations framework. In data mining, the
goal is to efficiently find the right document or categories the docu-
ments with the right labels—following the ‘gold standard’ defined
before—in discourse analysis the task is much more complicated:
Researchers using Foucauldian discourse analysis, approaches in the
paradigms of constructivism or deconstructivism and the like would
mistrust the very idea of being capable to define a gold standard. Tools
and methods that allow new perspectives on or new readings of the data
are of much greater interest for them. We have shown the importance
of diagrammatic operations as one possibility to do that, for example,
in breaking up the unity of texts to find énoncés. The tool then is not
just a tool, but the essential part of the methodological approach itself.
280 N. Bubenhofer et al.
References
Blätte, Andreas. 2013. PolMine-Plenardebattenkorpus (PolMine—German
parliamentary debates corpus). Accessed June 29, 2018. http://polmine.sowi.
uni-due.de/daten.html.
Bostock, Michael, Vadim Ogievetsky, and Jeffrey Heer. 2011. D3: Data-driven
documents. IEEE Transactions on Visualization & Computer Graphics (Proc.
InfoVis). Accessed June 29, 2018. http://vis.stanford.edu/papers/d3.
Brezina, Vaclav, Tony McEnery, and Stephen Wattam. 2015. Collocations in
context. A new perspective on collocation networks. International Journal of
Corpus Linguistics 20 (2): 139–173.
Bubenhofer, Noah. 2009. Sprachgebrauchsmuster. Korpuslinguistik als Methode
der Diskurs- und Kulturanalyse. Berlin and New York: De Gruyter.
———. 2015. Muster aus korpuslinguistischer Sicht. In Handbuch Satz –
Äußerung – Schema, ed. Christa Dürscheid and Jan Georg Schneider,
485–502. Berlin and New York: De Gruyter.
———. 2016. Drei Thesen Zu Visualisierungspraktiken in den Digital
Humanities. Rechtsgeschichte Legal History—Journal of the Max Planck
Institute for European Legal History 24: 351–355.
———. 2018. Visual linguistics: Plädoyer für ein neues Forschungsfeld. In
Visual linguistics, ed. Noah Bubenhofer and Marc Kupietz, 25–62. Heidelberg:
Heidelberg University Publishing.
Bubenhofer, Noah, and Joachim Scharloth. 2014. Korpuspragmatische
Methoden Für Kulturanalytische Fragestellungen. In Kommunikation Korpus
Kultur: Ansätze Und Konzepte Einer Kulturwissenschaftlichen Linguistik, ed.
Nora Benitt, Christopher Koch, Katharina Müller, Lisa Schüler, and Sven
Saage, 47–66. Trier: WVT.
The Linguistic Construction of World: An Example of Visual… 281
———. 2015. Maschinelle Textanalyse im Zeichen von Big Data und data-
driven Turn – Überblick und Desiderate. Zeitschrift Für Germanistische
Linguistik 43 (1): 1–26.
Bubenhofer, Noah, Joachim Scharloth, and David Eugster. 2015. Rhizome digi-
tal: Datengeleitete Methoden Für Alte Und Neue Fragestellungen in Der
Diskursanalyse. Zeitschrift für Diskursforschung, Sonderheft Diskurs,
Interpretation, Hermeneutik 1: 144–172.
Chen, Chun-houh. 2008. In Handbook of data visualization, ed. Wolfgang
Härdle and Antony Unwin. Berlin: Springer.
Coleman, E. Gabriella. 2012. Coding freedom: The ethics and aesthetics of hack-
ing. Princeton, NJ and Oxford: Princeton University Press.
Dorling, Danny. 1993. Map design for census mapping. The Cartographic
Journal 30 (2): 167–183.
Evert, Stefan. 2009. Corpora and collocations. In Corpus linguistics. An interna-
tional handbook, ed. Anke Lüdeling and Merja Kytö, 1212–1248. Berlin and
New York: De Gruyter.
Faruqui, Manaal, and Sebastian Padó. 2010. Training and evaluating a German
named entity recognizer with semantic generalization. In Proceedings of
KONVENS 2010, 129–134.
Feilke, Helmuth. 2000. Die pragmatische Wende in der Textlinguistik. In Text-
und Gesprächslinguistik/Linguistics of text and conversation, ed. Klaus Brinker,
64–82. Berlin and New York: De Gruyter.
Feilke, Helmuth, and Angelika Linke, eds. 2009. Oberfläche Und Performanz.
Untersuchungen Zur Sprache Als Dynamische Gestalt. Berlin and New York:
De Gruyter.
Felder, Ekkehard, Marcus Müller, and Friedemann Vogel. 2011. Korpuspragmatik:
Thematische Korpora als Basis diskurslinguistischer Analysen. Berlin and
New York: De Gruyter.
Finkel, Jenny Rose, Trond Grenager, and Christopher Manning. 2005.
Incorporating non-local information into information extraction systems by
Gibbs Sampling. In Proceedings of ACL, 363–370.
Fleck, Ludwik. 1980. Entstehung und Entwicklung einer wissenschaftlichen
Tatsache: Einführung in die Lehre vom Denkstil und Denkkollektiv. Frankfurt/
Main: Suhrkamp.
Ford, Paul. 2015. What is code? If you don’t know, you need to read this.
Businessweek, June. Accessed June 29, 2018. http://www.bloomberg.com/
whatiscode/.
Foucault, Michel. 1966. Die Ordnung der Dinge: Eine Archäologie der
Humanwissenschaften. Frankfurt/Main: Suhrkamp.
282 N. Bubenhofer et al.
Keim, Daniel A., Jörn Kohlhammer, Geoffrey Ellis, and Florian Mansmann.
2010. Mastering the information age: Solving problems with visual analytics.
Goslar: Eurographics Association.
Keim, Daniel A., Florian Mansmann, Jörn Schneidewind, and Hartmut Ziegler.
2006. Challenges in visual data analysis. In Proceedings of Tenth International
Conference on Information Visualization (IV’06), 9–16.
Lebart, Ludovic, and André Salem. 1994. Statistique textuelle. Paris: Dunod.
Manovich, Lev. 2014. Software is the message. Journal of Visual Culture 13 (1):
79–81.
Mautner, Gerlinde. 2012. Corpora and critical discourse analysis. In
Contemporary corpus linguistics, ed. Paul Baker, 32–46. London and New York:
Continuum.
Milton, Michael. 2010. When to use Excel, when to use R. Webpage. Accessed
March 27, 2017. http://www.michaelmilton.net/2010/01/26/when-to-use-
excel-when-to-use-r/.
Moretti, Franco. 2000. Conjectures on world literature. New Left Review 1:
54–68.
Sankey, Henry R. 1896. The thermal efficiency of steam-engines. (Including
appendixes). Minutes of the Proceedings of the Institution of Civil Engineers
125: 182–212.
Schmid, Helmut. 1994. Probabilistic part-of-speech tagging using decision
trees. In Proceedings of International Conference on New Methods in Language
Processing, Manchester, UK.
———. 1999. Improvements in part-of-speech tagging with an application to
German. In Natural language processing using very large corpora, ed. Susan
Armstrong, Kenneth Church, Pierre Isabelle, Sandra Manzi, Evelyne
Tzoukermann, and David Yarowsky, 13–25. Dordrecht: Springer Netherlands.
Scholz, Ronny. 2010. Die diskursive Legitimation der Europäischen Union: Eine
lexikometrische Analyse zur Verwendung des sprachlichen Zeichens Europa/
Europe in deutschen, französischen und britischen Wahlprogrammen zu den
Europawahlen zwischen 1979 und 2004. Magdeburg, Univ., Fak. für Geistes-,
Sozial- und Erziehungswiss, Magdeburg. Accessed June 29, 2018. http://
edoc2.bibliothek.uni-halle.de/hs/urn/urn:nbn:de:101:1-201108243629.
Scholz, Ronny, and Annika Mattissek. 2014. Zwischen Exzellenz und
Bildungsstreik. Lexikometrie als Methodik zur Ermittlung semantischer
Makrostrukturen des Hochschulreformdiskurses. In Diskursforschung. Ein
interdisziplinäres Handbuch. Band 2: Methoden und Analysepraxis. Perspektiven
284 N. Bubenhofer et al.
S. Stier
Institute of Political Science, NRW School of Governance, University of
Duisburg-Essen, Duisburg, Germany
Department of Computational Social Science, GESIS – Leibniz Institute for
the Social Sciences, Cologne, Germany
e-mail: sebastian.stier@gesis.org
Multi-method Discourse Analysis of Twitter Communication… 287
can only deliver insights into the actor-based structuration of the global
debates and, thus, the preconditions for the transnational circulation and
diffusion of discursive patterns. For the calculation of keywords, an ade-
quate reference corpus is needed, as it more or less determines the value
of the resulting keyword lists (cf. the discussion of the individual meth-
ods below). Of course, these kinds of decisions cannot be avoided in
empirical research. Each decision must be made and justified per se.
However, by mixing the methods, we get views on our data from differ-
ent perspectives. We might be in a better position to judge whether an
analysis based on a certain configuration is an artefact of research, needs
to be explained by a hidden variable, or gives us reliable results. Moreover,
it seems particularly important to us that we apply our methods not only
to one corpus but to two corpora which are then compared. If we keep
constant the configuration of the tools we use and the principles of apply-
ing them, we can be more certain that the variation we find has mostly to
do with our research object and not so much with our methods. Thus, a
combination of methods and the comparison of different corpora are
helpful for both: a good interpretation of our findings and a constant
check on our methodology.
Second, we take advantage of the fact that each method puts a specific
focus on our data. Geolocation highlights national differences. Network
analysis gives insights into the relations between users without taking
into account thematic aspects. Finally, keyword analysis brings us closer
to the subject matter of Twitter discourses. That is to say, our mix of
methods is designed as a funnel leading progressively from geographical
to social aspects and on to the conceptual level of discourse.
In the following, we use the term ‘discourse’ in the general sense as
‘language in use’ (e.g. Fasold 1990). This needs to be specified: discourse
is formed by language patterns to be analysed as traces of social interac-
tion (Müller 2012, 2015). Discourse patterns indicate the individual,
social, and collective knowledge of the speakers (Felder and Müller 2009).
Our underlying aim is thus to learn about the interrelationship of lan-
guage, knowledge, and society in Twitter communication. Aside from a
measure of relative influence in networks, which is of course heavily
dependent on an actor’s speaker position and institutionalised influence
both online and offline, we do not delve deeper into power effects and
288 J. Stegmeier et al.
1
In general, the analysis of social behaviour on the internet suffers from uncertainty, which is inher-
ent to the medium (Boyd and Crawford 2012; Ruths and Pfeffer 2014). The streaming of tweets
using the API, for instance, is restricted to 1% of real time Twitter traffic. However, this threshold
has not been passed at any time during our study. Moreover, besides relevant messages, communi-
cation in social networks produces a lot of ‘noise’, e.g. spam and automated messages sent from bots
that distort political discourses. For this reason, the accounts @All4NeutralNet and @
RealNeutralNet set up by activists from Demand Progress were excluded from data collection, since
they sent the same citizen petitions to Republican politicians and President Obama in an infinite
loop.
290 J. Stegmeier et al.
2.3 Data
Table 10.1 Geolocated tweets and retweets of the ten most frequent countries
Tweets Retweets Total
#ClimateChange 80,324 88,336 168,660
#NetNeutrality 117,786 125,438 243,224
Total 198,110 213,774 411,884
294 J. Stegmeier et al.
#ClimateChange
Australia
7% Brazil
1%
Canada
13% France
3%
Germany
1%
India
2%
USA Irland
60% Italy 1%
1%
UK
11%
Canada France
#NetNeutrality 1% Germany
4%
Brazil 1% India
1% 2%
Italy
Australia
1%
1%
Mexico
1%
UK
4%
USA
84%
higher. This points to the fact that there are also users from countries in
Africa (e.g. Kenya, Nigeria, South Africa), Asia, and South America that
are directly affected by climate change.
This first finding already indicates that both policy fields diverge
regarding Twitter communication patterns. Dynamic developments of
the policy debate in the USA during the research period are clearly
reflected by the activities in the #NetNeutrality sample. On February 4,
2015, in an interview published by the magazine Wired and on Twitter,
FCC Chairman Tom Wheeler announced his decision to advocate for the
principle of net neutrality in his regulation proposal, which caused a
strong increase in activity on Twitter.2 Mass media also reported compre-
hensively. For the New York Times, the FCC decision concluded the ‘lon-
gest, most sustained campaign of Internet activism in history’. Civil
rights organisations and telecommunication firms commented on the
decision from their respective positions. Finally, political actors such as
President Obama, Senator John McCain, and the Speaker of the House
2
#Tom Wheeler (@TomWheelerFCC): “I have outlined the new #OpenInternet proposal in an
op-ed just posted on @Wired here: http://wrd.cm/16nDJn5 #NetNeutrality”.
296 J. Stegmeier et al.
This is one reason for us to continue data collection until the end of the year so that we cover the
3
conservsbelieve
krobertory
politicallaughs debnicolina
climatecou
ibdeditorials
gopfashionista anonspress
chuckwoolery
lrihendry
scienceadvances nature_mi
thedavidmcguire
chucknellis occupycorruptdc
washtimes genelingerfelt
georgewhitejr parvasaeua greenfraud
katrinapierson newsroompostcom
audubonsociety
natlparkservice
realalexjones leodicaprio
citi
cli
cbsthismorning blossomnnodim
nowiknowmyabcs robfit pmgeezer realtimers clickhole
nprpolitics
notjoshearnest
sarahmlauren
gma lauraecpaul
juleslalaland
iluvco2 startalkradio
newscientist
carbongate
alammaldives greenl4l
c40cities
mintpressnews scottwx_twn
shaughn_a heritage
carinabloro organicconsumer
greenforall
usgs
mormondems everyvoice
peter_bowden ucberkeley
statedept randcorporation
secretaryjewell jimmybear2
foe_us
vanobserver nprnews rt_america
billnye johnkerry
bob_owens dailycaller ecointeractive
noaaclimate
weatherchannel
climatetreaty who commercegov
interior billmaher
gizmodo
wired robertswan2041
agimcorp
danielgennaoui foxnews rahul358 fidelherrera jrcarmichael
yaleclimatecomm markruffalo
nowwithalex
huffingtonpost
climateopp
bloombergdotorg ndgain jeffreydsachs
pharrell
telegraph
edshow paulcarfoot
cnn
theeconomist
ap
markeymemo
peta sierraclub thescienceguy unuehs
luciagrenna
real_liam_payne
awea
climatenewsca drbobbullard
youtube
earthjustice
ginaepa britanniacomms
dodo
peddoc63 mprnews
megynkelly
kdungul
michael_shank
glen4ont
nasagiss
lindasuhler takepart alroker narendramodi mcampaign
slate
cleanairmoms newclimateecon wsj
sensanders usaid upeace
gmiller1952 epa chriscmooney pnudperu
senatorkirk kennytorrella
salon
sciam noaa worldresources m3metic
allenwest
rednationrising
ecowatch
globalthermo gravitydynamic
connect4climate limacop20 wmonews un_news_centre
mashable resilientearth
senwhitehouse
laureldavilacpa momentum_unfccc
algore
floggermiester
triplepundit
nytimes
bostonglobe
dofvoteclimate wsf5_2015 eu_commission
allanmargolin climategroup
michaelemann
barackobama
cdp undpasiapac
mcconnellpress thebaxterbean
konalowell
undp reinvest2015
wef
carminezozzora gop thedailyclimate
carlsiegrist reuters undp_india
thinkprogress
rollingstone lorabruncke climatereality
350climatehotnews nilimajumder
marcvegan uncclearn
thedailyedge planetexperts
adb_hq
tjwiseman
tchenya nodesystems rtccclimatenews
freja_petersen ecowarrior1980 nappeema un_women
jiminhofe brighteyedjaymi
un
cfigueres iaeaorg
policyngeco
bettybeekeeper cifor
democracynow
uniteblue paulhbeckwith ruisaldanha
unrightswire
unep
goodbyekoch
onahunttoday ungeneva
theearthnetwork
richardhine usrealitycheck yourveganho
latimes silenced_not ineeshadvs
africagreenmedi teriin
desmogblog greenpeaceusa
earthhour
ericwolfson matthieunappee2 rt_com
time
sninkypoo bluedotregister
johnlundin unfccc unepasiapacific
ajit_ranade
lee_tennant climatecentral
ipcc_ch irena
sentedcruz
richardohornos
mikeloburgio moxyladies
guardian
jackiepburgoyne
wwf
scottwesterfeld
mercianrockyrex
jimharris
greencomedy resilientneighb
amazonwatch fhollande
climatewise2015 climateprogress grist greenvideos
greenpeaceaustp iepjeltok action2015
sustainia
huffpostpol
politicalant whatsyurimpact
richardmclellan helenclarkundp
greenpeace
cdhill9 aj
skepticscience
senfeinstein mrfcj
johnfugelsang
hugoc3318
physorg_com
huffpostgreen isaacbiovega
bulletinatomic billmckibben
stanford tveitdal earthvitalsigns
benandjerrys thecvf
climatechange_a
ddimick liveearth
lluisahicart coolmyplanet planamikebarry
mcspocky business
senatorsanders senschumer faonews designboom
hearyanow
tonyabbottmhr edjoyce nick_clegg
mmnngreenworld
wwwfoecouk onegreenplanet deborahmeaden radioleary
wendy_bacon
pittgriffin kimacheson deccgovuk
theccoalition columbia_biz
reclaimanglesea wildlifetrusts
edie greenpeaceuk
malmmckay
jamie_woodward_ circularecology food_tank
greghuntmp anothergreen cafod
realtimferguson
stansteam2
_bto
strobetalbott
elitedaily
johnwren1950 nhm_london
petricholas
hankgreen albomp
4 Network Analysis
Figures 10.3 and 10.4 display the results of our network analyses for the
#ClimateChange and #NetNeutrality debates. For illustrative reasons, we
restricted the graphs to the 500 most important actors in each network.4
Colours are based on frequent connections identified by Gephi’s algo-
rithm to detect communities in networks. The #NetNeutrality network
4
We used the PageRank algorithm to position actors and to scale the size of actor labels. Results
remain robust if we apply the Betweenness centrality algorithm as a comparison. The graphical
design of networks is based on the Fruchterman-Reingold layout.
298 J. Stegmeier et al.
docdead elgatogaming
bestvpns
educationweek
dbargen
adafruit
littlebits thepushdaily
change
sarajchipps republican_mrs
engadget
politics_pr
50th_president michellemalkin
dmashak g_humbertson
policy
ansip_eu atlantahumanist
selinasorrels
kimberly_canete jarjarbug
doctorow jasonabbruzzese
unitebluewi
bfeld lesliemarshall
blob_fish
jgalt9 bosshoggusmc
guardianus nickgillespie galtsgirl occupycorruptdc
mashabletech
dlb703
140elect
internetassn
marietjeschaake hapkidobigdad
thedemocrats teapartyexpress
pmgeezer afpnevada amymek
sensanders
uniteblue afphq
mashable usatoday
aaronsw senate_gops
aclu
billmoyershq housegop
zeldman
cendemtech
firefox
sarah__reynolds joshuawoodz
reason ladysandersfarm
resisttyranny
cblacktx
guyverhofstadt
anonyops occupywallstnyc
bipartisanism
smith83k gop gop_thinker
mozillaadvocacy
laureldavilacpa barneyfranken patriotic_me calfreedommom
youranonnews b_fung
barackobama
couragecampaign latinagopvoter
bradthor
mcuban
boldprogressive
ronwyden
sharneal
christichat gt_rman
greenpeaceusa moveon mikeofcc housecommerce ibdinvestors
rwsurfergirl
timberners_lee dominicru
thomasb00001
ctia oxco martydrinksbeer
allenwestrepub
alfranken
holding_our_own dagnyred
jim_b60
theblaze
tomwheelerfcc
jdtabish
keithellison senjohnthune rednationrising
jensan1332
sumofus ppinternational senblumenthal
tanyainalameda ajitpaifcc tea_alliance
croatansound
beanfrompa
lrihendry
theopenmediamclyburnfcc thehill
countermoonbat
honestjames4 gerfingerpoken
fightfortheftr
spreadbutter
verizonwhitehouse
hitrecordjoe jrosenworcel
edhaskl
credomobile
openmedia_ca twitter netflix
johnlaprise
greensboro_nc
norsu2
snap_politics klsouth kerryepp
guardian markeymemo marshablackburn lybr3
freepress
rachellive tsb1974 chuckwoolery
craignewmark michaelkhoo jimfak wordsmithguy
brettglass opposenetttax
glennbeck
nhmc
colorofchange
michaelscurato
gigibsohnfcc darrellissa dbongino
demandprogress
ivote4usa twistedpolitix
propublica publicknowledge
verge
rosariodawson
freenetbot washtimes dailysignal
senatorleahy
culturejedi techfreedom foxnews tedcruz
jkfreespirit
thegeeksjt1 adambaldwin bmoc98
markruffalo repjohnlewis mediajusticemediaaction katyonthehill
onlineontheair ppr_intern
regiteric
thefaithfulnet
repannaeshoo att cnet libsinamerica
realdeancain
buyvpnservice
rt_com booyahboyz mattwalshblog
penn
superwuster nytimes bloombergtv wsjd
nerdywonka
billdeblasio tumblr sprint
msnbc foxbusiness
lalaruefrench75 netadvisor
anniescranton
reddit cnbc anthonycumia
hitrecord newyorker maplight wsjpolitics google
scrawford westjournalism
lnonblonde foxnewspolitics
technocowboy
kickstarter
democracynow womensmediacntr newsweek bidenscreepyhug
oldcaesarcole
variety sharethis
feliciaday
cityjournal
stanfordcis vzpublicpolicy poltoons
rt_america cbsnews darth fox5sandiego
twc business forbes
dreamhost
manhattaninst
ga boogie2988 thaddeusrussell
techraptr internetarchive jerezim
gigaom
gitgirl
cbseveningnews thenextweb
tedtalks occupy_www
mckennasmark
anonymous5thnov huffpostpol universilence
theskimm jamesrhine
westchesterocpy change4india
harikondabolu
pcmag thepeoplescube
maddie_marshall
wikileaks
blackberry
bassnectar benandjerrys
angryblacklady
josephscrimshaw mattmay iglvzx
npr
bbcbreaking theneedledrop
producercody
criticl_me
vpnbaron marceloclaure
ign rsprasad
jonrisinger thinkgeek cbcalerts
digiges
gamespot
joshjepson
allispeed
debnicolina
brownjenjen
darthputinkgb riotfest
adage chrisdemarais
5 Keyword Analysis
Geolocation analysis and network analysis operate on large quantities of
data, which makes them perfect for a ‘bird’s eye analysis’ of the corpus
and for gaining insight into the internal structure of the corpus.
Geolocation analysis and network analysis with user names as nodes give
valuable information on where the discourse actors are located and who
among them are the most visible and, therefore, the most important
ones. However, due to their rather coarse-grained approach, they are not
as suitable for providing hermeneutic insight into the data. The last part,
keyword analysis, aims at finding topic-specific words by comparing
word frequency lists.
Keyword analysis is a well-established approach in corpus linguistics in
which a keyword is considered to be a word ‘which can be shown to occur
in the text with a frequency greater than the expected frequency (using
some relevant measure), to an extent which is statistically significant’
(Wynne 2008, 730; cf. Demmen and Culpeper 2015 for a comprehen-
sive overview). The token frequency in the reference corpus is used to set
the expected frequencies, and the token frequencies in the main corpus
are the observed frequencies. We used Laurence Anthony’s concordancer
software AntConc (Anthony 2005) for keyword computation, as it is one
of the leading free software packages for this task. It uses the Log-
Likelihood measure to test the null hypothesis that there is no difference
between the observed and the expected frequencies.
For this analysis word means token, which is a syntactically used word
form like is, are and was as opposed to the base form be (= lemma). While
using lemmas for keyword analysis serves the purpose of finding topic-
specific vocabulary well, we chose word forms over lemmas because of the
formal challenges of Twitter messages. Automatically finding and anno-
tating the base form of the word forms used in the Twitter messages
(= lemmatising) did not prove sufficiently correct at the time of writing
even though this is a standard procedure in Natural Language Processing
for regular text (cf. Manning et al. 2014).
As mentioned above, the features of the reference corpus affect the
outcome of the whole procedure: ‘Features which are similar in the refer-
ence corpus and the [research] corpus itself will not surface in the com-
Multi-method Discourse Analysis of Twitter Communication… 301
parison, […] only features where there is significant departure from the
reference corpus norm will become prominent for inspection’ (Scott
2009, 80). In other words, comparing the frequencies of the tokens
occurring in the research corpus and the tokens occurring in the reference
corpus yields content- and topic-related statistically significant tokens as
keywords if the corpora do not belong to the same domain and share
predominant formal features. The keywords are, in turn, a starting point
for a more detailed hermeneutic topic analysis by inspection of the con-
text in which the keywords occur.
In our case, the most prominent features that should not show up as
key are the linguistic patterns which are specific to Twitter communica-
tion as a social media platform. These include, among others, certain
acronyms like ‘lol’ (‘laughing out loud’) or ‘wtf ’ (‘what the fuck’), which
appear quite regularly in social media discourse but not (as much) in
media articles. Therefore, it seems prudent to use a Twitter corpus as the
reference corpus since items like these will not stand out in the compari-
son. The features that should register in the comparison are those which
are mainly content-related, including Twitter-specific constructions like
@-mentions and hashtags. To make sure that these items would be
counted as regular words in AntConc and following Baker and McEnery
2015, we changed the default word definition, which counts any string
of ASCII characters as words, to include the following characters ‘@’, ‘–’,
‘_’, ‘:’, and ‘#’.
For our analysis, we used a corpus of tweets belonging to the domain
‘art’ and ‘communication about art’ which seems sufficiently removed
from the domains covered in the research corpus to make topic specific
words statistically significant. It was built by streaming tweets containing
at least one of the following artist’s name in hashtag form: #botticelli,
#schiele, #gursky, #calder, #kandinsky. The corpus covers the period from
1 December 2015 to 31 January 2016 because there were exhibitions in
several countries featuring these artists.
For a more detailed profiling of the two policy fields net neutrality and
climate change, we split the research corpus into two subcorpora, one
containing all the tweets dealing with net neutrality and the other one
containing the tweets dealing with climate change.
302 J. Stegmeier et al.
Hashtags and @-mentions that are also keywords are treated as regular
words in the categorisation. It should be pointed out, however, that
hashtagged items are more than mere words in Twitter communication.
They can fulfil up to three functions. In their purest form, they categorise
a whole tweet by adding a meaningful tag to it without any syntactic con-
nection to the rest of the tweet. For example: ‘Help save the Internet!
#fcc’. However, any syntactically used word within a tweet can be turned
into a hashtag: ‘#Republicans bill to gut the #fcc and kill #netneutrality’.
And, as shown in the next and last example, the word used as a hashtag
can also refer to a discourse actor like in ‘#fcc’ (which stands for federal
communications commission) (Table 10.4).
Of all the categories, Discourse actors and Effect on the domain are the
most frequent; together, they account for almost half of the keywords
that were categorised.
Of the fourteen keywords, only two are in the form of @-mentions,
which means that the remaining twelve do not register as nodes on the
network graph. The discourse actors that proved to be statistically signifi-
cant in comparison with our reference corpus show that there must be
substantial differences of opinion (‘deniers’) regarding the topic of cli-
mate change. A clear focus on politicians and policymakers is also evident
if the discourse actors doubling as hashtags are also taken into account.
This confirms the findings of the network analysis and refines it by bring-
ing actors to attention which do not (or not dominantly) show up as
nodes in the network graph like ‘deniers’ and ‘scientists’.
The affected domain presents itself as widespread, covering virtually
the whole globe, and the effect on it is painted as clearly negative. The
keywords dealing with the effect on the affected domain are the second
strongest category, which indicates the need of the users to talk about it.
The category Evidence and factuality is the third largest, which shows that
the evidence for the postulated effect is subject to debate. The measures
to reduce the effect that are part of the keyword list are rather vague and
aimed at rallying people or expressing one’s own willingness to do
something.
Interestingly, the category Scale of effect is completely devoid of hashtags
even though the words that prompted this category seem quite suited to
be used as hashtags. Their rank on the keyword list is so low, however,
that they were not part of the categorisation process. This means that the
users who chose not to use these words as hashtags did not find it prob-
able that they would fulfil the role of pooling tweets relevant to their own
(Table 10.5).
Just like in the subcorpus on climate change, most of the categorised
keywords belong to only one of two categories: Discourse actors and
Measures to cause the desired effect. While both are significant parts of the
debate—together, they comprise more than half the categorised key-
words—the Measures seem to be the most important topic of this debate.
The discourse actors who made it on the keyword list are mainly poli-
cymakers (@ajitpaifcc, @tomwheelerfcc, @fcc/fcc, chairman, congress,
Multi-method Discourse Analysis of Twitter Communication… 305
The geolocation and network analyses showed that the net neutrality
discourse proceeds mostly along national boundaries, which gives it a
certain homogeneity. Keywords like bill, decision, title (‘title II regula-
tions’), vote/voted/votes, approves (‘FCC approves #netneutrality rules’)
show how rooted the discourse is in national boundaries, especially, in
this case, the boundaries of the USA since most of them describe parts of
US law-making processes. At the same time, the net neutrality discourse
is also heterogeneous in that the discourse actors belong to different social
spheres (politics, business, and media). Most of the keywords in the cat-
egory ‘discourse actors’ refer to politicians. Only two refer to business and
the remaining two are quite vague (‘people’ and ‘we’) but still part of
political discourse patterns, which is especially true for ‘we’ which func-
tions as a group defining entity (s. also below). The media, surprisingly,
do not register within the first 100 keywords. This is especially surprising
in the light of the fact that Twitter users regularly refer to news outlets by
@-mention and hashtags. However, the names of media actors are so low
on the keyword list that they did not make it into the categorisation pro-
cess. While this could be interpreted as a very strong dominance of non-
media related actors, it needs to be taken into account that the keywords
are computed by using the text of the tweets only. The Twitter account
where they are tweeted from is not part of the keyword computation.
This means that anything mentioned in the category ‘discourse actor’ is
either an @-mention or a word referring to a person or institution that
was used in the text of a tweet. The keyword list does not give any indica-
tion of how strong the impact of media actors is as communicators. It
does, however, show that they are not important as discourse topics. This
is also consistent with the fact that media actors like ‘foxnews’, ‘usatoday’,
‘theopenmedia’, and ‘freepress’ are important nodes in the network analy-
sis (see above).
The keyword list of the net neutrality tweets shows a dominance of
liberal points of view in the top 100 keywords, which coincides with a
critical perspective on centralisation tendencies on the internet. Only a
few keywords indicate the antagonism of political perspectives shown
above by the network analysis: the hashtags #NetNeutrality vs. #nonet-
neutrality, #tcot (‘top conservative on twitter’), and the words ‘Obama’
vs. ‘republicans’ and eventually the keyword ‘we’, which is used exten-
Multi-method Discourse Analysis of Twitter Communication… 307
trality corpus) both refer to ways of dealing with whatever change the
affected domain is subject to, which makes them eligible as categories
present in both discourses. However, as their labels already show, the
underlying keywords refer to quite different things, which makes them
eligible as categories marking the differences between the two discourses.
In fact, a closer look reveals that while category 9. Measures to cause the
desired effect is composed mostly of words referring to various steps in the
regulatory process, category 8. Measures to reduce the effect mostly consists
of more generic words that constitute a call to action.
The categories 4. Agents causing the effect (in the climate change corpus)
and 7. Evidence and factuality illustrate the most striking difference
between the two discourses. As far as the discourse on net neutrality is
concerned, the what and how is open for and in need of debate. In the
climate change discourse, however, the very existence of the issue is open
for debate. Again, both statements come as no surprise for those who
already have knowledge of the discourses. But still, it also shows how
keyword analysis can help make sense of an overwhelmingly large num-
ber of texts.
6 Conclusion
In this study, we have put our focus on the transnationalisation of politi-
cal communication via the social network Twitter. On the current level of
analysis (i.e. metadata analyses of geolocations and networks, preliminary
results of content analysis), we observed transnational Twitter communi-
cation for both the issues examined. We used a sophisticated set of meth-
ods the benefits of which are discussed in the second part of this
conclusion. We end this contribution on an outlook into the future.
We found that the methods we used led to supplementary and even com-
plementary results. First, the geolocation analysis and the network analy-
sis showed how topics like #ClimateChange or #NetNeutrality are
Multi-method Discourse Analysis of Twitter Communication… 309
The internet has brought a plethora of new empirical sources for research,
such as social media, and an ever-growing number of applicable methods,
but this development also poses the risk of very data- or method-driven
studies, missing a theoretical foundation. We tried to avoid that by always
evaluating the benefits of applied methods for our research question. Our
Multi-method Discourse Analysis of Twitter Communication… 311
approach therefore did not only use the advantages offered by every sin-
gle method, but combined them in a way that significantly improved our
understanding of the matter at hand. Geotagging is a necessary precondi-
tion for the further analysis of potential transnationalisation in Twitter
communication. Therefore, most of the analytical steps directly built on
and profited from geolocation information, since this information
enabled us to build specific subcorpora for network and linguistic analy-
sis. Employing geolocalisation therefore enabled us to engage with our
research question. Furthermore, the combination of different methods
enabled us to verify and enhance some of our findings. This holds
especially true for the combination of network analysis and keyword
analysis. Combing these two methods and going deeper into the results
of keyword analysis, we were able to identify the conceptual polarisation
of the climate change discourse on Twitter, which did not show up in
network analysis. Building upon both methods, we could further sub-
stantiate and differentiate our findings. Although some findings differed
slightly, as mentioned above, the results nevertheless enabled us to draw
a more nuanced picture. The combination of different methods may also
help compensate for the potential weaknesses of some methods and shed
light on what would otherwise be blind spots. While network analysis
focuses only on relations between actors, linguistic analysis supplements
this by actually looking for similarities and differences in content. This
also illustrates the benefits not only of combining methods but also of an
interdisciplinary approach.
References
Anthony, Laurence. 2005. AntConc: Design and development of a freeware
corpus analysis toolkit for the technical writing classroom. In IEEE:
Proceedings International Professional Communication Conference, 729–737.
Baker, Paul, and Tony McEnery. 2015. Who benefits when discourse gets
democratised?: Analysing a Twitter corpus around the British Benefits Street
debate. In Corpora and discourse studies. Integrating discourse and corpora, ed.
Paul Baker and Tony McEnery, 244–265. London: Palgrave Macmillan.
Barberá, Pablo. 2014. Package ‘streamR’. Accessed February 21, 2015. http://
cran.r-project.org/web/packages/streamR/index.html.
Bohman, James. 2007. Democracy across borders: From Dēmos to Dēmoi. Studies
in contemporary German social thought. Cambridge, MA: MIT Press.
Boyd, Danah, and Kate Crawford. 2012. Critical questions for Big Data.
Information, Communication & Society 15 (5): 662–679.
Bruns, Axel, and Jean Burgess. 2011. #Ausvotes: How twitter covered the 2010
Australian federal election. Communication, Politics & Culture 44 (2): 37–56.
Castells, Manuel. 2002. Die Macht der Identität: Teil 2 der Trilogie: Das
Informationszeitalter. Das Informationszeitalter: Wirtschaft, Gesellschaft,
Kultur. Vol. 2. Opladen: Leske + Budrich.
Chadwick, Andrew. 2013. The hybrid media system: Politics and power. Oxford:
Oxford University Press.
Demmen, Jane E., and Jonathan V. Culpeper. 2015. Keywords. In The Cambridge
handbook of English corpus linguistics, ed. Douglas Biber and Randi Reppen,
90–105. Cambridge: Cambridge University Press.
Duggan, Maeve, Nicole B. Ellison, Cliff Lampe, Amanda Lenhart, and Mary
Madden. 2015. Social media update 2014. Accessed February 14, 2015.
http://www.pewinternet.org/2015/01/09/social-media-update-2014.
Fasold, Ralph W. 1990. The sociolinguistics of language. Oxford: Blackwell.
Multi-method Discourse Analysis of Twitter Communication… 313
Felder, Ekkehard, and Marcus Müller, eds. 2009. Wissen durch Sprache. Theorie,
Praxis und Erkenntnisinteresse des Forschungsnetzwerks “Sprache und Wissen”.
Berlin and New York: De Gruyter.
Freelon, Deen, and David Karpf. 2015. Of big birds and bayonets: Hybrid
Twitter interactivity in the 2012 Presidential debates. Information,
Communication & Society 184: 390–406.
Hanegraaff, Marcel. 2015. Transnational advocacy over time: Business and
NGO mobilization at UN climate summits. Global Environmental Politics 15
(1): 83–104.
Held, David. 1997. Democracy and globalization. Global Governance 3:
251–267.
Jeffares, Stephen. 2014. Interpreting hashtag politics: Policy ideas in an era of social
media. Basingstoke: Palgrave Macmillan.
Kielmansegg, Peter G. 2013. Die Grammatik der Freiheit: Acht Versuche über den
demokratischen Verfassungsstaat. Baden-Baden: Nomos.
Kneuer, Marianne. 2013. Bereicherung oder Stressfaktor?: Überlegungen zur
Wirkung des Internets auf die Demokratie. In Veröffentlichungen der
Deutschen Gesellschaft für Politikwissenschaft: Vol. 31. Das Internet: Bereicherung
oder Stressfaktor für die Demokratie? ed. Marianne Kneuer, 7–31. Baden-
Baden: Nomos.
Kwak, Haewoon, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is
Twitter, a social network or a news media? In Proceedings of the 19th interna-
tional conference on World Wide Web—WWW ’10, 591–600. Raleigh, NC:
ACM Press.
Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven
J. Bethard, and David McClosky. 2014. The Stanford CoreNLP natural lan-
guage processing toolkit. In Proceedings of 52nd Annual Meeting of the
Association for Computational Linguistics: System Demonstrations, 55–60.
Accessed March 15, 2017. http://www.aclweb.org/anthology/P/P14/P14-
5010.
McEnery, Tony, Mark McGlashan, and Robbie Love. 2015. Press and social
media reaction to ideologically inspired murder: The case of Lee Rigby.
Discourse and Communication 9 (2): 1–23.
Müller, Marcus. 2012. Vom Wort zur Gesellschaft: Kontexte in Korpora: Ein
Beitrag zur Methodologie der Korpuspragmatik. In Korpuspragmatik.
Thematische Korpora als Basis diskurslinguistischer Analysen, ed. Ekkehard
Felder, Marcus Müller, and Friedemann Vogel, 33–82. Berlin and New York:
De Gruyter.
314 J. Stegmeier et al.
Language use, 73, 194, 201, 239, qualitative, 12, 30, 32, 35, 38, 41,
258, 290 56, 58, 64, 124, 128, 145,
patterns of language use, 124, 147, 148
252, 254, 259, 287 quantitative, 8, 11, 13, 14, 24,
visualising, 256 27–32, 38, 41, 53, 56, 58,
Legitimation, 4, 27, 112, 188, 310 59, 99, 123, 124, 143, 147,
Lexicometry, 13, 26, 184–192, 148, 286
252
Linguistics
corpus, 4, 7, 11–13, 15, 16, 70, P
184, 191, 217–220, 251, Participant, 12, 66, 69, 130,
252, 254, 257, 275, 294, 135–136, 141, 142, 148
300 Politics, 135, 136, 145, 188, 195,
quantitative, 4, 13, 125, 294 291, 292, 306, 307
socio, 9, 10, 124 deep, 157
text, 26 international, 142, 168, 169, 174
parapolitics, 157, 167–176
of scientific research, 167
M Position
Meaning, 6–12, 30, 32, 33, 35, 51, discursive, 12
52, 63, 64, 73, 78, 123, geographic, 258
127, 128, 131, 184, 191, institutional, 80
207 ontological, 29
access, 27, 33 political, 6, 145
construction, 6, 7, 9, 53, 73, 127, social, 53
131, 217 speaker, 287
explicit, 217, 218 symbolic, 65
implicit, 217, 218 Power, 7, 8, 14, 27, 52, 54, 55, 62,
latent, 191 66, 78, 79, 81, 90, 92–94,
multiple, 63 96, 97, 100, 114, 124, 157,
production, 10, 61, 63 229
referential, 76 academic, 61
social, 52, 55 attractive, 171, 174, 175
Methods effects, 61, 287
corpus, 9, 11, 12, 16, 126, 143, political, 160, 168
217, 251 relations, 7, 58, 60, 62, 72, 124,
mixed, 14, 24, 28–31, 42, 43, 288
147 structures, 52, 68, 93
Index 319
Practice, 8–10, 12, 14, 15, 25, 28, Scientometrics, 13, 14, 89–116
30, 31, 39, 52, 54, 62, 69, Situation, 36, 124
90–99, 105, 106, 113–116, Social
123, 128, 131, 255–256 change, 3, 5, 10, 98, 192
discursive, 7, 9, 11, 14, 52–55, dynamic, 9, 61, 69, 83
58, 60, 64, 65, 81, 93, 96, logics, 62
128 order, 53, 61, 64, 72, 91–98, 114
institutionalised, 8 reflexivity, 15
language, 8, 12 Social media
linguistic, 54, 62 Facebook, 6, 99
meaning-making, 51, 52, 55 Twitter, 6, 16, 286–312
political, 157 Society, 55, 56, 58, 60, 63, 81, 91,
of programming, 276 92, 95, 98, 100, 124, 131,
research, 25, 28, 29, 43, 68, 129, 229, 288, 310
253 Sociology, 10, 13–15, 53, 58, 71–73,
social, 24, 55, 60, 91, 106–113, 77, 78, 81, 82, 162–167,
115, 123 177
of text production, 59 of education, 177
of religion, 158, 167
of work, 165
R Strategy, 81, 108, 110, 115
Register, 130 selection, 189
Representation, 15, 16, 32, 38, 41, Structure, 8, 14, 54, 55, 60, 62, 80,
55, 113, 190, 216n2, 91, 93, 95, 96, 113, 126,
228–236, 241 129, 132, 148, 167, 184,
geographical, 262, 279 185, 189, 207
mental, 262 actor-based, 287
visual, 133, 252 data-driven, 186, 201
Representativeness, 7, 102, 129, 169, discourse, 12, 72, 185
184, 188, 189, 198, economic, 156
200–202, 207, 289, 296 geopolitical, 177
institutional, 9, 58, 65, 67
internal, 164, 300
S lexicosemantic, 156, 165, 167,
Science(s) 252
political, 13, 145, 167, 168, 289 macro-structure, 13, 15, 64, 72,
social, 4, 7, 12, 13, 26, 32, 39, 125, 127, 132, 142, 143, 148
63, 98, 99, 107, 149, 156, semantic, 132, 157, 183, 184,
184, 188, 253, 286, 289 208
320 Index