Tacit Knowledge Assertions

An extract from “The role of informal networks in creating knowledge among health-care managers: a prospective case study” https://www.journalslibrary.nihr.ac.uk/hsdr/hsdr02120/#/full-report


Background: Leximancer

Leximancer is computer software that conducts quantitative content analysis using a machine learning technique. It learns what the main concepts are in a text and how they relate to each other. It conducts a thematic analysis and a relational (or semantic) analysis of the interview data. Leximancer provides word frequency counts and co-occurrence counts of concepts present in the transcripts of the narrative interviews. It is:

[A] Method for transforming lexical co-occurrence information from natural language into semantic patterns in an unsupervised manner. It employs two stages of co-occurrence information extraction—semantic and relational—using a different algorithm for each stage. The algorithms used are statistical, but they employ nonlinear dynamics and machine learning.

Smith and Humphreys, p. 2686Once a concept has been identified by the machine learning process, Leximancer then creates a thesaurus of words that are associated with that concept giving the ‘concept its semantic or definitional content’.87

We are made aware of the larger context of all the narrative interviews of the cluster and the prominence of certain concepts. It ensures that we do not become fixated on some concepts to the detriment of others. Leximancer uses a combination of techniques such as Bayesian statistics that record the occurrence of a word and connects it to the occurrence of a series of other words. It then quantifies those outputs by coding the segments of text, from one sentence to groups of sentences. As the data set presented here is relatively small, we are looking at the data sentence by sentence. Each word or concept is associated with a subset of related terms. The next step involves the machine learning from the concepts already uncovered and linked to other concepts creating a ‘concept space’. It then iteratively creates a thesaurus around a group of seed concepts. This information is visualised using network analysis.

Emergent themes are then visible to the user, and are expandable using the map visualisation that links directly to the areas of the data in which the concept occurs. The themes map enables a quick reading of the narrative interviews. It lets us see what the dominant themes are, rather than imposing our own interpretations on the data. The proximity of two concepts indicates how often or not they appear in similar conceptual contexts. So, when two concepts are placed at a distance from each other, it indicates that they are not used in the same context. The themes are the coloured circles around clusters of concepts. The lines or pathways navigate the most likely path in conceptual space between concepts in order to aid reading the map. The connectivity score reflects the degree (equivalent to degree score in network analysis) to which the theme is connected to the other concepts in the map.

Re-presenting narrative interviews

We focus here on results from one of our sites, site 1, to illustrate our methods. A thematic analysis looking at the ranked ordering of the concept list was created and then a thesaurus for each concept was collected. The thesaurus list for each concept, presented in Table 26, shows the most strongly connected – either directly or indirectly – related words to the concept they are defining.87

At site 1, of the top 20 most important concepts, two – smoke/smoking and tobacco – are the focus of the cluster selected from the sociograms for the ‘Goes To’ network in round 2. This is the specific problem identified by the cluster. They are more generally concerned with health of people; however, they have focused on smoking as the main hindrance to achieving public health. As the cluster is involved in public health, this is not surprising. The focus is on smoking.

The clusters’ values and preferences are related to the urgency with which a working solution is required and, looking at the concept of time in Table 26 with an absolute count of 67 and a relative count of 20%, it is present in the clusters cognition. The source of that urgency is thematically related to meetings, issue, year, local, working, person, services, public and support.

The value present in the cluster is that of public or more specifically public health. There is a level of uncertainty surrounding the problem of smoking as the suppose (absolute count of 99, relative count of 29%) and probably (absolute count of 96, relative count of 28%) concepts are prominent for this cluster, with obviously (absolute count of 91, relative count of 27%) less prominent.

TABLE 26 Site 1: top 20 ranked concepts

Top 20 word-like concepts Absolute count Relative count Thesaurus
People 337 100% smoke, probably, working, trying, services, time, talk, group, prevalence, suppose
Smoking 273 81% prevalence, service, services, smoke, working, somebody, talk, team, meeting, support
Health 243 72% public, management, issue, support, look, team, services, smoke, service, coming
Work 200 59% tobacco, public, trying, health, smoke, time, things, issue, probably, doing
Service 159 47% spec, management, provider, year, smoking, look, working, public, health, doing
Cause 156 46% stuff, smoke, management, things, involved, meetings, level, saying, work, talk
Public 126 37% health, management, working, probably, team, smoke, year, issue, service, different
Things 121 36% tobacco, different, issue, prevalence, look, cause, support, suppose, meetings, trying
Doing 121 36% spec, somebody, year, stuff, look, service, probably, provider, work, used
Suppose 99 29% trying, services, prevalence, smoke, talk, issue, things, money, meetings, probably
Probably 96 28% thought, public, person, services, talk, people, coming, management, doing, trying
Group 96 28% tobacco, person, trying, suppose, people, team, public, local, terms, support
Different 93 28% working, things, support, role, tobacco, look, public, meetings, level, service
Team 91 27% support, public, health, smoking, level, involved, services, different, group, thought
Obviously 91 27% management, provider, spec, prevalence, smoking, services, public, local, involved, role
Terms 85 25% support, level, look, person, different, probably, tobacco, stuff, coming, work
Services 81 24% smoking, probably, suppose, management, health, person, time, prevalence, team, tobacco
Tobacco 77 23% things, group, coming, different, trying, work, smoke, issue, services, terms
Time 67 20% used, meetings, issue, year, local, working, person, services, public, support
Look 66 20% prevalence, spec, provider, role, things, health, service, different, doing, terms
Name-like Absolute count Relative count Thesaurus
Tobacco Alliance 45 13% person, prevalence, group, probably, meeting, tobacco, suppose, look, time, smoke
Site 1 40 12% prevalence, trying, provider, coming, team, year, group, public, tobacco, services
PCT 40 12% year, role, money, doing, different, cause, time, prevalence, provider, people

The range of activities which can be used to share and exchange knowledge related to the specific problem of smoking are within the cluster. These activities are situated within the context that the cluster is in. After a 20-year period of market-inspired organisational reform (managerialism, or the New Public Management), concepts such as, service (absolute count of 159 and relative count 47%) and public (absolute count of 126 and relative count of 37%) could be indicative of a social policy-orientated outlook rather than a managerial one. The thesaurus list does contain the more market inspired concept of management. However, management does appear, with a low absolute count of 30 and a relative count of 0.9%.

It appears that the cluster is social policy orientated. However, this is not unambiguously so. The concept of service is related to spec, management, provider, year, smoking, look, working, public, health and doing. The concept of public is related to health, management, working, probably, team, smoke, year, issue, service and different. So, the concept of service is orientated towards management rather more so than public, health, and public relates to health, and management more than service. What this highlights is the level of ambiguity around the concept of management for the cluster.

How the concepts are semantically contextualised can be seen in the Leximancer concept map below (Figure 30). The map is a re-presentation of the relational or semantic characteristics of the concepts presented in Table 1. To paraphrase Rooney,87 the direct co-occurrence between concepts is extracted from the data and these direct links are based on the strength of relations between the concepts. The more often two concepts appear together in the same sentence the more likely they are to be linked together. Leximancer then compares each concepts thesaurus and creates indirect links between them, meaning that even when concepts do not appear in the same sentence together there can still be an indirect connection between them. So, Leximancer rank orders concepts and presents them according to the strength of association and semantic similarity. So,

Concepts that are directly related but not necessarily strongly semantically linked can be far apart on the concept map while concepts that are strongly semantically related will be close to each other on the concept map . . . concepts that occur in similar semantic contexts tend to form clusters (or gather together).

Rooney, p. 41087

FIGURE 30 Leximancer map of the 11 site 1 narrative interviews combined. General concepts are in black with themes in colour.


The coloured circles indicate the thematic space of a theme with the label of that theme at the centre. The words in black are the concepts and the lines between are links that tell us which concepts are semantically linked. When two or more circles overlap it indicates that the themes are semantically related to each other.

Figure 31 shows for the cluster the most dominant theme is SMOKING, followed by PEOPLE, CAUSE, TIME, SAYING, OBVIOUSLY, INVOLVED, GROUP, TERMS, LEVEL, TALK, TOBACCO, ALLIANCE, MONEY, THOUGHT and USED. The proximity of SMOKING, PEOPLE, CAUSE, TIME, OBVIOUSLY and INVOLVED themes indicate that they are related to each other in a chain-like manner. GROUP, TERMS, LEVEL, TALK, TOBACCO, ALLIANCE, MONEY, THOUGHT and USED are semantically isolated. The name like concepts of Tobacco Alliance and PCT are not directly connected to the dominant theme of SMOKING, while site 1 is within the SMOKING theme. Therefore, site 1 resides in the same semantic space and is connected to SMOKING, while Tobacco Alliance and PCT are not.

Focusing on the theme of SMOKING (see Figure 31), it is associated and linked with smoking, health, services, prevalence, support, spec, site 1, different and look. However, smoking is not linked directly but indirectly to management, team or working. SMOKING is also semantically associated with management, as the two themes overlap slightly.

FIGURE 31 Close-up of the SMOKING theme from site 1.


We tell more than we realise we know

Taking Polanyi’s concept of ‘knowing more than we can tell’, we can reformulate it to read ‘we tell more than we realise we know’, to paraphrase Zappavigna (p. 298).88

This is the position that speakers express what they tacitly know through grammatical patterns without being aware they are doing so. By carefully analysing the grammatical patterns of ‘under-representations’88 in texts, we can bring to the fore the tacit knowledge assertions of our interviewees.

The next phase of the analysis of the interview texts makes explicit that which is tacit by looking at the function of the grammatical choices the interviewees are making.

Systemic Functional Linguistics (SFL) is an analytical method from Halliday89 which is concerned with grammar’s functionality or, rather, how it creates and expresses meaning. It regards grammar as a system of explaining things by referring to other things. Each system of the interconnected words construct the ‘meaning potential’ shaped by the semantic choices being made and the activity in the brain. SFL’s position is that when text is analysed it brings to the fore the meaningful choices made at the expense of the choices that were not made. This analysis goes beyond the usual procedures employed by others who typically look at scenarios90,91 and narratives92 and then deliver a running commentary on the text.

The functionality in language is central to language or rather the function of language is to convey experience and to generate interaction with others. With the construction of experience and interaction needing cohesion and continuity of text, a second function of language emerges – that experience and interaction require text. According to SFL, language has three ‘metafunctions’, ideational, interpersonal and textual, with the term ‘metafunction’ being used to ensure that function is regarded as an integral component of the interaction of the three terms.

The texts collected during the narrative interviews are ways of being that allows the relationship between the text and persons involved to bring in the ‘below-view patterning in language’.88 SFL allows us to bring out the ‘tacit assumptions and ideological assumptions’ that characterise certain domains of discourse. This corresponds with Halliday’s interpersonal function of language. So when analysing the texts this accounts for social practices that are being realised in the texts.

Language is an abstract social structure that defines what is and is not possible. Orders of discourse are linguistic practices that select which linguistic elements are included and excluded, and texts or social events are the products of the mediation by orders of discourse. Focusing on analysing the use of nominalisations, modality, generalisation and agencyin what people commit themselves to when they make statements, ask questions, make demands or offers in texts we are able to categorise the tacit knowledge assertions that are being made during the narrative interviews.88

The following descriptions are taken from Zappavigna.88 Her approach is also based on the ideas of Nonaka and Takeuchi93 – that middle managers are knowledge engineers. When they are involved in creating mid-level business and product concepts they mediate between ‘what is’ (epistemic modality; is, are, was, were . . .) and ‘what should be’ (deontic modality; should, would, will, ought to be, can . . .). They remake reality, or engineer new knowledge assertions, according to the ideas they have received from meetings and documents from more senior or external inputs. ‘They facilitate all four varieties of knowledge conversion and engineer knowledge spirals between organisational levels (cross-levelling). Their essential skills are in project coordination, formulating hypotheses, integrative methodologies, facilitating dialogue, use of metaphor, ability to engender trust, and ability to envision the future based on an understanding of the past’ (Nonaka and Takeuchi).93

According to Zappavigna,88 these attempts at project co-ordination, formulating hypotheses, integrative methodologies, facilitating dialogue, use of metaphor, ability to engender trust, and ability to envision the future based on an understanding of the past are evident in the choices they make when talking about what they do. Analysing the specific words they use can highlight for us when they are facilitating knowledge.

Zappavigna argues that ‘the central linguistic process of tacit knowing is ‘under-representation’. The under-representation of meaning is how tacit knowledge is indicated in language.’88


The use of nominalisation in speech indicates an ongoing project. By looking at when the interviewees use nominalisation, we are seeing where the interviewee is presenting an ambiguous or unambiguous relationship with the statement they are making.

When they refer to processes as things such as ‘health improvement’, which is an ongoing project that they are co-ordinating, they in fact see is it as a project and they refer to it as an entity in its own right. The meaning of ‘a person’s need to do something’ (i.e. improve health) has become condensed with the use of ‘ment’ in improvement.

Nominalisations are demarcated by the use of suffixes (able, ad, age, agogy, al, ality, ativement, to name only a few) which are placed at the end of words.

Processes become things that act on other processes as things, then this relation of ‘acting upon’ itself becomes a thing. The unfolding of activity sequences are finally re-expressed as parts of composition taxonomies, as criteria for classifying the abstract entities they modify. Instead of a sensually experienced world of unfolding processes involving actual people, things, places and qualities, reality comes to be experienced virtually as a generalised structure of abstractions.

Rose, pp. 263–494


The use of modality in speech indicates the formulating of hypotheses [is, are, were] – ability to envision the future [should, would, will].

Examples of modality are can, could, should, would, might, must and probably (this list is not exhaustive). They are an indicator of the level of certainty or uncertainty that the speaker has in regard to the assertion being made.

Modality contains meaning by embedding the agent motivating the opinion expressed. The use of modality in text under-represents agency or cause. For example, an IT professional might say ‘I should reassess this requirement’.

The use of the modal verb should is masking the ‘who’ or ‘what’ motivating the process of reassessing. It could be a command from a senior and not from the interviewee.


Rather than saying that something is a fact, speakers make generalisations in order to sound less direct and allow for uncertainty in the statement that they are making. Generalisations indicate to us the cognitive process and contents of the statement.

Generalisation contains meaning through underspecifying a concept and pattern. Examples of words that demarcate generalisations are some, a bit, a few, any, part of, complete, entire, none, no one, nothing and zero (again to name only a few). The generalisation usually follows these words.

General terms are not necessarily more abstract; a bird is no more abstract than a pigeon. But some words have referents that are purely abstract – words like cost and clue and habit and strange; they are construing some aspect of our experience, but there is no concrete thing or process with which they can be identified.

Halliday and Matthiessen, p. 61595Generalisation underspecifies meaning and highlights assumptions; examples are system and programme.

Cognitive analysis

Within Leximancer, there is a pre-set ability to conduct sentiment analysis. Simply put, sentiment analysis measures the attitude of a speaker or writer towards a concept, whether they express something positively or negatively. In order to conduct cognitive analysis, we have combined sentiment, nominalisation, generalisation and modality. By doing so we focus on what the interviewee holds to be pre-supposed or tacit knowledge, thereby enabling us to answer two questions: what do they know, and what do they not know?

Cognitive analysis using sentiment analysis settings

What types of knowledge is the cluster concerned with? Taking each concept as highlighted by Leximancer and extracting the complete thesaurus of all words related to that concept by Leximancer, we then count the number of uses of nominalisation, modality, generalisation and agency in relation to each concept (see Table 27).

It is clear from Table 27 and Figure 32 that the cluster is predominantly involved in the use of nominalisations; this indicates ongoing projects being perceived as entities in their own right rather than processes. What Table 27 and Figure 32 do not tell us, however, is whether the cluster perceives these projects are ongoing or finished, whether they are making claims with epistemic certainty or uncertainty and whether there assertions are based on assumptions or ‘fact’.

TABLE 27 Site 1: top 20 ranked concepts by types of knowledge

Concept Nominalisation: project coordination Modality: formulating hypotheses Generalisations Agency
1 People 119 54 30
2 Smoking 119 54 30
3 Health 119 54 30
4 Work 119 54 30
5 Service 56 54 30
6 Cause 119 54 30
7 Public NA NA NA
8 Things 111 54 30
9 Doing 101 54 30
10 Suppose 64 54 30
11 Probably 55 54 30
12 Group 31 31 30
13 Different 55 54 30
14 Team 48 48 30
15 Obviously 50 50 30
16 Terms 41 41 30
17 Services 56 54 30
18 Tobacco 58 53 30
19 Time 53 48 30
20 Look 48 54 30
TOTAL 1422 973 570
Name-like Nominalisation: project coordination Modality: formulating hypotheses Generalisations Agency: power
21 Tobacco Alliance 13 13 13
22 Site 1 45 45 30
23 PCT 28 28 28
TOTAL 86 86 71

FIGURE 32 Nominalisation, modality and generalisation frequency for each concept of site 1 interviews.


What follows is an automated report generated by limiting the number of concepts to 23 listed in Figure 30, above, plus 2 GPs, and Public Health, as they were highlighted by Leximancer as potential names. The categories of interest are the interviewee data files. So, what we get is an analysis of each interviewee’s use of the top concepts for the cluster.

As well as that, the technology within Leximancer that analyses positive and negative sentiment has been altered to include categorisation of terms that indicate nominalisation, generalisation and modality. The results are presented in a high-level, visual chart displayed in a ‘magic quadrant’ format. The axes are relative frequency, which is a measure of the conditional probability of the concept given the categories of Sentiment, Nominalisation, Generalisation and Modality (cognitive analysis – positive or negative). We are looking at the occurrence of positive or negative words when ‘health’ is mentioned. The axes labelled ‘strength’ is a measure of the conditional probability of the category cognitive analysis – positive or negative given the particular concept (e.g. how often is ‘service’ mentioned with positive or negative cognition?).

There are four areas to the quadrant, and the different colours of concepts refer to different interviewees’ accounts. Concepts in quadrant one (bottom left) are weak and less prevalent within the interviewee’s data – this is where negative Sentiment, Nominalisation, Generalisation and Modality manifest. Concepts in quadrant four (top right) are strong, prominent and more likely to co-occur with the category. This is where positive Sentiment, Nominalisation, Generalisation and Modality sit.

Figure 33 indicates a low frequency for the majority of concepts except for terms and obviously and these are both from one interviewee. A majority of the concepts are also viewed negatively on the negative cognition scale.

FIGURE 33 Cognitive analysis quadrant of top 20 concepts: frequency and strength results for site 1.


When the data from Figure 33 are compared with the cognition scale frequency and strength results of the cluster, this generates Figure 34 (presented below). The concepts causeservicehealthsmokingpeople and work are viewed moderately positively on the cognition scale. They have also scored highly for cognition scale for each concept of site interviews in Figure 33.

FIGURE 34 Cognitive analysis quadrant of top 20 concepts: frequency and strength compared with cluster results for site 1.


The most striking aspect of Figure 34, which shows all interviews combined as well as the individual interviews, is that the Tobacco Alliance, which has a high frequency score, also has a negative or weak cognition score, meaning that the concept Tobacco Alliance is used in a manner that indicates that the cluster does not know what the Tobacco Alliance is, or what it intends to do. Work, cause, smoking, people, health and service are all within the positive quadrant of the scale, indicating that these terms are used positively and that the cluster knows what these things are. For the cluster, the concepts public, things, doing, suppose, probably, group, different, team, obviously, terms, tobacco, time, look, Tobacco Alliance, site 1 and PCTs fall into the negative, high-frequency quadrant.

Site 1 documents results

Figure 35 shows that for the cluster the most dominant theme is TOBACCO, followed by SMOKING, LOCAL, SUPPORT, GROUPS, SMOKEFREE, SCHOOL, ENSURE, SMOKING, YEAR, TOBACCO, CIGARETTES and PROJECT. The proximity of TOBACCO, LOCAL and GROUPS are overlapping. This indicates that they are related to each other in a chain-like manner. YEAR, SCHOOL and PROJECT are semantically isolated. The concepts of council, control, products, public, communities and inequalities are directly connected to the dominant theme of TOBACCO (Figure 36). The theme of SCHOOL is semantically isolated from the dominant theme of TOBACCO.

FIGURE 35 Leximancer default positions map of the documentation for site 1. General concepts are in black with themes in colour.


FIGURE 36 Close-up of the TOBACCO theme from documents for site 1.


TABLE 28 Site 1 documents top 20 ranked concepts

Top 20 word-like concepts Count Relevance Thesaurus
smoking 797 100% prevalence, quit, risk, children, smokers, likely, reduce, social, groups, service
tobacco 774 97% products, control, illicit, councils, use, communities, key, public, local, reduces
control 650 82% councils, illicit, products, tobacco, use, key, communities, reduce, national, programme
inequalities 458 57% health, councils, public, use, approach, communities, reduce, control, services national
health 446 56% inequalities, councils, public, reduce, use, approach, communities, services control, social
people 405 51% young, children, social, likely, smokers, quit, groups, products, smoke, use
young 393 49% people, children, social, likely, smokers, quit, groups, smoke, products, group
local 327 41% services, effective, national, areas, communities, approach, public, partnership, community, use
smokers 267 34% quit, likely, groups, communities, services, impact, cigarettes, year, range prevalence
illicit 253 32% products, programme, control, tobacco, partnership, reduce, working, impact, communities, key
smoke 247 31% children, risk, likely, smoke-free, legislation, cigarettes, smokers, people, year, young
support 229 29% services, effective, local, staff, areas, quit, legislation, ensure, national, research
use 177 22% reduce, social, communities, impact, national, range, products, areas, interventions, councils
work 158 20% partnership, legislation, effective, national, working, public, local, including, programme, reduce
communities 157 20% key, approach, councils, partnership, public, effective, social, use, local, reduce
groups 157 20% social, likely, smokers, key, range, group, services, areas, research, communities
school 151 19% policy, staff, smokefree, ensure, legislation, including, support, community, children, smoking
public 149 19% approach, legislation, communities, inequalities, health, working, reduce, local, partnership, work
councils 149 19% inequalities, key, communities, health, approach, control, use, tobacco, services, range
prevalence 136 17% reduce, areas, national, smoking groups, smokers, further, year, services, likely
Name-like Count Relevance Thesaurus
England 123 15% year, public, prevalence, reduce, young, people, national, children, control, research
Smoking 86 11% risk, prevalence, groups, social, smoking, interventions, further, including, year, smokers
R&M 71 9% groups, smokers, communities, group, key, likely, impact, social, councils, quit
Tobacco 56 7% products, control, use, tobacco, smoke, cigarettes, public, social, people, young

Note on R&M (‘routine and manual’) smokers

The term ‘routine and manual’ (R&M) is widely used by NHS partners, but is less commonly used by councils where deprivation and geographical classifications take precedence over occupational classifications. R&M smokers are defined by their occupation according to the Standard Occupational Classification (SOC) codes where jobs are classified by their skill level and skill content. The SOC codes for R&M groups include occupations such as lower supervisory and technical or routine and semi-routine occupations. While R&M smokers are defined by their occupation, most non-employed people (the unemployed, the retired, those looking after a home, those on government employment or training schemes, the sick, and people with disabilities) are classified according to their last main job. This means that many individuals who fall into the R&M category are not employed in R&M occupations. This qualification is important, particularly in the context of the current economic climate, with increased unemployment levels and worklessness being a key priority for many councils.

Comparison of site 1 documents against cluster (Figure 37)

FIGURE 37 Cognitive analysis quadrant of top 20 concepts: frequency and strength results for site 1 with comparison with all interviewee data for the same site.


Cambridge Artificial Intelligence Summit 2018 – Feature

From left to right: Dr Raoul-Gabriel Urma (CEO, Cambridge Spark), Thomas Westmacott (FiveAI), Kevin Nelson (Cloud Developer Advocate, Google), Dr Steven McDermott (Qualitative Analysis and Social Media Lead, HMRC), Dr Sebastian Kaltwang (FiveAI, Machine Learning Engineer), Alison Lowndes (Artificial Intelligence DevRel EMEA, Nvidia), Professor Kenneth Benoit (Professor of Quantitative Social Research Methods, LSE), Dr Maksim Sipos (Co-founder and CTO, causaLens) and Dr Jeremy Bradley (Senior Data Scientist, Royal Mail)

We hosted our Cambridge Artificial Intelligence Summit, sponsored by Cambridge Judge Business School Executive Education, on 15–16 June, welcoming Analysts, Data Scientists and Researchers to network, develop new skills and gain insight into the evolving field of Data Science.

To continue reading…

AI as Moderator in the Recognition of Citizen’s Voice with Social Media

Dr Steven McDermott
Qualitative Analysis and Social Media Lead, HMRC

Session: AI as Moderator/Mediator in the Recognition of Citizen’s Voice with Social Media at the Cambridge AI Summit in June 2018.

Cambridge AI Summit

2 Day Artificial Intelligence Event in Cambridge

Book Your Tickets

Join our community of hundreds of researchers, analysts and data scientists for an opportunity to network, develop new skills and gain insight into the evolving field of data science.


Hear from industry and academic speakers representing a range of sectors, from research and bioinformatics to business and finance

Learn about the practical application and implementation of the latest tools, techniques to industry case-studies.

Share knowledge, pick up new ideas and connect with developers, analysts, researchers and executives.

The Data Science Summit’s are all about putting research into action. You can see how the latest techniques are implemented, network with other leaders and specialists in the field who make research actionable, and get insight on how you can help transform your company, teams and the way you work.

Sarah Curshen, Director of Executive Education Custom Programmes, Cambridge Judge Business School


Prof. Kenneth Benoit
Professor of Quantitative Social Research Methods, London School of Economics

Session: Quantitative Text Mining, the Social Scientific Way: Mining Social Media on Brexit

Dr. Sebastian Kaltwang and Brook Roberts
Machine Learning Engineer, FiveAI

Session: Overcoming the Data Bottleneck for Self-driving Cars

Kevin Nelson
Cloud Developer Advocate, Google

Session: Google Cloud AutoML

Alison Lowndes
Artificial Intelligence DevRel EMEA, Nvidia

Session: Artificial intelligence and the evolution of the computing platform

Dr Haitham Bou-Ammar
Head of Reinforcement Learning and Tuneable AI, Prowler

Session: Data-Efficient Reinforcement Learning

Dr Maksim Sipos
CTO, causaLens

Session: Automated feature extraction and selection for challenging time-series prediction problems

Dr Jeremy Bradley
Lead Data Scientist, Royal Mail

Session: Data Science as a Transformative process

Dr Steven McDermott
Qualitative Analysis and Social Media Lead, HMRC

Session: AI as Moderator/Mediator in the Recognition of Citizen’s Voice with Social Media

Book your tickets

Scholarship Alert – Predictive analytics for tax compliance

Type of project

Competition funded PhD projects


Contact Dr Georgios Aivaliotis to discuss this project further informally.

Project description

HMRC collects a wealth of data regarding tax compliance of companies and individuals. Sometimes people and companies do not pay the correct amount of tax on time for a variety of reasons (e.g. lack of knowledge, lack of ability, evasion). The data collected are “big”, i.e. a high number of variables and many clients and are of both temporal (time stamped) as well as static nature.

The aim of this project will be to develop the necessary methodology that allows to extract information from the data and to apply machine learning and pattern mining alongside classical statistical techniques in order to predict which cases are most likely to result in non-compliance so that early action can be taken. Linking SME’s and HMRC data will be an additional possibility and challenge. As a follow-up, economic models will be developed that look into the cost of interventions and what actions are economically meaningful to ensure compliance.

The successful PhD candidate will work under the guidance of an academic as well as industrial (HMRC Digital Academy and Cambridge Spark) supervisor(s). HMRC and Cambridge Spark will provide expertise in the data, the possibility of working onsite and training. Cambridge Spark offers a variety of training, conferences and workshops in AI and data analytics methodology. HMRC Digital Academy runs a series of regular seminars and are investing in research in data analytics.

Entry requirements

Applicants should have, or expect to obtain, a minimum of a UK upper second class honours degree in Mathematics or a related discipline, or equivalent. Applicants whose first language is not English must also meet the University’s English language requirements.

How to apply

Formal applications for research degree study should be made online through the university’s website. Please state clearly in the research information second that the PhD you wish to be considered for is ‘Predictive analytics for tax compliance’ as well as Dr Georgios Aivaliotis as your proposed supervisor.

If English is not your first language, you must provide evidence that you meet the University’s minimum English Language requirements.

We welcome scholarship applications from all suitably-qualified candidates, but UK black and minority ethnic (BME) researchers are currently under-represented in our Postgraduate Research community, and we would therefore particularly encourage applications from UK BME candidates. All scholarships will be awarded on the basis of merit.

AI as Moderator/Mediator in the Recognition of Citizen’s Voice with Social Media

by Dr Steven McDermott

Qualitative Analysis and Social Media Lead, 
Digital Data Academy, 
Her Majesty’s Revenue and Customs, UK

Government departments are now utilising customer feedback channels and social media in an attempt to respond to crowdsourced insights and eventually informing policy. They are also using social media listening platforms to listen in to conversations taking place regarding their departments. They are also taking tentative steps into machine learning and AI techniques. The debates surrounding these tools have tended to frame such activity as surveillance and opening up the possibility of Armageddon with the rise of the machines. However, how can the voice of the citizen be recognised and responded to if these departments are discouraged from listening and using the latest tools? Does the utilisation of social media, machine learning and AI offer the potential means of escaping from the stranglehold of top–down, stage–managed politics. If millions of people could be the producers as well as receivers of political messages, could that invigorate democracy? And what role will machine learning and AI play in this emerging new media ecology? I intend to present a peak behind the curtain regarding the level of listening that is taking place and how machine learning and AI are being applied. Asking can this be done ethically and to enhance democratic processes and improve evidence based policy decisions. In which ways will democratic institutions have to change in order to meet these challenges?

 Why a ‘Listening Organisation’?

Macnamara (2016) has issued a list of criteria for organisations wishing to adhere to the maxim of being a listening organisation. It is acknowledged within Her Majesty’s Revenue and Customs (HMRC) that it is some way short of meeting those criteria – despite pockets of good practice. Part of the strategy within HMRC is that by moving to digital and utilising advances in technology and software in particular that they will be a listening organisation. HMRC are trying to address the identified `crisis of listening’ within the organisation in the hope of regaining trust and re-engaging people whose voices are unheard or ignored. HMRC in doing so understands that urgent attention to organizational listening is essential for maintaining governance, democracy, organisational legitimacy, business sustainability, and social equity.

The department is attempting to use data, to procure software tools, implement processes and change the culture in order to act on the insights generated by the data. So there is an acknowledgement that the solution to the ‘crisis of listening’ is more than a technological one. A core aspect of overcoming this crisis is the implementation of ‘real-time’ listening capability.  A key component of ‘real-time’ listening is the monitoring and response to social media interactions between HMRC representatives and citizens/customers of HMRC services. Coupled with the empowerment of staff to act on feedback at all levels in order for HMRC to be in place to be a world-class customer listening organisation. HMRC is exploring the capability of ‘data scientists’ and Machine Learning (ML) to develop less labour intensive practises of responding to customer feedback.

The discourse of ‘crisis of listening’ posited by Macnamara seems to be in stark contrast to the discourse of an emerging ‘surveillance capitalism’ and the rise of the machines. HMRC has pointed out that as well as technic focused solutions there is also the need to shift from a top-down, staged managed politics. On one side governments are invading citizen’s privacy by eavesdropping in on social media and at the same time governments are not listening enough to its citizens and failing to recognise or ignore the voice of the citizen. This polemic seems rather naïve and sensationalist. HMRC’s response is to side with Macnamara and build processes and systems that will enable the utilisation of big data; the empowerment of staff to respond to customer feedback. A key driver is the desire to use machine learning, and artificial intelligence to achieve this. However, listening in on social media platforms such as Twitter and Facebook in an attempt to read public sentiment – if it is not coupled with serious attempts “to connect representation to institutional work of speaking for, to and with the represented” (Coleman, 2017: 106) it is state surveillance.

The problem with implementing a ‘big data’ solution is not a cultural of resistance to change but of knowledge. The limitation of introducing big data analytics and data scientists with machine learning is that humans need to make judgements on what is generated. The judgements require human interpretation of the results and visualisations generated by the algorithms. According to Floridi (2012) the problem with big data is epistemological not cultural one.

The material presented here will assess who are the self-declared experts and organisations that are lying claim to such expertise – “and […] claiming the power to authorize what constituted acceptable knowledge in specific fields and what [does] not” (Robertson and Travaglia, 2016[1]). The contention being aimed at big data practitioners and democratic institutions – and the methodological approaches that they practice are that they have the potential to undermine freewill/autonomy. The goal of big data analytics is to change people’s behaviour at scale. A Chief Data Scientist of a Silicon Valley company that develops applications to improve students’ learning states that[2]

“The goal of everything we do is to change people’s actual behaviour at scale. When people use our app, we can capture their behaviours, identify good and bad behaviours, and develop ways to reward the good and punish the bad (emphasis added). We can test how actionable our cues are for them and how profitable for us.” (Zuboff, 2016)

Intending to “punish the bad” – is a disciplinary rather than a control mechanism. This attitude contradicts at least two principles of ethical research and potentially a third with wider social and cultural ramifications. Central principles of social science research are that the subjects are afforded autonomy, beneficence and justice (Childress, Meslin and Shapiro, 2005). Individuals are to be treated as autonomous agents – this normally results in informed consent being sought before publication (a limited interpretation of autonomy – and one that needs to be addressed again); the researcher is to minimise harm and relate it to the potential benefits of the study; and finally the benefits are to be distributed in a just manner and no undue denial of such benefits is to be imposed on any member. Such goals as outlined by the data scientist are indicative of at least an individual – potentially a discipline that is devoid of ethical training.

Social media platforms and the people who use them are not a representative sample of the population as a whole – they are self-selected at best and possibly the already vocal within online debates and in wider society as a whole. The platforms: Twitter, Facebook and Instagram are data brokers in the first instance and harvest and sell user data on to third parties usually for the better targeting of advertisement. These social media data brokerage firms are the visible vanguard of surveillance capitalism. The option to choose not take an active role on these platforms is possible but severely curtailed as it comes with a sense opting out of the contemporary world (Cegłowski, 2016).

From the early days of the internet it was viewed as a potential way of escaping the top-down heavily managed political performances. Suddenly anyone could be a producer as well as a receiver of political messages. For Coleman (2017) governments and global institutions have failed to democratise their ways of operating. The opportunity to reinvent and re-strengthen democracy for the 21st Century has been missed.

Coleman’s view that what needs to change is not the technology or the culture but the political architecture upon which democracy rests. On the one hand we have the technologically focused drive with a top down structure. What is needed is a reorienting of the structures versus the human agent based on a cultural turn required to meet the crisis of listening.

Two approaches – one dominated by large macro level structuralist understanding of human behaviour and another that allows for the micro events and personal understandings of agents to also be influencing changes in the environment.  These disciplines are not clearly delineated – there are those within the Data Science discipline who are prepared to acknowledge the utility of human interpretation of data over algorithmic accounts. A principal Data Scientist at @BoozAllen, recently stated[3] that only using computer algorithms for visualisation…

“[…] can miss salient (explanatory) features of the data [therefore] a data analytics approach that combines the best of both worlds (machine algorithms and human perception) will enable efficient and effective exploration of large high-dimensional data”.

Data Science generates crude quantitative knowledge, or “calculated publics” (Gillespie, 2014). It also creates crude calculated customer/citizen types that are reductive. What is needed is an acknowledgment of the limitations of the quantitative approach – opening the door to the possibility of a cultural – acknowledgment of the qualitative approaches.

Such calculated citizen types are devoid of individual citizen responses to political, cultural and social intervention and wilfully disrespectful to the autonomy of the people involved and the dynamics of state and citizen interaction. There are calls within data science practitioners for a shift to include more social scientific approaches. It is also without notions of geographic location, postcode, gender, age, class or social status. It is a classification of people and groups without any reference to work from the social sciences.

There is a core narrative running through the departments dominant narrative of making tax digital. It is the idea that digital online self-service applications and websites will somehow do away with the expensive telephone capabilities within the department. It is founded on the same march of technology story that has pre-dated most shifts in the uptake of the most recent piece of technology. Radio was going to replace newspapers; Television was going to replace radio; and that computers were going to replace paper. What happens is not one medium replacing another but that content or processes move and continue to be reliant on the others. A lot of transactions between HMRC and their ‘customers’ can be facilitated by the move to digital platforms but the telephone will – or voice to voice interaction between two people – will be required. Whether that is human to human or human to chatbot is to be seen. What we are witnessing is an evolution of the media ecology. Government departments are enthralled by the prospects of moving the cost of interactions from the department to the consumer – jumping on the ‘home manufacturing’ or ‘self-service’ bandwagon.

The promotion of self-quantifying applications for governance purposes is the ongoing increase in “home manufacturing” (Lambert, 2015: 251-252). To the growing list of unpaid labour via the self-service petrol station, the self-checkout machine at the supermarket, check-in machines at the airport, ticket vending machines at the tube, train and bus stations, ATMs at the bank, self-service fast food restaurants add self-help applications that facilitate and monitor a citizen’s governance. Governance apps and the mechanisms that facilitate them are another way collecting data on people.

Big data analytics is built on a myth that tries to hide the reality of a situation that pairing human behaviour with technological innovation results in surplus behavioural value. The business world needs to convince the social that what they trade in – data – is worthless.

The real problems with big data are not quantity or quality of the data but rather one of epistemology (Floridi, 2012); and another problem (according to a report published by the   in 2016[4] which surveyed 448 senior executives and professionals based in the United States on the current state of marketing and sales analytics from pharmaceuticals, medical devises, IT and telecoms) is the lack of impact of big data analytics. The impact is described as modest at best.

What follows here is the presentation of the tools of big data process monitoring that are being used to listen to the voice of citizens in relation to HMRC in the United Kingdom. The methodologies that are to be applied here will ultimately shape the insights gained. The tools applied to this digital context will be digital and such an approach will render digital insights; an issue discussed at length by others (Baym, 2013; Boyd and Crawford, 2012; Clough et al., 2015; Gitelman and Jackson, 2013; Kitchin and Lauriault, 2014; Kitchin, 2014; Manovich, 2011; Van Dijck, 2014). Once the digital methods, tools and interpretations have been presented the material will then move to a more human centred analysis. Rather than move to the application of qualitative small scale methods the intention is to place the issues and insights within wider political economic and communicative interpretations of what is going on with big data and governance.

Social Media Analytics in Practice

A link to the slides presented on 8th March 2018 - to “Answering Social 
Science Questions with Social Media Data” conference hosted by NSMNSS.

With the application of data mining techniques of a social media 
analytics software called Brandwatch and the visualisation tool Vizia; 
social media platform (Twitter); and two forms of analysis; social 
network analysis and content analysis the goal is to re-present 
the techniques and tools of data scientists. 

The tools here will also include machine learning and 
automated approaches.

HMRC Listening Organisation

Data is never raw and is always the output of an algorithm that requires validation. ‘Data’ is the result of a long chain of requirements, goals and (in the case of big data) wider political economy. Without context and meaning the data becomes fetishized. The ‘insights’ are at the macro level – devoid of context and therefore meaning and categorised into certain ‘calculated publics’. It is advisable that citizens and the public have a level of knowledge of how these calculations are performed in order to aid is navigating their outcomes particularly in relation to our governance and the governance of the wider public. The algorithmic black box needs to be unpacked and the assemblages of control that reside within these instrument need to be displayed and debated.

  • What questions can be answered using social media analytics in a governance capacity?
  • How does an organisation show that it is listening to its customers?
  • What role can Machine Learning and AI play in the interface between state and citizen?
  • Are epistemological and ethical concerns playing their role in the uptake of Machine Learning, AI and algorithms?
  • What impact is the introduction GDPR (General Data Protection Regulation) likely to have?

The methodological contention here is that big data does not represent what we think it represents. It is not representing the social structure or patterns of interaction at a macro scale. The data presented here is being presented in a way that is hopefully worthy of our consideration (Robertson and Travaglia, 2014). The data presented here is not objective – it does not represent how wider discourse surrounding policies is conducted on other platforms and in face to face interaction. What the approach applied here can tell us is what organisations; and who are ‘influencing’ the debates about Big data and Governance on Twitter. Hopefully it is also clear that what is being presented here is a peak behind the curtains, a look under the hood, and shining a light into the black box of big data analytics, and data-mining. It will in some small way enable us to see how societies are to be regulated if left to the practices and procedures of data scientists. It is looking at a sphere of the social as seen through the prism of datafication and provides insight into the references and meanings that are being constructed; it is not only a glimpse of a limited percentage of the population of data scientists and big data and governance analytics. It is also a glimpse of the various ways in which they intend to define, manage and govern us; capture our behaviours, identify good and bad behaviours, and develop ways to reward the good and punish the bad among us.

Big data and governance not only have problems regarding matters such as privacy, ownership and a lack of tangible results (so far) it also has one that is less to do with quantity and quality of data or even the technicalities surrounding it. It has an epistemological problem (Floridi, 2012). It doesn’t have a clearly distinct set of criteria that needs to be met in order for it to be able to make assertions that are not simply statements of belief. Big data lacks a theory of knowledge. Some claim to have found such a theory – Pentland (2014) argues that they have created a true social physics.

[1] http://sociologicalimagination.org/archives/18555

[2] http://www.faz.net/aktuell/feuilleton/debatten/the-digital-debate/shoshana-zuboff-secrets-of-surveillance-capitalism-14103616-p3.html?printPagedArticle=true#pageIndex_4

[3] http://rocketdatascience.org/?p=567

[4] http://www.zsassociates.com/-/media/files/publications/public/broken-links-why-analytics-investments-have-yet-to-pay-off.pdf?la=en