What are knowledge graphs, and why should I care?

How to make sense of complex data by exploring the interconnectedness of all things

Joe Baker
Convivio

--

Standing woman taking a pose wearing a white dress. Photo by Sergio Souza from Pexels
Standing woman taking a pose wearing a white dress. But how well does she know Kevin Bacon? Photo by Sergio Souza from Pexels.

In our recent projects we have found ourselves working with increasingly larger amounts of information and substantially more complex data. Largely, we have been able to rely on our tried-and-tested approaches to handling that data, and we’ve managed fairly well in our projects in the most part. However, we’ve also been aware of some challenges, particularly when working with data about networks, and networks of data. Those challenges seem to arise especially in the way we interact with these larger and more complex datasets, difficulties that are often to do with combining, analysing and traversing the data in order to understand it more deeply.

Consequently, we’ve been looking for new approaches to working with complex data, with networked data. In our research, we’ve started looking in particular at knowledge graphs.

Data, or knowledge

Digital services handle data in one form or another. For years, the power-tools for working with data in software applications have been relational databases. They were designed originally to emulate paper forms and tables, and they do that very well. Relational databases have limitations, though — they store highly structured data in defined tables with predetermined columns, in many, many rows. And that means that the software and its developers must conform to that strict data structure.

Nowadays, though, we are frequently dealing with more complex data that requires more speed to handle it, data with more nuanced connections that requires more agility to analyse it. With this sort of networked data, we are not just interested in the items of data themselves but also in the relationships between items of data.

In this data environment, that means both the entities in the data and their relationships have equal significance — they are, equally, first-class citizens. By understanding the entities and the relationships between them in a network of data we are moving beyond simple data — pieces of information about a thing — to knowledge — what we know about the world (or the limited world of a given data network).

So, what is a knowledge graph?

In simple terms, a ‘knowledge graph’ is a model of some form of the entities and relationships in a network of data. The ‘graph’ part of the term is taken from the field of mathematics, and refers particularly to the relationship aspect of the network — a graph is a structure where a set of objects are related.

Diagrams are often used to portray knowledge graphs, so here are some graph diagrams to help understand their potential power in analysing networks of data.

Actors and their work on films is a good, relatively straightforward network of data to introduce us to knowledge graphs. You probably know the movie trivia game Six Degrees of Kevin Bacon, where actors are scored for the number of actor friends (defined by the films they have worked on together) they would need to talk to in order to meet Kevin Bacon.

We can use a knowledge graph of actors and movies to help us cheat at Six Degrees of Kevin Bacon.

Simple graph of Kevin Bacon

A simple knowledge graph diagram of Kevin Bacon
A simple knowledge graph of Kevin Bacon

Here we have a simple graph of Kevin Bacon. In this knowledge graph there are two entity types:

  1. A Person entity type, of which there is only one entity — Kevin Bacon
  2. A Movie entity type, of which there are three entities — A Few Good Men; Apollo 13; and Frost/Nixon.

This simple graph of Kevin Bacon also includes the relationships between these two entities: Kevin Bacon ‘acted in’ each of these three films.

We could create a similar simple knowledge graph for Jack Nicholson:

A simple knowledge graph diagram of Jack Nicholson
A simple knowledge graph of Jack Nicholson

This graph shows us that Jack Nicholson ‘acted in’ Hoffa, As Good as It Gets, Something’s Gotta Give, One Flew Over the Cuckoo’s Nest, and A Few Good Men.

These two simple graphs are just excerpts of data in the same larger knowledge graph, so we can querying the data in this graph to discover the Bacon Number for Jack Nicholson:

Knowledge graph diagram of Jack Nicholson’s Bacon Number
Working out Jack Nicholson’s Bacon Number

Jack Nicholson has a Bacon Number of 1 — he worked directly with Kevin Bacon on the film A Few Good Men.

There are many other people with a Bacon Number = 1:

Knowledge graph diagram of actors with a Bacon Number = 1
Actors with a Bacon Number = 1

Our knowledge graph contains data about many other actors and the films in which they acted. We can then query our graph to work out the Bacon Number of any person in the network — the number of other actors that someone would need to know through having worked together on films in order to connect with Kevin Bacon.

Some example Bacon Numbers from this dataset:

A table of the Bacon Numbers for a sample of actors
The Bacon Numbers for a sample of actors

This kind of thing can be done with a query of a relational database, of course, but what makes a knowledge graph of the network of relationships between actors and movies is that finding out the ‘shortest path’ across the network between two actors is at the core of interacting with the graph’s data.

Even more can be done, though, by incorporating more information into the knowledge graph, especially information about the type of people and movies in the graph.

There is a new breed of databases that are designed specifically to work in this way — graph databases.

This is a type of that

Photograph of the outside of an independent cinema, by Myke Simon on Unsplash
Movie night. But what kind of film are we going to see? Photo by Myke Simon on Unsplash

You’ll notice that above I said ‘people’ rather than ‘actors’ because our Person entities in the graph could include many other people in the movie industry who have other relationships to a film. A person may have ‘directed’ or ‘produced’ or even ‘reviewed’ a film, for instance.

In this small movie knowledge graph, the we do not need to have multiple variants on a Person entity (actor entities, director entities, producer entities, say), because the relationships distinguish the connection between a person and movie in this data network.

With relationship types that describe the manner in which one entity is connected to another, it would be possible to distinguish all the Person entities who have a ‘directed’ relationship to a Movie entity to find all the directors.

There is a better way of doing this, though.

Types of things

It may be more useful to add information into the knowledge graph that would help us to categorise or classify the things already in our graph.

We could add a set of entities for movie Genres, for example. Each movie would then be able to have a relationship to a genre, a relationship called ‘has genre’, maybe.

List of Kevin Bacon’s sci-fi movies, from Google’s Knowledge Graph
List of Kevin Bacon’s sci-fi movies, from Google’s Knowledge Graph

That would allow us to quickly find all the sci-fi movies in our graph, say. And we would also be able to find the sci-fi movies in which Kevin Bacon has acted (6, by the way).

We might also add entities for movie budget levels, and revenue levels, which would allow us to find the highest grossing and the most profitable genres for Kevin Bacon movies, to find actors who work on movies that are more, or less, profitable than those in Kevin ‘The Chip’ Bacon’s oeuvre, and so on.

We might also consider adding entities for Occupations, as well, so that each Person could have a ‘has occupation’ relationship to an occupation as an actor, director, producer, cinematographer, foley, or even reviewer, say. We could then discover actors with a lower Bacon Number when you consider relationships with directors, or directors whose movies have the same box office receipts as Kevin Bacon’s movies.

Graphs, ontologies and taxonomies

This approach to clarifying the information in a knowledge graph by relating it to classifications uses things like taxonomies and ontologies to structure the graph.

Knowledge graph diagram for a simple taxonomy of the drama genre for movies
A simple taxonomy of the drama genre for movies

A taxonomy is a tree of related terms or categories. Each branch on the bifurcating tree is a more specific version of the parent term.

For movie genres, for example, Drama is a Genre entity, with sub-genres like Period Drama or Crime Drama, which may themselves have sub-genres.

Taxonomies tend to be relatively simple, usually with just a label for the term in the taxonomy. The Medical Science Subject Headings (MeSH) vocabulary is a widely-used standard taxonomy.

An ontology is more formal than a taxonomy, and defines a) the types of things that can exist in a our realm of information (movies, people, genres, occupations, etc. for entities; acted in, directed, type of, etc. for relationships), b) the properties that each thing may have (release date, rating for movies; or birth date, nick name, for people; role name, for the acted in relationship).

For example, Schema.org is a project to create an ontology for structured data on the internet, with schemas for each entity. As an ontology, though, it is not very well suited to knowledge graphs — it is primarily focussed on entities and does not define relationships as clearly.

The basic concepts of knowledge graphs

I’ve already outlined the two primary elements of knowledge graphs.

Entities

The simplest graph would have a single entity in it. In graph databases, an entity is often called a node — the data object that represents the real-world entity.

Entities have labels to group them together in sets, such as the Person and Movie labels in the knowledge graph above. Our knowledge graph could be extended to have entities with other labels as well, such as Locations. Actions can then be performed on entities with a given label — list all the Movies in the dataset, for instance.

Relationships

Relationships connect entities to each other. In graph databases, a relationship is often called an edge — the data relation that represents the real world relationship between two entities.

Relationships have a type to group them together in sets, such as the ‘acted in’ type above. Our movie knowledge graph could be extended to have other relationship types, so we can identify people who ‘wrote’, ‘directed’, ‘produced’ or even ‘reviewed’ the movies in the knowledge graph.

Properties

What is significant about knowledge graphs and most graph databases, too, is that, since both are first-class citizens they can both have properties added to them.

For instance:

  • a Person entity might have properties like ‘name’ and ‘age’;
  • a Movie entity might have properties like ‘title’, ‘rating’, ‘release date’ and ‘tag line’;
  • an ‘acted in’ relationship might have properties like ‘role’, for the character’s name, and a ‘type’, such as lead, supporting, cameo or walk-on actor.

Traversals and paths

With a growing set of networked data in our knowledge graph, we can start to consider using those entities, relationships and properties to discover things about the network of data.

Queries that analyse the entities and relationships in a knowledge graph are often said to be ‘traversing’ the data network. The result of a traversal query, from one entity to another across a series of relationships, is called a ‘path’.

This is exactly what the Bacon Number queries above are doing — they are traversal queries that ask ‘what is the shortest path between two given actors, based on their work histories on movies with other actors?’

How could a knowledge graph help me?

Knowledge graphs are helpful in contexts where the information being managed and analysed forms a web or network, where the data includes items that are related to each other, or are even interdependent.

They are particularly applicable in environments where the data is large and complex, and though the relationships between entities exist, the connections within the network are not immediately clear — in scenarios like these, traversal and path queries frequently reveal startling insights.

Here are some examples of use-cases.

Google’s Knowledge Graph

An excerpt from a Google knowledge panel for Kevin Bacon
An excerpt from a Google knowledge panel for Kevin Bacon

Google pioneered the use of knowledge graphs in the digital world, and the most well known knowledge graph is their ‘knowledge panels’ that accompany a Google search.

The knowledge panel for a ‘Kevin Bacon’ search on Google, includes key biographical and personal data, as you might expect from Google’s search index. There’s also links to images of Kevin Bacon, highlights from his oeuvre, links to his social media profiles, and suggestions of other people for whom people search. There are knowledge panels for most searches on Google, of course, not just Kevin Bacon.

Google’s pioneering work now forms one of their key backend services, called the Knowledge Graph, with a search API for developers to work directly with it.

Other knowledge graph case studies

Knowledge graphs are now used in a huge variety of contexts.

They are used to power voice assistants like Siri and Alexa.

Many large retailers like Walmart use knowledge graphs to understand their products and their customers, and provide recommendation engines.

In financial services, knowledge graphs are being used for fraud detection.

Investigative journalism has used knowledge graphs for investigating the Panama Papers and the analysis of the TrumpWorld dataset that was released by Buzzfeed.

Knowledge graphs are being used in healthcare sector for cause-effect analysis in the response to COVID-19.

Examples could go on for multiple industries, sectors, and contexts.

How would I go about using a knowledge graph?

If you are at the stage of thinking that a knowledge graph in the form of a graph database may be a good way forward in your project, and wondering how to go about it, here’s some things to bear in mind.

1. Start small

When starting a project that uses a knowledge graph, it would be easy to be overwhelmed with a large amount of complex data and struggle to make much progress. Instead, choose a small and concrete context for your application, one that is fairly straightforward to understand and has a limited amount of data initially, and work towards a small prototype than can be demonstrated quickly. It might be helpful or even important to align this initial use-case with your business or organisation goals, and to define it according to user or business value.

2. Get to know your data

As you begin to work with the information you’ll be adding to your knowledge graph you should make time for a thorough inventory of your data. The reason for the sort of cleaning or sifting process you’ll need to do as an inventory is to gain a thorough understanding of the entities and the relationships in your data.

In doing this, you may well find that the structure of your data in your knowledge graph format needs to change, possibly multiple times, as your understanding improves. This is most likely to be true if you’re new to knowledge graphs, and if you’re working with or moving information that is highly structured to work with a relational database.

3. Make good use of best practices and standards

When structuring your data to work well in your knowledge graph, make sure you make good use of established standards for arranging your data.

In particular, there are numerous standards for ontologies, such as Schema.org for structured data on the internet. Google’s Knowledge Graph uses the Schema.org types, for example. There are established ontologies for many industries and sectors, and declaring the ontology to which your data conforms will help in both working with and adopting your knowledge graph data.

The same is true for standard taxonomies for the industry or sector for your data.

Make sure you spend some time to research the appropriate best practices and standards to which you will align your data, and declare the standards that your graph data complies with to help your team and anyone working with your data.

4. Work as a team

You will want to lean on the knowledge and insights of a range of specialist users and subject matter experts to make sure your knowledge graph best represent your information. Make sure you form a team with diverse perspectives and skills to help you when you are working to understand your data and adapt it to work best within your knowledge graph and match best practices and standards.

This team might include taxonomy specialists, end users, business domain users, data scientists and more, all of whom will help the iterative process of refining and clarifying the information into your knowledge graph, and using your knowledge graph into your context.

5. Learn lots, be adaptable, keep experimenting

Don’t be beguiled by the excitement of your first knowledge graph — it’s easy to take your early successes and quickly start developing a giant plan for an all-encompassing grand unifying graph for your data.

Instead, continue to be agile. Make iterative improvements and additions, rolling out small increments and extra dimensions regularly. Keep experimenting; learn as you go; make informed decisions about your next steps; keep improving.

In summary, why should I care about knowledge graphs?

Modern digital services are iteratively designed, continuously responding to new insights about user needs. In that environment of constant adaptation and improvement, it’s important that the way we store our data doesn’t constrain us.

Knowledge graphs provide a flexible way to provide valuable understanding from complex datasets. Thinking about connections, and degrees of connectedness, can open up new ways to serve users with our data, and even, maybe, help us all discover our own Bacon Number.

--

--

Writer, PhD in religion and narrative from Bristol University. Chief Research Officer at Convivio, the collaboration company.