The use of graph databases is rapidly increasing as organisations worldwide look to gain insight into large quantities of relational data. With analytics libraries like the Neo4J Graph Data Science Module, the possibilities for analysis keep evolving. The new ‘link prediction’ function might be the next big hit in machine learning technology.

Suppose you ask a group of individuals to imagine and describe a database – chances are they will come up with the same interpretation: rows and columns filled with information, linked together by common variables. This makes perfect sense, as the tabular representation of a database has been popular and, for a long time, even dominant since the British computer scientist Ted Codd introduced it back in the 1970s.

While these databases – commonly referred to as relational databases – are still frequently used, the late 2000s have given rise to a new format: NoSQL databases. As data storage costs have drastically decreased over the years, the need for complex data models to avoid duplicates has diminished. Subsequently, developer comfort has become increasingly important, and the emphasis has shifted toward the speed and convenience of database queries. NoSQL databases meet all these needs.

Graph databases as a major trend

NoSQL databases come in many different shapes and forms. One that has recently gained a lot of attention is the concept of graph databases. Gartner has identified graph databases as a top data and analytics trend for several years in a row, stating that “by 2025, graph technologies will be used in 80% of data and analytics innovations, up from 10% in 2021”. A graph database is based on mathematical graph theory and is used to store and analyse data with a relational character. The concept is simple and consists of only three building blocks: nodes, edges and properties. For example: in a social network you could think of the nodes as individuals, the edges as relationships between these individuals, and the properties as their age, address and the number of years they have known one another.

As the reach of graph databases is ever-evolving, so are the opportunities for analysis. An example is the Graph Data Science module in Neo4j (a popular graph database management system). This module allows the user to execute various algorithms and procedures in order to gain powerful insights from the database at hand. One of its most interesting functions is ‘link prediction’, a machine learning tool to predict missing relationships between nodes in a graph. A data analyst can allocate a subset of the data to be trained by the algorithm and indicates which node properties should be considered for the analysis. The algorithm then generates a list of plausible relationships between nodes in the graph, including a list of their probabilities.

Some of the many applications of link prediction

Link prediction opens a whole new world of opportunities for project applications. Where graph databases have traditionally been used for the representation and convenient storage of existing data and their relationships, they can now be used for the prediction of new relationships between data. Imagine for example a graph database used to build up a network of work-related relationships between employees and departments in a large-scale organisation. When built with the correct architecture, link prediction could be used to identify certain unknown relationships within the organisation. For instance, to assemble new teams or add employees with relevant skillsets or characteristics to certain teams, link prediction algorithms can make suggestions based on prior cooperation between employees. This means that database builders do not have to spend time identifying these relationships themselves. In that regard, the algorithm’s work can be compared to that of a sophisticated recommendation engine.

The link prediction module in Neo4j is currently still in its beta phase, which means that it contains some known (and likely some unknown) drawbacks. For instance, its use is limited to homogeneous graphs. These are graphs with a single type of node and edge. Using the example of a social network, this limitation means that links can only be found between persons, and not between persons and cities[SM1] . This issue, however, is accommodated for in other graph database packages and will most likely be added to Neo4j’s link prediction module in the near future. Another drawback is that only properties of individual nodes are considered. It is currently not possible to include properties of node pairs, which would most certainly enrich the prediction capacity of the algorithm.

Nonetheless, when the module reaches a more mature stage, it will likely promote graph databases to a whole new level.

Would you like more information?

If you want to get more information about this subject please get in touch with our experts who would be pleased to hear from you.

  • Jeroen Tegelaar
    Contact me