Revista de investigación sobre farmacia y administración de medicamentos

An Ensemble Learning Approach to Perform Link Prediction on Large Scale Biomedical Knowledge Graphs for Drug Repurposing and Discovery

Kai Liu, Vignesh Prabhakar, Chau Vu, Jennifer Crawford and Joseph Waite

Generating Knowledge Graph Embeddings (KGEs) to represent entities (nodes) and relations (edges) in large scale knowledge graph datasets has been a challenging problem in representation learning. This is primarily because the embeddings/vector representations that are required to encode the full scope of data in a large heterogeneous graph needs to have a high dimensionality. The orientation of a large number of vectors requires a lot of space which is achieved by projecting the embeddings to higher dimensions. This is not a scalable solution especially when we expect the knowledge graph to grow in size in order to incorporate more data. Any efforts to constrain the embeddings to lower number of dimensions could be problematic as insufficient space to spatially orient the large number of embeddings/vector representations within limited number of dimensions could lead to poor inferencing on downstream tasks such as link prediction which leverage these embeddings to predict the likelihood of existence of a link between two or more entities in a knowledge graph. This is especially the case with large biomedical knowledge graphs which relate several diverse entities such as genes, diseases, signaling pathways, biological functions etc. that are clinically relevant for the application of KGs to drug discovery. The sizes of the biomedical knowledge graphs are therefore much larger compared to typical benchmark knowledge graph datasets. This poses a huge challenge in generating embeddings/vector representations of good quality to represent the latent semantic structure of the graph. Attempts to circumvent this challenge by increasing the dimensionality of the embeddings often render hardware limitations as generating high dimensional embedding is computationally expensive and often times infeasible. To practically deal with representing the latent structure of such large scale Knowledge Graphs (KGs), our work proposes an ensemble learning model in which the full knowledge graph is sampled into several smaller subgraphs and KGE models generate embedding for each individual subgraph. The results of link prediction from the KGE models trained on each subgraph are then aggregated to generate a consolidated set of link predictions across the full knowledge graph. The experimental results demonstrated significant improvement in rank-based evaluation metrics on task specific link predictions as well as general link predictions on four open-sourced biomedical knowledge graph datasets.