Scoring Entities’ Authority in Academia based on Large-scale Heterogeneous Information Network

Demo: http://hinet.lindayi.me

Introduction

Hinet is an algorithm to quantify academic influence of entities inside a specific academic heterogeneous information network formed by papers, venues, authors and organizations. Similar to the IF of venues or the h-index of authors, the hi-index of entities (i.e., papers, venues, authors or organizations) demonstrate how important they are in academia.

Unlike traditional algorithms for academic influence, which use either simple statistics (e.g. h-index), or homogeneous information network or simple bi-type heterogeneous information network (e.g. RankClus), Hinet tries to simulate the academia using a more complex heterogeneous information network, as shown in the figure below.

network

How it works

Define A as the set of authors; Authority(a) as the hi-index of author a;

Similarly, define P, O, V as the sets of papers, organizations, and venues respectively; Authority (p), Authority(o) and Authority(v) as the hi-index of paper p, organization o and venue v.

Define Coauth[i,j] as the metrix of coauthor relationships. If author i and j has coauthor relationship, Coauth[i,j] equals to the time they have cooperated, otherwise 0.

Define Work[i,j] as the metrix of employment. If author i is hired by organization j, Work[i,j] equals to 1, otherwise 0.

Define Citate[i,j] as the metrix of citation. If paper i cites paper j, Citate[i,j] equals to 1, otherwise 0.

Define Publish[i,j] as the metrix of publication. If paper i is published on venue j, Publish[i,j] equals to 1, otherwise 0.

Therefore, we have the following formulas:

screen-shot-2016-10-24-at-10-33-47

screen-shot-2016-10-24-at-10-33-52

screen-shot-2016-10-24-at-10-33-57

screen-shot-2016-10-24-at-10-34-04

By default, all the entities have an authority of 1. After each round of iteration, their authority would be passed to their neighbours. At the end of each iteration, a normalization step is performed to all the entities:

screen-shot-2016-10-24-at-10-36-56

How it’s implemented

We use the Pregel framework in Spark GraphX to implement the algorithm. Pregel is an iterative graph algorithm framework raised by Google. It uses the idea of BSP (Bulk Synchronous message Passing). The core of using Pregel is to implement three customized functions: vprog, sendMsg and mergeMsg.

For our algorithm, the vprog function directly overwrite the previous hi-index value with the new hi-index value in the current iteration; sendMsg function sends out the hi-index value of the entity (times or divides the weight) to its neighbours based on the type of the edge (i.e., the type of relationship between neighbours); mergeMsg function simply add up all the hi-index value received from the entity’s neighbours.

The iteration stops when the average difference of hi-index between two adjacent iterations is less than 1e-7.