A Framework for Graph-based Multi-Document Relation Exploration: A Case Study on the News Datasets


The 8th National Conference on Social Media Processing(SMP 2019)


Yongpan Sheng1 Haojie Huang2



1. Abstract

When overwhelming amounts of data in the 24/7 stream of new articles including news reports, business transactions and digital media present, considering only the key entities and concepts and their relationship saves time for acquiring key information. However, the fact that relevant connections may be distributed over a number of sources perplexes the challenge of relation exploration. In this paper, we propose a system to aid users in quickly discerning salient connections and facts from a set of related documents and presents the resulting information in a graph-based visualization. Specifically, given a set of relevant documents as input, we firstly propose a novel importance-based Open Information Extraction (Open IE) approach by exploiting the global structure of a dependency tree to extract candidate facts from above sources. Different from previous OpenIE approaches, our method is capable to cover more diverse relation expressions and measure the relative importance of candidate facts within a sentence. Then we design a Two-Stage Candidate Triple Filtering (TCTF) approach based on a self-training framework to maintain only coherent facts associated with the specified document topics from the candidates and connect them in the form of an initial graph. We further construct this graph by a heuristic strategy that iteratively removes the weakest entities and concepts with relatively lower importance scores computed by the extended TextRank algorithm, so that it ensures the final conceptual graph only consist of facts likely to represent meaningful and salient relationships where users may explore graphically. The extensive experiments on two real-world news datasets illustrate that our extraction approach achieves 3.2% higher on the average of F-score over state-of-the-art OpenIE methods. We also conduct an empirical study to investigate the quality of the final generated conceptual graph towards different document topics on its coverage rate of topic entities and concepts, confidence score, the compatibility of involved facts, and graph connectedness. Experimental results show the effectiveness of our proposed approach. More details of this system and code can be found in https://shengyp.github.io/Multi-doc-rel-min/.

2. Video

A video presenting the system is available at here.
To facilitate further research on this topic, we have made the source code and dataset of proposed framework publicly available. If you will leverage these resources, please cite our paper.

Code will be updated after the work be accepted.
github://

4. Acknowledgement

The authors wish to thank the anonymous reviewers for their helpful comments. This work was partially supported by National Natural Science Foundation of China (Nos.61572111 and 61876034), and a Fundamental Research Fund for the Central Universities of China (No.ZYGX2016Z003).