Visualizing Multi-Document Semantics via Open Domain Information Extraction

The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases(ECML-PKDD)2018

Abstract

Faced with the overwhelming amounts of data in the 24/7 stream of new articles appearing online, it is often helpful to consider only the key entities and concepts and their relationships. This is challenging, as relevant connections may be spread across a number of disparate articles and sources. In this paper, we present a system that extracts salient entities, concepts, and their relationships from a set of related documents, discovers connections within and across them, and presents the resulting information in a graph-based visualization. We rely on a series of natural language processing methods, including open-domain information extraction, a special filtering method to maintain only meaningful relationships, and a heuristic to form graphs with a high coverage rate of topic entities and concepts. Our graph visualization then allows users to explore these connections. In our experiments, we rely on a large collection of news crawled from the Web and show how connections within this data can be explored.

How dose the system works?

The objective of our system is to support the user in quickly extracting salient entities, concepts, and their relationships from a set of related documents, discovers connections within and across them, and presents the resulting information in a graph-based interactive visualization. It is componsed of three major components, including fact extraction, fact filtering, conceptual graph construction module, respectively. In the following, we will present the output resulting information of our system via a concrete example.

Downloads

Our system is designed to operate on a large document collections that can then be searched for specific documents to be analysed. Our current corpus consists of 734,488 news articles and 265,512 blog articles, in total around 1 million English-language articles, with an average article length of 405 words. The articles stem from a variety of news sources and were released by Signal Media during a period of one month (1–30 September 2015).

References

Acknowledgement

This paper was in part supported by Grants from the Natural Science Foundation of China (No. 61572111), the National High Technology Research and Development Program of China (863 Program) (No. 2015AA015408), a 985 Project of UESTC (No. A1098531023601041) and a Fundamental Research Funds for the Central Universities of China (No.A03017023701). Gerard de Melo's research is funded in part by ARO grant no. W911NF-17-C-0098 (DARPA SocialSim program).

If you are interested in our works or have any problem, do not hesitate to contact me at shengyp2011@163.com.

Thanks and enjoy the journey.