Visualizing Multi-Document Semantics via Open Domain Information Extraction

The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases（ECML-PKDD）2018

Yongpan Sheng¹ Zenglin Xu¹ Yafang Wang² Xiangyu Zhang¹ Jia Jia² Zhonghui You¹ Gerard de Melo³

¹University of Electronic Science and Technology of China ²Shandong University ³Rutgers University

Abstract

Faced with the overwhelming amounts of data in the 24/7 stream of new articles appearing online, it is often helpful to consider only the key entities and concepts and their relationships. This is challenging, as relevant connections may be spread across a number of disparate articles and sources. In this paper, we present a system that extracts salient entities, concepts, and their relationships from a set of related documents, discovers connections within and across them, and presents the resulting information in a graph-based visualization. We rely on a series of natural language processing methods, including open-domain information extraction, a special filtering method to maintain only meaningful relationships, and a heuristic to form graphs with a high coverage rate of topic entities and concepts. Our graph visualization then allows users to explore these connections. In our experiments, we rely on a large collection of news crawled from the Web and show how connections within this data can be explored.

How dose the system works?

The objective of our system is to support the user in quickly extracting salient entities, concepts, and their relationships from a set of related documents, discovers connections within and across them, and presents the resulting information in a graph-based interactive visualization. It is componsed of three major components, including fact extraction, fact filtering, conceptual graph construction module, respectively. In the following, we will present the output resulting information of our system via a concrete example.

Fact Extraction Module

The system first lists five predefined trending news topics, including the Syria refugee crisis, Iran nuclear issue, Volkswagen scandal, US presidential election, and the Chinese cooperation with Sudan. The topic words are used to induce document representations and appearing in each of topics. We pick several topic words under the topic of US presidential election, then the system shows a list of documents under this topic are ranked by the TF-IDF weights of the topic words in each document, and by default, the top-k documents for this topic are selected for further processing.

The system relies on a series of natural language processing methods, including coreference resolution and open-domain information extraction, in which the subject, predicate, and object are natural language phrases extracted from the sentences within the top-k documents, as discussed above. These often correspond to syntactic subject, predicate, object, respectively, and can be obtained from here.

Fact Filtering Module

The reached data from prior fact extraction module can be filtered such that only the most salient, confident, and compatible facts are maintained. That is, fact filtering module aims at hiding less representative facts in the visualization and the filtered results can be retrieved from here.

Conceptual Graph Construction Module

The expert annotators merge potential entities and concepts stemming from the fact filtering process, while a heuristic is employed to guarantee that the final graph is connected with high coverage rate of topic concepts, but might not find the subset of concepts that has the highest total importance score.

Downloads

Our system is designed to operate on a large document collections that can then be searched for specific documents to be analysed. Our current corpus consists of 734,488 news articles and 265,512 blog articles, in total around 1 million English-language articles, with an average article length of 405 words. The articles stem from a variety of news sources and were released by Signal Media during a period of one month (1–30 September 2015).

References

Cafarella, M.J., Banko, M., Etzioni, O.: Open information extraction from the web (2015)
Corney, D., Albakour, D., Martinez, M., Moussa, S.: What do a million news articles look like? In: ECIR. pp. 42–47 (2016)
Kochtchi, A., von Landesberger, T., Biemann, C.: Networks of names: Visual exploration and semi-automatic tagging of social networks from newspaper articles. Computer Graphics Forum 33(3), 211–220 (2014). https://doi.org/10.1111/cgf.12377
Li, J., Li, L., Li, T.: Multi-document summarization via submodularity. Kluwer Academic Publishers (2012)
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: ACL System Demonstrations (2014)
Pilehvar, M.T., Jurgens, D., Navigli, R.: Align, disambiguate and walk: A unified approach for measuring semantic similarity. In: Meeting of the Association for Computational Linguistics(2013)
Tixier, A., Skianis, K., Vazirgiannis, M.: Gowvis: A web application for graph-of-words-based text visualization and summarization. In: Acl-2016 System Demonstrations (2016)

Acknowledgement

This paper was in part supported by Grants from the Natural Science Foundation of China (No. 61572111), the National High Technology Research and Development Program of China (863 Program) (No. 2015AA015408), a 985 Project of UESTC (No. A1098531023601041) and a Fundamental Research Funds for the Central Universities of China (No.A03017023701). Gerard de Melo's research is funded in part by ARO grant no. W911NF-17-C-0098 (DARPA SocialSim program).

If you are interested in our works or have any problem, do not hesitate to contact me at shengyp2011@163.com.

Thanks and enjoy the journey.