Ou, S., Khoo, C.S.G., & Goh, D. (2008). Design and development of a concept-based multi-document summarization system for research abstracts. Journal of Information Science, 34(3), 308-326. [PDF]


This paper describes a concept-based multi-document summarization system that was developed to summarize sets of dissertation abstracts in sociology that might be retrieved by an information retrieval system or Web search engine in response to a user query.

The summarization method developed in this study is a hybrid method comprising four major steps:

  1. Macro-level discourse parsing: An automatic discourse parsing method was developed to segment a dissertation abstract into several macro-level sections and identify which sections contain important research information;
  2. Information extraction: An information extraction method was developed to extract research concepts and relationships as well as other kinds of information from the micro-level structure (within sentences);
  3. Information integration: An information integration method was developed to integrate similar concepts and relationships extracted from different abstracts;
  4. Summary presentation: A presentation method was developed to combine and organize the different kinds of information using a variable-based framework, and present them in an interactive Web-based interface.

Each of the major steps was evaluated by comparing the system-generated output against human coding.

In discourse parsing, a decision tree classifier was developed to categorize sentences into five standard sections. The system obtained an overall accuracy of 63%, which was rather lower than the inter-coder agreement of 80%. However, the accuracy of 91% obtained for identifying the research objectives and research results sections was quite high. In the future, other supervised learning techniques such as SVM and Naive Bayes will be investigated.

In term extraction, we used a rule-based method employing syntactic rules to extract multi-word terms. The system obtained a high recall of 90% for extracting important concepts from dissertation abstracts but the precision of 46% was low. Among the extracted terms, we selected research concept terms as those extracted from the research objectives and research results sections. Furthermore, we identified contextual relation terms and research method terms throughout the whole text using cue phrases. The accuracy obtained was good – 86% precision and 90% recall for contextual relations, and 97% precision and 72% recall for research methods.

In relationship extraction, we pre-constructed a set of relationship patterns and performed pattern matching to identify the text segments that match with the patterns. It obtained a high precision of 81% but the recall of 55% was low. In the future, relationships across sentences and implied relationships without clear cue phrases will be explored.

In information integration, we performed only syntactic-level generalization since it is easy to realize without the need of an ontology, taxonomy or thesaurus. Although the system clustering is more similar to each of human codings (e.g. F-measure = 51.6) than between human codings (e.g. F-measure = 46.7), such generalization is not very accurate without considering the semantic meanings of concepts.

User evaluation of the summary presentation is reported in the companion paper:

  • Ou, S., Khoo, C.S.G., & Goh, D. (2007). Automatic multi-document summarization of research abstracts: Design and user evaluation. Journal of the American Society for Information Science & Technology, 58(10), 1419-1435. [PDF]

Development of the taxonomy for concept filtering and generalization was reported in:

  • Ou, S., Khoo, C.S.G., & Goh, D. (2005). Constructing a taxonomy to support multi-document summarization of dissertation abstracts. Journal of Zhejiang University: Science 6A(11), 1258-1267.