My projects in NLP and text mining are currently focused on:

  • Multi-document summarization of research papers. Current work is focused on graphical visualization of social science research abstracts using a graph database
  • Writing analytics–linguistic and content analysis of academic papers to develop automatic analysis tools to support academic writing and analysis of academic texts. See the page on Academic Writing for more details.
  • Information extraction of adverse drug reactions from consumer drug reviews.
  • Information extraction from police charge sheets, and developing a predictive model for crime sentences
  • Sentiment analysis. Current work is focused on developing sentiment analysis resources and methods

Project descriptions below:

 Project: Multi-document summarization of research papers

The approach I’ve taken is not the usual one of sentence extraction, but information extraction, identification of concepts and conceptual relations, modelling of the domain using ontologies, and text generation to output a summary in the form of a literature review. I’m focusing on the domain of social science research.

Current work is focused on graphical visualization of research results in a set of social science research abstracts using a graph database. The abstracts of social science journal articles are converted to a dependency tree representation using the Stanford parser (i.e. the sentences are converted to a graphical form that represents the syntactic/grammatical relations between the words). This is stored in a graph database (Neo4j). Graph matching operations are used to convert the syntactic representations of sentences into semantic (meaning) network representations. The sentence representations are linked and merged to form more abstract knowledge representations. The purpose of the study is develop graph mining methods to summarize multiple journal articles into an overview representation, and to synthesize knowledge and infer new knowledge.

Earlier work was carried out in collaboration with 2 former PhD students, Dr Ou Shiyan and Dr Kokil Jaidka.

The work with Dr Ou Shiyan  proposed a “variable-based framework” for representing research concepts and relationships as well as contextual relations and research methods in a set of related dissertation abstracts. The framework contains four kinds of information:

  • Main concepts: The common research concepts, often operationalized as research variables.
  • Research relationships between concepts:  For each main concept, the descriptive attribute values or relationships with other concepts (e.g. correlations and cause-effect relationships) investigated in different dissertation abstracts.
  • Contextual relations: Concepts and relationships in the perception, attitude, insight, etc. of a target population, or in the context, framework, model, theory, etc.
  • Research methods: One or more research methods used to explore the attributes of concepts and relationships, including research design, sampling, and data measurement & analysis method.

This framework has been expanded and is being applied to my current work on Academic Writing.

The subsequent work with Dr Kokil Jaidka sought to produce multi-document summaries that read like literature reviews. To that end, we analysed human-written literature reviews to identify the document structure (macro-level structure) of literature reviews, rhetorical functions/relations used, and information selection tactics used by authors to select information and text from the source papers to include in their literature reviews. The work is being continued under the topic of Academic Writing.

Selected papers:

  • Jaidka, K., Khoo, C.S.G., & Na, J.C. (2013). Literature review writing: How information is selected and transformed. Aslib Proceedings, 65(3), 303-325. [PDF]
  • Khoo, C.S.G., Na, J.C., and Jaidka, K. (2011). Analysis of the macro-level discourse structure of literature reviews. Online Information Review, 35(2), 255-271. [PDF]
  • Jaidka, K., Khoo, C.S.G., & Na, J.C. (2010). Imitating human literature review writing: An approach to multi-document summarization. In 12th International Conference on Asia-Pacific Digital Libraries, ICADL 2010: Proceedings (Lecture Notes in Computer Science 6102, pp. 116-119) Berlin: Springer.
  • Ou, S., Khoo, C.S.G., & Goh, D. (2008). Design and development of a concept-based multi-document summarization system for research abstracts. Journal of Information Science, 34(3), 308-326. [PDF]
  • Ou, S., Khoo, C.S.G., & Goh, D. (2007). Automatic multi-document summarization of research abstracts: Design and user evaluation. Journal of the American Society for Information Science & Technology, 58(10), 1419-1435. [PDF]
  • Ou, S., Khoo, C.S.G., & Goh, D. (2005). Constructing a taxonomy to support multi-document summarization of dissertation abstracts. Journal of Zhejiang University: Science 6A(11), 1258-1267.

Project: Sentiment analysis

My current work is focused on

  • Development and application of sentiment lexicons and sentiment analysis resources.
  • A general-purpose English sentiment lexicon called WKWSCI Sentiment Lexicon v1.1 (named after the authors’ school Wee Kim Wee School of Communication & Information) is available for download. The lexicon is based on the 12dicts common American English word lists compiled by Alan Beale from twelve source dictionaries. The lexicon contains 29,729 words tagged with 4 parts-of-speech: adjective, adverb, noun, and verb. The lexicon comprises 3,187 positive words, 7,247 negative words and 19,295 neutral words. WKWSCI Sentiment Lexicon v1.0 was described and compared with five existing lexicons in the paper “Lexicon-Based Sentiment Analysis: Comparative Evaluation of Six Sentiment Lexicons“. Version 1.1 includes some improvements resulting from the reported study.

I have worked in the following areas in collaboration with Dr Jin-Cheon Na:

  • Domains: product reviews, movie reviews, drug reviews, news text
  • Genre: formal text (expert reviews, news articles) and social media (discussion forum, user reviews, blog, microblog)
  • Granularity level: overall document sentiment, sentence level, clause level vs. sentiment towards various aspects
  • Presentation:  scoring vs. visualization vs. text summary

Selected papers:

  • Khoo, C.S.G., & Johnkhan, S.B. (early view). Lexicon-based sentiment analysis: Comparative evaluation of six sentiment lexicons. Journal of Information Science (Available: or the accepted pre-publication version at NTU digital repository)
  • Khoo, C.S.G., Johnkhan, Sathik B., & Na, J.C. (2015). Evaluation of a general-purpose sentiment lexicon on a product review corpus. In R.B. Allen, J. Hunter, & M.L. Zeng (Eds.), Digital libraries: Providing quality information: 17th International Conference on Asia-Pacific Digital Libraries, ICADL2015: Proceedings (LNCS 9469, pp. 82–93). Berlin: Springer. [PDF]
  • Khoo, C.S.G., Nourbakhsh, A., & Na, J.C. (2012). Sentiment analysis of news text: A case study of appraisal theory. Online Information Review, 36(6), 858-878. [PDF]
  • Na, J.C., Kyaing, W.Y.M., Khoo, C., Foo, S., Chang, Y.-K., & Theng, Y.L. (2012). Sentiment classification of drug reviews using a rule-based linguistic approach. In Proceedings of ICADL (International Conference on Asian Digital Libraries) 2012, Taipei (Lecture Notes in Computer Science, v. 7634, pp. 189-198). Berlin: Springer-Verlag.
  • Goeuriot, L., Na, J.C., Kyaing, W.Y.M., Khoo, C.S.G., Chang, Y.K., Theng, Y.L., & Kim, J.J. (2012). Sentiment lexicons for health-related opinion mining. In Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium (pp. 219-225). New York: ACM.
  • Na, J.C., Thet, T.T., Khoo, C.S.G., & Kyaing, W.Y.M. (2011). Visual sentiment summarization of movie reviews. In Proceedings of ICADL 2011 (Lecture Notes in Computer Science, v. 7008, pp. 277-287). Berlin: Springer Verlag.
  • Na, J.C., Thet, T.T., and Khoo, C.S.G. (2010). Comparing sentiment expression in movie reviews from four online genres. Online Information Review, 34(2), 317-338.
  • Thet, T. T., Na, J.C., Khoo, C. (2010). Aspect-based sentiment analysis of movie reviews on discussion boards. Journal of Information Science, 36(6), 823-848.

Project: Information extraction of conceptual relations

I’ve an abiding interest in conceptual relations and semantic relations, especially cause-effect relations. I’ve not been actively working on information extraction of relationship information for awhile, but plan to get back to it soon.

In the “early days”, I worked on information extraction of cause-effect relation for the purpose of information retrieval:

  • Khoo, C., Chan, S., & Niu, Y. (2000). Extracting causal knowledge from a medical database using graphical patterns. In ACL-2000: 38th Annual Meeting of the Association for Computational Linguistics, 1-8 October 2000, Hong Kong (pp. 336-343). New Brunswick, NJ: Association for Computational Linguistics.
  • Khoo, C., Chan, S., Niu, Y., & Ang, A. (1999). A method for extracting causal knowledge from textual databases. Singapore Journal of Library & Information Management, 28, 48-63. [PDF]
  • Khoo, C., Myaeng, S.H., & Oddy, R. (2001). Using cause-effect relations in text to improve information retrieval precision. Information Processing and Management, 37(1), 119-145.
  • Khoo, C., Kornfilt, J., Oddy, R., & Myaeng, S.H. (1998). Automatic extraction of cause-effect information from newspaper text without knowledge-based inferencing. Literary & Linguistic Computing, 13(4), 177-186. [PDF]

Some survey papers which may be useful:

  • Khoo, C., & Na, J.C. (2006). Semantic Relations in Information Science. Annual Review of Information Science and Technology, 40, 157-228. [PDF]
  • Khoo, C., Chan, S., & Niu, Y. (2002). The many facets of the cause-effect relation. In R.Green, C.A. Bean & S.H. Myaeng (Eds.), The semantics of relationships: An interdisciplinary perspective (pp. 51-70). Dordrecht: Kluwer. [PDF]
  • Khoo, C., & Myaeng, S.H. (2002). Identifying semantic relations in text for information retrieval and information extraction. In R.Green, C.A. Bean & S.H. Myaeng (Eds.), The semantics of relationships: An interdisciplinary perspective (pp. 161-180). Dordrecht: Kluwer. [PDF]