My projects in NLP and text mining are currently focused on:

  • Multi-document summarization of research papers, especially literature review generation.
  • Linguistic and content analysis of research papers, especially the introduction and literature review sections.
  • Sentiment analysis of health-related social media sites and political news.
  • Information extraction of conceptual relations, especially cause-effect information.

Project:  Evidence-based teaching of literature review writing (MOE-Tertiary Education Research funding for 2 years)

The literature review is an important and pervasive type of academic writing. Students have to include a literature review in term papers, project reports, research proposals and dissertations. A literature review is more than just a summary of previous research on a particular topic. It is difficult to write a good literature review as it requires critical thinking, argumentation and writing skills. The process of literature review writing includes assessing and selecting relevant information from previous research papers, integrating the information (e.g., comparing and generalizing reported research results), synthesizing arguments to justify the current research, and presenting the arguments in coherent and persuasive text.

It is difficult to teach literature review writing because of the different types of intellectual activities and skills involved. Instructors of report writing need more resources in the form of best-practice patterns and examples, and online or computerized diagnostic aids, to give detailed guidance to students. The project will carry out in-depth linguistic and content analyses of literature reviews published in top journals in three fields (sociology, biological science and mechanical engineering), to identify the linguistic, informational and argumentation strategies used.

Based on the results of the analyses, online pedagogical resources will be developed, including a catalogue of linguistic, informational and argumentation patterns found in good literature reviews, together with specific examples. At least two e-learning modules will be developed: the first on constructing the discourse/rhetorical and argumentative structure of the literature review; and the second on the selection of information from the cited papers and the transformation and integration of the information into a literature review. A computerized diagnostic tool (computer program) will be developed to make use of the catalogue of patterns to analyse student literature reviews, compare their profile with those of published literature reviews, and suggest directions for improvement. Suitable evaluation metrics will be developed for use by the diagnostic tool. Finally, an evaluation study will be carried out where classroom lessons on literature review writing (using the online resources and computerized diagnostic tool) will be developed. The new “evidence-based” teaching of literature review writing will be applied in undergraduate courses on academic report writing. The performance of the “treatment” group will be evaluated in comparison with a control group using the “traditional” method.

Project: Literature review generation (multi-document summarization of research papers)

The approach I’ve taken is not the usual one of sentence extraction, but information extraction, identification of concepts and conceptual relations, modelling of the domain using ontologies, and text generation to output a summary in the form of a literature review. I’m focusing on the domain of social science research. This work has been carried out in collaboration with 2 former PhD students, Dr Ou Shiyan and Dr Kokil Jaidka.

The earlier work with Dr Ou Shiyan  proposed a “variable-based framework” for representing research concepts and relationships as well as contextual relations and research methods in a set of related dissertation abstracts. The framework contains four kinds of information:

  • Main concepts: The common research concepts, often operationalized as research variables.
  • Research relationships between concepts:  For each main concept, the descriptive attribute values or relationships with other concepts (e.g. correlations and cause-effect relationships) investigated in different dissertation abstracts.
  • Contextual relations: Concepts and relationships in the perception, attitude, insight, etc. of a target population, or in the context, framework, model, theory, etc.
  • Research methods: One or more research methods used to explore the attributes of concepts and relationships, including research design, sampling, and data measurement & analysis method.

Later work with Dr Kokil Jaidka sought to produce multi-document summaries that read like literature reviews. To that end, we analysed human-written literature reviews to identify the document structure (macro-level structure) of literature reviews, rhetorical functions/relations used, and information selection tactics used by authors to select information and text from the source papers to include in their literature reviews. The work is ongoing.

Selected papers:

  • Jaidka, K., Khoo, C.S.G., & Na, J.C. (2013). Literature review writing: How information is selected and transformed. Aslib Proceedings, 65(3), 303-325. [PDF]
  • Khoo, C.S.G., Na, J.C., and Jaidka, K. (2011). Analysis of the macro-level discourse structure of literature reviews. Online Information Review, 35(2), 255-271. [PDF]
  • Jaidka, K., Khoo, C.S.G., & Na, J.C. (2010). Imitating human literature review writing: An approach to multi-document summarization. In 12th International Conference on Asia-Pacific Digital Libraries, ICADL 2010: Proceedings (Lecture Notes in Computer Science 6102, pp. 116-119) Berlin: Springer.
  • Ou, S., Khoo, C.S.G., & Goh, D. (2008). Design and development of a concept-based multi-document summarization system for research abstracts. Journal of Information Science, 34(3), 308-326. [PDF]
  • Ou, S., Khoo, C.S.G., & Goh, D. (2007). Automatic multi-document summarization of research abstracts: Design and user evaluation. Journal of the American Society for Information Science & Technology, 58(10), 1419-1435. [PDF]
  • Ou, S., Khoo, C.S.G., & Goh, D. (2005). Constructing a taxonomy to support multi-document summarization of dissertation abstracts. Journal of Zhejiang University: Science 6A(11), 1258-1267.

Project: Sentiment analysis

I have worked in the following areas in collaboration with Dr Jin-Cheon Na:

  • Domains: product reviews, movie reviews, drug reviews, news text
  • Genre: formal text (expert reviews, news articles) and social media (discussion forum, user reviews, blog, microblog)
  • Granularity level: overall document sentiment, sentence level, clause level vs. sentiment towards various aspects
  • Presentation:  scoring vs. visualization vs. text summary

My current work is focused on

  • Development and application of sentiment lexicons. A 30,000-word English sentiment lexicon called WKWSCI Sentiment Lexicon has been completed and will be made available for download soon. A paper evaluating the lexicon in comparison with 5 other lexicons is under review. Future work on the lexicon includes: addition of multiword terms, and developing versions customized to product reviews, health-related social media postings and news text.

Selected papers:

  • Khoo, C.S.G., Johnkhan, Sathik B., & Na, J.C. (2015). Evaluation of a general-purpose sentiment lexicon on a product review corpus. In R.B. Allen, J. Hunter, & M.L. Zeng (Eds.), Digital libraries: Providing quality information: 17th International Conference on Asia-Pacific Digital Libraries, ICADL2015: Proceedings (LNCS 9469, pp. 82–93). Berlin: Springer. [PDF]
  • Khoo, C.S.G., Nourbakhsh, A., & Na, J.C. (2012). Sentiment analysis of news text: A case study of appraisal theory. Online Information Review, 36(6), 858-878. [PDF]
  • Na, J.C., Kyaing, W.Y.M., Khoo, C., Foo, S., Chang, Y.-K., & Theng, Y.L. (2012). Sentiment classification of drug reviews using a rule-based linguistic approach. In Proceedings of ICADL (International Conference on Asian Digital Libraries) 2012, Taipei (Lecture Notes in Computer Science, v. 7634, pp. 189-198). Berlin: Springer-Verlag.
  • Goeuriot, L., Na, J.C., Kyaing, W.Y.M., Khoo, C.S.G., Chang, Y.K., Theng, Y.L., & Kim, J.J. (2012). Sentiment lexicons for health-related opinion mining. In Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium (pp. 219-225). New York: ACM.
  • Na, J.C., Thet, T.T., Khoo, C.S.G., & Kyaing, W.Y.M. (2011). Visual sentiment summarization of movie reviews. In Proceedings of ICADL 2011 (Lecture Notes in Computer Science, v. 7008, pp. 277-287). Berlin: Springer Verlag.
  • Na, J.C., Thet, T.T., and Khoo, C.S.G. (2010). Comparing sentiment expression in movie reviews from four online genres. Online Information Review, 34(2), 317-338.
  • Thet, T. T., Na, J.C., Khoo, C. (2010). Aspect-based sentiment analysis of movie reviews on discussion boards. Journal of Information Science, 36(6), 823-848.

Project: Information extraction of conceptual relations

I’ve an abiding interest in conceptual relations and semantic relations, especially cause-effect relations. I’ve not been actively working on information extraction of relationship information for awhile, but plan to get back to it soon.

In the “early days”, I worked on information extraction of cause-effect relation for the purpose of information retrieval:

  • Khoo, C., Chan, S., & Niu, Y. (2000). Extracting causal knowledge from a medical database using graphical patterns. In ACL-2000: 38th Annual Meeting of the Association for Computational Linguistics, 1-8 October 2000, Hong Kong (pp. 336-343). New Brunswick, NJ: Association for Computational Linguistics.
  • Khoo, C., Chan, S., Niu, Y., & Ang, A. (1999). A method for extracting causal knowledge from textual databases. Singapore Journal of Library & Information Management, 28, 48-63. [PDF]
  • Khoo, C., Myaeng, S.H., & Oddy, R. (2001). Using cause-effect relations in text to improve information retrieval precision. Information Processing and Management, 37(1), 119-145.
  • Khoo, C., Kornfilt, J., Oddy, R., & Myaeng, S.H. (1998). Automatic extraction of cause-effect information from newspaper text without knowledge-based inferencing. Literary & Linguistic Computing, 13(4), 177-186. [PDF]

Some survey papers which may be useful:

  • Khoo, C., & Na, J.C. (2006). Semantic Relations in Information Science. Annual Review of Information Science and Technology, 40, 157-228. [PDF]
  • Khoo, C., Chan, S., & Niu, Y. (2002). The many facets of the cause-effect relation. In R.Green, C.A. Bean & S.H. Myaeng (Eds.), The semantics of relationships: An interdisciplinary perspective (pp. 51-70). Dordrecht: Kluwer. [PDF]
  • Khoo, C., & Myaeng, S.H. (2002). Identifying semantic relations in text for information retrieval and information extraction. In R.Green, C.A. Bean & S.H. Myaeng (Eds.), The semantics of relationships: An interdisciplinary perspective (pp. 161-180). Dordrecht: Kluwer. [PDF]