Khoo, C.S.G., Stanley-Baker, M., Zakaria, F.B., Chen, J., Ang, S.Q.R., & Huang, B. (2023). Development of the Polyglot Asian Medicine Knowledge Graph System. In D.H. Goh, S.J. Chen, & S. Tuarob (Eds.), 25th International Conference on Asia-Pacific Digital Libraries, ICADL2023 (Lecture Notes in Computer Science, vol 14458, pp. 3-11). Springer. https://doi.org/10.1007/978-981-99-8088-8_1

Abstract: The Polyglot Asian Medicine system hosts a research database of Asian traditional and herbal medicines, represented as a knowledge graph and implemented in a Neo4j graph database system. The current coverage of the database is mainly traditional Chinese medicines with some Malay and Indonesian data, with plans to extend to other Southeast Asian communities. The knowledge graph currently links the medicine names in the original and English languages, to alternate names and scientific names, to plant/animal parts they are made from, to literary and historical sources they were mentioned in, to geographic areas they were associated with, and to external database records. A novel graph visualization interface supports user searching, browsing and visual analysis. This is an example of representing a digital humanities research dataset as a knowledge graph for reference and research purposes. The paper describes how the knowledge graph was derived based on a dataset comprising over 25 Microsoft Excel spreadsheets, and how the spreadsheet data were processed and mapped to the graph database
using upload scripts in the Neo4j Cypher graph query language. The advantages of using a knowledge graph system to support user browsing and analysis using a graph visualization interface are illustrated. The paper describes issues encountered, solutions adopted, and lessons learned that can be applied to other digital humanities data.

Extracts: … we propose an alternative definition that distinguishes knowledge graph from ontology and emphasizes its support for human information seeking and information use. We propose that the focus of a knowledge graph is less on logical reasoning, but more on connecting things in a graph (network) representation. The growth of graph databases has stimulated interest in these aspects of  knowledge graphs. We informally characterize a knowledge graph as a network of nodes connected by directed links, where nodes represent resources (ideas, concepts and entities) and links represent semantic relations between them. The nodes are assigned meaning by labeling them with classes from a taxonomy, and assigning them properties. The links are also labeled with relationship types and may be assigned properties as well.

… For our knowledge graph implementation, we adopted the Neo4j graph database management system1—a popular graph database software based on the labeled property graph model. Labeled property graph can be viewed as a light-weight alternative to RDF/OWL2. … A major
difference is that in a labeled property graph, the links (relations) can be assigned properties. In an RDF/OWL2 ontology, links with properties have to be represented as an intermediate node linked to the source and target nodes. A labeled property graph, as implemented in a Neo4j database, is schema-free (or schema-less) as the database does not store a schema specifying mandatory properties for each node type as well as a datatype for each property. Nor does it store domain-range and cardinality restrictions commonly found in RDF/OWL2 ontologies. Thus, a node or link can have any property (i.e., attribute-value pair). This makes it easier to represent digital humanities datasets that include data from multiple sources and in multiple languages, stored in many spreadsheets, and are continually expanded with new types of data. It makes it possible for the knowledge graph to evolve with changing conceptualizations and ideas for analysis and application. However, some structure and style guide need to be imposed on the data, outside of the graph database system.

… The primary data storage is Google Sheets, which are used for data entry. Google Sheet has a publish-to-csv function that dynamically converts a spreadsheet to a CSV file at a specified URL. Upload scripts (in Neo4j’s Cypher graph query language) can then be submitted to the Neo4j database to retrieve the CSV file (from the specified URL) to process and map to the knowledge graph. The graph database is thus used only for
searching and analysis, and not for data entry.

… The dataset was originally stored in over 25 Microsoft Excel spreadsheets. We considered converting the spreadsheets to a relational database to assure the referential integrity of the links (i.e., check that the links refer to existing nodes) and also to impose some property constraints. As some of the spreadsheets have complex structures, conversion to a relational database in third normal form would necessitate decomposing each spreadsheet into multiple tables, each table representing entities of a particular type. We decided that this would disrupt the researchers’ mental model of the dataset, and make data entry more difficult and error-prone (even with the use of data views and data-entry forms). We have found that digital humanities researchers are comfortable with spreadsheets, and it is more natural to enter and store related information in the same spreadsheet. So, a significant decision was made to work with the researchers’ spreadsheets, with minor adjustments. The spreadsheets were converted to Google Sheets to support collaborative data entry as well as direct upload to the graph database on Neo4j’s AuraDB cloud database service. This section describes the issues encountered with processing the data and mapping them to nodes and links in the graph database.

… We have shared lessons learned in applying knowledge graph technologies to a digital humanities research dataset, to develop a Web application to support browsing, reference and research. We have found that the flexibility of the knowledge graph technologies
adopted—labeled property graph, graph database and graph visualization—together with Google Sheets as the primary data store, can meet the needs of digital humanities applications that have evolving datasets and changing conceptualization.