A practical guide to building your Knowledge Organization System
July 21, 2022
By Veronique Moore, PhD
A data representation specialist at Elsevier shares tips for optimizing your KOS for users and reusability
What is a Knowledge Organization System?
A Knowledge Organization System (KOS) is a way to organize in a structured manner sets of synonyms about a specific domain, usually targeted to a specific use case. By sets of synonyms, I mean a unique identifier is regrouping under a single ID sets of terms that represent the same notion, such as the three different sets for the noun form of the word “fracture” below, as represented in Princeton University’s WordNet(opens in new tab/window):
Each bullet point represents a different notion, which can be expressed by the word “fracture” in each case. A KOS helps make that distinction and also associates “geological fault” with the geological context of “fracture.” This association helps search engines retrieving documents about geological fault for a query on the word “fracture”; a KOS bridges the gap between different perspectives on the same notion.
WordNet also gives more context to this word “fracture” by linking the synonym set (synset) with its broader notion in the geological domain, along with other relationships:
This is an example of an organized structure of synonym sets, a KOS. The relationships highlighted above can be useful for use-cases such as:
Query expansion: Give me a set of documents on a broader topic than my search query, i.e.,documents that contain entities in the list of “direct hypernym.”
Faceted search results display: Documents that contain entities from the list “direct hyponym.”
Recommender systems: Give me a set of related documents, i.e.,documents that contain related entities to my search query (not featured above but could be along the lines of: minerals present in a fault, fault ecosystem …).
Question answering: What is an example of a geological fault? This could be documents containing entities from the list of “has instance.”
In a nutshell, a KOS or vocabulary helps you find and organize your content, helping your readers or customers discover it in a way that suits their needs.
How to build a KOS
There are different perspectives to building a KOS in terms of the modeling choices, the coverage question and the lexical selection.
Modeling perspective: keep it simple
The modeling choice involves the selection of the type of relationships and properties you will need in your KOS. Relationships are predicates between entities (such as “treatment of” between a drug and a disease), while properties are attributes of a given entity (such as molecular weight or mol file representation for a molecule). The main type of relationships in a KOS are hierarchical and “related” or domain-specific.
The way to define the type of relationships or attributes you will need is based on your product’s or tool’s use case and your user’s information needs. Do you need to build a hierarchy or is a flat list going to serve your needs? If the latter is the case, keep it simple and go for a plain list. How would I know whether I need a hierarchy?
Besides being necessary for use cases such as query expansion or classification roll-up (use-cases where the system needs to get access to broader notions than the current hits), a hierarchy also helps with KOS management. The hierarchy gives you an intuitive way to organize the notions you have in your KOS, which often goes beyond the thousands, and helps you identify a knowledge gap or keeps you from adding a duplicate notion (creating a new ID for a notion that already exists in your KOS).
Do you need domain-specific relationships, such as “treatment of” to link Drugs to Diseases in your content? Then implement them in your KOS; if not, stick to the minimal possible structure and add any type of relationship intentionally.
Marie Kondo(opens in new tab/window) your KOS: always make sure any addition to the structure brings you joy or purpose.
The same goes for properties linked to a notion: just because you know the boiling point of a liquid doesn’t mean you need to add it to your KOS. Only add the properties that will bring value to your tool or that will be useful to your customers. Do you have an interface to search for material properties? Then go wild and add in as many as makes sense to each entity.
Properties can have no value for some entities, or no known values: your KOS only contains as much information as is available to you at any point, but it does not mean that if all entities can’t get a value for a property none should.
Part of the modeling choices are also related to restrictions. For example, you might want to allow for just one birthdate for any instance of a person in your KOS. But for historical persona, several dates may be stated by different sources. Base your model on the reality of your data and as little as possible on pre-conceived ideas, unless you have strict constraints from your tool (such as: it allows for only one value for the birthdate of a person).
About future-proofing your KOS: making the model modular is a good practice that helps make the KOS reusable for purposes other than its original use case. Adding explicit metadata such as creation date and last update date helps with the maintenance and update of the KOS.
The modeling choices are mostly driven by the application use-cases:
What will my tool/product do with that information?
What type of pieces of information can it handle?
What will my customers need for the use case my tool/product is serving?
What metadata will I need to keep my KOS up to date with the field’s advances and my data’s expansion?
Once you have defined what relationships and properties you want to model, the question is: how should you represent them?
There are standard vocabularies used by the Linked Data community: using them makes your data standard, its semantics understood by other users and tools, and your tool compatible with other resources developed by Linked Data contributors. A great entry point to search for published vocabularies that can represent your modeling needs is the Linked Open Vocabularies (LOV) page(opens in new tab/window) and the recommendations of the World Wide Web consortium (W3C)(opens in new tab/window) for publishing (linked) data on the Web.
In terms of editorial metadata, the Dublin Core(opens in new tab/window) offers standard properties for most needs, together with the PROV family of standards(opens in new tab/window) to document provenance. Schema.org(opens in new tab/window) has also a set of properties to represent structured data on the Internet. Elsevier has also a set of properties defined to represent our company’s content in a standard way, making annotations and KOS compatible across the company. Isn’t this viewable just by Elsevier employees, or could someone outside the company log into their account and view it? If not, we should not include the link. Indeed, the Elsevier models are for internal use and are not shared externally; I’m fine with deleting the mention of Elsevier standards or the link, either way is OK with me.
Coverage perspective: keep it tight
Orthogonal to the questions around modeling is the question of coverage: what entities should I put in my KOS? There are two aspects to that question: the content aspect and the users’ aspect.
Content aspect: Build your KOS to fit your content as well as possible. Ideally, use automatic extraction methods or a full-text index to get candidates for your KOS. Methods such as Rake(opens in new tab/window), KeyBERT(opens in new tab/window) or BERTopic(opens in new tab/window) can be used to generate such candidate lists from digital content. With these lists, you will be able to see the main branches or aspects of the domain you need to cover in your KOS to bring your users the best discoverability experience on your content.
User aspect: Use search logs or user feedback to identify what your customers are looking for, and pay a special attention to designing a KOS that includes their explicit informational needs.
The optimal coverage of your KOS is at the intersection of both aspects.
Candidate lists and customer input will give lists of terms. You then need to group them into semantic entities (notions); otherwise you will miss out on the semantic bridging and recall enhancing capabilities of a KOS. You can use context-based clustering algorithms trained on your data to create proto-notions, and ideally have them revised by a Subject Matter Expert (SME). BERT-based methods typically behave poorly on linking acronyms with fully-expanded versions of terms, so you will likely have to pay particular attention (SME involvement or acronym-specific automatic clustering method).
Lexical selection perspective: keep it standard
You will face a specific challenge with your clustered candidate lists: which entry in the cluster should be the representative label for that cluster, i.e., which term will be the preferred one for this notion? The preferred term is often the customer-facing tip-of-the-iceberg view for a given notion. You can base your selection on frequency information (in your content, in customer search logs, or both), but you can also use the input of existing researcher communities: identify standard vocabularies for the community of researchers your customers belong to and choose their preferred term to represent your notions. This way, your customers will recognize the notion and identify your KOS as an authoritative way to represent the domain of your tool/product, fostering a strong acceptance.
You can also use these reference vocabularies (such as the MeSH(opens in new tab/window) in the medicine domain) to cluster the candidates terms from the step dealing with the KOS content coverage (or quality control the automatic clusters); the reference vocabularies regroup terms in notions already and can be used as gold standards for any term matching your lists.
Reference vocabularies can also (depending on license rights) be used straight up to build your own KOS, but they might not reflect perfectly your coverage or modeling needs. Using them as a basis, however, gives you a solid ground to expand on the areas that are specific to your own content pool and customers’ information needs.