Original Message:
Sent: 7/12/2024 7:24:00 PM
From: Keven Ates
Subject: RE: Thoughts and ideas
Bret,
The DoD/IC Ontology Working Group (DIOWG) has a DoD IC Ontology Foundry--a registry for "officially" related ontologies for its community. It is modeled on the OBO Ontology Foundry. It is structured as a hierarchy. The Basic Formal Ontology is the top. The Common Core Ontology follows and encompasses all the domain specific ontologies. One of them is the Cyber Ontology. Ideally, the Cyber Ontology Working Group would oversee our STIX ontology submission into that space. This would be a tremendous win for the CTI TC.
For implementation, the upper level ontologies are generally not used for any real work that would interest us--its primarily an ontology organization strategy. We would concentrate on our domain ontology work and use with other intersecting domains. As a way to short circuit to a STIX ontology, there is a faithful rendition of the STIX protocol at the TAC TC (you may be aware). While it represents a fairly faithful rendition of the STIX protocol, this work needs review by the CTI TC to ensure alignment.
See the Pizza Ontology for an (humorous) intro to ontology work. It demonstrates 3 of many (what I call transport) formats: JSON, OWL, TTL. The JSON format is, well, JSON. A better implementation would use JSON-LD to be more compact. The OWL format is an XML implementation and is generally considered the baseline standard due to historical reasons. The TTL format (a.k.a., Turtle) is the most popular format by far...very human readable and generally more condensed. Each file is "graph equivalent" to the others of the same name. Reviewing a file, you can see how the various classes and properties are defined, their terms, labels, etc.
The related technology makes the transport format irrelevant. See this discussion for an overview of the format non-issue.
@JeffreyMates - Mixed, 2:
The linked article also says, "One of the lessons Doug would like to pass on is, don't be too tied down to the way you develop software. Embrace and explore new paradigms."
His recollection on getting JSON adoption over XML feels much like this discussion on namespaces.
Namespaces are a simplifier, not a complicator. They helps organize code and data. Namespaces are used all over JavaScript. In fact, its a primary function of the language--when you declare an object literal, you literally (pun intended) declare a namespace scope. Doug supports it when he is referencing the "closure" issue. The Object Oriented paradigm is also in JavaScript with "class", so I don't get his statements on the subject since class is a "closure" concept.
By not using namespaces, we force the STIX protocol to apply other complexities within to compensate. It forces implementer without to apply a defacto namespace when they mix STIX with non-STIX data--a STIX "indicator" (a thing, especially a trend or fact, that indicates the state or level of something) is very different from an vehicle "indicator" (a gauge or meter of a specified kind.), but a STIX "indicator" may be related to a vehicle "indicator" (CAN bus attacks anyone?). So, how do we reduce confusion? Namespaces...easy! We can play in a bigger world and freely mix STIX data with other data. Resistance is futile!
------------------------------
Keven Ates
US Federal Bureau of Investigation
Washington DC
------------------------------
Original Message:
Sent: 07-11-2024 08:16
From: Bret Jordan
Subject: Thoughts and ideas
Keven,
Exactly. I also think there are a lot of work items that we need to address to accommodate the changing landscape of AI in CTI. I know some have already done some work here. I would also like to look at a registry for objects and vocabs.
All in all, I think there are a lot of things that we need to do. We have learned a lot since we did the first migration from STIX 1.0/1.1 to STIX 2.0/2.1. I also believe that the changes we need are not simple iterative changes, but things that will warrant a STIX and TAXII 3.0.
Keven, I do think it would be good to help everyone understand what would be involved in moving to an ontology, like concrete examples. But we can talk off line about what could be good there.
Bret
I'm in a similar situation regarding involvement but am intending to devote more time to this area. I'll break down the basic Pros and Cons as I...
Re: Thoughts and ideas | | | I'm in a similar situation regarding involvement but am intending to devote more time to this area. I'll break down the basic Pros and Cons as I see them. I'll also provide some solution paths for the Cons (so as not to just complain about them). Pros: - The STIX and TAXII protocols have done fairly well with defining the various CTI things the community needs as far as a taxonomy is concerned.
- It has mostly structured a hierarchy from the taxonomy.
- It has mostly accomplished a way to define CTI documents and a method to share those documents.
- It proposes a process for extending parts of the hierarchy.
Cons: - The STIX protocol from its earliest conception attempts to use a graph analogy to structure a document but misses the mark on well applied graph process and theory.
- The STIX protocol is not machine readable / understandable. To effectively use CTI data, the STIX protocol must move from taxonomy and hierarchy to an ontology and related standardized principles of ontology--descriptive logic. This will enable a solution path for CTI knowledge processing and management concerns such as data governance, AI explainability, inferred knowledge, and other CTI analytics.
- The STIX protocol can apply better object oriented principles for class-subclass and property-subproperty relationships. This would align the work with well defined ontology principles. Problem areas in the current work include duplicative definitions and puns across the various STIX definitions.
- The STIX protocol does not adopt namespace solutions that help us "play well with others".
Solutions: These solution should be considered a primer for standardized knowledge graph technologies. For Con 1: The STIX graph analogy uses node-edge-node relations but, in some of the definitions, things you expect to be edges are described as nodes. For instance a "relationship" is properly describes (almost) universally as an edge that connect two nodes. Yet, STIX defines a "Relationship" as a node with a "source" and a "target". Using a directed graph (common for knowledge graphs) we can see the differences: (SourceNode) -- relationship -->(TargetNode) versus: (SourceNode) <-- source -- (Relationship) -- target --> (TargetNode) While this is technically not wrong (in graph theory, edges can be converted to nodes and all sorts of other blasphemy), it unnecessarily complicates the linked data and query solutions. Using a knowledge graph solution, the source (Domain) and target (Range) is a result of the directed edge and describes a predicate Binary function. In a descriptive logic ontology, information can be easily inferred from the edge logic. There is nothing gained by dividing the relationship into two node-edge-node statements as a directed edge already defines source and target. However, a "relationship" as a node is sometimes beneficial as it can be used to apply additional data to the relationship. This is known a "reification" in the knowledge graph arena. Instead of using additional (Relationship) -- some property -> (Data) statements, a knowledge graph can reference a whole node-edge-node statement as a node: << (SourceNode) -- relationship -->(TargetNode) >> -- some property -> (Data) or (SomeNode) -- some property -> << (SourceNode) -- relationship -->(TargetNode) >> This is generally considered a better solution as the context is clearer--something is being said about the entire statement instead of just the edge of a statement. There are other issues but the above is a primary issue. For Con 2: The solution is combined with Con 4 below as they are related. For Con 3: As a generic example, one might describe a directed edge relationship between a Person and an Email Address as: (John) -- has a --> (email) -- defined as --> (john@example.org) or as individual statements (John) -- has a --> (email_1)
(email_1) -- defined as --> (john@example.org) The "has a" and "defined as" properties are highly generic properties that can be used in other kinds of relationships. These are the kind of definitions I see in STIX. This should be a clue that these properties are super-properties of some better defined sub-properties to represent specific relations. There is no well defined expectations of what the domain and range are for these properties. If we allow context alone to determine meaning for these properties, we fail to provide enough logic for effective use (and related governance) for the user base. It raises questions of who's definition of what context is appropriate. A better solution might be via the following two blocks: (Kind) -- type --> (Class)
(Individual) -- type --> (Kind)
(Organization) -- type --> (Kind)
hasEmail -- type --> (Property)
hasEmail -- range --> (Email) These are common ontology definitions provided by a "specific domain" standards body (like CTI). The ontology definitions for "Class", "Property", "type" and "range" are given by another "general domain" standards body--the "specific domain" standard aligns with a "general domain" standard that helps define and share the specific work with specific domain users as well as a larger community of data users. (John) -- type --> (Individual)
(John) -- hasEmail --> (_container1_)
(_container1_) -- type --> (Home)
(_container1_) -- hasValue --> (mailto:john@example.org) The above is a user created document using the combined standard definitions for the specific and general domains. The "hasEmail" property implies that "_container1_" is an "Email" class type in the prior ontology statement. It is also directly defined as a "Home" class type and, therefore, defined as a Home Email address. In knowledge graph speak, "_container1_" is called a Blank Node as it's just a place to consolidate related data and does not need to be a Named Node. It is also interesting to note that the ontology statements are declared and defined in exactly the same way that data statements are defined. Then, redefinition and deprecation of ontology statements can be automated in these solutions since it's machine readable. For the current STIX protocol, change requires independent work by every implementer. By using standardized ontologies, we minimize impact on the implementations. In fact, I've snuck in the W3C vCard Ontology for this example. This highlights the very relevant reason to adopt a namespace solution. STIX can use existing ontologies to augment its own ontology--we don't need to reinvent the wheel...again. For Con 4: Without a well defined ontology, the STIX protocol is difficult to use with other data standards without a lot of extra implementation work on individual organizations. This forces them to develop bespoke data solutions in their own bubble. That, in turn, makes it difficult for them to effectively share their data with others and adds additional maintenance cost to their solutions. The status quo fosters a fractured user domain. By adopting an ontology solution, we must also embrace namespace solutions to help us share data in the CTI domain with others. We become a resource for a larger community of data users. A STIX namespace (and possible sub namespaces) organizes the ontology and compliant data to be shared and used with other ontologies. For instance, if NIEM and STIX adopt ontological solutions, the ontologies can be aligned by establishing equivalency statements--a NIEM ThingX class can be equivalent to a STIX ThingY class. This mitigates any need for us to agree to use a common term or to develop any kind of data conversion process. The equivalencies allows the standardized ontology engines to query on one term and provide results that are otherwise defined by another term. This helps an analyst "Get Things Done"(tm). It helps our community to connect with other communities. The descriptive logic that comes with an ontology provides inferencing--new data can be inferred from existing data. For instance, (John) -- type --> (Individual) implies John is also a "Kind" by the subclass relation. If we query for things of type "Kind" using an inferencing engine, we get John as a result. While these are simple examples, the implication for this shift is HUGE for the user base. This knowledge technology is also being used behind the scenes for the current AI movement. AI solutions have difficulty explaining how they arrive at a solution and proving that they are correct solutions. With knowledge graph technologies, the AIs have a way to "explain" how solutions are derived. They also have a ready made ML Feature selection set from the knowledge graph repositories. Conclusion: The point is, that by shifting to an ontology solution, we help a larger community get work done with lower costs to everyone and better, standardized, machine ingestible definitions. It provides for a natural extendibility as a consequence--independent groups can define their own ontologies to use with the STIX ontology without concern. STIX bundles can have extension data using extension ontologies in a document without impacting end users who haven't implemented an extension--they are just extra data elements that can be ignored or processed for later use. If they want to use that data, they can ingest the related ontology at will and adjust their queries in short order. I can present these points in more detail as requested.
------------------------------
Keven Ates
US Federal Bureau of Investigation
Washington DC
------------------------------
| | Reply to Group via Email Reply to Sender via Email View Thread Recommend Forward |
Original Message:
Sent: 07-02-2024 14:41 | |
| |
Original Message:
Sent: 7/10/2024 5:44:00 PM
From: Keven Ates
Subject: RE: Thoughts and ideas
I'm in a similar situation regarding involvement but am intending to devote more time to this area. I'll break down the basic Pros and Cons as I see them. I'll also provide some solution paths for the Cons (so as not to just complain about them).
Pros:
- The STIX and TAXII protocols have done fairly well with defining the various CTI things the community needs as far as a taxonomy is concerned.
- It has mostly structured a hierarchy from the taxonomy.
- It has mostly accomplished a way to define CTI documents and a method to share those documents.
- It proposes a process for extending parts of the hierarchy.
Cons:
- The STIX protocol from its earliest conception attempts to use a graph analogy to structure a document but misses the mark on well applied graph process and theory.
- The STIX protocol is not machine readable / understandable. To effectively use CTI data, the STIX protocol must move from taxonomy and hierarchy to an ontology and related standardized principles of ontology--descriptive logic. This will enable a solution path for CTI knowledge processing and management concerns such as data governance, AI explainability, inferred knowledge, and other CTI analytics.
- The STIX protocol can apply better object oriented principles for class-subclass and property-subproperty relationships. This would align the work with well defined ontology principles. Problem areas in the current work include duplicative definitions and puns across the various STIX definitions.
- The STIX protocol does not adopt namespace solutions that help us "play well with others".
Solutions:
These solution should be considered a primer for standardized knowledge graph technologies.
For Con 1:
The STIX graph analogy uses node-edge-node relations but, in some of the definitions, things you expect to be edges are described as nodes. For instance a "relationship" is properly describes (almost) universally as an edge that connect two nodes. Yet, STIX defines a "Relationship" as a node with a "source" and a "target". Using a directed graph (common for knowledge graphs) we can see the differences:
(SourceNode) -- relationship -->(TargetNode)
versus:
(SourceNode) <-- source -- (Relationship) -- target --> (TargetNode)
While this is technically not wrong (in graph theory, edges can be converted to nodes and all sorts of other blasphemy), it unnecessarily complicates the linked data and query solutions. Using a knowledge graph solution, the source (Domain) and target (Range) is a result of the directed edge and describes a predicate Binary function. In a descriptive logic ontology, information can be easily inferred from the edge logic. There is nothing gained by dividing the relationship into two node-edge-node statements as a directed edge already defines source and target.
However, a "relationship" as a node is sometimes beneficial as it can be used to apply additional data to the relationship. This is known a "reification" in the knowledge graph arena. Instead of using additional (Relationship) -- some property -> (Data)
statements, a knowledge graph can reference a whole node-edge-node statement as a node:
<< (SourceNode) -- relationship -->(TargetNode) >> -- some property -> (Data)
or
(SomeNode) -- some property -> << (SourceNode) -- relationship -->(TargetNode) >>
This is generally considered a better solution as the context is clearer--something is being said about the entire statement instead of just the edge of a statement.
There are other issues but the above is a primary issue.
For Con 2:
The solution is combined with Con 4 below as they are related.
For Con 3:
As a generic example, one might describe a directed edge relationship between a Person and an Email Address as:
(John) -- has a --> (email) -- defined as --> (john@example.org)
or as individual statements
(John) -- has a --> (email_1)
(email_1) -- defined as --> (john@example.org)
The "has a" and "defined as" properties are highly generic properties that can be used in other kinds of relationships. These are the kind of definitions I see in STIX. This should be a clue that these properties are super-properties of some better defined sub-properties to represent specific relations. There is no well defined expectations of what the domain and range are for these properties. If we allow context alone to determine meaning for these properties, we fail to provide enough logic for effective use (and related governance) for the user base. It raises questions of who's definition of what context is appropriate.
A better solution might be via the following two blocks:
(Kind) -- type --> (Class)
(Individual) -- type --> (Kind)
(Organization) -- type --> (Kind)
hasEmail -- type --> (Property)
hasEmail -- range --> (Email)
These are common ontology definitions provided by a "specific domain" standards body (like CTI). The ontology definitions for "Class", "Property", "type" and "range" are given by another "general domain" standards body--the "specific domain" standard aligns with a "general domain" standard that helps define and share the specific work with specific domain users as well as a larger community of data users.
(John) -- type --> (Individual)
(John) -- hasEmail --> (_container1_)
(_container1_) -- type --> (Home)
(_container1_) -- hasValue --> (mailto:john@example.org)
The above is a user created document using the combined standard definitions for the specific and general domains. The "hasEmail" property implies that "_container1_" is an "Email" class type in the prior ontology statement. It is also directly defined as a "Home" class type and, therefore, defined as a Home Email address. In knowledge graph speak, "_container1_" is called a Blank Node as it's just a place to consolidate related data and does not need to be a Named Node.
It is also interesting to note that the ontology statements are declared and defined in exactly the same way that data statements are defined. Then, redefinition and deprecation of ontology statements can be automated in these solutions since it's machine readable. For the current STIX protocol, change requires independent work by every implementer. By using standardized ontologies, we minimize impact on the implementations.
In fact, I've snuck in the W3C vCard Ontology for this example. This highlights the very relevant reason to adopt a namespace solution. STIX can use existing ontologies to augment its own ontology--we don't need to reinvent the wheel...again.
For Con 4:
Without a well defined ontology, the STIX protocol is difficult to use with other data standards without a lot of extra implementation work on individual organizations. This forces them to develop bespoke data solutions in their own bubble. That, in turn, makes it difficult for them to effectively share their data with others and adds additional maintenance cost to their solutions. The status quo fosters a fractured user domain.
By adopting an ontology solution, we must also embrace namespace solutions to help us share data in the CTI domain with others. We become a resource for a larger community of data users. A STIX namespace (and possible sub namespaces) organizes the ontology and compliant data to be shared and used with other ontologies. For instance, if NIEM and STIX adopt ontological solutions, the ontologies can be aligned by establishing equivalency statements--a NIEM ThingX class can be equivalent to a STIX ThingY class. This mitigates any need for us to agree to use a common term or to develop any kind of data conversion process. The equivalencies allows the standardized ontology engines to query on one term and provide results that are otherwise defined by another term.
This helps an analyst "Get Things Done"(tm). It helps our community to connect with other communities. The descriptive logic that comes with an ontology provides inferencing--new data can be inferred from existing data. For instance, (John) -- type --> (Individual)
implies John is also a "Kind" by the subclass relation. If we query for things of type "Kind" using an inferencing engine, we get John as a result. While these are simple examples, the implication for this shift is HUGE for the user base.
This knowledge technology is also being used behind the scenes for the current AI movement. AI solutions have difficulty explaining how they arrive at a solution and proving that they are correct solutions. With knowledge graph technologies, the AIs have a way to "explain" how solutions are derived. They also have a ready made ML Feature selection set from the knowledge graph repositories.
Conclusion:
The point is, that by shifting to an ontology solution, we help a larger community get work done with lower costs to everyone and better, standardized, machine ingestible definitions. It provides for a natural extendibility as a consequence--independent groups can define their own ontologies to use with the STIX ontology without concern. STIX bundles can have extension data using extension ontologies in a document without impacting end users who haven't implemented an extension--they are just extra data elements that can be ignored or processed for later use. If they want to use that data, they can ingest the related ontology at will and adjust their queries in short order.
I can present these points in more detail as requested.
------------------------------
Keven Ates
US Federal Bureau of Investigation
Washington DC
------------------------------