Data Provenance (DPS) TC

 View Only
  • 1.  Naming for DP Model and Specification

    Posted 06-24-2025 11:44
    At the June meeting we decided:

    • Naming Decision: Specification renamed to "Data Provenance" (dropped "standards").
    That decision is related to the initial  work product specification that uses property tables to define providence metadata.  As we discussed, the content of the tables is just a strawman to be replaced by actual content as determined by the TC.  But the form of the specification should be agreed early so we have a way to specify the content.

    Two approaches have been proposed:
    1. anonymous collections of properties as shown in the contributed YAML schema and its equivalent JSON schema
    2. named collections of properties ("Types") as shown in the property tables:

    The naming decision referred to the name of the specification document, but it also applies to the schema defined by the specification. In the typed approach, the top level of the schema has the name "DataProvenance" (circled in green), while in the anonymous approach the schema root doesn't have a Type, it's just a collection of three properties.

    Similarly, the first property is defined to be a named "Source" type (circled in orange) while in the anonymous approach all of the source content is defined as sets of properties nested under the source property.

    I discuss this further under Issue 16.  Most TC members are probably not interested in esoteric modeling details, but this decision has a couple of practical impacts:
    1. Specification readability: Named types can be referred to without defining all the details of their content 
    2. XML support: Unlike JSON, XML elements must have a tag, so if we want to allow XML metadata the model must define those tags:

    <DataProvenance>    <Source>        < ... >    </Source>    <Provenance>        < ... >    </Provenance>    <Use>        < ... >    </Use> </DataProvenance>

    Therefore, I Propose that the data provenance specification document and model schema use named types to define content.



  • 2.  RE: Naming for DP Model and Specification

    Posted 06-25-2025 10:55

    Hi Dave (and all!),

    I will summarize in my language (sorry, I'm not a very technical person!) what I believe is your proposal and my response. Would you please help me understand if I went off the path or misrepresent any aspect?

    The DPS specification will describe metadata, and we aim to represent this in a machine-readable way. We can either do this through:

    1. Anonymous collections of properties
    2. Named collections (or types)

    Option 1 is a flexible and straightforward solution that can utilize JSON, as you suggested. Option 2 would have us group explicitly like the D&TA standards did (e.g., Source). Option 2 is easier to reuse and refer to, as well as support in other, non-JSON structures (like XML).

    The reason the TC should care us that we need to prioritize readability (with named types, you can refer to something like "Source" instead of having to spell it out every time) and unlike JSON, XML requires element tags (which need names), so named types are better if XML support is necessary.

    Given this understanding, we should adopt the named types of approach. This will make the structure more transparent, reusable, and compatible with XML and other systems that need named elements.

    Thanks,

    Kristina






  • 3.  RE: Naming for DP Model and Specification

    Posted 06-26-2025 11:42
      |   view attached

    Kristina,

    You've got it.

    A couple of additional points:

    • An information model (like the property tables shown in the spec) supplies information that is needed for multiple data formats (like XML, JSON, and concise binary data)
    • A JSON schema generated from the information model automatically supplies names using $defs.  A JSON schema written by hand could also supply names using $defs, but the contributed JSON schema derived from YAML does not use $defs
    • A tree view can be generated from an information model or named JSON schema, in the format of a graphical tree or an ascii tree.  Here's what both look like when derived from the anonymous schema.  They'd look the same generated from a named schema except that the names would be more meaningful (e.g., "DataProvenance" instead of "Root").

     

    Regards,
    David


    data-provenance-standards-1.0.0.schema.yml-to-json-is-lossless-conceptual.atree:

     

    Root

    ── Root.provenance

    │   ── Format

    │   ── Generation-method

    │   │   └── Generation-method-item

    │   ── Root.provenance.generation-period

    │   ── Origin

    │   │   └── Origin-item

    │   │       └── Address

    │   └── Origin-geography

    │       └── Origin-geography-item

    ── Root.source

    │   └── Issuer

    │       └── Issuer-item

    │           └── Address

    └── Root.use

         ── Classification

         │   └── Classification-item

         │       └── Classification-item.regulation

         ── Consents

         ── Copyright

         ── Intended-purpose

         │   └── Intended-purpose-item

         ── License

         ── Patent

         ── Privacy-enhancing

         │   └── Privacy-enhancing-item

         │       ── Parameters

         │       ── Result

         │       └── Privacy-enhancing-item.tool-category

         ── Processing-excluded

         │   └── Processing-excluded-item

         ── Processing-included

         │   └── Processing-included-item

         ── Storage-allowed

         │   └── Storage-allowed-item

         ── Storage-forbidden

         │   └── Storage-forbidden-item

         └── Trademark

     

     

     






  • 4.  RE: Naming for DP Model and Specification

    Posted 06-27-2025 09:21

    Thanks, David, for confirming and the additional clarifications. That really brings it all together for me.

    It sounds like we're aligned on the value of using named types moving forward, which is great. I'll keep this in mind as we continue refining the specification. Hopefully everyone else on the TC is in agreement as well.

    Thanks again for taking the time to explain!

    Kristina

     






  • 5.  RE: Naming for DP Model and Specification

    Posted 06-30-2025 15:30
    Dear members,

    On Fri, Jun 27, 2025, at 15:20, Kristina Podnar via OASIS wrote:
    > Thanks, David, for confirming and the additional clarifications. That really brings it all together for me. It sounds like we're aligned on the...
    > Data Provenance (DPS) TC <https: groups.oasis-open.org communities community-home digestviewer?communitykey=2c60b2cf-45d3-48cd-8594-0194f182b33d>
    >
    > [...]
    > Re: Naming for DP Model and Specification <https: groups.oasis-open.org discussion naming-for-dp-model-and-specification#bm06fd476c-3e0e-4bf4-9f6d-c5ec76739aa0>
    > [...]
    > Kristina Podnar <https: groups.oasis-open.org profile?userkey=22501efe-394d-4bea-8dee-019397343d8d>
    > Jun 27, 2025 9:21 AM
    > Kristina Podnar <https: groups.oasis-open.org profile?userkey=22501efe-394d-4bea-8dee-019397343d8d>
    > Thanks, David, for confirming and the additional clarifications. That really brings it all together for me.
    >
    > It sounds like we're aligned on the value of using named types moving forward, which is great. I'll keep this in mind as we continue refining the specification. Hopefully everyone else on the TC is in agreement as well.
    >
    > Thanks again for taking the time to explain!
    >
    > Kristina [...]
    > -------------------------------------------
    > Original Message:
    > Sent: 6/26/2025 11:42:00 AM
    > From: David Kemp
    > Subject: RE: Naming for DP Model and Specification
    >
    > Kristina,
    >
    > You've got it.
    >
    > A couple of additional points:
    >
    > • An information model (like the property tables shown in the spec) supplies information that is needed for multiple data formats (like XML, JSON, and concise binary data)
    > • A JSON schema generated from the information model automatically supplies names using $defs. A JSON schema written by hand could also supply names using $defs, but the contributed JSON schema derived from YAML does not use $defs
    > • A tree view can be generated from an information model or named JSON schema, in the format of a graphical tree or an ascii tree. Here's what both look like when derived from the anonymous schema. They'd look the same generated from a named schema except that the names would be more meaningful (e.g., "DataProvenance" instead of "Root").
    >
    >
    > Regards,
    > David
    >
    >
    > *data-provenance-standards-1.0.0.schema.yml-to-json-is-lossless-conceptual.atree:*
    >
    >
    >
    > Root
    >
    > ├── Root.provenance
    > │ ├── Format
    > │ ├── Generation-method
    > │ │ └── Generation-method-item
    > │ ├── Root.provenance.generation-period
    > │ ├── Origin
    > │ │ └── Origin-item
    > │ │ └── Address
    > │ └── Origin-geography
    > │ └── Origin-geography-item
    > ├── Root.source
    > │ └── Issuer
    > │ └── Issuer-item
    > │ └── Address
    > └── Root.use
    > ├── Classification
    > │ └── Classification-item
    > │ └── Classification-item.regulation
    > ├── Consents
    > ├── Copyright
    > ├── Intended-purpose
    > │ └── Intended-purpose-item
    > ├── License
    > ├── Patent
    > ├── Privacy-enhancing
    > │ └── Privacy-enhancing-item
    > │ ├── Parameters
    > │ ├── Result
    > │ └── Privacy-enhancing-item.tool-category
    > ├── Processing-excluded
    > │ └── Processing-excluded-item
    > ├── Processing-included
    > │ └── Processing-included-item
    > ├── Storage-allowed
    > │ └── Storage-allowed-item
    > ├── Storage-forbidden
    > │ └── Storage-forbidden-item
    > └── Trademark
    >
    > [...]

    I added some additional thoughts for your reading pleasure
    and consideration to the idea exchange around the ticket at:
    https://github.com/oasis-tcs/dps/issues/16#issuecomment-3020193224

    In my opinion there is no need to rush fixing names and topologies
    yet, so enjoy.

    But, I consider us being in a more leaf type exploring phase that
    started with the questions TC members brought up and that Kristina
    kindly started to answer / provide feedback to from the perspective
    of a long time DTA discussions participant.

    I am sure we have some interesting "functions" we want to represent
    in our specifdied struicture and need to discuss about result types
    (so to say).

    "Functions" are to me here members of the objects that provide
    answers to questions like:

    - to which dataset do "I" relate
    - when was "I" changed last time
    - "I" am the latest version
    - "I" have a history
    - what human language am "I" in
    - the rules of which region do apply
    - "here" are further verifiable claims in this or that format
    requiring this or that transport protocol

    Rewritten in some pseudo code for a hosting structure dp
    (for data-provenance):

    - dp.data().link()
    - dp.data().changed()
    - dp.data().effective()
    - dp.data().history()
    - dp.provenance().lang()
    - dp.provenance().region()
    - dp.use().consents()

    Thanks.

    All the best,
    Stefan




  • 6.  RE: Naming for DP Model and Specification

    Posted 07-01-2025 01:54

    Thank you, Stefan. That is a perspective I was not considering.

     

    I propose to the co-chairs and TC that we take this up in today's conversation. While it may not seem pressing, it seems to me to be on the critical path if we are to aim to get the standards out in the coming month and a half for public comment. This fits into topic #3 on our agenda.

     

    Welcoming other perspectives and suggestions.

     

    Kristina