OASIS Static Analysis Results Interchange Format (SARIF) TC

Expand all | Collapse all

partialFingerprints: the words the world has been waiting for

  • 1.  partialFingerprints: the words the world has been waiting for

    Posted 05-03-2018 00:25
    For a long time we’ve agreed that partialFingerprints shouldn’t include information that’s deducible from the SARIF file, but the spec has never said so. As part of the “fingerprints” draft that I just merged and pushed, Appendix B now says the magic words:   An analysis tool SHALL NOT include in partialFingerprints information that a result management system could deduce from other information in the SARIF file, for example, file hashes. Rather, the result management would use such information, along with partialFingerprints , in its computation of fingerprints .   I understand that our vision of partialFingerprints is still evolving, but this will do for now.   Larry


  • 2.  RE: [sarif] partialFingerprints: the words the world has been waiting for

    Posted 05-03-2018 00:47
    My thinking which I tried to articulate in today’s discussion, more or less successfully, is that result matching is not a matter of comparing a previously computed fingerprint to another. Instead, result matching is a complex algorithm that tries to stitch various results together. If unsuccessful in producing an exact match, the algorithm may fall back to partial fingerprints, which are essentially logical- and physical-location-free things that may still help determine issue identity (in practice, a result matcher might still have a notion of two files that should be compared for the match, but have lost all other useful intra-file location details).   With the definition above, a partial fingerprint is partial in the sense that it is a speculative match that doesn’t benefit from other data that would increase confidence in a match. It is also a contribution, as per our previous definition, in the sense that you might try to glue this information to whatever else you have (such as a file name, where you’ve lost the location details).   I think the most significant impact to the reorientation above is how we think of result.fingerprints. This data now truly becomes mostly a placeholder for putting data produced by legacy formats. We wouldn’t expect fingerprints to be populated by a result management system. Instead, this is what we’d see:   SARIF baseline is loaded, which the result management system has populated with instance ids (a guid, for example) A new SARIF log is loaded. The stable ids match between these, so they are candidates to compare The result matcher runs an elaborate algorithm to try to correlate results, that includes things like remapping file names, loading them, running standard line-level diff algorithms to find matched, moved, new and deleted lines. After identifying exact matches based on file diff (and other precise locators such as fully qualified logical name), the result matching algorithm falls back to partial fingerprints (such as surrounding context region) to make a match). For all matches, when found, the instance id form the baseline flows to the newer SARIF. We also update the baseline state.   And that’s it. At no point does it seem critical to populate the fingerprints object. You could imagine the fingerprints of the baseline log file containing some fingerprints that will always match if file name + physical location details haven’t changed. But how useful is that? (we already have file hashes to tell us this). If you have to diff two files anyway to overcome line churn, the extra work of prepopulating and storing fingerprints might not provide cost ROI.   Michael From: sarif@lists.oasis-open.org <sarif@lists.oasis-open.org> On Behalf Of Larry Golding (Comcast) Sent: Wednesday, May 2, 2018 5:23 PM To: sarif@lists.oasis-open.org Subject: [sarif] partialFingerprints: the words the world has been waiting for   For a long time we’ve agreed that partialFingerprints shouldn’t include information that’s deducible from the SARIF file, but the spec has never said so. As part of the “fingerprints” draft that I just merged and pushed, Appendix B now says the magic words:   An analysis tool SHALL NOT include in partialFingerprints information that a result management system could deduce from other information in the SARIF file, for example, file hashes. Rather, the result management would use such information, along with partialFingerprints , in its computation of fingerprints .   I understand that our vision of partialFingerprints is still evolving, but this will do for now.   Larry


  • 3.  RE: [sarif] partialFingerprints: the words the world has been waiting for

    Posted 05-03-2018 17:04
    Hi all,   Just want to chime in to explain what we do at Fortify. In SCA, our instance ids (result ids that allow us to track what’s happening to the result over time) is the “fingerprint” calculated based on a complex algorithm that takes into consideration various things like file names, sources/sinks involved in generating this result, ruleids involved etc. The idea is that, no matter where we run the tool and how many times, if the code hasn’t significantly changed and the result did not get fixed, the same exact instanceid gets generated for the same exact issue.   As mentioned, this is a complex algorithm, which sometimes fails to generate the same exact insatnceid for various reasons, and so our results management system tries to correlate results from multiple scans to indicate that something might be the same exact result as generated before. The user of the system has to verify that he/she agrees with these correlations. However, we never assign a different instance id to an already generated result.   So, to me it looks like we would only be using the id property of the result object, and neither use fingerprints or partialFingerprints properties.   Do let me know if I’m missing something.   Thanks! k   From: sarif@lists.oasis-open.org [mailto:sarif@lists.oasis-open.org] On Behalf Of Michael Fanning Sent: Wednesday, May 02, 2018 5:47 PM To: Larry Golding (Comcast) <larrygolding@comcast.net>; sarif@lists.oasis-open.org Subject: RE: [sarif] partialFingerprints: the words the world has been waiting for   My thinking which I tried to articulate in today’s discussion, more or less successfully, is that result matching is not a matter of comparing a previously computed fingerprint to another. Instead, result matching is a complex algorithm that tries to stitch various results together. If unsuccessful in producing an exact match, the algorithm may fall back to partial fingerprints, which are essentially logical- and physical-location-free things that may still help determine issue identity (in practice, a result matcher might still have a notion of two files that should be compared for the match, but have lost all other useful intra-file location details).   With the definition above, a partial fingerprint is partial in the sense that it is a speculative match that doesn’t benefit from other data that would increase confidence in a match. It is also a contribution, as per our previous definition, in the sense that you might try to glue this information to whatever else you have (such as a file name, where you’ve lost the location details).   I think the most significant impact to the reorientation above is how we think of result.fingerprints. This data now truly becomes mostly a placeholder for putting data produced by legacy formats. We wouldn’t expect fingerprints to be populated by a result management system. Instead, this is what we’d see:   1)       SARIF baseline is loaded, which the result management system has populated with instance ids (a guid, for example) 2)       A new SARIF log is loaded. The stable ids match between these, so they are candidates to compare 3)       The result matcher runs an elaborate algorithm to try to correlate results, that includes things like remapping file names, loading them, running standard line-level diff algorithms to find matched, moved, new and deleted lines. 4)       After identifying exact matches based on file diff (and other precise locators such as fully qualified logical name), the result matching algorithm falls back to partial fingerprints (such as surrounding context region) to make a match). 5)       For all matches, when found, the instance id form the baseline flows to the newer SARIF. We also update the baseline state.   And that’s it. At no point does it seem critical to populate the fingerprints object. You could imagine the fingerprints of the baseline log file containing some fingerprints that will always match if file name + physical location details haven’t changed. But how useful is that? (we already have file hashes to tell us this). If you have to diff two files anyway to overcome line churn, the extra work of prepopulating and storing fingerprints might not provide cost ROI.   Michael From: sarif@lists.oasis-open.org < sarif@lists.oasis-open.org > On Behalf Of Larry Golding (Comcast) Sent: Wednesday, May 2, 2018 5:23 PM To: sarif@lists.oasis-open.org Subject: [sarif] partialFingerprints: the words the world has been waiting for   For a long time we’ve agreed that partialFingerprints shouldn’t include information that’s deducible from the SARIF file, but the spec has never said so. As part of the “fingerprints” draft that I just merged and pushed, Appendix B now says the magic words:   An analysis tool SHALL NOT include in partialFingerprints information that a result management system could deduce from other information in the SARIF file, for example, file hashes. Rather, the result management would use such information, along with partialFingerprints , in its computation of fingerprints .   I understand that our vision of partialFingerprints is still evolving, but this will do for now.   Larry


  • 4.  RE: [sarif] partialFingerprints: the words the world has been waiting for

    Posted 05-03-2018 18:05
    Hi all,   Yekaterina , thanks for the explanation. Since my memory is usually so poor, I was pleased to find that I remembered most of what you just wrote


  • 5.  RE: [sarif] partialFingerprints: the words the world has been waiting for

    Posted 05-03-2018 19:46
    Of course, we could dispense with result.fingerprints and just keep result.correlationId , documenting that it could either be an arbitrary identifier or a calculated fingerprint value.   I’m still not quite clear on whether correlationId would need to be plural in that case.   From: sarif@lists.oasis-open.org <sarif@lists.oasis-open.org> On Behalf Of Larry Golding (Comcast) Sent: Thursday, May 3, 2018 11:03 AM To: 'O'Neil, Yekaterina Tsipenyuk' <katrina@microfocus.com>; 'Michael Fanning' <Michael.Fanning@microsoft.com>; sarif@lists.oasis-open.org; O'Neil, Yekaterina Tsipenyuk <katrina@microfocus.com> Subject: RE: [sarif] partialFingerprints: the words the world has been waiting for   Hi all,   Yekaterina , thanks for the explanation. Since my memory is usually so poor, I was pleased to find that I remembered most of what you just wrote


  • 6.  RE: [sarif] partialFingerprints: the words the world has been waiting for

    Posted 05-03-2018 22:15




    Larry, I read your proposal around id and correlationId and it is clear and makes very good conceptual sense.
     
    These things are easy to describe in the spec: an id is a guid, generated on the fly, which is an identifier that is only good for a specific result in a single log file. The correlationId is a guid that correlates logically unique instances
    of a result across multiple log files.
     
    Btw – I am wondering whether we need an identifier object, which explicitly contains a guid, a readable id (which is an arbitrary namespaced label that provides some hierarchy, and a description). This id object would work well for automationId
    and for stableId. For id or correlationId, looks less helpful. We might consider renaming these to result.instanceGuid and result.correlationGuid.
     


    From: Larry Golding (Comcast) <larrygolding@comcast.net>

    Sent: Thursday, May 3, 2018 12:43 PM
    To: 'O'Neil, Yekaterina Tsipenyuk' <katrina@microfocus.com>; Michael Fanning <Michael.Fanning@microsoft.com>; sarif@lists.oasis-open.org; 'O'Neil, Yekaterina Tsipenyuk' <katrina@microfocus.com>
    Subject: RE: [sarif] partialFingerprints: the words the world has been waiting for


     
    Of course, we could dispense with
    result.fingerprints and just keep result.correlationId , documenting that it could
    either be an arbitrary identifier or a calculated fingerprint value.
     
    I’m still not quite clear on whether
    correlationId would need to be plural in that case.
     


    From: sarif@lists.oasis-open.org < sarif@lists.oasis-open.org >
    On Behalf Of Larry Golding (Comcast)
    Sent: Thursday, May 3, 2018 11:03 AM
    To: 'O'Neil, Yekaterina Tsipenyuk' < katrina@microfocus.com >; 'Michael Fanning' < Michael.Fanning@microsoft.com >;
    sarif@lists.oasis-open.org ; O'Neil, Yekaterina Tsipenyuk < katrina@microfocus.com >
    Subject: RE: [sarif] partialFingerprints: the words the world has been waiting for


     
    Hi all,
     
    Yekaterina , thanks for the explanation. Since my memory is usually so poor, I was pleased to find that I remembered most of what you just wrote



  • 7.  RE: [sarif] partialFingerprints: the words the world has been waiting for

    Posted 05-03-2018 22:35
    Thanks for reading! Let’s put a pin in your “identifier object” question for a moment so I can ask you:   What do you think of overloading correlationId so it can hold either an arbitrary identifier (as in your usage scenario) or a computed “fingerprint” (as in Yekaterina’s SCA scenario) – as opposed to separate correlationId and fingerprint(s) properties? Remind me again why “fingerprints” (however we name it or overload it) is plural.   Now as to “identifier object”… as you say, it’s not useful for result.id or result.correlationId (because nobody’s going to generate a human-readable equivalent for those). As to run.automationId and run.stableId – I don’t see the point of a GUID associated with the namespaced human-readable values for this properties. GUIDs are fine when you have to guarantee uniqueness and there’s no central authority. IMO, within any given team’s engineering system, there would be no danger of choosing two identifiers with the same human-readable name, but different semantics requiring them to be distinguished – because you have a central authority. The complexity/value trade-off doesn’t work for me here, but I’m open to persuasion.   Thoughts?   From: Michael Fanning <Michael.Fanning@microsoft.com> Sent: Thursday, May 3, 2018 3:14 PM To: Larry Golding (Comcast) <larrygolding@comcast.net>; 'O'Neil, Yekaterina Tsipenyuk' <katrina@microfocus.com>; sarif@lists.oasis-open.org; 'O'Neil, Yekaterina Tsipenyuk' <katrina@microfocus.com> Subject: RE: [sarif] partialFingerprints: the words the world has been waiting for   Larry, I read your proposal around id and correlationId and it is clear and makes very good conceptual sense.   These things are easy to describe in the spec: an id is a guid, generated on the fly, which is an identifier that is only good for a specific result in a single log file. The correlationId is a guid that correlates logically unique instances of a result across multiple log files.   Btw – I am wondering whether we need an identifier object, which explicitly contains a guid, a readable id (which is an arbitrary namespaced label that provides some hierarchy, and a description). This id object would work well for automationId and for stableId. For id or correlationId, looks less helpful. We might consider renaming these to result.instanceGuid and result.correlationGuid.   From: Larry Golding (Comcast) < larrygolding@comcast.net > Sent: Thursday, May 3, 2018 12:43 PM To: 'O'Neil, Yekaterina Tsipenyuk' < katrina@microfocus.com >; Michael Fanning < Michael.Fanning@microsoft.com >; sarif@lists.oasis-open.org ; 'O'Neil, Yekaterina Tsipenyuk' < katrina@microfocus.com > Subject: RE: [sarif] partialFingerprints: the words the world has been waiting for   Of course, we could dispense with result.fingerprints and just keep result.correlationId , documenting that it could either be an arbitrary identifier or a calculated fingerprint value.   I’m still not quite clear on whether correlationId would need to be plural in that case.   From: sarif@lists.oasis-open.org < sarif@lists.oasis-open.org > On Behalf Of Larry Golding (Comcast) Sent: Thursday, May 3, 2018 11:03 AM To: 'O'Neil, Yekaterina Tsipenyuk' < katrina@microfocus.com >; 'Michael Fanning' < Michael.Fanning@microsoft.com >; sarif@lists.oasis-open.org ; O'Neil, Yekaterina Tsipenyuk < katrina@microfocus.com > Subject: RE: [sarif] partialFingerprints: the words the world has been waiting for   Hi all,   Yekaterina , thanks for the explanation. Since my memory is usually so poor, I was pleased to find that I remembered most of what you just wrote


  • 8.  RE: [sarif] partialFingerprints: the words the world has been waiting for

    Posted 05-04-2018 14:54




    We have a slot for computed fingerprints, it is result.fingerprints, so I suggest we don’t overload it. This property is an array in part because Yekaterina indicated that in rare cases a fingerprint would change slightly, and it may be
    valuable to retain the previously computed version.
     
    Let me repeat a comment that Henny made previously, our success here depends in part on clear semantics around what, exactly, lives in each property. If a correlationId could be either a computed fingerprint or a guid, we could just refer
    people to the fingerprints array. If you’re looking for an id, just grab the first one in the array.
     
    I really begin to think it’s cleaner if we redefine ids as guids, where guids make sense in the format, i.e., an entirely arbitrary (i.e., not computed from any results data), opaque, and unique. And leave result.fingerprints and result.partialFingerprints
    dedicated to the fingerprints concept.
     


    From: Larry Golding (Comcast) <larrygolding@comcast.net>

    Sent: Thursday, May 3, 2018 3:33 PM
    To: Michael Fanning <Michael.Fanning@microsoft.com>; 'O'Neil, Yekaterina Tsipenyuk' <katrina@microfocus.com>; sarif@lists.oasis-open.org; 'O'Neil, Yekaterina Tsipenyuk' <katrina@microfocus.com>
    Subject: RE: [sarif] partialFingerprints: the words the world has been waiting for


     
    Thanks for reading! Let’s put a pin in your “identifier object” question for a moment so I can ask you:
     

    What do you think of overloading
    correlationId so it can hold either an arbitrary identifier (as in your usage scenario)
    or a computed “fingerprint” (as in Yekaterina’s SCA scenario) – as opposed to separate
    correlationId and
    fingerprint(s) properties? Remind me again why “fingerprints” (however we name it or overload it) is plural.
     
    Now as to “identifier object”… as you say, it’s not useful for
    result.id or
    result.correlationId (because nobody’s going to generate a human-readable equivalent for those). As to
    run.automationId and
    run.stableId – I don’t see the point of a GUID associated with the namespaced human-readable values for this properties. GUIDs are fine when you have to guarantee uniqueness and there’s no central authority. IMO, within any given team’s engineering system,
    there would be no danger of choosing two identifiers with the same human-readable name, but different semantics requiring them to be distinguished – because you have a central authority. The complexity/value trade-off doesn’t work for me here, but I’m open
    to persuasion.
     
    Thoughts?
     


    From: Michael Fanning < Michael.Fanning@microsoft.com >

    Sent: Thursday, May 3, 2018 3:14 PM
    To: Larry Golding (Comcast) < larrygolding@comcast.net >; 'O'Neil, Yekaterina Tsipenyuk' < katrina@microfocus.com >;
    sarif@lists.oasis-open.org ; 'O'Neil, Yekaterina Tsipenyuk' < katrina@microfocus.com >
    Subject: RE: [sarif] partialFingerprints: the words the world has been waiting for


     
    Larry, I read your proposal around id and correlationId and it is clear and makes very good conceptual sense.
     
    These things are easy to describe in the spec: an id is a guid, generated on the fly, which is an identifier that is only good for a specific result in a single log file. The correlationId is a guid that correlates logically unique instances
    of a result across multiple log files.
     
    Btw – I am wondering whether we need an identifier object, which explicitly contains a guid, a readable id (which is an arbitrary namespaced label that provides some hierarchy, and a description). This id object would work well for automationId
    and for stableId. For id or correlationId, looks less helpful. We might consider renaming these to result.instanceGuid and result.correlationGuid.
     


    From: Larry Golding (Comcast) < larrygolding@comcast.net >

    Sent: Thursday, May 3, 2018 12:43 PM
    To: 'O'Neil, Yekaterina Tsipenyuk' < katrina@microfocus.com >; Michael Fanning < Michael.Fanning@microsoft.com >;
    sarif@lists.oasis-open.org ; 'O'Neil, Yekaterina Tsipenyuk' < katrina@microfocus.com >
    Subject: RE: [sarif] partialFingerprints: the words the world has been waiting for


     
    Of course, we could dispense with
    result.fingerprints and just keep result.correlationId , documenting that it could
    either be an arbitrary identifier or a calculated fingerprint value.
     
    I’m still not quite clear on whether
    correlationId would need to be plural in that case.
     


    From: sarif@lists.oasis-open.org < sarif@lists.oasis-open.org >
    On Behalf Of Larry Golding (Comcast)
    Sent: Thursday, May 3, 2018 11:03 AM
    To: 'O'Neil, Yekaterina Tsipenyuk' < katrina@microfocus.com >; 'Michael Fanning' < Michael.Fanning@microsoft.com >;
    sarif@lists.oasis-open.org ; O'Neil, Yekaterina Tsipenyuk < katrina@microfocus.com >
    Subject: RE: [sarif] partialFingerprints: the words the world has been waiting for


     
    Hi all,
     
    Yekaterina , thanks for the explanation. Since my memory is usually so poor, I was pleased to find that I remembered most of what you just wrote



  • 9.  RE: [sarif] partialFingerprints: the words the world has been waiting for

    Posted 05-04-2018 17:11
    Ok, so we’ve agreed on this so far:   result.id , a GUID, unique across every result reported in every run. result.correlationId , a GUID, common to a set of results that are logically identical. result.fingerprints , and array of calculated values, each of which captures a stable identifier for a set of logically identical results. An array, so that the RMS can improve a fingerprint without losing the old value.   This supports both Michael’s and Yekaterina’s/SCA’s usage scenarios.   Are we closed on this?   Now as to ids: I think there’s only one other id in the spec we should require to be a GUID: run.id .   As for all the others: Not rule.id , for sure. Not edge.id or node.id – they only need to be unique within a graph, and values like "e1" or "n2" are easier to read. Not graph.id , or graphTraversal.id – they only need to be unique within a result (or, in the case of graph.id , possibly within a run). Not run.automationId – it’s intended to be a value that’s meaningful to the engineering system, like a build id. Not run.stableId – that’s the namespaced thing like "Nightly security scan/x86" . Not physicalLocation.id – that’s the integer we use in message substitution sequences like "{2}" . Not threadFlow.id – it only needs to be unique within a codeFlow , so sequential numbers would be fine. Not notification.id – that should be something human readable like "RunStarted" . Not message.resourceId – that should be human readable like "ErrorUnitializedVariable" .   Larry   From: Michael Fanning <Michael.Fanning@microsoft.com> Sent: Friday, May 4, 2018 7:54 AM To: Larry Golding (Comcast) <larrygolding@comcast.net>; 'O'Neil, Yekaterina Tsipenyuk' <katrina@microfocus.com>; sarif@lists.oasis-open.org; 'O'Neil, Yekaterina Tsipenyuk' <katrina@microfocus.com> Subject: RE: [sarif] partialFingerprints: the words the world has been waiting for   We have a slot for computed fingerprints, it is result.fingerprints, so I suggest we don’t overload it. This property is an array in part because Yekaterina indicated that in rare cases a fingerprint would change slightly, and it may be valuable to retain the previously computed version.   Let me repeat a comment that Henny made previously, our success here depends in part on clear semantics around what, exactly, lives in each property. If a correlationId could be either a computed fingerprint or a guid, we could just refer people to the fingerprints array. If you’re looking for an id, just grab the first one in the array.   I really begin to think it’s cleaner if we redefine ids as guids, where guids make sense in the format, i.e., an entirely arbitrary (i.e., not computed from any results data), opaque, and unique. And leave result.fingerprints and result.partialFingerprints dedicated to the fingerprints concept.   From: Larry Golding (Comcast) < larrygolding@comcast.net > Sent: Thursday, May 3, 2018 3:33 PM To: Michael Fanning < Michael.Fanning@microsoft.com >; 'O'Neil, Yekaterina Tsipenyuk' < katrina@microfocus.com >; sarif@lists.oasis-open.org ; 'O'Neil, Yekaterina Tsipenyuk' < katrina@microfocus.com > Subject: RE: [sarif] partialFingerprints: the words the world has been waiting for   Thanks for reading! Let’s put a pin in your “identifier object” question for a moment so I can ask you:   What do you think of overloading correlationId so it can hold either an arbitrary identifier (as in your usage scenario) or a computed “fingerprint” (as in Yekaterina’s SCA scenario) – as opposed to separate correlationId and fingerprint(s) properties? Remind me again why “fingerprints” (however we name it or overload it) is plural.   Now as to “identifier object”… as you say, it’s not useful for result.id or result.correlationId (because nobody’s going to generate a human-readable equivalent for those). As to run.automationId and run.stableId – I don’t see the point of a GUID associated with the namespaced human-readable values for this properties. GUIDs are fine when you have to guarantee uniqueness and there’s no central authority. IMO, within any given team’s engineering system, there would be no danger of choosing two identifiers with the same human-readable name, but different semantics requiring them to be distinguished – because you have a central authority. The complexity/value trade-off doesn’t work for me here, but I’m open to persuasion.   Thoughts?   From: Michael Fanning < Michael.Fanning@microsoft.com > Sent: Thursday, May 3, 2018 3:14 PM To: Larry Golding (Comcast) < larrygolding@comcast.net >; 'O'Neil, Yekaterina Tsipenyuk' < katrina@microfocus.com >; sarif@lists.oasis-open.org ; 'O'Neil, Yekaterina Tsipenyuk' < katrina@microfocus.com > Subject: RE: [sarif] partialFingerprints: the words the world has been waiting for   Larry, I read your proposal around id and correlationId and it is clear and makes very good conceptual sense.   These things are easy to describe in the spec: an id is a guid, generated on the fly, which is an identifier that is only good for a specific result in a single log file. The correlationId is a guid that correlates logically unique instances of a result across multiple log files.   Btw – I am wondering whether we need an identifier object, which explicitly contains a guid, a readable id (which is an arbitrary namespaced label that provides some hierarchy, and a description). This id object would work well for automationId and for stableId. For id or correlationId, looks less helpful. We might consider renaming these to result.instanceGuid and result.correlationGuid.   From: Larry Golding (Comcast) < larrygolding@comcast.net > Sent: Thursday, May 3, 2018 12:43 PM To: 'O'Neil, Yekaterina Tsipenyuk' < katrina@microfocus.com >; Michael Fanning < Michael.Fanning@microsoft.com >; sarif@lists.oasis-open.org ; 'O'Neil, Yekaterina Tsipenyuk' < katrina@microfocus.com > Subject: RE: [sarif] partialFingerprints: the words the world has been waiting for   Of course, we could dispense with result.fingerprints and just keep result.correlationId , documenting that it could either be an arbitrary identifier or a calculated fingerprint value.   I’m still not quite clear on whether correlationId would need to be plural in that case.   From: sarif@lists.oasis-open.org < sarif@lists.oasis-open.org > On Behalf Of Larry Golding (Comcast) Sent: Thursday, May 3, 2018 11:03 AM To: 'O'Neil, Yekaterina Tsipenyuk' < katrina@microfocus.com >; 'Michael Fanning' < Michael.Fanning@microsoft.com >; sarif@lists.oasis-open.org ; O'Neil, Yekaterina Tsipenyuk < katrina@microfocus.com > Subject: RE: [sarif] partialFingerprints: the words the world has been waiting for   Hi all,   Yekaterina , thanks for the explanation. Since my memory is usually so poor, I was pleased to find that I remembered most of what you just wrote


  • 10.  RE: [sarif] partialFingerprints: the words the world has been waiting for

    Posted 05-04-2018 17:45




    My proposal
     
    result.instanceGuid
    result.correlationGuid
    result.fingerprints
     
    Revisit run.automationId and run.stableId. Replace these with a ‘descriptor’ object that contains a guid, a readable id, and a descriptive string.
     
    [don’t get hung up on my names <g>]
    run.automationCorrelationDescriptor
    run.correlationDescriptor
     
     
    run.correlationDescriptor = {
      guid = “2374285703485093480983459830938540”,
      readableId = “FxCop Nightly Run/Debug Non-Optimized”,
      description = “FxCop nightly run produced by XXX build lab for compliance certification process. Contact ‘mybuildlab@contoso.com’ with any questions.”
    }
     


    From: Larry Golding (Comcast) <larrygolding@comcast.net>

    Sent: Friday, May 4, 2018 10:09 AM
    To: Michael Fanning <Michael.Fanning@microsoft.com>; 'O'Neil, Yekaterina Tsipenyuk' <katrina@microfocus.com>; sarif@lists.oasis-open.org
    Subject: RE: [sarif] partialFingerprints: the words the world has been waiting for


     
    Ok, so we’ve agreed on this so far:
     

    result.id , a GUID, unique across every result reported in every run. result.correlationId , a GUID, common to a set of results that are logically identical. result.fingerprints , and array of
    calculated values, each of which captures a stable identifier for a set of logically identical results. An array, so that the RMS can improve a fingerprint without losing the old value.
     
    This supports both Michael’s and Yekaterina’s/SCA’s usage scenarios.
     
    Are we closed on this?
     
    Now as to ids: I think there’s only one other id in the spec we should require to be a GUID:
    run.id .
     
    As for all the others:

    Not
    rule.id , for sure. Not
    edge.id or node.id – they only need to be unique within a graph, and values like
    "e1" or
    "n2" are easier to read. Not
    graph.id , or graphTraversal.id – they only need to be unique within a result (or, in the case of
    graph.id , possibly within a run). Not
    run.automationId – it’s intended to be a value that’s meaningful to the engineering system, like a build id. Not
    run.stableId – that’s the namespaced thing like
    "Nightly security scan/x86" . Not
    physicalLocation.id – that’s the integer we use in message substitution sequences like
    "{2}" . Not
    threadFlow.id – it only needs to be unique within a
    codeFlow , so sequential numbers would be fine. Not
    notification.id – that should be something human readable like
    "RunStarted" . Not
    message.resourceId – that should be human readable like
    "ErrorUnitializedVariable" .
     
    Larry
     


    From: Michael Fanning < Michael.Fanning@microsoft.com >

    Sent: Friday, May 4, 2018 7:54 AM
    To: Larry Golding (Comcast) < larrygolding@comcast.net >; 'O'Neil, Yekaterina Tsipenyuk' < katrina@microfocus.com >;
    sarif@lists.oasis-open.org ; 'O'Neil, Yekaterina Tsipenyuk' < katrina@microfocus.com >
    Subject: RE: [sarif] partialFingerprints: the words the world has been waiting for


     
    We have a slot for computed fingerprints, it is result.fingerprints, so I suggest we don’t overload it. This property is an array in part because Yekaterina indicated that in rare cases a fingerprint would change slightly, and it may be
    valuable to retain the previously computed version.
     
    Let me repeat a comment that Henny made previously, our success here depends in part on clear semantics around what, exactly, lives in each property. If a correlationId could be either a computed fingerprint or a guid, we could just refer
    people to the fingerprints array. If you’re looking for an id, just grab the first one in the array.
     
    I really begin to think it’s cleaner if we redefine ids as guids, where guids make sense in the format, i.e., an entirely arbitrary (i.e., not computed from any results data), opaque, and unique. And leave result.fingerprints and result.partialFingerprints
    dedicated to the fingerprints concept.
     


    From: Larry Golding (Comcast) < larrygolding@comcast.net >

    Sent: Thursday, May 3, 2018 3:33 PM
    To: Michael Fanning < Michael.Fanning@microsoft.com >; 'O'Neil, Yekaterina Tsipenyuk' < katrina@microfocus.com >;
    sarif@lists.oasis-open.org ; 'O'Neil, Yekaterina Tsipenyuk' < katrina@microfocus.com >
    Subject: RE: [sarif] partialFingerprints: the words the world has been waiting for


     
    Thanks for reading! Let’s put a pin in your “identifier object” question for a moment so I can ask you:
     

    What do you think of overloading
    correlationId so it can hold either an arbitrary identifier (as in your usage scenario)
    or a computed “fingerprint” (as in Yekaterina’s SCA scenario) – as opposed to separate
    correlationId and
    fingerprint(s) properties? Remind me again why “fingerprints” (however we name it or overload it) is plural.
     
    Now as to “identifier object”… as you say, it’s not useful for
    result.id or
    result.correlationId (because nobody’s going to generate a human-readable equivalent for those). As to
    run.automationId and
    run.stableId – I don’t see the point of a GUID associated with the namespaced human-readable values for this properties. GUIDs are fine when you have to guarantee uniqueness and there’s no central authority. IMO, within any given team’s engineering system,
    there would be no danger of choosing two identifiers with the same human-readable name, but different semantics requiring them to be distinguished – because you have a central authority. The complexity/value trade-off doesn’t work for me here, but I’m open
    to persuasion.
     
    Thoughts?
     


    From: Michael Fanning < Michael.Fanning@microsoft.com >

    Sent: Thursday, May 3, 2018 3:14 PM
    To: Larry Golding (Comcast) < larrygolding@comcast.net >; 'O'Neil, Yekaterina Tsipenyuk' < katrina@microfocus.com >;
    sarif@lists.oasis-open.org ; 'O'Neil, Yekaterina Tsipenyuk' < katrina@microfocus.com >
    Subject: RE: [sarif] partialFingerprints: the words the world has been waiting for


     
    Larry, I read your proposal around id and correlationId and it is clear and makes very good conceptual sense.
     
    These things are easy to describe in the spec: an id is a guid, generated on the fly, which is an identifier that is only good for a specific result in a single log file. The correlationId is a guid that correlates logically unique instances
    of a result across multiple log files.
     
    Btw – I am wondering whether we need an identifier object, which explicitly contains a guid, a readable id (which is an arbitrary namespaced label that provides some hierarchy, and a description). This id object would work well for automationId
    and for stableId. For id or correlationId, looks less helpful. We might consider renaming these to result.instanceGuid and result.correlationGuid.
     


    From: Larry Golding (Comcast) < larrygolding@comcast.net >

    Sent: Thursday, May 3, 2018 12:43 PM
    To: 'O'Neil, Yekaterina Tsipenyuk' < katrina@microfocus.com >; Michael Fanning < Michael.Fanning@microsoft.com >;
    sarif@lists.oasis-open.org ; 'O'Neil, Yekaterina Tsipenyuk' < katrina@microfocus.com >
    Subject: RE: [sarif] partialFingerprints: the words the world has been waiting for


     
    Of course, we could dispense with
    result.fingerprints and just keep result.correlationId , documenting that it could
    either be an arbitrary identifier or a calculated fingerprint value.
     
    I’m still not quite clear on whether
    correlationId would need to be plural in that case.
     


    From: sarif@lists.oasis-open.org < sarif@lists.oasis-open.org >
    On Behalf Of Larry Golding (Comcast)
    Sent: Thursday, May 3, 2018 11:03 AM
    To: 'O'Neil, Yekaterina Tsipenyuk' < katrina@microfocus.com >; 'Michael Fanning' < Michael.Fanning@microsoft.com >;
    sarif@lists.oasis-open.org ; O'Neil, Yekaterina Tsipenyuk < katrina@microfocus.com >
    Subject: RE: [sarif] partialFingerprints: the words the world has been waiting for


     
    Hi all,
     
    Yekaterina , thanks for the explanation. Since my memory is usually so poor, I was pleased to find that I remembered most of what you just wrote



  • 11.  RE: [sarif] partialFingerprints: the words the world has been waiting for

    Posted 05-03-2018 17:33
    Hi Michael,   Before reading Yekaterina’s follow-on, I wanted to say that I understand your approach. You are saying that once you have decided that a result in today's build is logically the same as a result in yesterday's build, there's no need to persist a “fingerprint” that essentially captures the result of your comparison. Instead, you just stamp the two “logically identical” results with the same id.   If we settle on this model, I suggest that we shouldn’t use result.id for this purpose . Instead, I would introduce a new property result.correlationId . Every single result in every single run would have a unique run.id . Otherwise, a result management system could store only one of a set of “logically identical” results.   I would modify your step 5 as follows:   5) For each result in the current run: if it does not match a result in the baseline run, generate a new GUID and assign it to result.correlationId . If it does match a result in the baseline run, copy the baseline result’s correlationId to the new result. In either case, update result.baselineState in the current result appropriately.   Now to read Yekaterina’s message.   Larry From: Michael Fanning <Michael.Fanning@microsoft.com> Sent: Wednesday, May 2, 2018 5:47 PM To: Larry Golding (Comcast) <larrygolding@comcast.net>; sarif@lists.oasis-open.org Subject: RE: [sarif] partialFingerprints: the words the world has been waiting for   My thinking which I tried to articulate in today’s discussion, more or less successfully, is that result matching is not a matter of comparing a previously computed fingerprint to another. Instead, result matching is a complex algorithm that tries to stitch various results together. If unsuccessful in producing an exact match, the algorithm may fall back to partial fingerprints, which are essentially logical- and physical-location-free things that may still help determine issue identity (in practice, a result matcher might still have a notion of two files that should be compared for the match, but have lost all other useful intra-file location details).   With the definition above, a partial fingerprint is partial in the sense that it is a speculative match that doesn’t benefit from other data that would increase confidence in a match. It is also a contribution, as per our previous definition, in the sense that you might try to glue this information to whatever else you have (such as a file name, where you’ve lost the location details).   I think the most significant impact to the reorientation above is how we think of result.fingerprints. This data now truly becomes mostly a placeholder for putting data produced by legacy formats. We wouldn’t expect fingerprints to be populated by a result management system. Instead, this is what we’d see:   SARIF baseline is loaded, which the result management system has populated with instance ids (a guid, for example) A new SARIF log is loaded. The stable ids match between these, so they are candidates to compare The result matcher runs an elaborate algorithm to try to correlate results, that includes things like remapping file names, loading them, running standard line-level diff algorithms to find matched, moved, new and deleted lines. After identifying exact matches based on file diff (and other precise locators such as fully qualified logical name), the result matching algorithm falls back to partial fingerprints (such as surrounding context region) to make a match). For all matches, when found, the instance id form the baseline flows to the newer SARIF. We also update the baseline state.   And that’s it. At no point does it seem critical to populate the fingerprints object. You could imagine the fingerprints of the baseline log file containing some fingerprints that will always match if file name + physical location details haven’t changed. But how useful is that? (we already have file hashes to tell us this). If you have to diff two files anyway to overcome line churn, the extra work of prepopulating and storing fingerprints might not provide cost ROI.   Michael From: sarif@lists.oasis-open.org < sarif@lists.oasis-open.org > On Behalf Of Larry Golding (Comcast) Sent: Wednesday, May 2, 2018 5:23 PM To: sarif@lists.oasis-open.org Subject: [sarif] partialFingerprints: the words the world has been waiting for   For a long time we’ve agreed that partialFingerprints shouldn’t include information that’s deducible from the SARIF file, but the spec has never said so. As part of the “fingerprints” draft that I just merged and pushed, Appendix B now says the magic words:   An analysis tool SHALL NOT include in partialFingerprints information that a result management system could deduce from other information in the SARIF file, for example, file hashes. Rather, the result management would use such information, along with partialFingerprints , in its computation of fingerprints .   I understand that our vision of partialFingerprints is still evolving, but this will do for now.   Larry