OASIS Static Analysis Results Interchange Format (SARIF) TC

 View Only
Expand all | Collapse all

Please comment on #125

  • 1.  Please comment on #125

    Posted 04-06-2018 21:33
    #125 : Address corner case for generated files in run.files dictionary   This is the scenario where the same physical file is re-written in the course of an analysis. Please see my comments in the issue. What is the scenario here? – that is:   What tool is involved? What is the nature of the file that’s being re-written? Is it necessary to represent this file in the run.files dictionary? Is it necessary to represent multiple versions of this re-written file in the run.files dictionary? Would a viewer need access to any version of this file except the last one written ?   Thanks, Larry


  • 2.  RE: Please comment on #125

    Posted 04-10-2018 02:59
    I’ve thought about this issue a bit. We should be thinking about an analysis that provides a hit in any generated file that isn’t under source control. For example, a generated XAML code-behind file. The corner case covers something even more problematic, a single analysis run where generated files are, for example, overwritten on a per-project basis (to a common location in some build intermediates folder). To answer your questions:   This isn’t tool specific, it relates to scan targets which are themselves generated content not under source control (and which are fluid/overwritten even while some larger build analysis is taking place) The file is a valid scan target, whatever that means. A PCH file or other intermediate. A header file that is generated by some perl script. Etc. Producers SHOULD persist all files to run.files that aren’t managed by a version control system. This is just good general guidance. It may be necessary to represent multiple versions of this re-written file in the run.files dictionary, if multiple results instances exist that point to different versions of the generated content. Ditto, a viewer will need to access any version of the file referenced by any result.     From: Larry Golding (Comcast) <larrygolding@comcast.net> Sent: Friday, April 6, 2018 2:31 PM To: Michael Fanning <Michael.Fanning@microsoft.com>; 'James A. Kupsch' <kupsch@cs.wisc.edu>; sarif@lists.oasis-open.org Subject: Please comment on #125   #125 : Address corner case for generated files in run.files dictionary   This is the scenario where the same physical file is re-written in the course of an analysis. Please see my comments in the issue. What is the scenario here? – that is:   What tool is involved? What is the nature of the file that’s being re-written? Is it necessary to represent this file in the run.files dictionary? Is it necessary to represent multiple versions of this re-written file in the run.files dictionary? Would a viewer need access to any version of this file except the last one written ?   Thanks, Larr


  • 3.  RE: Please comment on #125

    Posted 04-10-2018 16:43
    I see!  A SARIF producer enables consumers to access previous versions of an overwritten file not just by mentioning each version in the run.files dictionary, but by persisting their contents there. It seems so obvious now


  • 4.  Re: Please comment on #125

    Posted 04-10-2018 18:31
    This scenario was mentioned by Paul at the in person meeting. Maybe there should be a separation of the path to the file when captured and the path to a copy of the file contents. Jim On 04/10/2018 11:41 AM, Larry Golding (Comcast) wrote: I see!  A SARIF producer enables consumers to access previous versions of an overwritten file not just by /mentioning/ each version in the run.files dictionary, but by /persisting their contents/ there. It seems so obvious now


  • 5.  RE: [sarif] Re: Please comment on #125

    Posted 04-10-2018 18:51
    I'm not sure what you mean by that. In the example below, the "path to the file when captured" is given by a combination of run.originalUriBaseIds and result.location.physicalLocation.fileLocation.uri. And the "copy of the file contents" is in run.files. Can you tell me, in terms of a modification to the sample JSON code I sent, what you're suggesting? Larry


  • 6.  Re: [sarif] Re: Please comment on #125

    Posted 04-10-2018 19:15
    As part of build monitoring, I might capture each file that is accessed by a build commands and store them in a file system with a unique name, while recording their original path. I could then create an archive of this directory or store this directory somewhere. Now I have files with the contents, but they are not located at the original path. It might be nice to be able to refer these locations instead of (or in addition to) storing the contents in-line in the fileContents value. Jim On 04/10/2018 01:48 PM, Larry Golding (Comcast) wrote: I'm not sure what you mean by that. In the example below, the "path to the file when captured" is given by a combination of run.originalUriBaseIds and result.location.physicalLocation.fileLocation.uri. And the "copy of the file contents" is in run.files. Can you tell me, in terms of a modification to the sample JSON code I sent, what you're suggesting? Larry


  • 7.  RE: [sarif] Re: Please comment on #125

    Posted 04-10-2018 19:29
    I see. So the `file` object might have a new property `originalFileLocation` in addition to its existing `fileLocation` property. [If that's not what you're proposing, stop reading now and correct me!] So we have two different proposals: Your proposal introduces a new property `file.originalFileLocation`. It relies on intervention from the build monitoring system to move the files around, and to populate `file.originalLocation`. It does _not_ require potentially large file contents to be embedded in the SARIF log file. My proposal (well, I suppose it's actually Michael's


  • 8.  Re: [sarif] Re: Please comment on #125

    Posted 04-10-2018 20:06
    I like it :). On 04/10/2018 02:26 PM, Larry Golding (Comcast) wrote: I see. So the `file` object might have a new property `originalFileLocation` in addition to its existing `fileLocation` property. [If that's not what you're proposing, stop reading now and correct me!] So we have two different proposals: Your proposal introduces a new property `file.originalFileLocation`. It relies on intervention from the build monitoring system to move the files around, and to populate `file.originalLocation`. It does _not_ require potentially large file contents to be embedded in the SARIF log file. My proposal (well, I suppose it's actually Michael's


  • 9.  RE: [sarif] Re: Please comment on #125

    Posted 04-10-2018 20:34
    Hold on. I didn't say you SHALL persist file contents in this case, I said that in many cases you should consider it (because that content isn't under source control). Please note, that if you don't emit the file contents, the uriBaseId and file table key name construction still makes it clear that separate results referred to a different version of a generated file that was persisted to a common location. Without introducing a new property. And so I prefer my solution.


  • 10.  RE: [sarif] Re: Please comment on #125

    Posted 04-11-2018 17:11
    I go back to a question I asked you on another thread: If you don't persist the overwritten contents to the log file, and you don't move a copy of the file to another location before overwriting it, then what's a viewer to do when a user tries to navigate to the site of the error? Consider this experience: 1. A tool detects a result in version 1 of the generated file foo.g.cs and logs it. 2. The build system overwrites the generated file with version 2. Later... 3. The user opens the log in a viewer and double clicks on a result from the overwritten file. 4. The viewer finds the file at the specified location, but it's not the right version, so the target location has nothing to do with the reported result. One way to avoid this is to introduce a convention and have a smart viewer: Convention: the uriBaseId has the form "someString-n", for example "generated-1", "generated-2"... [Note: I think Michael might have suggested this convention at one point.] Smart viewer: when the user attempts to navigate to a file whose uriBaseId is "generated-1", the viewer notices that run.originalUriBaseIds has an entry for a later version "generated-2", so it tells the user that the file was overwritten. I don't like this because it requires the analysis tool (or the build system's post-processor) to add an entry for "generated-2" to run.originalUriBaseIds, even if no results are found in version 2 of the overwritten file. This just feels really fragile to me. Larry


  • 11.  RE: [sarif] Re: Please comment on #125

    Posted 04-13-2018 16:06
    I don't think there's anything to do here. There's a well-established convention in viewers for situations where file contents can't be matched by hash, they throw up a file/open dialog and/or emit a message 'the file at the current location does not match the original content, would you like to open it anyway?' You see both of these when debugging. So, we don't need a smart convention. Maybe we DO need to strongly recommend that, minimally, the file hash information is emitted in cases where it is prohibitively expense to embed the contents. This gives the viewer enough information to know if it found the right content or not. Michael


  • 12.  RE: [sarif] Re: Please comment on #125

    Posted 04-13-2018 16:58
    Your suggestion to include the file hash is a good one. Two comments, though: 1. Important comment: Providing the hash does indeed avoid the experience I described, where a viewer unwittingly opens the current version of a file even though the result was found in a previous version. But it does not address my main concern, which is that unless you embed the contents, there is _no way at all_ for a viewer or any other tool to allow a user to see the result in its context, which will make it more difficult to correct. For example, if the repeatedly-overwritten file is the output of a code generator, and the code generator has a bug, not having the actual generated code will make the problem harder to fix. Not impossible; the error message itself might suffice ("Missing semicolon!"). But harder. I can certainly add text explaining the trade-off. 2. Unimportant comment: Although strictly speaking, there's no such thing as "strongly recommend", I can fake it the way I did when writing about hash functions. Recall that I wrote "To maximize interoperability, the array MAY contain algorithms from IANA-HASH ... The array MAY contain algorithms that do not appear in IANA-HASH, but at the expense of interoperability." So I used "MAY" for both, but I editorialized a little. Likewise here, I can say something like "The files object SHOULD contain an entry for the file that includes the hash property ... The files object MAY omit the entry, but that will make it impossible for a consumer to verify that it has the desired version of the file." Larry


  • 13.  RE: [sarif] Re: Please comment on #125

    Posted 04-13-2018 18:20
    You know, part of our proposed solution might not be needed at all for the "generated files" scenario. Consider this sequence: 1. The build system executes a build and creates generated file. 2. The analysis tool analyzes the output, including the generated file, detects a result in the generated file, and produces a SARIF log file. 3. The build system builds a different configuration, and overwrites the generated file. 4. The analysis tool analyzes the output, including the generated file, detects a result in the generated file, and produces a _different_ SARIF log file ... or, it inserts a new `run` into the existing log file. Either way, there's no need to generate unique uriBaseIds. The two incarnations of the generated file are mentioned in different `run.files` dictionaries, so there's no conflict. We still have the issue of how to enable the user to diagnose the issue, so we _should_ still provide guidance about when it makes sense to embed file contents, and when it makes more sense just mention the file hash. All this assumes that there is only one version of the generated file present for any given run of the analysis tool. The question is: Is it ever the case that multiple versions of the same file are created _during a single run of an analysis tool_? Jim, please comment. Thanks, Larry


  • 14.  RE: [sarif] Re: Please comment on #125

    Posted 04-13-2018 20:10
    The answer to your question is yes, a user could kick off a single build as part of a single tool run, each project of which could generate a new version of a file. If the tool intercepted all compiler calls to do some work, for example, you might get into this condition. file hash SHOULD be generated for generated files that may be overwritten. If the content is too copious to be embedded in the log file, we have the snippet feature to help mitigate against the possibility the original file hasn't been overwritten. Michael


  • 15.  RE: [sarif] Re: Please comment on #125

    Posted 04-13-2018 20:19
    Right, the snippet feature -- I'd forgotten about that. Ok, I have everything I need now to write the guidance. Larry


  • 16.  RE: [sarif] RE: Please comment on #125

    Posted 04-10-2018 20:30




    This is right. And it really does illustrate what an odd corner case this is. You have to use a uriBaseId to construct a reference for this because the otherwise identical absolute URLs don’t comprise unique keys for the files table.
     
    Btw – this example sheds light on a need for some general guidance for how to handle embedded file contents. In this corner case, you must absolutely prefer the embedded content because it might not exist anywhere else (i.e., it was never
    in source control or it was on disk previously but was subsequently overwritten).

     
    But does it make sense as a rule to tell people to prefer the embedded file contents if it exists? I think it does as I think about it. In cases where you are certain to have access to files properly matched to your SARIF results (such
    as if you just completed a local analysis or if your absolute URLs are versioned and point to accessible copies of the file), producers should not generate the embedded file contents. As a rule, injecting the file contents is a post-processing step that is
    explicitly completed because the log file is being prepared for ingestion into a results mgmt. system (attached to a work item, persisted to a common remote store, etc.).
     
    Michael


    From: sarif@lists.oasis-open.org <sarif@lists.oasis-open.org>
    On Behalf Of Larry Golding (Comcast)
    Sent: Tuesday, April 10, 2018 9:41 AM
    To: Michael Fanning <Michael.Fanning@microsoft.com>; 'James A. Kupsch' <kupsch@cs.wisc.edu>; sarif@lists.oasis-open.org
    Subject: [sarif] RE: Please comment on #125


     
    I see!  A SARIF producer enables consumers to access previous versions of an overwritten file not just by
    mentioning each version in the run.files dictionary, but by
    persisting their contents there. It seems so obvious now