OASIS Static Analysis Results Interchange Format (SARIF) TC

 View Only
  • 1.  Change draft for #178 (column interpretation property)

    Posted 06-02-2018 00:23
    I pushed a change draft for Issue #178 : Support a character or column interpretation property.   Documents/ChangeDrafts/Active/sarif-v2.0-issue-178-columnsPerSurrogatePair.docx   I will move its adoption at TC #19 on June 6 th .   Here is the change in its entirety (the property is on the run object): 3.11.19 columnsPerSurrogatePair property A run object MAY contain a property named columnsPerSurrogatePair whose value is an integer that specifies, for results reported in UTF-16-encoded files, the number of columns a surrogate pair [ UNICODE10 ] is considered to occupy. If columnsPerSurrogatePair is absent, it defaults to 1.   Thanks, Larry


  • 2.  Re: [sarif] Change draft for #178 (column interpretation property)

    Posted 06-04-2018 13:17
    Hi Larry, I think this change draft doesn't quite solve the issue. In particular, I believe this restriction is wrong: > for results reported in UTF-16-encoded files    It's not UTF-16 encoded files that are (specifically) the problem. The problem is that many languages internally  represent strings as UTF-16 regardless  of the original encoding of the file. For ease of implementation, static analysis tools written in these languages will often produce column numbers which count surrogate pairs as 2 columns, and all other code points as 1 column - i.e. count columns based on utf-16 code units, regardless of the original encoding. Given that this can apply even when the file is not utf-16 encoded, I'm not sure specifying columns-per-surrogate pair is the right way to go. My thought when I submitted the issue (which I didn't fully explain, sorry!) was that we should introduce an enumerated property called something like "columnKind", with options of:  * unicodeCodePoint  * utf16CodeUnit This is more flexible than the column counting approach, as it allows us the option to add other column types later (utf8CodeUnit, for example). I also note that this text still occurs in the change draft (3.22.2): > In text files encoded in UTF-16, a “surrogate pair” [ UNICODE10 ]  SHALL  be considered as a single character.    It looks like it should just be removed. Also in the same section, we have: > A  column number represents a count of characters.  With the change, this is no longer true. This should be updated to reference the new property. I also think it should be made clear that the interpretation of the column count is not in any way reliant on the original encoding of the file. Apologies for not sending this out sooner, I was away over the weekend. Thanks, Luke On Sat, Jun 2, 2018 at 1:23 AM Larry Golding (Comcast) < larrygolding@comcast.net > wrote: I pushed a change draft for Issue #178 : Support a character or column interpretation property.   Documents/ChangeDrafts/Active/sarif-v2.0-issue-178-columnsPerSurrogatePair.docx   I will move its adoption at TC #19 on June 6 th .   Here is the change in its entirety (the property is on the run object): 3.11.19 columnsPerSurrogatePair property A run object MAY contain a property named columnsPerSurrogatePair whose value is an integer that specifies, for results reported in UTF-16-encoded files, the number of columns a surrogate pair [ UNICODE10 ] is considered to occupy. If columnsPerSurrogatePair is absent, it defaults to 1.   Thanks, Larry


  • 3.  RE: [sarif] Change draft for #178 (column interpretation property)

    Posted 06-04-2018 18:30
    Thanks Luke. Really, your original description of #178 explained this very well, and it’s my fault for misinterpreting it and making you explain it a second time. Sorry.   Yes, your enumerated columnKind property, along with the ancillary changes you suggest, seem like a good approach. I’ll put out another change draft.   In the issue, Michael wrote: I agree with this and that also raises the question of whether column is the appropriate term to use in the format . It is clear that the information that is persisted to SARIF should be data that isn't subject to, for example, the configuration of the editor (I'd forgotten about the tabs -> columns issue). The most reliable mechanisms are either to specify Unicode code units or code points. I'd propose that code units are the default and that we provide a property that allows to make this explicitly code unit or code points. I'd also suggest that we consider renaming startColumn/endColumn (despite the pervasive use of these terms in existing tool log file output). This agrees with what you said below, except that Michael also specifies the default (and I agree with utf16CodeUnits as the default).   The other issue Michael raises is naming. I agree that “column” isn’t accurate. “character”  is also wrong, both because of surrogate pairs and because of Unicode normalization issues with accented characters.   There is no English word for the concept we want, which is “the unit in which a language or tool measures string lengths”. We could coin “stringLengthUnit” or “stringUnit” but I don’t think it’s necessary.   IMO, we should continue to use column. As Michael said, many tools use this term in their output, even though it’s not strictly correct.   Larry   From: sarif@lists.oasis-open.org <sarif@lists.oasis-open.org> On Behalf Of Luke Cartey Sent: Monday, June 4, 2018 6:17 AM To: Larry Golding (Comcast) <larrygolding@comcast.net> Cc: sarif@lists.oasis-open.org Subject: Re: [sarif] Change draft for #178 (column interpretation property)   Hi Larry,   I think this change draft doesn't quite solve the issue. In particular, I believe this restriction is wrong:   > for results reported in UTF-16-encoded files      It's not UTF-16 encoded files that are (specifically) the problem. The problem is that many languages internally  represent strings as UTF-16 regardless  of the original encoding of the file. For ease of implementation, static analysis tools written in these languages will often produce column numbers which count surrogate pairs as 2 columns, and all other code points as 1 column - i.e. count columns based on utf-16 code units, regardless of the original encoding.   Given that this can apply even when the file is not utf-16 encoded, I'm not sure specifying columns-per-surrogate pair is the right way to go.   My thought when I submitted the issue (which I didn't fully explain, sorry!) was that we should introduce an enumerated property called something like "columnKind", with options of:    * unicodeCodePoint  * utf16CodeUnit   This is more flexible than the column counting approach, as it allows us the option to add other column types later (utf8CodeUnit, for example).   I also note that this text still occurs in the change draft (3.22.2):   > In text files encoded in UTF-16, a “surrogate pair” [ UNICODE10 ]  SHALL  be considered as a single character.      It looks like it should just be removed.   Also in the same section, we have:   > A  column number represents a count of characters.  With the change, this is no longer true. This should be updated to reference the new property. I also think it should be made clear that the interpretation of the column count is not in any way reliant on the original encoding of the file.   Apologies for not sending this out sooner, I was away over the weekend.   Thanks,   Luke   On Sat, Jun 2, 2018 at 1:23 AM Larry Golding (Comcast) < larrygolding@comcast.net > wrote: I pushed a change draft for Issue #178 : Support a character or column interpretation property.   Documents/ChangeDrafts/Active/sarif-v2.0-issue-178-columnsPerSurrogatePair.docx   I will move its adoption at TC #19 on June 6 th .   Here is the change in its entirety (the property is on the run object): 3.11.19 columnsPerSurrogatePair property A run object MAY contain a property named columnsPerSurrogatePair whose value is an integer that specifies, for results reported in UTF-16-encoded files, the number of columns a surrogate pair [ UNICODE10 ] is considered to occupy. If columnsPerSurrogatePair is absent, it defaults to 1.   Thanks, Larry


  • 4.  RE: [sarif] Change draft for #178 (column interpretation property)

    Posted 06-04-2018 18:57
    What do you think of this?   3.11.19 columnKind property A run object MAY contain a property named columnKind whose value is a string that specifies the unit in which the analysis tool measures columns. If present, the columnKind property SHALL have one of the following values, with the specified meanings: ·          "utf16CodeUnits" : Each UTF-16 code unit is considered to occupy one column. This means that a surrogate pair is considered to occupy two columns. ·          "unicodeCodePoints" : Each Unicode code point (abstract character) is considered to occupy one column. This means that even a character that is represented in UTF-16 by a surrogate pair is considered to occupy one column. If absent, columnKind SHALL default to "utf16CodeUnits" . If a SARIF consumer uses a column measurement unit other than that specified by columnKind , and if the consumer is required to interact with the file contents (for example, by displaying the file in an editor and highlighting a region), the consumer SHALL recompute column numbers in its (the consumer’s) native measurement unit.   Larry   From: sarif@lists.oasis-open.org <sarif@lists.oasis-open.org> On Behalf Of Larry Golding (Comcast) Sent: Monday, June 4, 2018 11:28 AM To: 'Luke Cartey' <luke@semmle.com> Cc: sarif@lists.oasis-open.org Subject: RE: [sarif] Change draft for #178 (column interpretation property)   Thanks Luke. Really, your original description of #178 explained this very well, and it’s my fault for misinterpreting it and making you explain it a second time. Sorry.   Yes, your enumerated columnKind property, along with the ancillary changes you suggest, seem like a good approach. I’ll put out another change draft.   In the issue, Michael wrote: I agree with this and that also raises the question of whether column is the appropriate term to use in the format . It is clear that the information that is persisted to SARIF should be data that isn't subject to, for example, the configuration of the editor (I'd forgotten about the tabs -> columns issue). The most reliable mechanisms are either to specify Unicode code units or code points. I'd propose that code units are the default and that we provide a property that allows to make this explicitly code unit or code points. I'd also suggest that we consider renaming startColumn/endColumn (despite the pervasive use of these terms in existing tool log file output). This agrees with what you said below, except that Michael also specifies the default (and I agree with utf16CodeUnits as the default).   The other issue Michael raises is naming. I agree that “column” isn’t accurate. “character”  is also wrong, both because of surrogate pairs and because of Unicode normalization issues with accented characters.   There is no English word for the concept we want, which is “the unit in which a language or tool measures string lengths”. We could coin “stringLengthUnit” or “stringUnit” but I don’t think it’s necessary.   IMO, we should continue to use column. As Michael said, many tools use this term in their output, even though it’s not strictly correct.   Larry   From: sarif@lists.oasis-open.org < sarif@lists.oasis-open.org > On Behalf Of Luke Cartey Sent: Monday, June 4, 2018 6:17 AM To: Larry Golding (Comcast) < larrygolding@comcast.net > Cc: sarif@lists.oasis-open.org Subject: Re: [sarif] Change draft for #178 (column interpretation property)   Hi Larry,   I think this change draft doesn't quite solve the issue. In particular, I believe this restriction is wrong:   > for results reported in UTF-16-encoded files      It's not UTF-16 encoded files that are (specifically) the problem. The problem is that many languages internally  represent strings as UTF-16 regardless  of the original encoding of the file. For ease of implementation, static analysis tools written in these languages will often produce column numbers which count surrogate pairs as 2 columns, and all other code points as 1 column - i.e. count columns based on utf-16 code units, regardless of the original encoding.   Given that this can apply even when the file is not utf-16 encoded, I'm not sure specifying columns-per-surrogate pair is the right way to go.   My thought when I submitted the issue (which I didn't fully explain, sorry!) was that we should introduce an enumerated property called something like "columnKind", with options of:    * unicodeCodePoint  * utf16CodeUnit   This is more flexible than the column counting approach, as it allows us the option to add other column types later (utf8CodeUnit, for example).   I also note that this text still occurs in the change draft (3.22.2):   > In text files encoded in UTF-16, a “surrogate pair” [ UNICODE10 ]  SHALL  be considered as a single character.      It looks like it should just be removed.   Also in the same section, we have:   > A  column number represents a count of characters.  With the change, this is no longer true. This should be updated to reference the new property. I also think it should be made clear that the interpretation of the column count is not in any way reliant on the original encoding of the file.   Apologies for not sending this out sooner, I was away over the weekend.   Thanks,   Luke   On Sat, Jun 2, 2018 at 1:23 AM Larry Golding (Comcast) < larrygolding@comcast.net > wrote: I pushed a change draft for Issue #178 : Support a character or column interpretation property.   Documents/ChangeDrafts/Active/sarif-v2.0-issue-178-columnsPerSurrogatePair.docx   I will move its adoption at TC #19 on June 6 th .   Here is the change in its entirety (the property is on the run object): 3.11.19 columnsPerSurrogatePair property A run object MAY contain a property named columnsPerSurrogatePair whose value is an integer that specifies, for results reported in UTF-16-encoded files, the number of columns a surrogate pair [ UNICODE10 ] is considered to occupy. If columnsPerSurrogatePair is absent, it defaults to 1.   Thanks, Larry


  • 5.  Re: [sarif] Change draft for #178 (column interpretation property)

    Posted 06-04-2018 19:05
    Hi Larry, That sounds good to me. I agree with your summary on the naming. Cheers, Luke On Mon, Jun 4, 2018 at 7:57 PM Larry Golding (Comcast) < larrygolding@comcast.net > wrote: What do you think of this?   3.11.19 columnKind property A run object MAY contain a property named columnKind whose value is a string that specifies the unit in which the analysis tool measures columns. If present, the columnKind property SHALL have one of the following values, with the specified meanings: ·          "utf16CodeUnits" : Each UTF-16 code unit is considered to occupy one column. This means that a surrogate pair is considered to occupy two columns. ·          "unicodeCodePoints" : Each Unicode code point (abstract character) is considered to occupy one column. This means that even a character that is represented in UTF-16 by a surrogate pair is considered to occupy one column. If absent, columnKind SHALL default to "utf16CodeUnits" . If a SARIF consumer uses a column measurement unit other than that specified by columnKind , and if the consumer is required to interact with the file contents (for example, by displaying the file in an editor and highlighting a region), the consumer SHALL recompute column numbers in its (the consumer’s) native measurement unit.   Larry   From: sarif@lists.oasis-open.org < sarif@lists.oasis-open.org > On Behalf Of Larry Golding (Comcast) Sent: Monday, June 4, 2018 11:28 AM To: 'Luke Cartey' < luke@semmle.com > Cc: sarif@lists.oasis-open.org Subject: RE: [sarif] Change draft for #178 (column interpretation property)   Thanks Luke. Really, your original description of #178 explained this very well, and it’s my fault for misinterpreting it and making you explain it a second time. Sorry.   Yes, your enumerated columnKind property, along with the ancillary changes you suggest, seem like a good approach. I’ll put out another change draft.   In the issue, Michael wrote: I agree with this and that also raises the question of whether column is the appropriate term to use in the format . It is clear that the information that is persisted to SARIF should be data that isn't subject to, for example, the configuration of the editor (I'd forgotten about the tabs -> columns issue). The most reliable mechanisms are either to specify Unicode code units or code points. I'd propose that code units are the default and that we provide a property that allows to make this explicitly code unit or code points. I'd also suggest that we consider renaming startColumn/endColumn (despite the pervasive use of these terms in existing tool log file output). This agrees with what you said below, except that Michael also specifies the default (and I agree with utf16CodeUnits as the default).   The other issue Michael raises is naming. I agree that “column” isn’t accurate. “character”  is also wrong, both because of surrogate pairs and because of Unicode normalization issues with accented characters.   There is no English word for the concept we want, which is “the unit in which a language or tool measures string lengths”. We could coin “stringLengthUnit” or “stringUnit” but I don’t think it’s necessary.   IMO, we should continue to use column. As Michael said, many tools use this term in their output, even though it’s not strictly correct.   Larry   From: sarif@lists.oasis-open.org < sarif@lists.oasis-open.org > On Behalf Of Luke Cartey Sent: Monday, June 4, 2018 6:17 AM To: Larry Golding (Comcast) < larrygolding@comcast.net > Cc: sarif@lists.oasis-open.org Subject: Re: [sarif] Change draft for #178 (column interpretation property)   Hi Larry,   I think this change draft doesn't quite solve the issue. In particular, I believe this restriction is wrong:   > for results reported in UTF-16-encoded files      It's not UTF-16 encoded files that are (specifically) the problem. The problem is that many languages internally  represent strings as UTF-16 regardless  of the original encoding of the file. For ease of implementation, static analysis tools written in these languages will often produce column numbers which count surrogate pairs as 2 columns, and all other code points as 1 column - i.e. count columns based on utf-16 code units, regardless of the original encoding.   Given that this can apply even when the file is not utf-16 encoded, I'm not sure specifying columns-per-surrogate pair is the right way to go.   My thought when I submitted the issue (which I didn't fully explain, sorry!) was that we should introduce an enumerated property called something like "columnKind", with options of:    * unicodeCodePoint  * utf16CodeUnit   This is more flexible than the column counting approach, as it allows us the option to add other column types later (utf8CodeUnit, for example).   I also note that this text still occurs in the change draft (3.22.2):   > In text files encoded in UTF-16, a “surrogate pair” [ UNICODE10 ]  SHALL  be considered as a single character.      It looks like it should just be removed.   Also in the same section, we have:   > A  column number represents a count of characters.  With the change, this is no longer true. This should be updated to reference the new property. I also think it should be made clear that the interpretation of the column count is not in any way reliant on the original encoding of the file.   Apologies for not sending this out sooner, I was away over the weekend.   Thanks,   Luke   On Sat, Jun 2, 2018 at 1:23 AM Larry Golding (Comcast) < larrygolding@comcast.net > wrote: I pushed a change draft for Issue #178 : Support a character or column interpretation property.   Documents/ChangeDrafts/Active/sarif-v2.0-issue-178-columnsPerSurrogatePair.docx   I will move its adoption at TC #19 on June 6 th .   Here is the change in its entirety (the property is on the run object): 3.11.19 columnsPerSurrogatePair property A run object MAY contain a property named columnsPerSurrogatePair whose value is an integer that specifies, for results reported in UTF-16-encoded files, the number of columns a surrogate pair [ UNICODE10 ] is considered to occupy. If columnsPerSurrogatePair is absent, it defaults to 1.   Thanks, Larry


  • 6.  RE: [sarif] Change draft for #178 (column interpretation property)

    Posted 06-04-2018 20:47
    In a phone call just now, Michael agreed as well. I’ll have the revised change draft out in a few minutes.   Thanks, Larry   From: sarif@lists.oasis-open.org <sarif@lists.oasis-open.org> On Behalf Of Luke Cartey Sent: Monday, June 4, 2018 12:05 PM To: Larry Golding (Comcast) <larrygolding@comcast.net> Cc: sarif@lists.oasis-open.org Subject: Re: [sarif] Change draft for #178 (column interpretation property)   Hi Larry,   That sounds good to me. I agree with your summary on the naming.   Cheers,   Luke   On Mon, Jun 4, 2018 at 7:57 PM Larry Golding (Comcast) < larrygolding@comcast.net > wrote: What do you think of this?   3.11.19 columnKind property A run object MAY contain a property named columnKind whose value is a string that specifies the unit in which the analysis tool measures columns. If present, the columnKind property SHALL have one of the following values, with the specified meanings: ·          "utf16CodeUnits" : Each UTF-16 code unit is considered to occupy one column. This means that a surrogate pair is considered to occupy two columns. ·          "unicodeCodePoints" : Each Unicode code point (abstract character) is considered to occupy one column. This means that even a character that is represented in UTF-16 by a surrogate pair is considered to occupy one column. If absent, columnKind SHALL default to "utf16CodeUnits" . If a SARIF consumer uses a column measurement unit other than that specified by columnKind , and if the consumer is required to interact with the file contents (for example, by displaying the file in an editor and highlighting a region), the consumer SHALL recompute column numbers in its (the consumer’s) native measurement unit.   Larry   From: sarif@lists.oasis-open.org < sarif@lists.oasis-open.org > On Behalf Of Larry Golding (Comcast) Sent: Monday, June 4, 2018 11:28 AM To: 'Luke Cartey' < luke@semmle.com > Cc: sarif@lists.oasis-open.org Subject: RE: [sarif] Change draft for #178 (column interpretation property)   Thanks Luke. Really, your original description of #178 explained this very well, and it’s my fault for misinterpreting it and making you explain it a second time. Sorry.   Yes, your enumerated columnKind property, along with the ancillary changes you suggest, seem like a good approach. I’ll put out another change draft.   In the issue, Michael wrote: I agree with this and that also raises the question of whether column is the appropriate term to use in the format . It is clear that the information that is persisted to SARIF should be data that isn't subject to, for example, the configuration of the editor (I'd forgotten about the tabs -> columns issue). The most reliable mechanisms are either to specify Unicode code units or code points. I'd propose that code units are the default and that we provide a property that allows to make this explicitly code unit or code points. I'd also suggest that we consider renaming startColumn/endColumn (despite the pervasive use of these terms in existing tool log file output). This agrees with what you said below, except that Michael also specifies the default (and I agree with utf16CodeUnits as the default).   The other issue Michael raises is naming. I agree that “column” isn’t accurate. “character”  is also wrong, both because of surrogate pairs and because of Unicode normalization issues with accented characters.   There is no English word for the concept we want, which is “the unit in which a language or tool measures string lengths”. We could coin “stringLengthUnit” or “stringUnit” but I don’t think it’s necessary.   IMO, we should continue to use column. As Michael said, many tools use this term in their output, even though it’s not strictly correct.   Larry   From: sarif@lists.oasis-open.org < sarif@lists.oasis-open.org > On Behalf Of Luke Cartey Sent: Monday, June 4, 2018 6:17 AM To: Larry Golding (Comcast) < larrygolding@comcast.net > Cc: sarif@lists.oasis-open.org Subject: Re: [sarif] Change draft for #178 (column interpretation property)   Hi Larry,   I think this change draft doesn't quite solve the issue. In particular, I believe this restriction is wrong:   > for results reported in UTF-16-encoded files      It's not UTF-16 encoded files that are (specifically) the problem. The problem is that many languages internally  represent strings as UTF-16 regardless  of the original encoding of the file. For ease of implementation, static analysis tools written in these languages will often produce column numbers which count surrogate pairs as 2 columns, and all other code points as 1 column - i.e. count columns based on utf-16 code units, regardless of the original encoding.   Given that this can apply even when the file is not utf-16 encoded, I'm not sure specifying columns-per-surrogate pair is the right way to go.   My thought when I submitted the issue (which I didn't fully explain, sorry!) was that we should introduce an enumerated property called something like "columnKind", with options of:    * unicodeCodePoint  * utf16CodeUnit   This is more flexible than the column counting approach, as it allows us the option to add other column types later (utf8CodeUnit, for example).   I also note that this text still occurs in the change draft (3.22.2):   > In text files encoded in UTF-16, a “surrogate pair” [ UNICODE10 ]  SHALL  be considered as a single character.      It looks like it should just be removed.   Also in the same section, we have:   > A  column number represents a count of characters.  With the change, this is no longer true. This should be updated to reference the new property. I also think it should be made clear that the interpretation of the column count is not in any way reliant on the original encoding of the file.   Apologies for not sending this out sooner, I was away over the weekend.   Thanks,   Luke   On Sat, Jun 2, 2018 at 1:23 AM Larry Golding (Comcast) < larrygolding@comcast.net > wrote: I pushed a change draft for Issue #178 : Support a character or column interpretation property.   Documents/ChangeDrafts/Active/sarif-v2.0-issue-178-columnsPerSurrogatePair.docx   I will move its adoption at TC #19 on June 6 th .   Here is the change in its entirety (the property is on the run object): 3.11.19 columnsPerSurrogatePair property A run object MAY contain a property named columnsPerSurrogatePair whose value is an integer that specifies, for results reported in UTF-16-encoded files, the number of columns a surrogate pair [ UNICODE10 ] is considered to occupy. If columnsPerSurrogatePair is absent, it defaults to 1.   Thanks, Larry