Hi, There is no way to automatically determine the encoding of the file. The same sequence of bytes can be a valid bytes sequence in multiple encodings that result in different character sequences (what is rendered to the user). For instance the value C1 is GREEK CAPITAL LETTER ALPHA in ISO-8859-7 while it is LATIN CAPITAL LETTER A WITH ACUTE in ISO-8859-15. For files encoded with these encodings there is no indicator sequence in the file, so it is not that an editor inspects the file and determines which one is correct, but a human being manually configuring their editor and maybe trying different encodings until one of them makes sense. A reasonable strategy today, might be look for a BOM and use an encoding of UTF-16LE, UTF-16BE, UTF-32LE or UTF-32BE. If the BOM is not present then use UTF-8. This does gets it wrong if the file is encoded as UTF-16, UTF-32, ISO-8859-*, CP*, SJIS, ... SARIF should say that the encoding is known only if it is specified in the file. If it is known then that is the encoding. If it is unknown, then users and viewers will need to determine the encoding of file's using heuristics, and that the file may not be displayed correctly. I don't think that we need to say anything more than that (we don't need to talk about the BOM or how to guess). The encoding is an attribute of the file, so I do not think that there needs any encoding associated with a region (the encoding is the same as the file). I think that we should encourage producers to include encoding information, and that if it is not present then note that viewer may not be able to correctly display the file contents to users. For snippets we should say that if there is not a one-to-one mapping character codes from the source encoding to the unicode then the highlighting of results based on the snippet may not be correct. Jim On 05/14/2018 11:34 AM, Michael Fanning wrote: I’m uncomfortable having our spec become a place where we provide definitive guidance on how text processors (such as code editors) determine file encoding. This isn’t the north star of our effort and neither have we pulled together relevant industry experience. Leaving SARIF out of the equation, afaik, every editor today has only a file’s contents available in order to determine its encoding. So I’m not clear on the urgency around persisting relevant information to SARIF and/or advising editors on how to go about this. Larry, as we discussed offline, I’ve gone through several scenarios mentally to support the position above. Consider these: 1. Let’s say VS supports UTF16. It can parse utf16 surrogate pairs and display them, but somehow, VS skipped the part where it reads the UTF16 BOM on file open to determine endian-ness/how to parse/display (so SARIF needs to provide it). This seems unlikely. 2. Let’s say VS doesn’t support UTF16. The SARIF file helpfully provides this encoding information, but so what? VS can’t parse/display the SA results, so what’s the benefit? 3. Let’s say someone is building a new text file viewer. In 100% of non-SARIF cases, that viewer needs to inspect the file on file open to detect the encoding. So how is it that this information must be present in SARIF files or our scenarios fail? A sarif producer’s responsibility is to make sure region column/line details are in sync with a text file’s encoding. Any viewer that attempts to display sarif results needs to be able to detect and handle that file’s encoding if it attempts to display results. Why does sarif need to get involved? I may be missing something as I am not expert in this area. To be clear, I am not averse to leaving placeholders for ‘encoding’ and even Jim’s new line sequences data. Maybe it will give someone a leg up somewhere. I do object to spending lots of time explicating how to handle things in the spec. we need to close on solid, well-tested dynamic code flows, graphs, etc. That’s the special value we’re adding. Michael *From:* Larry Golding (Comcast) <
larrygolding@comcast.net> *Sent:* Saturday, May 12, 2018 2:45 PM *To:* Michael Fanning <
Michael.Fanning@microsoft.com>; 'James A. Kupsch' <
kupsch@cs.wisc.edu>; Luke Cartey <
luke@semmle.com>;
sarif@lists.oasis-open.org *Subject:* RE: Determining file encoding +SARIF *From:* Larry Golding (Comcast) <
larrygolding@comcast.net < mailto:
larrygolding@comcast.net >> *Sent:* Friday, May 11, 2018 5:33 PM *To:* Michael Fanning <
Michael.Fanning@microsoft.com < mailto:
Michael.Fanning@microsoft.com >>; 'James A. Kupsch' <
kupsch@cs.wisc.edu < mailto:
kupsch@cs.wisc.edu >>; Luke Cartey <
luke@semmle.com < mailto:
luke@semmle.com >> *Subject:* RE: Determining file encoding I might be able to finesse this point. I could remove the whole part of the “text regions” section that presents this (old and busted) way of determining encoding. Then I could say something like this: A SARIF producer *SHALL* only emit text-related region properties if it knows the character encoding of the file, in which case it *SHALL* also emit file.encoding (§3.17.9) or run.defaultFileEncoding (§3.11.17). In the section on fixes I’d say something like: If a SARIF consumer does not know the character encoding of a file, it *SHALL NOT* apply a fix unless the deletedRegion contains binary-related properties. Larry *From:* Larry Golding (Comcast) <
larrygolding@comcast.net < mailto:
larrygolding@comcast.net >> *Sent:* Friday, May 11, 2018 3:27 PM *To:* Michael Fanning <
Michael.Fanning@microsoft.com < mailto:
Michael.Fanning@microsoft.com >>; 'James A. Kupsch' <
kupsch@cs.wisc.edu < mailto:
kupsch@cs.wisc.edu >>; Luke Cartey <
luke@semmle.com < mailto:
luke@semmle.com >> *Subject:* Determining file encoding *Importance:* High *The spec is inconsistent in how it tells a consumer to determine a file’s encoding.* The sections on file.encoding and run.defaultFileEncoding say: 1. First use file.encoding. 2. If it’s missing, use run.defaultFileEncoding. 3. If it’s missing, consider the encoding to be unknown. The section on “Text regions” (which was written *before we introduced **file.encoding and **run.defaultFileEncoding*) has a different idea. The reason this section cares about encoding is that it wants consumers to know how many bytes each character occupies, so they can correctly identify (and highlight) a text region: 1. Look for a BOM. 2. If it’s absent, use external information (command line arguments, project settings, …) 3. If none of that helps, assume each character represents one byte. (*NOTE*: Step 3 doesn’t actually identify an encoding, but it gives the consumer a best guess as to how to identify the region.) We need to rationalize these. It might look like this: 1. Look for a BOM. (IMO, the file is the final authority.) 2. If there’s no BOM, use file.encoding. 3. If it’s missing, use run.defaultFileEncoding. 4. It it’s missing, use external information (command line arguments, project settings, …) 5. Otherwise, sniff the file and make your best guess. A couple of things: * Step 5 is inconsistent with Luke’s dictum “consider it to be unknown”. *Luke*, tell me again please why it is unknown, and what would go wrong if I sniffed the file and guessed wrong. * In the case of fixes, you absolutely cannot afford to guess. You might even refuse to make the fix if the BOM is inconsistent with file.encoding. Thoughts? Larry