CTI STIX Subcommittee

 View Only
Expand all | Collapse all

Unicode, strings, and STIX

  • 1.  Unicode, strings, and STIX

    Posted 05-31-2016 19:55
    Hello, In attempting to nail down the definition of the type string, there have been a few questions raised about the best definition.  I do not believe there is any disagreement that Unicode will be used for the string representation, it is more how to address some of the things about handling the string type. You may have heard various talk about character vs code point vs glyph vs grapheme, and I found a good post answering the distinction between them at  http://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme  .  I will talk about encoding later. So, at the most basic, a string is a sequence of Unicode code points.  Some strings may have more code points than others, though they are the equivalent, ø (1) vs o w/ combining long solidus overlay (2), though when normalized (NFC), they will be equal.  Sadly, some other code points are ligatures, which are not expanded when normalized (NFC) resulting in the fi ligature not being equal to the letters f followed by i when normalized (NFC).  NFKC will make them equal, but will destroy the meaning of other symbols, like 2 superscript becomes a normal 2. 1) Should we add length restrictions to (some?) fields?  For example, should the title field be restricted in it's length somehow?  Or should people be able to put unlimited length text in the field?  Some fields like description, I expect would possibly be unlimited sans some other overriding limit, such as total TLO size, etc. 2) If there are length limits, how should the length limit be defined?  Should it be number of graphemes displayed?  (Be careful of this, because of things like Zalgo ( http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work ) make even a short ~25 grapheme string have ~292 code points, or 559 bytes when UTF-8 encoded.  Though no language will normally use so many combining code points, it is required to use more than one in some languages. Normalization can help reduce a string's number of code points, but does not always help.  Some languages, like Thai, will use more than one combining code point to make a single grapheme (consonant + vowel + tone mark for three code points for a single grapheme). If graphemes are used, it would require a validator to have a detailed table to decide how many graphemes are in the string.  Using code points would not require as much work for the validator. There is an additional issue of encoding, but this should be easy.  It should use the underlying serialization format's encoding of Unicode.  In the case of JSON, the default is UTF-8.  In the case of XML, it can be specified by the document itself, and may even be in a non-UTF encoding, but it is assumed that if the document is in a different character set, that the processor will convert to Unicode code points properly. Additional Reading: UNICODE TEXT SEGMENTATION  http://unicode.org/reports/tr29/   -- has additional examples of grapheme and code points. Internationalization for Turkish: Dotted and Dotless Letter "I"  http://www.i18nguy.com/unicode/turkish-i18n.html  -- More deals w/ complexities of locales than the above Forms of Unicode  http://www.icu-project.org/docs/papers/forms_of_unicode/  -- Good description of glyph vs characters vs ligatures and encoding info My recommendations: 1) I do believe that limits should be defined for some fields.  Things like title should not have the description in them, and leaving it undefined will allow it to happen. 2) My personal view (as a programmer of many years) it to go the simple round and limit it by code points.  This is easiest for a programmer to do w/ existing tools.  It also gives a more clear storage space limit (see the Zalgo example above). John-Mark New Context


  • 2.  Re: [cti-stix] Unicode, strings, and STIX

    Posted 06-01-2016 10:09
      |   view attached
    Hi John-Mark, My issue with this is that its simple enough language for people reading the STIX standard. Not everyone who reads the STIX standards document will be a programmer, or have a programmers mentality. You have to be a programmer and understand all these terms and subjects before being able to comprehend what's going on within the standard. I firmly believe that we should use common terminology where possible within the standard, to make it as accessible as possible. And that got me thinking.... We should create a  STIX v2.0 JSON serialization document that specifies the JSON specific implementations in nomative statements, and this should be separate from the STIX v2.0 standards document . JSON examples should absolutely be kept in the STIX v2.0 standards document to help readers conceptualise the standard, and to see how it would work in practice, but the examples in the standards document should only be for illustrative purposes. Doing things this way we will achieve a few key benefits: The STIX v2.0 Standards document will be easier to read with plain language, and still have examples to clarify meaning to the reader. The STIX v2.0 Standards document will describe the standard itself, and will not have specific JSON implementation details in there, which will make it easier to apply to additional serialisation formats in the future. Detailed implementation requirements for the JSON MTI serialization will be in a JSON specific document. This will ensure  Using this structure will set ourselves up for the future, enabling creation of additional serializations if we want in the future (binary anyone?). Cheers Terry MacDonald   Chief Product Officer M:   +61-407-203-026 E:   terry.macdonald@cosive.com W:   www.cosive.com On Wed, Jun 1, 2016 at 5:55 AM, John-Mark Gurney < jmg@newcontext.com > wrote: Hello, In attempting to nail down the definition of the type string, there have been a few questions raised about the best definition.  I do not believe there is any disagreement that Unicode will be used for the string representation, it is more how to address some of the things about handling the string type. You may have heard various talk about character vs code point vs glyph vs grapheme, and I found a good post answering the distinction between them at  http://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme  .  I will talk about encoding later. So, at the most basic, a string is a sequence of Unicode code points.  Some strings may have more code points than others, though they are the equivalent, ø (1) vs o w/ combining long solidus overlay (2), though when normalized (NFC), they will be equal.  Sadly, some other code points are ligatures, which are not expanded when normalized (NFC) resulting in the fi ligature not being equal to the letters f followed by i when normalized (NFC).  NFKC will make them equal, but will destroy the meaning of other symbols, like 2 superscript becomes a normal 2. 1) Should we add length restrictions to (some?) fields?  For example, should the title field be restricted in it's length somehow?  Or should people be able to put unlimited length text in the field?  Some fields like description, I expect would possibly be unlimited sans some other overriding limit, such as total TLO size, etc. 2) If there are length limits, how should the length limit be defined?  Should it be number of graphemes displayed?  (Be careful of this, because of things like Zalgo ( http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work ) make even a short ~25 grapheme string have ~292 code points, or 559 bytes when UTF-8 encoded.  Though no language will normally use so many combining code points, it is required to use more than one in some languages. Normalization can help reduce a string's number of code points, but does not always help.  Some languages, like Thai, will use more than one combining code point to make a single grapheme (consonant + vowel + tone mark for three code points for a single grapheme). If graphemes are used, it would require a validator to have a detailed table to decide how many graphemes are in the string.  Using code points would not require as much work for the validator. There is an additional issue of encoding, but this should be easy.  It should use the underlying serialization format's encoding of Unicode.  In the case of JSON, the default is UTF-8.  In the case of XML, it can be specified by the document itself, and may even be in a non-UTF encoding, but it is assumed that if the document is in a different character set, that the processor will convert to Unicode code points properly. Additional Reading: UNICODE TEXT SEGMENTATION  http://unicode.org/reports/tr29/   -- has additional examples of grapheme and code points. Internationalization for Turkish: Dotted and Dotless Letter "I"  http://www.i18nguy.com/unicode/turkish-i18n.html  -- More deals w/ complexities of locales than the above Forms of Unicode  http://www.icu-project.org/docs/papers/forms_of_unicode/  -- Good description of glyph vs characters vs ligatures and encoding info My recommendations: 1) I do believe that limits should be defined for some fields.  Things like title should not have the description in them, and leaving it undefined will allow it to happen. 2) My personal view (as a programmer of many years) it to go the simple round and limit it by code points.  This is easiest for a programmer to do w/ existing tools.  It also gives a more clear storage space limit (see the Zalgo example above). John-Mark New Context


  • 3.  RE: [cti-stix] Unicode, strings, and STIX

    Posted 06-01-2016 14:40
      |   view attached




    +1
     
    From: cti-stix@lists.oasis-open.org [mailto:cti-stix@lists.oasis-open.org]
    On Behalf Of Terry MacDonald
    Sent: Wednesday, June 01, 2016 6:09 AM
    To: John-Mark Gurney <jmg@newcontext.com>
    Cc: cti-stix@lists.oasis-open.org
    Subject: Re: [cti-stix] Unicode, strings, and STIX
     

    Hi John-Mark,

     


    My issue with this is that its simple enough language for people reading the STIX standard. Not everyone who reads the STIX standards document will be a programmer, or have a programmers mentality. You have to be
    a programmer and understand all these terms and subjects before being able to comprehend what's going on within the standard. I firmly believe that we should use common terminology where possible within the standard, to make it as accessible as possible. And
    that got me thinking....


     


    We should create a  STIX v2.0 JSON serialization document
    that specifies the JSON specific implementations in nomative statements, and this should be separate from the
    STIX v2.0 standards document . JSON examples should absolutely be kept in the
    STIX v2.0 standards document to help readers conceptualise the standard, and to see how it would work in practice, but the examples in the standards document
    should only be for illustrative purposes.


     


    Doing things this way we will achieve a few key benefits:



    ·         
    The STIX v2.0 Standards document will be easier to read with plain language, and still have examples to clarify meaning to the reader.

    ·         
    The STIX v2.0 Standards document will describe the standard itself, and will not have specific JSON implementation details in there, which will make it easier to apply to additional serialisation formats in the future.

    ·         
    Detailed implementation requirements for the JSON MTI serialization will be in a JSON specific document. This will ensure 

    ·         
    Using this structure will set ourselves up for the future, enabling creation of additional serializations if we want in the future (binary anyone?).











    Cheers


     



    Terry MacDonald   Chief Product Officer


     





     


    M:   +61-407-203-026


    E:   terry.macdonald@cosive.com


    W:   www.cosive.com


     



     


     






     

    On Wed, Jun 1, 2016 at 5:55 AM, John-Mark Gurney < jmg@newcontext.com > wrote:



    Hello,


     


    In attempting to nail down the definition of the type string, there have been a few questions raised about the best definition.  I do not believe there is any disagreement that Unicode
    will be used for the string representation, it is more how to address some of the things about handling the string type.


     


    You may have heard various talk about character vs code point vs glyph vs grapheme, and I found a good post answering the distinction between them at  http://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme  . 
    I will talk about encoding later.


     


    So, at the most basic, a string is a sequence of Unicode code points.  Some strings may have more code points than others, though they are the equivalent, ø (1) vs o w/ combining long
    solidus overlay (2), though when normalized (NFC), they will be equal.  Sadly, some other code points are ligatures, which are not expanded when normalized (NFC) resulting in the fi ligature not being equal to the letters f followed by i when normalized (NFC). 
    NFKC will make them equal, but will destroy the meaning of other symbols, like 2 superscript becomes a normal 2.


     


    1) Should we add length restrictions to (some?) fields?  For example, should the title field be restricted in it's length somehow?  Or should people be able to put unlimited length
    text in the field?  Some fields like description, I expect would possibly be unlimited sans some other overriding limit, such as total TLO size, etc.


     


    2) If there are length limits, how should the length limit be defined?  Should it be number of graphemes displayed?  (Be careful of this, because of things like Zalgo ( http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work )
    make even a short ~25 grapheme string have ~292 code points, or 559 bytes when UTF-8 encoded.  Though no language will normally use so many combining code points, it is required to use more than one in some languages. Normalization can help reduce a string's
    number of code points, but does not always help.  Some languages, like Thai, will use more than one combining code point to make a single grapheme (consonant + vowel + tone mark for three code points for a single grapheme).


     


    If graphemes are used, it would require a validator to have a detailed table to decide how many graphemes are in the string.  Using code points would not require as much work for the
    validator.


     


    There is an additional issue of encoding, but this should be easy.  It should use the underlying serialization format's encoding of Unicode.  In the case of JSON, the default is UTF-8. 
    In the case of XML, it can be specified by the document itself, and may even be in a non-UTF encoding, but it is assumed that if the document is in a different character set, that the processor will convert to Unicode code points properly.


     


    Additional Reading:


    UNICODE TEXT SEGMENTATION  http://unicode.org/reports/tr29/   -- has additional examples of grapheme and code points.


    Internationalization for Turkish: Dotted and Dotless Letter "I"  http://www.i18nguy.com/unicode/turkish-i18n.html  --
    More deals w/ complexities of locales than the above


    Forms of Unicode  http://www.icu-project.org/docs/papers/forms_of_unicode/  -- Good description
    of glyph vs characters vs ligatures and encoding info


     


    My recommendations:


    1) I do believe that limits should be defined for some fields.  Things like title should not have the description in them, and leaving it undefined will allow it to happen.


     


    2) My personal view (as a programmer of many years) it to go the simple round and limit it by code points.  This is easiest for a programmer to do w/ existing tools.  It also gives
    a more clear storage space limit (see the Zalgo example above).


     


    John-Mark


    New Context




     







  • 4.  RE: [cti-stix] Unicode, strings, and STIX

    Posted 06-01-2016 14:50
      |   view attached
    RE the encoding language question, I posted some sample language to slack that I think solves the problem:  "Any serialization of STIX MUST encode all String values in an encoding that follows the Unicode standard". I do not think the below proposal solves some of the other key questions JMG poses. The most critical question we have is with regards to all of these "max length" properties in the spec and how they will be validated. These things actually *can not* be validated in an encoding-independent way. I have asked a few times and will ask again - in 2016, is "max length" really anything we need to care about here. DBAs may have a bit of heartburn, but IMO it is not something we should be concerned with in STIX. Modern databases do not pre-allocate storage for columns anymore anyway. I would rather just forget about the idea. It makes things a lot simpler. Also, the idea that we should say for example "a title should only be 255 code points long" is completely arbitrary IMO and imposing undue limits on the analyst. - Jason Keirstead STSM, Product Architect, Security Intelligence, IBM Security Systems www.ibm.com/security www.securityintelligence.com Without data, all you are is just another person with an opinion - Unknown "Piazza, Rich" ---06/01/2016 11:39:45 AM---+1 From: cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ] On Behalf Of Terry Mac From: "Piazza, Rich" <rpiazza@mitre.org> To: Terry MacDonald <terry.macdonald@cosive.com>, John-Mark Gurney <jmg@newcontext.com> Cc: "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org> Date: 06/01/2016 11:39 AM Subject: RE: [cti-stix] Unicode, strings, and STIX Sent by: <cti-stix@lists.oasis-open.org> +1 From: cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ] On Behalf Of Terry MacDonald Sent: Wednesday, June 01, 2016 6:09 AM To: John-Mark Gurney <jmg@newcontext.com> Cc: cti-stix@lists.oasis-open.org Subject: Re: [cti-stix] Unicode, strings, and STIX Hi John-Mark, My issue with this is that its simple enough language for people reading the STIX standard. Not everyone who reads the STIX standards document will be a programmer, or have a programmers mentality. You have to be a programmer and understand all these terms and subjects before being able to comprehend what's going on within the standard. I firmly believe that we should use common terminology where possible within the standard, to make it as accessible as possible. And that got me thinking.... We should create a STIX v2.0 JSON serialization document that specifies the JSON specific implementations in nomative statements, and this should be separate from the STIX v2.0 standards document . JSON examples should absolutely be kept in the STIX v2.0 standards document to help readers conceptualise the standard, and to see how it would work in practice, but the examples in the standards document should only be for illustrative purposes. Doing things this way we will achieve a few key benefits: · The STIX v2.0 Standards document will be easier to read with plain language, and still have examples to clarify meaning to the reader. · The STIX v2.0 Standards document will describe the standard itself, and will not have specific JSON implementation details in there, which will make it easier to apply to additional serialisation formats in the future. · Detailed implementation requirements for the JSON MTI serialization will be in a JSON specific document. This will ensure · Using this structure will set ourselves up for the future, enabling creation of additional serializations if we want in the future (binary anyone?). Cheers Terry MacDonald Chief Product Officer M: +61-407-203-026 E: terry.macdonald@cosive.com W: www.cosive.com On Wed, Jun 1, 2016 at 5:55 AM, John-Mark Gurney < jmg@newcontext.com > wrote: Hello, In attempting to nail down the definition of the type string, there have been a few questions raised about the best definition. I do not believe there is any disagreement that Unicode will be used for the string representation, it is more how to address some of the things about handling the string type. You may have heard various talk about character vs code point vs glyph vs grapheme, and I found a good post answering the distinction between them at http://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme . I will talk about encoding later. So, at the most basic, a string is a sequence of Unicode code points. Some strings may have more code points than others, though they are the equivalent, ø (1) vs o w/ combining long solidus overlay (2), though when normalized (NFC), they will be equal. Sadly, some other code points are ligatures, which are not expanded when normalized (NFC) resulting in the fi ligature not being equal to the letters f followed by i when normalized (NFC). NFKC will make them equal, but will destroy the meaning of other symbols, like 2 superscript becomes a normal 2. 1) Should we add length restrictions to (some?) fields? For example, should the title field be restricted in it's length somehow? Or should people be able to put unlimited length text in the field? Some fields like description, I expect would possibly be unlimited sans some other overriding limit, such as total TLO size, etc. 2) If there are length limits, how should the length limit be defined? Should it be number of graphemes displayed? (Be careful of this, because of things like Zalgo ( http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work ) make even a short ~25 grapheme string have ~292 code points, or 559 bytes when UTF-8 encoded. Though no language will normally use so many combining code points, it is required to use more than one in some languages. Normalization can help reduce a string's number of code points, but does not always help. Some languages, like Thai, will use more than one combining code point to make a single grapheme (consonant + vowel + tone mark for three code points for a single grapheme). If graphemes are used, it would require a validator to have a detailed table to decide how many graphemes are in the string. Using code points would not require as much work for the validator. There is an additional issue of encoding, but this should be easy. It should use the underlying serialization format's encoding of Unicode. In the case of JSON, the default is UTF-8. In the case of XML, it can be specified by the document itself, and may even be in a non-UTF encoding, but it is assumed that if the document is in a different character set, that the processor will convert to Unicode code points properly. Additional Reading: UNICODE TEXT SEGMENTATION http://unicode.org/reports/tr29/ -- has additional examples of grapheme and code points. Internationalization for Turkish: Dotted and Dotless Letter "I" http://www.i18nguy.com/unicode/turkish-i18n.html -- More deals w/ complexities of locales than the above Forms of Unicode http://www.icu-project.org/docs/papers/forms_of_unicode/ -- Good description of glyph vs characters vs ligatures and encoding info My recommendations: 1) I do believe that limits should be defined for some fields. Things like title should not have the description in them, and leaving it undefined will allow it to happen. 2) My personal view (as a programmer of many years) it to go the simple round and limit it by code points. This is easiest for a programmer to do w/ existing tools. It also gives a more clear storage space limit (see the Zalgo example above). John-Mark New Context


  • 5.  RE: [cti-stix] Unicode, strings, and STIX

    Posted 06-01-2016 15:08
      |   view attached
    My +1 was for the idea that implementation details like this do not belong in the standard.   In addition, I kinda agree that that the length of strings isn’t a “standards” issue, or an implementation issue that we need to comment on anywhere.    From: cti-stix@lists.oasis-open.org [mailto:cti-stix@lists.oasis-open.org] On Behalf Of Jason Keirstead Sent: Wednesday, June 01, 2016 10:48 AM To: Piazza, Rich <rpiazza@mitre.org> Cc: Terry MacDonald <terry.macdonald@cosive.com>; John-Mark Gurney <jmg@newcontext.com>; cti-stix@lists.oasis-open.org Subject: RE: [cti-stix] Unicode, strings, and STIX   RE the encoding language question, I posted some sample language to slack that I think solves the problem:  "Any serialization of STIX MUST encode all String values in an encoding that follows the Unicode standard". I do not think the below proposal solves some of the other key questions JMG poses. The most critical question we have is with regards to all of these "max length" properties in the spec and how they will be validated. These things actually *can not* be validated in an encoding-independent way. I have asked a few times and will ask again - in 2016, is "max length" really anything we need to care about here. DBAs may have a bit of heartburn, but IMO it is not something we should be concerned with in STIX. Modern databases do not pre-allocate storage for columns anymore anyway. I would rather just forget about the idea. It makes things a lot simpler. Also, the idea that we should say for example "a title should only be 255 code points long" is completely arbitrary IMO and imposing undue limits on the analyst. - Jason Keirstead STSM, Product Architect, Security Intelligence, IBM Security Systems www.ibm.com/security www.securityintelligence.com Without data, all you are is just another person with an opinion - Unknown "Piazza, Rich" ---06/01/2016 11:39:45 AM---+1 From: cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ] On Behalf Of Terry Mac From: "Piazza, Rich" < rpiazza@mitre.org > To: Terry MacDonald < terry.macdonald@cosive.com >, John-Mark Gurney < jmg@newcontext.com > Cc: " cti-stix@lists.oasis-open.org " < cti-stix@lists.oasis-open.org > Date: 06/01/2016 11:39 AM Subject: RE: [cti-stix] Unicode, strings, and STIX Sent by: < cti-stix@lists.oasis-open.org > +1 From: cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ] On Behalf Of Terry MacDonald Sent: Wednesday, June 01, 2016 6:09 AM To: John-Mark Gurney < jmg@newcontext.com > Cc: cti-stix@lists.oasis-open.org Subject: Re: [cti-stix] Unicode, strings, and STIX Hi John-Mark, My issue with this is that its simple enough language for people reading the STIX standard. Not everyone who reads the STIX standards document will be a programmer, or have a programmers mentality. You have to be a programmer and understand all these terms and subjects before being able to comprehend what's going on within the standard. I firmly believe that we should use common terminology where possible within the standard, to make it as accessible as possible. And that got me thinking.... We should create a STIX v2.0 JSON serialization document that specifies the JSON specific implementations in nomative statements, and this should be separate from the STIX v2.0 standards document . JSON examples should absolutely be kept in the STIX v2.0 standards document to help readers conceptualise the standard, and to see how it would work in practice, but the examples in the standards document should only be for illustrative purposes. Doing things this way we will achieve a few key benefits: · The STIX v2.0 Standards document will be easier to read with plain language, and still have examples to clarify meaning to the reader. · The STIX v2.0 Standards document will describe the standard itself, and will not have specific JSON implementation details in there, which will make it easier to apply to additional serialisation formats in the future. · Detailed implementation requirements for the JSON MTI serialization will be in a JSON specific document. This will ensure · Using this structure will set ourselves up for the future, enabling creation of additional serializations if we want in the future (binary anyone?). Cheers Terry MacDonald Chief Product Officer M: +61-407-203-026 E: terry.macdonald@cosive.com W: www.cosive.com On Wed, Jun 1, 2016 at 5:55 AM, John-Mark Gurney < jmg@newcontext.com > wrote: Hello, In attempting to nail down the definition of the type string, there have been a few questions raised about the best definition. I do not believe there is any disagreement that Unicode will be used for the string representation, it is more how to address some of the things about handling the string type. You may have heard various talk about character vs code point vs glyph vs grapheme, and I found a good post answering the distinction between them at http://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme . I will talk about encoding later. So, at the most basic, a string is a sequence of Unicode code points. Some strings may have more code points than others, though they are the equivalent, ø (1) vs o w/ combining long solidus overlay (2), though when normalized (NFC), they will be equal. Sadly, some other code points are ligatures, which are not expanded when normalized (NFC) resulting in the fi ligature not being equal to the letters f followed by i when normalized (NFC). NFKC will make them equal, but will destroy the meaning of other symbols, like 2 superscript becomes a normal 2. 1) Should we add length restrictions to (some?) fields? For example, should the title field be restricted in it's length somehow? Or should people be able to put unlimited length text in the field? Some fields like description, I expect would possibly be unlimited sans some other overriding limit, such as total TLO size, etc. 2) If there are length limits, how should the length limit be defined? Should it be number of graphemes displayed? (Be careful of this, because of things like Zalgo ( http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work ) make even a short ~25 grapheme string have ~292 code points, or 559 bytes when UTF-8 encoded. Though no language will normally use so many combining code points, it is required to use more than one in some languages. Normalization can help reduce a string's number of code points, but does not always help. Some languages, like Thai, will use more than one combining code point to make a single grapheme (consonant + vowel + tone mark for three code points for a single grapheme). If graphemes are used, it would require a validator to have a detailed table to decide how many graphemes are in the string. Using code points would not require as much work for the validator. There is an additional issue of encoding, but this should be easy. It should use the underlying serialization format's encoding of Unicode. In the case of JSON, the default is UTF-8. In the case of XML, it can be specified by the document itself, and may even be in a non-UTF encoding, but it is assumed that if the document is in a different character set, that the processor will convert to Unicode code points properly. Additional Reading: UNICODE TEXT SEGMENTATION http://unicode.org/reports/tr29/ -- has additional examples of grapheme and code points. Internationalization for Turkish: Dotted and Dotless Letter "I" http://www.i18nguy.com/unicode/turkish-i18n.html -- More deals w/ complexities of locales than the above Forms of Unicode http://www.icu-project.org/docs/papers/forms_of_unicode/ -- Good description of glyph vs characters vs ligatures and encoding info My recommendations: 1) I do believe that limits should be defined for some fields. Things like title should not have the description in them, and leaving it undefined will allow it to happen. 2) My personal view (as a programmer of many years) it to go the simple round and limit it by code points. This is easiest for a programmer to do w/ existing tools. It also gives a more clear storage space limit (see the Zalgo example above). John-Mark New Context  


  • 6.  Re: [cti-stix] Unicode, strings, and STIX

    Posted 06-01-2016 17:38
      |   view attached
    If we do not define a max length then everyone will set their own.  And we will have problems. Bret  Sent from my Commodore 64 On Jun 1, 2016, at 8:08 AM, Piazza, Rich < rpiazza@mitre.org > wrote: My +1 was for the idea that implementation details like this do not belong in the standard.   In addition, I kinda agree that that the length of strings isn’t a “standards” issue, or an implementation issue that we need to comment on anywhere.    From: cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ] On Behalf Of Jason Keirstead Sent: Wednesday, June 01, 2016 10:48 AM To: Piazza, Rich < rpiazza@mitre.org > Cc: Terry MacDonald < terry.macdonald@cosive.com >; John-Mark Gurney < jmg@newcontext.com >; cti-stix@lists.oasis-open.org Subject: RE: [cti-stix] Unicode, strings, and STIX   RE the encoding language question, I posted some sample language to slack that I think solves the problem:  "Any serialization of STIX MUST encode all String values in an encoding that follows the Unicode standard". I do not think the below proposal solves some of the other key questions JMG poses. The most critical question we have is with regards to all of these "max length" properties in the spec and how they will be validated. These things actually *can not* be validated in an encoding-independent way. I have asked a few times and will ask again - in 2016, is "max length" really anything we need to care about here. DBAs may have a bit of heartburn, but IMO it is not something we should be concerned with in STIX. Modern databases do not pre-allocate storage for columns anymore anyway. I would rather just forget about the idea. It makes things a lot simpler. Also, the idea that we should say for example "a title should only be 255 code points long" is completely arbitrary IMO and imposing undue limits on the analyst. - Jason Keirstead STSM, Product Architect, Security Intelligence, IBM Security Systems www.ibm.com/security www.securityintelligence.com Without data, all you are is just another person with an opinion - Unknown <image001.gif> "Piazza, Rich" ---06/01/2016 11:39:45 AM---+1 From: cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ] On Behalf Of Terry Mac From: "Piazza, Rich" < rpiazza@mitre.org > To: Terry MacDonald < terry.macdonald@cosive.com >, John-Mark Gurney < jmg@newcontext.com > Cc: " cti-stix@lists.oasis-open.org " < cti-stix@lists.oasis-open.org > Date: 06/01/2016 11:39 AM Subject: RE: [cti-stix] Unicode, strings, and STIX Sent by: < cti-stix@lists.oasis-open.org > +1 From: cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ] On Behalf Of Terry MacDonald Sent: Wednesday, June 01, 2016 6:09 AM To: John-Mark Gurney < jmg@newcontext.com > Cc: cti-stix@lists.oasis-open.org Subject: Re: [cti-stix] Unicode, strings, and STIX Hi John-Mark, My issue with this is that its simple enough language for people reading the STIX standard. Not everyone who reads the STIX standards document will be a programmer, or have a programmers mentality. You have to be a programmer and understand all these terms and subjects before being able to comprehend what's going on within the standard. I firmly believe that we should use common terminology where possible within the standard, to make it as accessible as possible. And that got me thinking.... We should create a STIX v2.0 JSON serialization document that specifies the JSON specific implementations in nomative statements, and this should be separate from the STIX v2.0 standards document . JSON examples should absolutely be kept in the STIX v2.0 standards document to help readers conceptualise the standard, and to see how it would work in practice, but the examples in the standards document should only be for illustrative purposes. Doing things this way we will achieve a few key benefits: · The STIX v2.0 Standards document will be easier to read with plain language, and still have examples to clarify meaning to the reader. · The STIX v2.0 Standards document will describe the standard itself, and will not have specific JSON implementation details in there, which will make it easier to apply to additional serialisation formats in the future. · Detailed implementation requirements for the JSON MTI serialization will be in a JSON specific document. This will ensure · Using this structure will set ourselves up for the future, enabling creation of additional serializations if we want in the future (binary anyone?). Cheers Terry MacDonald Chief Product Officer <image002.png> M: +61-407-203-026 E: terry.macdonald@cosive.com W: www.cosive.com On Wed, Jun 1, 2016 at 5:55 AM, John-Mark Gurney < jmg@newcontext.com > wrote: Hello, In attempting to nail down the definition of the type string, there have been a few questions raised about the best definition. I do not believe there is any disagreement that Unicode will be used for the string representation, it is more how to address some of the things about handling the string type. You may have heard various talk about character vs code point vs glyph vs grapheme, and I found a good post answering the distinction between them at http://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme . I will talk about encoding later. So, at the most basic, a string is a sequence of Unicode code points. Some strings may have more code points than others, though they are the equivalent, ø (1) vs o w/ combining long solidus overlay (2), though when normalized (NFC), they will be equal. Sadly, some other code points are ligatures, which are not expanded when normalized (NFC) resulting in the fi ligature not being equal to the letters f followed by i when normalized (NFC). NFKC will make them equal, but will destroy the meaning of other symbols, like 2 superscript becomes a normal 2. 1) Should we add length restrictions to (some?) fields? For example, should the title field be restricted in it's length somehow? Or should people be able to put unlimited length text in the field? Some fields like description, I expect would possibly be unlimited sans some other overriding limit, such as total TLO size, etc. 2) If there are length limits, how should the length limit be defined? Should it be number of graphemes displayed? (Be careful of this, because of things like Zalgo ( http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work ) make even a short ~25 grapheme string have ~292 code points, or 559 bytes when UTF-8 encoded. Though no language will normally use so many combining code points, it is required to use more than one in some languages. Normalization can help reduce a string's number of code points, but does not always help. Some languages, like Thai, will use more than one combining code point to make a single grapheme (consonant + vowel + tone mark for three code points for a single grapheme). If graphemes are used, it would require a validator to have a detailed table to decide how many graphemes are in the string. Using code points would not require as much work for the validator. There is an additional issue of encoding, but this should be easy. It should use the underlying serialization format's encoding of Unicode. In the case of JSON, the default is UTF-8. In the case of XML, it can be specified by the document itself, and may even be in a non-UTF encoding, but it is assumed that if the document is in a different character set, that the processor will convert to Unicode code points properly. Additional Reading: UNICODE TEXT SEGMENTATION http://unicode.org/reports/tr29/ -- has additional examples of grapheme and code points. Internationalization for Turkish: Dotted and Dotless Letter "I" http://www.i18nguy.com/unicode/turkish-i18n.html -- More deals w/ complexities of locales than the above Forms of Unicode http://www.icu-project.org/docs/papers/forms_of_unicode/ -- Good description of glyph vs characters vs ligatures and encoding info My recommendations: 1) I do believe that limits should be defined for some fields. Things like title should not have the description in them, and leaving it undefined will allow it to happen. 2) My personal view (as a programmer of many years) it to go the simple round and limit it by code points. This is easiest for a programmer to do w/ existing tools. It also gives a more clear storage space limit (see the Zalgo example above). John-Mark New Context  


  • 7.  RE: [cti-stix] Unicode, strings, and STIX

    Posted 06-01-2016 18:17
    I think the spec would have to say something like – “ Any length is permitted”   Then, implementers would have to make sure they could support that.   In STIX 1.2.1, the description field of all of the objects had this text in the specification documents.  I’m not sure in which direction that will sway you J   From: Jordan, Bret [mailto:bret.jordan@bluecoat.com] Sent: Wednesday, June 01, 2016 1:38 PM To: Piazza, Rich <rpiazza@mitre.org> Cc: Jason Keirstead <Jason.Keirstead@ca.ibm.com>; Terry MacDonald <terry.macdonald@cosive.com>; John-Mark Gurney <jmg@newcontext.com>; cti-stix@lists.oasis-open.org Subject: Re: [cti-stix] Unicode, strings, and STIX   If we do not define a max length then everyone will set their own.  And we will have problems.   Bret  Sent from my Commodore 64 On Jun 1, 2016, at 8:08 AM, Piazza, Rich < rpiazza@mitre.org > wrote: My +1 was for the idea that implementation details like this do not belong in the standard.   In addition, I kinda agree that that the length of strings isn’t a “standards” issue, or an implementation issue that we need to comment on anywhere.    From: cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ] On Behalf Of Jason Keirstead Sent: Wednesday, June 01, 2016 10:48 AM To: Piazza, Rich < rpiazza@mitre.org > Cc: Terry MacDonald < terry.macdonald@cosive.com >; John-Mark Gurney < jmg@newcontext.com >; cti-stix@lists.oasis-open.org Subject: RE: [cti-stix] Unicode, strings, and STIX   RE the encoding language question, I posted some sample language to slack that I think solves the problem:  "Any serialization of STIX MUST encode all String values in an encoding that follows the Unicode standard". I do not think the below proposal solves some of the other key questions JMG poses. The most critical question we have is with regards to all of these "max length" properties in the spec and how they will be validated. These things actually *can not* be validated in an encoding-independent way. I have asked a few times and will ask again - in 2016, is "max length" really anything we need to care about here. DBAs may have a bit of heartburn, but IMO it is not something we should be concerned with in STIX. Modern databases do not pre-allocate storage for columns anymore anyway. I would rather just forget about the idea. It makes things a lot simpler. Also, the idea that we should say for example "a title should only be 255 code points long" is completely arbitrary IMO and imposing undue limits on the analyst. - Jason Keirstead STSM, Product Architect, Security Intelligence, IBM Security Systems www.ibm.com/security www.securityintelligence.com Without data, all you are is just another person with an opinion - Unknown <image001.gif> "Piazza, Rich" ---06/01/2016 11:39:45 AM---+1 From: cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ] On Behalf Of Terry Mac From: "Piazza, Rich" < rpiazza@mitre.org > To: Terry MacDonald < terry.macdonald@cosive.com >, John-Mark Gurney < jmg@newcontext.com > Cc: " cti-stix@lists.oasis-open.org " < cti-stix@lists.oasis-open.org > Date: 06/01/2016 11:39 AM Subject: RE: [cti-stix] Unicode, strings, and STIX Sent by: < cti-stix@lists.oasis-open.org > +1 From: cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ] On Behalf Of Terry MacDonald Sent: Wednesday, June 01, 2016 6:09 AM To: John-Mark Gurney < jmg@newcontext.com > Cc: cti-stix@lists.oasis-open.org Subject: Re: [cti-stix] Unicode, strings, and STIX Hi John-Mark, My issue with this is that its simple enough language for people reading the STIX standard. Not everyone who reads the STIX standards document will be a programmer, or have a programmers mentality. You have to be a programmer and understand all these terms and subjects before being able to comprehend what's going on within the standard. I firmly believe that we should use common terminology where possible within the standard, to make it as accessible as possible. And that got me thinking.... We should create a STIX v2.0 JSON serialization document that specifies the JSON specific implementations in nomative statements, and this should be separate from the STIX v2.0 standards document . JSON examples should absolutely be kept in the STIX v2.0 standards document to help readers conceptualise the standard, and to see how it would work in practice, but the examples in the standards document should only be for illustrative purposes. Doing things this way we will achieve a few key benefits: · The STIX v2.0 Standards document will be easier to read with plain language, and still have examples to clarify meaning to the reader. · The STIX v2.0 Standards document will describe the standard itself, and will not have specific JSON implementation details in there, which will make it easier to apply to additional serialisation formats in the future. · Detailed implementation requirements for the JSON MTI serialization will be in a JSON specific document. This will ensure · Using this structure will set ourselves up for the future, enabling creation of additional serializations if we want in the future (binary anyone?). Cheers Terry MacDonald Chief Product Officer <image002.png> M: +61-407-203-026 E: terry.macdonald@cosive.com W: www.cosive.com On Wed, Jun 1, 2016 at 5:55 AM, John-Mark Gurney < jmg@newcontext.com > wrote: Hello, In attempting to nail down the definition of the type string, there have been a few questions raised about the best definition. I do not believe there is any disagreement that Unicode will be used for the string representation, it is more how to address some of the things about handling the string type. You may have heard various talk about character vs code point vs glyph vs grapheme, and I found a good post answering the distinction between them at http://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme . I will talk about encoding later. So, at the most basic, a string is a sequence of Unicode code points. Some strings may have more code points than others, though they are the equivalent, ø (1) vs o w/ combining long solidus overlay (2), though when normalized (NFC), they will be equal. Sadly, some other code points are ligatures, which are not expanded when normalized (NFC) resulting in the fi ligature not being equal to the letters f followed by i when normalized (NFC). NFKC will make them equal, but will destroy the meaning of other symbols, like 2 superscript becomes a normal 2. 1) Should we add length restrictions to (some?) fields? For example, should the title field be restricted in it's length somehow? Or should people be able to put unlimited length text in the field? Some fields like description, I expect would possibly be unlimited sans some other overriding limit, such as total TLO size, etc. 2) If there are length limits, how should the length limit be defined? Should it be number of graphemes displayed? (Be careful of this, because of things like Zalgo ( http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work ) make even a short ~25 grapheme string have ~292 code points, or 559 bytes when UTF-8 encoded. Though no language will normally use so many combining code points, it is required to use more than one in some languages. Normalization can help reduce a string's number of code points, but does not always help. Some languages, like Thai, will use more than one combining code point to make a single grapheme (consonant + vowel + tone mark for three code points for a single grapheme). If graphemes are used, it would require a validator to have a detailed table to decide how many graphemes are in the string. Using code points would not require as much work for the validator. There is an additional issue of encoding, but this should be easy. It should use the underlying serialization format's encoding of Unicode. In the case of JSON, the default is UTF-8. In the case of XML, it can be specified by the document itself, and may even be in a non-UTF encoding, but it is assumed that if the document is in a different character set, that the processor will convert to Unicode code points properly. Additional Reading: UNICODE TEXT SEGMENTATION http://unicode.org/reports/tr29/ -- has additional examples of grapheme and code points. Internationalization for Turkish: Dotted and Dotless Letter "I" http://www.i18nguy.com/unicode/turkish-i18n.html -- More deals w/ complexities of locales than the above Forms of Unicode http://www.icu-project.org/docs/papers/forms_of_unicode/ -- Good description of glyph vs characters vs ligatures and encoding info My recommendations: 1) I do believe that limits should be defined for some fields. Things like title should not have the description in them, and leaving it undefined will allow it to happen. 2) My personal view (as a programmer of many years) it to go the simple round and limit it by code points. This is easiest for a programmer to do w/ existing tools. It also gives a more clear storage space limit (see the Zalgo example above). John-Mark New Context  


  • 8.  RE: [cti-stix] Unicode, strings, and STIX

    Posted 06-01-2016 22:19
    I think having built in maximum field size is pragmatic. We don't want to design buffer overflow susceptibility into all STIX services just because we couldn't agree where to place text limiting field lengths. I personally think that maximum field length should be defined in the STIX standards doc for each STIX type (e.g. boolean, number), and that it should be sized in Unicode characters. Then in each serialisation document (e.g. in a JSON serialisation doc) we should convert that Unicode character length into what ever length definition makes sense for that serialisation format e.g. JSON and the use of code points. I really don't want to be responsible for creating threat intelligence hacks in 2-5 years from now because of a decision we made today. Cheers Terry MacDonald Cosive On 2/06/2016 04:17, "Piazza, Rich" < rpiazza@mitre.org > wrote: I think the spec would have to say something like – “ Any length is permitted”   Then, implementers would have to make sure they could support that.   In STIX 1.2.1, the description field of all of the objects had this text in the specification documents.  I’m not sure in which direction that will sway you J   From: Jordan, Bret [mailto: bret.jordan@bluecoat.com ] Sent: Wednesday, June 01, 2016 1:38 PM To: Piazza, Rich < rpiazza@mitre.org > Cc: Jason Keirstead < Jason.Keirstead@ca.ibm.com >; Terry MacDonald < terry.macdonald@cosive.com >; John-Mark Gurney < jmg@newcontext.com >; cti-stix@lists.oasis-open.org Subject: Re: [cti-stix] Unicode, strings, and STIX   If we do not define a max length then everyone will set their own.  And we will have problems.   Bret  Sent from my Commodore 64 On Jun 1, 2016, at 8:08 AM, Piazza, Rich < rpiazza@mitre.org > wrote: My +1 was for the idea that implementation details like this do not belong in the standard.   In addition, I kinda agree that that the length of strings isn’t a “standards” issue, or an implementation issue that we need to comment on anywhere.    From: cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ] On Behalf Of Jason Keirstead Sent: Wednesday, June 01, 2016 10:48 AM To: Piazza, Rich < rpiazza@mitre.org > Cc: Terry MacDonald < terry.macdonald@cosive.com >; John-Mark Gurney < jmg@newcontext.com >; cti-stix@lists.oasis-open.org Subject: RE: [cti-stix] Unicode, strings, and STIX   RE the encoding language question, I posted some sample language to slack that I think solves the problem:  "Any serialization of STIX MUST encode all String values in an encoding that follows the Unicode standard". I do not think the below proposal solves some of the other key questions JMG poses. The most critical question we have is with regards to all of these "max length" properties in the spec and how they will be validated. These things actually *can not* be validated in an encoding-independent way. I have asked a few times and will ask again - in 2016, is "max length" really anything we need to care about here. DBAs may have a bit of heartburn, but IMO it is not something we should be concerned with in STIX. Modern databases do not pre-allocate storage for columns anymore anyway. I would rather just forget about the idea. It makes things a lot simpler. Also, the idea that we should say for example "a title should only be 255 code points long" is completely arbitrary IMO and imposing undue limits on the analyst. - Jason Keirstead STSM, Product Architect, Security Intelligence, IBM Security Systems www.ibm.com/security www.securityintelligence.com Without data, all you are is just another person with an opinion - Unknown <image001.gif> "Piazza, Rich" ---06/01/2016 11:39:45 AM---+1 From: cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ] On Behalf Of Terry Mac From: "Piazza, Rich" < rpiazza@mitre.org > To: Terry MacDonald < terry.macdonald@cosive.com >, John-Mark Gurney < jmg@newcontext.com > Cc: " cti-stix@lists.oasis-open.org " < cti-stix@lists.oasis-open.org > Date: 06/01/2016 11:39 AM Subject: RE: [cti-stix] Unicode, strings, and STIX Sent by: < cti-stix@lists.oasis-open.org > +1 From: cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ] On Behalf Of Terry MacDonald Sent: Wednesday, June 01, 2016 6:09 AM To: John-Mark Gurney < jmg@newcontext.com > Cc: cti-stix@lists.oasis-open.org Subject: Re: [cti-stix] Unicode, strings, and STIX Hi John-Mark, My issue with this is that its simple enough language for people reading the STIX standard. Not everyone who reads the STIX standards document will be a programmer, or have a programmers mentality. You have to be a programmer and understand all these terms and subjects before being able to comprehend what's going on within the standard. I firmly believe that we should use common terminology where possible within the standard, to make it as accessible as possible. And that got me thinking.... We should create a STIX v2.0 JSON serialization document that specifies the JSON specific implementations in nomative statements, and this should be separate from the STIX v2.0 standards document . JSON examples should absolutely be kept in the STIX v2.0 standards document to help readers conceptualise the standard, and to see how it would work in practice, but the examples in the standards document should only be for illustrative purposes. Doing things this way we will achieve a few key benefits: · The STIX v2.0 Standards document will be easier to read with plain language, and still have examples to clarify meaning to the reader. · The STIX v2.0 Standards document will describe the standard itself, and will not have specific JSON implementation details in there, which will make it easier to apply to additional serialisation formats in the future. · Detailed implementation requirements for the JSON MTI serialization will be in a JSON specific document. This will ensure · Using this structure will set ourselves up for the future, enabling creation of additional serializations if we want in the future (binary anyone?). Cheers Terry MacDonald Chief Product Officer <image002.png> M: +61-407-203-026 E: terry.macdonald@cosive.com W: www.cosive.com On Wed, Jun 1, 2016 at 5:55 AM, John-Mark Gurney < jmg@newcontext.com > wrote: Hello, In attempting to nail down the definition of the type string, there have been a few questions raised about the best definition. I do not believe there is any disagreement that Unicode will be used for the string representation, it is more how to address some of the things about handling the string type. You may have heard various talk about character vs code point vs glyph vs grapheme, and I found a good post answering the distinction between them at http://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme . I will talk about encoding later. So, at the most basic, a string is a sequence of Unicode code points. Some strings may have more code points than others, though they are the equivalent, ø (1) vs o w/ combining long solidus overlay (2), though when normalized (NFC), they will be equal. Sadly, some other code points are ligatures, which are not expanded when normalized (NFC) resulting in the fi ligature not being equal to the letters f followed by i when normalized (NFC). NFKC will make them equal, but will destroy the meaning of other symbols, like 2 superscript becomes a normal 2. 1) Should we add length restrictions to (some?) fields? For example, should the title field be restricted in it's length somehow? Or should people be able to put unlimited length text in the field? Some fields like description, I expect would possibly be unlimited sans some other overriding limit, such as total TLO size, etc. 2) If there are length limits, how should the length limit be defined? Should it be number of graphemes displayed? (Be careful of this, because of things like Zalgo ( http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work ) make even a short ~25 grapheme string have ~292 code points, or 559 bytes when UTF-8 encoded. Though no language will normally use so many combining code points, it is required to use more than one in some languages. Normalization can help reduce a string's number of code points, but does not always help. Some languages, like Thai, will use more than one combining code point to make a single grapheme (consonant + vowel + tone mark for three code points for a single grapheme). If graphemes are used, it would require a validator to have a detailed table to decide how many graphemes are in the string. Using code points would not require as much work for the validator. There is an additional issue of encoding, but this should be easy. It should use the underlying serialization format's encoding of Unicode. In the case of JSON, the default is UTF-8. In the case of XML, it can be specified by the document itself, and may even be in a non-UTF encoding, but it is assumed that if the document is in a different character set, that the processor will convert to Unicode code points properly. Additional Reading: UNICODE TEXT SEGMENTATION http://unicode.org/reports/tr29/ -- has additional examples of grapheme and code points. Internationalization for Turkish: Dotted and Dotless Letter "I" http://www.i18nguy.com/unicode/turkish-i18n.html -- More deals w/ complexities of locales than the above Forms of Unicode http://www.icu-project.org/docs/papers/forms_of_unicode/ -- Good description of glyph vs characters vs ligatures and encoding info My recommendations: 1) I do believe that limits should be defined for some fields. Things like title should not have the description in them, and leaving it undefined will allow it to happen. 2) My personal view (as a programmer of many years) it to go the simple round and limit it by code points. This is easiest for a programmer to do w/ existing tools. It also gives a more clear storage space limit (see the Zalgo example above). John-Mark New Context  


  • 9.  RE: [cti-stix] Unicode, strings, and STIX

    Posted 06-02-2016 12:17
    There is simply no logical way to define a "max length" in a way that protects against "buffer overflow" problems with Unicode... so if buffer overflow is the main motivation for this - If we say "max_length" of title means 255 *BYTES*, then in some languages that is going to result in a very short title than other languages - and furthermore, you could be truncating it in the middle of a character (grapheme) making it all the more invalid for the person entering it on their screen. - If we say "max_length" of title means 255 *code points*, then in some languages it will result in shorter titles being allowd than others, and it also could equal an arbitrary number of bytes, as it depends on the encoding and language being encoded. And you still have the problem of truncating in the middle of a character (grapheme) - If we say "max_length" of title means 255 *graphemes*, then all languages are allowed the same title length, and you have no problems truncating in the middle of a character. However, it means a title could equal an arbitrary number of bytes. I say throw it out. - Jason Keirstead STSM, Product Architect, Security Intelligence, IBM Security Systems www.ibm.com/security www.securityintelligence.com Without data, all you are is just another person with an opinion - Unknown Terry MacDonald ---06/01/2016 07:19:19 PM---I think having built in maximum field size is pragmatic. We don't want to design buffer overflow sus From: Terry MacDonald <terry.macdonald@cosive.com> To: Rich Piazza <rpiazza@mitre.org> Cc: John-Mark Gurney <jmg@newcontext.com>, Jason Keirstead/CanEast/IBM@IBMCA, "Jordan, Bret" <bret.jordan@bluecoat.com>, "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org> Date: 06/01/2016 07:19 PM Subject: RE: [cti-stix] Unicode, strings, and STIX Sent by: <cti-stix@lists.oasis-open.org> I think having built in maximum field size is pragmatic. We don't want to design buffer overflow susceptibility into all STIX services just because we couldn't agree where to place text limiting field lengths. I personally think that maximum field length should be defined in the STIX standards doc for each STIX type (e.g. boolean, number), and that it should be sized in Unicode characters. Then in each serialisation document (e.g. in a JSON serialisation doc) we should convert that Unicode character length into what ever length definition makes sense for that serialisation format e.g. JSON and the use of code points. I really don't want to be responsible for creating threat intelligence hacks in 2-5 years from now because of a decision we made today. Cheers Terry MacDonald Cosive On 2/06/2016 04:17, "Piazza, Rich" < rpiazza@mitre.org > wrote: I think the spec would have to say something like – “ Any length is permitted”   Then, implementers would have to make sure they could support that.   In STIX 1.2.1, the description field of all of the objects had this text in the specification documents.  I’m not sure in which direction that will sway you J   From: Jordan, Bret [mailto: bret.jordan@bluecoat.com ] Sent: Wednesday, June 01, 2016 1:38 PM To: Piazza, Rich < rpiazza@mitre.org > Cc: Jason Keirstead < Jason.Keirstead@ca.ibm.com >; Terry MacDonald < terry.macdonald@cosive.com >; John-Mark Gurney < jmg@newcontext.com >; cti-stix@lists.oasis-open.org Subject: Re: [cti-stix] Unicode, strings, and STIX   If we do not define a max length then everyone will set their own.  And we will have problems.   Bret  Sent from my Commodore 64 On Jun 1, 2016, at 8:08 AM, Piazza, Rich < rpiazza@mitre.org > wrote: My +1 was for the idea that implementation details like this do not belong in the standard.   In addition, I kinda agree that that the length of strings isn’t a “standards” issue, or an implementation issue that we need to comment on anywhere.    From: cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ] On Behalf Of Jason Keirstead Sent: Wednesday, June 01, 2016 10:48 AM To: Piazza, Rich < rpiazza@mitre.org > Cc: Terry MacDonald < terry.macdonald@cosive.com >; John-Mark Gurney < jmg@newcontext.com >; cti-stix@lists.oasis-open.org Subject: RE: [cti-stix] Unicode, strings, and STIX   RE the encoding language question, I posted some sample language to slack that I think solves the problem:  "Any serialization of STIX MUST encode all String values in an encoding that follows the Unicode standard". I do not think the below proposal solves some of the other key questions JMG poses. The most critical question we have is with regards to all of these "max length" properties in the spec and how they will be validated. These things actually *can not* be validated in an encoding-independent way. I have asked a few times and will ask again - in 2016, is "max length" really anything we need to care about here. DBAs may have a bit of heartburn, but IMO it is not something we should be concerned with in STIX. Modern databases do not pre-allocate storage for columns anymore anyway. I would rather just forget about the idea. It makes things a lot simpler. Also, the idea that we should say for example "a title should only be 255 code points long" is completely arbitrary IMO and imposing undue limits on the analyst. - Jason Keirstead STSM, Product Architect, Security Intelligence, IBM Security Systems www.ibm.com/security www.securityintelligence.com Without data, all you are is just another person with an opinion - Unknown <image001.gif> "Piazza, Rich" ---06/01/2016 11:39:45 AM---+1 From: cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ] On Behalf Of Terry Mac From: "Piazza, Rich" < rpiazza@mitre.org > To: Terry MacDonald < terry.macdonald@cosive.com >, John-Mark Gurney < jmg@newcontext.com > Cc: " cti-stix@lists.oasis-open.org " < cti-stix@lists.oasis-open.org > Date: 06/01/2016 11:39 AM Subject: RE: [cti-stix] Unicode, strings, and STIX Sent by: < cti-stix@lists.oasis-open.org >
    +1 From: cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ] On Behalf Of Terry MacDonald Sent: Wednesday, June 01, 2016 6:09 AM To: John-Mark Gurney < jmg@newcontext.com > Cc: cti-stix@lists.oasis-open.org Subject: Re: [cti-stix] Unicode, strings, and STIX Hi John-Mark, My issue with this is that its simple enough language for people reading the STIX standard. Not everyone who reads the STIX standards document will be a programmer, or have a programmers mentality. You have to be a programmer and understand all these terms and subjects before being able to comprehend what's going on within the standard. I firmly believe that we should use common terminology where possible within the standard, to make it as accessible as possible. And that got me thinking.... We should create a STIX v2.0 JSON serialization document that specifies the JSON specific implementations in nomative statements, and this should be separate from the STIX v2.0 standards document . JSON examples should absolutely be kept in the STIX v2.0 standards document to help readers conceptualise the standard, and to see how it would work in practice, but the examples in the standards document should only be for illustrative purposes. Doing things this way we will achieve a few key benefits: · The STIX v2.0 Standards document will be easier to read with plain language, and still have examples to clarify meaning to the reader. · The STIX v2.0 Standards document will describe the standard itself, and will not have specific JSON implementation details in there, which will make it easier to apply to additional serialisation formats in the future. · Detailed implementation requirements for the JSON MTI serialization will be in a JSON specific document. This will ensure · Using this structure will set ourselves up for the future, enabling creation of additional serializations if we want in the future (binary anyone?). Cheers Terry MacDonald Chief Product Officer <image002.png> M: +61-407-203-026 E: terry.macdonald@cosive.com W: www.cosive.com On Wed, Jun 1, 2016 at 5:55 AM, John-Mark Gurney < jmg@newcontext.com > wrote: Hello, In attempting to nail down the definition of the type string, there have been a few questions raised about the best definition. I do not believe there is any disagreement that Unicode will be used for the string representation, it is more how to address some of the things about handling the string type. You may have heard various talk about character vs code point vs glyph vs grapheme, and I found a good post answering the distinction between them at http://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme . I will talk about encoding later. So, at the most basic, a string is a sequence of Unicode code points. Some strings may have more code points than others, though they are the equivalent, ø (1) vs o w/ combining long solidus overlay (2), though when normalized (NFC), they will be equal. Sadly, some other code points are ligatures, which are not expanded when normalized (NFC) resulting in the fi ligature not being equal to the letters f followed by i when normalized (NFC). NFKC will make them equal, but will destroy the meaning of other symbols, like 2 superscript becomes a normal 2. 1) Should we add length restrictions to (some?) fields? For example, should the title field be restricted in it's length somehow? Or should people be able to put unlimited length text in the field? Some fields like description, I expect would possibly be unlimited sans some other overriding limit, such as total TLO size, etc. 2) If there are length limits, how should the length limit be defined? Should it be number of graphemes displayed? (Be careful of this, because of things like Zalgo ( http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work ) make even a short ~25 grapheme string have ~292 code points, or 559 bytes when UTF-8 encoded. Though no language will normally use so many combining code points, it is required to use more than one in some languages. Normalization can help reduce a string's number of code points, but does not always help. Some languages, like Thai, will use more than one combining code point to make a single grapheme (consonant + vowel + tone mark for three code points for a single grapheme). If graphemes are used, it would require a validator to have a detailed table to decide how many graphemes are in the string. Using code points would not require as much work for the validator. There is an additional issue of encoding, but this should be easy. It should use the underlying serialization format's encoding of Unicode. In the case of JSON, the default is UTF-8. In the case of XML, it can be specified by the document itself, and may even be in a non-UTF encoding, but it is assumed that if the document is in a different character set, that the processor will convert to Unicode code points properly. Additional Reading: UNICODE TEXT SEGMENTATION http://unicode.org/reports/tr29/ -- has additional examples of grapheme and code points. Internationalization for Turkish: Dotted and Dotless Letter "I" http://www.i18nguy.com/unicode/turkish-i18n.html -- More deals w/ complexities of locales than the above Forms of Unicode http://www.icu-project.org/docs/papers/forms_of_unicode/ -- Good description of glyph vs characters vs ligatures and encoding info My recommendations: 1) I do believe that limits should be defined for some fields. Things like title should not have the description in them, and leaving it undefined will allow it to happen. 2) My personal view (as a programmer of many years) it to go the simple round and limit it by code points. This is easiest for a programmer to do w/ existing tools. It also gives a more clear storage space limit (see the Zalgo example above). John-Mark New Context  




  • 10.  Re: [cti-stix] Unicode, strings, and STIX

    Posted 06-02-2016 12:50




    There needs to be a limit, even if it’s a SHOULD requirement. If we don’t specify it, we’ll get SO posts like this: http://stackoverflow.com/questions/686217/maximum-on-http-header-values
     
    Thank you.
    -Mark
     

    From:
    <cti-stix@lists.oasis-open.org> on behalf of "Piazza, Rich" <rpiazza@mitre.org>
    Date: Wednesday, June 1, 2016 at 2:17 PM
    To: "Jordan, Bret" <bret.jordan@bluecoat.com>
    Cc: Jason Keirstead <Jason.Keirstead@ca.ibm.com>, Terry MacDonald <terry.macdonald@cosive.com>, John-Mark Gurney <jmg@newcontext.com>, "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org>
    Subject: RE: [cti-stix] Unicode, strings, and STIX


     



    I think the spec would have to say something like – “ Any length is permitted”
     
    Then, implementers would have to make sure they could support that.
     
    In STIX 1.2.1, the description field of all of the objects had this text in the specification documents.  I’m not sure in which direction that will sway you
    J
     


    From: Jordan, Bret [mailto:bret.jordan@bluecoat.com]

    Sent: Wednesday, June 01, 2016 1:38 PM
    To: Piazza, Rich <rpiazza@mitre.org>
    Cc: Jason Keirstead <Jason.Keirstead@ca.ibm.com>; Terry MacDonald <terry.macdonald@cosive.com>; John-Mark Gurney <jmg@newcontext.com>; cti-stix@lists.oasis-open.org
    Subject: Re: [cti-stix] Unicode, strings, and STIX


     

    If we do not define a max length then everyone will set their own.  And we will have problems.


     


    Bret 

    Sent from my Commodore 64




    On Jun 1, 2016, at 8:08 AM, Piazza, Rich < rpiazza@mitre.org > wrote:



    My +1 was for the idea that implementation details like this do not belong in the standard.
     
    In addition, I kinda agree that that the length of strings isn’t a “standards” issue,
    or an implementation issue that we need to comment on anywhere. 
     


    From:
    cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ]
    On Behalf Of Jason Keirstead
    Sent: Wednesday, June 01, 2016 10:48 AM
    To: Piazza, Rich < rpiazza@mitre.org >
    Cc: Terry MacDonald < terry.macdonald@cosive.com >; John-Mark Gurney < jmg@newcontext.com >;
    cti-stix@lists.oasis-open.org
    Subject: RE: [cti-stix] Unicode, strings, and STIX


     
    RE the encoding language question, I posted some sample language to slack that I think solves the problem:
     "Any serialization of STIX MUST encode all String values in an encoding that follows the Unicode standard".

    I do not think the below proposal solves some of the other key questions JMG poses. The most critical question we have is with regards to all of these "max length" properties in the spec and how they will be validated. These things actually *can not* be validated
    in an encoding-independent way. I have asked a few times and will ask again - in 2016, is "max length" really anything we need to care about here. DBAs may have a bit of heartburn, but IMO it is not something we should be concerned with in STIX. Modern databases
    do not pre-allocate storage for columns anymore anyway. I would rather just forget about the idea. It makes things a lot simpler.


    Also, the idea that we should say for example "a title should only be 255 code points long" is completely arbitrary IMO and imposing undue limits on the analyst.

    -
    Jason Keirstead
    STSM, Product Architect, Security Intelligence, IBM Security Systems
    www.ibm.com/security
    www.securityintelligence.com

    Without data, all you are is just another person with an opinion - Unknown


    <image001.gif> "Piazza, Rich" ---06/01/2016 11:39:45 AM---+1 From:
    cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ] On Behalf Of Terry Mac

    From: "Piazza, Rich" < rpiazza@mitre.org >
    To: Terry MacDonald < terry.macdonald@cosive.com >, John-Mark Gurney < jmg@newcontext.com >
    Cc: " cti-stix@lists.oasis-open.org " < cti-stix@lists.oasis-open.org >
    Date: 06/01/2016 11:39 AM
    Subject: RE: [cti-stix] Unicode, strings, and STIX
    Sent by: < cti-stix@lists.oasis-open.org >










    +1
    From:
    cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ]
    On Behalf Of Terry MacDonald
    Sent: Wednesday, June 01, 2016 6:09 AM
    To: John-Mark Gurney < jmg@newcontext.com >
    Cc: cti-stix@lists.oasis-open.org
    Subject: Re: [cti-stix] Unicode, strings, and STIX

    Hi John-Mark,

    My issue with this is that its simple enough language for people reading the STIX standard. Not everyone who reads the STIX standards document will be a programmer, or have a programmers mentality. You have to be a programmer
    and understand all these terms and subjects before being able to comprehend what's going on within the standard. I firmly believe that we should use common terminology where possible within the standard, to make it as accessible as possible. And that got me
    thinking....

    We should create a STIX v2.0 JSON serialization document
    that specifies the JSON specific implementations in nomative statements, and this should be separate from the
    STIX v2.0 standards document . JSON examples should absolutely be kept in the
    STIX v2.0 standards document to help readers conceptualise the standard, and to see how it would work in practice, but the examples in the standards document
    should only be for illustrative purposes.

    Doing things this way we will achieve a few key benefits:
    ·
    The STIX v2.0 Standards document will be easier to read with plain language, and still have examples to clarify meaning to the reader.
    · The STIX v2.0 Standards document will describe the standard itself, and will not have specific JSON implementation details in there, which will make it easier to apply to additional serialisation
    formats in the future.
    · Detailed implementation requirements for the JSON MTI serialization will be in a JSON specific document. This will ensure

    · Using this structure will set ourselves up for the future, enabling creation of additional serializations if we want in the future (binary anyone?).


    Cheers

    Terry MacDonald
    Chief Product Officer

    <image002.png>

    M:
    +61-407-203-026
    E:
    terry.macdonald@cosive.com
    W:
    www.cosive.com




    On Wed, Jun 1, 2016 at 5:55 AM, John-Mark Gurney < jmg@newcontext.com > wrote:
    Hello,

    In attempting to nail down the definition of the type string, there have been a few questions raised about the best definition. I do not believe there is any disagreement that Unicode will be used for the string representation, it is more how to address some
    of the things about handling the string type.

    You may have heard various talk about character vs code point vs glyph vs grapheme, and I found a good post answering the distinction between them at

    http://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme . I will talk about encoding later.

    So, at the most basic, a string is a sequence of Unicode code points. Some strings may have more code points than others, though they are the equivalent, ø (1) vs o w/ combining long solidus overlay (2), though when normalized (NFC), they will be equal. Sadly,
    some other code points are ligatures, which are not expanded when normalized (NFC) resulting in the fi ligature not being equal to the letters f followed by i when normalized (NFC). NFKC will make them equal, but will destroy the meaning of other symbols,
    like 2 superscript becomes a normal 2.

    1) Should we add length restrictions to (some?) fields? For example, should the title field be restricted in it's length somehow? Or should people be able to put unlimited length text in the field? Some fields like description, I expect would possibly be unlimited
    sans some other overriding limit, such as total TLO size, etc.

    2) If there are length limits, how should the length limit be defined? Should it be number of graphemes displayed? (Be careful of this, because of things like Zalgo ( http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work )
    make even a short ~25 grapheme string have ~292 code points, or 559 bytes when UTF-8 encoded. Though no language will normally use so many combining code points, it is required to use more than one in some languages. Normalization can help reduce a string's
    number of code points, but does not always help. Some languages, like Thai, will use more than one combining code point to make a single grapheme (consonant + vowel + tone mark for three code points for a single grapheme).

    If graphemes are used, it would require a validator to have a detailed table to decide how many graphemes are in the string. Using code points would not require as much work for the validator.

    There is an additional issue of encoding, but this should be easy. It should use the underlying serialization format's encoding of Unicode. In the case of JSON, the default is UTF-8. In the case of XML, it can be specified by the document itself, and may even
    be in a non-UTF encoding, but it is assumed that if the document is in a different character set, that the processor will convert to Unicode code points properly.

    Additional Reading:
    UNICODE TEXT SEGMENTATION
    http://unicode.org/reports/tr29/ -- has additional examples of grapheme and code points.
    Internationalization for Turkish: Dotted and Dotless Letter "I"
    http://www.i18nguy.com/unicode/turkish-i18n.html -- More deals w/ complexities of locales than the above
    Forms of Unicode
    http://www.icu-project.org/docs/papers/forms_of_unicode/ -- Good description of glyph vs characters vs ligatures and encoding info

    My recommendations:
    1) I do believe that limits should be defined for some fields. Things like title should not have the description in them, and leaving it undefined will allow it to happen.

    2) My personal view (as a programmer of many years) it to go the simple round and limit it by code points. This is easiest for a programmer to do w/ existing tools. It also gives a more clear storage space limit (see the Zalgo example above).

    John-Mark
    New Context




     










  • 11.  Re: [cti-stix] Unicode, strings, and STIX

    Posted 06-02-2016 12:53




    I guess I have also provided evidence that a spec can be widely implemented without specifying max lengths on important fields. The drawback, however, is that the max length will end up
    being the shortest supported value from major implementations, and it will only be discovered through painful research.
     
    -Mark
     

    From:
    <cti-stix@lists.oasis-open.org> on behalf of Mark Davidson <mdavidson@soltra.com>
    Date: Thursday, June 2, 2016 at 8:49 AM
    To: "Piazza, Rich" <rpiazza@mitre.org>, "Jordan, Bret" <bret.jordan@bluecoat.com>
    Cc: Jason Keirstead <Jason.Keirstead@ca.ibm.com>, Terry MacDonald <terry.macdonald@cosive.com>, John-Mark Gurney <jmg@newcontext.com>, "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org>
    Subject: Re: [cti-stix] Unicode, strings, and STIX


     



    There needs to be a limit, even if it’s a SHOULD requirement. If we don’t specify it, we’ll get SO posts like this: http://stackoverflow.com/questions/686217/maximum-on-http-header-values
     
    Thank you.
    -Mark
     

    From:
    <cti-stix@lists.oasis-open.org> on behalf of "Piazza, Rich" <rpiazza@mitre.org>
    Date: Wednesday, June 1, 2016 at 2:17 PM
    To: "Jordan, Bret" <bret.jordan@bluecoat.com>
    Cc: Jason Keirstead <Jason.Keirstead@ca.ibm.com>, Terry MacDonald <terry.macdonald@cosive.com>, John-Mark Gurney <jmg@newcontext.com>, "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org>
    Subject: RE: [cti-stix] Unicode, strings, and STIX


     



    I think the spec would have to say something like – “ Any length is permitted”
     
    Then, implementers would have to make sure they could support that.
     
    In STIX 1.2.1, the description field of all of the objects had this text in the specification documents.  I’m not sure in which direction that will sway you
    J
     


    From: Jordan, Bret [mailto:bret.jordan@bluecoat.com]

    Sent: Wednesday, June 01, 2016 1:38 PM
    To: Piazza, Rich <rpiazza@mitre.org>
    Cc: Jason Keirstead <Jason.Keirstead@ca.ibm.com>; Terry MacDonald <terry.macdonald@cosive.com>; John-Mark Gurney <jmg@newcontext.com>; cti-stix@lists.oasis-open.org
    Subject: Re: [cti-stix] Unicode, strings, and STIX


     

    If we do not define a max length then everyone will set their own.  And we will have problems.


     


    Bret 

    Sent from my Commodore 64




    On Jun 1, 2016, at 8:08 AM, Piazza, Rich < rpiazza@mitre.org > wrote:



    My +1 was for the idea that implementation details like this do not belong in the standard.
     
    In addition, I kinda agree that that the length of strings isn’t a “standards” issue,
    or an implementation issue that we need to comment on anywhere. 
     


    From:
    cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ]
    On Behalf Of Jason Keirstead
    Sent: Wednesday, June 01, 2016 10:48 AM
    To: Piazza, Rich < rpiazza@mitre.org >
    Cc: Terry MacDonald < terry.macdonald@cosive.com >; John-Mark Gurney < jmg@newcontext.com >;
    cti-stix@lists.oasis-open.org
    Subject: RE: [cti-stix] Unicode, strings, and STIX


     
    RE the encoding language question, I posted some sample language to slack that I think solves the problem:
     "Any serialization of STIX MUST encode all String values in an encoding that follows the Unicode standard".

    I do not think the below proposal solves some of the other key questions JMG poses. The most critical question we have is with regards to all of these "max length" properties in the spec and how they will be validated. These things actually *can not* be validated
    in an encoding-independent way. I have asked a few times and will ask again - in 2016, is "max length" really anything we need to care about here. DBAs may have a bit of heartburn, but IMO it is not something we should be concerned with in STIX. Modern databases
    do not pre-allocate storage for columns anymore anyway. I would rather just forget about the idea. It makes things a lot simpler.


    Also, the idea that we should say for example "a title should only be 255 code points long" is completely arbitrary IMO and imposing undue limits on the analyst.

    -
    Jason Keirstead
    STSM, Product Architect, Security Intelligence, IBM Security Systems
    www.ibm.com/security
    www.securityintelligence.com

    Without data, all you are is just another person with an opinion - Unknown


    <image001.gif> "Piazza, Rich" ---06/01/2016 11:39:45 AM---+1 From:
    cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ] On Behalf Of Terry Mac

    From: "Piazza, Rich" < rpiazza@mitre.org >
    To: Terry MacDonald < terry.macdonald@cosive.com >, John-Mark Gurney < jmg@newcontext.com >
    Cc: " cti-stix@lists.oasis-open.org " < cti-stix@lists.oasis-open.org >
    Date: 06/01/2016 11:39 AM
    Subject: RE: [cti-stix] Unicode, strings, and STIX
    Sent by: < cti-stix@lists.oasis-open.org >












    +1
    From:
    cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ]
    On Behalf Of Terry MacDonald
    Sent: Wednesday, June 01, 2016 6:09 AM
    To: John-Mark Gurney < jmg@newcontext.com >
    Cc: cti-stix@lists.oasis-open.org
    Subject: Re: [cti-stix] Unicode, strings, and STIX

    Hi John-Mark,

    My issue with this is that its simple enough language for people reading the STIX standard. Not everyone who reads the STIX standards document will be a programmer, or have a programmers mentality. You have to be a programmer
    and understand all these terms and subjects before being able to comprehend what's going on within the standard. I firmly believe that we should use common terminology where possible within the standard, to make it as accessible as possible. And that got me
    thinking....

    We should create a STIX v2.0 JSON serialization document
    that specifies the JSON specific implementations in nomative statements, and this should be separate from the
    STIX v2.0 standards document . JSON examples should absolutely be kept in the
    STIX v2.0 standards document to help readers conceptualise the standard, and to see how it would work in practice, but the examples in the standards document
    should only be for illustrative purposes.

    Doing things this way we will achieve a few key benefits:
    ·
    The STIX v2.0 Standards document will be easier to read with plain language, and still have examples to clarify meaning to the reader.
    · The STIX v2.0 Standards document will describe the standard itself, and will not have specific JSON implementation details in there, which will make it easier to apply to additional serialisation
    formats in the future.
    · Detailed implementation requirements for the JSON MTI serialization will be in a JSON specific document. This will ensure

    · Using this structure will set ourselves up for the future, enabling creation of additional serializations if we want in the future (binary anyone?).


    Cheers

    Terry MacDonald
    Chief Product Officer

    <image002.png>

    M:
    +61-407-203-026
    E:
    terry.macdonald@cosive.com
    W:
    www.cosive.com




    On Wed, Jun 1, 2016 at 5:55 AM, John-Mark Gurney < jmg@newcontext.com > wrote:
    Hello,

    In attempting to nail down the definition of the type string, there have been a few questions raised about the best definition. I do not believe there is any disagreement that Unicode will be used for the string representation, it is more how to address some
    of the things about handling the string type.

    You may have heard various talk about character vs code point vs glyph vs grapheme, and I found a good post answering the distinction between them at

    http://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme . I will talk about encoding later.

    So, at the most basic, a string is a sequence of Unicode code points. Some strings may have more code points than others, though they are the equivalent, ø (1) vs o w/ combining long solidus overlay (2), though when normalized (NFC), they will be equal. Sadly,
    some other code points are ligatures, which are not expanded when normalized (NFC) resulting in the fi ligature not being equal to the letters f followed by i when normalized (NFC). NFKC will make them equal, but will destroy the meaning of other symbols,
    like 2 superscript becomes a normal 2.

    1) Should we add length restrictions to (some?) fields? For example, should the title field be restricted in it's length somehow? Or should people be able to put unlimited length text in the field? Some fields like description, I expect would possibly be unlimited
    sans some other overriding limit, such as total TLO size, etc.

    2) If there are length limits, how should the length limit be defined? Should it be number of graphemes displayed? (Be careful of this, because of things like Zalgo ( http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work )
    make even a short ~25 grapheme string have ~292 code points, or 559 bytes when UTF-8 encoded. Though no language will normally use so many combining code points, it is required to use more than one in some languages. Normalization can help reduce a string's
    number of code points, but does not always help. Some languages, like Thai, will use more than one combining code point to make a single grapheme (consonant + vowel + tone mark for three code points for a single grapheme).

    If graphemes are used, it would require a validator to have a detailed table to decide how many graphemes are in the string. Using code points would not require as much work for the validator.

    There is an additional issue of encoding, but this should be easy. It should use the underlying serialization format's encoding of Unicode. In the case of JSON, the default is UTF-8. In the case of XML, it can be specified by the document itself, and may even
    be in a non-UTF encoding, but it is assumed that if the document is in a different character set, that the processor will convert to Unicode code points properly.

    Additional Reading:
    UNICODE TEXT SEGMENTATION
    http://unicode.org/reports/tr29/ -- has additional examples of grapheme and code points.
    Internationalization for Turkish: Dotted and Dotless Letter "I"
    http://www.i18nguy.com/unicode/turkish-i18n.html -- More deals w/ complexities of locales than the above
    Forms of Unicode
    http://www.icu-project.org/docs/papers/forms_of_unicode/ -- Good description of glyph vs characters vs ligatures and encoding info

    My recommendations:
    1) I do believe that limits should be defined for some fields. Things like title should not have the description in them, and leaving it undefined will allow it to happen.

    2) My personal view (as a programmer of many years) it to go the simple round and limit it by code points. This is easiest for a programmer to do w/ existing tools. It also gives a more clear storage space limit (see the Zalgo example above).

    John-Mark
    New Context





     












  • 12.  RE: [cti-stix] Unicode, strings, and STIX

    Posted 06-02-2016 13:25




    Maybe say instead:
    Any length SHOULD be permitted
     
    Then maybe in the implementation guide say: suggested storage size is 8KB…
     


    From: Mark Davidson [mailto:mdavidson@soltra.com]

    Sent: Thursday, June 02, 2016 8:53 AM
    To: Piazza, Rich <rpiazza@mitre.org>; Jordan, Bret <bret.jordan@bluecoat.com>
    Cc: Jason Keirstead <Jason.Keirstead@ca.ibm.com>; Terry MacDonald <terry.macdonald@cosive.com>; John-Mark Gurney <jmg@newcontext.com>; cti-stix@lists.oasis-open.org
    Subject: Re: [cti-stix] Unicode, strings, and STIX


     
    I guess I have also provided evidence that a spec can be widely implemented without specifying max lengths on important fields. The drawback, however,
    is that the max length will end up being the shortest supported value from major implementations, and it will only be discovered through painful research.
     
    -Mark
     

    From:
    < cti-stix@lists.oasis-open.org > on behalf of Mark Davidson < mdavidson@soltra.com >
    Date: Thursday, June 2, 2016 at 8:49 AM
    To: "Piazza, Rich" < rpiazza@mitre.org >, "Jordan, Bret" < bret.jordan@bluecoat.com >
    Cc: Jason Keirstead < Jason.Keirstead@ca.ibm.com >, Terry MacDonald < terry.macdonald@cosive.com >, John-Mark Gurney < jmg@newcontext.com >,
    " cti-stix@lists.oasis-open.org " < cti-stix@lists.oasis-open.org >
    Subject: Re: [cti-stix] Unicode, strings, and STIX


     



    There needs to be a limit, even if it’s a SHOULD requirement. If we don’t specify it, we’ll get SO posts like this:

    http://stackoverflow.com/questions/686217/maximum-on-http-header-values
     
    Thank you.
    -Mark
     

    From:
    < cti-stix@lists.oasis-open.org > on behalf of "Piazza, Rich" < rpiazza@mitre.org >
    Date: Wednesday, June 1, 2016 at 2:17 PM
    To: "Jordan, Bret" < bret.jordan@bluecoat.com >
    Cc: Jason Keirstead < Jason.Keirstead@ca.ibm.com >, Terry MacDonald < terry.macdonald@cosive.com >, John-Mark Gurney < jmg@newcontext.com >,
    " cti-stix@lists.oasis-open.org " < cti-stix@lists.oasis-open.org >
    Subject: RE: [cti-stix] Unicode, strings, and STIX


     



    I think the spec would have to say something like – “ Any length is permitted”
     
    Then, implementers would have to make sure they could support that.
     
    In STIX 1.2.1, the description field of all of the objects had this text in the specification documents.  I’m not sure in which direction
    that will sway you J
     


    From: Jordan, Bret [ mailto:bret.jordan@bluecoat.com ]

    Sent: Wednesday, June 01, 2016 1:38 PM
    To: Piazza, Rich < rpiazza@mitre.org >
    Cc: Jason Keirstead < Jason.Keirstead@ca.ibm.com >; Terry MacDonald < terry.macdonald@cosive.com >; John-Mark Gurney < jmg@newcontext.com >;
    cti-stix@lists.oasis-open.org
    Subject: Re: [cti-stix] Unicode, strings, and STIX


     

    If we do not define a max length then everyone will set their own.  And we will have problems.


     


    Bret 

    Sent from my Commodore 64




    On Jun 1, 2016, at 8:08 AM, Piazza, Rich < rpiazza@mitre.org > wrote:



    My +1 was for the idea that implementation details like this do not belong in the standard.
     
    In addition, I kinda agree that that the length of strings isn’t a “standards” issue,
    or an implementation issue that we need to comment on anywhere. 
     


    From:
    cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ]
    On Behalf Of Jason Keirstead
    Sent: Wednesday, June 01, 2016 10:48 AM
    To: Piazza, Rich < rpiazza@mitre.org >
    Cc: Terry MacDonald < terry.macdonald@cosive.com >; John-Mark Gurney < jmg@newcontext.com >;
    cti-stix@lists.oasis-open.org
    Subject: RE: [cti-stix] Unicode, strings, and STIX


     
    RE the encoding language question, I posted some sample language to slack that I think solves the problem:
     "Any serialization of STIX MUST encode all String values in an encoding that follows the Unicode standard".

    I do not think the below proposal solves some of the other key questions JMG poses. The most critical question we have is with regards to all of these "max length" properties in the spec and how they will be validated. These things actually *can not* be validated
    in an encoding-independent way. I have asked a few times and will ask again - in 2016, is "max length" really anything we need to care about here. DBAs may have a bit of heartburn, but IMO it is not something we should be concerned with in STIX. Modern databases
    do not pre-allocate storage for columns anymore anyway. I would rather just forget about the idea. It makes things a lot simpler.


    Also, the idea that we should say for example "a title should only be 255 code points long" is completely arbitrary IMO and imposing undue limits on the analyst.

    -
    Jason Keirstead
    STSM, Product Architect, Security Intelligence, IBM Security Systems
    www.ibm.com/security
    www.securityintelligence.com

    Without data, all you are is just another person with an opinion - Unknown


    <image001.gif> "Piazza, Rich" ---06/01/2016 11:39:45 AM---+1 From:
    cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ] On Behalf Of Terry Mac

    From: "Piazza, Rich" < rpiazza@mitre.org >
    To: Terry MacDonald < terry.macdonald@cosive.com >, John-Mark Gurney < jmg@newcontext.com >
    Cc: " cti-stix@lists.oasis-open.org " < cti-stix@lists.oasis-open.org >
    Date: 06/01/2016 11:39 AM
    Subject: RE: [cti-stix] Unicode, strings, and STIX
    Sent by: < cti-stix@lists.oasis-open.org >














    +1
    From:
    cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ]
    On Behalf Of Terry MacDonald
    Sent: Wednesday, June 01, 2016 6:09 AM
    To: John-Mark Gurney < jmg@newcontext.com >
    Cc: cti-stix@lists.oasis-open.org
    Subject: Re: [cti-stix] Unicode, strings, and STIX

    Hi John-Mark,

    My issue with this is that its simple enough language for people reading the STIX standard. Not everyone who reads the STIX standards document will be a programmer, or have a programmers mentality. You have to be a programmer
    and understand all these terms and subjects before being able to comprehend what's going on within the standard. I firmly believe that we should use common terminology where possible within the standard, to make it as accessible as possible. And that got me
    thinking....

    We should create a STIX v2.0 JSON serialization document
    that specifies the JSON specific implementations in nomative statements, and this should be separate from the
    STIX v2.0 standards document . JSON examples should absolutely be kept in the
    STIX v2.0 standards document to help readers conceptualise the standard, and to see how it would work in practice, but the examples in the standards document
    should only be for illustrative purposes.

    Doing things this way we will achieve a few key benefits:
    ·
    The STIX v2.0 Standards document will be easier to read with plain language, and still have examples to clarify meaning to the reader.
    · The STIX v2.0 Standards document will describe the standard itself, and will not have specific JSON implementation details in there, which will make it easier to apply to additional serialisation
    formats in the future.
    · Detailed implementation requirements for the JSON MTI serialization will be in a JSON specific document. This will ensure

    · Using this structure will set ourselves up for the future, enabling creation of additional serializations if we want in the future (binary anyone?).


    Cheers

    Terry MacDonald
    Chief Product Officer

    <image002.png>

    M:
    +61-407-203-026
    E:
    terry.macdonald@cosive.com
    W:
    www.cosive.com




    On Wed, Jun 1, 2016 at 5:55 AM, John-Mark Gurney < jmg@newcontext.com > wrote:
    Hello,

    In attempting to nail down the definition of the type string, there have been a few questions raised about the best definition. I do not believe there is any disagreement that Unicode will be used for the string representation, it is more how to address some
    of the things about handling the string type.

    You may have heard various talk about character vs code point vs glyph vs grapheme, and I found a good post answering the distinction between them at

    http://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme . I will talk about encoding later.

    So, at the most basic, a string is a sequence of Unicode code points. Some strings may have more code points than others, though they are the equivalent, ø (1) vs o w/ combining long solidus overlay (2), though when normalized (NFC), they will be equal. Sadly,
    some other code points are ligatures, which are not expanded when normalized (NFC) resulting in the fi ligature not being equal to the letters f followed by i when normalized (NFC). NFKC will make them equal, but will destroy the meaning of other symbols,
    like 2 superscript becomes a normal 2.

    1) Should we add length restrictions to (some?) fields? For example, should the title field be restricted in it's length somehow? Or should people be able to put unlimited length text in the field? Some fields like description, I expect would possibly be unlimited
    sans some other overriding limit, such as total TLO size, etc.

    2) If there are length limits, how should the length limit be defined? Should it be number of graphemes displayed? (Be careful of this, because of things like Zalgo ( http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work )
    make even a short ~25 grapheme string have ~292 code points, or 559 bytes when UTF-8 encoded. Though no language will normally use so many combining code points, it is required to use more than one in some languages. Normalization can help reduce a string's
    number of code points, but does not always help. Some languages, like Thai, will use more than one combining code point to make a single grapheme (consonant + vowel + tone mark for three code points for a single grapheme).

    If graphemes are used, it would require a validator to have a detailed table to decide how many graphemes are in the string. Using code points would not require as much work for the validator.

    There is an additional issue of encoding, but this should be easy. It should use the underlying serialization format's encoding of Unicode. In the case of JSON, the default is UTF-8. In the case of XML, it can be specified by the document itself, and may even
    be in a non-UTF encoding, but it is assumed that if the document is in a different character set, that the processor will convert to Unicode code points properly.

    Additional Reading:
    UNICODE TEXT SEGMENTATION
    http://unicode.org/reports/tr29/ -- has additional examples of grapheme and code points.
    Internationalization for Turkish: Dotted and Dotless Letter "I"
    http://www.i18nguy.com/unicode/turkish-i18n.html -- More deals w/ complexities of locales than the above
    Forms of Unicode
    http://www.icu-project.org/docs/papers/forms_of_unicode/ -- Good description of glyph vs characters vs ligatures and encoding info

    My recommendations:
    1) I do believe that limits should be defined for some fields. Things like title should not have the description in them, and leaving it undefined will allow it to happen.

    2) My personal view (as a programmer of many years) it to go the simple round and limit it by code points. This is easiest for a programmer to do w/ existing tools. It also gives a more clear storage space limit (see the Zalgo example above).

    John-Mark
    New Context




     












  • 13.  RE: [cti-stix] Unicode, strings, and STIX

    Posted 06-02-2016 14:11
    If the consensus is that we *must* specify some length recommendation, then this is a good way to attempt to do so. - Jason Keirstead STSM, Product Architect, Security Intelligence, IBM Security Systems www.ibm.com/security www.securityintelligence.com Without data, all you are is just another person with an opinion - Unknown "Piazza, Rich" ---06/02/2016 10:24:57 AM---Maybe say instead: Any length SHOULD be permitted Then maybe in the implementation guide say: sugges From: "Piazza, Rich" <rpiazza@mitre.org> To: Mark Davidson <mdavidson@soltra.com>, "Jordan, Bret" <bret.jordan@bluecoat.com> Cc: Jason Keirstead/CanEast/IBM@IBMCA, Terry MacDonald <terry.macdonald@cosive.com>, John-Mark Gurney <jmg@newcontext.com>, "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org> Date: 06/02/2016 10:24 AM Subject: RE: [cti-stix] Unicode, strings, and STIX Sent by: <cti-stix@lists.oasis-open.org> Maybe say instead: Any length SHOULD be permitted Then maybe in the implementation guide say: suggested storage size is 8KB… From: Mark Davidson [ mailto:mdavidson@soltra.com ] Sent: Thursday, June 02, 2016 8:53 AM To: Piazza, Rich <rpiazza@mitre.org>; Jordan, Bret <bret.jordan@bluecoat.com> Cc: Jason Keirstead <Jason.Keirstead@ca.ibm.com>; Terry MacDonald <terry.macdonald@cosive.com>; John-Mark Gurney <jmg@newcontext.com>; cti-stix@lists.oasis-open.org Subject: Re: [cti-stix] Unicode, strings, and STIX I guess I have also provided evidence that a spec can be widely implemented without specifying max lengths on important fields. The drawback, however, is that the max length will end up being the shortest supported value from major implementations, and it will only be discovered through painful research. -Mark From: < cti-stix@lists.oasis-open.org > on behalf of Mark Davidson < mdavidson@soltra.com > Date: Thursday, June 2, 2016 at 8:49 AM To: "Piazza, Rich" < rpiazza@mitre.org >, "Jordan, Bret" < bret.jordan@bluecoat.com > Cc: Jason Keirstead < Jason.Keirstead@ca.ibm.com >, Terry MacDonald < terry.macdonald@cosive.com >, John-Mark Gurney < jmg@newcontext.com >, " cti-stix@lists.oasis-open.org " < cti-stix@lists.oasis-open.org > Subject: Re: [cti-stix] Unicode, strings, and STIX There needs to be a limit, even if it’s a SHOULD requirement. If we don’t specify it, we’ll get SO posts like this: http://stackoverflow.com/questions/686217/maximum-on-http-header-values Thank you. -Mark From: < cti-stix@lists.oasis-open.org > on behalf of "Piazza, Rich" < rpiazza@mitre.org > Date: Wednesday, June 1, 2016 at 2:17 PM To: "Jordan, Bret" < bret.jordan@bluecoat.com > Cc: Jason Keirstead < Jason.Keirstead@ca.ibm.com >, Terry MacDonald < terry.macdonald@cosive.com >, John-Mark Gurney < jmg@newcontext.com >, " cti-stix@lists.oasis-open.org " < cti-stix@lists.oasis-open.org > Subject: RE: [cti-stix] Unicode, strings, and STIX I think the spec would have to say something like – “ Any length is permitted” Then, implementers would have to make sure they could support that. In STIX 1.2.1, the description field of all of the objects had this text in the specification documents. I’m not sure in which direction that will sway you J From: Jordan, Bret [ mailto:bret.jordan@bluecoat.com ] Sent: Wednesday, June 01, 2016 1:38 PM To: Piazza, Rich < rpiazza@mitre.org > Cc: Jason Keirstead < Jason.Keirstead@ca.ibm.com >; Terry MacDonald < terry.macdonald@cosive.com >; John-Mark Gurney < jmg@newcontext.com >; cti-stix@lists.oasis-open.org Subject: Re: [cti-stix] Unicode, strings, and STIX If we do not define a max length then everyone will set their own. And we will have problems. Bret Sent from my Commodore 64 On Jun 1, 2016, at 8:08 AM, Piazza, Rich < rpiazza@mitre.org > wrote: My +1 was for the idea that implementation details like this do not belong in the standard. In addition, I kinda agree that that the length of strings isn’t a “standards” issue, or an implementation issue that we need to comment on anywhere. From: cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ] On Behalf Of Jason Keirstead Sent: Wednesday, June 01, 2016 10:48 AM To: Piazza, Rich < rpiazza@mitre.org > Cc: Terry MacDonald < terry.macdonald@cosive.com >; John-Mark Gurney < jmg@newcontext.com >; cti-stix@lists.oasis-open.org Subject: RE: [cti-stix] Unicode, strings, and STIX RE the encoding language question, I posted some sample language to slack that I think solves the problem: "Any serialization of STIX MUST encode all String values in an encoding that follows the Unicode standard". I do not think the below proposal solves some of the other key questions JMG poses. The most critical question we have is with regards to all of these "max length" properties in the spec and how they will be validated. These things actually *can not* be validated in an encoding-independent way. I have asked a few times and will ask again - in 2016, is "max length" really anything we need to care about here. DBAs may have a bit of heartburn, but IMO it is not something we should be concerned with in STIX. Modern databases do not pre-allocate storage for columns anymore anyway. I would rather just forget about the idea. It makes things a lot simpler. Also, the idea that we should say for example "a title should only be 255 code points long" is completely arbitrary IMO and imposing undue limits on the analyst. - Jason Keirstead STSM, Product Architect, Security Intelligence, IBM Security Systems www.ibm.com/security www.securityintelligence.com Without data, all you are is just another person with an opinion - Unknown <image001.gif> "Piazza, Rich" ---06/01/2016 11:39:45 AM---+1 From: cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ] On Behalf Of Terry Mac From: "Piazza, Rich" < rpiazza@mitre.org > To: Terry MacDonald < terry.macdonald@cosive.com >, John-Mark Gurney < jmg@newcontext.com > Cc: " cti-stix@lists.oasis-open.org " < cti-stix@lists.oasis-open.org > Date: 06/01/2016 11:39 AM Subject: RE: [cti-stix] Unicode, strings, and STIX Sent by: < cti-stix@lists.oasis-open.org > +1 From: cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ] On Behalf Of Terry MacDonald Sent: Wednesday, June 01, 2016 6:09 AM To: John-Mark Gurney < jmg@newcontext.com > Cc: cti-stix@lists.oasis-open.org Subject: Re: [cti-stix] Unicode, strings, and STIX Hi John-Mark, My issue with this is that its simple enough language for people reading the STIX standard. Not everyone who reads the STIX standards document will be a programmer, or have a programmers mentality. You have to be a programmer and understand all these terms and subjects before being able to comprehend what's going on within the standard. I firmly believe that we should use common terminology where possible within the standard, to make it as accessible as possible. And that got me thinking.... We should create a STIX v2.0 JSON serialization document that specifies the JSON specific implementations in nomative statements, and this should be separate from the STIX v2.0 standards document . JSON examples should absolutely be kept in the STIX v2.0 standards document to help readers conceptualise the standard, and to see how it would work in practice, but the examples in the standards document should only be for illustrative purposes. Doing things this way we will achieve a few key benefits: · The STIX v2.0 Standards document will be easier to read with plain language, and still have examples to clarify meaning to the reader. · The STIX v2.0 Standards document will describe the standard itself, and will not have specific JSON implementation details in there, which will make it easier to apply to additional serialisation formats in the future. · Detailed implementation requirements for the JSON MTI serialization will be in a JSON specific document. This will ensure · Using this structure will set ourselves up for the future, enabling creation of additional serializations if we want in the future (binary anyone?). Cheers Terry MacDonald Chief Product Officer <image002.png> M: +61-407-203-026 E: terry.macdonald@cosive.com W: www.cosive.com On Wed, Jun 1, 2016 at 5:55 AM, John-Mark Gurney < jmg@newcontext.com > wrote: Hello, In attempting to nail down the definition of the type string, there have been a few questions raised about the best definition. I do not believe there is any disagreement that Unicode will be used for the string representation, it is more how to address some of the things about handling the string type. You may have heard various talk about character vs code point vs glyph vs grapheme, and I found a good post answering the distinction between them at http://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme . I will talk about encoding later. So, at the most basic, a string is a sequence of Unicode code points. Some strings may have more code points than others, though they are the equivalent, ø (1) vs o w/ combining long solidus overlay (2), though when normalized (NFC), they will be equal. Sadly, some other code points are ligatures, which are not expanded when normalized (NFC) resulting in the fi ligature not being equal to the letters f followed by i when normalized (NFC). NFKC will make them equal, but will destroy the meaning of other symbols, like 2 superscript becomes a normal 2. 1) Should we add length restrictions to (some?) fields? For example, should the title field be restricted in it's length somehow? Or should people be able to put unlimited length text in the field? Some fields like description, I expect would possibly be unlimited sans some other overriding limit, such as total TLO size, etc. 2) If there are length limits, how should the length limit be defined? Should it be number of graphemes displayed? (Be careful of this, because of things like Zalgo ( http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work ) make even a short ~25 grapheme string have ~292 code points, or 559 bytes when UTF-8 encoded. Though no language will normally use so many combining code points, it is required to use more than one in some languages. Normalization can help reduce a string's number of code points, but does not always help. Some languages, like Thai, will use more than one combining code point to make a single grapheme (consonant + vowel + tone mark for three code points for a single grapheme). If graphemes are used, it would require a validator to have a detailed table to decide how many graphemes are in the string. Using code points would not require as much work for the validator. There is an additional issue of encoding, but this should be easy. It should use the underlying serialization format's encoding of Unicode. In the case of JSON, the default is UTF-8. In the case of XML, it can be specified by the document itself, and may even be in a non-UTF encoding, but it is assumed that if the document is in a different character set, that the processor will convert to Unicode code points properly. Additional Reading: UNICODE TEXT SEGMENTATION http://unicode.org/reports/tr29/ -- has additional examples of grapheme and code points. Internationalization for Turkish: Dotted and Dotless Letter "I" http://www.i18nguy.com/unicode/turkish-i18n.html -- More deals w/ complexities of locales than the above Forms of Unicode http://www.icu-project.org/docs/papers/forms_of_unicode/ -- Good description of glyph vs characters vs ligatures and encoding info My recommendations: 1) I do believe that limits should be defined for some fields. Things like title should not have the description in them, and leaving it undefined will allow it to happen. 2) My personal view (as a programmer of many years) it to go the simple round and limit it by code points. This is easiest for a programmer to do w/ existing tools. It also gives a more clear storage space limit (see the Zalgo example above). John-Mark New Context




  • 14.  Re: [cti-stix] Unicode, strings, and STIX

    Posted 06-02-2016 14:18




    This struck me as the type of thing that must have been done before, so  I did a little research on what other similar specifications (data models, not transport protocols) did:
     
    -          
    IODEF: no max lengths specified
    -          
    CIQ: no max lengths specified
    -          
    HL7: specify a minimum length that specifications have to be able to handle, but no maximum length (could not find actual language here due to specs not being free, I asked a
    colleague,)
    -          
    HDATA: no max lengths specified
    -          
    SMTP: some fields have max length (in characters), some don’t
    -          
    OASIS CAP: no max lengths, they have a MAY requirement for some fields suggesting a max size that would be appropriate
    -          
    EDXL: no max lengths
     
    To be honest I went into this thinking that we needed to specify max lengths, but based on this research maybe we shouldn’t? Rich’s approach below seems best to me.
     
    Are there any other specs we could learn from? What did I miss?
     
    John
     
     

    From:
    <cti-stix@lists.oasis-open.org> on behalf of Jason Keirstead <Jason.Keirstead@ca.ibm.com>
    Date: Thursday, June 2, 2016 at 10:10 AM
    To: Rich Piazza <rpiazza@mitre.org>
    Cc: Mark Davidson <mdavidson@soltra.com>, "Jordan, Bret" <bret.jordan@bluecoat.com>, Terry MacDonald <terry.macdonald@cosive.com>, John-Mark Gurney <jmg@newcontext.com>, "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org>
    Subject: RE: [cti-stix] Unicode, strings, and STIX


     



    If the consensus is that we *must* specify some length recommendation, then this is a good way to attempt to do so.


    -
    Jason Keirstead
    STSM, Product Architect, Security Intelligence, IBM Security Systems
    www.ibm.com/security www.securityintelligence.com

    Without data, all you are is just another person with an opinion - Unknown


    "Piazza, Rich" ---06/02/2016 10:24:57 AM---Maybe say instead:
    Any length SHOULD be permitted Then maybe in the implementation guide say: sugges

    From: "Piazza, Rich" <rpiazza@mitre.org>
    To: Mark Davidson <mdavidson@soltra.com>, "Jordan, Bret" <bret.jordan@bluecoat.com>
    Cc: Jason Keirstead/CanEast/IBM@IBMCA, Terry MacDonald <terry.macdonald@cosive.com>, John-Mark Gurney <jmg@newcontext.com>, "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org>
    Date: 06/02/2016 10:24 AM
    Subject: RE: [cti-stix] Unicode, strings, and STIX
    Sent by: <cti-stix@lists.oasis-open.org>






    Maybe say instead: Any length SHOULD be permitted

    Then maybe in the implementation guide say: suggested storage size is 8KB…
    From: Mark Davidson [ mailto:mdavidson@soltra.com ]

    Sent: Thursday, June 02, 2016 8:53 AM
    To: Piazza, Rich <rpiazza@mitre.org>; Jordan, Bret <bret.jordan@bluecoat.com>
    Cc: Jason Keirstead <Jason.Keirstead@ca.ibm.com>; Terry MacDonald <terry.macdonald@cosive.com>; John-Mark Gurney <jmg@newcontext.com>; cti-stix@lists.oasis-open.org
    Subject: Re: [cti-stix] Unicode, strings, and STIX

    I guess I have also provided evidence that a spec can be widely implemented without specifying max lengths on important fields. The drawback, however, is that the max length will end up being the shortest supported value from
    major implementations, and it will only be discovered through painful research.

    -Mark

    From: < cti-stix@lists.oasis-open.org >
    on behalf of Mark Davidson < mdavidson@soltra.com >
    Date: Thursday, June 2, 2016 at 8:49 AM
    To: "Piazza, Rich" < rpiazza@mitre.org >, "Jordan, Bret" < bret.jordan@bluecoat.com >
    Cc: Jason Keirstead < Jason.Keirstead@ca.ibm.com >, Terry MacDonald < terry.macdonald@cosive.com >,
    John-Mark Gurney < jmg@newcontext.com >, " cti-stix@lists.oasis-open.org "
    < cti-stix@lists.oasis-open.org >
    Subject: Re: [cti-stix] Unicode, strings, and STIX

    There needs to be a limit, even if it’s a SHOULD requirement. If we don’t specify it, we’ll get SO posts like this:
    http://stackoverflow.com/questions/686217/maximum-on-http-header-values

    Thank you.
    -Mark

    From: < cti-stix@lists.oasis-open.org >
    on behalf of "Piazza, Rich" < rpiazza@mitre.org >
    Date: Wednesday, June 1, 2016 at 2:17 PM
    To: "Jordan, Bret" < bret.jordan@bluecoat.com >
    Cc: Jason Keirstead < Jason.Keirstead@ca.ibm.com >, Terry MacDonald < terry.macdonald@cosive.com >,
    John-Mark Gurney < jmg@newcontext.com >, " cti-stix@lists.oasis-open.org "
    < cti-stix@lists.oasis-open.org >
    Subject: RE: [cti-stix] Unicode, strings, and STIX

    I think the spec would have to say something like – “ Any length is permitted”

    Then, implementers would have to make sure they could support that.

    In STIX 1.2.1, the description field of all of the objects had this text in the specification documents. I’m not sure in which direction that will sway you
    J
    From: Jordan, Bret [ mailto:bret.jordan@bluecoat.com ]

    Sent: Wednesday, June 01, 2016 1:38 PM
    To: Piazza, Rich < rpiazza@mitre.org >
    Cc: Jason Keirstead < Jason.Keirstead@ca.ibm.com >; Terry MacDonald < terry.macdonald@cosive.com >;
    John-Mark Gurney < jmg@newcontext.com >;
    cti-stix@lists.oasis-open.org
    Subject: Re: [cti-stix] Unicode, strings, and STIX

    If we do not define a max length then everyone will set their own. And we will have problems.

    Bret

    Sent from my Commodore 64

    On Jun 1, 2016, at 8:08 AM, Piazza, Rich < rpiazza@mitre.org > wrote:

    My +1 was for the idea that implementation details like this do not belong in the standard.

    In addition, I kinda agree that that the length of strings isn’t a “standards” issue,
    or an implementation issue that we need to comment on anywhere.
    From:
    cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ]
    On Behalf Of Jason Keirstead
    Sent: Wednesday, June 01, 2016 10:48 AM
    To: Piazza, Rich < rpiazza@mitre.org >
    Cc: Terry MacDonald < terry.macdonald@cosive.com >; John-Mark Gurney < jmg@newcontext.com >;
    cti-stix@lists.oasis-open.org
    Subject: RE: [cti-stix] Unicode, strings, and STIX
    RE the encoding language question, I posted some sample language to slack that I think solves the problem:
    "Any serialization of STIX MUST encode all String values in an encoding that follows the Unicode standard".

    I do not think the below proposal solves some of the other key questions JMG poses. The most critical question we have is with regards to all of these "max length" properties in the spec and how they will be validated. These things actually *can not* be validated
    in an encoding-independent way. I have asked a few times and will ask again - in 2016, is "max length" really anything we need to care about here. DBAs may have a bit of heartburn, but IMO it is not something we should be concerned with in STIX. Modern databases
    do not pre-allocate storage for columns anymore anyway. I would rather just forget about the idea. It makes things a lot simpler.


    Also, the idea that we should say for example "a title should only be 255 code points long" is completely arbitrary IMO and imposing undue limits on the analyst.

    -
    Jason Keirstead
    STSM, Product Architect, Security Intelligence, IBM Security Systems
    www.ibm.com/security
    www.securityintelligence.com

    Without data, all you are is just another person with an opinion - Unknown


    <image001.gif> "Piazza, Rich" ---06/01/2016 11:39:45 AM---+1 From:
    cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ]
    On Behalf Of Terry Mac

    From: "Piazza, Rich" < rpiazza@mitre.org >
    To: Terry MacDonald < terry.macdonald@cosive.com >, John-Mark Gurney < jmg@newcontext.com >
    Cc: " cti-stix@lists.oasis-open.org " < cti-stix@lists.oasis-open.org >
    Date: 06/01/2016 11:39 AM
    Subject: RE: [cti-stix] Unicode, strings, and STIX
    Sent by: < cti-stix@lists.oasis-open.org >







    +1
    From:
    cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ]
    On Behalf Of Terry MacDonald
    Sent: Wednesday, June 01, 2016 6:09 AM
    To: John-Mark Gurney < jmg@newcontext.com >
    Cc: cti-stix@lists.oasis-open.org
    Subject: Re: [cti-stix] Unicode, strings, and STIX

    Hi John-Mark,

    My issue with this is that its simple enough language for people reading the STIX standard. Not everyone who reads the STIX standards document will be a programmer, or have a programmers mentality. You have to be a programmer and understand all these terms
    and subjects before being able to comprehend what's going on within the standard. I firmly believe that we should use common terminology where possible within the standard, to make it as accessible as possible. And that got me thinking....

    We should create a STIX v2.0 JSON serialization document that specifies the JSON specific implementations in nomative statements, and this should be separate from the
    STIX v2.0 standards document . JSON examples should absolutely be kept in the
    STIX v2.0 standards document to help readers conceptualise the standard, and to see how it would work in practice, but the examples in the standards document
    should only be for illustrative purposes.

    Doing things this way we will achieve a few key benefits:
    ·
    The STIX v2.0 Standards document will be easier to read with plain language, and still have examples to clarify meaning to the reader.
    · The STIX v2.0 Standards document will describe the standard itself, and will not have specific JSON implementation details in there, which will make it easier to apply to additional serialisation formats in the future.
    · Detailed implementation requirements for the JSON MTI serialization will be in a JSON specific document. This will ensure

    · Using this structure will set ourselves up for the future, enabling creation of additional serializations if we want in the future (binary anyone?).


    Cheers

    Terry MacDonald Chief Product Officer

    <image002.png>

    M: +61-407-203-026
    E: terry.macdonald@cosive.com
    W: www.cosive.com




    On Wed, Jun 1, 2016 at 5:55 AM, John-Mark Gurney < jmg@newcontext.com > wrote:
    Hello,

    In attempting to nail down the definition of the type string, there have been a few questions raised about the best definition. I do not believe there is any disagreement that Unicode will be used for the string representation, it is more how to address some
    of the things about handling the string type.

    You may have heard various talk about character vs code point vs glyph vs grapheme, and I found a good post answering the distinction between them at
    http://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme
    . I will talk about encoding later.

    So, at the most basic, a string is a sequence of Unicode code points. Some strings may have more code points than others, though they are the equivalent, ø (1) vs o w/ combining long solidus overlay (2), though when normalized (NFC), they will be equal. Sadly,
    some other code points are ligatures, which are not expanded when normalized (NFC) resulting in the fi ligature not being equal to the letters f followed by i when normalized (NFC). NFKC will make them equal, but will destroy the meaning of other symbols,
    like 2 superscript becomes a normal 2.

    1) Should we add length restrictions to (some?) fields? For example, should the title field be restricted in it's length somehow? Or should people be able to put unlimited length text in the field? Some fields like description, I expect would possibly be unlimited
    sans some other overriding limit, such as total TLO size, etc.

    2) If there are length limits, how should the length limit be defined? Should it be number of graphemes displayed? (Be careful of this, because of things like Zalgo ( http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work )
    make even a short ~25 grapheme string have ~292 code points, or 559 bytes when UTF-8 encoded. Though no language will normally use so many combining code points, it is required to use more than one in some languages. Normalization can help reduce a string's
    number of code points, but does not always help. Some languages, like Thai, will use more than one combining code point to make a single grapheme (consonant + vowel + tone mark for three code points for a single grapheme).

    If graphemes are used, it would require a validator to have a detailed table to decide how many graphemes are in the string. Using code points would not require as much work for the validator.

    There is an additional issue of encoding, but this should be easy. It should use the underlying serialization format's encoding of Unicode. In the case of JSON, the default is UTF-8. In the case of XML, it can be specified by the document itself, and may even
    be in a non-UTF encoding, but it is assumed that if the document is in a different character set, that the processor will convert to Unicode code points properly.

    Additional Reading:
    UNICODE TEXT SEGMENTATION http://unicode.org/reports/tr29/ -- has additional examples of grapheme and code points.
    Internationalization for Turkish: Dotted and Dotless Letter "I" http://www.i18nguy.com/unicode/turkish-i18n.html
    -- More deals w/ complexities of locales than the above
    Forms of Unicode http://www.icu-project.org/docs/papers/forms_of_unicode/ -- Good description
    of glyph vs characters vs ligatures and encoding info

    My recommendations:
    1) I do believe that limits should be defined for some fields. Things like title should not have the description in them, and leaving it undefined will allow it to happen.

    2) My personal view (as a programmer of many years) it to go the simple round and limit it by code points. This is easiest for a programmer to do w/ existing tools. It also gives a more clear storage space limit (see the Zalgo example above).

    John-Mark
    New Context



     
     








  • 15.  Re: [cti-stix] Unicode, strings, and STIX

    Posted 06-02-2016 16:02
    HTTP sez: " The HTTP protocol does not place any a priori limit on the length of a URI. Servers MUST be able to handle the URI of any resource they serve, and SHOULD be able to handle URIs of unbounded length if they provide GET-based forms that could generate such URIs. A server SHOULD return 414 (Request-URI Too Long) status if a URI is longer than the server can handle (see section 10.4.15). Note: Servers ought to be cautious about depending on URI lengths above 255 bytes, because some older client or proxy implementations might not properly support these lengths. " https://www.ietf.org/rfc/rfc2616.txt The focus of HTTP was not to define a schema (as in "how long is a String type"?), but to promote interoperability via a standard API (as in, the 4 standard verbs: GET/POST/PUT/DELETE). In this snippet, we see how HTTP addresses the situation when the length of something is unknown and potentially too long for one party in the conversation. Yes, as Mark Davidson points out, the globally-discoverable minimal length of HTTP Headers is really the smallest of any implementation. So, if you want your HTTP thing to interoperate with everybody with the least friction, you have to find and use that minimum length. But, HTTP has you covered if you don't know (or can't know, really) what that globally-discoverable minimal length is. Simply put: HTTP gives the communicating parties a means to say, "Too big! Sorry!" Here's the question that most compels me: How can we avoid arguing about schema (lengths of strings and other datatype questions) and allow the communicating parties to tell each other "I can't handle that! Sorry!" JSA From: cti-stix@lists.oasis-open.org <cti-stix@lists.oasis-open.org> on behalf of Wunder, John A. <jwunder@mitre.org> Sent: Thursday, June 2, 2016 10:17:30 AM To: cti-stix@lists.oasis-open.org Subject: Re: [cti-stix] Unicode, strings, and STIX   This struck me as the type of thing that must have been done before, so  I did a little research on what other similar specifications (data models, not transport protocols) did:   -           IODEF: no max lengths specified -           CIQ: no max lengths specified -           HL7: specify a minimum length that specifications have to be able to handle, but no maximum length (could not find actual language here due to specs not being free, I asked a colleague,) -           HDATA: no max lengths specified -           SMTP: some fields have max length (in characters), some don’t -           OASIS CAP: no max lengths, they have a MAY requirement for some fields suggesting a max size that would be appropriate -           EDXL: no max lengths   To be honest I went into this thinking that we needed to specify max lengths, but based on this research maybe we shouldn’t? Rich’s approach below seems best to me.   Are there any other specs we could learn from? What did I miss?   John     From: <cti-stix@lists.oasis-open.org> on behalf of Jason Keirstead <Jason.Keirstead@ca.ibm.com> Date: Thursday, June 2, 2016 at 10:10 AM To: Rich Piazza <rpiazza@mitre.org> Cc: Mark Davidson <mdavidson@soltra.com>, "Jordan, Bret" <bret.jordan@bluecoat.com>, Terry MacDonald <terry.macdonald@cosive.com>, John-Mark Gurney <jmg@newcontext.com>, "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org> Subject: RE: [cti-stix] Unicode, strings, and STIX   If the consensus is that we *must* specify some length recommendation, then this is a good way to attempt to do so. - Jason Keirstead STSM, Product Architect, Security Intelligence, IBM Security Systems www.ibm.com/security www.securityintelligence.com Without data, all you are is just another person with an opinion - Unknown "Piazza, Rich" ---06/02/2016 10:24:57 AM---Maybe say instead: Any length SHOULD be permitted Then maybe in the implementation guide say: sugges From: "Piazza, Rich" <rpiazza@mitre.org> To: Mark Davidson <mdavidson@soltra.com>, "Jordan, Bret" <bret.jordan@bluecoat.com> Cc: Jason Keirstead/CanEast/IBM@IBMCA, Terry MacDonald <terry.macdonald@cosive.com>, John-Mark Gurney <jmg@newcontext.com>, "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org> Date: 06/02/2016 10:24 AM Subject: RE: [cti-stix] Unicode, strings, and STIX Sent by: <cti-stix@lists.oasis-open.org> Maybe say instead: Any length SHOULD be permitted Then maybe in the implementation guide say: suggested storage size is 8KB… From: Mark Davidson [ mailto:mdavidson@soltra.com ] Sent: Thursday, June 02, 2016 8:53 AM To: Piazza, Rich <rpiazza@mitre.org>; Jordan, Bret <bret.jordan@bluecoat.com> Cc: Jason Keirstead <Jason.Keirstead@ca.ibm.com>; Terry MacDonald <terry.macdonald@cosive.com>; John-Mark Gurney <jmg@newcontext.com>; cti-stix@lists.oasis-open.org Subject: Re: [cti-stix] Unicode, strings, and STIX I guess I have also provided evidence that a spec can be widely implemented without specifying max lengths on important fields. The drawback, however, is that the max length will end up being the shortest supported value from major implementations, and it will only be discovered through painful research. -Mark From: < cti-stix@lists.oasis-open.org > on behalf of Mark Davidson < mdavidson@soltra.com > Date: Thursday, June 2, 2016 at 8:49 AM To: "Piazza, Rich" < rpiazza@mitre.org >, "Jordan, Bret" < bret.jordan@bluecoat.com > Cc: Jason Keirstead < Jason.Keirstead@ca.ibm.com >, Terry MacDonald < terry.macdonald@cosive.com >, John-Mark Gurney < jmg@newcontext.com >, " cti-stix@lists.oasis-open.org " < cti-stix@lists.oasis-open.org > Subject: Re: [cti-stix] Unicode, strings, and STIX There needs to be a limit, even if it’s a SHOULD requirement. If we don’t specify it, we’ll get SO posts like this: http://stackoverflow.com/questions/686217/maximum-on-http-header-values Thank you. -Mark From: < cti-stix@lists.oasis-open.org > on behalf of "Piazza, Rich" < rpiazza@mitre.org > Date: Wednesday, June 1, 2016 at 2:17 PM To: "Jordan, Bret" < bret.jordan@bluecoat.com > Cc: Jason Keirstead < Jason.Keirstead@ca.ibm.com >, Terry MacDonald < terry.macdonald@cosive.com >, John-Mark Gurney < jmg@newcontext.com >, " cti-stix@lists.oasis-open.org " < cti-stix@lists.oasis-open.org > Subject: RE: [cti-stix] Unicode, strings, and STIX I think the spec would have to say something like – “ Any length is permitted” Then, implementers would have to make sure they could support that. In STIX 1.2.1, the description field of all of the objects had this text in the specification documents. I’m not sure in which direction that will sway you J From: Jordan, Bret [ mailto:bret.jordan@bluecoat.com ] Sent: Wednesday, June 01, 2016 1:38 PM To: Piazza, Rich < rpiazza@mitre.org > Cc: Jason Keirstead < Jason.Keirstead@ca.ibm.com >; Terry MacDonald < terry.macdonald@cosive.com >; John-Mark Gurney < jmg@newcontext.com >; cti-stix@lists.oasis-open.org Subject: Re: [cti-stix] Unicode, strings, and STIX If we do not define a max length then everyone will set their own. And we will have problems. Bret Sent from my Commodore 64 On Jun 1, 2016, at 8:08 AM, Piazza, Rich < rpiazza@mitre.org > wrote: My +1 was for the idea that implementation details like this do not belong in the standard. In addition, I kinda agree that that the length of strings isn’t a “standards” issue, or an implementation issue that we need to comment on anywhere. From: cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ] On Behalf Of Jason Keirstead Sent: Wednesday, June 01, 2016 10:48 AM To: Piazza, Rich < rpiazza@mitre.org > Cc: Terry MacDonald < terry.macdonald@cosive.com >; John-Mark Gurney < jmg@newcontext.com >; cti-stix@lists.oasis-open.org Subject: RE: [cti-stix] Unicode, strings, and STIX RE the encoding language question, I posted some sample language to slack that I think solves the problem: "Any serialization of STIX MUST encode all String values in an encoding that follows the Unicode standard". I do not think the below proposal solves some of the other key questions JMG poses. The most critical question we have is with regards to all of these "max length" properties in the spec and how they will be validated. These things actually *can not* be validated in an encoding-independent way. I have asked a few times and will ask again - in 2016, is "max length" really anything we need to care about here. DBAs may have a bit of heartburn, but IMO it is not something we should be concerned with in STIX. Modern databases do not pre-allocate storage for columns anymore anyway. I would rather just forget about the idea. It makes things a lot simpler. Also, the idea that we should say for example "a title should only be 255 code points long" is completely arbitrary IMO and imposing undue limits on the analyst. - Jason Keirstead STSM, Product Architect, Security Intelligence, IBM Security Systems www.ibm.com/security www.securityintelligence.com Without data, all you are is just another person with an opinion - Unknown <image001.gif> "Piazza, Rich" ---06/01/2016 11:39:45 AM---+1 From: cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ] On Behalf Of Terry Mac From: "Piazza, Rich" < rpiazza@mitre.org > To: Terry MacDonald < terry.macdonald@cosive.com >, John-Mark Gurney < jmg@newcontext.com > Cc: " cti-stix@lists.oasis-open.org " < cti-stix@lists.oasis-open.org > Date: 06/01/2016 11:39 AM Subject: RE: [cti-stix] Unicode, strings, and STIX Sent by: < cti-stix@lists.oasis-open.org > +1 From: cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ] On Behalf Of Terry MacDonald Sent: Wednesday, June 01, 2016 6:09 AM To: John-Mark Gurney < jmg@newcontext.com > Cc: cti-stix@lists.oasis-open.org Subject: Re: [cti-stix] Unicode, strings, and STIX Hi John-Mark, My issue with this is that its simple enough language for people reading the STIX standard. Not everyone who reads the STIX standards document will be a programmer, or have a programmers mentality. You have to be a programmer and understand all these terms and subjects before being able to comprehend what's going on within the standard. I firmly believe that we should use common terminology where possible within the standard, to make it as accessible as possible. And that got me thinking.... We should create a STIX v2.0 JSON serialization document that specifies the JSON specific implementations in nomative statements, and this should be separate from the STIX v2.0 standards document . JSON examples should absolutely be kept in the STIX v2.0 standards document to help readers conceptualise the standard, and to see how it would work in practice, but the examples in the standards document should only be for illustrative purposes. Doing things this way we will achieve a few key benefits: · The STIX v2.0 Standards document will be easier to read with plain language, and still have examples to clarify meaning to the reader. · The STIX v2.0 Standards document will describe the standard itself, and will not have specific JSON implementation details in there, which will make it easier to apply to additional serialisation formats in the future. · Detailed implementation requirements for the JSON MTI serialization will be in a JSON specific document. This will ensure · Using this structure will set ourselves up for the future, enabling creation of additional serializations if we want in the future (binary anyone?). Cheers Terry MacDonald Chief Product Officer <image002.png> M: +61-407-203-026 E: terry.macdonald@cosive.com W: www.cosive.com On Wed, Jun 1, 2016 at 5:55 AM, John-Mark Gurney < jmg@newcontext.com > wrote: Hello, In attempting to nail down the definition of the type string, there have been a few questions raised about the best definition. I do not believe there is any disagreement that Unicode will be used for the string representation, it is more how to address some of the things about handling the string type. You may have heard various talk about character vs code point vs glyph vs grapheme, and I found a good post answering the distinction between them at http://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme . I will talk about encoding later. So, at the most basic, a string is a sequence of Unicode code points. Some strings may have more code points than others, though they are the equivalent, ø (1) vs o w/ combining long solidus overlay (2), though when normalized (NFC), they will be equal. Sadly, some other code points are ligatures, which are not expanded when normalized (NFC) resulting in the fi ligature not being equal to the letters f followed by i when normalized (NFC). NFKC will make them equal, but will destroy the meaning of other symbols, like 2 superscript becomes a normal 2. 1) Should we add length restrictions to (some?) fields? For example, should the title field be restricted in it's length somehow? Or should people be able to put unlimited length text in the field? Some fields like description, I expect would possibly be unlimited sans some other overriding limit, such as total TLO size, etc. 2) If there are length limits, how should the length limit be defined? Should it be number of graphemes displayed? (Be careful of this, because of things like Zalgo ( http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work ) make even a short ~25 grapheme string have ~292 code points, or 559 bytes when UTF-8 encoded. Though no language will normally use so many combining code points, it is required to use more than one in some languages. Normalization can help reduce a string's number of code points, but does not always help. Some languages, like Thai, will use more than one combining code point to make a single grapheme (consonant + vowel + tone mark for three code points for a single grapheme). If graphemes are used, it would require a validator to have a detailed table to decide how many graphemes are in the string. Using code points would not require as much work for the validator. There is an additional issue of encoding, but this should be easy. It should use the underlying serialization format's encoding of Unicode. In the case of JSON, the default is UTF-8. In the case of XML, it can be specified by the document itself, and may even be in a non-UTF encoding, but it is assumed that if the document is in a different character set, that the processor will convert to Unicode code points properly. Additional Reading: UNICODE TEXT SEGMENTATION http://unicode.org/reports/tr29/ -- has additional examples of grapheme and code points. Internationalization for Turkish: Dotted and Dotless Letter "I" http://www.i18nguy.com/unicode/turkish-i18n.html -- More deals w/ complexities of locales than the above Forms of Unicode http://www.icu-project.org/docs/papers/forms_of_unicode/ -- Good description of glyph vs characters vs ligatures and encoding info My recommendations: 1) I do believe that limits should be defined for some fields. Things like title should not have the description in them, and leaving it undefined will allow it to happen. 2) My personal view (as a programmer of many years) it to go the simple round and limit it by code points. This is easiest for a programmer to do w/ existing tools. It also gives a more clear storage space limit (see the Zalgo example above). John-Mark New Context    


  • 16.  Re: [cti-stix] Unicode, strings, and STIX

    Posted 06-02-2016 16:12
    This is the difference though between STIX (data interchange format) and TAXII (protocol). HTTP is a protocol, and can thus "negotiate" things, as could TAXII. STIX can't have "negotiation" in it, it's just a set of bytes, either stored as a file on disk or being sent on a wire. - Jason Keirstead STSM, Product Architect, Security Intelligence, IBM Security Systems www.ibm.com/security www.securityintelligence.com Without data, all you are is just another person with an opinion - Unknown John Anderson ---06/02/2016 01:02:33 PM---HTTP sez: " From: John Anderson <janderson@soltra.com> To: "Wunder, John A." <jwunder@mitre.org>, "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org> Date: 06/02/2016 01:02 PM Subject: Re: [cti-stix] Unicode, strings, and STIX Sent by: <cti-stix@lists.oasis-open.org> HTTP sez: " The HTTP protocol does not place any a priori limit on the length of a URI. Servers MUST be able to handle the URI of any resource they serve, and SHOULD be able to handle URIs of unbounded length if they provide GET-based forms that could generate such URIs. A server SHOULD return 414 (Request-URI Too Long) status if a URI is longer than the server can handle (see section 10.4.15). Note: Servers ought to be cautious about depending on URI lengths above 255 bytes, because some older client or proxy implementations might not properly support these lengths. " https://www.ietf.org/rfc/rfc2616.txt The focus of HTTP was not to define a schema (as in "how long is a String type"?), but to promote interoperability via a standard API (as in, the 4 standard verbs: GET/POST/PUT/DELETE). In this snippet, we see how HTTP addresses the situation when the length of something is unknown and potentially too long for one party in the conversation. Yes, as Mark Davidson points out, the globally-discoverable minimal length of HTTP Headers is really the smallest of any implementation. So, if you want your HTTP thing to interoperate with everybody with the least friction, you have to find and use that minimum length. But, HTTP has you covered if you don't know (or can't know, really) what that globally-discoverable minimal length is. Simply put: HTTP gives the communicating parties a means to say, "Too big! Sorry!" Here's the question that most compels me: How can we avoid arguing about schema (lengths of strings and other datatype questions) and allow the communicating parties to tell each other "I can't handle that! Sorry!" JSA From: cti-stix@lists.oasis-open.org <cti-stix@lists.oasis-open.org> on behalf of Wunder, John A. <jwunder@mitre.org> Sent: Thursday, June 2, 2016 10:17:30 AM To: cti-stix@lists.oasis-open.org Subject: Re: [cti-stix] Unicode, strings, and STIX This struck me as the type of thing that must have been done before, so I did a little research on what other similar specifications (data models, not transport protocols) did: - IODEF: no max lengths specified - CIQ: no max lengths specified - HL7: specify a minimum length that specifications have to be able to handle, but no maximum length (could not find actual language here due to specs not being free, I asked a colleague,) - HDATA: no max lengths specified - SMTP: some fields have max length (in characters), some don’t - OASIS CAP: no max lengths, they have a MAY requirement for some fields suggesting a max size that would be appropriate - EDXL: no max lengths To be honest I went into this thinking that we needed to specify max lengths, but based on this research maybe we shouldn’t? Rich’s approach below seems best to me. Are there any other specs we could learn from? What did I miss? John From: <cti-stix@lists.oasis-open.org> on behalf of Jason Keirstead <Jason.Keirstead@ca.ibm.com> Date: Thursday, June 2, 2016 at 10:10 AM To: Rich Piazza <rpiazza@mitre.org> Cc: Mark Davidson <mdavidson@soltra.com>, "Jordan, Bret" <bret.jordan@bluecoat.com>, Terry MacDonald <terry.macdonald@cosive.com>, John-Mark Gurney <jmg@newcontext.com>, "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org> Subject: RE: [cti-stix] Unicode, strings, and STIX If the consensus is that we *must* specify some length recommendation, then this is a good way to attempt to do so. - Jason Keirstead STSM, Product Architect, Security Intelligence, IBM Security Systems www.ibm.com/security www.securityintelligence.com Without data, all you are is just another person with an opinion - Unknown "Piazza, Rich" ---06/02/2016 10:24:57 AM---Maybe say instead: Any length SHOULD be permitted Then maybe in the implementation guide say: sugges From: "Piazza, Rich" <rpiazza@mitre.org> To: Mark Davidson <mdavidson@soltra.com>, "Jordan, Bret" <bret.jordan@bluecoat.com> Cc: Jason Keirstead/CanEast/IBM@IBMCA, Terry MacDonald <terry.macdonald@cosive.com>, John-Mark Gurney <jmg@newcontext.com>, "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org> Date: 06/02/2016 10:24 AM Subject: RE: [cti-stix] Unicode, strings, and STIX Sent by: <cti-stix@lists.oasis-open.org> Maybe say instead: Any length SHOULD be permitted Then maybe in the implementation guide say: suggested storage size is 8KB… From: Mark Davidson [ mailto:mdavidson@soltra.com ] Sent: Thursday, June 02, 2016 8:53 AM To: Piazza, Rich <rpiazza@mitre.org>; Jordan, Bret <bret.jordan@bluecoat.com> Cc: Jason Keirstead <Jason.Keirstead@ca.ibm.com>; Terry MacDonald <terry.macdonald@cosive.com>; John-Mark Gurney <jmg@newcontext.com>; cti-stix@lists.oasis-open.org Subject: Re: [cti-stix] Unicode, strings, and STIX I guess I have also provided evidence that a spec can be widely implemented without specifying max lengths on important fields. The drawback, however, is that the max length will end up being the shortest supported value from major implementations, and it will only be discovered through painful research. -Mark From: < cti-stix@lists.oasis-open.org > on behalf of Mark Davidson < mdavidson@soltra.com > Date: Thursday, June 2, 2016 at 8:49 AM To: "Piazza, Rich" < rpiazza@mitre.org >, "Jordan, Bret" < bret.jordan@bluecoat.com > Cc: Jason Keirstead < Jason.Keirstead@ca.ibm.com >, Terry MacDonald < terry.macdonald@cosive.com >, John-Mark Gurney < jmg@newcontext.com >, " cti-stix@lists.oasis-open.org " < cti-stix@lists.oasis-open.org > Subject: Re: [cti-stix] Unicode, strings, and STIX There needs to be a limit, even if it’s a SHOULD requirement. If we don’t specify it, we’ll get SO posts like this: http://stackoverflow.com/questions/686217/maximum-on-http-header-values Thank you. -Mark From: < cti-stix@lists.oasis-open.org > on behalf of "Piazza, Rich" < rpiazza@mitre.org > Date: Wednesday, June 1, 2016 at 2:17 PM To: "Jordan, Bret" < bret.jordan@bluecoat.com > Cc: Jason Keirstead < Jason.Keirstead@ca.ibm.com >, Terry MacDonald < terry.macdonald@cosive.com >, John-Mark Gurney < jmg@newcontext.com >, " cti-stix@lists.oasis-open.org " < cti-stix@lists.oasis-open.org > Subject: RE: [cti-stix] Unicode, strings, and STIX I think the spec would have to say something like – “ Any length is permitted” Then, implementers would have to make sure they could support that. In STIX 1.2.1, the description field of all of the objects had this text in the specification documents. I’m not sure in which direction that will sway you J From: Jordan, Bret [ mailto:bret.jordan@bluecoat.com ] Sent: Wednesday, June 01, 2016 1:38 PM To: Piazza, Rich < rpiazza@mitre.org > Cc: Jason Keirstead < Jason.Keirstead@ca.ibm.com >; Terry MacDonald < terry.macdonald@cosive.com >; John-Mark Gurney < jmg@newcontext.com >; cti-stix@lists.oasis-open.org Subject: Re: [cti-stix] Unicode, strings, and STIX If we do not define a max length then everyone will set their own. And we will have problems. Bret Sent from my Commodore 64 On Jun 1, 2016, at 8:08 AM, Piazza, Rich < rpiazza@mitre.org > wrote: My +1 was for the idea that implementation details like this do not belong in the standard. In addition, I kinda agree that that the length of strings isn’t a “standards” issue, or an implementation issue that we need to comment on anywhere. From: cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ] On Behalf Of Jason Keirstead Sent: Wednesday, June 01, 2016 10:48 AM To: Piazza, Rich < rpiazza@mitre.org > Cc: Terry MacDonald < terry.macdonald@cosive.com >; John-Mark Gurney < jmg@newcontext.com >; cti-stix@lists.oasis-open.org Subject: RE: [cti-stix] Unicode, strings, and STIX RE the encoding language question, I posted some sample language to slack that I think solves the problem: "Any serialization of STIX MUST encode all String values in an encoding that follows the Unicode standard". I do not think the below proposal solves some of the other key questions JMG poses. The most critical question we have is with regards to all of these "max length" properties in the spec and how they will be validated. These things actually *can not* be validated in an encoding-independent way. I have asked a few times and will ask again - in 2016, is "max length" really anything we need to care about here. DBAs may have a bit of heartburn, but IMO it is not something we should be concerned with in STIX. Modern databases do not pre-allocate storage for columns anymore anyway. I would rather just forget about the idea. It makes things a lot simpler. Also, the idea that we should say for example "a title should only be 255 code points long" is completely arbitrary IMO and imposing undue limits on the analyst. - Jason Keirstead STSM, Product Architect, Security Intelligence, IBM Security Systems www.ibm.com/security www.securityintelligence.com Without data, all you are is just another person with an opinion - Unknown <image001.gif> "Piazza, Rich" ---06/01/2016 11:39:45 AM---+1 From: cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ] On Behalf Of Terry Mac From: "Piazza, Rich" < rpiazza@mitre.org > To: Terry MacDonald < terry.macdonald@cosive.com >, John-Mark Gurney < jmg@newcontext.com > Cc: " cti-stix@lists.oasis-open.org " < cti-stix@lists.oasis-open.org > Date: 06/01/2016 11:39 AM Subject: RE: [cti-stix] Unicode, strings, and STIX Sent by: < cti-stix@lists.oasis-open.org > +1 From: cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ] On Behalf Of Terry MacDonald Sent: Wednesday, June 01, 2016 6:09 AM To: John-Mark Gurney < jmg@newcontext.com > Cc: cti-stix@lists.oasis-open.org Subject: Re: [cti-stix] Unicode, strings, and STIX Hi John-Mark, My issue with this is that its simple enough language for people reading the STIX standard. Not everyone who reads the STIX standards document will be a programmer, or have a programmers mentality. You have to be a programmer and understand all these terms and subjects before being able to comprehend what's going on within the standard. I firmly believe that we should use common terminology where possible within the standard, to make it as accessible as possible. And that got me thinking.... We should create a STIX v2.0 JSON serialization document that specifies the JSON specific implementations in nomative statements, and this should be separate from the STIX v2.0 standards document . JSON examples should absolutely be kept in the STIX v2.0 standards document to help readers conceptualise the standard, and to see how it would work in practice, but the examples in the standards document should only be for illustrative purposes. Doing things this way we will achieve a few key benefits: · The STIX v2.0 Standards document will be easier to read with plain language, and still have examples to clarify meaning to the reader. · The STIX v2.0 Standards document will describe the standard itself, and will not have specific JSON implementation details in there, which will make it easier to apply to additional serialisation formats in the future. · Detailed implementation requirements for the JSON MTI serialization will be in a JSON specific document. This will ensure · Using this structure will set ourselves up for the future, enabling creation of additional serializations if we want in the future (binary anyone?). Cheers Terry MacDonald Chief Product Officer <image002.png> M: +61-407-203-026 E: terry.macdonald@cosive.com W: www.cosive.com On Wed, Jun 1, 2016 at 5:55 AM, John-Mark Gurney < jmg@newcontext.com > wrote: Hello, In attempting to nail down the definition of the type string, there have been a few questions raised about the best definition. I do not believe there is any disagreement that Unicode will be used for the string representation, it is more how to address some of the things about handling the string type. You may have heard various talk about character vs code point vs glyph vs grapheme, and I found a good post answering the distinction between them at http://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme . I will talk about encoding later. So, at the most basic, a string is a sequence of Unicode code points. Some strings may have more code points than others, though they are the equivalent, ø (1) vs o w/ combining long solidus overlay (2), though when normalized (NFC), they will be equal. Sadly, some other code points are ligatures, which are not expanded when normalized (NFC) resulting in the fi ligature not being equal to the letters f followed by i when normalized (NFC). NFKC will make them equal, but will destroy the meaning of other symbols, like 2 superscript becomes a normal 2. 1) Should we add length restrictions to (some?) fields? For example, should the title field be restricted in it's length somehow? Or should people be able to put unlimited length text in the field? Some fields like description, I expect would possibly be unlimited sans some other overriding limit, such as total TLO size, etc. 2) If there are length limits, how should the length limit be defined? Should it be number of graphemes displayed? (Be careful of this, because of things like Zalgo ( http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work ) make even a short ~25 grapheme string have ~292 code points, or 559 bytes when UTF-8 encoded. Though no language will normally use so many combining code points, it is required to use more than one in some languages. Normalization can help reduce a string's number of code points, but does not always help. Some languages, like Thai, will use more than one combining code point to make a single grapheme (consonant + vowel + tone mark for three code points for a single grapheme). If graphemes are used, it would require a validator to have a detailed table to decide how many graphemes are in the string. Using code points would not require as much work for the validator. There is an additional issue of encoding, but this should be easy. It should use the underlying serialization format's encoding of Unicode. In the case of JSON, the default is UTF-8. In the case of XML, it can be specified by the document itself, and may even be in a non-UTF encoding, but it is assumed that if the document is in a different character set, that the processor will convert to Unicode code points properly. Additional Reading: UNICODE TEXT SEGMENTATION http://unicode.org/reports/tr29/ -- has additional examples of grapheme and code points. Internationalization for Turkish: Dotted and Dotless Letter "I" http://www.i18nguy.com/unicode/turkish-i18n.html -- More deals w/ complexities of locales than the above Forms of Unicode http://www.icu-project.org/docs/papers/forms_of_unicode/ -- Good description of glyph vs characters vs ligatures and encoding info My recommendations: 1) I do believe that limits should be defined for some fields. Things like title should not have the description in them, and leaving it undefined will allow it to happen. 2) My personal view (as a programmer of many years) it to go the simple round and limit it by code points. This is easiest for a programmer to do w/ existing tools. It also gives a more clear storage space limit (see the Zalgo example above). John-Mark New Context


  • 17.  Re: [cti-stix] Unicode, strings, and STIX

    Posted 06-02-2016 16:46
    Ah, yes. You're correct, Jason, about the conversation between STIX processors being specified in TAXII, not STIX. Hmm. P erhaps, rather than ask "how long is a String", could we rephrase it? Maybe look at it from a different perspective? Something like: How "SHOULD" a STIX processor handle strings that are too long for it to process?  A lot of implications flow from that question: Is truncation allowed? Can truncated strings be passed on down the line, or do they need to be marked as "the first X bytes of the source data"? Can truncated strings be used for matching content? Would those questions be appropriate for a data interchange format spec? Thanks, JSA From: cti-stix@lists.oasis-open.org <cti-stix@lists.oasis-open.org> on behalf of Jason Keirstead <Jason.Keirstead@ca.ibm.com> Sent: Thursday, June 2, 2016 12:11:45 PM To: John Anderson Cc: Wunder, John A.; cti-stix@lists.oasis-open.org Subject: Re: [cti-stix] Unicode, strings, and STIX   This is the difference though between STIX (data interchange format) and TAXII (protocol). HTTP is a protocol, and can thus "negotiate" things, as could TAXII. STIX can't have "negotiation" in it, it's just a set of bytes, either stored as a file on disk or being sent on a wire. - Jason Keirstead STSM, Product Architect, Security Intelligence, IBM Security Systems www.ibm.com/security www.securityintelligence.com Without data, all you are is just another person with an opinion - Unknown John Anderson ---06/02/2016 01:02:33 PM---HTTP sez: " From: John Anderson <janderson@soltra.com> To: "Wunder, John A." <jwunder@mitre.org>, "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org> Date: 06/02/2016 01:02 PM Subject: Re: [cti-stix] Unicode, strings, and STIX Sent by: <cti-stix@lists.oasis-open.org> HTTP sez: " The HTTP protocol does not place any a priori limit on the length of a URI. Servers MUST be able to handle the URI of any resource they serve, and SHOULD be able to handle URIs of unbounded length if they provide GET-based forms that could generate such URIs. A server SHOULD return 414 (Request-URI Too Long) status if a URI is longer than the server can handle (see section 10.4.15). Note: Servers ought to be cautious about depending on URI lengths above 255 bytes, because some older client or proxy implementations might not properly support these lengths. " https://www.ietf.org/rfc/rfc2616.txt The focus of HTTP was not to define a schema (as in "how long is a String type"?), but to promote interoperability via a standard API (as in, the 4 standard verbs: GET/POST/PUT/DELETE). In this snippet, we see how HTTP addresses the situation when the length of something is unknown and potentially too long for one party in the conversation. Yes, as Mark Davidson points out, the globally-discoverable minimal length of HTTP Headers is really the smallest of any implementation. So, if you want your HTTP thing to interoperate with everybody with the least friction, you have to find and use that minimum length. But, HTTP has you covered if you don't know (or can't know, really) what that globally-discoverable minimal length is. Simply put: HTTP gives the communicating parties a means to say, "Too big! Sorry!" Here's the question that most compels me: How can we avoid arguing about schema (lengths of strings and other datatype questions) and allow the communicating parties to tell each other "I can't handle that! Sorry!" JSA From: cti-stix@lists.oasis-open.org <cti-stix@lists.oasis-open.org> on behalf of Wunder, John A. <jwunder@mitre.org> Sent: Thursday, June 2, 2016 10:17:30 AM To: cti-stix@lists.oasis-open.org Subject: Re: [cti-stix] Unicode, strings, and STIX This struck me as the type of thing that must have been done before, so I did a little research on what other similar specifications (data models, not transport protocols) did: - IODEF: no max lengths specified - CIQ: no max lengths specified - HL7: specify a minimum length that specifications have to be able to handle, but no maximum length (could not find actual language here due to specs not being free, I asked a colleague,) - HDATA: no max lengths specified - SMTP: some fields have max length (in characters), some don’t - OASIS CAP: no max lengths, they have a MAY requirement for some fields suggesting a max size that would be appropriate - EDXL: no max lengths To be honest I went into this thinking that we needed to specify max lengths, but based on this research maybe we shouldn’t? Rich’s approach below seems best to me. Are there any other specs we could learn from? What did I miss? John From: <cti-stix@lists.oasis-open.org> on behalf of Jason Keirstead <Jason.Keirstead@ca.ibm.com> Date: Thursday, June 2, 2016 at 10:10 AM To: Rich Piazza <rpiazza@mitre.org> Cc: Mark Davidson <mdavidson@soltra.com>, "Jordan, Bret" <bret.jordan@bluecoat.com>, Terry MacDonald <terry.macdonald@cosive.com>, John-Mark Gurney <jmg@newcontext.com>, "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org> Subject: RE: [cti-stix] Unicode, strings, and STIX If the consensus is that we *must* specify some length recommendation, then this is a good way to attempt to do so. - Jason Keirstead STSM, Product Architect, Security Intelligence, IBM Security Systems www.ibm.com/security www.securityintelligence.com Without data, all you are is just another person with an opinion - Unknown "Piazza, Rich" ---06/02/2016 10:24:57 AM---Maybe say instead: Any length SHOULD be permitted Then maybe in the implementation guide say: sugges From: "Piazza, Rich" <rpiazza@mitre.org> To: Mark Davidson <mdavidson@soltra.com>, "Jordan, Bret" <bret.jordan@bluecoat.com> Cc: Jason Keirstead/CanEast/IBM@IBMCA, Terry MacDonald <terry.macdonald@cosive.com>, John-Mark Gurney <jmg@newcontext.com>, "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org> Date: 06/02/2016 10:24 AM Subject: RE: [cti-stix] Unicode, strings, and STIX Sent by: <cti-stix@lists.oasis-open.org> Maybe say instead: Any length SHOULD be permitted Then maybe in the implementation guide say: suggested storage size is 8KB… From: Mark Davidson [ mailto:mdavidson@soltra.com ] Sent: Thursday, June 02, 2016 8:53 AM To: Piazza, Rich <rpiazza@mitre.org>; Jordan, Bret <bret.jordan@bluecoat.com> Cc: Jason Keirstead <Jason.Keirstead@ca.ibm.com>; Terry MacDonald <terry.macdonald@cosive.com>; John-Mark Gurney <jmg@newcontext.com>; cti-stix@lists.oasis-open.org Subject: Re: [cti-stix] Unicode, strings, and STIX I guess I have also provided evidence that a spec can be widely implemented without specifying max lengths on important fields. The drawback, however, is that the max length will end up being the shortest supported value from major implementations, and it will only be discovered through painful research. -Mark From: < cti-stix@lists.oasis-open.org > on behalf of Mark Davidson < mdavidson@soltra.com > Date: Thursday, June 2, 2016 at 8:49 AM To: "Piazza, Rich" < rpiazza@mitre.org >, "Jordan, Bret" < bret.jordan@bluecoat.com > Cc: Jason Keirstead < Jason.Keirstead@ca.ibm.com >, Terry MacDonald < terry.macdonald@cosive.com >, John-Mark Gurney < jmg@newcontext.com >, " cti-stix@lists.oasis-open.org " < cti-stix@lists.oasis-open.org > Subject: Re: [cti-stix] Unicode, strings, and STIX There needs to be a limit, even if it’s a SHOULD requirement. If we don’t specify it, we’ll get SO posts like this: http://stackoverflow.com/questions/686217/maximum-on-http-header-values Thank you. -Mark From: < cti-stix@lists.oasis-open.org > on behalf of "Piazza, Rich" < rpiazza@mitre.org > Date: Wednesday, June 1, 2016 at 2:17 PM To: "Jordan, Bret" < bret.jordan@bluecoat.com > Cc: Jason Keirstead < Jason.Keirstead@ca.ibm.com >, Terry MacDonald < terry.macdonald@cosive.com >, John-Mark Gurney < jmg@newcontext.com >, " cti-stix@lists.oasis-open.org " < cti-stix@lists.oasis-open.org > Subject: RE: [cti-stix] Unicode, strings, and STIX I think the spec would have to say something like – “ Any length is permitted” Then, implementers would have to make sure they could support that. In STIX 1.2.1, the description field of all of the objects had this text in the specification documents. I’m not sure in which direction that will sway you J From: Jordan, Bret [ mailto:bret.jordan@bluecoat.com ] Sent: Wednesday, June 01, 2016 1:38 PM To: Piazza, Rich < rpiazza@mitre.org > Cc: Jason Keirstead < Jason.Keirstead@ca.ibm.com >; Terry MacDonald < terry.macdonald@cosive.com >; John-Mark Gurney < jmg@newcontext.com >; cti-stix@lists.oasis-open.org Subject: Re: [cti-stix] Unicode, strings, and STIX If we do not define a max length then everyone will set their own. And we will have problems. Bret Sent from my Commodore 64 On Jun 1, 2016, at 8:08 AM, Piazza, Rich < rpiazza@mitre.org > wrote: My +1 was for the idea that implementation details like this do not belong in the standard. In addition, I kinda agree that that the length of strings isn’t a “standards” issue, or an implementation issue that we need to comment on anywhere. From: cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ] On Behalf Of Jason Keirstead Sent: Wednesday, June 01, 2016 10:48 AM To: Piazza, Rich < rpiazza@mitre.org > Cc: Terry MacDonald < terry.macdonald@cosive.com >; John-Mark Gurney < jmg@newcontext.com >; cti-stix@lists.oasis-open.org Subject: RE: [cti-stix] Unicode, strings, and STIX RE the encoding language question, I posted some sample language to slack that I think solves the problem: "Any serialization of STIX MUST encode all String values in an encoding that follows the Unicode standard". I do not think the below proposal solves some of the other key questions JMG poses. The most critical question we have is with regards to all of these "max length" properties in the spec and how they will be validated. These things actually *can not* be validated in an encoding-independent way. I have asked a few times and will ask again - in 2016, is "max length" really anything we need to care about here. DBAs may have a bit of heartburn, but IMO it is not something we should be concerned with in STIX. Modern databases do not pre-allocate storage for columns anymore anyway. I would rather just forget about the idea. It makes things a lot simpler. Also, the idea that we should say for example "a title should only be 255 code points long" is completely arbitrary IMO and imposing undue limits on the analyst. - Jason Keirstead STSM, Product Architect, Security Intelligence, IBM Security Systems www.ibm.com/security www.securityintelligence.com Without data, all you are is just another person with an opinion - Unknown <image001.gif> "Piazza, Rich" ---06/01/2016 11:39:45 AM---+1 From: cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ] On Behalf Of Terry Mac From: "Piazza, Rich" < rpiazza@mitre.org > To: Terry MacDonald < terry.macdonald@cosive.com >, John-Mark Gurney < jmg@newcontext.com > Cc: " cti-stix@lists.oasis-open.org " < cti-stix@lists.oasis-open.org > Date: 06/01/2016 11:39 AM Subject: RE: [cti-stix] Unicode, strings, and STIX Sent by: < cti-stix@lists.oasis-open.org > +1 From: cti-stix@lists.oasis-open.org [ mailto:cti-stix@lists.oasis-open.org ] On Behalf Of Terry MacDonald Sent: Wednesday, June 01, 2016 6:09 AM To: John-Mark Gurney < jmg@newcontext.com > Cc: cti-stix@lists.oasis-open.org Subject: Re: [cti-stix] Unicode, strings, and STIX Hi John-Mark, My issue with this is that its simple enough language for people reading the STIX standard. Not everyone who reads the STIX standards document will be a programmer, or have a programmers mentality. You have to be a programmer and understand all these terms and subjects before being able to comprehend what's going on within the standard. I firmly believe that we should use common terminology where possible within the standard, to make it as accessible as possible. And that got me thinking.... We should create a STIX v2.0 JSON serialization document that specifies the JSON specific implementations in nomative statements, and this should be separate from the STIX v2.0 standards document . JSON examples should absolutely be kept in the STIX v2.0 standards document to help readers conceptualise the standard, and to see how it would work in practice, but the examples in the standards document should only be for illustrative purposes. Doing things this way we will achieve a few key benefits: · The STIX v2.0 Standards document will be easier to read with plain language, and still have examples to clarify meaning to the reader. · The STIX v2.0 Standards document will describe the standard itself, and will not have specific JSON implementation details in there, which will make it easier to apply to additional serialisation formats in the future. · Detailed implementation requirements for the JSON MTI serialization will be in a JSON specific document. This will ensure · Using this structure will set ourselves up for the future, enabling creation of additional serializations if we want in the future (binary anyone?). Cheers Terry MacDonald Chief Product Officer <image002.png> M: +61-407-203-026 E: terry.macdonald@cosive.com W: www.cosive.com On Wed, Jun 1, 2016 at 5:55 AM, John-Mark Gurney < jmg@newcontext.com > wrote: Hello, In attempting to nail down the definition of the type string, there have been a few questions raised about the best definition. I do not believe there is any disagreement that Unicode will be used for the string representation, it is more how to address some of the things about handling the string type. You may have heard various talk about character vs code point vs glyph vs grapheme, and I found a good post answering the distinction between them at http://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme . I will talk about encoding later. So, at the most basic, a string is a sequence of Unicode code points. Some strings may have more code points than others, though they are the equivalent, ø (1) vs o w/ combining long solidus overlay (2), though when normalized (NFC), they will be equal. Sadly, some other code points are ligatures, which are not expanded when normalized (NFC) resulting in the fi ligature not being equal to the letters f followed by i when normalized (NFC). NFKC will make them equal, but will destroy the meaning of other symbols, like 2 superscript becomes a normal 2. 1) Should we add length restrictions to (some?) fields? For example, should the title field be restricted in it's length somehow? Or should people be able to put unlimited length text in the field? Some fields like description, I expect would possibly be unlimited sans some other overriding limit, such as total TLO size, etc. 2) If there are length limits, how should the length limit be defined? Should it be number of graphemes displayed? (Be careful of this, because of things like Zalgo ( http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work ) make even a short ~25 grapheme string have ~292 code points, or 559 bytes when UTF-8 encoded. Though no language will normally use so many combining code points, it is required to use more than one in some languages. Normalization can help reduce a string's number of code points, but does not always help. Some languages, like Thai, will use more than one combining code point to make a single grapheme (consonant + vowel + tone mark for three code points for a single grapheme). If graphemes are used, it would require a validator to have a detailed table to decide how many graphemes are in the string. Using code points would not require as much work for the validator. There is an additional issue of encoding, but this should be easy. It should use the underlying serialization format's encoding of Unicode. In the case of JSON, the default is UTF-8. In the case of XML, it can be specified by the document itself, and may even be in a non-UTF encoding, but it is assumed that if the document is in a different character set, that the processor will convert to Unicode code points properly. Additional Reading: UNICODE TEXT SEGMENTATION http://unicode.org/reports/tr29/ -- has additional examples of grapheme and code points. Internationalization for Turkish: Dotted and Dotless Letter "I" http://www.i18nguy.com/unicode/turkish-i18n.html -- More deals w/ complexities of locales than the above Forms of Unicode http://www.icu-project.org/docs/papers/forms_of_unicode/ -- Good description of glyph vs characters vs ligatures and encoding info My recommendations: 1) I do believe that limits should be defined for some fields. Things like title should not have the description in them, and leaving it undefined will allow it to happen. 2) My personal view (as a programmer of many years) it to go the simple round and limit it by code points. This is easiest for a programmer to do w/ existing tools. It also gives a more clear storage space limit (see the Zalgo example above). John-Mark New Context


  • 18.  Re: [cti-stix] Unicode, strings, and STIX

    Posted 06-02-2016 18:55
    Wunder, John A. wrote this message on Thu, Jun 02, 2016 at 14:17 +0000: > - HL7: specify a minimum length that specifications have to be able to handle, but no maximum length (could not find actual language here due to specs not being free, I asked a colleague,) Just a IMO, if you specify a minimum length that has to be handled, you have also just specified the maximum length, as an implementation that wants to work now knows that anything longer may not be handled by all other implementations, so a mandatory minimum is really just another name for maximum... -- John-Mark


  • 19.  Re: [cti-stix] Unicode, strings, and STIX

    Posted 06-02-2016 18:21
    Piazza, Rich wrote this message on Thu, Jun 02, 2016 at 13:24 +0000: > Maybe say instead: Any length SHOULD be permitted > > Then maybe in the implementation guide say: suggested storage size is 8KB… I'm against suggesting a storage size in bytes, because that means on object serialized one way, and the same object serialized another way could be invalid due to the encoding... -- John-Mark


  • 20.  RE: [Non-DoD Source] Re: [cti-stix] Unicode, strings, and STIX

    Posted 06-02-2016 18:44
    I don't think specifying a maximum size for a STIX file makes much sense. As was said before almost no protocol specifies a maximum size and I'm not aware of any major file format that specifies a maximum size of under 4GB and the ones that do only give this limit because they encode file size as a 4 byte integer. If your system can't handle a file of a given size then rejecting the file is always a valid option, but that isn't the job of the spec. Jeffrey Mates, Civ DC3/DCCI ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Computer Scientist Defense Cyber Crime Institute jeffrey.mates@dc3.mil 410-694-4335