OASIS Open Document Format for Office Applications (OpenDocument) TC

  • 1.  Re: [office] Proposal for Spreadsheets: New sort option "natural sort"

    Posted 02-02-2007 18:00
    Hi,
    
    The following illustrates how two strings shall be compared under the 
    natural sort algorithm.
    
    Step 1. First of all, the two strings are compared by using the normal 
    string comparison algorithm to ensure that they are not equal. If they 
    are, the function will return immediately with equality.
    
    Step 2. Next, each of the two strings is divided into three parts: 
    1.Prefix substring 2.Numeric substring 3.Suffix substring The prefix 
    substring is determined by locating the first occurrence of a digit 
    character; the substring from the very first character through the 
    character preceding the first digit is considered the prefix. Now, if 
    the first digit happens to be the first character of the whole string, 
    the prefix substring becomes empty. If there is no digit in either one 
    of the compared strings, the natural sort process will end and the 
    normal string comparison will be performed instead. The digit determined 
    herein is locale-aware, and therefore is not limited to ASCII digits. A 
    decimal separator may also be considered a digit so that real numbers 
    can also be supported if the appropriate conditions are met (see "Note" 
    below).
    
    Step 3. After the prefix substring is extracted from both of the 
    compared strings, a normal string comparison is performed on the 
    extracted prefixes. If they differ, the result is returned and the 
    process will end. If they are equal, it will proceed to the next step of 
    numeric string comparison.
    
    Step 4. In this step, the numeric substring is determined by locating 
    the first occurrence of a non-digit character after the first digit 
    character; the substring from the first digit character through the 
    character preceding the first non-digit is considered the numeric 
    substring. This substring is then converted into a double-precision 
    variable. This step is performed on both of the compared strings, and 
    the converted values are compared by simple numeric comparison. If these 
    values differ, then the result will be returned and the process will 
    end. If they are equal to one another, then the process will proceed to 
    the next step.
    
    Step 5. After the numeric comparison returns equality, the suffix 
    substring, which is simply the rest of the string that occurs after the 
    last digit of the numeric substring, will be extracted. This suffix 
    substring will then replace the original string, and the whole process 
    will repeat (i.e. back to Step 1).
    
    This sorting process is illustrated in the picture below. Note that the 
    term "normal string comparison" repeatedly mentioned in the algorithm 
    description refers to a locale-specific string comparison; therefore the 
    term does not refer to a simple ASCII string comparison. This locale 
    setting is either explicitly given by the table:language and 
    table:country attributes, or the default locale when the language option 
    is not explicitly specified (current behavior).
    
    
    Note: Treatment of decimal separators:The treatment of a decimal 
    separator is context-dependent, that is, when a decimal separator occurs 
    adjacent to one or two digit characters, it is considered a digit 
    character as long as it's the only occurrence in that given numeric 
    substring. In other words, a second occurrence of a decimal separator in 
    any numeric substring is treated as a non-digit character; therefore the 
    character immediately preceding the separator becomes the last character 
    of the numeric substring, while the separator itself becomes the first 
    character of the suffix substring.
    
    
    Best regards
    
    Michael
    
    robert_weir@us.ibm.com wrote:
    > 
    > Interesting idea.
    > 
    > How far do we take it?  
    > 
    > For example do we allow multiple levels, as in:
    > 
    > A1.1, A1.2, A1.10, ... , A19.1, A20.3, etc.
    > 
    > -Rob
    > 
    > Michael.Brauer@Sun.COM wrote on 01/19/2007 05:26:03 AM:
    > 
    >  > Dear TC members,
    >  >
    >  > this is a proposal for a new attribute of the 


  • 2.  Re: [office] Proposal for Spreadsheets: New sort option "natural sort"

    Posted 02-05-2007 10:14
    On Friday 02 February 2007, Michael Brauer - Sun Germany - ham02 - Hamburg wrote:
    > Step 4. In this step, the numeric substring is determined by locating 
    > the first occurrence of a non-digit character after the first digit 
    > character; the substring from the first digit character through the 
    > character preceding the first non-digit is considered the numeric 
    > substring. This substring is then converted into a double-precision 
    > variable. This step is performed on both of the compared strings, and 
    > the converted values are compared by simple numeric comparison. If these 
    > values differ, then the result will be returned and the process will 
    > end. If they are equal to one another, then the process will proceed to 
    > the next step.
    
    Here's a comment by Martin Pool, who implemented "natural sorting" in KDE.
    
    "
    If I'm reading this correctly, that means that "1.3" > "1.20", in a
    locale where "." is the decimal separator.  In typical software version
    strings that's not correct, and that was the case I was originally
    trying to handle, and also apparently the case Robert Weir describes.
    Obviously sometimes sorting as floats is best but I suggest that when
    numbers are intermixed with non-digits the other algorithm is better.
    That is, to basically follow this algorithm but just treat the decimal
    separator as non-numeric.
    
    Also, conversion to double might give unexpected results if there are
    very long runs of digits (barcodes?)  I'm not sure if that is a concern.
    
    Also it seems rather odd that 1.2.3.4 will be sorted as (1.2, 3.4)... 
    "
    
    -- 
    David Faure, faure@kde.org, sponsored by Trolltech to work on KDE,
    Konqueror (http://www.konqueror.org), and KOffice (http://www.koffice.org).
    


  • 3.  Re: [office] Proposal for Spreadsheets: New sort option "natural sort"

    Posted 02-05-2007 10:25
    Hi David,
    
    I think we have some kind of conflicting requirements here: You either 
    want to be able to sort floating point numbers. You then need to 
    interpret the decimal delimiter. Or you want to be able to sort version 
    numbers. You then must not interpret the decimal delimiter.
    
    What about resolving this conflict by having two options (or three, if 
    the include the default character code based sorting) instead of one, 
    "natural-integer" and "natural-float", where the first one sorts only 
    integer values, while the 2nd one sorts floats?
    
    Michael
    
    David Faure wrote:
    > On Friday 02 February 2007, Michael Brauer - Sun Germany - ham02 - Hamburg wrote:
    >> Step 4. In this step, the numeric substring is determined by locating 
    >> the first occurrence of a non-digit character after the first digit 
    >> character; the substring from the first digit character through the 
    >> character preceding the first non-digit is considered the numeric 
    >> substring. This substring is then converted into a double-precision 
    >> variable. This step is performed on both of the compared strings, and 
    >> the converted values are compared by simple numeric comparison. If these 
    >> values differ, then the result will be returned and the process will 
    >> end. If they are equal to one another, then the process will proceed to 
    >> the next step.
    > 
    > Here's a comment by Martin Pool, who implemented "natural sorting" in KDE.
    > 
    > "
    > If I'm reading this correctly, that means that "1.3" > "1.20", in a
    > locale where "." is the decimal separator.  In typical software version
    > strings that's not correct, and that was the case I was originally
    > trying to handle, and also apparently the case Robert Weir describes.
    > Obviously sometimes sorting as floats is best but I suggest that when
    > numbers are intermixed with non-digits the other algorithm is better.
    > That is, to basically follow this algorithm but just treat the decimal
    > separator as non-numeric.
    > 
    > Also, conversion to double might give unexpected results if there are
    > very long runs of digits (barcodes?)  I'm not sure if that is a concern.
    > 
    > Also it seems rather odd that 1.2.3.4 will be sorted as (1.2, 3.4)... 
    > "
    > 
    
    
    -- 
    Michael Brauer, Technical Architect Software Engineering
    StarOffice/OpenOffice.org
    Sun Microsystems GmbH             Nagelsweg 55
    D-20097 Hamburg, Germany          michael.brauer@sun.com
    http://sun.com/staroffice         +49 40 23646 500
    http://blogs.sun.com/GullFOSS
    
    


  • 4.  Re: [office] Proposal for Spreadsheets: New sort option "natural sort"

    Posted 02-05-2007 10:45
    On Monday 05 February 2007, Michael Brauer - Sun Germany - ham02 - Hamburg wrote:
    > Hi David,
    > 
    > I think we have some kind of conflicting requirements here: You either 
    > want to be able to sort floating point numbers. You then need to 
    > interpret the decimal delimiter. Or you want to be able to sort version 
    > numbers. You then must not interpret the decimal delimiter.
    
    Indeed.
    
    I hadn't realized that difference, which means that any generic code that implements
    "natural sorting" must take into account those two different requirements/algorithms.
    
    > What about resolving this conflict by having two options (or three, if 
    > the include the default character code based sorting) instead of one, 
    > "natural-integer" and "natural-float", where the first one sorts only 
    > integer values, while the 2nd one sorts floats?
    
    Well, now that we identified two ways of doing natural sorting, we still need
    to answer: are both ways useful in spreadsheets? I admit that I'm the one who
    broadened the scope of the discussion, thinking about that code for filemanagers
    we have, but if the spreadsheet authors say that from a spreadsheet point of view
    we only need the natural-float mechanism, then that's fine with me. No point in
    specifying things that are not useful.
    
    -- 
    David Faure, faure@kde.org, sponsored by Trolltech to work on KDE,
    Konqueror (http://www.konqueror.org), and KOffice (http://www.koffice.org).
    


  • 5.  Re: [office] Proposal for Spreadsheets: New sort option "natural sort"

    Posted 02-05-2007 11:02
    Hi David,
    
    David Faure wrote:
    > On Monday 05 February 2007, Michael Brauer - Sun Germany - ham02 - Hamburg wrote:
    >> Hi David,
    >>
    >> I think we have some kind of conflicting requirements here: You either 
    >> want to be able to sort floating point numbers. You then need to 
    >> interpret the decimal delimiter. Or you want to be able to sort version 
    >> numbers. You then must not interpret the decimal delimiter.
    > 
    > Indeed.
    > 
    > I hadn't realized that difference, which means that any generic code that implements
    > "natural sorting" must take into account those two different requirements/algorithms.
    > 
    >> What about resolving this conflict by having two options (or three, if 
    >> the include the default character code based sorting) instead of one, 
    >> "natural-integer" and "natural-float", where the first one sorts only 
    >> integer values, while the 2nd one sorts floats?
    > 
    > Well, now that we identified two ways of doing natural sorting, we still need
    > to answer: are both ways useful in spreadsheets? I admit that I'm the one who
    > broadened the scope of the discussion, thinking about that code for filemanagers
    > we have, but if the spreadsheet authors say that from a spreadsheet point of view
    > we only need the natural-float mechanism, then that's fine with me. No point in
    > specifying things that are not useful.
    > 
    
    I've just talked to OOo's spreadsheet experts: They think there are use 
    cases where sorting by integers may be required (for instance the 
    version numbers, but also chapter numbers), but don't know how often 
    they occur. So, it is up to you. If you would like to have this options, 
    its okay for them, but it is also okay for them if this option does not 
    exist.
    
    Michael
    


  • 6.  Re: [office] Proposal for Spreadsheets: New sort option "naturalsort"

    Posted 02-06-2007 00:03
    Martin Pool:
    > > If I'm reading this correctly, that means that "1.3" > "1.20", in a
    > > locale where "." is the decimal separator.  In typical software version
    > > strings that's not correct, and that was the case I was originally
    > > trying to handle, and also apparently the case Robert Weir describes.
    > > Obviously sometimes sorting as floats is best but I suggest that when
    > > numbers are intermixed with non-digits the other algorithm is better.
    > > That is, to basically follow this algorithm but just treat the decimal
    > > separator as non-numeric.
    
    
    Michael Brauer:
    > I think we have some kind of conflicting requirements here: You either 
    > want to be able to sort floating point numbers. You then need to 
    > interpret the decimal delimiter. Or you want to be able to sort version 
    > numbers. You then must not interpret the decimal delimiter.
    > 
    > What about resolving this conflict by having two options (or three, if 
    > the include the default character code based sorting) instead of one, 
    > "natural-integer" and "natural-float", where the first one sorts only 
    > integer values, while the 2nd one sorts floats?
    
    In all the "Natural sort" algorithms I've seen, the data is expected to include
    trailing digits after a decimal point. So "1.30" > "1.20".
    
    If you truly want a "version sort", that's a different animal.  Perl, for example, includes separate routines for doing version sorts that are DIFFERENT from its natural sort implementation.
    
    If you're doing something significantly different than the usual "natural sort", it should have a new name.
    
    If we ARE going to include a "natural sort", is Michael's proposed definition the same as used elsewhere?
    
    --- David A. Wheeler