Hi,
On Monday, 2006-12-04 15:59:55 +0100, Eike Rathke wrote:
> XSL 1.1 narrows language/country to RFC 3066, which formally covers only
> ISO 639-1 and ISO 639-2 and IANA-registered languages, and does not
> cover the upcoming ISO/FDIS 639-3 nor other language designators that
> might be valid according to RFC 4646. I don't know whether that is
> really an obstacle and would affect us or not, just want to mention.
>
> > and could add a single attribute from our own namespaces
> > that contains the
> >
> > [region] *("-" variant) *("-" extension) ["-" privateuse]
> > fragment of RFC 4646.
>
> Note that the region is (most times? always? need to reread RFC 4646)
> the country.
Not always, it can also be a 3-digit UN M.49 code that formally also
knows regions like South America.
In fact with the *:language definitions we're also not able to model the
'extlang' extended language part of RFC 4646 'language', see below, nor
the 'privateuse' or 'grandfathered' tags. So Michael's suggestion to
continue to use *:language and in an own attribute place only the
remainder doesn't work.
I think for backwards compatibility we should continue to place the
2*3ALPHA ISO 639 code in *:language, the 2ALPHA ISO 3166 code in
*:country, and create a *:script attribute where the 4ALPHA ISO 15924
code will go. Doing so makes it easy to use by existing implementations.
However, the script attribute may be lost during roundtrips.
In all cases where a combination of those 3 ISO codes is not sufficient
another single attribute should be added where the entire RFC 4646
notation is contained, including repeated information from the
*:language, *:country and *:script attributes. The reason for
replicating that information is simply that otherwise parsing the
attribute could be a nightmare, and existing RFC 4646 parsers will know
how to handle it. Effectively an existing language-tag attribute would
override existing *:language, *:country, *:script attributes.
I'll try to mock up some specification text for this.
Here the RFC 4646 syntax of the language tag in ABNF [RFC4234]:
Language-Tag = langtag
/ privateuse ; private use tag
/ grandfathered ; grandfathered registrations
langtag = (language
["-" script]
["-" region]
*("-" variant)
*("-" extension)
["-" privateuse])
language = (2*3ALPHA [ extlang ]) ; shortest ISO 639 code
/ 4ALPHA ; reserved for future use
/ 5*8ALPHA ; registered language subtag
extlang = *3("-" 3ALPHA) ; reserved for future use
; NOTE: RFC 4646bis, waiting
; for ISO 639-3 to be
; finalized, will replace this
; comment with:
; specific ISO 639-3 codes
script = 4ALPHA ; ISO 15924 code
region = 2ALPHA ; ISO 3166 code
/ 3DIGIT ; UN M.49 code
variant = 5*8alphanum ; registered variants
/ (DIGIT 3alphanum)
extension = singleton 1*("-" (2*8alphanum))
singleton = %x41-57 / %x59-5A / %x61-77 / %x79-7A / DIGIT
; "a"-"w" / "y"-"z" / "A"-"W" / "Y"-"Z" / "0"-"9"
; Single letters: x/X is reserved for private use
privateuse = ("x"/"X") 1*("-" (1*8alphanum))
grandfathered = 1*3ALPHA 1*2("-" (2*8alphanum))
; grandfathered registration
; Note: i is the only singleton
; that starts a grandfathered tag
alphanum = (ALPHA / DIGIT) ; letters and numbers
Figure 1: Language Tag ABNF
Note: There is a subtlety in the ABNF for 'variant': variants
starting with a digit MAY be four characters long, while those
starting with a letter MUST be at least five characters long.
--
OpenOffice.org Engineering at Sun: http://blogs.sun.com/GullFOSS