MHonArc v2.5.0b2 -->
office message
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
| [List Home]
Subject: [office] OpenDocument metadata and XMP
OpenDocument TC members,
This posting is a commentary on metadata issues for OpenDocument and
in particular how an XMP-like approach might address those issues. It
is meant to further the already ongoing metadata discussion within
the OpenDocument TC and only represents ideas, not any concrete
proposal. This was written by Alan Lillich with some welcome early
feedback from Duane Nickull and Bruce D'Arcus. The views expressed
here do not constitute the official opinion of Adobe Systems, nor of
Duane and Bruce.
Please post comments the OpenDocument TC mailing list or to Alan
Lillich (alillich@adobe.com) and Duane Nickull (dnickull@adobe.com).
I'm coming into this discussion somewhat late, there is a lot of
ground to cover about OpenDocument, metadata in general, and XMP.
This is a long posting, hopefully it is coherent and useful. I have
tried to not specifically "make a case for XMP". Instead I have tried
to present an objective discussion of metadata issues in a manner
that will help the OpenDocument TC make decisions.
This posting is divided into sections:
1. Miscellaneous background
2. Decision factors for the OpenDocument TC
3. A suggested approach for OpenDocument
4. A description of XMP
======================================================================
1. Miscellaneous background
---------------------------------------------
Some background on the author:
I'm a software engineer with 27 years of work experience. I spent
almost 10 years working on commercial Ada compilers and related
software, and almost 10 years working for Apple on internals of the
PowerPC Mac OS. I've been with Adobe almost 5 years, hired after XMP
was first shipped with Acrobat 5 to take over development of the core
XMP toolkit and help the other Adobe application teams incorporate
support for XMP. I have a deep interest in shipping high quality,
high volume commercial software. While I can't speak for the original
design intentions behind XMP, I can address the value of XMP from the
view of implementing its internals and helping client applications
utilize it.
BTW - I have recently become a member of the OpenDocument TC,
specifically to participate in this debate. I will however abstain
from any vote concerning metadata in order to avoid the appearance of
Adobe attempting to "push" XMP into OpenDocument.
--------------------------------------
About the Adobe XMP SDK:
If you've looked at the XMP SDK in the past, please look again.
Earlier this year Adobe posted a significant update to the XMP
Specification. This did not introduce significant changes to the XMP
data model, but did significantly improve how it is described. The
latest XMP spec has chapter 2 "XMP Data Model" and chapter 3 "XMP
Storage Model". Adobe recently (October?) posted an entirely new
implementation for the core XMP toolkit. This has a revamped API that
is similar to the old but much easier to use and more complete. The
code is a total rewrite, it is now smaller, faster, and more robust.
--------------------------------
Presumptions and bias:
A core part of making rational decisions is determining goals and
placing valuations on choices. I've tried to avoid outright advocacy,
but there certainly are presumptions and bias behind what is
presented here.
One presumption is that we're talking about a solution that can be
serialized as RDF. There is no presumption about how much of RDF is
allowed. I do have a bias for a subset that retains expressive power
while reducing implementation effort.
Perhaps the most significant presumption is that success of
OpenDocument depends on the availability of a variety of high quality
and low cost commercial applications. I suspect that business and
government in the US and Europe will insist on the stability and
support of commercial products. The completeness and quality of all
applications, commercial or open source, depends quite a bit on the
clarity and implementability of the OpenDocument specification. It
needs to be easily, reliably, and consistently implemented. The ill
effects that can arise if parts of the specification are unclear or
hard to implement include:
- Features might be too complex for mainstream users
- Applications might be fragile or buggy
- Applications might support private subsets, by intent or ignorance
- The cost of implementation might reduce the variety of choice
A bias related to this presumption is that pragmatic choices are
necessary. Good enough is not necessarily a four letter word. Time to
market is important.
Pragmatic choices do not necessarily mean simplistic results. I have
a strong bias for formal models that are reasonably simple, robust,
and powerful.
Another presumption is that good software design will lead to
application layering. In the case of metadata this means a core
metadata toolkit that manages an application neutral metadata model,
with client application logic layered on top. The core metadata
toolkit provides a runtime model and API to the client code. The
strength of the underlying formal model has a big effect on the cost
of the core metadata toolkit, and on the richness of the client code
that can be created above it. The design of the runtime model and API
has a big effect on the cost to create rich and robust client code on
top of it.
A final presumption is that a good data model with open extensibility
is crucial. By that I mean extensibility within a well defined data
model, not wide open anything-in-the-universe extensibility. End user
appreciation of metadata is growing rapidly in breadth and
sophistication. The value of OpenDocument to large organizations will
be enhanced by open metadata extensibility. Examples of significant
customer extension in the case of XMP include the ISO PDF/A standard
(http://www.aiim.org/documents/standards/ISO_19005-1_(E).doc), and
the IPTC extensions (http://www.iptc.org/IPTC4XMP/).
======================================================================
2. Decision factors for the OpenDocument TC
This section poses a bunch of questions that are hopefully relevant
in designing a metadata solution for OpenDocument. I've tried to
organize them in a more or less logical progression. Some of them
might make more sense after reading the following section describing
XMP.
- How quickly to move on new metadata?
There is an existing, albeit limited, metadata solution. Since a
change is being contemplated, there is a lot to gain by getting it
right. Is there a major release coming up that places a deadline or
urgency on defining a better metadata solution?
- Will the new metadata allow open extension?
Can end users freely create new metadata elements, provided that they
stay within a defined formal model?
- How are formal schema used?
Must end users provide a formal schema in order to use new metadata
elements? If not required, is it allowed/supported? If not provided,
what impact does this have on other aspects of general document
checking? If formal schemas are not used, is the underlying data
model explicit in the serialization? If formal schemas are not used,
where are various kinds of errors detected?
- If formal schemas are used, what is the schema language?
RELAX NG is clearly a better schema language than XML Schema. Can XML
Schema be used at all by those who insist on it?
- What is the formal model for the metadata?
What is the expressive capability of the formal model? Can it be
easily taught to general users? Does it contain enough power for
sophisticated users? Can sophisticated users reasonably work within
any perceived limitations? Can it be implemented reliably, cheaply,
and efficiently? Will it be easy for client applications to use? Are
there existing implementations?
- Is the formal model based on RDF, or can it be expressed in RDF? If
so, does it encompass all of RDF? If not all of RDF, what are the
model constraints? Can any equivalent serialization of RDF be used?
If so, what impact does that have on formal schemas?
- Does the formal model have a specific notion of reference? If so,
does it work broadly for general local file system use, networked
file use, Internet use? What happens to references as files are moved
into and out of asset management systems? If there is a formal notion
of reference, what characteristics of persistence and specificity
does it have? How well does it satisfy local workflow needs?
- What kinds of "user standard" metadata features are layered on top
of the formal model? Users want helpful visible features. They
generally don't care if things are part of a formal model or part of
conventions at higher levels. For example, a UI can make use of
standard metadata elements to provide a rich browsing, searching, and
discovery experience. It is not necessary to have every aspect
ensconced in the formal model.
- How important is interaction with XMP? Is it important to create a
document using OpenDocument then publish and distribute it as PDF? If
so, how is the OpenDocument metadata mapped into XMP in the PDF? Is
it important to import illustrations or images that contain XMP into
OpenDodument files? If so, how is the XMP in those files mapped into
the OpenDocument metadata? How does it return to XMP when published
as PDF? This "how" includes both how is the mapping defined, how well
do the formal models mesh, and how is the mapping implemented, what
software must run? Is it important to work seamlessly with 3rd party
asset management systems that recognize XMP?
- How important is interaction with other forms of metadata or other
metadata systems? What other systems? How would the metadata be mapped?
- Are there things in XMP that are absolutely intolerable? Things
that have no reasonable workaround? Does XMP place unacceptable
limitations on possible future directions? Are there undesireable
aspects of XMP that can reasonably be changed?
======================================================================
3. A suggested approach for OpenDocument
This is written with great trepidation. It is here for the sake of
being concrete and complete, and to provide an honest suggestion.
This is not a formal proposal from Adobe, nor an informal attempt to
twist anyone's arm. It is nothing but one software engineer's
suggestion - a software engineer with an obvious chance of being
biased by personal experience.
I think the OpenDocument metadata effort could succeed by starting
with XMP, understanding how to work within XMP, and only looking for
truly necessary changes. This could be done reasonably quickly and
easily. It saves a lot of abstract design effort, allowing the
OpenDocument TC to concentrate on more concrete issues.
It would provide an RDF-based metadata model that has demonstrated
practical value. One that can be reliably, cheaply, and efficiently
implemented. With an existing C++ public implementation that matches
internal use at Adobe (not a toy freebie). Adobe does not have a Java
implementation at this time though.
This would provide a solution that exports seamlessly to PDF, imports
seamlessly from existing files containing XMP, and integrates
seamlessly with other systems recognizing XMP.
Since XMP can be serialized as legitimate RDF, there is an argument
for easy, if not seamless, incorporation into other RDF stores.
Slight decoration or modification of the XMP in these cases should be
reasonably easy. And probably not unique to XMP, since the universe
of RDF usage is not uniform.
======================================================================
4. A description of XMP
This section primarily describes XMP as it exists today. The purpose
is to make sure everyone understands what the XMP specification
specifies, what it leaves unsaid, and what Adobe software can and
cannot do, so that well informed choices can be made. There is no
intent to imply that XMP is the best of all possible solutions.
You can break XMP into 4 distinct areas:
- The abstract data model, the kinds of metadata values and
structures.
- The specific data model used by standard properties.
- The serialization syntax.
- The rules for embedding in files.
The abstract data model is the most important part. It defines the
kind of metadata values and concepts that can be represented. The
data model used by standard by standard properties is almost as
important. Common modeling of standard properties is important for
reliable data interchange.
The specific serialization syntax is not as important. As long as the
mapping to the data model is well defined, it is reasonably easy to
convert between different ways to write the metadata. Of course there
are benefits and costs to any specific serialization. What I mean
here is that the underlying formal data model defines what concepts
can be expressed. How the data model is serialized in XML is not as
important as the data model itself.
The file embedding rules are by far the least important here. It is
important that metadata is embedded consistently for each file
format, but these rules are specific to the format and not much
related to the other areas.
The following subsections discuss aspects of the abstract data model.
-------------------------------------
The basic XMP data model
I've taken to describing the XMP data model as "qualified data
structures". The basis is traditional C-like data structures: simple
values, structs containing named fields, and arrays containing
indexed items. These are natural concepts, easily explained even to
novices, and can be composed into rich and complex data structures.
Ignoring surrounding context and issues about alternative equivalent
forms of RDF, here are some simple examples:
<ns:UniqueID>74A9C2F643DC11DABBE284332F708B21</ns:UniqueID>
<ns:ImageSize rdf:parseType="Resource">
<ns:Height>900</ns:Height>
<ns:Width>1600</ns:Width>
</ns:ImageSize>
<dc:subject>
<rdf:Bag>
<rdf:li>XMP</rdf:li>
<rdf:li>example</rdf:li>
</rdf:Bag>
</dc:subject>
One of the main advantages of serializing XMP as RDF is that these
aspects of the data model become self-evident. The core XMP toolkit
knows that something is simple, or is a struct, or is an array
directly from the serialized RDF, no additional schema knowledge is
necessary. This allows new metadata to be freely and easily created
by customers. Files can be shared without having to carry along
schema descriptions. Similarly client applications can freely and
easily create new metadata without creating formal schemas or
requiring change in the core XMP toolkit. The client applications and
users understand their metadata, it is not necessary for the core
toolkit to do so. Granted formal schemas are necessary for automated
checking, which is a good thing. The point here is that a lot of
effective work and sharing can be done without burdening everyone
with the overhead of creating formal schemas.
The notion of arrays in XMP seems to be often misunderstood, causing
controversy in the use of RDF Bag, Seq, or Alt containers. One point
is that within XMP these are just used to denote traditional arrays.
The broader aspects of RDF containers are not part of the XMP data
model. For XMP the difference between Bag, Seq, and Alt is simply a
sideband hint that the items in the array are an unordered
collection, an ordered collection, or a weakly ordered list of
alternatives. A common question is why use arrays at all instead of
repeated properties like:
<dc:subject>XMP</dc:subject>
<dc:subject>example</dc:subject>
The basic answer is the point about a self-evident data model in the
RDF serialization. What if a given file only contained 1 dc:subject
element? Is dc:subject a simple property or an array? Most humans
have a very specific notion about whether a property is supposed to
be unique (simple), or might have multiple values (an array). Using
explicit array notation in the serialization makes this clear. Which
in turn makes it clear in the XMP toolkit API, and in how client
applications use that API. Client application code becomes more
complex and UI design more difficult if everything is potentially an
array.
------------------------------
XML markup in values
A small aside: The XMP data model does allow XML markup in values,
but this is serialized with escaping. This is easier and more
efficient to parse than use of rdf:parseType="Literal". The main
difference is that with escaping the markup is not visible in the DOM
of a generic XML parse. Having that visibility does not seem like a
crucial feature. Having the markup be visible will also complicate
formal schemas.
For example, a call like:
xmp.SetProperty ( "Prop", "<elem>text</elem>" );
will get serialized as:
<Prop><elem>text</elem></Prop>
------------------------
Qualifiers in XMP
Qualifiers in XMP are from RDF, they are not part of traditional
programming data structures. In the XMP data model qualifiers can be
viewed as properties of properties. The XMP data model is fully
general and recursive. They seem to be easily understood by users,
fit easily into the core toolkit API, and provide a significant
mechanism for growth and evolution. They do this by allowing later
addition of information in a self evident and well structured way,
without breaking clients using an earlier and simpler view.
For an example I'll first use an XMP data model display instead of
RDF. Let's accept the notion of the XMP use of dc:creator as an
ordered array of names. This works for the vast majority of needs:
dc:creator (orderedArray)
[1] = "Bruce D'Arcus"
Suppose we now want to add some annotation for Bruce's blog. By
adding this as a qualifier older clients still work just fine. In
fact they could even have been written to anticipate qualifiers and
display them when found:
dc:creator (isOrderedArray)
[1] = "Bruce D'Arcus" (hasQualifiers)
ns:blog = "http://netapps.muohio.edu/blogs/darcusb/
darcusb/" (isQualifier isURI)
The RDF serialization of XMP uses the rdf:value notation for
qualifiers. This is unfortunately a bit ugly and complicates formal
schemas since it makes the qualified element look like a struct. The
presence of the rdf:value "field" is what says this is not really a
struct. The original unqualified array item:
<rdf:li>Bruce D'Arcus</rdf:li>
Adding the qualifier:
<rdf:li rdf:parseType="Resource">
<rdf:value>Bruce D'Arcus</rdf:value>
<ns:blog rdf:resource="http://netapps.muohio.edu/blogs/darcusb/
darcusb/"/>
</rdf:li>
--------------------------
References in XMP
One aspect of programming, and many other, data models that is not a
first class part of XMP is a notion of reference. By this I mean that
the XMP specification does not define references, and the Adobe XMP
toolkit does not contain specific API or logic for dealing with
references. References can be defined and used within XMP by clients,
they just are not a fundamental part of the data model. A reference
is some form of address along with a means to find what is at that
address. Having the address without being able to go there isn't of
much use.
The lack of a formal notion of reference does not at all say that
references cannot be represented or used within XMP. Specific kinds
of references can easily be used. The onus is on the users of those
references to define their semantics and representation.
In the qualifier example, the use of the rdf:resource notation does
not constitute a formal reference. That is just sideband information
that this particular simple value happens to be a URI. The XMP
specification does not require any specific action for this. The
Adobe XMP toolkit does not attempt to follow the URI, nor does it
allow rdf:resource to be used as a general inclusion or redirection
mechanism. All that said, the example qualifier is an informal form
of reference in the sense of an address that can be understood and
utilized by client software. A generic UI can even display it with a
nice OpenWebPage button.
By avoiding a formal notion of reference XMP avoids being over
constrained by picking a particular notion of address, or of being
overly complex in order to support a totally generalized notion of
address. An important distinction between actual XMP usage and
typical RDF examples is that XMP operates primarily in a file system
world while RDF examples are almost always Internet oriented.
This is an important distinction with significant practical aspects.
Suppose a reference is stored as a file URL. What happens to that
reference as the file is copied around a network, or emailed, or
moved into and out of an asset management system? What are the
privacy issues related to putting file URLs in metadata without the
user's conscious knowledge?
There are other aspects of references that URIs typically loose. At
any rate URIs in the form of typical readable URLs. Like machine
addresses, a URL references the current content at some location,
i.e. it is all about the location regardless of content. It is
incapable of being used for wider search, it breaks if the content
moves, it cannot detect changes to the content. A typical URL is not
persistent, it can't identify the content through time and space. Nor
is it specific, it can't detect differences between altered forms of
the content. Yes, general URIs can contain arbitrary knowledge, but
that knowledge isn't of much use without an agent to perform lookup.
Consider a number of forms of reference to a book: title, which
edition, which printing, ISBN number, Dewey Decimal number. Which of
these is useful depends on local context. XMP leaves the definition
and processing of references to clients. They are the ones with
specific knowledge of local context and workflow.
As a more concrete example, consider how compound documents are
typically created and published by InDesign. This isn't specifically
about InDesign's use of XMP, but does illustrate the changing nature
of a reference. During the creation process images and illustrations
are usually placed into the layout by file reference. This lets the
separate image file be updated by a graphic artist while an editorial
person is working on the layout. When published to PDF, the images
are physically incorporated. The file reference is no longer needed,
and often not even wanted because of privacy concerns - the PDF file
might be sent to places that have no business knowing what source
files were used. However, XMP from the images can be embedded in the
PDF and attached to the relevant image objects.
---------------------------------------------------
Interaction with metadata repositories
This section has looked inward at the XMP data model. There has been
no mention of RDF triples or broader RDF implications. This is
intentional. In terms of providing immediate customer benefit, first
order value of XMP is getting the metadata into files, viewing/
editing it within applications like Photoshop, and viewing/editing/
searching it with applications like Bridge. By focusing inwards a
number of simplifications can be made that make the metadata more
approachable, and that make implementations more robust and less
expensive.
That said, there is real value in being able to have interactions
between XMP and other metadata systems. What this means for a given
metadata repository depends on the internal data model of the
repository, how that relates to the XMP data model, and the
directions that metadata is moved between XMP and the repository.
Information is potentially lost when moved from a less to a more
constrained data model. Since XMP can be serialized using a subset of
RDF, XMP can be ingested fairly easily into a general RDF store. It
should be reasonably easy to transform the XMP if the particular
usage of RDF by XMP is not what is preferred.
--------------------------
Latitude for change
I've seen well intentioned suggestions like: "Enhance XMP to fit with
current RDF and XML best practices." People need to be very realistic
about the feasibility of various kinds of change. XMP is a shipping
technology, with hundreds of thousands if not millions of copies of
applications using it. This includes 3 major generations of Adobe
products. Backward compatibility is a major concern.
Global or implicit changes that would cause XMP to fail in existing
file formats and applications are unlikely to happen. There would
have to be some very compelling reason. Suppose a future version of
XMP in Photoshop started writing dc:subject as repeated elements
instead of an explicit array (rdf:Bag). The XMP in new files would
not be accepted by any existing Adobe software, and probably not by
any existing 3rd party software supporting XMP.
Global or implicit changes restricted to new file formats have a
better chance of success. Suppose OpenDocument files were "NewXMP",
using repeated elements and schema knowledge. No existing software
specifically looks for XMP in OpenDocument files, so the exposure is
less than the previous example. But there is software, especially 3rd
party asset management systems, that use byte-oriented packet
scanning to find XMP in arbitrary files. That software will not
handle these new files.
Changes to XMP that are restricted to actual metadata usage, or
otherwise under conscious user control, have a much better chance of
being accepted. One example might be a user preference for a custom
RDF serialization that is more amenable as input to a general RDF store.
------------------------
Plain XMP Syntax
I want to also give the OpenDocument TC a heads-up about something
called Plain XMP. We will be posting a paper about this for review
and discussion to the Adobe XMP web site in the near future. I want
to emphasize that Adobe has made no decisions about this, we are
simply looking for community review and feedback.
Plain XMP is being presented as a possible alternative serialization
for the XMP data model, one that happens to be describable using XML
Schema. The full XMP data model is represented, you can move back and
forth between the RDF form of XMP and Plain XMP without loss. This
does not signal any intent by Adobe to abandon RDF. This is purely an
attempt to satisfy conflicting customer desires.
Since XMP first shipped with Acrobat 5, Adobe has gotten feedback
from a number of customers or potential adopters of XMP that they
don't like RDF. Why they don't like RDF isn't really an issue here.
The Customer Is Always Right. There seem to be 3 common
"complaints" (pardon the term) - general FUD about RDF, a dislike of
the RDF XML syntax, and a desire to use "standard XML tools". This
last generally means using W3C XML Schema.
Granted, RELAX NG is a vastly superior schema language. The conflict
between RDF and XML Schema can be viewed as the fault of shortcomings
in XML Schema. Again that isn't the point. The Customer Is Always Right.
A reasonable usage model for Plain XMP might be to put the RDF form
of XMP in current file types by default, and maybe let users choose
Plain XMP. New file types could go either way, realizing that
existing packet scanners won't recognize Plain XMP. Future XMP
toolkits would accept either. Client software could ask for a
serialization in either form.
Plain XMP might also be more amenable to XSLT transformation than
RDF, especially when qualifiers are used. This could make it useful
for getting XMP into or out of metadata repositories.
Here are the previous examples serialized as Plain XMP, again
ignoring surrounding context. Yes, there is going to be controversy
about the use of an attribute versus character data for values. This
will be explained in the Plain XMP proposal. In essence, this avoids
some XML Schema problems. Arguably, XML used for data is distinctly
different from XML used for "traditional markup". The latter requires
character data, the former does not.
<ns:UniqueID value="74A9C2F643DC11DABBE284332F708B21"/>
<ns:ImageSize kind="struct">
<ns:Height value="900"/>
<ns:Width value="1600"/>
</ns:ImageSize>
<dc:subject kind="bag">
<item value="XMP"/>
<item value="example"/>
</dc:subject>
<dc:creator kind="seq"> <!-- This form drops the isURI tagging of
ns:blog. -->
<item value="Bruce D'Arcus" ns:blog="http://netapps.muohio.edu/
blogs/darcusb/darcusb/"/>
</dc:creator>
<dc:creator kind="seq"> <!-- This keeps the isURI tagging of
ns:blog. -->
<item value="Bruce D'Arcus">
<ns:blog value="http://netapps.muohio.edu/blogs/darcusb/darcusb/";
rdf:resource=""/>
</item>
</dc:creator>
======================================================================
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
| [List Home]