OASIS XML Localisation Interchange File Format (XLIFF) TC

 View Only
  • 1.  Meeting minutes

    Posted 07-02-2025 11:35

    Dear all,

     

    Please find below this week's meeting minutes.

     

    Many thanks to Mathijs for providing the automated transcription of the meeting.

     

    Best,

     

    Lucía

     

     

     

     

    Attendance: Mathijs, Rodolfo, Yoshito, Lucia. We have quorum.

     

    Administration

     

    R: I move to approve June 17, meeting minutes –  https://groups.oasis-open.org/discussion/meeting-minutes-20

     

    Y: I second.

     

    R: Meeting minutes approved.

     

    Summer availability. Doodle. https://doodle.com/group-poll/participate/erVo3lBe

     

    Charter clarification.

     

    L:. We have asked the OASIS administrator to start the process so we can vote on this. And Rodolphe created the ticket for the system that they have and we are still waiting for it. That's the where we are now. So once we will get the ballot open, you will, all of you that have voting rights, you could vote on the charter clarification and that close that topic. Once that's done.

     

     

     

    Technical

     

     

     

    XLIFF 2.2.

     

    New translation memory standard. https://github.com/oasis-tcs/xliff-xliff-22/tree/master/memory

     

    R: We started with the project in the repository.

     

    M: This is something we are trying to work with in my daily work. I was recently at locworld,, there was in the process innovation challenge a gentleman from Microsoft who presented that they are of course heavily using LLMs for translation. And what their process innovation was is that they introduced an extra element into every active unit to include content context. So they explicitly added context to every unit that might be maybe relevant to any processor of that xliff, especially if it was of course LLM specific.

     

    R:  Yeah, I'm dealing with something like that where instead of introducing a new element, I'm trying to use standard XLIFF metadata in metadata to provide the context in there. So it's some standard way.

     

    M: This leads to a question I had, is like what would define when this metadata be become codified in the standard? And when should you use metadata versus just a random element that you create that XLIFF ignores? What is the principles behind that?

     

    R:  The thing is, XLIF is modular. So you can create your own elements and define them in what we call a module. You can have a custom module with your own elements, your choice, that's fine. But there is a big rule. If there is something already in XLIFF that provides what you're trying to do.  You should use that. And if we are talking about introducing metadata, well, the right thing to do is use a metadata module. Do not create your own, do not invent your own thing, otherwise you will not be able to change with other people.

     

    M:  The thing about metadata is that other processes would not necessarily from the metadata, they would never use it because it's just metadata that you added that is only relevant to you. But what if other processors will be like, no, but I want to know which one is actually the context that I need to inject into my LLM prompt. Like then it becomes meaningful semantic metadata. And that's at some point, a point where we need to say, okay, now it becomes standard, right?

     

    R: If you created something that is useful and not part of the standard, you can submit what you created and then we will be adding that as a module in next live. That's the idea.

     

    Y:  Yeah, I think challenging. I think current metadata module is sufficient to store any type of metadata. Very flexible. The point is like, if you think about like exchange for like,  the LLM, like a model, training, whatever. Right. So, you define some, like a group, like, specific group ID or like a key for meta, like, so those need to be kind of. two implementation agree on same semantics for those group. And yeah, that's a tough part.

     

     

     

    R: we have XLIFF that is a vehicle, and if we put that metadata in every step of the process. You have a successful XLIFF that you can send to your LLMs. That's something we can even start discussing for the project of the translation memory, establishing some kind of metadata rules that could be used to improve the translation memory format. Matthias, while you were out, we started with the project in the repository about. It's about writing a document right now it's a markdown skeleton for using XLIF as translation memory. Because TMX will not be reinvented.  So we were talking about metadata right now, and there's no way you can implement flexible metadata in tmx.

     

    M:  No. Very interesting. This is something that we're also looking at at the company I work at, it's well, reinventing the TM in general, as in, it should be, you should be able to add custom attributes to all your translations, because without annotations of like, okay, but where does it live? How does it perform? How was it created? It's. It's very stupid data. And having an interchange format that is the vehicle for that is very relevant.

     

    R:  Just imagine we have XLIFF, which is a vehicle that moves translation information from one place to another. And if you put that metadata in every step of the process. So you say, hey, I have pre translated using this gem, I'm using this LLM to annotate things. I mean, and you put all that information there, you have a useful XLIFF that you can send to an LLM and say, hey, finish this, finish this. Yeah, but then you're losing everything. No, just create your TM with that XLIF drop TMX and use XLIFF with all that metadata with all those process information and start that.

     

    Y:  Yeah, I'm just curious. In general, I think the goal for you wrote a lot of selective tooling, right? do you actually utilize metadata for your tooling? actually I recently came across this one. And I'm utilizing metadata for carrying a little bit more context in our tooling.

     

    Two situations, two special situations. I've been converting from XLIFF to TMX back and forth for many years. So the tools work with XLIFF, take your Word document, convert to XLIFF, let you translate the XLIFF. Then for simplicity of exchange, it's working with tmx. It's dropping all the information that I already have in the XLIFF. If I'm losing that information converting to tmx, that's stupid. And I have a customer that I'm dealing with now that has a simple problem. They're using Japanese right now. When you use TMX for processing translation memory, what you have to do is work with N grams for index in Japanese so you can get a reasonable speed.

     

    But N grams are bad when you have a full text in Japanese and you want to find the term or things like that, because a word in Japanese character, it's not a kanji character. A word may have several. And we don't have good tool for creating for tokenizing properly Japanese. But in my latest experiments, I've been using vector embeddings, the way that LLMs process the text. I did a simple example, taking a few phrases in Spanish, Japanese and English, generating vector embeddings for those sentences, and then searching for one of them. Magically, I was able with one query to retrieve in all three languages. Now, just imagine were talking about the translation memory when the problem were discussing last time was multilingual. But in our memory, we put all the sentences in all languages. We really don't care that much. But if we put the metadata, we can search using the vectors and filter using the metadata and retrieve the things we want.

     

    Y: So you store the tokenized like, you know, the text in the metadata.

     

    R: No, I'm creating tables that contain the vectors, tokenize data, and also fields with the metadata. So in the database, whenever I want to search for the translation of the paragraph, I do the vector search and get text in any language that matches my question. But in the same table, I have the metadata that allows me to filter. This is for this project, this was for that client. This is for this subject. That helps a lot. The best thing is when the user does a concordance search. For example, I'm translating from English into Spanish and I don't have that information in the data set, but I do have the French Translation So that same storage can tell me, hey, I don't have Spanish, but here's the French. It may help you because French is somewhat similar to Spanish. Somewhat not the same, but it helps.

     

    Then I have a company that works with 20 languages and some of their translators are using translation from other languages to get ideas because they produce machine things, machinery and they sometimes don't have the name for a new piece in the native language, but they are taking  from other languages, adjusting. That's something that today they are doing with TMX and it's a nightmare to handle. But I can simplify that a lot using XLIF and dropping the N grams and using just the vectors and the LLMs to find the things. I'm not asking, not saying use the use AI to translate for you. I'm using the technology, the base technology of the AI which is to store everything in vectors and assign probabilities to search in what I have. I don't have to send my text to chat, GPT or whatever. You can do that internally.

     

    Y: Yeah, probably so people's concern is that XLIFF provide a great flexibility for adding extra like you know, semantics or context. The good thing is like you can write you can use your own defined metadata as  you want.  

     

    R: So that's why I was saying we should put that define a set of metadata things that must be included in what we are going to write.

     

    R: Yeah, this is something similar discussion about a long time ago discussion., So if this would be a part of XLIFF or just a recommendation or implementation guideline, I just don't know. how those things actually, you know, can be useful for the kind of so data exchange purpose.

     

    R:Well, today what we have is just a metadata format, but we don't have any suggestion for end users. You have to put this on that if you want to exchange we may provide a reduced vocabulary and say this is the basic list like when you work with, I don't know if you're familiar with tbx, the term basics change very. The TBX lets you use multiple vocabularies they have created. If you want to use the simple version, you have TBX Basic. If you want to use the full version, you have a huge set of properties that you can fill. But we don't have any equivalent in actually saying hey, for metadata. This is a basic system, this is the enhanced system. We don't have that.

     

    It doesn't mean it is part of the standard, but it can be documentation that we present to the  user saying hey, if you're going to use XLIFf and you're willing to use metadata, not just plain XLIFF, this is a suggested set of properties that you should be using, should be exchanging. If we come up with a suggestion like that, it would help exchange.

     

    M: I think from the current standard. It would be really helpful if at least the metadata module would outline some examples of these are examples of metadata. Because it's so important to use that and not go on and decide your own standard that having these examples could be very meaningful. Also in this whole context that you're explaining.

     

    R: Yes.

     

    Y: Actually as an implementator, if there is some suggested properties or set of properties defined, I would use that. Right.

     

    R: Yes. And that's why I'm interested in having that kind of stuff. I don't know one. I don't know if you can share the metadata that you're using. So we can define some common things. Remember that XLIFF and TMX and SRX all standards were created following what people were using. If you're using some kind of metadata, we can start with that.

     

    M: Before the next meeting. I can actually share the metadata that we ourselves inject because we have so many tools that we interchange with. Might be interesting to see.

     

    R: Yes.

     

    M:  On another note about using XLIFF as a term based exchange, wouldn't any meaningful TM exchange format also need to include the segmentation rules?

     

    R: Not necessarily. That was a problem that we had when I was in LISA and SRX appeared. The problem was simple. TMX had all the texts already split, already segmented. And people from SDL later suggested we could use srx because SDL was using srx. They invented it to segment the text they already have and say if our SRX rules are applied to this document, we will get this segmentation and segments like that are going to be in the tmx. If you use this SRX when you're processing your files with other tools, you will get similar segmentation and then we will have a compatible TMX files. That was the idea. And then we found a problem with SRX1 and created SRX2 to make it more available. But the thing is, the SRX splits the text.

     

    You get the segments in fixed size in tmx. But in XLIFF we don't have the possibility because we have a unit which could be a paragraph containing multiple sentences inside multiple segments. So if we do the segmentation with a set of rules, that would be interesting to share the SRX that we use. But at translation time, given that we have segments and ignorable translators are able to reconnect the reordered sentences and the segmentation rules don't match what translator did at translation time.

     

    M: I see.

     

    R: In that case, what I found is for translators, I pre segment using SRX and then let them adjust join segments, split segments as they need. Because in each language you have different needs and you sometimes need to reorder. What I do is display that information to translator as they used. But internally I work with units. I simply concatenate the content of all segments and ignorables and keep them in my unit. And I work with the unit because translators can change the intention at runtime.

     

    M: But in the context of you wanting to use XLIFF as a TM interchange, so what we found. A lot of customers ask for interchanging TM between different systems, and we always tell them like it's possible, but in reality you lose segments. Do you think this is a problem that's solvable or not?

     

    R: Yes, it's solvable. Why are you losing segments? Because you have a unit with three sentences here in the database. You have a sentence that is alone. You may be able to reuse one sentence of your current unit. If you're doing a TM based on tmx, it may be hard to find if your segmentation doesn't match what I do. In this particular case, I have a unit with several segments. First I search for the full unit in the tm. If it's not there, then I search in the TM segment by segment. Sometimes I can get the middle one that was already stored in database. Now, using vector embeddings, you don't have that problem.

     

    M: Yeah, exactly.

     

    Y: That's an interesting idea. Actually, I didn't you know, we don't have any implementation of like a storing unit in the memory. Right, it's only segmented. Sentences are stored in memory. And we just use this memory in a traditional way. Like, pre segmented, sentence pair, I didn't actually think about entire unit to be stored.

     

    R: Yes. I'm using that because it happened to me. I was translating the manual from my tools. I'm not a translator, I simply translate because I want to make sure the tools work. I use the tools, I use the tools just oh, I need this feature because I hate when I try to. And that's when I implement the features. It happened to me because I was translating XLIFF Manager and I had another tool that had full matches, but I was not able to use them because they were in the middle of a paragraph. I had a lot of things that were partial matches. So I decided, okay, I will store everything at unit level. First try to locate the matches at unit level. If I cannot find the matches at unit level, search at segment level. Some segments would be happy. That's what I did.

     

    M: That's nice. If you're in control of the tm. Well, if all you have is tmx.

     

    R: In  my case, I am in control. I own the server where the remote TM is based on. Yeah. And I put everything there and then.

     

    M: Exactly. But if all you have is a TMX file, then again all that context is lost and you lose all that stuff.

     

    R: Yes, but if I have my units in that server, I'm happy.

     

    M: Never the case for us.

     

    R: Yes. That's why we need to tell people, hey, don't waste time. Store the unit and index at the level you want. Keep an index for the unit and then if that doesn't help, index each segment.

     

    M: Yep. If it was segmented in the first place. If it was segmented in the first place. Sometimes you just have that paragraph and you put in one segment and you say, yeah, someone else can reset.

     

    R: Yes. And once that translator does the translations resegmenting at will, you can get the translation, compact the text in one unit and store it vector.

     

    M: Yeah, exactly. I actually used from the repository all the valid XLIFF files for test cases. For us and they're all green, which is very nice. CDATA is annoying to implement. Subflows are even more annoying to implement. But it's all done basically having implemented the whole standard myself now all as well. In a project, there's only one element of the standard that I chose out of necessity not to adhere to. And that is the caveat that if you have a target segment with tags, then the ID should match with the unit tag with the source tags.  if you have an AI translate this is super hard to adhere to. And I was just wondering what your thought about that was. Because when I take a segment and I give it to any machine translation, either NMT or LLM, whatever, I give it as the original. Like I put back the original value of the text, e.g. HTML. I send it as HTML because it's really good at that. When I then get it back, I don't know which tag aligns with which tag anymore.

     

    R: You're not. Let me show you. Because I solved that.

     

    (Rodolfo shares his screen) Okay, I have a segment here. I just cleared the Translation I have several tags. This is a ph, another ph. This is actually two, so the all tags are just ph and empty..

     

    So I want to translate this using some AI. Here I have a prompt and this is telling the AI translate given this source. When I'm showing the xliff source with all PAs, translate the content from English into Spanish and the list of requirements and the terminology that I already have on my screen. In this section on the bottom right, I'm sending the terms as a JSON. I can copy this prompt to Clipboard. , I just copied the prompt to Clipboard. I will paste this into copilot. Paste the prompt here, execute center and copilot gave me this. So it gave me an next target with all ph. I have the answer in the clipboard I paste and I have all ties in the right place.

     

    M: Yeah, so yeah, I of course understand that if you ask an AI to just deal with XLIFF, you get XLIFF back and then it can be valid. But that's something we see is an inferior option to just sending the actual source tag. So if it's HTML to send the strong and intellect, send the spans, etc. Because we see that it does a better job with that and then. Then it becomes a problem.

     

    R: Yes, because you sent the HTML but if you send the XLIFF and in your prompt you put for example, in this case I only had terms as context. I didn't have matches. The other boxes on the left side of the screen were empty. I could send the matches and I could also tell this is a good match. Fixed match.

     

    Y: I'm just Wondering. So you said it's a ph, Right. So in many cases, like, you know, HTML, it's going to be PC, it's gonna be like appeared and then sometimes nested or, you know, like some Asian language reordered. In this case, like, it's sometimes, you know, like any, you know, translation machine translation mess up. Right. So you know, the ordering or, you know, nesting relationship and such. So that's like, you know, you can.

     

    R: This is XLIF. That is done like M. said. This segment says PH here with an ID, which is S1. This second is another. This is a closing PC here in the target. What I have is someone that was creative and the Target, the. The PC, the ID instead of S1 was T1 bad, isn't it? So what? That's the problem you're facing, M. So I can say fix that with AI. And this was French supposed to be. And it fixed the tags and adjusted my translation because it was just source copied into target. But if I have a translation, it works. I can ask the AI to fix the tax, saying, what I have here is the source text, what I have here is the translation. But DAX are wrong.

     

    M: Yeah, yeah. The problem for us is not every app speaks XLIFF. LLMs generally can, but not all NMT apps speak XLIF. So that's where we have to send the original tags. It becomes a problem.

     

    R: Yep. I'm using a few ChatGPT, Gemini, Copilot man, they all handle XLIFF. So I just show you the prompt. I don't need to use the prompt. I can simply click a segment and say. Let me go back here. I can simply click the segment and use a shortcut. Copy the AI prompt to clipboard. So I simply press SHIFT command C copy to the clipboard, execute in the. In whatever LLM I want. Or simply define this as get machine translation. And I'm using anthropic cloth. I'm using Azure, Chat, GPT, DeepL, Google, Modern, MT. And if you look at a cloud, it's giving me the right tax.

     

    M: But Azure didn't.

     

    R: No, Azure is not capable of. This is Azure machine translation.

     

    M: Exactly. And that's my point. But if were to use that, then I cannot put the tags right back. So that's why I cannot guarantee that the IDs match.

     

    R: Yeah, but I can use that one and tell this to fix the tags.

     

    M: Yeah, it's not always an option, unfortunately, but yeah.

     

    R: So it's a matter of selecting the tool and this is way beyond XLIF support. It's a problem with some.

     

    M: Of course that's not our problem. It's just wanted to highlight that for some tools this requirement is a problem.

     

    R: Yeah, but I let the users choose what engine they want to use. Some of them are good placing tags, some are terrible. I find that Claude is the best one. Personal preference.

     

    Y: So yeah, it'd be great like sharing this type of like experience solution how to you know, tackle this and what is a kind of community, you know, sharing this type of things. It's a localization word or you know.

     

    R: That's the thing that we are going to implement is supposed to be useful. That's what that's here where we should be sharing our knowledge and creating some kind of standard saying hey, use this. We know that translating it's a lot of stuff. We have a problem. It's a common problem converting processing tags in an LLM. But for that to work you need to make sure that your prompt tells the LLM. Do not invent your own tags. Do not discard any of the tags. Put them in the right order to preserve the attributes. That's why for just a two line paragraph, a two line sentence, I had a long prompt giving it the.

     

    M: For a lot of enterprise usage that's the real problem is like if you inject a prompt per segment or send a request per segment. That's Way too slow.

     

    R: No, that's. That was a special case. That's for. Just for fixing a single cell.

     

    M: Yeah, yeah, yeah. But generally you. You find a batch size that works for you.

     

    R: Yes. For you create a larger prompt and saying this is. Instead of saying this is a segment, I'm saying this is a set of units. And consider that they. When there is a target, use that as context. When there is no target, provide a translation. And in your translation you should preserve the tags and do not be creative and use the rest of the text to make it fluid. But I send 10 or 30 depending on the context. 10. 10 segments? 10 units at a time? But no more than.

     

    M: What do you foresee as a problem besides if it ends up in a CAT tool? What is the Problem if the IDs don't match?

     

    R: The problem if IDs don't match is when you are trying to merge the XLIFF. If you are suppose in the example I showed you a little earlier, we have tags 1 and 2 in source, 3 and 4 in target. That's valid. Actually, the standard says that if the inline elements are similar, they should be using the same IDs. In this case the user didn't. It was creative. If you have the content of the tag in the active elements, you have a original attributes element. If the content of the tag is in the original attributes in the original elements, you can merge. If not, you're screwed. You have a problem because when you don't have the original data, you don't have where to get the tag content from.

     

    M: But if the idea is different, but the data ref is still pointing towards the same value. For me, that works like I can just turn it back into it.

     

    R:You have the start ref and the end ref. Yeah, that's fine.

     

    M: Yeah, yeah, exactly. So I don't have any problem. I'm just the only part of the standard I don't officially follow.

     

    R: No, you must have the start ref and end ref stored in the original data.

     

    M: Yeah, I do that. Yeah, it's just the ID itself. Without referencing the actual data id, it's it start data ref. It's the ID between the source and target that doesn't match, but they still point towards the same data element that would be valued.

     

    R: Okay, if I use the validator to check that file and notice that two PCs point to the same start and end and they are in the same segment, one is inserts and one is in target and they have different id, that would be a mistake. Okay.

     

    Y: The only the problem is about the original data references, but there are inline tags with. Without having external data. Like, you know, it's very rare case. Right. So which is the character? Character difference, I think again, CP like a tag.

     

    R: But sometimes I've seen special range Unicode characters there which are not something you cannot show. In that case, I simply convert the CP to the character and let the translator view the character.

     

    R: So, Matthias, there is a template in the repository. Take a look at that.

     

    M: I will.

     

    R: And you should be able to write in there. You can create a pull request. So we can merge without stepping our tools. Feel free to add anything there. And if you want to share your metadata, please do. And you see if you have customized data that you want to share, that will be great.

     

    R:Lucia, years ago you have. You were working on metadata and asked me to modify Swordfish to display that metadata. Is that project still relevant?

     

    L:That was my PhD research that I actually investigated which metadata was being used in XLIFF. I identified those items and then decide if those items could carry provenance metadata, which was the topic of my PhD. And the research question was if this provenance metadata was helpful for the translators. In plain words is like we gave translators translation matches with information about the provenance, that is the people that translated it, when it was translated, etc. And we wanted to investigate if that had any impact in whether they reuse it or not. So that was before the LLMs era, that was 2009-2012.

     

    R: Did it work?

     

    L: So in the experiment with real translators we measured the time, the quality, and there was no difference (between those that receive tm with our without provenance metadata). The only difference we saw was we had three groups, and those that translated from scratch got worse results that those with the TM matches (with or without metadata). So there was no difference between having metadata and no metadata for humans in terms of quantitative data, that is the time and the quality. However, we interviewed them and they found it useful and they said that's something they always like to know where that translation memory came from.

     

    So those were the results before LLMs and now I'm trying to see if we are going to do something with LLMs in terms of metadata and how useful it could be for feeding the LLMs and see if they can adapt the result based on metadata.

     

    But if you want the list of metadata items that I identify, I can do that. Or if M he starts with that, we can always elaborate on redefining that was still in the version 1.2. I haven't done any work on that since the new the 2.01 or 2 came.

     

    R:That would be interesting. Okay.

     

    M: The science. Right. That's also a good result.

     

    L:Yeah, but we saw that there was a big difference between translating from scratch and using translation memory matches, which was also understandable and obvious that were going to. We worked with Microsoft official translation memories. I was working in this CNGL project where Microsoft was one of the partners. So I got access to all these translation memories. And yeah, we use real data on the products. We use Rodolfo's tool because he was able to and kindly modify the tool so we had a small window where you could also see the metadata. Not only the translation match, but also this provenance information that were key in our investigation.

     

    R: Okay. So if you want to share anything to help me with the writing. Start with the writing. I will appreciate that. So we're waiting for the charter clarification ballot and work on this new text for the Suppose we'll meet again in two weeks.

     

    L: I'm not sure if I will be able to make it before the next meeting because I will still be on holidays. But I will prepare the agenda, that's for sure. And I will try my best to make that hour. If not, I will send my regrets beforehand. All the metadata work, I will do it after my holiday.

     

    R: Okay. Enjoy the holidays and hopefully speak to all of you in two weeks.

     

    *******

     

    Lucía Morado Vázquez

    Collaboratrice scientifique II // Senior Research Associate

    Bureau 6336 (Uni Mail) 

    Département de traitement informatique multilingue

    Faculté de traduction et d'interprétation

    Université de Genève

     

     



    ------------------------------
    --
    Rodolfo M. Raya rmraya@maxprograms.com
    Maxprograms https://www.maxprograms.com
    ------------------------------