OASIS Topology and Orchestration Specification for Cloud Applications (TOSCA) TC

  • 1.  Proposal: CSARs should be tarballs, not ZIPs

    Posted 04-15-2021 17:04
    In a conversation I had with someone who professes to "hate TOSCA" one of the issues that came up was how bad CSARs are. And one point made hit home. CSARs are currently defined as ZIP containers . Unfortunately, ZIP is not a streaming format, instead requiring random access to locations in the container. The entire container needs to be read in order to access an individual entry. Thus The any processing of a CSAR has to take place on an accessible file system, which means that if the CSAR is at a URL then the whole package would have to be downloaded first. If you're dealing with a CSAR with very big artifacts (virtual machine images) then this quickly becomes a major burden on different parts of the system which need to process specific parts of a CSAR. This is indeed a pain point with currently existing TOSCA solutions, e.g. ONAP. There's a reason why "tarballs" are so often preferred in packaging. A ".tar.gz" file is streamable for two reasons: gunzip is streaming decompression of a single file, and that single file is a "tape archive" (tar), which is a straightforward concatenation, likewise streaming. There is no random access. Thus a CSAR processor can choose to process just a specific entry and not have to download the entirety. It can throw away bytes that do not interest it. Note that if one can benefit from random access to a tarball, then it's easy enough to unpack it in its entirety, and indeed in a much more efficient way than a ZIP: the tarball can be unpacked and streamed directly to the filesystem. A ZIP would still have to be downloaded first to accomplish the some function, leading indeed to more than double the storage requirement. So, it's very obvious to me that this needs to change in TOSCA 2.0 with a new CSAR specification. My specific recommendations: 1. Let's first standardize on TAR. So a raw ".csar" extension would be exactly a "tape archive" (a tarball). 2. Let's then standardize on GZIP for the supported algorithm. So a ".csar.gz" extension would imply a GZIPped CSAR. There are many other popular algorithms used (bz2, xz) but in the interests of interoperability it's best to recommend one. The usefulness of adding the extra ".gz" is to clarify if decompression is needed, and indeed many toolchains recognize that convention automatically.


  • 2.  Re: Proposal: CSARs should be tarballs, not ZIPs

    Posted 04-16-2021 17:25
    Some additional thoughts: Remember that the CSAR version is separate from the TOSCA version. The current CSAR version is 1.1. So my proposal here would be for CSAR 2.0 (it's a significant enough change that I think it would warrant a major semantic version change). But, backwards compatibility would mean that systems would still be able to support CSAR 1.1, which is in ZIP. To be 100% clear: you could write a TOSCA 2.0 service template and package it in CSAR 1.1. We would have to be clear in the TOSCA 2.0 spec that this is supported. Another thought regarding extensions -- if we move to tarballs, it might be a good idea to choose a different extension than ".csar" so that processors would easily know if they're dealing with a new-style vs. old-style container. (This is a common problem with systems that upgrade their formats.) So, perhaps something like this: ".csar" extension: means CSAR 1.1 or CSAR 1.0, meaning it's a ZIP ".csar2" extension: means CSAR 2.0 (and beyond), meaning it's a TAR ".csar2.gz" extension: GZIPped TAR It's a bit awkward, but 100% deterministic. On Thu, Apr 15, 2021 at 12:03 PM Tal Liron < tliron@redhat.com > wrote: In a conversation I had with someone who professes to "hate TOSCA" one of the issues that came up was how bad CSARs are. And one point made hit home. CSARs are currently defined as ZIP containers . Unfortunately, ZIP is not a streaming format, instead requiring random access to locations in the container. The entire container needs to be read in order to access an individual entry. Thus The any processing of a CSAR has to take place on an accessible file system, which means that if the CSAR is at a URL then the whole package would have to be downloaded first. If you're dealing with a CSAR with very big artifacts (virtual machine images) then this quickly becomes a major burden on different parts of the system which need to process specific parts of a CSAR. This is indeed a pain point with currently existing TOSCA solutions, e.g. ONAP. There's a reason why "tarballs" are so often preferred in packaging. A ".tar.gz" file is streamable for two reasons: gunzip is streaming decompression of a single file, and that single file is a "tape archive" (tar), which is a straightforward concatenation, likewise streaming. There is no random access. Thus a CSAR processor can choose to process just a specific entry and not have to download the entirety. It can throw away bytes that do not interest it. Note that if one can benefit from random access to a tarball, then it's easy enough to unpack it in its entirety, and indeed in a much more efficient way than a ZIP: the tarball can be unpacked and streamed directly to the filesystem. A ZIP would still have to be downloaded first to accomplish the some function, leading indeed to more than double the storage requirement. So, it's very obvious to me that this needs to change in TOSCA 2.0 with a new CSAR specification. My specific recommendations: 1. Let's first standardize on TAR. So a raw ".csar" extension would be exactly a "tape archive" (a tarball). 2. Let's then standardize on GZIP for the supported algorithm. So a ".csar.gz" extension would imply a GZIPped CSAR. There are many other popular algorithms used (bz2, xz) but in the interests of interoperability it's best to recommend one. The usefulness of adding the extra ".gz" is to clarify if decompression is needed, and indeed many toolchains recognize that convention automatically.


  • 3.  Re: [tosca] Re: Proposal: CSARs should be tarballs, not ZIPs

    Posted 04-16-2021 18:25
    +1 On Fri, Apr 16, 2021 at 10:24 AM Tal Liron < tliron@redhat.com > wrote: Some additional thoughts: Remember that the CSAR version is separate from the TOSCA version. The current CSAR version is 1.1. So my proposal here would be for CSAR 2.0 (it's a significant enough change that I think it would warrant a major semantic version change). But, backwards compatibility would mean that systems would still be able to support CSAR 1.1, which is in ZIP. To be 100% clear: you could write a TOSCA 2.0 service template and package it in CSAR 1.1. We would have to be clear in the TOSCA 2.0 spec that this is supported. Another thought regarding extensions -- if we move to tarballs, it might be a good idea to choose a different extension than ".csar" so that processors would easily know if they're dealing with a new-style vs. old-style container. (This is a common problem with systems that upgrade their formats.) So, perhaps something like this: ".csar" extension: means CSAR 1.1 or CSAR 1.0, meaning it's a ZIP ".csar2" extension: means CSAR 2.0 (and beyond), meaning it's a TAR ".csar2.gz" extension: GZIPped TAR It's a bit awkward, but 100% deterministic. On Thu, Apr 15, 2021 at 12:03 PM Tal Liron < tliron@redhat.com > wrote: In a conversation I had with someone who professes to "hate TOSCA" one of the issues that came up was how bad CSARs are. And one point made hit home. CSARs are currently defined as ZIP containers . Unfortunately, ZIP is not a streaming format, instead requiring random access to locations in the container. The entire container needs to be read in order to access an individual entry. Thus The any processing of a CSAR has to take place on an accessible file system, which means that if the CSAR is at a URL then the whole package would have to be downloaded first. If you're dealing with a CSAR with very big artifacts (virtual machine images) then this quickly becomes a major burden on different parts of the system which need to process specific parts of a CSAR. This is indeed a pain point with currently existing TOSCA solutions, e.g. ONAP. There's a reason why "tarballs" are so often preferred in packaging. A ".tar.gz" file is streamable for two reasons: gunzip is streaming decompression of a single file, and that single file is a "tape archive" (tar), which is a straightforward concatenation, likewise streaming. There is no random access. Thus a CSAR processor can choose to process just a specific entry and not have to download the entirety. It can throw away bytes that do not interest it. Note that if one can benefit from random access to a tarball, then it's easy enough to unpack it in its entirety, and indeed in a much more efficient way than a ZIP: the tarball can be unpacked and streamed directly to the filesystem. A ZIP would still have to be downloaded first to accomplish the some function, leading indeed to more than double the storage requirement. So, it's very obvious to me that this needs to change in TOSCA 2.0 with a new CSAR specification. My specific recommendations: 1. Let's first standardize on TAR. So a raw ".csar" extension would be exactly a "tape archive" (a tarball). 2. Let's then standardize on GZIP for the supported algorithm. So a ".csar.gz" extension would imply a GZIPped CSAR. There are many other popular algorithms used (bz2, xz) but in the interests of interoperability it's best to recommend one. The usefulness of adding the extra ".gz" is to clarify if decompression is needed, and indeed many toolchains recognize that convention automatically.