virtio-comment

 View Only
Expand all | Collapse all

Hardware friendly proposals from Intel for packed-ring-layout

  • 1.  Hardware friendly proposals from Intel for packed-ring-layout

    Posted 08-24-2017 11:53
    Hi all,

    Based on the packed-ring-layout proposal posted here:

    https://lists.oasis-open.org/archives/virtio-dev/201702/msg00010.html

    We have below proposals to make it more hardware friendly.

    Driver Signaling Available Descriptors
    ======================================

    ## Current proposal

    * Each descriptor has 1 bit flag DESC_HW
    * Driver creates descriptors and then sets DESC_HW flag
    * Device reads descriptors and can use it if DESC_HW is set

    ## New proposal

    * In addition to the DESC_HW flag, each virtio queue has a tail pointer
    - Driver creates suitable (i.e. multiple of cacheline) descriptors,
    then performs MMIO write to tail pointer.
    * For each virtio queue, there is a head pointer lives in device and
    not used by driver
    - Device compares tail pointer with head pointer to determine exactly
    how many new descriptors have been added to a specific queue
    * The descriptors in [head, tail) are available to device
    * The DESC_HW flag will be kept for device signaling used descriptors

    Device Signaling Used Descriptors
    =================================

    ## Current proposal

    * Device clears each descriptor's DESC_HW flag (1 bit) after it has
    finished with the descriptor

    ## New proposal

    * Device does not need to clear DESC_HW flag for every descriptor
    * Driver controls which descriptors need to have their DESC_HW cleared:
    - Descriptor has an extra 1 bit flag, DESC_WB (Write-Back):
    * w/ DESC_WB set => Device must write-back this descriptor
    after use. At the minimum, clear the
    DESC_HW flag.
    * w/o DESC_WB set => Device doesn't need to write-back the
    descriptor.

    This proposal saves PCIe bandwidth:

    In many scenarios, descriptor data doesn't need to be written back,
    i.e. for network devices, the packet metadata is prepended to packet
    data.

    An alternative would be to add a field with a number of used descriptors.
    This would give the same benefit but would use more bits in the descriptor.

    Indirect Chaining
    =================

    ## Current proposal

    * Indirect chaining is an optional feature

    ## New proposal

    * Remove this feature from this new ring layout

    It's very unlikely that hardware implementations would support this
    due to extra latency of fetching actual descriptors.

    This is a totally new ring layout, and we don't need to worry about the
    compatibility issues with the old one. So it's better to not include this
    feature in this new ring layout if we can't find it's necessary now.

    Rx Fixed Buffer Sizes
    =====================

    ## Current proposal

    * Driver is free to choose whatever buffer sizes it wishes for Tx and
    Rx buffers
    * Theoretically within a ring, a driver could have different buffer sizes

    ## New proposal

    * Driver negotiates with device the size of a Rx buffer for a ring
    - Each descriptor in that ring will have same size buffer
    - Different rings can have different sized buffers

    Data Alignment Boundaries
    =========================

    ## Current proposal

    * Driver is free to choose data buffer alignment to any byte boundary

    ## New proposal

    * Stipulate a fixed alignment for the data buffer

    ----------------------------------------------------------------

    We have done a basic prototype for the packed-ring-layout in DPDK
    based on the v2 packed-ring-layout proposal [1].

    The prototype has been sent to the DPDK mailing list as RFC [2][3].
    And I also collected those public patches into my github repo [4]
    to help others be able to try it easily.

    Besides the v2 packed-ring-layout proposal posted on the mailing list.
    This prototype also includes the proposal that introduces the DESC_WB
    flag to make it possible to let the driver tell the device just update
    the specified descriptors. You can find more details in this patch [5].
    And we don't see the performance regression in software implementation:

    64bytes iofwd loopback:
    5c'virtio-1c'vhost 1c'virtio-5c'vhost
    virtio1.0 7.655Mpps 11.48Mpps
    virtio1.1 A 8.757Mpps 11.70Mpps
    virtio1.1 B 8.910Mpps 11.66Mpps
    The columns:
    5c'virtio-1c'vhost - use 5 cores to run testpmd/virtio-user and
    use 1 core to run testpmd/vhost-pmd (shows
    vhost performance)
    1c'virtio-5c'vhost - use 1 core to run testpmd/virtio-user and
    use 5 cores to run testpmd/vhost-pmd (shows
    virtio performance)
    The rows:
    virtio1.0 - The current (simplified) virtio/vhost implementation in DPDK
    virtio1.1 A - The prototype based on the v2 packed-ring-layout proposal
    virtio1.1 B - Introduce DESC_WB, and adopt it on the Tx path

    [1] https://lists.oasis-open.org/archives/virtio-dev/201702/msg00010.html
    [2] http://dpdk.org/ml/archives/dev/2017-June/068315.html
    [3] http://dpdk.org/ml/archives/dev/2017-July/071562.html
    [4] https://github.com/btw616/dpdk-virtio1.1
    [5] http://dpdk.org/ml/archives/dev/2017-July/071568.html

    Best regards,
    Tiwei Bie



  • 2.  Re: [virtio-comment] Hardware friendly proposals from Intel for packed-ring-layout

    Posted 08-24-2017 12:11
    On 24/08/2017 13:53, Tiwei Bie wrote:
    >
    > * In addition to the DESC_HW flag, each virtio queue has a tail pointer
    > - Driver creates suitable (i.e. multiple of cacheline) descriptors,
    > then performs MMIO write to tail pointer.

    If I understand correctly, the tail pointer is the value that is written
    to the MMIO register. If that is the case, this is unfortunately bad
    for virtualization. Virt prefers a doorbell register where the value
    doesn't matter. This is because:

    1) the value is not available directly and computing it requires
    instruction decoding, which in turn requires walking page tables

    2) if the value doesn't matter, the hypervisor can simply wake up a
    userspace thread that processes the virtio queue without bothering to
    pass the value.

    On the other hand, writing a tail pointer _before_ the MMIO write may
    cost a cache miss. Hence the packed ring layout proposal replaced the
    tail pointer write with lookahead on the ring buffer's DESC_HW flags.
    The idea is that lookahead is cheaper, because hopefully the first
    non-DESC_HW buffer will be in the same cache line as the last DESC_HW
    buffer.


    > Indirect Chaining
    > =================
    >
    > ## Current proposal
    >
    > * Indirect chaining is an optional feature
    >
    > ## New proposal
    >
    > * Remove this feature from this new ring layout
    >
    > It's very unlikely that hardware implementations would support this
    > due to extra latency of fetching actual descriptors.
    >
    > This is a totally new ring layout, and we don't need to worry about the
    > compatibility issues with the old one. So it's better to not include this
    > feature in this new ring layout if we can't find it's necessary now.

    Indirect chaining is actually relatively common for storage devices.

    Hardware implementations are free not to support indirect chaining if it
    hurts latency.

    >
    > Rx Fixed Buffer Sizes
    > =====================
    >
    > ## Current proposal
    >
    > * Driver is free to choose whatever buffer sizes it wishes for Tx and
    > Rx buffers
    > * Theoretically within a ring, a driver could have different buffer sizes
    >
    > ## New proposal
    >
    > * Driver negotiates with device the size of a Rx buffer for a ring
    > - Each descriptor in that ring will have same size buffer
    > - Different rings can have different sized buffers

    This makes sense, but it's independent from the packed ring layout.

    Paolo



  • 3.  Re: [virtio-comment] Hardware friendly proposals from Intel for packed-ring-layout

    Posted 08-24-2017 13:12
    On Thu, Aug 24, 2017 at 02:10:34PM +0200, Paolo Bonzini wrote:
    > On 24/08/2017 13:53, Tiwei Bie wrote:
    > >
    > > * In addition to the DESC_HW flag, each virtio queue has a tail pointer
    > > - Driver creates suitable (i.e. multiple of cacheline) descriptors,
    > > then performs MMIO write to tail pointer.
    >
    > If I understand correctly, the tail pointer is the value that is written
    > to the MMIO register. If that is the case, this is unfortunately bad
    > for virtualization. Virt prefers a doorbell register where the value
    > doesn't matter. This is because:
    >
    > 1) the value is not available directly and computing it requires
    > instruction decoding, which in turn requires walking page tables
    >
    > 2) if the value doesn't matter, the hypervisor can simply wake up a
    > userspace thread that processes the virtio queue without bothering to
    > pass the value.
    >
    > On the other hand, writing a tail pointer _before_ the MMIO write may
    > cost a cache miss. Hence the packed ring layout proposal replaced the
    > tail pointer write with lookahead on the ring buffer's DESC_HW flags.
    > The idea is that lookahead is cheaper, because hopefully the first
    > non-DESC_HW buffer will be in the same cache line as the last DESC_HW
    > buffer.
    >

    Thank you so much for such quick and detailed reply!

    Yeah, we know it's a bit tricky to support the tail pointer in
    software. But it's really helpful for the hardware implementation.
    So we want more discussions on this.

    How about having this feature be switchable at runtime, so it's
    possible to be enabled after migrating to a hardware backend, or
    disabled after migrating to a software backend. So for the software
    backend, it can still use the DESC_HW based mechanism.

    It's just some rough thoughts, and we haven't thought about the
    implementation details. What's your thoughts on this?

    >
    > > Indirect Chaining
    > > =================
    > >
    > > ## Current proposal
    > >
    > > * Indirect chaining is an optional feature
    > >
    > > ## New proposal
    > >
    > > * Remove this feature from this new ring layout
    > >
    > > It's very unlikely that hardware implementations would support this
    > > due to extra latency of fetching actual descriptors.
    > >
    > > This is a totally new ring layout, and we don't need to worry about the
    > > compatibility issues with the old one. So it's better to not include this
    > > feature in this new ring layout if we can't find it's necessary now.
    >
    > Indirect chaining is actually relatively common for storage devices.
    >
    > Hardware implementations are free not to support indirect chaining if it
    > hurts latency.
    >

    We are proposing removing it if it's not really necessary.
    So if it's really necessary, let's just keep it. :)

    One problem that keeping this feature may introduce is that, if a
    software backend chooses to implement this feature, and a VM is
    running on this backend with this feature enabled, it could be a
    problem to be live-migrated to a hardware backend which doesn't
    support this feature. Yeah, it can be a general problem about live
    migration if we have some features be negotiable, and some types
    of backend don't plan to support them at all. Do you have any
    thoughts on this?

    > >
    > > Rx Fixed Buffer Sizes
    > > =====================
    > >
    > > ## Current proposal
    > >
    > > * Driver is free to choose whatever buffer sizes it wishes for Tx and
    > > Rx buffers
    > > * Theoretically within a ring, a driver could have different buffer sizes
    > >
    > > ## New proposal
    > >
    > > * Driver negotiates with device the size of a Rx buffer for a ring
    > > - Each descriptor in that ring will have same size buffer
    > > - Different rings can have different sized buffers
    >
    > This makes sense, but it's independent from the packed ring layout.
    >

    MST mentioned the related things in the packed-ring-layout proposal:

    > From: https://lists.oasis-open.org/archives/virtio-dev/201702/msg00010.html
    > * Descriptor length in device descriptors
    > ...
    > Some devices use identically-sized buffers in all descriptors.
    > Ignoring length for driver descriptors there could be an option too.

    So we are proposing making the statement more clear.

    Best regards,
    Tiwei Bie



  • 4.  Re: [virtio-comment] Hardware friendly proposals from Intel for packed-ring-layout

    Posted 08-24-2017 13:32
    On 24/08/2017 15:11, Tiwei Bie wrote:
    > On Thu, Aug 24, 2017 at 02:10:34PM +0200, Paolo Bonzini wrote:
    >> On 24/08/2017 13:53, Tiwei Bie wrote:
    >>>
    >>> * In addition to the DESC_HW flag, each virtio queue has a tail pointer
    >>> - Driver creates suitable (i.e. multiple of cacheline) descriptors,
    >>> then performs MMIO write to tail pointer.
    >>
    >> If I understand correctly, the tail pointer is the value that is written
    >> to the MMIO register. If that is the case, this is unfortunately bad
    >> for virtualization. Virt prefers a doorbell register where the value
    >> doesn't matter. This is because:
    >>
    >> 1) the value is not available directly and computing it requires
    >> instruction decoding, which in turn requires walking page tables
    >>
    >> 2) if the value doesn't matter, the hypervisor can simply wake up a
    >> userspace thread that processes the virtio queue without bothering to
    >> pass the value.
    >>
    >> On the other hand, writing a tail pointer _before_ the MMIO write may
    >> cost a cache miss. Hence the packed ring layout proposal replaced the
    >> tail pointer write with lookahead on the ring buffer's DESC_HW flags.
    >> The idea is that lookahead is cheaper, because hopefully the first
    >> non-DESC_HW buffer will be in the same cache line as the last DESC_HW
    >> buffer.
    >
    > Thank you so much for such quick and detailed reply!
    >
    > Yeah, we know it's a bit tricky to support the tail pointer in
    > software. But it's really helpful for the hardware implementation.
    > So we want more discussions on this.
    >
    > How about having this feature be switchable at runtime, so it's
    > possible to be enabled after migrating to a hardware backend, or
    > disabled after migrating to a software backend. So for the software
    > backend, it can still use the DESC_HW based mechanism.
    >
    > It's just some rough thoughts, and we haven't thought about the
    > implementation details. What's your thoughts on this?

    Why is lookahead bad for hardware? Can a PCIe device use burst reads to
    retrieve many 2-byte descriptor in a single TLP transaction?

    >>> Indirect Chaining
    >>> =================
    >>>
    >>> ## Current proposal
    >>>
    >>> * Indirect chaining is an optional feature
    >>>
    >>> ## New proposal
    >>>
    >>> * Remove this feature from this new ring layout
    >>>
    >>> It's very unlikely that hardware implementations would support this
    >>> due to extra latency of fetching actual descriptors.
    >>>
    >>> This is a totally new ring layout, and we don't need to worry about the
    >>> compatibility issues with the old one. So it's better to not include this
    >>> feature in this new ring layout if we can't find it's necessary now.
    >>
    >> Indirect chaining is actually relatively common for storage devices.
    >>
    >> Hardware implementations are free not to support indirect chaining if it
    >> hurts latency.
    >
    > We are proposing removing it if it's not really necessary.
    > So if it's really necessary, let's just keep it. :)
    >
    > One problem that keeping this feature may introduce is that, if a
    > software backend chooses to implement this feature, and a VM is
    > running on this backend with this feature enabled, it could be a
    > problem to be live-migrated to a hardware backend which doesn't
    > support this feature. Yeah, it can be a general problem about live
    > migration if we have some features be negotiable, and some types
    > of backend don't plan to support them at all. Do you have any
    > thoughts on this?

    If you are preparing for live migration to a hardware backend, you can
    disable indirect rings when starting the VM, even though it will start
    on a software backend.

    Paolo




  • 5.  Re: [virtio-comment] Hardware friendly proposals from Intel for packed-ring-layout

    Posted 08-24-2017 14:12
    On Thu, Aug 24, 2017 at 03:32:15PM +0200, Paolo Bonzini wrote:
    > On 24/08/2017 15:11, Tiwei Bie wrote:
    > > On Thu, Aug 24, 2017 at 02:10:34PM +0200, Paolo Bonzini wrote:
    > >> On 24/08/2017 13:53, Tiwei Bie wrote:
    > >>>
    > >>> * In addition to the DESC_HW flag, each virtio queue has a tail pointer
    > >>> - Driver creates suitable (i.e. multiple of cacheline) descriptors,
    > >>> then performs MMIO write to tail pointer.
    > >>
    > >> If I understand correctly, the tail pointer is the value that is written
    > >> to the MMIO register. If that is the case, this is unfortunately bad
    > >> for virtualization. Virt prefers a doorbell register where the value
    > >> doesn't matter. This is because:
    > >>
    > >> 1) the value is not available directly and computing it requires
    > >> instruction decoding, which in turn requires walking page tables
    > >>
    > >> 2) if the value doesn't matter, the hypervisor can simply wake up a
    > >> userspace thread that processes the virtio queue without bothering to
    > >> pass the value.
    > >>
    > >> On the other hand, writing a tail pointer _before_ the MMIO write may
    > >> cost a cache miss. Hence the packed ring layout proposal replaced the
    > >> tail pointer write with lookahead on the ring buffer's DESC_HW flags.
    > >> The idea is that lookahead is cheaper, because hopefully the first
    > >> non-DESC_HW buffer will be in the same cache line as the last DESC_HW
    > >> buffer.
    > >
    > > Thank you so much for such quick and detailed reply!
    > >
    > > Yeah, we know it's a bit tricky to support the tail pointer in
    > > software. But it's really helpful for the hardware implementation.
    > > So we want more discussions on this.
    > >
    > > How about having this feature be switchable at runtime, so it's
    > > possible to be enabled after migrating to a hardware backend, or
    > > disabled after migrating to a software backend. So for the software
    > > backend, it can still use the DESC_HW based mechanism.
    > >
    > > It's just some rough thoughts, and we haven't thought about the
    > > implementation details. What's your thoughts on this?
    >
    > Why is lookahead bad for hardware? Can a PCIe device use burst reads to
    > retrieve many 2-byte descriptor in a single TLP transaction?
    >

    I'm not a hardware engineer, so what I said may be not accurate.
    Kully (Cc'ed in this thread) can provide more details if necessary.

    From my understanding, in current design hardware will need to keep
    issuing TLP transactions at a certain pace to check whether the
    descriptor is available or not. And it's possible that there is no
    available descriptor. In this case, there will be a lot of TLP
    transactions wasted. That is to say, the doorbell (MMIO write) is
    helpful for the hardware.

    So maybe we can combine both of them, similar to the notification
    support in virtio1.0, we can have the tail pointer be optional,
    the backend is free to choose to enable (and use) it or disable it.
    And the software backend doesn't need to support it and just need
    to keep it disabled. Any thoughts?

    Best regards,
    Tiwei Bie



  • 6.  Re: [virtio-comment] Hardware friendly proposals from Intel for packed-ring-layout

    Posted 08-25-2017 15:32
    On Thu, Aug 24, 2017 at 10:11:33PM +0800, Tiwei Bie wrote:
    > On Thu, Aug 24, 2017 at 03:32:15PM +0200, Paolo Bonzini wrote:
    > > On 24/08/2017 15:11, Tiwei Bie wrote:
    > > > On Thu, Aug 24, 2017 at 02:10:34PM +0200, Paolo Bonzini wrote:
    > > >> On 24/08/2017 13:53, Tiwei Bie wrote:
    > > >>>
    > > >>> * In addition to the DESC_HW flag, each virtio queue has a tail pointer
    > > >>> - Driver creates suitable (i.e. multiple of cacheline) descriptors,
    > > >>> then performs MMIO write to tail pointer.
    > > >>
    > > >> If I understand correctly, the tail pointer is the value that is written
    > > >> to the MMIO register. If that is the case, this is unfortunately bad
    > > >> for virtualization. Virt prefers a doorbell register where the value
    > > >> doesn't matter. This is because:
    > > >>
    > > >> 1) the value is not available directly and computing it requires
    > > >> instruction decoding, which in turn requires walking page tables
    > > >>
    > > >> 2) if the value doesn't matter, the hypervisor can simply wake up a
    > > >> userspace thread that processes the virtio queue without bothering to
    > > >> pass the value.
    > > >>
    > > >> On the other hand, writing a tail pointer _before_ the MMIO write may
    > > >> cost a cache miss. Hence the packed ring layout proposal replaced the
    > > >> tail pointer write with lookahead on the ring buffer's DESC_HW flags.
    > > >> The idea is that lookahead is cheaper, because hopefully the first
    > > >> non-DESC_HW buffer will be in the same cache line as the last DESC_HW
    > > >> buffer.
    > > >
    > > > Thank you so much for such quick and detailed reply!
    > > >
    > > > Yeah, we know it's a bit tricky to support the tail pointer in
    > > > software. But it's really helpful for the hardware implementation.
    > > > So we want more discussions on this.
    > > >
    > > > How about having this feature be switchable at runtime, so it's
    > > > possible to be enabled after migrating to a hardware backend, or
    > > > disabled after migrating to a software backend. So for the software
    > > > backend, it can still use the DESC_HW based mechanism.
    > > >
    > > > It's just some rough thoughts, and we haven't thought about the
    > > > implementation details. What's your thoughts on this?
    > >
    > > Why is lookahead bad for hardware? Can a PCIe device use burst reads to
    > > retrieve many 2-byte descriptor in a single TLP transaction?
    > >
    >
    > I'm not a hardware engineer, so what I said may be not accurate.
    > Kully (Cc'ed in this thread) can provide more details if necessary.
    >
    > From my understanding, in current design hardware will need to keep
    > issuing TLP transactions at a certain pace to check whether the
    > descriptor is available or not. And it's possible that there is no
    > available descriptor. In this case, there will be a lot of TLP
    > transactions wasted. That is to say, the doorbell (MMIO write) is
    > helpful for the hardware.
    >
    > So maybe we can combine both of them, similar to the notification
    > support in virtio1.0, we can have the tail pointer be optional,
    > the backend is free to choose to enable (and use) it or disable it.
    > And the software backend doesn't need to support it and just need
    > to keep it disabled. Any thoughts?

    Isn't the VIRTIO 1.0 "4.1.2.3 Notification structure layout" still going
    to be available in the new ring layout?

    That means the device already has a doorbell and does not need to keep
    issuing bus transactions to poll the DESC_HW bit.

    Michael: It would help to see a full draft VIRTIO 1.1 spec with the
    proposed packed ring layout. That way everyone can get on the same
    page.

    Stefan



  • 7.  Re: [virtio-comment] Hardware friendly proposals from Intel for packed-ring-layout

    Posted 08-25-2017 17:00
    On Fri, Aug 25, 2017 at 04:32:26PM +0100, Stefan Hajnoczi wrote:
    > Michael: It would help to see a full draft VIRTIO 1.1 spec with the
    > proposed packed ring layout. That way everyone can get on the same
    > page.

    Working on that.

    --
    MST



  • 8.  Re: [virtio-comment] Hardware friendly proposals from Intel for packed-ring-layout

    Posted 08-31-2017 10:22
    On Fri, Aug 25, 2017 at 04:32:26PM +0100, Stefan Hajnoczi wrote:
    > On Thu, Aug 24, 2017 at 10:11:33PM +0800, Tiwei Bie wrote:
    > > On Thu, Aug 24, 2017 at 03:32:15PM +0200, Paolo Bonzini wrote:
    > > > On 24/08/2017 15:11, Tiwei Bie wrote:
    > > > > On Thu, Aug 24, 2017 at 02:10:34PM +0200, Paolo Bonzini wrote:
    > > > >> On 24/08/2017 13:53, Tiwei Bie wrote:
    > > > >>>
    > > > >>> * In addition to the DESC_HW flag, each virtio queue has a tail pointer
    > > > >>> - Driver creates suitable (i.e. multiple of cacheline) descriptors,
    > > > >>> then performs MMIO write to tail pointer.
    > > > >>
    > > > >> If I understand correctly, the tail pointer is the value that is written
    > > > >> to the MMIO register. If that is the case, this is unfortunately bad
    > > > >> for virtualization. Virt prefers a doorbell register where the value
    > > > >> doesn't matter. This is because:
    > > > >>
    > > > >> 1) the value is not available directly and computing it requires
    > > > >> instruction decoding, which in turn requires walking page tables
    > > > >>
    > > > >> 2) if the value doesn't matter, the hypervisor can simply wake up a
    > > > >> userspace thread that processes the virtio queue without bothering to
    > > > >> pass the value.
    > > > >>
    > > > >> On the other hand, writing a tail pointer _before_ the MMIO write may
    > > > >> cost a cache miss. Hence the packed ring layout proposal replaced the
    > > > >> tail pointer write with lookahead on the ring buffer's DESC_HW flags.
    > > > >> The idea is that lookahead is cheaper, because hopefully the first
    > > > >> non-DESC_HW buffer will be in the same cache line as the last DESC_HW
    > > > >> buffer.
    > > > >
    > > > > Thank you so much for such quick and detailed reply!
    > > > >
    > > > > Yeah, we know it's a bit tricky to support the tail pointer in
    > > > > software. But it's really helpful for the hardware implementation.
    > > > > So we want more discussions on this.
    > > > >
    > > > > How about having this feature be switchable at runtime, so it's
    > > > > possible to be enabled after migrating to a hardware backend, or
    > > > > disabled after migrating to a software backend. So for the software
    > > > > backend, it can still use the DESC_HW based mechanism.
    > > > >
    > > > > It's just some rough thoughts, and we haven't thought about the
    > > > > implementation details. What's your thoughts on this?
    > > >
    > > > Why is lookahead bad for hardware? Can a PCIe device use burst reads to
    > > > retrieve many 2-byte descriptor in a single TLP transaction?
    > > >
    > >
    > > I'm not a hardware engineer, so what I said may be not accurate.
    > > Kully (Cc'ed in this thread) can provide more details if necessary.
    > >
    > > From my understanding, in current design hardware will need to keep
    > > issuing TLP transactions at a certain pace to check whether the
    > > descriptor is available or not. And it's possible that there is no
    > > available descriptor. In this case, there will be a lot of TLP
    > > transactions wasted. That is to say, the doorbell (MMIO write) is
    > > helpful for the hardware.
    > >
    > > So maybe we can combine both of them, similar to the notification
    > > support in virtio1.0, we can have the tail pointer be optional,
    > > the backend is free to choose to enable (and use) it or disable it.
    > > And the software backend doesn't need to support it and just need
    > > to keep it disabled. Any thoughts?
    >
    > Isn't the VIRTIO 1.0 "4.1.2.3 Notification structure layout" still going
    > to be available in the new ring layout?
    >
    > That means the device already has a doorbell and does not need to keep
    > issuing bus transactions to poll the DESC_HW bit.

    I think I misunderstood the discussion:

    The VIRTIO 1.0 "4.1.2.3 Notification structure layout" doorbell does not
    indicate how many descriptors are available for the device.

    This means the device needs to read the DESC_HW bit for each descriptor
    until it reaches the first descriptor without DESC_HW set.

    If the doorbell contained the number of descriptors then the device
    could fetch exactly N descriptors instead of N + 1. Maybe it could also
    use fewer bus transactions (I'm not familiar with low-level PCIe).

    So there does seem to be a performance advantage if the VIRTIO 1.0
    "4.1.2.3 Notification structure layout" is modified to include the
    number of descriptors available. Seems like a good idea.

    Stefan



  • 9.  Re: [virtio-comment] Hardware friendly proposals from Intel for packed-ring-layout

    Posted 09-01-2017 05:35
    On Thu, Aug 31, 2017 at 11:21:50AM +0100, Stefan Hajnoczi wrote:
    > On Fri, Aug 25, 2017 at 04:32:26PM +0100, Stefan Hajnoczi wrote:
    > > On Thu, Aug 24, 2017 at 10:11:33PM +0800, Tiwei Bie wrote:
    > > > On Thu, Aug 24, 2017 at 03:32:15PM +0200, Paolo Bonzini wrote:
    > > > > On 24/08/2017 15:11, Tiwei Bie wrote:
    > > > > > On Thu, Aug 24, 2017 at 02:10:34PM +0200, Paolo Bonzini wrote:
    > > > > >> On 24/08/2017 13:53, Tiwei Bie wrote:
    > > > > >>>
    > > > > >>> * In addition to the DESC_HW flag, each virtio queue has a tail pointer
    > > > > >>> - Driver creates suitable (i.e. multiple of cacheline) descriptors,
    > > > > >>> then performs MMIO write to tail pointer.
    > > > > >>
    > > > > >> If I understand correctly, the tail pointer is the value that is written
    > > > > >> to the MMIO register. If that is the case, this is unfortunately bad
    > > > > >> for virtualization. Virt prefers a doorbell register where the value
    > > > > >> doesn't matter. This is because:
    > > > > >>
    > > > > >> 1) the value is not available directly and computing it requires
    > > > > >> instruction decoding, which in turn requires walking page tables
    > > > > >>
    > > > > >> 2) if the value doesn't matter, the hypervisor can simply wake up a
    > > > > >> userspace thread that processes the virtio queue without bothering to
    > > > > >> pass the value.
    > > > > >>
    > > > > >> On the other hand, writing a tail pointer _before_ the MMIO write may
    > > > > >> cost a cache miss. Hence the packed ring layout proposal replaced the
    > > > > >> tail pointer write with lookahead on the ring buffer's DESC_HW flags.
    > > > > >> The idea is that lookahead is cheaper, because hopefully the first
    > > > > >> non-DESC_HW buffer will be in the same cache line as the last DESC_HW
    > > > > >> buffer.
    > > > > >
    > > > > > Thank you so much for such quick and detailed reply!
    > > > > >
    > > > > > Yeah, we know it's a bit tricky to support the tail pointer in
    > > > > > software. But it's really helpful for the hardware implementation.
    > > > > > So we want more discussions on this.
    > > > > >
    > > > > > How about having this feature be switchable at runtime, so it's
    > > > > > possible to be enabled after migrating to a hardware backend, or
    > > > > > disabled after migrating to a software backend. So for the software
    > > > > > backend, it can still use the DESC_HW based mechanism.
    > > > > >
    > > > > > It's just some rough thoughts, and we haven't thought about the
    > > > > > implementation details. What's your thoughts on this?
    > > > >
    > > > > Why is lookahead bad for hardware? Can a PCIe device use burst reads to
    > > > > retrieve many 2-byte descriptor in a single TLP transaction?
    > > > >
    > > >
    > > > I'm not a hardware engineer, so what I said may be not accurate.
    > > > Kully (Cc'ed in this thread) can provide more details if necessary.
    > > >
    > > > From my understanding, in current design hardware will need to keep
    > > > issuing TLP transactions at a certain pace to check whether the
    > > > descriptor is available or not. And it's possible that there is no
    > > > available descriptor. In this case, there will be a lot of TLP
    > > > transactions wasted. That is to say, the doorbell (MMIO write) is
    > > > helpful for the hardware.
    > > >
    > > > So maybe we can combine both of them, similar to the notification
    > > > support in virtio1.0, we can have the tail pointer be optional,
    > > > the backend is free to choose to enable (and use) it or disable it.
    > > > And the software backend doesn't need to support it and just need
    > > > to keep it disabled. Any thoughts?
    > >
    > > Isn't the VIRTIO 1.0 "4.1.2.3 Notification structure layout" still going
    > > to be available in the new ring layout?
    > >
    > > That means the device already has a doorbell and does not need to keep
    > > issuing bus transactions to poll the DESC_HW bit.
    >
    > I think I misunderstood the discussion:
    >
    > The VIRTIO 1.0 "4.1.2.3 Notification structure layout" doorbell does not
    > indicate how many descriptors are available for the device.
    >
    > This means the device needs to read the DESC_HW bit for each descriptor
    > until it reaches the first descriptor without DESC_HW set.
    >
    > If the doorbell contained the number of descriptors then the device
    > could fetch exactly N descriptors instead of N + 1. Maybe it could also
    > use fewer bus transactions (I'm not familiar with low-level PCIe).
    >
    > So there does seem to be a performance advantage if the VIRTIO 1.0
    > "4.1.2.3 Notification structure layout" is modified to include the
    > number of descriptors available. Seems like a good idea.
    >

    Yeah, you are right! :-) We are proposing making the doorbell
    indicate how many descriptors are available for the device. This
    idea was raised by our hardware engineer. I'm also not really
    familiar with the low-level PCIe. Besides, sorry for not making
    a very clear/thorough proposal on this.. And many thanks for
    Michael's and Paolo's great feedbacks and suggestions on this!

    Best regards,
    Tiwei Bie



  • 10.  Re: [virtio-comment] Hardware friendly proposals from Intel for packed-ring-layout

    Posted 08-24-2017 14:23
    On Thu, Aug 24, 2017 at 02:10:34PM +0200, Paolo Bonzini wrote:
    > On 24/08/2017 13:53, Tiwei Bie wrote:
    > >
    > > * In addition to the DESC_HW flag, each virtio queue has a tail pointer
    > > - Driver creates suitable (i.e. multiple of cacheline) descriptors,
    > > then performs MMIO write to tail pointer.
    >
    > If I understand correctly, the tail pointer is the value that is written
    > to the MMIO register. If that is the case, this is unfortunately bad
    > for virtualization. Virt prefers a doorbell register where the value
    > doesn't matter.

    I think it's fine actually, this tail pointer merely duplicates data
    that is already available in form of DESC_HW flag. Virt can just ignore
    the value.




  • 11.  Re: [virtio-comment] Hardware friendly proposals from Intel for packed-ring-layout

    Posted 08-24-2017 14:26
    On 24/08/2017 16:23, Michael S. Tsirkin wrote:
    > On Thu, Aug 24, 2017 at 02:10:34PM +0200, Paolo Bonzini wrote:
    >> On 24/08/2017 13:53, Tiwei Bie wrote:
    >>>
    >>> * In addition to the DESC_HW flag, each virtio queue has a tail pointer
    >>> - Driver creates suitable (i.e. multiple of cacheline) descriptors,
    >>> then performs MMIO write to tail pointer.
    >>
    >> If I understand correctly, the tail pointer is the value that is written
    >> to the MMIO register. If that is the case, this is unfortunately bad
    >> for virtualization. Virt prefers a doorbell register where the value
    >> doesn't matter.
    >
    > I think it's fine actually, this tail pointer merely duplicates data
    > that is already available in form of DESC_HW flag. Virt can just ignore
    > the value.

    If the spec says the hardware can read speculatively beyond the tail
    pointer, that's fine indeed (that was not in Tiwei's proposal).

    Paolo



  • 12.  Re: [virtio-comment] Hardware friendly proposals from Intel for packed-ring-layout

    Posted 08-25-2017 14:24
    If the tail pointer is available to the device, can an implementation not check the DESC_HW flag on each individual descriptor to save some cycle? Implementation is supposed to check VRING_DESC_F_WRITE prior to the write. However, I don’t see either DPDK or VPP is checking it. I don’t know if they are doing that on purpose to save some cycle or an oversight.

    Steven

    On 8/24/17, 7:25 AM, "virtio-comment@lists.oasis-open.org on behalf of Paolo Bonzini" <virtio-comment@lists.oasis-open.org on behalf of pbonzini@redhat.com> wrote:

    On 24/08/2017 16:23, Michael S. Tsirkin wrote:
    > On Thu, Aug 24, 2017 at 02:10:34PM +0200, Paolo Bonzini wrote:
    >> On 24/08/2017 13:53, Tiwei Bie wrote:
    >>>
    >>> * In addition to the DESC_HW flag, each virtio queue has a tail pointer
    >>> - Driver creates suitable (i.e. multiple of cacheline) descriptors,
    >>> then performs MMIO write to tail pointer.
    >>
    >> If I understand correctly, the tail pointer is the value that is written
    >> to the MMIO register. If that is the case, this is unfortunately bad
    >> for virtualization. Virt prefers a doorbell register where the value
    >> doesn't matter.
    >
    > I think it's fine actually, this tail pointer merely duplicates data
    > that is already available in form of DESC_HW flag. Virt can just ignore
    > the value.

    If the spec says the hardware can read speculatively beyond the tail
    pointer, that's fine indeed (that was not in Tiwei's proposal).

    Paolo

    This publicly archived list offers a means to provide input to the

    OASIS Virtual I/O Device (VIRTIO) TC.



    In order to verify user consent to the Feedback License terms and

    to minimize spam in the list archive, subscription is required

    before posting.



    Subscribe: virtio-comment-subscribe@lists.oasis-open.org

    Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org

    List help: virtio-comment-help@lists.oasis-open.org

    List archive: https://lists.oasis-open.org/archives/virtio-comment/

    Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf

    List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists

    Committee: https://www.oasis-open.org/committees/virtio/

    Join OASIS: https://www.oasis-open.org/join/






  • 13.  Re: Hardware friendly proposals from Intel for packed-ring-layout

    Posted 08-24-2017 14:05
    On Thu, Aug 24, 2017 at 07:53:15PM +0800, Tiwei Bie wrote:
    > Indirect Chaining
    > =================
    >
    > ## Current proposal
    >
    > * Indirect chaining is an optional feature
    >
    > ## New proposal
    >
    > * Remove this feature from this new ring layout
    >
    > It's very unlikely that hardware implementations would support this
    > due to extra latency of fetching actual descriptors.
    >
    > This is a totally new ring layout, and we don't need to worry about the
    > compatibility issues with the old one. So it's better to not include this
    > feature in this new ring layout if we can't find it's necessary now.

    Some form of indirect is really useful for software - as we have
    unbounded latencies we really benefit from very deep queues.

    As it's an optional feature, just don't implement it in hardware.

    --
    MST



  • 14.  Re: Hardware friendly proposals from Intel for packed-ring-layout

    Posted 08-24-2017 14:24
    On Thu, Aug 24, 2017 at 05:05:05PM +0300, Michael S. Tsirkin wrote:
    > On Thu, Aug 24, 2017 at 07:53:15PM +0800, Tiwei Bie wrote:
    > > Indirect Chaining
    > > =================
    > >
    > > ## Current proposal
    > >
    > > * Indirect chaining is an optional feature
    > >
    > > ## New proposal
    > >
    > > * Remove this feature from this new ring layout
    > >
    > > It's very unlikely that hardware implementations would support this
    > > due to extra latency of fetching actual descriptors.
    > >
    > > This is a totally new ring layout, and we don't need to worry about the
    > > compatibility issues with the old one. So it's better to not include this
    > > feature in this new ring layout if we can't find it's necessary now.
    >
    > Some form of indirect is really useful for software - as we have
    > unbounded latencies we really benefit from very deep queues.
    >

    Cool. Let's just keep it. :-)

    Best regards,
    Tiwei Bie



  • 15.  Re: [virtio-comment] Hardware friendly proposals from Intel for packed-ring-layout

    Posted 08-31-2017 18:51
    I have a naïve question. Why do we need to invent two flags per descriptor, DESC_HW and DESC_WB? Why can’t we keep things simple like a circular queue. There is a tail pointer and a head pointer for the ring. The producer manages the tail pointer while the consumer manages the head pointer. From the two pointers, the consumer knows exactly how many descriptors are available to read and the producer also knows exactly how many more slots that it has so that it can continue to write. There is no need to test each descriptor before the write and no need to test each descriptor prior to the read.

    You already advocate the tail pointer, but fall short of introducing the head pointer. What did I miss?

    Steven

    On 8/24/17, 4:53 AM, "virtio-comment@lists.oasis-open.org on behalf of Tiwei Bie" <virtio-comment@lists.oasis-open.org on behalf of tiwei.bie@intel.com> wrote:

    Hi all,

    Based on the packed-ring-layout proposal posted here:

    https://lists.oasis-open.org/archives/virtio-dev/201702/msg00010.html

    We have below proposals to make it more hardware friendly.

    Driver Signaling Available Descriptors
    ======================================

    ## Current proposal

    * Each descriptor has 1 bit flag DESC_HW
    * Driver creates descriptors and then sets DESC_HW flag
    * Device reads descriptors and can use it if DESC_HW is set

    ## New proposal

    * In addition to the DESC_HW flag, each virtio queue has a tail pointer
    - Driver creates suitable (i.e. multiple of cacheline) descriptors,
    then performs MMIO write to tail pointer.
    * For each virtio queue, there is a head pointer lives in device and
    not used by driver
    - Device compares tail pointer with head pointer to determine exactly
    how many new descriptors have been added to a specific queue
    * The descriptors in [head, tail) are available to device
    * The DESC_HW flag will be kept for device signaling used descriptors

    Device Signaling Used Descriptors
    =================================

    ## Current proposal

    * Device clears each descriptor's DESC_HW flag (1 bit) after it has
    finished with the descriptor

    ## New proposal

    * Device does not need to clear DESC_HW flag for every descriptor
    * Driver controls which descriptors need to have their DESC_HW cleared:
    - Descriptor has an extra 1 bit flag, DESC_WB (Write-Back):
    * w/ DESC_WB set => Device must write-back this descriptor
    after use. At the minimum, clear the
    DESC_HW flag.
    * w/o DESC_WB set => Device doesn't need to write-back the
    descriptor.

    This proposal saves PCIe bandwidth:

    In many scenarios, descriptor data doesn't need to be written back,
    i.e. for network devices, the packet metadata is prepended to packet
    data.

    An alternative would be to add a field with a number of used descriptors.
    This would give the same benefit but would use more bits in the descriptor.

    Indirect Chaining
    =================

    ## Current proposal

    * Indirect chaining is an optional feature

    ## New proposal

    * Remove this feature from this new ring layout

    It's very unlikely that hardware implementations would support this
    due to extra latency of fetching actual descriptors.

    This is a totally new ring layout, and we don't need to worry about the
    compatibility issues with the old one. So it's better to not include this
    feature in this new ring layout if we can't find it's necessary now.

    Rx Fixed Buffer Sizes
    =====================

    ## Current proposal

    * Driver is free to choose whatever buffer sizes it wishes for Tx and
    Rx buffers
    * Theoretically within a ring, a driver could have different buffer sizes

    ## New proposal

    * Driver negotiates with device the size of a Rx buffer for a ring
    - Each descriptor in that ring will have same size buffer
    - Different rings can have different sized buffers

    Data Alignment Boundaries
    =========================

    ## Current proposal

    * Driver is free to choose data buffer alignment to any byte boundary

    ## New proposal

    * Stipulate a fixed alignment for the data buffer

    ----------------------------------------------------------------

    We have done a basic prototype for the packed-ring-layout in DPDK
    based on the v2 packed-ring-layout proposal [1].

    The prototype has been sent to the DPDK mailing list as RFC [2][3].
    And I also collected those public patches into my github repo [4]
    to help others be able to try it easily.

    Besides the v2 packed-ring-layout proposal posted on the mailing list.
    This prototype also includes the proposal that introduces the DESC_WB
    flag to make it possible to let the driver tell the device just update
    the specified descriptors. You can find more details in this patch [5].
    And we don't see the performance regression in software implementation:

    64bytes iofwd loopback:
    5c'virtio-1c'vhost 1c'virtio-5c'vhost
    virtio1.0 7.655Mpps 11.48Mpps
    virtio1.1 A 8.757Mpps 11.70Mpps
    virtio1.1 B 8.910Mpps 11.66Mpps
    The columns:
    5c'virtio-1c'vhost - use 5 cores to run testpmd/virtio-user and
    use 1 core to run testpmd/vhost-pmd (shows
    vhost performance)
    1c'virtio-5c'vhost - use 1 core to run testpmd/virtio-user and
    use 5 cores to run testpmd/vhost-pmd (shows
    virtio performance)
    The rows:
    virtio1.0 - The current (simplified) virtio/vhost implementation in DPDK
    virtio1.1 A - The prototype based on the v2 packed-ring-layout proposal
    virtio1.1 B - Introduce DESC_WB, and adopt it on the Tx path

    [1] https://lists.oasis-open.org/archives/virtio-dev/201702/msg00010.html
    [2] http://dpdk.org/ml/archives/dev/2017-June/068315.html
    [3] http://dpdk.org/ml/archives/dev/2017-July/071562.html
    [4] https://github.com/btw616/dpdk-virtio1.1
    [5] http://dpdk.org/ml/archives/dev/2017-July/071568.html

    Best regards,
    Tiwei Bie

    This publicly archived list offers a means to provide input to the

    OASIS Virtual I/O Device (VIRTIO) TC.



    In order to verify user consent to the Feedback License terms and

    to minimize spam in the list archive, subscription is required

    before posting.



    Subscribe: virtio-comment-subscribe@lists.oasis-open.org

    Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org

    List help: virtio-comment-help@lists.oasis-open.org

    List archive: https://lists.oasis-open.org/archives/virtio-comment/

    Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf

    List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists

    Committee: https://www.oasis-open.org/committees/virtio/

    Join OASIS: https://www.oasis-open.org/join/






  • 16.  Re: [virtio-comment] Hardware friendly proposals from Intel for packed-ring-layout

    Posted 08-31-2017 19:50
    On Thu, Aug 31, 2017 at 06:51:09PM +0000, Steven Luong (sluong) wrote:
    > I have a naïve question. Why do we need to invent two flags per descriptor, DESC_HW and DESC_WB? Why can’t we keep things simple like a circular queue. There is a tail pointer and a head pointer for the ring. The producer manages the tail pointer while the consumer manages the head pointer. From the two pointers, the consumer knows exactly how many descriptors are available to read and the producer also knows exactly how many more slots that it has so that it can continue to write. There is no need to test each descriptor before the write and no need to test each descriptor prior to the read.

    Looks like you are reinventing virtio 1.0. Head/tail pointers cause a
    lot of cache line bounces.

    > You already advocate the tail pointer, but fall short of introducing the head pointer. What did I miss?
    >
    > Steven

    Some considerations going into the design can be seen here:
    https://www.youtube.com/watch?v=5QIE0F7nU3U

    > On 8/24/17, 4:53 AM, "virtio-comment@lists.oasis-open.org on behalf of Tiwei Bie" <virtio-comment@lists.oasis-open.org on behalf of tiwei.bie@intel.com> wrote:
    >
    > Hi all,
    >
    > Based on the packed-ring-layout proposal posted here:
    >
    > https://lists.oasis-open.org/archives/virtio-dev/201702/msg00010.html
    >
    > We have below proposals to make it more hardware friendly.
    >
    > Driver Signaling Available Descriptors
    > ======================================
    >
    > ## Current proposal
    >
    > * Each descriptor has 1 bit flag DESC_HW
    > * Driver creates descriptors and then sets DESC_HW flag
    > * Device reads descriptors and can use it if DESC_HW is set
    >
    > ## New proposal
    >
    > * In addition to the DESC_HW flag, each virtio queue has a tail pointer
    > - Driver creates suitable (i.e. multiple of cacheline) descriptors,
    > then performs MMIO write to tail pointer.
    > * For each virtio queue, there is a head pointer lives in device and
    > not used by driver
    > - Device compares tail pointer with head pointer to determine exactly
    > how many new descriptors have been added to a specific queue
    > * The descriptors in [head, tail) are available to device
    > * The DESC_HW flag will be kept for device signaling used descriptors
    >
    > Device Signaling Used Descriptors
    > =================================
    >
    > ## Current proposal
    >
    > * Device clears each descriptor's DESC_HW flag (1 bit) after it has
    > finished with the descriptor
    >
    > ## New proposal
    >
    > * Device does not need to clear DESC_HW flag for every descriptor
    > * Driver controls which descriptors need to have their DESC_HW cleared:
    > - Descriptor has an extra 1 bit flag, DESC_WB (Write-Back):
    > * w/ DESC_WB set => Device must write-back this descriptor
    > after use. At the minimum, clear the
    > DESC_HW flag.
    > * w/o DESC_WB set => Device doesn't need to write-back the
    > descriptor.
    >
    > This proposal saves PCIe bandwidth:
    >
    > In many scenarios, descriptor data doesn't need to be written back,
    > i.e. for network devices, the packet metadata is prepended to packet
    > data.
    >
    > An alternative would be to add a field with a number of used descriptors.
    > This would give the same benefit but would use more bits in the descriptor.
    >
    > Indirect Chaining
    > =================
    >
    > ## Current proposal
    >
    > * Indirect chaining is an optional feature
    >
    > ## New proposal
    >
    > * Remove this feature from this new ring layout
    >
    > It's very unlikely that hardware implementations would support this
    > due to extra latency of fetching actual descriptors.
    >
    > This is a totally new ring layout, and we don't need to worry about the
    > compatibility issues with the old one. So it's better to not include this
    > feature in this new ring layout if we can't find it's necessary now.
    >
    > Rx Fixed Buffer Sizes
    > =====================
    >
    > ## Current proposal
    >
    > * Driver is free to choose whatever buffer sizes it wishes for Tx and
    > Rx buffers
    > * Theoretically within a ring, a driver could have different buffer sizes
    >
    > ## New proposal
    >
    > * Driver negotiates with device the size of a Rx buffer for a ring
    > - Each descriptor in that ring will have same size buffer
    > - Different rings can have different sized buffers
    >
    > Data Alignment Boundaries
    > =========================
    >
    > ## Current proposal
    >
    > * Driver is free to choose data buffer alignment to any byte boundary
    >
    > ## New proposal
    >
    > * Stipulate a fixed alignment for the data buffer
    >
    > ----------------------------------------------------------------
    >
    > We have done a basic prototype for the packed-ring-layout in DPDK
    > based on the v2 packed-ring-layout proposal [1].
    >
    > The prototype has been sent to the DPDK mailing list as RFC [2][3].
    > And I also collected those public patches into my github repo [4]
    > to help others be able to try it easily.
    >
    > Besides the v2 packed-ring-layout proposal posted on the mailing list.
    > This prototype also includes the proposal that introduces the DESC_WB
    > flag to make it possible to let the driver tell the device just update
    > the specified descriptors. You can find more details in this patch [5].
    > And we don't see the performance regression in software implementation:
    >
    > 64bytes iofwd loopback:
    > 5c'virtio-1c'vhost 1c'virtio-5c'vhost
    > virtio1.0 7.655Mpps 11.48Mpps
    > virtio1.1 A 8.757Mpps 11.70Mpps
    > virtio1.1 B 8.910Mpps 11.66Mpps
    > The columns:
    > 5c'virtio-1c'vhost - use 5 cores to run testpmd/virtio-user and
    > use 1 core to run testpmd/vhost-pmd (shows
    > vhost performance)
    > 1c'virtio-5c'vhost - use 1 core to run testpmd/virtio-user and
    > use 5 cores to run testpmd/vhost-pmd (shows
    > virtio performance)
    > The rows:
    > virtio1.0 - The current (simplified) virtio/vhost implementation in DPDK
    > virtio1.1 A - The prototype based on the v2 packed-ring-layout proposal
    > virtio1.1 B - Introduce DESC_WB, and adopt it on the Tx path
    >
    > [1] https://lists.oasis-open.org/archives/virtio-dev/201702/msg00010.html
    > [2] http://dpdk.org/ml/archives/dev/2017-June/068315.html
    > [3] http://dpdk.org/ml/archives/dev/2017-July/071562.html
    > [4] https://github.com/btw616/dpdk-virtio1.1
    > [5] http://dpdk.org/ml/archives/dev/2017-July/071568.html
    >
    > Best regards,
    > Tiwei Bie
    >
    > This publicly archived list offers a means to provide input to the
    >
    > OASIS Virtual I/O Device (VIRTIO) TC.
    >
    >
    >
    > In order to verify user consent to the Feedback License terms and
    >
    > to minimize spam in the list archive, subscription is required
    >
    > before posting.
    >
    >
    >
    > Subscribe: virtio-comment-subscribe@lists.oasis-open.org
    >
    > Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
    >
    > List help: virtio-comment-help@lists.oasis-open.org
    >
    > List archive: https://lists.oasis-open.org/archives/virtio-comment/
    >
    > Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
    >
    > List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
    >
    > Committee: https://www.oasis-open.org/committees/virtio/
    >
    > Join OASIS: https://www.oasis-open.org/join/
    >
    >
    >



  • 17.  RE: [virtio-comment] Hardware friendly proposals from Intel for packed-ring-layout

    Posted 09-01-2017 09:07
    Hi Steven
    I will try to explain from a hardware perspective why from a guest perspective having a headpointer or using DESC_HW flag can be very similar.

    The hardware will maintain a head pointer (local to it) for each queue and use this along with the tail pointer to determine how many valid descriptors are in the ring.

    Potentially (not proposed) the hardware could write the head pointer value into guest memory. As you mentioned, the guest could use this along with the tail pointer to determine how many descriptors have been consumed by the hardware.

    If instead of writing the head pointer value into guest memory, the hardware writes DESC_HW flag for a descriptor, then this can achieve the same result. If the hardware writes back every descriptor's DESC_HW flag, then yes this can be inefficient. The DESC_WB flag indicates which descriptors need to be written back to guest memory (as a minimum, only the DESC_HW flag needs to be written). This ensures (as with a head pointer writeback) that hardware is only writing back a single descriptor's DESC_HW flag after a batch of n descriptors have been consumed.

    The guest (as it set the DESC_WB flags) would know that it only needs to poll every nth (8/16?) descriptor's HW_flag and if it is clear then all n-1 previous descriptors have also been consumed by the hardware.

    So given that both head pointer writeback or DESC_HW flag writeback allow the same functionality, why chose the latter?
    Well in many cases the descriptor may have to be written back with extra information (packet length) and so if this mechanism has to be supported then we may as well use the DESC_HW flag as well.

    Admittedly, currently most packet metadata is prepended to the packet buffer.

    Also, to allow out-of-order processing, the descriptors would have to be written back (once consumed by hardware) to the guest memory.


    Rgds
    Kully



  • 18.  Re: [virtio-comment] Hardware friendly proposals from Intel for packed-ring-layout

    Posted 09-10-2017 02:30
    On Thu, Aug 24, 2017 at 07:53:15PM +0800, Tiwei Bie wrote:
    > Rx Fixed Buffer Sizes
    > =====================
    >
    > ## Current proposal
    >
    > * Driver is free to choose whatever buffer sizes it wishes for Tx and
    > Rx buffers
    > * Theoretically within a ring, a driver could have different buffer sizes
    >
    > ## New proposal
    >
    > * Driver negotiates with device the size of a Rx buffer for a ring
    > - Each descriptor in that ring will have same size buffer
    > - Different rings can have different sized buffers

    What's the motivation for this? In our testing dynamically
    sized entries perform better in contrained environments
    such as the linux kernel where packets are queued at
    a huge number of independent application sockets.

    It seems that device can easily cache the last size to speed up
    operation.


    > Data Alignment Boundaries
    > =========================
    >
    > ## Current proposal
    >
    > * Driver is free to choose data buffer alignment to any byte boundary
    >
    > ## New proposal
    >
    > * Stipulate a fixed alignment for the data buffer

    Again motivation seems to be missing. Saving PCI bandwidth
    isn't going to help if it means driver will then incur more
    cache misses on access.


    --
    MST



  • 19.  RE: [virtio-comment] Hardware friendly proposals from Intel for packed-ring-layout

    Posted 09-11-2017 06:56
    Hi

    I've embedded my comments below.

    Rgds
    Kully


    On Thu, Aug 24, 2017 at 07:53:15PM +0800, Tiwei Bie wrote:
    > Rx Fixed Buffer Sizes
    > =====================
    >
    > ## Current proposal
    >
    > * Driver is free to choose whatever buffer sizes it wishes for Tx and
    > Rx buffers
    > * Theoretically within a ring, a driver could have different buffer
    > sizes
    >
    > ## New proposal
    >
    > * Driver negotiates with device the size of a Rx buffer for a ring
    > - Each descriptor in that ring will have same size buffer
    > - Different rings can have different sized buffers

    What's the motivation for this? In our testing dynamically sized entries perform better in contrained environments such as the linux kernel where packets are queued at a huge number of independent application sockets.

    It seems that device can easily cache the last size to speed up operation.

    [Kully]:
    Device incurs around 1us delay fetching each descriptor. In situations whereby memory is limited on the device and many queues are being supported, device would probably fetch descriptors (for Rx) after packets have been received from the network.
    Knowing upfront the buffer size associated with a ring, would allow the device to be able to accurately determine how many descriptors are required.

    Yes, agreed that overall system performance is important. Are sockets intended to be used with virtio drivers? If so, would the driver not allocate a different queue per socket? The proposal was to have fixed buffer sizes per ring but different rings can have different buffer sizes. Would this comprise of different sizes per ring as opposed to per entry work?


    > Data Alignment Boundaries
    > =========================
    >
    > ## Current proposal
    >
    > * Driver is free to choose data buffer alignment to any byte boundary
    >
    > ## New proposal
    >
    > * Stipulate a fixed alignment for the data buffer

    Again motivation seems to be missing. Saving PCI bandwidth isn't going to help if it means driver will then incur more cache misses on access.

    [Kully] Would s/w not benefit from buffers which start aligned to a cache line boundary (i.e. 64B)? This would also benefit hardware.

    --
    MST
    ---------------------------------------------------------------------
    Intel Corporation (UK) Limited
    Registered No. 1134945 (England)
    Registered Office: Pipers Way, Swindon SN3 1RJ
    VAT No: 860 2173 47

    This e-mail and any attachments may contain confidential material for
    the sole use of the intended recipient(s). Any review or distribution
    by others is strictly prohibited. If you are not the intended
    recipient, please contact the sender and delete all copies.