virtio-comment

 View Only
Expand all | Collapse all

RE: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers

  • 1.  RE: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers

    Posted 04-12-2023 14:24


    > From: Jason Wang <jasowang@redhat.com>
    > Sent: Wednesday, April 12, 2023 2:15 AM
    >
    > On Wed, Apr 12, 2023 at 1:55?PM Parav Pandit <parav@nvidia.com> wrote:
    > >
    > >
    > >
    > > > From: Jason Wang <jasowang@redhat.com>
    > > > Sent: Wednesday, April 12, 2023 1:38 AM
    > >
    > > > > Modern device says FEAETURE_1 must be offered and must be
    > > > > negotiated by
    > > > driver.
    > > > > Legacy has Mac as RW area. (hypervisor can do it).
    > > > > Reset flow is difference between the legacy and modern.
    > > >
    > > > Just to make sure we're at the same page. We're talking in the
    > > > context of mediation. Without mediation, your proposal can't work.
    > > >
    > > Right.
    > >
    > > > So in this case, the guest driver is not talking with the device
    > > > directly. Qemu needs to traps whatever it wants to achieve the
    > > > mediation:
    > > >
    > > I prefer to avoid picking specific sw component here, but yes. QEMU can trap.
    > >
    > > > 1) It's perfectly fine that Qemu negotiated VERSION_1 but presented
    > > > a mediated legacy device to guests.
    > > Right but if VERSION_1 is negotiated, device will work as V_1 with 12B
    > virtio_net_hdr.
    >
    > Shadow virtqueue could be used here. And we have much more issues without
    > shadow virtqueue, more below.
    >
    > >
    > > > 2) For MAC and Reset, Qemu can trap and do anything it wants.
    > > >
    > > The idea is not to poke in the fields even though such sw can.
    > > MAC is RW in legacy.
    > > Mac ia RO in 1.x.
    > >
    > > So QEMU cannot make RO register into RW.
    >
    > It can be done via using the control vq. Trap the MAC write and forward it via
    > control virtqueue.
    >
    This proposal Is not implementing about vdpa mediator that requires far higher understanding in hypervisor.
    Such mediation works fine for vdpa and it is upto vdpa layer to do. Not relevant here.

    > >
    > > The proposed solution in this series enables it and avoid per field sw
    > interpretation and mediation in parsing values etc.
    >
    > I don't think it's possible. See the discussion about ORDER_PLATFORM and
    > ACCESS_PLATFORM in previous threads.
    >
    I have read the previous thread.
    Hypervisor will be limiting to those platforms where ORDER_PLATFORM is not needed.
    And this is a pci transitional device that uses the standard platform dma anyway so ACCESS_PLATFORM is not related.

    > >
    > > What is proposed here, that
    > > a. legacy registers are emulated as MMIO in a BAR.
    > > b. This can be either be BAR0 or some other BAR
    > >
    > > Your question was why this flexibility?
    >
    > Yes.
    >
    > >
    > > The reason is:
    > > a. if device prefers to implement only two BARs, it can do so and have window
    > for this 60+ config registers in an existing BAR.
    > > b. if device prefers to implement a new BAR dedicated for legacy registers
    > emulation, it is fine too.
    > >
    > > A mediating sw will be able to forward them regardless.
    >
    > I'm not sure I fully understand this. The only difference is that for b, it can only
    > use BAR0.
    Why do say it can use only BAR 0?

    For example, a device may have implemented say only BAR2, and small portion of the BAR2 is pointing to legacy MMIO config registers.
    A mediator hypervisor sw will be able to read/write to it when BAR0 is exposed towards the guest VM as IOBAR 0.

    > Unless there's a new feature that mandates
    > BAR0 (which I think is impossible since all the features are advertised via
    > capabilities now). We're fine.
    >
    No new feature. Legacy BAR emulation is exposed via the extended capability we discussed providing the location.

    > >
    > > > > Right, it doesn’t. But spec shouldn’t write BAR0 is only for
    > > > > legacy MMIO
    > > > emulation, that would prevent BAR0 usage.
    > > >
    > > > How can it be prevented? Can you give me an example?
    > >
    > > I mean to say, that say if we write a spec like below,
    > >
    > > A device exposes BAR 0 of size X bytes for supporting legacy configuration
    > and device specific registers as memory mapped region.
    > >
    >
    > Ok, it looks just a matter of how the spec is written. The problematic part is that
    > it tries to enforce a size which is suboptimal.
    >
    > What's has been done is:
    >
    > "
    > Transitional devices MUST expose the Legacy Interface in I/O space in BAR0.
    > "
    >
    > Without mentioning the size.

    For new legacy MMIO registers can be implemented as BAR0 with same size. But better to not place such restriction like above wording.



  • 2.  Re: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers

    Posted 04-13-2023 01:48
    On Wed, Apr 12, 2023 at 10:23?PM Parav Pandit <parav@nvidia.com> wrote:
    >
    >
    >
    > > From: Jason Wang <jasowang@redhat.com>
    > > Sent: Wednesday, April 12, 2023 2:15 AM
    > >
    > > On Wed, Apr 12, 2023 at 1:55?PM Parav Pandit <parav@nvidia.com> wrote:
    > > >
    > > >
    > > >
    > > > > From: Jason Wang <jasowang@redhat.com>
    > > > > Sent: Wednesday, April 12, 2023 1:38 AM
    > > >
    > > > > > Modern device says FEAETURE_1 must be offered and must be
    > > > > > negotiated by
    > > > > driver.
    > > > > > Legacy has Mac as RW area. (hypervisor can do it).
    > > > > > Reset flow is difference between the legacy and modern.
    > > > >
    > > > > Just to make sure we're at the same page. We're talking in the
    > > > > context of mediation. Without mediation, your proposal can't work.
    > > > >
    > > > Right.
    > > >
    > > > > So in this case, the guest driver is not talking with the device
    > > > > directly. Qemu needs to traps whatever it wants to achieve the
    > > > > mediation:
    > > > >
    > > > I prefer to avoid picking specific sw component here, but yes. QEMU can trap.
    > > >
    > > > > 1) It's perfectly fine that Qemu negotiated VERSION_1 but presented
    > > > > a mediated legacy device to guests.
    > > > Right but if VERSION_1 is negotiated, device will work as V_1 with 12B
    > > virtio_net_hdr.
    > >
    > > Shadow virtqueue could be used here. And we have much more issues without
    > > shadow virtqueue, more below.
    > >
    > > >
    > > > > 2) For MAC and Reset, Qemu can trap and do anything it wants.
    > > > >
    > > > The idea is not to poke in the fields even though such sw can.
    > > > MAC is RW in legacy.
    > > > Mac ia RO in 1.x.
    > > >
    > > > So QEMU cannot make RO register into RW.
    > >
    > > It can be done via using the control vq. Trap the MAC write and forward it via
    > > control virtqueue.
    > >
    > This proposal Is not implementing about vdpa mediator that requires far higher understanding in hypervisor.

    It's not related to vDPA, it's about a common technology that is used
    in virtualization. You do a trap and emulate the status, why can't you
    do that for others?

    > Such mediation works fine for vdpa and it is upto vdpa layer to do. Not relevant here.
    >
    > > >
    > > > The proposed solution in this series enables it and avoid per field sw
    > > interpretation and mediation in parsing values etc.
    > >
    > > I don't think it's possible. See the discussion about ORDER_PLATFORM and
    > > ACCESS_PLATFORM in previous threads.
    > >
    > I have read the previous thread.
    > Hypervisor will be limiting to those platforms where ORDER_PLATFORM is not needed.

    So you introduce a bunch of new facilities that only work on some
    specific archs. This breaks the architecture independence of virtio
    since 1.0. The root cause is legacy is not fit for hardware
    implementation, any kind of hardware that tries to offer legacy
    function will finally run into those corner cases which require extra
    interfaces which may finally end up with a (partial) duplication of
    the modern interface.

    > And this is a pci transitional device that uses the standard platform dma anyway so ACCESS_PLATFORM is not related.

    So which type of transactions did this device use when it is used via
    legacy MMIO BAR? Translated request or not?

    >
    > > >
    > > > What is proposed here, that
    > > > a. legacy registers are emulated as MMIO in a BAR.
    > > > b. This can be either be BAR0 or some other BAR
    > > >
    > > > Your question was why this flexibility?
    > >
    > > Yes.
    > >
    > > >
    > > > The reason is:
    > > > a. if device prefers to implement only two BARs, it can do so and have window
    > > for this 60+ config registers in an existing BAR.
    > > > b. if device prefers to implement a new BAR dedicated for legacy registers
    > > emulation, it is fine too.
    > > >
    > > > A mediating sw will be able to forward them regardless.
    > >
    > > I'm not sure I fully understand this. The only difference is that for b, it can only
    > > use BAR0.
    > Why do say it can use only BAR 0?

    Because:

    1) It's the way current transitional device works
    2) it's simple, a small extension to the transitional device instead
    of a brunch of facilities that is can do much less than this
    3) it works for legacy drivers on some OSes such as Linux and DPDK, it
    means it works for bare metal which can't be achieved by your proposal
    here

    >
    > For example, a device may have implemented say only BAR2, and small portion of the BAR2 is pointing to legacy MMIO config registers.

    We're discussing spec changes, not a specific implementation here. Why
    is the device can't use BAR0, do you see any restriction in the spec?

    > A mediator hypervisor sw will be able to read/write to it when BAR0 is exposed towards the guest VM as IOBAR 0.

    So I don't think it can work:

    1) This is very dangerous unless the spec mandates the size (this is
    also tricky since page size varies among arches) for any
    BAR/capability which is not what virtio wants, the spec leave those
    flexibility to the implementation:

    E.g

    """
    The driver MUST accept a cap_len value which is larger than specified here.
    """

    2) A blocker for live migration (and compatibility), the hypervisor
    should not assume the size for any capability so for whatever case it
    should have a fallback for the case where the BAR can't be assigned.

    >
    > > Unless there's a new feature that mandates
    > > BAR0 (which I think is impossible since all the features are advertised via
    > > capabilities now). We're fine.
    > >
    > No new feature. Legacy BAR emulation is exposed via the extended capability we discussed providing the location.
    >
    > > >
    > > > > > Right, it doesn’t. But spec shouldn’t write BAR0 is only for
    > > > > > legacy MMIO
    > > > > emulation, that would prevent BAR0 usage.
    > > > >
    > > > > How can it be prevented? Can you give me an example?
    > > >
    > > > I mean to say, that say if we write a spec like below,
    > > >
    > > > A device exposes BAR 0 of size X bytes for supporting legacy configuration
    > > and device specific registers as memory mapped region.
    > > >
    > >
    > > Ok, it looks just a matter of how the spec is written. The problematic part is that
    > > it tries to enforce a size which is suboptimal.
    > >
    > > What's has been done is:
    > >
    > > "
    > > Transitional devices MUST expose the Legacy Interface in I/O space in BAR0.
    > > "
    > >
    > > Without mentioning the size.
    >
    > For new legacy MMIO registers can be implemented as BAR0 with same size. But better to not place such restriction like above wording.

    Let me summarize, we had three ways currently:

    1) legacy MMIO BAR via capability:

    Pros:
    - allow some flexibility to place MMIO BAR other than 0
    Cons:
    - new device ID
    - non trivial spec changes which ends up of the tricky cases that
    tries to workaround legacy to fit for a hardware implementation
    - work only for the case of virtualization with the help of
    meditation, can't work for bare metal
    - only work for some specific archs without SVQ

    2) allow BAR0 to be MMIO for transitional device

    Pros:
    - very minor change for the spec
    - work for virtualization (and it work even without dedicated
    mediation for some setups)
    - work for bare metal for some setups (without mediation)
    Cons:
    - only work for some specific archs without SVQ
    - BAR0 is required

    3) modern device mediation for legacy

    Pros:
    - no changes in the spec
    Cons:
    - require mediation layer in order to work in bare metal
    - require datapath mediation like SVQ to work for virtualization

    Compared to method 2) the only advantages of method 1) is the
    flexibility of BAR0 but it has too many disadvantages. If we only care
    about virtualization, modern devices are sufficient. Then why bother
    for that?

    Thanks




  • 3.  Re: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers

    Posted 04-13-2023 01:48
    On Wed, Apr 12, 2023 at 10:23?PM Parav Pandit <parav@nvidia.com> wrote:
    >
    >
    >
    > > From: Jason Wang <jasowang@redhat.com>
    > > Sent: Wednesday, April 12, 2023 2:15 AM
    > >
    > > On Wed, Apr 12, 2023 at 1:55?PM Parav Pandit <parav@nvidia.com> wrote:
    > > >
    > > >
    > > >
    > > > > From: Jason Wang <jasowang@redhat.com>
    > > > > Sent: Wednesday, April 12, 2023 1:38 AM
    > > >
    > > > > > Modern device says FEAETURE_1 must be offered and must be
    > > > > > negotiated by
    > > > > driver.
    > > > > > Legacy has Mac as RW area. (hypervisor can do it).
    > > > > > Reset flow is difference between the legacy and modern.
    > > > >
    > > > > Just to make sure we're at the same page. We're talking in the
    > > > > context of mediation. Without mediation, your proposal can't work.
    > > > >
    > > > Right.
    > > >
    > > > > So in this case, the guest driver is not talking with the device
    > > > > directly. Qemu needs to traps whatever it wants to achieve the
    > > > > mediation:
    > > > >
    > > > I prefer to avoid picking specific sw component here, but yes. QEMU can trap.
    > > >
    > > > > 1) It's perfectly fine that Qemu negotiated VERSION_1 but presented
    > > > > a mediated legacy device to guests.
    > > > Right but if VERSION_1 is negotiated, device will work as V_1 with 12B
    > > virtio_net_hdr.
    > >
    > > Shadow virtqueue could be used here. And we have much more issues without
    > > shadow virtqueue, more below.
    > >
    > > >
    > > > > 2) For MAC and Reset, Qemu can trap and do anything it wants.
    > > > >
    > > > The idea is not to poke in the fields even though such sw can.
    > > > MAC is RW in legacy.
    > > > Mac ia RO in 1.x.
    > > >
    > > > So QEMU cannot make RO register into RW.
    > >
    > > It can be done via using the control vq. Trap the MAC write and forward it via
    > > control virtqueue.
    > >
    > This proposal Is not implementing about vdpa mediator that requires far higher understanding in hypervisor.

    It's not related to vDPA, it's about a common technology that is used
    in virtualization. You do a trap and emulate the status, why can't you
    do that for others?

    > Such mediation works fine for vdpa and it is upto vdpa layer to do. Not relevant here.
    >
    > > >
    > > > The proposed solution in this series enables it and avoid per field sw
    > > interpretation and mediation in parsing values etc.
    > >
    > > I don't think it's possible. See the discussion about ORDER_PLATFORM and
    > > ACCESS_PLATFORM in previous threads.
    > >
    > I have read the previous thread.
    > Hypervisor will be limiting to those platforms where ORDER_PLATFORM is not needed.

    So you introduce a bunch of new facilities that only work on some
    specific archs. This breaks the architecture independence of virtio
    since 1.0. The root cause is legacy is not fit for hardware
    implementation, any kind of hardware that tries to offer legacy
    function will finally run into those corner cases which require extra
    interfaces which may finally end up with a (partial) duplication of
    the modern interface.

    > And this is a pci transitional device that uses the standard platform dma anyway so ACCESS_PLATFORM is not related.

    So which type of transactions did this device use when it is used via
    legacy MMIO BAR? Translated request or not?

    >
    > > >
    > > > What is proposed here, that
    > > > a. legacy registers are emulated as MMIO in a BAR.
    > > > b. This can be either be BAR0 or some other BAR
    > > >
    > > > Your question was why this flexibility?
    > >
    > > Yes.
    > >
    > > >
    > > > The reason is:
    > > > a. if device prefers to implement only two BARs, it can do so and have window
    > > for this 60+ config registers in an existing BAR.
    > > > b. if device prefers to implement a new BAR dedicated for legacy registers
    > > emulation, it is fine too.
    > > >
    > > > A mediating sw will be able to forward them regardless.
    > >
    > > I'm not sure I fully understand this. The only difference is that for b, it can only
    > > use BAR0.
    > Why do say it can use only BAR 0?

    Because:

    1) It's the way current transitional device works
    2) it's simple, a small extension to the transitional device instead
    of a brunch of facilities that is can do much less than this
    3) it works for legacy drivers on some OSes such as Linux and DPDK, it
    means it works for bare metal which can't be achieved by your proposal
    here

    >
    > For example, a device may have implemented say only BAR2, and small portion of the BAR2 is pointing to legacy MMIO config registers.

    We're discussing spec changes, not a specific implementation here. Why
    is the device can't use BAR0, do you see any restriction in the spec?

    > A mediator hypervisor sw will be able to read/write to it when BAR0 is exposed towards the guest VM as IOBAR 0.

    So I don't think it can work:

    1) This is very dangerous unless the spec mandates the size (this is
    also tricky since page size varies among arches) for any
    BAR/capability which is not what virtio wants, the spec leave those
    flexibility to the implementation:

    E.g

    """
    The driver MUST accept a cap_len value which is larger than specified here.
    """

    2) A blocker for live migration (and compatibility), the hypervisor
    should not assume the size for any capability so for whatever case it
    should have a fallback for the case where the BAR can't be assigned.

    >
    > > Unless there's a new feature that mandates
    > > BAR0 (which I think is impossible since all the features are advertised via
    > > capabilities now). We're fine.
    > >
    > No new feature. Legacy BAR emulation is exposed via the extended capability we discussed providing the location.
    >
    > > >
    > > > > > Right, it doesn’t. But spec shouldn’t write BAR0 is only for
    > > > > > legacy MMIO
    > > > > emulation, that would prevent BAR0 usage.
    > > > >
    > > > > How can it be prevented? Can you give me an example?
    > > >
    > > > I mean to say, that say if we write a spec like below,
    > > >
    > > > A device exposes BAR 0 of size X bytes for supporting legacy configuration
    > > and device specific registers as memory mapped region.
    > > >
    > >
    > > Ok, it looks just a matter of how the spec is written. The problematic part is that
    > > it tries to enforce a size which is suboptimal.
    > >
    > > What's has been done is:
    > >
    > > "
    > > Transitional devices MUST expose the Legacy Interface in I/O space in BAR0.
    > > "
    > >
    > > Without mentioning the size.
    >
    > For new legacy MMIO registers can be implemented as BAR0 with same size. But better to not place such restriction like above wording.

    Let me summarize, we had three ways currently:

    1) legacy MMIO BAR via capability:

    Pros:
    - allow some flexibility to place MMIO BAR other than 0
    Cons:
    - new device ID
    - non trivial spec changes which ends up of the tricky cases that
    tries to workaround legacy to fit for a hardware implementation
    - work only for the case of virtualization with the help of
    meditation, can't work for bare metal
    - only work for some specific archs without SVQ

    2) allow BAR0 to be MMIO for transitional device

    Pros:
    - very minor change for the spec
    - work for virtualization (and it work even without dedicated
    mediation for some setups)
    - work for bare metal for some setups (without mediation)
    Cons:
    - only work for some specific archs without SVQ
    - BAR0 is required

    3) modern device mediation for legacy

    Pros:
    - no changes in the spec
    Cons:
    - require mediation layer in order to work in bare metal
    - require datapath mediation like SVQ to work for virtualization

    Compared to method 2) the only advantages of method 1) is the
    flexibility of BAR0 but it has too many disadvantages. If we only care
    about virtualization, modern devices are sufficient. Then why bother
    for that?

    Thanks




  • 4.  Re: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers

    Posted 04-13-2023 03:31


    On 4/12/2023 9:48 PM, Jason Wang wrote:
    > On Wed, Apr 12, 2023 at 10:23?PM Parav Pandit <parav@nvidia.com> wrote:
    >>
    >>
    >>
    >>> From: Jason Wang <jasowang@redhat.com>
    >>> Sent: Wednesday, April 12, 2023 2:15 AM
    >>>
    >>> On Wed, Apr 12, 2023 at 1:55?PM Parav Pandit <parav@nvidia.com> wrote:
    >>>>
    >>>>
    >>>>
    >>>>> From: Jason Wang <jasowang@redhat.com>
    >>>>> Sent: Wednesday, April 12, 2023 1:38 AM
    >>>>
    >>>>>> Modern device says FEAETURE_1 must be offered and must be
    >>>>>> negotiated by
    >>>>> driver.
    >>>>>> Legacy has Mac as RW area. (hypervisor can do it).
    >>>>>> Reset flow is difference between the legacy and modern.
    >>>>>
    >>>>> Just to make sure we're at the same page. We're talking in the
    >>>>> context of mediation. Without mediation, your proposal can't work.
    >>>>>
    >>>> Right.
    >>>>
    >>>>> So in this case, the guest driver is not talking with the device
    >>>>> directly. Qemu needs to traps whatever it wants to achieve the
    >>>>> mediation:
    >>>>>
    >>>> I prefer to avoid picking specific sw component here, but yes. QEMU can trap.
    >>>>
    >>>>> 1) It's perfectly fine that Qemu negotiated VERSION_1 but presented
    >>>>> a mediated legacy device to guests.
    >>>> Right but if VERSION_1 is negotiated, device will work as V_1 with 12B
    >>> virtio_net_hdr.
    >>>
    >>> Shadow virtqueue could be used here. And we have much more issues without
    >>> shadow virtqueue, more below.
    >>>
    >>>>
    >>>>> 2) For MAC and Reset, Qemu can trap and do anything it wants.
    >>>>>
    >>>> The idea is not to poke in the fields even though such sw can.
    >>>> MAC is RW in legacy.
    >>>> Mac ia RO in 1.x.
    >>>>
    >>>> So QEMU cannot make RO register into RW.
    >>>
    >>> It can be done via using the control vq. Trap the MAC write and forward it via
    >>> control virtqueue.
    >>>
    >> This proposal Is not implementing about vdpa mediator that requires far higher understanding in hypervisor.
    >
    > It's not related to vDPA, it's about a common technology that is used
    > in virtualization. You do a trap and emulate the status, why can't you
    > do that for others?
    >
    >> Such mediation works fine for vdpa and it is upto vdpa layer to do. Not relevant here.
    >>
    >>>>
    >>>> The proposed solution in this series enables it and avoid per field sw
    >>> interpretation and mediation in parsing values etc.
    >>>
    >>> I don't think it's possible. See the discussion about ORDER_PLATFORM and
    >>> ACCESS_PLATFORM in previous threads.
    >>>
    >> I have read the previous thread.
    >> Hypervisor will be limiting to those platforms where ORDER_PLATFORM is not needed.
    >
    > So you introduce a bunch of new facilities that only work on some
    > specific archs. This breaks the architecture independence of virtio
    > since 1.0.
    The defined spec for PCI device does not work today for transitional
    device for virtualization. Only works in limited PF case.
    Hence this update. More below.

    > The root cause is legacy is not fit for hardware
    > implementation, any kind of hardware that tries to offer legacy
    > function will finally run into those corner cases which require extra
    > interfaces which may finally end up with a (partial) duplication of
    > the modern interface.
    >
    I agree with you. We cannot change the legacy.
    What is being added here it to enable legacy transport via MMIO or AQ
    and using notification region.

    Will comment where you listed 3 options.

    >> And this is a pci transitional device that uses the standard platform dma anyway so ACCESS_PLATFORM is not related.
    >
    > So which type of transactions did this device use when it is used via
    > legacy MMIO BAR? Translated request or not?
    >
    Device uses the PCI transport level addresses configured because its a
    PCI device.

    >> For example, a device may have implemented say only BAR2, and small portion of the BAR2 is pointing to legacy MMIO config registers.
    >
    > We're discussing spec changes, not a specific implementation here. Why
    > is the device can't use BAR0, do you see any restriction in the spec?
    >
    No restriction.
    Forcing it to use BAR0 is the restrictive method.
    >> A mediator hypervisor sw will be able to read/write to it when BAR0 is exposed towards the guest VM as IOBAR 0.
    >
    > So I don't think it can work:
    >
    > 1) This is very dangerous unless the spec mandates the size (this is
    > also tricky since page size varies among arches) for any
    > BAR/capability which is not what virtio wants, the spec leave those
    > flexibility to the implementation:
    >
    > E.g
    >
    > """
    > The driver MUST accept a cap_len value which is larger than specified here.
    > """
    cap_len talks about length of the PCI capability structure as defined by
    the PCI spec. BAR length is located in the le32 length.

    So new MMIO region can be of any size and anywhere in the BAR.

    For LM BAR length and number should be same between two PCI VFs. But its
    orthogonal to this point. Such checks will be done anyway.

    >
    > 2) A blocker for live migration (and compatibility), the hypervisor
    > should not assume the size for any capability so for whatever case it
    > should have a fallback for the case where the BAR can't be assigned.
    >
    I agree that hypervisor should not assume.
    for LM such compatibility checks will be done anyway.
    So not a blocker, they should match on two sides is all needed.

    > Let me summarize, we had three ways currently:
    >
    > 1) legacy MMIO BAR via capability:
    >
    > Pros:
    > - allow some flexibility to place MMIO BAR other than 0
    > Cons:
    > - new device ID
    Not needed as Michael suggest. Existing transitional or non transitional
    device can expose this optional capability and its attached MMIO region.

    Spec changes are similar to #2.
    > - non trivial spec changes which ends up of the tricky cases that
    > tries to workaround legacy to fit for a hardware implementation
    > - work only for the case of virtualization with the help of
    > meditation, can't work for bare metal
    For bare-metal PFs usually thin hypervisors are used that does very
    minimal setup. But I agree that bare-metal is relatively less important.

    > - only work for some specific archs without SVQ
    >
    That is the legacy limitation that we don't worry about.

    > 2) allow BAR0 to be MMIO for transitional device
    >
    > Pros:
    > - very minor change for the spec
    Spec changes wise they are similar to #1.
    > - work for virtualization (and it work even without dedicated
    > mediation for some setups)
    I am not aware where can it work without mediation. Do you know any
    specific kernel version where it actually works?

    > - work for bare metal for some setups (without mediation)
    > Cons:
    > - only work for some specific archs without SVQ
    > - BAR0 is required
    >
    Both are not limitation as they are mainly coming from the legacy side
    of things.

    > 3) modern device mediation for legacy
    >
    > Pros:
    > - no changes in the spec
    > Cons:
    > - require mediation layer in order to work in bare metal
    > - require datapath mediation like SVQ to work for virtualization
    >
    Spec change is still require for net and blk because modern device do
    not understand legacy, even with mediation layer.
    FEATURE_1, RW cap via CVQ which is not really owned by the hypervisor.
    A guest may be legacy or non legacy, so mediation shouldn't be always done.

    > Compared to method 2) the only advantages of method 1) is the
    > flexibility of BAR0 but it has too many disadvantages. If we only care
    > about virtualization, modern devices are sufficient. Then why bother
    > for that?

    So that a single stack which doesn't always have the knowledge of which
    driver version is running is guest can utilize it. Otherwise 1.x also
    end up doing mediation when guest driver = 1.x and device = transitional
    PCI VF.

    so (1) and (2) both are equivalent, one is more flexible, if you know
    more valid cases where BAR0 as MMIO can work as_is, such option is open.

    We can draft the spec that MMIO BAR SHOULD be exposes in BAR0.



  • 5.  Re: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers

    Posted 04-13-2023 03:31


    On 4/12/2023 9:48 PM, Jason Wang wrote:
    > On Wed, Apr 12, 2023 at 10:23?PM Parav Pandit <parav@nvidia.com> wrote:
    >>
    >>
    >>
    >>> From: Jason Wang <jasowang@redhat.com>
    >>> Sent: Wednesday, April 12, 2023 2:15 AM
    >>>
    >>> On Wed, Apr 12, 2023 at 1:55?PM Parav Pandit <parav@nvidia.com> wrote:
    >>>>
    >>>>
    >>>>
    >>>>> From: Jason Wang <jasowang@redhat.com>
    >>>>> Sent: Wednesday, April 12, 2023 1:38 AM
    >>>>
    >>>>>> Modern device says FEAETURE_1 must be offered and must be
    >>>>>> negotiated by
    >>>>> driver.
    >>>>>> Legacy has Mac as RW area. (hypervisor can do it).
    >>>>>> Reset flow is difference between the legacy and modern.
    >>>>>
    >>>>> Just to make sure we're at the same page. We're talking in the
    >>>>> context of mediation. Without mediation, your proposal can't work.
    >>>>>
    >>>> Right.
    >>>>
    >>>>> So in this case, the guest driver is not talking with the device
    >>>>> directly. Qemu needs to traps whatever it wants to achieve the
    >>>>> mediation:
    >>>>>
    >>>> I prefer to avoid picking specific sw component here, but yes. QEMU can trap.
    >>>>
    >>>>> 1) It's perfectly fine that Qemu negotiated VERSION_1 but presented
    >>>>> a mediated legacy device to guests.
    >>>> Right but if VERSION_1 is negotiated, device will work as V_1 with 12B
    >>> virtio_net_hdr.
    >>>
    >>> Shadow virtqueue could be used here. And we have much more issues without
    >>> shadow virtqueue, more below.
    >>>
    >>>>
    >>>>> 2) For MAC and Reset, Qemu can trap and do anything it wants.
    >>>>>
    >>>> The idea is not to poke in the fields even though such sw can.
    >>>> MAC is RW in legacy.
    >>>> Mac ia RO in 1.x.
    >>>>
    >>>> So QEMU cannot make RO register into RW.
    >>>
    >>> It can be done via using the control vq. Trap the MAC write and forward it via
    >>> control virtqueue.
    >>>
    >> This proposal Is not implementing about vdpa mediator that requires far higher understanding in hypervisor.
    >
    > It's not related to vDPA, it's about a common technology that is used
    > in virtualization. You do a trap and emulate the status, why can't you
    > do that for others?
    >
    >> Such mediation works fine for vdpa and it is upto vdpa layer to do. Not relevant here.
    >>
    >>>>
    >>>> The proposed solution in this series enables it and avoid per field sw
    >>> interpretation and mediation in parsing values etc.
    >>>
    >>> I don't think it's possible. See the discussion about ORDER_PLATFORM and
    >>> ACCESS_PLATFORM in previous threads.
    >>>
    >> I have read the previous thread.
    >> Hypervisor will be limiting to those platforms where ORDER_PLATFORM is not needed.
    >
    > So you introduce a bunch of new facilities that only work on some
    > specific archs. This breaks the architecture independence of virtio
    > since 1.0.
    The defined spec for PCI device does not work today for transitional
    device for virtualization. Only works in limited PF case.
    Hence this update. More below.

    > The root cause is legacy is not fit for hardware
    > implementation, any kind of hardware that tries to offer legacy
    > function will finally run into those corner cases which require extra
    > interfaces which may finally end up with a (partial) duplication of
    > the modern interface.
    >
    I agree with you. We cannot change the legacy.
    What is being added here it to enable legacy transport via MMIO or AQ
    and using notification region.

    Will comment where you listed 3 options.

    >> And this is a pci transitional device that uses the standard platform dma anyway so ACCESS_PLATFORM is not related.
    >
    > So which type of transactions did this device use when it is used via
    > legacy MMIO BAR? Translated request or not?
    >
    Device uses the PCI transport level addresses configured because its a
    PCI device.

    >> For example, a device may have implemented say only BAR2, and small portion of the BAR2 is pointing to legacy MMIO config registers.
    >
    > We're discussing spec changes, not a specific implementation here. Why
    > is the device can't use BAR0, do you see any restriction in the spec?
    >
    No restriction.
    Forcing it to use BAR0 is the restrictive method.
    >> A mediator hypervisor sw will be able to read/write to it when BAR0 is exposed towards the guest VM as IOBAR 0.
    >
    > So I don't think it can work:
    >
    > 1) This is very dangerous unless the spec mandates the size (this is
    > also tricky since page size varies among arches) for any
    > BAR/capability which is not what virtio wants, the spec leave those
    > flexibility to the implementation:
    >
    > E.g
    >
    > """
    > The driver MUST accept a cap_len value which is larger than specified here.
    > """
    cap_len talks about length of the PCI capability structure as defined by
    the PCI spec. BAR length is located in the le32 length.

    So new MMIO region can be of any size and anywhere in the BAR.

    For LM BAR length and number should be same between two PCI VFs. But its
    orthogonal to this point. Such checks will be done anyway.

    >
    > 2) A blocker for live migration (and compatibility), the hypervisor
    > should not assume the size for any capability so for whatever case it
    > should have a fallback for the case where the BAR can't be assigned.
    >
    I agree that hypervisor should not assume.
    for LM such compatibility checks will be done anyway.
    So not a blocker, they should match on two sides is all needed.

    > Let me summarize, we had three ways currently:
    >
    > 1) legacy MMIO BAR via capability:
    >
    > Pros:
    > - allow some flexibility to place MMIO BAR other than 0
    > Cons:
    > - new device ID
    Not needed as Michael suggest. Existing transitional or non transitional
    device can expose this optional capability and its attached MMIO region.

    Spec changes are similar to #2.
    > - non trivial spec changes which ends up of the tricky cases that
    > tries to workaround legacy to fit for a hardware implementation
    > - work only for the case of virtualization with the help of
    > meditation, can't work for bare metal
    For bare-metal PFs usually thin hypervisors are used that does very
    minimal setup. But I agree that bare-metal is relatively less important.

    > - only work for some specific archs without SVQ
    >
    That is the legacy limitation that we don't worry about.

    > 2) allow BAR0 to be MMIO for transitional device
    >
    > Pros:
    > - very minor change for the spec
    Spec changes wise they are similar to #1.
    > - work for virtualization (and it work even without dedicated
    > mediation for some setups)
    I am not aware where can it work without mediation. Do you know any
    specific kernel version where it actually works?

    > - work for bare metal for some setups (without mediation)
    > Cons:
    > - only work for some specific archs without SVQ
    > - BAR0 is required
    >
    Both are not limitation as they are mainly coming from the legacy side
    of things.

    > 3) modern device mediation for legacy
    >
    > Pros:
    > - no changes in the spec
    > Cons:
    > - require mediation layer in order to work in bare metal
    > - require datapath mediation like SVQ to work for virtualization
    >
    Spec change is still require for net and blk because modern device do
    not understand legacy, even with mediation layer.
    FEATURE_1, RW cap via CVQ which is not really owned by the hypervisor.
    A guest may be legacy or non legacy, so mediation shouldn't be always done.

    > Compared to method 2) the only advantages of method 1) is the
    > flexibility of BAR0 but it has too many disadvantages. If we only care
    > about virtualization, modern devices are sufficient. Then why bother
    > for that?

    So that a single stack which doesn't always have the knowledge of which
    driver version is running is guest can utilize it. Otherwise 1.x also
    end up doing mediation when guest driver = 1.x and device = transitional
    PCI VF.

    so (1) and (2) both are equivalent, one is more flexible, if you know
    more valid cases where BAR0 as MMIO can work as_is, such option is open.

    We can draft the spec that MMIO BAR SHOULD be exposes in BAR0.



  • 6.  Re: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers

    Posted 04-13-2023 05:14
    On Thu, Apr 13, 2023 at 11:31?AM Parav Pandit <parav@nvidia.com> wrote:
    >
    >
    >
    > On 4/12/2023 9:48 PM, Jason Wang wrote:
    > > On Wed, Apr 12, 2023 at 10:23?PM Parav Pandit <parav@nvidia.com> wrote:
    > >>
    > >>
    > >>
    > >>> From: Jason Wang <jasowang@redhat.com>
    > >>> Sent: Wednesday, April 12, 2023 2:15 AM
    > >>>
    > >>> On Wed, Apr 12, 2023 at 1:55?PM Parav Pandit <parav@nvidia.com> wrote:
    > >>>>
    > >>>>
    > >>>>
    > >>>>> From: Jason Wang <jasowang@redhat.com>
    > >>>>> Sent: Wednesday, April 12, 2023 1:38 AM
    > >>>>
    > >>>>>> Modern device says FEAETURE_1 must be offered and must be
    > >>>>>> negotiated by
    > >>>>> driver.
    > >>>>>> Legacy has Mac as RW area. (hypervisor can do it).
    > >>>>>> Reset flow is difference between the legacy and modern.
    > >>>>>
    > >>>>> Just to make sure we're at the same page. We're talking in the
    > >>>>> context of mediation. Without mediation, your proposal can't work.
    > >>>>>
    > >>>> Right.
    > >>>>
    > >>>>> So in this case, the guest driver is not talking with the device
    > >>>>> directly. Qemu needs to traps whatever it wants to achieve the
    > >>>>> mediation:
    > >>>>>
    > >>>> I prefer to avoid picking specific sw component here, but yes. QEMU can trap.
    > >>>>
    > >>>>> 1) It's perfectly fine that Qemu negotiated VERSION_1 but presented
    > >>>>> a mediated legacy device to guests.
    > >>>> Right but if VERSION_1 is negotiated, device will work as V_1 with 12B
    > >>> virtio_net_hdr.
    > >>>
    > >>> Shadow virtqueue could be used here. And we have much more issues without
    > >>> shadow virtqueue, more below.
    > >>>
    > >>>>
    > >>>>> 2) For MAC and Reset, Qemu can trap and do anything it wants.
    > >>>>>
    > >>>> The idea is not to poke in the fields even though such sw can.
    > >>>> MAC is RW in legacy.
    > >>>> Mac ia RO in 1.x.
    > >>>>
    > >>>> So QEMU cannot make RO register into RW.
    > >>>
    > >>> It can be done via using the control vq. Trap the MAC write and forward it via
    > >>> control virtqueue.
    > >>>
    > >> This proposal Is not implementing about vdpa mediator that requires far higher understanding in hypervisor.
    > >
    > > It's not related to vDPA, it's about a common technology that is used
    > > in virtualization. You do a trap and emulate the status, why can't you
    > > do that for others?
    > >
    > >> Such mediation works fine for vdpa and it is upto vdpa layer to do. Not relevant here.
    > >>
    > >>>>
    > >>>> The proposed solution in this series enables it and avoid per field sw
    > >>> interpretation and mediation in parsing values etc.
    > >>>
    > >>> I don't think it's possible. See the discussion about ORDER_PLATFORM and
    > >>> ACCESS_PLATFORM in previous threads.
    > >>>
    > >> I have read the previous thread.
    > >> Hypervisor will be limiting to those platforms where ORDER_PLATFORM is not needed.
    > >
    > > So you introduce a bunch of new facilities that only work on some
    > > specific archs. This breaks the architecture independence of virtio
    > > since 1.0.
    > The defined spec for PCI device does not work today for transitional
    > device for virtualization. Only works in limited PF case.
    > Hence this update.

    I fully understand the motivation. I just want to say

    1) compare to the MMIO ar BAR0, this proposal doesn't provide much advantages
    2) mediate on top of modern devices allows us to not worry about the
    device design which is hard for legacy

    > More below.
    >
    > > The root cause is legacy is not fit for hardware
    > > implementation, any kind of hardware that tries to offer legacy
    > > function will finally run into those corner cases which require extra
    > > interfaces which may finally end up with a (partial) duplication of
    > > the modern interface.
    > >
    > I agree with you. We cannot change the legacy.
    > What is being added here it to enable legacy transport via MMIO or AQ
    > and using notification region.
    >
    > Will comment where you listed 3 options.
    >
    > >> And this is a pci transitional device that uses the standard platform dma anyway so ACCESS_PLATFORM is not related.
    > >
    > > So which type of transactions did this device use when it is used via
    > > legacy MMIO BAR? Translated request or not?
    > >
    > Device uses the PCI transport level addresses configured because its a
    > PCI device.
    >
    > >> For example, a device may have implemented say only BAR2, and small portion of the BAR2 is pointing to legacy MMIO config registers.
    > >
    > > We're discussing spec changes, not a specific implementation here. Why
    > > is the device can't use BAR0, do you see any restriction in the spec?
    > >
    > No restriction.
    > Forcing it to use BAR0 is the restrictive method.
    > >> A mediator hypervisor sw will be able to read/write to it when BAR0 is exposed towards the guest VM as IOBAR 0.
    > >
    > > So I don't think it can work:
    > >
    > > 1) This is very dangerous unless the spec mandates the size (this is
    > > also tricky since page size varies among arches) for any
    > > BAR/capability which is not what virtio wants, the spec leave those
    > > flexibility to the implementation:
    > >
    > > E.g
    > >
    > > """
    > > The driver MUST accept a cap_len value which is larger than specified here.
    > > """
    > cap_len talks about length of the PCI capability structure as defined by
    > the PCI spec. BAR length is located in the le32 length.
    >
    > So new MMIO region can be of any size and anywhere in the BAR.
    >
    > For LM BAR length and number should be same between two PCI VFs. But its
    > orthogonal to this point. Such checks will be done anyway.

    Quoted the wrong sections, I think it should be:

    "
    length MAY include padding, or fields unused by the driver, or future
    extensions. Note: For example, a future device might present a large
    structure size of several MBytes. As current devices never utilize
    structures larger than 4KBytes in size, driver MAY limit the mapped
    structure size to e.g. 4KBytes (thus ignoring parts of structure after
    the first 4KBytes) to allow forward compatibility with such devices
    without loss of functionality and without wasting resources.
    "

    >
    > >
    > > 2) A blocker for live migration (and compatibility), the hypervisor
    > > should not assume the size for any capability so for whatever case it
    > > should have a fallback for the case where the BAR can't be assigned.
    > >
    > I agree that hypervisor should not assume.
    > for LM such compatibility checks will be done anyway.
    > So not a blocker, they should match on two sides is all needed.
    >
    > > Let me summarize, we had three ways currently:
    > >
    > > 1) legacy MMIO BAR via capability:
    > >
    > > Pros:
    > > - allow some flexibility to place MMIO BAR other than 0
    > > Cons:
    > > - new device ID
    > Not needed as Michael suggest. Existing transitional or non transitional

    If it's a transitional device but not placed at BAR0, it might have
    side effects for Linux drivers which assumes BAR0 for legacy.

    I don't see how easy it could be a non transitional device:

    "
    Devices or drivers with no legacy compatibility are referred to as
    non-transitional devices and drivers, respectively.
    "

    > device can expose this optional capability and its attached MMIO region.
    >
    > Spec changes are similar to #2.
    > > - non trivial spec changes which ends up of the tricky cases that
    > > tries to workaround legacy to fit for a hardware implementation
    > > - work only for the case of virtualization with the help of
    > > meditation, can't work for bare metal
    > For bare-metal PFs usually thin hypervisors are used that does very
    > minimal setup. But I agree that bare-metal is relatively less important.

    This is not what I understand. I know several vendors that are using
    virtio devices for bare metal.

    >
    > > - only work for some specific archs without SVQ
    > >
    > That is the legacy limitation that we don't worry about.
    >
    > > 2) allow BAR0 to be MMIO for transitional device
    > >
    > > Pros:
    > > - very minor change for the spec
    > Spec changes wise they are similar to #1.

    This is different since the changes for this are trivial.

    > > - work for virtualization (and it work even without dedicated
    > > mediation for some setups)
    > I am not aware where can it work without mediation. Do you know any
    > specific kernel version where it actually works?

    E.g current Linux driver did:

    rc = pci_request_region(pci_dev, 0, "virtio-pci-legacy");

    It doesn't differ from I/O with memory. It means if you had a
    "transitional" device with legacy MMIO BAR0, it just works.

    >
    > > - work for bare metal for some setups (without mediation)
    > > Cons:
    > > - only work for some specific archs without SVQ
    > > - BAR0 is required
    > >
    > Both are not limitation as they are mainly coming from the legacy side
    > of things.
    >
    > > 3) modern device mediation for legacy
    > >
    > > Pros:
    > > - no changes in the spec
    > > Cons:
    > > - require mediation layer in order to work in bare metal
    > > - require datapath mediation like SVQ to work for virtualization
    > >
    > Spec change is still require for net and blk because modern device do
    > not understand legacy, even with mediation layer.

    That's fine and easy since we work on top of modern devices.

    > FEATURE_1, RW cap via CVQ which is not really owned by the hypervisor.

    Hypervisors can trap if they wish.

    > A guest may be legacy or non legacy, so mediation shouldn't be always done.

    Yes, so mediation can work only if we found it's a legacy driver.

    >
    > > Compared to method 2) the only advantages of method 1) is the
    > > flexibility of BAR0 but it has too many disadvantages. If we only care
    > > about virtualization, modern devices are sufficient. Then why bother
    > > for that?
    >
    > So that a single stack which doesn't always have the knowledge of which
    > driver version is running is guest can utilize it. Otherwise 1.x also
    > end up doing mediation when guest driver = 1.x and device = transitional
    > PCI VF.

    I don't see how this can be solved in your proposal.

    >
    > so (1) and (2) both are equivalent, one is more flexible, if you know
    > more valid cases where BAR0 as MMIO can work as_is, such option is open.

    As said in previous threads, this has been used by several vendors for years.

    E.g I have a handy transitional hardware virtio device that has:

    Region 0: Memory at f5ff0000 (64-bit, prefetchable) [size=8K]
    Region 2: Memory at f5fe0000 (64-bit, prefetchable) [size=4K]
    Region 4: Memory at f5800000 (64-bit, prefetchable) [size=4M]

    And:

    Capabilities: [64] Vendor Specific Information: VirtIO: CommonCfg
    BAR=0 offset=00000888 size=00000078
    Capabilities: [74] Vendor Specific Information: VirtIO: Notify
    BAR=0 offset=00001800 size=00000020 multiplier=00000000
    Capabilities: [88] Vendor Specific Information: VirtIO: ISR
    BAR=0 offset=00000820 size=00000020
    Capabilities: [98] Vendor Specific Information: VirtIO: DeviceCfg
    BAR=0 offset=00000840 size=00000020

    >
    > We can draft the spec that MMIO BAR SHOULD be exposes in BAR0.
    >

    Thanks




  • 7.  Re: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers

    Posted 04-13-2023 05:14
    On Thu, Apr 13, 2023 at 11:31?AM Parav Pandit <parav@nvidia.com> wrote:
    >
    >
    >
    > On 4/12/2023 9:48 PM, Jason Wang wrote:
    > > On Wed, Apr 12, 2023 at 10:23?PM Parav Pandit <parav@nvidia.com> wrote:
    > >>
    > >>
    > >>
    > >>> From: Jason Wang <jasowang@redhat.com>
    > >>> Sent: Wednesday, April 12, 2023 2:15 AM
    > >>>
    > >>> On Wed, Apr 12, 2023 at 1:55?PM Parav Pandit <parav@nvidia.com> wrote:
    > >>>>
    > >>>>
    > >>>>
    > >>>>> From: Jason Wang <jasowang@redhat.com>
    > >>>>> Sent: Wednesday, April 12, 2023 1:38 AM
    > >>>>
    > >>>>>> Modern device says FEAETURE_1 must be offered and must be
    > >>>>>> negotiated by
    > >>>>> driver.
    > >>>>>> Legacy has Mac as RW area. (hypervisor can do it).
    > >>>>>> Reset flow is difference between the legacy and modern.
    > >>>>>
    > >>>>> Just to make sure we're at the same page. We're talking in the
    > >>>>> context of mediation. Without mediation, your proposal can't work.
    > >>>>>
    > >>>> Right.
    > >>>>
    > >>>>> So in this case, the guest driver is not talking with the device
    > >>>>> directly. Qemu needs to traps whatever it wants to achieve the
    > >>>>> mediation:
    > >>>>>
    > >>>> I prefer to avoid picking specific sw component here, but yes. QEMU can trap.
    > >>>>
    > >>>>> 1) It's perfectly fine that Qemu negotiated VERSION_1 but presented
    > >>>>> a mediated legacy device to guests.
    > >>>> Right but if VERSION_1 is negotiated, device will work as V_1 with 12B
    > >>> virtio_net_hdr.
    > >>>
    > >>> Shadow virtqueue could be used here. And we have much more issues without
    > >>> shadow virtqueue, more below.
    > >>>
    > >>>>
    > >>>>> 2) For MAC and Reset, Qemu can trap and do anything it wants.
    > >>>>>
    > >>>> The idea is not to poke in the fields even though such sw can.
    > >>>> MAC is RW in legacy.
    > >>>> Mac ia RO in 1.x.
    > >>>>
    > >>>> So QEMU cannot make RO register into RW.
    > >>>
    > >>> It can be done via using the control vq. Trap the MAC write and forward it via
    > >>> control virtqueue.
    > >>>
    > >> This proposal Is not implementing about vdpa mediator that requires far higher understanding in hypervisor.
    > >
    > > It's not related to vDPA, it's about a common technology that is used
    > > in virtualization. You do a trap and emulate the status, why can't you
    > > do that for others?
    > >
    > >> Such mediation works fine for vdpa and it is upto vdpa layer to do. Not relevant here.
    > >>
    > >>>>
    > >>>> The proposed solution in this series enables it and avoid per field sw
    > >>> interpretation and mediation in parsing values etc.
    > >>>
    > >>> I don't think it's possible. See the discussion about ORDER_PLATFORM and
    > >>> ACCESS_PLATFORM in previous threads.
    > >>>
    > >> I have read the previous thread.
    > >> Hypervisor will be limiting to those platforms where ORDER_PLATFORM is not needed.
    > >
    > > So you introduce a bunch of new facilities that only work on some
    > > specific archs. This breaks the architecture independence of virtio
    > > since 1.0.
    > The defined spec for PCI device does not work today for transitional
    > device for virtualization. Only works in limited PF case.
    > Hence this update.

    I fully understand the motivation. I just want to say

    1) compare to the MMIO ar BAR0, this proposal doesn't provide much advantages
    2) mediate on top of modern devices allows us to not worry about the
    device design which is hard for legacy

    > More below.
    >
    > > The root cause is legacy is not fit for hardware
    > > implementation, any kind of hardware that tries to offer legacy
    > > function will finally run into those corner cases which require extra
    > > interfaces which may finally end up with a (partial) duplication of
    > > the modern interface.
    > >
    > I agree with you. We cannot change the legacy.
    > What is being added here it to enable legacy transport via MMIO or AQ
    > and using notification region.
    >
    > Will comment where you listed 3 options.
    >
    > >> And this is a pci transitional device that uses the standard platform dma anyway so ACCESS_PLATFORM is not related.
    > >
    > > So which type of transactions did this device use when it is used via
    > > legacy MMIO BAR? Translated request or not?
    > >
    > Device uses the PCI transport level addresses configured because its a
    > PCI device.
    >
    > >> For example, a device may have implemented say only BAR2, and small portion of the BAR2 is pointing to legacy MMIO config registers.
    > >
    > > We're discussing spec changes, not a specific implementation here. Why
    > > is the device can't use BAR0, do you see any restriction in the spec?
    > >
    > No restriction.
    > Forcing it to use BAR0 is the restrictive method.
    > >> A mediator hypervisor sw will be able to read/write to it when BAR0 is exposed towards the guest VM as IOBAR 0.
    > >
    > > So I don't think it can work:
    > >
    > > 1) This is very dangerous unless the spec mandates the size (this is
    > > also tricky since page size varies among arches) for any
    > > BAR/capability which is not what virtio wants, the spec leave those
    > > flexibility to the implementation:
    > >
    > > E.g
    > >
    > > """
    > > The driver MUST accept a cap_len value which is larger than specified here.
    > > """
    > cap_len talks about length of the PCI capability structure as defined by
    > the PCI spec. BAR length is located in the le32 length.
    >
    > So new MMIO region can be of any size and anywhere in the BAR.
    >
    > For LM BAR length and number should be same between two PCI VFs. But its
    > orthogonal to this point. Such checks will be done anyway.

    Quoted the wrong sections, I think it should be:

    "
    length MAY include padding, or fields unused by the driver, or future
    extensions. Note: For example, a future device might present a large
    structure size of several MBytes. As current devices never utilize
    structures larger than 4KBytes in size, driver MAY limit the mapped
    structure size to e.g. 4KBytes (thus ignoring parts of structure after
    the first 4KBytes) to allow forward compatibility with such devices
    without loss of functionality and without wasting resources.
    "

    >
    > >
    > > 2) A blocker for live migration (and compatibility), the hypervisor
    > > should not assume the size for any capability so for whatever case it
    > > should have a fallback for the case where the BAR can't be assigned.
    > >
    > I agree that hypervisor should not assume.
    > for LM such compatibility checks will be done anyway.
    > So not a blocker, they should match on two sides is all needed.
    >
    > > Let me summarize, we had three ways currently:
    > >
    > > 1) legacy MMIO BAR via capability:
    > >
    > > Pros:
    > > - allow some flexibility to place MMIO BAR other than 0
    > > Cons:
    > > - new device ID
    > Not needed as Michael suggest. Existing transitional or non transitional

    If it's a transitional device but not placed at BAR0, it might have
    side effects for Linux drivers which assumes BAR0 for legacy.

    I don't see how easy it could be a non transitional device:

    "
    Devices or drivers with no legacy compatibility are referred to as
    non-transitional devices and drivers, respectively.
    "

    > device can expose this optional capability and its attached MMIO region.
    >
    > Spec changes are similar to #2.
    > > - non trivial spec changes which ends up of the tricky cases that
    > > tries to workaround legacy to fit for a hardware implementation
    > > - work only for the case of virtualization with the help of
    > > meditation, can't work for bare metal
    > For bare-metal PFs usually thin hypervisors are used that does very
    > minimal setup. But I agree that bare-metal is relatively less important.

    This is not what I understand. I know several vendors that are using
    virtio devices for bare metal.

    >
    > > - only work for some specific archs without SVQ
    > >
    > That is the legacy limitation that we don't worry about.
    >
    > > 2) allow BAR0 to be MMIO for transitional device
    > >
    > > Pros:
    > > - very minor change for the spec
    > Spec changes wise they are similar to #1.

    This is different since the changes for this are trivial.

    > > - work for virtualization (and it work even without dedicated
    > > mediation for some setups)
    > I am not aware where can it work without mediation. Do you know any
    > specific kernel version where it actually works?

    E.g current Linux driver did:

    rc = pci_request_region(pci_dev, 0, "virtio-pci-legacy");

    It doesn't differ from I/O with memory. It means if you had a
    "transitional" device with legacy MMIO BAR0, it just works.

    >
    > > - work for bare metal for some setups (without mediation)
    > > Cons:
    > > - only work for some specific archs without SVQ
    > > - BAR0 is required
    > >
    > Both are not limitation as they are mainly coming from the legacy side
    > of things.
    >
    > > 3) modern device mediation for legacy
    > >
    > > Pros:
    > > - no changes in the spec
    > > Cons:
    > > - require mediation layer in order to work in bare metal
    > > - require datapath mediation like SVQ to work for virtualization
    > >
    > Spec change is still require for net and blk because modern device do
    > not understand legacy, even with mediation layer.

    That's fine and easy since we work on top of modern devices.

    > FEATURE_1, RW cap via CVQ which is not really owned by the hypervisor.

    Hypervisors can trap if they wish.

    > A guest may be legacy or non legacy, so mediation shouldn't be always done.

    Yes, so mediation can work only if we found it's a legacy driver.

    >
    > > Compared to method 2) the only advantages of method 1) is the
    > > flexibility of BAR0 but it has too many disadvantages. If we only care
    > > about virtualization, modern devices are sufficient. Then why bother
    > > for that?
    >
    > So that a single stack which doesn't always have the knowledge of which
    > driver version is running is guest can utilize it. Otherwise 1.x also
    > end up doing mediation when guest driver = 1.x and device = transitional
    > PCI VF.

    I don't see how this can be solved in your proposal.

    >
    > so (1) and (2) both are equivalent, one is more flexible, if you know
    > more valid cases where BAR0 as MMIO can work as_is, such option is open.

    As said in previous threads, this has been used by several vendors for years.

    E.g I have a handy transitional hardware virtio device that has:

    Region 0: Memory at f5ff0000 (64-bit, prefetchable) [size=8K]
    Region 2: Memory at f5fe0000 (64-bit, prefetchable) [size=4K]
    Region 4: Memory at f5800000 (64-bit, prefetchable) [size=4M]

    And:

    Capabilities: [64] Vendor Specific Information: VirtIO: CommonCfg
    BAR=0 offset=00000888 size=00000078
    Capabilities: [74] Vendor Specific Information: VirtIO: Notify
    BAR=0 offset=00001800 size=00000020 multiplier=00000000
    Capabilities: [88] Vendor Specific Information: VirtIO: ISR
    BAR=0 offset=00000820 size=00000020
    Capabilities: [98] Vendor Specific Information: VirtIO: DeviceCfg
    BAR=0 offset=00000840 size=00000020

    >
    > We can draft the spec that MMIO BAR SHOULD be exposes in BAR0.
    >

    Thanks




  • 8.  Re: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers

    Posted 04-13-2023 17:20
    On Thu, Apr 13, 2023 at 01:14:15PM +0800, Jason Wang wrote:
    > > >>>> The proposed solution in this series enables it and avoid per field sw
    > > >>> interpretation and mediation in parsing values etc.

    ... except for reset, notifications, and maybe more down the road.


    > > >>> I don't think it's possible. See the discussion about ORDER_PLATFORM and
    > > >>> ACCESS_PLATFORM in previous threads.
    > > >>>
    > > >> I have read the previous thread.
    > > >> Hypervisor will be limiting to those platforms where ORDER_PLATFORM is not needed.
    > > >
    > > > So you introduce a bunch of new facilities that only work on some
    > > > specific archs. This breaks the architecture independence of virtio
    > > > since 1.0.
    > > The defined spec for PCI device does not work today for transitional
    > > device for virtualization. Only works in limited PF case.
    > > Hence this update.
    >
    > I fully understand the motivation. I just want to say
    >
    > 1) compare to the MMIO ar BAR0, this proposal doesn't provide much advantages
    > 2) mediate on top of modern devices allows us to not worry about the
    > device design which is hard for legacy

    I begin to think so, too. When I proposed this it looked like just a
    single capability will be enough, without a lot of mess. But it seems
    that addressing this fully is getting more and more complex.
    The one thing we can't do in software is different header size for
    virtio net. For starters, let's add a capability to address that?

    --
    MST




  • 9.  Re: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers

    Posted 04-13-2023 17:20
    On Thu, Apr 13, 2023 at 01:14:15PM +0800, Jason Wang wrote:
    > > >>>> The proposed solution in this series enables it and avoid per field sw
    > > >>> interpretation and mediation in parsing values etc.

    ... except for reset, notifications, and maybe more down the road.


    > > >>> I don't think it's possible. See the discussion about ORDER_PLATFORM and
    > > >>> ACCESS_PLATFORM in previous threads.
    > > >>>
    > > >> I have read the previous thread.
    > > >> Hypervisor will be limiting to those platforms where ORDER_PLATFORM is not needed.
    > > >
    > > > So you introduce a bunch of new facilities that only work on some
    > > > specific archs. This breaks the architecture independence of virtio
    > > > since 1.0.
    > > The defined spec for PCI device does not work today for transitional
    > > device for virtualization. Only works in limited PF case.
    > > Hence this update.
    >
    > I fully understand the motivation. I just want to say
    >
    > 1) compare to the MMIO ar BAR0, this proposal doesn't provide much advantages
    > 2) mediate on top of modern devices allows us to not worry about the
    > device design which is hard for legacy

    I begin to think so, too. When I proposed this it looked like just a
    single capability will be enough, without a lot of mess. But it seems
    that addressing this fully is getting more and more complex.
    The one thing we can't do in software is different header size for
    virtio net. For starters, let's add a capability to address that?

    --
    MST




  • 10.  RE: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers

    Posted 04-13-2023 19:39

    > From: Michael S. Tsirkin <mst@redhat.com>
    > Sent: Thursday, April 13, 2023 1:20 PM
    >
    > On Thu, Apr 13, 2023 at 01:14:15PM +0800, Jason Wang wrote:
    > > > >>>> The proposed solution in this series enables it and avoid per
    > > > >>>> field sw
    > > > >>> interpretation and mediation in parsing values etc.
    >
    > ... except for reset, notifications, and maybe more down the road.
    >
    Your AQ proposal addresses reset too.
    Nothing extra for the notifications, as comes for free from the device side.
    >
    > > > >>> I don't think it's possible. See the discussion about
    > > > >>> ORDER_PLATFORM and ACCESS_PLATFORM in previous threads.
    > > > >>>
    > > > >> I have read the previous thread.
    > > > >> Hypervisor will be limiting to those platforms where ORDER_PLATFORM
    > is not needed.
    > > > >
    > > > > So you introduce a bunch of new facilities that only work on some
    > > > > specific archs. This breaks the architecture independence of
    > > > > virtio since 1.0.
    > > > The defined spec for PCI device does not work today for transitional
    > > > device for virtualization. Only works in limited PF case.
    > > > Hence this update.
    > >
    > > I fully understand the motivation. I just want to say
    > >
    > > 1) compare to the MMIO ar BAR0, this proposal doesn't provide much
    > > advantages
    > > 2) mediate on top of modern devices allows us to not worry about the
    > > device design which is hard for legacy
    >
    > I begin to think so, too. When I proposed this it looked like just a single
    > capability will be enough, without a lot of mess. But it seems that addressing
    > this fully is getting more and more complex.
    > The one thing we can't do in software is different header size for virtio net. For
    > starters, let's add a capability to address that?

    Hdr bit doesn't solve it because hypervisor is not involved in any trapping of feature bits, cvq or other vqs.
    It is unified code for 1.x and transitional in hypervisor.

    We have two options to satisfy the requirements.
    (partly taken/repeated from Jason's yday email).

    1. AQ (solves reset) + notification for building non transitional device that support perform well and it is both backward and forward compat
    Pros:
    a. efficient device reset.
    b. efficient notifications from OS to device
    c. device vendor doesn't need to build transitional configuration space.
    d. works without any mediation in hv for 1.x and non 1.x for all non-legacy interfaces (vqs, config space, cvq, and future features).
    e. can work with non-Linux guest VMs too

    Cons:
    a. More AQ commands work in sw
    b. Does not work for bare metal PFs

    2. Allowing MMIO BAR0 on transitional device as SHOULD requirement with larger BAR size.
    Pros:
    a. Can work with Linux bare-metal and Linux guest VMs as one of the wider uses case
    b. in-efficient device handling for notifications
    c. Works without mediation like 1.d.
    d. Also works without HV mediation.

    Cons:
    a. device reset implementation is very for the hw.
    b. requires transitional device to be built.
    c. Notification performance may suffer.

    For Marvell and us #1 works well.
    I am evaluating #2 and get back.



  • 11.  RE: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers

    Posted 04-13-2023 19:39

    > From: Michael S. Tsirkin <mst@redhat.com>
    > Sent: Thursday, April 13, 2023 1:20 PM
    >
    > On Thu, Apr 13, 2023 at 01:14:15PM +0800, Jason Wang wrote:
    > > > >>>> The proposed solution in this series enables it and avoid per
    > > > >>>> field sw
    > > > >>> interpretation and mediation in parsing values etc.
    >
    > ... except for reset, notifications, and maybe more down the road.
    >
    Your AQ proposal addresses reset too.
    Nothing extra for the notifications, as comes for free from the device side.
    >
    > > > >>> I don't think it's possible. See the discussion about
    > > > >>> ORDER_PLATFORM and ACCESS_PLATFORM in previous threads.
    > > > >>>
    > > > >> I have read the previous thread.
    > > > >> Hypervisor will be limiting to those platforms where ORDER_PLATFORM
    > is not needed.
    > > > >
    > > > > So you introduce a bunch of new facilities that only work on some
    > > > > specific archs. This breaks the architecture independence of
    > > > > virtio since 1.0.
    > > > The defined spec for PCI device does not work today for transitional
    > > > device for virtualization. Only works in limited PF case.
    > > > Hence this update.
    > >
    > > I fully understand the motivation. I just want to say
    > >
    > > 1) compare to the MMIO ar BAR0, this proposal doesn't provide much
    > > advantages
    > > 2) mediate on top of modern devices allows us to not worry about the
    > > device design which is hard for legacy
    >
    > I begin to think so, too. When I proposed this it looked like just a single
    > capability will be enough, without a lot of mess. But it seems that addressing
    > this fully is getting more and more complex.
    > The one thing we can't do in software is different header size for virtio net. For
    > starters, let's add a capability to address that?

    Hdr bit doesn't solve it because hypervisor is not involved in any trapping of feature bits, cvq or other vqs.
    It is unified code for 1.x and transitional in hypervisor.

    We have two options to satisfy the requirements.
    (partly taken/repeated from Jason's yday email).

    1. AQ (solves reset) + notification for building non transitional device that support perform well and it is both backward and forward compat
    Pros:
    a. efficient device reset.
    b. efficient notifications from OS to device
    c. device vendor doesn't need to build transitional configuration space.
    d. works without any mediation in hv for 1.x and non 1.x for all non-legacy interfaces (vqs, config space, cvq, and future features).
    e. can work with non-Linux guest VMs too

    Cons:
    a. More AQ commands work in sw
    b. Does not work for bare metal PFs

    2. Allowing MMIO BAR0 on transitional device as SHOULD requirement with larger BAR size.
    Pros:
    a. Can work with Linux bare-metal and Linux guest VMs as one of the wider uses case
    b. in-efficient device handling for notifications
    c. Works without mediation like 1.d.
    d. Also works without HV mediation.

    Cons:
    a. device reset implementation is very for the hw.
    b. requires transitional device to be built.
    c. Notification performance may suffer.

    For Marvell and us #1 works well.
    I am evaluating #2 and get back.



  • 12.  Re: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers

    Posted 04-14-2023 03:10
    On Fri, Apr 14, 2023 at 3:39?AM Parav Pandit <parav@nvidia.com> wrote:
    >
    >
    > > From: Michael S. Tsirkin <mst@redhat.com>
    > > Sent: Thursday, April 13, 2023 1:20 PM
    > >
    > > On Thu, Apr 13, 2023 at 01:14:15PM +0800, Jason Wang wrote:
    > > > > >>>> The proposed solution in this series enables it and avoid per
    > > > > >>>> field sw
    > > > > >>> interpretation and mediation in parsing values etc.
    > >
    > > ... except for reset, notifications, and maybe more down the road.
    > >
    > Your AQ proposal addresses reset too.
    > Nothing extra for the notifications, as comes for free from the device side.
    > >
    > > > > >>> I don't think it's possible. See the discussion about
    > > > > >>> ORDER_PLATFORM and ACCESS_PLATFORM in previous threads.
    > > > > >>>
    > > > > >> I have read the previous thread.
    > > > > >> Hypervisor will be limiting to those platforms where ORDER_PLATFORM
    > > is not needed.
    > > > > >
    > > > > > So you introduce a bunch of new facilities that only work on some
    > > > > > specific archs. This breaks the architecture independence of
    > > > > > virtio since 1.0.
    > > > > The defined spec for PCI device does not work today for transitional
    > > > > device for virtualization. Only works in limited PF case.
    > > > > Hence this update.
    > > >
    > > > I fully understand the motivation. I just want to say
    > > >
    > > > 1) compare to the MMIO ar BAR0, this proposal doesn't provide much
    > > > advantages
    > > > 2) mediate on top of modern devices allows us to not worry about the
    > > > device design which is hard for legacy
    > >
    > > I begin to think so, too. When I proposed this it looked like just a single
    > > capability will be enough, without a lot of mess. But it seems that addressing
    > > this fully is getting more and more complex.
    > > The one thing we can't do in software is different header size for virtio net. For
    > > starters, let's add a capability to address that?
    >
    > Hdr bit doesn't solve it because hypervisor is not involved in any trapping of feature bits, cvq or other vqs.
    > It is unified code for 1.x and transitional in hypervisor.
    >
    > We have two options to satisfy the requirements.
    > (partly taken/repeated from Jason's yday email).
    >
    > 1. AQ (solves reset) + notification for building non transitional device that support perform well and it is both backward and forward compat
    > Pros:
    > a. efficient device reset.
    > b. efficient notifications from OS to device
    > c. device vendor doesn't need to build transitional configuration space.
    > d. works without any mediation in hv for 1.x and non 1.x for all non-legacy interfaces (vqs, config space, cvq, and future features).

    Without mediation, how could you forward guest config access to admin
    virtqueue? Or you mean:

    1) hypervisor medate for legacy
    2) otherwise the modern BARs are assigned to the guest

    For 2) as we discussed, we can't have such an assumption as

    1) spec doesn't enforce the size of a specific structure
    2) bing vendor locked thus a blocker for live migration as it mandate
    the layout for the guest, mediation layer is a must in this case to
    maintain cross vendor compatibility

    Hypervisor needs to start from a mediation method and do BAR
    assignment only when possible.

    > e. can work with non-Linux guest VMs too
    >
    > Cons:
    > a. More AQ commands work in sw

    Note that this needs to be done on top of the transport virtqueue. And
    we need to carefully design the command sets since they could be
    mutually exclusive.

    Thanks


    > b. Does not work for bare metal PFs
    >
    > 2. Allowing MMIO BAR0 on transitional device as SHOULD requirement with larger BAR size.
    > Pros:
    > a. Can work with Linux bare-metal and Linux guest VMs as one of the wider uses case
    > b. in-efficient device handling for notifications
    > c. Works without mediation like 1.d.
    > d. Also works without HV mediation.
    >
    > Cons:
    > a. device reset implementation is very for the hw.
    > b. requires transitional device to be built.
    > c. Notification performance may suffer.
    >
    > For Marvell and us #1 works well.
    > I am evaluating #2 and get back.
    >




  • 13.  Re: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers

    Posted 04-14-2023 03:10
    On Fri, Apr 14, 2023 at 3:39?AM Parav Pandit <parav@nvidia.com> wrote:
    >
    >
    > > From: Michael S. Tsirkin <mst@redhat.com>
    > > Sent: Thursday, April 13, 2023 1:20 PM
    > >
    > > On Thu, Apr 13, 2023 at 01:14:15PM +0800, Jason Wang wrote:
    > > > > >>>> The proposed solution in this series enables it and avoid per
    > > > > >>>> field sw
    > > > > >>> interpretation and mediation in parsing values etc.
    > >
    > > ... except for reset, notifications, and maybe more down the road.
    > >
    > Your AQ proposal addresses reset too.
    > Nothing extra for the notifications, as comes for free from the device side.
    > >
    > > > > >>> I don't think it's possible. See the discussion about
    > > > > >>> ORDER_PLATFORM and ACCESS_PLATFORM in previous threads.
    > > > > >>>
    > > > > >> I have read the previous thread.
    > > > > >> Hypervisor will be limiting to those platforms where ORDER_PLATFORM
    > > is not needed.
    > > > > >
    > > > > > So you introduce a bunch of new facilities that only work on some
    > > > > > specific archs. This breaks the architecture independence of
    > > > > > virtio since 1.0.
    > > > > The defined spec for PCI device does not work today for transitional
    > > > > device for virtualization. Only works in limited PF case.
    > > > > Hence this update.
    > > >
    > > > I fully understand the motivation. I just want to say
    > > >
    > > > 1) compare to the MMIO ar BAR0, this proposal doesn't provide much
    > > > advantages
    > > > 2) mediate on top of modern devices allows us to not worry about the
    > > > device design which is hard for legacy
    > >
    > > I begin to think so, too. When I proposed this it looked like just a single
    > > capability will be enough, without a lot of mess. But it seems that addressing
    > > this fully is getting more and more complex.
    > > The one thing we can't do in software is different header size for virtio net. For
    > > starters, let's add a capability to address that?
    >
    > Hdr bit doesn't solve it because hypervisor is not involved in any trapping of feature bits, cvq or other vqs.
    > It is unified code for 1.x and transitional in hypervisor.
    >
    > We have two options to satisfy the requirements.
    > (partly taken/repeated from Jason's yday email).
    >
    > 1. AQ (solves reset) + notification for building non transitional device that support perform well and it is both backward and forward compat
    > Pros:
    > a. efficient device reset.
    > b. efficient notifications from OS to device
    > c. device vendor doesn't need to build transitional configuration space.
    > d. works without any mediation in hv for 1.x and non 1.x for all non-legacy interfaces (vqs, config space, cvq, and future features).

    Without mediation, how could you forward guest config access to admin
    virtqueue? Or you mean:

    1) hypervisor medate for legacy
    2) otherwise the modern BARs are assigned to the guest

    For 2) as we discussed, we can't have such an assumption as

    1) spec doesn't enforce the size of a specific structure
    2) bing vendor locked thus a blocker for live migration as it mandate
    the layout for the guest, mediation layer is a must in this case to
    maintain cross vendor compatibility

    Hypervisor needs to start from a mediation method and do BAR
    assignment only when possible.

    > e. can work with non-Linux guest VMs too
    >
    > Cons:
    > a. More AQ commands work in sw

    Note that this needs to be done on top of the transport virtqueue. And
    we need to carefully design the command sets since they could be
    mutually exclusive.

    Thanks


    > b. Does not work for bare metal PFs
    >
    > 2. Allowing MMIO BAR0 on transitional device as SHOULD requirement with larger BAR size.
    > Pros:
    > a. Can work with Linux bare-metal and Linux guest VMs as one of the wider uses case
    > b. in-efficient device handling for notifications
    > c. Works without mediation like 1.d.
    > d. Also works without HV mediation.
    >
    > Cons:
    > a. device reset implementation is very for the hw.
    > b. requires transitional device to be built.
    > c. Notification performance may suffer.
    >
    > For Marvell and us #1 works well.
    > I am evaluating #2 and get back.
    >




  • 14.  Re: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers

    Posted 04-13-2023 17:24


    On 4/13/2023 1:14 AM, Jason Wang wrote:

    >> For LM BAR length and number should be same between two PCI VFs. But its
    >> orthogonal to this point. Such checks will be done anyway.
    >
    > Quoted the wrong sections, I think it should be:
    >
    > "
    > length MAY include padding, or fields unused by the driver, or future
    > extensions. Note: For example, a future device might present a large
    > structure size of several MBytes. As current devices never utilize
    > structures larger than 4KBytes in size, driver MAY limit the mapped
    > structure size to e.g. 4KBytes (thus ignoring parts of structure after
    > the first 4KBytes) to allow forward compatibility with such devices
    > without loss of functionality and without wasting resources.
    > "
    yes. This is the one.

    > If it's a transitional device but not placed at BAR0, it might have
    > side effects for Linux drivers which assumes BAR0 for legacy.
    >
    True. Transitional can be at BAR0.

    > I don't see how easy it could be a non transitional device:
    >
    > "
    > Devices or drivers with no legacy compatibility are referred to as
    > non-transitional devices and drivers, respectively.
    > "
    Michael has suggested rewording of the text.
    It is anyway new text so lets park it aside for now.
    It is mostly tweaking the text.

    >
    >> device can expose this optional capability and its attached MMIO region.
    >>
    >> Spec changes are similar to #2.
    >>> - non trivial spec changes which ends up of the tricky cases that
    >>> tries to workaround legacy to fit for a hardware implementation
    >>> - work only for the case of virtualization with the help of
    >>> meditation, can't work for bare metal
    >> For bare-metal PFs usually thin hypervisors are used that does very
    >> minimal setup. But I agree that bare-metal is relatively less important.
    >
    > This is not what I understand. I know several vendors that are using
    > virtio devices for bare metal.
    >
    I was saying the case for legacy bare metal is less of a problem because
    PCIe does not limit functionality, perf is still limited due to IOBAR.

    >>
    >>> - only work for some specific archs without SVQ
    >>>
    >> That is the legacy limitation that we don't worry about.
    >>
    >>> 2) allow BAR0 to be MMIO for transitional device
    >>>
    >>> Pros:
    >>> - very minor change for the spec
    >> Spec changes wise they are similar to #1.
    >
    > This is different since the changes for this are trivial.
    >
    >>> - work for virtualization (and it work even without dedicated
    >>> mediation for some setups)
    >> I am not aware where can it work without mediation. Do you know any
    >> specific kernel version where it actually works?
    >
    > E.g current Linux driver did:
    >
    > rc = pci_request_region(pci_dev, 0, "virtio-pci-legacy");
    >
    > It doesn't differ from I/O with memory. It means if you had a
    > "transitional" device with legacy MMIO BAR0, it just works.
    >

    Thanks to the abstract PCI API in Linux.

    >>> - work for bare metal for some setups (without mediation)
    >>> Cons:
    >>> - only work for some specific archs without SVQ
    >>> - BAR0 is required
    >>>
    >> Both are not limitation as they are mainly coming from the legacy side
    >> of things.
    >>
    >>> 3) modern device mediation for legacy
    >>>
    >>> Pros:
    >>> - no changes in the spec
    >>> Cons:
    >>> - require mediation layer in order to work in bare metal
    >>> - require datapath mediation like SVQ to work for virtualization
    >>>
    >> Spec change is still require for net and blk because modern device do
    >> not understand legacy, even with mediation layer.
    >
    > That's fine and easy since we work on top of modern devices.
    >
    >> FEATURE_1, RW cap via CVQ which is not really owned by the hypervisor.
    >
    > Hypervisors can trap if they wish.
    >
    Trapping non legacy accessing for 1.x doesn't make sense.

    >> A guest may be legacy or non legacy, so mediation shouldn't be always done.
    >
    > Yes, so mediation can work only if we found it's a legacy driver.
    >
    Mediation will be done only for legacy accesses without cvq, rest will
    go as-is without any cvq and other mediation.

    >>
    >>> Compared to method 2) the only advantages of method 1) is the
    >>> flexibility of BAR0 but it has too many disadvantages. If we only care
    >>> about virtualization, modern devices are sufficient. Then why bother
    >>> for that?
    >>
    >> So that a single stack which doesn't always have the knowledge of which
    >> driver version is running is guest can utilize it. Otherwise 1.x also
    >> end up doing mediation when guest driver = 1.x and device = transitional
    >> PCI VF.
    >
    > I don't see how this can be solved in your proposal.
    >
    This proposal only traps the legacy accesses and doesnt require other
    giant framework.

    I think we can make the BAR0 work for transitional with spec change and
    with optional notification region.
    I am evaluating further.

    >>
    >> so (1) and (2) both are equivalent, one is more flexible, if you know
    >> more valid cases where BAR0 as MMIO can work as_is, such option is open.
    >
    > As said in previous threads, this has been used by several vendors for years.
    >
    > E.g I have a handy transitional hardware virtio device that has:
    >
    > Region 0: Memory at f5ff0000 (64-bit, prefetchable) [size=8K]
    > Region 2: Memory at f5fe0000 (64-bit, prefetchable) [size=4K]
    > Region 4: Memory at f5800000 (64-bit, prefetchable) [size=4M]
    >
    > And:
    >
    > Capabilities: [64] Vendor Specific Information: VirtIO: CommonCfg
    > BAR=0 offset=00000888 size=00000078
    > Capabilities: [74] Vendor Specific Information: VirtIO: Notify
    > BAR=0 offset=00001800 size=00000020 multiplier=00000000
    > Capabilities: [88] Vendor Specific Information: VirtIO: ISR
    > BAR=0 offset=00000820 size=00000020
    > Capabilities: [98] Vendor Specific Information: VirtIO: DeviceCfg
    > BAR=0 offset=00000840 size=00000020
    >
    >>
    >> We can draft the spec that MMIO BAR SHOULD be exposes in BAR0.
    >>
    Yes, above one.



  • 15.  Re: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers

    Posted 04-13-2023 17:24


    On 4/13/2023 1:14 AM, Jason Wang wrote:

    >> For LM BAR length and number should be same between two PCI VFs. But its
    >> orthogonal to this point. Such checks will be done anyway.
    >
    > Quoted the wrong sections, I think it should be:
    >
    > "
    > length MAY include padding, or fields unused by the driver, or future
    > extensions. Note: For example, a future device might present a large
    > structure size of several MBytes. As current devices never utilize
    > structures larger than 4KBytes in size, driver MAY limit the mapped
    > structure size to e.g. 4KBytes (thus ignoring parts of structure after
    > the first 4KBytes) to allow forward compatibility with such devices
    > without loss of functionality and without wasting resources.
    > "
    yes. This is the one.

    > If it's a transitional device but not placed at BAR0, it might have
    > side effects for Linux drivers which assumes BAR0 for legacy.
    >
    True. Transitional can be at BAR0.

    > I don't see how easy it could be a non transitional device:
    >
    > "
    > Devices or drivers with no legacy compatibility are referred to as
    > non-transitional devices and drivers, respectively.
    > "
    Michael has suggested rewording of the text.
    It is anyway new text so lets park it aside for now.
    It is mostly tweaking the text.

    >
    >> device can expose this optional capability and its attached MMIO region.
    >>
    >> Spec changes are similar to #2.
    >>> - non trivial spec changes which ends up of the tricky cases that
    >>> tries to workaround legacy to fit for a hardware implementation
    >>> - work only for the case of virtualization with the help of
    >>> meditation, can't work for bare metal
    >> For bare-metal PFs usually thin hypervisors are used that does very
    >> minimal setup. But I agree that bare-metal is relatively less important.
    >
    > This is not what I understand. I know several vendors that are using
    > virtio devices for bare metal.
    >
    I was saying the case for legacy bare metal is less of a problem because
    PCIe does not limit functionality, perf is still limited due to IOBAR.

    >>
    >>> - only work for some specific archs without SVQ
    >>>
    >> That is the legacy limitation that we don't worry about.
    >>
    >>> 2) allow BAR0 to be MMIO for transitional device
    >>>
    >>> Pros:
    >>> - very minor change for the spec
    >> Spec changes wise they are similar to #1.
    >
    > This is different since the changes for this are trivial.
    >
    >>> - work for virtualization (and it work even without dedicated
    >>> mediation for some setups)
    >> I am not aware where can it work without mediation. Do you know any
    >> specific kernel version where it actually works?
    >
    > E.g current Linux driver did:
    >
    > rc = pci_request_region(pci_dev, 0, "virtio-pci-legacy");
    >
    > It doesn't differ from I/O with memory. It means if you had a
    > "transitional" device with legacy MMIO BAR0, it just works.
    >

    Thanks to the abstract PCI API in Linux.

    >>> - work for bare metal for some setups (without mediation)
    >>> Cons:
    >>> - only work for some specific archs without SVQ
    >>> - BAR0 is required
    >>>
    >> Both are not limitation as they are mainly coming from the legacy side
    >> of things.
    >>
    >>> 3) modern device mediation for legacy
    >>>
    >>> Pros:
    >>> - no changes in the spec
    >>> Cons:
    >>> - require mediation layer in order to work in bare metal
    >>> - require datapath mediation like SVQ to work for virtualization
    >>>
    >> Spec change is still require for net and blk because modern device do
    >> not understand legacy, even with mediation layer.
    >
    > That's fine and easy since we work on top of modern devices.
    >
    >> FEATURE_1, RW cap via CVQ which is not really owned by the hypervisor.
    >
    > Hypervisors can trap if they wish.
    >
    Trapping non legacy accessing for 1.x doesn't make sense.

    >> A guest may be legacy or non legacy, so mediation shouldn't be always done.
    >
    > Yes, so mediation can work only if we found it's a legacy driver.
    >
    Mediation will be done only for legacy accesses without cvq, rest will
    go as-is without any cvq and other mediation.

    >>
    >>> Compared to method 2) the only advantages of method 1) is the
    >>> flexibility of BAR0 but it has too many disadvantages. If we only care
    >>> about virtualization, modern devices are sufficient. Then why bother
    >>> for that?
    >>
    >> So that a single stack which doesn't always have the knowledge of which
    >> driver version is running is guest can utilize it. Otherwise 1.x also
    >> end up doing mediation when guest driver = 1.x and device = transitional
    >> PCI VF.
    >
    > I don't see how this can be solved in your proposal.
    >
    This proposal only traps the legacy accesses and doesnt require other
    giant framework.

    I think we can make the BAR0 work for transitional with spec change and
    with optional notification region.
    I am evaluating further.

    >>
    >> so (1) and (2) both are equivalent, one is more flexible, if you know
    >> more valid cases where BAR0 as MMIO can work as_is, such option is open.
    >
    > As said in previous threads, this has been used by several vendors for years.
    >
    > E.g I have a handy transitional hardware virtio device that has:
    >
    > Region 0: Memory at f5ff0000 (64-bit, prefetchable) [size=8K]
    > Region 2: Memory at f5fe0000 (64-bit, prefetchable) [size=4K]
    > Region 4: Memory at f5800000 (64-bit, prefetchable) [size=4M]
    >
    > And:
    >
    > Capabilities: [64] Vendor Specific Information: VirtIO: CommonCfg
    > BAR=0 offset=00000888 size=00000078
    > Capabilities: [74] Vendor Specific Information: VirtIO: Notify
    > BAR=0 offset=00001800 size=00000020 multiplier=00000000
    > Capabilities: [88] Vendor Specific Information: VirtIO: ISR
    > BAR=0 offset=00000820 size=00000020
    > Capabilities: [98] Vendor Specific Information: VirtIO: DeviceCfg
    > BAR=0 offset=00000840 size=00000020
    >
    >>
    >> We can draft the spec that MMIO BAR SHOULD be exposes in BAR0.
    >>
    Yes, above one.



  • 16.  Re: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers

    Posted 04-13-2023 21:02
    On Thu, Apr 13, 2023 at 01:24:24PM -0400, Parav Pandit wrote:
    > > > > - work for virtualization (and it work even without dedicated
    > > > > mediation for some setups)
    > > > I am not aware where can it work without mediation. Do you know any
    > > > specific kernel version where it actually works?
    > >
    > > E.g current Linux driver did:
    > >
    > > rc = pci_request_region(pci_dev, 0, "virtio-pci-legacy");
    > >
    > > It doesn't differ from I/O with memory. It means if you had a
    > > "transitional" device with legacy MMIO BAR0, it just works.
    > >
    >
    > Thanks to the abstract PCI API in Linux.

    Right. I do however at least see the point of what Jason is proposing,
    which is to enable some legacy guests without mediation in software.

    This thing ... you move some code to the card and reduce the amount of
    virtio knowledge in software but do not eliminate it completely.
    Seems kind of pointless. Minimal hardware changes make more sense to
    me, I'd say. Talking about that, what is a minimal hardware change
    to allow a vdpa based solution?
    I think that's VIRTIO_NET_F_LEGACY_HEADER, right?

    --
    MST




  • 17.  Re: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers

    Posted 04-13-2023 21:02
    On Thu, Apr 13, 2023 at 01:24:24PM -0400, Parav Pandit wrote:
    > > > > - work for virtualization (and it work even without dedicated
    > > > > mediation for some setups)
    > > > I am not aware where can it work without mediation. Do you know any
    > > > specific kernel version where it actually works?
    > >
    > > E.g current Linux driver did:
    > >
    > > rc = pci_request_region(pci_dev, 0, "virtio-pci-legacy");
    > >
    > > It doesn't differ from I/O with memory. It means if you had a
    > > "transitional" device with legacy MMIO BAR0, it just works.
    > >
    >
    > Thanks to the abstract PCI API in Linux.

    Right. I do however at least see the point of what Jason is proposing,
    which is to enable some legacy guests without mediation in software.

    This thing ... you move some code to the card and reduce the amount of
    virtio knowledge in software but do not eliminate it completely.
    Seems kind of pointless. Minimal hardware changes make more sense to
    me, I'd say. Talking about that, what is a minimal hardware change
    to allow a vdpa based solution?
    I think that's VIRTIO_NET_F_LEGACY_HEADER, right?

    --
    MST




  • 18.  RE: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers

    Posted 04-13-2023 21:08


    > From: Michael S. Tsirkin <mst@redhat.com>
    > Sent: Thursday, April 13, 2023 5:02 PM

    > This thing ... you move some code to the card and reduce the amount of virtio
    > knowledge in software but do not eliminate it completely.
    Sure. Practically there is no knowledge other than transporting like a vxlan encapsulating, here its AQ.

    > Seems kind of pointless. Minimal hardware changes make more sense to me, I'd
    > say. Talking about that, what is a minimal hardware change to allow a vdpa
    > based solution?
    > I think that's VIRTIO_NET_F_LEGACY_HEADER, right?

    The main requirement/point is that there is virito PCI VF that to be mapped to the guest VM.
    Hence, there is no vdpa type of hypervisor layer exists in this use case.

    This same VF need to be transitional as the guest kernel may not be known.
    hence sometimes vdpa sometime regular 1.x VF is not an option.
    Hence for next few years, transitional VF will be plugged in into the guest VM when user is using the PCI VF devices.



  • 19.  RE: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers

    Posted 04-13-2023 21:08


    > From: Michael S. Tsirkin <mst@redhat.com>
    > Sent: Thursday, April 13, 2023 5:02 PM

    > This thing ... you move some code to the card and reduce the amount of virtio
    > knowledge in software but do not eliminate it completely.
    Sure. Practically there is no knowledge other than transporting like a vxlan encapsulating, here its AQ.

    > Seems kind of pointless. Minimal hardware changes make more sense to me, I'd
    > say. Talking about that, what is a minimal hardware change to allow a vdpa
    > based solution?
    > I think that's VIRTIO_NET_F_LEGACY_HEADER, right?

    The main requirement/point is that there is virito PCI VF that to be mapped to the guest VM.
    Hence, there is no vdpa type of hypervisor layer exists in this use case.

    This same VF need to be transitional as the guest kernel may not be known.
    hence sometimes vdpa sometime regular 1.x VF is not an option.
    Hence for next few years, transitional VF will be plugged in into the guest VM when user is using the PCI VF devices.



  • 20.  Re: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers

    Posted 04-14-2023 02:37
    On Fri, Apr 14, 2023 at 5:08?AM Parav Pandit <parav@nvidia.com> wrote:
    >
    >
    >
    > > From: Michael S. Tsirkin <mst@redhat.com>
    > > Sent: Thursday, April 13, 2023 5:02 PM
    >
    > > This thing ... you move some code to the card and reduce the amount of virtio
    > > knowledge in software but do not eliminate it completely.
    > Sure. Practically there is no knowledge other than transporting like a vxlan encapsulating, here its AQ.
    >
    > > Seems kind of pointless. Minimal hardware changes make more sense to me, I'd
    > > say. Talking about that, what is a minimal hardware change to allow a vdpa
    > > based solution?
    > > I think that's VIRTIO_NET_F_LEGACY_HEADER, right?

    I think it is. It would be much easier if we do this.

    >
    > The main requirement/point is that there is virito PCI VF that to be mapped to the guest VM.
    > Hence, there is no vdpa type of hypervisor layer exists in this use case.

    It's about mediation which is a must for things like legacy. If
    there's a way that helps the vendor to get rid of the tricky legacy
    completely, then why not?

    >
    > This same VF need to be transitional as the guest kernel may not be known.
    > hence sometimes vdpa sometime regular 1.x VF is not an option.
    > Hence for next few years, transitional VF will be plugged in into the guest VM when user is using the PCI VF devices.

    I'm not sure I get this. With VIRTIO_NET_F_LEGACY_HEADER, we don't
    need mediation for datapath. For the control path, mediation is a must
    for legacy and it's very easy to keep it work for modern, what's wrong
    with that?

    Thanks

    >
    > This publicly archived list offers a means to provide input to the
    > OASIS Virtual I/O Device (VIRTIO) TC.
    >
    > In order to verify user consent to the Feedback License terms and
    > to minimize spam in the list archive, subscription is required
    > before posting.
    >
    > Subscribe: virtio-comment-subscribe@lists.oasis-open.org
    > Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
    > List help: virtio-comment-help@lists.oasis-open.org
    > List archive: https://lists.oasis-open.org/archives/virtio-comment/
    > Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
    > List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
    > Committee: https://www.oasis-open.org/committees/virtio/
    > Join OASIS: https://www.oasis-open.org/join/
    >




  • 21.  Re: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers

    Posted 04-14-2023 02:37
    On Fri, Apr 14, 2023 at 5:08?AM Parav Pandit <parav@nvidia.com> wrote:
    >
    >
    >
    > > From: Michael S. Tsirkin <mst@redhat.com>
    > > Sent: Thursday, April 13, 2023 5:02 PM
    >
    > > This thing ... you move some code to the card and reduce the amount of virtio
    > > knowledge in software but do not eliminate it completely.
    > Sure. Practically there is no knowledge other than transporting like a vxlan encapsulating, here its AQ.
    >
    > > Seems kind of pointless. Minimal hardware changes make more sense to me, I'd
    > > say. Talking about that, what is a minimal hardware change to allow a vdpa
    > > based solution?
    > > I think that's VIRTIO_NET_F_LEGACY_HEADER, right?

    I think it is. It would be much easier if we do this.

    >
    > The main requirement/point is that there is virito PCI VF that to be mapped to the guest VM.
    > Hence, there is no vdpa type of hypervisor layer exists in this use case.

    It's about mediation which is a must for things like legacy. If
    there's a way that helps the vendor to get rid of the tricky legacy
    completely, then why not?

    >
    > This same VF need to be transitional as the guest kernel may not be known.
    > hence sometimes vdpa sometime regular 1.x VF is not an option.
    > Hence for next few years, transitional VF will be plugged in into the guest VM when user is using the PCI VF devices.

    I'm not sure I get this. With VIRTIO_NET_F_LEGACY_HEADER, we don't
    need mediation for datapath. For the control path, mediation is a must
    for legacy and it's very easy to keep it work for modern, what's wrong
    with that?

    Thanks

    >
    > This publicly archived list offers a means to provide input to the
    > OASIS Virtual I/O Device (VIRTIO) TC.
    >
    > In order to verify user consent to the Feedback License terms and
    > to minimize spam in the list archive, subscription is required
    > before posting.
    >
    > Subscribe: virtio-comment-subscribe@lists.oasis-open.org
    > Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
    > List help: virtio-comment-help@lists.oasis-open.org
    > List archive: https://lists.oasis-open.org/archives/virtio-comment/
    > Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
    > List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
    > Committee: https://www.oasis-open.org/committees/virtio/
    > Join OASIS: https://www.oasis-open.org/join/
    >




  • 22.  Re: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers

    Posted 04-14-2023 06:59
    On Fri, Apr 14, 2023 at 10:36:52AM +0800, Jason Wang wrote:
    > On Fri, Apr 14, 2023 at 5:08?AM Parav Pandit <parav@nvidia.com> wrote:
    > >
    > >
    > >
    > > > From: Michael S. Tsirkin <mst@redhat.com>
    > > > Sent: Thursday, April 13, 2023 5:02 PM
    > >
    > > > This thing ... you move some code to the card and reduce the amount of virtio
    > > > knowledge in software but do not eliminate it completely.
    > > Sure. Practically there is no knowledge other than transporting like a vxlan encapsulating, here its AQ.
    > >
    > > > Seems kind of pointless. Minimal hardware changes make more sense to me, I'd
    > > > say. Talking about that, what is a minimal hardware change to allow a vdpa
    > > > based solution?
    > > > I think that's VIRTIO_NET_F_LEGACY_HEADER, right?
    >
    > I think it is. It would be much easier if we do this.

    It does not look like Parav is interested in this approach but
    if you like it feel free to propose it.

    > >
    > > The main requirement/point is that there is virito PCI VF that to be mapped to the guest VM.
    > > Hence, there is no vdpa type of hypervisor layer exists in this use case.
    >
    > It's about mediation which is a must for things like legacy. If
    > there's a way that helps the vendor to get rid of the tricky legacy
    > completely, then why not?
    >
    > >
    > > This same VF need to be transitional as the guest kernel may not be known.
    > > hence sometimes vdpa sometime regular 1.x VF is not an option.
    > > Hence for next few years, transitional VF will be plugged in into the guest VM when user is using the PCI VF devices.
    >
    > I'm not sure I get this. With VIRTIO_NET_F_LEGACY_HEADER, we don't
    > need mediation for datapath. For the control path, mediation is a must
    > for legacy and it's very easy to keep it work for modern, what's wrong
    > with that?
    >
    > Thanks
    >
    > >
    > > This publicly archived list offers a means to provide input to the
    > > OASIS Virtual I/O Device (VIRTIO) TC.
    > >
    > > In order to verify user consent to the Feedback License terms and
    > > to minimize spam in the list archive, subscription is required
    > > before posting.




  • 23.  Re: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers

    Posted 04-14-2023 06:59
    On Fri, Apr 14, 2023 at 10:36:52AM +0800, Jason Wang wrote:
    > On Fri, Apr 14, 2023 at 5:08?AM Parav Pandit <parav@nvidia.com> wrote:
    > >
    > >
    > >
    > > > From: Michael S. Tsirkin <mst@redhat.com>
    > > > Sent: Thursday, April 13, 2023 5:02 PM
    > >
    > > > This thing ... you move some code to the card and reduce the amount of virtio
    > > > knowledge in software but do not eliminate it completely.
    > > Sure. Practically there is no knowledge other than transporting like a vxlan encapsulating, here its AQ.
    > >
    > > > Seems kind of pointless. Minimal hardware changes make more sense to me, I'd
    > > > say. Talking about that, what is a minimal hardware change to allow a vdpa
    > > > based solution?
    > > > I think that's VIRTIO_NET_F_LEGACY_HEADER, right?
    >
    > I think it is. It would be much easier if we do this.

    It does not look like Parav is interested in this approach but
    if you like it feel free to propose it.

    > >
    > > The main requirement/point is that there is virito PCI VF that to be mapped to the guest VM.
    > > Hence, there is no vdpa type of hypervisor layer exists in this use case.
    >
    > It's about mediation which is a must for things like legacy. If
    > there's a way that helps the vendor to get rid of the tricky legacy
    > completely, then why not?
    >
    > >
    > > This same VF need to be transitional as the guest kernel may not be known.
    > > hence sometimes vdpa sometime regular 1.x VF is not an option.
    > > Hence for next few years, transitional VF will be plugged in into the guest VM when user is using the PCI VF devices.
    >
    > I'm not sure I get this. With VIRTIO_NET_F_LEGACY_HEADER, we don't
    > need mediation for datapath. For the control path, mediation is a must
    > for legacy and it's very easy to keep it work for modern, what's wrong
    > with that?
    >
    > Thanks
    >
    > >
    > > This publicly archived list offers a means to provide input to the
    > > OASIS Virtual I/O Device (VIRTIO) TC.
    > >
    > > In order to verify user consent to the Feedback License terms and
    > > to minimize spam in the list archive, subscription is required
    > > before posting.




  • 24.  Re: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers

    Posted 04-14-2023 03:09
    On Fri, Apr 14, 2023 at 1:24?AM Parav Pandit <parav@nvidia.com> wrote:
    >
    >
    >
    > On 4/13/2023 1:14 AM, Jason Wang wrote:
    >
    > >> For LM BAR length and number should be same between two PCI VFs. But its
    > >> orthogonal to this point. Such checks will be done anyway.
    > >
    > > Quoted the wrong sections, I think it should be:
    > >
    > > "
    > > length MAY include padding, or fields unused by the driver, or future
    > > extensions. Note: For example, a future device might present a large
    > > structure size of several MBytes. As current devices never utilize
    > > structures larger than 4KBytes in size, driver MAY limit the mapped
    > > structure size to e.g. 4KBytes (thus ignoring parts of structure after
    > > the first 4KBytes) to allow forward compatibility with such devices
    > > without loss of functionality and without wasting resources.
    > > "
    > yes. This is the one.
    >
    > > If it's a transitional device but not placed at BAR0, it might have
    > > side effects for Linux drivers which assumes BAR0 for legacy.
    > >
    > True. Transitional can be at BAR0.
    >
    > > I don't see how easy it could be a non transitional device:
    > >
    > > "
    > > Devices or drivers with no legacy compatibility are referred to as
    > > non-transitional devices and drivers, respectively.
    > > "
    > Michael has suggested rewording of the text.
    > It is anyway new text so lets park it aside for now.
    > It is mostly tweaking the text.
    >
    > >
    > >> device can expose this optional capability and its attached MMIO region.
    > >>
    > >> Spec changes are similar to #2.
    > >>> - non trivial spec changes which ends up of the tricky cases that
    > >>> tries to workaround legacy to fit for a hardware implementation
    > >>> - work only for the case of virtualization with the help of
    > >>> meditation, can't work for bare metal
    > >> For bare-metal PFs usually thin hypervisors are used that does very
    > >> minimal setup. But I agree that bare-metal is relatively less important.
    > >
    > > This is not what I understand. I know several vendors that are using
    > > virtio devices for bare metal.
    > >
    > I was saying the case for legacy bare metal is less of a problem because
    > PCIe does not limit functionality, perf is still limited due to IOBAR.
    >
    > >>
    > >>> - only work for some specific archs without SVQ
    > >>>
    > >> That is the legacy limitation that we don't worry about.
    > >>
    > >>> 2) allow BAR0 to be MMIO for transitional device
    > >>>
    > >>> Pros:
    > >>> - very minor change for the spec
    > >> Spec changes wise they are similar to #1.
    > >
    > > This is different since the changes for this are trivial.
    > >
    > >>> - work for virtualization (and it work even without dedicated
    > >>> mediation for some setups)
    > >> I am not aware where can it work without mediation. Do you know any
    > >> specific kernel version where it actually works?
    > >
    > > E.g current Linux driver did:
    > >
    > > rc = pci_request_region(pci_dev, 0, "virtio-pci-legacy");
    > >
    > > It doesn't differ from I/O with memory. It means if you had a
    > > "transitional" device with legacy MMIO BAR0, it just works.
    > >
    >
    > Thanks to the abstract PCI API in Linux.

    And this (legacy MMIO bar) has been supported by DPDK as well for a while.

    >
    > >>> - work for bare metal for some setups (without mediation)
    > >>> Cons:
    > >>> - only work for some specific archs without SVQ
    > >>> - BAR0 is required
    > >>>
    > >> Both are not limitation as they are mainly coming from the legacy side
    > >> of things.
    > >>
    > >>> 3) modern device mediation for legacy
    > >>>
    > >>> Pros:
    > >>> - no changes in the spec
    > >>> Cons:
    > >>> - require mediation layer in order to work in bare metal
    > >>> - require datapath mediation like SVQ to work for virtualization
    > >>>
    > >> Spec change is still require for net and blk because modern device do
    > >> not understand legacy, even with mediation layer.
    > >
    > > That's fine and easy since we work on top of modern devices.
    > >
    > >> FEATURE_1, RW cap via CVQ which is not really owned by the hypervisor.
    > >
    > > Hypervisors can trap if they wish.
    > >
    > Trapping non legacy accessing for 1.x doesn't make sense.

    Actually not, I think I've mentioned the reason for several times:

    It's a must for ABI and migration compatibility:

    1) The offset is not necessarily a page boundary
    2) The length is not necessarily times of a PAGE_SIZE
    3) PAGE_SIZE varies among different archs
    4) Two vendors may have two different layout for the structure

    Thanks




  • 25.  Re: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers

    Posted 04-14-2023 03:09
    On Fri, Apr 14, 2023 at 1:24?AM Parav Pandit <parav@nvidia.com> wrote:
    >
    >
    >
    > On 4/13/2023 1:14 AM, Jason Wang wrote:
    >
    > >> For LM BAR length and number should be same between two PCI VFs. But its
    > >> orthogonal to this point. Such checks will be done anyway.
    > >
    > > Quoted the wrong sections, I think it should be:
    > >
    > > "
    > > length MAY include padding, or fields unused by the driver, or future
    > > extensions. Note: For example, a future device might present a large
    > > structure size of several MBytes. As current devices never utilize
    > > structures larger than 4KBytes in size, driver MAY limit the mapped
    > > structure size to e.g. 4KBytes (thus ignoring parts of structure after
    > > the first 4KBytes) to allow forward compatibility with such devices
    > > without loss of functionality and without wasting resources.
    > > "
    > yes. This is the one.
    >
    > > If it's a transitional device but not placed at BAR0, it might have
    > > side effects for Linux drivers which assumes BAR0 for legacy.
    > >
    > True. Transitional can be at BAR0.
    >
    > > I don't see how easy it could be a non transitional device:
    > >
    > > "
    > > Devices or drivers with no legacy compatibility are referred to as
    > > non-transitional devices and drivers, respectively.
    > > "
    > Michael has suggested rewording of the text.
    > It is anyway new text so lets park it aside for now.
    > It is mostly tweaking the text.
    >
    > >
    > >> device can expose this optional capability and its attached MMIO region.
    > >>
    > >> Spec changes are similar to #2.
    > >>> - non trivial spec changes which ends up of the tricky cases that
    > >>> tries to workaround legacy to fit for a hardware implementation
    > >>> - work only for the case of virtualization with the help of
    > >>> meditation, can't work for bare metal
    > >> For bare-metal PFs usually thin hypervisors are used that does very
    > >> minimal setup. But I agree that bare-metal is relatively less important.
    > >
    > > This is not what I understand. I know several vendors that are using
    > > virtio devices for bare metal.
    > >
    > I was saying the case for legacy bare metal is less of a problem because
    > PCIe does not limit functionality, perf is still limited due to IOBAR.
    >
    > >>
    > >>> - only work for some specific archs without SVQ
    > >>>
    > >> That is the legacy limitation that we don't worry about.
    > >>
    > >>> 2) allow BAR0 to be MMIO for transitional device
    > >>>
    > >>> Pros:
    > >>> - very minor change for the spec
    > >> Spec changes wise they are similar to #1.
    > >
    > > This is different since the changes for this are trivial.
    > >
    > >>> - work for virtualization (and it work even without dedicated
    > >>> mediation for some setups)
    > >> I am not aware where can it work without mediation. Do you know any
    > >> specific kernel version where it actually works?
    > >
    > > E.g current Linux driver did:
    > >
    > > rc = pci_request_region(pci_dev, 0, "virtio-pci-legacy");
    > >
    > > It doesn't differ from I/O with memory. It means if you had a
    > > "transitional" device with legacy MMIO BAR0, it just works.
    > >
    >
    > Thanks to the abstract PCI API in Linux.

    And this (legacy MMIO bar) has been supported by DPDK as well for a while.

    >
    > >>> - work for bare metal for some setups (without mediation)
    > >>> Cons:
    > >>> - only work for some specific archs without SVQ
    > >>> - BAR0 is required
    > >>>
    > >> Both are not limitation as they are mainly coming from the legacy side
    > >> of things.
    > >>
    > >>> 3) modern device mediation for legacy
    > >>>
    > >>> Pros:
    > >>> - no changes in the spec
    > >>> Cons:
    > >>> - require mediation layer in order to work in bare metal
    > >>> - require datapath mediation like SVQ to work for virtualization
    > >>>
    > >> Spec change is still require for net and blk because modern device do
    > >> not understand legacy, even with mediation layer.
    > >
    > > That's fine and easy since we work on top of modern devices.
    > >
    > >> FEATURE_1, RW cap via CVQ which is not really owned by the hypervisor.
    > >
    > > Hypervisors can trap if they wish.
    > >
    > Trapping non legacy accessing for 1.x doesn't make sense.

    Actually not, I think I've mentioned the reason for several times:

    It's a must for ABI and migration compatibility:

    1) The offset is not necessarily a page boundary
    2) The length is not necessarily times of a PAGE_SIZE
    3) PAGE_SIZE varies among different archs
    4) Two vendors may have two different layout for the structure

    Thanks