OASIS Virtual I/O Device (VIRTIO) TC

Expand all | Collapse all

Problems with VIRTIO-4 and writeback only disks

  • 1.  Problems with VIRTIO-4 and writeback only disks

    Posted 09-10-2013 17:27
    [resending to virtio-comment; it looks like I'm not subscribed to
    virtio-dev ... how do you subscribe?]

    Sorry, I don't have a copy of the original email to reply to:

    https://lists.oasis-open.org/archives/virtio-comment/201308/msg00078.html

    The part that concerns me is this:

    > +5. The cache mode should be read from the writeback field of the configuration
    > + if the VIRTIO_BLK_F_CONFIG_WCE feature if available; the driver can also
    > + write to the field in order to toggle the cache between writethrough (0)
    > + and writeback (1) mode.
    > + If the feature is not available, the driver can instead look at the result
    > + of negotiating VIRTIO_BLK_F_WCE: the cache will be in writeback mode after
    > + reset if and only if VIRTIO_BLK_F_WCE is negotiated[30]

    The questions are twofold and have to do with Write Back only disks (to
    date we've seen quite a few ATA devices like this and a huge number of
    USB devices):

    1. If the guest doesn't negotiate WCE, what do you do on the host
    (flush on every write is one possible option; run unsafe and
    hope the host doesn't crash is another).
    2. If the guest asks to toggle the device from writeback (1) to
    writethrough (0) mode, what do you do? Refuse the toggle would
    be reasonable or flip back into whatever mode you were using to
    handle 1. is also possible.

    James





  • 2.  Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 09-10-2013 17:49
    Redirecting discussion to virtio-dev@lists.oasis-open.org
    as per oasis rules.

    On Tue, Sep 10, 2013 at 05:27:26PM +0000, James Bottomley wrote:
    > [resending to virtio-comment; it looks like I'm not subscribed to
    > virtio-dev ... how do you subscribe?]
    >
    > Sorry, I don't have a copy of the original email to reply to:
    >
    > https://lists.oasis-open.org/archives/virtio-comment/201308/msg00078.html
    >
    > The part that concerns me is this:
    >
    > > +5. The cache mode should be read from the writeback field of the configuration
    > > + if the VIRTIO_BLK_F_CONFIG_WCE feature if available; the driver can also
    > > + write to the field in order to toggle the cache between writethrough (0)
    > > + and writeback (1) mode.
    > > + If the feature is not available, the driver can instead look at the result
    > > + of negotiating VIRTIO_BLK_F_WCE: the cache will be in writeback mode after
    > > + reset if and only if VIRTIO_BLK_F_WCE is negotiated[30]
    >
    > The questions are twofold and have to do with Write Back only disks (to
    > date we've seen quite a few ATA devices like this and a huge number of
    > USB devices):
    >
    > 1. If the guest doesn't negotiate WCE, what do you do on the host
    > (flush on every write is one possible option; run unsafe and
    > hope the host doesn't crash is another).
    > 2. If the guest asks to toggle the device from writeback (1) to
    > writethrough (0) mode, what do you do? Refuse the toggle would
    > be reasonable or flip back into whatever mode you were using to
    > handle 1. is also possible.
    >
    > James


    Basically if I understand what you are saying, we should
    just require devices to make WCE state always present in config space,
    and ask drivers to always read the value from config space
    instead of assuming tricky logic ""the cache will be in writeback mode
    after reset if and only if VIRTIO_BLK_F_WCE is negotiated"


    >
    >
    > This publicly archived list offers a means to provide input to the
    > OASIS Virtual I/O Device (VIRTIO) TC.
    >
    > In order to verify user consent to the Feedback License terms and
    > to minimize spam in the list archive, subscription is required
    > before posting.
    >
    > Subscribe: virtio-comment-subscribe@lists.oasis-open.org
    > Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
    > List help: virtio-comment-help@lists.oasis-open.org
    > List archive: https://lists.oasis-open.org/archives/virtio-comment/
    > Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
    > List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
    > Committee: https://www.oasis-open.org/committees/virtio/
    > Join OASIS: https://www.oasis-open.org/join/



  • 3.  Re: [virtio-dev] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 09-16-2013 07:20
    "Michael S. Tsirkin" <mst@redhat.com> writes:
    > Redirecting discussion to virtio-dev@lists.oasis-open.org
    > as per oasis rules.
    >
    > On Tue, Sep 10, 2013 at 05:27:26PM +0000, James Bottomley wrote:
    >> [resending to virtio-comment; it looks like I'm not subscribed to
    >> virtio-dev ... how do you subscribe?]
    >>
    >> Sorry, I don't have a copy of the original email to reply to:
    >>
    >> https://lists.oasis-open.org/archives/virtio-comment/201308/msg00078.html
    >>
    >> The part that concerns me is this:
    >>
    >> > +5. The cache mode should be read from the writeback field of the configuration
    >> > + if the VIRTIO_BLK_F_CONFIG_WCE feature if available; the driver can also
    >> > + write to the field in order to toggle the cache between writethrough (0)
    >> > + and writeback (1) mode.
    >> > + If the feature is not available, the driver can instead look at the result
    >> > + of negotiating VIRTIO_BLK_F_WCE: the cache will be in writeback mode after
    >> > + reset if and only if VIRTIO_BLK_F_WCE is negotiated[30]
    >>
    >> The questions are twofold and have to do with Write Back only disks (to
    >> date we've seen quite a few ATA devices like this and a huge number of
    >> USB devices):
    >>
    >> 1. If the guest doesn't negotiate WCE, what do you do on the host
    >> (flush on every write is one possible option; run unsafe and
    >> hope the host doesn't crash is another).
    >> 2. If the guest asks to toggle the device from writeback (1) to
    >> writethrough (0) mode, what do you do? Refuse the toggle would
    >> be reasonable or flip back into whatever mode you were using to
    >> handle 1. is also possible.
    >>
    >> James
    >
    >
    > Basically if I understand what you are saying, we should
    > just require devices to make WCE state always present in config space,
    > and ask drivers to always read the value from config space
    > instead of assuming tricky logic ""the cache will be in writeback mode
    > after reset if and only if VIRTIO_BLK_F_WCE is negotiated"

    I think we should do the reverse:
    1) If you negotiate VIRTIO_BLK_F_WCE, you need to flush (ie. it's
    writeback). Otherwise it's write-through.
    2) If (and only if) you negotiate VIRTIO_BLK_F_WCE, you may also negotiate
    VIRTIO_BLK_F_CONFIG_WCE, in which case you can toggle it.

    But I'm not convinced that #2 should exist at all. Does someone have a
    real use case?

    So the advice would be as follows:
    Driver authors:
    If you want simplicity, ignore VIRTIO_BLK_F_WCE.
    If you want performance, you should negotiate VIRTIO_BLK_F_WCE.
    And if you also want to expose a toggle (eg. for testing), you
    can negotiate VIRTIO_BLK_F_CONFIG_WCE.
    Device authors:
    If you want simplicity, don't offer VIRTIO_BLK_F_WCE.
    If you want performance, offer VIRTIO_BLK_F_WCE.
    And if you're masochistic, offer VIRTIO_BLK_F_CONFIG_WCE
    so they can toggle.

    Cheers,
    Rusty.




  • 4.  Re: [virtio-dev] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 09-16-2013 08:01
    On Mon, Sep 16, 2013 at 04:50:14PM +0930, Rusty Russell wrote:
    > "Michael S. Tsirkin" <mst@redhat.com> writes:
    > > Redirecting discussion to virtio-dev@lists.oasis-open.org
    > > as per oasis rules.
    > >
    > > On Tue, Sep 10, 2013 at 05:27:26PM +0000, James Bottomley wrote:
    > >> [resending to virtio-comment; it looks like I'm not subscribed to
    > >> virtio-dev ... how do you subscribe?]
    > >>
    > >> Sorry, I don't have a copy of the original email to reply to:
    > >>
    > >> https://lists.oasis-open.org/archives/virtio-comment/201308/msg00078.html
    > >>
    > >> The part that concerns me is this:
    > >>
    > >> > +5. The cache mode should be read from the writeback field of the configuration
    > >> > + if the VIRTIO_BLK_F_CONFIG_WCE feature if available; the driver can also
    > >> > + write to the field in order to toggle the cache between writethrough (0)
    > >> > + and writeback (1) mode.
    > >> > + If the feature is not available, the driver can instead look at the result
    > >> > + of negotiating VIRTIO_BLK_F_WCE: the cache will be in writeback mode after
    > >> > + reset if and only if VIRTIO_BLK_F_WCE is negotiated[30]
    > >>
    > >> The questions are twofold and have to do with Write Back only disks (to
    > >> date we've seen quite a few ATA devices like this and a huge number of
    > >> USB devices):
    > >>
    > >> 1. If the guest doesn't negotiate WCE, what do you do on the host
    > >> (flush on every write is one possible option; run unsafe and
    > >> hope the host doesn't crash is another).
    > >> 2. If the guest asks to toggle the device from writeback (1) to
    > >> writethrough (0) mode, what do you do? Refuse the toggle would
    > >> be reasonable or flip back into whatever mode you were using to
    > >> handle 1. is also possible.
    > >>
    > >> James
    > >
    > >
    > > Basically if I understand what you are saying, we should
    > > just require devices to make WCE state always present in config space,
    > > and ask drivers to always read the value from config space
    > > instead of assuming tricky logic ""the cache will be in writeback mode
    > > after reset if and only if VIRTIO_BLK_F_WCE is negotiated"
    >
    > I think we should do the reverse:
    > 1) If you negotiate VIRTIO_BLK_F_WCE, you need to flush (ie. it's
    > writeback). Otherwise it's write-through.
    > 2) If (and only if) you negotiate VIRTIO_BLK_F_WCE, you may also negotiate
    > VIRTIO_BLK_F_CONFIG_WCE, in which case you can toggle it.
    >
    > But I'm not convinced that #2 should exist at all. Does someone have a
    > real use case?
    >
    > So the advice would be as follows:
    > Driver authors:
    > If you want simplicity, ignore VIRTIO_BLK_F_WCE.
    > If you want performance, you should negotiate VIRTIO_BLK_F_WCE.
    > And if you also want to expose a toggle (eg. for testing), you
    > can negotiate VIRTIO_BLK_F_CONFIG_WCE.
    > Device authors:
    > If you want simplicity, don't offer VIRTIO_BLK_F_WCE.
    > If you want performance, offer VIRTIO_BLK_F_WCE.
    > And if you're masochistic, offer VIRTIO_BLK_F_CONFIG_WCE
    > so they can toggle.
    >
    > Cheers,
    > Rusty.

    Okay so IIUC with hosts which use WCE storage as the back-end, not using
    WCE will basically require host to flush after each transaction.
    If we resolve VIRTIO-30, hosts will be able to make WCE mandatory if
    they wish.

    --
    MST



  • 5.  Re: [virtio-dev] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 09-16-2013 12:42
    On Mon, 2013-09-16 at 16:50 +0930, Rusty Russell wrote:
    > "Michael S. Tsirkin" <mst@redhat.com> writes:
    > > Redirecting discussion to virtio-dev@lists.oasis-open.org
    > > as per oasis rules.
    > >
    > > On Tue, Sep 10, 2013 at 05:27:26PM +0000, James Bottomley wrote:
    > >> [resending to virtio-comment; it looks like I'm not subscribed to
    > >> virtio-dev ... how do you subscribe?]
    > >>
    > >> Sorry, I don't have a copy of the original email to reply to:
    > >>
    > >> https://lists.oasis-open.org/archives/virtio-comment/201308/msg00078.html
    > >>
    > >> The part that concerns me is this:
    > >>
    > >> > +5. The cache mode should be read from the writeback field of the configuration
    > >> > + if the VIRTIO_BLK_F_CONFIG_WCE feature if available; the driver can also
    > >> > + write to the field in order to toggle the cache between writethrough (0)
    > >> > + and writeback (1) mode.
    > >> > + If the feature is not available, the driver can instead look at the result
    > >> > + of negotiating VIRTIO_BLK_F_WCE: the cache will be in writeback mode after
    > >> > + reset if and only if VIRTIO_BLK_F_WCE is negotiated[30]
    > >>
    > >> The questions are twofold and have to do with Write Back only disks (to
    > >> date we've seen quite a few ATA devices like this and a huge number of
    > >> USB devices):
    > >>
    > >> 1. If the guest doesn't negotiate WCE, what do you do on the host
    > >> (flush on every write is one possible option; run unsafe and
    > >> hope the host doesn't crash is another).
    > >> 2. If the guest asks to toggle the device from writeback (1) to
    > >> writethrough (0) mode, what do you do? Refuse the toggle would
    > >> be reasonable or flip back into whatever mode you were using to
    > >> handle 1. is also possible.
    > >>
    > >> James
    > >
    > >
    > > Basically if I understand what you are saying, we should
    > > just require devices to make WCE state always present in config space,
    > > and ask drivers to always read the value from config space
    > > instead of assuming tricky logic ""the cache will be in writeback mode
    > > after reset if and only if VIRTIO_BLK_F_WCE is negotiated"
    >
    > I think we should do the reverse:
    > 1) If you negotiate VIRTIO_BLK_F_WCE, you need to flush (ie. it's
    > writeback). Otherwise it's write-through.
    > 2) If (and only if) you negotiate VIRTIO_BLK_F_WCE, you may also negotiate
    > VIRTIO_BLK_F_CONFIG_WCE, in which case you can toggle it.
    >
    > But I'm not convinced that #2 should exist at all. Does someone have a
    > real use case?

    There is no use case in Linux. We provide knobs to toggle the cache
    type in the SCSI disk class, but the kernel doesn't do it by default, it
    just runs with the cache type because the toggle can be unreliable.

    James





  • 6.  Re: [virtio-dev] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 09-17-2013 05:34
    James Bottomley <jbottomley@parallels.com> writes:
    > On Mon, 2013-09-16 at 16:50 +0930, Rusty Russell wrote:
    >> I think we should do the reverse:
    >> 1) If you negotiate VIRTIO_BLK_F_WCE, you need to flush (ie. it's
    >> writeback). Otherwise it's write-through.
    >> 2) If (and only if) you negotiate VIRTIO_BLK_F_WCE, you may also negotiate
    >> VIRTIO_BLK_F_CONFIG_WCE, in which case you can toggle it.
    >>
    >> But I'm not convinced that #2 should exist at all. Does someone have a
    >> real use case?
    >
    > There is no use case in Linux. We provide knobs to toggle the cache
    > type in the SCSI disk class, but the kernel doesn't do it by default, it
    > just runs with the cache type because the toggle can be unreliable.

    OK. How about this as a strawman:

    1) Kill VIRTIO_BLK_F_CONFIG_WCE as a bad idea. You either negotiate
    WCE or not during device setup; no toggling.

    2) If device offers VIRTIO_BLK_F_WCE, and driver negotiates it:
    - Completed writes should be persistent if guest crashes.
    - No ordering guarantees are given except where flush command sent.
    - Flush SHOULD push all preceeding writes to permanent storage.
    (in the RFC SHOULD sense, which is weaker than MUST).

    3) If device does not offer VIRTIO_BLK_F_WCE, or driver doesn't
    negotiate it:
    - Completed writes should be persistent if guest crashes.
    - No flush commands are supported.
    - No guarantee about writes hitting permanent storage.

    This pretty neatly divides it into complex and simple cases. If you
    want more fine-grained, you know where to find virtio-scsi...

    Cheers,
    Rusty.




  • 7.  Re: [virtio-dev] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 09-18-2013 06:00
    On Tue, Sep 17, 2013 at 03:03:47PM +0930, Rusty Russell wrote:
    > James Bottomley <jbottomley@parallels.com> writes:
    > > On Mon, 2013-09-16 at 16:50 +0930, Rusty Russell wrote:
    > >> I think we should do the reverse:
    > >> 1) If you negotiate VIRTIO_BLK_F_WCE, you need to flush (ie. it's
    > >> writeback). Otherwise it's write-through.
    > >> 2) If (and only if) you negotiate VIRTIO_BLK_F_WCE, you may also negotiate
    > >> VIRTIO_BLK_F_CONFIG_WCE, in which case you can toggle it.
    > >>
    > >> But I'm not convinced that #2 should exist at all. Does someone have a
    > >> real use case?
    > >
    > > There is no use case in Linux. We provide knobs to toggle the cache
    > > type in the SCSI disk class, but the kernel doesn't do it by default, it
    > > just runs with the cache type because the toggle can be unreliable.
    >
    > OK. How about this as a strawman:
    >
    > 1) Kill VIRTIO_BLK_F_CONFIG_WCE as a bad idea. You either negotiate
    > WCE or not during device setup; no toggling.

    It's not really useful as guest-controlled, I agree.
    But I have this vague idea:
    - host wants to enable WCE on backend
    - host notifies guest
    - guest enables WCE
    - host can now enable WCE on backend

    > 2) If device offers VIRTIO_BLK_F_WCE, and driver negotiates it:
    > - Completed writes should be persistent if guest crashes.
    > - No ordering guarantees are given except where flush command sent.
    > - Flush SHOULD push all preceeding writes to permanent storage.
    > (in the RFC SHOULD sense, which is weaker than MUST).

    We can also document that it's legal for device to
    make the feature required (using the new handshake).


    > 3) If device does not offer VIRTIO_BLK_F_WCE, or driver doesn't
    > negotiate it:
    > - Completed writes should be persistent if guest crashes.
    > - No flush commands are supported.
    > - No guarantee about writes hitting permanent storage.
    >
    > This pretty neatly divides it into complex and simple cases. If you
    > want more fine-grained, you know where to find virtio-scsi...
    >
    > Cheers,
    > Rusty.



  • 8.  Re: [virtio-dev] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 09-18-2013 07:12
    "Michael S. Tsirkin" <mst@redhat.com> writes:
    > On Tue, Sep 17, 2013 at 03:03:47PM +0930, Rusty Russell wrote:
    >> James Bottomley <jbottomley@parallels.com> writes:
    >> > On Mon, 2013-09-16 at 16:50 +0930, Rusty Russell wrote:
    >> >> I think we should do the reverse:
    >> >> 1) If you negotiate VIRTIO_BLK_F_WCE, you need to flush (ie. it's
    >> >> writeback). Otherwise it's write-through.
    >> >> 2) If (and only if) you negotiate VIRTIO_BLK_F_WCE, you may also negotiate
    >> >> VIRTIO_BLK_F_CONFIG_WCE, in which case you can toggle it.
    >> >>
    >> >> But I'm not convinced that #2 should exist at all. Does someone have a
    >> >> real use case?
    >> >
    >> > There is no use case in Linux. We provide knobs to toggle the cache
    >> > type in the SCSI disk class, but the kernel doesn't do it by default, it
    >> > just runs with the cache type because the toggle can be unreliable.
    >>
    >> OK. How about this as a strawman:
    >>
    >> 1) Kill VIRTIO_BLK_F_CONFIG_WCE as a bad idea. You either negotiate
    >> WCE or not during device setup; no toggling.
    >
    > It's not really useful as guest-controlled, I agree.
    > But I have this vague idea:
    > - host wants to enable WCE on backend
    > - host notifies guest
    > - guest enables WCE
    > - host can now enable WCE on backend

    That seems convoluted to me. This kind of complexity seems best in
    virtio-scsi.

    >> 2) If device offers VIRTIO_BLK_F_WCE, and driver negotiates it:
    >> - Completed writes should be persistent if guest crashes.
    >> - No ordering guarantees are given except where flush command sent.
    >> - Flush SHOULD push all preceeding writes to permanent storage.
    >> (in the RFC SHOULD sense, which is weaker than MUST).
    >
    > We can also document that it's legal for device to
    > make the feature required (using the new handshake).

    I dislike compulsory features. Most features are a
    complexity/performance tradeoff. Making a feature compulsory means that
    only the device gets to decide the tradeoff, not the driver.

    >> 3) If device does not offer VIRTIO_BLK_F_WCE, or driver doesn't
    >> negotiate it:
    >> - Completed writes should be persistent if guest crashes.
    >> - No flush commands are supported.
    >> - No guarantee about writes hitting permanent storage.
    >>
    >> This pretty neatly divides it into complex and simple cases. If you
    >> want more fine-grained, you know where to find virtio-scsi...

    Subnote:
    This means that quisceing a device non-WCE device forms a write
    barrier. This is not quite the same as strictly ordered, as
    in-flight requests can be reordered.

    Quiescing a WCE device doesn't do anything (though a
    straightforward aio implementation would cause a write barrier
    too).

    My main question remains: is this non-WCE vs WCE split sane? Non-WCE
    means no host persistence, WCE gives you a hope.

    Cheers,
    Rusty.




  • 9.  Re: [virtio-dev] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 09-19-2013 13:51
    On Tue, Sep 17, 2013 at 03:03:47PM +0930, Rusty Russell wrote:
    > James Bottomley <jbottomley@parallels.com> writes:
    > > On Mon, 2013-09-16 at 16:50 +0930, Rusty Russell wrote:
    > >> I think we should do the reverse:
    > >> 1) If you negotiate VIRTIO_BLK_F_WCE, you need to flush (ie. it's
    > >> writeback). Otherwise it's write-through.
    > >> 2) If (and only if) you negotiate VIRTIO_BLK_F_WCE, you may also negotiate
    > >> VIRTIO_BLK_F_CONFIG_WCE, in which case you can toggle it.
    > >>
    > >> But I'm not convinced that #2 should exist at all. Does someone have a
    > >> real use case?
    > >
    > > There is no use case in Linux. We provide knobs to toggle the cache
    > > type in the SCSI disk class, but the kernel doesn't do it by default, it
    > > just runs with the cache type because the toggle can be unreliable.

    The virtio_blk driver provides a cache_type sysfs attr that userspace
    can toggle. I guess this was inspired by how the SCSI disk class does
    it.

    > OK. How about this as a strawman:
    >
    > 1) Kill VIRTIO_BLK_F_CONFIG_WCE as a bad idea. You either negotiate
    > WCE or not during device setup; no toggling.
    >
    > 2) If device offers VIRTIO_BLK_F_WCE, and driver negotiates it:
    > - Completed writes should be persistent if guest crashes.
    > - No ordering guarantees are given except where flush command sent.
    > - Flush SHOULD push all preceeding writes to permanent storage.
    > (in the RFC SHOULD sense, which is weaker than MUST).
    >
    > 3) If device does not offer VIRTIO_BLK_F_WCE, or driver doesn't
    > negotiate it:
    > - Completed writes should be persistent if guest crashes.
    > - No flush commands are supported.
    > - No guarantee about writes hitting permanent storage.
    >
    > This pretty neatly divides it into complex and simple cases. If you
    > want more fine-grained, you know where to find virtio-scsi...

    #3 is worse than what we had with VIRTIO_BLK_F_WCE semantics. In order
    to keep things simple you weakened the guarantees to the point where you
    have to look at your hypervisor implementation instead of the virtio
    standard. We're trying to define standard so guests and hypervisors can
    work together - undefined behavior doesn't further that goal, it
    actually prevents virtio implementations from working universally.

    When VIRTIO_BLK_F_WCE is not offered by the device or negotiated by the
    guest it makes sense to guarantee that every write hits permanent
    storage.

    Stefan



  • 10.  Re: [virtio-dev] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 09-23-2013 00:11
    Stefan Hajnoczi <stefanha@redhat.com> writes: > On Tue, Sep 17, 2013 at 03:03:47PM +0930, Rusty Russell wrote: >> 3) If device does not offer VIRTIO_BLK_F_WCE, or driver doesn't >> negotiate it: >> - Completed writes should be persistent if guest crashes. >> - No flush commands are supported. >> - No guarantee about writes hitting permanent storage. >> >> This pretty neatly divides it into complex and simple cases. If you >> want more fine-grained, you know where to find virtio-scsi... > > #3 is worse than what we had with VIRTIO_BLK_F_WCE semantics. In order > to keep things simple you weakened the guarantees to the point where you > have to look at your hypervisor implementation instead of the virtio > standard. Let's be absolutely clear here, the spec can *never* say: Writes MUST be committed to persistent storage. Because there are real use cases which violate that: consider qemu -snapshot. So you will *always* have to consider the hypervisor. > We're trying to define standard so guests and hypervisors can > work together - undefined behavior doesn't further that goal, it > actually prevents virtio implementations from working universally. It's a quality of implementation issue, not a core compatibility issue. And I think it's perfectly reasonable not to flush to permanent storage. Bryan, have there been any complaints about bhyve not doing it? > When VIRTIO_BLK_F_WCE is not offered by the device or negotiated by the > guest it makes sense to guarantee that every write hits permanent > storage. Perhaps conflating the two (WCE <=> permanence) is a mistake. But I think we need a way for fast, simple implementations to exist: so far, that's the norm. And I'm reluctant to weaken SHOULD to MAY. Cheers, Rusty.


  • 11.  Re: [virtio-dev] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 09-23-2013 00:11
    Rusty Russell <rusty@au1.ibm.com> writes: > Stefan Hajnoczi <stefanha@redhat.com> writes: >> On Tue, Sep 17, 2013 at 03:03:47PM +0930, Rusty Russell wrote: >>> 3) If device does not offer VIRTIO_BLK_F_WCE, or driver doesn't >>> negotiate it: >>> - Completed writes should be persistent if guest crashes. >>> - No flush commands are supported. >>> - No guarantee about writes hitting permanent storage. >>> >>> This pretty neatly divides it into complex and simple cases. If you >>> want more fine-grained, you know where to find virtio-scsi... >> >> #3 is worse than what we had with VIRTIO_BLK_F_WCE semantics. In order >> to keep things simple you weakened the guarantees to the point where you >> have to look at your hypervisor implementation instead of the virtio >> standard. > > Let's be absolutely clear here, the spec can *never* say: > > Writes MUST be committed to persistent storage. > > Because there are real use cases which violate that: consider qemu > -snapshot. So you will *always* have to consider the hypervisor. > >> We're trying to define standard so guests and hypervisors can >> work together - undefined behavior doesn't further that goal, it >> actually prevents virtio implementations from working universally. > > It's a quality of implementation issue, not a core compatibility issue. > And I think it's perfectly reasonable not to flush to permanent storage. > Bryan, have there been any complaints about bhyve not doing it? > >> When VIRTIO_BLK_F_WCE is not offered by the device or negotiated by the >> guest it makes sense to guarantee that every write hits permanent >> storage. > > Perhaps conflating the two (WCE <=> permanence) is a mistake. But I > think we need a way for fast, simple implementations to exist: so far, > that's the norm. And I'm reluctant to weaken SHOULD to MAY. Thinking about this some more: why not make WCE the only option? (1) It's simple. (2) We say flush SHOULD hit the disk. Now, I *think* we should remove "Completed (unflushed) writes should be persistent if guest crashes.". This is a nice property for simple guests, but it disallows aggressive internal caching. eg. consider a device implementation which compresses into blocks. It might want to cache writes aggressively internally in the hope of aggregating them before compressing and writing back. OTOH, it could do so by deferring completion until that has been done... Thoughts? Rusty.


  • 12.  Re: [virtio-dev] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 10-01-2013 05:27
    Rusty Russell <rusty@au1.ibm.com> writes: > Thinking about this some more: why not make WCE the only option? > > (1) It's simple. > (2) We say flush SHOULD hit the disk. OK, here's the actual patch. It applies most of Paulo's patch, but to the legacy section, and remove VIRTIO_BLK_F_FLUSH. It also documents what FLUSH should do (is more clarity required?) Cheers, Rusty. diff --git a/virtio-v1.0-wd01-part1-specification.txt b/virtio-v1.0-wd01-part1-specification.txt index 5efeba7..9b5c29a 100644 --- a/virtio-v1.0-wd01-part1-specification.txt +++ b/virtio-v1.0-wd01-part1-specification.txt @@ -1817,8 +1817,6 @@ device except where noted. VIRTIO_BLK_F_BLK_SIZE (6) Block size of disk is in “blk_size”. - VIRTIO_BLK_F_FLUSH (9) Cache flush command support. - VIRTIO_BLK_F_TOPOLOGY (10) Device exports information on optimal I/O alignment. @@ -1843,6 +1841,7 @@ device except where noted. u16 min_io_size; u32 opt_io_size; } topology; + u8 reserved; }; 2.4.2.3.1 Legacy Interface: Feature bits @@ -1851,6 +1850,15 @@ device except where noted. VIRTIO_BLK_F_SCSI (7) Device supports scsi packet commands. + VIRTIO_BLK_F_FLUSH (9) Cache flush command support. + + VIRTIO_BLK_F_CONFIG_WCE (11) Device can toggle its cache between writeback + and writethrough modes. + +VIRTIO_BLK_F_FLUSH was also called VIRTIO_BLK_F_WCE: Legacy drivers +should only negotiate this feature if they are capable of sending +VIRTIO_BLK_T_FLUSH commands. + 2.4.2.4. Device Initialization ----------------------------- @@ -1872,6 +1880,22 @@ device except where noted. I/O lengths for the driver to use. This also does not affect the units in the protocol, only performance. +2.4.2.4.1. Legacy Interface: Device Initialization +----------------------------- + +The reserved field used to be called writeback. If the +VIRTIO_BLK_F_CONFIG_WCE feature is offered, the cache mode should be +read from the writeback field of the configuration if available; the +driver can also write to the field in order to toggle the cache +between writethrough (0) and writeback (1) mode. If the feature is +not available, the driver can instead look at the result of +negotiating VIRTIO_BLK_F_WCE: the cache will be in writeback mode +after reset if and only if VIRTIO_BLK_F_WCE is negotiated. + +Note that buggy legacy devices are common, which always operate in +write back mode even if VIRTIO_BLK_F_WCE isn't accepted, or writeback +is turned off. + 2.4.2.5. Device Operation ------------------------ @@ -1907,6 +1931,9 @@ error or VIRTIO_BLK_S_UNSUPP for a request unsupported by host: #define VIRTIO_BLK_S_IOERR 1 #define VIRTIO_BLK_S_UNSUPP 2 +Any writes completed before the submission of the flush command should +be committed to permanent storage by the device. + 2.4.2.5.1 Legacy Interface: Device Operation ------------------------ The 'reserved' field was previously called ioprio. The ioprio field


  • 13.  Re: [virtio] Re: [virtio-dev] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 10-01-2013 07:17
    Il 01/10/2013 07:22, Rusty Russell ha scritto: > Rusty Russell <rusty@au1.ibm.com> writes: >> Thinking about this some more: why not make WCE the only option? >> >> (1) It's simple. >> (2) We say flush SHOULD hit the disk. > > OK, here's the actual patch. It applies most of Paulo's patch, > but to the legacy section, and remove VIRTIO_BLK_F_FLUSH. It also > documents what FLUSH should do (is more clarity required?) Actually that's a pretty smart thing to do. :) Just a couple comments below. > diff --git a/virtio-v1.0-wd01-part1-specification.txt b/virtio-v1.0-wd01-part1-specification.txt > index 5efeba7..9b5c29a 100644 > --- a/virtio-v1.0-wd01-part1-specification.txt > +++ b/virtio-v1.0-wd01-part1-specification.txt > @@ -1817,8 +1817,6 @@ device except where noted. > > VIRTIO_BLK_F_BLK_SIZE (6) Block size of disk is in “blk_size”. > > - VIRTIO_BLK_F_FLUSH (9) Cache flush command support. > - > VIRTIO_BLK_F_TOPOLOGY (10) Device exports information on optimal I/O > alignment. > > @@ -1843,6 +1841,7 @@ device except where noted. > u16 min_io_size; > u32 opt_io_size; > } topology; > + u8 reserved; > }; > > 2.4.2.3.1 Legacy Interface: Feature bits > @@ -1851,6 +1850,15 @@ device except where noted. > > VIRTIO_BLK_F_SCSI (7) Device supports scsi packet commands. > > + VIRTIO_BLK_F_FLUSH (9) Cache flush command support. > + > + VIRTIO_BLK_F_CONFIG_WCE (11) Device can toggle its cache between writeback > + and writethrough modes. Do you need to document where the wce flag is in the configuration? > +VIRTIO_BLK_F_FLUSH was also called VIRTIO_BLK_F_WCE: Legacy drivers > +should only negotiate this feature if they are capable of sending > +VIRTIO_BLK_T_FLUSH commands. > + > 2.4.2.4. Device Initialization > ----------------------------- > > @@ -1872,6 +1880,22 @@ device except where noted. > I/O lengths for the driver to use. This also does not affect the units > in the protocol, only performance. > > +2.4.2.4.1. Legacy Interface: Device Initialization > +----------------------------- > + > +The reserved field used to be called writeback. If the > +VIRTIO_BLK_F_CONFIG_WCE feature is offered, the cache mode should be > +read from the writeback field of the configuration if available; the > +driver can also write to the field in order to toggle the cache > +between writethrough (0) and writeback (1) mode. If the feature is > +not available, the driver can instead look at the result of > +negotiating VIRTIO_BLK_F_WCE: the cache will be in writeback mode > +after reset if and only if VIRTIO_BLK_F_WCE is negotiated. s/VIRTIO_BLK_F_WCE/VIRTIO_BLK_F_FLUSH/ everywhere > +Note that buggy legacy devices are common, which always operate in > +write back mode even if VIRTIO_BLK_F_WCE isn't accepted, or writeback > +is turned off. I would remove this paragraph. You found bugs, true, but it's not that the affected versions of QEMU are common---they are neither in Debian nor in RHEL/CentOS, and frankly using anything else in production would be crazy (also if you do I hope you'll update fast, and those hosts will not be "common" anymore by the time standardization is finished). Paolo > 2.4.2.5. Device Operation > ------------------------ > > @@ -1907,6 +1931,9 @@ error or VIRTIO_BLK_S_UNSUPP for a request unsupported by host: > #define VIRTIO_BLK_S_IOERR 1 > #define VIRTIO_BLK_S_UNSUPP 2 > > +Any writes completed before the submission of the flush command should > +be committed to permanent storage by the device. > + > 2.4.2.5.1 Legacy Interface: Device Operation > ------------------------ > The 'reserved' field was previously called ioprio. The ioprio field > > > --------------------------------------------------------------------- > To unsubscribe from this mail list, you must leave the OASIS TC that > generates this mail. Follow this link to all your TCs in OASIS at: > https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php >


  • 14.  Re: [virtio] Re: [virtio-dev] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 10-01-2013 10:15
    Paolo Bonzini <pbonzini@redhat.com> writes: > Il 01/10/2013 07:22, Rusty Russell ha scritto: >> Rusty Russell <rusty@au1.ibm.com> writes: >>> Thinking about this some more: why not make WCE the only option? >>> >>> (1) It's simple. >>> (2) We say flush SHOULD hit the disk. >> >> OK, here's the actual patch. It applies most of Paulo's patch, >> but to the legacy section, and remove VIRTIO_BLK_F_FLUSH. It also >> documents what FLUSH should do (is more clarity required?) > > Actually that's a pretty smart thing to do. :) Just a couple comments > below. > >> diff --git a/virtio-v1.0-wd01-part1-specification.txt b/virtio-v1.0-wd01-part1-specification.txt >> index 5efeba7..9b5c29a 100644 >> --- a/virtio-v1.0-wd01-part1-specification.txt >> +++ b/virtio-v1.0-wd01-part1-specification.txt >> @@ -1817,8 +1817,6 @@ device except where noted. >> >> VIRTIO_BLK_F_BLK_SIZE (6) Block size of disk is in “blk_size”. >> >> - VIRTIO_BLK_F_FLUSH (9) Cache flush command support. >> - >> VIRTIO_BLK_F_TOPOLOGY (10) Device exports information on optimal I/O >> alignment. >> >> @@ -1843,6 +1841,7 @@ device except where noted. >> u16 min_io_size; >> u32 opt_io_size; >> } topology; >> + u8 reserved; >> }; >> >> 2.4.2.3.1 Legacy Interface: Feature bits >> @@ -1851,6 +1850,15 @@ device except where noted. >> >> VIRTIO_BLK_F_SCSI (7) Device supports scsi packet commands. >> >> + VIRTIO_BLK_F_FLUSH (9) Cache flush command support. >> + >> + VIRTIO_BLK_F_CONFIG_WCE (11) Device can toggle its cache between writeback >> + and writethrough modes. > > Do you need to document where the wce flag is in the configuration? Yes, it's the 'u8 reserved' field above, and referred to below: >> +VIRTIO_BLK_F_FLUSH was also called VIRTIO_BLK_F_WCE: Legacy drivers >> +should only negotiate this feature if they are capable of sending >> +VIRTIO_BLK_T_FLUSH commands. >> + >> 2.4.2.4. Device Initialization >> ----------------------------- >> >> @@ -1872,6 +1880,22 @@ device except where noted. >> I/O lengths for the driver to use. This also does not affect the units >> in the protocol, only performance. >> >> +2.4.2.4.1. Legacy Interface: Device Initialization >> +----------------------------- >> + >> +The reserved field used to be called writeback. If the Here. >> +VIRTIO_BLK_F_CONFIG_WCE feature is offered, the cache mode should be >> +read from the writeback field of the configuration if available; the >> +driver can also write to the field in order to toggle the cache >> +between writethrough (0) and writeback (1) mode. If the feature is >> +not available, the driver can instead look at the result of >> +negotiating VIRTIO_BLK_F_WCE: the cache will be in writeback mode >> +after reset if and only if VIRTIO_BLK_F_WCE is negotiated. > s/VIRTIO_BLK_F_WCE/VIRTIO_BLK_F_FLUSH/ everywhere Yep, thanks. >> +Note that buggy legacy devices are common, which always operate in >> +write back mode even if VIRTIO_BLK_F_WCE isn't accepted, or writeback >> +is turned off. > > I would remove this paragraph. You found bugs, true, but it's not that > the affected versions of QEMU are common---they are neither in Debian > nor in RHEL/CentOS, and frankly using anything else in production would > be crazy (also if you do I hope you'll update fast, and those hosts will > not be "common" anymore by the time standardization is finished). I was also thinking of this: [30] Until version 1.1, QEMU remained in writeback mode even after a guest announced lack of support for VIRTIO_BLK_F_FLUSH. Cheers, Rusty.


  • 15.  Re: [virtio] Re: [virtio-dev] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 10-01-2013 11:35
    Il 01/10/2013 12:13, Rusty Russell ha scritto: >>> +Note that buggy legacy devices are common, which always operate in >>> >> +write back mode even if VIRTIO_BLK_F_WCE isn't accepted, or writeback >>> >> +is turned off. >> > >> > I would remove this paragraph. You found bugs, true, but it's not that >> > the affected versions of QEMU are common---they are neither in Debian >> > nor in RHEL/CentOS, and frankly using anything else in production would >> > be crazy (also if you do I hope you'll update fast, and those hosts will >> > not be "common" anymore by the time standardization is finished). > I was also thinking of this: > > [30] Until version 1.1, QEMU remained in writeback mode even after a > guest announced lack of support for VIRTIO_BLK_F_FLUSH. I would make this even stronger, like "It is undefined whether a disk operates in writeback or writethrough mode after a guest announces lack of support for VIRTIO_BLK_F_FLUSH". Paolo


  • 16.  Re: [virtio] Re: [virtio-dev] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 10-02-2013 04:40
    Paolo Bonzini <pbonzini@redhat.com> writes: > Il 01/10/2013 12:13, Rusty Russell ha scritto: >>>> +Note that buggy legacy devices are common, which always operate in >>>> >> +write back mode even if VIRTIO_BLK_F_WCE isn't accepted, or writeback >>>> >> +is turned off. >>> > >>> > I would remove this paragraph. You found bugs, true, but it's not that >>> > the affected versions of QEMU are common---they are neither in Debian >>> > nor in RHEL/CentOS, and frankly using anything else in production would >>> > be crazy (also if you do I hope you'll update fast, and those hosts will >>> > not be "common" anymore by the time standardization is finished). >> I was also thinking of this: >> >> [30] Until version 1.1, QEMU remained in writeback mode even after a >> guest announced lack of support for VIRTIO_BLK_F_FLUSH. > > I would make this even stronger, like "It is undefined whether a disk > operates in writeback or writethrough mode after a guest announces lack > of support for VIRTIO_BLK_F_FLUSH". At this point it's documenting historical behavior, rather than proscribing it. How about we simply note: Some older legacy devices did not operate in writethrough mode even after a guest announced lack of support for VIRTIO_BLK_F_FLUSH. Cheers, Rusty.


  • 17.  Re: [virtio] Re: [virtio-dev] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 10-02-2013 10:25
    Il 02/10/2013 02:50, Rusty Russell ha scritto: >> > I would make this even stronger, like "It is undefined whether a disk >> > operates in writeback or writethrough mode after a guest announces lack >> > of support for VIRTIO_BLK_F_FLUSH". > At this point it's documenting historical behavior, rather than > proscribing it. How about we simply note: > > Some older legacy devices did not operate in writethrough mode even > after a guest announced lack of support for VIRTIO_BLK_F_FLUSH. That's fine. Paolo


  • 18.  Re: [virtio] Re: [virtio-dev] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 10-03-2013 05:07
    Paolo Bonzini <pbonzini@redhat.com> writes: > Il 02/10/2013 02:50, Rusty Russell ha scritto: >>> > I would make this even stronger, like "It is undefined whether a disk >>> > operates in writeback or writethrough mode after a guest announces lack >>> > of support for VIRTIO_BLK_F_FLUSH". >> At this point it's documenting historical behavior, rather than >> proscribing it. How about we simply note: >> >> Some older legacy devices did not operate in writethrough mode even >> after a guest announced lack of support for VIRTIO_BLK_F_FLUSH. > > That's fine. > > Paolo Here's the final patch then: commit ff7486a8699c06743473c0078e9b511275d65306 Author: Rusty Russell <rusty@au1.ibm.com> Date: Tue Oct 1 14:53:17 2013 +0930 virtio_blk: deprecate flush. This is partially a port of Paolo Bonzini <pbonzini@redhat.com>'s WCE spec changes, but puts them in the legacy section. It reserves the writeback field for compatibility. Flush is now required. See: https://lists.oasis-open.org/archives/virtio-comment/201308/msg00078.html Signed-off-by: Rusty Russell <rusty@au1.ibm.com> diff --git a/virtio-v1.0-wd01-part1-specification.txt b/virtio-v1.0-wd01-part1-specification.txt index 5efeba7..a9c0cce 100644 --- a/virtio-v1.0-wd01-part1-specification.txt +++ b/virtio-v1.0-wd01-part1-specification.txt @@ -1817,8 +1817,6 @@ device except where noted. VIRTIO_BLK_F_BLK_SIZE (6) Block size of disk is in “blk_size”. - VIRTIO_BLK_F_FLUSH (9) Cache flush command support. - VIRTIO_BLK_F_TOPOLOGY (10) Device exports information on optimal I/O alignment. @@ -1843,6 +1841,7 @@ device except where noted. u16 min_io_size; u32 opt_io_size; } topology; + u8 reserved; }; 2.4.2.3.1 Legacy Interface: Feature bits @@ -1851,6 +1850,15 @@ device except where noted. VIRTIO_BLK_F_SCSI (7) Device supports scsi packet commands. + VIRTIO_BLK_F_FLUSH (9) Cache flush command support. + + VIRTIO_BLK_F_CONFIG_WCE (11) Device can toggle its cache between writeback + and writethrough modes. + +VIRTIO_BLK_F_FLUSH was also called VIRTIO_BLK_F_WCE: Legacy drivers +should only negotiate this feature if they are capable of sending +VIRTIO_BLK_T_FLUSH commands. + 2.4.2.4. Device Initialization ----------------------------- @@ -1872,6 +1880,21 @@ device except where noted. I/O lengths for the driver to use. This also does not affect the units in the protocol, only performance. +2.4.2.4.1. Legacy Interface: Device Initialization +----------------------------- + +The reserved field used to be called writeback. If the +VIRTIO_BLK_F_CONFIG_FLUSH feature is offered, the cache mode should be +read from the writeback field of the configuration if available; the +driver can also write to the field in order to toggle the cache +between writethrough (0) and writeback (1) mode. If the feature is +not available, the driver can instead look at the result of +negotiating VIRTIO_BLK_F_FLUSH: the cache will be in writeback mode +after reset if and only if VIRTIO_BLK_F_FLUSH is negotiated. + +Some older legacy devices did not operate in writethrough mode even +after a guest announced lack of support for VIRTIO_BLK_F_FLUSH. + 2.4.2.5. Device Operation ------------------------ @@ -1907,6 +1930,9 @@ error or VIRTIO_BLK_S_UNSUPP for a request unsupported by host: #define VIRTIO_BLK_S_IOERR 1 #define VIRTIO_BLK_S_UNSUPP 2 +Any writes completed before the submission of the flush command should +be committed to permanent storage by the device. + 2.4.2.5.1 Legacy Interface: Device Operation ------------------------ The 'reserved' field was previously called ioprio. The ioprio field


  • 19.  Re: [virtio] Re: [virtio-dev] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 10-03-2013 08:26
    Il 03/10/2013 07:05, Rusty Russell ha scritto: > Paolo Bonzini <pbonzini@redhat.com> writes: >> Il 02/10/2013 02:50, Rusty Russell ha scritto: >>>>> I would make this even stronger, like "It is undefined whether a disk >>>>> operates in writeback or writethrough mode after a guest announces lack >>>>> of support for VIRTIO_BLK_F_FLUSH". >>> At this point it's documenting historical behavior, rather than >>> proscribing it. How about we simply note: >>> >>> Some older legacy devices did not operate in writethrough mode even >>> after a guest announced lack of support for VIRTIO_BLK_F_FLUSH. >> >> That's fine. >> >> Paolo > > Here's the final patch then: > > commit ff7486a8699c06743473c0078e9b511275d65306 > Author: Rusty Russell <rusty@au1.ibm.com> > Date: Tue Oct 1 14:53:17 2013 +0930 > > virtio_blk: deprecate flush. > > This is partially a port of Paolo Bonzini <pbonzini@redhat.com>'s > WCE spec changes, but puts them in the legacy section. It reserves > the writeback field for compatibility. > > Flush is now required. > > See: https://lists.oasis-open.org/archives/virtio-comment/201308/msg00078.html > > Signed-off-by: Rusty Russell <rusty@au1.ibm.com> > > diff --git a/virtio-v1.0-wd01-part1-specification.txt b/virtio-v1.0-wd01-part1-specification.txt > index 5efeba7..a9c0cce 100644 > --- a/virtio-v1.0-wd01-part1-specification.txt > +++ b/virtio-v1.0-wd01-part1-specification.txt > @@ -1817,8 +1817,6 @@ device except where noted. > > VIRTIO_BLK_F_BLK_SIZE (6) Block size of disk is in “blk_size”. > > - VIRTIO_BLK_F_FLUSH (9) Cache flush command support. > - > VIRTIO_BLK_F_TOPOLOGY (10) Device exports information on optimal I/O > alignment. > > @@ -1843,6 +1841,7 @@ device except where noted. > u16 min_io_size; > u32 opt_io_size; > } topology; > + u8 reserved; > }; > > 2.4.2.3.1 Legacy Interface: Feature bits > @@ -1851,6 +1850,15 @@ device except where noted. > > VIRTIO_BLK_F_SCSI (7) Device supports scsi packet commands. > > + VIRTIO_BLK_F_FLUSH (9) Cache flush command support. > + > + VIRTIO_BLK_F_CONFIG_WCE (11) Device can toggle its cache between writeback > + and writethrough modes. > + > +VIRTIO_BLK_F_FLUSH was also called VIRTIO_BLK_F_WCE: Legacy drivers > +should only negotiate this feature if they are capable of sending > +VIRTIO_BLK_T_FLUSH commands. > + > 2.4.2.4. Device Initialization > ----------------------------- > > @@ -1872,6 +1880,21 @@ device except where noted. > I/O lengths for the driver to use. This also does not affect the units > in the protocol, only performance. > > +2.4.2.4.1. Legacy Interface: Device Initialization > +----------------------------- > + > +The reserved field used to be called writeback. If the > +VIRTIO_BLK_F_CONFIG_FLUSH feature is offered, the cache mode should be CONFIG_WCE? > +read from the writeback field of the configuration if available; the > +driver can also write to the field in order to toggle the cache > +between writethrough (0) and writeback (1) mode. If the feature is > +not available, the driver can instead look at the result of > +negotiating VIRTIO_BLK_F_FLUSH: the cache will be in writeback mode > +after reset if and only if VIRTIO_BLK_F_FLUSH is negotiated. > + > +Some older legacy devices did not operate in writethrough mode even > +after a guest announced lack of support for VIRTIO_BLK_F_FLUSH. > + > 2.4.2.5. Device Operation > ------------------------ > > @@ -1907,6 +1930,9 @@ error or VIRTIO_BLK_S_UNSUPP for a request unsupported by host: > #define VIRTIO_BLK_S_IOERR 1 > #define VIRTIO_BLK_S_UNSUPP 2 > > +Any writes completed before the submission of the flush command should > +be committed to permanent storage by the device. s/permanent/non-volatile/ Putting them in a non-volatile cache is fine. Paolo > + > 2.4.2.5.1 Legacy Interface: Device Operation > ------------------------ > The 'reserved' field was previously called ioprio. The ioprio field > > > --------------------------------------------------------------------- > To unsubscribe from this mail list, you must leave the OASIS TC that > generates this mail. Follow this link to all your TCs in OASIS at: > https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php >


  • 20.  [PATCH v2] virtio_blk: deprecate flush.

    Posted 10-04-2013 01:16
    This is partially a port of Paolo Bonzini <pbonzini@redhat.com>'s WCE spec changes, but puts them in the legacy section. It reserves the writeback field for compatibility. Flush is now required. See: https://lists.oasis-open.org/archives/virtio-comment/201308/msg00078.html Signed-off-by: Rusty Russell <rusty@au1.ibm.com> --- virtio-v1.0-wd01-part1-specification.txt 30 ++++++++++++++++++++++++++++-- 1 file changed, 28 insertions(+), 2 deletions(-) diff --git a/virtio-v1.0-wd01-part1-specification.txt b/virtio-v1.0-wd01-part1-specification.txt index 5efeba7..2676e28 100644 --- a/virtio-v1.0-wd01-part1-specification.txt +++ b/virtio-v1.0-wd01-part1-specification.txt @@ -1817,8 +1817,6 @@ device except where noted. VIRTIO_BLK_F_BLK_SIZE (6) Block size of disk is in “blk_size”. - VIRTIO_BLK_F_FLUSH (9) Cache flush command support. - VIRTIO_BLK_F_TOPOLOGY (10) Device exports information on optimal I/O alignment. @@ -1843,6 +1841,7 @@ device except where noted. u16 min_io_size; u32 opt_io_size; } topology; + u8 reserved; }; 2.4.2.3.1 Legacy Interface: Feature bits @@ -1851,6 +1850,15 @@ device except where noted. VIRTIO_BLK_F_SCSI (7) Device supports scsi packet commands. + VIRTIO_BLK_F_FLUSH (9) Cache flush command support. + + VIRTIO_BLK_F_CONFIG_WCE (11) Device can toggle its cache between writeback + and writethrough modes. + +VIRTIO_BLK_F_FLUSH was also called VIRTIO_BLK_F_WCE: Legacy drivers +should only negotiate this feature if they are capable of sending +VIRTIO_BLK_T_FLUSH commands. + 2.4.2.4. Device Initialization ----------------------------- @@ -1872,6 +1880,21 @@ device except where noted. I/O lengths for the driver to use. This also does not affect the units in the protocol, only performance. +2.4.2.4.1. Legacy Interface: Device Initialization +----------------------------- + +The reserved field used to be called writeback. If the +VIRTIO_BLK_F_CONFIG_WCE feature is offered, the cache mode should be +read from the writeback field of the configuration if available; the +driver can also write to the field in order to toggle the cache +between writethrough (0) and writeback (1) mode. If the feature is +not available, the driver can instead look at the result of +negotiating VIRTIO_BLK_F_FLUSH: the cache will be in writeback mode +after reset if and only if VIRTIO_BLK_F_FLUSH is negotiated. + +Some older legacy devices did not operate in writethrough mode even +after a guest announced lack of support for VIRTIO_BLK_F_FLUSH. + 2.4.2.5. Device Operation ------------------------ @@ -1907,6 +1930,9 @@ error or VIRTIO_BLK_S_UNSUPP for a request unsupported by host: #define VIRTIO_BLK_S_IOERR 1 #define VIRTIO_BLK_S_UNSUPP 2 +Any writes completed before the submission of the flush command should +be committed to non-volatile storage by the device. + 2.4.2.5.1 Legacy Interface: Device Operation ------------------------ The 'reserved' field was previously called ioprio. The ioprio field -- 1.8.1.2


  • 21.  Re: [virtio] Re: [virtio-dev] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 10-04-2013 01:16
    Paolo Bonzini <pbonzini@redhat.com> writes: > Il 03/10/2013 07:05, Rusty Russell ha scritto: >> Paolo Bonzini <pbonzini@redhat.com> writes: >>> Il 02/10/2013 02:50, Rusty Russell ha scritto: >>>>>> I would make this even stronger, like "It is undefined whether a disk >>>>>> operates in writeback or writethrough mode after a guest announces lack >>>>>> of support for VIRTIO_BLK_F_FLUSH". >>>> At this point it's documenting historical behavior, rather than >>>> proscribing it. How about we simply note: >>>> >>>> Some older legacy devices did not operate in writethrough mode even >>>> after a guest announced lack of support for VIRTIO_BLK_F_FLUSH. >>> >>> That's fine. >>> >>> Paolo >> >> Here's the final patch then: >> >> commit ff7486a8699c06743473c0078e9b511275d65306 >> Author: Rusty Russell <rusty@au1.ibm.com> >> Date: Tue Oct 1 14:53:17 2013 +0930 >> >> virtio_blk: deprecate flush. >> >> This is partially a port of Paolo Bonzini <pbonzini@redhat.com>'s >> WCE spec changes, but puts them in the legacy section. It reserves >> the writeback field for compatibility. >> >> Flush is now required. >> >> See: https://lists.oasis-open.org/archives/virtio-comment/201308/msg00078.html >> >> Signed-off-by: Rusty Russell <rusty@au1.ibm.com> >> >> diff --git a/virtio-v1.0-wd01-part1-specification.txt b/virtio-v1.0-wd01-part1-specification.txt >> index 5efeba7..a9c0cce 100644 >> --- a/virtio-v1.0-wd01-part1-specification.txt >> +++ b/virtio-v1.0-wd01-part1-specification.txt >> @@ -1817,8 +1817,6 @@ device except where noted. >> >> VIRTIO_BLK_F_BLK_SIZE (6) Block size of disk is in “blk_size”. >> >> - VIRTIO_BLK_F_FLUSH (9) Cache flush command support. >> - >> VIRTIO_BLK_F_TOPOLOGY (10) Device exports information on optimal I/O >> alignment. >> >> @@ -1843,6 +1841,7 @@ device except where noted. >> u16 min_io_size; >> u32 opt_io_size; >> } topology; >> + u8 reserved; >> }; >> >> 2.4.2.3.1 Legacy Interface: Feature bits >> @@ -1851,6 +1850,15 @@ device except where noted. >> >> VIRTIO_BLK_F_SCSI (7) Device supports scsi packet commands. >> >> + VIRTIO_BLK_F_FLUSH (9) Cache flush command support. >> + >> + VIRTIO_BLK_F_CONFIG_WCE (11) Device can toggle its cache between writeback >> + and writethrough modes. >> + >> +VIRTIO_BLK_F_FLUSH was also called VIRTIO_BLK_F_WCE: Legacy drivers >> +should only negotiate this feature if they are capable of sending >> +VIRTIO_BLK_T_FLUSH commands. >> + >> 2.4.2.4. Device Initialization >> ----------------------------- >> >> @@ -1872,6 +1880,21 @@ device except where noted. >> I/O lengths for the driver to use. This also does not affect the units >> in the protocol, only performance. >> >> +2.4.2.4.1. Legacy Interface: Device Initialization >> +----------------------------- >> + >> +The reserved field used to be called writeback. If the >> +VIRTIO_BLK_F_CONFIG_FLUSH feature is offered, the cache mode should be > > CONFIG_WCE? Oops, oversed. Thanks. >> +read from the writeback field of the configuration if available; the >> +driver can also write to the field in order to toggle the cache >> +between writethrough (0) and writeback (1) mode. If the feature is >> +not available, the driver can instead look at the result of >> +negotiating VIRTIO_BLK_F_FLUSH: the cache will be in writeback mode >> +after reset if and only if VIRTIO_BLK_F_FLUSH is negotiated. >> + >> +Some older legacy devices did not operate in writethrough mode even >> +after a guest announced lack of support for VIRTIO_BLK_F_FLUSH. >> + >> 2.4.2.5. Device Operation >> ------------------------ >> >> @@ -1907,6 +1930,9 @@ error or VIRTIO_BLK_S_UNSUPP for a request unsupported by host: >> #define VIRTIO_BLK_S_IOERR 1 >> #define VIRTIO_BLK_S_UNSUPP 2 >> >> +Any writes completed before the submission of the flush command should >> +be committed to permanent storage by the device. > > s/permanent/non-volatile/ > > Putting them in a non-volatile cache is fine. Thanks, fixed. Cheers, Rusty.


  • 22.  Re: [virtio] Re: [virtio-dev] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 10-03-2013 13:57
    On Thu, Oct 03, 2013 at 02:35:04PM +0930, Rusty Russell wrote: > Paolo Bonzini <pbonzini@redhat.com> writes: > > Il 02/10/2013 02:50, Rusty Russell ha scritto: > >>> > I would make this even stronger, like "It is undefined whether a disk > >>> > operates in writeback or writethrough mode after a guest announces lack > >>> > of support for VIRTIO_BLK_F_FLUSH". > >> At this point it's documenting historical behavior, rather than > >> proscribing it. How about we simply note: > >> > >> Some older legacy devices did not operate in writethrough mode even > >> after a guest announced lack of support for VIRTIO_BLK_F_FLUSH. > > > > That's fine. > > > > Paolo > > Here's the final patch then: > > commit ff7486a8699c06743473c0078e9b511275d65306 > Author: Rusty Russell <rusty@au1.ibm.com> > Date: Tue Oct 1 14:53:17 2013 +0930 > > virtio_blk: deprecate flush. > > This is partially a port of Paolo Bonzini <pbonzini@redhat.com>'s > WCE spec changes, but puts them in the legacy section. It reserves > the writeback field for compatibility. > > Flush is now required. > > See: https://lists.oasis-open.org/archives/virtio-comment/201308/msg00078.html > > Signed-off-by: Rusty Russell <rusty@au1.ibm.com> Can you do git format-patch -v2/git send-email for patch submission please? Makes it much easier to spot when there's a new proposal, and git am produces a messy log if you don't. > diff --git a/virtio-v1.0-wd01-part1-specification.txt b/virtio-v1.0-wd01-part1-specification.txt > index 5efeba7..a9c0cce 100644 > --- a/virtio-v1.0-wd01-part1-specification.txt > +++ b/virtio-v1.0-wd01-part1-specification.txt > @@ -1817,8 +1817,6 @@ device except where noted. > > VIRTIO_BLK_F_BLK_SIZE (6) Block size of disk is in “blk_size”. > > - VIRTIO_BLK_F_FLUSH (9) Cache flush command support. > - > VIRTIO_BLK_F_TOPOLOGY (10) Device exports information on optimal I/O > alignment. > > @@ -1843,6 +1841,7 @@ device except where noted. > u16 min_io_size; > u32 opt_io_size; > } topology; > + u8 reserved; > }; > > 2.4.2.3.1 Legacy Interface: Feature bits > @@ -1851,6 +1850,15 @@ device except where noted. > > VIRTIO_BLK_F_SCSI (7) Device supports scsi packet commands. > > + VIRTIO_BLK_F_FLUSH (9) Cache flush command support. > + > + VIRTIO_BLK_F_CONFIG_WCE (11) Device can toggle its cache between writeback > + and writethrough modes. > + > +VIRTIO_BLK_F_FLUSH was also called VIRTIO_BLK_F_WCE: Legacy drivers > +should only negotiate this feature if they are capable of sending > +VIRTIO_BLK_T_FLUSH commands. > + > 2.4.2.4. Device Initialization > ----------------------------- > > @@ -1872,6 +1880,21 @@ device except where noted. > I/O lengths for the driver to use. This also does not affect the units > in the protocol, only performance. > > +2.4.2.4.1. Legacy Interface: Device Initialization > +----------------------------- > + > +The reserved field used to be called writeback. If the > +VIRTIO_BLK_F_CONFIG_FLUSH feature is offered, the cache mode should be > +read from the writeback field of the configuration if available; the > +driver can also write to the field in order to toggle the cache > +between writethrough (0) and writeback (1) mode. If the feature is > +not available, the driver can instead look at the result of > +negotiating VIRTIO_BLK_F_FLUSH: the cache will be in writeback mode > +after reset if and only if VIRTIO_BLK_F_FLUSH is negotiated. > + > +Some older legacy devices did not operate in writethrough mode even > +after a guest announced lack of support for VIRTIO_BLK_F_FLUSH. > + > 2.4.2.5. Device Operation > ------------------------ > > @@ -1907,6 +1930,9 @@ error or VIRTIO_BLK_S_UNSUPP for a request unsupported by host: > #define VIRTIO_BLK_S_IOERR 1 > #define VIRTIO_BLK_S_UNSUPP 2 > > +Any writes completed before the submission of the flush command should > +be committed to permanent storage by the device. > + > 2.4.2.5.1 Legacy Interface: Device Operation > ------------------------ > The 'reserved' field was previously called ioprio. The ioprio field


  • 23.  Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 09-11-2013 03:19
    James Bottomley <jbottomley@parallels.com> writes:
    > [resending to virtio-comment; it looks like I'm not subscribed to
    > virtio-dev ... how do you subscribe?]

    Mail to virtio-dev-subscribe@lists.oasis-open.org, or via
    https://www.oasis-open.org/mlmanage/

    BTW, I've moved this to virtio@ since it's core business, with virtio-comment
    cc'd.

    > Sorry, I don't have a copy of the original email to reply to:
    >
    > https://lists.oasis-open.org/archives/virtio-comment/201308/msg00078.html
    >
    > The part that concerns me is this:
    >
    >> +5. The cache mode should be read from the writeback field of the configuration
    >> + if the VIRTIO_BLK_F_CONFIG_WCE feature if available; the driver can also
    >> + write to the field in order to toggle the cache between writethrough (0)
    >> + and writeback (1) mode.
    >> + If the feature is not available, the driver can instead look at the result
    >> + of negotiating VIRTIO_BLK_F_WCE: the cache will be in writeback mode after
    >> + reset if and only if VIRTIO_BLK_F_WCE is negotiated[30]
    >
    > The questions are twofold and have to do with Write Back only disks (to
    > date we've seen quite a few ATA devices like this and a huge number of
    > USB devices):
    >
    > 1. If the guest doesn't negotiate WCE, what do you do on the host
    > (flush on every write is one possible option; run unsafe and
    > hope the host doesn't crash is another).
    > 2. If the guest asks to toggle the device from writeback (1) to
    > writethrough (0) mode, what do you do? Refuse the toggle would
    > be reasonable or flip back into whatever mode you were using to
    > handle 1. is also possible.
    >
    > James

    I thought about this more after the call. If we look at block device
    implementations on the host:

    1) Dumb device (ie. no flush support).
    - Get write request, write() to backing file. Repeat.
    - If guest crashes it always sees in order, if host crashes you're
    out of luck.

    2) Dumb device which tries to handle host crashes.
    - Noone wants this: requires a fdatasync() after every write.

    3) Smart device. Uses AIO/threads to service requests.
    - Needs flushes otherwise if guest crashes it can see out of order.
    - Flushes can must wait for outstanding requests.

    4) Smart device which tries to handle host crashes.
    - Flushes must fdatasync() after waiting.

    The interesting question is between 3 & 4:
    - Do we differentiate 3 and 4 from the guest side?
    - Or do we ban 3 and insist on 4? Knowing that there are no guarantees that an
    implementation will actually hit the metal (eg. crappy underlying
    device or crappy non-barrier filesystem).

    Whatever we do, I don't see why we'd want to toggle WCE after
    negotiation. If you implement a smart device, you'd need to drop to a
    single thread, but you'd definitely lose host-crash reliability.

    Cheers,
    Rusty.




  • 24.  Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 09-11-2013 03:22
    James Bottomley <jbottomley@parallels.com> writes: > [resending to virtio-comment; it looks like I'm not subscribed to > virtio-dev ... how do you subscribe?] Mail to virtio-dev-subscribe@lists.oasis-open.org, or via https://www.oasis-open.org/mlmanage/ BTW, I've moved this to virtio@ since it's core business, with virtio-comment cc'd. > Sorry, I don't have a copy of the original email to reply to: > > https://lists.oasis-open.org/archives/virtio-comment/201308/msg00078.html > > The part that concerns me is this: > >> +5. The cache mode should be read from the writeback field of the configuration >> + if the VIRTIO_BLK_F_CONFIG_WCE feature if available; the driver can also >> + write to the field in order to toggle the cache between writethrough (0) >> + and writeback (1) mode. >> + If the feature is not available, the driver can instead look at the result >> + of negotiating VIRTIO_BLK_F_WCE: the cache will be in writeback mode after >> + reset if and only if VIRTIO_BLK_F_WCE is negotiated[30] > > The questions are twofold and have to do with Write Back only disks (to > date we've seen quite a few ATA devices like this and a huge number of > USB devices): > > 1. If the guest doesn't negotiate WCE, what do you do on the host > (flush on every write is one possible option; run unsafe and > hope the host doesn't crash is another). > 2. If the guest asks to toggle the device from writeback (1) to > writethrough (0) mode, what do you do? Refuse the toggle would > be reasonable or flip back into whatever mode you were using to > handle 1. is also possible. > > James I thought about this more after the call. If we look at block device implementations on the host: 1) Dumb device (ie. no flush support). - Get write request, write() to backing file. Repeat. - If guest crashes it always sees in order, if host crashes you're out of luck. 2) Dumb device which tries to handle host crashes. - Noone wants this: requires a fdatasync() after every write. 3) Smart device. Uses AIO/threads to service requests. - Needs flushes otherwise if guest crashes it can see out of order. - Flushes can must wait for outstanding requests. 4) Smart device which tries to handle host crashes. - Flushes must fdatasync() after waiting. The interesting question is between 3 & 4: - Do we differentiate 3 and 4 from the guest side? - Or do we ban 3 and insist on 4? Knowing that there are no guarantees that an implementation will actually hit the metal (eg. crappy underlying device or crappy non-barrier filesystem). Whatever we do, I don't see why we'd want to toggle WCE after negotiation. If you implement a smart device, you'd need to drop to a single thread, but you'd definitely lose host-crash reliability. Cheers, Rusty.


  • 25.  Re: [virtio] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 09-11-2013 04:49
    On Wed, Sep 11, 2013 at 12:49:26PM +0930, Rusty Russell wrote: > James Bottomley <jbottomley@parallels.com> writes: > > [resending to virtio-comment; it looks like I'm not subscribed to > > virtio-dev ... how do you subscribe?] > > Mail to virtio-dev-subscribe@lists.oasis-open.org, or via > https://www.oasis-open.org/mlmanage/ > > BTW, I've moved this to virtio@ since it's core business, with virtio-comment > cc'd. > Hmm I don't think we should Cc both - this just makes most people get two copies. > > Sorry, I don't have a copy of the original email to reply to: > > > > https://lists.oasis-open.org/archives/virtio-comment/201308/msg00078.html > > > > The part that concerns me is this: > > > >> +5. The cache mode should be read from the writeback field of the configuration > >> + if the VIRTIO_BLK_F_CONFIG_WCE feature if available; the driver can also > >> + write to the field in order to toggle the cache between writethrough (0) > >> + and writeback (1) mode. > >> + If the feature is not available, the driver can instead look at the result > >> + of negotiating VIRTIO_BLK_F_WCE: the cache will be in writeback mode after > >> + reset if and only if VIRTIO_BLK_F_WCE is negotiated[30] > > > > The questions are twofold and have to do with Write Back only disks (to > > date we've seen quite a few ATA devices like this and a huge number of > > USB devices): > > > > 1. If the guest doesn't negotiate WCE, what do you do on the host > > (flush on every write is one possible option; run unsafe and > > hope the host doesn't crash is another). > > 2. If the guest asks to toggle the device from writeback (1) to > > writethrough (0) mode, what do you do? Refuse the toggle would > > be reasonable or flip back into whatever mode you were using to > > handle 1. is also possible. > > > > James > > I thought about this more after the call. If we look at block device > implementations on the host: > > 1) Dumb device (ie. no flush support). > - Get write request, write() to backing file. Repeat. > - If guest crashes it always sees in order, if host crashes you're > out of luck. > > 2) Dumb device which tries to handle host crashes. > - Noone wants this: requires a fdatasync() after every write. > > 3) Smart device. Uses AIO/threads to service requests. > - Needs flushes otherwise if guest crashes it can see out of order. > - Flushes can must wait for outstanding requests. > > 4) Smart device which tries to handle host crashes. > - Flushes must fdatasync() after waiting. > > The interesting question is between 3 & 4: > - Do we differentiate 3 and 4 from the guest side? > - Or do we ban 3 and insist on 4? Knowing that there are no guarantees that an > implementation will actually hit the metal (eg. crappy underlying > device or crappy non-barrier filesystem). > > Whatever we do, I don't see why we'd want to toggle WCE after > negotiation. I think it's mostly because sdparm let you tweak it for normal disks ... > If you implement a smart device, you'd need to drop to a > single thread, but you'd definitely lose host-crash reliability. > > Cheers, > Rusty. > > > --------------------------------------------------------------------- > To unsubscribe from this mail list, you must leave the OASIS TC that > generates this mail. Follow this link to all your TCs in OASIS at: > https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php


  • 26.  Re: [virtio] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 09-11-2013 08:21
    On Wed, Sep 11, 2013 at 12:49:26PM +0930, Rusty Russell wrote:
    > James Bottomley <jbottomley@parallels.com> writes:
    > > [resending to virtio-comment; it looks like I'm not subscribed to
    > > virtio-dev ... how do you subscribe?]
    >
    > Mail to virtio-dev-subscribe@lists.oasis-open.org, or via
    > https://www.oasis-open.org/mlmanage/
    >
    > BTW, I've moved this to virtio@ since it's core business, with virtio-comment
    > cc'd.
    >
    > > Sorry, I don't have a copy of the original email to reply to:
    > >
    > > https://lists.oasis-open.org/archives/virtio-comment/201308/msg00078.html
    > >
    > > The part that concerns me is this:
    > >
    > >> +5. The cache mode should be read from the writeback field of the configuration
    > >> + if the VIRTIO_BLK_F_CONFIG_WCE feature if available; the driver can also
    > >> + write to the field in order to toggle the cache between writethrough (0)
    > >> + and writeback (1) mode.
    > >> + If the feature is not available, the driver can instead look at the result
    > >> + of negotiating VIRTIO_BLK_F_WCE: the cache will be in writeback mode after
    > >> + reset if and only if VIRTIO_BLK_F_WCE is negotiated[30]
    > >
    > > The questions are twofold and have to do with Write Back only disks (to
    > > date we've seen quite a few ATA devices like this and a huge number of
    > > USB devices):
    > >
    > > 1. If the guest doesn't negotiate WCE, what do you do on the host
    > > (flush on every write is one possible option; run unsafe and
    > > hope the host doesn't crash is another).

    The default WCE=0 semantics should be that the host ensures every write
    reaches stable storage.

    This can optionally be overridden in the host. It might be useful, for
    example during guest OS installation where you throw away the image if
    installation is interrupted by a power failure.

    > > 2. If the guest asks to toggle the device from writeback (1) to
    > > writethrough (0) mode, what do you do? Refuse the toggle would
    > > be reasonable or flip back into whatever mode you were using to
    > > handle 1. is also possible.

    I don't think there is a reasonable way to refuse since the WCE toggle
    is implemented as a configuration space field. It's hard to return an
    error from configuration space stores - virtio-net moved to a control
    virtqueue in order to support configuration updates properly.

    The transition from writeback (1) to writethrough (0) mode should be
    allowed and the host uses the same solution as for #1. I think your
    suggestion is a good idea.

    > I thought about this more after the call. If we look at block device
    > implementations on the host:
    >
    > 1) Dumb device (ie. no flush support).
    > - Get write request, write() to backing file. Repeat.
    > - If guest crashes it always sees in order, if host crashes you're
    > out of luck.
    >
    > 2) Dumb device which tries to handle host crashes.
    > - Noone wants this: requires a fdatasync() after every write.
    >
    > 3) Smart device. Uses AIO/threads to service requests.
    > - Needs flushes otherwise if guest crashes it can see out of order.
    > - Flushes can must wait for outstanding requests.
    >
    > 4) Smart device which tries to handle host crashes.
    > - Flushes must fdatasync() after waiting.
    >
    > The interesting question is between 3 & 4:
    > - Do we differentiate 3 and 4 from the guest side?
    > - Or do we ban 3 and insist on 4? Knowing that there are no guarantees that an
    > implementation will actually hit the metal (eg. crappy underlying
    > device or crappy non-barrier filesystem).
    >
    > Whatever we do, I don't see why we'd want to toggle WCE after
    > negotiation. If you implement a smart device, you'd need to drop to a
    > single thread, but you'd definitely lose host-crash reliability.

    I think this classification doesn't correspond to the actual semantics
    of disks. My understanding is that:

    If the host submits multiple requests then ordering is not guaranteed.
    WCE=0 does not imply that requests become ordered. Therefore comments
    about dropping to a single thread don't appear correct to me.

    For example, the host wants to ensure that write A reaches the disk
    before write B. With WCE=0 the host must wait for write A to complete
    before submitting write B.

    I also don't think you lose host-crash reliability by dropping to WCE=0.
    The guest initiated the WCE 1 -> 0 change and therefore it understands
    the rules for reaching stable storage. The guest OS or application
    would wait for write A to complete before issuing write B if A -> B
    ordering is necessary.

    Finally, let's not worry about broken storage stacks that do not
    propagate flushes. Let's specify virtio-blk WCE to work like real disks
    and then hypervisors can let users restrict themselves to safe modes if
    the stack doesn't support all modes.

    For example, it was typical to run legacy guests (old LVM) with WCE=0
    since the guest storage stack did not propagate flushes. That's a
    *configuration* choice but at the spec level all we need to do is:
    1. Make guests that are unaware of WCE default to WCE=0.
    2. Expose WCE toggling to guests that are aware.

    Stefan



  • 27.  Re: [virtio] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 09-11-2013 08:21
    On Wed, Sep 11, 2013 at 12:49:26PM +0930, Rusty Russell wrote: > James Bottomley <jbottomley@parallels.com> writes: > > [resending to virtio-comment; it looks like I'm not subscribed to > > virtio-dev ... how do you subscribe?] > > Mail to virtio-dev-subscribe@lists.oasis-open.org, or via > https://www.oasis-open.org/mlmanage/ > > BTW, I've moved this to virtio@ since it's core business, with virtio-comment > cc'd. > > > Sorry, I don't have a copy of the original email to reply to: > > > > https://lists.oasis-open.org/archives/virtio-comment/201308/msg00078.html > > > > The part that concerns me is this: > > > >> +5. The cache mode should be read from the writeback field of the configuration > >> + if the VIRTIO_BLK_F_CONFIG_WCE feature if available; the driver can also > >> + write to the field in order to toggle the cache between writethrough (0) > >> + and writeback (1) mode. > >> + If the feature is not available, the driver can instead look at the result > >> + of negotiating VIRTIO_BLK_F_WCE: the cache will be in writeback mode after > >> + reset if and only if VIRTIO_BLK_F_WCE is negotiated[30] > > > > The questions are twofold and have to do with Write Back only disks (to > > date we've seen quite a few ATA devices like this and a huge number of > > USB devices): > > > > 1. If the guest doesn't negotiate WCE, what do you do on the host > > (flush on every write is one possible option; run unsafe and > > hope the host doesn't crash is another). The default WCE=0 semantics should be that the host ensures every write reaches stable storage. This can optionally be overridden in the host. It might be useful, for example during guest OS installation where you throw away the image if installation is interrupted by a power failure. > > 2. If the guest asks to toggle the device from writeback (1) to > > writethrough (0) mode, what do you do? Refuse the toggle would > > be reasonable or flip back into whatever mode you were using to > > handle 1. is also possible. I don't think there is a reasonable way to refuse since the WCE toggle is implemented as a configuration space field. It's hard to return an error from configuration space stores - virtio-net moved to a control virtqueue in order to support configuration updates properly. The transition from writeback (1) to writethrough (0) mode should be allowed and the host uses the same solution as for #1. I think your suggestion is a good idea. > I thought about this more after the call. If we look at block device > implementations on the host: > > 1) Dumb device (ie. no flush support). > - Get write request, write() to backing file. Repeat. > - If guest crashes it always sees in order, if host crashes you're > out of luck. > > 2) Dumb device which tries to handle host crashes. > - Noone wants this: requires a fdatasync() after every write. > > 3) Smart device. Uses AIO/threads to service requests. > - Needs flushes otherwise if guest crashes it can see out of order. > - Flushes can must wait for outstanding requests. > > 4) Smart device which tries to handle host crashes. > - Flushes must fdatasync() after waiting. > > The interesting question is between 3 & 4: > - Do we differentiate 3 and 4 from the guest side? > - Or do we ban 3 and insist on 4? Knowing that there are no guarantees that an > implementation will actually hit the metal (eg. crappy underlying > device or crappy non-barrier filesystem). > > Whatever we do, I don't see why we'd want to toggle WCE after > negotiation. If you implement a smart device, you'd need to drop to a > single thread, but you'd definitely lose host-crash reliability. I think this classification doesn't correspond to the actual semantics of disks. My understanding is that: If the host submits multiple requests then ordering is not guaranteed. WCE=0 does not imply that requests become ordered. Therefore comments about dropping to a single thread don't appear correct to me. For example, the host wants to ensure that write A reaches the disk before write B. With WCE=0 the host must wait for write A to complete before submitting write B. I also don't think you lose host-crash reliability by dropping to WCE=0. The guest initiated the WCE 1 -> 0 change and therefore it understands the rules for reaching stable storage. The guest OS or application would wait for write A to complete before issuing write B if A -> B ordering is necessary. Finally, let's not worry about broken storage stacks that do not propagate flushes. Let's specify virtio-blk WCE to work like real disks and then hypervisors can let users restrict themselves to safe modes if the stack doesn't support all modes. For example, it was typical to run legacy guests (old LVM) with WCE=0 since the guest storage stack did not propagate flushes. That's a *configuration* choice but at the spec level all we need to do is: 1. Make guests that are unaware of WCE default to WCE=0. 2. Expose WCE toggling to guests that are aware. Stefan


  • 28.  Re: [virtio] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 09-16-2013 01:35
    Stefan Hajnoczi <stefanha@redhat.com> writes:
    > On Wed, Sep 11, 2013 at 12:49:26PM +0930, Rusty Russell wrote:
    >> James Bottomley <jbottomley@parallels.com> writes:
    >> > [resending to virtio-comment; it looks like I'm not subscribed to
    >> > virtio-dev ... how do you subscribe?]
    >>
    >> Mail to virtio-dev-subscribe@lists.oasis-open.org, or via
    >> https://www.oasis-open.org/mlmanage/
    >>
    >> BTW, I've moved this to virtio@ since it's core business, with virtio-comment
    >> cc'd.
    >>
    >> > Sorry, I don't have a copy of the original email to reply to:
    >> >
    >> > https://lists.oasis-open.org/archives/virtio-comment/201308/msg00078.html
    >> >
    >> > The part that concerns me is this:
    >> >
    >> >> +5. The cache mode should be read from the writeback field of the configuration
    >> >> + if the VIRTIO_BLK_F_CONFIG_WCE feature if available; the driver can also
    >> >> + write to the field in order to toggle the cache between writethrough (0)
    >> >> + and writeback (1) mode.
    >> >> + If the feature is not available, the driver can instead look at the result
    >> >> + of negotiating VIRTIO_BLK_F_WCE: the cache will be in writeback mode after
    >> >> + reset if and only if VIRTIO_BLK_F_WCE is negotiated[30]
    >> >
    >> > The questions are twofold and have to do with Write Back only disks (to
    >> > date we've seen quite a few ATA devices like this and a huge number of
    >> > USB devices):
    >> >
    >> > 1. If the guest doesn't negotiate WCE, what do you do on the host
    >> > (flush on every write is one possible option; run unsafe and
    >> > hope the host doesn't crash is another).
    >
    > The default WCE=0 semantics should be that the host ensures every write
    > reaches stable storage.

    Here's the problem: I don't think anyone will really implement this.

    lguest certainly doesn't flush every write, not bhyve. Xen famously
    didn't. I can't see where qemu does it either, but it could be buried
    in the aio stuff?

    > This can optionally be overridden in the host. It might be useful, for
    > example during guest OS installation where you throw away the image if
    > installation is interrupted by a power failure.

    If noone does it, and they don't have to, let's just be honest in the
    spec and specify that we don't expect them to do sync writes.

    >> > 2. If the guest asks to toggle the device from writeback (1) to
    >> > writethrough (0) mode, what do you do? Refuse the toggle would
    >> > be reasonable or flip back into whatever mode you were using to
    >> > handle 1. is also possible.
    >
    > I don't think there is a reasonable way to refuse since the WCE toggle
    > is implemented as a configuration space field. It's hard to return an
    > error from configuration space stores - virtio-net moved to a control
    > virtqueue in order to support configuration updates properly.
    >
    > The transition from writeback (1) to writethrough (0) mode should be
    > allowed and the host uses the same solution as for #1. I think your
    > suggestion is a good idea.
    >
    >> I thought about this more after the call. If we look at block device
    >> implementations on the host:
    >>
    >> 1) Dumb device (ie. no flush support).
    >> - Get write request, write() to backing file. Repeat.
    >> - If guest crashes it always sees in order, if host crashes you're
    >> out of luck.
    >>
    >> 2) Dumb device which tries to handle host crashes.
    >> - Noone wants this: requires a fdatasync() after every write.
    >>
    >> 3) Smart device. Uses AIO/threads to service requests.
    >> - Needs flushes otherwise if guest crashes it can see out of order.
    >> - Flushes can must wait for outstanding requests.
    >>
    >> 4) Smart device which tries to handle host crashes.
    >> - Flushes must fdatasync() after waiting.
    >>
    >> The interesting question is between 3 & 4:
    >> - Do we differentiate 3 and 4 from the guest side?
    >> - Or do we ban 3 and insist on 4? Knowing that there are no guarantees that an
    >> implementation will actually hit the metal (eg. crappy underlying
    >> device or crappy non-barrier filesystem).
    >>
    >> Whatever we do, I don't see why we'd want to toggle WCE after
    >> negotiation. If you implement a smart device, you'd need to drop to a
    >> single thread, but you'd definitely lose host-crash reliability.
    >
    > I think this classification doesn't correspond to the actual semantics
    > of disks. My understanding is that:
    >
    > If the host submits multiple requests then ordering is not guaranteed.
    > WCE=0 does not imply that requests become ordered. Therefore comments
    > about dropping to a single thread don't appear correct to me.
    >
    > For example, the host wants to ensure that write A reaches the disk
    > before write B. With WCE=0 the host must wait for write A to complete
    > before submitting write B.

    Right, I had missed that subtlety.

    > I also don't think you lose host-crash reliability by dropping to WCE=0.
    > The guest initiated the WCE 1 -> 0 change and therefore it understands
    > the rules for reaching stable storage. The guest OS or application
    > would wait for write A to complete before issuing write B if A -> B
    > ordering is necessary.

    But how would this guarantee be implemented on the host without syncing
    after every write? Ok, technically it could batch updates to the used
    ring and do a single fsync before that, but that doesn't seem much of a
    win.

    > Finally, let's not worry about broken storage stacks that do not
    > propagate flushes. Let's specify virtio-blk WCE to work like real disks
    > and then hypervisors can let users restrict themselves to safe modes if
    > the stack doesn't support all modes.

    But they won't get host-crash resilience under any circumstances, right?
    Certainly if the host fs doesn't support barriers they won't...

    > For example, it was typical to run legacy guests (old LVM) with WCE=0
    > since the guest storage stack did not propagate flushes. That's a
    > *configuration* choice but at the spec level all we need to do is:
    > 1. Make guests that are unaware of WCE default to WCE=0.
    > 2. Expose WCE toggling to guests that are aware.

    That's one reason I prefer the simplified version: no-WCE means no
    host-crash guarantees, with-WCE means it hits the metal.

    Whether it's really sane to toggle WCE is another question, but it's
    currently a feature bit so we can just not offer it. Qemu seems not to
    offer it by default.

    Cheers,
    Rusty.




  • 29.  Re: [virtio] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 09-16-2013 06:39
    Stefan Hajnoczi <stefanha@redhat.com> writes: > On Wed, Sep 11, 2013 at 12:49:26PM +0930, Rusty Russell wrote: >> James Bottomley <jbottomley@parallels.com> writes: >> > [resending to virtio-comment; it looks like I'm not subscribed to >> > virtio-dev ... how do you subscribe?] >> >> Mail to virtio-dev-subscribe@lists.oasis-open.org, or via >> https://www.oasis-open.org/mlmanage/ >> >> BTW, I've moved this to virtio@ since it's core business, with virtio-comment >> cc'd. >> >> > Sorry, I don't have a copy of the original email to reply to: >> > >> > https://lists.oasis-open.org/archives/virtio-comment/201308/msg00078.html >> > >> > The part that concerns me is this: >> > >> >> +5. The cache mode should be read from the writeback field of the configuration >> >> + if the VIRTIO_BLK_F_CONFIG_WCE feature if available; the driver can also >> >> + write to the field in order to toggle the cache between writethrough (0) >> >> + and writeback (1) mode. >> >> + If the feature is not available, the driver can instead look at the result >> >> + of negotiating VIRTIO_BLK_F_WCE: the cache will be in writeback mode after >> >> + reset if and only if VIRTIO_BLK_F_WCE is negotiated[30] >> > >> > The questions are twofold and have to do with Write Back only disks (to >> > date we've seen quite a few ATA devices like this and a huge number of >> > USB devices): >> > >> > 1. If the guest doesn't negotiate WCE, what do you do on the host >> > (flush on every write is one possible option; run unsafe and >> > hope the host doesn't crash is another). > > The default WCE=0 semantics should be that the host ensures every write > reaches stable storage. Here's the problem: I don't think anyone will really implement this. lguest certainly doesn't flush every write, not bhyve. Xen famously didn't. I can't see where qemu does it either, but it could be buried in the aio stuff? > This can optionally be overridden in the host. It might be useful, for > example during guest OS installation where you throw away the image if > installation is interrupted by a power failure. If noone does it, and they don't have to, let's just be honest in the spec and specify that we don't expect them to do sync writes. >> > 2. If the guest asks to toggle the device from writeback (1) to >> > writethrough (0) mode, what do you do? Refuse the toggle would >> > be reasonable or flip back into whatever mode you were using to >> > handle 1. is also possible. > > I don't think there is a reasonable way to refuse since the WCE toggle > is implemented as a configuration space field. It's hard to return an > error from configuration space stores - virtio-net moved to a control > virtqueue in order to support configuration updates properly. > > The transition from writeback (1) to writethrough (0) mode should be > allowed and the host uses the same solution as for #1. I think your > suggestion is a good idea. > >> I thought about this more after the call. If we look at block device >> implementations on the host: >> >> 1) Dumb device (ie. no flush support). >> - Get write request, write() to backing file. Repeat. >> - If guest crashes it always sees in order, if host crashes you're >> out of luck. >> >> 2) Dumb device which tries to handle host crashes. >> - Noone wants this: requires a fdatasync() after every write. >> >> 3) Smart device. Uses AIO/threads to service requests. >> - Needs flushes otherwise if guest crashes it can see out of order. >> - Flushes can must wait for outstanding requests. >> >> 4) Smart device which tries to handle host crashes. >> - Flushes must fdatasync() after waiting. >> >> The interesting question is between 3 & 4: >> - Do we differentiate 3 and 4 from the guest side? >> - Or do we ban 3 and insist on 4? Knowing that there are no guarantees that an >> implementation will actually hit the metal (eg. crappy underlying >> device or crappy non-barrier filesystem). >> >> Whatever we do, I don't see why we'd want to toggle WCE after >> negotiation. If you implement a smart device, you'd need to drop to a >> single thread, but you'd definitely lose host-crash reliability. > > I think this classification doesn't correspond to the actual semantics > of disks. My understanding is that: > > If the host submits multiple requests then ordering is not guaranteed. > WCE=0 does not imply that requests become ordered. Therefore comments > about dropping to a single thread don't appear correct to me. > > For example, the host wants to ensure that write A reaches the disk > before write B. With WCE=0 the host must wait for write A to complete > before submitting write B. Right, I had missed that subtlety. > I also don't think you lose host-crash reliability by dropping to WCE=0. > The guest initiated the WCE 1 -> 0 change and therefore it understands > the rules for reaching stable storage. The guest OS or application > would wait for write A to complete before issuing write B if A -> B > ordering is necessary. But how would this guarantee be implemented on the host without syncing after every write? Ok, technically it could batch updates to the used ring and do a single fsync before that, but that doesn't seem much of a win. > Finally, let's not worry about broken storage stacks that do not > propagate flushes. Let's specify virtio-blk WCE to work like real disks > and then hypervisors can let users restrict themselves to safe modes if > the stack doesn't support all modes. But they won't get host-crash resilience under any circumstances, right? Certainly if the host fs doesn't support barriers they won't... > For example, it was typical to run legacy guests (old LVM) with WCE=0 > since the guest storage stack did not propagate flushes. That's a > *configuration* choice but at the spec level all we need to do is: > 1. Make guests that are unaware of WCE default to WCE=0. > 2. Expose WCE toggling to guests that are aware. That's one reason I prefer the simplified version: no-WCE means no host-crash guarantees, with-WCE means it hits the metal. Whether it's really sane to toggle WCE is another question, but it's currently a feature bit so we can just not offer it. Qemu seems not to offer it by default. Cheers, Rusty.


  • 30.  Re: [virtio] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 09-16-2013 09:35
    Il 16/09/2013 03:34, Rusty Russell ha scritto: >> > The default WCE=0 semantics should be that the host ensures every write >> > reaches stable storage. > Here's the problem: I don't think anyone will really implement this. > > lguest certainly doesn't flush every write, not bhyve. Xen famously > didn't. I can't see where qemu does it either, but it could be buried > in the aio stuff? > It's here in block.c's bdrv_co_do_writev: if (ret < 0) { /* Do nothing, write notifier decided to fail this request */ } else if (flags & BDRV_REQ_ZERO_WRITE) { ret = bdrv_co_do_write_zeroes(bs, sector_num, nb_sectors); } else { ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov); } if (ret == 0 && !bs->enable_write_cache) { ret = bdrv_co_flush(bs); } Paolo


  • 31.  Re: [virtio] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 09-16-2013 09:35
    Il 16/09/2013 03:34, Rusty Russell ha scritto:
    >> > The default WCE=0 semantics should be that the host ensures every write
    >> > reaches stable storage.
    > Here's the problem: I don't think anyone will really implement this.
    >
    > lguest certainly doesn't flush every write, not bhyve. Xen famously
    > didn't. I can't see where qemu does it either, but it could be buried
    > in the aio stuff?
    >

    It's here in block.c's bdrv_co_do_writev:

    if (ret < 0) {
    /* Do nothing, write notifier decided to fail this request */
    } else if (flags & BDRV_REQ_ZERO_WRITE) {
    ret = bdrv_co_do_write_zeroes(bs, sector_num, nb_sectors);
    } else {
    ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov);
    }

    if (ret == 0 && !bs->enable_write_cache) {
    ret = bdrv_co_flush(bs);
    }

    Paolo



  • 32.  Re: [virtio] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 09-17-2013 01:21
    Paolo Bonzini <pbonzini@redhat.com> writes:
    > Il 16/09/2013 03:34, Rusty Russell ha scritto:
    >>> > The default WCE=0 semantics should be that the host ensures every write
    >>> > reaches stable storage.
    >> Here's the problem: I don't think anyone will really implement this.
    >>
    >> lguest certainly doesn't flush every write, not bhyve. Xen famously
    >> didn't. I can't see where qemu does it either, but it could be buried
    >> in the aio stuff?
    >>
    >
    > It's here in block.c's bdrv_co_do_writev:
    >
    > if (ret < 0) {
    > /* Do nothing, write notifier decided to fail this request */
    > } else if (flags & BDRV_REQ_ZERO_WRITE) {
    > ret = bdrv_co_do_write_zeroes(bs, sector_num, nb_sectors);
    > } else {
    > ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov);
    > }
    >
    > if (ret == 0 && !bs->enable_write_cache) {
    > ret = bdrv_co_flush(bs);
    > }

    The truth is more complex than that.

    bdrv_co_flush() calls:

    bs->drv->bdrv_co_flush_to_disk

    or if that's NULL:

    bs->drv->bdrv_aio_flush

    or if that's NULL, does nothing.

    Now, qcow2 doesn't set bdrv_co_flush_to_disk or bdrv_aio_flush. I tried
    to follow the others, eg. vmdk, but it's completely runtime-determined
    maze of function pointers so I can't tell if anyone *actually* flushes
    to disk.

    It seems a raw file will get an fdatasync, via bdrv_aio_flush() ->
    raw_aio_flush(), but strace shows no fsync/fdatasync calls, even when I
    just a raw file, and type "sync" in the guest:

    strace -e trace=file,desc -o /tmp/trace qemu-system-i386 -machine pc,accel=kvm -m 512 -drive file=/home/rusty/qemu-images/ubuntu-copy,index=0,media=disk,if=virtio -kernel arch/x86/boot/bzImage -append "ro root=/dev/vda1 single"

    So perhaps if you use the right back ends and set your non-default
    options just right you can have your data safe on disk? But you're
    certainly not convincing me anyone wants it...

    Cheers,
    Rusty.




  • 33.  Re: [virtio] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 09-17-2013 05:07
    Paolo Bonzini <pbonzini@redhat.com> writes: > Il 16/09/2013 03:34, Rusty Russell ha scritto: >>> > The default WCE=0 semantics should be that the host ensures every write >>> > reaches stable storage. >> Here's the problem: I don't think anyone will really implement this. >> >> lguest certainly doesn't flush every write, not bhyve. Xen famously >> didn't. I can't see where qemu does it either, but it could be buried >> in the aio stuff? >> > > It's here in block.c's bdrv_co_do_writev: > > if (ret < 0) { > /* Do nothing, write notifier decided to fail this request */ > } else if (flags & BDRV_REQ_ZERO_WRITE) { > ret = bdrv_co_do_write_zeroes(bs, sector_num, nb_sectors); > } else { > ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov); > } > > if (ret == 0 && !bs->enable_write_cache) { > ret = bdrv_co_flush(bs); > } The truth is more complex than that. bdrv_co_flush() calls: bs->drv->bdrv_co_flush_to_disk or if that's NULL: bs->drv->bdrv_aio_flush or if that's NULL, does nothing. Now, qcow2 doesn't set bdrv_co_flush_to_disk or bdrv_aio_flush. I tried to follow the others, eg. vmdk, but it's completely runtime-determined maze of function pointers so I can't tell if anyone *actually* flushes to disk. It seems a raw file will get an fdatasync, via bdrv_aio_flush() -> raw_aio_flush(), but strace shows no fsync/fdatasync calls, even when I just a raw file, and type "sync" in the guest: strace -e trace=file,desc -o /tmp/trace qemu-system-i386 -machine pc,accel=kvm -m 512 -drive file=/home/rusty/qemu-images/ubuntu-copy,index=0,media=disk,if=virtio -kernel arch/x86/boot/bzImage -append "ro root=/dev/vda1 single" So perhaps if you use the right back ends and set your non-default options just right you can have your data safe on disk? But you're certainly not convincing me anyone wants it... Cheers, Rusty.


  • 34.  Re: [virtio] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 09-18-2013 15:39
    Il 17/09/2013 03:20, Rusty Russell ha scritto: > bdrv_co_flush() calls: > > bs->drv->bdrv_co_flush_to_disk > > or if that's NULL: > > bs->drv->bdrv_aio_flush > > or if that's NULL, does nothing. > > Now, qcow2 doesn't set bdrv_co_flush_to_disk or bdrv_aio_flush. It also does bdrv_co_flush(bs->file); and that's where bdrv_co_flush_to_disk gets called. > It seems a raw file will get an fdatasync, via bdrv_aio_flush() -> > raw_aio_flush(), but strace shows no fsync/fdatasync calls, even when I > just a raw file, and type "sync" in the guest: That's the same path that gets qcow2 flushed, actually. You need "strace -ff" because the flush is done in a thread pool. > So perhaps if you use the right back ends and set your non-default > options just right you can have your data safe on disk? But you're > certainly not convincing me anyone wants it... I'm 100% sure that QEMU gets data safely on disk. :) Paolo


  • 35.  Re: [virtio] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 09-18-2013 15:40
    Il 17/09/2013 03:20, Rusty Russell ha scritto:
    > bdrv_co_flush() calls:
    >
    > bs->drv->bdrv_co_flush_to_disk
    >
    > or if that's NULL:
    >
    > bs->drv->bdrv_aio_flush
    >
    > or if that's NULL, does nothing.
    >
    > Now, qcow2 doesn't set bdrv_co_flush_to_disk or bdrv_aio_flush.

    It also does

    bdrv_co_flush(bs->file);

    and that's where bdrv_co_flush_to_disk gets called.

    > It seems a raw file will get an fdatasync, via bdrv_aio_flush() ->
    > raw_aio_flush(), but strace shows no fsync/fdatasync calls, even when I
    > just a raw file, and type "sync" in the guest:

    That's the same path that gets qcow2 flushed, actually. You need
    "strace -ff" because the flush is done in a thread pool.

    > So perhaps if you use the right back ends and set your non-default
    > options just right you can have your data safe on disk? But you're
    > certainly not convincing me anyone wants it...

    I'm 100% sure that QEMU gets data safely on disk. :)

    Paolo



  • 36.  Re: [virtio] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

    Posted 09-19-2013 01:25
    [ Moved to virtio-dev, since we're now talking about an implementation ]

    Paolo Bonzini <pbonzini@redhat.com> writes:
    > Il 17/09/2013 03:20, Rusty Russell ha scritto:
    >> bdrv_co_flush() calls:
    >>
    >> bs->drv->bdrv_co_flush_to_disk
    >>
    >> or if that's NULL:
    >>
    >> bs->drv->bdrv_aio_flush
    >>
    >> or if that's NULL, does nothing.
    >>
    >> Now, qcow2 doesn't set bdrv_co_flush_to_disk or bdrv_aio_flush.
    >
    > It also does
    >
    > bdrv_co_flush(bs->file);
    >
    > and that's where bdrv_co_flush_to_disk gets called.
    >
    >> It seems a raw file will get an fdatasync, via bdrv_aio_flush() ->
    >> raw_aio_flush(), but strace shows no fsync/fdatasync calls, even when I
    >> just a raw file, and type "sync" in the guest:
    >
    > That's the same path that gets qcow2 flushed, actually. You need
    > "strace -ff" because the flush is done in a thread pool.

    Oops, I missed the -f. Unfortunately, it doesn't help. I only get
    fdatasync on shutdown, not when I do ops in the guest.

    So I ended up tracking this into the virtio_blk driver: the guest isn't
    submitting any flush commands. It's accepted both the VIRTIO_BLK_F_WCE
    and VIRTIO_BLK_F_CONFIG_WCE flags, and
    /sys/devices/pci0000:00/0000:00:04.0/virtio0/block/vda/cache_type says
    "write back".

    It's mounted using ext4, default options. 3.12-rc1.

    Clearly, I'm missing some Linux block layer thing. How do I get it to
    submit a flush to the virtio_blk driver? I would expect "touch
    some-new-file; sync" to do that.

    Seriously confused,
    Rusty.