Stefan Hajnoczi <
stefanha@redhat.com> writes:
> On Wed, Sep 11, 2013 at 12:49:26PM +0930, Rusty Russell wrote:
>> James Bottomley <
jbottomley@parallels.com> writes:
>> > [resending to virtio-comment; it looks like I'm not subscribed to
>> > virtio-dev ... how do you subscribe?]
>>
>> Mail to
virtio-dev-subscribe@lists.oasis-open.org, or via
>>
https://www.oasis-open.org/mlmanage/>>
>> BTW, I've moved this to virtio@ since it's core business, with virtio-comment
>> cc'd.
>>
>> > Sorry, I don't have a copy of the original email to reply to:
>> >
>> >
https://lists.oasis-open.org/archives/virtio-comment/201308/msg00078.html>> >
>> > The part that concerns me is this:
>> >
>> >> +5. The cache mode should be read from the writeback field of the configuration
>> >> + if the VIRTIO_BLK_F_CONFIG_WCE feature if available; the driver can also
>> >> + write to the field in order to toggle the cache between writethrough (0)
>> >> + and writeback (1) mode.
>> >> + If the feature is not available, the driver can instead look at the result
>> >> + of negotiating VIRTIO_BLK_F_WCE: the cache will be in writeback mode after
>> >> + reset if and only if VIRTIO_BLK_F_WCE is negotiated[30]
>> >
>> > The questions are twofold and have to do with Write Back only disks (to
>> > date we've seen quite a few ATA devices like this and a huge number of
>> > USB devices):
>> >
>> > 1. If the guest doesn't negotiate WCE, what do you do on the host
>> > (flush on every write is one possible option; run unsafe and
>> > hope the host doesn't crash is another).
>
> The default WCE=0 semantics should be that the host ensures every write
> reaches stable storage.
Here's the problem: I don't think anyone will really implement this.
lguest certainly doesn't flush every write, not bhyve. Xen famously
didn't. I can't see where qemu does it either, but it could be buried
in the aio stuff?
> This can optionally be overridden in the host. It might be useful, for
> example during guest OS installation where you throw away the image if
> installation is interrupted by a power failure.
If noone does it, and they don't have to, let's just be honest in the
spec and specify that we don't expect them to do sync writes.
>> > 2. If the guest asks to toggle the device from writeback (1) to
>> > writethrough (0) mode, what do you do? Refuse the toggle would
>> > be reasonable or flip back into whatever mode you were using to
>> > handle 1. is also possible.
>
> I don't think there is a reasonable way to refuse since the WCE toggle
> is implemented as a configuration space field. It's hard to return an
> error from configuration space stores - virtio-net moved to a control
> virtqueue in order to support configuration updates properly.
>
> The transition from writeback (1) to writethrough (0) mode should be
> allowed and the host uses the same solution as for #1. I think your
> suggestion is a good idea.
>
>> I thought about this more after the call. If we look at block device
>> implementations on the host:
>>
>> 1) Dumb device (ie. no flush support).
>> - Get write request, write() to backing file. Repeat.
>> - If guest crashes it always sees in order, if host crashes you're
>> out of luck.
>>
>> 2) Dumb device which tries to handle host crashes.
>> - Noone wants this: requires a fdatasync() after every write.
>>
>> 3) Smart device. Uses AIO/threads to service requests.
>> - Needs flushes otherwise if guest crashes it can see out of order.
>> - Flushes can must wait for outstanding requests.
>>
>> 4) Smart device which tries to handle host crashes.
>> - Flushes must fdatasync() after waiting.
>>
>> The interesting question is between 3 & 4:
>> - Do we differentiate 3 and 4 from the guest side?
>> - Or do we ban 3 and insist on 4? Knowing that there are no guarantees that an
>> implementation will actually hit the metal (eg. crappy underlying
>> device or crappy non-barrier filesystem).
>>
>> Whatever we do, I don't see why we'd want to toggle WCE after
>> negotiation. If you implement a smart device, you'd need to drop to a
>> single thread, but you'd definitely lose host-crash reliability.
>
> I think this classification doesn't correspond to the actual semantics
> of disks. My understanding is that:
>
> If the host submits multiple requests then ordering is not guaranteed.
> WCE=0 does not imply that requests become ordered. Therefore comments
> about dropping to a single thread don't appear correct to me.
>
> For example, the host wants to ensure that write A reaches the disk
> before write B. With WCE=0 the host must wait for write A to complete
> before submitting write B.
Right, I had missed that subtlety.
> I also don't think you lose host-crash reliability by dropping to WCE=0.
> The guest initiated the WCE 1 -> 0 change and therefore it understands
> the rules for reaching stable storage. The guest OS or application
> would wait for write A to complete before issuing write B if A -> B
> ordering is necessary.
But how would this guarantee be implemented on the host without syncing
after every write? Ok, technically it could batch updates to the used
ring and do a single fsync before that, but that doesn't seem much of a
win.
> Finally, let's not worry about broken storage stacks that do not
> propagate flushes. Let's specify virtio-blk WCE to work like real disks
> and then hypervisors can let users restrict themselves to safe modes if
> the stack doesn't support all modes.
But they won't get host-crash resilience under any circumstances, right?
Certainly if the host fs doesn't support barriers they won't...
> For example, it was typical to run legacy guests (old LVM) with WCE=0
> since the guest storage stack did not propagate flushes. That's a
> *configuration* choice but at the spec level all we need to do is:
> 1. Make guests that are unaware of WCE default to WCE=0.
> 2. Expose WCE toggling to guests that are aware.
That's one reason I prefer the simplified version: no-WCE means no
host-crash guarantees, with-WCE means it hits the metal.
Whether it's really sane to toggle WCE is another question, but it's
currently a feature bit so we can just not offer it. Qemu seems not to
offer it by default.
Cheers,
Rusty.