On Thu, Jul 02, 2015 at 11:48:15AM +0200, Paolo Bonzini wrote: > > > On 02/07/2015 11:27, Michael S. Tsirkin wrote: > > On Thu, Jul 02, 2015 at 09:57:07AM +0200, Paolo Bonzini wrote: > >> VIRTIO_BLK_F_CONFIG_WCE is important in order to achieve good performance > >> (up to 2x, though more realistically +30-40%) in latency-bound workloads. > >> However, it was removed by mistake together with VIRTIO_BLK_F_FLUSH. > >> > >> In addition, even removing VIRTIO_BLK_F_FLUSH was probably not a great > >> idea, because it simplifies simple drivers (e.g. firmware) that are okay > >> with a writethrough cache but still need data to persist after power loss. > >> What really should have been removed is just the possibility that devices > >> not propose VIRTIO_BLK_F_FLUSH, but even that only deserves a "SHOULD" in > >> the new world of conformance statements. > >> > >> Restore these, with the following changes: > >> > >> * clarify and use conformance statements in order to define writeback > >> and writethrough caching according to what is commonly done by high-end > >> storage. > >> > >> * clarify (with conformance statements) the influence of the > >> VIRTIO_BLK_F_FLUSH feature on caching and how to proceed if only one of > >> VIRTIO_BLK_F_FLUSH and VIRTIO_BLK_F_CONFIG_WCE is negotiated. > >> > >> * strengthen the requirement for persisting writes after a > >> VIRTIO_BLK_T_FLUSH request. This is now a MUST. > >> > >> This fixes VIRTIO-144. > >> > >> Signed-off-by: Paolo Bonzini <
pbonzini@redhat.com> > >> --- > >> content.tex 76 ++++++++++++++++++++++++++++++++++++++++++------------------- > >> 1 file changed, 52 insertions(+), 24 deletions(-) > >> > >> diff --git a/content.tex b/content.tex > >> index 8de905f..4e03844 100644 > >> --- a/content.tex > >> +++ b/content.tex > >> @@ -3723,26 +3723,31 @@ device except where noted. > >> > >> item[VIRTIO_BLK_F_BLK_SIZE (6)] Block size of disk is in field{blk_size}. > >> > >> +item[VIRTIO_BLK_F_FLUSH (9)] Cache flush command support. > >> + > >> item[VIRTIO_BLK_F_TOPOLOGY (10)] Device exports information on optimal I/O > >> alignment. > >> + > >> +item[VIRTIO_BLK_F_CONFIG_WCE (11)] Device can toggle its cache between writeback > >> + and writethrough modes. > >> end{description} > >> > >> +Devices SHOULD always propose VIRTIO_BLK_F_FLUSH, and MUST propose it > >> +if they propose VIRTIO_BLK_F_CONFIG_WCE. Drivers SHOULD only negotiate > >> +these features if they are capable of sending VIRTIO_BLK_T_FLUSH commands. > >> + > > > > Sorry this is a non-normative section, so you can't say MUST and SHOULD > > here. You'll need to add a normative one, and link from conformance > > sections. > > It's ok to have a bit of duplication between normative > > and non-normative ones if this helps readability. > > Ok, will move this to a normative subsubsection. > > >> subsubsection{Legacy Interface: Feature bits}label{sec:Device Types / Block Device / Feature bits / Legacy Interface: Feature bits} > >> > >> egin{description} > >> item[VIRTIO_BLK_F_BARRIER (0)] Device supports request barriers. > >> > >> item[VIRTIO_BLK_F_SCSI (7)] Device supports scsi packet commands. > >> - > >> -item[VIRTIO_BLK_F_FLUSH (9)] Cache flush command support. > >> - > > > > Did you mean to remove this one? > > Yes, it's not legacy anymore. I don't see it readded anywhere though. If I missed it, feel free to ignore. > >> subsection{Device configuration layout}label{sec:Device Types / Block Device / Device configuration layout} > >> > >> @@ -3771,7 +3776,7 @@ struct virtio_blk_config { > >> // optimal (suggested maximum) I/O size in blocks > >> le32 opt_io_size; > >> } topology; > >> - u8 reserved; > >> + u8 writeback; > >> }; > >> end{lstlisting} > >> > >> @@ -3801,21 +3806,27 @@ according to the native endian of the guest rather than > >> field{topology} struct can be read to determine the physical block size and optimal > >> I/O lengths for the driver to use. This also does not affect the units > >> in the protocol, only performance. > >> -end{enumerate} > >> - > >> -subsubsection{Legacy Interface: Device Initialization}label{sec:Device Types / Block Device / Device Initialization / Legacy Interface: Device Initialization} > >> > >> -The field{reserved} field used to be called field{writeback}. If the > >> -VIRTIO_BLK_F_CONFIG_WCE feature is offered, the cache mode can be > >> -read from field{writeback}; the > >> -driver can also write to the field in order to toggle the cache > >> -between writethrough (0) and writeback (1) mode. If the feature is > >> -not available, the driver can instead look at the result of > >> -negotiating VIRTIO_BLK_F_FLUSH: the cache will be in writeback mode > >> -after reset if and only if VIRTIO_BLK_F_FLUSH is negotiated. > >> +item If the VIRTIO_BLK_F_CONFIG_WCE feature is negotiated, the cache > >> + mode can be read from field{writeback}. 0 corresponds to a writethrough > >> + cache, 1 to a writeback cachefootnote{Consistent with > >> +
ef{sec:Device Types / Block Device / Device Operation}, > >> + a writethrough cache can be defined broadly as a cache that commits > >> + writes to non-volatile storage before reporting their completion. > >> + For example, a battery-backed writeback cache actually counts as > >> + writethrough according to this definition.}. > > > > This seems a bit too much text for a footnote, I'd still prefer it > > as a simple text note. > > I can't do this inside an {enumerate}. It is long-ish, but it's just > documenting what existing hardware does. Why not? I just tried this: note nests fine within enumerate. > >> + > >> +item If the VIRTIO_BLK_F_CONFIG_WCE feature is not available, the driver > >> + can deduct the cache mode by looking at the result of negotiating > >> + VIRTIO_BLK_F_FLUSH: the cache will be in writeback mode if and only if > >> + VIRTIO_BLK_F_FLUSH is negotiated. > >> +end{enumerate} > >> > >> -Some older legacy devices did not operate in writethrough mode even > >> -after a driver announced lack of support for VIRTIO_BLK_F_FLUSH. > >> +If the VIRTIO_BLK_F_FLUSH feature is not negotiated while > >> +VIRTIO_BLK_F_CONFIG_WCE is, field{writeback} MUST read 0 immediately > >> +after feature negotiation. > > > > This is confusing. > > Previously you simply said writeback is only valid if > > VIRTIO_BLK_F_CONFIG_WCE is negotiated. > > This paragraph says "If the VIRTIO_BLK_F_FLUSH feature is not negotiated > *while VIRTIO_BLK_F_CONFIG_WCE is*". Would be clearer to revert the order then. Is this logic something that legacy devices implemented? I don't see anything like this in QEMU. > > Also, what's the reset value if it is negotiated? > > It is not specified. The device can make it persistent or reset it to a > user-specified value. Does driver have to read it on init then? > > Also, should we prohibit driver from writing the field > > if VIRTIO_BLK_F_FLUSH is off? > > It doesn't really matter I think. > > >> Alternatively, a device MAY simply fail > >> +feature negotiation if the driver proposes VIRTIO_BLK_F_CONFIG_WCE but > >> +does not propose VIRTIO_BLK_F_FLUSH. > > > > Alternatively to what? > > I'll change the previous sentence to "field{writeback} MUST read 0 > immediately after successful feature negotiation" (adding "successful"). > Then, failing feature negotiation can be an alternative because you > don't reach successful negotiation. I think this text could benefit from separating between two concepts: - device writenback mode - writeback field operation should refer to current mode, separately field might or might not make it discoverable. I see this in QEMU: /* A guest that supports VIRTIO_BLK_F_CONFIG_WCE must be able to send * cache flushes. Thus, the "auto writethrough" behavior is never * necessary for guests that support the VIRTIO_BLK_F_CONFIG_WCE feature. * Leaving it enabled would break the following sequence: * * Guest started with "-drive cache=writethrough" * Guest sets status to 0 * Guest sets DRIVER bit in status field * Guest reads host features (WCE=0, CONFIG_WCE=1) * Guest writes guest features (WCE=0, CONFIG_WCE=1) * Guest writes 1 to the WCE configuration field (writeback mode) * Guest sets DRIVER_OK bit in status field * * s->blk would erroneously be placed in writethrough mode. */ if (!virtio_has_feature(vdev, VIRTIO_BLK_F_CONFIG_WCE)) { aio_context_acquire(blk_get_aio_context(s->blk)); blk_set_enable_write_cache(s->blk, virtio_has_feature(vdev, VIRTIO_BLK_F_WCE)); aio_context_release(blk_get_aio_context(s->blk)); } is this what this text is trying to say? > > We can't have a mandatory behaviour but also > > make it optional. And driver does not propose features, it negotiates > > them. > > Will fix. > > > Or maybe device must fail negotiation if it meets some criteria? > > Yes, but just "MAY". Devices may fail negotiation if any dependencies are not met. Is this the case? See Feature bit requirements that we have for the net device, I think that's formulated in a clearer way. > >> subsection{Device Operation}label{sec:Device Types / Block Device / Device Operation} > >> > >> @@ -3871,8 +3882,25 @@ A device MUST set the field{status} byte to VIRTIO_BLK_S_IOERR > >> for a write request if the VIRTIO_BLK_F_RO feature if offered, and MUST NOT > >> write any data. > >> > >> -Upon receipt of a VIRTIO_BLK_T_FLUSH request, the device SHOULD ensure > >> -that any writes which were completed are committed to non-volatile storage. > >> +If the field{VIRTIO_BLK_F_FLUSH} feature is not negotiated, the > >> +device SHOULD ensure that all writes are committed to non-volatile > >> +storage before reporting completion. It MUST do so if it proposed > >> +field{VIRTIO_BLK_F_FLUSH}. Failure to do so can cause data loss. > > > > The 1st one is a SHOULD because not everyone cares about data loss. > > But why is the second one a MUST? Same logic seems to apply. > > Because then the driver can _know_ that it is safe from data losses, by > looking at proposed features. I still don't see a difference. the point of the 1st SHOULD is that maybe it's used with e.g. -snapshot so data will be discarded on qemu exit anyway. Maybe we should specify that data loss is in case of a crash. This seems to apply to all cases that might lead to data loss. IOW driver doesn't really know it's safe from losses since backend can still lose data. All driver cares about is what is required from driver to avoid losses. > > Same below. > > > >> + > >> +If the field{VIRTIO_BLK_F_FLUSH} feature is negotiated, upon receipt > >> +of a VIRTIO_BLK_T_FLUSH request, the device MUST ensure that any writes > >> +which were completed are committed to non-volatile storage. > > > > Aren't there any driver requirements? > > E.g. with one mode driver may assume data is safe > > when complete, in another is should issue flush requests > > to commit it to storage. > > I don't follow this. Can you rephrase? If writeback is enabled, device can cache requests, and driver must issue explicit flushes, correct? If writeback is disabled, driver may assume data is committed after request completes. Above should be conformance statements for the driver. > >> + > >> +If the VIRTIO_BLK_F_CONFIG_WCE feature is negotiated and the > >> +field{writeback} field in configuration space is 0, the device MUST > >> +ensure that all writes are committed to non-volatile storage before > >> +reporting completion. > >> + > >> +If the VIRTIO_BLK_F_CONFIG_WCE feature is negotiated, the driver MAY > >> +switch to writethrough or writeback mode by writing respectively 0 and > >> +1 to the field{writeback} field. After writing a 0 to field{writeback}, > >> +the driver MUST NOT assume that writes submitted while the field was 1 > > > > since the last VIRTIO_BLK_F_FLUSH completed, correct? > > More precisely "submitted while the field was 1 and completed after the > last VIRTIO_BLK_F_FLUSH was submitted". Does it have to be specified > explicitly? > > Paolo If we want it understood correctly, I guess yes. > >> have +been committed to non-volatile storage. > > > > > > > >> > >> subsubsection{Legacy Interface: Device Operation}label{sec:Device Types / Block Device / Device Operation / Legacy Interface: Device Operation} > >> When using the legacy interface, transitional devices and drivers > >> -- > >> 2.4.3 > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe from this mail list, you must leave the OASIS TC that > >> generates this mail. Follow this link to all your TCs in OASIS at: > >>
https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php