OASIS Virtual I/O Device (VIRTIO) TC

Expand all | Collapse all

[PATCH] virtio-blk: restore VIRTIO_BLK_F_FLUSH and VIRTIO_BLK_F_CONFIG_WCE

  • 1.  [PATCH] virtio-blk: restore VIRTIO_BLK_F_FLUSH and VIRTIO_BLK_F_CONFIG_WCE

    Posted 07-02-2015 14:20
    VIRTIO_BLK_F_CONFIG_WCE is important in order to achieve good performance (up to 2x, though more realistically +30-40%) in latency-bound workloads. However, it was removed by mistake together with VIRTIO_BLK_F_FLUSH. In addition, even removing VIRTIO_BLK_F_FLUSH was probably not a great idea, because it simplifies simple drivers (e.g. firmware) that are okay with a writethrough cache but still need data to persist after power loss. What really should have been removed is just the possibility that devices not propose VIRTIO_BLK_F_FLUSH, but even that only deserves a "SHOULD" in the new world of conformance statements. Restore these, with the following changes: * clarify and use conformance statements in order to define writeback and writethrough caching according to what is commonly done by high-end storage. * clarify (with conformance statements) the influence of the VIRTIO_BLK_F_FLUSH feature on caching and how to proceed if only one of VIRTIO_BLK_F_FLUSH and VIRTIO_BLK_F_CONFIG_WCE is negotiated. * strengthen the requirement for persisting writes to MUST after a VIRTIO_BLK_T_FLUSH request (and in other cases too involving the new features). The suggested behavior upon feature negotiation is okay for the Linux implementation of virtio1, even after the implementation is modified to support the two new features. This fixes VIRTIO-144. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> --- conformance.tex 2 + content.tex 130 +++++++++++++++++++++++++++++++++++++++++++++++--------- 2 files changed, 111 insertions(+), 21 deletions(-) diff --git a/conformance.tex b/conformance.tex index 29c6ba8..8753cd6 100644 --- a/conformance.tex +++ b/conformance.tex @@ -102,6 +102,7 @@ A network driver MUST conform to the following normative statements: A block driver MUST conform to the following normative statements: egin{itemize} +item
    ef{drivernormative:Device Types / Block Device / Device Initialization} item
    ef{drivernormative:Device Types / Block Device / Device Operation} end{itemize} @@ -205,6 +206,7 @@ A network device MUST conform to the following normative statements: A block device MUST conform to the following normative statements: egin{itemize} +item
    ef{devicenormative:Device Types / Block Device / Device Initialization} item
    ef{devicenormative:Device Types / Block Device / Device Operation} end{itemize} diff --git a/content.tex b/content.tex index 3b12263..e58a0bd 100644 --- a/content.tex +++ b/content.tex @@ -3723,8 +3723,13 @@ device except where noted. item[VIRTIO_BLK_F_BLK_SIZE (6)] Block size of disk is in field{blk_size}. +item[VIRTIO_BLK_F_FLUSH (9)] Cache flush command support. + item[VIRTIO_BLK_F_TOPOLOGY (10)] Device exports information on optimal I/O alignment. + +item[VIRTIO_BLK_F_CONFIG_WCE (11)] Device can toggle its cache between writeback + and writethrough modes. end{description} subsubsection{Legacy Interface: Feature bits}label{sec:Device Types / Block Device / Feature bits / Legacy Interface: Feature bits} @@ -3733,16 +3738,12 @@ device except where noted. item[VIRTIO_BLK_F_BARRIER (0)] Device supports request barriers. item[VIRTIO_BLK_F_SCSI (7)] Device supports scsi packet commands. - -item[VIRTIO_BLK_F_FLUSH (9)] Cache flush command support. - -item[VIRTIO_BLK_F_CONFIG_WCE (11)] Device can toggle its cache between writeback - and writethrough modes. end{description} -VIRTIO_BLK_F_FLUSH was also called VIRTIO_BLK_F_WCE: Legacy drivers -MUST only negotiate this feature if they are capable of sending -VIRTIO_BLK_T_FLUSH commands. +egin{note} + In the legacy interface, VIRTIO_BLK_F_FLUSH was also + called VIRTIO_BLK_F_WCE. +end{note} subsection{Device configuration layout}label{sec:Device Types / Block Device / Device configuration layout} @@ -3771,7 +3772,7 @@ struct virtio_blk_config { // optimal (suggested maximum) I/O size in blocks le32 opt_io_size; } topology; - u8 reserved; + u8 writeback; }; end{lstlisting} @@ -3801,21 +3802,52 @@ according to the native endian of the guest rather than field{topology} struct can be read to determine the physical block size and optimal I/O lengths for the driver to use. This also does not affect the units in the protocol, only performance. + +item If the VIRTIO_BLK_F_CONFIG_WCE feature is negotiated, the cache + mode can be read or set through the field{writeback} field. 0 corresponds + to a writethrough cache, 1 to a writeback cachefootnote{Consistent with +
    ef{devicenormative:Device Types / Block Device / Device Operation}, + a writethrough cache can be defined broadly as a cache that commits + writes to non-volatile storage before reporting their completion. + For example, a battery-backed writeback cache actually counts as + writethrough according to this definition.}. The cache mode after + reset is undefined. end{enumerate} +drivernormative{subsubsection}{Device Initialization}{Device Types / Block Device / Device Initialization} + +Drivers SHOULD NOT negotiate VIRTIO_BLK_F_FLUSH if they are incapable of +sending VIRTIO_BLK_T_FLUSH commands. + +If the VIRTIO_BLK_F_CONFIG_WCE feature is not negotiated, but +VIRTIO_BLK_F_FLUSH was proposed by the device, the driver MAY deduce +the presence of a writethrough cache. Otherwise, the driver SHOULD +assume presence of a writeback cache. + +devicenormative{subsubsection}{Device Initialization}{Device Types / Block Device / Device Initialization} + +Devices SHOULD always propose VIRTIO_BLK_F_FLUSH, and MUST propose it +if they propose VIRTIO_BLK_F_CONFIG_WCE. + +If VIRTIO_BLK_F_CONFIG_WCE is negotiated but the VIRTIO_BLK_F_FLUSH feature +is not, the device MUST set field{writeback} to 0 as soon as the driver +sets the FEATURES_OK status bit. + subsubsection{Legacy Interface: Device Initialization}label{sec:Device Types / Block Device / Device Initialization / Legacy Interface: Device Initialization} -The field{reserved} field used to be called field{writeback}. If the -VIRTIO_BLK_F_CONFIG_WCE feature is offered, the cache mode can be -read from field{writeback}; the -driver can also write to the field in order to toggle the cache -between writethrough (0) and writeback (1) mode. If the feature is -not available, the driver can instead look at the result of -negotiating VIRTIO_BLK_F_FLUSH: the cache will be in writeback mode -after reset if and only if VIRTIO_BLK_F_FLUSH is negotiated. +Because legacy devices do not have FEATURES_OK, legacy device behavior +differs around feature negotiation; legacy devices never modify +field{writeback} as a result of a driver's write to the status register. +In particular, some legacy drivers wrote to field{writeback} between +setting the DRIVER status bit and setting the DRIVER_OK status bit. +It would be a bug for the device to overwrite field{writeback} when +the DRIVER_OK status bit is set. + +Legacy devices must also never modify field{writeback} as a result of +a driver's write to the guest features registers, for example if the +driver sets the VIRTIO_BLK_F_CONFIG_WCE guest feature bit but does not +set the VIRTIO_BLK_F_FLUSH bit. -Some older legacy devices did not operate in writethrough mode even -after a driver announced lack of support for VIRTIO_BLK_F_FLUSH. subsection{Device Operation}label{sec:Device Types / Block Device / Device Operation} @@ -3865,14 +3897,67 @@ A driver SHOULD accept the VIRTIO_BLK_F_RO feature if offered. A driver MUST set field{sector} to 0 for a VIRTIO_BLK_T_FLUSH request. A driver SHOULD NOT include any data in a VIRTIO_BLK_T_FLUSH request. +If the VIRTIO_BLK_F_CONFIG_WCE feature is negotiated, the driver MAY +switch to writethrough or writeback mode by writing respectively 0 and +1 to the field{writeback} field. After writing a 0 to field{writeback}, +the driver MUST NOT assume that any volatile writes have been committed +to non-volatile storage. + devicenormative{subsubsection}{Device Operation}{Device Types / Block Device / Device Operation} A device MUST set the field{status} byte to VIRTIO_BLK_S_IOERR for a write request if the VIRTIO_BLK_F_RO feature if offered, and MUST NOT write any data. -Upon receipt of a VIRTIO_BLK_T_FLUSH request, the device SHOULD ensure -that any writes which were completed are committed to non-volatile storage. +A write is considered volatile when it is submitted; the contents of +sectors covered by a volatile are undefined until the write becomes stable. +A write becomes stable once it is completed and one or more of the +following conditions is true: + +egin{enumerate} +itemlabel{item:flush1} neither VIRTIO_BLK_F_CONFIG_WCE nor + VIRTIO_BLK_F_FLUSH feature were negotiated, but VIRTIO_BLK_F_FLUSH was + proposed by the device; + +itemlabel{item:flush2} the VIRTIO_BLK_F_CONFIG_WCE feature was negotiated and the + field{writeback} field in configuration space was 0 extbf{all the time between + the submission of the write and its completion}; + +itemlabel{item:flush3} a VIRTIO_BLK_T_FLUSH request is sent extbf{after the write is + completed} and is completed itself. +end{enumerate} + +The device MUST ensure that stable writes are committed to +non-volatile storage, before reporting completion of the write +(cases~
    ef{item:flush1} and~
    ef{item:flush2}) or the flush +(case~
    ef{item:flush3}). Failure to do so can cause data loss +in case of a crash. + +% ** Not using MAY/MAY NOT intentionally, this is not optional behavior ** +If the driver changes field{writeback} between the submission of the write +and its completion, the write can be either volatile or stable when +its completion is reported; the exact behavior is undefined. + +% According to the device requirements for device initialization: +% Propose(CONFIG_WCE) => Propose(FLUSH). +% +% After reversing the implication: +% not Propose(FLUSH) => not Propose(CONFIG_WCE). + +If VIRTIO_BLK_F_FLUSH was not proposed by the + devicefootnote{Note that in this case, according to +
    ef{devicenormative:Device Types / Block Device / Device Initialization}, + the device will not have proposed VIRTIO_BLK_F_CONFIG_WCE either.}, the +device MAY also commit writes to non-volatile storage before reporting +their completion. Unlike case~
    ef{item:flush1}, however, this is not +an absolute requirement of the specification. + +egin{note} + An implementation that does not propose VIRTIO_BLK_F_FLUSH and does not commit + completed writes will not be resilient to data loss in case of crashes. + Not proposing VIRTIO_BLK_F_FLUSH is an absolute requirement + for implementations that do not wish to be safe against such data losses. +end{note} subsubsection{Legacy Interface: Device Operation}label{sec:Device Types / Block Device / Device Operation / Legacy Interface: Device Operation} When using the legacy interface, transitional devices and drivers @@ -3907,6 +3992,9 @@ serve as data consistency guarantee. Only a VIRTIO_BLK_T_FLUSH request does that. end{note} +Some older legacy devices did not commit completed writes to non-volatile +storage, even if VIRTIO_BLK_F_FLUSH was proposed but not negotiated. + If the device has VIRTIO_BLK_F_SCSI feature, it can also support scsi packet command requests, each of these requests is of form: -- 2.4.3


  • 2.  Re: [virtio] [PATCH] virtio-blk: restore VIRTIO_BLK_F_FLUSH and VIRTIO_BLK_F_CONFIG_WCE

    Posted 07-02-2015 14:21
    This is v3. Paolo On 02/07/2015 16:20, Paolo Bonzini wrote: > VIRTIO_BLK_F_CONFIG_WCE is important in order to achieve good performance > (up to 2x, though more realistically +30-40%) in latency-bound workloads. > However, it was removed by mistake together with VIRTIO_BLK_F_FLUSH. > > In addition, even removing VIRTIO_BLK_F_FLUSH was probably not a great > idea, because it simplifies simple drivers (e.g. firmware) that are okay > with a writethrough cache but still need data to persist after power loss. > What really should have been removed is just the possibility that devices > not propose VIRTIO_BLK_F_FLUSH, but even that only deserves a "SHOULD" in > the new world of conformance statements. > > Restore these, with the following changes: > > * clarify and use conformance statements in order to define writeback > and writethrough caching according to what is commonly done by high-end > storage. > > * clarify (with conformance statements) the influence of the > VIRTIO_BLK_F_FLUSH feature on caching and how to proceed if only one of > VIRTIO_BLK_F_FLUSH and VIRTIO_BLK_F_CONFIG_WCE is negotiated. > > * strengthen the requirement for persisting writes to MUST after > a VIRTIO_BLK_T_FLUSH request (and in other cases too involving the > new features). > > The suggested behavior upon feature negotiation is okay for the Linux > implementation of virtio1, even after the implementation is modified to > support the two new features. > > This fixes VIRTIO-144. > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > --- > conformance.tex 2 + > content.tex 130 +++++++++++++++++++++++++++++++++++++++++++++++--------- > 2 files changed, 111 insertions(+), 21 deletions(-) > > diff --git a/conformance.tex b/conformance.tex > index 29c6ba8..8753cd6 100644 > --- a/conformance.tex > +++ b/conformance.tex > @@ -102,6 +102,7 @@ A network driver MUST conform to the following normative statements: > A block driver MUST conform to the following normative statements: > > egin{itemize} > +item
    ef{drivernormative:Device Types / Block Device / Device Initialization} > item
    ef{drivernormative:Device Types / Block Device / Device Operation} > end{itemize} > > @@ -205,6 +206,7 @@ A network device MUST conform to the following normative statements: > A block device MUST conform to the following normative statements: > > egin{itemize} > +item
    ef{devicenormative:Device Types / Block Device / Device Initialization} > item
    ef{devicenormative:Device Types / Block Device / Device Operation} > end{itemize} > > diff --git a/content.tex b/content.tex > index 3b12263..e58a0bd 100644 > --- a/content.tex > +++ b/content.tex > @@ -3723,8 +3723,13 @@ device except where noted. > > item[VIRTIO_BLK_F_BLK_SIZE (6)] Block size of disk is in field{blk_size}. > > +item[VIRTIO_BLK_F_FLUSH (9)] Cache flush command support. > + > item[VIRTIO_BLK_F_TOPOLOGY (10)] Device exports information on optimal I/O > alignment. > + > +item[VIRTIO_BLK_F_CONFIG_WCE (11)] Device can toggle its cache between writeback > + and writethrough modes. > end{description} > > subsubsection{Legacy Interface: Feature bits}label{sec:Device Types / Block Device / Feature bits / Legacy Interface: Feature bits} > @@ -3733,16 +3738,12 @@ device except where noted. > item[VIRTIO_BLK_F_BARRIER (0)] Device supports request barriers. > > item[VIRTIO_BLK_F_SCSI (7)] Device supports scsi packet commands. > - > -item[VIRTIO_BLK_F_FLUSH (9)] Cache flush command support. > - > -item[VIRTIO_BLK_F_CONFIG_WCE (11)] Device can toggle its cache between writeback > - and writethrough modes. > end{description} > > -VIRTIO_BLK_F_FLUSH was also called VIRTIO_BLK_F_WCE: Legacy drivers > -MUST only negotiate this feature if they are capable of sending > -VIRTIO_BLK_T_FLUSH commands. > +egin{note} > + In the legacy interface, VIRTIO_BLK_F_FLUSH was also > + called VIRTIO_BLK_F_WCE. > +end{note} > > subsection{Device configuration layout}label{sec:Device Types / Block Device / Device configuration layout} > > @@ -3771,7 +3772,7 @@ struct virtio_blk_config { > // optimal (suggested maximum) I/O size in blocks > le32 opt_io_size; > } topology; > - u8 reserved; > + u8 writeback; > }; > end{lstlisting} > > @@ -3801,21 +3802,52 @@ according to the native endian of the guest rather than > field{topology} struct can be read to determine the physical block size and optimal > I/O lengths for the driver to use. This also does not affect the units > in the protocol, only performance. > + > +item If the VIRTIO_BLK_F_CONFIG_WCE feature is negotiated, the cache > + mode can be read or set through the field{writeback} field. 0 corresponds > + to a writethrough cache, 1 to a writeback cachefootnote{Consistent with > +
    ef{devicenormative:Device Types / Block Device / Device Operation}, > + a writethrough cache can be defined broadly as a cache that commits > + writes to non-volatile storage before reporting their completion. > + For example, a battery-backed writeback cache actually counts as > + writethrough according to this definition.}. The cache mode after > + reset is undefined. > end{enumerate} > > +drivernormative{subsubsection}{Device Initialization}{Device Types / Block Device / Device Initialization} > + > +Drivers SHOULD NOT negotiate VIRTIO_BLK_F_FLUSH if they are incapable of > +sending VIRTIO_BLK_T_FLUSH commands. > + > +If the VIRTIO_BLK_F_CONFIG_WCE feature is not negotiated, but > +VIRTIO_BLK_F_FLUSH was proposed by the device, the driver MAY deduce > +the presence of a writethrough cache. Otherwise, the driver SHOULD > +assume presence of a writeback cache. > + > +devicenormative{subsubsection}{Device Initialization}{Device Types / Block Device / Device Initialization} > + > +Devices SHOULD always propose VIRTIO_BLK_F_FLUSH, and MUST propose it > +if they propose VIRTIO_BLK_F_CONFIG_WCE. > + > +If VIRTIO_BLK_F_CONFIG_WCE is negotiated but the VIRTIO_BLK_F_FLUSH feature > +is not, the device MUST set field{writeback} to 0 as soon as the driver > +sets the FEATURES_OK status bit. > + > subsubsection{Legacy Interface: Device Initialization}label{sec:Device Types / Block Device / Device Initialization / Legacy Interface: Device Initialization} > > -The field{reserved} field used to be called field{writeback}. If the > -VIRTIO_BLK_F_CONFIG_WCE feature is offered, the cache mode can be > -read from field{writeback}; the > -driver can also write to the field in order to toggle the cache > -between writethrough (0) and writeback (1) mode. If the feature is > -not available, the driver can instead look at the result of > -negotiating VIRTIO_BLK_F_FLUSH: the cache will be in writeback mode > -after reset if and only if VIRTIO_BLK_F_FLUSH is negotiated. > +Because legacy devices do not have FEATURES_OK, legacy device behavior > +differs around feature negotiation; legacy devices never modify > +field{writeback} as a result of a driver's write to the status register. > +In particular, some legacy drivers wrote to field{writeback} between > +setting the DRIVER status bit and setting the DRIVER_OK status bit. > +It would be a bug for the device to overwrite field{writeback} when > +the DRIVER_OK status bit is set. > + > +Legacy devices must also never modify field{writeback} as a result of > +a driver's write to the guest features registers, for example if the > +driver sets the VIRTIO_BLK_F_CONFIG_WCE guest feature bit but does not > +set the VIRTIO_BLK_F_FLUSH bit. > > -Some older legacy devices did not operate in writethrough mode even > -after a driver announced lack of support for VIRTIO_BLK_F_FLUSH. > > subsection{Device Operation}label{sec:Device Types / Block Device / Device Operation} > > @@ -3865,14 +3897,67 @@ A driver SHOULD accept the VIRTIO_BLK_F_RO feature if offered. > A driver MUST set field{sector} to 0 for a VIRTIO_BLK_T_FLUSH request. > A driver SHOULD NOT include any data in a VIRTIO_BLK_T_FLUSH request. > > +If the VIRTIO_BLK_F_CONFIG_WCE feature is negotiated, the driver MAY > +switch to writethrough or writeback mode by writing respectively 0 and > +1 to the field{writeback} field. After writing a 0 to field{writeback}, > +the driver MUST NOT assume that any volatile writes have been committed > +to non-volatile storage. > + > devicenormative{subsubsection}{Device Operation}{Device Types / Block Device / Device Operation} > > A device MUST set the field{status} byte to VIRTIO_BLK_S_IOERR > for a write request if the VIRTIO_BLK_F_RO feature if offered, and MUST NOT > write any data. > > -Upon receipt of a VIRTIO_BLK_T_FLUSH request, the device SHOULD ensure > -that any writes which were completed are committed to non-volatile storage. > +A write is considered volatile when it is submitted; the contents of > +sectors covered by a volatile are undefined until the write becomes stable. > +A write becomes stable once it is completed and one or more of the > +following conditions is true: > + > +egin{enumerate} > +itemlabel{item:flush1} neither VIRTIO_BLK_F_CONFIG_WCE nor > + VIRTIO_BLK_F_FLUSH feature were negotiated, but VIRTIO_BLK_F_FLUSH was > + proposed by the device; > + > +itemlabel{item:flush2} the VIRTIO_BLK_F_CONFIG_WCE feature was negotiated and the > + field{writeback} field in configuration space was 0 extbf{all the time between > + the submission of the write and its completion}; > + > +itemlabel{item:flush3} a VIRTIO_BLK_T_FLUSH request is sent extbf{after the write is > + completed} and is completed itself. > +end{enumerate} > + > +The device MUST ensure that stable writes are committed to > +non-volatile storage, before reporting completion of the write > +(cases~
    ef{item:flush1} and~
    ef{item:flush2}) or the flush > +(case~
    ef{item:flush3}). Failure to do so can cause data loss > +in case of a crash. > + > +% ** Not using MAY/MAY NOT intentionally, this is not optional behavior ** > +If the driver changes field{writeback} between the submission of the write > +and its completion, the write can be either volatile or stable when > +its completion is reported; the exact behavior is undefined. > + > +% According to the device requirements for device initialization: > +% Propose(CONFIG_WCE) => Propose(FLUSH). > +% > +% After reversing the implication: > +% not Propose(FLUSH) => not Propose(CONFIG_WCE). > + > +If VIRTIO_BLK_F_FLUSH was not proposed by the > + devicefootnote{Note that in this case, according to > +
    ef{devicenormative:Device Types / Block Device / Device Initialization}, > + the device will not have proposed VIRTIO_BLK_F_CONFIG_WCE either.}, the > +device MAY also commit writes to non-volatile storage before reporting > +their completion. Unlike case~
    ef{item:flush1}, however, this is not > +an absolute requirement of the specification. > + > +egin{note} > + An implementation that does not propose VIRTIO_BLK_F_FLUSH and does not commit > + completed writes will not be resilient to data loss in case of crashes. > + Not proposing VIRTIO_BLK_F_FLUSH is an absolute requirement > + for implementations that do not wish to be safe against such data losses. > +end{note} > > subsubsection{Legacy Interface: Device Operation}label{sec:Device Types / Block Device / Device Operation / Legacy Interface: Device Operation} > When using the legacy interface, transitional devices and drivers > @@ -3907,6 +3992,9 @@ serve as data consistency guarantee. Only a VIRTIO_BLK_T_FLUSH request > does that. > end{note} > > +Some older legacy devices did not commit completed writes to non-volatile > +storage, even if VIRTIO_BLK_F_FLUSH was proposed but not negotiated. > + > If the device has VIRTIO_BLK_F_SCSI feature, it can also support > scsi packet command requests, each of these requests is of form: > >


  • 3.  Re: [virtio] [PATCH] virtio-blk: restore VIRTIO_BLK_F_FLUSH and VIRTIO_BLK_F_CONFIG_WCE

    Posted 07-02-2015 14:41
    On Thu, Jul 02, 2015 at 04:20:22PM +0200, Paolo Bonzini wrote: > VIRTIO_BLK_F_CONFIG_WCE is important in order to achieve good performance > (up to 2x, though more realistically +30-40%) in latency-bound workloads. > However, it was removed by mistake together with VIRTIO_BLK_F_FLUSH. > > In addition, even removing VIRTIO_BLK_F_FLUSH was probably not a great > idea, because it simplifies simple drivers (e.g. firmware) that are okay > with a writethrough cache but still need data to persist after power loss. > What really should have been removed is just the possibility that devices > not propose VIRTIO_BLK_F_FLUSH, but even that only deserves a "SHOULD" in > the new world of conformance statements. > > Restore these, with the following changes: > > * clarify and use conformance statements in order to define writeback > and writethrough caching according to what is commonly done by high-end > storage. > > * clarify (with conformance statements) the influence of the > VIRTIO_BLK_F_FLUSH feature on caching and how to proceed if only one of > VIRTIO_BLK_F_FLUSH and VIRTIO_BLK_F_CONFIG_WCE is negotiated. > > * strengthen the requirement for persisting writes to MUST after > a VIRTIO_BLK_T_FLUSH request (and in other cases too involving the > new features). > > The suggested behavior upon feature negotiation is okay for the Linux > implementation of virtio1, even after the implementation is modified to > support the two new features. > > This fixes VIRTIO-144. > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > --- > conformance.tex 2 + > content.tex 130 +++++++++++++++++++++++++++++++++++++++++++++++--------- > 2 files changed, 111 insertions(+), 21 deletions(-) Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Attachment: pgp64quXZ4MMS.pgp Description: PGP signature


  • 4.  Re: [virtio] [PATCH] virtio-blk: restore VIRTIO_BLK_F_FLUSH and VIRTIO_BLK_F_CONFIG_WCE

    Posted 07-08-2015 09:31
    On Thu, Jul 02, 2015 at 04:20:22PM +0200, Paolo Bonzini wrote: > VIRTIO_BLK_F_CONFIG_WCE is important in order to achieve good performance > (up to 2x, though more realistically +30-40%) in latency-bound workloads. > However, it was removed by mistake together with VIRTIO_BLK_F_FLUSH. > > In addition, even removing VIRTIO_BLK_F_FLUSH was probably not a great > idea, because it simplifies simple drivers (e.g. firmware) that are okay > with a writethrough cache but still need data to persist after power loss. > What really should have been removed is just the possibility that devices > not propose VIRTIO_BLK_F_FLUSH, but even that only deserves a "SHOULD" in > the new world of conformance statements. > > Restore these, with the following changes: > > * clarify and use conformance statements in order to define writeback > and writethrough caching according to what is commonly done by high-end > storage. > > * clarify (with conformance statements) the influence of the > VIRTIO_BLK_F_FLUSH feature on caching and how to proceed if only one of > VIRTIO_BLK_F_FLUSH and VIRTIO_BLK_F_CONFIG_WCE is negotiated. > > * strengthen the requirement for persisting writes to MUST after > a VIRTIO_BLK_T_FLUSH request (and in other cases too involving the > new features). > The suggested behavior upon feature negotiation is okay for the Linux > implementation of virtio1, even after the implementation is modified to > support the two new features. > > This fixes VIRTIO-144. > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > --- > conformance.tex 2 + > content.tex 130 +++++++++++++++++++++++++++++++++++++++++++++++--------- > 2 files changed, 111 insertions(+), 21 deletions(-) > > diff --git a/conformance.tex b/conformance.tex > index 29c6ba8..8753cd6 100644 > --- a/conformance.tex > +++ b/conformance.tex > @@ -102,6 +102,7 @@ A network driver MUST conform to the following normative statements: > A block driver MUST conform to the following normative statements: > > egin{itemize} > +item
    ef{drivernormative:Device Types / Block Device / Device Initialization} > item
    ef{drivernormative:Device Types / Block Device / Device Operation} > end{itemize} > > @@ -205,6 +206,7 @@ A network device MUST conform to the following normative statements: > A block device MUST conform to the following normative statements: > > egin{itemize} > +item
    ef{devicenormative:Device Types / Block Device / Device Initialization} > item
    ef{devicenormative:Device Types / Block Device / Device Operation} > end{itemize} > > diff --git a/content.tex b/content.tex > index 3b12263..e58a0bd 100644 > --- a/content.tex > +++ b/content.tex > @@ -3723,8 +3723,13 @@ device except where noted. > > item[VIRTIO_BLK_F_BLK_SIZE (6)] Block size of disk is in field{blk_size}. > > +item[VIRTIO_BLK_F_FLUSH (9)] Cache flush command support. > + > item[VIRTIO_BLK_F_TOPOLOGY (10)] Device exports information on optimal I/O > alignment. > + > +item[VIRTIO_BLK_F_CONFIG_WCE (11)] Device can toggle its cache between writeback > + and writethrough modes. > end{description} > > subsubsection{Legacy Interface: Feature bits}label{sec:Device Types / Block Device / Feature bits / Legacy Interface: Feature bits} > @@ -3733,16 +3738,12 @@ device except where noted. > item[VIRTIO_BLK_F_BARRIER (0)] Device supports request barriers. > > item[VIRTIO_BLK_F_SCSI (7)] Device supports scsi packet commands. > - > -item[VIRTIO_BLK_F_FLUSH (9)] Cache flush command support. > - > -item[VIRTIO_BLK_F_CONFIG_WCE (11)] Device can toggle its cache between writeback > - and writethrough modes. > end{description} > > -VIRTIO_BLK_F_FLUSH was also called VIRTIO_BLK_F_WCE: Legacy drivers > -MUST only negotiate this feature if they are capable of sending > -VIRTIO_BLK_T_FLUSH commands. > +egin{note} > + In the legacy interface, VIRTIO_BLK_F_FLUSH was also > + called VIRTIO_BLK_F_WCE. > +end{note} > > subsection{Device configuration layout}label{sec:Device Types / Block Device / Device configuration layout} > > @@ -3771,7 +3772,7 @@ struct virtio_blk_config { > // optimal (suggested maximum) I/O size in blocks > le32 opt_io_size; > } topology; > - u8 reserved; > + u8 writeback; > }; > end{lstlisting} > > @@ -3801,21 +3802,52 @@ according to the native endian of the guest rather than > field{topology} struct can be read to determine the physical block size and optimal > I/O lengths for the driver to use. This also does not affect the units > in the protocol, only performance. > + > +item If the VIRTIO_BLK_F_CONFIG_WCE feature is negotiated, the cache > + mode can be read or set through the field{writeback} field. 0 corresponds > + to a writethrough cache, 1 to a writeback cachefootnote{Consistent with > +
    ef{devicenormative:Device Types / Block Device / Device Operation}, > + a writethrough cache can be defined broadly as a cache that commits > + writes to non-volatile storage before reporting their completion. > + For example, a battery-backed writeback cache actually counts as > + writethrough according to this definition.}. The cache mode after > + reset is undefined. > end{enumerate} > > +drivernormative{subsubsection}{Device Initialization}{Device Types / Block Device / Device Initialization} > + > +Drivers SHOULD NOT negotiate VIRTIO_BLK_F_FLUSH if they are incapable of > +sending VIRTIO_BLK_T_FLUSH commands. > + > +If the VIRTIO_BLK_F_CONFIG_WCE feature is not negotiated, but > +VIRTIO_BLK_F_FLUSH was proposed by the device, the driver MAY deduce > +the presence of a writethrough cache. Otherwise, the driver SHOULD > +assume presence of a writeback cache. > + > +devicenormative{subsubsection}{Device Initialization}{Device Types / Block Device / Device Initialization} > + > +Devices SHOULD always propose VIRTIO_BLK_F_FLUSH, and MUST propose it > +if they propose VIRTIO_BLK_F_CONFIG_WCE. s/propose/offer/ > + > +If VIRTIO_BLK_F_CONFIG_WCE is negotiated but the VIRTIO_BLK_F_FLUSH feature > +is not, the device MUST set field{writeback} to 0 as soon as the driver > +sets the FEATURES_OK status bit. So this field is dual use, sometimes it's written by driver, sometimes - by a device. A bit messy I think, will cause races or at least confusion. What is this in aid of? If VIRTIO_BLK_F_FLUSH is not negotiated devices already commit writes. Why also set writeback to 0? > + > subsubsection{Legacy Interface: Device Initialization}label{sec:Device Types / Block Device / Device Initialization / Legacy Interface: Device Initialization} > > -The field{reserved} field used to be called field{writeback}. If the > -VIRTIO_BLK_F_CONFIG_WCE feature is offered, the cache mode can be > -read from field{writeback}; the > -driver can also write to the field in order to toggle the cache > -between writethrough (0) and writeback (1) mode. If the feature is > -not available, the driver can instead look at the result of > -negotiating VIRTIO_BLK_F_FLUSH: the cache will be in writeback mode > -after reset if and only if VIRTIO_BLK_F_FLUSH is negotiated. > +Because legacy devices do not have FEATURES_OK, legacy device behavior > +differs around feature negotiation; legacy devices never modify > +field{writeback} as a result of a driver's write to the status register. > +In particular, some legacy drivers wrote to field{writeback} between > +setting the DRIVER status bit and setting the DRIVER_OK status bit. > +It would be a bug for the device to overwrite field{writeback} when > +the DRIVER_OK status bit is set. Can this be a MUST NOT statement? And when is it ok to write then? > + > +Legacy devices must also never modify field{writeback} as a result of We can not require anything from legacy devices. They are out there. > +a driver's write to the guest features registers, for example if the > +driver sets the VIRTIO_BLK_F_CONFIG_WCE guest feature bit but does not > +set the VIRTIO_BLK_F_FLUSH bit. Is it ever ok for device to modify the field? > > -Some older legacy devices did not operate in writethrough mode even > -after a driver announced lack of support for VIRTIO_BLK_F_FLUSH. > > subsection{Device Operation}label{sec:Device Types / Block Device / Device Operation} > > @@ -3865,14 +3897,67 @@ A driver SHOULD accept the VIRTIO_BLK_F_RO feature if offered. > A driver MUST set field{sector} to 0 for a VIRTIO_BLK_T_FLUSH request. > A driver SHOULD NOT include any data in a VIRTIO_BLK_T_FLUSH request. > > +If the VIRTIO_BLK_F_CONFIG_WCE feature is negotiated, the driver MAY > +switch to writethrough or writeback mode by writing respectively 0 and > +1 to the field{writeback} field. After writing a 0 to field{writeback}, > +the driver MUST NOT assume that any volatile writes have been committed > +to non-volatile storage. I think volatile here clashes with its use for buffers. How about we say e.g. "permanent device backend storage" here? Same below. > + > devicenormative{subsubsection}{Device Operation}{Device Types / Block Device / Device Operation} > > A device MUST set the field{status} byte to VIRTIO_BLK_S_IOERR > for a write request if the VIRTIO_BLK_F_RO feature if offered, and MUST NOT > write any data. > > -Upon receipt of a VIRTIO_BLK_T_FLUSH request, the device SHOULD ensure > -that any writes which were completed are committed to non-volatile storage. > +A write is considered volatile when it is submitted; the contents of > +sectors covered by a volatile are undefined until the write becomes stable. > +A write becomes stable once it is completed and one or more of the > +following conditions is true: > + > +egin{enumerate} > +itemlabel{item:flush1} neither VIRTIO_BLK_F_CONFIG_WCE nor > + VIRTIO_BLK_F_FLUSH feature were negotiated, but VIRTIO_BLK_F_FLUSH was > + proposed by the device; > + > +itemlabel{item:flush2} the VIRTIO_BLK_F_CONFIG_WCE feature was negotiated and the > + field{writeback} field in configuration space was 0 extbf{all the time between > + the submission of the write and its completion}; > + > +itemlabel{item:flush3} a VIRTIO_BLK_T_FLUSH request is sent extbf{after the write is > + completed} and is completed itself. > +end{enumerate} > + > +The device MUST ensure that stable writes are committed to > +non-volatile storage, So in particular at the moment QEMU violates this requirement. How would it avoid violating it? E.g. I can use a file in tmpfs as a backend. Why do drivers have to know? I think it's useful to pretend to guest data is stored, and then cheat. In case of tmpfs flushes are cheap, so I don't see what do we gain by requiring that drivers skip it with such force. How about we say something like "permanent device backend storage" or even "device backend storage"? > before reporting completion of the write > +(cases~
    ef{item:flush1} and~
    ef{item:flush2}) or the flush > +(case~
    ef{item:flush3}). Failure to do so can cause data loss > +in case of a crash. > + > +% ** Not using MAY/MAY NOT intentionally, this is not optional behavior ** > +If the driver changes field{writeback} between the submission of the write > +and its completion, the write can be either volatile or stable when > +its completion is reported; the exact behavior is undefined. So MUST then? > + > +% According to the device requirements for device initialization: > +% Propose(CONFIG_WCE) => Propose(FLUSH). > +% > +% After reversing the implication: > +% not Propose(FLUSH) => not Propose(CONFIG_WCE). > + > +If VIRTIO_BLK_F_FLUSH was not proposed by the > + devicefootnote{Note that in this case, according to > +
    ef{devicenormative:Device Types / Block Device / Device Initialization}, > + the device will not have proposed VIRTIO_BLK_F_CONFIG_WCE either.}, the > +device MAY also commit writes to non-volatile storage before reporting > +their completion. Unlike case~
    ef{item:flush1}, however, this is not > +an absolute requirement of the specification. Pls drop "absolute", let's use specific terms from an RFC. Maybe this one is a SHOULD? > + > +egin{note} > + An implementation that does not propose VIRTIO_BLK_F_FLUSH and does not commit > + completed writes will not be resilient to data loss in case of crashes. > + Not proposing VIRTIO_BLK_F_FLUSH is an absolute requirement > + for implementations that do not wish to be safe against such data losses. I think this note confuses more than it clarifies. It seems to imply VIRTIO_BLK_F_FLUSH somehow means "I will not lose your data on crash". That requirement is just too strong, devices often have no way to know, and guests often don't care. Why is it tied to VIRTIO_BLK_F_FLUSH at all? Is this an optimization? > +end{note} > > subsubsection{Legacy Interface: Device Operation}label{sec:Device Types / Block Device / Device Operation / Legacy Interface: Device Operation} > When using the legacy interface, transitional devices and drivers > @@ -3907,6 +3992,9 @@ serve as data consistency guarantee. Only a VIRTIO_BLK_T_FLUSH request > does that. > end{note} > > +Some older legacy devices did not commit completed writes to non-volatile > +storage, even if VIRTIO_BLK_F_FLUSH was proposed but not negotiated. > + Anything transitional drivers can do it work around that? If no, this doesn't seem like a useful bit of info to me. > If the device has VIRTIO_BLK_F_SCSI feature, it can also support > scsi packet command requests, each of these requests is of form: > > -- > 2.4.3 > > > --------------------------------------------------------------------- > To unsubscribe from this mail list, you must leave the OASIS TC that > generates this mail. Follow this link to all your TCs in OASIS at: > https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php


  • 5.  Re: [virtio] [PATCH] virtio-blk: restore VIRTIO_BLK_F_FLUSH and VIRTIO_BLK_F_CONFIG_WCE

    Posted 07-08-2015 09:58
    On 08/07/2015 11:30, Michael S. Tsirkin wrote: > On Thu, Jul 02, 2015 at 04:20:22PM +0200, Paolo Bonzini wrote: >> VIRTIO_BLK_F_CONFIG_WCE is important in order to achieve good performance >> (up to 2x, though more realistically +30-40%) in latency-bound workloads. >> However, it was removed by mistake together with VIRTIO_BLK_F_FLUSH. >> >> In addition, even removing VIRTIO_BLK_F_FLUSH was probably not a great >> idea, because it simplifies simple drivers (e.g. firmware) that are okay >> with a writethrough cache but still need data to persist after power loss. >> What really should have been removed is just the possibility that devices >> not propose VIRTIO_BLK_F_FLUSH, but even that only deserves a "SHOULD" in >> the new world of conformance statements. >> >> Restore these, with the following changes: >> >> * clarify and use conformance statements in order to define writeback >> and writethrough caching according to what is commonly done by high-end >> storage. >> >> * clarify (with conformance statements) the influence of the >> VIRTIO_BLK_F_FLUSH feature on caching and how to proceed if only one of >> VIRTIO_BLK_F_FLUSH and VIRTIO_BLK_F_CONFIG_WCE is negotiated. >> >> * strengthen the requirement for persisting writes to MUST after >> a VIRTIO_BLK_T_FLUSH request (and in other cases too involving the >> new features). >> The suggested behavior upon feature negotiation is okay for the Linux >> implementation of virtio1, even after the implementation is modified to >> support the two new features. >> >> This fixes VIRTIO-144. >> >> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> >> --- >> conformance.tex 2 + >> content.tex 130 +++++++++++++++++++++++++++++++++++++++++++++++--------- >> 2 files changed, 111 insertions(+), 21 deletions(-) >> >> diff --git a/conformance.tex b/conformance.tex >> index 29c6ba8..8753cd6 100644 >> --- a/conformance.tex >> +++ b/conformance.tex >> @@ -102,6 +102,7 @@ A network driver MUST conform to the following normative statements: >> A block driver MUST conform to the following normative statements: >> >> egin{itemize} >> +item
    ef{drivernormative:Device Types / Block Device / Device Initialization} >> item
    ef{drivernormative:Device Types / Block Device / Device Operation} >> end{itemize} >> >> @@ -205,6 +206,7 @@ A network device MUST conform to the following normative statements: >> A block device MUST conform to the following normative statements: >> >> egin{itemize} >> +item
    ef{devicenormative:Device Types / Block Device / Device Initialization} >> item
    ef{devicenormative:Device Types / Block Device / Device Operation} >> end{itemize} >> >> diff --git a/content.tex b/content.tex >> index 3b12263..e58a0bd 100644 >> --- a/content.tex >> +++ b/content.tex >> @@ -3723,8 +3723,13 @@ device except where noted. >> >> item[VIRTIO_BLK_F_BLK_SIZE (6)] Block size of disk is in field{blk_size}. >> >> +item[VIRTIO_BLK_F_FLUSH (9)] Cache flush command support. >> + >> item[VIRTIO_BLK_F_TOPOLOGY (10)] Device exports information on optimal I/O >> alignment. >> + >> +item[VIRTIO_BLK_F_CONFIG_WCE (11)] Device can toggle its cache between writeback >> + and writethrough modes. >> end{description} >> >> subsubsection{Legacy Interface: Feature bits}label{sec:Device Types / Block Device / Feature bits / Legacy Interface: Feature bits} >> @@ -3733,16 +3738,12 @@ device except where noted. >> item[VIRTIO_BLK_F_BARRIER (0)] Device supports request barriers. >> >> item[VIRTIO_BLK_F_SCSI (7)] Device supports scsi packet commands. >> - >> -item[VIRTIO_BLK_F_FLUSH (9)] Cache flush command support. >> - >> -item[VIRTIO_BLK_F_CONFIG_WCE (11)] Device can toggle its cache between writeback >> - and writethrough modes. >> end{description} >> >> -VIRTIO_BLK_F_FLUSH was also called VIRTIO_BLK_F_WCE: Legacy drivers >> -MUST only negotiate this feature if they are capable of sending >> -VIRTIO_BLK_T_FLUSH commands. >> +egin{note} >> + In the legacy interface, VIRTIO_BLK_F_FLUSH was also >> + called VIRTIO_BLK_F_WCE. >> +end{note} >> >> subsection{Device configuration layout}label{sec:Device Types / Block Device / Device configuration layout} >> >> @@ -3771,7 +3772,7 @@ struct virtio_blk_config { >> // optimal (suggested maximum) I/O size in blocks >> le32 opt_io_size; >> } topology; >> - u8 reserved; >> + u8 writeback; >> }; >> end{lstlisting} >> >> @@ -3801,21 +3802,52 @@ according to the native endian of the guest rather than >> field{topology} struct can be read to determine the physical block size and optimal >> I/O lengths for the driver to use. This also does not affect the units >> in the protocol, only performance. >> + >> +item If the VIRTIO_BLK_F_CONFIG_WCE feature is negotiated, the cache >> + mode can be read or set through the field{writeback} field. 0 corresponds >> + to a writethrough cache, 1 to a writeback cachefootnote{Consistent with >> +
    ef{devicenormative:Device Types / Block Device / Device Operation}, >> + a writethrough cache can be defined broadly as a cache that commits >> + writes to non-volatile storage before reporting their completion. >> + For example, a battery-backed writeback cache actually counts as >> + writethrough according to this definition.}. The cache mode after >> + reset is undefined. >> end{enumerate} >> >> +drivernormative{subsubsection}{Device Initialization}{Device Types / Block Device / Device Initialization} >> + >> +Drivers SHOULD NOT negotiate VIRTIO_BLK_F_FLUSH if they are incapable of >> +sending VIRTIO_BLK_T_FLUSH commands. >> + >> +If the VIRTIO_BLK_F_CONFIG_WCE feature is not negotiated, but >> +VIRTIO_BLK_F_FLUSH was proposed by the device, the driver MAY deduce >> +the presence of a writethrough cache. Otherwise, the driver SHOULD >> +assume presence of a writeback cache. >> + >> +devicenormative{subsubsection}{Device Initialization}{Device Types / Block Device / Device Initialization} >> + >> +Devices SHOULD always propose VIRTIO_BLK_F_FLUSH, and MUST propose it >> +if they propose VIRTIO_BLK_F_CONFIG_WCE. > > s/propose/offer/ Okay. >> + >> +If VIRTIO_BLK_F_CONFIG_WCE is negotiated but the VIRTIO_BLK_F_FLUSH feature >> +is not, the device MUST set field{writeback} to 0 as soon as the driver >> +sets the FEATURES_OK status bit. > > So this field is dual use, sometimes it's written by driver, sometimes - > by a device. A bit messy I think, will cause races or at least > confusion. It doesn't cause races, because the device can set it before returning from the instruction that wrote FEATURES_OK. > What is this in aid of? > > If VIRTIO_BLK_F_FLUSH is not negotiated devices already commit writes. > > Why also set writeback to 0? So that it's consistent with the status of the device. The driver may set it to 1 to prevent the device from committing the writes. >> + >> subsubsection{Legacy Interface: Device Initialization}label{sec:Device Types / Block Device / Device Initialization / Legacy Interface: Device Initialization} >> >> -The field{reserved} field used to be called field{writeback}. If the >> -VIRTIO_BLK_F_CONFIG_WCE feature is offered, the cache mode can be >> -read from field{writeback}; the >> -driver can also write to the field in order to toggle the cache >> -between writethrough (0) and writeback (1) mode. If the feature is >> -not available, the driver can instead look at the result of >> -negotiating VIRTIO_BLK_F_FLUSH: the cache will be in writeback mode >> -after reset if and only if VIRTIO_BLK_F_FLUSH is negotiated. >> +Because legacy devices do not have FEATURES_OK, legacy device behavior >> +differs around feature negotiation; legacy devices never modify >> +field{writeback} as a result of a driver's write to the status register. >> +In particular, some legacy drivers wrote to field{writeback} between >> +setting the DRIVER status bit and setting the DRIVER_OK status bit. >> +It would be a bug for the device to overwrite field{writeback} when >> +the DRIVER_OK status bit is set. > > Can this be a MUST NOT statement? > And when is it ok to write then? For legacy, never. For modern, when you get FEATURES_OK. >> + >> +Legacy devices must also never modify field{writeback} as a result of > > We can not require anything from legacy devices. They are out there. They didn't as far as I know, and if you want to implement both modern and legacy this is a deviation from the requirements of modern devices. >> +a driver's write to the guest features registers, for example if the >> +driver sets the VIRTIO_BLK_F_CONFIG_WCE guest feature bit but does not >> +set the VIRTIO_BLK_F_FLUSH bit. > > Is it ever ok for device to modify the field? Yes, see above regarding FEATURES_OK. That's the only case. >> >> -Some older legacy devices did not operate in writethrough mode even >> -after a driver announced lack of support for VIRTIO_BLK_F_FLUSH. >> >> subsection{Device Operation}label{sec:Device Types / Block Device / Device Operation} >> >> @@ -3865,14 +3897,67 @@ A driver SHOULD accept the VIRTIO_BLK_F_RO feature if offered. >> A driver MUST set field{sector} to 0 for a VIRTIO_BLK_T_FLUSH request. >> A driver SHOULD NOT include any data in a VIRTIO_BLK_T_FLUSH request. >> >> +If the VIRTIO_BLK_F_CONFIG_WCE feature is negotiated, the driver MAY >> +switch to writethrough or writeback mode by writing respectively 0 and >> +1 to the field{writeback} field. After writing a 0 to field{writeback}, >> +the driver MUST NOT assume that any volatile writes have been committed >> +to non-volatile storage. > > I think volatile here clashes with its use for buffers. > How about we say e.g. "permanent device backend storage" here? > Same below. Okay, or even just s/non-volatile/permanent/. >> + >> devicenormative{subsubsection}{Device Operation}{Device Types / Block Device / Device Operation} >> >> A device MUST set the field{status} byte to VIRTIO_BLK_S_IOERR >> for a write request if the VIRTIO_BLK_F_RO feature if offered, and MUST NOT >> write any data. >> >> -Upon receipt of a VIRTIO_BLK_T_FLUSH request, the device SHOULD ensure >> -that any writes which were completed are committed to non-volatile storage. >> +A write is considered volatile when it is submitted; the contents of >> +sectors covered by a volatile are undefined until the write becomes stable. >> +A write becomes stable once it is completed and one or more of the >> +following conditions is true: >> + >> +egin{enumerate} >> +itemlabel{item:flush1} neither VIRTIO_BLK_F_CONFIG_WCE nor >> + VIRTIO_BLK_F_FLUSH feature were negotiated, but VIRTIO_BLK_F_FLUSH was >> + proposed by the device; >> + >> +itemlabel{item:flush2} the VIRTIO_BLK_F_CONFIG_WCE feature was negotiated and the >> + field{writeback} field in configuration space was 0 extbf{all the time between >> + the submission of the write and its completion}; >> + >> +itemlabel{item:flush3} a VIRTIO_BLK_T_FLUSH request is sent extbf{after the write is >> + completed} and is completed itself. >> +end{enumerate} >> + >> +The device MUST ensure that stable writes are committed to >> +non-volatile storage, > > So in particular at the moment QEMU violates this requirement. It doesn't. Why do you say it does? > How would it avoid violating it? E.g. I can use a file in tmpfs as a backend. > Why do drivers have to know? > I think it's useful to pretend to guest data is stored, and then cheat. > In case of tmpfs flushes are cheap, so I don't see what do we > gain by requiring that drivers skip it with such force. > > How about we say something like "permanent device backend storage" > or even "device backend storage"? > >> before reporting completion of the write >> +(cases~
    ef{item:flush1} and~
    ef{item:flush2}) or the flush >> +(case~
    ef{item:flush3}). Failure to do so can cause data loss >> +in case of a crash. >> + >> +% ** Not using MAY/MAY NOT intentionally, this is not optional behavior ** >> +If the driver changes field{writeback} between the submission of the write >> +and its completion, the write can be either volatile or stable when >> +its completion is reported; the exact behavior is undefined. > > So MUST then? No, definitely not MUST. "The write MUST be either X or not X" is a tautology. "The write MAY be either X or not X" also does not make sense, because there's no way for an implementation to do neither X nor not X. Tertium non datur. It's describing a case where the behavior is undefined. The RFC does not include a case for this, so I used "can". >> + >> +% According to the device requirements for device initialization: >> +% Propose(CONFIG_WCE) => Propose(FLUSH). >> +% >> +% After reversing the implication: >> +% not Propose(FLUSH) => not Propose(CONFIG_WCE). >> + >> +If VIRTIO_BLK_F_FLUSH was not proposed by the >> + devicefootnote{Note that in this case, according to >> +
    ef{devicenormative:Device Types / Block Device / Device Initialization}, >> + the device will not have proposed VIRTIO_BLK_F_CONFIG_WCE either.}, the >> +device MAY also commit writes to non-volatile storage before reporting >> +their completion. Unlike case~
    ef{item:flush1}, however, this is not >> +an absolute requirement of the specification. > > Pls drop "absolute", let's use specific terms from an RFC. "absolute requirement" comes from RFC2119. It means "it's not a MUST", except it's valid English. > Maybe this one is a SHOULD? No, it's definitely a MAY. "One vendor may choose to include the item because a particular marketplace requires it or because the vendor feels that it enhances the product while another vendor may omit the same item" is a perfect description of this. Also, the driver can recognize the situation (and though it cannot do anything about it, it can log something). >> + >> +egin{note} >> + An implementation that does not propose VIRTIO_BLK_F_FLUSH and does not commit >> + completed writes will not be resilient to data loss in case of crashes. >> + Not proposing VIRTIO_BLK_F_FLUSH is an absolute requirement >> + for implementations that do not wish to be safe against such data losses. > > I think this note confuses more than it clarifies. > > It seems to imply VIRTIO_BLK_F_FLUSH somehow means "I will not lose > your data on crash". That requirement is just too strong, devices often have > no way to know, and guests often don't care. I don't think it's too strong. Of course it is more like "I will not lose your data on crash, unless the user screwed up". > Why is it tied to VIRTIO_BLK_F_FLUSH at all? Is this an optimization? > > >> +end{note} >> >> subsubsection{Legacy Interface: Device Operation}label{sec:Device Types / Block Device / Device Operation / Legacy Interface: Device Operation} >> When using the legacy interface, transitional devices and drivers >> @@ -3907,6 +3992,9 @@ serve as data consistency guarantee. Only a VIRTIO_BLK_T_FLUSH request >> does that. >> end{note} >> >> +Some older legacy devices did not commit completed writes to non-volatile >> +storage, even if VIRTIO_BLK_F_FLUSH was proposed but not negotiated. >> + > > Anything transitional drivers can do it work around that? Always accept VIRTIO_BLK_F_FLUSH. > If no, this doesn't seem like a useful bit of info to me. I can remove it, but it was in the existing document and there is a workaround. Paolo


  • 6.  Re: [virtio] [PATCH] virtio-blk: restore VIRTIO_BLK_F_FLUSH and VIRTIO_BLK_F_CONFIG_WCE

    Posted 07-08-2015 10:24
    On Wed, Jul 08, 2015 at 11:58:05AM +0200, Paolo Bonzini wrote: > On 08/07/2015 11:30, Michael S. Tsirkin wrote: > > On Thu, Jul 02, 2015 at 04:20:22PM +0200, Paolo Bonzini wrote: > >> VIRTIO_BLK_F_CONFIG_WCE is important in order to achieve good performance > >> (up to 2x, though more realistically +30-40%) in latency-bound workloads. > >> However, it was removed by mistake together with VIRTIO_BLK_F_FLUSH. > >> > >> In addition, even removing VIRTIO_BLK_F_FLUSH was probably not a great > >> idea, because it simplifies simple drivers (e.g. firmware) that are okay > >> with a writethrough cache but still need data to persist after power loss. > >> What really should have been removed is just the possibility that devices > >> not propose VIRTIO_BLK_F_FLUSH, but even that only deserves a "SHOULD" in > >> the new world of conformance statements. > >> > >> Restore these, with the following changes: > >> > >> * clarify and use conformance statements in order to define writeback > >> and writethrough caching according to what is commonly done by high-end > >> storage. > >> > >> * clarify (with conformance statements) the influence of the > >> VIRTIO_BLK_F_FLUSH feature on caching and how to proceed if only one of > >> VIRTIO_BLK_F_FLUSH and VIRTIO_BLK_F_CONFIG_WCE is negotiated. > >> > >> * strengthen the requirement for persisting writes to MUST after > >> a VIRTIO_BLK_T_FLUSH request (and in other cases too involving the > >> new features). > >> The suggested behavior upon feature negotiation is okay for the Linux > >> implementation of virtio1, even after the implementation is modified to > >> support the two new features. > >> > >> This fixes VIRTIO-144. > >> > >> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > >> --- > >> conformance.tex 2 + > >> content.tex 130 +++++++++++++++++++++++++++++++++++++++++++++++--------- > >> 2 files changed, 111 insertions(+), 21 deletions(-) > >> > >> diff --git a/conformance.tex b/conformance.tex > >> index 29c6ba8..8753cd6 100644 > >> --- a/conformance.tex > >> +++ b/conformance.tex > >> @@ -102,6 +102,7 @@ A network driver MUST conform to the following normative statements: > >> A block driver MUST conform to the following normative statements: > >> > >> egin{itemize} > >> +item
    ef{drivernormative:Device Types / Block Device / Device Initialization} > >> item
    ef{drivernormative:Device Types / Block Device / Device Operation} > >> end{itemize} > >> > >> @@ -205,6 +206,7 @@ A network device MUST conform to the following normative statements: > >> A block device MUST conform to the following normative statements: > >> > >> egin{itemize} > >> +item
    ef{devicenormative:Device Types / Block Device / Device Initialization} > >> item
    ef{devicenormative:Device Types / Block Device / Device Operation} > >> end{itemize} > >> > >> diff --git a/content.tex b/content.tex > >> index 3b12263..e58a0bd 100644 > >> --- a/content.tex > >> +++ b/content.tex > >> @@ -3723,8 +3723,13 @@ device except where noted. > >> > >> item[VIRTIO_BLK_F_BLK_SIZE (6)] Block size of disk is in field{blk_size}. > >> > >> +item[VIRTIO_BLK_F_FLUSH (9)] Cache flush command support. > >> + > >> item[VIRTIO_BLK_F_TOPOLOGY (10)] Device exports information on optimal I/O > >> alignment. > >> + > >> +item[VIRTIO_BLK_F_CONFIG_WCE (11)] Device can toggle its cache between writeback > >> + and writethrough modes. > >> end{description} > >> > >> subsubsection{Legacy Interface: Feature bits}label{sec:Device Types / Block Device / Feature bits / Legacy Interface: Feature bits} > >> @@ -3733,16 +3738,12 @@ device except where noted. > >> item[VIRTIO_BLK_F_BARRIER (0)] Device supports request barriers. > >> > >> item[VIRTIO_BLK_F_SCSI (7)] Device supports scsi packet commands. > >> - > >> -item[VIRTIO_BLK_F_FLUSH (9)] Cache flush command support. > >> - > >> -item[VIRTIO_BLK_F_CONFIG_WCE (11)] Device can toggle its cache between writeback > >> - and writethrough modes. > >> end{description} > >> > >> -VIRTIO_BLK_F_FLUSH was also called VIRTIO_BLK_F_WCE: Legacy drivers > >> -MUST only negotiate this feature if they are capable of sending > >> -VIRTIO_BLK_T_FLUSH commands. > >> +egin{note} > >> + In the legacy interface, VIRTIO_BLK_F_FLUSH was also > >> + called VIRTIO_BLK_F_WCE. > >> +end{note} > >> > >> subsection{Device configuration layout}label{sec:Device Types / Block Device / Device configuration layout} > >> > >> @@ -3771,7 +3772,7 @@ struct virtio_blk_config { > >> // optimal (suggested maximum) I/O size in blocks > >> le32 opt_io_size; > >> } topology; > >> - u8 reserved; > >> + u8 writeback; > >> }; > >> end{lstlisting} > >> > >> @@ -3801,21 +3802,52 @@ according to the native endian of the guest rather than > >> field{topology} struct can be read to determine the physical block size and optimal > >> I/O lengths for the driver to use. This also does not affect the units > >> in the protocol, only performance. > >> + > >> +item If the VIRTIO_BLK_F_CONFIG_WCE feature is negotiated, the cache > >> + mode can be read or set through the field{writeback} field. 0 corresponds > >> + to a writethrough cache, 1 to a writeback cachefootnote{Consistent with > >> +
    ef{devicenormative:Device Types / Block Device / Device Operation}, > >> + a writethrough cache can be defined broadly as a cache that commits > >> + writes to non-volatile storage before reporting their completion. > >> + For example, a battery-backed writeback cache actually counts as > >> + writethrough according to this definition.}. The cache mode after > >> + reset is undefined. > >> end{enumerate} > >> > >> +drivernormative{subsubsection}{Device Initialization}{Device Types / Block Device / Device Initialization} > >> + > >> +Drivers SHOULD NOT negotiate VIRTIO_BLK_F_FLUSH if they are incapable of > >> +sending VIRTIO_BLK_T_FLUSH commands. > >> + > >> +If the VIRTIO_BLK_F_CONFIG_WCE feature is not negotiated, but > >> +VIRTIO_BLK_F_FLUSH was proposed by the device, the driver MAY deduce > >> +the presence of a writethrough cache. Otherwise, the driver SHOULD > >> +assume presence of a writeback cache. > >> + > >> +devicenormative{subsubsection}{Device Initialization}{Device Types / Block Device / Device Initialization} > >> + > >> +Devices SHOULD always propose VIRTIO_BLK_F_FLUSH, and MUST propose it > >> +if they propose VIRTIO_BLK_F_CONFIG_WCE. > > > > s/propose/offer/ > > Okay. > > >> + > >> +If VIRTIO_BLK_F_CONFIG_WCE is negotiated but the VIRTIO_BLK_F_FLUSH feature > >> +is not, the device MUST set field{writeback} to 0 as soon as the driver > >> +sets the FEATURES_OK status bit. > > > > So this field is dual use, sometimes it's written by driver, sometimes - > > by a device. A bit messy I think, will cause races or at least > > confusion. > > It doesn't cause races, because the device can set it before returning > from the instruction that wrote FEATURES_OK. SMP guests exist :) > > What is this in aid of? > > > > If VIRTIO_BLK_F_FLUSH is not negotiated devices already commit writes. > > > > Why also set writeback to 0? > > So that it's consistent with the status of the device. The driver may > set it to 1 to prevent the device from committing the writes. Fields should have a single owner. RO fields can change, RW ones can't. This one is writeable, let's ask driver to set it consistently. > >> + > >> subsubsection{Legacy Interface: Device Initialization}label{sec:Device Types / Block Device / Device Initialization / Legacy Interface: Device Initialization} > >> > >> -The field{reserved} field used to be called field{writeback}. If the > >> -VIRTIO_BLK_F_CONFIG_WCE feature is offered, the cache mode can be > >> -read from field{writeback}; the > >> -driver can also write to the field in order to toggle the cache > >> -between writethrough (0) and writeback (1) mode. If the feature is > >> -not available, the driver can instead look at the result of > >> -negotiating VIRTIO_BLK_F_FLUSH: the cache will be in writeback mode > >> -after reset if and only if VIRTIO_BLK_F_FLUSH is negotiated. > >> +Because legacy devices do not have FEATURES_OK, legacy device behavior > >> +differs around feature negotiation; legacy devices never modify > >> +field{writeback} as a result of a driver's write to the status register. > >> +In particular, some legacy drivers wrote to field{writeback} between > >> +setting the DRIVER status bit and setting the DRIVER_OK status bit. > >> +It would be a bug for the device to overwrite field{writeback} when > >> +the DRIVER_OK status bit is set. > > > > Can this be a MUST NOT statement? > > And when is it ok to write then? > > For legacy, never. For modern, when you get FEATURES_OK. Spec should say this then :) > >> + > >> +Legacy devices must also never modify field{writeback} as a result of > > > > We can not require anything from legacy devices. They are out there. > > They didn't as far as I know, and if you want to implement both modern > and legacy this is a deviation from the requirements of modern devices. We don't tell people how to implement legacy devices. We support transitional devices with legacy interfaces. Is this what you are trying to say? If yes must is MUST, etc. > >> +a driver's write to the guest features registers, for example if the > >> +driver sets the VIRTIO_BLK_F_CONFIG_WCE guest feature bit but does not > >> +set the VIRTIO_BLK_F_FLUSH bit. > > > > Is it ever ok for device to modify the field? > > Yes, see above regarding FEATURES_OK. That's the only case. So let's just say so :) But I'd rather just tell driver to do it. Keeps ownership in one place. > >> > >> -Some older legacy devices did not operate in writethrough mode even > >> -after a driver announced lack of support for VIRTIO_BLK_F_FLUSH. > >> > >> subsection{Device Operation}label{sec:Device Types / Block Device / Device Operation} > >> > >> @@ -3865,14 +3897,67 @@ A driver SHOULD accept the VIRTIO_BLK_F_RO feature if offered. > >> A driver MUST set field{sector} to 0 for a VIRTIO_BLK_T_FLUSH request. > >> A driver SHOULD NOT include any data in a VIRTIO_BLK_T_FLUSH request. > >> > >> +If the VIRTIO_BLK_F_CONFIG_WCE feature is negotiated, the driver MAY > >> +switch to writethrough or writeback mode by writing respectively 0 and > >> +1 to the field{writeback} field. After writing a 0 to field{writeback}, > >> +the driver MUST NOT assume that any volatile writes have been committed > >> +to non-volatile storage. > > > > I think volatile here clashes with its use for buffers. > > How about we say e.g. "permanent device backend storage" here? > > Same below. > > Okay, or even just s/non-volatile/permanent/. I added "device backend" to stress this depends on device. It's not absoutely permanent - as permanent as device can make it. > >> + > >> devicenormative{subsubsection}{Device Operation}{Device Types / Block Device / Device Operation} > >> > >> A device MUST set the field{status} byte to VIRTIO_BLK_S_IOERR > >> for a write request if the VIRTIO_BLK_F_RO feature if offered, and MUST NOT > >> write any data. > >> > >> -Upon receipt of a VIRTIO_BLK_T_FLUSH request, the device SHOULD ensure > >> -that any writes which were completed are committed to non-volatile storage. > >> +A write is considered volatile when it is submitted; the contents of > >> +sectors covered by a volatile are undefined until the write becomes stable. > >> +A write becomes stable once it is completed and one or more of the > >> +following conditions is true: > >> + > >> +egin{enumerate} > >> +itemlabel{item:flush1} neither VIRTIO_BLK_F_CONFIG_WCE nor > >> + VIRTIO_BLK_F_FLUSH feature were negotiated, but VIRTIO_BLK_F_FLUSH was > >> + proposed by the device; > >> + > >> +itemlabel{item:flush2} the VIRTIO_BLK_F_CONFIG_WCE feature was negotiated and the > >> + field{writeback} field in configuration space was 0 extbf{all the time between > >> + the submission of the write and its completion}; > >> + > >> +itemlabel{item:flush3} a VIRTIO_BLK_T_FLUSH request is sent extbf{after the write is > >> + completed} and is completed itself. > >> +end{enumerate} > >> + > >> +The device MUST ensure that stable writes are committed to > >> +non-volatile storage, > > > > So in particular at the moment QEMU violates this requirement. > > It doesn't. Why do you say it does? Below 2 paragraphs tell you why :) Because it does not always *have* non-volatile storage. > > How would it avoid violating it? E.g. I can use a file in tmpfs as a backend. > > Why do drivers have to know? > > I think it's useful to pretend to guest data is stored, and then cheat. > > In case of tmpfs flushes are cheap, so I don't see what do we > > gain by requiring that drivers skip it with such force. > > > > How about we say something like "permanent device backend storage" > > or even "device backend storage"? > > > >> before reporting completion of the write > >> +(cases~
    ef{item:flush1} and~
    ef{item:flush2}) or the flush > >> +(case~
    ef{item:flush3}). Failure to do so can cause data loss > >> +in case of a crash. > >> + > >> +% ** Not using MAY/MAY NOT intentionally, this is not optional behavior ** > >> +If the driver changes field{writeback} between the submission of the write > >> +and its completion, the write can be either volatile or stable when > >> +its completion is reported; the exact behavior is undefined. > > > > So MUST then? > > No, definitely not MUST. "The write MUST be either X or not X" is a > tautology. "The write MAY be either X or not X" also does not make > sense, because there's no way for an implementation to do neither X nor > not X. Tertium non datur. Missed this fact. As would other readers :) I would s/can be/is/. And drop the exact behavior is undefined - this is just clarifying the definition, not saying anything about the behaviour. Maybe prefix with "in other words". > It's describing a case where the behavior is undefined. The RFC does > not include a case for this, so I used "can". So why describe it? to clarify the terms, right? let's just say so. > >> + > >> +% According to the device requirements for device initialization: > >> +% Propose(CONFIG_WCE) => Propose(FLUSH). > >> +% > >> +% After reversing the implication: > >> +% not Propose(FLUSH) => not Propose(CONFIG_WCE). > >> + > >> +If VIRTIO_BLK_F_FLUSH was not proposed by the > >> + devicefootnote{Note that in this case, according to > >> +
    ef{devicenormative:Device Types / Block Device / Device Initialization}, > >> + the device will not have proposed VIRTIO_BLK_F_CONFIG_WCE either.}, the > >> +device MAY also commit writes to non-volatile storage before reporting > >> +their completion. Unlike case~
    ef{item:flush1}, however, this is not > >> +an absolute requirement of the specification. > > > > Pls drop "absolute", let's use specific terms from an RFC. > > "absolute requirement" comes from RFC2119. It means "it's not a MUST", > except it's valid English. > > > Maybe this one is a SHOULD? > > No, it's definitely a MAY. "One vendor may choose to include the item > because a particular marketplace requires it or because the vendor feels > that it enhances the product while another vendor may omit the same > item" is a perfect description of this. Also, the driver can recognize > the situation (and though it cannot do anything about it, it can log > something). But SHOULD is an even better match: SHOULD This word, or the adjective "RECOMMENDED", mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course. implications such as data loss should definitely be well understood. MAY just means "ok not to do". > >> + > >> +egin{note} > >> + An implementation that does not propose VIRTIO_BLK_F_FLUSH and does not commit > >> + completed writes will not be resilient to data loss in case of crashes. > >> + Not proposing VIRTIO_BLK_F_FLUSH is an absolute requirement > >> + for implementations that do not wish to be safe against such data losses. > > > > I think this note confuses more than it clarifies. > > > > It seems to imply VIRTIO_BLK_F_FLUSH somehow means "I will not lose > > your data on crash". That requirement is just too strong, devices often have > > no way to know, and guests often don't care. > > I don't think it's too strong. Of course it is more like "I will not > lose your data on crash, unless the user screwed up". As I explained above, this blocks using tmpfs for storage - that loses data on a crash. The gain is minimal - at best a log message in the guest. It's not a confirmance statement, just a note, it should clarify something. If instead it only confuses, as this one does, should not be there. > > Why is it tied to VIRTIO_BLK_F_FLUSH at all? Is this an optimization? > > > > > >> +end{note} > >> > >> subsubsection{Legacy Interface: Device Operation}label{sec:Device Types / Block Device / Device Operation / Legacy Interface: Device Operation} > >> When using the legacy interface, transitional devices and drivers > >> @@ -3907,6 +3992,9 @@ serve as data consistency guarantee. Only a VIRTIO_BLK_T_FLUSH request > >> does that. > >> end{note} > >> > >> +Some older legacy devices did not commit completed writes to non-volatile > >> +storage, even if VIRTIO_BLK_F_FLUSH was proposed but not negotiated. > >> + > > > > Anything transitional drivers can do it work around that? > > Always accept VIRTIO_BLK_F_FLUSH. I thought they also did this on flush. So simply "when", not "even". Maybe suggest what transitional drivers should do. > > > If no, this doesn't seem like a useful bit of info to me. > > I can remove it, but it was in the existing document and there is a > workaround. > > Paolo That one talked about using writethrough mode. So it was clear isn't unrelated to flush. -- MST


  • 7.  Re: [virtio] [PATCH] virtio-blk: restore VIRTIO_BLK_F_FLUSH and VIRTIO_BLK_F_CONFIG_WCE

    Posted 07-08-2015 10:37
    On 08/07/2015 12:24, Michael S. Tsirkin wrote: >>>> +If VIRTIO_BLK_F_CONFIG_WCE is negotiated but the VIRTIO_BLK_F_FLUSH feature >>>> +is not, the device MUST set field{writeback} to 0 as soon as the driver >>>> +sets the FEATURES_OK status bit. >>> >>> So this field is dual use, sometimes it's written by driver, sometimes - >>> by a device. A bit messy I think, will cause races or at least >>> confusion. >> >> It doesn't cause races, because the device can set it before returning >> from the instruction that wrote FEATURES_OK. > > SMP guests exist :) See below. >>> What is this in aid of? >>> >>> If VIRTIO_BLK_F_FLUSH is not negotiated devices already commit writes. >>> >>> Why also set writeback to 0? >> >> So that it's consistent with the status of the device. The driver may >> set it to 1 to prevent the device from committing the writes. > > Fields should have a single owner. RO fields can change, RW ones can't. > This one is writeable, let's ask driver to set it consistently. Until FEATURES_OK is set and the guest has checked that the FEATURES_OK bit was not reset by the device, the driver does not even _know that the field exists_. So it is a non problem. Until FEATURES_OK is set, all fields are owned by the device. Doing as you suggest would make things a lot more complicated for legacy devices. >>>> +Legacy devices must also never modify field{writeback} as a result of >>> >>> We can not require anything from legacy devices. They are out there. >> >> They didn't as far as I know, and if you want to implement both modern >> and legacy this is a deviation from the requirements of modern devices. > > We don't tell people how to implement legacy devices. We support transitional devices > with legacy interfaces. Is this what you are trying to say? > If yes must is MUST, etc. Okay, will fix. >>>> -Some older legacy devices did not operate in writethrough mode even >>>> -after a driver announced lack of support for VIRTIO_BLK_F_FLUSH. >>>> >>>> subsection{Device Operation}label{sec:Device Types / Block Device / Device Operation} >>>> >>>> @@ -3865,14 +3897,67 @@ A driver SHOULD accept the VIRTIO_BLK_F_RO feature if offered. >>>> A driver MUST set field{sector} to 0 for a VIRTIO_BLK_T_FLUSH request. >>>> A driver SHOULD NOT include any data in a VIRTIO_BLK_T_FLUSH request. >>>> >>>> +If the VIRTIO_BLK_F_CONFIG_WCE feature is negotiated, the driver MAY >>>> +switch to writethrough or writeback mode by writing respectively 0 and >>>> +1 to the field{writeback} field. After writing a 0 to field{writeback}, >>>> +the driver MUST NOT assume that any volatile writes have been committed >>>> +to non-volatile storage. >>> >>> I think volatile here clashes with its use for buffers. >>> How about we say e.g. "permanent device backend storage" here? >>> Same below. >> >> Okay, or even just s/non-volatile/permanent/. > > I added "device backend" to stress this depends on device. > It's not absoutely permanent - as permanent as device can make it. Okay. >>>> + >>>> devicenormative{subsubsection}{Device Operation}{Device Types / Block Device / Device Operation} >>>> >>>> A device MUST set the field{status} byte to VIRTIO_BLK_S_IOERR >>>> for a write request if the VIRTIO_BLK_F_RO feature if offered, and MUST NOT >>>> write any data. >>>> >>>> -Upon receipt of a VIRTIO_BLK_T_FLUSH request, the device SHOULD ensure >>>> -that any writes which were completed are committed to non-volatile storage. >>>> +A write is considered volatile when it is submitted; the contents of >>>> +sectors covered by a volatile are undefined until the write becomes stable. >>>> +A write becomes stable once it is completed and one or more of the >>>> +following conditions is true: >>>> + >>>> +egin{enumerate} >>>> +itemlabel{item:flush1} neither VIRTIO_BLK_F_CONFIG_WCE nor >>>> + VIRTIO_BLK_F_FLUSH feature were negotiated, but VIRTIO_BLK_F_FLUSH was >>>> + proposed by the device; >>>> + >>>> +itemlabel{item:flush2} the VIRTIO_BLK_F_CONFIG_WCE feature was negotiated and the >>>> + field{writeback} field in configuration space was 0 extbf{all the time between >>>> + the submission of the write and its completion}; >>>> + >>>> +itemlabel{item:flush3} a VIRTIO_BLK_T_FLUSH request is sent extbf{after the write is >>>> + completed} and is completed itself. >>>> +end{enumerate} >>>> + >>>> +The device MUST ensure that stable writes are committed to >>>> +non-volatile storage, >>> >>> So in particular at the moment QEMU violates this requirement. >> >> It doesn't. Why do you say it does? > > Below 2 paragraphs tell you why :) > Because it does not always *have* non-volatile storage. So are we still discussing what MUST means? If the guest cannot see the difference, QEMU is not violating anything. >>>> +% ** Not using MAY/MAY NOT intentionally, this is not optional behavior ** >>>> +If the driver changes field{writeback} between the submission of the write >>>> +and its completion, the write can be either volatile or stable when >>>> +its completion is reported; the exact behavior is undefined. >>> >>> So MUST then? >> >> No, definitely not MUST. "The write MUST be either X or not X" is a >> tautology. "The write MAY be either X or not X" also does not make >> sense, because there's no way for an implementation to do neither X nor >> not X. Tertium non datur. > > Missed this fact. As would other readers :) > > I would s/can be/is/. And drop the exact behavior is undefined - > this is just clarifying the definition, not saying anything > about the behaviour. Maybe prefix with "in other words". That's certainly okay. >> No, it's definitely a MAY. "One vendor may choose to include the item >> because a particular marketplace requires it or because the vendor feels >> that it enhances the product while another vendor may omit the same >> item" is a perfect description of this. Also, the driver can recognize >> the situation (and though it cannot do anything about it, it can log >> something). > > But SHOULD is an even better match: > SHOULD This word, or the adjective "RECOMMENDED", mean that there > may exist valid reasons in particular circumstances to ignore a > particular item, but the full implications must be understood and > carefully weighed before choosing a different course. > > implications such as data loss should definitely be well understood. > > MAY just means "ok not to do". Ah I see what you mean. I think what you want is already covered above: "Devices SHOULD always propose VIRTIO_BLK_F_FLUSH". If they don't, they've already done something not recommended, and a MAY is just fine in this paragraph. This paragraph only applies to implementation that have already disregarded that other "SHOULD". It makes little sense to do yet another recommendation that they would disregard. This "MAY" is here to explicitly allow the behavior. >>> It seems to imply VIRTIO_BLK_F_FLUSH somehow means "I will not lose >>> your data on crash". That requirement is just too strong, devices often have >>> no way to know, and guests often don't care. >> >> I don't think it's too strong. Of course it is more like "I will not >> lose your data on crash, unless the user screwed up". > > As I explained above, this blocks using tmpfs for storage - > that loses data on a crash. tmpfs for storage loses way more data than just unflushed things. The whole backend _does not exist anymore_ after a crash. Since today I'm in the mood of using Latin, "ex falso quodlibet". :) >>> Why is it tied to VIRTIO_BLK_F_FLUSH at all? Is this an optimization? >>> >>> >>>> +end{note} >>>> >>>> subsubsection{Legacy Interface: Device Operation}label{sec:Device Types / Block Device / Device Operation / Legacy Interface: Device Operation} >>>> When using the legacy interface, transitional devices and drivers >>>> @@ -3907,6 +3992,9 @@ serve as data consistency guarantee. Only a VIRTIO_BLK_T_FLUSH request >>>> does that. >>>> end{note} >>>> >>>> +Some older legacy devices did not commit completed writes to non-volatile >>>> +storage, even if VIRTIO_BLK_F_FLUSH was proposed but not negotiated. >>> >>> Anything transitional drivers can do it work around that? >> >> Always accept VIRTIO_BLK_F_FLUSH. > > I thought they also did this on flush. So simply "when", not "even". > Maybe suggest what transitional drivers should do. Okay, I'll take a look. >>> If no, this doesn't seem like a useful bit of info to me. >> >> I can remove it, but it was in the existing document and there is a >> workaround. > > That one talked about using writethrough mode. > So it was clear isn't unrelated to flush. VIRTIO_BLK_F_FLUSH proposed but not negotiated _is_ writethrough mode. Paolo


  • 8.  Re: [virtio] [PATCH] virtio-blk: restore VIRTIO_BLK_F_FLUSH and VIRTIO_BLK_F_CONFIG_WCE

    Posted 07-08-2015 10:54
    On Wed, Jul 08, 2015 at 12:36:21PM +0200, Paolo Bonzini wrote: > > > On 08/07/2015 12:24, Michael S. Tsirkin wrote: > >>>> +If VIRTIO_BLK_F_CONFIG_WCE is negotiated but the VIRTIO_BLK_F_FLUSH feature > >>>> +is not, the device MUST set field{writeback} to 0 as soon as the driver > >>>> +sets the FEATURES_OK status bit. > >>> > >>> So this field is dual use, sometimes it's written by driver, sometimes - > >>> by a device. A bit messy I think, will cause races or at least > >>> confusion. > >> > >> It doesn't cause races, because the device can set it before returning > >> from the instruction that wrote FEATURES_OK. > > > > SMP guests exist :) > > See below. > > >>> What is this in aid of? > >>> > >>> If VIRTIO_BLK_F_FLUSH is not negotiated devices already commit writes. > >>> > >>> Why also set writeback to 0? > >> > >> So that it's consistent with the status of the device. The driver may > >> set it to 1 to prevent the device from committing the writes. > > > > Fields should have a single owner. RO fields can change, RW ones can't. > > This one is writeable, let's ask driver to set it consistently. > > Until FEATURES_OK is set and the guest has checked that the FEATURES_OK > bit was not reset by the device, the driver does not even _know that the > field exists_. So it is a non problem. > > Until FEATURES_OK is set, all fields are owned by the device. Doing as > you suggest would make things a lot more complicated for legacy devices. OK so this is really just an init value. Anything before that is a tree falling in the forest. +If VIRTIO_BLK_F_CONFIG_WCE is negotiated but the VIRTIO_BLK_F_FLUSH feature +is not, the device MUST initialize field{writeback} to 0. We implicitly rely on driver not reading or writing this before FEATURES_OK, so no need to mention FEATURES_OK here either. > >>>> +Legacy devices must also never modify field{writeback} as a result of > >>> > >>> We can not require anything from legacy devices. They are out there. > >> > >> They didn't as far as I know, and if you want to implement both modern > >> and legacy this is a deviation from the requirements of modern devices. > > > > We don't tell people how to implement legacy devices. We support transitional devices > > with legacy interfaces. Is this what you are trying to say? > > If yes must is MUST, etc. > > Okay, will fix. > > >>>> -Some older legacy devices did not operate in writethrough mode even > >>>> -after a driver announced lack of support for VIRTIO_BLK_F_FLUSH. > >>>> > >>>> subsection{Device Operation}label{sec:Device Types / Block Device / Device Operation} > >>>> > >>>> @@ -3865,14 +3897,67 @@ A driver SHOULD accept the VIRTIO_BLK_F_RO feature if offered. > >>>> A driver MUST set field{sector} to 0 for a VIRTIO_BLK_T_FLUSH request. > >>>> A driver SHOULD NOT include any data in a VIRTIO_BLK_T_FLUSH request. > >>>> > >>>> +If the VIRTIO_BLK_F_CONFIG_WCE feature is negotiated, the driver MAY > >>>> +switch to writethrough or writeback mode by writing respectively 0 and > >>>> +1 to the field{writeback} field. After writing a 0 to field{writeback}, > >>>> +the driver MUST NOT assume that any volatile writes have been committed > >>>> +to non-volatile storage. > >>> > >>> I think volatile here clashes with its use for buffers. > >>> How about we say e.g. "permanent device backend storage" here? > >>> Same below. > >> > >> Okay, or even just s/non-volatile/permanent/. > > > > I added "device backend" to stress this depends on device. > > It's not absoutely permanent - as permanent as device can make it. > > Okay. > > >>>> + > >>>> devicenormative{subsubsection}{Device Operation}{Device Types / Block Device / Device Operation} > >>>> > >>>> A device MUST set the field{status} byte to VIRTIO_BLK_S_IOERR > >>>> for a write request if the VIRTIO_BLK_F_RO feature if offered, and MUST NOT > >>>> write any data. > >>>> > >>>> -Upon receipt of a VIRTIO_BLK_T_FLUSH request, the device SHOULD ensure > >>>> -that any writes which were completed are committed to non-volatile storage. > >>>> +A write is considered volatile when it is submitted; the contents of > >>>> +sectors covered by a volatile are undefined until the write becomes stable. > >>>> +A write becomes stable once it is completed and one or more of the > >>>> +following conditions is true: > >>>> + > >>>> +egin{enumerate} > >>>> +itemlabel{item:flush1} neither VIRTIO_BLK_F_CONFIG_WCE nor > >>>> + VIRTIO_BLK_F_FLUSH feature were negotiated, but VIRTIO_BLK_F_FLUSH was > >>>> + proposed by the device; > >>>> + > >>>> +itemlabel{item:flush2} the VIRTIO_BLK_F_CONFIG_WCE feature was negotiated and the > >>>> + field{writeback} field in configuration space was 0 extbf{all the time between > >>>> + the submission of the write and its completion}; > >>>> + > >>>> +itemlabel{item:flush3} a VIRTIO_BLK_T_FLUSH request is sent extbf{after the write is > >>>> + completed} and is completed itself. > >>>> +end{enumerate} > >>>> + > >>>> +The device MUST ensure that stable writes are committed to > >>>> +non-volatile storage, > >>> > >>> So in particular at the moment QEMU violates this requirement. > >> > >> It doesn't. Why do you say it does? > > > > Below 2 paragraphs tell you why :) > > Because it does not always *have* non-volatile storage. > > So are we still discussing what MUST means? I'm discussing what non-volatile means. > If the guest cannot see the > difference, QEMU is not violating anything. Of course it can see the difference after reset. Easy to detect if you have e.g. some other storage to compare against. > >>>> +% ** Not using MAY/MAY NOT intentionally, this is not optional behavior ** > >>>> +If the driver changes field{writeback} between the submission of the write > >>>> +and its completion, the write can be either volatile or stable when > >>>> +its completion is reported; the exact behavior is undefined. > >>> > >>> So MUST then? > >> > >> No, definitely not MUST. "The write MUST be either X or not X" is a > >> tautology. "The write MAY be either X or not X" also does not make > >> sense, because there's no way for an implementation to do neither X nor > >> not X. Tertium non datur. > > > > Missed this fact. As would other readers :) > > > > I would s/can be/is/. And drop the exact behavior is undefined - > > this is just clarifying the definition, not saying anything > > about the behaviour. Maybe prefix with "in other words". > > That's certainly okay. > > >> No, it's definitely a MAY. "One vendor may choose to include the item > >> because a particular marketplace requires it or because the vendor feels > >> that it enhances the product while another vendor may omit the same > >> item" is a perfect description of this. Also, the driver can recognize > >> the situation (and though it cannot do anything about it, it can log > >> something). > > > > But SHOULD is an even better match: > > SHOULD This word, or the adjective "RECOMMENDED", mean that there > > may exist valid reasons in particular circumstances to ignore a > > particular item, but the full implications must be understood and > > carefully weighed before choosing a different course. > > > > implications such as data loss should definitely be well understood. > > > > MAY just means "ok not to do". > > Ah I see what you mean. I think what you want is already covered above: > "Devices SHOULD always propose VIRTIO_BLK_F_FLUSH". > > If they don't, they've already done something not recommended, and a MAY > is just fine in this paragraph. This paragraph only applies to > implementation that have already disregarded that other "SHOULD". It > makes little sense to do yet another recommendation that they would > disregard. > > This "MAY" is here to explicitly allow the behavior. Well ignoring that SHOULD leads to bad performance, ignoring this one leads to data loss. So you need to pause and think at both points ... > >>> It seems to imply VIRTIO_BLK_F_FLUSH somehow means "I will not lose > >>> your data on crash". That requirement is just too strong, devices often have > >>> no way to know, and guests often don't care. > >> > >> I don't think it's too strong. Of course it is more like "I will not > >> lose your data on crash, unless the user screwed up". > > > > As I explained above, this blocks using tmpfs for storage - > > that loses data on a crash. > > tmpfs for storage loses way more data than just unflushed things. The > whole backend _does not exist anymore_ after a crash. Since today I'm > in the mood of using Latin, "ex falso quodlibet". :) Yea. And it's still a valid case, in particular we can't prohibit it. > >>> Why is it tied to VIRTIO_BLK_F_FLUSH at all? Is this an optimization? > >>> > >>> > >>>> +end{note} > >>>> > >>>> subsubsection{Legacy Interface: Device Operation}label{sec:Device Types / Block Device / Device Operation / Legacy Interface: Device Operation} > >>>> When using the legacy interface, transitional devices and drivers > >>>> @@ -3907,6 +3992,9 @@ serve as data consistency guarantee. Only a VIRTIO_BLK_T_FLUSH request > >>>> does that. > >>>> end{note} > >>>> > >>>> +Some older legacy devices did not commit completed writes to non-volatile > >>>> +storage, even if VIRTIO_BLK_F_FLUSH was proposed but not negotiated. > >>> > >>> Anything transitional drivers can do it work around that? > >> > >> Always accept VIRTIO_BLK_F_FLUSH. > > > > I thought they also did this on flush. So simply "when", not "even". > > Maybe suggest what transitional drivers should do. > > Okay, I'll take a look. > > >>> If no, this doesn't seem like a useful bit of info to me. > >> > >> I can remove it, but it was in the existing document and there is a > >> workaround. > > > > That one talked about using writethrough mode. > > So it was clear isn't unrelated to flush. > > VIRTIO_BLK_F_FLUSH proposed but not negotiated _is_ writethrough mode. > > Paolo Agree here. -- MST


  • 9.  Re: [virtio] [PATCH] virtio-blk: restore VIRTIO_BLK_F_FLUSH and VIRTIO_BLK_F_CONFIG_WCE

    Posted 07-08-2015 11:05
    On 08/07/2015 12:53, Michael S. Tsirkin wrote: >>> Below 2 paragraphs tell you why :) >>> Because it does not always *have* non-volatile storage. >> >> So are we still discussing what MUST means? > > I'm discussing what non-volatile means. Since we're down to just 1 or 2 contentious points, I'm sending v4. Let's reboot the discussion on top of that version. Just a couple comments below: >> If the guest cannot see the >> difference, QEMU is not violating anything. > > Of course it can see the difference after reset. Not after reset, only after a cold restart (there was a crash---restoring from migration). A lot more changes could happen between shutdown (doesn't matter if clean or crashy) and the next cold restart, and the guest can detect them the same way. E.g. in one case the guest could detect that the kernel was updated via libguestfs. In another the guest could detect that the storage (formerly in tmpfs, though the guest didn't know that) is now all zero. >> If they don't, they've already done something not recommended, and a MAY >> is just fine in this paragraph. This paragraph only applies to >> implementation that have already disregarded that other "SHOULD". It >> makes little sense to do yet another recommendation that they would >> disregard. >> >> This "MAY" is here to explicitly allow the behavior. > > Well ignoring that SHOULD leads to bad performance, ignoring > this one leads to data loss. So you need to pause and think > at both points ... I see your point. However, whoever ignores that SHOULD is very likely not caring about data safety anyway. This MAY allows them to keep the data safe at the price of bad performance, in the unlikely case that they do care about data safety. Paolo


  • 10.  Re: [virtio] [PATCH] virtio-blk: restore VIRTIO_BLK_F_FLUSH and VIRTIO_BLK_F_CONFIG_WCE

    Posted 07-08-2015 11:07
    On 08/07/2015 13:04, Paolo Bonzini wrote: > Not after reset, only after a cold restart (there was a > crash---restoring from migration is not an option ). A lot more changes could happen > between shutdown (doesn't matter if clean or crashy) and the next cold > restart, and the guest can detect them the same way. > > E.g. in one case the guest could detect that the kernel was updated via > libguestfs. In another the guest could detect that the storage > (formerly in tmpfs, though the guest didn't know that) is now all zero. >


  • 11.  Re: [virtio] [PATCH] virtio-blk: restore VIRTIO_BLK_F_FLUSH and VIRTIO_BLK_F_CONFIG_WCE

    Posted 07-08-2015 11:11
    On Wed, Jul 08, 2015 at 01:04:40PM +0200, Paolo Bonzini wrote: > > > On 08/07/2015 12:53, Michael S. Tsirkin wrote: > >>> Below 2 paragraphs tell you why :) > >>> Because it does not always *have* non-volatile storage. > >> > >> So are we still discussing what MUST means? > > > > I'm discussing what non-volatile means. > > Since we're down to just 1 or 2 contentious points, I'm sending v4. > Let's reboot the discussion on top of that version. Just a couple > comments below: > > >> If the guest cannot see the > >> difference, QEMU is not violating anything. > > > > Of course it can see the difference after reset. > > Not after reset, only after a cold restart (there was a > crash---restoring from migration). A lot more changes could happen > between shutdown (doesn't matter if clean or crashy) and the next cold > restart, and the guest can detect them the same way. > > E.g. in one case the guest could detect that the kernel was updated via > libguestfs. In another the guest could detect that the storage > (formerly in tmpfs, though the guest didn't know that) is now all zero. Absolutely. I think this means we can't say "spec tells device MUST NOT lose data, but qemu loses it and that is ok since no one ever can observe that". It's observable, and if it's sometimes okay, appropriate statement is SHOULD NOT lose data. > >> If they don't, they've already done something not recommended, and a MAY > >> is just fine in this paragraph. This paragraph only applies to > >> implementation that have already disregarded that other "SHOULD". It > >> makes little sense to do yet another recommendation that they would > >> disregard. > >> > >> This "MAY" is here to explicitly allow the behavior. > > > > Well ignoring that SHOULD leads to bad performance, ignoring > > this one leads to data loss. So you need to pause and think > > at both points ... > > I see your point. However, whoever ignores that SHOULD is very likely > not caring about data safety anyway. This MAY allows them to keep the > data safe at the price of bad performance, in the unlikely case that > they do care about data safety. > > Paolo Looks like I was unclear. What I was suggesting is reversing the case. Asking that data SHOULD be safe. -- MST


  • 12.  Re: [virtio] [PATCH] virtio-blk: restore VIRTIO_BLK_F_FLUSH and VIRTIO_BLK_F_CONFIG_WCE

    Posted 07-08-2015 11:32
    On 08/07/2015 13:10, Michael S. Tsirkin wrote: > On Wed, Jul 08, 2015 at 01:04:40PM +0200, Paolo Bonzini wrote: >> >> >> On 08/07/2015 12:53, Michael S. Tsirkin wrote: >>>>> Below 2 paragraphs tell you why :) >>>>> Because it does not always *have* non-volatile storage. >>>> >>>> So are we still discussing what MUST means? >>> >>> I'm discussing what non-volatile means. >> >> Since we're down to just 1 or 2 contentious points, I'm sending v4. >> Let's reboot the discussion on top of that version. Just a couple >> comments below: >> >>>> If the guest cannot see the >>>> difference, QEMU is not violating anything. >>> >>> Of course it can see the difference after reset. >> >> Not after reset, only after a cold restart (there was a >> crash---restoring from migration). A lot more changes could happen >> between shutdown (doesn't matter if clean or crashy) and the next cold >> restart, and the guest can detect them the same way. >> >> E.g. in one case the guest could detect that the kernel was updated via >> libguestfs. In another the guest could detect that the storage >> (formerly in tmpfs, though the guest didn't know that) is now all zero. > > Absolutely. I think this means we can't say "spec tells device MUST NOT > lose data, but qemu loses it and that is ok since no one ever can > observe that". It's observable, and if it's sometimes okay, > appropriate statement is SHOULD NOT lose data. I think you're putting yourself out of the spec by using tmpfs, because there is no "permanent device backend storage" to talk of. The user option to eliminate flushes is the same thing: the user is telling you to treat device backend storage as transient (not permanent). As long as there is permanent device backend storage, QEMU is fine. Perhaps you can propose a patch on top of v4 that adds the "device backend storage is transient" case? I don't think it's necessary, but perhaps it's a small change. >>>> If they don't, they've already done something not recommended, and a MAY >>>> is just fine in this paragraph. This paragraph only applies to >>>> implementation that have already disregarded that other "SHOULD". It >>>> makes