virtio-comment

 View Only
Expand all | Collapse all

[PATCH RFC] virtio-pci: new configuration layout

  • 1.  [PATCH RFC] virtio-pci: new configuration layout

    Posted 08-26-2013 23:46
    This is the new configuration layout.

    Notes:
    - Everything is LE
    - There's a feature bit that means spec 1.0 compliant.
    - Both devices and drivers can either require the 1.0 interface
    or try to include compatibility support. The spec isn't forcing
    this decision.
    - I kept documentation of the legacy interface around, and added notes
    on transition inline. They are in separate sections each clearly marked
    "Legacy Interface" so we'll be able to separate them out
    from the final document as necessary - for now I think it's easier
    to keep it all together.

    Only virtio PCI has been converted.
    Let's discuss this on the meeting tonight, once we figure out PCI
    we can do something similar for MMIO and CCW.

    This is on top of
    [PATCH] virtio: fix feature bit thinko

    Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

    ---

    diff --git a/virtio-v1.0-wd01-part1-specification.txt b/virtio-v1.0-wd01-part1-specification.txt
    index 3c4260a..50e5f47 100644
    --- a/virtio-v1.0-wd01-part1-specification.txt
    +++ b/virtio-v1.0-wd01-part1-specification.txt
    @@ -87,6 +87,8 @@ o Feature bits
    o Configuration space
    o One or more virtqueues

    +Unless explicitly specified otherwise, all multi-byte fields are little-endian.
    +
    2.1.1 Device Status Field
    -------------------------

    @@ -137,6 +139,11 @@ Feature bits are allocated as follows:
    24 to 31: Feature bits reserved for extensions to the queue and
    feature negotiation mechanisms

    + 32: Feature bit must be set for any device compliant with this
    + revision of the specification, and acknowledged by all device drivers.
    +
    + 33 to 63: Feature bits reserved for future extensions
    +
    For example, feature bit 0 for a network device (i.e. Subsystem
    Device ID 1) indicates that the device supports checksumming of
    packets.
    @@ -145,13 +152,63 @@ In particular, new fields in the device configuration space are
    indicated by offering a feature bit, so the guest can check
    before accessing that part of the configuration space.

    +2.1.2.1 Legacy Interface: A Note on transitions from earlier drafts
    +--------------------------------------
    +
    +Earlier drafts of this specification (up to 0.9.X) defined a similar, but
    +different interface between the hypervisor and the guest.
    +Since these are widely deployed in the field, this specification
    +accomodates optional features to simplify transition
    +from these earlier draft interfaces. Specifically:
    +
    +Legacy Interface
    + is an interface specified by an earlier draft of this specification
    + (up to 0.9.X)
    +Legacy Device
    + is a device implemented before this specification was released,
    + and implementing a legacy interface on the host side
    +Legacy Driver
    + is a driver implemented before this specification was released,
    + and implementing a legacy interface on the guest side
    +
    +to simplify transition from these earlier draft interfaces,
    +it is possible to implement
    +
    +Transitional Device
    + a device supporting both drivers conforming to this
    + specification, and legacy drivers
    +
    +Transitional Driver
    + a driver supporting both devices conforming to this
    + specification, and legacy devices
    +
    +Device and driver that require support for revision 1.0 or newer of
    +the specification to function, are called non-transitional device and driver,
    +respectively.
    +
    +Transitional Drivers can detect Legacy Devices by detecting that
    +Feature bit 32 is not offered.
    +Transitional devices can detect Legacy drivers by detecting that
    +Feature bit 32 has not been acknowledged by driver.
    +
    +To make them easier to locate, specification sections documenting these
    +transitional features all explicitly marked with
    +'Legacy Interface' in the section title.
    +
    +
    2.1.3 Configuration Space
    -------------------------

    Configuration space is generally used for rarely-changing or
    initialization-time parameters.

    -Note that this space is generally the guest's native endian,
    +Note that configuration space generally uses the little-endian format
    +for multi-byte fields.
    +
    +2.1.4.1 Legacy Interface: A Note on Configuration Space endian-ness
    +--------------------------------------
    +
    +Note that for legacy interfaces, configuration space is generally the guest's native endian,
    rather than PCI's little-endian.

    2.1.4 Virtqueues
    @@ -164,6 +221,45 @@ transmit and one for receive. Each queue has a 16-bit queue size
    parameter, which sets the number of entries and implies the total size
    of the queue.

    +Each virtqueue consists of three parts:
    +
    + Descriptor Table
    + Available Ring
    + Used Ring
    +
    +where each part is physically-contiguous in guest memory,
    +and has different alignment requirements.
    +
    +The Queue Size field controls the total number of bytes
    +required for each part of the virtqueue.
    +
    +The memory aligment and size requirements, in bytes, of each part of the
    +virtqueue are summarized in the following table (qsz is the Queue Size field):
    +
    ++------------+---------------------------------+
    +| Virtqueue Part | Alignment | Size |
    ++------------+---------------------------------+
    ++------------+---------------------------------+
    +| Descriptor Table | 16 | 16 * qsz |
    ++------------+---------------------------------+
    +| Available Ring | 2 | 6 + 2 * qsz |
    ++------------+---------------------------------+
    +| Used Ring | 4 | 6 + 4 * qsz |
    ++------------+---------------------------------+
    +
    +When the driver wants to send a buffer to the device, it fills in
    +a slot in the descriptor table (or chains several together), and
    +writes the descriptor index into the available ring. It then
    +notifies the device. When the device has finished a buffer, it
    +writes the descriptor into the used ring, and sends an interrupt.
    +
    +
    +2.1.4.1 Legacy Interfaces: A Note on Virtqueue Layout
    +--------------------------------------
    +
    +For Legacy Interfaces, several additional
    +restrictions are placed on the virtqueue layout:
    +
    Each virtqueue occupies two or more physically-contiguous pages
    (usually defined as 4096 bytes, but depending on the transport)
    and consists of three parts:
    @@ -182,9 +278,8 @@ required for the virtqueue according to the following formula:
    + ALIGN(sizeof(u16)*3 + sizeof(struct vring_used_elem)*qsz);
    }

    -This currently wastes some space with padding, but also allows future
    -extensions such as the VIRTIO_RING_F_EVENT_IDX extension. The
    -virtqueue layout structure looks like this:
    +This wastes some space with padding.
    +The legacy virtqueue layout structure therefore looks like this:

    struct vring {
    // The actual descriptors (16 bytes each)
    @@ -200,25 +295,17 @@ virtqueue layout structure looks like this:
    struct vring_used used;
    };

    -When the driver wants to send a buffer to the device, it fills in
    -a slot in the descriptor table (or chains several together), and
    -writes the descriptor index into the available ring. It then
    -notifies the device. When the device has finished a buffer, it
    -writes the descriptor into the used ring, and sends an interrupt.
    -
    -2.1.4.1 A Note on Virtqueue Endianness
    +2.1.4.1 Legacy Interfaces: A Note on Virtqueue Endianness
    --------------------------------------

    Note that the endian of fields and in the virtqueue is the native
    -endian of the guest, not little-endian as PCI normally is. This makes
    -for simpler guest code, and it is assumed that the host already has to
    -be deeply aware of the guest endian so such an “endian-aware” device
    -is not a significant issue.
    +endian of the guest, not little-endian as PCI normally is.
    +It is assumed that the host is already aware of the guest endian.

    2.1.4.2 Message Framing
    -----------------------
    -The original intent of the specification was that message framing (the
    -particular layout of descriptors) be independent of the contents of
    +Generally, the intent of the specification is for message framing (the
    +particular layout of descriptors) to be independent of the contents of
    the buffers. For example, a network transmit buffer consists of a 12
    byte header followed by the network packet. This could be most simply
    placed in the descriptor table as a 12 byte output descriptor followed
    @@ -227,16 +314,21 @@ single 1526 byte output descriptor in the case where the header and
    packet are adjacent, or even three or more descriptors (possibly with
    loss of efficiency in that case).

    -Regrettably, initial driver implementations used simple layouts, and
    -devices came to rely on it, despite this specification wording[10]. It
    -is thus recommended that drivers be conservative in their assumptions,
    -unless the VIRTIO_F_ANY_LAYOUT feature is accepted. In addition, some
    +In addition, some
    implementations may have large-but-reasonable restrictions on total
    descriptor size (such as based on IOV_MAX in the host OS). This has
    not been a problem in practice: little sympathy will be given to
    drivers which create unreasonably-sized descriptors such as by
    dividing a network packet into 1500 single-byte descriptors!

    +2.1.4.2.1 Legacy Interfaces: A Note on Message Framing
    +-----------------------
    +Regrettably, initial driver implementations used simple layouts, and
    +devices came to rely on it, despite this specification wording[10]. It
    +is thus recommended that when using legacy interfaces,
    +drivers should be conservative in their assumptions,
    +unless the VIRTIO_F_ANY_LAYOUT feature is accepted.
    +
    2.1.4.3 The Virtqueue Descriptor Table
    --------------------------------------

    @@ -386,23 +478,27 @@ how to communicate with the specific device.
    2.2.1 Device Initialization
    ---------------------------

    -1. Reset the device. This is not required on initial start up.
    +1. Device discovery. This is only required for some transports.
    +
    +2. Reset the device. This is not required on initial start up.

    -2. The ACKNOWLEDGE status bit is set: we have noticed the device.
    +3. Device layout detection. This is only required for some transports.

    -3. The DRIVER status bit is set: we know how to drive the device.
    +4. The ACKNOWLEDGE status bit is set: we have noticed the device.

    -4. Device-specific setup, including reading the device feature
    +5. The DRIVER status bit is set: we know how to drive the device.
    +
    +6. Device-specific setup, including reading the device feature
    bits, discovery of virtqueues for the device, optional per-bus
    setup, and reading and possibly writing the device's virtio
    configuration space.

    -5. The subset of device feature bits understood by the driver is
    +7. The subset of device feature bits understood by the driver is
    written to the device.

    -6. The DRIVER_OK status bit is set.
    +8. The DRIVER_OK status bit is set.

    -7. The device can now be used (ie. buffers added to the
    +9. The device can now be used (ie. buffers added to the
    virtqueues)[4]

    If any of these steps go irrecoverably wrong, the guest should
    @@ -622,35 +718,183 @@ Virtio devices are commonly implemented as PCI devices.

    Any PCI device with Vendor ID 0x1AF4, and Device ID 0x1000 through
    0x103F inclusive is a virtio device[3]. The device must also have a
    -Revision ID of 0 to match this specification.
    +Revision ID of 0 or Revision ID of 1 to match this specification.

    The Subsystem Device ID indicates which virtio device is
    supported by the device. The Subsystem Vendor ID should reflect
    the PCI Vendor ID of the environment (it's currently only used
    for informational purposes by the guest).

    +Drivers must not match devices where Revision ID does not match 0 or 1.
    +
    +2.4.1.1.1 Legacy Interface: A Note on PCI Device Discovery
    +----------------------------
    +Transitional devices must have a Revision ID of 0.
    +
    +Non-transitional devices must have a Revision ID of 1.
    +
    +Transitional drivers must match a Revision ID of 0 or 1.
    +
    +Non-transitional drivers must only match a Revision ID of 1.
    +
    2.4.1.2 PCI Device Layout
    -------------------------

    -To configure the device, we use the first I/O region of the PCI
    -device. This contains a virtio header followed by a
    -device-specific region.
    +To configure the device,
    +use I/O and/or memory regions and/or PCI configuration space of the PCI device.
    +These contain the virtio header registers, the notification register, the
    +ISR status register and device specific registers, as specified by Virtio
    ++ Structure PCI Capabilities
    +
    +There may be different widths of accesses to the I/O region; the
    +“natural” access method for each field must be
    +used (i.e. 32-bit accesses for 32-bit fields, etc).
    +
    +PCI Device Configuration Layout includes the common configuration,
    +ISR, notification and device specific configuration
    +structures.
    +
    +Unless explicitly specified otherwise, all multi-byte fields are little-endian.
    +
    +
    +2.4.1.2.1 Common configuration structure layout
    +-------------------------
    +Common configuration structure layout is documented below:
    +
    +struct virtio_pci_common_cfg {
    + /* About the whole device. */
    + __le32 device_feature_select; /* read-write */
    + __le32 device_feature; /* read-only */
    + __le32 guest_feature_select; /* read-write */
    + __le32 guest_feature; /* read-write */
    + __le16 msix_config; /* read-write */
    + __le16 num_queues; /* read-only */
    + __u8 device_status; /* read-write */
    + __u8 unused1;
    +
    + /* About a specific virtqueue. */
    + __le16 queue_select; /* read-write */
    + __le16 queue_size; /* read-write, power of 2, or 0. */
    + __le16 queue_msix_vector; /* read-write */
    + __le16 queue_enable; /* read-write */
    + __le16 queue_notify_off; /* read-only */
    + __le64 queue_desc; /* read-write */
    + __le64 queue_avail; /* read-write */
    + __le64 queue_used; /* read-write */
    +};
    +
    +device_feature_select
    +
    + Selects which Feature Bits does device_feature field refer to.
    + Value 0x0 selects Feature Bits 0 to 31
    + Value 0x1 selects Feature Bits 32 to 63
    + All other values cause reads from device_feature to return 0.
    +
    +device_feature
    +
    + Used by Device to report Feature Bits to Driver.
    + Device Feature Bits selected by device_feature_select.
    +
    +guest_feature_select
    +
    + Selects which Feature Bits does guest_feature field refer to.
    + Value 0x0 selects Feature Bits 0 to 31
    + Value 0x1 selects Feature Bits 32 to 63
    + All other values cause writes to guest_feature to be ignored,
    + and reads to return 0.
    +
    +guest_feature
    +
    + Used by Driver to acknowledge Feature Bits to Device.
    + Guest Feature Bits selected by guest_feature_select.
    +
    +msix_config
    +
    + Configuration Vector for MSI-X.
    +
    +num_queues
    +
    + Specifies the maximum number of virtqueues supported by device.
    +
    +device_status
    +
    + Device Status field.
    +
    +queue_select
    +
    + Queue Select. Selects which virtqueue do other fields refer to.
    +
    +queue_size
    +
    + Queue Size. On reset, specifies the maximum queue size supported by
    + the hypervisor. This can be modified by driver to reduce memory requirements.
    + Set to 0 if this virtqueue is unused.
    +
    +queue_msix_vector
    +
    + Queue Vector for MSI-X.
    +
    +queue_enable
    +
    + Used to selectively prevent host from executing requests from this virtqueue.
    + 1 - enabled; 0 - disabled
    +
    +queue_notify_off
    +
    + Used to calculate the offset from start of Notification structure at
    + which this virtqueue is located.
    + Note: this is *not* an offset in bytes. See notify_off_multiplier below.
    +
    +queue_desc
    +
    + Physical address of Descriptor Table.
    +
    +queue_avail
    +
    + Physical address of Available Ring.
    +
    +queue_used
    +
    + Physical address of Used Ring.
    +
    +
    +2.4.1.2.2 ISR status structure layout
    +-------------------------
    +ISR status structure includes a single 8-bite ISR status field
    +
    +2.4.1.2.3 Notification structure layout
    +-------------------------
    +Notification structure is always a multiple of 2 bytes in size.
    +It includes 2-byte Queue Notify fields for each virtqueue of
    +the device. Note that multiple virtqueues can use the same
    +Queue Notify field, if necessary.
    +
    +2.4.1.2.4 Device specific structure
    +-------------------------
    +
    +Device specific structure is optional.
    +
    +2.4.1.2.5 Legacy Interfaces: A Note on PCI Device Layout
    +-------------------------
    +
    +Transitional devices should present part of configuration
    +registers in a legacy configuration structure in BAR0 in the first I/O
    +region of the PCI device, as documented below.

    There may be different widths of accesses to the I/O region; the
    “natural” access method for each field in the virtio header must be
    -used (i.e. 32-bit accesses for 32-bit fields, etc), but the
    +used (i.e. 32-bit accesses for 32-bit fields, etc), but
    +When accessed through the legacy interface the
    device-specific region can be accessed using any width accesses, and
    should obtain the same results.

    Note that this is possible because while the virtio header is PCI
    -(i.e. little) endian, the device-specific region is encoded in
    -the native endian of the guest (where such distinction is
    +(i.e. little) endian, when using the legacy interface the device-specific
    +region is encoded in the native endian of the guest (where such distinction is
    applicable).

    -2.4.1.2.1 PCI Device Virtio Header
    -----------------------------------

    -The virtio header looks as follows:
    +When used through the legacy interface, the virtio header looks as follows:

    +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+
    | Bits || 32 | 32 | 32 | 16 | 16 | 16 | 8 | 8 |
    @@ -661,7 +905,6 @@ The virtio header looks as follows:
    | || Features bits 0:31 | Features bits 0:31 | Address | Size | Select | Notify | Status | Status |
    +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+

    -
    If MSI-X is enabled for the device, two additional fields
    immediately follow this header:[5]

    @@ -689,25 +932,154 @@ device-specific headers:
    | || |
    +------------++--------------------+

    +Note that only Feature Bits 0 to 31 are accessible through the
    +Legacy Interface. When used through the Legacy Interface,
    +Transitional Devices must assume that Feature Bits 32 to 63
    +are not acknowledged by Driver.
    +
    +
    2.4.1.3 PCI-specific Initialization And Device Operation
    --------------------------------------------------------

    -The page size for a virtqueue on a PCI virtio device is defined as
    -4096 bytes.
    -
    2.4.1.3.1 Device Initialization
    -------------------------------

    -2.4.1.3.1.1 Queue Vector Configuration
    +This documents PCI-specific steps executed during Device Initialization.
    +As the first step, driver must detect device configuration layout
    +to locate configuration fields in memory,I/O or configuration space of the
    +device.
    +
    +2.4.1.3.1.1 Virtio Device Configuration Layout Detection
    +-------------------------------
    +
    +As a prerequisite to device initialization, driver executes a
    +PCI capability list scan, detecting virtio configuration layout using Virtio
    +Structure PCI capabilities.
    +
    +Virtio Device Configuration Layout includes virtio configuration header, Notification
    +and ISR Status and device configuration structures.
    +Each structure can be mapped by a Base Address register (BAR) belonging to
    +the function, located beginning at 10h in Configuration Space,
    +or accessed though PCI configuration space.
    +
    +Actual location of each structure is specified using vendor-specific PCI capability located
    +on capability list in PCI configuration space of the device.
    +This virtio structure capability uses little-endian format; all bits are
    +read-only:
    +
    +struct virtio_pci_cap {
    + __u8 cap_vndr; /* Generic PCI field: PCI_CAP_ID_VNDR */
    + __u8 cap_next; /* Generic PCI field: next ptr. */
    + __u8 cap_len; /* Generic PCI field: capability length */
    + __u8 cfg_type; /* Identifies the structure. */
    + __u8 bar; /* Where to find it. */
    + __u8 padding[3];/* Pad to full dword. */
    + __le32 offset; /* Offset within bar. */
    + __le32 length; /* Length of the structure, in bytes. */
    +};
    +
    +This structure can optionally followed by extra data, depending on
    +other fields, as documented below.
    +
    +The fields are interpreted as follows:
    +
    +cap_vndr
    + 0x09; Identifies a vendor-specific capability.
    +
    +cap_next
    + Link to next capability in the capability list in the configuration space.
    +
    +cap_len
    + Length of the capability structure, including the whole of
    + struct virtio_pci_cap, and extra data if any.
    + This length might include padding, or fields unused by the driver.
    +
    +cfg_type
    + identifies the structure, according to the following table.
    +
    + /* Common configuration */
    + #define VIRTIO_PCI_CAP_COMMON_CFG 1
    + /* Notifications */
    + #define VIRTIO_PCI_CAP_NOTIFY_CFG 2
    + /* ISR Status */
    + #define VIRTIO_PCI_CAP_ISR_CFG 3
    + /* Device specific configuration */
    + #define VIRTIO_PCI_CAP_DEVICE_CFG 4
    +
    + More than one capability can identify the same structure - this makes it
    + possible for the device to expose multiple interfaces to drivers. The order of
    + the capabilities in the capability list specifies the order of preference
    + suggested by the device; drivers should use the first interface that they can
    + support. For example, on some hypervisors, notifications using IO accesses are
    + faster than memory accesses. In this case, hypervisor can expose two
    + capabilities with cfg_type set to VIRTIO_PCI_CAP_NOTIFY_CFG:
    + the first one addressing an I/O BAR, the second one addressing a memory BAR.
    + Driver will use the I/O BAR if I/O resources are available, and fall back on
    + memory BAR when I/O resources are unavailable.
    +
    +bar
    +
    + values 0x0 to 0x5 specify a Base Address register (BAR) belonging to
    + the function located beginning at 10h in Configuration Space
    + and used to map the structure into Memory or I/O Space.
    + The BAR is permitted to be either 32-bit or 64-bit, it can map Memory Space
    + or I/O Space.
    +
    + The value 0xF specifies that the structure is in PCI configuration space
    + inline with this capability structure, following (not necessarily immediately)
    + the length field.
    +
    +offset
    + indicates where the structure begins relative to the base address associated
    + with the BAR. If bar specifies configuration space, offset is relative
    + to start of virtio_pci_cap structure.
    +
    +length
    + indicates the length of the structure.
    + This size might include padding, or fields unused by the driver.
    + Drivers are also recommended to only map part of configuration structure
    + large enough for device operation.
    + For example, a future device might present a large structure size of several
    + MBytes.
    + As current devices never utilize structures larger than 4KBytes in size,
    + driver can limit the mapped structure size to e.g.
    + 4KBytes to allow forward compatibility with such devices without loss of
    + functionality and without wasting resources.
    +
    +
    +If cfg_type is VIRTIO_PCI_CAP_NOTIFY_CFG this structure is immediately followed
    +by additional fields:
    +
    +struct virtio_pci_notify_cap {
    + struct virtio_pci_cap cap;
    + __le32 notify_off_multiplier; /* Multiplier for queue_notify_off. */
    +};
    +
    +notify_off_multiplier
    +
    + Virtqueue offset multiplier, in bytes. Must be even and either a power of two, or 0.
    + Value 0x1 is reserved.
    + For a given virtqueue, the address to use for notifications is calculated as follows:
    +
    + queue_notify_off * notify_off_multiplier + offset
    +
    + If notify_off_multiplier is 0, all virtqueues use the same address in
    + the Notifications structure!
    +
    +
    +2.4.1.3.1.1 Legacy Interface: A Note on Device Layout Detection
    +-------------------------------
    +
    +Legacy drivers skipped Device Layout Detection step, assuming legacy
    +configuration space in BAR0 in I/O space unconditionally.
    +
    +2.4.1.3.1.3 Queue Vector Configuration
    --------------------------------------

    When MSI-X capability is present and enabled in the device
    -(through standard PCI configuration space) 4 bytes at byte offset
    -20 are used to map configuration change and queue interrupts to
    -MSI-X vectors. In this case, the ISR Status field is unused, and
    -device specific configuration starts at byte offset 24 in virtio
    -header structure. When MSI-X capability is not enabled, device
    -specific configuration starts at byte offset 20 in virtio header.
    +(through standard PCI configuration space) Configuration/Queue
    +MSI-X Vector registers are used to map configuration change and queue
    +interrupts to MSI-X vectors. In this case, the ISR Status is unused.

    Writing a valid MSI-X Table entry number, 0 to 0x7FF, to one of
    Configuration/Queue Vector registers, maps interrupts triggered
    @@ -732,7 +1104,7 @@ success, the previously written value is returned, and on
    failure, NO_VECTOR is returned. If a mapping failure is detected,
    the driver can retry mapping with fewervectors, or disable MSI-X.

    -2.4.1.3.1.2 Virtqueue Configuration
    +2.4.1.3.1.4 Virtqueue Configuration
    -----------------------------------

    As a device can have zero or more virtqueues for bulk data
    @@ -749,9 +1121,11 @@ This is done as follows, for each virtqueue a device has:
    always a power of 2. This controls how big the virtqueue is
    (see 2.1.4 Virtqueues). If this field is 0, the virtqueue does not exist.

    -3. Allocate and zero virtqueue in contiguous physical memory, on
    - a 4096 byte alignment. Write the physical address, divided by
    - 4096 to the Queue Address field.[6]
    +3. Optionally, select a smaller virtqueue size and write it in the Queue Size
    + field.
    +
    +3. Allocate and zero Descriptor Table, Available and Used rings for the
    + virtqueue in contiguous physical memory.

    4. Optionally, if MSI-X capability is present and enabled on the
    device, select a vector to use to request interrupts triggered
    @@ -760,14 +1134,21 @@ This is done as follows, for each virtqueue a device has:
    Queue Vector field: on success, previously written value is
    returned; on failure, NO_VECTOR value is returned.

    +
    +2.4.1.3.1.4.1 Legacy Interface: A Note on Virtqueue Configuration
    +-----------------------------------
    +When using the legacy interface, the page size for a virtqueue on a PCI virtio
    +device is defined as 4096 bytes. Driver writes the physical address, divided
    +by 4096 to the Queue Address field [6].
    +
    2.4.1.3.2 Notifying The Device
    ------------------------------

    Device notification occurs by writing the 16-bit virtqueue index
    -of this virtqueue to the Queue Notify field of the virtio header
    -in the first I/O region of the PCI device.
    +of this virtqueue to the Queue Notify field.

    2.4.1.3.3 Receiving Used Buffers From The Device
    +------------------------------

    If an interrupt is necessary:

    @@ -2798,7 +3179,10 @@ the non-PCI implementations (currently lguest and S/390).
    This is only allowed if the driver does not use any features
    which would alter this early use of the device.

    -[5] ie. once you enable MSI-X on the device, the other fields move.
    +[5] When MSI-X capability is enabled, device specific configuration starts at
    +byte offset 24 in virtio header structure. When MSI-X capability is not
    +enabled, device specific configuration starts at byte offset 20 in virtio
    +header. ie. once you enable MSI-X on the device, the other fields move.
    If you turn it off again, they move back!

    [6] The 4096 is based on the x86 page size, but it's also large



  • 2.  Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout

    Posted 08-27-2013 07:04
    "Michael S. Tsirkin" <mst@redhat.com> writes: > This is the new configuration layout. > > Notes: > - Everything is LE > - There's a feature bit that means spec 1.0 compliant. > - Both devices and drivers can either require the 1.0 interface > or try to include compatibility support. The spec isn't forcing > this decision. Hmm, this kind includes other changes already proposed, like the LE change and the framing change. I think this conceptually splits nicely: 1) Feature bit 32 proposal. 2) Endian change. 3) Framing change. 4) PCI layout change. > - I kept documentation of the legacy interface around, and added notes > on transition inline. They are in separate sections each clearly marked > "Legacy Interface" so we'll be able to separate them out > from the final document as necessary - for now I think it's easier > to keep it all together. Good thinking: most of us know the current spec so it's definitely clearer. And makes sure we're thinking about the transition. > Only virtio PCI has been converted. > Let's discuss this on the meeting tonight, once we figure out PCI > we can do something similar for MMIO and CCW. > @@ -137,6 +139,11 @@ Feature bits are allocated as follows: > 24 to 31: Feature bits reserved for extensions to the queue and > feature negotiation mechanisms > > + 32: Feature bit must be set for any device compliant with this > + revision of the specification, and acknowledged by all device drivers. > + > + 33 to 63: Feature bits reserved for future extensions > + > For example, feature bit 0 for a network device (i.e. Subsystem > Device ID 1) indicates that the device supports checksumming of > packets. Why stop at 63? If we go to a more decentralized feature-assignment model, we'll run through those very fast. > @@ -145,13 +152,63 @@ In particular, new fields in the device configuration space are > indicated by offering a feature bit, so the guest can check > before accessing that part of the configuration space. > > +2.1.2.1 Legacy Interface: A Note on transitions from earlier drafts > +-------------------------------------- > + > +Earlier drafts of this specification (up to 0.9.X) defined a similar, but > +different interface between the hypervisor and the guest. > +Since these are widely deployed in the field, this specification > +accomodates optional features to simplify transition > +from these earlier draft interfaces. Specifically: > + > +Legacy Interface > + is an interface specified by an earlier draft of this specification > + (up to 0.9.X) > +Legacy Device > + is a device implemented before this specification was released, > + and implementing a legacy interface on the host side > +Legacy Driver > + is a driver implemented before this specification was released, > + and implementing a legacy interface on the guest side > + > +to simplify transition from these earlier draft interfaces, > +it is possible to implement > + > +Transitional Device > + a device supporting both drivers conforming to this > + specification, and legacy drivers > + > +Transitional Driver > + a driver supporting both devices conforming to this > + specification, and legacy devices > + > +Device and driver that require support for revision 1.0 or newer of > +the specification to function, are called non-transitional device and driver, > +respectively. > + > +Transitional Drivers can detect Legacy Devices by detecting that > +Feature bit 32 is not offered. > +Transitional devices can detect Legacy drivers by detecting that > +Feature bit 32 has not been acknowledged by driver. > + > +To make them easier to locate, specification sections documenting these > +transitional features all explicitly marked with > +'Legacy Interface' in the section title. > + > + > 2.1.3 Configuration Space > ------------------------- > > Configuration space is generally used for rarely-changing or > initialization-time parameters. > > -Note that this space is generally the guest's native endian, > +Note that configuration space generally uses the little-endian format > +for multi-byte fields. > + > +2.1.4.1 Legacy Interface: A Note on Configuration Space endian-ness > +-------------------------------------- > + > +Note that for legacy interfaces, configuration space is generally the guest's native endian, > rather than PCI's little-endian. > > 2.1.4 Virtqueues > @@ -164,6 +221,45 @@ transmit and one for receive. Each queue has a 16-bit queue size > parameter, which sets the number of entries and implies the total size > of the queue. > > +Each virtqueue consists of three parts: > + > + Descriptor Table > + Available Ring > + Used Ring > + > +where each part is physically-contiguous in guest memory, > +and has different alignment requirements. > + > +The Queue Size field controls the total number of bytes > +required for each part of the virtqueue. > + > +The memory aligment and size requirements, in bytes, of each part of the > +virtqueue are summarized in the following table (qsz is the Queue Size field): > + > ++------------+---------------------------------+ > + Virtqueue Part Alignment Size > ++------------+---------------------------------+ > ++------------+---------------------------------+ > + Descriptor Table 16 16 * qsz > ++------------+---------------------------------+ > + Available Ring 2 6 + 2 * qsz > ++------------+---------------------------------+ > + Used Ring 4 6 + 4 * qsz > ++------------+---------------------------------+ > + > +When the driver wants to send a buffer to the device, it fills in > +a slot in the descriptor table (or chains several together), and > +writes the descriptor index into the available ring. It then > +notifies the device. When the device has finished a buffer, it > +writes the descriptor into the used ring, and sends an interrupt. > + > + > +2.1.4.1 Legacy Interfaces: A Note on Virtqueue Layout > +-------------------------------------- > + > +For Legacy Interfaces, several additional > +restrictions are placed on the virtqueue layout: > + > Each virtqueue occupies two or more physically-contiguous pages > (usually defined as 4096 bytes, but depending on the transport) > and consists of three parts: > @@ -182,9 +278,8 @@ required for the virtqueue according to the following formula: > + ALIGN(sizeof(u16)*3 + sizeof(struct vring_used_elem)*qsz); > } > > -This currently wastes some space with padding, but also allows future > -extensions such as the VIRTIO_RING_F_EVENT_IDX extension. The > -virtqueue layout structure looks like this: > +This wastes some space with padding. > +The legacy virtqueue layout structure therefore looks like this: > > struct vring { > // The actual descriptors (16 bytes each) > @@ -200,25 +295,17 @@ virtqueue layout structure looks like this: > struct vring_used used; > }; > > -When the driver wants to send a buffer to the device, it fills in > -a slot in the descriptor table (or chains several together), and > -writes the descriptor index into the available ring. It then > -notifies the device. When the device has finished a buffer, it > -writes the descriptor into the used ring, and sends an interrupt. > - > -2.1.4.1 A Note on Virtqueue Endianness > +2.1.4.1 Legacy Interfaces: A Note on Virtqueue Endianness > -------------------------------------- > > Note that the endian of fields and in the virtqueue is the native > -endian of the guest, not little-endian as PCI normally is. This makes > -for simpler guest code, and it is assumed that the host already has to > -be deeply aware of the guest endian so such an “endian-aware” device > -is not a significant issue. > +endian of the guest, not little-endian as PCI normally is. > +It is assumed that the host is already aware of the guest endian. > > 2.1.4.2 Message Framing > ----------------------- > -The original intent of the specification was that message framing (the > -particular layout of descriptors) be independent of the contents of > +Generally, the intent of the specification is for message framing (the > +particular layout of descriptors) to be independent of the contents of > the buffers. For example, a network transmit buffer consists of a 12 > byte header followed by the network packet. This could be most simply > placed in the descriptor table as a 12 byte output descriptor followed > @@ -227,16 +314,21 @@ single 1526 byte output descriptor in the case where the header and > packet are adjacent, or even three or more descriptors (possibly with > loss of efficiency in that case). > > -Regrettably, initial driver implementations used simple layouts, and > -devices came to rely on it, despite this specification wording[10]. It > -is thus recommended that drivers be conservative in their assumptions, > -unless the VIRTIO_F_ANY_LAYOUT feature is accepted. In addition, some > +In addition, some > implementations may have large-but-reasonable restrictions on total > descriptor size (such as based on IOV_MAX in the host OS). This has > not been a problem in practice: little sympathy will be given to > drivers which create unreasonably-sized descriptors such as by > dividing a network packet into 1500 single-byte descriptors! > > +2.1.4.2.1 Legacy Interfaces: A Note on Message Framing > +----------------------- > +Regrettably, initial driver implementations used simple layouts, and > +devices came to rely on it, despite this specification wording[10]. It > +is thus recommended that when using legacy interfaces, > +drivers should be conservative in their assumptions, > +unless the VIRTIO_F_ANY_LAYOUT feature is accepted. > + > 2.1.4.3 The Virtqueue Descriptor Table > -------------------------------------- > > @@ -386,23 +478,27 @@ how to communicate with the specific device. > 2.2.1 Device Initialization > --------------------------- > > -1. Reset the device. This is not required on initial start up. > +1. Device discovery. This is only required for some transports. > + > +2. Reset the device. This is not required on initial start up. > > -2. The ACKNOWLEDGE status bit is set: we have noticed the device. > +3. Device layout detection. This is only required for some transports. > > -3. The DRIVER status bit is set: we know how to drive the device. > +4. The ACKNOWLEDGE status bit is set: we have noticed the device. > > -4. Device-specific setup, including reading the device feature > +5. The DRIVER status bit is set: we know how to drive the device. > + > +6. Device-specific setup, including reading the device feature > bits, discovery of virtqueues for the device, optional per-bus > setup, and reading and possibly writing the device's virtio > configuration space. > > -5. The subset of device feature bits understood by the driver is > +7. The subset of device feature bits understood by the driver is > written to the device. > > -6. The DRIVER_OK status bit is set. > +8. The DRIVER_OK status bit is set. > > -7. The device can now be used (ie. buffers added to the > +9. The device can now be used (ie. buffers added to the > virtqueues)[4] > > If any of these steps go irrecoverably wrong, the guest should > @@ -622,35 +718,183 @@ Virtio devices are commonly implemented as PCI devices. > > Any PCI device with Vendor ID 0x1AF4, and Device ID 0x1000 through > 0x103F inclusive is a virtio device[3]. The device must also have a > -Revision ID of 0 to match this specification. > +Revision ID of 0 or Revision ID of 1 to match this specification. > > The Subsystem Device ID indicates which virtio device is > supported by the device. The Subsystem Vendor ID should reflect > the PCI Vendor ID of the environment (it's currently only used > for informational purposes by the guest). > > +Drivers must not match devices where Revision ID does not match 0 or 1. > + > +2.4.1.1.1 Legacy Interface: A Note on PCI Device Discovery > +---------------------------- > +Transitional devices must have a Revision ID of 0. > + > +Non-transitional devices must have a Revision ID of 1. > + > +Transitional drivers must match a Revision ID of 0 or 1. > + > +Non-transitional drivers must only match a Revision ID of 1. > + I think we should stop abusing Revision IDs, and start using them to reflect device version changes as intended. We could reserve revision id 0 for legacy devices, however, which should work nicely. > 2.4.1.2 PCI Device Layout > ------------------------- > > -To configure the device, we use the first I/O region of the PCI > -device. This contains a virtio header followed by a > -device-specific region. > +To configure the device, > +use I/O and/or memory regions and/or PCI configuration space of the PCI device. > +These contain the virtio header registers, the notification register, the > +ISR status register and device specific registers, as specified by Virtio > ++ Structure PCI Capabilities > + > +There may be different widths of accesses to the I/O region; the > +“natural” access method for each field must be > +used (i.e. 32-bit accesses for 32-bit fields, etc). > + > +PCI Device Configuration Layout includes the common configuration, > +ISR, notification and device specific configuration > +structures. > + > +Unless explicitly specified otherwise, all multi-byte fields are little-endian. > + > + > +2.4.1.2.1 Common configuration structure layout > +------------------------- > +Common configuration structure layout is documented below: > + > +struct virtio_pci_common_cfg { > + /* About the whole device. */ > + __le32 device_feature_select; /* read-write */ > + __le32 device_feature; /* read-only */ > + __le32 guest_feature_select; /* read-write */ > + __le32 guest_feature; /* read-write */ > + __le16 msix_config; /* read-write */ > + __le16 num_queues; /* read-only */ > + __u8 device_status; /* read-write */ > + __u8 unused1; > + > + /* About a specific virtqueue. */ > + __le16 queue_select; /* read-write */ > + __le16 queue_size; /* read-write, power of 2, or 0. */ > + __le16 queue_msix_vector; /* read-write */ > + __le16 queue_enable; /* read-write */ > + __le16 queue_notify_off; /* read-only */ > + __le64 queue_desc; /* read-write */ > + __le64 queue_avail; /* read-write */ > + __le64 queue_used; /* read-write */ > +}; > + > +device_feature_select > + > + Selects which Feature Bits does device_feature field refer to. > + Value 0x0 selects Feature Bits 0 to 31 > + Value 0x1 selects Feature Bits 32 to 63 > + All other values cause reads from device_feature to return 0. > + > +device_feature > + > + Used by Device to report Feature Bits to Driver. > + Device Feature Bits selected by device_feature_select. > + > +guest_feature_select > + > + Selects which Feature Bits does guest_feature field refer to. > + Value 0x0 selects Feature Bits 0 to 31 > + Value 0x1 selects Feature Bits 32 to 63 > + All other values cause writes to guest_feature to be ignored, > + and reads to return 0. > + > +guest_feature > + > + Used by Driver to acknowledge Feature Bits to Device. > + Guest Feature Bits selected by guest_feature_select. > + > +msix_config > + > + Configuration Vector for MSI-X. > + > +num_queues > + > + Specifies the maximum number of virtqueues supported by device. > + > +device_status > + > + Device Status field. > + > +queue_select > + > + Queue Select. Selects which virtqueue do other fields refer to. > + > +queue_size > + > + Queue Size. On reset, specifies the maximum queue size supported by > + the hypervisor. This can be modified by driver to reduce memory requirements. > + Set to 0 if this virtqueue is unused. > + > +queue_msix_vector > + > + Queue Vector for MSI-X. > + > +queue_enable > + > + Used to selectively prevent host from executing requests from this virtqueue. > + 1 - enabled; 0 - disabled > + > +queue_notify_off > + > + Used to calculate the offset from start of Notification structure at > + which this virtqueue is located. > + Note: this is *not* an offset in bytes. See notify_off_multiplier below. > + > +queue_desc > + > + Physical address of Descriptor Table. > + > +queue_avail > + > + Physical address of Available Ring. > + > +queue_used > + > + Physical address of Used Ring. > + > + > +2.4.1.2.2 ISR status structure layout > +------------------------- > +ISR status structure includes a single 8-bite ISR status field 8-bit > + > +2.4.1.2.3 Notification structure layout > +------------------------- > +Notification structure is always a multiple of 2 bytes in size. > +It includes 2-byte Queue Notify fields for each virtqueue of > +the device. Note that multiple virtqueues can use the same > +Queue Notify field, if necessary. Hmm, maybe move this down, so you can have a section which starts with "If cfg_type is VIRTIO_PCI_CAP_NOTIFY_CFG" below? That would put it all together. > + > +2.4.1.2.4 Device specific structure > +------------------------- > + > +Device specific structure is optional. > + > +2.4.1.2.5 Legacy Interfaces: A Note on PCI Device Layout > +------------------------- > + > +Transitional devices should present part of configuration > +registers in a legacy configuration structure in BAR0 in the first I/O > +region of the PCI device, as documented below. > > There may be different widths of accesses to the I/O region; the > “natural” access method for each field in the virtio header must be > -used (i.e. 32-bit accesses for 32-bit fields, etc), but the > +used (i.e. 32-bit accesses for 32-bit fields, etc), but > +When accessed through the legacy interface the > device-specific region can be accessed using any width accesses, and > should obtain the same results. > > Note that this is possible because while the virtio header is PCI > -(i.e. little) endian, the device-specific region is encoded in > -the native endian of the guest (where such distinction is > +(i.e. little) endian, when using the legacy interface the device-specific > +region is encoded in the native endian of the guest (where such distinction is > applicable). > > -2.4.1.2.1 PCI Device Virtio Header > ----------------------------------- > > -The virtio header looks as follows: > +When used through the legacy interface, the virtio header looks as follows: > > +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ > Bits 32 32 32 16 16 16 8 8 > @@ -661,7 +905,6 @@ The virtio header looks as follows: > Features bits 0:31 Features bits 0:31 Address Size Select Notify Status Status > +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ > > - > If MSI-X is enabled for the device, two additional fields > immediately follow this header:[5] > > @@ -689,25 +932,154 @@ device-specific headers: > > +------------++--------------------+ > > +Note that only Feature Bits 0 to 31 are accessible through the > +Legacy Interface. When used through the Legacy Interface, > +Transitional Devices must assume that Feature Bits 32 to 63 > +are not acknowledged by Driver. > + > + > 2.4.1.3 PCI-specific Initialization And Device Operation > -------------------------------------------------------- > > -The page size for a virtqueue on a PCI virtio device is defined as > -4096 bytes. > - > 2.4.1.3.1 Device Initialization > ------------------------------- > > -2.4.1.3.1.1 Queue Vector Configuration > +This documents PCI-specific steps executed during Device Initialization. > +As the first step, driver must detect device configuration layout > +to locate configuration fields in memory,I/O or configuration space of the > +device. > + > +2.4.1.3.1.1 Virtio Device Configuration Layout Detection > +------------------------------- > + > +As a prerequisite to device initialization, driver executes a > +PCI capability list scan, detecting virtio configuration layout using Virtio > +Structure PCI capabilities. > + > +Virtio Device Configuration Layout includes virtio configuration header, Notification > +and ISR Status and device configuration structures. > +Each structure can be mapped by a Base Address register (BAR) belonging to > +the function, located beginning at 10h in Configuration Space, > +or accessed though PCI configuration space. > + > +Actual location of each structure is specified using vendor-specific PCI capability located > +on capability list in PCI configuration space of the device. > +This virtio structure capability uses little-endian format; all bits are > +read-only: > + > +struct virtio_pci_cap { > + __u8 cap_vndr; /* Generic PCI field: PCI_CAP_ID_VNDR */ > + __u8 cap_next; /* Generic PCI field: next ptr. */ > + __u8 cap_len; /* Generic PCI field: capability length */ > + __u8 cfg_type; /* Identifies the structure. */ > + __u8 bar; /* Where to find it. */ > + __u8 padding[3];/* Pad to full dword. */ > + __le32 offset; /* Offset within bar. */ > + __le32 length; /* Length of the structure, in bytes. */ > +}; > + > +This structure can optionally followed by extra data, depending on > +other fields, as documented below. > + > +The fields are interpreted as follows: > + > +cap_vndr > + 0x09; Identifies a vendor-specific capability. > + > +cap_next > + Link to next capability in the capability list in the configuration space. > + > +cap_len > + Length of the capability structure, including the whole of > + struct virtio_pci_cap, and extra data if any. > + This length might include padding, or fields unused by the driver. > + > +cfg_type > + identifies the structure, according to the following table. > + > + /* Common configuration */ > + #define VIRTIO_PCI_CAP_COMMON_CFG 1 > + /* Notifications */ > + #define VIRTIO_PCI_CAP_NOTIFY_CFG 2 > + /* ISR Status */ > + #define VIRTIO_PCI_CAP_ISR_CFG 3 > + /* Device specific configuration */ > + #define VIRTIO_PCI_CAP_DEVICE_CFG 4 > + > + More than one capability can identify the same structure - this makes it > + possible for the device to expose multiple interfaces to drivers. The order of > + the capabilities in the capability list specifies the order of preference > + suggested by the device; drivers should use the first interface that they can > + support. For example, on some hypervisors, notifications using IO accesses are > + faster than memory accesses. In this case, hypervisor can expose two > + capabilities with cfg_type set to VIRTIO_PCI_CAP_NOTIFY_CFG: > + the first one addressing an I/O BAR, the second one addressing a memory BAR. > + Driver will use the I/O BAR if I/O resources are available, and fall back on > + memory BAR when I/O resources are unavailable. > + > +bar > + > + values 0x0 to 0x5 specify a Base Address register (BAR) belonging to > + the function located beginning at 10h in Configuration Space > + and used to map the structure into Memory or I/O Space. > + The BAR is permitted to be either 32-bit or 64-bit, it can map Memory Space > + or I/O Space. > + > + The value 0xF specifies that the structure is in PCI configuration space > + inline with this capability structure, following (not necessarily immediately) > + the length field. Why not immediately? Or how would the driver know where it is? > + > +offset > + indicates where the structure begins relative to the base address associated > + with the BAR. If bar specifies configuration space, offset is relative > + to start of virtio_pci_cap structure. > + > +length > + indicates the length of the structure. > + This size might include padding, or fields unused by the driver. > + Drivers are also recommended to only map part of configuration structure > + large enough for device operation. > + For example, a future device might present a large structure size of several > + MBytes. > + As current devices never utilize structures larger than 4KBytes in size, > + driver can limit the mapped structure size to e.g. > + 4KBytes to allow forward compatibility with such devices without loss of > + functionality and without wasting resources. > + > + > +If cfg_type is VIRTIO_PCI_CAP_NOTIFY_CFG this structure is immediately followed > +by additional fields: > + > +struct virtio_pci_notify_cap { > + struct virtio_pci_cap cap; > + __le32 notify_off_multiplier; /* Multiplier for queue_notify_off. */ > +}; > + > +notify_off_multiplier > + > + Virtqueue offset multiplier, in bytes. Must be even and either a power of two, or 0. > + Value 0x1 is reserved. > + For a given virtqueue, the address to use for notifications is calculated as follows: > + > + queue_notify_off * notify_off_multiplier + offset > + > + If notify_off_multiplier is 0, all virtqueues use the same address in > + the Notifications structure! > + > + > +2.4.1.3.1.1 Legacy Interface: A Note on Device Layout Detection > +------------------------------- > + > +Legacy drivers skipped Device Layout Detection step, assuming legacy > +configuration space in BAR0 in I/O space unconditionally. > + > +2.4.1.3.1.3 Queue Vector Configuration > -------------------------------------- > > When MSI-X capability is present and enabled in the device > -(through standard PCI configuration space) 4 bytes at byte offset > -20 are used to map configuration change and queue interrupts to > -MSI-X vectors. In this case, the ISR Status field is unused, and > -device specific configuration starts at byte offset 24 in virtio > -header structure. When MSI-X capability is not enabled, device > -specific configuration starts at byte offset 20 in virtio header. > +(through standard PCI configuration space) Configuration/Queue > +MSI-X Vector registers are used to map configuration change and queue > +interrupts to MSI-X vectors. In this case, the ISR Status is unused. > > Writing a valid MSI-X Table entry number, 0 to 0x7FF, to one of > Configuration/Queue Vector registers, maps interrupts triggered > @@ -732,7 +1104,7 @@ success, the previously written value is returned, and on > failure, NO_VECTOR is returned. If a mapping failure is detected, > the driver can retry mapping with fewervectors, or disable MSI-X. > > -2.4.1.3.1.2 Virtqueue Configuration > +2.4.1.3.1.4 Virtqueue Configuration > ----------------------------------- > > As a device can have zero or more virtqueues for bulk data > @@ -749,9 +1121,11 @@ This is done as follows, for each virtqueue a device has: > always a power of 2. This controls how big the virtqueue is > (see 2.1.4 Virtqueues). If this field is 0, the virtqueue does not exist. > > -3. Allocate and zero virtqueue in contiguous physical memory, on > - a 4096 byte alignment. Write the physical address, divided by > - 4096 to the Queue Address field.[6] > +3. Optionally, select a smaller virtqueue size and write it in the Queue Size > + field. > + > +3. Allocate and zero Descriptor Table, Available and Used rings for the > + virtqueue in contiguous physical memory. > > 4. Optionally, if MSI-X capability is present and enabled on the > device, select a vector to use to request interrupts triggered > @@ -760,14 +1134,21 @@ This is done as follows, for each virtqueue a device has: > Queue Vector field: on success, previously written value is > returned; on failure, NO_VECTOR value is returned. > > + > +2.4.1.3.1.4.1 Legacy Interface: A Note on Virtqueue Configuration > +----------------------------------- > +When using the legacy interface, the page size for a virtqueue on a PCI virtio > +device is defined as 4096 bytes. Driver writes the physical address, divided > +by 4096 to the Queue Address field [6]. > + > 2.4.1.3.2 Notifying The Device > ------------------------------ > > Device notification occurs by writing the 16-bit virtqueue index > -of this virtqueue to the Queue Notify field of the virtio header > -in the first I/O region of the PCI device. > +of this virtqueue to the Queue Notify field. > > 2.4.1.3.3 Receiving Used Buffers From The Device > +------------------------------ > > If an interrupt is necessary: > > @@ -2798,7 +3179,10 @@ the non-PCI implementations (currently lguest and S/390). > This is only allowed if the driver does not use any features > which would alter this early use of the device. > > -[5] ie. once you enable MSI-X on the device, the other fields move. > +[5] When MSI-X capability is enabled, device specific configuration starts at > +byte offset 24 in virtio header structure. When MSI-X capability is not > +enabled, device specific configuration starts at byte offset 20 in virtio > +header. ie. once you enable MSI-X on the device, the other fields move. > If you turn it off again, they move back! Thanks, Rusty.


  • 3.  Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout

    Posted 08-27-2013 07:37
    On Tue, Aug 27, 2013 at 04:24:47PM +0930, Rusty Russell wrote: > "Michael S. Tsirkin" <mst@redhat.com> writes: > > This is the new configuration layout. > > > > Notes: > > - Everything is LE > > - There's a feature bit that means spec 1.0 compliant. > > - Both devices and drivers can either require the 1.0 interface > > or try to include compatibility support. The spec isn't forcing > > this decision. > > Hmm, this kind includes other changes already proposed, like the LE > change and the framing change. I think this conceptually splits nicely: > > 1) Feature bit 32 proposal. > 2) Endian change. > 3) Framing change. > 4) PCI layout change. Right - they are mostly in different parts of the document. I put it all together so it's easy to see how we intend to handle the transition. So is everyone OK with keeping this in a single patch? > > - I kept documentation of the legacy interface around, and added notes > > on transition inline. They are in separate sections each clearly marked > > "Legacy Interface" so we'll be able to separate them out > > from the final document as necessary - for now I think it's easier > > to keep it all together. > > Good thinking: most of us know the current spec so it's definitely > clearer. And makes sure we're thinking about the transition. > > > Only virtio PCI has been converted. > > Let's discuss this on the meeting tonight, once we figure out PCI > > we can do something similar for MMIO and CCW. > > > @@ -137,6 +139,11 @@ Feature bits are allocated as follows: > > 24 to 31: Feature bits reserved for extensions to the queue and > > feature negotiation mechanisms > > > > + 32: Feature bit must be set for any device compliant with this > > + revision of the specification, and acknowledged by all device drivers. > > + > > + 33 to 63: Feature bits reserved for future extensions > > + > > For example, feature bit 0 for a network device (i.e. Subsystem > > Device ID 1) indicates that the device supports checksumming of > > packets. > > Why stop at 63? If we go to a more decentralized feature-assignment > model, we'll run through those very fast. Then we'll just document more, but driver needs to know where to stop looking for features. > > > @@ -145,13 +152,63 @@ In particular, new fields in the device configuration space are > > indicated by offering a feature bit, so the guest can check > > before accessing that part of the configuration space. > > > > +2.1.2.1 Legacy Interface: A Note on transitions from earlier drafts > > +-------------------------------------- > > + > > +Earlier drafts of this specification (up to 0.9.X) defined a similar, but > > +different interface between the hypervisor and the guest. > > +Since these are widely deployed in the field, this specification > > +accomodates optional features to simplify transition > > +from these earlier draft interfaces. Specifically: > > + > > +Legacy Interface > > + is an interface specified by an earlier draft of this specification > > + (up to 0.9.X) > > +Legacy Device > > + is a device implemented before this specification was released, > > + and implementing a legacy interface on the host side > > +Legacy Driver > > + is a driver implemented before this specification was released, > > + and implementing a legacy interface on the guest side > > + > > +to simplify transition from these earlier draft interfaces, > > +it is possible to implement > > + > > +Transitional Device > > + a device supporting both drivers conforming to this > > + specification, and legacy drivers > > + > > +Transitional Driver > > + a driver supporting both devices conforming to this > > + specification, and legacy devices > > + > > +Device and driver that require support for revision 1.0 or newer of > > +the specification to function, are called non-transitional device and driver, > > +respectively. > > + > > +Transitional Drivers can detect Legacy Devices by detecting that > > +Feature bit 32 is not offered. > > +Transitional devices can detect Legacy drivers by detecting that > > +Feature bit 32 has not been acknowledged by driver. > > + > > +To make them easier to locate, specification sections documenting these > > +transitional features all explicitly marked with > > +'Legacy Interface' in the section title. > > + > > + > > 2.1.3 Configuration Space > > ------------------------- > > > > Configuration space is generally used for rarely-changing or > > initialization-time parameters. > > > > -Note that this space is generally the guest's native endian, > > +Note that configuration space generally uses the little-endian format > > +for multi-byte fields. > > + > > +2.1.4.1 Legacy Interface: A Note on Configuration Space endian-ness > > +-------------------------------------- > > + > > +Note that for legacy interfaces, configuration space is generally the guest's native endian, > > rather than PCI's little-endian. > > > > 2.1.4 Virtqueues > > @@ -164,6 +221,45 @@ transmit and one for receive. Each queue has a 16-bit queue size > > parameter, which sets the number of entries and implies the total size > > of the queue. > > > > +Each virtqueue consists of three parts: > > + > > + Descriptor Table > > + Available Ring > > + Used Ring > > + > > +where each part is physically-contiguous in guest memory, > > +and has different alignment requirements. > > + > > +The Queue Size field controls the total number of bytes > > +required for each part of the virtqueue. > > + > > +The memory aligment and size requirements, in bytes, of each part of the > > +virtqueue are summarized in the following table (qsz is the Queue Size field): > > + > > ++------------+---------------------------------+ > > + Virtqueue Part Alignment Size > > ++------------+---------------------------------+ > > ++------------+---------------------------------+ > > + Descriptor Table 16 16 * qsz > > ++------------+---------------------------------+ > > + Available Ring 2 6 + 2 * qsz > > ++------------+---------------------------------+ > > + Used Ring 4 6 + 4 * qsz > > ++------------+---------------------------------+ > > + > > +When the driver wants to send a buffer to the device, it fills in > > +a slot in the descriptor table (or chains several together), and > > +writes the descriptor index into the available ring. It then > > +notifies the device. When the device has finished a buffer, it > > +writes the descriptor into the used ring, and sends an interrupt. > > + > > + > > +2.1.4.1 Legacy Interfaces: A Note on Virtqueue Layout > > +-------------------------------------- > > + > > +For Legacy Interfaces, several additional > > +restrictions are placed on the virtqueue layout: > > + > > Each virtqueue occupies two or more physically-contiguous pages > > (usually defined as 4096 bytes, but depending on the transport) > > and consists of three parts: > > @@ -182,9 +278,8 @@ required for the virtqueue according to the following formula: > > + ALIGN(sizeof(u16)*3 + sizeof(struct vring_used_elem)*qsz); > > } > > > > -This currently wastes some space with padding, but also allows future > > -extensions such as the VIRTIO_RING_F_EVENT_IDX extension. The > > -virtqueue layout structure looks like this: > > +This wastes some space with padding. > > +The legacy virtqueue layout structure therefore looks like this: > > > > struct vring { > > // The actual descriptors (16 bytes each) > > @@ -200,25 +295,17 @@ virtqueue layout structure looks like this: > > struct vring_used used; > > }; > > > > -When the driver wants to send a buffer to the device, it fills in > > -a slot in the descriptor table (or chains several together), and > > -writes the descriptor index into the available ring. It then > > -notifies the device. When the device has finished a buffer, it > > -writes the descriptor into the used ring, and sends an interrupt. > > - > > -2.1.4.1 A Note on Virtqueue Endianness > > +2.1.4.1 Legacy Interfaces: A Note on Virtqueue Endianness > > -------------------------------------- > > > > Note that the endian of fields and in the virtqueue is the native > > -endian of the guest, not little-endian as PCI normally is. This makes > > -for simpler guest code, and it is assumed that the host already has to > > -be deeply aware of the guest endian so such an “endian-aware” device > > -is not a significant issue. > > +endian of the guest, not little-endian as PCI normally is. > > +It is assumed that the host is already aware of the guest endian. > > > > 2.1.4.2 Message Framing > > ----------------------- > > -The original intent of the specification was that message framing (the > > -particular layout of descriptors) be independent of the contents of > > +Generally, the intent of the specification is for message framing (the > > +particular layout of descriptors) to be independent of the contents of > > the buffers. For example, a network transmit buffer consists of a 12 > > byte header followed by the network packet. This could be most simply > > placed in the descriptor table as a 12 byte output descriptor followed > > @@ -227,16 +314,21 @@ single 1526 byte output descriptor in the case where the header and > > packet are adjacent, or even three or more descriptors (possibly with > > loss of efficiency in that case). > > > > -Regrettably, initial driver implementations used simple layouts, and > > -devices came to rely on it, despite this specification wording[10]. It > > -is thus recommended that drivers be conservative in their assumptions, > > -unless the VIRTIO_F_ANY_LAYOUT feature is accepted. In addition, some > > +In addition, some > > implementations may have large-but-reasonable restrictions on total > > descriptor size (such as based on IOV_MAX in the host OS). This has > > not been a problem in practice: little sympathy will be given to > > drivers which create unreasonably-sized descriptors such as by > > dividing a network packet into 1500 single-byte descriptors! > > > > +2.1.4.2.1 Legacy Interfaces: A Note on Message Framing > > +----------------------- > > +Regrettably, initial driver implementations used simple layouts, and > > +devices came to rely on it, despite this specification wording[10]. It > > +is thus recommended that when using legacy interfaces, > > +drivers should be conservative in their assumptions, > > +unless the VIRTIO_F_ANY_LAYOUT feature is accepted. > > + > > 2.1.4.3 The Virtqueue Descriptor Table > > -------------------------------------- > > > > @@ -386,23 +478,27 @@ how to communicate with the specific device. > > 2.2.1 Device Initialization > > --------------------------- > > > > -1. Reset the device. This is not required on initial start up. > > +1. Device discovery. This is only required for some transports. > > + > > +2. Reset the device. This is not required on initial start up. > > > > -2. The ACKNOWLEDGE status bit is set: we have noticed the device. > > +3. Device layout detection. This is only required for some transports. > > > > -3. The DRIVER status bit is set: we know how to drive the device. > > +4. The ACKNOWLEDGE status bit is set: we have noticed the device. > > > > -4. Device-specific setup, including reading the device feature > > +5. The DRIVER status bit is set: we know how to drive the device. > > + > > +6. Device-specific setup, including reading the device feature > > bits, discovery of virtqueues for the device, optional per-bus > > setup, and reading and possibly writing the device's virtio > > configuration space. > > > > -5. The subset of device feature bits understood by the driver is > > +7. The subset of device feature bits understood by the driver is > > written to the device. > > > > -6. The DRIVER_OK status bit is set. > > +8. The DRIVER_OK status bit is set. > > > > -7. The device can now be used (ie. buffers added to the > > +9. The device can now be used (ie. buffers added to the > > virtqueues)[4] > > > > If any of these steps go irrecoverably wrong, the guest should > > @@ -622,35 +718,183 @@ Virtio devices are commonly implemented as PCI devices. > > > > Any PCI device with Vendor ID 0x1AF4, and Device ID 0x1000 through > > 0x103F inclusive is a virtio device[3]. The device must also have a > > -Revision ID of 0 to match this specification. > > +Revision ID of 0 or Revision ID of 1 to match this specification. > > > > The Subsystem Device ID indicates which virtio device is > > supported by the device. The Subsystem Vendor ID should reflect > > the PCI Vendor ID of the environment (it's currently only used > > for informational purposes by the guest). > > > > +Drivers must not match devices where Revision ID does not match 0 or 1. > > + > > +2.4.1.1.1 Legacy Interface: A Note on PCI Device Discovery > > +---------------------------- > > +Transitional devices must have a Revision ID of 0. > > + > > +Non-transitional devices must have a Revision ID of 1. > > + > > +Transitional drivers must match a Revision ID of 0 or 1. > > + > > +Non-transitional drivers must only match a Revision ID of 1. > > + > > I think we should stop abusing Revision IDs, and start using them > to reflect device version changes as intended. > > We could reserve revision id 0 for legacy devices, however, which should > work nicely. Hmm I am not sure I agree - what does it buy us that feature bits don't already supply? More concerns: We are using revision ID now exactly as was intended to disable old drivers - it served us well for 0.X-1.X and would be as useful if we ever have 1.X->2.0 transition. Another worry with using revision numbering for features is that it does not play well with downstreams. E.g. RHEL might want to cherry-pick a feature without implementing other features that happened to land in the same revision. Also Revision ID is only 8 bit - it's designed for hardware where making a new revision is expensive. In software we'll run out of that eventually. > > > 2.4.1.2 PCI Device Layout > > ------------------------- > > > > -To configure the device, we use the first I/O region of the PCI > > -device. This contains a virtio header followed by a > > -device-specific region. > > +To configure the device, > > +use I/O and/or memory regions and/or PCI configuration space of the PCI device. > > +These contain the virtio header registers, the notification register, the > > +ISR status register and device specific registers, as specified by Virtio > > ++ Structure PCI Capabilities > > + > > +There may be different widths of accesses to the I/O region; the > > +“natural” access method for each field must be > > +used (i.e. 32-bit accesses for 32-bit fields, etc). > > + > > +PCI Device Configuration Layout includes the common configuration, > > +ISR, notification and device specific configuration > > +structures. > > + > > +Unless explicitly specified otherwise, all multi-byte fields are little-endian. > > + > > + > > +2.4.1.2.1 Common configuration structure layout > > +------------------------- > > +Common configuration structure layout is documented below: > > + > > +struct virtio_pci_common_cfg { > > + /* About the whole device. */ > > + __le32 device_feature_select; /* read-write */ > > + __le32 device_feature; /* read-only */ > > + __le32 guest_feature_select; /* read-write */ > > + __le32 guest_feature; /* read-write */ > > + __le16 msix_config; /* read-write */ > > + __le16 num_queues; /* read-only */ > > + __u8 device_status; /* read-write */ > > + __u8 unused1; > > + > > + /* About a specific virtqueue. */ > > + __le16 queue_select; /* read-write */ > > + __le16 queue_size; /* read-write, power of 2, or 0. */ > > + __le16 queue_msix_vector; /* read-write */ > > + __le16 queue_enable; /* read-write */ > > + __le16 queue_notify_off; /* read-only */ > > + __le64 queue_desc; /* read-write */ > > + __le64 queue_avail; /* read-write */ > > + __le64 queue_used; /* read-write */ > > +}; > > + > > +device_feature_select > > + > > + Selects which Feature Bits does device_feature field refer to. > > + Value 0x0 selects Feature Bits 0 to 31 > > + Value 0x1 selects Feature Bits 32 to 63 > > + All other values cause reads from device_feature to return 0. > > + > > +device_feature > > + > > + Used by Device to report Feature Bits to Driver. > > + Device Feature Bits selected by device_feature_select. > > + > > +guest_feature_select > > + > > + Selects which Feature Bits does guest_feature field refer to. > > + Value 0x0 selects Feature Bits 0 to 31 > > + Value 0x1 selects Feature Bits 32 to 63 > > + All other values cause writes to guest_feature to be ignored, > > + and reads to return 0. > > + > > +guest_feature > > + > > + Used by Driver to acknowledge Feature Bits to Device. > > + Guest Feature Bits selected by guest_feature_select. > > + > > +msix_config > > + > > + Configuration Vector for MSI-X. > > + > > +num_queues > > + > > + Specifies the maximum number of virtqueues supported by device. > > + > > +device_status > > + > > + Device Status field. > > + > > +queue_select > > + > > + Queue Select. Selects which virtqueue do other fields refer to. > > + > > +queue_size > > + > > + Queue Size. On reset, specifies the maximum queue size supported by > > + the hypervisor. This can be modified by driver to reduce memory requirements. > > + Set to 0 if this virtqueue is unused. > > + > > +queue_msix_vector > > + > > + Queue Vector for MSI-X. > > + > > +queue_enable > > + > > + Used to selectively prevent host from executing requests from this virtqueue. > > + 1 - enabled; 0 - disabled > > + > > +queue_notify_off > > + > > + Used to calculate the offset from start of Notification structure at > > + which this virtqueue is located. > > + Note: this is *not* an offset in bytes. See notify_off_multiplier below. > > + > > +queue_desc > > + > > + Physical address of Descriptor Table. > > + > > +queue_avail > > + > > + Physical address of Available Ring. > > + > > +queue_used > > + > > + Physical address of Used Ring. > > + > > + > > +2.4.1.2.2 ISR status structure layout > > +------------------------- > > +ISR status structure includes a single 8-bite ISR status field > > 8-bit Right :) > > + > > +2.4.1.2.3 Notification structure layout > > +------------------------- > > +Notification structure is always a multiple of 2 bytes in size. > > +It includes 2-byte Queue Notify fields for each virtqueue of > > +the device. Note that multiple virtqueues can use the same > > +Queue Notify field, if necessary. > > Hmm, maybe move this down, so you can have a section which starts with > "If cfg_type is VIRTIO_PCI_CAP_NOTIFY_CFG" below? That would put it all > together. so Move PCI Device Layout to within PCI-specific Initialization And Device Operation? > > + > > +2.4.1.2.4 Device specific structure > > +------------------------- > > + > > +Device specific structure is optional. > > + > > +2.4.1.2.5 Legacy Interfaces: A Note on PCI Device Layout > > +------------------------- > > + > > +Transitional devices should present part of configuration > > +registers in a legacy configuration structure in BAR0 in the first I/O > > +region of the PCI device, as documented below. > > > > There may be different widths of accesses to the I/O region; the > > “natural” access method for each field in the virtio header must be > > -used (i.e. 32-bit accesses for 32-bit fields, etc), but the > > +used (i.e. 32-bit accesses for 32-bit fields, etc), but > > +When accessed through the legacy interface the > > device-specific region can be accessed using any width accesses, and > > should obtain the same results. > > > > Note that this is possible because while the virtio header is PCI > > -(i.e. little) endian, the device-specific region is encoded in > > -the native endian of the guest (where such distinction is > > +(i.e. little) endian, when using the legacy interface the device-specific > > +region is encoded in the native endian of the guest (where such distinction is > > applicable). > > > > -2.4.1.2.1 PCI Device Virtio Header > > ----------------------------------- > > > > -The virtio header looks as follows: > > +When used through the legacy interface, the virtio header looks as follows: > > > > +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ > > Bits 32 32 32 16 16 16 8 8 > > @@ -661,7 +905,6 @@ The virtio header looks as follows: > > Features bits 0:31 Features bits 0:31 Address Size Select Notify Status Status > > +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ > > > > - > > If MSI-X is enabled for the device, two additional fields > > immediately follow this header:[5] > > > > @@ -689,25 +932,154 @@ device-specific headers: > > > > +------------++--------------------+ > > > > +Note that only Feature Bits 0 to 31 are accessible through the > > +Legacy Interface. When used through the Legacy Interface, > > +Transitional Devices must assume that Feature Bits 32 to 63 > > +are not acknowledged by Driver. > > + > > + > > 2.4.1.3 PCI-specific Initialization And Device Operation > > -------------------------------------------------------- > > > > -The page size for a virtqueue on a PCI virtio device is defined as > > -4096 bytes. > > - > > 2.4.1.3.1 Device Initialization > > ------------------------------- > > > > -2.4.1.3.1.1 Queue Vector Configuration > > +This documents PCI-specific steps executed during Device Initialization. > > +As the first step, driver must detect device configuration layout > > +to locate configuration fields in memory,I/O or configuration space of the > > +device. > > + > > +2.4.1.3.1.1 Virtio Device Configuration Layout Detection > > +------------------------------- > > + > > +As a prerequisite to device initialization, driver executes a > > +PCI capability list scan, detecting virtio configuration layout using Virtio > > +Structure PCI capabilities. > > + > > +Virtio Device Configuration Layout includes virtio configuration header, Notification > > +and ISR Status and device configuration structures. > > +Each structure can be mapped by a Base Address register (BAR) belonging to > > +the function, located beginning at 10h in Configuration Space, > > +or accessed though PCI configuration space. > > + > > +Actual location of each structure is specified using vendor-specific PCI capability located > > +on capability list in PCI configuration space of the device. > > +This virtio structure capability uses little-endian format; all bits are > > +read-only: > > + > > +struct virtio_pci_cap { > > + __u8 cap_vndr; /* Generic PCI field: PCI_CAP_ID_VNDR */ > > + __u8 cap_next; /* Generic PCI field: next ptr. */ > > + __u8 cap_len; /* Generic PCI field: capability length */ > > + __u8 cfg_type; /* Identifies the structure. */ > > + __u8 bar; /* Where to find it. */ > > + __u8 padding[3];/* Pad to full dword. */ > > + __le32 offset; /* Offset within bar. */ > > + __le32 length; /* Length of the structure, in bytes. */ > > +}; > > + > > +This structure can optionally followed by extra data, depending on > > +other fields, as documented below. > > + > > +The fields are interpreted as follows: > > + > > +cap_vndr > > + 0x09; Identifies a vendor-specific capability. > > + > > +cap_next > > + Link to next capability in the capability list in the configuration space. > > + > > +cap_len > > + Length of the capability structure, including the whole of > > + struct virtio_pci_cap, and extra data if any. > > + This length might include padding, or fields unused by the driver. > > + > > +cfg_type > > + identifies the structure, according to the following table. > > + > > + /* Common configuration */ > > + #define VIRTIO_PCI_CAP_COMMON_CFG 1 > > + /* Notifications */ > > + #define VIRTIO_PCI_CAP_NOTIFY_CFG 2 > > + /* ISR Status */ > > + #define VIRTIO_PCI_CAP_ISR_CFG 3 > > + /* Device specific configuration */ > > + #define VIRTIO_PCI_CAP_DEVICE_CFG 4 > > + > > + More than one capability can identify the same structure - this makes it > > + possible for the device to expose multiple interfaces to drivers. The order of > > + the capabilities in the capability list specifies the order of preference > > + suggested by the device; drivers should use the first interface that they can > > + support. For example, on some hypervisors, notifications using IO accesses are > > + faster than memory accesses. In this case, hypervisor can expose two > > + capabilities with cfg_type set to VIRTIO_PCI_CAP_NOTIFY_CFG: > > + the first one addressing an I/O BAR, the second one addressing a memory BAR. > > + Driver will use the I/O BAR if I/O resources are available, and fall back on > > + memory BAR when I/O resources are unavailable. > > + > > +bar > > + > > + values 0x0 to 0x5 specify a Base Address register (BAR) belonging to > > + the function located beginning at 10h in Configuration Space > > + and used to map the structure into Memory or I/O Space. > > + The BAR is permitted to be either 32-bit or 64-bit, it can map Memory Space > > + or I/O Space. > > + > > + The value 0xF specifies that the structure is in PCI configuration space > > + inline with this capability structure, following (not necessarily immediately) > > + the length field. > > Why not immediately? > Or how would the driver know where it is? It's at the offset. E.g. for notification we stick multiplier after length. Further, we might extend virtio_pci_cap in the future, and we don't want to move stuff around like we had to with MSI-X. > > + > > +offset > > + indicates where the structure begins relative to the base address associated > > + with the BAR. If bar specifies configuration space, offset is relative > > + to start of virtio_pci_cap structure. > > + > > +length > > + indicates the length of the structure. > > + This size might include padding, or fields unused by the driver. > > + Drivers are also recommended to only map part of configuration structure > > + large enough for device operation. > > + For example, a future device might present a large structure size of several > > + MBytes. > > + As current devices never utilize structures larger than 4KBytes in size, > > + driver can limit the mapped structure size to e.g. > > + 4KBytes to allow forward compatibility with such devices without loss of > > + functionality and without wasting resources. > > + > > + > > +If cfg_type is VIRTIO_PCI_CAP_NOTIFY_CFG this structure is immediately followed > > +by additional fields: > > + > > +struct virtio_pci_notify_cap { > > + struct virtio_pci_cap cap; > > + __le32 notify_off_multiplier; /* Multiplier for queue_notify_off. */ > > +}; > > + > > +notify_off_multiplier > > + > > + Virtqueue offset multiplier, in bytes. Must be even and either a power of two, or 0. > > + Value 0x1 is reserved. > > + For a given virtqueue, the address to use for notifications is calculated as follows: > > + > > + queue_notify_off * notify_off_multiplier + offset > > + > > + If notify_off_multiplier is 0, all virtqueues use the same address in > > + the Notifications structure! > > + > > + > > +2.4.1.3.1.1 Legacy Interface: A Note on Device Layout Detection > > +------------------------------- > > + > > +Legacy drivers skipped Device Layout Detection step, assuming legacy > > +configuration space in BAR0 in I/O space unconditionally. > > + > > +2.4.1.3.1.3 Queue Vector Configuration > > -------------------------------------- > > > > When MSI-X capability is present and enabled in the device > > -(through standard PCI configuration space) 4 bytes at byte offset > > -20 are used to map configuration change and queue interrupts to > > -MSI-X vectors. In this case, the ISR Status field is unused, and > > -device specific configuration starts at byte offset 24 in virtio > > -header structure. When MSI-X capability is not enabled, device > > -specific configuration starts at byte offset 20 in virtio header. > > +(through standard PCI configuration space) Configuration/Queue > > +MSI-X Vector registers are used to map configuration change and queue > > +interrupts to MSI-X vectors. In this case, the ISR Status is unused. > > > > Writing a valid MSI-X Table entry number, 0 to 0x7FF, to one of > > Configuration/Queue Vector registers, maps interrupts triggered > > @@ -732,7 +1104,7 @@ success, the previously written value is returned, and on > > failure, NO_VECTOR is returned. If a mapping failure is detected, > > the driver can retry mapping with fewervectors, or disable MSI-X. > > > > -2.4.1.3.1.2 Virtqueue Configuration > > +2.4.1.3.1.4 Virtqueue Configuration > > ----------------------------------- > > > > As a device can have zero or more virtqueues for bulk data > > @@ -749,9 +1121,11 @@ This is done as follows, for each virtqueue a device has: > > always a power of 2. This controls how big the virtqueue is > > (see 2.1.4 Virtqueues). If this field is 0, the virtqueue does not exist. > > > > -3. Allocate and zero virtqueue in contiguous physical memory, on > > - a 4096 byte alignment. Write the physical address, divided by > > - 4096 to the Queue Address field.[6] > > +3. Optionally, select a smaller virtqueue size and write it in the Queue Size > > + field. > > + > > +3. Allocate and zero Descriptor Table, Available and Used rings for the > > + virtqueue in contiguous physical memory. > > > > 4. Optionally, if MSI-X capability is present and enabled on the > > device, select a vector to use to request interrupts triggered > > @@ -760,14 +1134,21 @@ This is done as follows, for each virtqueue a device has: > > Queue Vector field: on success, previously written value is > > returned; on failure, NO_VECTOR value is returned. > > > > + > > +2.4.1.3.1.4.1 Legacy Interface: A Note on Virtqueue Configuration > > +----------------------------------- > > +When using the legacy interface, the page size for a virtqueue on a PCI virtio > > +device is defined as 4096 bytes. Driver writes the physical address, divided > > +by 4096 to the Queue Address field [6]. > > + > > 2.4.1.3.2 Notifying The Device > > ------------------------------ > > > > Device notification occurs by writing the 16-bit virtqueue index > > -of this virtqueue to the Queue Notify field of the virtio header > > -in the first I/O region of the PCI device. > > +of this virtqueue to the Queue Notify field. > > > > 2.4.1.3.3 Receiving Used Buffers From The Device > > +------------------------------ > > > > If an interrupt is necessary: > > > > @@ -2798,7 +3179,10 @@ the non-PCI implementations (currently lguest and S/390). > > This is only allowed if the driver does not use any features > > which would alter this early use of the device. > > > > -[5] ie. once you enable MSI-X on the device, the other fields move. > > +[5] When MSI-X capability is enabled, device specific configuration starts at > > +byte offset 24 in virtio header structure. When MSI-X capability is not > > +enabled, device specific configuration starts at byte offset 20 in virtio > > +header. ie. once you enable MSI-X on the device, the other fields move. > > If you turn it off again, they move back! > > Thanks, > Rusty.


  • 4.  Re: [virtio] Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout

    Posted 08-27-2013 15:10
    Some remarks from my side... On Tue, 27 Aug 2013 10:38:59 +0300 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Tue, Aug 27, 2013 at 04:24:47PM +0930, Rusty Russell wrote: > > "Michael S. Tsirkin" <mst@redhat.com> writes: > > > This is the new configuration layout. > > > > > > Notes: > > > - Everything is LE > > > - There's a feature bit that means spec 1.0 compliant. > > > - Both devices and drivers can either require the 1.0 interface > > > or try to include compatibility support. The spec isn't forcing > > > this decision. > > > > Hmm, this kind includes other changes already proposed, like the LE > > change and the framing change. I think this conceptually splits nicely: > > > > 1) Feature bit 32 proposal. > > 2) Endian change. > > 3) Framing change. > > 4) PCI layout change. > > Right - they are mostly in different parts of the document. > I put it all together so it's easy to see how we intend to > handle the transition. > So is everyone OK with keeping this in a single patch? The new feature bit is supposed to cover all of this, right? Then this should be one patch. > > > > - I kept documentation of the legacy interface around, and added notes > > > on transition inline. They are in separate sections each clearly marked > > > "Legacy Interface" so we'll be able to separate them out > > > from the final document as necessary - for now I think it's easier > > > to keep it all together. > > > > Good thinking: most of us know the current spec so it's definitely > > clearer. And makes sure we're thinking about the transition. > > > > > Only virtio PCI has been converted. > > > Let's discuss this on the meeting tonight, once we figure out PCI > > > we can do something similar for MMIO and CCW. > > > > > @@ -137,6 +139,11 @@ Feature bits are allocated as follows: > > > 24 to 31: Feature bits reserved for extensions to the queue and > > > feature negotiation mechanisms > > > > > > + 32: Feature bit must be set for any device compliant with this > > > + revision of the specification, and acknowledged by all device drivers. Would it make sense to have a bit 33 "rings big endian" whose validity depends on bit 32 set? This would make it possible for ccw to keep its current endianness. > > > + > > > + 33 to 63: Feature bits reserved for future extensions > > > + > > > For example, feature bit 0 for a network device (i.e. Subsystem > > > Device ID 1) indicates that the device supports checksumming of > > > packets. > > > > Why stop at 63? If we go to a more decentralized feature-assignment > > model, we'll run through those very fast. > > Then we'll just document more, but driver needs to know where to stop > looking for features. > > > > > > @@ -145,13 +152,63 @@ In particular, new fields in the device configuration space are > > > indicated by offering a feature bit, so the guest can check > > > before accessing that part of the configuration space. > > > > > > +2.1.2.1 Legacy Interface: A Note on transitions from earlier drafts > > > +-------------------------------------- > > > + > > > +Earlier drafts of this specification (up to 0.9.X) defined a similar, but > > > +different interface between the hypervisor and the guest. > > > +Since these are widely deployed in the field, this specification > > > +accomodates optional features to simplify transition > > > +from these earlier draft interfaces. Specifically: > > > + > > > +Legacy Interface > > > + is an interface specified by an earlier draft of this specification > > > + (up to 0.9.X) > > > +Legacy Device > > > + is a device implemented before this specification was released, > > > + and implementing a legacy interface on the host side > > > +Legacy Driver > > > + is a driver implemented before this specification was released, > > > + and implementing a legacy interface on the guest side > > > + > > > +to simplify transition from these earlier draft interfaces, > > > +it is possible to implement > > > + > > > +Transitional Device > > > + a device supporting both drivers conforming to this > > > + specification, and legacy drivers > > > + > > > +Transitional Driver > > > + a driver supporting both devices conforming to this > > > + specification, and legacy devices What happens to legacy devices in the future? Current implementers will obviously expose legacy devices, which means future drivers need to be transitional or they won't work with what is currently out there. Will legacy stay around (for the forseeable furture)? Will legacy devices still be considered standard compliant (as in "compliant to the legacy standard")? > > > + > > > +Device and driver that require support for revision 1.0 or newer of > > > +the specification to function, are called non-transitional device and driver, > > > +respectively. > > > + > > > +Transitional Drivers can detect Legacy Devices by detecting that > > > +Feature bit 32 is not offered. > > > +Transitional devices can detect Legacy drivers by detecting that > > > +Feature bit 32 has not been acknowledged by driver. Will we use new feature bits for new, incompatible revisions? Or will we try to stay backwards compatible? > > > + > > > +To make them easier to locate, specification sections documenting these > > > +transitional features all explicitly marked with > > > +'Legacy Interface' in the section title. > > > + > > > + > > > 2.1.3 Configuration Space > > > ------------------------- > > > > > > Configuration space is generally used for rarely-changing or > > > initialization-time parameters. > > > > > > -Note that this space is generally the guest's native endian, > > > +Note that configuration space generally uses the little-endian format > > > +for multi-byte fields. > > > + > > > +2.1.4.1 Legacy Interface: A Note on Configuration Space endian-ness > > > +-------------------------------------- > > > + > > > +Note that for legacy interfaces, configuration space is generally the guest's native endian, > > > rather than PCI's little-endian. > > > > > > 2.1.4 Virtqueues > > > @@ -164,6 +221,45 @@ transmit and one for receive. Each queue has a 16-bit queue size > > > parameter, which sets the number of entries and implies the total size > > > of the queue. > > > > > > +Each virtqueue consists of three parts: > > > + > > > + Descriptor Table > > > + Available Ring > > > + Used Ring > > > + > > > +where each part is physically-contiguous in guest memory, > > > +and has different alignment requirements. > > > + > > > +The Queue Size field controls the total number of bytes > > > +required for each part of the virtqueue. > > > + > > > +The memory aligment and size requirements, in bytes, of each part of the > > > +virtqueue are summarized in the following table (qsz is the Queue Size field): > > > + > > > ++------------+---------------------------------+ > > > + Virtqueue Part Alignment Size > > > ++------------+---------------------------------+ > > > ++------------+---------------------------------+ > > > + Descriptor Table 16 16 * qsz > > > ++------------+---------------------------------+ > > > + Available Ring 2 6 + 2 * qsz > > > ++------------+---------------------------------+ > > > + Used Ring 4 6 + 4 * qsz > > > ++------------+---------------------------------+ > > > + > > > +When the driver wants to send a buffer to the device, it fills in > > > +a slot in the descriptor table (or chains several together), and > > > +writes the descriptor index into the available ring. It then > > > +notifies the device. When the device has finished a buffer, it > > > +writes the descriptor into the used ring, and sends an interrupt. > > > + > > > + > > > +2.1.4.1 Legacy Interfaces: A Note on Virtqueue Layout > > > +-------------------------------------- > > > + > > > +For Legacy Interfaces, several additional > > > +restrictions are placed on the virtqueue layout: > > > + > > > Each virtqueue occupies two or more physically-contiguous pages > > > (usually defined as 4096 bytes, but depending on the transport) > > > and consists of three parts: > > > @@ -182,9 +278,8 @@ required for the virtqueue according to the following formula: > > > + ALIGN(sizeof(u16)*3 + sizeof(struct vring_used_elem)*qsz); > > > } > > > > > > -This currently wastes some space with padding, but also allows future > > > -extensions such as the VIRTIO_RING_F_EVENT_IDX extension. The > > > -virtqueue layout structure looks like this: > > > +This wastes some space with padding. > > > +The legacy virtqueue layout structure therefore looks like this: > > > > > > struct vring { > > > // The actual descriptors (16 bytes each) > > > @@ -200,25 +295,17 @@ virtqueue layout structure looks like this: > > > struct vring_used used; > > > }; > > > > > > -When the driver wants to send a buffer to the device, it fills in > > > -a slot in the descriptor table (or chains several together), and > > > -writes the descriptor index into the available ring. It then > > > -notifies the device. When the device has finished a buffer, it > > > -writes the descriptor into the used ring, and sends an interrupt. > > > - > > > -2.1.4.1 A Note on Virtqueue Endianness > > > +2.1.4.1 Legacy Interfaces: A Note on Virtqueue Endianness > > > -------------------------------------- > > > > > > Note that the endian of fields and in the virtqueue is the native > > > -endian of the guest, not little-endian as PCI normally is. This makes > > > -for simpler guest code, and it is assumed that the host already has to > > > -be deeply aware of the guest endian so such an “endian-aware” device > > > -is not a significant issue. > > > +endian of the guest, not little-endian as PCI normally is. > > > +It is assumed that the host is already aware of the guest endian. > > > > > > 2.1.4.2 Message Framing > > > ----------------------- > > > -The original intent of the specification was that message framing (the > > > -particular layout of descriptors) be independent of the contents of > > > +Generally, the intent of the specification is for message framing (the > > > +particular layout of descriptors) to be independent of the contents of > > > the buffers. For example, a network transmit buffer consists of a 12 > > > byte header followed by the network packet. This could be most simply > > > placed in the descriptor table as a 12 byte output descriptor followed > > > @@ -227,16 +314,21 @@ single 1526 byte output descriptor in the case where the header and > > > packet are adjacent, or even three or more descriptors (possibly with > > > loss of efficiency in that case). > > > > > > -Regrettably, initial driver implementations used simple layouts, and > > > -devices came to rely on it, despite this specification wording[10]. It > > > -is thus recommended that drivers be conservative in their assumptions, > > > -unless the VIRTIO_F_ANY_LAYOUT feature is accepted. In addition, some > > > +In addition, some > > > implementations may have large-but-reasonable restrictions on total > > > descriptor size (such as based on IOV_MAX in the host OS). This has > > > not been a problem in practice: little sympathy will be given to > > > drivers which create unreasonably-sized descriptors such as by > > > dividing a network packet into 1500 single-byte descriptors! > > > > > > +2.1.4.2.1 Legacy Interfaces: A Note on Message Framing > > > +----------------------- > > > +Regrettably, initial driver implementations used simple layouts, and > > > +devices came to rely on it, despite this specification wording[10]. It > > > +is thus recommended that when using legacy interfaces, > > > +drivers should be conservative in their assumptions, > > > +unless the VIRTIO_F_ANY_LAYOUT feature is accepted. So ANY_LAYOUT and feature bit 32 are mutually exclusive? > > > + > > > 2.1.4.3 The Virtqueue Descriptor Table > > > -------------------------------------- > > > > > > @@ -386,23 +478,27 @@ how to communicate with the specific device. > > > 2.2.1 Device Initialization > > > --------------------------- > > > > > > -1. Reset the device. This is not required on initial start up. > > > +1. Device discovery. This is only required for some transports. > > > + > > > +2. Reset the device. This is not required on initial start up. > > > > > > -2. The ACKNOWLEDGE status bit is set: we have noticed the device. > > > +3. Device layout detection. This is only required for some transports. > > > > > > -3. The DRIVER status bit is set: we know how to drive the device. > > > +4. The ACKNOWLEDGE status bit is set: we have noticed the device. > > > > > > -4. Device-specific setup, including reading the device feature > > > +5. The DRIVER status bit is set: we know how to drive the device. > > > + > > > +6. Device-specific setup, including reading the device feature > > > bits, discovery of virtqueues for the device, optional per-bus > > > setup, and reading and possibly writing the device's virtio > > > configuration space. > > > > > > -5. The subset of device feature bits understood by the driver is > > > +7. The subset of device feature bits understood by the driver is > > > written to the device. > > > > > > -6. The DRIVER_OK status bit is set. > > > +8. The DRIVER_OK status bit is set. > > > > > > -7. The device can now be used (ie. buffers added to the > > > +9. The device can now be used (ie. buffers added to the > > > virtqueues)[4] > > > > > > If any of these steps go irrecoverably wrong, the guest should > > > @@ -622,35 +718,183 @@ Virtio devices are commonly implemented as PCI devices. > > > > > > Any PCI device with Vendor ID 0x1AF4, and Device ID 0x1000 through > > > 0x103F inclusive is a virtio device[3]. The device must also have a > > > -Revision ID of 0 to match this specification. > > > +Revision ID of 0 or Revision ID of 1 to match this specification. > > > > > > The Subsystem Device ID indicates which virtio device is > > > supported by the device. The Subsystem Vendor ID should reflect > > > the PCI Vendor ID of the environment (it's currently only used > > > for informational purposes by the guest). > > > > > > +Drivers must not match devices where Revision ID does not match 0 or 1. > > > + > > > +2.4.1.1.1 Legacy Interface: A Note on PCI Device Discovery > > > +---------------------------- > > > +Transitional devices must have a Revision ID of 0. > > > + > > > +Non-transitional devices must have a Revision ID of 1. > > > + > > > +Transitional drivers must match a Revision ID of 0 or 1. > > > + > > > +Non-transitional drivers must only match a Revision ID of 1. > > > + > > > > I think we should stop abusing Revision IDs, and start using them > > to reflect device version changes as intended. > > > > We could reserve revision id 0 for legacy devices, however, which should > > work nicely. > > Hmm I am not sure I agree - what does it buy us that feature bits don't already supply? > > More concerns: > > We are using revision ID now exactly as was intended to disable old > drivers - it served us well for 0.X-1.X and would be as useful if we > ever have 1.X->2.0 transition. > > Another worry with using revision numbering for features is that > it does not play well with downstreams. > E.g. RHEL might want to cherry-pick a feature without implementing > other features that happened to land in the same revision. > > Also Revision ID is only 8 bit - it's designed for hardware where > making a new revision is expensive. In software we'll run out of that > eventually. So Revision ID is a PCI-specific thing, right? Not all transports will necessarily have something equivalent, so they would need to depend on the feature bit. > > > > > > > 2.4.1.2 PCI Device Layout > > > ------------------------- > > > > > > -To configure the device, we use the first I/O region of the PCI > > > -device. This contains a virtio header followed by a > > > -device-specific region. > > > +To configure the device, > > > +use I/O and/or memory regions and/or PCI configuration space of the PCI device. > > > +These contain the virtio header registers, the notification register, the > > > +ISR status register and device specific registers, as specified by Virtio > > > ++ Structure PCI Capabilities > > > + > > > +There may be different widths of accesses to the I/O region; the > > > +“natural” access method for each field must be > > > +used (i.e. 32-bit accesses for 32-bit fields, etc). > > > + > > > +PCI Device Configuration Layout includes the common configuration, > > > +ISR, notification and device specific configuration > > > +structures. > > > + > > > +Unless explicitly specified otherwise, all multi-byte fields are little-endian. > > > + > > > + > > > +2.4.1.2.1 Common configuration structure layout > > > +------------------------- > > > +Common configuration structure layout is documented below: > > > + > > > +struct virtio_pci_common_cfg { > > > + /* About the whole device. */ > > > + __le32 device_feature_select; /* read-write */ > > > + __le32 device_feature; /* read-only */ > > > + __le32 guest_feature_select; /* read-write */ > > > + __le32 guest_feature; /* read-write */ > > > + __le16 msix_config; /* read-write */ > > > + __le16 num_queues; /* read-only */ > > > + __u8 device_status; /* read-write */ > > > + __u8 unused1; > > > + > > > + /* About a specific virtqueue. */ > > > + __le16 queue_select; /* read-write */ > > > + __le16 queue_size; /* read-write, power of 2, or 0. */ > > > + __le16 queue_msix_vector; /* read-write */ > > > + __le16 queue_enable; /* read-write */ > > > + __le16 queue_notify_off; /* read-only */ > > > + __le64 queue_desc; /* read-write */ > > > + __le64 queue_avail; /* read-write */ > > > + __le64 queue_used; /* read-write */ > > > +}; > > > + > > > +device_feature_select > > > + > > > + Selects which Feature Bits does device_feature field refer to. > > > + Value 0x0 selects Feature Bits 0 to 31 > > > + Value 0x1 selects Feature Bits 32 to 63 > > > + All other values cause reads from device_feature to return 0. > > > + > > > +device_feature > > > + > > > + Used by Device to report Feature Bits to Driver. > > > + Device Feature Bits selected by device_feature_select. > > > + > > > +guest_feature_select > > > + > > > + Selects which Feature Bits does guest_feature field refer to. > > > + Value 0x0 selects Feature Bits 0 to 31 > > > + Value 0x1 selects Feature Bits 32 to 63 > > > + All other values cause writes to guest_feature to be ignored, > > > + and reads to return 0. > > > + > > > +guest_feature > > > + > > > + Used by Driver to acknowledge Feature Bits to Device. > > > + Guest Feature Bits selected by guest_feature_select. > > > + > > > +msix_config > > > + > > > + Configuration Vector for MSI-X. > > > + > > > +num_queues > > > + > > > + Specifies the maximum number of virtqueues supported by device. > > > + > > > +device_status > > > + > > > + Device Status field. > > > + > > > +queue_select > > > + > > > + Queue Select. Selects which virtqueue do other fields refer to. > > > + > > > +queue_size > > > + > > > + Queue Size. On reset, specifies the maximum queue size supported by > > > + the hypervisor. This can be modified by driver to reduce memory requirements. > > > + Set to 0 if this virtqueue is unused. > > > + > > > +queue_msix_vector > > > + > > > + Queue Vector for MSI-X. > > > + > > > +queue_enable > > > + > > > + Used to selectively prevent host from executing requests from this virtqueue. > > > + 1 - enabled; 0 - disabled > > > + > > > +queue_notify_off > > > + > > > + Used to calculate the offset from start of Notification structure at > > > + which this virtqueue is located. > > > + Note: this is *not* an offset in bytes. See notify_off_multiplier below. > > > + > > > +queue_desc > > > + > > > + Physical address of Descriptor Table. > > > + > > > +queue_avail > > > + > > > + Physical address of Available Ring. > > > + > > > +queue_used > > > + > > > + Physical address of Used Ring. > > > + > > > + > > > +2.4.1.2.2 ISR status structure layout > > > +------------------------- > > > +ISR status structure includes a single 8-bite ISR status field > > > > 8-bit > > Right :) > > > > + > > > +2.4.1.2.3 Notification structure layout > > > +------------------------- > > > +Notification structure is always a multiple of 2 bytes in size. > > > +It includes 2-byte Queue Notify fields for each virtqueue of > > > +the device. Note that multiple virtqueues can use the same > > > +Queue Notify field, if necessary. > > > > Hmm, maybe move this down, so you can have a section which starts with > > "If cfg_type is VIRTIO_PCI_CAP_NOTIFY_CFG" below? That would put it all > > together. > > so Move PCI Device Layout to within > PCI-specific Initialization And Device Operation? > > > > + > > > +2.4.1.2.4 Device specific structure > > > +------------------------- > > > + > > > +Device specific structure is optional. > > > + > > > +2.4.1.2.5 Legacy Interfaces: A Note on PCI Device Layout > > > +------------------------- > > > + > > > +Transitional devices should present part of configuration > > > +registers in a legacy configuration structure in BAR0 in the first I/O > > > +region of the PCI device, as documented below. > > > > > > There may be different widths of accesses to the I/O region; the > > > “natural” access method for each field in the virtio header must be > > > -used (i.e. 32-bit accesses for 32-bit fields, etc), but the > > > +used (i.e. 32-bit accesses for 32-bit fields, etc), but > > > +When accessed through the legacy interface the > > > device-specific region can be accessed using any width accesses, and > > > should obtain the same results. > > > > > > Note that this is possible because while the virtio header is PCI > > > -(i.e. little) endian, the device-specific region is encoded in > > > -the native endian of the guest (where such distinction is > > > +(i.e. little) endian, when using the legacy interface the device-specific > > > +region is encoded in the native endian of the guest (where such distinction is > > > applicable). > > > > > > -2.4.1.2.1 PCI Device Virtio Header > > > ----------------------------------- > > > > > > -The virtio header looks as follows: > > > +When used through the legacy interface, the virtio header looks as follows: > > > > > > +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ > > > Bits 32 32 32 16 16 16 8 8 > > > @@ -661,7 +905,6 @@ The virtio header looks as follows: > > > Features bits 0:31 Features bits 0:31 Address Size Select Notify Status Status > > > +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ > > > > > > - > > > If MSI-X is enabled for the device, two additional fields > > > immediately follow this header:[5] > > > > > > @@ -689,25 +932,154 @@ device-specific headers: > > > > > > +------------++--------------------+ > > > > > > +Note that only Feature Bits 0 to 31 are accessible through the > > > +Legacy Interface. When used through the Legacy Interface, > > > +Transitional Devices must assume that Feature Bits 32 to 63 > > > +are not acknowledged by Driver. > > > + > > > + > > > 2.4.1.3 PCI-specific Initialization And Device Operation > > > -------------------------------------------------------- > > > > > > -The page size for a virtqueue on a PCI virtio device is defined as > > > -4096 bytes. > > > - > > > 2.4.1.3.1 Device Initialization > > > ------------------------------- > > > > > > -2.4.1.3.1.1 Queue Vector Configuration > > > +This documents PCI-specific steps executed during Device Initialization. > > > +As the first step, driver must detect device configuration layout > > > +to locate configuration fields in memory,I/O or configuration space of the > > > +device. > > > + > > > +2.4.1.3.1.1 Virtio Device Configuration Layout Detection > > > +------------------------------- > > > + > > > +As a prerequisite to device initialization, driver executes a > > > +PCI capability list scan, detecting virtio configuration layout using Virtio > > > +Structure PCI capabilities. > > > + > > > +Virtio Device Configuration Layout includes virtio configuration header, Notification > > > +and ISR Status and device configuration structures. > > > +Each structure can be mapped by a Base Address register (BAR) belonging to > > > +the function, located beginning at 10h in Configuration Space, > > > +or accessed though PCI configuration space. > > > + > > > +Actual location of each structure is specified using vendor-specific PCI capability located > > > +on capability list in PCI configuration space of the device. > > > +This virtio structure capability uses little-endian format; all bits are > > > +read-only: > > > + > > > +struct virtio_pci_cap { > > > + __u8 cap_vndr; /* Generic PCI field: PCI_CAP_ID_VNDR */ > > > + __u8 cap_next; /* Generic PCI field: next ptr. */ > > > + __u8 cap_len; /* Generic PCI field: capability length */ > > > + __u8 cfg_type; /* Identifies the structure. */ > > > + __u8 bar; /* Where to find it. */ > > > + __u8 padding[3];/* Pad to full dword. */ > > > + __le32 offset; /* Offset within bar. */ > > > + __le32 length; /* Length of the structure, in bytes. */ > > > +}; > > > + > > > +This structure can optionally followed by extra data, depending on > > > +other fields, as documented below. > > > + > > > +The fields are interpreted as follows: > > > + > > > +cap_vndr > > > + 0x09; Identifies a vendor-specific capability. > > > + > > > +cap_next > > > + Link to next capability in the capability list in the configuration space. > > > + > > > +cap_len > > > + Length of the capability structure, including the whole of > > > + struct virtio_pci_cap, and extra data if any. > > > + This length might include padding, or fields unused by the driver. > > > + > > > +cfg_type > > > + identifies the structure, according to the following table. > > > + > > > + /* Common configuration */ > > > + #define VIRTIO_PCI_CAP_COMMON_CFG 1 > > > + /* Notifications */ > > > + #define VIRTIO_PCI_CAP_NOTIFY_CFG 2 > > > + /* ISR Status */ > > > + #define VIRTIO_PCI_CAP_ISR_CFG 3 > > > + /* Device specific configuration */ > > > + #define VIRTIO_PCI_CAP_DEVICE_CFG 4 > > > + > > > + More than one capability can identify the same structure - this makes it > > > + possible for the device to expose multiple interfaces to drivers. The order of > > > + the capabilities in the capability list specifies the order of preference > > > + suggested by the device; drivers should use the first interface that they can > > > + support. For example, on some hypervisors, notifications using IO accesses are > > > + faster than memory accesses. In this case, hypervisor can expose two > > > + capabilities with cfg_type set to VIRTIO_PCI_CAP_NOTIFY_CFG: > > > + the first one addressing an I/O BAR, the second one addressing a memory BAR. > > > + Driver will use the I/O BAR if I/O resources are available, and fall back on > > > + memory BAR when I/O resources are unavailable. > > > + > > > +bar > > > + > > > + values 0x0 to 0x5 specify a Base Address register (BAR) belonging to > > > + the function located beginning at 10h in Configuration Space > > > + and used to map the structure into Memory or I/O Space. > > > + The BAR is permitted to be either 32-bit or 64-bit, it can map Memory Space > > > + or I/O Space. > > > + > > > + The value 0xF specifies that the structure is in PCI configuration space > > > + inline with this capability structure, following (not necessarily immediately) > > > + the length field. > > > > Why not immediately? > > Or how would the driver know where it is? > > It's at the offset. > > E.g. for notification we stick multiplier after length. > Further, we might extend virtio_pci_cap in the future, > and we don't want to move stuff around like we > had to with MSI-X. > > > > + > > > +offset > > > + indicates where the structure begins relative to the base address associated > > > + with the BAR. If bar specifies configuration space, offset is relative > > > + to start of virtio_pci_cap structure. > > > + > > > +length > > > + indicates the length of the structure. > > > + This size might include padding, or fields unused by the driver. > > > + Drivers are also recommended to only map part of configuration structure > > > + large enough for device operation. > > > + For example, a future device might present a large structure size of several > > > + MBytes. > > > + As current devices never utilize structures larger than 4KBytes in size, > > > + driver can limit the mapped structure size to e.g. > > > + 4KBytes to allow forward compatibility with such devices without loss of > > > + functionality and without wasting resources. > > > + > > > + > > > +If cfg_type is VIRTIO_PCI_CAP_NOTIFY_CFG this structure is immediately followed > > > +by additional fields: > > > + > > > +struct virtio_pci_notify_cap { > > > + struct virtio_pci_cap cap; > > > + __le32 notify_off_multiplier; /* Multiplier for queue_notify_off. */ > > > +}; > > > + > > > +notify_off_multiplier > > > + > > > + Virtqueue offset multiplier, in bytes. Must be even and either a power of two, or 0. > > > + Value 0x1 is reserved. > > > + For a given virtqueue, the address to use for notifications is calculated as follows: > > > + > > > + queue_notify_off * notify_off_multiplier + offset > > > + > > > + If notify_off_multiplier is 0, all virtqueues use the same address in > > > + the Notifications structure! > > > + > > > + > > > +2.4.1.3.1.1 Legacy Interface: A Note on Device Layout Detection > > > +------------------------------- > > > + > > > +Legacy drivers skipped Device Layout Detection step, assuming legacy > > > +configuration space in BAR0 in I/O space unconditionally. > > > + > > > +2.4.1.3.1.3 Queue Vector Configuration > > > -------------------------------------- > > > > > > When MSI-X capability is present and enabled in the device > > > -(through standard PCI configuration space) 4 bytes at byte offset > > > -20 are used to map configuration change and queue interrupts to > > > -MSI-X vectors. In this case, the ISR Status field is unused, and > > > -device specific configuration starts at byte offset 24 in virtio > > > -header structure. When MSI-X capability is not enabled, device > > > -specific configuration starts at byte offset 20 in virtio header. > > > +(through standard PCI configuration space) Configuration/Queue > > > +MSI-X Vector registers are used to map configuration change and queue > > > +interrupts to MSI-X vectors. In this case, the ISR Status is unused. > > > > > > Writing a valid MSI-X Table entry number, 0 to 0x7FF, to one of > > > Configuration/Queue Vector registers, maps interrupts triggered > > > @@ -732,7 +1104,7 @@ success, the previously written value is returned, and on > > > failure, NO_VECTOR is returned. If a mapping failure is detected, > > > the driver can retry mapping with fewervectors, or disable MSI-X. > > > > > > -2.4.1.3.1.2 Virtqueue Configuration > > > +2.4.1.3.1.4 Virtqueue Configuration > > > ----------------------------------- > > > > > > As a device can have zero or more virtqueues for bulk data > > > @@ -749,9 +1121,11 @@ This is done as follows, for each virtqueue a device has: > > > always a power of 2. This controls how big the virtqueue is > > > (see 2.1.4 Virtqueues). If this field is 0, the virtqueue does not exist. > > > > > > -3. Allocate and zero virtqueue in contiguous physical memory, on > > > - a 4096 byte alignment. Write the physical address, divided by > > > - 4096 to the Queue Address field.[6] > > > +3. Optionally, select a smaller virtqueue size and write it in the Queue Size > > > + field. > > > + > > > +3. Allocate and zero Descriptor Table, Available and Used rings for the > > > + virtqueue in contiguous physical memory. > > > > > > 4. Optionally, if MSI-X capability is present and enabled on the > > > device, select a vector to use to request interrupts triggered > > > @@ -760,14 +1134,21 @@ This is done as follows, for each virtqueue a device has: > > > Queue Vector field: on success, previously written value is > > > returned; on failure, NO_VECTOR value is returned. > > > > > > + > > > +2.4.1.3.1.4.1 Legacy Interface: A Note on Virtqueue Configuration > > > +----------------------------------- > > > +When using the legacy interface, the page size for a virtqueue on a PCI virtio > > > +device is defined as 4096 bytes. Driver writes the physical address, divided > > > +by 4096 to the Queue Address field [6]. > > > + > > > 2.4.1.3.2 Notifying The Device > > > ------------------------------ > > > > > > Device notification occurs by writing the 16-bit virtqueue index > > > -of this virtqueue to the Queue Notify field of the virtio header > > > -in the first I/O region of the PCI device. > > > +of this virtqueue to the Queue Notify field. > > > > > > 2.4.1.3.3 Receiving Used Buffers From The Device > > > +------------------------------ > > > > > > If an interrupt is necessary: > > > > > > @@ -2798,7 +3179,10 @@ the non-PCI implementations (currently lguest and S/390). > > > This is only allowed if the driver does not use any features > > > which would alter this early use of the device. > > > > > > -[5] ie. once you enable MSI-X on the device, the other fields move. > > > +[5] When MSI-X capability is enabled, device specific configuration starts at > > > +byte offset 24 in virtio header structure. When MSI-X capability is not > > > +enabled, device specific configuration starts at byte offset 20 in virtio > > > +header. ie. once you enable MSI-X on the device, the other fields move. > > > If you turn it off again, they move back! > > > > Thanks, > > Rusty. Cornelia


  • 5.  Re: [virtio] Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout

    Posted 08-27-2013 15:35
    On Tue, Aug 27, 2013 at 05:09:53PM +0200, Cornelia Huck wrote: > Some remarks from my side... > > On Tue, 27 Aug 2013 10:38:59 +0300 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Tue, Aug 27, 2013 at 04:24:47PM +0930, Rusty Russell wrote: > > > "Michael S. Tsirkin" <mst@redhat.com> writes: > > > > This is the new configuration layout. > > > > > > > > Notes: > > > > - Everything is LE > > > > - There's a feature bit that means spec 1.0 compliant. > > > > - Both devices and drivers can either require the 1.0 interface > > > > or try to include compatibility support. The spec isn't forcing > > > > this decision. > > > > > > Hmm, this kind includes other changes already proposed, like the LE > > > change and the framing change. I think this conceptually splits nicely: > > > > > > 1) Feature bit 32 proposal. > > > 2) Endian change. > > > 3) Framing change. > > > 4) PCI layout change. > > > > Right - they are mostly in different parts of the document. > > I put it all together so it's easy to see how we intend to > > handle the transition. > > So is everyone OK with keeping this in a single patch? > > The new feature bit is supposed to cover all of this, right? Then this > should be one patch. > > > > > > > - I kept documentation of the legacy interface around, and added notes > > > > on transition inline. They are in separate sections each clearly marked > > > > "Legacy Interface" so we'll be able to separate them out > > > > from the final document as necessary - for now I think it's easier > > > > to keep it all together. > > > > > > Good thinking: most of us know the current spec so it's definitely > > > clearer. And makes sure we're thinking about the transition. > > > > > > > Only virtio PCI has been converted. > > > > Let's discuss this on the meeting tonight, once we figure out PCI > > > > we can do something similar for MMIO and CCW. > > > > > > > @@ -137,6 +139,11 @@ Feature bits are allocated as follows: > > > > 24 to 31: Feature bits reserved for extensions to the queue and > > > > feature negotiation mechanisms > > > > > > > > + 32: Feature bit must be set for any device compliant with this > > > > + revision of the specification, and acknowledged by all device drivers. > > Would it make sense to have a bit 33 "rings big endian" whose validity > depends on bit 32 set? This would make it possible for ccw to keep its > current endianness. I didn't go over ccw or MMIO yet - only PCI. I think ccw registers will just be explicitly BE, with no need for a feature bit. Does this sound right? > > > > + > > > > + 33 to 63: Feature bits reserved for future extensions > > > > + > > > > For example, feature bit 0 for a network device (i.e. Subsystem > > > > Device ID 1) indicates that the device supports checksumming of > > > > packets. > > > > > > Why stop at 63? If we go to a more decentralized feature-assignment > > > model, we'll run through those very fast. > > > > Then we'll just document more, but driver needs to know where to stop > > looking for features. > > > > > > > > > @@ -145,13 +152,63 @@ In particular, new fields in the device configuration space are > > > > indicated by offering a feature bit, so the guest can check > > > > before accessing that part of the configuration space. > > > > > > > > +2.1.2.1 Legacy Interface: A Note on transitions from earlier drafts > > > > +-------------------------------------- > > > > + > > > > +Earlier drafts of this specification (up to 0.9.X) defined a similar, but > > > > +different interface between the hypervisor and the guest. > > > > +Since these are widely deployed in the field, this specification > > > > +accomodates optional features to simplify transition > > > > +from these earlier draft interfaces. Specifically: > > > > + > > > > +Legacy Interface > > > > + is an interface specified by an earlier draft of this specification > > > > + (up to 0.9.X) > > > > +Legacy Device > > > > + is a device implemented before this specification was released, > > > > + and implementing a legacy interface on the host side > > > > +Legacy Driver > > > > + is a driver implemented before this specification was released, > > > > + and implementing a legacy interface on the guest side > > > > + > > > > +to simplify transition from these earlier draft interfaces, > > > > +it is possible to implement > > > > + > > > > +Transitional Device > > > > + a device supporting both drivers conforming to this > > > > + specification, and legacy drivers > > > > + > > > > +Transitional Driver > > > > + a driver supporting both devices conforming to this > > > > + specification, and legacy devices > > What happens to legacy devices in the future? Current implementers > will obviously expose legacy devices, which means future drivers need > to be transitional or they won't work with what is currently out there. You are right. It's a bug in what I wrote: non transitional drivers should work with transitional devices. This way a transitional device can change to non-transitional after drivers are updated. > Will legacy stay around (for the forseeable furture)? That's up to implementers I think as long as they implement the new standard we should not prevent them from bundling in the old virtio, coffee making capabilities etc. > Will legacy > devices still be considered standard compliant (as in "compliant to the > legacy standard")? I don't think they are compliant. We'll split the legacy sections from spec out to a separate transition guide before we release the spec. > > > > + > > > > +Device and driver that require support for revision 1.0 or newer of > > > > +the specification to function, are called non-transitional device and driver, > > > > +respectively. > > > > + > > > > +Transitional Drivers can detect Legacy Devices by detecting that > > > > +Feature bit 32 is not offered. > > > > +Transitional devices can detect Legacy drivers by detecting that > > > > +Feature bit 32 has not been acknowledged by driver. > > Will we use new feature bits for new, incompatible revisions? Or will > we try to stay backwards compatible? So an incompatible change needs to increment revision ID to prevent drivers from loading. MMIO and PCI both have revision IDs. CCW will need to add something like a revision ID, we discussed this already. > > > > + > > > > +To make them easier to locate, specification sections documenting these > > > > +transitional features all explicitly marked with > > > > +'Legacy Interface' in the section title. > > > > + > > > > + > > > > 2.1.3 Configuration Space > > > > ------------------------- > > > > > > > > Configuration space is generally used for rarely-changing or > > > > initialization-time parameters. > > > > > > > > -Note that this space is generally the guest's native endian, > > > > +Note that configuration space generally uses the little-endian format > > > > +for multi-byte fields. > > > > + > > > > +2.1.4.1 Legacy Interface: A Note on Configuration Space endian-ness > > > > +-------------------------------------- > > > > + > > > > +Note that for legacy interfaces, configuration space is generally the guest's native endian, > > > > rather than PCI's little-endian. > > > > > > > > 2.1.4 Virtqueues > > > > @@ -164,6 +221,45 @@ transmit and one for receive. Each queue has a 16-bit queue size > > > > parameter, which sets the number of entries and implies the total size > > > > of the queue. > > > > > > > > +Each virtqueue consists of three parts: > > > > + > > > > + Descriptor Table > > > > + Available Ring > > > > + Used Ring > > > > + > > > > +where each part is physically-contiguous in guest memory, > > > > +and has different alignment requirements. > > > > + > > > > +The Queue Size field controls the total number of bytes > > > > +required for each part of the virtqueue. > > > > + > > > > +The memory aligment and size requirements, in bytes, of each part of the > > > > +virtqueue are summarized in the following table (qsz is the Queue Size field): > > > > + > > > > ++------------+---------------------------------+ > > > > + Virtqueue Part Alignment Size > > > > ++------------+---------------------------------+ > > > > ++------------+---------------------------------+ > > > > + Descriptor Table 16 16 * qsz > > > > ++------------+---------------------------------+ > > > > + Available Ring 2 6 + 2 * qsz > > > > ++------------+---------------------------------+ > > > > + Used Ring 4 6 + 4 * qsz > > > > ++------------+---------------------------------+ > > > > + > > > > +When the driver wants to send a buffer to the device, it fills in > > > > +a slot in the descriptor table (or chains several together), and > > > > +writes the descriptor index into the available ring. It then > > > > +notifies the device. When the device has finished a buffer, it > > > > +writes the descriptor into the used ring, and sends an interrupt. > > > > + > > > > + > > > > +2.1.4.1 Legacy Interfaces: A Note on Virtqueue Layout > > > > +-------------------------------------- > > > > + > > > > +For Legacy Interfaces, several additional > > > > +restrictions are placed on the virtqueue layout: > > > > + > > > > Each virtqueue occupies two or more physically-contiguous pages > > > > (usually defined as 4096 bytes, but depending on the transport) > > > > and consists of three parts: > > > > @@ -182,9 +278,8 @@ required for the virtqueue according to the following formula: > > > > + ALIGN(sizeof(u16)*3 + sizeof(struct vring_used_elem)*qsz); > > > > } > > > > > > > > -This currently wastes some space with padding, but also allows future > > > > -extensions such as the VIRTIO_RING_F_EVENT_IDX extension. The > > > > -virtqueue layout structure looks like this: > > > > +This wastes some space with padding. > > > > +The legacy virtqueue layout structure therefore looks like this: > > > > > > > > struct vring { > > > > // The actual descriptors (16 bytes each) > > > > @@ -200,25 +295,17 @@ virtqueue layout structure looks like this: > > > > struct vring_used used; > > > > }; > > > > > > > > -When the driver wants to send a buffer to the device, it fills in > > > > -a slot in the descriptor table (or chains several together), and > > > > -writes the descriptor index into the available ring. It then > > > > -notifies the device. When the device has finished a buffer, it > > > > -writes the descriptor into the used ring, and sends an interrupt. > > > > - > > > > -2.1.4.1 A Note on Virtqueue Endianness > > > > +2.1.4.1 Legacy Interfaces: A Note on Virtqueue Endianness > > > > -------------------------------------- > > > > > > > > Note that the endian of fields and in the virtqueue is the native > > > > -endian of the guest, not little-endian as PCI normally is. This makes > > > > -for simpler guest code, and it is assumed that the host already has to > > > > -be deeply aware of the guest endian so such an “endian-aware” device > > > > -is not a significant issue. > > > > +endian of the guest, not little-endian as PCI normally is. > > > > +It is assumed that the host is already aware of the guest endian. > > > > > > > > 2.1.4.2 Message Framing > > > > ----------------------- > > > > -The original intent of the specification was that message framing (the > > > > -particular layout of descriptors) be independent of the contents of > > > > +Generally, the intent of the specification is for message framing (the > > > > +particular layout of descriptors) to be independent of the contents of > > > > the buffers. For example, a network transmit buffer consists of a 12 > > > > byte header followed by the network packet. This could be most simply > > > > placed in the descriptor table as a 12 byte output descriptor followed > > > > @@ -227,16 +314,21 @@ single 1526 byte output descriptor in the case where the header and > > > > packet are adjacent, or even three or more descriptors (possibly with > > > > loss of efficiency in that case). > > > > > > > > -Regrettably, initial driver implementations used simple layouts, and > > > > -devices came to rely on it, despite this specification wording[10]. It > > > > -is thus recommended that drivers be conservative in their assumptions, > > > > -unless the VIRTIO_F_ANY_LAYOUT feature is accepted. In addition, some > > > > +In addition, some > > > > implementations may have large-but-reasonable restrictions on total > > > > descriptor size (such as based on IOV_MAX in the host OS). This has > > > > not been a problem in practice: little sympathy will be given to > > > > drivers which create unreasonably-sized descriptors such as by > > > > dividing a network packet into 1500 single-byte descriptors! > > > > > > > > +2.1.4.2.1 Legacy Interfaces: A Note on Message Framing > > > > +----------------------- > > > > +Regrettably, initial driver implementations used simple layouts, and > > > > +devices came to rely on it, despite this specification wording[10]. It > > > > +is thus recommended that when using legacy interfaces, > > > > +drivers should be conservative in their assumptions, > > > > +unless the VIRTIO_F_ANY_LAYOUT feature is accepted. > > So ANY_LAYOUT and feature bit 32 are mutually exclusive? Hmm. I wonder what gives this impression. What I tried to say is bit 32 should imply ANY_LAYOUT. > > > > + > > > > 2.1.4.3 The Virtqueue Descriptor Table > > > > -------------------------------------- > > > > > > > > @@ -386,23 +478,27 @@ how to communicate with the specific device. > > > > 2.2.1 Device Initialization > > > > --------------------------- > > > > > > > > -1. Reset the device. This is not required on initial start up. > > > > +1. Device discovery. This is only required for some transports. > > > > + > > > > +2. Reset the device. This is not required on initial start up. > > > > > > > > -2. The ACKNOWLEDGE status bit is set: we have noticed the device. > > > > +3. Device layout detection. This is only required for some transports. > > > > > > > > -3. The DRIVER status bit is set: we know how to drive the device. > > > > +4. The ACKNOWLEDGE status bit is set: we have noticed the device. > > > > > > > > -4. Device-specific setup, including reading the device feature > > > > +5. The DRIVER status bit is set: we know how to drive the device. > > > > + > > > > +6. Device-specific setup, including reading the device feature > > > > bits, discovery of virtqueues for the device, optional per-bus > > > > setup, and reading and possibly writing the device's virtio > > > > configuration space. > > > > > > > > -5. The subset of device feature bits understood by the driver is > > > > +7. The subset of device feature bits understood by the driver is > > > > written to the device. > > > > > > > > -6. The DRIVER_OK status bit is set. > > > > +8. The DRIVER_OK status bit is set. > > > > > > > > -7. The device can now be used (ie. buffers added to the > > > > +9. The device can now be used (ie. buffers added to the > > > > virtqueues)[4] > > > > > > > > If any of these steps go irrecoverably wrong, the guest should > > > > @@ -622,35 +718,183 @@ Virtio devices are commonly implemented as PCI devices. > > > > > > > > Any PCI device with Vendor ID 0x1AF4, and Device ID 0x1000 through > > > > 0x103F inclusive is a virtio device[3]. The device must also have a > > > > -Revision ID of 0 to match this specification. > > > > +Revision ID of 0 or Revision ID of 1 to match this specification. > > > > > > > > The Subsystem Device ID indicates which virtio device is > > > > supported by the device. The Subsystem Vendor ID should reflect > > > > the PCI Vendor ID of the environment (it's currently only used > > > > for informational purposes by the guest). > > > > > > > > +Drivers must not match devices where Revision ID does not match 0 or 1. > > > > + > > > > +2.4.1.1.1 Legacy Interface: A Note on PCI Device Discovery > > > > +---------------------------- > > > > +Transitional devices must have a Revision ID of 0. > > > > + > > > > +Non-transitional devices must have a Revision ID of 1. > > > > + > > > > +Transitional drivers must match a Revision ID of 0 or 1. > > > > + > > > > +Non-transitional drivers must only match a Revision ID of 1. > > > > + > > > > > > I think we should stop abusing Revision IDs, and start using them > > > to reflect device version changes as intended. > > > > > > We could reserve revision id 0 for legacy devices, however, which should > > > work nicely. > > > > Hmm I am not sure I agree - what does it buy us that feature bits don't already supply? > > > > More concerns: > > > > We are using revision ID now exactly as was intended to disable old > > drivers - it served us well for 0.X-1.X and would be as useful if we > > ever have 1.X->2.0 transition. > > > > Another worry with using revision numbering for features is that > > it does not play well with downstreams. > > E.g. RHEL might want to cherry-pick a feature without implementing > > other features that happened to land in the same revision. > > > > Also Revision ID is only 8 bit - it's designed for hardware where > > making a new revision is expensive. In software we'll run out of that > > eventually. > > So Revision ID is a PCI-specific thing, right? Not all transports will > necessarily have something equivalent, so they would need to depend on > the feature bit. They can't do this reliably - for example you might want to move feature bits around. For 0.9.X drivers and non-transitional devices, I'd like to find some hack to make probe fail. Any idea? But let's plan ahead and add a way to do this in the future if we make an incompatible change again. > > > > > > > > > > > 2.4.1.2 PCI Device Layout > > > > ------------------------- > > > > > > > > -To configure the device, we use the first I/O region of the PCI > > > > -device. This contains a virtio header followed by a > > > > -device-specific region. > > > > +To configure the device, > > > > +use I/O and/or memory regions and/or PCI configuration space of the PCI device. > > > > +These contain the virtio header registers, the notification register, the > > > > +ISR status register and device specific registers, as specified by Virtio > > > > ++ Structure PCI Capabilities > > > > + > > > > +There may be different widths of accesses to the I/O region; the > > > > +“natural” access method for each field must be > > > > +used (i.e. 32-bit accesses for 32-bit fields, etc). > > > > + > > > > +PCI Device Configuration Layout includes the common configuration, > > > > +ISR, notification and device specific configuration > > > > +structures. > > > > + > > > > +Unless explicitly specified otherwise, all multi-byte fields are little-endian. > > > > + > > > > + > > > > +2.4.1.2.1 Common configuration structure layout > > > > +------------------------- > > > > +Common configuration structure layout is documented below: > > > > + > > > > +struct virtio_pci_common_cfg { > > > > + /* About the whole device. */ > > > > + __le32 device_feature_select; /* read-write */ > > > > + __le32 device_feature; /* read-only */ > > > > + __le32 guest_feature_select; /* read-write */ > > > > + __le32 guest_feature; /* read-write */ > > > > + __le16 msix_config; /* read-write */ > > > > + __le16 num_queues; /* read-only */ > > > > + __u8 device_status; /* read-write */ > > > > + __u8 unused1; > > > > + > > > > + /* About a specific virtqueue. */ > > > > + __le16 queue_select; /* read-write */ > > > > + __le16 queue_size; /* read-write, power of 2, or 0. */ > > > > + __le16 queue_msix_vector; /* read-write */ > > > > + __le16 queue_enable; /* read-write */ > > > > + __le16 queue_notify_off; /* read-only */ > > > > + __le64 queue_desc; /* read-write */ > > > > + __le64 queue_avail; /* read-write */ > > > > + __le64 queue_used; /* read-write */ > > > > +}; > > > > + > > > > +device_feature_select > > > > + > > > > + Selects which Feature Bits does device_feature field refer to. > > > > + Value 0x0 selects Feature Bits 0 to 31 > > > > + Value 0x1 selects Feature Bits 32 to 63 > > > > + All other values cause reads from device_feature to return 0. > > > > + > > > > +device_feature > > > > + > > > > + Used by Device to report Feature Bits to Driver. > > > > + Device Feature Bits selected by device_feature_select. > > > > + > > > > +guest_feature_select > > > > + > > > > + Selects which Feature Bits does guest_feature field refer to. > > > > + Value 0x0 selects Feature Bits 0 to 31 > > > > + Value 0x1 selects Feature Bits 32 to 63 > > > > + All other values cause writes to guest_feature to be ignored, > > > > + and reads to return 0. > > > > + > > > > +guest_feature > > > > + > > > > + Used by Driver to acknowledge Feature Bits to Device. > > > > + Guest Feature Bits selected by guest_feature_select. > > > > + > > > > +msix_config > > > > + > > > > + Configuration Vector for MSI-X. > > > > + > > > > +num_queues > > > > + > > > > + Specifies the maximum number of virtqueues supported by device. > > > > + > > > > +device_status > > > > + > > > > + Device Status field. > > > > + > > > > +queue_select > > > > + > > > > + Queue Select. Selects which virtqueue do other fields refer to. > > > > + > > > > +queue_size > > > > + > > > > + Queue Size. On reset, specifies the maximum queue size supported by > > > > + the hypervisor. This can be modified by driver to reduce memory requirements. > > > > + Set to 0 if this virtqueue is unused. > > > > + > > > > +queue_msix_vector > > > > + > > > > + Queue Vector for MSI-X. > > > > + > > > > +queue_enable > > > > + > > > > + Used to selectively prevent host from executing requests from this virtqueue. > > > > + 1 - enabled; 0 - disabled > > > > + > > > > +queue_notify_off > > > > + > > > > + Used to calculate the offset from start of Notification structure at > > > > + which this virtqueue is located. > > > > + Note: this is *not* an offset in bytes. See notify_off_multiplier below. > > > > + > > > > +queue_desc > > > > + > > > > + Physical address of Descriptor Table. > > > > + > > > > +queue_avail > > > > + > > > > + Physical address of Available Ring. > > > > + > > > > +queue_used > > > > + > > > > + Physical address of Used Ring. > > > > + > > > > + > > > > +2.4.1.2.2 ISR status structure layout > > > > +------------------------- > > > > +ISR status structure includes a single 8-bite ISR status field > > > > > > 8-bit > > > > Right :) > > > > > > + > > > > +2.4.1.2.3 Notification structure layout > > > > +------------------------- > > > > +Notification structure is always a multiple of 2 bytes in size. > > > > +It includes 2-byte Queue Notify fields for each virtqueue of > > > > +the device. Note that multiple virtqueues can use the same > > > > +Queue Notify field, if necessary. > > > > > > Hmm, maybe move this down, so you can have a section which starts with > > > "If cfg_type is VIRTIO_PCI_CAP_NOTIFY_CFG" below? That would put it all > > > together. > > > > so Move PCI Device Layout to within > > PCI-specific Initialization And Device Operation? > > > > > > + > > > > +2.4.1.2.4 Device specific structure > > > > +------------------------- > > > > + > > > > +Device specific structure is optional. > > > > + > > > > +2.4.1.2.5 Legacy Interfaces: A Note on PCI Device Layout > > > > +------------------------- > > > > + > > > > +Transitional devices should present part of configuration > > > > +registers in a legacy configuration structure in BAR0 in the first I/O > > > > +region of the PCI device, as documented below. > > > > > > > > There may be different widths of accesses to the I/O region; the > > > > “natural” access method for each field in the virtio header must be > > > > -used (i.e. 32-bit accesses for 32-bit fields, etc), but the > > > > +used (i.e. 32-bit accesses for 32-bit fields, etc), but > > > > +When accessed through the legacy interface the > > > > device-specific region can be accessed using any width accesses, and > > > > should obtain the same results. > > > > > > > > Note that this is possible because while the virtio header is PCI > > > > -(i.e. little) endian, the device-specific region is encoded in > > > > -the native endian of the guest (where such distinction is > > > > +(i.e. little) endian, when using the legacy interface the device-specific > > > > +region is encoded in the native endian of the guest (where such distinction is > > > > applicable). > > > > > > > > -2.4.1.2.1 PCI Device Virtio Header > > > > ----------------------------------- > > > > > > > > -The virtio header looks as follows: > > > > +When used through the legacy interface, the virtio header looks as follows: > > > > > > > > +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ > > > > Bits 32 32 32 16 16 16 8 8 > > > > @@ -661,7 +905,6 @@ The virtio header looks as follows: > > > > Features bits 0:31 Features bits 0:31 Address Size Select Notify Status Status > > > > +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ > > > > > > > > - > > > > If MSI-X is enabled for the device, two additional fields > > > > immediately follow this header:[5] > > > > > > > > @@ -689,25 +932,154 @@ device-specific headers: > > > > > > > > +------------++--------------------+ > > > > > > > > +Note that only Feature Bits 0 to 31 are accessible through the > > > > +Legacy Interface. When used through the Legacy Interface, > > > > +Transitional Devices must assume that Feature Bits 32 to 63 > > > > +are not acknowledged by Driver. > > > > + > > > > + > > > > 2.4.1.3 PCI-specific Initialization And Device Operation > > > > -------------------------------------------------------- > > > > > > > > -The page size for a virtqueue on a PCI virtio device is defined as > > > > -4096 bytes. > > > > - > > > > 2.4.1.3.1 Device Initialization > > > > ------------------------------- > > > > > > > > -2.4.1.3.1.1 Queue Vector Configuration > > > > +This documents PCI-specific steps executed during Device Initialization. > > > > +As the first step, driver must detect device configuration layout > > > > +to locate configuration fields in memory,I/O or configuration space of the > > > > +device. > > > > + > > > > +2.4.1.3.1.1 Virtio Device Configuration Layout Detection > > > > +------------------------------- > > > > + > > > > +As a prerequisite to device initialization, driver executes a > > > > +PCI capability list scan, detecting virtio configuration layout using Virtio > > > > +Structure PCI capabilities. > > > > + > > > > +Virtio Device Configuration Layout includes virtio configuration header, Notification > > > > +and ISR Status and device configuration structures. > > > > +Each structure can be mapped by a Base Address register (BAR) belonging to > > > > +the function, located beginning at 10h in Configuration Space, > > > > +or accessed though PCI configuration space. > > > > + > > > > +Actual location of each structure is specified using vendor-specific PCI capability located > > > > +on capability list in PCI configuration space of the device. > > > > +This virtio structure capability uses little-endian format; all bits are > > > > +read-only: > > > > + > > > > +struct virtio_pci_cap { > > > > + __u8 cap_vndr; /* Generic PCI field: PCI_CAP_ID_VNDR */ > > > > + __u8 cap_next; /* Generic PCI field: next ptr. */ > > > > + __u8 cap_len; /* Generic PCI field: capability length */ > > > > + __u8 cfg_type; /* Identifies the structure. */ > > > > + __u8 bar; /* Where to find it. */ > > > > + __u8 padding[3];/* Pad to full dword. */ > > > > + __le32 offset; /* Offset within bar. */ > > > > + __le32 length; /* Length of the structure, in bytes. */ > > > > +}; > > > > + > > > > +This structure can optionally followed by extra data, depending on > > > > +other fields, as documented below. > > > > + > > > > +The fields are interpreted as follows: > > > > + > > > > +cap_vndr > > > > + 0x09; Identifies a vendor-specific capability. > > > > + > > > > +cap_next > > > > + Link to next capability in the capability list in the configuration space. > > > > + > > > > +cap_len > > > > + Length of the capability structure, including the whole of > > > > + struct virtio_pci_cap, and extra data if any. > > > > + This length might include padding, or fields unused by the driver. > > > > + > > > > +cfg_type > > > > + identifies the structure, according to the following table. > > > > + > > > > + /* Common configuration */ > > > > + #define VIRTIO_PCI_CAP_COMMON_CFG 1 > > > > + /* Notifications */ > > > > + #define VIRTIO_PCI_CAP_NOTIFY_CFG 2 > > > > + /* ISR Status */ > > > > + #define VIRTIO_PCI_CAP_ISR_CFG 3 > > > > + /* Device specific configuration */ > > > > + #define VIRTIO_PCI_CAP_DEVICE_CFG 4 > > > > + > > > > + More than one capability can identify the same structure - this makes it > > > > + possible for the device to expose multiple interfaces to drivers. The order of > > > > + the capabilities in the capability list specifies the order of preference > > > > + suggested by the device; drivers should use the first interface that they can > > > > + support. For example, on some hypervisors, notifications using IO accesses are > > > > + faster than memory accesses. In this case, hypervisor can expose two > > > > + capabilities with cfg_type set to VIRTIO_PCI_CAP_NOTIFY_CFG: > > > > + the first one addressing an I/O BAR, the second one addressing a memory BAR. > > > > + Driver will use the I/O BAR if I/O resources are available, and fall back on > > > > + memory BAR when I/O resources are unavailable. > > > > + > > > > +bar > > > > + > > > > + values 0x0 to 0x5 specify a Base Address register (BAR) belonging to > > > > + the function located beginning at 10h in Configuration Space > > > > + and used to map the structure into Memory or I/O Space. > > > > + The BAR is permitted to be either 32-bit or 64-bit, it can map Memory Space > > > > + or I/O Space. > > > > + > > > > + The value 0xF specifies that the structure is in PCI configuration space > > > > + inline with this capability structure, following (not necessarily immediately) > > > > + the length field. > > > > > > Why not immediately? > > > Or how would the driver know where it is? > > > > It's at the offset. > > > > E.g. for notification we stick multiplier after length. > > Further, we might extend virtio_pci_cap in the future, > > and we don't want to move stuff around like we > > had to with MSI-X. > > > > > > + > > > > +offset > > > > + indicates where the structure begins relative to the base address associated > > > > + with the BAR. If bar specifies configuration space, offset is relative > > > > + to start of virtio_pci_cap structure. > > > > + > > > > +length > > > > + indicates the length of the structure. > > > > + This size might include padding, or fields unused by the driver. > > > > + Drivers are also recommended to only map part of configuration structure > > > > + large enough for device operation. > > > > + For example, a future device might present a large structure size of several > > > > + MBytes. > > > > + As current devices never utilize structures larger than 4KBytes in size, > > > > + driver can limit the mapped structure size to e.g. > > > > + 4KBytes to allow forward compatibility with such devices without loss of > > > > + functionality and without wasting resources. > > > > + > > > > + > > > > +If cfg_type is VIRTIO_PCI_CAP_NOTIFY_CFG this structure is immediately followed > > > > +by additional fields: > > > > + > > > > +struct virtio_pci_notify_cap { > > > > + struct virtio_pci_cap cap; > > > > + __le32 notify_off_multiplier; /* Multiplier for queue_notify_off. */ > > > > +}; > > > > + > > > > +notify_off_multiplier > > > > + > > > > + Virtqueue offset multiplier, in bytes. Must be even and either a power of two, or 0. > > > > + Value 0x1 is reserved. > > > > + For a given virtqueue, the address to use for notifications is calculated as follows: > > > > + > > > > + queue_notify_off * notify_off_multiplier + offset > > > > + > > > > + If notify_off_multiplier is 0, all virtqueues use the same address in > > > > + the Notifications structure! > > > > + > > > > + > > > > +2.4.1.3.1.1 Legacy Interface: A Note on Device Layout Detection > > > > +------------------------------- > > > > + > > > > +Legacy drivers skipped Device Layout Detection step, assuming legacy > > > > +configuration space in BAR0 in I/O space unconditionally. > > > > + > > > > +2.4.1.3.1.3 Queue Vector Configuration > > > > -------------------------------------- > > > > > > > > When MSI-X capability is present and enabled in the device > > > > -(through standard PCI configuration space) 4 bytes at byte offset > > > > -20 are used to map configuration change and queue interrupts to > > > > -MSI-X vectors. In this case, the ISR Status field is unused, and > > > > -device specific configuration starts at byte offset 24 in virtio > > > > -header structure. When MSI-X capability is not enabled, device > > > > -specific configuration starts at byte offset 20 in virtio header. > > > > +(through standard PCI configuration space) Configuration/Queue > > > > +MSI-X Vector registers are used to map configuration change and queue > > > > +interrupts to MSI-X vectors. In this case, the ISR Status is unused. > > > > > > > > Writing a valid MSI-X Table entry number, 0 to 0x7FF, to one of > > > > Configuration/Queue Vector registers, maps interrupts triggered > > > > @@ -732,7 +1104,7 @@ success, the previously written value is returned, and on > > > > failure, NO_VECTOR is returned. If a mapping failure is detected, > > > > the driver can retry mapping with fewervectors, or disable MSI-X. > > > > > > > > -2.4.1.3.1.2 Virtqueue Configuration > > > > +2.4.1.3.1.4 Virtqueue Configuration > > > > ----------------------------------- > > > > > > > > As a device can have zero or more virtqueues for bulk data > > > > @@ -749,9 +1121,11 @@ This is done as follows, for each virtqueue a device has: > > > > always a power of 2. This controls how big the virtqueue is > > > > (see 2.1.4 Virtqueues). If this field is 0, the virtqueue does not exist. > > > > > > > > -3. Allocate and zero virtqueue in contiguous physical memory, on > > > > - a 4096 byte alignment. Write the physical address, divided by > > > > - 4096 to the Queue Address field.[6] > > > > +3. Optionally, select a smaller virtqueue size and write it in the Queue Size > > > > + field. > > > > + > > > > +3. Allocate and zero Descriptor Table, Available and Used rings for the > > > > + virtqueue in contiguous physical memory. > > > > > > > > 4. Optionally, if MSI-X capability is present and enabled on the > > > > device, select a vector to use to request interrupts triggered > > > > @@ -760,14 +1134,21 @@ This is done as follows, for each virtqueue a device has: > > > > Queue Vector field: on success, previously written value is > > > > returned; on failure, NO_VECTOR value is returned. > > > > > > > > + > > > > +2.4.1.3.1.4.1 Legacy Interface: A Note on Virtqueue Configuration > > > > +----------------------------------- > > > > +When using the legacy interface, the page size for a virtqueue on a PCI virtio > > > > +device is defined as 4096 bytes. Driver writes the physical address, divided > > > > +by 4096 to the Queue Address field [6]. > > > > + > > > > 2.4.1.3.2 Notifying The Device > > > > ------------------------------ > > > > > > > > Device notification occurs by writing the 16-bit virtqueue index > > > > -of this virtqueue to the Queue Notify field of the virtio header > > > > -in the first I/O region of the PCI device. > > > > +of this virtqueue to the Queue Notify field. > > > > > > > > 2.4.1.3.3 Receiving Used Buffers From The Device > > > > +------------------------------ > > > > > > > > If an interrupt is necessary: > > > > > > > > @@ -2798,7 +3179,10 @@ the non-PCI implementations (currently lguest and S/390). > > > > This is only allowed if the driver does not use any features > > > > which would alter this early use of the device. > > > > > > > > -[5] ie. once you enable MSI-X on the device, the other fields move. > > > > +[5] When MSI-X capability is enabled, device specific configuration starts at > > > > +byte offset 24 in virtio header structure. When MSI-X capability is not > > > > +enabled, device specific configuration starts at byte offset 20 in virtio > > > > +header. ie. once you enable MSI-X on the device, the other fields move. > > > > If you turn it off again, they move back! > > > > > > Thanks, > > > Rusty. > > Cornelia


  • 6.  Re: [virtio] Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout

    Posted 08-27-2013 17:02
    On Tue, 27 Aug 2013 18:36:29 +0300 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Tue, Aug 27, 2013 at 05:09:53PM +0200, Cornelia Huck wrote: > > Some remarks from my side... > > > > On Tue, 27 Aug 2013 10:38:59 +0300 > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > On Tue, Aug 27, 2013 at 04:24:47PM +0930, Rusty Russell wrote: > > > > "Michael S. Tsirkin" <mst@redhat.com> writes: > > > > > This is the new configuration layout. > > > > > > > > > > Notes: > > > > > - Everything is LE > > > > > - There's a feature bit that means spec 1.0 compliant. > > > > > - Both devices and drivers can either require the 1.0 interface > > > > > or try to include compatibility support. The spec isn't forcing > > > > > this decision. > > > > > > > > Hmm, this kind includes other changes already proposed, like the LE > > > > change and the framing change. I think this conceptually splits nicely: > > > > > > > > 1) Feature bit 32 proposal. > > > > 2) Endian change. > > > > 3) Framing change. > > > > 4) PCI layout change. > > > > > > Right - they are mostly in different parts of the document. > > > I put it all together so it's easy to see how we intend to > > > handle the transition. > > > So is everyone OK with keeping this in a single patch? > > > > The new feature bit is supposed to cover all of this, right? Then this > > should be one patch. > > > > > > > > > > - I kept documentation of the legacy interface around, and added notes > > > > > on transition inline. They are in separate sections each clearly marked > > > > > "Legacy Interface" so we'll be able to separate them out > > > > > from the final document as necessary - for now I think it's easier > > > > > to keep it all together. > > > > > > > > Good thinking: most of us know the current spec so it's definitely > > > > clearer. And makes sure we're thinking about the transition. > > > > > > > > > Only virtio PCI has been converted. > > > > > Let's discuss this on the meeting tonight, once we figure out PCI > > > > > we can do something similar for MMIO and CCW. > > > > > > > > > @@ -137,6 +139,11 @@ Feature bits are allocated as follows: > > > > > 24 to 31: Feature bits reserved for extensions to the queue and > > > > > feature negotiation mechanisms > > > > > > > > > > + 32: Feature bit must be set for any device compliant with this > > > > > + revision of the specification, and acknowledged by all device drivers. > > > > Would it make sense to have a bit 33 "rings big endian" whose validity > > depends on bit 32 set? This would make it possible for ccw to keep its > > current endianness. > > I didn't go over ccw or MMIO yet - only PCI. > I think ccw registers will just > be explicitly BE, with no need for a feature bit. > Does this sound right? Sure, that would be even better. > > > > > > + > > > > > + 33 to 63: Feature bits reserved for future extensions > > > > > + > > > > > For example, feature bit 0 for a network device (i.e. Subsystem > > > > > Device ID 1) indicates that the device supports checksumming of > > > > > packets. > > > > > > > > Why stop at 63? If we go to a more decentralized feature-assignment > > > > model, we'll run through those very fast. > > > > > > Then we'll just document more, but driver needs to know where to stop > > > looking for features. > > > > > > > > > > > > @@ -145,13 +152,63 @@ In particular, new fields in the device configuration space are > > > > > indicated by offering a feature bit, so the guest can check > > > > > before accessing that part of the configuration space. > > > > > > > > > > +2.1.2.1 Legacy Interface: A Note on transitions from earlier drafts > > > > > +-------------------------------------- > > > > > + > > > > > +Earlier drafts of this specification (up to 0.9.X) defined a similar, but > > > > > +different interface between the hypervisor and the guest. > > > > > +Since these are widely deployed in the field, this specification > > > > > +accomodates optional features to simplify transition > > > > > +from these earlier draft interfaces. Specifically: > > > > > + > > > > > +Legacy Interface > > > > > + is an interface specified by an earlier draft of this specification > > > > > + (up to 0.9.X) > > > > > +Legacy Device > > > > > + is a device implemented before this specification was released, > > > > > + and implementing a legacy interface on the host side > > > > > +Legacy Driver > > > > > + is a driver implemented before this specification was released, > > > > > + and implementing a legacy interface on the guest side > > > > > + > > > > > +to simplify transition from these earlier draft interfaces, > > > > > +it is possible to implement > > > > > + > > > > > +Transitional Device > > > > > + a device supporting both drivers conforming to this > > > > > + specification, and legacy drivers > > > > > + > > > > > +Transitional Driver > > > > > + a driver supporting both devices conforming to this > > > > > + specification, and legacy devices > > > > What happens to legacy devices in the future? Current implementers > > will obviously expose legacy devices, which means future drivers need > > to be transitional or they won't work with what is currently out there. > > You are right. It's a bug in what I wrote: non transitional drivers > should work with transitional devices. > This way a transitional device can change to non-transitional > after drivers are updated. > > > Will legacy stay around (for the forseeable furture)? > > That's up to implementers I think as long as they > implement the new standard we should not prevent them from > bundling in the old virtio, coffee making capabilities etc. > > > > Will legacy > > devices still be considered standard compliant (as in "compliant to the > > legacy standard")? > > I don't think they are compliant. We'll split the legacy sections > from spec out to a separate transition guide before we release > the spec. What I'm worried about is probably the transitional nature of this. There is a framework we have now, so there will be users - and not on all platforms they expect needing to upgrade, especially if traditional I/O has always been backwards compatible for decades... > > > > > > + > > > > > +Device and driver that require support for revision 1.0 or newer of > > > > > +the specification to function, are called non-transitional device and driver, > > > > > +respectively. > > > > > + > > > > > +Transitional Drivers can detect Legacy Devices by detecting that > > > > > +Feature bit 32 is not offered. > > > > > +Transitional devices can detect Legacy drivers by detecting that > > > > > +Feature bit 32 has not been acknowledged by driver. > > > > Will we use new feature bits for new, incompatible revisions? Or will > > we try to stay backwards compatible? > > So an incompatible change needs to increment revision ID > to prevent drivers from loading. > MMIO and PCI both have revision IDs. > CCW will need to add something like a revision ID, > we discussed this already. Command rejects? I think it is a good idea to try to stay as compatible as possible; this should really be a last measure. > > > > > > + > > > > > +To make them easier to locate, specification sections documenting these > > > > > +transitional features all explicitly marked with > > > > > +'Legacy Interface' in the section title. > > > > > + > > > > > + > > > > > 2.1.3 Configuration Space > > > > > ------------------------- > > > > > > > > > > Configuration space is generally used for rarely-changing or > > > > > initialization-time parameters. > > > > > > > > > > -Note that this space is generally the guest's native endian, > > > > > +Note that configuration space generally uses the little-endian format > > > > > +for multi-byte fields. > > > > > + > > > > > +2.1.4.1 Legacy Interface: A Note on Configuration Space endian-ness > > > > > +-------------------------------------- > > > > > + > > > > > +Note that for legacy interfaces, configuration space is generally the guest's native endian, > > > > > rather than PCI's little-endian. > > > > > > > > > > 2.1.4 Virtqueues > > > > > @@ -164,6 +221,45 @@ transmit and one for receive. Each queue has a 16-bit queue size > > > > > parameter, which sets the number of entries and implies the total size > > > > > of the queue. > > > > > > > > > > +Each virtqueue consists of three parts: > > > > > + > > > > > + Descriptor Table > > > > > + Available Ring > > > > > + Used Ring > > > > > + > > > > > +where each part is physically-contiguous in guest memory, > > > > > +and has different alignment requirements. > > > > > + > > > > > +The Queue Size field controls the total number of bytes > > > > > +required for each part of the virtqueue. > > > > > + > > > > > +The memory aligment and size requirements, in bytes, of each part of the > > > > > +virtqueue are summarized in the following table (qsz is the Queue Size field): > > > > > + > > > > > ++------------+---------------------------------+ > > > > > + Virtqueue Part Alignment Size > > > > > ++------------+---------------------------------+ > > > > > ++------------+---------------------------------+ > > > > > + Descriptor Table 16 16 * qsz > > > > > ++------------+---------------------------------+ > > > > > + Available Ring 2 6 + 2 * qsz > > > > > ++------------+---------------------------------+ > > > > > + Used Ring 4 6 + 4 * qsz > > > > > ++------------+---------------------------------+ > > > > > + > > > > > +When the driver wants to send a buffer to the device, it fills in > > > > > +a slot in the descriptor table (or chains several together), and > > > > > +writes the descriptor index into the available ring. It then > > > > > +notifies the device. When the device has finished a buffer, it > > > > > +writes the descriptor into the used ring, and sends an interrupt. > > > > > + > > > > > + > > > > > +2.1.4.1 Legacy Interfaces: A Note on Virtqueue Layout > > > > > +-------------------------------------- > > > > > + > > > > > +For Legacy Interfaces, several additional > > > > > +restrictions are placed on the virtqueue layout: > > > > > + > > > > > Each virtqueue occupies two or more physically-contiguous pages > > > > > (usually defined as 4096 bytes, but depending on the transport) > > > > > and consists of three parts: > > > > > @@ -182,9 +278,8 @@ required for the virtqueue according to the following formula: > > > > > + ALIGN(sizeof(u16)*3 + sizeof(struct vring_used_elem)*qsz); > > > > > } > > > > > > > > > > -This currently wastes some space with padding, but also allows future > > > > > -extensions such as the VIRTIO_RING_F_EVENT_IDX extension. The > > > > > -virtqueue layout structure looks like this: > > > > > +This wastes some space with padding. > > > > > +The legacy virtqueue layout structure therefore looks like this: > > > > > > > > > > struct vring { > > > > > // The actual descriptors (16 bytes each) > > > > > @@ -200,25 +295,17 @@ virtqueue layout structure looks like this: > > > > > struct vring_used used; > > > > > }; > > > > > > > > > > -When the driver wants to send a buffer to the device, it fills in > > > > > -a slot in the descriptor table (or chains several together), and > > > > > -writes the descriptor index into the available ring. It then > > > > > -notifies the device. When the device has finished a buffer, it > > > > > -writes the descriptor into the used ring, and sends an interrupt. > > > > > - > > > > > -2.1.4.1 A Note on Virtqueue Endianness > > > > > +2.1.4.1 Legacy Interfaces: A Note on Virtqueue Endianness > > > > > -------------------------------------- > > > > > > > > > > Note that the endian of fields and in the virtqueue is the native > > > > > -endian of the guest, not little-endian as PCI normally is. This makes > > > > > -for simpler guest code, and it is assumed that the host already has to > > > > > -be deeply aware of the guest endian so such an “endian-aware” device > > > > > -is not a significant issue. > > > > > +endian of the guest, not little-endian as PCI normally is. > > > > > +It is assumed that the host is already aware of the guest endian. > > > > > > > > > > 2.1.4.2 Message Framing > > > > > ----------------------- > > > > > -The original intent of the specification was that message framing (the > > > > > -particular layout of descriptors) be independent of the contents of > > > > > +Generally, the intent of the specification is for message framing (the > > > > > +particular layout of descriptors) to be independent of the contents of > > > > > the buffers. For example, a network transmit buffer consists of a 12 > > > > > byte header followed by the network packet. This could be most simply > > > > > placed in the descriptor table as a 12 byte output descriptor followed > > > > > @@ -227,16 +314,21 @@ single 1526 byte output descriptor in the case where the header and > > > > > packet are adjacent, or even three or more descriptors (possibly with > > > > > loss of efficiency in that case). > > > > > > > > > > -Regrettably, initial driver implementations used simple layouts, and > > > > > -devices came to rely on it, despite this specification wording[10]. It > > > > > -is thus recommended that drivers be conservative in their assumptions, > > > > > -unless the VIRTIO_F_ANY_LAYOUT feature is accepted. In addition, some > > > > > +In addition, some > > > > > implementations may have large-but-reasonable restrictions on total > > > > > descriptor size (such as based on IOV_MAX in the host OS). This has > > > > > not been a problem in practice: little sympathy will be given to > > > > > drivers which create unreasonably-sized descriptors such as by > > > > > dividing a network packet into 1500 single-byte descriptors! > > > > > > > > > > +2.1.4.2.1 Legacy Interfaces: A Note on Message Framing > > > > > +----------------------- > > > > > +Regrettably, initial driver implementations used simple layouts, and > > > > > +devices came to rely on it, despite this specification wording[10]. It > > > > > +is thus recommended that when using legacy interfaces, > > > > > +drivers should be conservative in their assumptions, > > > > > +unless the VIRTIO_F_ANY_LAYOUT feature is accepted. > > > > So ANY_LAYOUT and feature bit 32 are mutually exclusive? > > Hmm. I wonder what gives this impression. > What I tried to say is bit 32 should imply ANY_LAYOUT. Better to spell it out, then. > > > > > > > + > > > > > 2.1.4.3 The Virtqueue Descriptor Table > > > > > -------------------------------------- > > > > > > > > > > @@ -386,23 +478,27 @@ how to communicate with the specific device. > > > > > 2.2.1 Device Initialization > > > > > --------------------------- > > > > > > > > > > -1. Reset the device. This is not required on initial start up. > > > > > +1. Device discovery. This is only required for some transports. > > > > > + > > > > > +2. Reset the device. This is not required on initial start up. > > > > > > > > > > -2. The ACKNOWLEDGE status bit is set: we have noticed the device. > > > > > +3. Device layout detection. This is only required for some transports. > > > > > > > > > > -3. The DRIVER status bit is set: we know how to drive the device. > > > > > +4. The ACKNOWLEDGE status bit is set: we have noticed the device. > > > > > > > > > > -4. Device-specific setup, including reading the device feature > > > > > +5. The DRIVER status bit is set: we know how to drive the device. > > > > > + > > > > > +6. Device-specific setup, including reading the device feature > > > > > bits, discovery of virtqueues for the device, optional per-bus > > > > > setup, and reading and possibly writing the device's virtio > > > > > configuration space. > > > > > > > > > > -5. The subset of device feature bits understood by the driver is > > > > > +7. The subset of device feature bits understood by the driver is > > > > > written to the device. > > > > > > > > > > -6. The DRIVER_OK status bit is set. > > > > > +8. The DRIVER_OK status bit is set. > > > > > > > > > > -7. The device can now be used (ie. buffers added to the > > > > > +9. The device can now be used (ie. buffers added to the > > > > > virtqueues)[4] > > > > > > > > > > If any of these steps go irrecoverably wrong, the guest should > > > > > @@ -622,35 +718,183 @@ Virtio devices are commonly implemented as PCI devices. > > > > > > > > > > Any PCI device with Vendor ID 0x1AF4, and Device ID 0x1000 through > > > > > 0x103F inclusive is a virtio device[3]. The device must also have a > > > > > -Revision ID of 0 to match this specification. > > > > > +Revision ID of 0 or Revision ID of 1 to match this specification. > > > > > > > > > > The Subsystem Device ID indicates which virtio device is > > > > > supported by the device. The Subsystem Vendor ID should reflect > > > > > the PCI Vendor ID of the environment (it's currently only used > > > > > for informational purposes by the guest). > > > > > > > > > > +Drivers must not match devices where Revision ID does not match 0 or 1. > > > > > + > > > > > +2.4.1.1.1 Legacy Interface: A Note on PCI Device Discovery > > > > > +---------------------------- > > > > > +Transitional devices must have a Revision ID of 0. > > > > > + > > > > > +Non-transitional devices must have a Revision ID of 1. > > > > > + > > > > > +Transitional drivers must match a Revision ID of 0 or 1. > > > > > + > > > > > +Non-transitional drivers must only match a Revision ID of 1. > > > > > + > > > > > > > > I think we should stop abusing Revision IDs, and start using them > > > > to reflect device version changes as intended. > > > > > > > > We could reserve revision id 0 for legacy devices, however, which should > > > > work nicely. > > > > > > Hmm I am not sure I agree - what does it buy us that feature bits don't already supply? > > > > > > More concerns: > > > > > > We are using revision ID now exactly as was intended to disable old > > > drivers - it served us well for 0.X-1.X and would be as useful if we > > > ever have 1.X->2.0 transition. > > > > > > Another worry with using revision numbering for features is that > > > it does not play well with downstreams. > > > E.g. RHEL might want to cherry-pick a feature without implementing > > > other features that happened to land in the same revision. > > > > > > Also Revision ID is only 8 bit - it's designed for hardware where > > > making a new revision is expensive. In software we'll run out of that > > > eventually. > > > > So Revision ID is a PCI-specific thing, right? Not all transports will > > necessarily have something equivalent, so they would need to depend on > > the feature bit. > > They can't do this reliably - for example you might want to move feature > bits around. That sounds like setting yourself up for problems. If you want to deprecate bits, it would be better to define them as "reserved" and use a new bit for your new feature. The s390 architecture is full of "reserved" bits like that. > For 0.9.X drivers and non-transitional devices, > I'd like to find some hack to make probe fail. > > Any idea? Not really, sorry. > > But let's plan ahead and add a way to do this > in the future if we make an incompatible change again. I'd rather have an architecture that allows us to be backwards compatible for a long time and introduce a new device id/cu type for a new kind of device if we want to do things differently and ditch old baggage. > > > > > > > > > > > > > > > > 2.4.1.2 PCI Device Layout > > > > > ------------------------- > > > > > > > > > > -To configure the device, we use the first I/O region of the PCI > > > > > -device. This contains a virtio header followed by a > > > > > -device-specific region. > > > > > +To configure the device, > > > > > +use I/O and/or memory regions and/or PCI configuration space of the PCI device. > > > > > +These contain the virtio header registers, the notification register, the > > > > > +ISR status register and device specific registers, as specified by Virtio > > > > > ++ Structure PCI Capabilities > > > > > + > > > > > +There may be different widths of accesses to the I/O region; the > > > > > +“natural” access method for each field must be > > > > > +used (i.e. 32-bit accesses for 32-bit fields, etc). > > > > > + > > > > > +PCI Device Configuration Layout includes the common configuration, > > > > > +ISR, notification and device specific configuration > > > > > +structures. > > > > > + > > > > > +Unless explicitly specified otherwise, all multi-byte fields are little-endian. > > > > > + > > > > > + > > > > > +2.4.1.2.1 Common configuration structure layout > > > > > +------------------------- > > > > > +Common configuration structure layout is documented below: > > > > > + > > > > > +struct virtio_pci_common_cfg { > > > > > + /* About the whole device. */ > > > > > + __le32 device_feature_select; /* read-write */ > > > > > + __le32 device_feature; /* read-only */ > > > > > + __le32 guest_feature_select; /* read-write */ > > > > > + __le32 guest_feature; /* read-write */ > > > > > + __le16 msix_config; /* read-write */ > > > > > + __le16 num_queues; /* read-only */ > > > > > + __u8 device_status; /* read-write */ > > > > > + __u8 unused1; > > > > > + > > > > > + /* About a specific virtqueue. */ > > > > > + __le16 queue_select; /* read-write */ > > > > > + __le16 queue_size; /* read-write, power of 2, or 0. */ > > > > > + __le16 queue_msix_vector; /* read-write */ > > > > > + __le16 queue_enable; /* read-write */ > > > > > + __le16 queue_notify_off; /* read-only */ > > > > > + __le64 queue_desc; /* read-write */ > > > > > + __le64 queue_avail; /* read-write */ > > > > > + __le64 queue_used; /* read-write */ > > > > > +}; > > > > > + > > > > > +device_feature_select > > > > > + > > > > > + Selects which Feature Bits does device_feature field refer to. > > > > > + Value 0x0 selects Feature Bits 0 to 31 > > > > > + Value 0x1 selects Feature Bits 32 to 63 > > > > > + All other values cause reads from device_feature to return 0. > > > > > + > > > > > +device_feature > > > > > + > > > > > + Used by Device to report Feature Bits to Driver. > > > > > + Device Feature Bits selected by device_feature_select. > > > > > + > > > > > +guest_feature_select > > > > > + > > > > > + Selects which Feature Bits does guest_feature field refer to. > > > > > + Value 0x0 selects Feature Bits 0 to 31 > > > > > + Value 0x1 selects Feature Bits 32 to 63 > > > > > + All other values cause writes to guest_feature to be ignored, > > > > > + and reads to return 0. > > > > > + > > > > > +guest_feature > > > > > + > > > > > + Used by Driver to acknowledge Feature Bits to Device. > > > > > + Guest Feature Bits selected by guest_feature_select. > > > > > + > > > > > +msix_config > > > > > + > > > > > + Configuration Vector for MSI-X. > > > > > + > > > > > +num_queues > > > > > + > > > > > + Specifies the maximum number of virtqueues supported by device. > > > > > + > > > > > +device_status > > > > > + > > > > > + Device Status field. > > > > > + > > > > > +queue_select > > > > > + > > > > > + Queue Select. Selects which virtqueue do other fields refer to. > > > > > + > > > > > +queue_size > > > > > + > > > > > + Queue Size. On reset, specifies the maximum queue size supported by > > > > > + the hypervisor. This can be modified by driver to reduce memory requirements. > > > > > + Set to 0 if this virtqueue is unused. > > > > > + > > > > > +queue_msix_vector > > > > > + > > > > > + Queue Vector for MSI-X. > > > > > + > > > > > +queue_enable > > > > > + > > > > > + Used to selectively prevent host from executing requests from this virtqueue. > > > > > + 1 - enabled; 0 - disabled > > > > > + > > > > > +queue_notify_off > > > > > + > > > > > + Used to calculate the offset from start of Notification structure at > > > > > + which this virtqueue is located. > > > > > + Note: this is *not* an offset in bytes. See notify_off_multiplier below. > > > > > + > > > > > +queue_desc > > > > > + > > > > > + Physical address of Descriptor Table. > > > > > + > > > > > +queue_avail > > > > > + > > > > > + Physical address of Available Ring. > > > > > + > > > > > +queue_used > > > > > + > > > > > + Physical address of Used Ring. > > > > > + > > > > > + > > > > > +2.4.1.2.2 ISR status structure layout > > > > > +------------------------- > > > > > +ISR status structure includes a single 8-bite ISR status field > > > > > > > > 8-bit > > > > > > Right :) > > > > > > > > + > > > > > +2.4.1.2.3 Notification structure layout > > > > > +------------------------- > > > > > +Notification structure is always a multiple of 2 bytes in size. > > > > > +It includes 2-byte Queue Notify fields for each virtqueue of > > > > > +the device. Note that multiple virtqueues can use the same > > > > > +Queue Notify field, if necessary. > > > > > > > > Hmm, maybe move this down, so you can have a section which starts with > > > > "If cfg_type is VIRTIO_PCI_CAP_NOTIFY_CFG" below? That would put it all > > > > together. > > > > > > so Move PCI Device Layout to within > > > PCI-specific Initialization And Device Operation? > > > > > > > > + > > > > > +2.4.1.2.4 Device specific structure > > > > > +------------------------- > > > > > + > > > > > +Device specific structure is optional. > > > > > + > > > > > +2.4.1.2.5 Legacy Interfaces: A Note on PCI Device Layout > > > > > +------------------------- > > > > > + > > > > > +Transitional devices should present part of configuration > > > > > +registers in a legacy configuration structure in BAR0 in the first I/O > > > > > +region of the PCI device, as documented below. > > > > > > > > > > There may be different widths of accesses to the I/O region; the > > > > > “natural” access method for each field in the virtio header must be > > > > > -used (i.e. 32-bit accesses for 32-bit fields, etc), but the > > > > > +used (i.e. 32-bit accesses for 32-bit fields, etc), but > > > > > +When accessed through the legacy interface the > > > > > device-specific region can be accessed using any width accesses, and > > > > > should obtain the same results. > > > > > > > > > > Note that this is possible because while the virtio header is PCI > > > > > -(i.e. little) endian, the device-specific region is encoded in > > > > > -the native endian of the guest (where such distinction is > > > > > +(i.e. little) endian, when using the legacy interface the device-specific > > > > > +region is encoded in the native endian of the guest (where such distinction is > > > > > applicable). > > > > > > > > > > -2.4.1.2.1 PCI Device Virtio Header > > > > > ----------------------------------- > > > > > > > > > > -The virtio header looks as follows: > > > > > +When used through the legacy interface, the virtio header looks as follows: > > > > > > > > > > +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ > > > > > Bits 32 32 32 16 16 16 8 8 > > > > > @@ -661,7 +905,6 @@ The virtio header looks as follows: > > > > > Features bits 0:31 Features bits 0:31 Address Size Select Notify Status Status > > > > > +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ > > > > > > > > > > - > > > > > If MSI-X is enabled for the device, two additional fields > > > > > immediately follow this header:[5] > > > > > > > > > > @@ -689,25 +932,154 @@ device-specific headers: > > > > > > > > > > +------------++--------------------+ > > > > > > > > > > +Note that only Feature Bits 0 to 31 are accessible through the > > > > > +Legacy Interface. When used through the Legacy Interface, > > > > > +Transitional Devices must assume that Feature Bits 32 to 63 > > > > > +are not acknowledged by Driver. > > > > > + > > > > > + > > > > > 2.4.1.3 PCI-specific Initialization And Device Operation > > > > > -------------------------------------------------------- > > > > > > > > > > -The page size for a virtqueue on a PCI virtio device is defined as > > > > > -4096 bytes. > > > > > - > > > > > 2.4.1.3.1 Device Initialization > > > > > ------------------------------- > > > > > > > > > > -2.4.1.3.1.1 Queue Vector Configuration > > > > > +This documents PCI-specific steps executed during Device Initialization. > > > > > +As the first step, driver must detect device configuration layout > > > > > +to locate configuration fields in memory,I/O or configuration space of the > > > > > +device. > > > > > + > > > > > +2.4.1.3.1.1 Virtio Device Configuration Layout Detection > > > > > +------------------------------- > > > > > + > > > > > +As a prerequisite to device initialization, driver executes a > > > > > +PCI capability list scan, detecting virtio configuration layout using Virtio > > > > > +Structure PCI capabilities. > > > > > + > > > > > +Virtio Device Configuration Layout includes virtio configuration header, Notification > > > > > +and ISR Status and device configuration structures. > > > > > +Each structure can be mapped by a Base Address register (BAR) belonging to > > > > > +the function, located beginning at 10h in Configuration Space, > > > > > +or accessed though PCI configuration space. > > > > > + > > > > > +Actual location of each structure is specified using vendor-specific PCI capability located > > > > > +on capability list in PCI configuration space of the device. > > > > > +This virtio structure capability uses little-endian format; all bits are > > > > > +read-only: > > > > > + > > > > > +struct virtio_pci_cap { > > > > > + __u8 cap_vndr; /* Generic PCI field: PCI_CAP_ID_VNDR */ > > > > > + __u8 cap_next; /* Generic PCI field: next ptr. */ > > > > > + __u8 cap_len; /* Generic PCI field: capability length */ > > > > > + __u8 cfg_type; /* Identifies the structure. */ > > > > > + __u8 bar; /* Where to find it. */ > > > > > + __u8 padding[3];/* Pad to full dword. */ > > > > > + __le32 offset; /* Offset within bar. */ > > > > > + __le32 length; /* Length of the structure, in bytes. */ > > > > > +}; > > > > > + > > > > > +This structure can optionally followed by extra data, depending on > > > > > +other fields, as documented below. > > > > > + > > > > > +The fields are interpreted as follows: > > > > > + > > > > > +cap_vndr > > > > > + 0x09; Identifies a vendor-specific capability. > > > > > + > > > > > +cap_next > > > > > + Link to next capability in the capability list in the configuration space. > > > > > + > > > > > +cap_len > > > > > + Length of the capability structure, including the whole of > > > > > + struct virtio_pci_cap, and extra data if any. > > > > > + This length might include padding, or fields unused by the driver. > > > > > + > > > > > +cfg_type > > > > > + identifies the structure, according to the following table. > > > > > + > > > > > + /* Common configuration */ > > > > > + #define VIRTIO_PCI_CAP_COMMON_CFG 1 > > > > > + /* Notifications */ > > > > > + #define VIRTIO_PCI_CAP_NOTIFY_CFG 2 > > > > > + /* ISR Status */ > > > > > + #define VIRTIO_PCI_CAP_ISR_CFG 3 > > > > > + /* Device specific configuration */ > > > > > + #define VIRTIO_PCI_CAP_DEVICE_CFG 4 > > > > > + > > > > > + More than one capability can identify the same structure - this makes it > > > > > + possible for the device to expose multiple interfaces to drivers. The order of > > > > > + the capabilities in the capability list specifies the order of preference > > > > > + suggested by the device; drivers should use the first interface that they can > > > > > + support. For example, on some hypervisors, notifications using IO accesses are > > > > > + faster than memory accesses. In this case, hypervisor can expose two > > > > > + capabilities with cfg_type set to VIRTIO_PCI_CAP_NOTIFY_CFG: > > > > > + the first one addressing an I/O BAR, the second one addressing a memory BAR. > > > > > + Driver will use the I/O BAR if I/O resources are available, and fall back on > > > > > + memory BAR when I/O resources are unavailable. > > > > > + > > > > > +bar > > > > > + > > > > > + values 0x0 to 0x5 specify a Base Address register (BAR) belonging to > > > > > + the function located beginning at 10h in Configuration Space > > > > > + and used to map the structure into Memory or I/O Space. > > > > > + The BAR is permitted to be either 32-bit or 64-bit, it can map Memory Space > > > > > + or I/O Space. > > > > > + > > > > > + The value 0xF specifies that the structure is in PCI configuration space > > > > > + inline with this capability structure, following (not necessarily immediately) > > > > > + the length field. > > > > > > > > Why not immediately? > > > > Or how would the driver know where it is? > > > > > > It's at the offset. > > > > > > E.g. for notification we stick multiplier after length. > > > Further, we might extend virtio_pci_cap in the future, > > > and we don't want to move stuff around like we > > > had to with MSI-X. > > > > > > > > + > > > > > +offset > > > > > + indicates where the structure begins relative to the base address associated > > > > > + with the BAR. If bar specifies configuration space, offset is relative > > > > > + to start of virtio_pci_cap structure. > > > > > + > > > > > +length > > > > > + indicates the length of the structure. > > > > > + This size might include padding, or fields unused by the driver. > > > > > + Drivers are also recommended to only map part of configuration structure > > > > > + large enough for device operation. > > > > > + For example, a future device might present a large structure size of several > > > > > + MBytes. > > > > > + As current devices never utilize structures larger than 4KBytes in size, > > > > > + driver can limit the mapped structure size to e.g. > > > > > + 4KBytes to allow forward compatibility with such devices without loss of > > > > > + functionality and without wasting resources. > > > > > + > > > > > + > > > > > +If cfg_type is VIRTIO_PCI_CAP_NOTIFY_CFG this structure is immediately followed > > > > > +by additional fields: > > > > > + > > > > > +struct virtio_pci_notify_cap { > > > > > + struct virtio_pci_cap cap; > > > > > + __le32 notify_off_multiplier; /* Multiplier for queue_notify_off. */ > > > > > +}; > > > > > + > > > > > +notify_off_multiplier > > > > > + > > > > > + Virtqueue offset multiplier, in bytes. Must be even and either a power of two, or 0. > > > > > + Value 0x1 is reserved. > > > > > + For a given virtqueue, the address to use for notifications is calculated as follows: > > > > > + > > > > > + queue_notify_off * notify_off_multiplier + offset > > > > > + > > > > > + If notify_off_multiplier is 0, all virtqueues use the same address in > > > > > + the Notifications structure! > > > > > + > > > > > + > > > > > +2.4.1.3.1.1 Legacy Interface: A Note on Device Layout Detection > > > > > +------------------------------- > > > > > + > > > > > +Legacy drivers skipped Device Layout Detection step, assuming legacy > > > > > +configuration space in BAR0 in I/O space unconditionally. > > > > > + > > > > > +2.4.1.3.1.3 Queue Vector Configuration > > > > > -------------------------------------- > > > > > > > > > > When MSI-X capability is present and enabled in the device > > > > > -(through standard PCI configuration space) 4 bytes at byte offset > > > > > -20 are used to map configuration change and queue interrupts to > > > > > -MSI-X vectors. In this case, the ISR Status field is unused, and > > > > > -device specific configuration starts at byte offset 24 in virtio > > > > > -header structure. When MSI-X capability is not enabled, device > > > > > -specific configuration starts at byte offset 20 in virtio header. > > > > > +(through standard PCI configuration space) Configuration/Queue > > > > > +MSI-X Vector registers are used to map configuration change and queue > > > > > +interrupts to MSI-X vectors. In this case, the ISR Status is unused. > > > > > > > > > > Writing a valid MSI-X Table entry number, 0 to 0x7FF, to one of > > > > > Configuration/Queue Vector registers, maps interrupts triggered > > > > > @@ -732,7 +1104,7 @@ success, the previously written value is returned, and on > > > > > failure, NO_VECTOR is returned. If a mapping failure is detected, > > > > > the driver can retry mapping with fewervectors, or disable MSI-X. > > > > > > > > > > -2.4.1.3.1.2 Virtqueue Configuration > > > > > +2.4.1.3.1.4 Virtqueue Configuration > > > > > ----------------------------------- > > > > > > > > > > As a device can have zero or more virtqueues for bulk data > > > > > @@ -749,9 +1121,11 @@ This is done as follows, for each virtqueue a device has: > > > > > always a power of 2. This controls how big the virtqueue is > > > > > (see 2.1.4 Virtqueues). If this field is 0, the virtqueue does not exist. > > > > > > > > > > -3. Allocate and zero virtqueue in contiguous physical memory, on > > > > > - a 4096 byte alignment. Write the physical address, divided by > > > > > - 4096 to the Queue Address field.[6] > > > > > +3. Optionally, select a smaller virtqueue size and write it in the Queue Size > > > > > + field. > > > > > + > > > > > +3. Allocate and zero Descriptor Table, Available and Used rings for the > > > > > + virtqueue in contiguous physical memory. > > > > > > > > > > 4. Optionally, if MSI-X capability is present and enabled on the > > > > > device, select a vector to use to request interrupts triggered > > > > > @@ -760,14 +1134,21 @@ This is done as follows, for each virtqueue a device has: > > > > > Queue Vector field: on success, previously written value is > > > > > returned; on failure, NO_VECTOR value is returned. > > > > > > > > > > + > > > > > +2.4.1.3.1.4.1 Legacy Interface: A Note on Virtqueue Configuration > > > > > +----------------------------------- > > > > > +When using the legacy interface, the page size for a virtqueue on a PCI virtio > > > > > +device is defined as 4096 bytes. Driver writes the physical address, divided > > > > > +by 4096 to the Queue Address field [6]. > > > > > + > > > > > 2.4.1.3.2 Notifying The Device > > > > > ------------------------------ > > > > > > > > > > Device notification occurs by writing the 16-bit virtqueue index > > > > > -of this virtqueue to the Queue Notify field of the virtio header > > > > > -in the first I/O region of the PCI device. > > > > > +of this virtqueue to the Queue Notify field. > > > > > > > > > > 2.4.1.3.3 Receiving Used Buffers From The Device > > > > > +------------------------------ > > > > > > > > > > If an interrupt is necessary: > > > > > > > > > > @@ -2798,7 +3179,10 @@ the non-PCI implementations (currently lguest and S/390). > > > > > This is only allowed if the driver does not use any features > > > > > which would alter this early use of the device. > > > > > > > > > > -[5] ie. once you enable MSI-X on the device, the other fields move. > > > > > +[5] When MSI-X capability is enabled, device specific configuration starts at > > > > > +byte offset 24 in virtio header structure. When MSI-X capability is not > > > > > +enabled, device specific configuration starts at byte offset 20 in virtio > > > > > +header. ie. once you enable MSI-X on the device, the other fields move. > > > > > If you turn it off again, they move back! > > > > > > > > Thanks, > > > > Rusty. > > > > Cornelia > >


  • 7.  Re: [virtio] Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout

    Posted 08-27-2013 17:17
    On Tue, Aug 27, 2013 at 07:01:23PM +0200, Cornelia Huck wrote: > On Tue, 27 Aug 2013 18:36:29 +0300 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Tue, Aug 27, 2013 at 05:09:53PM +0200, Cornelia Huck wrote: > > > Some remarks from my side... > > > > > > On Tue, 27 Aug 2013 10:38:59 +0300 > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > On Tue, Aug 27, 2013 at 04:24:47PM +0930, Rusty Russell wrote: > > > > > "Michael S. Tsirkin" <mst@redhat.com> writes: > > > > > > This is the new configuration layout. > > > > > > > > > > > > Notes: > > > > > > - Everything is LE > > > > > > - There's a feature bit that means spec 1.0 compliant. > > > > > > - Both devices and drivers can either require the 1.0 interface > > > > > > or try to include compatibility support. The spec isn't forcing > > > > > > this decision. > > > > > > > > > > Hmm, this kind includes other changes already proposed, like the LE > > > > > change and the framing change. I think this conceptually splits nicely: > > > > > > > > > > 1) Feature bit 32 proposal. > > > > > 2) Endian change. > > > > > 3) Framing change. > > > > > 4) PCI layout change. > > > > > > > > Right - they are mostly in different parts of the document. > > > > I put it all together so it's easy to see how we intend to > > > > handle the transition. > > > > So is everyone OK with keeping this in a single patch? > > > > > > The new feature bit is supposed to cover all of this, right? Then this > > > should be one patch. > > > > > > > > > > > > > - I kept documentation of the legacy interface around, and added notes > > > > > > on transition inline. They are in separate sections each clearly marked > > > > > > "Legacy Interface" so we'll be able to separate them out > > > > > > from the final document as necessary - for now I think it's easier > > > > > > to keep it all together. > > > > > > > > > > Good thinking: most of us know the current spec so it's definitely > > > > > clearer. And makes sure we're thinking about the transition. > > > > > > > > > > > Only virtio PCI has been converted. > > > > > > Let's discuss this on the meeting tonight, once we figure out PCI > > > > > > we can do something similar for MMIO and CCW. > > > > > > > > > > > @@ -137,6 +139,11 @@ Feature bits are allocated as follows: > > > > > > 24 to 31: Feature bits reserved for extensions to the queue and > > > > > > feature negotiation mechanisms > > > > > > > > > > > > + 32: Feature bit must be set for any device compliant with this > > > > > > + revision of the specification, and acknowledged by all device drivers. > > > > > > Would it make sense to have a bit 33 "rings big endian" whose validity > > > depends on bit 32 set? This would make it possible for ccw to keep its > > > current endianness. > > > > I didn't go over ccw or MMIO yet - only PCI. > > I think ccw registers will just > > be explicitly BE, with no need for a feature bit. > > Does this sound right? > > Sure, that would be even better. > > > > > > > > > + > > > > > > + 33 to 63: Feature bits reserved for future extensions > > > > > > + > > > > > > For example, feature bit 0 for a network device (i.e. Subsystem > > > > > > Device ID 1) indicates that the device supports checksumming of > > > > > > packets. > > > > > > > > > > Why stop at 63? If we go to a more decentralized feature-assignment > > > > > model, we'll run through those very fast. > > > > > > > > Then we'll just document more, but driver needs to know where to stop > > > > looking for features. > > > > > > > > > > > > > > > @@ -145,13 +152,63 @@ In particular, new fields in the device configuration space are > > > > > > indicated by offering a feature bit, so the guest can check > > > > > > before accessing that part of the configuration space. > > > > > > > > > > > > +2.1.2.1 Legacy Interface: A Note on transitions from earlier drafts > > > > > > +-------------------------------------- > > > > > > + > > > > > > +Earlier drafts of this specification (up to 0.9.X) defined a similar, but > > > > > > +different interface between the hypervisor and the guest. > > > > > > +Since these are widely deployed in the field, this specification > > > > > > +accomodates optional features to simplify transition > > > > > > +from these earlier draft interfaces. Specifically: > > > > > > + > > > > > > +Legacy Interface > > > > > > + is an interface specified by an earlier draft of this specification > > > > > > + (up to 0.9.X) > > > > > > +Legacy Device > > > > > > + is a device implemented before this specification was released, > > > > > > + and implementing a legacy interface on the host side > > > > > > +Legacy Driver > > > > > > + is a driver implemented before this specification was released, > > > > > > + and implementing a legacy interface on the guest side > > > > > > + > > > > > > +to simplify transition from these earlier draft interfaces, > > > > > > +it is possible to implement > > > > > > + > > > > > > +Transitional Device > > > > > > + a device supporting both drivers conforming to this > > > > > > + specification, and legacy drivers > > > > > > + > > > > > > +Transitional Driver > > > > > > + a driver supporting both devices conforming to this > > > > > > + specification, and legacy devices > > > > > > What happens to legacy devices in the future? Current implementers > > > will obviously expose legacy devices, which means future drivers need > > > to be transitional or they won't work with what is currently out there. > > > > You are right. It's a bug in what I wrote: non transitional drivers > > should work with transitional devices. > > This way a transitional device can change to non-transitional > > after drivers are updated. > > > > > Will legacy stay around (for the forseeable furture)? > > > > That's up to implementers I think as long as they > > implement the new standard we should not prevent them from > > bundling in the old virtio, coffee making capabilities etc. > > > > > > > Will legacy > > > devices still be considered standard compliant (as in "compliant to the > > > legacy standard")? > > > > I don't think they are compliant. We'll split the legacy sections > > from spec out to a separate transition guide before we release > > the spec. > > What I'm worried about is probably the transitional nature of this. > There is a framework we have now, so there will be users - and not on > all platforms they expect needing to upgrade, especially if traditional > I/O has always been backwards compatible for decades... I'm not sure I understand the suggestion. You want us to push devices harder to implement legacy interfaces? You want us to push drivers harder to switch to new interfaces? The proposal is basically trying hard to supply a mechanism, not force a policy. > > > > > > > > + > > > > > > +Device and driver that require support for revision 1.0 or newer of > > > > > > +the specification to function, are called non-transitional device and driver, > > > > > > +respectively. > > > > > > + > > > > > > +Transitional Drivers can detect Legacy Devices by detecting that > > > > > > +Feature bit 32 is not offered. > > > > > > +Transitional devices can detect Legacy drivers by detecting that > > > > > > +Feature bit 32 has not been acknowledged by driver. > > > > > > Will we use new feature bits for new, incompatible revisions? Or will > > > we try to stay backwards compatible? > > > > So an incompatible change needs to increment revision ID > > to prevent drivers from loading. > > MMIO and PCI both have revision IDs. > > CCW will need to add something like a revision ID, > > we discussed this already. > > Command rejects? Which command would you reject? > I think it is a good idea to try to stay as compatible as possible; > this should really be a last measure. Again, I think that at some point, e.g. 10-15 years in the future, devices will want to say "I require new drivers and that's it". I think it's useful to have a mechanism for this, so old drivers fail gracefully. > > > > > > > > + > > > > > > +To make them easier to locate, specification sections documenting these > > > > > > +transitional features all explicitly marked with > > > > > > +'Legacy Interface' in the section title. > > > > > > + > > > > > > + > > > > > > 2.1.3 Configuration Space > > > > > > ------------------------- > > > > > > > > > > > > Configuration space is generally used for rarely-changing or > > > > > > initialization-time parameters. > > > > > > > > > > > > -Note that this space is generally the guest's native endian, > > > > > > +Note that configuration space generally uses the little-endian format > > > > > > +for multi-byte fields. > > > > > > + > > > > > > +2.1.4.1 Legacy Interface: A Note on Configuration Space endian-ness > > > > > > +-------------------------------------- > > > > > > + > > > > > > +Note that for legacy interfaces, configuration space is generally the guest's native endian, > > > > > > rather than PCI's little-endian. > > > > > > > > > > > > 2.1.4 Virtqueues > > > > > > @@ -164,6 +221,45 @@ transmit and one for receive. Each queue has a 16-bit queue size > > > > > > parameter, which sets the number of entries and implies the total size > > > > > > of the queue. > > > > > > > > > > > > +Each virtqueue consists of three parts: > > > > > > + > > > > > > + Descriptor Table > > > > > > + Available Ring > > > > > > + Used Ring > > > > > > + > > > > > > +where each part is physically-contiguous in guest memory, > > > > > > +and has different alignment requirements. > > > > > > + > > > > > > +The Queue Size field controls the total number of bytes > > > > > > +required for each part of the virtqueue. > > > > > > + > > > > > > +The memory aligment and size requirements, in bytes, of each part of the > > > > > > +virtqueue are summarized in the following table (qsz is the Queue Size field): > > > > > > + > > > > > > ++------------+---------------------------------+ > > > > > > + Virtqueue Part Alignment Size > > > > > > ++------------+---------------------------------+ > > > > > > ++------------+---------------------------------+ > > > > > > + Descriptor Table 16 16 * qsz > > > > > > ++------------+---------------------------------+ > > > > > > + Available Ring 2 6 + 2 * qsz > > > > > > ++------------+---------------------------------+ > > > > > > + Used Ring 4 6 + 4 * qsz > > > > > > ++------------+---------------------------------+ > > > > > > + > > > > > > +When the driver wants to send a buffer to the device, it fills in > > > > > > +a slot in the descriptor table (or chains several together), and > > > > > > +writes the descriptor index into the available ring. It then > > > > > > +notifies the device. When the device has finished a buffer, it > > > > > > +writes the descriptor into the used ring, and sends an interrupt. > > > > > > + > > > > > > + > > > > > > +2.1.4.1 Legacy Interfaces: A Note on Virtqueue Layout > > > > > > +-------------------------------------- > > > > > > + > > > > > > +For Legacy Interfaces, several additional > > > > > > +restrictions are placed on the virtqueue layout: > > > > > > + > > > > > > Each virtqueue occupies two or more physically-contiguous pages > > > > > > (usually defined as 4096 bytes, but depending on the transport) > > > > > > and consists of three parts: > > > > > > @@ -182,9 +278,8 @@ required for the virtqueue according to the following formula: > > > > > > + ALIGN(sizeof(u16)*3 + sizeof(struct vring_used_elem)*qsz); > > > > > > } > > > > > > > > > > > > -This currently wastes some space with padding, but also allows future > > > > > > -extensions such as the VIRTIO_RING_F_EVENT_IDX extension. The > > > > > > -virtqueue layout structure looks like this: > > > > > > +This wastes some space with padding. > > > > > > +The legacy virtqueue layout structure therefore looks like this: > > > > > > > > > > > > struct vring { > > > > > > // The actual descriptors (16 bytes each) > > > > > > @@ -200,25 +295,17 @@ virtqueue layout structure looks like this: > > > > > > struct vring_used used; > > > > > > }; > > > > > > > > > > > > -When the driver wants to send a buffer to the device, it fills in > > > > > > -a slot in the descriptor table (or chains several together), and > > > > > > -writes the descriptor index into the available ring. It then > > > > > > -notifies the device. When the device has finished a buffer, it > > > > > > -writes the descriptor into the used ring, and sends an interrupt. > > > > > > - > > > > > > -2.1.4.1 A Note on Virtqueue Endianness > > > > > > +2.1.4.1 Legacy Interfaces: A Note on Virtqueue Endianness > > > > > > -------------------------------------- > > > > > > > > > > > > Note that the endian of fields and in the virtqueue is the native > > > > > > -endian of the guest, not little-endian as PCI normally is. This makes > > > > > > -for simpler guest code, and it is assumed that the host already has to > > > > > > -be deeply aware of the guest endian so such an “endian-aware” device > > > > > > -is not a significant issue. > > > > > > +endian of the guest, not little-endian as PCI normally is. > > > > > > +It is assumed that the host is already aware of the guest endian. > > > > > > > > > > > > 2.1.4.2 Message Framing > > > > > > ----------------------- > > > > > > -The original intent of the specification was that message framing (the > > > > > > -particular layout of descriptors) be independent of the contents of > > > > > > +Generally, the intent of the specification is for message framing (the > > > > > > +particular layout of descriptors) to be independent of the contents of > > > > > > the buffers. For example, a network transmit buffer consists of a 12 > > > > > > byte header followed by the network packet. This could be most simply > > > > > > placed in the descriptor table as a 12 byte output descriptor followed > > > > > > @@ -227,16 +314,21 @@ single 1526 byte output descriptor in the case where the header and > > > > > > packet are adjacent, or even three or more descriptors (possibly with > > > > > > loss of efficiency in that case). > > > > > > > > > > > > -Regrettably, initial driver implementations used simple layouts, and > > > > > > -devices came to rely on it, despite this specification wording[10]. It > > > > > > -is thus recommended that drivers be conservative in their assumptions, > > > > > > -unless the VIRTIO_F_ANY_LAYOUT feature is accepted. In addition, some > > > > > > +In addition, some > > > > > > implementations may have large-but-reasonable restrictions on total > > > > > > descriptor size (such as based on IOV_MAX in the host OS). This has > > > > > > not been a problem in practice: little sympathy will be given to > > > > > > drivers which create unreasonably-sized descriptors such as by > > > > > > dividing a network packet into 1500 single-byte descriptors! > > > > > > > > > > > > +2.1.4.2.1 Legacy Interfaces: A Note on Message Framing > > > > > > +----------------------- > > > > > > +Regrettably, initial driver implementations used simple layouts, and > > > > > > +devices came to rely on it, despite this specification wording[10]. It > > > > > > +is thus recommended that when using legacy interfaces, > > > > > > +drivers should be conservative in their assumptions, > > > > > > +unless the VIRTIO_F_ANY_LAYOUT feature is accepted. > > > > > > So ANY_LAYOUT and feature bit 32 are mutually exclusive? > > > > Hmm. I wonder what gives this impression. > > What I tried to say is bit 32 should imply ANY_LAYOUT. > > Better to spell it out, then. Well it says (in unchanged text) Generally, the intent of the specification is for message framing (the particular layout of descriptors) to be independent of the contents of the buffers. how would you make it clearer? > > > > > > > > > > + > > > > > > 2.1.4.3 The Virtqueue Descriptor Table > > > > > > -------------------------------------- > > > > > > > > > > > > @@ -386,23 +478,27 @@ how to communicate with the specific device. > > > > > > 2.2.1 Device Initialization > > > > > > --------------------------- > > > > > > > > > > > > -1. Reset the device. This is not required on initial start up. > > > > > > +1. Device discovery. This is only required for some transports. > > > > > > + > > > > > > +2. Reset the device. This is not required on initial start up. > > > > > > > > > > > > -2. The ACKNOWLEDGE status bit is set: we have noticed the device. > > > > > > +3. Device layout detection. This is only required for some transports. > > > > > > > > > > > > -3. The DRIVER status bit is set: we know how to drive the device. > > > > > > +4. The ACKNOWLEDGE status bit is set: we have noticed the device. > > > > > > > > > > > > -4. Device-specific setup, including reading the device feature > > > > > > +5. The DRIVER status bit is set: we know how to drive the device. > > > > > > + > > > > > > +6. Device-specific setup, including reading the device feature > > > > > > bits, discovery of virtqueues for the device, optional per-bus > > > > > > setup, and reading and possibly writing the device's virtio > > > > > > configuration space. > > > > > > > > > > > > -5. The subset of device feature bits understood by the driver is > > > > > > +7. The subset of device feature bits understood by the driver is > > > > > > written to the device. > > > > > > > > > > > > -6. The DRIVER_OK status bit is set. > > > > > > +8. The DRIVER_OK status bit is set. > > > > > > > > > > > > -7. The device can now be used (ie. buffers added to the > > > > > > +9. The device can now be used (ie. buffers added to the > > > > > > virtqueues)[4] > > > > > > > > > > > > If any of these steps go irrecoverably wrong, the guest should > > > > > > @@ -622,35 +718,183 @@ Virtio devices are commonly implemented as PCI devices. > > > > > > > > > > > > Any PCI device with Vendor ID 0x1AF4, and Device ID 0x1000 through > > > > > > 0x103F inclusive is a virtio device[3]. The device must also have a > > > > > > -Revision ID of 0 to match this specification. > > > > > > +Revision ID of 0 or Revision ID of 1 to match this specification. > > > > > > > > > > > > The Subsystem Device ID indicates which virtio device is > > > > > > supported by the device. The Subsystem Vendor ID should reflect > > > > > > the PCI Vendor ID of the environment (it's currently only used > > > > > > for informational purposes by the guest). > > > > > > > > > > > > +Drivers must not match devices where Revision ID does not match 0 or 1. > > > > > > + > > > > > > +2.4.1.1.1 Legacy Interface: A Note on PCI Device Discovery > > > > > > +---------------------------- > > > > > > +Transitional devices must have a Revision ID of 0. > > > > > > + > > > > > > +Non-transitional devices must have a Revision ID of 1. > > > > > > + > > > > > > +Transitional drivers must match a Revision ID of 0 or 1. > > > > > > + > > > > > > +Non-transitional drivers must only match a Revision ID of 1. > > > > > > + > > > > > > > > > > I think we should stop abusing Revision IDs, and start using them > > > > > to reflect device version changes as intended. > > > > > > > > > > We could reserve revision id 0 for legacy devices, however, which should > > > > > work nicely. > > > > > > > > Hmm I am not sure I agree - what does it buy us that feature bits don't already supply? > > > > > > > > More concerns: > > > > > > > > We are using revision ID now exactly as was intended to disable old > > > > drivers - it served us well for 0.X-1.X and would be as useful if we > > > > ever have 1.X->2.0 transition. > > > > > > > > Another worry with using revision numbering for features is that > > > > it does not play well with downstreams. > > > > E.g. RHEL might want to cherry-pick a feature without implementing > > > > other features that happened to land in the same revision. > > > > > > > > Also Revision ID is only 8 bit - it's designed for hardware where > > > > making a new revision is expensive. In software we'll run out of that > > > > eventually. > > > > > > So Revision ID is a PCI-specific thing, right? Not all transports will > > > necessarily have something equivalent, so they would need to depend on > > > the feature bit. > > > > They can't do this reliably - for example you might want to move feature > > bits around. > > That sounds like setting yourself up for problems. > If you want to > deprecate bits, it would be better to define them as "reserved" and use > a new bit for your new feature. The s390 architecture is full of > "reserved" bits like that. That's exactly what PCI does here though, and it does this without problems exactly because we have a way to make old drivers fail if we want to. So IMO it would be good to add a revision field to ccw so we can do this there in the future. > > For 0.9.X drivers and non-transitional devices, > > I'd like to find some hack to make probe fail. > > > > Any idea? > > Not really, sorry. > > > > > But let's plan ahead and add a way to do this > > in the future if we make an incompatible change again. > > I'd rather have an architecture that allows us to be backwards > compatible for a long time and introduce a new device id/cu type for > a new kind of device if we want to do things differently and ditch old > baggage. device ids are transport independent so we can't do this. What's a cu type? Hard to add? If no let's do that, and add a revision to future-proof it. > > > > > > > > > > > > > > > > > > > > > 2.4.1.2 PCI Device Layout > > > > > > ------------------------- > > > > > > > > > > > > -To configure the device, we use the first I/O region of the PCI > > > > > > -device. This contains a virtio header followed by a > > > > > > -device-specific region. > > > > > > +To configure the device, > > > > > > +use I/O and/or memory regions and/or PCI configuration space of the PCI device. > > > > > > +These contain the virtio header registers, the notification register, the > > > > > > +ISR status register and device specific registers, as specified by Virtio > > > > > > ++ Structure PCI Capabilities > > > > > > + > > > > > > +There may be different widths of accesses to the I/O region; the > > > > > > +“natural” access method for each field must be > > > > > > +used (i.e. 32-bit accesses for 32-bit fields, etc). > > > > > > + > > > > > > +PCI Device Configuration Layout includes the common configuration, > > > > > > +ISR, notification and device specific configuration > > > > > > +structures. > > > > > > + > > > > > > +Unless explicitly specified otherwise, all multi-byte fields are little-endian. > > > > > > + > > > > > > + > > > > > > +2.4.1.2.1 Common configuration structure layout > > > > > > +------------------------- > > > > > > +Common configuration structure layout is documented below: > > > > > > + > > > > > > +struct virtio_pci_common_cfg { > > > > > > + /* About the whole device. */ > > > > > > + __le32 device_feature_select; /* read-write */ > > > > > > + __le32 device_feature; /* read-only */ > > > > > > + __le32 guest_feature_select; /* read-write */ > > > > > > + __le32 guest_feature; /* read-write */ > > > > > > + __le16 msix_config; /* read-write */ > > > > > > + __le16 num_queues; /* read-only */ > > > > > > + __u8 device_status; /* read-write */ > > > > > > + __u8 unused1; > > > > > > + > > > > > > + /* About a specific virtqueue. */ > > > > > > + __le16 queue_select; /* read-write */ > > > > > > + __le16 queue_size; /* read-write, power of 2, or 0. */ > > > > > > + __le16 queue_msix_vector; /* read-write */ > > > > > > + __le16 queue_enable; /* read-write */ > > > > > > + __le16 queue_notify_off; /* read-only */ > > > > > > + __le64 queue_desc; /* read-write */ > > > > > > + __le64 queue_avail; /* read-write */ > > > > > > + __le64 queue_used; /* read-write */ > > > > > > +}; > > > > > > + > > > > > > +device_feature_select > > > > > > + > > > > > > + Selects which Feature Bits does device_feature field refer to. > > > > > > + Value 0x0 selects Feature Bits 0 to 31 > > > > > > + Value 0x1 selects Feature Bits 32 to 63 > > > > > > + All other values cause reads from device_feature to return 0. > > > > > > + > > > > > > +device_feature > > > > > > + > > > > > > + Used by Device to report Feature Bits to Driver. > > > > > > + Device Feature Bits selected by device_feature_select. > > > > > > + > > > > > > +guest_feature_select > > > > > > + > > > > > > + Selects which Feature Bits does guest_feature field refer to. > > > > > > + Value 0x0 selects Feature Bits 0 to 31 > > > > > > + Value 0x1 selects Feature Bits 32 to 63 > > > > > > + All other values cause writes to guest_feature to be ignored, > > > > > > + and reads to return 0. > > > > > > + > > > > > > +guest_feature > > > > > > + > > > > > > + Used by Driver to acknowledge Feature Bits to Device. > > > > > > + Guest Feature Bits selected by guest_feature_select. > > > > > > + > > > > > > +msix_config > > > > > > + > > > > > > + Configuration Vector for MSI-X. > > > > > > + > > > > > > +num_queues > > > > > > + > > > > > > + Specifies the maximum number of virtqueues supported by device. > > > > > > + > > > > > > +device_status > > > > > > + > > > > > > + Device Status field. > > > > > > + > > > > > > +queue_select > > > > > > + > > > > > > + Queue Select. Selects which virtqueue do other fields refer to. > > > > > > + > > > > > > +queue_size > > > > > > + > > > > > > + Queue Size. On reset, specifies the maximum queue size supported by > > > > > > + the hypervisor. This can be modified by driver to reduce memory requirements. > > > > > > + Set to 0 if this virtqueue is unused. > > > > > > + > > > > > > +queue_msix_vector > > > > > > + > > > > > > + Queue Vector for MSI-X. > > > > > > + > > > > > > +queue_enable > > > > > > + > > > > > > + Used to selectively prevent host from executing requests from this virtqueue. > > > > > > + 1 - enabled; 0 - disabled > > > > > > + > > > > > > +queue_notify_off > > > > > > + > > > > > > + Used to calculate the offset from start of Notification structure at > > > > > > + which this virtqueue is located. > > > > > > + Note: this is *not* an offset in bytes. See notify_off_multiplier below. > > > > > > + > > > > > > +queue_desc > > > > > > + > > > > > > + Physical address of Descriptor Table. > > > > > > + > > > > > > +queue_avail > > > > > > + > > > > > > + Physical address of Available Ring. > > > > > > + > > > > > > +queue_used > > > > > > + > > > > > > + Physical address of Used Ring. > > > > > > + > > > > > > + > > > > > > +2.4.1.2.2 ISR status structure layout > > > > > > +------------------------- > > > > > > +ISR status structure includes a single 8-bite ISR status field > > > > > > > > > > 8-bit > > > > > > > > Right :) > > > > > > > > > > + > > > > > > +2.4.1.2.3 Notification structure layout > > > > > > +------------------------- > > > > > > +Notification structure is always a multiple of 2 bytes in size. > > > > > > +It includes 2-byte Queue Notify fields for each virtqueue of > > > > > > +the device. Note that multiple virtqueues can use the same > > > > > > +Queue Notify field, if necessary. > > > > > > > > > > Hmm, maybe move this down, so you can have a section which starts with > > > > > "If cfg_type is VIRTIO_PCI_CAP_NOTIFY_CFG" below? That would put it all > > > > > together. > > > > > > > > so Move PCI Device Layout to within > > > > PCI-specific Initialization And Device Operation? > > > > > > > > > > + > > > > > > +2.4.1.2.4 Device specific structure > > > > > > +------------------------- > > > > > > + > > > > > > +Device specific structure is optional. > > > > > > + > > > > > > +2.4.1.2.5 Legacy Interfaces: A Note on PCI Device Layout > > > > > > +------------------------- > > > > > > + > > > > > > +Transitional devices should present part of configuration > > > > > > +registers in a legacy configuration structure in BAR0 in the first I/O > > > > > > +region of the PCI device, as documented below. > > > > > > > > > > > > There may be different widths of accesses to the I/O region; the > > > > > > “natural” access method for each field in the virtio header must be > > > > > > -used (i.e. 32-bit accesses for 32-bit fields, etc), but the > > > > > > +used (i.e. 32-bit accesses for 32-bit fields, etc), but > > > > > > +When accessed through the legacy interface the > > > > > > device-specific region can be accessed using any width accesses, and > > > > > > should obtain the same results. > > > > > > > > > > > > Note that this is possible because while the virtio header is PCI > > > > > > -(i.e. little) endian, the device-specific region is encoded in > > > > > > -the native endian of the guest (where such distinction is > > > > > > +(i.e. little) endian, when using the legacy interface the device-specific > > > > > > +region is encoded in the native endian of the guest (where such distinction is > > > > > > applicable). > > > > > > > > > > > > -2.4.1.2.1 PCI Device Virtio Header > > > > > > ----------------------------------- > > > > > > > > > > > > -The virtio header looks as follows: > > > > > > +When used through the legacy interface, the virtio header looks as follows: > > > > > > > > > > > > +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ > > > > > > Bits 32 32 32 16 16 16 8 8 > > > > > > @@ -661,7 +905,6 @@ The virtio header looks as follows: > > > > > > Features bits 0:31 Features bits 0:31 Address Size Select Notify Status Status > > > > > > +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ > > > > > > > > > > > > - > > > > > > If MSI-X is enabled for the device, two additional fields > > > > > > immediately follow this header:[5] > > > > > > > > > > > > @@ -689,25 +932,154 @@ device-specific headers: > > > > > > > > > > > > +------------++--------------------+ > > > > > > > > > > > > +Note that only Feature Bits 0 to 31 are accessible through the > > > > > > +Legacy Interface. When used through the Legacy Interface, > > > > > > +Transitional Devices must assume that Feature Bits 32 to 63 > > > > > > +are not acknowledged by Driver. > > > > > > + > > > > > > + > > > > > > 2.4.1.3 PCI-specific Initialization And Device Operation > > > > > > -------------------------------------------------------- > > > > > > > > > > > > -The page size for a virtqueue on a PCI virtio device is defined as > > > > > > -4096 bytes. > > > > > > - > > > > > > 2.4.1.3.1 Device Initialization > > > > > > ------------------------------- > > > > > > > > > > > > -2.4.1.3.1.1 Queue Vector Configuration > > > > > > +This documents PCI-specific steps executed during Device Initialization. > > > > > > +As the first step, driver must detect device configuration layout > > > > > > +to locate configuration fields in memory,I/O or configuration space of the > > > > > > +device. > > > > > > + > > > > > > +2.4.1.3.1.1 Virtio Device Configuration Layout Detection > > > > > > +------------------------------- > > > > > > + > > > > > > +As a prerequisite to device initialization, driver executes a > > > > > > +PCI capability list scan, detecting virtio configuration layout using Virtio > > > > > > +Structure PCI capabilities. > > > > > > + > > > > > > +Virtio Device Configuration Layout includes virtio configuration header, Notification > > > > > > +and ISR Status and device configuration structures. > > > > > > +Each structure can be mapped by a Base Address register (BAR) belonging to > > > > > > +the function, located beginning at 10h in Configuration Space, > > > > > > +or accessed though PCI configuration space. > > > > > > + > > > > > > +Actual location of each structure is specified using vendor-specific PCI capability located > > > > > > +on capability list in PCI configuration space of the device. > > > > > > +This virtio structure capability uses little-endian format; all bits are > > > > > > +read-only: > > > > > > + > > > > > > +struct virtio_pci_cap { > > > > > > + __u8 cap_vndr; /* Generic PCI field: PCI_CAP_ID_VNDR */ > > > > > > + __u8 cap_next; /* Generic PCI field: next ptr. */ > > > > > > + __u8 cap_len; /* Generic PCI field: capability length */ > > > > > > + __u8 cfg_type; /* Identifies the structure. */ > > > > > > + __u8 bar; /* Where to find it. */ > > > > > > + __u8 padding[3];/* Pad to full dword. */ > > > > > > + __le32 offset; /* Offset within bar. */ > > > > > > + __le32 length; /* Length of the structure, in bytes. */ > > > > > > +}; > > > > > > + > > > > > > +This structure can optionally followed by extra data, depending on > > > > > > +other fields, as documented below. > > > > > > + > > > > > > +The fields are interpreted as follows: > > > > > > + > > > > > > +cap_vndr > > > > > > + 0x09; Identifies a vendor-specific capability. > > > > > > + > > > > > > +cap_next > > > > > > + Link to next capability in the capability list in the configuration space. > > > > > > + > > > > > > +cap_len > > > > > > + Length of the capability structure, including the whole of > > > > > > + struct virtio_pci_cap, and extra data if any. > > > > > > + This length might include padding, or fields unused by the driver. > > > > > > + > > > > > > +cfg_type > > > > > > + identifies the structure, according to the following table. > > > > > > + > > > > > > + /* Common configuration */ > > > > > > + #define VIRTIO_PCI_CAP_COMMON_CFG 1 > > > > > > + /* Notifications */ > > > > > > + #define VIRTIO_PCI_CAP_NOTIFY_CFG 2 > > > > > > + /* ISR Status */ > > > > > > + #define VIRTIO_PCI_CAP_ISR_CFG 3 > > > > > > + /* Device specific configuration */ > > > > > > + #define VIRTIO_PCI_CAP_DEVICE_CFG 4 > > > > > > + > > > > > > + More than one capability can identify the same structure - this makes it > > > > > > + possible for the device to expose multiple interfaces to drivers. The order of > > > > > > + the capabilities in the capability list specifies the order of preference > > > > > > + suggested by the device; drivers should use the first interface that they can > > > > > > + support. For example, on some hypervisors, notifications using IO accesses are > > > > > > + faster than memory accesses. In this case, hypervisor can expose two > > > > > > + capabilities with cfg_type set to VIRTIO_PCI_CAP_NOTIFY_CFG: > > > > > > + the first one addressing an I/O BAR, the second one addressing a memory BAR. > > > > > > + Driver will use the I/O BAR if I/O resources are available, and fall back on > > > > > > + memory BAR when I/O resources are unavailable. > > > > > > + > > > > > > +bar > > > > > > + > > > > > > + values 0x0 to 0x5 specify a Base Address register (BAR) belonging to > > > > > > + the function located beginning at 10h in Configuration Space > > > > > > + and used to map the structure into Memory or I/O Space. > > > > > > + The BAR is permitted to be either 32-bit or 64-bit, it can map Memory Space > > > > > > + or I/O Space. > > > > > > + > > > > > > + The value 0xF specifies that the structure is in PCI configuration space > > > > > > + inline with this capability structure, following (not necessarily immediately) > > > > > > + the length field. > > > > > > > > > > Why not immediately? > > > > > Or how would the driver know where it is? > > > > > > > > It's at the offset. > > > > > > > > E.g. for notification we stick multiplier after length. > > > > Further, we might extend virtio_pci_cap in the future, > > > > and we don't want to move stuff around like we > > > > had to with MSI-X. > > > > > > > > > > + > > > > > > +offset > > > > > > + indicates where the structure begins relative to the base address associated > > > > > > + with the BAR. If bar specifies configuration space, offset is relative > > > > > > + to start of virtio_pci_cap structure. > > > > > > + > > > > > > +length > > > > > > + indicates the length of the structure. > > > > > > + This size might include padding, or fields unused by the driver. > > > > > > + Drivers are also recommended to only map part of configuration structure > > > > > > + large enough for device operation. > > > > > > + For example, a future device might present a large structure size of several > > > > > > + MBytes. > > > > > > + As current devices never utilize structures larger than 4KBytes in size, > > > > > > + driver can limit the mapped structure size to e.g. > > > > > > + 4KBytes to allow forward compatibility with such devices without loss of > > > > > > + functionality and without wasting resources. > > > > > > + > > > > > > + > > > > > > +If cfg_type is VIRTIO_PCI_CAP_NOTIFY_CFG this structure is immediately followed > > > > > > +by additional fields: > > > > > > + > > > > > > +struct virtio_pci_notify_cap { > > > > > > + struct virtio_pci_cap cap; > > > > > > + __le32 notify_off_multiplier; /* Multiplier for queue_notify_off. */ > > > > > > +}; > > > > > > + > > > > > > +notify_off_multiplier > > > > > > + > > > > > > + Virtqueue offset multiplier, in bytes. Must be even and either a power of two, or 0. > > > > > > + Value 0x1 is reserved. > > > > > > + For a given virtqueue, the address to use for notifications is calculated as follows: > > > > > > + > > > > > > + queue_notify_off * notify_off_multiplier + offset > > > > > > + > > > > > > + If notify_off_multiplier is 0, all virtqueues use the same address in > > > > > > + the Notifications structure! > > > > > > + > > > > > > + > > > > > > +2.4.1.3.1.1 Legacy Interface: A Note on Device Layout Detection > > > > > > +------------------------------- > > > > > > + > > > > > > +Legacy drivers skipped Device Layout Detection step, assuming legacy > > > > > > +configuration space in BAR0 in I/O space unconditionally. > > > > > > + > > > > > > +2.4.1.3.1.3 Queue Vector Configuration > > > > > > -------------------------------------- > > > > > > > > > > > > When MSI-X capability is present and enabled in the device > > > > > > -(through standard PCI configuration space) 4 bytes at byte offset > > > > > > -20 are used to map configuration change and queue interrupts to > > > > > > -MSI-X vectors. In this case, the ISR Status field is unused, and > > > > > > -device specific configuration starts at byte offset 24 in virtio > > > > > > -header structure. When MSI-X capability is not enabled, device > > > > > > -specific configuration starts at byte offset 20 in virtio header. > > > > > > +(through standard PCI configuration space) Configuration/Queue > > > > > > +MSI-X Vector registers are used to map configuration change and queue > > > > > > +interrupts to MSI-X vectors. In this case, the ISR Status is unused. > > > > > > > > > > > > Writing a valid MSI-X Table entry number, 0 to 0x7FF, to one of > > > > > > Configuration/Queue Vector registers, maps interrupts triggered > > > > > > @@ -732,7 +1104,7 @@ success, the previously written value is returned, and on > > > > > > failure, NO_VECTOR is returned. If a mapping failure is detected, > > > > > > the driver can retry mapping with fewervectors, or disable MSI-X. > > > > > > > > > > > > -2.4.1.3.1.2 Virtqueue Configuration > > > > > > +2.4.1.3.1.4 Virtqueue Configuration > > > > > > ----------------------------------- > > > > > > > > > > > > As a device can have zero or more virtqueues for bulk data > > > > > > @@ -749,9 +1121,11 @@ This is done as follows, for each virtqueue a device has: > > > > > > always a power of 2. This controls how big the virtqueue is > > > > > > (see 2.1.4 Virtqueues). If this field is 0, the virtqueue does not exist. > > > > > > > > > > > > -3. Allocate and zero virtqueue in contiguous physical memory, on > > > > > > - a 4096 byte alignment. Write the physical address, divided by > > > > > > - 4096 to the Queue Address field.[6] > > > > > > +3. Optionally, select a smaller virtqueue size and write it in the Queue Size > > > > > > + field. > > > > > > + > > > > > > +3. Allocate and zero Descriptor Table, Available and Used rings for the > > > > > > + virtqueue in contiguous physical memory. > > > > > > > > > > > > 4. Optionally, if MSI-X capability is present and enabled on the > > > > > > device, select a vector to use to request interrupts triggered > > > > > > @@ -760,14 +1134,21 @@ This is done as follows, for each virtqueue a device has: > > > > > > Queue Vector field: on success, previously written value is > > > > > > returned; on failure, NO_VECTOR value is returned. > > > > > > > > > > > > + > > > > > > +2.4.1.3.1.4.1 Legacy Interface: A Note on Virtqueue Configuration > > > > > > +----------------------------------- > > > > > > +When using the legacy interface, the page size for a virtqueue on a PCI virtio > > > > > > +device is defined as 4096 bytes. Driver writes the physical address, divided > > > > > > +by 4096 to the Queue Address field [6]. > > > > > > + > > > > > > 2.4.1.3.2 Notifying The Device > > > > > > ------------------------------ > > > > > > > > > > > > Device notification occurs by writing the 16-bit virtqueue index > > > > > > -of this virtqueue to the Queue Notify field of the virtio header > > > > > > -in the first I/O region of the PCI device. > > > > > > +of this virtqueue to the Queue Notify field. > > > > > > > > > > > > 2.4.1.3.3 Receiving Used Buffers From The Device > > > > > > +------------------------------ > > > > > > > > > > > > If an interrupt is necessary: > > > > > > > > > > > > @@ -2798,7 +3179,10 @@ the non-PCI implementations (currently lguest and S/390). > > > > > > This is only allowed if the driver does not use any features > > > > > > which would alter this early use of the device. > > > > > > > > > > > > -[5] ie. once you enable MSI-X on the device, the other fields move. > > > > > > +[5] When MSI-X capability is enabled, device specific configuration starts at > > > > > > +byte offset 24 in virtio header structure. When MSI-X capability is not > > > > > > +enabled, device specific configuration starts at byte offset 20 in virtio > > > > > > +header. ie. once you enable MSI-X on the device, the other fields move. > > > > > > If you turn it off again, they move back! > > > > > > > > > > Thanks, > > > > > Rusty. > > > > > > Cornelia > > > >


  • 8.  Re: [virtio] Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout

    Posted 08-28-2013 08:23
    On Tue, 27 Aug 2013 20:18:21 +0300 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Tue, Aug 27, 2013 at 07:01:23PM +0200, Cornelia Huck wrote: > > On Tue, 27 Aug 2013 18:36:29 +0300 > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > On Tue, Aug 27, 2013 at 05:09:53PM +0200, Cornelia Huck wrote: > > > > Some remarks from my side... > > > > > > > > On Tue, 27 Aug 2013 10:38:59 +0300 > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > > > On Tue, Aug 27, 2013 at 04:24:47PM +0930, Rusty Russell wrote: > > > > > > "Michael S. Tsirkin" <mst@redhat.com> writes: > > > > > > > This is the new configuration layout. > > > > > > > > > > > > > > Notes: > > > > > > > - Everything is LE > > > > > > > - There's a feature bit that means spec 1.0 compliant. > > > > > > > - Both devices and drivers can either require the 1.0 interface > > > > > > > or try to include compatibility support. The spec isn't forcing > > > > > > > this decision. > > > > > > > > > > > > Hmm, this kind includes other changes already proposed, like the LE > > > > > > change and the framing change. I think this conceptually splits nicely: > > > > > > > > > > > > 1) Feature bit 32 proposal. > > > > > > 2) Endian change. > > > > > > 3) Framing change. > > > > > > 4) PCI layout change. > > > > > > > > > > Right - they are mostly in different parts of the document. > > > > > I put it all together so it's easy to see how we intend to > > > > > handle the transition. > > > > > So is everyone OK with keeping this in a single patch? > > > > > > > > The new feature bit is supposed to cover all of this, right? Then this > > > > should be one patch. > > > > > > > > > > > > > > > > - I kept documentation of the legacy interface around, and added notes > > > > > > > on transition inline. They are in separate sections each clearly marked > > > > > > > "Legacy Interface" so we'll be able to separate them out > > > > > > > from the final document as necessary - for now I think it's easier > > > > > > > to keep it all together. > > > > > > > > > > > > Good thinking: most of us know the current spec so it's definitely > > > > > > clearer. And makes sure we're thinking about the transition. > > > > > > > > > > > > > Only virtio PCI has been converted. > > > > > > > Let's discuss this on the meeting tonight, once we figure out PCI > > > > > > > we can do something similar for MMIO and CCW. > > > > > > > > > > > > > @@ -137,6 +139,11 @@ Feature bits are allocated as follows: > > > > > > > 24 to 31: Feature bits reserved for extensions to the queue and > > > > > > > feature negotiation mechanisms > > > > > > > > > > > > > > + 32: Feature bit must be set for any device compliant with this > > > > > > > + revision of the specification, and acknowledged by all device drivers. > > > > > > > > Would it make sense to have a bit 33 "rings big endian" whose validity > > > > depends on bit 32 set? This would make it possible for ccw to keep its > > > > current endianness. > > > > > > I didn't go over ccw or MMIO yet - only PCI. > > > I think ccw registers will just > > > be explicitly BE, with no need for a feature bit. > > > Does this sound right? > > > > Sure, that would be even better. > > > > > > > > > > > > + > > > > > > > + 33 to 63: Feature bits reserved for future extensions > > > > > > > + > > > > > > > For example, feature bit 0 for a network device (i.e. Subsystem > > > > > > > Device ID 1) indicates that the device supports checksumming of > > > > > > > packets. > > > > > > > > > > > > Why stop at 63? If we go to a more decentralized feature-assignment > > > > > > model, we'll run through those very fast. > > > > > > > > > > Then we'll just document more, but driver needs to know where to stop > > > > > looking for features. > > > > > > > > > > > > > > > > > > @@ -145,13 +152,63 @@ In particular, new fields in the device configuration space are > > > > > > > indicated by offering a feature bit, so the guest can check > > > > > > > before accessing that part of the configuration space. > > > > > > > > > > > > > > +2.1.2.1 Legacy Interface: A Note on transitions from earlier drafts > > > > > > > +-------------------------------------- > > > > > > > + > > > > > > > +Earlier drafts of this specification (up to 0.9.X) defined a similar, but > > > > > > > +different interface between the hypervisor and the guest. > > > > > > > +Since these are widely deployed in the field, this specification > > > > > > > +accomodates optional features to simplify transition > > > > > > > +from these earlier draft interfaces. Specifically: > > > > > > > + > > > > > > > +Legacy Interface > > > > > > > + is an interface specified by an earlier draft of this specification > > > > > > > + (up to 0.9.X) > > > > > > > +Legacy Device > > > > > > > + is a device implemented before this specification was released, > > > > > > > + and implementing a legacy interface on the host side > > > > > > > +Legacy Driver > > > > > > > + is a driver implemented before this specification was released, > > > > > > > + and implementing a legacy interface on the guest side > > > > > > > + > > > > > > > +to simplify transition from these earlier draft interfaces, > > > > > > > +it is possible to implement > > > > > > > + > > > > > > > +Transitional Device > > > > > > > + a device supporting both drivers conforming to this > > > > > > > + specification, and legacy drivers > > > > > > > + > > > > > > > +Transitional Driver > > > > > > > + a driver supporting both devices conforming to this > > > > > > > + specification, and legacy devices > > > > > > > > What happens to legacy devices in the future? Current implementers > > > > will obviously expose legacy devices, which means future drivers need > > > > to be transitional or they won't work with what is currently out there. > > > > > > You are right. It's a bug in what I wrote: non transitional drivers > > > should work with transitional devices. > > > This way a transitional device can change to non-transitional > > > after drivers are updated. > > > > > > > Will legacy stay around (for the forseeable furture)? > > > > > > That's up to implementers I think as long as they > > > implement the new standard we should not prevent them from > > > bundling in the old virtio, coffee making capabilities etc. > > > > > > > > > > Will legacy > > > > devices still be considered standard compliant (as in "compliant to the > > > > legacy standard")? > > > > > > I don't think they are compliant. We'll split the legacy sections > > > from spec out to a separate transition guide before we release > > > the spec. > > > > What I'm worried about is probably the transitional nature of this. > > There is a framework we have now, so there will be users - and not on > > all platforms they expect needing to upgrade, especially if traditional > > I/O has always been backwards compatible for decades... > > I'm not sure I understand the suggestion. > You want us to push devices harder to implement legacy interfaces? > You want us to push drivers harder to switch to new interfaces? > > The proposal is basically trying hard to supply a mechanism, > not force a policy. So we may be in violent agreement there :) If the legacy mechanism can stay, I'm fine. > > > > > > > > > > > + > > > > > > > +Device and driver that require support for revision 1.0 or newer of > > > > > > > +the specification to function, are called non-transitional device and driver, > > > > > > > +respectively. > > > > > > > + > > > > > > > +Transitional Drivers can detect Legacy Devices by detecting that > > > > > > > +Feature bit 32 is not offered. > > > > > > > +Transitional devices can detect Legacy drivers by detecting that > > > > > > > +Feature bit 32 has not been acknowledged by driver. > > > > > > > > Will we use new feature bits for new, incompatible revisions? Or will > > > > we try to stay backwards compatible? > > > > > > So an incompatible change needs to increment revision ID > > > to prevent drivers from loading. > > > MMIO and PCI both have revision IDs. > > > CCW will need to add something like a revision ID, > > > we discussed this already. > > > > Command rejects? > > Which command would you reject? Whatever was incompatible or unknown. > > > I think it is a good idea to try to stay as compatible as possible; > > this should really be a last measure. > > Again, I think that at some point, e.g. 10-15 years in > the future, devices will want to say "I require new drivers > and that's it". Then they should probably present themselves as different devices, no? (We're not talking about about minor changes here, I guess.) > > I think it's useful to have a mechanism for this, so > old drivers fail gracefully. See below for some more thoughts I had on this. > > > > > > > > > > > + > > > > > > > +To make them easier to locate, specification sections documenting these > > > > > > > +transitional features all explicitly marked with > > > > > > > +'Legacy Interface' in the section title. > > > > > > > + > > > > > > > + > > > > > > > 2.1.3 Configuration Space > > > > > > > ------------------------- > > > > > > > > > > > > > > Configuration space is generally used for rarely-changing or > > > > > > > initialization-time parameters. > > > > > > > > > > > > > > -Note that this space is generally the guest's native endian, > > > > > > > +Note that configuration space generally uses the little-endian format > > > > > > > +for multi-byte fields. > > > > > > > + > > > > > > > +2.1.4.1 Legacy Interface: A Note on Configuration Space endian-ness > > > > > > > +-------------------------------------- > > > > > > > + > > > > > > > +Note that for legacy interfaces, configuration space is generally the guest's native endian, > > > > > > > rather than PCI's little-endian. > > > > > > > > > > > > > > 2.1.4 Virtqueues > > > > > > > @@ -164,6 +221,45 @@ transmit and one for receive. Each queue has a 16-bit queue size > > > > > > > parameter, which sets the number of entries and implies the total size > > > > > > > of the queue. > > > > > > > > > > > > > > +Each virtqueue consists of three parts: > > > > > > > + > > > > > > > + Descriptor Table > > > > > > > + Available Ring > > > > > > > + Used Ring > > > > > > > + > > > > > > > +where each part is physically-contiguous in guest memory, > > > > > > > +and has different alignment requirements. > > > > > > > + > > > > > > > +The Queue Size field controls the total number of bytes > > > > > > > +required for each part of the virtqueue. > > > > > > > + > > > > > > > +The memory aligment and size requirements, in bytes, of each part of the > > > > > > > +virtqueue are summarized in the following table (qsz is the Queue Size field): > > > > > > > + > > > > > > > ++------------+---------------------------------+ > > > > > > > + Virtqueue Part Alignment Size > > > > > > > ++------------+---------------------------------+ > > > > > > > ++------------+---------------------------------+ > > > > > > > + Descriptor Table 16 16 * qsz > > > > > > > ++------------+---------------------------------+ > > > > > > > + Available Ring 2 6 + 2 * qsz > > > > > > > ++------------+---------------------------------+ > > > > > > > + Used Ring 4 6 + 4 * qsz > > > > > > > ++------------+---------------------------------+ > > > > > > > + > > > > > > > +When the driver wants to send a buffer to the device, it fills in > > > > > > > +a slot in the descriptor table (or chains several together), and > > > > > > > +writes the descriptor index into the available ring. It then > > > > > > > +notifies the device. When the device has finished a buffer, it > > > > > > > +writes the descriptor into the used ring, and sends an interrupt. > > > > > > > + > > > > > > > + > > > > > > > +2.1.4.1 Legacy Interfaces: A Note on Virtqueue Layout > > > > > > > +-------------------------------------- > > > > > > > + > > > > > > > +For Legacy Interfaces, several additional > > > > > > > +restrictions are placed on the virtqueue layout: > > > > > > > + > > > > > > > Each virtqueue occupies two or more physically-contiguous pages > > > > > > > (usually defined as 4096 bytes, but depending on the transport) > > > > > > > and consists of three parts: > > > > > > > @@ -182,9 +278,8 @@ required for the virtqueue according to the following formula: > > > > > > > + ALIGN(sizeof(u16)*3 + sizeof(struct vring_used_elem)*qsz); > > > > > > > } > > > > > > > > > > > > > > -This currently wastes some space with padding, but also allows future > > > > > > > -extensions such as the VIRTIO_RING_F_EVENT_IDX extension. The > > > > > > > -virtqueue layout structure looks like this: > > > > > > > +This wastes some space with padding. > > > > > > > +The legacy virtqueue layout structure therefore looks like this: > > > > > > > > > > > > > > struct vring { > > > > > > > // The actual descriptors (16 bytes each) > > > > > > > @@ -200,25 +295,17 @@ virtqueue layout structure looks like this: > > > > > > > struct vring_used used; > > > > > > > }; > > > > > > > > > > > > > > -When the driver wants to send a buffer to the device, it fills in > > > > > > > -a slot in the descriptor table (or chains several together), and > > > > > > > -writes the descriptor index into the available ring. It then > > > > > > > -notifies the device. When the device has finished a buffer, it > > > > > > > -writes the descriptor into the used ring, and sends an interrupt. > > > > > > > - > > > > > > > -2.1.4.1 A Note on Virtqueue Endianness > > > > > > > +2.1.4.1 Legacy Interfaces: A Note on Virtqueue Endianness > > > > > > > -------------------------------------- > > > > > > > > > > > > > > Note that the endian of fields and in the virtqueue is the native > > > > > > > -endian of the guest, not little-endian as PCI normally is. This makes > > > > > > > -for simpler guest code, and it is assumed that the host already has to > > > > > > > -be deeply aware of the guest endian so such an “endian-aware” device > > > > > > > -is not a significant issue. > > > > > > > +endian of the guest, not little-endian as PCI normally is. > > > > > > > +It is assumed that the host is already aware of the guest endian. > > > > > > > > > > > > > > 2.1.4.2 Message Framing > > > > > > > ----------------------- > > > > > > > -The original intent of the specification was that message framing (the > > > > > > > -particular layout of descriptors) be independent of the contents of > > > > > > > +Generally, the intent of the specification is for message framing (the > > > > > > > +particular layout of descriptors) to be independent of the contents of > > > > > > > the buffers. For example, a network transmit buffer consists of a 12 > > > > > > > byte header followed by the network packet. This could be most simply > > > > > > > placed in the descriptor table as a 12 byte output descriptor followed > > > > > > > @@ -227,16 +314,21 @@ single 1526 byte output descriptor in the case where the header and > > > > > > > packet are adjacent, or even three or more descriptors (possibly with > > > > > > > loss of efficiency in that case). > > > > > > > > > > > > > > -Regrettably, initial driver implementations used simple layouts, and > > > > > > > -devices came to rely on it, despite this specification wording[10]. It > > > > > > > -is thus recommended that drivers be conservative in their assumptions, > > > > > > > -unless the VIRTIO_F_ANY_LAYOUT feature is accepted. In addition, some > > > > > > > +In addition, some > > > > > > > implementations may have large-but-reasonable restrictions on total > > > > > > > descriptor size (such as based on IOV_MAX in the host OS). This has > > > > > > > not been a problem in practice: little sympathy will be given to > > > > > > > drivers which create unreasonably-sized descriptors such as by > > > > > > > dividing a network packet into 1500 single-byte descriptors! > > > > > > > > > > > > > > +2.1.4.2.1 Legacy Interfaces: A Note on Message Framing > > > > > > > +----------------------- > > > > > > > +Regrettably, initial driver implementations used simple layouts, and > > > > > > > +devices came to rely on it, despite this specification wording[10]. It > > > > > > > +is thus recommended that when using legacy interfaces, > > > > > > > +drivers should be conservative in their assumptions, > > > > > > > +unless the VIRTIO_F_ANY_LAYOUT feature is accepted. > > > > > > > > So ANY_LAYOUT and feature bit 32 are mutually exclusive? > > > > > > Hmm. I wonder what gives this impression. > > > What I tried to say is bit 32 should imply ANY_LAYOUT. > > > > Better to spell it out, then. > > Well it says (in unchanged text) > Generally, the intent of the specification is for message framing (the > particular layout of descriptors) to be independent of the contents of > the buffers. > > how would you make it clearer? "Note that bit 32 implies ANY_LAYOUT"? > > > > > > > > > > > > > > + > > > > > > > 2.1.4.3 The Virtqueue Descriptor Table > > > > > > > -------------------------------------- > > > > > > > > > > > > > > @@ -386,23 +478,27 @@ how to communicate with the specific device. > > > > > > > 2.2.1 Device Initialization > > > > > > > --------------------------- > > > > > > > > > > > > > > -1. Reset the device. This is not required on initial start up. > > > > > > > +1. Device discovery. This is only required for some transports. > > > > > > > + > > > > > > > +2. Reset the device. This is not required on initial start up. > > > > > > > > > > > > > > -2. The ACKNOWLEDGE status bit is set: we have noticed the device. > > > > > > > +3. Device layout detection. This is only required for some transports. > > > > > > > > > > > > > > -3. The DRIVER status bit is set: we know how to drive the device. > > > > > > > +4. The ACKNOWLEDGE status bit is set: we have noticed the device. > > > > > > > > > > > > > > -4. Device-specific setup, including reading the device feature > > > > > > > +5. The DRIVER status bit is set: we know how to drive the device. > > > > > > > + > > > > > > > +6. Device-specific setup, including reading the device feature > > > > > > > bits, discovery of virtqueues for the device, optional per-bus > > > > > > > setup, and reading and possibly writing the device's virtio > > > > > > > configuration space. > > > > > > > > > > > > > > -5. The subset of device feature bits understood by the driver is > > > > > > > +7. The subset of device feature bits understood by the driver is > > > > > > > written to the device. > > > > > > > > > > > > > > -6. The DRIVER_OK status bit is set. > > > > > > > +8. The DRIVER_OK status bit is set. > > > > > > > > > > > > > > -7. The device can now be used (ie. buffers added to the > > > > > > > +9. The device can now be used (ie. buffers added to the > > > > > > > virtqueues)[4] > > > > > > > > > > > > > > If any of these steps go irrecoverably wrong, the guest should > > > > > > > @@ -622,35 +718,183 @@ Virtio devices are commonly implemented as PCI devices. > > > > > > > > > > > > > > Any PCI device with Vendor ID 0x1AF4, and Device ID 0x1000 through > > > > > > > 0x103F inclusive is a virtio device[3]. The device must also have a > > > > > > > -Revision ID of 0 to match this specification. > > > > > > > +Revision ID of 0 or Revision ID of 1 to match this specification. > > > > > > > > > > > > > > The Subsystem Device ID indicates which virtio device is > > > > > > > supported by the device. The Subsystem Vendor ID should reflect > > > > > > > the PCI Vendor ID of the environment (it's currently only used > > > > > > > for informational purposes by the guest). > > > > > > > > > > > > > > +Drivers must not match devices where Revision ID does not match 0 or 1. > > > > > > > + > > > > > > > +2.4.1.1.1 Legacy Interface: A Note on PCI Device Discovery > > > > > > > +---------------------------- > > > > > > > +Transitional devices must have a Revision ID of 0. > > > > > > > + > > > > > > > +Non-transitional devices must have a Revision ID of 1. > > > > > > > + > > > > > > > +Transitional drivers must match a Revision ID of 0 or 1. > > > > > > > + > > > > > > > +Non-transitional drivers must only match a Revision ID of 1. > > > > > > > + > > > > > > > > > > > > I think we should stop abusing Revision IDs, and start using them > > > > > > to reflect device version changes as intended. > > > > > > > > > > > > We could reserve revision id 0 for legacy devices, however, which should > > > > > > work nicely. > > > > > > > > > > Hmm I am not sure I agree - what does it buy us that feature bits don't already supply? > > > > > > > > > > More concerns: > > > > > > > > > > We are using revision ID now exactly as was intended to disable old > > > > > drivers - it served us well for 0.X-1.X and would be as useful if we > > > > > ever have 1.X->2.0 transition. > > > > > > > > > > Another worry with using revision numbering for features is that > > > > > it does not play well with downstreams. > > > > > E.g. RHEL might want to cherry-pick a feature without implementing > > > > > other features that happened to land in the same revision. > > > > > > > > > > Also Revision ID is only 8 bit - it's designed for hardware where > > > > > making a new revision is expensive. In software we'll run out of that > > > > > eventually. > > > > > > > > So Revision ID is a PCI-specific thing, right? Not all transports will > > > > necessarily have something equivalent, so they would need to depend on > > > > the feature bit. > > > > > > They can't do this reliably - for example you might want to move feature > > > bits around. > > > > That sounds like setting yourself up for problems. > > If you want to > > deprecate bits, it would be better to define them as "reserved" and use > > a new bit for your new feature. The s390 architecture is full of > > "reserved" bits like that. > > That's exactly what PCI does here though, and it does this > without problems exactly because we have a way to > make old drivers fail if we want to. You fail if a reserved bit is to be negotiated? > > So IMO it would be good to add a revision field to ccw so we > can do this there in the future. An idea I just had: - Add a new channel command "set virtio configuration". This can set: - a revision id; 0 for legacy, 1 for the proposal, possible more later - a format field indicating the format of the following data area - a data area (unused for now, but can be used for all kind of configuration parameters) - A transitional or modern driver will issue this command when starting to probe. - A legacy device will reject the command, prompting the transitional driver to use the legacy interface and the modern driver to fail. - Transitional/modern devices will either accept the configuration or reject it if they don't support that particular configuration. The driver may then either fail or re-try with a different configuration. - A legacy driver will not issue this command (obviously). - A legacy device will work as before. - A transitional device will notice that a virtio-ccw command is issued without any configuration set. It will therefore operate in legacy mode. - A modern device will reject any virtio-ccw command without any configuration set, causing the legacy driver to fail. - A set configuration command is rejected after the first virtio-ccw command has been issued. - Obviously true for legacy devices. - Allows transitional/modern devices to fence off misbehaving drivers. - No dynamic change of the configuration; you'll always have to tear down and re-init for that. I think this should allow us to accomodate future changes without having to change the control unit type, unless we'd really do something radically different. > > > > For 0.9.X drivers and non-transitional devices, > > > I'd like to find some hack to make probe fail. > > > > > > Any idea? > > > > Not really, sorry. > > > > > > > > But let's plan ahead and add a way to do this > > > in the future if we make an incompatible change again. > > > > I'd rather have an architecture that allows us to be backwards > > compatible for a long time and introduce a new device id/cu type for > > a new kind of device if we want to do things differently and ditch old > > baggage. > > device ids are transport independent so we can't do this. > What's a cu type? Hard to add? > If no let's do that, and add a revision to future-proof it. Ah, terminology fail. The device id (net, block, ...) is reflected for virtio-ccw devices in the control unit model (8 bit value). The control unit type (16 bit value) is what identifies the control unit as a virtio-ccw control unit, accepting virtio-ccw channel commands. So you get 3832/01 -> virtio-net via virtio-ccw 3832/02 -> virtio-blk via virtio-ccw (all of which is discovered via a common channel-io mechanism) For something radically incompatible, this could become 3833/01, 3833/02, ... What I *meant* with device id above was the pci id. A new cu type (3833) would be easy to code, but I'd have to get it reserved with the folks handling known ids. So if my idea from above worked, that would be way better. > > > > > > > > > > > > > > > > > > > > > > > > > > > 2.4.1.2 PCI Device Layout > > > > > > > ------------------------- > > > > > > > > > > > > > > -To configure the device, we use the first I/O region of the PCI > > > > > > > -device. This contains a virtio header followed by a > > > > > > > -device-specific region. > > > > > > > +To configure the device, > > > > > > > +use I/O and/or memory regions and/or PCI configuration space of the PCI device. > > > > > > > +These contain the virtio header registers, the notification register, the > > > > > > > +ISR status register and device specific registers, as specified by Virtio > > > > > > > ++ Structure PCI Capabilities > > > > > > > + > > > > > > > +There may be different widths of accesses to the I/O region; the > > > > > > > +“natural” access method for each field must be > > > > > > > +used (i.e. 32-bit accesses for 32-bit fields, etc). > > > > > > > + > > > > > > > +PCI Device Configuration Layout includes the common configuration, > > > > > > > +ISR, notification and device specific configuration > > > > > > > +structures. > > > > > > > + > > > > > > > +Unless explicitly specified otherwise, all multi-byte fields are little-endian. > > > > > > > + > > > > > > > + > > > > > > > +2.4.1.2.1 Common configuration structure layout > > > > > > > +------------------------- > > > > > > > +Common configuration structure layout is documented below: > > > > > > > + > > > > > > > +struct virtio_pci_common_cfg { > > > > > > > + /* About the whole device. */ > > > > > > > + __le32 device_feature_select; /* read-write */ > > > > > > > + __le32 device_feature; /* read-only */ > > > > > > > + __le32 guest_feature_select; /* read-write */ > > > > > > > + __le32 guest_feature; /* read-write */ > > > > > > > + __le16 msix_config; /* read-write */ > > > > > > > + __le16 num_queues; /* read-only */ > > > > > > > + __u8 device_status; /* read-write */ > > > > > > > + __u8 unused1; > > > > > > > + > > > > > > > + /* About a specific virtqueue. */ > > > > > > > + __le16 queue_select; /* read-write */ > > > > > > > + __le16 queue_size; /* read-write, power of 2, or 0. */ > > > > > > > + __le16 queue_msix_vector; /* read-write */ > > > > > > > + __le16 queue_enable; /* read-write */ > > > > > > > + __le16 queue_notify_off; /* read-only */ > > > > > > > + __le64 queue_desc; /* read-write */ > > > > > > > + __le64 queue_avail; /* read-write */ > > > > > > > + __le64 queue_used; /* read-write */ > > > > > > > +}; > > > > > > > + > > > > > > > +device_feature_select > > > > > > > + > > > > > > > + Selects which Feature Bits does device_feature field refer to. > > > > > > > + Value 0x0 selects Feature Bits 0 to 31 > > > > > > > + Value 0x1 selects Feature Bits 32 to 63 > > > > > > > + All other values cause reads from device_feature to return 0. > > > > > > > + > > > > > > > +device_feature > > > > > > > + > > > > > > > + Used by Device to report Feature Bits to Driver. > > > > > > > + Device Feature Bits selected by device_feature_select. > > > > > > > + > > > > > > > +guest_feature_select > > > > > > > + > > > > > > > + Selects which Feature Bits does guest_feature field refer to. > > > > > > > + Value 0x0 selects Feature Bits 0 to 31 > > > > > > > + Value 0x1 selects Feature Bits 32 to 63 > > > > > > > + All other values cause writes to guest_feature to be ignored, > > > > > > > + and reads to return 0. > > > > > > > + > > > > > > > +guest_feature > > > > > > > + > > > > > > > + Used by Driver to acknowledge Feature Bits to Device. > > > > > > > + Guest Feature Bits selected by guest_feature_select. > > > > > > > + > > > > > > > +msix_config > > > > > > > + > > > > > > > + Configuration Vector for MSI-X. > > > > > > > + > > > > > > > +num_queues > > > > > > > + > > > > > > > + Specifies the maximum number of virtqueues supported by device. > > > > > > > + > > > > > > > +device_status > > > > > > > + > > > > > > > + Device Status field. > > > > > > > + > > > > > > > +queue_select > > > > > > > + > > > > > > > + Queue Select. Selects which virtqueue do other fields refer to. > > > > > > > + > > > > > > > +queue_size > > > > > > > + > > > > > > > + Queue Size. On reset, specifies the maximum queue size supported by > > > > > > > + the hypervisor. This can be modified by driver to reduce memory requirements. > > > > > > > + Set to 0 if this virtqueue is unused. > > > > > > > + > > > > > > > +queue_msix_vector > > > > > > > + > > > > > > > + Queue Vector for MSI-X. > > > > > > > + > > > > > > > +queue_enable > > > > > > > + > > > > > > > + Used to selectively prevent host from executing requests from this virtqueue. > > > > > > > + 1 - enabled; 0 - disabled > > > > > > > + > > > > > > > +queue_notify_off > > > > > > > + > > > > > > > + Used to calculate the offset from start of Notification structure at > > > > > > > + which this virtqueue is located. > > > > > > > + Note: this is *not* an offset in bytes. See notify_off_multiplier below. > > > > > > > + > > > > > > > +queue_desc > > > > > > > + > > > > > > > + Physical address of Descriptor Table. > > > > > > > + > > > > > > > +queue_avail > > > > > > > + > > > > > > > + Physical address of Available Ring. > > > > > > > + > > > > > > > +queue_used > > > > > > > + > > > > > > > + Physical address of Used Ring. > > > > > > > + > > > > > > > + > > > > > > > +2.4.1.2.2 ISR status structure layout > > > > > > > +------------------------- > > > > > > > +ISR status structure includes a single 8-bite ISR status field > > > > > > > > > > > > 8-bit > > > > > > > > > > Right :) > > > > > > > > > > > > + > > > > > > > +2.4.1.2.3 Notification structure layout > > > > > > > +------------------------- > > > > > > > +Notification structure is always a multiple of 2 bytes in size. > > > > > > > +It includes 2-byte Queue Notify fields for each virtqueue of > > > > > > > +the device. Note that multiple virtqueues can use the same > > > > > > > +Queue Notify field, if necessary. > > > > > > > > > > > > Hmm, maybe move this down, so you can have a section which starts with > > > > > > "If cfg_type is VIRTIO_PCI_CAP_NOTIFY_CFG" below? That would put it all > > > > > > together. > > > > > > > > > > so Move PCI Device Layout to within > > > > > PCI-specific Initialization And Device Operation? > > > > > > > > > > > > + > > > > > > > +2.4.1.2.4 Device specific structure > > > > > > > +------------------------- > > > > > > > + > > > > > > > +Device specific structure is optional. > > > > > > > + > > > > > > > +2.4.1.2.5 Legacy Interfaces: A Note on PCI Device Layout > > > > > > > +------------------------- > > > > > > > + > > > > > > > +Transitional devices should present part of configuration > > > > > > > +registers in a legacy configuration structure in BAR0 in the first I/O > > > > > > > +region of the PCI device, as documented below. > > > > > > > > > > > > > > There may be different widths of accesses to the I/O region; the > > > > > > > “natural” access method for each field in the virtio header must be > > > > > > > -used (i.e. 32-bit accesses for 32-bit fields, etc), but the > > > > > > > +used (i.e. 32-bit accesses for 32-bit fields, etc), but > > > > > > > +When accessed through the legacy interface the > > > > > > > device-specific region can be accessed using any width accesses, and > > > > > > > should obtain the same results. > > > > > > > > > > > > > > Note that this is possible because while the virtio header is PCI > > > > > > > -(i.e. little) endian, the device-specific region is encoded in > > > > > > > -the native endian of the guest (where such distinction is > > > > > > > +(i.e. little) endian, when using the legacy interface the device-specific > > > > > > > +region is encoded in the native endian of the guest (where such distinction is > > > > > > > applicable). > > > > > > > > > > > > > > -2.4.1.2.1 PCI Device Virtio Header > > > > > > > ----------------------------------- > > > > > > > > > > > > > > -The virtio header looks as follows: > > > > > > > +When used through the legacy interface, the virtio header looks as follows: > > > > > > > > > > > > > > +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ > > > > > > > Bits 32 32 32 16 16 16 8 8 > > > > > > > @@ -661,7 +905,6 @@ The virtio header looks as follows: > > > > > > > Features bits 0:31 Features bits 0:31 Address Size Select Notify Status Status > > > > > > > +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ > > > > > > > > > > > > > > - > > > > > > > If MSI-X is enabled for the device, two additional fields > > > > > > > immediately follow this header:[5] > > > > > > > > > > > > > > @@ -689,25 +932,154 @@ device-specific headers: > > > > > > > > > > > > > > +------------++--------------------+ > > > > > > > > > > > > > > +Note that only Feature Bits 0 to 31 are accessible through the > > > > > > > +Legacy Interface. When used through the Legacy Interface, > > > > > > > +Transitional Devices must assume that Feature Bits 32 to 63 > > > > > > > +are not acknowledged by Driver. > > > > > > > + > > > > > > > + > > > > > > > 2.4.1.3 PCI-specific Initialization And Device Operation > > > > > > > -------------------------------------------------------- > > > > > > > > > > > > > > -The page size for a virtqueue on a PCI virtio device is defined as > > > > > > > -4096 bytes. > > > > > > > - > > > > > > > 2.4.1.3.1 Device Initialization > > > > > > > ------------------------------- > > > > > > > > > > > > > > -2.4.1.3.1.1 Queue Vector Configuration > > > > > > > +This documents PCI-specific steps executed during Device Initialization. > > > > > > > +As the first step, driver must detect device configuration layout > > > > > > > +to locate configuration fields in memory,I/O or configuration space of the > > > > > > > +device. > > > > > > > + > > > > > > > +2.4.1.3.1.1 Virtio Device Configuration Layout Detection > > > > > > > +------------------------------- > > > > > > > + > > > > > > > +As a prerequisite to device initialization, driver executes a > > > > > > > +PCI capability list scan, detecting virtio configuration layout using Virtio > > > > > > > +Structure PCI capabilities. > > > > > > > + > > > > > > > +Virtio Device Configuration Layout includes virtio configuration header, Notification > > > > > > > +and ISR Status and device configuration structures. > > > > > > > +Each structure can be mapped by a Base Address register (BAR) belonging to > > > > > > > +the function, located beginning at 10h in Configuration Space, > > > > > > > +or accessed though PCI configuration space. > > > > > > > + > > > > > > > +Actual location of each structure is specified using vendor-specific PCI capability located > > > > > > > +on capability list in PCI configuration space of the device. > > > > > > > +This virtio structure capability uses little-endian format; all bits are > > > > > > > +read-only: > > > > > > > + > > > > > > > +struct virtio_pci_cap { > > > > > > > + __u8 cap_vndr; /* Generic PCI field: PCI_CAP_ID_VNDR */ > > > > > > > + __u8 cap_next; /* Generic PCI field: next ptr. */ > > > > > > > + __u8 cap_len; /* Generic PCI field: capability length */ > > > > > > > + __u8 cfg_type; /* Identifies the structure. */ > > > > > > > + __u8 bar; /* Where to find it. */ > > > > > > > + __u8 padding[3];/* Pad to full dword. */ > > > > > > > + __le32 offset; /* Offset within bar. */ > > > > > > > + __le32 length; /* Length of the structure, in bytes. */ > > > > > > > +}; > > > > > > > + > > > > > > > +This structure can optionally followed by extra data, depending on > > > > > > > +other fields, as documented below. > > > > > > > + > > > > > > > +The fields are interpreted as follows: > > > > > > > + > > > > > > > +cap_vndr > > > > > > > + 0x09; Identifies a vendor-specific capability. > > > > > > > + > > > > > > > +cap_next > > > > > > > + Link to next capability in the capability list in the configuration space. > > > > > > > + > > > > > > > +cap_len > > > > > > > + Length of the capability structure, including the whole of > > > > > > > + struct virtio_pci_cap, and extra data if any. > > > > > > > + This length might include padding, or fields unused by the driver. > > > > > > > + > > > > > > > +cfg_type > > > > > > > + identifies the structure, according to the following table. > > > > > > > + > > > > > > > + /* Common configuration */ > > > > > > > + #define VIRTIO_PCI_CAP_COMMON_CFG 1 > > > > > > > + /* Notifications */ > > > > > > > + #define VIRTIO_PCI_CAP_NOTIFY_CFG 2 > > > > > > > + /* ISR Status */ > > > > > > > + #define VIRTIO_PCI_CAP_ISR_CFG 3 > > > > > > > + /* Device specific configuration */ > > > > > > > + #define VIRTIO_PCI_CAP_DEVICE_CFG 4 > > > > > > > + > > > > > > > + More than one capability can identify the same structure - this makes it > > > > > > > + possible for the device to expose multiple interfaces to drivers. The order of > > > > > > > + the capabilities in the capability list specifies the order of preference > > > > > > > + suggested by the device; drivers should use the first interface that they can > > > > > > > + support. For example, on some hypervisors, notifications using IO accesses are > > > > > > > + faster than memory accesses. In this case, hypervisor can expose two > > > > > > > + capabilities with cfg_type set to VIRTIO_PCI_CAP_NOTIFY_CFG: > > > > > > > + the first one addressing an I/O BAR, the second one addressing a memory BAR. > > > > > > > + Driver will use the I/O BAR if I/O resources are available, and fall back on > > > > > > > + memory BAR when I/O resources are unavailable. > > > > > > > + > > > > > > > +bar > > > > > > > + > > > > > > > + values 0x0 to 0x5 specify a Base Address register (BAR) belonging to > > > > > > > + the function located beginning at 10h in Configuration Space > > > > > > > + and used to map the structure into Memory or I/O Space. > > > > > > > + The BAR is permitted to be either 32-bit or 64-bit, it can map Memory Space > > > > > > > + or I/O Space. > > > > > > > + > > > > > > > + The value 0xF specifies that the structure is in PCI configuration space > > > > > > > + inline with this capability structure, following (not necessarily immediately) > > > > > > > + the length field. > > > > > > > > > > > > Why not immediately? > > > > > > Or how would the driver know where it is? > > > > > > > > > > It's at the offset. > > > > > > > > > > E.g. for notification we stick multiplier after length. > > > > > Further, we might extend virtio_pci_cap in the future, > > > > > and we don't want to move stuff around like we > > > > > had to with MSI-X. > > > > > > > > > > > > + > > > > > > > +offset > > > > > > > + indicates where the structure begins relative to the base address associated > > > > > > > + with the BAR. If bar specifies configuration space, offset is relative > > > > > > > + to start of virtio_pci_cap structure. > > > > > > > + > > > > > > > +length > > > > > > > + indicates the length of the structure. > > > > > > > + This size might include padding, or fields unused by the driver. > > > > > > > + Drivers are also recommended to only map part of configuration structure > > > > > > > + large enough for device operation. > > > > > > > + For example, a future device might present a large structure size of several > > > > > > > + MBytes. > > > > > > > + As current devices never utilize structures larger than 4KBytes in size, > > > > > > > + driver can limit the mapped structure size to e.g. > > > > > > > + 4KBytes to allow forward compatibility with such devices without loss of > > > > > > > + functionality and without wasting resources. > > > > > > > + > > > > > > > + > > > > > > > +If cfg_type is VIRTIO_PCI_CAP_NOTIFY_CFG this structure is immediately followed > > > > > > > +by additional fields: > > > > > > > + > > > > > > > +struct virtio_pci_notify_cap { > > > > > > > + struct virtio_pci_cap cap; > > > > > > > + __le32 notify_off_multiplier; /* Multiplier for queue_notify_off. */ > > > > > > > +}; > > > > > > > + > > > > > > > +notify_off_multiplier > > > > > > > + > > > > > > > + Virtqueue offset multiplier, in bytes. Must be even and either a power of two, or 0. > > > > > > > + Value 0x1 is reserved. > > > > > > > + For a given virtqueue, the address to use for notifications is calculated as follows: > > > > > > > + > > > > > > > + queue_notify_off * notify_off_multiplier + offset > > > > > > > + > > > > > > > + If notify_off_multiplier is 0, all virtqueues use the same address in > > > > > > > + the Notifications structure! > > > > > > > + > > > > > > > + > > > > > > > +2.4.1.3.1.1 Legacy Interface: A Note on Device Layout Detection > > > > > > > +------------------------------- > > > > > > > + > > > > > > > +Legacy drivers skipped Device Layout Detection step, assuming legacy > > > > > > > +configuration space in BAR0 in I/O space unconditionally. > > > > > > > + > > > > > > > +2.4.1.3.1.3 Queue Vector Configuration > > > > > > > -------------------------------------- > > > > > > > > > > > > > > When MSI-X capability is present and enabled in the device > > > > > > > -(through standard PCI configuration space) 4 bytes at byte offset > > > > > > > -20 are used to map configuration change and queue interrupts to > > > > > > > -MSI-X vectors. In this case, the ISR Status field is unused, and > > > > > > > -device specific configuration starts at byte offset 24 in virtio > > > > > > > -header structure. When MSI-X capability is not enabled, device > > > > > > > -specific configuration starts at byte offset 20 in virtio header. > > > > > > > +(through standard PCI configuration space) Configuration/Queue > > > > > > > +MSI-X Vector registers are used to map configuration change and queue > > > > > > > +interrupts to MSI-X vectors. In this case, the ISR Status is unused. > > > > > > > > > > > > > > Writing a valid MSI-X Table entry number, 0 to 0x7FF, to one of > > > > > > > Configuration/Queue Vector registers, maps interrupts triggered > > > > > > > @@ -732,7 +1104,7 @@ success, the previously written value is returned, and on > > > > > > > failure, NO_VECTOR is returned. If a mapping failure is detected, > > > > > > > the driver can retry mapping with fewervectors, or disable MSI-X. > > > > > > > > > > > > > > -2.4.1.3.1.2 Virtqueue Configuration > > > > > > > +2.4.1.3.1.4 Virtqueue Configuration > > > > > > > ----------------------------------- > > > > > > > > > > > > > > As a device can have zero or more virtqueues for bulk data > > > > > > > @@ -749,9 +1121,11 @@ This is done as follows, for each virtqueue a device has: > > > > > > > always a power of 2. This controls how big the virtqueue is > > > > > > > (see 2.1.4 Virtqueues). If this field is 0, the virtqueue does not exist. > > > > > > > > > > > > > > -3. Allocate and zero virtqueue in contiguous physical memory, on > > > > > > > - a 4096 byte alignment. Write the physical address, divided by > > > > > > > - 4096 to the Queue Address field.[6] > > > > > > > +3. Optionally, select a smaller virtqueue size and write it in the Queue Size > > > > > > > + field. > > > > > > > + > > > > > > > +3. Allocate and zero Descriptor Table, Available and Used rings for the > > > > > > > + virtqueue in contiguous physical memory. > > > > > > > > > > > > > > 4. Optionally, if MSI-X capability is present and enabled on the > > > > > > > device, select a vector to use to request interrupts triggered > > > > > > > @@ -760,14 +1134,21 @@ This is done as follows, for each virtqueue a device has: > > > > > > > Queue Vector field: on success, previously written value is > > > > > > > returned; on failure, NO_VECTOR value is returned. > > > > > > > > > > > > > > + > > > > > > > +2.4.1.3.1.4.1 Legacy Interface: A Note on Virtqueue Configuration > > > > > > > +----------------------------------- > > > > > > > +When using the legacy interface, the page size for a virtqueue on a PCI virtio > > > > > > > +device is defined as 4096 bytes. Driver writes the physical address, divided > > > > > > > +by 4096 to the Queue Address field [6]. > > > > > > > + > > > > > > > 2.4.1.3.2 Notifying The Device > > > > > > > ------------------------------ > > > > > > > > > > > > > > Device notification occurs by writing the 16-bit virtqueue index > > > > > > > -of this virtqueue to the Queue Notify field of the virtio header > > > > > > > -in the first I/O region of the PCI device. > > > > > > > +of this virtqueue to the Queue Notify field. > > > > > > > > > > > > > > 2.4.1.3.3 Receiving Used Buffers From The Device > > > > > > > +------------------------------ > > > > > > > > > > > > > > If an interrupt is necessary: > > > > > > > > > > > > > > @@ -2798,7 +3179,10 @@ the non-PCI implementations (currently lguest and S/390). > > > > > > > This is only allowed if the driver does not use any features > > > > > > > which would alter this early use of the device. > > > > > > > > > > > > > > -[5] ie. once you enable MSI-X on the device, the other fields move. > > > > > > > +[5] When MSI-X capability is enabled, device specific configuration starts at > > > > > > > +byte offset 24 in virtio header structure. When MSI-X capability is not > > > > > > > +enabled, device specific configuration starts at byte offset 20 in virtio > > > > > > > +header. ie. once you enable MSI-X on the device, the other fields move. > > > > > > > If you turn it off again, they move back! > > > > > > > > > > > > Thanks, > > > > > > Rusty. > > > > > > > > Cornelia > > > > > > > >


  • 9.  Re: [virtio] Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout

    Posted 09-02-2013 02:16
    Cornelia Huck <cornelia.huck@de.ibm.com> writes: > Some remarks from my side... > > On Tue, 27 Aug 2013 10:38:59 +0300 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > >> On Tue, Aug 27, 2013 at 04:24:47PM +0930, Rusty Russell wrote: >> > "Michael S. Tsirkin" <mst@redhat.com> writes: >> > > This is the new configuration layout. >> > > >> > > Notes: >> > > - Everything is LE >> > > - There's a feature bit that means spec 1.0 compliant. >> > > - Both devices and drivers can either require the 1.0 interface >> > > or try to include compatibility support. The spec isn't forcing >> > > this decision. >> > >> > Hmm, this kind includes other changes already proposed, like the LE >> > change and the framing change. I think this conceptually splits nicely: >> > >> > 1) Feature bit 32 proposal. >> > 2) Endian change. >> > 3) Framing change. >> > 4) PCI layout change. >> >> Right - they are mostly in different parts of the document. >> I put it all together so it's easy to see how we intend to >> handle the transition. >> So is everyone OK with keeping this in a single patch? > > The new feature bit is supposed to cover all of this, right? Then this > should be one patch. Well, the new feature would be "1.0 compliant" which will cover much more than this. We'll have a whole document on the differences between legacy and 1.0, in fact. >> > > - I kept documentation of the legacy interface around, and added notes >> > > on transition inline. They are in separate sections each clearly marked >> > > "Legacy Interface" so we'll be able to separate them out >> > > from the final document as necessary - for now I think it's easier >> > > to keep it all together. >> > >> > Good thinking: most of us know the current spec so it's definitely >> > clearer. And makes sure we're thinking about the transition. >> > >> > > Only virtio PCI has been converted. >> > > Let's discuss this on the meeting tonight, once we figure out PCI >> > > we can do something similar for MMIO and CCW. >> > >> > > @@ -137,6 +139,11 @@ Feature bits are allocated as follows: >> > > 24 to 31: Feature bits reserved for extensions to the queue and >> > > feature negotiation mechanisms >> > > >> > > + 32: Feature bit must be set for any device compliant with this >> > > + revision of the specification, and acknowledged by all device drivers. > > Would it make sense to have a bit 33 "rings big endian" whose validity > depends on bit 32 set? This would make it possible for ccw to keep its > current endianness. Very awkward. Our experience with bi-endian devices on PowerPC suggests it's far better to not negotiate endian. It also avoids a branch everywhere in the driver and host. >> > > +Legacy Interface >> > > + is an interface specified by an earlier draft of this specification >> > > + (up to 0.9.X) >> > > +Legacy Device >> > > + is a device implemented before this specification was released, >> > > + and implementing a legacy interface on the host side >> > > +Legacy Driver >> > > + is a driver implemented before this specification was released, >> > > + and implementing a legacy interface on the guest side >> > > + >> > > +to simplify transition from these earlier draft interfaces, >> > > +it is possible to implement >> > > + >> > > +Transitional Device >> > > + a device supporting both drivers conforming to this >> > > + specification, and legacy drivers >> > > + >> > > +Transitional Driver >> > > + a driver supporting both devices conforming to this >> > > + specification, and legacy devices > > What happens to legacy devices in the future? Current implementers > will obviously expose legacy devices, which means future drivers need > to be transitional or they won't work with what is currently out there. > Will legacy stay around (for the forseeable furture)? Will legacy > devices still be considered standard compliant (as in "compliant to the > legacy standard")? Legacy devices will not be spec compliant, though we will map out the differences for implementors and to make sure that transitional devices & drivers are still possible. This legacy part will eventually be a separate document. I expect them legacy to be firmly deprecated within a few years: new features will probably only be offered for standard devices, and for PCI there are some nice proposals in the standard which make life easier, eg. for bootloaders. >> > > + >> > > +Device and driver that require support for revision 1.0 or newer of >> > > +the specification to function, are called non-transitional device and driver, >> > > +respectively. >> > > + >> > > +Transitional Drivers can detect Legacy Devices by detecting that >> > > +Feature bit 32 is not offered. >> > > +Transitional devices can detect Legacy drivers by detecting that >> > > +Feature bit 32 has not been acknowledged by driver. > > Will we use new feature bits for new, incompatible revisions? Or will > we try to stay backwards compatible? My intent is to stay backwards compatible, using feature bits. At some future point we may deprecate older features I suppose. >> > > +2.1.4.2.1 Legacy Interfaces: A Note on Message Framing >> > > +----------------------- >> > > +Regrettably, initial driver implementations used simple layouts, and >> > > +devices came to rely on it, despite this specification wording[10]. It >> > > +is thus recommended that when using legacy interfaces, >> > > +drivers should be conservative in their assumptions, >> > > +unless the VIRTIO_F_ANY_LAYOUT feature is accepted. > > So ANY_LAYOUT and feature bit 32 are mutually exclusive? I was thinking feature 32 would imply ANY_LAYOUT; there's a feature request for that: https://tools.oasis-open.org/issues/browse/VIRTIO-10 Since there seems to be consensus that we remove the SCSI commands from virtio-blk, it seems sensible to do that (which is a blocker for ANY_LAYOUT) and then remove ANY_LAYOUT from the non-legacy wording and make it clear that we're always independent of framing. >> Also Revision ID is only 8 bit - it's designed for hardware where >> making a new revision is expensive. In software we'll run out of that >> eventually. > > So Revision ID is a PCI-specific thing, right? Not all transports will > necessarily have something equivalent, so they would need to depend on > the feature bit. Yes. Cheers, Rusty.


  • 10.  Re: [virtio] Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout

    Posted 09-02-2013 14:51
    On Mon, 02 Sep 2013 10:56:20 +0930 Rusty Russell <rusty@au1.ibm.com> wrote: > Cornelia Huck <cornelia.huck@de.ibm.com> writes: > > Some remarks from my side... > > > > On Tue, 27 Aug 2013 10:38:59 +0300 > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > >> On Tue, Aug 27, 2013 at 04:24:47PM +0930, Rusty Russell wrote: > >> > "Michael S. Tsirkin" <mst@redhat.com> writes: > >> > > This is the new configuration layout. > >> > > > >> > > Notes: > >> > > - Everything is LE > >> > > - There's a feature bit that means spec 1.0 compliant. > >> > > - Both devices and drivers can either require the 1.0 interface > >> > > or try to include compatibility support. The spec isn't forcing > >> > > this decision. > >> > > >> > Hmm, this kind includes other changes already proposed, like the LE > >> > change and the framing change. I think this conceptually splits nicely: > >> > > >> > 1) Feature bit 32 proposal. > >> > 2) Endian change. > >> > 3) Framing change. > >> > 4) PCI layout change. > >> > >> Right - they are mostly in different parts of the document. > >> I put it all together so it's easy to see how we intend to > >> handle the transition. > >> So is everyone OK with keeping this in a single patch? > > > > The new feature bit is supposed to cover all of this, right? Then this > > should be one patch. > > Well, the new feature would be "1.0 compliant" which will cover much > more than this. We'll have a whole document on the differences between > legacy and 1.0, in fact. > > >> > > - I kept documentation of the legacy interface around, and added notes > >> > > on transition inline. They are in separate sections each clearly marked > >> > > "Legacy Interface" so we'll be able to separate them out > >> > > from the final document as necessary - for now I think it's easier > >> > > to keep it all together. > >> > > >> > Good thinking: most of us know the current spec so it's definitely > >> > clearer. And makes sure we're thinking about the transition. > >> > > >> > > Only virtio PCI has been converted. > >> > > Let's discuss this on the meeting tonight, once we figure out PCI > >> > > we can do something similar for MMIO and CCW. > >> > > >> > > @@ -137,6 +139,11 @@ Feature bits are allocated as follows: > >> > > 24 to 31: Feature bits reserved for extensions to the queue and > >> > > feature negotiation mechanisms > >> > > > >> > > + 32: Feature bit must be set for any device compliant with this > >> > > + revision of the specification, and acknowledged by all device drivers. > > > > Would it make sense to have a bit 33 "rings big endian" whose validity > > depends on bit 32 set? This would make it possible for ccw to keep its > > current endianness. > > Very awkward. Our experience with bi-endian devices on PowerPC suggests > it's far better to not negotiate endian. It also avoids a branch > everywhere in the driver and host. Ok, so let's skip this feature bit and make virtio-ccw big endian for the ring? > > >> > > +Legacy Interface > >> > > + is an interface specified by an earlier draft of this specification > >> > > + (up to 0.9.X) > >> > > +Legacy Device > >> > > + is a device implemented before this specification was released, > >> > > + and implementing a legacy interface on the host side > >> > > +Legacy Driver > >> > > + is a driver implemented before this specification was released, > >> > > + and implementing a legacy interface on the guest side > >> > > + > >> > > +to simplify transition from these earlier draft interfaces, > >> > > +it is possible to implement > >> > > + > >> > > +Transitional Device > >> > > + a device supporting both drivers conforming to this > >> > > + specification, and legacy drivers > >> > > + > >> > > +Transitional Driver > >> > > + a driver supporting both devices conforming to this > >> > > + specification, and legacy devices > > > > What happens to legacy devices in the future? Current implementers > > will obviously expose legacy devices, which means future drivers need > > to be transitional or they won't work with what is currently out there. > > Will legacy stay around (for the forseeable furture)? Will legacy > > devices still be considered standard compliant (as in "compliant to the > > legacy standard")? > > Legacy devices will not be spec compliant, though we will map out the > differences for implementors and to make sure that transitional devices > & drivers are still possible. This legacy part will eventually be a > separate document. > > I expect them legacy to be firmly deprecated within a few years: new > features will probably only be offered for standard devices, and for PCI > there are some nice proposals in the standard which make life easier, > eg. for bootloaders. Is there any reason the legacy version can't stay in, with a big fat warning "new devices are not supposed to implement this"? Another thing we need to keep in mind is that switching a device driver to support a new version of a device is not the way it works on every system. On s390, there's generally no device driver as such - Linux is the oddball there. > > >> > > + > >> > > +Device and driver that require support for revision 1.0 or newer of > >> > > +the specification to function, are called non-transitional device and driver, > >> > > +respectively. > >> > > + > >> > > +Transitional Drivers can detect Legacy Devices by detecting that > >> > > +Feature bit 32 is not offered. > >> > > +Transitional devices can detect Legacy drivers by detecting that > >> > > +Feature bit 32 has not been acknowledged by driver. > > > > Will we use new feature bits for new, incompatible revisions? Or will > > we try to stay backwards compatible? > > My intent is to stay backwards compatible, using feature bits. At some > future point we may deprecate older features I suppose. ok > > >> > > +2.1.4.2.1 Legacy Interfaces: A Note on Message Framing > >> > > +----------------------- > >> > > +Regrettably, initial driver implementations used simple layouts, and > >> > > +devices came to rely on it, despite this specification wording[10]. It > >> > > +is thus recommended that when using legacy interfaces, > >> > > +drivers should be conservative in their assumptions, > >> > > +unless the VIRTIO_F_ANY_LAYOUT feature is accepted. > > > > So ANY_LAYOUT and feature bit 32 are mutually exclusive? > > I was thinking feature 32 would imply ANY_LAYOUT; there's a feature > request for that: > https://tools.oasis-open.org/issues/browse/VIRTIO-10 > > Since there seems to be consensus that we remove the SCSI commands from > virtio-blk, it seems sensible to do that (which is a blocker for > ANY_LAYOUT) and then remove ANY_LAYOUT from the non-legacy wording and > make it clear that we're always independent of framing. I was a bit confused by the diff, but that sounds sensible. > > >> Also Revision ID is only 8 bit - it's designed for hardware where > >> making a new revision is expensive. In software we'll run out of that > >> eventually. > > > > So Revision ID is a PCI-specific thing, right? Not all transports will > > necessarily have something equivalent, so they would need to depend on > > the feature bit. > > Yes. > > Cheers, > Rusty. Cornelia


  • 11.  Re: [virtio] Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout

    Posted 09-02-2013 15:15
    On Mon, Sep 02, 2013 at 04:50:24PM +0200, Cornelia Huck wrote: > On Mon, 02 Sep 2013 10:56:20 +0930 > Rusty Russell <rusty@au1.ibm.com> wrote: > > > Cornelia Huck <cornelia.huck@de.ibm.com> writes: > > > Some remarks from my side... > > > > > > On Tue, 27 Aug 2013 10:38:59 +0300 > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > >> On Tue, Aug 27, 2013 at 04:24:47PM +0930, Rusty Russell wrote: > > >> > "Michael S. Tsirkin" <mst@redhat.com> writes: > > >> > > This is the new configuration layout. > > >> > > > > >> > > Notes: > > >> > > - Everything is LE > > >> > > - There's a feature bit that means spec 1.0 compliant. > > >> > > - Both devices and drivers can either require the 1.0 interface > > >> > > or try to include compatibility support. The spec isn't forcing > > >> > > this decision. > > >> > > > >> > Hmm, this kind includes other changes already proposed, like the LE > > >> > change and the framing change. I think this conceptually splits nicely: > > >> > > > >> > 1) Feature bit 32 proposal. > > >> > 2) Endian change. > > >> > 3) Framing change. > > >> > 4) PCI layout change. > > >> > > >> Right - they are mostly in different parts of the document. > > >> I put it all together so it's easy to see how we intend to > > >> handle the transition. > > >> So is everyone OK with keeping this in a single patch? > > > > > > The new feature bit is supposed to cover all of this, right? Then this > > > should be one patch. > > > > Well, the new feature would be "1.0 compliant" which will cover much > > more than this. We'll have a whole document on the differences between > > legacy and 1.0, in fact. > > > > >> > > - I kept documentation of the legacy interface around, and added notes > > >> > > on transition inline. They are in separate sections each clearly marked > > >> > > "Legacy Interface" so we'll be able to separate them out > > >> > > from the final document as necessary - for now I think it's easier > > >> > > to keep it all together. > > >> > > > >> > Good thinking: most of us know the current spec so it's definitely > > >> > clearer. And makes sure we're thinking about the transition. > > >> > > > >> > > Only virtio PCI has been converted. > > >> > > Let's discuss this on the meeting tonight, once we figure out PCI > > >> > > we can do something similar for MMIO and CCW. > > >> > > > >> > > @@ -137,6 +139,11 @@ Feature bits are allocated as follows: > > >> > > 24 to 31: Feature bits reserved for extensions to the queue and > > >> > > feature negotiation mechanisms > > >> > > > > >> > > + 32: Feature bit must be set for any device compliant with this > > >> > > + revision of the specification, and acknowledged by all device drivers. > > > > > > Would it make sense to have a bit 33 "rings big endian" whose validity > > > depends on bit 32 set? This would make it possible for ccw to keep its > > > current endianness. > > > > Very awkward. Our experience with bi-endian devices on PowerPC suggests > > it's far better to not negotiate endian. It also avoids a branch > > everywhere in the driver and host. > > Ok, so let's skip this feature bit and make virtio-ccw big endian for > the ring? No, that just makes it impossible to have common ring parsing code used by both ccw and other transports. Plan is that ccw registers stay big endian, what was previously native endian becomes little endian. But it's hard to handle this since Appendix Y isn't included ATM. Is someone going to merge it?


  • 12.  Re: [virtio] Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout

    Posted 09-02-2013 15:32
    On Mon, 2 Sep 2013 18:16:45 +0300 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Mon, Sep 02, 2013 at 04:50:24PM +0200, Cornelia Huck wrote: > > On Mon, 02 Sep 2013 10:56:20 +0930 > > Rusty Russell <rusty@au1.ibm.com> wrote: > > > > > Cornelia Huck <cornelia.huck@de.ibm.com> writes: > > > > Some remarks from my side... > > > > > > > > On Tue, 27 Aug 2013 10:38:59 +0300 > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > >> On Tue, Aug 27, 2013 at 04:24:47PM +0930, Rusty Russell wrote: > > > >> > "Michael S. Tsirkin" <mst@redhat.com> writes: > > > >> > > This is the new configuration layout. > > > >> > > > > > >> > > Notes: > > > >> > > - Everything is LE > > > >> > > - There's a feature bit that means spec 1.0 compliant. > > > >> > > - Both devices and drivers can either require the 1.0 interface > > > >> > > or try to include compatibility support. The spec isn't forcing > > > >> > > this decision. > > > >> > > > > >> > Hmm, this kind includes other changes already proposed, like the LE > > > >> > change and the framing change. I think this conceptually splits nicely: > > > >> > > > > >> > 1) Feature bit 32 proposal. > > > >> > 2) Endian change. > > > >> > 3) Framing change. > > > >> > 4) PCI layout change. > > > >> > > > >> Right - they are mostly in different parts of the document. > > > >> I put it all together so it's easy to see how we intend to > > > >> handle the transition. > > > >> So is everyone OK with keeping this in a single patch? > > > > > > > > The new feature bit is supposed to cover all of this, right? Then this > > > > should be one patch. > > > > > > Well, the new feature would be "1.0 compliant" which will cover much > > > more than this. We'll have a whole document on the differences between > > > legacy and 1.0, in fact. > > > > > > >> > > - I kept documentation of the legacy interface around, and added notes > > > >> > > on transition inline. They are in separate sections each clearly marked > > > >> > > "Legacy Interface" so we'll be able to separate them out > > > >> > > from the final document as necessary - for now I think it's easier > > > >> > > to keep it all together. > > > >> > > > > >> > Good thinking: most of us know the current spec so it's definitely > > > >> > clearer. And makes sure we're thinking about the transition. > > > >> > > > > >> > > Only virtio PCI has been converted. > > > >> > > Let's discuss this on the meeting tonight, once we figure out PCI > > > >> > > we can do something similar for MMIO and CCW. > > > >> > > > > >> > > @@ -137,6 +139,11 @@ Feature bits are allocated as follows: > > > >> > > 24 to 31: Feature bits reserved for extensions to the queue and > > > >> > > feature negotiation mechanisms > > > >> > > > > > >> > > + 32: Feature bit must be set for any device compliant with this > > > >> > > + revision of the specification, and acknowledged by all device drivers. > > > > > > > > Would it make sense to have a bit 33 "rings big endian" whose validity > > > > depends on bit 32 set? This would make it possible for ccw to keep its > > > > current endianness. > > > > > > Very awkward. Our experience with bi-endian devices on PowerPC suggests > > > it's far better to not negotiate endian. It also avoids a branch > > > everywhere in the driver and host. > > > > Ok, so let's skip this feature bit and make virtio-ccw big endian for > > the ring? > > No, that just makes it impossible to have common ring parsing code > used by both ccw and other transports. > > Plan is that ccw registers stay big endian, what was previously native > endian becomes little endian. I assume by "registers" you mean the ccw payload? These obviously need to stay big endian for architectural reasons anyway. But the ring code? I'm really worried about performance hits here. Have not come around to test this, but I'll try to get some numbers with a prototype. > > But it's hard to handle this since Appendix Y isn't included > ATM. Is someone going to merge it? >


  • 13.  Re: [virtio] Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout

    Posted 09-02-2013 15:50
    On Mon, Sep 02, 2013 at 05:32:12PM +0200, Cornelia Huck wrote: > On Mon, 2 Sep 2013 18:16:45 +0300 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Mon, Sep 02, 2013 at 04:50:24PM +0200, Cornelia Huck wrote: > > > On Mon, 02 Sep 2013 10:56:20 +0930 > > > Rusty Russell <rusty@au1.ibm.com> wrote: > > > > > > > Cornelia Huck <cornelia.huck@de.ibm.com> writes: > > > > > Some remarks from my side... > > > > > > > > > > On Tue, 27 Aug 2013 10:38:59 +0300 > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > > > >> On Tue, Aug 27, 2013 at 04:24:47PM +0930, Rusty Russell wrote: > > > > >> > "Michael S. Tsirkin" <mst@redhat.com> writes: > > > > >> > > This is the new configuration layout. > > > > >> > > > > > > >> > > Notes: > > > > >> > > - Everything is LE > > > > >> > > - There's a feature bit that means spec 1.0 compliant. > > > > >> > > - Both devices and drivers can either require the 1.0 interface > > > > >> > > or try to include compatibility support. The spec isn't forcing > > > > >> > > this decision. > > > > >> > > > > > >> > Hmm, this kind includes other changes already proposed, like the LE > > > > >> > change and the framing change. I think this conceptually splits nicely: > > > > >> > > > > > >> > 1) Feature bit 32 proposal. > > > > >> > 2) Endian change. > > > > >> > 3) Framing change. > > > > >> > 4) PCI layout change. > > > > >> > > > > >> Right - they are mostly in different parts of the document. > > > > >> I put it all together so it's easy to see how we intend to > > > > >> handle the transition. > > > > >> So is everyone OK with keeping this in a single patch? > > > > > > > > > > The new feature bit is supposed to cover all of this, right? Then this > > > > > should be one patch. > > > > > > > > Well, the new feature would be "1.0 compliant" which will cover much > > > > more than this. We'll have a whole document on the differences between > > > > legacy and 1.0, in fact. > > > > > > > > >> > > - I kept documentation of the legacy interface around, and added notes > > > > >> > > on transition inline. They are in separate sections each clearly marked > > > > >> > > "Legacy Interface" so we'll be able to separate them out > > > > >> > > from the final document as necessary - for now I think it's easier > > > > >> > > to keep it all together. > > > > >> > > > > > >> > Good thinking: most of us know the current spec so it's definitely > > > > >> > clearer. And makes sure we're thinking about the transition. > > > > >> > > > > > >> > > Only virtio PCI has been converted. > > > > >> > > Let's discuss this on the meeting tonight, once we figure out PCI > > > > >> > > we can do something similar for MMIO and CCW. > > > > >> > > > > > >> > > @@ -137,6 +139,11 @@ Feature bits are allocated as follows: > > > > >> > > 24 to 31: Feature bits reserved for extensions to the queue and > > > > >> > > feature negotiation mechanisms > > > > >> > > > > > > >> > > + 32: Feature bit must be set for any device compliant with this > > > > >> > > + revision of the specification, and acknowledged by all device drivers. > > > > > > > > > > Would it make sense to have a bit 33 "rings big endian" whose validity > > > > > depends on bit 32 set? This would make it possible for ccw to keep its > > > > > current endianness. > > > > > > > > Very awkward. Our experience with bi-endian devices on PowerPC suggests > > > > it's far better to not negotiate endian. It also avoids a branch > > > > everywhere in the driver and host. > > > > > > Ok, so let's skip this feature bit and make virtio-ccw big endian for > > > the ring? > > > > No, that just makes it impossible to have common ring parsing code > > used by both ccw and other transports. > > > > Plan is that ccw registers stay big endian, what was previously native > > endian becomes little endian. > > I assume by "registers" you mean the ccw payload? These obviously need > to stay big endian for architectural reasons anyway. > > But the ring code? I'm really worried about performance hits here. Why would you expect the performance hit be bigger than for the PCI bus? > Have > not come around to test this, but I'll try to get some numbers with a > prototype. That would be very nice. All my boxes with KVM support are little endian. > > > > But it's hard to handle this since Appendix Y isn't included > > ATM. Is someone going to merge it? > >


  • 14.  Re: [virtio] Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout

    Posted 09-02-2013 16:24
    On Mon, 2 Sep 2013 18:52:07 +0300 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Mon, Sep 02, 2013 at 05:32:12PM +0200, Cornelia Huck wrote: > > On Mon, 2 Sep 2013 18:16:45 +0300 > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > On Mon, Sep 02, 2013 at 04:50:24PM +0200, Cornelia Huck wrote: > > > > On Mon, 02 Sep 2013 10:56:20 +0930 > > > > Rusty Russell <rusty@au1.ibm.com> wrote: > > > > > > > > > Cornelia Huck <cornelia.huck@de.ibm.com> writes: > > > > > > Some remarks from my side... > > > > > > > > > > > > On Tue, 27 Aug 2013 10:38:59 +0300 > > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > > > > > >> On Tue, Aug 27, 2013 at 04:24:47PM +0930, Rusty Russell wrote: > > > > > >> > "Michael S. Tsirkin" <mst@redhat.com> writes: > > > > > >> > > This is the new configuration layout. > > > > > >> > > > > > > > >> > > Notes: > > > > > >> > > - Everything is LE > > > > > >> > > - There's a feature bit that means spec 1.0 compliant. > > > > > >> > > - Both devices and drivers can either require the 1.0 interface > > > > > >> > > or try to include compatibility support. The spec isn't forcing > > > > > >> > > this decision. > > > > > >> > > > > > > >> > Hmm, this kind includes other changes already proposed, like the LE > > > > > >> > change and the framing change. I think this conceptually splits nicely: > > > > > >> > > > > > > >> > 1) Feature bit 32 proposal. > > > > > >> > 2) Endian change. > > > > > >> > 3) Framing change. > > > > > >> > 4) PCI layout change. > > > > > >> > > > > > >> Right - they are mostly in different parts of the document. > > > > > >> I put it all together so it's easy to see how we intend to > > > > > >> handle the transition. > > > > > >> So is everyone OK with keeping this in a single patch? > > > > > > > > > > > > The new feature bit is supposed to cover all of this, right? Then this > > > > > > should be one patch. > > > > > > > > > > Well, the new feature would be "1.0 compliant" which will cover much > > > > > more than this. We'll have a whole document on the differences between > > > > > legacy and 1.0, in fact. > > > > > > > > > > >> > > - I kept documentation of the legacy interface around, and added notes > > > > > >> > > on transition inline. They are in separate sections each clearly marked > > > > > >> > > "Legacy Interface" so we'll be able to separate them out > > > > > >> > > from the final document as necessary - for now I think it's easier > > > > > >> > > to keep it all together. > > > > > >> > > > > > > >> > Good thinking: most of us know the current spec so it's definitely > > > > > >> > clearer. And makes sure we're thinking about the transition. > > > > > >> > > > > > > >> > > Only virtio PCI has been converted. > > > > > >> > > Let's discuss this on the meeting tonight, once we figure out PCI > > > > > >> > > we can do something similar for MMIO and CCW. > > > > > >> > > > > > > >> > > @@ -137,6 +139,11 @@ Feature bits are allocated as follows: > > > > > >> > > 24 to 31: Feature bits reserved for extensions to the queue and > > > > > >> > > feature negotiation mechanisms > > > > > >> > > > > > > > >> > > + 32: Feature bit must be set for any device compliant with this > > > > > >> > > + revision of the specification, and acknowledged by all device drivers. > > > > > > > > > > > > Would it make sense to have a bit 33 "rings big endian" whose validity > > > > > > depends on bit 32 set? This would make it possible for ccw to keep its > > > > > > current endianness. > > > > > > > > > > Very awkward. Our experience with bi-endian devices on PowerPC suggests > > > > > it's far better to not negotiate endian. It also avoids a branch > > > > > everywhere in the driver and host. > > > > > > > > Ok, so let's skip this feature bit and make virtio-ccw big endian for > > > > the ring? > > > > > > No, that just makes it impossible to have common ring parsing code > > > used by both ccw and other transports. > > > > > > Plan is that ccw registers stay big endian, what was previously native > > > endian becomes little endian. > > > > I assume by "registers" you mean the ccw payload? These obviously need > > to stay big endian for architectural reasons anyway. > > > > But the ring code? I'm really worried about performance hits here. > > > Why would you expect the performance hit be bigger than for > the PCI bus? We can't really compare with virtio-pci on s390 (won't work with s390 pci), but as both host and guest are always big endian, we'd need to do a byteswap on both sides where we previously needed none. > > > Have > > not come around to test this, but I'll try to get some numbers with a > > prototype. > > That would be very nice. All my boxes with KVM support are > little endian. > > > > > > > But it's hard to handle this since Appendix Y isn't included > > > ATM. Is someone going to merge it? > > > >


  • 15.  Re: [virtio] Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout

    Posted 09-02-2013 16:32
    On Mon, Sep 02, 2013 at 06:24:17PM +0200, Cornelia Huck wrote: > On Mon, 2 Sep 2013 18:52:07 +0300 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Mon, Sep 02, 2013 at 05:32:12PM +0200, Cornelia Huck wrote: > > > On Mon, 2 Sep 2013 18:16:45 +0300 > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > On Mon, Sep 02, 2013 at 04:50:24PM +0200, Cornelia Huck wrote: > > > > > On Mon, 02 Sep 2013 10:56:20 +0930 > > > > > Rusty Russell <rusty@au1.ibm.com> wrote: > > > > > > > > > > > Cornelia Huck <cornelia.huck@de.ibm.com> writes: > > > > > > > Some remarks from my side... > > > > > > > > > > > > > > On Tue, 27 Aug 2013 10:38:59 +0300 > > > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > > > > > > > > > >> On Tue, Aug 27, 2013 at 04:24:47PM +0930, Rusty Russell wrote: > > > > > > >> > "Michael S. Tsirkin" <mst@redhat.com> writes: > > > > > > >> > > This is the new configuration layout. > > > > > > >> > > > > > > > > >> > > Notes: > > > > > > >> > > - Everything is LE > > > > > > >> > > - There's a feature bit that means spec 1.0 compliant. > > > > > > >> > > - Both devices and drivers can either require the 1.0 interface > > > > > > >> > > or try to include compatibility support. The spec isn't forcing > > > > > > >> > > this decision. > > > > > > >> > > > > > > > >> > Hmm, this kind includes other changes already proposed, like the LE > > > > > > >> > change and the framing change. I think this conceptually splits nicely: > > > > > > >> > > > > > > > >> > 1) Feature bit 32 proposal. > > > > > > >> > 2) Endian change. > > > > > > >> > 3) Framing change. > > > > > > >> > 4) PCI layout change. > > > > > > >> > > > > > > >> Right - they are mostly in different parts of the document. > > > > > > >> I put it all together so it's easy to see how we intend to > > > > > > >> handle the transition. > > > > > > >> So is everyone OK with keeping this in a single patch? > > > > > > > > > > > > > > The new feature bit is supposed to cover all of this, right? Then this > > > > > > > should be one patch. > > > > > > > > > > > > Well, the new feature would be "1.0 compliant" which will cover much > > > > > > more than this. We'll have a whole document on the differences between > > > > > > legacy and 1.0, in fact. > > > > > > > > > > > > >> > > - I kept documentation of the legacy interface around, and added notes > > > > > > >> > > on transition inline. They are in separate sections each clearly marked > > > > > > >> > > "Legacy Interface" so we'll be able to separate them out > > > > > > >> > > from the final document as necessary - for now I think it's easier > > > > > > >> > > to keep it all together. > > > > > > >> > > > > > > > >> > Good thinking: most of us know the current spec so it's definitely > > > > > > >> > clearer. And makes sure we're thinking about the transition. > > > > > > >> > > > > > > > >> > > Only virtio PCI has been converted. > > > > > > >> > > Let's discuss this on the meeting tonight, once we figure out PCI > > > > > > >> > > we can do something similar for MMIO and CCW. > > > > > > >> > > > > > > > >> > > @@ -137,6 +139,11 @@ Feature bits are allocated as follows: > > > > > > >> > > 24 to 31: Feature bits reserved for extensions to the queue and > > > > > > >> > > feature negotiation mechanisms > > > > > > >> > > > > > > > > >> > > + 32: Feature bit must be set for any device compliant with this > > > > > > >> > > + revision of the specification, and acknowledged by all device drivers. > > > > > > > > > > > > > > Would it make sense to have a bit 33 "rings big endian" whose validity > > > > > > > depends on bit 32 set? This would make it possible for ccw to keep its > > > > > > > current endianness. > > > > > > > > > > > > Very awkward. Our experience with bi-endian devices on PowerPC suggests > > > > > > it's far better to not negotiate endian. It also avoids a branch > > > > > > everywhere in the driver and host. > > > > > > > > > > Ok, so let's skip this feature bit and make virtio-ccw big endian for > > > > > the ring? > > > > > > > > No, that just makes it impossible to have common ring parsing code > > > > used by both ccw and other transports. > > > > > > > > Plan is that ccw registers stay big endian, what was previously native > > > > endian becomes little endian. > > > > > > I assume by "registers" you mean the ccw payload? These obviously need > > > to stay big endian for architectural reasons anyway. > > > > > > But the ring code? I'm really worried about performance hits here. > > > > > > Why would you expect the performance hit be bigger than for > > the PCI bus? > > We can't really compare with virtio-pci on s390 (won't work with s390 > pci), but as both host and guest are always big endian, we'd need to do > a byteswap on both sides where we previously needed none. Well that's exactly the same with pci if both host and guest are big endian. > > > > > Have > > > not come around to test this, but I'll try to get some numbers with a > > > prototype. > > > > That would be very nice. All my boxes with KVM support are > > little endian. > > > > > > > > > > But it's hard to handle this since Appendix Y isn't included > > > > ATM. Is someone going to merge it? > > > > > >


  • 16.  Re: [virtio] Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout

    Posted 09-02-2013 16:34
    On Mon, Sep 02, 2013 at 06:16:45PM +0300, Michael S. Tsirkin wrote: > But it's hard to handle this since Appendix Y isn't included > ATM. Is someone going to merge it? This is kind of blocking progress on the endian-ness thing ATM. OTOH if there's no plan to merge CCW, we won't need to bother about it. Maybe we should discuss this on the next TC meeting. -- MST


  • 17.  Re: [virtio] Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout

    Posted 09-04-2013 07:59
    "Michael S. Tsirkin" <mst@redhat.com> writes: > On Mon, Sep 02, 2013 at 06:16:45PM +0300, Michael S. Tsirkin wrote: >> But it's hard to handle this since Appendix Y isn't included >> ATM. Is someone going to merge it? > > This is kind of blocking progress on the endian-ness thing ATM. > OTOH if there's no plan to merge CCW, we won't need to > bother about it. > Maybe we should discuss this on the next TC meeting. Even if we don't merge CCW, if we specify LE always, it will effect CCW if they want to be compliant. I'm going to implement the LE stuff for lguest (lguest is x86, so it's already LE, so I'll test by making it BE, but it'll show what an transition implementation would look like). Probably not this week though, hopefully before next meeting. Cheers, Rusty.


  • 18.  Re: [virtio] Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout

    Posted 09-03-2013 06:52
    Cornelia Huck <cornelia.huck@de.ibm.com> writes: > On Mon, 02 Sep 2013 10:56:20 +0930 > Rusty Russell <rusty@au1.ibm.com> wrote: > >> Cornelia Huck <cornelia.huck@de.ibm.com> writes: >> > Would it make sense to have a bit 33 "rings big endian" whose validity >> > depends on bit 32 set? This would make it possible for ccw to keep its >> > current endianness. >> >> Very awkward. Our experience with bi-endian devices on PowerPC suggests >> it's far better to not negotiate endian. It also avoids a branch >> everywhere in the driver and host. > > Ok, so let's skip this feature bit and make virtio-ccw big endian for > the ring? That's one option: it would allow current Linux guests to do a compile-time endian switch because we currently assume ARCH=s390 is equivalent to "using virtio-ccw only". QEMU may be able to make the same assumption. I worry about future virtio-pci on s390, where this issue will become more complicated. "LE everywhere" is a simple rule. We still have the question of the endian of per-device headers (eg. virtio-blk header) and per-device config space. I'm assuming we'll use the same endian as for the ring. >> Legacy devices will not be spec compliant, though we will map out the >> differences for implementors and to make sure that transitional devices >> & drivers are still possible. This legacy part will eventually be a >> separate document. >> >> I expect them legacy to be firmly deprecated within a few years: new >> features will probably only be offered for standard devices, and for PCI >> there are some nice proposals in the standard which make life easier, >> eg. for bootloaders. > > Is there any reason the legacy version can't stay in, with a big fat > warning "new devices are not supposed to implement this"? In the long term, a simpler spec will increase adoption. That's why I proposed a separate document. That looks better than a document studded with examples of our previous mistakes :) > Another thing we need to keep in mind is that switching a device driver > to support a new version of a device is not the way it works on every > system. On s390, there's generally no device driver as such - Linux is > the oddball there. Yes, this makes the transition more difficult. But I believe we are still at the stage where there are more implementations ahead of us then exist now, so my preference is for immediate pain and long-term gain. Cheers, Rusty.


  • 19.  Re: [virtio] Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout

    Posted 09-03-2013 10:50
    On Tue, 03 Sep 2013 13:27:58 +0930 Rusty Russell <rusty@au1.ibm.com> wrote: > Cornelia Huck <cornelia.huck@de.ibm.com> writes: > > On Mon, 02 Sep 2013 10:56:20 +0930 > > Rusty Russell <rusty@au1.ibm.com> wrote: > > > >> Cornelia Huck <cornelia.huck@de.ibm.com> writes: > >> > Would it make sense to have a bit 33 "rings big endian" whose validity > >> > depends on bit 32 set? This would make it possible for ccw to keep its > >> > current endianness. > >> > >> Very awkward. Our experience with bi-endian devices on PowerPC suggests > >> it's far better to not negotiate endian. It also avoids a branch > >> everywhere in the driver and host. > > > > Ok, so let's skip this feature bit and make virtio-ccw big endian for > > the ring? > > That's one option: it would allow current Linux guests to do a > compile-time endian switch because we currently assume ARCH=s390 is > equivalent to "using virtio-ccw only". QEMU may be able to make the > same assumption. > > I worry about future virtio-pci on s390, where this issue will become > more complicated. "LE everywhere" is a simple rule. I don't see how virtio-pci could be made to work with the s390 pci implementation, so I don't think we will see anything but virtio-ccw on s390 machines. > > We still have the question of the endian of per-device headers > (eg. virtio-blk header) and per-device config space. I'm assuming we'll > use the same endian as for the ring. Config "space" for virtio-ccw is not accessed through memory writes but through special channel commands. Whether it is big or little endian is completely orthogonal to the endianness of the ring etc. - both would be fine. Cornelia


  • 20.  Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout

    Posted 09-02-2013 02:16
    "Michael S. Tsirkin" <mst@redhat.com> writes: > On Tue, Aug 27, 2013 at 04:24:47PM +0930, Rusty Russell wrote: >> "Michael S. Tsirkin" <mst@redhat.com> writes: >> > This is the new configuration layout. >> > >> > Notes: >> > - Everything is LE >> > - There's a feature bit that means spec 1.0 compliant. >> > - Both devices and drivers can either require the 1.0 interface >> > or try to include compatibility support. The spec isn't forcing >> > this decision. >> >> Hmm, this kind includes other changes already proposed, like the LE >> change and the framing change. I think this conceptually splits nicely: >> >> 1) Feature bit 32 proposal. >> 2) Endian change. >> 3) Framing change. >> 4) PCI layout change. > > Right - they are mostly in different parts of the document. > I put it all together so it's easy to see how we intend to > handle the transition. > So is everyone OK with keeping this in a single patch? > >> > - I kept documentation of the legacy interface around, and added notes >> > on transition inline. They are in separate sections each clearly marked >> > "Legacy Interface" so we'll be able to separate them out >> > from the final document as necessary - for now I think it's easier >> > to keep it all together. >> >> Good thinking: most of us know the current spec so it's definitely >> clearer. And makes sure we're thinking about the transition. >> >> > Only virtio PCI has been converted. >> > Let's discuss this on the meeting tonight, once we figure out PCI >> > we can do something similar for MMIO and CCW. >> >> > @@ -137,6 +139,11 @@ Feature bits are allocated as follows: >> > 24 to 31: Feature bits reserved for extensions to the queue and >> > feature negotiation mechanisms >> > >> > + 32: Feature bit must be set for any device compliant with this >> > + revision of the specification, and acknowledged by all device drivers. >> > + >> > + 33 to 63: Feature bits reserved for future extensions >> > + >> > For example, feature bit 0 for a network device (i.e. Subsystem >> > Device ID 1) indicates that the device supports checksumming of >> > packets. >> >> Why stop at 63? If we go to a more decentralized feature-assignment >> model, we'll run through those very fast. > > Then we'll just document more, but driver needs to know where to stop > looking for features. Shouldn't it stop looking at the highest feature it understands? Yes, this makes a problem for diagnostics, but they can use a simple heuristic (stop at first word which is all zeroes). Or we need a register with an explicit limit. If the spec says there won't be features above 63, we'd need a revision to change that. >> I think we should stop abusing Revision IDs, and start using them >> to reflect device version changes as intended. >> >> We could reserve revision id 0 for legacy devices, however, which should >> work nicely. > > Hmm I am not sure I agree - what does it buy us that feature bits don't already supply? AFAICT it's a PCI question more than a virtio question. The use of revision IDs is left open in the PCI spec, but Microsoft's requirements here seem reasonable: http://msdn.microsoft.com/en-us/library/windows/hardware/gg463287.aspx > More concerns: > > We are using revision ID now exactly as was intended to disable old > drivers - it served us well for 0.X-1.X and would be as useful if we > ever have 1.X->2.0 transition. I think we'll use feature bit 64 for such a transition, now we don't have a feature bit limit. > Another worry with using revision numbering for features is that > it does not play well with downstreams. > E.g. RHEL might want to cherry-pick a feature without implementing > other features that happened to land in the same revision. > > Also Revision ID is only 8 bit - it's designed for hardware where > making a new revision is expensive. In software we'll run out of that > eventually. This is why I think we leave it open in the standard. If people want to use it as another way for them to implement driver workarounds, they can (probably a combination of device ID, rev id and features offered). >> > 2.4.1.2 PCI Device Layout >> > ------------------------- >> > >> > -To configure the device, we use the first I/O region of the PCI >> > -device. This contains a virtio header followed by a >> > -device-specific region. >> > +To configure the device, >> > +use I/O and/or memory regions and/or PCI configuration space of the PCI device. >> > +These contain the virtio header registers, the notification register, the >> > +ISR status register and device specific registers, as specified by Virtio >> > ++ Structure PCI Capabilities >> > + >> > +There may be different widths of accesses to the I/O region; the >> > +“natural” access method for each field must be >> > +used (i.e. 32-bit accesses for 32-bit fields, etc). >> > + >> > +PCI Device Configuration Layout includes the common configuration, >> > +ISR, notification and device specific configuration >> > +structures. >> > + >> > +Unless explicitly specified otherwise, all multi-byte fields are little-endian. >> > + >> > + >> > +2.4.1.2.1 Common configuration structure layout >> > +------------------------- >> > +Common configuration structure layout is documented below: >> > + >> > +struct virtio_pci_common_cfg { >> > + /* About the whole device. */ >> > + __le32 device_feature_select; /* read-write */ >> > + __le32 device_feature; /* read-only */ >> > + __le32 guest_feature_select; /* read-write */ >> > + __le32 guest_feature; /* read-write */ >> > + __le16 msix_config; /* read-write */ >> > + __le16 num_queues; /* read-only */ >> > + __u8 device_status; /* read-write */ >> > + __u8 unused1; >> > + >> > + /* About a specific virtqueue. */ >> > + __le16 queue_select; /* read-write */ >> > + __le16 queue_size; /* read-write, power of 2, or 0. */ >> > + __le16 queue_msix_vector; /* read-write */ >> > + __le16 queue_enable; /* read-write */ >> > + __le16 queue_notify_off; /* read-only */ >> > + __le64 queue_desc; /* read-write */ >> > + __le64 queue_avail; /* read-write */ >> > + __le64 queue_used; /* read-write */ >> > +}; >> > + >> > +device_feature_select >> > + >> > + Selects which Feature Bits does device_feature field refer to. >> > + Value 0x0 selects Feature Bits 0 to 31 >> > + Value 0x1 selects Feature Bits 32 to 63 >> > + All other values cause reads from device_feature to return 0. >> > + >> > +device_feature >> > + >> > + Used by Device to report Feature Bits to Driver. >> > + Device Feature Bits selected by device_feature_select. >> > + >> > +guest_feature_select >> > + >> > + Selects which Feature Bits does guest_feature field refer to. >> > + Value 0x0 selects Feature Bits 0 to 31 >> > + Value 0x1 selects Feature Bits 32 to 63 >> > + All other values cause writes to guest_feature to be ignored, >> > + and reads to return 0. >> > + >> > +guest_feature >> > + >> > + Used by Driver to acknowledge Feature Bits to Device. >> > + Guest Feature Bits selected by guest_feature_select. >> > + >> > +msix_config >> > + >> > + Configuration Vector for MSI-X. >> > + >> > +num_queues >> > + >> > + Specifies the maximum number of virtqueues supported by device. >> > + >> > +device_status >> > + >> > + Device Status field. >> > + >> > +queue_select >> > + >> > + Queue Select. Selects which virtqueue do other fields refer to. >> > + >> > +queue_size >> > + >> > + Queue Size. On reset, specifies the maximum queue size supported by >> > + the hypervisor. This can be modified by driver to reduce memory requirements. >> > + Set to 0 if this virtqueue is unused. >> > + >> > +queue_msix_vector >> > + >> > + Queue Vector for MSI-X. >> > + >> > +queue_enable >> > + >> > + Used to selectively prevent host from executing requests from this virtqueue. >> > + 1 - enabled; 0 - disabled >> > + >> > +queue_notify_off >> > + >> > + Used to calculate the offset from start of Notification structure at >> > + which this virtqueue is located. >> > + Note: this is *not* an offset in bytes. See notify_off_multiplier below. >> > + >> > +queue_desc >> > + >> > + Physical address of Descriptor Table. >> > + >> > +queue_avail >> > + >> > + Physical address of Available Ring. >> > + >> > +queue_used >> > + >> > + Physical address of Used Ring. >> > + >> > + >> > +2.4.1.2.2 ISR status structure layout >> > +------------------------- >> > +ISR status structure includes a single 8-bite ISR status field >> >> 8-bit > > Right :) > >> > + >> > +2.4.1.2.3 Notification structure layout >> > +------------------------- >> > +Notification structure is always a multiple of 2 bytes in size. >> > +It includes 2-byte Queue Notify fields for each virtqueue of >> > +the device. Note that multiple virtqueues can use the same >> > +Queue Notify field, if necessary. >> >> Hmm, maybe move this down, so you can have a section which starts with >> "If cfg_type is VIRTIO_PCI_CAP_NOTIFY_CFG" below? That would put it all >> together. > > so Move PCI Device Layout to within > PCI-specific Initialization And Device Operation? OK, maybe not. This is simple, we can always reshuffle after we've made all the changes we want. >> > + >> > +2.4.1.2.4 Device specific structure >> > +------------------------- >> > + >> > +Device specific structure is optional. >> > + >> > +2.4.1.2.5 Legacy Interfaces: A Note on PCI Device Layout >> > +------------------------- >> > + >> > +Transitional devices should present part of configuration >> > +registers in a legacy configuration structure in BAR0 in the first I/O >> > +region of the PCI device, as documented below. >> > >> > There may be different widths of accesses to the I/O region; the >> > “natural” access method for each field in the virtio header must be >> > -used (i.e. 32-bit accesses for 32-bit fields, etc), but the >> > +used (i.e. 32-bit accesses for 32-bit fields, etc), but >> > +When accessed through the legacy interface the >> > device-specific region can be accessed using any width accesses, and >> > should obtain the same results. >> > >> > Note that this is possible because while the virtio header is PCI >> > -(i.e. little) endian, the device-specific region is encoded in >> > -the native endian of the guest (where such distinction is >> > +(i.e. little) endian, when using the legacy interface the device-specific >> > +region is encoded in the native endian of the guest (where such distinction is >> > applicable). >> > >> > -2.4.1.2.1 PCI Device Virtio Header >> > ----------------------------------- >> > >> > -The virtio header looks as follows: >> > +When used through the legacy interface, the virtio header looks as follows: >> > >> > +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ >> > Bits 32 32 32 16 16 16 8 8 >> > @@ -661,7 +905,6 @@ The virtio header looks as follows: >> > Features bits 0:31 Features bits 0:31 Address Size Select Notify Status Status >> > +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ >> > >> > - >> > If MSI-X is enabled for the device, two additional fields >> > immediately follow this header:[5] >> > >> > @@ -689,25 +932,154 @@ device-specific headers: >> > >> > +------------++--------------------+ >> > >> > +Note that only Feature Bits 0 to 31 are accessible through the >> > +Legacy Interface. When used through the Legacy Interface, >> > +Transitional Devices must assume that Feature Bits 32 to 63 >> > +are not acknowledged by Driver. >> > + >> > + >> > 2.4.1.3 PCI-specific Initialization And Device Operation >> > -------------------------------------------------------- >> > >> > -The page size for a virtqueue on a PCI virtio device is defined as >> > -4096 bytes. >> > - >> > 2.4.1.3.1 Device Initialization >> > ------------------------------- >> > >> > -2.4.1.3.1.1 Queue Vector Configuration >> > +This documents PCI-specific steps executed during Device Initialization. >> > +As the first step, driver must detect device configuration layout >> > +to locate configuration fields in memory,I/O or configuration space of the >> > +device. >> > + >> > +2.4.1.3.1.1 Virtio Device Configuration Layout Detection >> > +------------------------------- >> > + >> > +As a prerequisite to device initialization, driver executes a >> > +PCI capability list scan, detecting virtio configuration layout using Virtio >> > +Structure PCI capabilities. >> > + >> > +Virtio Device Configuration Layout includes virtio configuration header, Notification >> > +and ISR Status and device configuration structures. >> > +Each structure can be mapped by a Base Address register (BAR) belonging to >> > +the function, located beginning at 10h in Configuration Space, >> > +or accessed though PCI configuration space. >> > + >> > +Actual location of each structure is specified using vendor-specific PCI capability located >> > +on capability list in PCI configuration space of the device. >> > +This virtio structure capability uses little-endian format; all bits are >> > +read-only: >> > + >> > +struct virtio_pci_cap { >> > + __u8 cap_vndr; /* Generic PCI field: PCI_CAP_ID_VNDR */ >> > + __u8 cap_next; /* Generic PCI field: next ptr. */ >> > + __u8 cap_len; /* Generic PCI field: capability length */ >> > + __u8 cfg_type; /* Identifies the structure. */ >> > + __u8 bar; /* Where to find it. */ >> > + __u8 padding[3];/* Pad to full dword. */ >> > + __le32 offset; /* Offset within bar. */ >> > + __le32 length; /* Length of the structure, in bytes. */ >> > +}; >> > + >> > +This structure can optionally followed by extra data, depending on >> > +other fields, as documented below. >> > + >> > +The fields are interpreted as follows: >> > + >> > +cap_vndr >> > + 0x09; Identifies a vendor-specific capability. >> > + >> > +cap_next >> > + Link to next capability in the capability list in the configuration space. >> > + >> > +cap_len >> > + Length of the capability structure, including the whole of >> > + struct virtio_pci_cap, and extra data if any. >> > + This length might include padding, or fields unused by the driver. >> > + >> > +cfg_type >> > + identifies the structure, according to the following table. >> > + >> > + /* Common configuration */ >> > + #define VIRTIO_PCI_CAP_COMMON_CFG 1 >> > + /* Notifications */ >> > + #define VIRTIO_PCI_CAP_NOTIFY_CFG 2 >> > + /* ISR Status */ >> > + #define VIRTIO_PCI_CAP_ISR_CFG 3 >> > + /* Device specific configuration */ >> > + #define VIRTIO_PCI_CAP_DEVICE_CFG 4 >> > + >> > + More than one capability can identify the same structure - this makes it >> > + possible for the device to expose multiple interfaces to drivers. The order of >> > + the capabilities in the capability list specifies the order of preference >> > + suggested by the device; drivers should use the first interface that they can >> > + support. For example, on some hypervisors, notifications using IO accesses are >> > + faster than memory accesses. In this case, hypervisor can expose two >> > + capabilities with cfg_type set to VIRTIO_PCI_CAP_NOTIFY_CFG: >> > + the first one addressing an I/O BAR, the second one addressing a memory BAR. >> > + Driver will use the I/O BAR if I/O resources are available, and fall back on >> > + memory BAR when I/O resources are unavailable. >> > + >> > +bar >> > + >> > + values 0x0 to 0x5 specify a Base Address register (BAR) belonging to >> > + the function located beginning at 10h in Configuration Space >> > + and used to map the structure into Memory or I/O Space. >> > + The BAR is permitted to be either 32-bit or 64-bit, it can map Memory Space >> > + or I/O Space. >> > + >> > + The value 0xF specifies that the structure is in PCI configuration space >> > + inline with this capability structure, following (not necessarily immediately) >> > + the length field. >> >> Why not immediately? >> Or how would the driver know where it is? > > It's at the offset. > > E.g. for notification we stick multiplier after length. > Further, we might extend virtio_pci_cap in the future, > and we don't want to move stuff around like we > had to with MSI-X. So, offset is described as "/* Offset within bar. */", but if bar is 0xF, it's offset within the config space? This adds complexity: do we really want this? Cheers, Rusty. PS. Sorry for delay, took some time off.


  • 21.  Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout

    Posted 09-02-2013 04:07
    On Mon, Sep 02, 2013 at 10:45:11AM +0930, Rusty Russell wrote: > "Michael S. Tsirkin" <mst@redhat.com> writes: > > On Tue, Aug 27, 2013 at 04:24:47PM +0930, Rusty Russell wrote: > >> "Michael S. Tsirkin" <mst@redhat.com> writes: > >> > This is the new configuration layout. > >> > > >> > Notes: > >> > - Everything is LE > >> > - There's a feature bit that means spec 1.0 compliant. > >> > - Both devices and drivers can either require the 1.0 interface > >> > or try to include compatibility support. The spec isn't forcing > >> > this decision. > >> > >> Hmm, this kind includes other changes already proposed, like the LE > >> change and the framing change. I think this conceptually splits nicely: > >> > >> 1) Feature bit 32 proposal. > >> 2) Endian change. > >> 3) Framing change. > >> 4) PCI layout change. > > > > Right - they are mostly in different parts of the document. > > I put it all together so it's easy to see how we intend to > > handle the transition. > > So is everyone OK with keeping this in a single patch? > > > >> > - I kept documentation of the legacy interface around, and added notes > >> > on transition inline. They are in separate sections each clearly marked > >> > "Legacy Interface" so we'll be able to separate them out > >> > from the final document as necessary - for now I think it's easier > >> > to keep it all together. > >> > >> Good thinking: most of us know the current spec so it's definitely > >> clearer. And makes sure we're thinking about the transition. > >> > >> > Only virtio PCI has been converted. > >> > Let's discuss this on the meeting tonight, once we figure out PCI > >> > we can do something similar for MMIO and CCW. > >> > >> > @@ -137,6 +139,11 @@ Feature bits are allocated as follows: > >> > 24 to 31: Feature bits reserved for extensions to the queue and > >> > feature negotiation mechanisms > >> > > >> > + 32: Feature bit must be set for any device compliant with this > >> > + revision of the specification, and acknowledged by all device drivers. > >> > + > >> > + 33 to 63: Feature bits reserved for future extensions > >> > + > >> > For example, feature bit 0 for a network device (i.e. Subsystem > >> > Device ID 1) indicates that the device supports checksumming of > >> > packets. > >> > >> Why stop at 63? If we go to a more decentralized feature-assignment > >> model, we'll run through those very fast. > > > > Then we'll just document more, but driver needs to know where to stop > > looking for features. > > Shouldn't it stop looking at the highest feature it understands? Yes, > this makes a problem for diagnostics, but they can use a simple > heuristic (stop at first word which is all zeroes). Or we need a > register with an explicit limit. > > If the spec says there won't be features above 63, we'd need a revision > to change that. > > >> I think we should stop abusing Revision IDs, and start using them > >> to reflect device version changes as intended. > >> > >> We could reserve revision id 0 for legacy devices, however, which should > >> work nicely. > > > > Hmm I am not sure I agree - what does it buy us that feature bits don't already supply? > > AFAICT it's a PCI question more than a virtio question. The use of > revision IDs is left open in the PCI spec, but Microsoft's requirements > here seem reasonable: > > http://msdn.microsoft.com/en-us/library/windows/hardware/gg463287.aspx They are only reasonable if you consider a requirement that users load new drivers for new revision of devices reasonable. > > More concerns: > > > > We are using revision ID now exactly as was intended to disable old > > drivers - it served us well for 0.X-1.X and would be as useful if we > > ever have 1.X->2.0 transition. > > I think we'll use feature bit 64 for such a transition, now we don't > have a feature bit limit. Windows drivers only match on device,vendor id, subsystem ids and revision. Once you match on that you can fail after checking feature bits but it's too late to have another driver loaded. > > Another worry with using revision numbering for features is that > > it does not play well with downstreams. > > E.g. RHEL might want to cherry-pick a feature without implementing > > other features that happened to land in the same revision. > > > > Also Revision ID is only 8 bit - it's designed for hardware where > > making a new revision is expensive. In software we'll run out of that > > eventually. > > This is why I think we leave it open in the standard. If people want to > use it as another way for them to implement driver workarounds, they > can (probably a combination of device ID, rev id and features offered). > > >> > 2.4.1.2 PCI Device Layout > >> > ------------------------- > >> > > >> > -To configure the device, we use the first I/O region of the PCI > >> > -device. This contains a virtio header followed by a > >> > -device-specific region. > >> > +To configure the device, > >> > +use I/O and/or memory regions and/or PCI configuration space of the PCI device. > >> > +These contain the virtio header registers, the notification register, the > >> > +ISR status register and device specific registers, as specified by Virtio > >> > ++ Structure PCI Capabilities > >> > + > >> > +There may be different widths of accesses to the I/O region; the > >> > +“natural” access method for each field must be > >> > +used (i.e. 32-bit accesses for 32-bit fields, etc). > >> > + > >> > +PCI Device Configuration Layout includes the common configuration, > >> > +ISR, notification and device specific configuration > >> > +structures. > >> > + > >> > +Unless explicitly specified otherwise, all multi-byte fields are little-endian. > >> > + > >> > + > >> > +2.4.1.2.1 Common configuration structure layout > >> > +------------------------- > >> > +Common configuration structure layout is documented below: > >> > + > >> > +struct virtio_pci_common_cfg { > >> > + /* About the whole device. */ > >> > + __le32 device_feature_select; /* read-write */ > >> > + __le32 device_feature; /* read-only */ > >> > + __le32 guest_feature_select; /* read-write */ > >> > + __le32 guest_feature; /* read-write */ > >> > + __le16 msix_config; /* read-write */ > >> > + __le16 num_queues; /* read-only */ > >> > + __u8 device_status; /* read-write */ > >> > + __u8 unused1; > >> > + > >> > + /* About a specific virtqueue. */ > >> > + __le16 queue_select; /* read-write */ > >> > + __le16 queue_size; /* read-write, power of 2, or 0. */ > >> > + __le16 queue_msix_vector; /* read-write */ > >> > + __le16 queue_enable; /* read-write */ > >> > + __le16 queue_notify_off; /* read-only */ > >> > + __le64 queue_desc; /* read-write */ > >> > + __le64 queue_avail; /* read-write */ > >> > + __le64 queue_used; /* read-write */ > >> > +}; > >> > + > >> > +device_feature_select > >> > + > >> > + Selects which Feature Bits does device_feature field refer to. > >> > + Value 0x0 selects Feature Bits 0 to 31 > >> > + Value 0x1 selects Feature Bits 32 to 63 > >> > + All other values cause reads from device_feature to return 0. > >> > + > >> > +device_feature > >> > + > >> > + Used by Device to report Feature Bits to Driver. > >> > + Device Feature Bits selected by device_feature_select. > >> > + > >> > +guest_feature_select > >> > + > >> > + Selects which Feature Bits does guest_feature field refer to. > >> > + Value 0x0 selects Feature Bits 0 to 31 > >> > + Value 0x1 selects Feature Bits 32 to 63 > >> > + All other values cause writes to guest_feature to be ignored, > >> > + and reads to return 0. > >> > + > >> > +guest_feature > >> > + > >> > + Used by Driver to acknowledge Feature Bits to Device. > >> > + Guest Feature Bits selected by guest_feature_select. > >> > + > >> > +msix_config > >> > + > >> > + Configuration Vector for MSI-X. > >> > + > >> > +num_queues > >> > + > >> > + Specifies the maximum number of virtqueues supported by device. > >> > + > >> > +device_status > >> > + > >> > + Device Status field. > >> > + > >> > +queue_select > >> > + > >> > + Queue Select. Selects which virtqueue do other fields refer to. > >> > + > >> > +queue_size > >> > + > >> > + Queue Size. On reset, specifies the maximum queue size supported by > >> > + the hypervisor. This can be modified by driver to reduce memory requirements. > >> > + Set to 0 if this virtqueue is unused. > >> > + > >> > +queue_msix_vector > >> > + > >> > + Queue Vector for MSI-X. > >> > + > >> > +queue_enable > >> > + > >> > + Used to selectively prevent host from executing requests from this virtqueue. > >> > + 1 - enabled; 0 - disabled > >> > + > >> > +queue_notify_off > >> > + > >> > + Used to calculate the offset from start of Notification structure at > >> > + which this virtqueue is located. > >> > + Note: this is *not* an offset in bytes. See notify_off_multiplier below. > >> > + > >> > +queue_desc > >> > + > >> > + Physical address of Descriptor Table. > >> > + > >> > +queue_avail > >> > + > >> > + Physical address of Available Ring. > >> > + > >> > +queue_used > >> > + > >> > + Physical address of Used Ring. > >> > + > >> > + > >> > +2.4.1.2.2 ISR status structure layout > >> > +------------------------- > >> > +ISR status structure includes a single 8-bite ISR status field > >> > >> 8-bit > > > > Right :) > > > >> > + > >> > +2.4.1.2.3 Notification structure layout > >> > +------------------------- > >> > +Notification structure is always a multiple of 2 bytes in size. > >> > +It includes 2-byte Queue Notify fields for each virtqueue of > >> > +the device. Note that multiple virtqueues can use the same > >> > +Queue Notify field, if necessary. > >> > >> Hmm, maybe move this down, so you can have a section which starts with > >> "If cfg_type is VIRTIO_PCI_CAP_NOTIFY_CFG" below? That would put it all > >> together. > > > > so Move PCI Device Layout to within > > PCI-specific Initialization And Device Operation? > > OK, maybe not. This is simple, we can always reshuffle after we've made > all the changes we want. > > >> > + > >> > +2.4.1.2.4 Device specific structure > >> > +------------------------- > >> > + > >> > +Device specific structure is optional. > >> > + > >> > +2.4.1.2.5 Legacy Interfaces: A Note on PCI Device Layout > >> > +------------------------- > >> > + > >> > +Transitional devices should present part of configuration > >> > +registers in a legacy configuration structure in BAR0 in the first I/O > >> > +region of the PCI device, as documented below. > >> > > >> > There may be different widths of accesses to the I/O region; the > >> > “natural” access method for each field in the virtio header must be > >> > -used (i.e. 32-bit accesses for 32-bit fields, etc), but the > >> > +used (i.e. 32-bit accesses for 32-bit fields, etc), but > >> > +When accessed through the legacy interface the > >> > device-specific region can be accessed using any width accesses, and > >> > should obtain the same results. > >> > > >> > Note that this is possible because while the virtio header is PCI > >> > -(i.e. little) endian, the device-specific region is encoded in > >> > -the native endian of the guest (where such distinction is > >> > +(i.e. little) endian, when using the legacy interface the device-specific > >> > +region is encoded in the native endian of the guest (where such distinction is > >> > applicable). > >> > > >> > -2.4.1.2.1 PCI Device Virtio Header > >> > ----------------------------------- > >> > > >> > -The virtio header looks as follows: > >> > +When used through the legacy interface, the virtio header looks as follows: > >> > > >> > +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ > >> > Bits 32 32 32 16 16 16 8 8 > >> > @@ -661,7 +905,6 @@ The virtio header looks as follows: > >> > Features bits 0:31 Features bits 0:31 Address Size Select Notify Status Status > >> > +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ > >> > > >> > - > >> > If MSI-X is enabled for the device, two additional fields > >> > immediately follow this header:[5] > >> > > >> > @@ -689,25 +932,154 @@ device-specific headers: > >> > > >> > +------------++--------------------+ > >> > > >> > +Note that only Feature Bits 0 to 31 are accessible through the > >> > +Legacy Interface. When used through the Legacy Interface, > >> > +Transitional Devices must assume that Feature Bits 32 to 63 > >> > +are not acknowledged by Driver. > >> > + > >> > + > >> > 2.4.1.3 PCI-specific Initialization And Device Operation > >> > -------------------------------------------------------- > >> > > >> > -The page size for a virtqueue on a PCI virtio device is defined as > >> > -4096 bytes. > >> > - > >> > 2.4.1.3.1 Device Initialization > >> > ------------------------------- > >> > > >> > -2.4.1.3.1.1 Queue Vector Configuration > >> > +This documents PCI-specific steps executed during Device Initialization. > >> > +As the first step, driver must detect device configuration layout > >> > +to locate configuration fields in memory,I/O or configuration space of the > >> > +device. > >> > + > >> > +2.4.1.3.1.1 Virtio Device Configuration Layout Detection > >> > +------------------------------- > >> > + > >> > +As a prerequisite to device initialization, driver executes a > >> > +PCI capability list scan, detecting virtio configuration layout using Virtio > >> > +Structure PCI capabilities. > >> > + > >> > +Virtio Device Configuration Layout includes virtio configuration header, Notification > >> > +and ISR Status and device configuration structures. > >> > +Each structure can be mapped by a Base Address register (BAR) belonging to > >> > +the function, located beginning at 10h in Configuration Space, > >> > +or accessed though PCI configuration space. > >> > + > >> > +Actual location of each structure is specified using vendor-specific PCI capability located > >> > +on capability list in PCI configuration space of the device. > >> > +This virtio structure capability uses little-endian format; all bits are > >> > +read-only: > >> > + > >> > +struct virtio_pci_cap { > >> > + __u8 cap_vndr; /* Generic PCI field: PCI_CAP_ID_VNDR */ > >> > + __u8 cap_next; /* Generic PCI field: next ptr. */ > >> > + __u8 cap_len; /* Generic PCI field: capability length */ > >> > + __u8 cfg_type; /* Identifies the structure. */ > >> > + __u8 bar; /* Where to find it. */ > >> > + __u8 padding[3];/* Pad to full dword. */ > >> > + __le32 offset; /* Offset within bar. */ > >> > + __le32 length; /* Length of the structure, in bytes. */ > >> > +}; > >> > + > >> > +This structure can optionally followed by extra data, depending on > >> > +other fields, as documented below. > >> > + > >> > +The fields are interpreted as follows: > >> > + > >> > +cap_vndr > >> > + 0x09; Identifies a vendor-specific capability. > >> > + > >> > +cap_next > >> > + Link to next capability in the capability list in the configuration space. > >> > + > >> > +cap_len > >> > + Length of the capability structure, including the whole of > >> > + struct virtio_pci_cap, and extra data if any. > >> > + This length might include padding, or fields unused by the driver. > >> > + > >> > +cfg_type > >> > + identifies the structure, according to the following table. > >> > + > >> > + /* Common configuration */ > >> > + #define VIRTIO_PCI_CAP_COMMON_CFG 1 > >> > + /* Notifications */ > >> > + #define VIRTIO_PCI_CAP_NOTIFY_CFG 2 > >> > + /* ISR Status */ > >> > + #define VIRTIO_PCI_CAP_ISR_CFG 3 > >> > + /* Device specific configuration */ > >> > + #define VIRTIO_PCI_CAP_DEVICE_CFG 4 > >> > + > >> > + More than one capability can identify the same structure - this makes it > >> > + possible for the device to expose multiple interfaces to drivers. The order of > >> > + the capabilities in the capability list specifies the order of preference > >> > + suggested by the device; drivers should use the first interface that they can > >> > + support. For example, on some hypervisors, notifications using IO accesses are > >> > + faster than memory accesses. In this case, hypervisor can expose two > >> > + capabilities with cfg_type set to VIRTIO_PCI_CAP_NOTIFY_CFG: > >> > + the first one addressing an I/O BAR, the second one addressing a memory BAR. > >> > + Driver will use the I/O BAR if I/O resources are available, and fall back on > >> > + memory BAR when I/O resources are unavailable. > >> > + > >> > +bar > >> > + > >> > + values 0x0 to 0x5 specify a Base Address register (BAR) belonging to > >> > + the function located beginning at 10h in Configuration Space > >> > + and used to map the structure into Memory or I/O Space. > >> > + The BAR is permitted to be either 32-bit or 64-bit, it can map Memory Space > >> > + or I/O Space. > >> > + > >> > + The value 0xF specifies that the structure is in PCI configuration space > >> > + inline with this capability structure, following (not necessarily immediately) > >> > + the length field. > >> > >> Why not immediately? > >> Or how would the driver know where it is? > > > > It's at the offset. > > > > E.g. for notification we stick multiplier after length. > > Further, we might extend virtio_pci_cap in the future, > > and we don't want to move stuff around like we > > had to with MSI-X. > > So, offset is described as "/* Offset within bar. */", but if bar is > 0xF, it's offset within the config space? Yes. So I'll fix the wording? > This adds complexity: do we really want this? I think making stuff part of a vendor specific cap is important, and we might add stuff to capabilities, so what are some other options? > Cheers, > Rusty. > PS. Sorry for delay, took some time off.


  • 22.  Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout

    Posted 09-04-2013 07:59
    "Michael S. Tsirkin" <mst@redhat.com> writes: > On Mon, Sep 02, 2013 at 10:45:11AM +0930, Rusty Russell wrote: >> "Michael S. Tsirkin" <mst@redhat.com> writes: >> > Hmm I am not sure I agree - what does it buy us that feature bits don't already supply? >> >> AFAICT it's a PCI question more than a virtio question. The use of >> revision IDs is left open in the PCI spec, but Microsoft's requirements >> here seem reasonable: >> >> http://msdn.microsoft.com/en-us/library/windows/hardware/gg463287.aspx > > They are only reasonable if you consider a requirement that users > load new drivers for new revision of devices reasonable. Well, I think it's reasonable for a driver to contain revision-specific workarounds. That's kind of what the field is for. >> > More concerns: >> > >> > We are using revision ID now exactly as was intended to disable old >> > drivers - it served us well for 0.X-1.X and would be as useful if we >> > ever have 1.X->2.0 transition. >> >> I think we'll use feature bit 64 for such a transition, now we don't >> have a feature bit limit. > > Windows drivers only match on device,vendor id, subsystem ids and revision. > > Once you match on that you can fail after checking feature bits > but it's too late to have another driver loaded. So can Windows drivers not match multiple revids? If not, you're saying tha transitional drivers are impossible, and they will need separate legacy and v1.0 drivers? >> >> > +bar >> >> > + >> >> > + values 0x0 to 0x5 specify a Base Address register (BAR) belonging to >> >> > + the function located beginning at 10h in Configuration Space >> >> > + and used to map the structure into Memory or I/O Space. >> >> > + The BAR is permitted to be either 32-bit or 64-bit, it can map Memory Space >> >> > + or I/O Space. >> >> > + >> >> > + The value 0xF specifies that the structure is in PCI configuration space >> >> > + inline with this capability structure, following (not necessarily immediately) >> >> > + the length field. >> >> >> >> Why not immediately? >> >> Or how would the driver know where it is? >> > >> > It's at the offset. >> > >> > E.g. for notification we stick multiplier after length. >> > Further, we might extend virtio_pci_cap in the future, >> > and we don't want to move stuff around like we >> > had to with MSI-X. >> >> So, offset is described as "/* Offset within bar. */", but if bar is >> 0xF, it's offset within the config space? > > Yes. So I'll fix the wording? Please. And specify if the offset is from start of config space, or offset from start or end of the virtio_pci_cap. >> This adds complexity: do we really want this? > > I think making stuff part of a vendor specific cap is important, > and we might add stuff to capabilities, so what are > some other options? If we want to embed the notification within the PCI config space, this makes sense. But I'd rather hear from someone who wants that. Otherwise we can reserve invalid BAR values and tell them to keep iterating until the find a cap they understand. That would let us add this later. Cheers, Rusty.


  • 23.  Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout

    Posted 09-04-2013 08:29
    On Wed, Sep 04, 2013 at 05:21:25PM +0930, Rusty Russell wrote: > "Michael S. Tsirkin" <mst@redhat.com> writes: > > On Mon, Sep 02, 2013 at 10:45:11AM +0930, Rusty Russell wrote: > >> "Michael S. Tsirkin" <mst@redhat.com> writes: > >> > Hmm I am not sure I agree - what does it buy us that feature bits don't already supply? > >> > >> AFAICT it's a PCI question more than a virtio question. The use of > >> revision IDs is left open in the PCI spec, but Microsoft's requirements > >> here seem reasonable: > >> > >> http://msdn.microsoft.com/en-us/library/windows/hardware/gg463287.aspx > > > > They are only reasonable if you consider a requirement that users > > load new drivers for new revision of devices reasonable. > > Well, I think it's reasonable for a driver to contain revision-specific > workarounds. That's kind of what the field is for. > > >> > More concerns: > >> > > >> > We are using revision ID now exactly as was intended to disable old > >> > drivers - it served us well for 0.X-1.X and would be as useful if we > >> > ever have 1.X->2.0 transition. > >> > >> I think we'll use feature bit 64 for such a transition, now we don't > >> have a feature bit limit. > > > > Windows drivers only match on device,vendor id, subsystem ids and revision. > > > > Once you match on that you can fail after checking feature bits > > but it's too late to have another driver loaded. > > So can Windows drivers not match multiple revids? My understanding is that you can either match a specific list of rev ids, or match all rev IDs (using *). > If not, you're saying > tha transitional drivers are impossible, and they will need separate > legacy and v1.0 drivers? A transitional driver for windows will just have to match both rev IDs 0 and 1. > >> >> > +bar > >> >> > + > >> >> > + values 0x0 to 0x5 specify a Base Address register (BAR) belonging to > >> >> > + the function located beginning at 10h in Configuration Space > >> >> > + and used to map the structure into Memory or I/O Space. > >> >> > + The BAR is permitted to be either 32-bit or 64-bit, it can map Memory Space > >> >> > + or I/O Space. > >> >> > + > >> >> > + The value 0xF specifies that the structure is in PCI configuration space > >> >> > + inline with this capability structure, following (not necessarily immediately) > >> >> > + the length field. > >> >> > >> >> Why not immediately? > >> >> Or how would the driver know where it is? > >> > > >> > It's at the offset. > >> > > >> > E.g. for notification we stick multiplier after length. > >> > Further, we might extend virtio_pci_cap in the future, > >> > and we don't want to move stuff around like we > >> > had to with MSI-X. > >> > >> So, offset is described as "/* Offset within bar. */", but if bar is > >> 0xF, it's offset within the config space? > > > > Yes. So I'll fix the wording? > > Please. And specify if the offset is from start of config space, or > offset from start or end of the virtio_pci_cap. > > >> This adds complexity: do we really want this? > > > > I think making stuff part of a vendor specific cap is important, > > and we might add stuff to capabilities, so what are > > some other options? > > If we want to embed the notification within the PCI config space, this > makes sense. But I'd rather hear from someone who wants that. I think PXE drivers will want to do this. I'll check with Gleb. > Otherwise we can reserve invalid BAR values and tell them to keep > iterating until the find a cap they understand. That would let us add > this later. > > Cheers, > Rusty.


  • 24.  Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout

    Posted 09-04-2013 08:53
    On Wed, Sep 04, 2013 at 05:21:25PM +0930, Rusty Russell wrote: > "Michael S. Tsirkin" <mst@redhat.com> writes: > > On Mon, Sep 02, 2013 at 10:45:11AM +0930, Rusty Russell wrote: > >> "Michael S. Tsirkin" <mst@redhat.com> writes: > >> > Hmm I am not sure I agree - what does it buy us that feature bits don't already supply? > >> > >> AFAICT it's a PCI question more than a virtio question. The use of > >> revision IDs is left open in the PCI spec, but Microsoft's requirements > >> here seem reasonable: > >> > >> http://msdn.microsoft.com/en-us/library/windows/hardware/gg463287.aspx > > > > They are only reasonable if you consider a requirement that users > > load new drivers for new revision of devices reasonable. > > Well, I think it's reasonable for a driver to contain revision-specific > workarounds. That's kind of what the field is for. Right. And when talking about hardware, it's a safe bet there will be work-arounds for each new revision. It's hopefully not applicable for virtio.


  • 25.  Re: [virtio] Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout

    Posted 09-02-2013 06:18
    On Mon, Sep 02, 2013 at 10:45:11AM +0930, Rusty Russell wrote: > > More concerns: > > > > We are using revision ID now exactly as was intended to disable old > > drivers - it served us well for 0.X-1.X and would be as useful if we > > ever have 1.X->2.0 transition. > > I think we'll use feature bit 64 for such a transition, now we don't > have a feature bit limit. Two issues: 1. Basically on windows this would interfere with the idea of using different drivers for legacy and non-legacy devices http://msdn.microsoft.com/en-us/library/windows/hardware/ff549553%28v=vs.85%29.aspx windows build up device identification 2. We'd have to declare this use for a feature bit right now, make current drivers check this bit and bail out if set.


  • 26.  Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout

    Posted 09-09-2013 05:45
    "Michael S. Tsirkin" <mst@redhat.com> writes: > This is the new configuration layout. OK, I separated the bit-32-means-1.0-compliant into its own issue, since I really like that idea and I want to separate it from the endianness issue. I'll steal that part to place on that thread... Cheers, Rusty.


  • 27.  [PATCH RFC incremental] virtio-pci: new PCI layout clarification

    Posted 09-09-2013 06:01
    Here's a fixup for spec issue noted by Cornelia,
    and some more clarifications on how transitional and
    non transitional drivers will work.

    Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

    ---

    diff --git a/virtio-v1.0-wd01-part1-specification.txt b/virtio-v1.0-wd01-part1-specification.txt
    index 4533fac..72323ba 100644
    --- a/virtio-v1.0-wd01-part1-specification.txt
    +++ b/virtio-v1.0-wd01-part1-specification.txt
    @@ -739,9 +739,8 @@ Transitional devices must have a Revision ID of 0.

    Non-transitional devices must have a Revision ID of 1.

    -Transitional drivers must match a Revision ID of 0 or 1.
    -
    -Non-transitional drivers must only match a Revision ID of 1.
    +Both transitional and nont-transitional drivers must match a
    +Revision ID of 0 or 1.

    2.3.1.2. PCI Device Layout
    -------------------------
    @@ -1009,6 +1008,10 @@ cfg_type
    /* Device specific configuration */
    #define VIRTIO_PCI_CAP_DEVICE_CFG 4

    + Any other value - reserved for future use. Drivers must
    + ignore any vendor-specific capability structure which has
    + a reserved cfg_type value.
    +
    More than one capability can identify the same structure - this makes it
    possible for the device to expose multiple interfaces to drivers. The order of
    the capabilities in the capability list specifies the order of preference
    @@ -1076,6 +1079,18 @@ notify_off_multiplier
    Legacy drivers skipped Device Layout Detection step, assuming legacy
    configuration space in BAR0 in I/O space unconditionally.

    +Legacy devices did not have the Virtio PCI Capability in their
    +capability list.
    +
    +Therefore:
    +
    +Transitional devices should expose the Legacy Interface in I/O
    +space in BAR0.
    +
    +Transitional drivers should look for the Virtio PCI
    +Capabilities on the capability list.
    +If there are not present, driver should assume a legacy device.
    +
    2.3.1.3.1.1. Queue Vector Configuration
    --------------------------------------