OASIS Virtual I/O Device (VIRTIO) TC

Expand all | Collapse all

[OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)

  • 1.  [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)

    Posted 09-11-2013 04:51
    Implement new balloon device (ID 13) ------------------------------------ Key: VIRTIO-28 URL: http://tools.oasis-open.org/issues/browse/VIRTIO-28 Project: OASIS Virtual I/O Device (VIRTIO) TC Issue Type: New Feature Reporter: Rusty Russell The current virtio balloon device is bad enough that fixing it yields something completely incompatible: 1) The page size is 4096, always, rather than some configurable page size. 2) Page numbers are 32 bits. 3) The deflate queue is not actually used by any device implementation (see issue virtio-25). 4) The "actual" number is kind of useless, and (along with the deflate queue) requires the guest to track pages, rather than just faulting them in. 5) The config is little-endian by mistake (may not matter). We could just move the entire thing to legacy, and have a new balloon device, like: 1) An incoming queue for balloon size updates or stats request. 2) An outgoing queue for giving 64 bit address/length pairs to balloon, answering stats. 3) No config space, no features. Thoughts? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://tools.oasis-open.org/issues/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira


  • 2.  Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)

    Posted 09-24-2013 07:14
    On (Wed) 11 Sep 2013 [04:51:39], OASIS Issues Tracker wrote: > Implement new balloon device (ID 13) > ------------------------------------ > > Key: VIRTIO-28 > URL: http://tools.oasis-open.org/issues/browse/VIRTIO-28 > Project: OASIS Virtual I/O Device (VIRTIO) TC > Issue Type: New Feature > Reporter: Rusty Russell > > > The current virtio balloon device is bad enough that fixing it yields something completely incompatible: > 1) The page size is 4096, always, rather than some configurable page size. > 2) Page numbers are 32 bits. > 3) The deflate queue is not actually used by any device implementation (see issue virtio-25). > 4) The "actual" number is kind of useless, and (along with the deflate queue) requires the guest to track pages, rather than just faulting them in. > 5) The config is little-endian by mistake (may not matter). The current design is weird: the host tells the guest the offset by which to inflate or deflate the balloon. Instead, the host should let the guest know how much in absolute terms the memory should be adjusted to, and the guest will take the decision on whether to inflate, or deflate, or not take any action at all. For example, with the current scenario, if a guest is started with 4G memory, but the guest kernel is started with 'mem=2G', the guest is only going to use 2G of memory. ballooning to 3G from the host will try to inflate the balloon by 1G on the guest, when really no action is needed. This also happens with > 4G RAM and 32-bit guests. > We could just move the entire thing to legacy, and have a new balloon device, like: > 1) An incoming queue for balloon size updates or stats request. > 2) An outgoing queue for giving 64 bit address/length pairs to balloon, answering stats. > 3) No config space, no features. ACK Amit


  • 3.  Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)

    Posted 09-26-2013 00:55
    Amit Shah <amit.shah@redhat.com> writes: > On (Wed) 11 Sep 2013 [04:51:39], OASIS Issues Tracker wrote: >> Implement new balloon device (ID 13) >> ------------------------------------ >> >> Key: VIRTIO-28 >> URL: http://tools.oasis-open.org/issues/browse/VIRTIO-28 >> Project: OASIS Virtual I/O Device (VIRTIO) TC >> Issue Type: New Feature >> Reporter: Rusty Russell >> >> >> The current virtio balloon device is bad enough that fixing it yields something completely incompatible: >> 1) The page size is 4096, always, rather than some configurable page size. >> 2) Page numbers are 32 bits. >> 3) The deflate queue is not actually used by any device implementation (see issue virtio-25). >> 4) The "actual" number is kind of useless, and (along with the deflate queue) requires the guest to track pages, rather than just faulting them in. >> 5) The config is little-endian by mistake (may not matter). > > The current design is weird: the host tells the guest the offset by > which to inflate or deflate the balloon. Instead, the host should let > the guest know how much in absolute terms the memory should be > adjusted to, and the guest will take the decision on whether to > inflate, or deflate, or not take any action at all. I see your point. I wonder how east it is define how much memory the guest has (from a guest perspective). Things like memory holes, device memory maps and I'm sure there is other platform-specific weirdness I'm sure I don't understand. I could be wrong though? Rusty.


  • 4.  Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)

    Posted 09-27-2013 09:12
    On (Wed) 25 Sep 2013 [16:32:51], Rusty Russell wrote: > Amit Shah <amit.shah@redhat.com> writes: > > On (Wed) 11 Sep 2013 [04:51:39], OASIS Issues Tracker wrote: > >> Implement new balloon device (ID 13) > >> ------------------------------------ > >> > >> Key: VIRTIO-28 > >> URL: http://tools.oasis-open.org/issues/browse/VIRTIO-28 > >> Project: OASIS Virtual I/O Device (VIRTIO) TC > >> Issue Type: New Feature > >> Reporter: Rusty Russell > >> > >> > >> The current virtio balloon device is bad enough that fixing it yields something completely incompatible: > >> 1) The page size is 4096, always, rather than some configurable page size. > >> 2) Page numbers are 32 bits. > >> 3) The deflate queue is not actually used by any device implementation (see issue virtio-25). > >> 4) The "actual" number is kind of useless, and (along with the deflate queue) requires the guest to track pages, rather than just faulting them in. > >> 5) The config is little-endian by mistake (may not matter). > > > > The current design is weird: the host tells the guest the offset by > > which to inflate or deflate the balloon. Instead, the host should let > > the guest know how much in absolute terms the memory should be > > adjusted to, and the guest will take the decision on whether to > > inflate, or deflate, or not take any action at all. > > I see your point. I wonder how east it is define how much memory the > guest has (from a guest perspective). Things like memory holes, device > memory maps and I'm sure there is other platform-specific weirdness I'm > sure I don't understand. Right now the driver directly grabs pages from the RAM area. It's a crude approach, but one that we use and works right now. We can continue doing the same... However, I see your point, too. So far, these questions you raise don't get asked by the host either, and the guests just comply with whatever the hosts have asked for (even if it means processes within the guest start to OOM). Amit


  • 5.  Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)

    Posted 10-01-2013 10:15
    Ok, so here's what I ended up with. Any feedback welcome... Rusty. diff --git a/virtio-v1.0-wd01-part1-specification.txt b/virtio-v1.0-wd01-part1-specification.txt index b2d3cd9..4e795d2 100644 --- a/virtio-v1.0-wd01-part1-specification.txt +++ b/virtio-v1.0-wd01-part1-specification.txt @@ -1713,6 +1713,15 @@ Discovering what devices are available and their type is bus-dependent. 12 virtio CAIF +------------+--------------------+ +100.4.1. Legacy Interface: Device Types +================ ++------------+--------------------+ + Device ID Virtio Device ++------------+--------------------+ ++------------+--------------------+ + 5 memory ballooning ++------------+--------------------+ + 2.4.1. Network Device ==================== @@ -2468,7 +2477,7 @@ When the driver requires random bytes, it places the descriptor of one or more buffers in the queue. It will be completely filled by random data by the device. -2.4.5. Memory Balloon Device +2.4.5. Legacy Interface: Memory Balloon Device =========================== The virtio memory balloon device is a primitive device for @@ -2479,17 +2488,17 @@ changes in allowance of underlying physical memory. If the feature is negotiated, the device can also be used to communicate guest memory statistics to the host. -2.4.5.1. Device ID +2.4.5.1. Legacy Interface: Device ID ----------------- 5 -2.4.5.2. Virtqueues +2.4.5.2. Legacy Interface: Virtqueues ------------------ 0:inflateq. 1:deflateq. 2:statsq. Virtqueue 2 only exists if VIRTIO_BALLON_F_STATS_VQ set. -2.4.5.3. Feature bits +2.4.5.3. Legacy Interface: Feature bits -------------------- VIRTIO_BALLOON_F_MUST_TELL_HOST (0) Host must be told before pages from the balloon are used. @@ -2497,7 +2506,7 @@ guest memory statistics to the host. VIRTIO_BALLOON_F_STATS_VQ (1) A virtqueue for reporting guest memory statistics is present. -2.4.5.4. Device configuration layout +2.4.5.4. Legacy Interface: Device configuration layout ----------------------------------- Both fields of this configuration are always available. Note that they are little endian, despite @@ -2508,7 +2517,7 @@ guest memory statistics to the host. u32 actual; }; -2.4.5.5. Device Initialization +2.4.5.5. Legacy Interface: Device Initialization ----------------------------- 1. The inflate and deflate virtqueues are identified. @@ -2522,7 +2531,7 @@ guest memory statistics to the host. Device operation begins immediately. -2.4.5.6. Device Operation +2.4.5.6. Legacy Interface: Device Operation ------------------------ Memory Ballooning The device is driven by the receipt of a @@ -2558,7 +2567,7 @@ configuration change interrupt. deflation, the “actual” field of the configuration should be updated to reflect the new number of pages in the balloon.[27] -2.4.5.6.1. Memory Statistics +2.4.5.6.1. Legacy Interface: Memory Statistics --------------------------- The stats virtqueue is atypical because communication is driven @@ -2598,7 +2607,7 @@ as follows: u64 val; } __attribute__((packed)); -2.4.5.6.2. Memory Statistics Tags +2.4.5.6.2. Legacy Interface: Memory Statistics Tags -------------------------------- VIRTIO_BALLOON_S_SWAP_IN The amount of memory that has been @@ -3141,6 +3150,124 @@ contents of the event field. The following events are defined: For simplicity, as of this version of the specification the host must never report this event for MMC devices. +100.2.4.5. Memory Balloon Device +=========================== + +The virtio memory balloon device is a primitive device for managing +guest memory: the device asks for a certain amount of memory, and the +guest supplies it. This allows the guest to adapt to changes in +allowance of underlying physical memory. The device can also be used +to communicate guest memory statistics to the host. + +100.2.4.5.1. Device ID +----------------- + 13 + +100.2.4.5.2. Virtqueues +------------------ + 0:inputq. 1:outputq. + + +100.2.4.5.3. Feature bits +-------------------- + None defined. + +100.2.4.5.4. Device configuration layout +----------------------------------- + +The read-only configuration field indicates the granularity of memory +which can be added to the balloon. This is typically reflects the +page size of the host (eg. 12 for 4096-byte pages). + + struct virtio_balloon_config { + u32 page_bits; + } + +100.2.4.5.5. Device Initialization +----------------------------- + +1. At least one struct virtio_balloon_request buffer should be placed + in the inputq. + +2. The balloon starts empty (size 0). + +100.2.4.5.6. Device Operation +------------------------ + +The device is driven by receipt of a command in the input queue: + + struct virtio_balloon_req { +#define VIRTIO_BALLOON_REQ_RESIZE 0 +#define VIRTIO_BALLOON_REQ_STATS 1 + u32 type; + u32 reserved; + u64 value; + } + +1. A VIRTIO_BALLOON_REQ_RESIZE command indicates the balloon target + size (in bytes) in the value field. If the current balloon size is + smaller than the target, the guest should add pages to the balloon + as soon as possible. If the current balloon is larger than the + target, the guest may withdraw pages. + +2. To add pages to the balloon, the physical addresses of the pages + are sent using the output queue. The number of pages is implied in + the message length, and each page value must be a multiple of the + page size indicated in struct virtio_balloon_config. + + struct virtio_balloon_pages { +#define VIRTIO_BALLOON_RESP_PAGES 0 + u32 type; // VIRTIO_BALLOON_RESP_PAGES + u64 page[]; + }; + +3. To withdraw a page from the balloon, it can simply be accessed. + The contents at this point will be undefined. The device should + keep count of how many pages remain in the balloon so it can + correctly respond to future resize requests. + +4. A VIRTIO_BALLOON_REQ_STATS command indicates that the balloon + should report what stats are available. + +5. To report stats, the following message is sent to the output queue. + Indeterminable stats must not be reported. + + struct virtio_balloon_stats { +#define VIRTIO_BALLOON_RESP_STATS 1 + u32 type; // VIRTIO_BALLOON_RESP_STATS + struct { +#define VIRTIO_BALLOON_TAG_SWAP_IN 0 +#define VIRTIO_BALLOON_TAG_SWAP_OUT 1 +#define VIRTIO_BALLOON_TAG_MAJFLT 2 +#define VIRTIO_BALLOON_TAG_MINFLT 3 +#define VIRTIO_BALLOON_TAG_MEMFREE 4 +#define VIRTIO_BALLOON_TAG_MEMTOT 5 + u64 tag; + u64 value; + }[]; + }; + +100.2.4.5.6.2. Memory Statistics Tags +-------------------------------- + + VIRTIO_BALLOON_TAG_SWAP_IN The amount of memory that has been + swapped in (in bytes). + + VIRTIO_BALLOON_TAG_SWAP_OUT The amount of memory that has been + swapped out to disk (in bytes). + + VIRTIO_BALLOON_TAG_MAJFLT The number of major page faults that + have occurred. + + VIRTIO_BALLOON_TAG_MINFLT The number of minor page faults that + have occurred. + + VIRTIO_BALLOON_TAG_MEMFREE The amount of memory not being used + for any purpose (in bytes). + + VIRTIO_BALLOON_TAG_MEMTOT The total amount of memory available + (in bytes). + 2.5. Reserved Feature Bits =========================


  • 6.  Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)

    Posted 10-03-2013 07:15
    On (Tue) 01 Oct 2013 [17:40:12], Rusty Russell wrote: > Ok, so here's what I ended up with. > > Any feedback welcome... Looks good, one small nit: > +4. A VIRTIO_BALLOON_REQ_STATS command indicates that the balloon > + should report what stats are available. balloon should report -> guest should report? Thanks, Amit


  • 7.  Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)

    Posted 10-04-2013 01:16
    Amit Shah <amit.shah@redhat.com> writes: > On (Tue) 01 Oct 2013 [17:40:12], Rusty Russell wrote: >> Ok, so here's what I ended up with. >> >> Any feedback welcome... > > Looks good, one small nit: > >> +4. A VIRTIO_BALLOON_REQ_STATS command indicates that the balloon >> + should report what stats are available. > > balloon should report -> guest should report? Thanks, I changed to "driver". Cheers, Rusty.


  • 8.  Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)

    Posted 10-24-2013 12:26
    Hi, Sorry for late reply but I am very busy now. On Tue, Oct 01, 2013 at 05:40:12PM +0930, Rusty Russell wrote: > Ok, so here's what I ended up with. > > Any feedback welcome... > Rusty. [...] > +100.2.4.5. Memory Balloon Device > +=========================== > + > +The virtio memory balloon device is a primitive device for managing > +guest memory: the device asks for a certain amount of memory, and the > +guest supplies it. This allows the guest to adapt to changes in > +allowance of underlying physical memory. The device can also be used > +to communicate guest memory statistics to the host. Maybe guest should also be able to set balloon target. However, in this situation host should establish limits which could not be exceeded and device should enforce them. This way balloon could be controlled from host and/or guest if needed. So balloon device will be responsible just for passing requests to/from guest/host and limits enforcement. This way even memory hotplug could be easily implemented. However, in this situation device should not be called balloon. Memory manager or memory allocator? Any better ideas? > +100.2.4.5.1. Device ID > +----------------- > + 13 > + > +100.2.4.5.2. Virtqueues > +------------------ > + 0:inputq. 1:outputq. > + > + > +100.2.4.5.3. Feature bits > +-------------------- > + None defined. > + > +100.2.4.5.4. Device configuration layout > +----------------------------------- > + > +The read-only configuration field indicates the granularity of memory > +which can be added to the balloon. This is typically reflects the > +page size of the host (eg. 12 for 4096-byte pages). > + > + struct virtio_balloon_config { > + u32 page_bits; > + } Why balloon device must be forced to use only one page size? I think that configuration area should list which page sizes could be requested by device. Device should be able to request any allowed/defined size but driver could reject request or fail partially. Additionally, maybe device should inform about allowed page sizes using size explicitly instead of number of bits. So maybe it is worth storing page sizes as u64. By the way we could store page sizes which are not equal full power of 2 (could be useful for some strange sizes of superpages in the future if something crazy happens). However, if we store page sizes as number of bits we could represent larger sizes. Hmmm... Probably we do not implement above mentioned feature at first but it gives us a chance to do that later. > +100.2.4.5.5. Device Initialization > +----------------------------- > + > +1. At least one struct virtio_balloon_request buffer should be placed > + in the inputq. > + > +2. The balloon starts empty (size 0). > + > +100.2.4.5.6. Device Operation > +------------------------ > + > +The device is driven by receipt of a command in the input queue: > + > + struct virtio_balloon_req { > +#define VIRTIO_BALLOON_REQ_RESIZE 0 > +#define VIRTIO_BALLOON_REQ_STATS 1 > + u32 type; > + u32 reserved; > + u64 value; > + } struct virtio_balloon_pages { #define VIRTIO_BALLOON_REQ_RESIZE 0 #define VIRTIO_BALLOON_REQ_STATS 1 u32 type; u32 reserved; u64 guest_memory_size; u64 page_sizes[]; }; ??? > +1. A VIRTIO_BALLOON_REQ_RESIZE command indicates the balloon target > + size (in bytes) in the value field. If the current balloon size is > + smaller than the target, the guest should add pages to the balloon > + as soon as possible. If the current balloon is larger than the > + target, the guest may withdraw pages. > + > +2. To add pages to the balloon, the physical addresses of the pages frames ??? > + are sent using the output queue. The number of pages is implied in > + the message length, and each page value must be a multiple of the > + page size indicated in struct virtio_balloon_config. > + > + struct virtio_balloon_pages { > +#define VIRTIO_BALLOON_RESP_PAGES 0 > + u32 type; // VIRTIO_BALLOON_RESP_PAGES > + u64 page[]; > + }; struct virtio_balloon_pages { u32 type; // VIRTIO_BALLOON_RESP_PAGES u64 page_size; u64 frames[]; }; ??? > +3. To withdraw a page from the balloon, it can simply be accessed. IIRC, ballooned pages are at first reserved and later frames are returned to a host. So if you would like to use pages you must do above steps in revers. Hence, "it can simply be accessed" is a bit misleading. May it should be phrased in following way: it should have reassigned a frame number and later it should be returned to a pool of free pages. Additionally, some hypervisors my require additional steps to add/remove page to/from the pool (e.g. Xen PV guests must add/remove frames to P2M and M2P lists/trees too). So implementation should be able to call hypervisor specific stuff in such situations. Daniel


  • 9.  Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)

    Posted 10-25-2013 03:57
    Daniel Kiper <daniel.kiper@oracle.com> writes: > Hi, > > Sorry for late reply but I am very busy now. That's OK, thanks for making the time. > On Tue, Oct 01, 2013 at 05:40:12PM +0930, Rusty Russell wrote: >> Ok, so here's what I ended up with. >> >> Any feedback welcome... >> Rusty. > > [...] > >> +100.2.4.5. Memory Balloon Device >> +=========================== >> + >> +The virtio memory balloon device is a primitive device for managing >> +guest memory: the device asks for a certain amount of memory, and the >> +guest supplies it. This allows the guest to adapt to changes in >> +allowance of underlying physical memory. The device can also be used >> +to communicate guest memory statistics to the host. > > Maybe guest should also be able to set balloon target. However, in this > situation host should establish limits which could not be exceeded and > device should enforce them. This way balloon could be controlled from host > and/or guest if needed. So balloon device will be responsible just for > passing requests to/from guest/host and limits enforcement. This way even > memory hotplug could be easily implemented. However, in this situation > device should not be called balloon. Memory manager or memory allocator? > Any better ideas? If it's purely guest-driven device, you don't need a target at all. You just have a driver which hands pages to the device. You could operate the device in that way, of course, treating the target as a ceiling. Is it useful to have a way of telling the device you're operating in such a "self-regulating" mode? Or should you just do it? ie. should this be a feature bit? >> +The read-only configuration field indicates the granularity of memory >> +which can be added to the balloon. This is typically reflects the >> +page size of the host (eg. 12 for 4096-byte pages). >> + >> + struct virtio_balloon_config { >> + u32 page_bits; >> + } > > Why balloon device must be forced to use only one page size? I think that > configuration area should list which page sizes could be requested by > device. Device should be able to request any allowed/defined size but driver > could reject request or fail partially. Additionally, maybe device should > inform about allowed page sizes using size explicitly instead of number > of bits. So maybe it is worth storing page sizes as u64. By the way we could > store page sizes which are not equal full power of 2 (could be useful for > some strange sizes of superpages in the future if something crazy happens). > However, if we store page sizes as number of bits we could represent > larger sizes. Hmmm... > > Probably we do not implement above mentioned feature at first but > it gives us a chance to do that later. I don't see non-power-of-two pages happening. But it makes sense to put the page size in each request. A bit more painful to implement, since the driver can't know in advance that it doesn't support a request. >> +100.2.4.5.5. Device Initialization >> +----------------------------- >> + >> +1. At least one struct virtio_balloon_request buffer should be placed >> + in the inputq. >> + >> +2. The balloon starts empty (size 0). >> + >> +100.2.4.5.6. Device Operation >> +------------------------ >> + >> +The device is driven by receipt of a command in the input queue: >> + >> + struct virtio_balloon_req { >> +#define VIRTIO_BALLOON_REQ_RESIZE 0 >> +#define VIRTIO_BALLOON_REQ_STATS 1 >> + u32 type; >> + u32 reserved; >> + u64 value; >> + } > > struct virtio_balloon_pages { > #define VIRTIO_BALLOON_REQ_RESIZE 0 > #define VIRTIO_BALLOON_REQ_STATS 1 > u32 type; > u32 reserved; > u64 guest_memory_size; > u64 page_sizes[]; > }; ??? This doesn't make sense. It's possible that the host has some memory in hugepages and some in smaller pages. But if so, it needs to be able to say "give me 5 small pages and 1 huge page please". >> +1. A VIRTIO_BALLOON_REQ_RESIZE command indicates the balloon target >> + size (in bytes) in the value field. If the current balloon size is >> + smaller than the target, the guest should add pages to the balloon >> + as soon as possible. If the current balloon is larger than the >> + target, the guest may withdraw pages. So let's drop this, and have two commands: /* Give me more pages! */ VIRTIO_BALLOON_REQ_FILL: u32 type; u32 page_bits; // eg 12 == 4096. u64 num_pages; And: /* You can take some back. */ VIRTIO_BALLOON_REQ_RELEASE: u32 type; u32 page_bits; // eg 12 == 4096. u64 num_pages; >> +2. To add pages to the balloon, the physical addresses of the pages > > frames ??? Define frame? It's an array of page physical addresses. Was that unclear? >> + are sent using the output queue. The number of pages is implied in >> + the message length, and each page value must be a multiple of the >> + page size indicated in struct virtio_balloon_config. >> + >> + struct virtio_balloon_pages { >> +#define VIRTIO_BALLOON_RESP_PAGES 0 >> + u32 type; // VIRTIO_BALLOON_RESP_PAGES >> + u64 page[]; >> + }; > > struct virtio_balloon_pages { > u32 type; // VIRTIO_BALLOON_RESP_PAGES > u64 page_size; > u64 frames[]; > }; ??? > >> +3. To withdraw a page from the balloon, it can simply be accessed. > > IIRC, ballooned pages are at first reserved and later frames are returned > to a host. So if you would like to use pages you must do above steps > in revers. Hence, "it can simply be accessed" is a bit misleading. May it > should be phrased in following way: it should have reassigned a frame > number and later it should be returned to a pool of free pages. Yet this requirement that pages be re-requested blocked one implementation attempt in Linux. They old spec said you had to, and yet QEMU didn't actually care. Nor any existing implementation. > Additionally, some hypervisors my require additional steps to add/remove > page to/from the pool (e.g. Xen PV guests must add/remove frames to P2M and M2P > lists/trees too). So implementation should be able to call hypervisor > specific stuff in such situations. The underlying assumption is that the hypervisor controls the mapping, so it can remove the page and fault one back in appropriately. This isn't true for PV Xen of course. Yet we can't put "do some hypervisor specific stuff here" in the spec. Let's step back a moment to look at the goals. It's nice for PV Xen to have portable drivers, but by definition you can't run a generic guest in PV Xen. So it's more about reducing the differences than trying to get a completely standardized guest. So it doesn't bother me a huge amount that a generic balloon driver won't work in a Xen PV guest. In practice, the Linux driver might have hooks to support Xen PV, but it might be better to keep the Xen-specific balloon driver until Xen PV finally dies. Cheers, Rusty.


  • 10.  Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)

    Posted 10-25-2013 11:38
    On Fri, Oct 25, 2013 at 02:25:58PM +1030, Rusty Russell wrote: > Daniel Kiper <daniel.kiper@oracle.com> writes: > > Hi, > > > > Sorry for late reply but I am very busy now. > > That's OK, thanks for making the time. > > > On Tue, Oct 01, 2013 at 05:40:12PM +0930, Rusty Russell wrote: > >> Ok, so here's what I ended up with. > >> > >> Any feedback welcome... > >> Rusty. > > > > [...] > > > >> +100.2.4.5. Memory Balloon Device > >> +=========================== > >> + > >> +The virtio memory balloon device is a primitive device for managing > >> +guest memory: the device asks for a certain amount of memory, and the > >> +guest supplies it. This allows the guest to adapt to changes in > >> +allowance of underlying physical memory. The device can also be used > >> +to communicate guest memory statistics to the host. > > > > Maybe guest should also be able to set balloon target. However, in this > > situation host should establish limits which could not be exceeded and > > device should enforce them. This way balloon could be controlled from host > > and/or guest if needed. So balloon device will be responsible just for > > passing requests to/from guest/host and limits enforcement. This way even > > memory hotplug could be easily implemented. However, in this situation > > device should not be called balloon. Memory manager or memory allocator? > > Any better ideas? > > If it's purely guest-driven device, you don't need a target at all. You > just have a driver which hands pages to the device. OK. Could host send requests too? I think that sometimes it could be useful if host could ask guest to limit its memory usage (guest should return to host only free pages). So if we assume that it is "purely guest-driven device" we could not do that. > You could operate the device in that way, of course, treating the target > as a ceiling. Is it useful to have a way of telling the device you're > operating in such a "self-regulating" mode? Or should you just do it? > > ie. should this be a feature bit? I think that target and ceiling are separate things. Target should describe "current" or "desired" allocation. Ceiling should be a limit enforced by device. Guest could not allocate more memory than ceiling but should be able to set target above ceiling. In that case host should allow guest allocation until ceiling but no more. > >> +The read-only configuration field indicates the granularity of memory > >> +which can be added to the balloon. This is typically reflects the > >> +page size of the host (eg. 12 for 4096-byte pages). > >> + > >> + struct virtio_balloon_config { > >> + u32 page_bits; > >> + } > > > > Why balloon device must be forced to use only one page size? I think that > > configuration area should list which page sizes could be requested by > > device. Device should be able to request any allowed/defined size but driver > > could reject request or fail partially. Additionally, maybe device should > > inform about allowed page sizes using size explicitly instead of number > > of bits. So maybe it is worth storing page sizes as u64. By the way we could > > store page sizes which are not equal full power of 2 (could be useful for > > some strange sizes of superpages in the future if something crazy happens). > > However, if we store page sizes as number of bits we could represent > > larger sizes. Hmmm... > > > > Probably we do not implement above mentioned feature at first but > > it gives us a chance to do that later. > > I don't see non-power-of-two pages happening. OK. > But it makes sense to put the page size in each request. A bit more > painful to implement, since the driver can't know in advance that it > doesn't support a request. > > >> +100.2.4.5.5. Device Initialization > >> +----------------------------- > >> + > >> +1. At least one struct virtio_balloon_request buffer should be placed > >> + in the inputq. > >> + > >> +2. The balloon starts empty (size 0). > >> + > >> +100.2.4.5.6. Device Operation > >> +------------------------ > >> + > >> +The device is driven by receipt of a command in the input queue: > >> + > >> + struct virtio_balloon_req { > >> +#define VIRTIO_BALLOON_REQ_RESIZE 0 > >> +#define VIRTIO_BALLOON_REQ_STATS 1 > >> + u32 type; > >> + u32 reserved; > >> + u64 value; > >> + } > > > > struct virtio_balloon_pages { > > #define VIRTIO_BALLOON_REQ_RESIZE 0 > > #define VIRTIO_BALLOON_REQ_STATS 1 > > u32 type; > > u32 reserved; > > u64 guest_memory_size; > > u64 page_sizes[]; > > }; ??? > > This doesn't make sense. It's possible that the host has some memory in > hugepages and some in smaller pages. But if so, it needs to be able to > say "give me 5 small pages and 1 huge page please". > > >> +1. A VIRTIO_BALLOON_REQ_RESIZE command indicates the balloon target > >> + size (in bytes) in the value field. If the current balloon size is > >> + smaller than the target, the guest should add pages to the balloon > >> + as soon as possible. If the current balloon is larger than the > >> + target, the guest may withdraw pages. > > So let's drop this, and have two commands: > > /* Give me more pages! */ > VIRTIO_BALLOON_REQ_FILL: > u32 type; > u32 page_bits; // eg 12 == 4096. > u64 num_pages; > > And: > > /* You can take some back. */ > VIRTIO_BALLOON_REQ_RELEASE: > u32 type; > u32 page_bits; // eg 12 == 4096. > u64 num_pages; General idea is OK but this way we back to relative requests. IIRC we would like to avoid them. > >> +2. To add pages to the balloon, the physical addresses of the pages > > > > frames ??? > > Define frame? It's an array of page physical addresses. Was that unclear? Frame is a page with physical address. Page is a page with virtual address. > >> + are sent using the output queue. The number of pages is implied in > >> + the message length, and each page value must be a multiple of the > >> + page size indicated in struct virtio_balloon_config. > >> + > >> + struct virtio_balloon_pages { > >> +#define VIRTIO_BALLOON_RESP_PAGES 0 > >> + u32 type; // VIRTIO_BALLOON_RESP_PAGES > >> + u64 page[]; > >> + }; > > > > struct virtio_balloon_pages { > > u32 type; // VIRTIO_BALLOON_RESP_PAGES > > u64 page_size; > > u64 frames[]; > > }; ??? > > > >> +3. To withdraw a page from the balloon, it can simply be accessed. > > > > IIRC, ballooned pages are at first reserved and later frames are returned > > to a host. So if you would like to use pages you must do above steps > > in revers. Hence, "it can simply be accessed" is a bit misleading. May it > > should be phrased in following way: it should have reassigned a frame > > number and later it should be returned to a pool of free pages. > > Yet this requirement that pages be re-requested blocked one > implementation attempt in Linux. They old spec said you had to, and yet > QEMU didn't actually care. Nor any existing implementation. From which point of view? Guest or host? My earlier description is from guest point of view. IIRC, old VIRTIO balloon driver works in the same way (from guest point of view). > > Additionally, some hypervisors my require additional steps to add/remove > > page to/from the pool (e.g. Xen PV guests must add/remove frames to P2M and M2P > > lists/trees too). So implementation should be able to call hypervisor > > specific stuff in such situations. > > The underlying assumption is that the hypervisor controls the mapping, > so it can remove the page and fault one back in appropriately. This > isn't true for PV Xen of course. Yet we can't put "do some hypervisor > specific stuff here" in the spec. > > Let's step back a moment to look at the goals. It's nice for PV Xen to > have portable drivers, but by definition you can't run a generic guest > in PV Xen. So it's more about reducing the differences than trying to > get a completely standardized guest. > > So it doesn't bother me a huge amount that a generic balloon driver > won't work in a Xen PV guest. In practice, the Linux driver might have > hooks to support Xen PV, but it might be better to keep the Xen-specific > balloon driver until Xen PV finally dies. So as I can see from this and others conversation we assume that PV guest should die sooner or later and we are not going to support them. As I can see the same goal is for Linux Kernel and Xen developers. However, just in case I will confirm this with Xen guys. Daniel


  • 11.  Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)

    Posted 10-28-2013 03:36
    Daniel Kiper <daniel.kiper@oracle.com> writes: > On Fri, Oct 25, 2013 at 02:25:58PM +1030, Rusty Russell wrote: >> Daniel Kiper <daniel.kiper@oracle.com> writes: >> > Maybe guest should also be able to set balloon target. However, in this >> > situation host should establish limits which could not be exceeded and >> > device should enforce them. This way balloon could be controlled from host >> > and/or guest if needed. So balloon device will be responsible just for >> > passing requests to/from guest/host and limits enforcement. This way even >> > memory hotplug could be easily implemented. However, in this situation >> > device should not be called balloon. Memory manager or memory allocator? >> > Any better ideas? >> >> If it's purely guest-driven device, you don't need a target at all. You >> just have a driver which hands pages to the device. > > OK. Could host send requests too? I think that sometimes it could > be useful if host could ask guest to limit its memory usage (guest > should return to host only free pages). So if we assume that it is > "purely guest-driven device" we could not do that. Dropping the target *is* how the host asks, surely. >> You could operate the device in that way, of course, treating the target >> as a ceiling. Is it useful to have a way of telling the device you're >> operating in such a "self-regulating" mode? Or should you just do it? >> >> ie. should this be a feature bit? > > I think that target and ceiling are separate things. Target should > describe "current" or "desired" allocation. Ceiling should be a limit > enforced by device. Guest could not allocate more memory than ceiling > but should be able to set target above ceiling. In that case host > should allow guest allocation until ceiling but no more. Hard and soft limits are an interesting idea, but I'm not sure that it withstands scrutiny. There is already a (platform-specific) ceiling mechanism for a guest, defined by its memory layout. Varying that hard limit is an interesting problem. To reduce it, you need to decide how long to give it to reach that limit, and what happens if they don't do it. With a soft limit, the implication is that pages will be swapped if/while it is above that limit. I think that's slightly simpler than having both mechanisms. >> This doesn't make sense. It's possible that the host has some memory in >> hugepages and some in smaller pages. But if so, it needs to be able to >> say "give me 5 small pages and 1 huge page please". >> >> >> +1. A VIRTIO_BALLOON_REQ_RESIZE command indicates the balloon target >> >> + size (in bytes) in the value field. If the current balloon size is >> >> + smaller than the target, the guest should add pages to the balloon >> >> + as soon as possible. If the current balloon is larger than the >> >> + target, the guest may withdraw pages. >> >> So let's drop this, and have two commands: >> >> /* Give me more pages! */ >> VIRTIO_BALLOON_REQ_FILL: >> u32 type; >> u32 page_bits; // eg 12 == 4096. >> u64 num_pages; >> >> And: >> >> /* You can take some back. */ >> VIRTIO_BALLOON_REQ_RELEASE: >> u32 type; >> u32 page_bits; // eg 12 == 4096. >> u64 num_pages; > > General idea is OK but this way we back to relative requests. > IIRC we would like to avoid them. They do kind of suck, but if we want to deal with more than one page size at once we either need to publish several values, or use this mechanism. In practice there'll probably only be a couple of page sizes, but this seems the sweet-spot for simplicity. >> Yet this requirement that pages be re-requested blocked one >> implementation attempt in Linux. They old spec said you had to, and yet >> QEMU didn't actually care. Nor any existing implementation. > >>From which point of view? Guest or host? My earlier description > is from guest point of view. IIRC, old VIRTIO balloon driver > works in the same way (from guest point of view). From the guest. Damn, I can't find the email thread, but my vague memory was that there was some advantage in having Linux simply use the pages without asking permission (as it has to now). I'd ping Paolo Bonzini <pbonzini@redhat.com>, "Sasha Levin" <sasha.levin@oracle.com> and Rafael Aquini <aquini@redhat.com>. >> > Additionally, some hypervisors my require additional steps to add/remove >> > page to/from the pool (e.g. Xen PV guests must add/remove frames to P2M and M2P >> > lists/trees too). So implementation should be able to call hypervisor >> > specific stuff in such situations. >> >> The underlying assumption is that the hypervisor controls the mapping, >> so it can remove the page and fault one back in appropriately. This >> isn't true for PV Xen of course. Yet we can't put "do some hypervisor >> specific stuff here" in the spec. >> >> Let's step back a moment to look at the goals. It's nice for PV Xen to >> have portable drivers, but by definition you can't run a generic guest >> in PV Xen. So it's more about reducing the differences than trying to >> get a completely standardized guest. >> >> So it doesn't bother me a huge amount that a generic balloon driver >> won't work in a Xen PV guest. In practice, the Linux driver might have >> hooks to support Xen PV, but it might be better to keep the Xen-specific >> balloon driver until Xen PV finally dies. > > So as I can see from this and others conversation we assume that PV guest > should die sooner or later and we are not going to support them. As I can > see the same goal is for Linux Kernel and Xen developers. However, just > in case I will confirm this with Xen guys. To be clear, as lguest author I mourn the loss of PV as much as anyone! But the page table tricks which made Xen-PV so brilliant make abstraction of a balloon device impossible, AFAICT :( Cheers, Rusty.


  • 12.  Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)

    Posted 10-29-2013 10:01
    On Mon, Oct 28, 2013 at 01:12:58PM +1030, Rusty Russell wrote: > Daniel Kiper <daniel.kiper@oracle.com> writes: > > On Fri, Oct 25, 2013 at 02:25:58PM +1030, Rusty Russell wrote: > >> Daniel Kiper <daniel.kiper@oracle.com> writes: > >> > Maybe guest should also be able to set balloon target. However, in this > >> > situation host should establish limits which could not be exceeded and > >> > device should enforce them. This way balloon could be controlled from host > >> > and/or guest if needed. So balloon device will be responsible just for > >> > passing requests to/from guest/host and limits enforcement. This way even > >> > memory hotplug could be easily implemented. However, in this situation > >> > device should not be called balloon. Memory manager or memory allocator? > >> > Any better ideas? > >> > >> If it's purely guest-driven device, you don't need a target at all. You > >> just have a driver which hands pages to the device. > > > > OK. Could host send requests too? I think that sometimes it could > > be useful if host could ask guest to limit its memory usage (guest > > should return to host only free pages). So if we assume that it is > > "purely guest-driven device" we could not do that. > > Dropping the target *is* how the host asks, surely. AIUC we assume here that host and guest could put requests into input queue. If yes then I think that it should be clearly stated in spec. > >> You could operate the device in that way, of course, treating the target > >> as a ceiling. Is it useful to have a way of telling the device you're > >> operating in such a "self-regulating" mode? Or should you just do it? > >> > >> ie. should this be a feature bit? > > > > I think that target and ceiling are separate things. Target should > > describe "current" or "desired" allocation. Ceiling should be a limit > > enforced by device. Guest could not allocate more memory than ceiling > > but should be able to set target above ceiling. In that case host > > should allow guest allocation until ceiling but no more. > > Hard and soft limits are an interesting idea, but I'm not sure that it > withstands scrutiny. There is already a (platform-specific) ceiling > mechanism for a guest, defined by its memory layout. > > Varying that hard limit is an interesting problem. To reduce it, you > need to decide how long to give it to reach that limit, and what happens > if they don't do it. With a soft limit, the implication is that pages > will be swapped if/while it is above that limit. I think that's > slightly simpler than having both mechanisms. Hmmm... I think that there is a misunderstanding here. I looked at this issue from host point of view. Anyway, my main concern here is that spec should not block development of memory hotplug on the base of this device. In the case of Linux Kernel its implementation is quiet easy and I did that for Xen balloon driver. Hence, I think that similar thing could be done for this device. However, as I understand balloon driver definition it deals only with memory allocated when a given VM was created. So that is why I asked, should we call it balloon device if we are going to support memory hotplug in driver? If yes then definition should be changed or we should use another name. Additionally, I think that host should have mechanism to limit guest memory usage which will prevent against host memory exhaustion. Of course later issue is out of scope of this spec but I think that we should at least mention about such problem. > >> This doesn't make sense. It's possible that the host has some memory in > >> hugepages and some in smaller pages. But if so, it needs to be able to > >> say "give me 5 small pages and 1 huge page please". > >> > >> >> +1. A VIRTIO_BALLOON_REQ_RESIZE command indicates the balloon target > >> >> + size (in bytes) in the value field. If the current balloon size is > >> >> + smaller than the target, the guest should add pages to the balloon > >> >> + as soon as possible. If the current balloon is larger than the > >> >> + target, the guest may withdraw pages. > >> > >> So let's drop this, and have two commands: > >> > >> /* Give me more pages! */ > >> VIRTIO_BALLOON_REQ_FILL: > >> u32 type; > >> u32 page_bits; // eg 12 == 4096. > >> u64 num_pages; > >> > >> And: > >> > >> /* You can take some back. */ > >> VIRTIO_BALLOON_REQ_RELEASE: > >> u32 type; > >> u32 page_bits; // eg 12 == 4096. > >> u64 num_pages; > > > > General idea is OK but this way we back to relative requests. > > IIRC we would like to avoid them. > > They do kind of suck, but if we want to deal with more than one page > size at once we either need to publish several values, or use this > mechanism. > > In practice there'll probably only be a couple of page sizes, but this > seems the sweet-spot for simplicity. My first idea was that request should contain list of desired page sizes than a hard statement like "give me X pages of size Y". Here, at least we are not able to predict which pages are unused in guest. So I think this soft statement is more useful. But if we choose your proposal I think that there should be a way to explicitly refuse such hard request or fail partially (e.g. host requests 5 superpages and guest is able to free 2 only). > >> Yet this requirement that pages be re-requested blocked one > >> implementation attempt in Linux. They old spec said you had to, and yet > >> QEMU didn't actually care. Nor any existing implementation. > > > >>From which point of view? Guest or host? My earlier description > > is from guest point of view. IIRC, old VIRTIO balloon driver > > works in the same way (from guest point of view). > > From the guest. Damn, I can't find the email thread, but my vague > memory was that there was some advantage in having Linux simply use the > pages without asking permission (as it has to now). > > I'd ping Paolo Bonzini <pbonzini@redhat.com>, "Sasha Levin" > <sasha.levin@oracle.com> and Rafael Aquini <aquini@redhat.com>. Please check linux/mm/balloon_compaction.c:balloon_page_enqueue(), linux/mm/balloon_compaction.c:balloon_page_dequeue(), drivers/virtio/virtio_balloon.c:release_pages_by_pfn() and include/linux/balloon_compaction.h:balloon_page_free(). It does this thing in a bit different way than Xen balloon driver but idea is the same. > >> > Additionally, some hypervisors my require additional steps to add/remove > >> > page to/from the pool (e.g. Xen PV guests must add/remove frames to P2M and M2P > >> > lists/trees too). So implementation should be able to call hypervisor > >> > specific stuff in such situations. > >> > >> The underlying assumption is that the hypervisor controls the mapping, > >> so it can remove the page and fault one back in appropriately. This > >> isn't true for PV Xen of course. Yet we can't put "do some hypervisor > >> specific stuff here" in the spec. > >> > >> Let's step back a moment to look at the goals. It's nice for PV Xen to > >> have portable drivers, but by definition you can't run a generic guest > >> in PV Xen. So it's more about reducing the differences than trying to > >> get a completely standardized guest. > >> > >> So it doesn't bother me a huge amount that a generic balloon driver > >> won't work in a Xen PV guest. In practice, the Linux driver might have > >> hooks to support Xen PV, but it might be better to keep the Xen-specific > >> balloon driver until Xen PV finally dies. > > > > So as I can see from this and others conversation we assume that PV guest > > should die sooner or later and we are not going to support them. As I can > > see the same goal is for Linux Kernel and Xen developers. However, just > > in case I will confirm this with Xen guys. > > To be clear, as lguest author I mourn the loss of PV as much as anyone! > But the page table tricks which made Xen-PV so brilliant make > abstraction of a balloon device impossible, AFAICT :( After some discussion with Konrad Wilk yesterday I suppose that we will be forced to have some mechanisms to cooperate with different types of hypervisors. However, let's wait for first version of draft and we will see what is needed. Daniel


  • 13.  Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)

    Posted 10-28-2013 19:17
    On Fri, Oct 25, 2013 at 02:25:58PM +1030, Rusty Russell wrote: > Daniel Kiper <daniel.kiper@oracle.com> writes: > > Hi, > > > > Sorry for late reply but I am very busy now. > > That's OK, thanks for making the time. > > > On Tue, Oct 01, 2013 at 05:40:12PM +0930, Rusty Russell wrote: > >> Ok, so here's what I ended up with. > >> > >> Any feedback welcome... > >> Rusty. > > > > [...] > > > >> +100.2.4.5. Memory Balloon Device > >> +=========================== > >> + > >> +The virtio memory balloon device is a primitive device for managing > >> +guest memory: the device asks for a certain amount of memory, and the > >> +guest supplies it. This allows the guest to adapt to changes in > >> +allowance of underlying physical memory. The device can also be used > >> +to communicate guest memory statistics to the host. > > > > Maybe guest should also be able to set balloon target. However, in this > > situation host should establish limits which could not be exceeded and > > device should enforce them. This way balloon could be controlled from host > > and/or guest if needed. So balloon device will be responsible just for > > passing requests to/from guest/host and limits enforcement. This way even > > memory hotplug could be easily implemented. However, in this situation > > device should not be called balloon. Memory manager or memory allocator? > > Any better ideas? > > If it's purely guest-driven device, you don't need a target at all. You > just have a driver which hands pages to the device. > > You could operate the device in that way, of course, treating the target > as a ceiling. Is it useful to have a way of telling the device you're > operating in such a "self-regulating" mode? Or should you just do it? > > ie. should this be a feature bit? > > >> +The read-only configuration field indicates the granularity of memory > >> +which can be added to the balloon. This is typically reflects the > >> +page size of the host (eg. 12 for 4096-byte pages). > >> + > >> + struct virtio_balloon_config { > >> + u32 page_bits; > >> + } > > > > Why balloon device must be forced to use only one page size? I think that > > configuration area should list which page sizes could be requested by > > device. Device should be able to request any allowed/defined size but driver > > could reject request or fail partially. Additionally, maybe device should > > inform about allowed page sizes using size explicitly instead of number > > of bits. So maybe it is worth storing page sizes as u64. By the way we could > > store page sizes which are not equal full power of 2 (could be useful for > > some strange sizes of superpages in the future if something crazy happens). > > However, if we store page sizes as number of bits we could represent > > larger sizes. Hmmm... > > > > Probably we do not implement above mentioned feature at first but > > it gives us a chance to do that later. > > I don't see non-power-of-two pages happening. > > But it makes sense to put the page size in each request. A bit more > painful to implement, since the driver can't know in advance that it > doesn't support a request. > > >> +100.2.4.5.5. Device Initialization > >> +----------------------------- > >> + > >> +1. At least one struct virtio_balloon_request buffer should be placed > >> + in the inputq. > >> + > >> +2. The balloon starts empty (size 0). > >> + > >> +100.2.4.5.6. Device Operation > >> +------------------------ > >> + > >> +The device is driven by receipt of a command in the input queue: > >> + > >> + struct virtio_balloon_req { > >> +#define VIRTIO_BALLOON_REQ_RESIZE 0 > >> +#define VIRTIO_BALLOON_REQ_STATS 1 > >> + u32 type; > >> + u32 reserved; > >> + u64 value; > >> + } > > > > struct virtio_balloon_pages { > > #define VIRTIO_BALLOON_REQ_RESIZE 0 > > #define VIRTIO_BALLOON_REQ_STATS 1 > > u32 type; > > u32 reserved; > > u64 guest_memory_size; > > u64 page_sizes[]; > > }; ??? > > This doesn't make sense. It's possible that the host has some memory in > hugepages and some in smaller pages. But if so, it needs to be able to > say "give me 5 small pages and 1 huge page please". > > >> +1. A VIRTIO_BALLOON_REQ_RESIZE command indicates the balloon target > >> + size (in bytes) in the value field. If the current balloon size is > >> + smaller than the target, the guest should add pages to the balloon > >> + as soon as possible. If the current balloon is larger than the > >> + target, the guest may withdraw pages. > > So let's drop this, and have two commands: > > /* Give me more pages! */ > VIRTIO_BALLOON_REQ_FILL: > u32 type; > u32 page_bits; // eg 12 == 4096. > u64 num_pages; > > And: > > /* You can take some back. */ > VIRTIO_BALLOON_REQ_RELEASE: > u32 type; > u32 page_bits; // eg 12 == 4096. > u64 num_pages; > > >> +2. To add pages to the balloon, the physical addresses of the pages > > > > frames ??? > > Define frame? It's an array of page physical addresses. Was that unclear? > > >> + are sent using the output queue. The number of pages is implied in > >> + the message length, and each page value must be a multiple of the > >> + page size indicated in struct virtio_balloon_config. > >> + > >> + struct virtio_balloon_pages { > >> +#define VIRTIO_BALLOON_RESP_PAGES 0 > >> + u32 type; // VIRTIO_BALLOON_RESP_PAGES > >> + u64 page[]; > >> + }; > > > > struct virtio_balloon_pages { > > u32 type; // VIRTIO_BALLOON_RESP_PAGES > > u64 page_size; > > u64 frames[]; > > }; ??? > > > >> +3. To withdraw a page from the balloon, it can simply be accessed. > > > > IIRC, ballooned pages are at first reserved and later frames are returned > > to a host. So if you would like to use pages you must do above steps > > in revers. Hence, "it can simply be accessed" is a bit misleading. May it > > should be phrased in following way: it should have reassigned a frame > > number and later it should be returned to a pool of free pages. > > Yet this requirement that pages be re-requested blocked one > implementation attempt in Linux. They old spec said you had to, and yet > QEMU didn't actually care. Nor any existing implementation. > > > Additionally, some hypervisors my require additional steps to add/remove > > page to/from the pool (e.g. Xen PV guests must add/remove frames to P2M and M2P > > lists/trees too). So implementation should be able to call hypervisor > > specific stuff in such situations. > > The underlying assumption is that the hypervisor controls the mapping, > so it can remove the page and fault one back in appropriately. This > isn't true for PV Xen of course. Yet we can't put "do some hypervisor > specific stuff here" in the spec. > > Let's step back a moment to look at the goals. It's nice for PV Xen to > have portable drivers, but by definition you can't run a generic guest > in PV Xen. So it's more about reducing the differences than trying to > get a completely standardized guest. > > So it doesn't bother me a huge amount that a generic balloon driver > won't work in a Xen PV guest. In practice, the Linux driver might have > hooks to support Xen PV, but it might be better to keep the Xen-specific > balloon driver until Xen PV finally dies. > > Cheers, > Rusty. OK there are some more issues to consider around deflate. 1. On KVM, we actually want to change QEMU so that pagefaults don't work either. Specifically, we want to skip pages in the balloon for migration. However, migration is done in userspace while pagefaults are done in kernel. I think the implication is that - you should be able to ask guest to inflate balloon with pages that can be paged in (when you don't want to migrate and want max local performance) or with pages that can not be paged in (when you want to migrate faster), dynamically, not through a device feature - "will notify before use" feature should be per a bunch or pages actually. 2. Assuming we need the deflate command, I think it's a problem that we need to allocate memory for deflate. Can't we just stick page pointers in VQ directly? If we address these, I think Xen PV becomes simple, you just never ask for pageable-in pages, and you ignore them if guest gives them to you. > > --------------------------------------------------------------------- > To unsubscribe from this mail list, you must leave the OASIS TC that > generates this mail. Follow this link to all your TCs in OASIS at: > https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php


  • 14.  Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)

    Posted 10-29-2013 01:58
    "Michael S. Tsirkin" <mst@redhat.com> writes: > OK there are some more issues to consider around deflate. > > > 1. On KVM, we actually want to change QEMU so that pagefaults don't > work either. Specifically, we want to skip pages in the balloon for > migration. > However, migration is done in userspace while pagefaults > are done in kernel. > I think the implication is that > - you should be able to ask guest to inflate balloon > with pages that can be paged in (when you don't want to migrate > and want max local performance) or with pages that can not be paged in > (when you want to migrate faster), dynamically, not through a > device feature > - "will notify before use" feature should be per a bunch or pages actually. I am always reluctant to implement a spec for things which don't exist. This is the cause of the current "negative feature" mess with VIRTIO_BALLOON_F_MUST_TELL_HOST. So if we *ever* want to ask for pages, let's make the the driver always ask for pages. You place a buffer in the queue and the device fills it with page addresses you can now use. See below. > 2. Assuming we need the deflate command, > I think it's a problem that we need to allocate memory for deflate. > Can't we just stick page pointers in VQ directly? That's an internal implementation issue; we can avoid the allocation fairly trivially using a static buffer. > If we address these, I think Xen PV becomes simple, you just > never ask for pageable-in pages, and you ignore them if guest > gives them to you. Simple, yes. But even with this design, there's no way for the Xen PV model to implement this without a xen-specific hook in the driver. At the very least, "get_a_page_for_the_balloon()" and "return_a_page_from_the_balloon()" will need to do Xen-specific stuff. We *could* argue that's outside the scope of the spec, but we should note it. Cheers, Rusty. This revises the previous proposal. It renames the queues for more clarity, and insists that you request pages. diff --git a/virtio-v1.0-wd01-part1-specification.txt b/virtio-v1.0-wd01-part1-specification.txt index 426a3cb..173130e 100644 --- a/virtio-v1.0-wd01-part1-specification.txt +++ b/virtio-v1.0-wd01-part1-specification.txt @@ -3165,7 +3165,7 @@ to communicate guest memory statistics to the host. 100.2.4.5.2. Virtqueues ------------------ - 0:inputq. 1:outputq. + 0:fromdevq. 1:todevq. 100.2.4.5.3. Feature bits @@ -3187,14 +3187,14 @@ page size of the host (eg. 12 for 4096-byte pages). ----------------------------- 1. At least one struct virtio_balloon_request buffer should be placed - in the inputq. + in the fromdevq. 2. The balloon starts empty (size 0). 100.2.4.5.6. Device Operation ------------------------ -The device is driven by receipt of a command in the input queue: +The device is driven by receipt of a command in the fromdevq queue: struct virtio_balloon_req { #define VIRTIO_BALLOON_REQ_RESIZE 0 @@ -3211,7 +3211,7 @@ The device is driven by receipt of a command in the input queue: target, the guest may withdraw pages. 2. To add pages to the balloon, the physical addresses of the pages - are sent using the output queue. The number of pages is implied in + are sent using the todevq queue. The number of pages is implied in the message length, and each page value must be a multiple of the page size indicated in struct virtio_balloon_config. @@ -3221,15 +3221,24 @@ The device is driven by receipt of a command in the input queue: u64 page[]; }; -3. To withdraw a page from the balloon, it can simply be accessed. - The contents at this point will be undefined. The device should +3. To withdraw pages from the balloon, the same structure should be + placed in the todevq queue, with the page array writable: + + struct virtio_balloon_pages { +#define VIRTIO_BALLOON_REQ_PAGES 2 + u32 type; // VIRTIO_BALLOON_REQ_PAGES + u64 page[]; + }; + + The device may not fill the entire page array. The contents + of the pages received will be undefined. The device should keep count of how many pages remain in the balloon so it can correctly respond to future resize requests. 4. A VIRTIO_BALLOON_REQ_STATS command indicates that the driver should report what stats are available. -5. To report stats, the following message is sent to the output queue. +5. To report stats, the following message is sent to the todevq queue. Indeterminable stats must not be reported. struct virtio_balloon_stats {


  • 15.  Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)

    Posted 10-29-2013 11:45
    On Tue, Oct 29, 2013 at 10:44:53AM +1030, Rusty Russell wrote: > "Michael S. Tsirkin" <mst@redhat.com> writes: > > OK there are some more issues to consider around deflate. > > > > > > 1. On KVM, we actually want to change QEMU so that pagefaults don't > > work either. Specifically, we want to skip pages in the balloon for > > migration. > > However, migration is done in userspace while pagefaults > > are done in kernel. > > I think the implication is that > > - you should be able to ask guest to inflate balloon > > with pages that can be paged in (when you don't want to migrate > > and want max local performance) or with pages that can not be paged in > > (when you want to migrate faster), dynamically, not through a > > device feature > > - "will notify before use" feature should be per a bunch or pages actually. > > I am always reluctant to implement a spec for things which don't exist. > This is the cause of the current "negative feature" mess with > VIRTIO_BALLOON_F_MUST_TELL_HOST. > > So if we *ever* want to ask for pages, let's make the the driver always > ask for pages. You place a buffer in the queue and the device fills it > with page addresses you can now use. You mean PFNs? > See below. > > > 2. Assuming we need the deflate command, > > I think it's a problem that we need to allocate memory for deflate. > > Can't we just stick page pointers in VQ directly? > > That's an internal implementation issue; we can avoid the allocation > fairly trivially using a static buffer. > > > If we address these, I think Xen PV becomes simple, you just > > never ask for pageable-in pages, and you ignore them if guest > > gives them to you. > > Simple, yes. > > But even with this design, there's no way for the Xen PV model to > implement this without a xen-specific hook in the driver. > > At the very least, "get_a_page_for_the_balloon()" and > "return_a_page_from_the_balloon()" will need to do Xen-specific stuff. Yep, it looks that even HVM needs some Xen specific hooks. > We *could* argue that's outside the scope of the spec, but we should > note it. I am not sure right now. > Cheers, > Rusty. > > This revises the previous proposal. It renames the queues for > more clarity, and insists that you request pages. > > diff --git a/virtio-v1.0-wd01-part1-specification.txt b/virtio-v1.0-wd01-part1-specification.txt > index 426a3cb..173130e 100644 > --- a/virtio-v1.0-wd01-part1-specification.txt > +++ b/virtio-v1.0-wd01-part1-specification.txt > @@ -3165,7 +3165,7 @@ to communicate guest memory statistics to the host. > > 100.2.4.5.2. Virtqueues > ------------------ > - 0:inputq. 1:outputq. > + 0:fromdevq. 1:todevq. > > > 100.2.4.5.3. Feature bits > @@ -3187,14 +3187,14 @@ page size of the host (eg. 12 for 4096-byte pages). > ----------------------------- > > 1. At least one struct virtio_balloon_request buffer should be placed > - in the inputq. > + in the fromdevq. > > 2. The balloon starts empty (size 0). > > 100.2.4.5.6. Device Operation > ------------------------ > > -The device is driven by receipt of a command in the input queue: > +The device is driven by receipt of a command in the fromdevq queue: > > struct virtio_balloon_req { > #define VIRTIO_BALLOON_REQ_RESIZE 0 > @@ -3211,7 +3211,7 @@ The device is driven by receipt of a command in the input queue: > target, the guest may withdraw pages. > > 2. To add pages to the balloon, the physical addresses of the pages > - are sent using the output queue. The number of pages is implied in > + are sent using the todevq queue. The number of pages is implied in > the message length, and each page value must be a multiple of the > page size indicated in struct virtio_balloon_config. > > @@ -3221,15 +3221,24 @@ The device is driven by receipt of a command in the input queue: > u64 page[]; > }; > > -3. To withdraw a page from the balloon, it can simply be accessed. > - The contents at this point will be undefined. The device should > +3. To withdraw pages from the balloon, the same structure should be > + placed in the todevq queue, with the page array writable: > + > + struct virtio_balloon_pages { > +#define VIRTIO_BALLOON_REQ_PAGES 2 > + u32 type; // VIRTIO_BALLOON_REQ_PAGES > + u64 page[]; What is the size of this array? > + }; > + > + The device may not fill the entire page array. The contents > + of the pages received will be undefined. The device should > keep count of how many pages remain in the balloon so it can > correctly respond to future resize requests. What happen if driver request more pages than are in balloon? Are we going to support such cases? I am asking in context of memory hotplug support. Daniel


  • 16.  Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)

    Posted 10-29-2013 12:53
    Daniel Kiper <daniel.kiper@oracle.com> writes: > On Tue, Oct 29, 2013 at 10:44:53AM +1030, Rusty Russell wrote: >> "Michael S. Tsirkin" <mst@redhat.com> writes: >> > OK there are some more issues to consider around deflate. >> > >> > >> > 1. On KVM, we actually want to change QEMU so that pagefaults don't >> > work either. Specifically, we want to skip pages in the balloon for >> > migration. >> > However, migration is done in userspace while pagefaults >> > are done in kernel. >> > I think the implication is that >> > - you should be able to ask guest to inflate balloon >> > with pages that can be paged in (when you don't want to migrate >> > and want max local performance) or with pages that can not be paged in >> > (when you want to migrate faster), dynamically, not through a >> > device feature >> > - "will notify before use" feature should be per a bunch or pages actually. >> >> I am always reluctant to implement a spec for things which don't exist. >> This is the cause of the current "negative feature" mess with >> VIRTIO_BALLOON_F_MUST_TELL_HOST. >> >> So if we *ever* want to ask for pages, let's make the the driver always >> ask for pages. You place a buffer in the queue and the device fills it >> with page addresses you can now use. > > You mean PFNs? PFN << PAGE_BITS. Since we're dealing with different size pages, using exact addresses is clearer, I think. >> +3. To withdraw pages from the balloon, the same structure should be >> + placed in the todevq queue, with the page array writable: >> + >> + struct virtio_balloon_pages { >> +#define VIRTIO_BALLOON_REQ_PAGES 2 >> + u32 type; // VIRTIO_BALLOON_REQ_PAGES >> + u64 page[]; > > What is the size of this array? It's implied by the length of the request. >> + }; >> + >> + The device may not fill the entire page array. The contents >> + of the pages received will be undefined. The device should >> keep count of how many pages remain in the balloon so it can >> correctly respond to future resize requests. > > What happen if driver request more pages than are in balloon? > Are we going to support such cases? I am asking in context > of memory hotplug support. I don't think so. The device won't fill the entire array in that case (remember, virtio gets a "used" field returned, which says how many bytes were written by the device). Memory hotplug is properly the realm of platform specific methods (eg. ACPI), so I think it's outside the virtio spec. Ballooning is simpler, but has shown to be useful. Both can coexist. Cheers, Rusty.


  • 17.  Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)

    Posted 10-29-2013 16:38
    On Tue, Oct 29, 2013 at 11:15:59PM +1030, Rusty Russell wrote: > Daniel Kiper <daniel.kiper@oracle.com> writes: > > On Tue, Oct 29, 2013 at 10:44:53AM +1030, Rusty Russell wrote: > >> "Michael S. Tsirkin" <mst@redhat.com> writes: > >> > OK there are some more issues to consider around deflate. > >> > > >> > > >> > 1. On KVM, we actually want to change QEMU so that pagefaults don't > >> > work either. Specifically, we want to skip pages in the balloon for > >> > migration. > >> > However, migration is done in userspace while pagefaults > >> > are done in kernel. > >> > I think the implication is that > >> > - you should be able to ask guest to inflate balloon > >> > with pages that can be paged in (when you don't want to migrate > >> > and want max local performance) or with pages that can not be paged in > >> > (when you want to migrate faster), dynamically, not through a > >> > device feature > >> > - "will notify before use" feature should be per a bunch or pages actually. > >> > >> I am always reluctant to implement a spec for things which don't exist. > >> This is the cause of the current "negative feature" mess with > >> VIRTIO_BALLOON_F_MUST_TELL_HOST. > >> > >> So if we *ever* want to ask for pages, let's make the the driver always > >> ask for pages. You place a buffer in the queue and the device fills it > >> with page addresses you can now use. > > > > You mean PFNs? > > PFN << PAGE_BITS. Since we're dealing with different size pages, using > exact addresses is clearer, I think. OK, it makes sens. However, I have done quick review of existing Xen balloon driver and PFNs or exact addresses probably will not work on Xen. Even on HVM. It must be confirmed. I will do that after releasing first draft. > >> +3. To withdraw pages from the balloon, the same structure should be > >> + placed in the todevq queue, with the page array writable: > >> + > >> + struct virtio_balloon_pages { > >> +#define VIRTIO_BALLOON_REQ_PAGES 2 > >> + u32 type; // VIRTIO_BALLOON_REQ_PAGES > >> + u64 page[]; > > > > What is the size of this array? > > It's implied by the length of the request. OK. > >> + }; > >> + > >> + The device may not fill the entire page array. The contents > >> + of the pages received will be undefined. The device should > >> keep count of how many pages remain in the balloon so it can > >> correctly respond to future resize requests. > > > > What happen if driver request more pages than are in balloon? > > Are we going to support such cases? I am asking in context > > of memory hotplug support. > > I don't think so. The device won't fill the entire array in that case > (remember, virtio gets a "used" field returned, which says how many > bytes were written by the device). > > Memory hotplug is properly the realm of platform specific methods > (eg. ACPI), so I think it's outside the virtio spec. In general ACPI is used as a notifier which says that new memory was installed in system. If memory hotplug is built in system it creates all structures needed to use newly added memory. However, memory must be activated by user. This process works in similar way like balloon deflate. Instead of ACPI another notification mechanism could be used but memory hotplug mechanism itself is quite generic and could be used everywhere. We use that features in Xen implementation. Hence, we could use the same in VIRTIO balloon driver. We just need device which gives pages to VM without any limitations (i.e. accordingly to limits established by admin and not limited by amount of memory assigned at boot). This way guest memory could be extended without stopping. If we just use balloon it would not work in that way. > Ballooning is simpler, but has shown to be useful. Both can coexist. Right. We did that in Xen balloon driver. It supports ballooning but memory hotplug is used when more memory is requested than it was assigned to VM at boot. Later all stuff works as usual. Even balloon driver works in the same way on hotplugged memory like on memory allocated at boot. So we just need a device which gives pages to VM without any limitations. Of course we could consider another device for memory hotplug but I do not think it makes sens. Daniel


  • 18.  Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)

    Posted 11-05-2013 10:47
    On Tue, Oct 29, 2013 at 05:37:26PM +0100, Daniel Kiper wrote: > On Tue, Oct 29, 2013 at 11:15:59PM +1030, Rusty Russell wrote: > > Daniel Kiper <daniel.kiper@oracle.com> writes: > > > On Tue, Oct 29, 2013 at 10:44:53AM +1030, Rusty Russell wrote: > > >> "Michael S. Tsirkin" <mst@redhat.com> writes: > > >> > OK there are some more issues to consider around deflate. > > >> > > > >> > > > >> > 1. On KVM, we actually want to change QEMU so that pagefaults don't > > >> > work either. Specifically, we want to skip pages in the balloon for > > >> > migration. > > >> > However, migration is done in userspace while pagefaults > > >> > are done in kernel. > > >> > I think the implication is that > > >> > - you should be able to ask guest to inflate balloon > > >> > with pages that can be paged in (when you don't want to migrate > > >> > and want max local performance) or with pages that can not be paged in > > >> > (when you want to migrate faster), dynamically, not through a > > >> > device feature > > >> > - "will notify before use" feature should be per a bunch or pages actually. > > >> > > >> I am always reluctant to implement a spec for things which don't exist. > > >> This is the cause of the current "negative feature" mess with > > >> VIRTIO_BALLOON_F_MUST_TELL_HOST. > > >> > > >> So if we *ever* want to ask for pages, let's make the the driver always > > >> ask for pages. You place a buffer in the queue and the device fills it > > >> with page addresses you can now use. > > > > > > You mean PFNs? > > > > PFN << PAGE_BITS. Since we're dealing with different size pages, using > > exact addresses is clearer, I think. > > OK, it makes sens. However, I have done quick review of existing Xen balloon driver > and PFNs or exact addresses probably will not work on Xen. Even on HVM. It must > be confirmed. I will do that after releasing first draft. > > > >> +3. To withdraw pages from the balloon, the same structure should be > > >> + placed in the todevq queue, with the page array writable: > > >> + > > >> + struct virtio_balloon_pages { > > >> +#define VIRTIO_BALLOON_REQ_PAGES 2 > > >> + u32 type; // VIRTIO_BALLOON_REQ_PAGES > > >> + u64 page[]; > > > > > > What is the size of this array? > > > > It's implied by the length of the request. > > OK. > > > >> + }; > > >> + > > >> + The device may not fill the entire page array. The contents > > >> + of the pages received will be undefined. The device should > > >> keep count of how many pages remain in the balloon so it can > > >> correctly respond to future resize requests. > > > > > > What happen if driver request more pages than are in balloon? > > > Are we going to support such cases? I am asking in context > > > of memory hotplug support. > > > > I don't think so. The device won't fill the entire array in that case > > (remember, virtio gets a "used" field returned, which says how many > > bytes were written by the device). > > > > Memory hotplug is properly the realm of platform specific methods > > (eg. ACPI), so I think it's outside the virtio spec. > > In general ACPI is used as a notifier which says that new memory was > installed in system. If memory hotplug is built in system it creates > all structures needed to use newly added memory. However, memory must > be activated by user. This process works in similar way like > balloon deflate. > > Instead of ACPI another notification mechanism could be used but memory > hotplug mechanism itself is quite generic and could be used everywhere. > We use that features in Xen implementation. Hence, we could use the > same in VIRTIO balloon driver. We just need device which gives pages > to VM without any limitations (i.e. accordingly to limits established > by admin and not limited by amount of memory assigned at boot). This > way guest memory could be extended without stopping. If we just use > balloon it would not work in that way. > > > Ballooning is simpler, but has shown to be useful. Both can coexist. > > Right. We did that in Xen balloon driver. It supports ballooning but > memory hotplug is used when more memory is requested than it was > assigned to VM at boot. Later all stuff works as usual. Even balloon > driver works in the same way on hotplugged memory like on memory > allocated at boot. So we just need a device which gives pages > to VM without any limitations. > > Of course we could consider another device for memory hotplug but > I do not think it makes sens. Any comments? Daniel


  • 19.  Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)

    Posted 12-11-2013 09:00
    On Tue, Oct 29, 2013 at 05:37:26PM +0100, Daniel Kiper wrote: > On Tue, Oct 29, 2013 at 11:15:59PM +1030, Rusty Russell wrote: > > Daniel Kiper <daniel.kiper@oracle.com> writes: > > > On Tue, Oct 29, 2013 at 10:44:53AM +1030, Rusty Russell wrote: > > >> "Michael S. Tsirkin" <mst@redhat.com> writes: > > >> > OK there are some more issues to consider around deflate. > > >> > > > >> > > > >> > 1. On KVM, we actually want to change QEMU so that pagefaults don't > > >> > work either. Specifically, we want to skip pages in the balloon for > > >> > migration. > > >> > However, migration is done in userspace while pagefaults > > >> > are done in kernel. > > >> > I think the implication is that > > >> > - you should be able to ask guest to inflate balloon > > >> > with pages that can be paged in (when you don't want to migrate > > >> > and want max local performance) or with pages that can not be paged in > > >> > (when you want to migrate faster), dynamically, not through a > > >> > device feature > > >> > - "will notify before use" feature should be per a bunch or pages actually. > > >> > > >> I am always reluctant to implement a spec for things which don't exist. > > >> This is the cause of the current "negative feature" mess with > > >> VIRTIO_BALLOON_F_MUST_TELL_HOST. > > >> > > >> So if we *ever* want to ask for pages, let's make the the driver always > > >> ask for pages. You place a buffer in the queue and the device fills it > > >> with page addresses you can now use. > > > > > > You mean PFNs? > > > > PFN << PAGE_BITS. Since we're dealing with different size pages, using > > exact addresses is clearer, I think. > > OK, it makes sens. However, I have done quick review of existing Xen balloon driver > and PFNs or exact addresses probably will not work on Xen. Even on HVM. It must > be confirmed. I will do that after releasing first draft. > > > >> +3. To withdraw pages from the balloon, the same structure should be > > >> + placed in the todevq queue, with the page array writable: > > >> + > > >> + struct virtio_balloon_pages { > > >> +#define VIRTIO_BALLOON_REQ_PAGES 2 > > >> + u32 type; // VIRTIO_BALLOON_REQ_PAGES > > >> + u64 page[]; > > > > > > What is the size of this array? > > > > It's implied by the length of the request. > > OK. > > > >> + }; > > >> + > > >> + The device may not fill the entire page array. The contents > > >> + of the pages received will be undefined. The device should > > >> keep count of how many pages remain in the balloon so it can > > >> correctly respond to future resize requests. > > > > > > What happen if driver request more pages than are in balloon? > > > Are we going to support such cases? I am asking in context > > > of memory hotplug support. > > > > I don't think so. The device won't fill the entire array in that case > > (remember, virtio gets a "used" field returned, which says how many > > bytes were written by the device). > > > > Memory hotplug is properly the realm of platform specific methods > > (eg. ACPI), so I think it's outside the virtio spec. > > In general ACPI is used as a notifier which says that new memory was > installed in system. If memory hotplug is built in system it creates > all structures needed to use newly added memory. However, memory must > be activated by user. This process works in similar way like > balloon deflate. > > Instead of ACPI another notification mechanism could be used but memory > hotplug mechanism itself is quite generic and could be used everywhere. > We use that features in Xen implementation. Hence, we could use the > same in VIRTIO balloon driver. We just need device which gives pages > to VM without any limitations (i.e. accordingly to limits established > by admin and not limited by amount of memory assigned at boot). This > way guest memory could be extended without stopping. If we just use > balloon it would not work in that way. > > > Ballooning is simpler, but has shown to be useful. Both can coexist. > > Right. We did that in Xen balloon driver. It supports ballooning but > memory hotplug is used when more memory is requested than it was > assigned to VM at boot. Later all stuff works as usual. Even balloon > driver works in the same way on hotplugged memory like on memory > allocated at boot. So we just need a device which gives pages > to VM without any limitations. > > Of course we could consider another device for memory hotplug but > I do not think it makes sens. > > Daniel Rusty, any comments on this? Daniel


  • 20.  Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)

    Posted 12-12-2013 02:11
    Daniel Kiper <daniel.kiper@oracle.com> writes: > On Tue, Oct 29, 2013 at 11:15:59PM +1030, Rusty Russell wrote: >> Daniel Kiper <daniel.kiper@oracle.com> writes: >> > On Tue, Oct 29, 2013 at 10:44:53AM +1030, Rusty Russell wrote: >> >> "Michael S. Tsirkin" <mst@redhat.com> writes: >> >> > OK there are some more issues to consider around deflate. >> >> > >> >> > >> >> > 1. On KVM, we actually want to change QEMU so that pagefaults don't >> >> > work either. Specifically, we want to skip pages in the balloon for >> >> > migration. >> >> > However, migration is done in userspace while pagefaults >> >> > are done in kernel. >> >> > I think the implication is that >> >> > - you should be able to ask guest to inflate balloon >> >> > with pages that can be paged in (when you don't want to migrate >> >> > and want max local performance) or with pages that can not be paged in >> >> > (when you want to migrate faster), dynamically, not through a >> >> > device feature >> >> > - "will notify before use" feature should be per a bunch or pages actually. >> >> >> >> I am always reluctant to implement a spec for things which don't exist. >> >> This is the cause of the current "negative feature" mess with >> >> VIRTIO_BALLOON_F_MUST_TELL_HOST. >> >> >> >> So if we *ever* want to ask for pages, let's make the the driver always >> >> ask for pages. You place a buffer in the queue and the device fills it >> >> with page addresses you can now use. >> > >> > You mean PFNs? >> >> PFN << PAGE_BITS. Since we're dealing with different size pages, using >> exact addresses is clearer, I think. > > OK, it makes sens. However, I have done quick review of existing Xen balloon driver > and PFNs or exact addresses probably will not work on Xen. Even on HVM. It must > be confirmed. I will do that after releasing first draft. If this is true, it means Xen cannot offer this device. We can't have Xen offer a device which a generic virtio-spec compliant OS can't drive. We can, however, have a differently-numbered "Xen Balloon Device" which takes xen-specific PFNs instead, and is otherwise identical. This is perfectly sane, and even the Linux driver (which will support both) will look clean. >> > What happen if driver request more pages than are in balloon? >> > Are we going to support such cases? I am asking in context >> > of memory hotplug support. >> >> I don't think so. The device won't fill the entire array in that case >> (remember, virtio gets a "used" field returned, which says how many >> bytes were written by the device). >> >> Memory hotplug is properly the realm of platform specific methods >> (eg. ACPI), so I think it's outside the virtio spec. > > In general ACPI is used as a notifier which says that new memory was > installed in system. If memory hotplug is built in system it creates > all structures needed to use newly added memory. However, memory must > be activated by user. This process works in similar way like > balloon deflate. > > Instead of ACPI another notification mechanism could be used but memory > hotplug mechanism itself is quite generic and could be used everywhere. > We use that features in Xen implementation. Hence, we could use the > same in VIRTIO balloon driver. We just need device which gives pages > to VM without any limitations (i.e. accordingly to limits established > by admin and not limited by amount of memory assigned at boot). This > way guest memory could be extended without stopping. If we just use > balloon it would not work in that way. It would have to be a feature bit: VIRTIO_BALLOON_F_UNLIMITED ? I think it would create an extra burden for non-Xen balloon devices: - The PFN extracted from the balloon would either be a PFN which had been put into the balloon, or the lowest available PFN (ie. just above current top of mem) Otherwise a guest might get a crazily sparse memory map. Thoughts? Rusty.


  • 21.  Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)

    Posted 12-12-2013 20:31
    On Thu, Dec 12, 2013 at 11:40:11AM +1030, Rusty Russell wrote: > Daniel Kiper <daniel.kiper@oracle.com> writes: > > On Tue, Oct 29, 2013 at 11:15:59PM +1030, Rusty Russell wrote: > >> Daniel Kiper <daniel.kiper@oracle.com> writes: > >> > On Tue, Oct 29, 2013 at 10:44:53AM +1030, Rusty Russell wrote: > >> >> "Michael S. Tsirkin" <mst@redhat.com> writes: > >> >> > OK there are some more issues to consider around deflate. > >> >> > > >> >> > > >> >> > 1. On KVM, we actually want to change QEMU so that pagefaults don't > >> >> > work either. Specifically, we want to skip pages in the balloon for > >> >> > migration. > >> >> > However, migration is done in userspace while pagefaults > >> >> > are done in kernel. > >> >> > I think the implication is that > >> >> > - you should be able to ask guest to inflate balloon > >> >> > with pages that can be paged in (when you don't want to migrate > >> >> > and want max local performance) or with pages that can not be paged in > >> >> > (when you want to migrate faster), dynamically, not through a > >> >> > device feature > >> >> > - "will notify before use" feature should be per a bunch or pages actually. > >> >> > >> >> I am always reluctant to implement a spec for things which don't exist. > >> >> This is the cause of the current "negative feature" mess with > >> >> VIRTIO_BALLOON_F_MUST_TELL_HOST. > >> >> > >> >> So if we *ever* want to ask for pages, let's make the the driver always > >> >> ask for pages. You place a buffer in the queue and the device fills it > >> >> with page addresses you can now use. > >> > > >> > You mean PFNs? > >> > >> PFN << PAGE_BITS. Since we're dealing with different size pages, using > >> exact addresses is clearer, I think. > > > > OK, it makes sens. However, I have done quick review of existing Xen balloon driver > > and PFNs or exact addresses probably will not work on Xen. Even on HVM. It must > > be confirmed. I will do that after releasing first draft. > > If this is true, it means Xen cannot offer this device. We can't have > Xen offer a device which a generic virtio-spec compliant OS can't drive. > > We can, however, have a differently-numbered "Xen Balloon Device" which > takes xen-specific PFNs instead, and is otherwise identical. This is > perfectly sane, and even the Linux driver (which will support both) will > look clean. This make sens but I think that it is last resort solution. IMO all devices/drivers should work on any hypervisor. I will play with new VIRTIO balloon device/driver and we will see what is possible. > >> > What happen if driver request more pages than are in balloon? > >> > Are we going to support such cases? I am asking in context > >> > of memory hotplug support. > >> > >> I don't think so. The device won't fill the entire array in that case > >> (remember, virtio gets a "used" field returned, which says how many > >> bytes were written by the device). > >> > >> Memory hotplug is properly the realm of platform specific methods > >> (eg. ACPI), so I think it's outside the virtio spec. > > > > In general ACPI is used as a notifier which says that new memory was > > installed in system. If memory hotplug is built in system it creates > > all structures needed to use newly added memory. However, memory must > > be activated by user. This process works in similar way like > > balloon deflate. > > > > Instead of ACPI another notification mechanism could be used but memory > > hotplug mechanism itself is quite generic and could be used everywhere. > > We use that features in Xen implementation. Hence, we could use the > > same in VIRTIO balloon driver. We just need device which gives pages > > to VM without any limitations (i.e. accordingly to limits established > > by admin and not limited by amount of memory assigned at boot). This > > way guest memory could be extended without stopping. If we just use > > balloon it would not work in that way. > > It would have to be a feature bit: VIRTIO_BALLOON_F_UNLIMITED ? OK. > I think it would create an extra burden for non-Xen balloon devices: > - The PFN extracted from the balloon would either be a PFN which had > been put into the balloon, or the lowest available PFN (ie. just > above current top of mem) > > Otherwise a guest might get a crazily sparse memory map. > > Thoughts? I used this simplistic solution in Xen balloon driver and it does not work well if there is an I/O region above max_pfn. So we should not make this mistake here. I think that we could use solution similar to used on real hardware with ACPI. In memory hotplug case VIRTIO balloon device should send start address of hotplugged region and its size. Later everything should work as in pure balloon driver. Daniel


  • 22.  Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)

    Posted 12-17-2013 04:55
    Daniel Kiper <daniel.kiper@oracle.com> writes: > On Thu, Dec 12, 2013 at 11:40:11AM +1030, Rusty Russell wrote: >> If this is true, it means Xen cannot offer this device. We can't have >> Xen offer a device which a generic virtio-spec compliant OS can't drive. >> >> We can, however, have a differently-numbered "Xen Balloon Device" which >> takes xen-specific PFNs instead, and is otherwise identical. This is >> perfectly sane, and even the Linux driver (which will support both) will >> look clean. > > This make sens but I think that it is last resort solution. IMO all > devices/drivers should work on any hypervisor. I will play with new > VIRTIO balloon device/driver and we will see what is possible. Agreed. >> It would have to be a feature bit: VIRTIO_BALLOON_F_UNLIMITED ? > > OK. > >> I think it would create an extra burden for non-Xen balloon devices: >> - The PFN extracted from the balloon would either be a PFN which had >> been put into the balloon, or the lowest available PFN (ie. just >> above current top of mem) >> >> Otherwise a guest might get a crazily sparse memory map. >> >> Thoughts? > > I used this simplistic solution in Xen balloon driver and it does not > work well if there is an I/O region above max_pfn. So we should not > make this mistake here. I think that we could use solution similar > to used on real hardware with ACPI. In memory hotplug case VIRTIO > balloon device should send start address of hotplugged region and its > size. Later everything should work as in pure balloon driver. That makes perfect sense (though the name "UNLIMITED" now sucks... Hmm, VIRTIO_BALLOON_F_EXTRAMEM maybe?). So if that feature is negotiated, then two new RO config fields become available, which specify the range where the extra memory will appear? Cheers, Rusty.


  • 23.  Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)

    Posted 10-29-2013 22:07
    On Tue, Oct 29, 2013 at 10:44:53AM +1030, Rusty Russell wrote: > "Michael S. Tsirkin" <mst@redhat.com> writes: > > OK there are some more issues to consider around deflate. > > > > > > 1. On KVM, we actually want to change QEMU so that pagefaults don't > > work either. Specifically, we want to skip pages in the balloon for > > migration. > > However, migration is done in userspace while pagefaults > > are done in kernel. > > I think the implication is that > > - you should be able to ask guest to inflate balloon > > with pages that can be paged in (when you don't want to migrate > > and want max local performance) or with pages that can not be paged in > > (when you want to migrate faster), dynamically, not through a > > device feature > > - "will notify before use" feature should be per a bunch or pages actually. > > I am always reluctant to implement a spec for things which don't exist. > This is the cause of the current "negative feature" mess with > VIRTIO_BALLOON_F_MUST_TELL_HOST. Well it's no longer negative now that we made it optional. The name might need to be changed to VIRTIO_BALLOON_F_GUEST_TELLS_HOST. > So if we *ever* want to ask for pages, let's make the the driver always > ask for pages. You place a buffer in the queue and the device fills it > with page addresses you can now use. > > See below. Hmm but then, this does not look so different from the current balloon anymore ... > > 2. Assuming we need the deflate command, > > I think it's a problem that we need to allocate memory for deflate. > > Can't we just stick page pointers in VQ directly? > > That's an internal implementation issue; we can avoid the allocation > fairly trivially using a static buffer. > > > If we address these, I think Xen PV becomes simple, you just > > never ask for pageable-in pages, and you ignore them if guest > > gives them to you. > > Simple, yes. > > But even with this design, there's no way for the Xen PV model to > implement this without a xen-specific hook in the driver. > > At the very least, "get_a_page_for_the_balloon()" and > "return_a_page_from_the_balloon()" will need to do Xen-specific stuff. > We *could* argue that's outside the scope of the spec, but we should > note it. > > Cheers, > Rusty. > > This revises the previous proposal. It renames the queues for > more clarity, and insists that you request pages. > > diff --git a/virtio-v1.0-wd01-part1-specification.txt b/virtio-v1.0-wd01-part1-specification.txt > index 426a3cb..173130e 100644 > --- a/virtio-v1.0-wd01-part1-specification.txt > +++ b/virtio-v1.0-wd01-part1-specification.txt > @@ -3165,7 +3165,7 @@ to communicate guest memory statistics to the host. > > 100.2.4.5.2. Virtqueues > ------------------ > - 0:inputq. 1:outputq. > + 0:fromdevq. 1:todevq. > > > 100.2.4.5.3. Feature bits > @@ -3187,14 +3187,14 @@ page size of the host (eg. 12 for 4096-byte pages). > ----------------------------- > > 1. At least one struct virtio_balloon_request buffer should be placed > - in the inputq. > + in the fromdevq. > > 2. The balloon starts empty (size 0). > > 100.2.4.5.6. Device Operation > ------------------------ > > -The device is driven by receipt of a command in the input queue: > +The device is driven by receipt of a command in the fromdevq queue: > > struct virtio_balloon_req { > #define VIRTIO_BALLOON_REQ_RESIZE 0 > @@ -3211,7 +3211,7 @@ The device is driven by receipt of a command in the input queue: > target, the guest may withdraw pages. > > 2. To add pages to the balloon, the physical addresses of the pages > - are sent using the output queue. The number of pages is implied in > + are sent using the todevq queue. The number of pages is implied in > the message length, and each page value must be a multiple of the > page size indicated in struct virtio_balloon_config. > > @@ -3221,15 +3221,24 @@ The device is driven by receipt of a command in the input queue: > u64 page[]; > }; > > -3. To withdraw a page from the balloon, it can simply be accessed. > - The contents at this point will be undefined. The device should > +3. To withdraw pages from the balloon, the same structure should be > + placed in the todevq queue, with the page array writable: > + > + struct virtio_balloon_pages { > +#define VIRTIO_BALLOON_REQ_PAGES 2 > + u32 type; // VIRTIO_BALLOON_REQ_PAGES > + u64 page[]; > + }; > + > + The device may not fill the entire page array. The contents > + of the pages received will be undefined. On a Linux host they are actually zeroed out. Maybe document this? > The device should > keep count of how many pages remain in the balloon so it can > correctly respond to future resize requests. > > 4. A VIRTIO_BALLOON_REQ_STATS command indicates that the driver > should report what stats are available. > > -5. To report stats, the following message is sent to the output queue. > +5. To report stats, the following message is sent to the todevq queue. > Indeterminable stats must not be reported. > > struct virtio_balloon_stats {


  • 24.  Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)

    Posted 10-30-2013 04:50
    "Michael S. Tsirkin" <mst@redhat.com> writes: > On Tue, Oct 29, 2013 at 10:44:53AM +1030, Rusty Russell wrote: >> "Michael S. Tsirkin" <mst@redhat.com> writes: >> > OK there are some more issues to consider around deflate. >> > >> > >> > 1. On KVM, we actually want to change QEMU so that pagefaults don't >> > work either. Specifically, we want to skip pages in the balloon for >> > migration. >> > However, migration is done in userspace while pagefaults >> > are done in kernel. >> > I think the implication is that >> > - you should be able to ask guest to inflate balloon >> > with pages that can be paged in (when you don't want to migrate >> > and want max local performance) or with pages that can not be paged in >> > (when you want to migrate faster), dynamically, not through a >> > device feature >> > - "will notify before use" feature should be per a bunch or pages actually. >> >> I am always reluctant to implement a spec for things which don't exist. >> This is the cause of the current "negative feature" mess with >> VIRTIO_BALLOON_F_MUST_TELL_HOST. > > Well it's no longer negative now that we made it optional. > The name might need to be changed to VIRTIO_BALLOON_F_GUEST_TELLS_HOST. It's still a problem: if a host *needs* this feature, and a guest doesn't accept it, we're in trouble. >> So if we *ever* want to ask for pages, let's make the the driver always >> ask for pages. You place a buffer in the queue and the device fills it >> with page addresses you can now use. >> >> See below. > > Hmm but then, this does not look so different from the current > balloon anymore ... I don't necessarily want different. I want to fix the various virtio balloon bugs, eg: VIRTIO-37 allow >4G 4K pages in balloon VIRTIO-36 balloon: align val in virtio_balloon_stat VIRTIO-25 balloon: remove VIRTIO_BALLOON_F_MUST_TELL_HOST and deflate queue. VIRTIO-24 balloon: don't define endianness separately for config space. I want the balloon to return to the simple device it was meant to be. >> +3. To withdraw pages from the balloon, the same structure should be >> + placed in the todevq queue, with the page array writable: >> + >> + struct virtio_balloon_pages { >> +#define VIRTIO_BALLOON_REQ_PAGES 2 >> + u32 type; // VIRTIO_BALLOON_REQ_PAGES >> + u64 page[]; >> + }; >> + >> + The device may not fill the entire page array. The contents >> + of the pages received will be undefined. > > On a Linux host they are actually zeroed out. > Maybe document this? Absolutely not, I don't want anyone relying on it! This can certainly change in a future implementation. In fact, I'm not even sure it's true on a hugepage-backed guest. Cheers, Rusty.


  • 25.  Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)

    Posted 10-29-2013 01:58
    Rusty Russell <rusty@au1.ibm.com> writes: > But it makes sense to put the page size in each request. A bit more > painful to implement, since the driver can't know in advance that it > doesn't support a request. Actually, I changed my mind. It isn't useful to ask for different size pages. QEMU currently does all-or-nothing. Even if it did a mix of both, it would need to tell the guest which areas were which (I really want one of *these* huge pages). That's too much complexity for a feature which doesn't exist. So I prefer the current single-page-size model. Cheers, Rusty.


  • 26.  Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)

    Posted 10-29-2013 11:47
    On Tue, Oct 29, 2013 at 11:31:58AM +1030, Rusty Russell wrote: > Rusty Russell <rusty@au1.ibm.com> writes: > > But it makes sense to put the page size in each request. A bit more > > painful to implement, since the driver can't know in advance that it > > doesn't support a request. > > Actually, I changed my mind. > > It isn't useful to ask for different size pages. QEMU currently does > all-or-nothing. Even if it did a mix of both, it would need to tell the > guest which areas were which (I really want one of *these* huge pages). > > That's too much complexity for a feature which doesn't exist. So I > prefer the current single-page-size model. If we assume that we could add this feature in the future then it is OK for me. Daniel


  • 27.  Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)

    Posted 10-29-2013 12:53
    Daniel Kiper <daniel.kiper@oracle.com> writes: > On Tue, Oct 29, 2013 at 11:31:58AM +1030, Rusty Russell wrote: >> Rusty Russell <rusty@au1.ibm.com> writes: >> > But it makes sense to put the page size in each request. A bit more >> > painful to implement, since the driver can't know in advance that it >> > doesn't support a request. >> >> Actually, I changed my mind. >> >> It isn't useful to ask for different size pages. QEMU currently does >> all-or-nothing. Even if it did a mix of both, it would need to tell the >> guest which areas were which (I really want one of *these* huge pages). >> >> That's too much complexity for a feature which doesn't exist. So I >> prefer the current single-page-size model. > > If we assume that we could add this feature in the future > then it is OK for me. Yes, I don't quite know what the feature would look like, so we should add it when we do. It may involve specifying pfn ranges, as well as sizes which the balloon wants.. Cheers, Rusty.


  • 28.  Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)

    Posted 02-21-2014 04:24
    OASIS Issues Tracker <workgroup_mailer@lists.oasis-open.org> writes: > Implement new balloon device (ID 13) > ------------------------------------ > > Key: VIRTIO-28 > URL: http://tools.oasis-open.org/issues/browse/VIRTIO-28 > Project: OASIS Virtual I/O Device (VIRTIO) TC > Issue Type: New Feature > Reporter: Rusty Russell For the moment, this just deprecates the existing one and reserves a new ID for the updated one: