OASIS Virtual I/O Device (VIRTIO) TC

 View Only
Expand all | Collapse all

New virtio balloon...

  • 1.  New virtio balloon...

    Posted 01-30-2014 09:07
    Hi, I tried to write a new balloon driver; it's completely untested (as I need to write the device). The protocol is basically two vqs, one for the guest to send commands, one for the host to send commands. Some interesting things come out: 1) We do need to explicitly tell the host where the page is we want. This is required for compaction, for example. 2) We need to be able to exceed the balloon target, especially for page migration. Thus there's no mechanism for the device to refuse to give us the pages. 3) The device can offer multiple page sizes, but the driver can only accept one. I'm not sure if this is useful, as guests are either huge page backed or not, and returning sub-pages isn't useful. Linux demo code follows. Cheers, Rusty. diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile index 9076635697bb..1dd45691b618 100644 --- a/drivers/virtio/Makefile +++ b/drivers/virtio/Makefile @@ -1,4 +1,4 @@ obj-$(CONFIG_VIRTIO) += virtio.o virtio_ring.o obj-$(CONFIG_VIRTIO_MMIO) += virtio_mmio.o obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o -obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o +obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o virtio_balloon2.o diff --git a/drivers/virtio/virtio_balloon2.c b/drivers/virtio/virtio_balloon2.c new file mode 100644 index 000000000000..93f13e7c561d --- /dev/null +++ b/drivers/virtio/virtio_balloon2.c @@ -0,0 +1,566 @@ +/* + * Virtio balloon implementation, inspired by Dor Laor and Marcelo + * Tosatti's implementations. + * + * Copyright 2008, 2014 Rusty Russell IBM Corporation + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include <linux/virtio.h> +#include <linux/virtio_balloon.h> +#include <linux/swap.h> +#include <linux/kthread.h> +#include <linux/freezer.h> +#include <linux/delay.h> +#include <linux/slab.h> +#include <linux/module.h> +#include <linux/balloon_compaction.h> + +struct gcmd_get_pages { + __le64 type; /* VIRTIO_BALLOON_GCMD_GET_PAGES */ + __le64 pages[256]; +}; + +struct gcmd_give_pages { + __le64 type; /* VIRTIO_BALLOON_GCMD_GIVE_PAGES */ + __le64 pages[256]; +}; + +struct gcmd_need_mem { + __le64 type; /* VIRTIO_BALLOON_GCMD_NEED_MEM */ +}; + +struct gcmd_stats_reply { + __le64 type; /* VIRTIO_BALLOON_GCMD_STATS_REPLY */ + struct virtio_balloon_statistic stats[VIRTIO_BALLOON_S_NR]; +}; + +struct hcmd_set_balloon { + __le64 type; /* VIRTIO_BALLOON_HCMD_SET_BALLOON */ + __le64 target; +}; + +struct hcmd_get_stats { + __le64 type; /* VIRTIO_BALLOON_HCMD_GET_STATS */ +}; + +struct virtio_balloon { + /* Protects contents of entire structure. */ + struct mutex lock; + + struct virtio_device *vdev; + struct virtqueue *gcmd_vq, *hcmd_vq; + + /* The thread servicing the balloon. */ + struct task_struct *thread; + + /* For interrupt/suspend to wake balloon thread. */ + wait_queue_head_t wait; + + /* How many pages are we supposed to have in balloon? */ + s64 target; + + /* How many do we have in the balloon? */ + u64 num_pages; + + /* This reminds me of Eeyore. */ + bool broken; + + /* + * The pages we've told the Host we're not using are enqueued + * at vb_dev_info->pages list. + */ + struct balloon_dev_info *vb_dev_info; + + /* To avoid kmalloc, we use single hcmd and gcmd buffers. */ + union gcmd { + __le64 type; + struct gcmd_get_pages get_pages; + struct gcmd_give_pages give_pages; + struct gcmd_need_mem need_mem; + struct gcmd_stats_reply stats_reply; + } gcmd; + + union hcmd { + __le64 type; + struct hcmd_set_balloon set_balloon; + struct hcmd_get_stats get_stats; + } hcmd; +}; + +static struct virtio_device_id id_table[] = { + { VIRTIO_ID_MEMBALLOON, VIRTIO_DEV_ANY_ID }, + { 0 }, +}; + +static void wake_balloon(struct virtqueue *vq) +{ + struct virtio_balloon *vb = vq->vdev->priv; + + wake_up(&vb->wait); +} + +/* Command is in vb->gcmd, lock is held. */ +static bool send_gcmd(struct virtio_balloon *vb, size_t len) +{ + struct scatterlist sg; + + BUG_ON(len > sizeof(vb->gcmd)); + sg_init_one(&sg, &vb->gcmd, len); + + /* + * We should always be able to add one buffer to an empty queue. + * If not, it's a broken device. + */ + if (virtqueue_add_outbuf(vb->gcmd_vq, &sg, 1, vb, GFP_KERNEL) != 0 + virtqueue_kick(vb->gcmd_vq) != 0) { + vb->broken = true; + return false; + } + + /* When host has read buffer, this completes via wake_balloon */ + wait_event(vb->wait, + virtqueue_get_buf(vb->gcmd_vq, &len) + (vb->broken = virtqueue_is_broken(vb->gcmd_vq))); + return !vb->broken; +} + +static void give_to_balloon(struct virtio_balloon *vb, u64 num) +{ + struct balloon_dev_info *vb_dev_info = vb->vb_dev_info; + u64 i; + + /* We can only do one array worth at a time. */ + num = min_t(u64, num, ARRAY_SIZE(vb->gcmd.give_pages.pages)); + + vb->gcmd.give_pages.type = cpu_to_le64(VIRTIO_BALLOON_GCMD_GIVE_PAGES); + + for (i = 0; i < num; i++) { + struct page *page = balloon_page_enqueue(vb_dev_info); + + if (!page) { + dev_info_ratelimited(&vb->vdev->dev, + "Out of puff! Can't get page
    "); + /* Sleep for at least 1/5 of a second before retry. */ + msleep(200); + break; + } + + vb->gcmd.give_pages.pages[i] = page_to_pfn(page) << PAGE_SHIFT; + vb->num_pages++; + adjust_managed_page_count(page, -1); + } + + /* Did we get any? */ + if (i) + send_gcmd(vb, offsetof(struct gcmd_give_pages, pages[i])); +} + +static void take_from_balloon(struct virtio_balloon *vb, u64 num) +{ + struct balloon_dev_info *vb_dev_info = vb->vb_dev_info; + size_t i; + + /* We can only do one array worth at a time. */ + num = min_t(u64, num, ARRAY_SIZE(vb->gcmd.get_pages.pages)); + + vb->gcmd.get_pages.type = cpu_to_le64(VIRTIO_BALLOON_GCMD_GET_PAGES); + + for (i = 0; i < num; i++) { + struct page *page = balloon_page_dequeue(vb_dev_info); + + /* In case we ran out of pages (compaction) */ + if (!page) + break; + + vb->gcmd.get_pages.pages[i] = page_to_pfn(page) << PAGE_SHIFT; + vb->num_pages--; + } + num = i; + if (num) + send_gcmd(vb, offsetof(struct gcmd_get_pages, pages[num])); + + /* Now release those pages. */ + for (i = 0; i < num; i++) { + struct page *page; + + page = pfn_to_page(vb->gcmd.get_pages.pages[i] >> PAGE_SHIFT); + balloon_page_free(page); + adjust_managed_page_count(page, 1); + } + mutex_unlock(&vb->lock); +} + +static inline void set_stat(struct gcmd_stats_reply *stats, int idx, + u64 tag, u64 val) +{ + BUG_ON(idx >= ARRAY_SIZE(stats->stats)); + stats->stats[idx].tag = cpu_to_le64(tag); + stats->stats[idx].val = cpu_to_le64(val); +} + +#define pages_to_bytes(x) ((u64)(x) << PAGE_SHIFT) + +static void get_stats(struct gcmd_stats_reply *stats) +{ + unsigned long events[NR_VM_EVENT_ITEMS]; + struct sysinfo i; + int idx = 0; + + all_vm_events(events); + si_meminfo(&i); + + stats->type = cpu_to_le64(VIRTIO_BALLOON_GCMD_STATS_REPLY); + set_stat(stats, idx++, VIRTIO_BALLOON_S_SWAP_IN, + pages_to_bytes(events[PSWPIN])); + set_stat(stats, idx++, VIRTIO_BALLOON_S_SWAP_OUT, + pages_to_bytes(events[PSWPOUT])); + set_stat(stats, idx++, VIRTIO_BALLOON_S_MAJFLT, + events[PGMAJFAULT]); + set_stat(stats, idx++, VIRTIO_BALLOON_S_MINFLT, + events[PGFAULT]); + set_stat(stats, idx++, VIRTIO_BALLOON_S_MEMFREE, + pages_to_bytes(i.freeram)); + set_stat(stats, idx++, VIRTIO_BALLOON_S_MEMTOT, + pages_to_bytes(i.totalram)); +} + +static bool move_towards_target(struct virtio_balloon *vb) +{ + bool moved = false; + + if (vb->broken) + return false; + + mutex_lock(&vb->lock); + if (vb->num_pages < vb->target) { + give_to_balloon(vb, vb->target - vb->num_pages); + moved = true; + } else if (vb->num_pages > vb->target) { + take_from_balloon(vb, vb->num_pages - vb->target); + moved = true; + } + mutex_unlock(&vb->lock); + return moved; +} + +static bool process_hcmd(struct virtio_balloon *vb) +{ + union hcmd *hcmd = NULL; + unsigned int cmdlen; + struct scatterlist sg; + + if (vb->broken) + return false; + + mutex_lock(&vb->lock); + hcmd = virtqueue_get_buf(vb->hcmd_vq, &cmdlen); + if (!hcmd) { + mutex_unlock(&vb->lock); + return false; + } + + switch (hcmd->type) { + case cpu_to_le64(VIRTIO_BALLOON_HCMD_SET_BALLOON): + vb->target = le64_to_cpu(hcmd->set_balloon.target); + break; + case cpu_to_le64(VIRTIO_BALLOON_HCMD_GET_STATS): + get_stats(&vb->gcmd.stats_reply); + send_gcmd(vb, sizeof(vb->gcmd.stats_reply)); + break; + default: + dev_err_ratelimited(&vb->vdev->dev, "Unknown hcmd %llu
    ", + le64_to_cpu(hcmd->type)); + break; + } + + /* Re-queue the hcmd for next time. */ + sg_init_one(&sg, &vb->hcmd, sizeof(vb->hcmd)); + virtqueue_add_inbuf(vb->hcmd_vq, &sg, 1, vb, GFP_KERNEL); + + mutex_unlock(&vb->lock); + return true; +} + +static int balloon(void *_vballoon) +{ + struct virtio_balloon *vb = _vballoon; + + set_freezable(); + while (!kthread_should_stop()) { + try_to_freeze(); + + wait_event_interruptible(vb->wait, + kthread_should_stop() + freezing(current) + process_hcmd(vb) + move_towards_target(vb)); + } + return 0; +} + +static int init_vqs(struct virtio_balloon *vb) +{ + struct virtqueue *vqs[2]; + vq_callback_t *callbacks[] = { wake_balloon, wake_balloon }; + const char *names[] = { "gcmd", "hcmd" }; + struct scatterlist sg; + int err; + + err = vb->vdev->config->find_vqs(vb->vdev, 2, vqs, callbacks, names); + if (err) + return err; + + vb->gcmd_vq = vqs[0]; + vb->hcmd_vq = vqs[1]; + + /* + * Prime this virtqueue with one buffer so the hypervisor can + * use it to signal us later (it can't be broken yet!). + */ + sg_init_one(&sg, &vb->hcmd, sizeof(vb->hcmd)); + if (virtqueue_add_inbuf(vb->hcmd_vq, &sg, 1, vb, GFP_KERNEL) < 0) + BUG(); + virtqueue_kick(vb->hcmd_vq); + + return 0; +} + +static const struct address_space_operations virtio_balloon_aops; +#ifdef CONFIG_BALLOON_COMPACTION +/* + * virtballoon_migratepage - perform the balloon page migration on behalf of + * a compation thread. (called under page lock) + * @mapping: the page->mapping which will be assigned to the new migrated page. + * @newpage: page that will replace the isolated page after migration finishes. + * @page : the isolated (old) page that is about to be migrated to newpage. + * @mode : compaction mode -- not used for balloon page migration. + * + * After a ballooned page gets isolated by compaction procedures, this is the + * function that performs the page migration on behalf of a compaction thread + * The page migration for virtio balloon is done in a simple swap fashion which + * follows these two macro steps: + * 1) insert newpage into vb->pages list and update the host about it; + * 2) update the host about the old page removed from vb->pages list; + * + * This function preforms the balloon page migration task. + * Called through balloon_mapping->a_ops->migratepage + */ +static int virtballoon_migratepage(struct address_space *mapping, + struct page *newpage, struct page *page, enum migrate_mode mode) +{ + struct balloon_dev_info *vb_dev_info = balloon_page_device(page); + struct virtio_balloon *vb; + unsigned long flags; + int err; + + BUG_ON(!vb_dev_info); + + vb = vb_dev_info->balloon_device; + + /* + * In order to avoid lock contention while migrating pages concurrently + * to leak_balloon() or fill_balloon() we just give up the balloon_lock + * this turn, as it is easier to retry the page migration later. + * This also prevents fill_balloon() getting stuck into a mutex + * recursion in the case it ends up triggering memory compaction + * while it is attempting to inflate the ballon. + */ + if (!mutex_trylock(&vb->lock)) + return -EAGAIN; + + /* Try to get the page out of the balloon. */ + vb->gcmd.get_pages.type = cpu_to_le64(VIRTIO_BALLOON_GCMD_GET_PAGES); + vb->gcmd.get_pages.pages[0] = page_to_pfn(page) << PAGE_SHIFT; + if (!send_gcmd(vb, offsetof(struct gcmd_get_pages, pages[1]))) { + err = -EIO; + goto unlock; + } + + /* Now put newpage into balloon. */ + vb->gcmd.give_pages.type = cpu_to_le64(VIRTIO_BALLOON_GCMD_GIVE_PAGES); + vb->gcmd.give_pages.pages[0] = page_to_pfn(newpage) << PAGE_SHIFT; + if (!send_gcmd(vb, offsetof(struct gcmd_give_pages, pages[1]))) { + /* We leak a page here, but only happens if balloon broken. */ + err = -EIO; + goto unlock; + } + + spin_lock_irqsave(&vb_dev_info->pages_lock, flags); + balloon_page_insert(newpage, mapping, &vb_dev_info->pages); + vb_dev_info->isolated_pages--; + spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags); + + /* + * It's safe to delete page->lru here because this page is at + * an isolated migration list, and this step is expected to happen here + */ + balloon_page_delete(page); + err = MIGRATEPAGE_BALLOON_SUCCESS; + +unlock: + mutex_unlock(&vb->lock); + return err; +} + +/* define the balloon_mapping->a_ops callback to allow balloon page migration */ +static const struct address_space_operations virtio_balloon_aops = { + .migratepage = virtballoon_migratepage, +}; +#endif /* CONFIG_BALLOON_COMPACTION */ + +static int virtballoon_probe(struct virtio_device *vdev) +{ + struct virtio_balloon *vb; + struct address_space *vb_mapping; + struct balloon_dev_info *vb_devinfo; + __le64 v; + int err; + + virtio_cread(vdev, struct virtio_balloon_config_space, pagesizes, &v); + /* FIXME: Support large pages. */ + if (!(le64_to_cpu(v) & PAGE_SIZE)) { + dev_warn(&vdev->dev, "Unacceptable pagesize %llu
    ", + (long long)le64_to_cpu(v)); + err = -EINVAL; + goto out; + } + v = cpu_to_le64(PAGE_SIZE); + virtio_cwrite(vdev, struct virtio_balloon_config_space, page_size, &v); + + vdev->priv = vb = kmalloc(sizeof(*vb), GFP_KERNEL); + if (!vb) { + err = -ENOMEM; + goto out; + } + + vb->target = 0; + vb->num_pages = 0; + mutex_init(&vb->lock); + init_waitqueue_head(&vb->wait); + vb->vdev = vdev; + + vb_devinfo = balloon_devinfo_alloc(vb); + if (IS_ERR(vb_devinfo)) { + err = PTR_ERR(vb_devinfo); + goto out_free_vb; + } + + vb_mapping = balloon_mapping_alloc(vb_devinfo, + (balloon_compaction_check()) ? + &virtio_balloon_aops : NULL); + if (IS_ERR(vb_mapping)) { + /* + * IS_ERR(vb_mapping) && PTR_ERR(vb_mapping) == -EOPNOTSUPP + * This means !CONFIG_BALLOON_COMPACTION, otherwise we get off. + */ + err = PTR_ERR(vb_mapping); + if (err != -EOPNOTSUPP) + goto out_free_vb_devinfo; + } + + vb->vb_dev_info = vb_devinfo; + + err = init_vqs(vb); + if (err) + goto out_free_vb_mapping; + + vb->thread = kthread_run(balloon, vb, "vballoon"); + if (IS_ERR(vb->thread)) { + err = PTR_ERR(vb->thread); + goto out_del_vqs; + } + + return 0; + +out_del_vqs: + vdev->config->del_vqs(vdev); +out_free_vb_mapping: + balloon_mapping_free(vb_mapping); +out_free_vb_devinfo: + balloon_devinfo_free(vb_devinfo); +out_free_vb: + kfree(vb); +out: + return err; +} + +/* FIXME: Leave pages alone during suspend, rather than taking them + * all back! */ +static void remove_common(struct virtio_balloon *vb) +{ + /* There might be pages left in the balloon: free them. */ + while (vb->num_pages) + take_from_balloon(vb, vb->num_pages); + + /* Now we reset the device so we can clean up the queues. */ + vb->vdev->config->reset(vb->vdev); + vb->vdev->config->del_vqs(vb->vdev); +} + +static void virtballoon_remove(struct virtio_device *vdev) +{ + struct virtio_balloon *vb = vdev->priv; + + kthread_stop(vb->thread); + remove_common(vb); + balloon_mapping_free(vb->vb_dev_info->mapping); + balloon_devinfo_free(vb->vb_dev_info); + kfree(vb); +} + +#ifdef CONFIG_PM_SLEEP +static int virtballoon_freeze(struct virtio_device *vdev) +{ + struct virtio_balloon *vb = vdev->priv; + + /* + * The kthread is already frozen by the PM core before this + * function is called. + */ + + remove_common(vb); + return 0; +} + +static int virtballoon_restore(struct virtio_device *vdev) +{ + return init_vqs(vdev->priv); +} +#endif + +static unsigned int features[] = { + /* FIXME: Support VIRTIO_BALLOON_F_EXTRA_MEM! */ +}; + +static struct virtio_driver virtio_balloon_driver = { + .feature_table = features, + .feature_table_size = ARRAY_SIZE(features), + .driver.name = KBUILD_MODNAME, + .driver.owner = THIS_MODULE, + .id_table = id_table, + .probe = virtballoon_probe, + .remove = virtballoon_remove, +#ifdef CONFIG_PM_SLEEP + .freeze = virtballoon_freeze, + .restore = virtballoon_restore, +#endif +}; + +module_virtio_driver(virtio_balloon_driver); +MODULE_DEVICE_TABLE(virtio, id_table); +MODULE_DESCRIPTION("Virtio balloon driver"); +MODULE_LICENSE("GPL"); diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h index 5e26f61b5df5..cdca2934668a 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -28,32 +28,45 @@ #include <linux/virtio_ids.h> #include <linux/virtio_config.h> -/* The feature bitmap for virtio balloon */ -#define VIRTIO_BALLOON_F_MUST_TELL_HOST 0 /* Tell before reclaiming pages */ -#define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */ - -/* Size of a PFN in the balloon interface. */ -#define VIRTIO_BALLOON_PFN_SHIFT 12 - -struct virtio_balloon_config -{ - /* Number of pages host wants Guest to give up. */ - __le32 num_pages; - /* Number of pages we've actually got in balloon. */ - __le32 actual; +/* This means the balloon can go negative (ie. add memory to system) */ +#define VIRTIO_BALLOON_F_EXTRA_MEM 0 + +struct virtio_balloon_config_space { + /* Set by device: bits indicate what page sizes supported. */ + __le64 pagesizes; + /* Set by driver: only a single bit is set! */ + __le64 page_size; + + /* These set by device if VIRTIO_BALLOON_F_EXTRA_MEM. */ + __le64 extra_mem_start; + __le64 extra_mem_end; +}; + +struct virtio_balloon_statistic { + __le64 tag; /* VIRTIO_BALLOON_S_* */ + __le64 val; }; -#define VIRTIO_BALLOON_S_SWAP_IN 0 /* Amount of memory swapped in */ -#define VIRTIO_BALLOON_S_SWAP_OUT 1 /* Amount of memory swapped out */ -#define VIRTIO_BALLOON_S_MAJFLT 2 /* Number of major faults */ -#define VIRTIO_BALLOON_S_MINFLT 3 /* Number of minor faults */ -#define VIRTIO_BALLOON_S_MEMFREE 4 /* Total amount of free memory */ -#define VIRTIO_BALLOON_S_MEMTOT 5 /* Total amount of memory */ -#define VIRTIO_BALLOON_S_NR 6 - -struct virtio_balloon_stat { - __u16 tag; - __u64 val; -} __attribute__((packed)); +/* Guest->host command queue. */ +/* Ask the host for more pages. + Followed by array of 1 or more readable le64 pageaddr's. */ +#define VIRTIO_BALLOON_GCMD_GET_PAGES ((__le64)0) +/* Give the host more pages. + Followed by array of 1 or more readable le64 pageaddr's */ +#define VIRTIO_BALLOON_GCMD_GIVE_PAGES ((__le64)1) +/* Dear host: I need more memory. */ +#define VIRTIO_BALLOON_GCMD_NEEDMEM ((__le64)2) +/* Dear host: here are your stats. + * Followed by 0 or more struct virtio_balloon_statistic structs. */ +#define VIRTIO_BALLOON_GCMD_STATS_REPLY ((__le64)3) + +/* Host->guest command queue. */ +/* Followed by s64 of new balloon target size (only negative if + * VIRTIO_BALLOON_F_EXTRA_MEM). */ +#define VIRTIO_BALLOON_HCMD_SET_BALLOON ((__le64)0x8000) +/* Ask for statistics */ +#define VIRTIO_BALLOON_HCMD_GET_STATS ((__le64)0x8001) + +#include <linux/virtio_balloon_legacy.h> #endif /* _LINUX_VIRTIO_BALLOON_H */ diff --git a/include/uapi/linux/virtio_balloon_legacy.h b/include/uapi/linux/virtio_balloon_legacy.h new file mode 100644 index 000000000000..cbf77bc1aee3 --- /dev/null +++ b/include/uapi/linux/virtio_balloon_legacy.h @@ -0,0 +1,59 @@ +#ifndef _LINUX_VIRTIO_BALLOON_LEGACY_H +#define _LINUX_VIRTIO_BALLOON_LEGACY_H +/* This header is BSD licensed so anyone can use the definitions to implement + * compatible drivers/servers. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * 3. Neither the name of IBM nor the names of its contributors + * may be used to endorse or promote products derived from this software + * without specific prior written permission. + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. */ +#include <linux/virtio_ids.h> +#include <linux/virtio_config.h> + +/* The feature bitmap for virtio balloon */ +#define VIRTIO_BALLOON_F_MUST_TELL_HOST 0 /* Tell before reclaiming pages */ +#define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */ + +/* Size of a PFN in the balloon interface. */ +#define VIRTIO_BALLOON_PFN_SHIFT 12 + +struct virtio_balloon_config +{ + /* Number of pages host wants Guest to give up. */ + __le32 num_pages; + /* Number of pages we've actually got in balloon. */ + __le32 actual; +}; + +#define VIRTIO_BALLOON_S_SWAP_IN 0 /* Amount of memory swapped in */ +#define VIRTIO_BALLOON_S_SWAP_OUT 1 /* Amount of memory swapped out */ +#define VIRTIO_BALLOON_S_MAJFLT 2 /* Number of major faults */ +#define VIRTIO_BALLOON_S_MINFLT 3 /* Number of minor faults */ +#define VIRTIO_BALLOON_S_MEMFREE 4 /* Total amount of free memory */ +#define VIRTIO_BALLOON_S_MEMTOT 5 /* Total amount of memory */ +#define VIRTIO_BALLOON_S_NR 6 + +struct virtio_balloon_stat { + __u16 tag; + __u64 val; +} __attribute__((packed)); + +#endif /* _LINUX_VIRTIO_BALLOON_LEGACY_H */ diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h index 284fc3a05f7b..8b5ac0047190 100644 --- a/include/uapi/linux/virtio_ids.h +++ b/include/uapi/linux/virtio_ids.h @@ -33,11 +33,12 @@ #define VIRTIO_ID_BLOCK 2 /* virtio block */ #define VIRTIO_ID_CONSOLE 3 /* virtio console */ #define VIRTIO_ID_RNG 4 /* virtio rng */ -#define VIRTIO_ID_BALLOON 5 /* virtio balloon */ +#define VIRTIO_ID_BALLOON 5 /* virtio balloon (legacy) */ #define VIRTIO_ID_RPMSG 7 /* virtio remote processor messaging */ #define VIRTIO_ID_SCSI 8 /* virtio scsi */ #define VIRTIO_ID_9P 9 /* 9p virtio console */ #define VIRTIO_ID_RPROC_SERIAL 11 /* virtio remoteproc serial link */ #define VIRTIO_ID_CAIF 12 /* Virtio caif */ +#define VIRTIO_ID_MEMBALLOON 13 /* virtio balloon */ #endif /* _LINUX_VIRTIO_IDS_H */


  • 2.  Re: [virtio] New virtio balloon...

    Posted 01-30-2014 10:16
    Also copy virtio-dev since this in clearly implementation ...

    On Thu, Jan 30, 2014 at 07:34:30PM +1030, Rusty Russell wrote:
    > Hi,
    >
    > I tried to write a new balloon driver; it's completely untested
    > (as I need to write the device). The protocol is basically two vqs, one
    > for the guest to send commands, one for the host to send commands.
    >
    > Some interesting things come out:
    > 1) We do need to explicitly tell the host where the page is we want.
    > This is required for compaction, for example.
    >
    > 2) We need to be able to exceed the balloon target, especially for page
    > migration. Thus there's no mechanism for the device to refuse to
    > give us the pages.
    >
    > 3) The device can offer multiple page sizes, but the driver can only
    > accept one. I'm not sure if this is useful, as guests are either
    > huge page backed or not, and returning sub-pages isn't useful.
    >
    > Linux demo code follows.
    >
    > Cheers,
    > Rusty.

    More comments:
    - for projects like auto-ballooning that Luiz works on,
    it's not nice that to swap page 1 for page 2
    you have to inflate then deflate
    besides overhead this confuses the host:
    imagine you tell QEMU to increase target,
    meanwhile guest inflates temporarily,
    QEMU thinks okay done, now you suddenly deflate.


    - what's the status of page returned from balloon?
    is it zeroed or can it have old data in there?
    I think in practice Linux will sometimes map in a zero page,
    so guest can save cycles and avoid zeroing it out.
    I think we should tell this to guest when returning
    pages.


    - I am guessing EXTRA_MEM is for uses like the ones proposed by
    Frank Swiderski from google that inflate/deflate balloon
    whenever guest wants (look for "Add a page cache-backed balloon
    device driver").

    this is useful but - we need to distinguish pages
    like this from regular inflate.
    it's not just counter and host needs a way to know
    that it's target is reached


    - do we even want to allow guest not telling host when it wants
    to reuse the page?
    if yes, I think this should be per-page somehow: when balloon
    is inflated guest should tell host whether it
    expects to use this page.


    So I think we should accomodate these uses, and so we want the following flags:

    - WEAK_TARGET (that's the EXTRA_MEM but I think done in a better way)
    flag that specifies pages do not count against target,
    can be taken out of balloon.
    EXTRA_MEM suggests there's an upper limit on balloon size
    but IMHO that's just extra work for host: host does not care
    I think, give it as much as you want.
    set by guest, used by host

    - TELL_HOST flag that specifies guest will tell host before using pages
    (that's VIRTIO_BALLOON_F_MUST_TELL_HOST
    at the moment, listed here for completeness)
    set by guest, used by host

    - ZEROED
    flag that specifies that page returned to guest
    is zeroed
    set by host, used by guest



    Each of the flags can be just a feature flag, and then
    if we wants a mix of them host can create multiple
    balloon devices with differnet flags, and guest looks for best
    balloon for its purposes.

    Alternatively flags can be set and reported per page.


    A couple of other suggestions:

    - how to accomodate memory pressure in guest?
    Let's add a field telling host how hard do we
    want our memory back

    - assume you want to over-commit host and start
    inflating balloon.
    If low on memory it might be better for guest to
    wait a bit before inflating.
    Also, if host asks for a lot of memory a ton of
    allocations will slow guest significantly.
    But for guest to do the right thing we need host to tell guest what
    are its memory and time contraints.
    Let's add a field telling guest how hard do we
    want it to give us memory (e.g. time limit)



    > diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
    > index 9076635697bb..1dd45691b618 100644
    > --- a/drivers/virtio/Makefile
    > +++ b/drivers/virtio/Makefile
    > @@ -1,4 +1,4 @@
    > obj-$(CONFIG_VIRTIO) += virtio.o virtio_ring.o
    > obj-$(CONFIG_VIRTIO_MMIO) += virtio_mmio.o
    > obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o
    > -obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
    > +obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o virtio_balloon2.o
    > diff --git a/drivers/virtio/virtio_balloon2.c b/drivers/virtio/virtio_balloon2.c
    > new file mode 100644
    > index 000000000000..93f13e7c561d
    > --- /dev/null
    > +++ b/drivers/virtio/virtio_balloon2.c
    > @@ -0,0 +1,566 @@
    > +/*
    > + * Virtio balloon implementation, inspired by Dor Laor and Marcelo
    > + * Tosatti's implementations.
    > + *
    > + * Copyright 2008, 2014 Rusty Russell IBM Corporation
    > + *
    > + * This program is free software; you can redistribute it and/or modify
    > + * it under the terms of the GNU General Public License as published by
    > + * the Free Software Foundation; either version 2 of the License, or
    > + * (at your option) any later version.
    > + *
    > + * This program is distributed in the hope that it will be useful,
    > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
    > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
    > + * GNU General Public License for more details.
    > + *
    > + * You should have received a copy of the GNU General Public License
    > + * along with this program; if not, write to the Free Software
    > + * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
    > + */
    > +
    > +#include <linux/virtio.h>
    > +#include <linux/virtio_balloon.h>
    > +#include <linux/swap.h>
    > +#include <linux/kthread.h>
    > +#include <linux/freezer.h>
    > +#include <linux/delay.h>
    > +#include <linux/slab.h>
    > +#include <linux/module.h>
    > +#include <linux/balloon_compaction.h>
    > +
    > +struct gcmd_get_pages {
    > + __le64 type; /* VIRTIO_BALLOON_GCMD_GET_PAGES */
    > + __le64 pages[256];
    > +};
    > +
    > +struct gcmd_give_pages {
    > + __le64 type; /* VIRTIO_BALLOON_GCMD_GIVE_PAGES */
    > + __le64 pages[256];
    > +};
    > +
    > +struct gcmd_need_mem {
    > + __le64 type; /* VIRTIO_BALLOON_GCMD_NEED_MEM */
    > +};
    > +
    > +struct gcmd_stats_reply {
    > + __le64 type; /* VIRTIO_BALLOON_GCMD_STATS_REPLY */
    > + struct virtio_balloon_statistic stats[VIRTIO_BALLOON_S_NR];
    > +};
    > +
    > +struct hcmd_set_balloon {
    > + __le64 type; /* VIRTIO_BALLOON_HCMD_SET_BALLOON */
    > + __le64 target;
    > +};
    > +
    > +struct hcmd_get_stats {
    > + __le64 type; /* VIRTIO_BALLOON_HCMD_GET_STATS */
    > +};
    > +
    > +struct virtio_balloon {
    > + /* Protects contents of entire structure. */
    > + struct mutex lock;
    > +
    > + struct virtio_device *vdev;
    > + struct virtqueue *gcmd_vq, *hcmd_vq;
    > +
    > + /* The thread servicing the balloon. */
    > + struct task_struct *thread;
    > +
    > + /* For interrupt/suspend to wake balloon thread. */
    > + wait_queue_head_t wait;
    > +
    > + /* How many pages are we supposed to have in balloon? */
    > + s64 target;
    > +
    > + /* How many do we have in the balloon? */
    > + u64 num_pages;
    > +
    > + /* This reminds me of Eeyore. */
    > + bool broken;
    > +
    > + /*
    > + * The pages we've told the Host we're not using are enqueued
    > + * at vb_dev_info->pages list.
    > + */
    > + struct balloon_dev_info *vb_dev_info;
    > +
    > + /* To avoid kmalloc, we use single hcmd and gcmd buffers. */
    > + union gcmd {
    > + __le64 type;
    > + struct gcmd_get_pages get_pages;
    > + struct gcmd_give_pages give_pages;
    > + struct gcmd_need_mem need_mem;
    > + struct gcmd_stats_reply stats_reply;
    > + } gcmd;
    > +
    > + union hcmd {
    > + __le64 type;
    > + struct hcmd_set_balloon set_balloon;
    > + struct hcmd_get_stats get_stats;
    > + } hcmd;
    > +};
    > +
    > +static struct virtio_device_id id_table[] = {
    > + { VIRTIO_ID_MEMBALLOON, VIRTIO_DEV_ANY_ID },
    > + { 0 },
    > +};
    > +
    > +static void wake_balloon(struct virtqueue *vq)
    > +{
    > + struct virtio_balloon *vb = vq->vdev->priv;
    > +
    > + wake_up(&vb->wait);
    > +}
    > +
    > +/* Command is in vb->gcmd, lock is held. */
    > +static bool send_gcmd(struct virtio_balloon *vb, size_t len)
    > +{
    > + struct scatterlist sg;
    > +
    > + BUG_ON(len > sizeof(vb->gcmd));
    > + sg_init_one(&sg, &vb->gcmd, len);
    > +
    > + /*
    > + * We should always be able to add one buffer to an empty queue.
    > + * If not, it's a broken device.
    > + */
    > + if (virtqueue_add_outbuf(vb->gcmd_vq, &sg, 1, vb, GFP_KERNEL) != 0
    > + || virtqueue_kick(vb->gcmd_vq) != 0) {
    > + vb->broken = true;
    > + return false;
    > + }
    > +
    > + /* When host has read buffer, this completes via wake_balloon */
    > + wait_event(vb->wait,
    > + virtqueue_get_buf(vb->gcmd_vq, &len)
    > + || (vb->broken = virtqueue_is_broken(vb->gcmd_vq)));
    > + return !vb->broken;
    > +}
    > +
    > +static void give_to_balloon(struct virtio_balloon *vb, u64 num)
    > +{
    > + struct balloon_dev_info *vb_dev_info = vb->vb_dev_info;
    > + u64 i;
    > +
    > + /* We can only do one array worth at a time. */
    > + num = min_t(u64, num, ARRAY_SIZE(vb->gcmd.give_pages.pages));
    > +
    > + vb->gcmd.give_pages.type = cpu_to_le64(VIRTIO_BALLOON_GCMD_GIVE_PAGES);
    > +
    > + for (i = 0; i < num; i++) {
    > + struct page *page = balloon_page_enqueue(vb_dev_info);
    > +
    > + if (!page) {
    > + dev_info_ratelimited(&vb->vdev->dev,
    > + "Out of puff! Can't get page\n");
    > + /* Sleep for at least 1/5 of a second before retry. */
    > + msleep(200);
    > + break;
    > + }
    > +
    > + vb->gcmd.give_pages.pages[i] = page_to_pfn(page) << PAGE_SHIFT;
    > + vb->num_pages++;
    > + adjust_managed_page_count(page, -1);
    > + }
    > +
    > + /* Did we get any? */
    > + if (i)
    > + send_gcmd(vb, offsetof(struct gcmd_give_pages, pages[i]));
    > +}
    > +
    > +static void take_from_balloon(struct virtio_balloon *vb, u64 num)
    > +{
    > + struct balloon_dev_info *vb_dev_info = vb->vb_dev_info;
    > + size_t i;
    > +
    > + /* We can only do one array worth at a time. */
    > + num = min_t(u64, num, ARRAY_SIZE(vb->gcmd.get_pages.pages));
    > +
    > + vb->gcmd.get_pages.type = cpu_to_le64(VIRTIO_BALLOON_GCMD_GET_PAGES);
    > +
    > + for (i = 0; i < num; i++) {
    > + struct page *page = balloon_page_dequeue(vb_dev_info);
    > +
    > + /* In case we ran out of pages (compaction) */
    > + if (!page)
    > + break;
    > +
    > + vb->gcmd.get_pages.pages[i] = page_to_pfn(page) << PAGE_SHIFT;
    > + vb->num_pages--;
    > + }
    > + num = i;
    > + if (num)
    > + send_gcmd(vb, offsetof(struct gcmd_get_pages, pages[num]));
    > +
    > + /* Now release those pages. */
    > + for (i = 0; i < num; i++) {
    > + struct page *page;
    > +
    > + page = pfn_to_page(vb->gcmd.get_pages.pages[i] >> PAGE_SHIFT);
    > + balloon_page_free(page);
    > + adjust_managed_page_count(page, 1);
    > + }
    > + mutex_unlock(&vb->lock);
    > +}
    > +
    > +static inline void set_stat(struct gcmd_stats_reply *stats, int idx,
    > + u64 tag, u64 val)
    > +{
    > + BUG_ON(idx >= ARRAY_SIZE(stats->stats));
    > + stats->stats[idx].tag = cpu_to_le64(tag);
    > + stats->stats[idx].val = cpu_to_le64(val);
    > +}
    > +
    > +#define pages_to_bytes(x) ((u64)(x) << PAGE_SHIFT)
    > +
    > +static void get_stats(struct gcmd_stats_reply *stats)
    > +{
    > + unsigned long events[NR_VM_EVENT_ITEMS];
    > + struct sysinfo i;
    > + int idx = 0;
    > +
    > + all_vm_events(events);
    > + si_meminfo(&i);
    > +
    > + stats->type = cpu_to_le64(VIRTIO_BALLOON_GCMD_STATS_REPLY);
    > + set_stat(stats, idx++, VIRTIO_BALLOON_S_SWAP_IN,
    > + pages_to_bytes(events[PSWPIN]));
    > + set_stat(stats, idx++, VIRTIO_BALLOON_S_SWAP_OUT,
    > + pages_to_bytes(events[PSWPOUT]));
    > + set_stat(stats, idx++, VIRTIO_BALLOON_S_MAJFLT,
    > + events[PGMAJFAULT]);
    > + set_stat(stats, idx++, VIRTIO_BALLOON_S_MINFLT,
    > + events[PGFAULT]);
    > + set_stat(stats, idx++, VIRTIO_BALLOON_S_MEMFREE,
    > + pages_to_bytes(i.freeram));
    > + set_stat(stats, idx++, VIRTIO_BALLOON_S_MEMTOT,
    > + pages_to_bytes(i.totalram));
    > +}
    > +
    > +static bool move_towards_target(struct virtio_balloon *vb)
    > +{
    > + bool moved = false;
    > +
    > + if (vb->broken)
    > + return false;
    > +
    > + mutex_lock(&vb->lock);
    > + if (vb->num_pages < vb->target) {
    > + give_to_balloon(vb, vb->target - vb->num_pages);
    > + moved = true;
    > + } else if (vb->num_pages > vb->target) {
    > + take_from_balloon(vb, vb->num_pages - vb->target);
    > + moved = true;
    > + }
    > + mutex_unlock(&vb->lock);
    > + return moved;
    > +}
    > +
    > +static bool process_hcmd(struct virtio_balloon *vb)
    > +{
    > + union hcmd *hcmd = NULL;
    > + unsigned int cmdlen;
    > + struct scatterlist sg;
    > +
    > + if (vb->broken)
    > + return false;
    > +
    > + mutex_lock(&vb->lock);
    > + hcmd = virtqueue_get_buf(vb->hcmd_vq, &cmdlen);
    > + if (!hcmd) {
    > + mutex_unlock(&vb->lock);
    > + return false;
    > + }
    > +
    > + switch (hcmd->type) {
    > + case cpu_to_le64(VIRTIO_BALLOON_HCMD_SET_BALLOON):
    > + vb->target = le64_to_cpu(hcmd->set_balloon.target);
    > + break;
    > + case cpu_to_le64(VIRTIO_BALLOON_HCMD_GET_STATS):
    > + get_stats(&vb->gcmd.stats_reply);
    > + send_gcmd(vb, sizeof(vb->gcmd.stats_reply));
    > + break;
    > + default:
    > + dev_err_ratelimited(&vb->vdev->dev, "Unknown hcmd %llu\n",
    > + le64_to_cpu(hcmd->type));
    > + break;
    > + }
    > +
    > + /* Re-queue the hcmd for next time. */
    > + sg_init_one(&sg, &vb->hcmd, sizeof(vb->hcmd));
    > + virtqueue_add_inbuf(vb->hcmd_vq, &sg, 1, vb, GFP_KERNEL);
    > +
    > + mutex_unlock(&vb->lock);
    > + return true;
    > +}
    > +
    > +static int balloon(void *_vballoon)
    > +{
    > + struct virtio_balloon *vb = _vballoon;
    > +
    > + set_freezable();
    > + while (!kthread_should_stop()) {
    > + try_to_freeze();
    > +
    > + wait_event_interruptible(vb->wait,
    > + kthread_should_stop()
    > + || freezing(current)
    > + || process_hcmd(vb)
    > + || move_towards_target(vb));
    > + }
    > + return 0;
    > +}
    > +
    > +static int init_vqs(struct virtio_balloon *vb)
    > +{
    > + struct virtqueue *vqs[2];
    > + vq_callback_t *callbacks[] = { wake_balloon, wake_balloon };
    > + const char *names[] = { "gcmd", "hcmd" };
    > + struct scatterlist sg;
    > + int err;
    > +
    > + err = vb->vdev->config->find_vqs(vb->vdev, 2, vqs, callbacks, names);
    > + if (err)
    > + return err;
    > +
    > + vb->gcmd_vq = vqs[0];
    > + vb->hcmd_vq = vqs[1];
    > +
    > + /*
    > + * Prime this virtqueue with one buffer so the hypervisor can
    > + * use it to signal us later (it can't be broken yet!).
    > + */
    > + sg_init_one(&sg, &vb->hcmd, sizeof(vb->hcmd));
    > + if (virtqueue_add_inbuf(vb->hcmd_vq, &sg, 1, vb, GFP_KERNEL) < 0)
    > + BUG();
    > + virtqueue_kick(vb->hcmd_vq);
    > +
    > + return 0;
    > +}
    > +
    > +static const struct address_space_operations virtio_balloon_aops;
    > +#ifdef CONFIG_BALLOON_COMPACTION
    > +/*
    > + * virtballoon_migratepage - perform the balloon page migration on behalf of
    > + * a compation thread. (called under page lock)
    > + * @mapping: the page->mapping which will be assigned to the new migrated page.
    > + * @newpage: page that will replace the isolated page after migration finishes.
    > + * @page : the isolated (old) page that is about to be migrated to newpage.
    > + * @mode : compaction mode -- not used for balloon page migration.
    > + *
    > + * After a ballooned page gets isolated by compaction procedures, this is the
    > + * function that performs the page migration on behalf of a compaction thread
    > + * The page migration for virtio balloon is done in a simple swap fashion which
    > + * follows these two macro steps:
    > + * 1) insert newpage into vb->pages list and update the host about it;
    > + * 2) update the host about the old page removed from vb->pages list;
    > + *
    > + * This function preforms the balloon page migration task.
    > + * Called through balloon_mapping->a_ops->migratepage
    > + */
    > +static int virtballoon_migratepage(struct address_space *mapping,
    > + struct page *newpage, struct page *page, enum migrate_mode mode)
    > +{
    > + struct balloon_dev_info *vb_dev_info = balloon_page_device(page);
    > + struct virtio_balloon *vb;
    > + unsigned long flags;
    > + int err;
    > +
    > + BUG_ON(!vb_dev_info);
    > +
    > + vb = vb_dev_info->balloon_device;
    > +
    > + /*
    > + * In order to avoid lock contention while migrating pages concurrently
    > + * to leak_balloon() or fill_balloon() we just give up the balloon_lock
    > + * this turn, as it is easier to retry the page migration later.
    > + * This also prevents fill_balloon() getting stuck into a mutex
    > + * recursion in the case it ends up triggering memory compaction
    > + * while it is attempting to inflate the ballon.
    > + */
    > + if (!mutex_trylock(&vb->lock))
    > + return -EAGAIN;
    > +
    > + /* Try to get the page out of the balloon. */
    > + vb->gcmd.get_pages.type = cpu_to_le64(VIRTIO_BALLOON_GCMD_GET_PAGES);
    > + vb->gcmd.get_pages.pages[0] = page_to_pfn(page) << PAGE_SHIFT;
    > + if (!send_gcmd(vb, offsetof(struct gcmd_get_pages, pages[1]))) {
    > + err = -EIO;
    > + goto unlock;
    > + }
    > +
    > + /* Now put newpage into balloon. */
    > + vb->gcmd.give_pages.type = cpu_to_le64(VIRTIO_BALLOON_GCMD_GIVE_PAGES);
    > + vb->gcmd.give_pages.pages[0] = page_to_pfn(newpage) << PAGE_SHIFT;
    > + if (!send_gcmd(vb, offsetof(struct gcmd_give_pages, pages[1]))) {
    > + /* We leak a page here, but only happens if balloon broken. */
    > + err = -EIO;
    > + goto unlock;
    > + }
    > +
    > + spin_lock_irqsave(&vb_dev_info->pages_lock, flags);
    > + balloon_page_insert(newpage, mapping, &vb_dev_info->pages);
    > + vb_dev_info->isolated_pages--;
    > + spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
    > +
    > + /*
    > + * It's safe to delete page->lru here because this page is at
    > + * an isolated migration list, and this step is expected to happen here
    > + */
    > + balloon_page_delete(page);
    > + err = MIGRATEPAGE_BALLOON_SUCCESS;
    > +
    > +unlock:
    > + mutex_unlock(&vb->lock);
    > + return err;
    > +}
    > +
    > +/* define the balloon_mapping->a_ops callback to allow balloon page migration */
    > +static const struct address_space_operations virtio_balloon_aops = {
    > + .migratepage = virtballoon_migratepage,
    > +};
    > +#endif /* CONFIG_BALLOON_COMPACTION */
    > +
    > +static int virtballoon_probe(struct virtio_device *vdev)
    > +{
    > + struct virtio_balloon *vb;
    > + struct address_space *vb_mapping;
    > + struct balloon_dev_info *vb_devinfo;
    > + __le64 v;
    > + int err;
    > +
    > + virtio_cread(vdev, struct virtio_balloon_config_space, pagesizes, &v);
    > + /* FIXME: Support large pages. */
    > + if (!(le64_to_cpu(v) & PAGE_SIZE)) {
    > + dev_warn(&vdev->dev, "Unacceptable pagesize %llu\n",
    > + (long long)le64_to_cpu(v));
    > + err = -EINVAL;
    > + goto out;
    > + }
    > + v = cpu_to_le64(PAGE_SIZE);
    > + virtio_cwrite(vdev, struct virtio_balloon_config_space, page_size, &v);
    > +
    > + vdev->priv = vb = kmalloc(sizeof(*vb), GFP_KERNEL);
    > + if (!vb) {
    > + err = -ENOMEM;
    > + goto out;
    > + }
    > +
    > + vb->target = 0;
    > + vb->num_pages = 0;
    > + mutex_init(&vb->lock);
    > + init_waitqueue_head(&vb->wait);
    > + vb->vdev = vdev;
    > +
    > + vb_devinfo = balloon_devinfo_alloc(vb);
    > + if (IS_ERR(vb_devinfo)) {
    > + err = PTR_ERR(vb_devinfo);
    > + goto out_free_vb;
    > + }
    > +
    > + vb_mapping = balloon_mapping_alloc(vb_devinfo,
    > + (balloon_compaction_check()) ?
    > + &virtio_balloon_aops : NULL);
    > + if (IS_ERR(vb_mapping)) {
    > + /*
    > + * IS_ERR(vb_mapping) && PTR_ERR(vb_mapping) == -EOPNOTSUPP
    > + * This means !CONFIG_BALLOON_COMPACTION, otherwise we get off.
    > + */
    > + err = PTR_ERR(vb_mapping);
    > + if (err != -EOPNOTSUPP)
    > + goto out_free_vb_devinfo;
    > + }
    > +
    > + vb->vb_dev_info = vb_devinfo;
    > +
    > + err = init_vqs(vb);
    > + if (err)
    > + goto out_free_vb_mapping;
    > +
    > + vb->thread = kthread_run(balloon, vb, "vballoon");
    > + if (IS_ERR(vb->thread)) {
    > + err = PTR_ERR(vb->thread);
    > + goto out_del_vqs;
    > + }
    > +
    > + return 0;
    > +
    > +out_del_vqs:
    > + vdev->config->del_vqs(vdev);
    > +out_free_vb_mapping:
    > + balloon_mapping_free(vb_mapping);
    > +out_free_vb_devinfo:
    > + balloon_devinfo_free(vb_devinfo);
    > +out_free_vb:
    > + kfree(vb);
    > +out:
    > + return err;
    > +}
    > +
    > +/* FIXME: Leave pages alone during suspend, rather than taking them
    > + * all back! */
    > +static void remove_common(struct virtio_balloon *vb)
    > +{
    > + /* There might be pages left in the balloon: free them. */
    > + while (vb->num_pages)
    > + take_from_balloon(vb, vb->num_pages);
    > +
    > + /* Now we reset the device so we can clean up the queues. */
    > + vb->vdev->config->reset(vb->vdev);
    > + vb->vdev->config->del_vqs(vb->vdev);
    > +}
    > +
    > +static void virtballoon_remove(struct virtio_device *vdev)
    > +{
    > + struct virtio_balloon *vb = vdev->priv;
    > +
    > + kthread_stop(vb->thread);
    > + remove_common(vb);
    > + balloon_mapping_free(vb->vb_dev_info->mapping);
    > + balloon_devinfo_free(vb->vb_dev_info);
    > + kfree(vb);
    > +}
    > +
    > +#ifdef CONFIG_PM_SLEEP
    > +static int virtballoon_freeze(struct virtio_device *vdev)
    > +{
    > + struct virtio_balloon *vb = vdev->priv;
    > +
    > + /*
    > + * The kthread is already frozen by the PM core before this
    > + * function is called.
    > + */
    > +
    > + remove_common(vb);
    > + return 0;
    > +}
    > +
    > +static int virtballoon_restore(struct virtio_device *vdev)
    > +{
    > + return init_vqs(vdev->priv);
    > +}
    > +#endif
    > +
    > +static unsigned int features[] = {
    > + /* FIXME: Support VIRTIO_BALLOON_F_EXTRA_MEM! */
    > +};
    > +
    > +static struct virtio_driver virtio_balloon_driver = {
    > + .feature_table = features,
    > + .feature_table_size = ARRAY_SIZE(features),
    > + .driver.name = KBUILD_MODNAME,
    > + .driver.owner = THIS_MODULE,
    > + .id_table = id_table,
    > + .probe = virtballoon_probe,
    > + .remove = virtballoon_remove,
    > +#ifdef CONFIG_PM_SLEEP
    > + .freeze = virtballoon_freeze,
    > + .restore = virtballoon_restore,
    > +#endif
    > +};
    > +
    > +module_virtio_driver(virtio_balloon_driver);
    > +MODULE_DEVICE_TABLE(virtio, id_table);
    > +MODULE_DESCRIPTION("Virtio balloon driver");
    > +MODULE_LICENSE("GPL");
    > diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
    > index 5e26f61b5df5..cdca2934668a 100644
    > --- a/include/uapi/linux/virtio_balloon.h
    > +++ b/include/uapi/linux/virtio_balloon.h
    > @@ -28,32 +28,45 @@
    > #include <linux/virtio_ids.h>
    > #include <linux/virtio_config.h>
    >
    > -/* The feature bitmap for virtio balloon */
    > -#define VIRTIO_BALLOON_F_MUST_TELL_HOST 0 /* Tell before reclaiming pages */
    > -#define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */
    > -
    > -/* Size of a PFN in the balloon interface. */
    > -#define VIRTIO_BALLOON_PFN_SHIFT 12
    > -
    > -struct virtio_balloon_config
    > -{
    > - /* Number of pages host wants Guest to give up. */
    > - __le32 num_pages;
    > - /* Number of pages we've actually got in balloon. */
    > - __le32 actual;
    > +/* This means the balloon can go negative (ie. add memory to system) */
    > +#define VIRTIO_BALLOON_F_EXTRA_MEM 0
    > +
    > +struct virtio_balloon_config_space {
    > + /* Set by device: bits indicate what page sizes supported. */
    > + __le64 pagesizes;
    > + /* Set by driver: only a single bit is set! */
    > + __le64 page_size;
    > +
    > + /* These set by device if VIRTIO_BALLOON_F_EXTRA_MEM. */
    > + __le64 extra_mem_start;
    > + __le64 extra_mem_end;
    > +};
    > +
    > +struct virtio_balloon_statistic {
    > + __le64 tag; /* VIRTIO_BALLOON_S_* */
    > + __le64 val;
    > };
    >
    > -#define VIRTIO_BALLOON_S_SWAP_IN 0 /* Amount of memory swapped in */
    > -#define VIRTIO_BALLOON_S_SWAP_OUT 1 /* Amount of memory swapped out */
    > -#define VIRTIO_BALLOON_S_MAJFLT 2 /* Number of major faults */
    > -#define VIRTIO_BALLOON_S_MINFLT 3 /* Number of minor faults */
    > -#define VIRTIO_BALLOON_S_MEMFREE 4 /* Total amount of free memory */
    > -#define VIRTIO_BALLOON_S_MEMTOT 5 /* Total amount of memory */
    > -#define VIRTIO_BALLOON_S_NR 6
    > -
    > -struct virtio_balloon_stat {
    > - __u16 tag;
    > - __u64 val;
    > -} __attribute__((packed));
    > +/* Guest->host command queue. */
    > +/* Ask the host for more pages.
    > + Followed by array of 1 or more readable le64 pageaddr's. */
    > +#define VIRTIO_BALLOON_GCMD_GET_PAGES ((__le64)0)
    > +/* Give the host more pages.
    > + Followed by array of 1 or more readable le64 pageaddr's */
    > +#define VIRTIO_BALLOON_GCMD_GIVE_PAGES ((__le64)1)
    > +/* Dear host: I need more memory. */
    > +#define VIRTIO_BALLOON_GCMD_NEEDMEM ((__le64)2)
    > +/* Dear host: here are your stats.
    > + * Followed by 0 or more struct virtio_balloon_statistic structs. */
    > +#define VIRTIO_BALLOON_GCMD_STATS_REPLY ((__le64)3)
    > +
    > +/* Host->guest command queue. */
    > +/* Followed by s64 of new balloon target size (only negative if
    > + * VIRTIO_BALLOON_F_EXTRA_MEM). */
    > +#define VIRTIO_BALLOON_HCMD_SET_BALLOON ((__le64)0x8000)
    > +/* Ask for statistics */
    > +#define VIRTIO_BALLOON_HCMD_GET_STATS ((__le64)0x8001)
    > +
    > +#include <linux/virtio_balloon_legacy.h>
    >
    > #endif /* _LINUX_VIRTIO_BALLOON_H */
    > diff --git a/include/uapi/linux/virtio_balloon_legacy.h b/include/uapi/linux/virtio_balloon_legacy.h
    > new file mode 100644
    > index 000000000000..cbf77bc1aee3
    > --- /dev/null
    > +++ b/include/uapi/linux/virtio_balloon_legacy.h
    > @@ -0,0 +1,59 @@
    > +#ifndef _LINUX_VIRTIO_BALLOON_LEGACY_H
    > +#define _LINUX_VIRTIO_BALLOON_LEGACY_H
    > +/* This header is BSD licensed so anyone can use the definitions to implement
    > + * compatible drivers/servers.
    > + *
    > + * Redistribution and use in source and binary forms, with or without
    > + * modification, are permitted provided that the following conditions
    > + * are met:
    > + * 1. Redistributions of source code must retain the above copyright
    > + * notice, this list of conditions and the following disclaimer.
    > + * 2. Redistributions in binary form must reproduce the above copyright
    > + * notice, this list of conditions and the following disclaimer in the
    > + * documentation and/or other materials provided with the distribution.
    > + * 3. Neither the name of IBM nor the names of its contributors
    > + * may be used to endorse or promote products derived from this software
    > + * without specific prior written permission.
    > + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS'' AND
    > + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
    > + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
    > + * ARE DISCLAIMED. IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
    > + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
    > + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
    > + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
    > + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
    > + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
    > + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
    > + * SUCH DAMAGE. */
    > +#include <linux/virtio_ids.h>
    > +#include <linux/virtio_config.h>
    > +
    > +/* The feature bitmap for virtio balloon */
    > +#define VIRTIO_BALLOON_F_MUST_TELL_HOST 0 /* Tell before reclaiming pages */
    > +#define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */
    > +
    > +/* Size of a PFN in the balloon interface. */
    > +#define VIRTIO_BALLOON_PFN_SHIFT 12
    > +
    > +struct virtio_balloon_config
    > +{
    > + /* Number of pages host wants Guest to give up. */
    > + __le32 num_pages;
    > + /* Number of pages we've actually got in balloon. */
    > + __le32 actual;
    > +};
    > +
    > +#define VIRTIO_BALLOON_S_SWAP_IN 0 /* Amount of memory swapped in */
    > +#define VIRTIO_BALLOON_S_SWAP_OUT 1 /* Amount of memory swapped out */
    > +#define VIRTIO_BALLOON_S_MAJFLT 2 /* Number of major faults */
    > +#define VIRTIO_BALLOON_S_MINFLT 3 /* Number of minor faults */
    > +#define VIRTIO_BALLOON_S_MEMFREE 4 /* Total amount of free memory */
    > +#define VIRTIO_BALLOON_S_MEMTOT 5 /* Total amount of memory */
    > +#define VIRTIO_BALLOON_S_NR 6
    > +
    > +struct virtio_balloon_stat {
    > + __u16 tag;
    > + __u64 val;
    > +} __attribute__((packed));
    > +
    > +#endif /* _LINUX_VIRTIO_BALLOON_LEGACY_H */
    > diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h
    > index 284fc3a05f7b..8b5ac0047190 100644
    > --- a/include/uapi/linux/virtio_ids.h
    > +++ b/include/uapi/linux/virtio_ids.h
    > @@ -33,11 +33,12 @@
    > #define VIRTIO_ID_BLOCK 2 /* virtio block */
    > #define VIRTIO_ID_CONSOLE 3 /* virtio console */
    > #define VIRTIO_ID_RNG 4 /* virtio rng */
    > -#define VIRTIO_ID_BALLOON 5 /* virtio balloon */
    > +#define VIRTIO_ID_BALLOON 5 /* virtio balloon (legacy) */
    > #define VIRTIO_ID_RPMSG 7 /* virtio remote processor messaging */
    > #define VIRTIO_ID_SCSI 8 /* virtio scsi */
    > #define VIRTIO_ID_9P 9 /* 9p virtio console */
    > #define VIRTIO_ID_RPROC_SERIAL 11 /* virtio remoteproc serial link */
    > #define VIRTIO_ID_CAIF 12 /* Virtio caif */
    > +#define VIRTIO_ID_MEMBALLOON 13 /* virtio balloon */
    >
    > #endif /* _LINUX_VIRTIO_IDS_H */
    >
    >
    > ---------------------------------------------------------------------
    > To unsubscribe from this mail list, you must leave the OASIS TC that
    > generates this mail. Follow this link to all your TCs in OASIS at:
    > https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php



  • 3.  Re: [virtio] New virtio balloon...

    Posted 01-30-2014 10:17
    Also copy virtio-dev since this in clearly implementation ... On Thu, Jan 30, 2014 at 07:34:30PM +1030, Rusty Russell wrote: > Hi, > > I tried to write a new balloon driver; it's completely untested > (as I need to write the device). The protocol is basically two vqs, one > for the guest to send commands, one for the host to send commands. > > Some interesting things come out: > 1) We do need to explicitly tell the host where the page is we want. > This is required for compaction, for example. > > 2) We need to be able to exceed the balloon target, especially for page > migration. Thus there's no mechanism for the device to refuse to > give us the pages. > > 3) The device can offer multiple page sizes, but the driver can only > accept one. I'm not sure if this is useful, as guests are either > huge page backed or not, and returning sub-pages isn't useful. > > Linux demo code follows. > > Cheers, > Rusty. More comments: - for projects like auto-ballooning that Luiz works on, it's not nice that to swap page 1 for page 2 you have to inflate then deflate besides overhead this confuses the host: imagine you tell QEMU to increase target, meanwhile guest inflates temporarily, QEMU thinks okay done, now you suddenly deflate. - what's the status of page returned from balloon? is it zeroed or can it have old data in there? I think in practice Linux will sometimes map in a zero page, so guest can save cycles and avoid zeroing it out. I think we should tell this to guest when returning pages. - I am guessing EXTRA_MEM is for uses like the ones proposed by Frank Swiderski from google that inflate/deflate balloon whenever guest wants (look for "Add a page cache-backed balloon device driver"). this is useful but - we need to distinguish pages like this from regular inflate. it's not just counter and host needs a way to know that it's target is reached - do we even want to allow guest not telling host when it wants to reuse the page? if yes, I think this should be per-page somehow: when balloon is inflated guest should tell host whether it expects to use this page. So I think we should accomodate these uses, and so we want the following flags: - WEAK_TARGET (that's the EXTRA_MEM but I think done in a better way) flag that specifies pages do not count against target, can be taken out of balloon. EXTRA_MEM suggests there's an upper limit on balloon size but IMHO that's just extra work for host: host does not care I think, give it as much as you want. set by guest, used by host - TELL_HOST flag that specifies guest will tell host before using pages (that's VIRTIO_BALLOON_F_MUST_TELL_HOST at the moment, listed here for completeness) set by guest, used by host - ZEROED flag that specifies that page returned to guest is zeroed set by host, used by guest Each of the flags can be just a feature flag, and then if we wants a mix of them host can create multiple balloon devices with differnet flags, and guest looks for best balloon for its purposes. Alternatively flags can be set and reported per page. A couple of other suggestions: - how to accomodate memory pressure in guest? Let's add a field telling host how hard do we want our memory back - assume you want to over-commit host and start inflating balloon. If low on memory it might be better for guest to wait a bit before inflating. Also, if host asks for a lot of memory a ton of allocations will slow guest significantly. But for guest to do the right thing we need host to tell guest what are its memory and time contraints. Let's add a field telling guest how hard do we want it to give us memory (e.g. time limit) > diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile > index 9076635697bb..1dd45691b618 100644 > --- a/drivers/virtio/Makefile > +++ b/drivers/virtio/Makefile > @@ -1,4 +1,4 @@ > obj-$(CONFIG_VIRTIO) += virtio.o virtio_ring.o > obj-$(CONFIG_VIRTIO_MMIO) += virtio_mmio.o > obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o > -obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o > +obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o virtio_balloon2.o > diff --git a/drivers/virtio/virtio_balloon2.c b/drivers/virtio/virtio_balloon2.c > new file mode 100644 > index 000000000000..93f13e7c561d > --- /dev/null > +++ b/drivers/virtio/virtio_balloon2.c > @@ -0,0 +1,566 @@ > +/* > + * Virtio balloon implementation, inspired by Dor Laor and Marcelo > + * Tosatti's implementations. > + * > + * Copyright 2008, 2014 Rusty Russell IBM Corporation > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License as published by > + * the Free Software Foundation; either version 2 of the License, or > + * (at your option) any later version. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + * > + * You should have received a copy of the GNU General Public License > + * along with this program; if not, write to the Free Software > + * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA > + */ > + > +#include <linux/virtio.h> > +#include <linux/virtio_balloon.h> > +#include <linux/swap.h> > +#include <linux/kthread.h> > +#include <linux/freezer.h> > +#include <linux/delay.h> > +#include <linux/slab.h> > +#include <linux/module.h> > +#include <linux/balloon_compaction.h> > + > +struct gcmd_get_pages { > + __le64 type; /* VIRTIO_BALLOON_GCMD_GET_PAGES */ > + __le64 pages[256]; > +}; > + > +struct gcmd_give_pages { > + __le64 type; /* VIRTIO_BALLOON_GCMD_GIVE_PAGES */ > + __le64 pages[256]; > +}; > + > +struct gcmd_need_mem { > + __le64 type; /* VIRTIO_BALLOON_GCMD_NEED_MEM */ > +}; > + > +struct gcmd_stats_reply { > + __le64 type; /* VIRTIO_BALLOON_GCMD_STATS_REPLY */ > + struct virtio_balloon_statistic stats[VIRTIO_BALLOON_S_NR]; > +}; > + > +struct hcmd_set_balloon { > + __le64 type; /* VIRTIO_BALLOON_HCMD_SET_BALLOON */ > + __le64 target; > +}; > + > +struct hcmd_get_stats { > + __le64 type; /* VIRTIO_BALLOON_HCMD_GET_STATS */ > +}; > + > +struct virtio_balloon { > + /* Protects contents of entire structure. */ > + struct mutex lock; > + > + struct virtio_device *vdev; > + struct virtqueue *gcmd_vq, *hcmd_vq; > + > + /* The thread servicing the balloon. */ > + struct task_struct *thread; > + > + /* For interrupt/suspend to wake balloon thread. */ > + wait_queue_head_t wait; > + > + /* How many pages are we supposed to have in balloon? */ > + s64 target; > + > + /* How many do we have in the balloon? */ > + u64 num_pages; > + > + /* This reminds me of Eeyore. */ > + bool broken; > + > + /* > + * The pages we've told the Host we're not using are enqueued > + * at vb_dev_info->pages list. > + */ > + struct balloon_dev_info *vb_dev_info; > + > + /* To avoid kmalloc, we use single hcmd and gcmd buffers. */ > + union gcmd { > + __le64 type; > + struct gcmd_get_pages get_pages; > + struct gcmd_give_pages give_pages; > + struct gcmd_need_mem need_mem; > + struct gcmd_stats_reply stats_reply; > + } gcmd; > + > + union hcmd { > + __le64 type; > + struct hcmd_set_balloon set_balloon; > + struct hcmd_get_stats get_stats; > + } hcmd; > +}; > + > +static struct virtio_device_id id_table[] = { > + { VIRTIO_ID_MEMBALLOON, VIRTIO_DEV_ANY_ID }, > + { 0 }, > +}; > + > +static void wake_balloon(struct virtqueue *vq) > +{ > + struct virtio_balloon *vb = vq->vdev->priv; > + > + wake_up(&vb->wait); > +} > + > +/* Command is in vb->gcmd, lock is held. */ > +static bool send_gcmd(struct virtio_balloon *vb, size_t len) > +{ > + struct scatterlist sg; > + > + BUG_ON(len > sizeof(vb->gcmd)); > + sg_init_one(&sg, &vb->gcmd, len); > + > + /* > + * We should always be able to add one buffer to an empty queue. > + * If not, it's a broken device. > + */ > + if (virtqueue_add_outbuf(vb->gcmd_vq, &sg, 1, vb, GFP_KERNEL) != 0 > + virtqueue_kick(vb->gcmd_vq) != 0) { > + vb->broken = true; > + return false; > + } > + > + /* When host has read buffer, this completes via wake_balloon */ > + wait_event(vb->wait, > + virtqueue_get_buf(vb->gcmd_vq, &len) > + (vb->broken = virtqueue_is_broken(vb->gcmd_vq))); > + return !vb->broken; > +} > + > +static void give_to_balloon(struct virtio_balloon *vb, u64 num) > +{ > + struct balloon_dev_info *vb_dev_info = vb->vb_dev_info; > + u64 i; > + > + /* We can only do one array worth at a time. */ > + num = min_t(u64, num, ARRAY_SIZE(vb->gcmd.give_pages.pages)); > + > + vb->gcmd.give_pages.type = cpu_to_le64(VIRTIO_BALLOON_GCMD_GIVE_PAGES); > + > + for (i = 0; i < num; i++) { > + struct page *page = balloon_page_enqueue(vb_dev_info); > + > + if (!page) { > + dev_info_ratelimited(&vb->vdev->dev, > + "Out of puff! Can't get page
    "); > + /* Sleep for at least 1/5 of a second before retry. */ > + msleep(200); > + break; > + } > + > + vb->gcmd.give_pages.pages[i] = page_to_pfn(page) << PAGE_SHIFT; > + vb->num_pages++; > + adjust_managed_page_count(page, -1); > + } > + > + /* Did we get any? */ > + if (i) > + send_gcmd(vb, offsetof(struct gcmd_give_pages, pages[i])); > +} > + > +static void take_from_balloon(struct virtio_balloon *vb, u64 num) > +{ > + struct balloon_dev_info *vb_dev_info = vb->vb_dev_info; > + size_t i; > + > + /* We can only do one array worth at a time. */ > + num = min_t(u64, num, ARRAY_SIZE(vb->gcmd.get_pages.pages)); > + > + vb->gcmd.get_pages.type = cpu_to_le64(VIRTIO_BALLOON_GCMD_GET_PAGES); > + > + for (i = 0; i < num; i++) { > + struct page *page = balloon_page_dequeue(vb_dev_info); > + > + /* In case we ran out of pages (compaction) */ > + if (!page) > + break; > + > + vb->gcmd.get_pages.pages[i] = page_to_pfn(page) << PAGE_SHIFT; > + vb->num_pages--; > + } > + num = i; > + if (num) > + send_gcmd(vb, offsetof(struct gcmd_get_pages, pages[num])); > + > + /* Now release those pages. */ > + for (i = 0; i < num; i++) { > + struct page *page; > + > + page = pfn_to_page(vb->gcmd.get_pages.pages[i] >> PAGE_SHIFT); > + balloon_page_free(page); > + adjust_managed_page_count(page, 1); > + } > + mutex_unlock(&vb->lock); > +} > + > +static inline void set_stat(struct gcmd_stats_reply *stats, int idx, > + u64 tag, u64 val) > +{ > + BUG_ON(idx >= ARRAY_SIZE(stats->stats)); > + stats->stats[idx].tag = cpu_to_le64(tag); > + stats->stats[idx].val = cpu_to_le64(val); > +} > + > +#define pages_to_bytes(x) ((u64)(x) << PAGE_SHIFT) > + > +static void get_stats(struct gcmd_stats_reply *stats) > +{ > + unsigned long events[NR_VM_EVENT_ITEMS]; > + struct sysinfo i; > + int idx = 0; > + > + all_vm_events(events); > + si_meminfo(&i); > + > + stats->type = cpu_to_le64(VIRTIO_BALLOON_GCMD_STATS_REPLY); > + set_stat(stats, idx++, VIRTIO_BALLOON_S_SWAP_IN, > + pages_to_bytes(events[PSWPIN])); > + set_stat(stats, idx++, VIRTIO_BALLOON_S_SWAP_OUT, > + pages_to_bytes(events[PSWPOUT])); > + set_stat(stats, idx++, VIRTIO_BALLOON_S_MAJFLT, > + events[PGMAJFAULT]); > + set_stat(stats, idx++, VIRTIO_BALLOON_S_MINFLT, > + events[PGFAULT]); > + set_stat(stats, idx++, VIRTIO_BALLOON_S_MEMFREE, > + pages_to_bytes(i.freeram)); > + set_stat(stats, idx++, VIRTIO_BALLOON_S_MEMTOT, > + pages_to_bytes(i.totalram)); > +} > + > +static bool move_towards_target(struct virtio_balloon *vb) > +{ > + bool moved = false; > + > + if (vb->broken) > + return false; > + > + mutex_lock(&vb->lock); > + if (vb->num_pages < vb->target) { > + give_to_balloon(vb, vb->target - vb->num_pages); > + moved = true; > + } else if (vb->num_pages > vb->target) { > + take_from_balloon(vb, vb->num_pages - vb->target); > + moved = true; > + } > + mutex_unlock(&vb->lock); > + return moved; > +} > + > +static bool process_hcmd(struct virtio_balloon *vb) > +{ > + union hcmd *hcmd = NULL; > + unsigned int cmdlen; > + struct scatterlist sg; > + > + if (vb->broken) > + return false; > + > + mutex_lock(&vb->lock); > + hcmd = virtqueue_get_buf(vb->hcmd_vq, &cmdlen); > + if (!hcmd) { > + mutex_unlock(&vb->lock); > + return false; > + } > + > + switch (hcmd->type) { > + case cpu_to_le64(VIRTIO_BALLOON_HCMD_SET_BALLOON): > + vb->target = le64_to_cpu(hcmd->set_balloon.target); > + break; > + case cpu_to_le64(VIRTIO_BALLOON_HCMD_GET_STATS): > + get_stats(&vb->gcmd.stats_reply); > + send_gcmd(vb, sizeof(vb->gcmd.stats_reply)); > + break; > + default: > + dev_err_ratelimited(&vb->vdev->dev, "Unknown hcmd %llu
    ", > + le64_to_cpu(hcmd->type)); > + break; > + } > + > + /* Re-queue the hcmd for next time. */ > + sg_init_one(&sg, &vb->hcmd, sizeof(vb->hcmd)); > + virtqueue_add_inbuf(vb->hcmd_vq, &sg, 1, vb, GFP_KERNEL); > + > + mutex_unlock(&vb->lock); > + return true; > +} > + > +static int balloon(void *_vballoon) > +{ > + struct virtio_balloon *vb = _vballoon; > + > + set_freezable(); > + while (!kthread_should_stop()) { > + try_to_freeze(); > + > + wait_event_interruptible(vb->wait, > + kthread_should_stop() > + freezing(current) > + process_hcmd(vb) > + move_towards_target(vb)); > + } > + return 0; > +} > + > +static int init_vqs(struct virtio_balloon *vb) > +{ > + struct virtqueue *vqs[2]; > + vq_callback_t *callbacks[] = { wake_balloon, wake_balloon }; > + const char *names[] = { "gcmd", "hcmd" }; > + struct scatterlist sg; > + int err; > + > + err = vb->vdev->config->find_vqs(vb->vdev, 2, vqs, callbacks, names); > + if (err) > + return err; > + > + vb->gcmd_vq = vqs[0]; > + vb->hcmd_vq = vqs[1]; > + > + /* > + * Prime this virtqueue with one buffer so the hypervisor can > + * use it to signal us later (it can't be broken yet!). > + */ > + sg_init_one(&sg, &vb->hcmd, sizeof(vb->hcmd)); > + if (virtqueue_add_inbuf(vb->hcmd_vq, &sg, 1, vb, GFP_KERNEL) < 0) > + BUG(); > + virtqueue_kick(vb->hcmd_vq); > + > + return 0; > +} > + > +static const struct address_space_operations virtio_balloon_aops; > +#ifdef CONFIG_BALLOON_COMPACTION > +/* > + * virtballoon_migratepage - perform the balloon page migration on behalf of > + * a compation thread. (called under page lock) > + * @mapping: the page->mapping which will be assigned to the new migrated page. > + * @newpage: page that will replace the isolated page after migration finishes. > + * @page : the isolated (old) page that is about to be migrated to newpage. > + * @mode : compaction mode -- not used for balloon page migration. > + * > + * After a ballooned page gets isolated by compaction procedures, this is the > + * function that performs the page migration on behalf of a compaction thread > + * The page migration for virtio balloon is done in a simple swap fashion which > + * follows these two macro steps: > + * 1) insert newpage into vb->pages list and update the host about it; > + * 2) update the host about the old page removed from vb->pages list; > + * > + * This function preforms the balloon page migration task. > + * Called through balloon_mapping->a_ops->migratepage > + */ > +static int virtballoon_migratepage(struct address_space *mapping, > + struct page *newpage, struct page *page, enum migrate_mode mode) > +{ > + struct balloon_dev_info *vb_dev_info = balloon_page_device(page); > + struct virtio_balloon *vb; > + unsigned long flags; > + int err; > + > + BUG_ON(!vb_dev_info); > + > + vb = vb_dev_info->balloon_device; > + > + /* > + * In order to avoid lock contention while migrating pages concurrently > + * to leak_balloon() or fill_balloon() we just give up the balloon_lock > + * this turn, as it is easier to retry the page migration later. > + * This also prevents fill_balloon() getting stuck into a mutex > + * recursion in the case it ends up triggering memory compaction > + * while it is attempting to inflate the ballon. > + */ > + if (!mutex_trylock(&vb->lock)) > + return -EAGAIN; > + > + /* Try to get the page out of the balloon. */ > + vb->gcmd.get_pages.type = cpu_to_le64(VIRTIO_BALLOON_GCMD_GET_PAGES); > + vb->gcmd.get_pages.pages[0] = page_to_pfn(page) << PAGE_SHIFT; > + if (!send_gcmd(vb, offsetof(struct gcmd_get_pages, pages[1]))) { > + err = -EIO; > + goto unlock; > + } > + > + /* Now put newpage into balloon. */ > + vb->gcmd.give_pages.type = cpu_to_le64(VIRTIO_BALLOON_GCMD_GIVE_PAGES); > + vb->gcmd.give_pages.pages[0] = page_to_pfn(newpage) << PAGE_SHIFT; > + if (!send_gcmd(vb, offsetof(struct gcmd_give_pages, pages[1]))) { > + /* We leak a page here, but only happens if balloon broken. */ > + err = -EIO; > + goto unlock; > + } > + > + spin_lock_irqsave(&vb_dev_info->pages_lock, flags); > + balloon_page_insert(newpage, mapping, &vb_dev_info->pages); > + vb_dev_info->isolated_pages--; > + spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags); > + > + /* > + * It's safe to delete page->lru here because this page is at > + * an isolated migration list, and this step is expected to happen here > + */ > + balloon_page_delete(page); > + err = MIGRATEPAGE_BALLOON_SUCCESS; > + > +unlock: > + mutex_unlock(&vb->lock); > + return err; > +} > + > +/* define the balloon_mapping->a_ops callback to allow balloon page migration */ > +static const struct address_space_operations virtio_balloon_aops = { > + .migratepage = virtballoon_migratepage, > +}; > +#endif /* CONFIG_BALLOON_COMPACTION */ > + > +static int virtballoon_probe(struct virtio_device *vdev) > +{ > + struct virtio_balloon *vb; > + struct address_space *vb_mapping; > + struct balloon_dev_info *vb_devinfo; > + __le64 v; > + int err; > + > + virtio_cread(vdev, struct virtio_balloon_config_space, pagesizes, &v); > + /* FIXME: Support large pages. */ > + if (!(le64_to_cpu(v) & PAGE_SIZE)) { > + dev_warn(&vdev->dev, "Unacceptable pagesize %llu
    ", > + (long long)le64_to_cpu(v)); > + err = -EINVAL; > + goto out; > + } > + v = cpu_to_le64(PAGE_SIZE); > + virtio_cwrite(vdev, struct virtio_balloon_config_space, page_size, &v); > + > + vdev->priv = vb = kmalloc(sizeof(*vb), GFP_KERNEL); > + if (!vb) { > + err = -ENOMEM; > + goto out; > + } > + > + vb->target = 0; > + vb->num_pages = 0; > + mutex_init(&vb->lock); > + init_waitqueue_head(&vb->wait); > + vb->vdev = vdev; > + > + vb_devinfo = balloon_devinfo_alloc(vb); > + if (IS_ERR(vb_devinfo)) { > + err = PTR_ERR(vb_devinfo); > + goto out_free_vb; > + } > + > + vb_mapping = balloon_mapping_alloc(vb_devinfo, > + (balloon_compaction_check()) ? > + &virtio_balloon_aops : NULL); > + if (IS_ERR(vb_mapping)) { > + /* > + * IS_ERR(vb_mapping) && PTR_ERR(vb_mapping) == -EOPNOTSUPP > + * This means !CONFIG_BALLOON_COMPACTION, otherwise we get off. > + */ > + err = PTR_ERR(vb_mapping); > + if (err != -EOPNOTSUPP) > + goto out_free_vb_devinfo; > + } > + > + vb->vb_dev_info = vb_devinfo; > + > + err = init_vqs(vb); > + if (err) > + goto out_free_vb_mapping; > + > + vb->thread = kthread_run(balloon, vb, "vballoon"); > + if (IS_ERR(vb->thread)) { > + err = PTR_ERR(vb->thread); > + goto out_del_vqs; > + } > + > + return 0; > + > +out_del_vqs: > + vdev->config->del_vqs(vdev); > +out_free_vb_mapping: > + balloon_mapping_free(vb_mapping); > +out_free_vb_devinfo: > + balloon_devinfo_free(vb_devinfo); > +out_free_vb: > + kfree(vb); > +out: > + return err; > +} > + > +/* FIXME: Leave pages alone during suspend, rather than taking them > + * all back! */ > +static void remove_common(struct virtio_balloon *vb) > +{ > + /* There might be pages left in the balloon: free them. */ > + while (vb->num_pages) > + take_from_balloon(vb, vb->num_pages); > + > + /* Now we reset the device so we can clean up the queues. */ > + vb->vdev->config->reset(vb->vdev); > + vb->vdev->config->del_vqs(vb->vdev); > +} > + > +static void virtballoon_remove(struct virtio_device *vdev) > +{ > + struct virtio_balloon *vb = vdev->priv; > + > + kthread_stop(vb->thread); > + remove_common(vb); > + balloon_mapping_free(vb->vb_dev_info->mapping); > + balloon_devinfo_free(vb->vb_dev_info); > + kfree(vb); > +} > + > +#ifdef CONFIG_PM_SLEEP > +static int virtballoon_freeze(struct virtio_device *vdev) > +{ > + struct virtio_balloon *vb = vdev->priv; > + > + /* > + * The kthread is already frozen by the PM core before this > + * function is called. > + */ > + > + remove_common(vb); > + return 0; > +} > + > +static int virtballoon_restore(struct virtio_device *vdev) > +{ > + return init_vqs(vdev->priv); > +} > +#endif > + > +static unsigned int features[] = { > + /* FIXME: Support VIRTIO_BALLOON_F_EXTRA_MEM! */ > +}; > + > +static struct virtio_driver virtio_balloon_driver = { > + .feature_table = features, > + .feature_table_size = ARRAY_SIZE(features), > + .driver.name = KBUILD_MODNAME, > + .driver.owner = THIS_MODULE, > + .id_table = id_table, > + .probe = virtballoon_probe, > + .remove = virtballoon_remove, > +#ifdef CONFIG_PM_SLEEP > + .freeze = virtballoon_freeze, > + .restore = virtballoon_restore, > +#endif > +}; > + > +module_virtio_driver(virtio_balloon_driver); > +MODULE_DEVICE_TABLE(virtio, id_table); > +MODULE_DESCRIPTION("Virtio balloon driver"); > +MODULE_LICENSE("GPL"); > diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h > index 5e26f61b5df5..cdca2934668a 100644 > --- a/include/uapi/linux/virtio_balloon.h > +++ b/include/uapi/linux/virtio_balloon.h > @@ -28,32 +28,45 @@ > #include <linux/virtio_ids.h> > #include <linux/virtio_config.h> > > -/* The feature bitmap for virtio balloon */ > -#define VIRTIO_BALLOON_F_MUST_TELL_HOST 0 /* Tell before reclaiming pages */ > -#define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */ > - > -/* Size of a PFN in the balloon interface. */ > -#define VIRTIO_BALLOON_PFN_SHIFT 12 > - > -struct virtio_balloon_config > -{ > - /* Number of pages host wants Guest to give up. */ > - __le32 num_pages; > - /* Number of pages we've actually got in balloon. */ > - __le32 actual; > +/* This means the balloon can go negative (ie. add memory to system) */ > +#define VIRTIO_BALLOON_F_EXTRA_MEM 0 > + > +struct virtio_balloon_config_space { > + /* Set by device: bits indicate what page sizes supported. */ > + __le64 pagesizes; > + /* Set by driver: only a single bit is set! */ > + __le64 page_size; > + > + /* These set by device if VIRTIO_BALLOON_F_EXTRA_MEM. */ > + __le64 extra_mem_start; > + __le64 extra_mem_end; > +}; > + > +struct virtio_balloon_statistic { > + __le64 tag; /* VIRTIO_BALLOON_S_* */ > + __le64 val; > }; > > -#define VIRTIO_BALLOON_S_SWAP_IN 0 /* Amount of memory swapped in */ > -#define VIRTIO_BALLOON_S_SWAP_OUT 1 /* Amount of memory swapped out */ > -#define VIRTIO_BALLOON_S_MAJFLT 2 /* Number of major faults */ > -#define VIRTIO_BALLOON_S_MINFLT 3 /* Number of minor faults */ > -#define VIRTIO_BALLOON_S_MEMFREE 4 /* Total amount of free memory */ > -#define VIRTIO_BALLOON_S_MEMTOT 5 /* Total amount of memory */ > -#define VIRTIO_BALLOON_S_NR 6 > - > -struct virtio_balloon_stat { > - __u16 tag; > - __u64 val; > -} __attribute__((packed)); > +/* Guest->host command queue. */ > +/* Ask the host for more pages. > + Followed by array of 1 or more readable le64 pageaddr's. */ > +#define VIRTIO_BALLOON_GCMD_GET_PAGES ((__le64)0) > +/* Give the host more pages. > + Followed by array of 1 or more readable le64 pageaddr's */ > +#define VIRTIO_BALLOON_GCMD_GIVE_PAGES ((__le64)1) > +/* Dear host: I need more memory. */ > +#define VIRTIO_BALLOON_GCMD_NEEDMEM ((__le64)2) > +/* Dear host: here are your stats. > + * Followed by 0 or more struct virtio_balloon_statistic structs. */ > +#define VIRTIO_BALLOON_GCMD_STATS_REPLY ((__le64)3) > + > +/* Host->guest command queue. */ > +/* Followed by s64 of new balloon target size (only negative if > + * VIRTIO_BALLOON_F_EXTRA_MEM). */ > +#define VIRTIO_BALLOON_HCMD_SET_BALLOON ((__le64)0x8000) > +/* Ask for statistics */ > +#define VIRTIO_BALLOON_HCMD_GET_STATS ((__le64)0x8001) > + > +#include <linux/virtio_balloon_legacy.h> > > #endif /* _LINUX_VIRTIO_BALLOON_H */ > diff --git a/include/uapi/linux/virtio_balloon_legacy.h b/include/uapi/linux/virtio_balloon_legacy.h > new file mode 100644 > index 000000000000..cbf77bc1aee3 > --- /dev/null > +++ b/include/uapi/linux/virtio_balloon_legacy.h > @@ -0,0 +1,59 @@ > +#ifndef _LINUX_VIRTIO_BALLOON_LEGACY_H > +#define _LINUX_VIRTIO_BALLOON_LEGACY_H > +/* This header is BSD licensed so anyone can use the definitions to implement > + * compatible drivers/servers. > + * > + * Redistribution and use in source and binary forms, with or without > + * modification, are permitted provided that the following conditions > + * are met: > + * 1. Redistributions of source code must retain the above copyright > + * notice, this list of conditions and the following disclaimer. > + * 2. Redistributions in binary form must reproduce the above copyright > + * notice, this list of conditions and the following disclaimer in the > + * documentation and/or other materials provided with the distribution. > + * 3. Neither the name of IBM nor the names of its contributors > + * may be used to endorse or promote products derived from this software > + * without specific prior written permission. > + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS'' AND > + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE > + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE > + * ARE DISCLAIMED. IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE > + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL > + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS > + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) > + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT > + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY > + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF > + * SUCH DAMAGE. */ > +#include <linux/virtio_ids.h> > +#include <linux/virtio_config.h> > + > +/* The feature bitmap for virtio balloon */ > +#define VIRTIO_BALLOON_F_MUST_TELL_HOST 0 /* Tell before reclaiming pages */ > +#define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */ > + > +/* Size of a PFN in the balloon interface. */ > +#define VIRTIO_BALLOON_PFN_SHIFT 12 > + > +struct virtio_balloon_config > +{ > + /* Number of pages host wants Guest to give up. */ > + __le32 num_pages; > + /* Number of pages we've actually got in balloon. */ > + __le32 actual; > +}; > + > +#define VIRTIO_BALLOON_S_SWAP_IN 0 /* Amount of memory swapped in */ > +#define VIRTIO_BALLOON_S_SWAP_OUT 1 /* Amount of memory swapped out */ > +#define VIRTIO_BALLOON_S_MAJFLT 2 /* Number of major faults */ > +#define VIRTIO_BALLOON_S_MINFLT 3 /* Number of minor faults */ > +#define VIRTIO_BALLOON_S_MEMFREE 4 /* Total amount of free memory */ > +#define VIRTIO_BALLOON_S_MEMTOT 5 /* Total amount of memory */ > +#define VIRTIO_BALLOON_S_NR 6 > + > +struct virtio_balloon_stat { > + __u16 tag; > + __u64 val; > +} __attribute__((packed)); > + > +#endif /* _LINUX_VIRTIO_BALLOON_LEGACY_H */ > diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h > index 284fc3a05f7b..8b5ac0047190 100644 > --- a/include/uapi/linux/virtio_ids.h > +++ b/include/uapi/linux/virtio_ids.h > @@ -33,11 +33,12 @@ > #define VIRTIO_ID_BLOCK 2 /* virtio block */ > #define VIRTIO_ID_CONSOLE 3 /* virtio console */ > #define VIRTIO_ID_RNG 4 /* virtio rng */ > -#define VIRTIO_ID_BALLOON 5 /* virtio balloon */ > +#define VIRTIO_ID_BALLOON 5 /* virtio balloon (legacy) */ > #define VIRTIO_ID_RPMSG 7 /* virtio remote processor messaging */ > #define VIRTIO_ID_SCSI 8 /* virtio scsi */ > #define VIRTIO_ID_9P 9 /* 9p virtio console */ > #define VIRTIO_ID_RPROC_SERIAL 11 /* virtio remoteproc serial link */ > #define VIRTIO_ID_CAIF 12 /* Virtio caif */ > +#define VIRTIO_ID_MEMBALLOON 13 /* virtio balloon */ > > #endif /* _LINUX_VIRTIO_IDS_H */ > > > --------------------------------------------------------------------- > To unsubscribe from this mail list, you must leave the OASIS TC that > generates this mail. Follow this link to all your TCs in OASIS at: > https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php


  • 4.  Re: [virtio] New virtio balloon...

    Posted 01-30-2014 16:47
    On Thu, 30 Jan 2014 12:16:29 +0200
    "Michael S. Tsirkin" <mst@redhat.com> wrote:

    > Also copy virtio-dev since this in clearly implementation ...
    >
    > On Thu, Jan 30, 2014 at 07:34:30PM +1030, Rusty Russell wrote:
    > > Hi,
    > >
    > > I tried to write a new balloon driver; it's completely untested
    > > (as I need to write the device). The protocol is basically two vqs, one
    > > for the guest to send commands, one for the host to send commands.
    > >
    > > Some interesting things come out:
    > > 1) We do need to explicitly tell the host where the page is we want.
    > > This is required for compaction, for example.
    > >
    > > 2) We need to be able to exceed the balloon target, especially for page
    > > migration. Thus there's no mechanism for the device to refuse to
    > > give us the pages.
    > >
    > > 3) The device can offer multiple page sizes, but the driver can only
    > > accept one. I'm not sure if this is useful, as guests are either
    > > huge page backed or not, and returning sub-pages isn't useful.
    > >
    > > Linux demo code follows.
    > >
    > > Cheers,
    > > Rusty.
    >
    > More comments:
    > - for projects like auto-ballooning that Luiz works on,
    > it's not nice that to swap page 1 for page 2
    > you have to inflate then deflate
    > besides overhead this confuses the host:
    > imagine you tell QEMU to increase target,
    > meanwhile guest inflates temporarily,
    > QEMU thinks okay done, now you suddenly deflate.

    Yes. Just to give more context: one of my auto-ballooning versions broke
    when virtballoon_migratepage() ran. The reason was that my host-side code
    in the balloon device did not expect guest initiated operations. And the
    current spec does imply that all operations are initiated by the host.

    So, first suggestion: if the current spec is still valid, we have to add
    a note there that balloon operations can be initiated by the guest.

    My current code is different, but something it does that could also brake
    due to guest initiated inflate/deflate is that it keeps track of the
    current balloon size. This is done by a counter which is incremented
    on inflate and decremented on deflate. I did that because the device just
    doesn't have this information ('actual' is unreliable, besides it's
    only updated every 256 pages inflated/deflated).

    Second suggestion: I think we need a reliable way to know the current
    balloon size on the host. My counter does work, btw.

    As far as the guest is concerned, my current code just informs the host
    that the guest is facing pressure. This is done through a "message" virtqueue,
    but I think this could just use the guest command virtqueue.

    > A couple of other suggestions:
    >
    > - how to accomodate memory pressure in guest?
    > Let's add a field telling host how hard do we
    > want our memory back

    I agree we have to accommodate pressure in the guest some way, but what
    you proposed is more or less related to auto-ballooning.

    My suggestion would be for the host to tell the guest what to do in
    case of pressure. Like, it could tell the guest to just keep trying like
    it does today or it could ask the guest to stop inflation on pressure
    (which would require an ack from the host, which complicates the
    protocol a bit).

    Also, there are two ways to know the guest is under pressure: 1. when
    alloc_page() fails or 2. use in-kernel vmpressure notification like
    auto-balloon does.

    > - assume you want to over-commit host and start
    > inflating balloon.
    > If low on memory it might be better for guest to
    > wait a bit before inflating.
    > Also, if host asks for a lot of memory a ton of
    > allocations will slow guest significantly.
    > But for guest to do the right thing we need host to tell guest what
    > are its memory and time contraints.
    > Let's add a field telling guest how hard do we
    > want it to give us memory (e.g. time limit)

    I think this is also related to auto-ballooning. Maybe we should start
    with a simple device/driver and add all these features on top.

    >
    >
    >
    > > diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
    > > index 9076635697bb..1dd45691b618 100644
    > > --- a/drivers/virtio/Makefile
    > > +++ b/drivers/virtio/Makefile
    > > @@ -1,4 +1,4 @@
    > > obj-$(CONFIG_VIRTIO) += virtio.o virtio_ring.o
    > > obj-$(CONFIG_VIRTIO_MMIO) += virtio_mmio.o
    > > obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o
    > > -obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
    > > +obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o virtio_balloon2.o
    > > diff --git a/drivers/virtio/virtio_balloon2.c b/drivers/virtio/virtio_balloon2.c
    > > new file mode 100644
    > > index 000000000000..93f13e7c561d
    > > --- /dev/null
    > > +++ b/drivers/virtio/virtio_balloon2.c
    > > @@ -0,0 +1,566 @@
    > > +/*
    > > + * Virtio balloon implementation, inspired by Dor Laor and Marcelo
    > > + * Tosatti's implementations.
    > > + *
    > > + * Copyright 2008, 2014 Rusty Russell IBM Corporation
    > > + *
    > > + * This program is free software; you can redistribute it and/or modify
    > > + * it under the terms of the GNU General Public License as published by
    > > + * the Free Software Foundation; either version 2 of the License, or
    > > + * (at your option) any later version.
    > > + *
    > > + * This program is distributed in the hope that it will be useful,
    > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
    > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
    > > + * GNU General Public License for more details.
    > > + *
    > > + * You should have received a copy of the GNU General Public License
    > > + * along with this program; if not, write to the Free Software
    > > + * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
    > > + */
    > > +
    > > +#include <linux/virtio.h>
    > > +#include <linux/virtio_balloon.h>
    > > +#include <linux/swap.h>
    > > +#include <linux/kthread.h>
    > > +#include <linux/freezer.h>
    > > +#include <linux/delay.h>
    > > +#include <linux/slab.h>
    > > +#include <linux/module.h>
    > > +#include <linux/balloon_compaction.h>
    > > +
    > > +struct gcmd_get_pages {
    > > + __le64 type; /* VIRTIO_BALLOON_GCMD_GET_PAGES */
    > > + __le64 pages[256];
    > > +};
    > > +
    > > +struct gcmd_give_pages {
    > > + __le64 type; /* VIRTIO_BALLOON_GCMD_GIVE_PAGES */
    > > + __le64 pages[256];
    > > +};
    > > +
    > > +struct gcmd_need_mem {
    > > + __le64 type; /* VIRTIO_BALLOON_GCMD_NEED_MEM */
    > > +};
    > > +
    > > +struct gcmd_stats_reply {
    > > + __le64 type; /* VIRTIO_BALLOON_GCMD_STATS_REPLY */
    > > + struct virtio_balloon_statistic stats[VIRTIO_BALLOON_S_NR];
    > > +};
    > > +
    > > +struct hcmd_set_balloon {
    > > + __le64 type; /* VIRTIO_BALLOON_HCMD_SET_BALLOON */
    > > + __le64 target;
    > > +};
    > > +
    > > +struct hcmd_get_stats {
    > > + __le64 type; /* VIRTIO_BALLOON_HCMD_GET_STATS */
    > > +};
    > > +
    > > +struct virtio_balloon {
    > > + /* Protects contents of entire structure. */
    > > + struct mutex lock;
    > > +
    > > + struct virtio_device *vdev;
    > > + struct virtqueue *gcmd_vq, *hcmd_vq;
    > > +
    > > + /* The thread servicing the balloon. */
    > > + struct task_struct *thread;
    > > +
    > > + /* For interrupt/suspend to wake balloon thread. */
    > > + wait_queue_head_t wait;
    > > +
    > > + /* How many pages are we supposed to have in balloon? */
    > > + s64 target;
    > > +
    > > + /* How many do we have in the balloon? */
    > > + u64 num_pages;
    > > +
    > > + /* This reminds me of Eeyore. */
    > > + bool broken;
    > > +
    > > + /*
    > > + * The pages we've told the Host we're not using are enqueued
    > > + * at vb_dev_info->pages list.
    > > + */
    > > + struct balloon_dev_info *vb_dev_info;
    > > +
    > > + /* To avoid kmalloc, we use single hcmd and gcmd buffers. */
    > > + union gcmd {
    > > + __le64 type;
    > > + struct gcmd_get_pages get_pages;
    > > + struct gcmd_give_pages give_pages;
    > > + struct gcmd_need_mem need_mem;
    > > + struct gcmd_stats_reply stats_reply;
    > > + } gcmd;
    > > +
    > > + union hcmd {
    > > + __le64 type;
    > > + struct hcmd_set_balloon set_balloon;
    > > + struct hcmd_get_stats get_stats;
    > > + } hcmd;
    > > +};
    > > +
    > > +static struct virtio_device_id id_table[] = {
    > > + { VIRTIO_ID_MEMBALLOON, VIRTIO_DEV_ANY_ID },
    > > + { 0 },
    > > +};
    > > +
    > > +static void wake_balloon(struct virtqueue *vq)
    > > +{
    > > + struct virtio_balloon *vb = vq->vdev->priv;
    > > +
    > > + wake_up(&vb->wait);
    > > +}
    > > +
    > > +/* Command is in vb->gcmd, lock is held. */
    > > +static bool send_gcmd(struct virtio_balloon *vb, size_t len)
    > > +{
    > > + struct scatterlist sg;
    > > +
    > > + BUG_ON(len > sizeof(vb->gcmd));
    > > + sg_init_one(&sg, &vb->gcmd, len);
    > > +
    > > + /*
    > > + * We should always be able to add one buffer to an empty queue.
    > > + * If not, it's a broken device.
    > > + */
    > > + if (virtqueue_add_outbuf(vb->gcmd_vq, &sg, 1, vb, GFP_KERNEL) != 0
    > > + || virtqueue_kick(vb->gcmd_vq) != 0) {
    > > + vb->broken = true;
    > > + return false;
    > > + }
    > > +
    > > + /* When host has read buffer, this completes via wake_balloon */
    > > + wait_event(vb->wait,
    > > + virtqueue_get_buf(vb->gcmd_vq, &len)
    > > + || (vb->broken = virtqueue_is_broken(vb->gcmd_vq)));
    > > + return !vb->broken;
    > > +}
    > > +
    > > +static void give_to_balloon(struct virtio_balloon *vb, u64 num)
    > > +{
    > > + struct balloon_dev_info *vb_dev_info = vb->vb_dev_info;
    > > + u64 i;
    > > +
    > > + /* We can only do one array worth at a time. */
    > > + num = min_t(u64, num, ARRAY_SIZE(vb->gcmd.give_pages.pages));
    > > +
    > > + vb->gcmd.give_pages.type = cpu_to_le64(VIRTIO_BALLOON_GCMD_GIVE_PAGES);
    > > +
    > > + for (i = 0; i < num; i++) {
    > > + struct page *page = balloon_page_enqueue(vb_dev_info);
    > > +
    > > + if (!page) {
    > > + dev_info_ratelimited(&vb->vdev->dev,
    > > + "Out of puff! Can't get page\n");
    > > + /* Sleep for at least 1/5 of a second before retry. */
    > > + msleep(200);
    > > + break;
    > > + }
    > > +
    > > + vb->gcmd.give_pages.pages[i] = page_to_pfn(page) << PAGE_SHIFT;
    > > + vb->num_pages++;
    > > + adjust_managed_page_count(page, -1);
    > > + }
    > > +
    > > + /* Did we get any? */
    > > + if (i)
    > > + send_gcmd(vb, offsetof(struct gcmd_give_pages, pages[i]));
    > > +}
    > > +
    > > +static void take_from_balloon(struct virtio_balloon *vb, u64 num)
    > > +{
    > > + struct balloon_dev_info *vb_dev_info = vb->vb_dev_info;
    > > + size_t i;
    > > +
    > > + /* We can only do one array worth at a time. */
    > > + num = min_t(u64, num, ARRAY_SIZE(vb->gcmd.get_pages.pages));
    > > +
    > > + vb->gcmd.get_pages.type = cpu_to_le64(VIRTIO_BALLOON_GCMD_GET_PAGES);
    > > +
    > > + for (i = 0; i < num; i++) {
    > > + struct page *page = balloon_page_dequeue(vb_dev_info);
    > > +
    > > + /* In case we ran out of pages (compaction) */
    > > + if (!page)
    > > + break;
    > > +
    > > + vb->gcmd.get_pages.pages[i] = page_to_pfn(page) << PAGE_SHIFT;
    > > + vb->num_pages--;
    > > + }
    > > + num = i;
    > > + if (num)
    > > + send_gcmd(vb, offsetof(struct gcmd_get_pages, pages[num]));
    > > +
    > > + /* Now release those pages. */
    > > + for (i = 0; i < num; i++) {
    > > + struct page *page;
    > > +
    > > + page = pfn_to_page(vb->gcmd.get_pages.pages[i] >> PAGE_SHIFT);
    > > + balloon_page_free(page);
    > > + adjust_managed_page_count(page, 1);
    > > + }
    > > + mutex_unlock(&vb->lock);
    > > +}
    > > +
    > > +static inline void set_stat(struct gcmd_stats_reply *stats, int idx,
    > > + u64 tag, u64 val)
    > > +{
    > > + BUG_ON(idx >= ARRAY_SIZE(stats->stats));
    > > + stats->stats[idx].tag = cpu_to_le64(tag);
    > > + stats->stats[idx].val = cpu_to_le64(val);
    > > +}
    > > +
    > > +#define pages_to_bytes(x) ((u64)(x) << PAGE_SHIFT)
    > > +
    > > +static void get_stats(struct gcmd_stats_reply *stats)
    > > +{
    > > + unsigned long events[NR_VM_EVENT_ITEMS];
    > > + struct sysinfo i;
    > > + int idx = 0;
    > > +
    > > + all_vm_events(events);
    > > + si_meminfo(&i);
    > > +
    > > + stats->type = cpu_to_le64(VIRTIO_BALLOON_GCMD_STATS_REPLY);
    > > + set_stat(stats, idx++, VIRTIO_BALLOON_S_SWAP_IN,
    > > + pages_to_bytes(events[PSWPIN]));
    > > + set_stat(stats, idx++, VIRTIO_BALLOON_S_SWAP_OUT,
    > > + pages_to_bytes(events[PSWPOUT]));
    > > + set_stat(stats, idx++, VIRTIO_BALLOON_S_MAJFLT,
    > > + events[PGMAJFAULT]);
    > > + set_stat(stats, idx++, VIRTIO_BALLOON_S_MINFLT,
    > > + events[PGFAULT]);
    > > + set_stat(stats, idx++, VIRTIO_BALLOON_S_MEMFREE,
    > > + pages_to_bytes(i.freeram));
    > > + set_stat(stats, idx++, VIRTIO_BALLOON_S_MEMTOT,
    > > + pages_to_bytes(i.totalram));
    > > +}
    > > +
    > > +static bool move_towards_target(struct virtio_balloon *vb)
    > > +{
    > > + bool moved = false;
    > > +
    > > + if (vb->broken)
    > > + return false;
    > > +
    > > + mutex_lock(&vb->lock);
    > > + if (vb->num_pages < vb->target) {
    > > + give_to_balloon(vb, vb->target - vb->num_pages);
    > > + moved = true;
    > > + } else if (vb->num_pages > vb->target) {
    > > + take_from_balloon(vb, vb->num_pages - vb->target);
    > > + moved = true;
    > > + }
    > > + mutex_unlock(&vb->lock);
    > > + return moved;
    > > +}
    > > +
    > > +static bool process_hcmd(struct virtio_balloon *vb)
    > > +{
    > > + union hcmd *hcmd = NULL;
    > > + unsigned int cmdlen;
    > > + struct scatterlist sg;
    > > +
    > > + if (vb->broken)
    > > + return false;
    > > +
    > > + mutex_lock(&vb->lock);
    > > + hcmd = virtqueue_get_buf(vb->hcmd_vq, &cmdlen);
    > > + if (!hcmd) {
    > > + mutex_unlock(&vb->lock);
    > > + return false;
    > > + }
    > > +
    > > + switch (hcmd->type) {
    > > + case cpu_to_le64(VIRTIO_BALLOON_HCMD_SET_BALLOON):
    > > + vb->target = le64_to_cpu(hcmd->set_balloon.target);
    > > + break;
    > > + case cpu_to_le64(VIRTIO_BALLOON_HCMD_GET_STATS):
    > > + get_stats(&vb->gcmd.stats_reply);
    > > + send_gcmd(vb, sizeof(vb->gcmd.stats_reply));
    > > + break;
    > > + default:
    > > + dev_err_ratelimited(&vb->vdev->dev, "Unknown hcmd %llu\n",
    > > + le64_to_cpu(hcmd->type));
    > > + break;
    > > + }
    > > +
    > > + /* Re-queue the hcmd for next time. */
    > > + sg_init_one(&sg, &vb->hcmd, sizeof(vb->hcmd));
    > > + virtqueue_add_inbuf(vb->hcmd_vq, &sg, 1, vb, GFP_KERNEL);
    > > +
    > > + mutex_unlock(&vb->lock);
    > > + return true;
    > > +}
    > > +
    > > +static int balloon(void *_vballoon)
    > > +{
    > > + struct virtio_balloon *vb = _vballoon;
    > > +
    > > + set_freezable();
    > > + while (!kthread_should_stop()) {
    > > + try_to_freeze();
    > > +
    > > + wait_event_interruptible(vb->wait,
    > > + kthread_should_stop()
    > > + || freezing(current)
    > > + || process_hcmd(vb)
    > > + || move_towards_target(vb));
    > > + }
    > > + return 0;
    > > +}
    > > +
    > > +static int init_vqs(struct virtio_balloon *vb)
    > > +{
    > > + struct virtqueue *vqs[2];
    > > + vq_callback_t *callbacks[] = { wake_balloon, wake_balloon };
    > > + const char *names[] = { "gcmd", "hcmd" };
    > > + struct scatterlist sg;
    > > + int err;
    > > +
    > > + err = vb->vdev->config->find_vqs(vb->vdev, 2, vqs, callbacks, names);
    > > + if (err)
    > > + return err;
    > > +
    > > + vb->gcmd_vq = vqs[0];
    > > + vb->hcmd_vq = vqs[1];
    > > +
    > > + /*
    > > + * Prime this virtqueue with one buffer so the hypervisor can
    > > + * use it to signal us later (it can't be broken yet!).
    > > + */
    > > + sg_init_one(&sg, &vb->hcmd, sizeof(vb->hcmd));
    > > + if (virtqueue_add_inbuf(vb->hcmd_vq, &sg, 1, vb, GFP_KERNEL) < 0)
    > > + BUG();
    > > + virtqueue_kick(vb->hcmd_vq);
    > > +
    > > + return 0;
    > > +}
    > > +
    > > +static const struct address_space_operations virtio_balloon_aops;
    > > +#ifdef CONFIG_BALLOON_COMPACTION
    > > +/*
    > > + * virtballoon_migratepage - perform the balloon page migration on behalf of
    > > + * a compation thread. (called under page lock)
    > > + * @mapping: the page->mapping which will be assigned to the new migrated page.
    > > + * @newpage: page that will replace the isolated page after migration finishes.
    > > + * @page : the isolated (old) page that is about to be migrated to newpage.
    > > + * @mode : compaction mode -- not used for balloon page migration.
    > > + *
    > > + * After a ballooned page gets isolated by compaction procedures, this is the
    > > + * function that performs the page migration on behalf of a compaction thread
    > > + * The page migration for virtio balloon is done in a simple swap fashion which
    > > + * follows these two macro steps:
    > > + * 1) insert newpage into vb->pages list and update the host about it;
    > > + * 2) update the host about the old page removed from vb->pages list;
    > > + *
    > > + * This function preforms the balloon page migration task.
    > > + * Called through balloon_mapping->a_ops->migratepage
    > > + */
    > > +static int virtballoon_migratepage(struct address_space *mapping,
    > > + struct page *newpage, struct page *page, enum migrate_mode mode)
    > > +{
    > > + struct balloon_dev_info *vb_dev_info = balloon_page_device(page);
    > > + struct virtio_balloon *vb;
    > > + unsigned long flags;
    > > + int err;
    > > +
    > > + BUG_ON(!vb_dev_info);
    > > +
    > > + vb = vb_dev_info->balloon_device;
    > > +
    > > + /*
    > > + * In order to avoid lock contention while migrating pages concurrently
    > > + * to leak_balloon() or fill_balloon() we just give up the balloon_lock
    > > + * this turn, as it is easier to retry the page migration later.
    > > + * This also prevents fill_balloon() getting stuck into a mutex
    > > + * recursion in the case it ends up triggering memory compaction
    > > + * while it is attempting to inflate the ballon.
    > > + */
    > > + if (!mutex_trylock(&vb->lock))
    > > + return -EAGAIN;
    > > +
    > > + /* Try to get the page out of the balloon. */
    > > + vb->gcmd.get_pages.type = cpu_to_le64(VIRTIO_BALLOON_GCMD_GET_PAGES);
    > > + vb->gcmd.get_pages.pages[0] = page_to_pfn(page) << PAGE_SHIFT;
    > > + if (!send_gcmd(vb, offsetof(struct gcmd_get_pages, pages[1]))) {
    > > + err = -EIO;
    > > + goto unlock;
    > > + }
    > > +
    > > + /* Now put newpage into balloon. */
    > > + vb->gcmd.give_pages.type = cpu_to_le64(VIRTIO_BALLOON_GCMD_GIVE_PAGES);
    > > + vb->gcmd.give_pages.pages[0] = page_to_pfn(newpage) << PAGE_SHIFT;
    > > + if (!send_gcmd(vb, offsetof(struct gcmd_give_pages, pages[1]))) {
    > > + /* We leak a page here, but only happens if balloon broken. */
    > > + err = -EIO;
    > > + goto unlock;
    > > + }
    > > +
    > > + spin_lock_irqsave(&vb_dev_info->pages_lock, flags);
    > > + balloon_page_insert(newpage, mapping, &vb_dev_info->pages);
    > > + vb_dev_info->isolated_pages--;
    > > + spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
    > > +
    > > + /*
    > > + * It's safe to delete page->lru here because this page is at
    > > + * an isolated migration list, and this step is expected to happen here
    > > + */
    > > + balloon_page_delete(page);
    > > + err = MIGRATEPAGE_BALLOON_SUCCESS;
    > > +
    > > +unlock:
    > > + mutex_unlock(&vb->lock);
    > > + return err;
    > > +}
    > > +
    > > +/* define the balloon_mapping->a_ops callback to allow balloon page migration */
    > > +static const struct address_space_operations virtio_balloon_aops = {
    > > + .migratepage = virtballoon_migratepage,
    > > +};
    > > +#endif /* CONFIG_BALLOON_COMPACTION */
    > > +
    > > +static int virtballoon_probe(struct virtio_device *vdev)
    > > +{
    > > + struct virtio_balloon *vb;
    > > + struct address_space *vb_mapping;
    > > + struct balloon_dev_info *vb_devinfo;
    > > + __le64 v;
    > > + int err;
    > > +
    > > + virtio_cread(vdev, struct virtio_balloon_config_space, pagesizes, &v);
    > > + /* FIXME: Support large pages. */
    > > + if (!(le64_to_cpu(v) & PAGE_SIZE)) {
    > > + dev_warn(&vdev->dev, "Unacceptable pagesize %llu\n",
    > > + (long long)le64_to_cpu(v));
    > > + err = -EINVAL;
    > > + goto out;
    > > + }
    > > + v = cpu_to_le64(PAGE_SIZE);
    > > + virtio_cwrite(vdev, struct virtio_balloon_config_space, page_size, &v);
    > > +
    > > + vdev->priv = vb = kmalloc(sizeof(*vb), GFP_KERNEL);
    > > + if (!vb) {
    > > + err = -ENOMEM;
    > > + goto out;
    > > + }
    > > +
    > > + vb->target = 0;
    > > + vb->num_pages = 0;
    > > + mutex_init(&vb->lock);
    > > + init_waitqueue_head(&vb->wait);
    > > + vb->vdev = vdev;
    > > +
    > > + vb_devinfo = balloon_devinfo_alloc(vb);
    > > + if (IS_ERR(vb_devinfo)) {
    > > + err = PTR_ERR(vb_devinfo);
    > > + goto out_free_vb;
    > > + }
    > > +
    > > + vb_mapping = balloon_mapping_alloc(vb_devinfo,
    > > + (balloon_compaction_check()) ?
    > > + &virtio_balloon_aops : NULL);
    > > + if (IS_ERR(vb_mapping)) {
    > > + /*
    > > + * IS_ERR(vb_mapping) && PTR_ERR(vb_mapping) == -EOPNOTSUPP
    > > + * This means !CONFIG_BALLOON_COMPACTION, otherwise we get off.
    > > + */
    > > + err = PTR_ERR(vb_mapping);
    > > + if (err != -EOPNOTSUPP)
    > > + goto out_free_vb_devinfo;
    > > + }
    > > +
    > > + vb->vb_dev_info = vb_devinfo;
    > > +
    > > + err = init_vqs(vb);
    > > + if (err)
    > > + goto out_free_vb_mapping;
    > > +
    > > + vb->thread = kthread_run(balloon, vb, "vballoon");
    > > + if (IS_ERR(vb->thread)) {
    > > + err = PTR_ERR(vb->thread);
    > > + goto out_del_vqs;
    > > + }
    > > +
    > > + return 0;
    > > +
    > > +out_del_vqs:
    > > + vdev->config->del_vqs(vdev);
    > > +out_free_vb_mapping:
    > > + balloon_mapping_free(vb_mapping);
    > > +out_free_vb_devinfo:
    > > + balloon_devinfo_free(vb_devinfo);
    > > +out_free_vb:
    > > + kfree(vb);
    > > +out:
    > > + return err;
    > > +}
    > > +
    > > +/* FIXME: Leave pages alone during suspend, rather than taking them
    > > + * all back! */
    > > +static void remove_common(struct virtio_balloon *vb)
    > > +{
    > > + /* There might be pages left in the balloon: free them. */
    > > + while (vb->num_pages)
    > > + take_from_balloon(vb, vb->num_pages);
    > > +
    > > + /* Now we reset the device so we can clean up the queues. */
    > > + vb->vdev->config->reset(vb->vdev);
    > > + vb->vdev->config->del_vqs(vb->vdev);
    > > +}
    > > +
    > > +static void virtballoon_remove(struct virtio_device *vdev)
    > > +{
    > > + struct virtio_balloon *vb = vdev->priv;
    > > +
    > > + kthread_stop(vb->thread);
    > > + remove_common(vb);
    > > + balloon_mapping_free(vb->vb_dev_info->mapping);
    > > + balloon_devinfo_free(vb->vb_dev_info);
    > > + kfree(vb);
    > > +}
    > > +
    > > +#ifdef CONFIG_PM_SLEEP
    > > +static int virtballoon_freeze(struct virtio_device *vdev)
    > > +{
    > > + struct virtio_balloon *vb = vdev->priv;
    > > +
    > > + /*
    > > + * The kthread is already frozen by the PM core before this
    > > + * function is called.
    > > + */
    > > +
    > > + remove_common(vb);
    > > + return 0;
    > > +}
    > > +
    > > +static int virtballoon_restore(struct virtio_device *vdev)
    > > +{
    > > + return init_vqs(vdev->priv);
    > > +}
    > > +#endif
    > > +
    > > +static unsigned int features[] = {
    > > + /* FIXME: Support VIRTIO_BALLOON_F_EXTRA_MEM! */
    > > +};
    > > +
    > > +static struct virtio_driver virtio_balloon_driver = {
    > > + .feature_table = features,
    > > + .feature_table_size = ARRAY_SIZE(features),
    > > + .driver.name = KBUILD_MODNAME,
    > > + .driver.owner = THIS_MODULE,
    > > + .id_table = id_table,
    > > + .probe = virtballoon_probe,
    > > + .remove = virtballoon_remove,
    > > +#ifdef CONFIG_PM_SLEEP
    > > + .freeze = virtballoon_freeze,
    > > + .restore = virtballoon_restore,
    > > +#endif
    > > +};
    > > +
    > > +module_virtio_driver(virtio_balloon_driver);
    > > +MODULE_DEVICE_TABLE(virtio, id_table);
    > > +MODULE_DESCRIPTION("Virtio balloon driver");
    > > +MODULE_LICENSE("GPL");
    > > diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
    > > index 5e26f61b5df5..cdca2934668a 100644
    > > --- a/include/uapi/linux/virtio_balloon.h
    > > +++ b/include/uapi/linux/virtio_balloon.h
    > > @@ -28,32 +28,45 @@
    > > #include <linux/virtio_ids.h>
    > > #include <linux/virtio_config.h>
    > >
    > > -/* The feature bitmap for virtio balloon */
    > > -#define VIRTIO_BALLOON_F_MUST_TELL_HOST 0 /* Tell before reclaiming pages */
    > > -#define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */
    > > -
    > > -/* Size of a PFN in the balloon interface. */
    > > -#define VIRTIO_BALLOON_PFN_SHIFT 12
    > > -
    > > -struct virtio_balloon_config
    > > -{
    > > - /* Number of pages host wants Guest to give up. */
    > > - __le32 num_pages;
    > > - /* Number of pages we've actually got in balloon. */
    > > - __le32 actual;
    > > +/* This means the balloon can go negative (ie. add memory to system) */
    > > +#define VIRTIO_BALLOON_F_EXTRA_MEM 0
    > > +
    > > +struct virtio_balloon_config_space {
    > > + /* Set by device: bits indicate what page sizes supported. */
    > > + __le64 pagesizes;
    > > + /* Set by driver: only a single bit is set! */
    > > + __le64 page_size;
    > > +
    > > + /* These set by device if VIRTIO_BALLOON_F_EXTRA_MEM. */
    > > + __le64 extra_mem_start;
    > > + __le64 extra_mem_end;
    > > +};
    > > +
    > > +struct virtio_balloon_statistic {
    > > + __le64 tag; /* VIRTIO_BALLOON_S_* */
    > > + __le64 val;
    > > };
    > >
    > > -#define VIRTIO_BALLOON_S_SWAP_IN 0 /* Amount of memory swapped in */
    > > -#define VIRTIO_BALLOON_S_SWAP_OUT 1 /* Amount of memory swapped out */
    > > -#define VIRTIO_BALLOON_S_MAJFLT 2 /* Number of major faults */
    > > -#define VIRTIO_BALLOON_S_MINFLT 3 /* Number of minor faults */
    > > -#define VIRTIO_BALLOON_S_MEMFREE 4 /* Total amount of free memory */
    > > -#define VIRTIO_BALLOON_S_MEMTOT 5 /* Total amount of memory */
    > > -#define VIRTIO_BALLOON_S_NR 6
    > > -
    > > -struct virtio_balloon_stat {
    > > - __u16 tag;
    > > - __u64 val;
    > > -} __attribute__((packed));
    > > +/* Guest->host command queue. */
    > > +/* Ask the host for more pages.
    > > + Followed by array of 1 or more readable le64 pageaddr's. */
    > > +#define VIRTIO_BALLOON_GCMD_GET_PAGES ((__le64)0)
    > > +/* Give the host more pages.
    > > + Followed by array of 1 or more readable le64 pageaddr's */
    > > +#define VIRTIO_BALLOON_GCMD_GIVE_PAGES ((__le64)1)
    > > +/* Dear host: I need more memory. */
    > > +#define VIRTIO_BALLOON_GCMD_NEEDMEM ((__le64)2)
    > > +/* Dear host: here are your stats.
    > > + * Followed by 0 or more struct virtio_balloon_statistic structs. */
    > > +#define VIRTIO_BALLOON_GCMD_STATS_REPLY ((__le64)3)
    > > +
    > > +/* Host->guest command queue. */
    > > +/* Followed by s64 of new balloon target size (only negative if
    > > + * VIRTIO_BALLOON_F_EXTRA_MEM). */
    > > +#define VIRTIO_BALLOON_HCMD_SET_BALLOON ((__le64)0x8000)
    > > +/* Ask for statistics */
    > > +#define VIRTIO_BALLOON_HCMD_GET_STATS ((__le64)0x8001)
    > > +
    > > +#include <linux/virtio_balloon_legacy.h>
    > >
    > > #endif /* _LINUX_VIRTIO_BALLOON_H */
    > > diff --git a/include/uapi/linux/virtio_balloon_legacy.h b/include/uapi/linux/virtio_balloon_legacy.h
    > > new file mode 100644
    > > index 000000000000..cbf77bc1aee3
    > > --- /dev/null
    > > +++ b/include/uapi/linux/virtio_balloon_legacy.h
    > > @@ -0,0 +1,59 @@
    > > +#ifndef _LINUX_VIRTIO_BALLOON_LEGACY_H
    > > +#define _LINUX_VIRTIO_BALLOON_LEGACY_H
    > > +/* This header is BSD licensed so anyone can use the definitions to implement
    > > + * compatible drivers/servers.
    > > + *
    > > + * Redistribution and use in source and binary forms, with or without
    > > + * modification, are permitted provided that the following conditions
    > > + * are met:
    > > + * 1. Redistributions of source code must retain the above copyright
    > > + * notice, this list of conditions and the following disclaimer.
    > > + * 2. Redistributions in binary form must reproduce the above copyright
    > > + * notice, this list of conditions and the following disclaimer in the
    > > + * documentation and/or other materials provided with the distribution.
    > > + * 3. Neither the name of IBM nor the names of its contributors
    > > + * may be used to endorse or promote products derived from this software
    > > + * without specific prior written permission.
    > > + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS'' AND
    > > + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
    > > + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
    > > + * ARE DISCLAIMED. IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
    > > + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
    > > + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
    > > + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
    > > + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
    > > + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
    > > + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
    > > + * SUCH DAMAGE. */
    > > +#include <linux/virtio_ids.h>
    > > +#include <linux/virtio_config.h>
    > > +
    > > +/* The feature bitmap for virtio balloon */
    > > +#define VIRTIO_BALLOON_F_MUST_TELL_HOST 0 /* Tell before reclaiming pages */
    > > +#define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */
    > > +
    > > +/* Size of a PFN in the balloon interface. */
    > > +#define VIRTIO_BALLOON_PFN_SHIFT 12
    > > +
    > > +struct virtio_balloon_config
    > > +{
    > > + /* Number of pages host wants Guest to give up. */
    > > + __le32 num_pages;
    > > + /* Number of pages we've actually got in balloon. */
    > > + __le32 actual;
    > > +};
    > > +
    > > +#define VIRTIO_BALLOON_S_SWAP_IN 0 /* Amount of memory swapped in */
    > > +#define VIRTIO_BALLOON_S_SWAP_OUT 1 /* Amount of memory swapped out */
    > > +#define VIRTIO_BALLOON_S_MAJFLT 2 /* Number of major faults */
    > > +#define VIRTIO_BALLOON_S_MINFLT 3 /* Number of minor faults */
    > > +#define VIRTIO_BALLOON_S_MEMFREE 4 /* Total amount of free memory */
    > > +#define VIRTIO_BALLOON_S_MEMTOT 5 /* Total amount of memory */
    > > +#define VIRTIO_BALLOON_S_NR 6
    > > +
    > > +struct virtio_balloon_stat {
    > > + __u16 tag;
    > > + __u64 val;
    > > +} __attribute__((packed));
    > > +
    > > +#endif /* _LINUX_VIRTIO_BALLOON_LEGACY_H */
    > > diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h
    > > index 284fc3a05f7b..8b5ac0047190 100644
    > > --- a/include/uapi/linux/virtio_ids.h
    > > +++ b/include/uapi/linux/virtio_ids.h
    > > @@ -33,11 +33,12 @@
    > > #define VIRTIO_ID_BLOCK 2 /* virtio block */
    > > #define VIRTIO_ID_CONSOLE 3 /* virtio console */
    > > #define VIRTIO_ID_RNG 4 /* virtio rng */
    > > -#define VIRTIO_ID_BALLOON 5 /* virtio balloon */
    > > +#define VIRTIO_ID_BALLOON 5 /* virtio balloon (legacy) */
    > > #define VIRTIO_ID_RPMSG 7 /* virtio remote processor messaging */
    > > #define VIRTIO_ID_SCSI 8 /* virtio scsi */
    > > #define VIRTIO_ID_9P 9 /* 9p virtio console */
    > > #define VIRTIO_ID_RPROC_SERIAL 11 /* virtio remoteproc serial link */
    > > #define VIRTIO_ID_CAIF 12 /* Virtio caif */
    > > +#define VIRTIO_ID_MEMBALLOON 13 /* virtio balloon */
    > >
    > > #endif /* _LINUX_VIRTIO_IDS_H */
    > >
    > >
    > > ---------------------------------------------------------------------
    > > To unsubscribe from this mail list, you must leave the OASIS TC that
    > > generates this mail. Follow this link to all your TCs in OASIS at:
    > > https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php
    >




  • 5.  Re: [virtio] New virtio balloon...

    Posted 01-30-2014 16:47
    On Thu, 30 Jan 2014 12:16:29 +0200 "Michael S. Tsirkin" <mst@redhat.com> wrote: > Also copy virtio-dev since this in clearly implementation ... > > On Thu, Jan 30, 2014 at 07:34:30PM +1030, Rusty Russell wrote: > > Hi, > > > > I tried to write a new balloon driver; it's completely untested > > (as I need to write the device). The protocol is basically two vqs, one > > for the guest to send commands, one for the host to send commands. > > > > Some interesting things come out: > > 1) We do need to explicitly tell the host where the page is we want. > > This is required for compaction, for example. > > > > 2) We need to be able to exceed the balloon target, especially for page > > migration. Thus there's no mechanism for the device to refuse to > > give us the pages. > > > > 3) The device can offer multiple page sizes, but the driver can only > > accept one. I'm not sure if this is useful, as guests are either > > huge page backed or not, and returning sub-pages isn't useful. > > > > Linux demo code follows. > > > > Cheers, > > Rusty. > > More comments: > - for projects like auto-ballooning that Luiz works on, > it's not nice that to swap page 1 for page 2 > you have to inflate then deflate > besides overhead this confuses the host: > imagine you tell QEMU to increase target, > meanwhile guest inflates temporarily, > QEMU thinks okay done, now you suddenly deflate. Yes. Just to give more context: one of my auto-ballooning versions broke when virtballoon_migratepage() ran. The reason was that my host-side code in the balloon device did not expect guest initiated operations. And the current spec does imply that all operations are initiated by the host. So, first suggestion: if the current spec is still valid, we have to add a note there that balloon operations can be initiated by the guest. My current code is different, but something it does that could also brake due to guest initiated inflate/deflate is that it keeps track of the current balloon size. This is done by a counter which is incremented on inflate and decremented on deflate. I did that because the device just doesn't have this information ('actual' is unreliable, besides it's only updated every 256 pages inflated/deflated). Second suggestion: I think we need a reliable way to know the current balloon size on the host. My counter does work, btw. As far as the guest is concerned, my current code just informs the host that the guest is facing pressure. This is done through a "message" virtqueue, but I think this could just use the guest command virtqueue. > A couple of other suggestions: > > - how to accomodate memory pressure in guest? > Let's add a field telling host how hard do we > want our memory back I agree we have to accommodate pressure in the guest some way, but what you proposed is more or less related to auto-ballooning. My suggestion would be for the host to tell the guest what to do in case of pressure. Like, it could tell the guest to just keep trying like it does today or it could ask the guest to stop inflation on pressure (which would require an ack from the host, which complicates the protocol a bit). Also, there are two ways to know the guest is under pressure: 1. when alloc_page() fails or 2. use in-kernel vmpressure notification like auto-balloon does. > - assume you want to over-commit host and start > inflating balloon. > If low on memory it might be better for guest to > wait a bit before inflating. > Also, if host asks for a lot of memory a ton of > allocations will slow guest significantly. > But for guest to do the right thing we need host to tell guest what > are its memory and time contraints. > Let's add a field telling guest how hard do we > want it to give us memory (e.g. time limit) I think this is also related to auto-ballooning. Maybe we should start with a simple device/driver and add all these features on top. > > > > > diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile > > index 9076635697bb..1dd45691b618 100644 > > --- a/drivers/virtio/Makefile > > +++ b/drivers/virtio/Makefile > > @@ -1,4 +1,4 @@ > > obj-$(CONFIG_VIRTIO) += virtio.o virtio_ring.o > > obj-$(CONFIG_VIRTIO_MMIO) += virtio_mmio.o > > obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o > > -obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o > > +obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o virtio_balloon2.o > > diff --git a/drivers/virtio/virtio_balloon2.c b/drivers/virtio/virtio_balloon2.c > > new file mode 100644 > > index 000000000000..93f13e7c561d > > --- /dev/null > > +++ b/drivers/virtio/virtio_balloon2.c > > @@ -0,0 +1,566 @@ > > +/* > > + * Virtio balloon implementation, inspired by Dor Laor and Marcelo > > + * Tosatti's implementations. > > + * > > + * Copyright 2008, 2014 Rusty Russell IBM Corporation > > + * > > + * This program is free software; you can redistribute it and/or modify > > + * it under the terms of the GNU General Public License as published by > > + * the Free Software Foundation; either version 2 of the License, or > > + * (at your option) any later version. > > + * > > + * This program is distributed in the hope that it will be useful, > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > > + * GNU General Public License for more details. > > + * > > + * You should have received a copy of the GNU General Public License > > + * along with this program; if not, write to the Free Software > > + * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA > > + */ > > + > > +#include <linux/virtio.h> > > +#include <linux/virtio_balloon.h> > > +#include <linux/swap.h> > > +#include <linux/kthread.h> > > +#include <linux/freezer.h> > > +#include <linux/delay.h> > > +#include <linux/slab.h> > > +#include <linux/module.h> > > +#include <linux/balloon_compaction.h> > > + > > +struct gcmd_get_pages { > > + __le64 type; /* VIRTIO_BALLOON_GCMD_GET_PAGES */ > > + __le64 pages[256]; > > +}; > > + > > +struct gcmd_give_pages { > > + __le64 type; /* VIRTIO_BALLOON_GCMD_GIVE_PAGES */ > > + __le64 pages[256]; > > +}; > > + > > +struct gcmd_need_mem { > > + __le64 type; /* VIRTIO_BALLOON_GCMD_NEED_MEM */ > > +}; > > + > > +struct gcmd_stats_reply { > > + __le64 type; /* VIRTIO_BALLOON_GCMD_STATS_REPLY */ > > + struct virtio_balloon_statistic stats[VIRTIO_BALLOON_S_NR]; > > +}; > > + > > +struct hcmd_set_balloon { > > + __le64 type; /* VIRTIO_BALLOON_HCMD_SET_BALLOON */ > > + __le64 target; > > +}; > > + > > +struct hcmd_get_stats { > > + __le64 type; /* VIRTIO_BALLOON_HCMD_GET_STATS */ > > +}; > > + > > +struct virtio_balloon { > > + /* Protects contents of entire structure. */ > > + struct mutex lock; > > + > > + struct virtio_device *vdev; > > + struct virtqueue *gcmd_vq, *hcmd_vq; > > + > > + /* The thread servicing the balloon. */ > > + struct task_struct *thread; > > + > > + /* For interrupt/suspend to wake balloon thread. */ > > + wait_queue_head_t wait; > > + > > + /* How many pages are we supposed to have in balloon? */ > > + s64 target; > > + > > + /* How many do we have in the balloon? */ > > + u64 num_pages; > > + > > + /* This reminds me of Eeyore. */ > > + bool broken; > > + > > + /* > > + * The pages we've told the Host we're not using are enqueued > > + * at vb_dev_info->pages list. > > + */ > > + struct balloon_dev_info *vb_dev_info; > > + > > + /* To avoid kmalloc, we use single hcmd and gcmd buffers. */ > > + union gcmd { > > + __le64 type; > > + struct gcmd_get_pages get_pages; > > + struct gcmd_give_pages give_pages; > > + struct gcmd_need_mem need_mem; > > + struct gcmd_stats_reply stats_reply; > > + } gcmd; > > + > > + union hcmd { > > + __le64 type; > > + struct hcmd_set_balloon set_balloon; > > + struct hcmd_get_stats get_stats; > > + } hcmd; > > +}; > > + > > +static struct virtio_device_id id_table[] = { > > + { VIRTIO_ID_MEMBALLOON, VIRTIO_DEV_ANY_ID }, > > + { 0 }, > > +}; > > + > > +static void wake_balloon(struct virtqueue *vq) > > +{ > > + struct virtio_balloon *vb = vq->vdev->priv; > > + > > + wake_up(&vb->wait); > > +} > > + > > +/* Command is in vb->gcmd, lock is held. */ > > +static bool send_gcmd(struct virtio_balloon *vb, size_t len) > > +{ > > + struct scatterlist sg; > > + > > + BUG_ON(len > sizeof(vb->gcmd)); > > + sg_init_one(&sg, &vb->gcmd, len); > > + > > + /* > > + * We should always be able to add one buffer to an empty queue. > > + * If not, it's a broken device. > > + */ > > + if (virtqueue_add_outbuf(vb->gcmd_vq, &sg, 1, vb, GFP_KERNEL) != 0 > > + virtqueue_kick(vb->gcmd_vq) != 0) { > > + vb->broken = true; > > + return false; > > + } > > + > > + /* When host has read buffer, this completes via wake_balloon */ > > + wait_event(vb->wait, > > + virtqueue_get_buf(vb->gcmd_vq, &len) > > + (vb->broken = virtqueue_is_broken(vb->gcmd_vq))); > > + return !vb->broken; > > +} > > + > > +static void give_to_balloon(struct virtio_balloon *vb, u64 num) > > +{ > > + struct balloon_dev_info *vb_dev_info = vb->vb_dev_info; > > + u64 i; > > + > > + /* We can only do one array worth at a time. */ > > + num = min_t(u64, num, ARRAY_SIZE(vb->gcmd.give_pages.pages)); > > + > > + vb->gcmd.give_pages.type = cpu_to_le64(VIRTIO_BALLOON_GCMD_GIVE_PAGES); > > + > > + for (i = 0; i < num; i++) { > > + struct page *page = balloon_page_enqueue(vb_dev_info); > > + > > + if (!page) { > > + dev_info_ratelimited(&vb->vdev->dev, > > + "Out of puff! Can't get page
    "); > > + /* Sleep for at least 1/5 of a second before retry. */ > > + msleep(200); > > + break; > > + } > > + > > + vb->gcmd.give_pages.pages[i] = page_to_pfn(page) << PAGE_SHIFT; > > + vb->num_pages++; > > + adjust_managed_page_count(page, -1); > > + } > > + > > + /* Did we get any? */ > > + if (i) > > + send_gcmd(vb, offsetof(struct gcmd_give_pages, pages[i])); > > +} > > + > > +static void take_from_balloon(struct virtio_balloon *vb, u64 num) > > +{ > > + struct balloon_dev_info *vb_dev_info = vb->vb_dev_info; > > + size_t i; > > + > > + /* We can only do one array worth at a time. */ > > + num = min_t(u64, num, ARRAY_SIZE(vb->gcmd.get_pages.pages)); > > + > > + vb->gcmd.get_pages.type = cpu_to_le64(VIRTIO_BALLOON_GCMD_GET_PAGES); > > + > > + for (i = 0; i < num; i++) { > > + struct page *page = balloon_page_dequeue(vb_dev_info); > > + > > + /* In case we ran out of pages (compaction) */ > > + if (!page) > > + break; > > + > > + vb->gcmd.get_pages.pages[i] = page_to_pfn(page) << PAGE_SHIFT; > > + vb->num_pages--; > > + } > > + num = i; > > + if (num) > > + send_gcmd(vb, offsetof(struct gcmd_get_pages, pages[num])); > > + > > + /* Now release those pages. */ > > + for (i = 0; i < num; i++) { > > + struct page *page; > > + > > + page = pfn_to_page(vb->gcmd.get_pages.pages[i] >> PAGE_SHIFT); > > + balloon_page_free(page); > > + adjust_managed_page_count(page, 1); > > + } > > + mutex_unlock(&vb->lock); > > +} > > + > > +static inline void set_stat(struct gcmd_stats_reply *stats, int idx, > > + u64 tag, u64 val) > > +{ > > + BUG_ON(idx >= ARRAY_SIZE(stats->stats)); > > + stats->stats[idx].tag = cpu_to_le64(tag); > > + stats->stats[idx].val = cpu_to_le64(val); > > +} > > + > > +#define pages_to_bytes(x) ((u64)(x) << PAGE_SHIFT) > > + > > +static void get_stats(struct gcmd_stats_reply *stats) > > +{ > > + unsigned long events[NR_VM_EVENT_ITEMS]; > > + struct sysinfo i; > > + int idx = 0; > > + > > + all_vm_events(events); > > + si_meminfo(&i); > > + > > + stats->type = cpu_to_le64(VIRTIO_BALLOON_GCMD_STATS_REPLY); > > + set_stat(stats, idx++, VIRTIO_BALLOON_S_SWAP_IN, > > + pages_to_bytes(events[PSWPIN])); > > + set_stat(stats, idx++, VIRTIO_BALLOON_S_SWAP_OUT, > > + pages_to_bytes(events[PSWPOUT])); > > + set_stat(stats, idx++, VIRTIO_BALLOON_S_MAJFLT, > > + events[PGMAJFAULT]); > > + set_stat(stats, idx++, VIRTIO_BALLOON_S_MINFLT, > > + events[PGFAULT]); > > + set_stat(stats, idx++, VIRTIO_BALLOON_S_MEMFREE, > > + pages_to_bytes(i.freeram)); > > + set_stat(stats, idx++, VIRTIO_BALLOON_S_MEMTOT, > > + pages_to_bytes(i.totalram)); > > +} > > + > > +static bool move_towards_target(struct virtio_balloon *vb) > > +{ > > + bool moved = false; > > + > > + if (vb->broken) > > + return false; > > + > > + mutex_lock(&vb->lock); > > + if (vb->num_pages < vb->target) { > > + give_to_balloon(vb, vb->target - vb->num_pages); > > + moved = true; > > + } else if (vb->num_pages > vb->target) { > > + take_from_balloon(vb, vb->num_pages - vb->target); > > + moved = true; > > + } > > + mutex_unlock(&vb->lock); > > + return moved; > > +} > > + > > +static bool process_hcmd(struct virtio_balloon *vb) > > +{ > > + union hcmd *hcmd = NULL; > > + unsigned int cmdlen; > > + struct scatterlist sg; > > + > > + if (vb->broken) > > + return false; > > + > > + mutex_lock(&vb->lock); > > + hcmd = virtqueue_get_buf(vb->hcmd_vq, &cmdlen); > > + if (!hcmd) { > > + mutex_unlock(&vb->lock); > > + return false; > > + } > > + > > + switch (hcmd->type) { > > + case cpu_to_le64(VIRTIO_BALLOON_HCMD_SET_BALLOON): > > + vb->target = le64_to_cpu(hcmd->set_balloon.target); > > + break; > > + case cpu_to_le64(VIRTIO_BALLOON_HCMD_GET_STATS): > > + get_stats(&vb->gcmd.stats_reply); > > + send_gcmd(vb, sizeof(vb->gcmd.stats_reply)); > > + break; > > + default: > > + dev_err_ratelimited(&vb->vdev->dev, "Unknown hcmd %llu
    ", > > + le64_to_cpu(hcmd->type)); > > + break; > > + } > > + > > + /* Re-queue the hcmd for next time. */ > > + sg_init_one(&sg, &vb->hcmd, sizeof(vb->hcmd)); > > + virtqueue_add_inbuf(vb->hcmd_vq, &sg, 1, vb, GFP_KERNEL); > > + > > + mutex_unlock(&vb->lock); > > + return true; > > +} > > + > > +static int balloon(void *_vballoon) > > +{ > > + struct virtio_balloon *vb = _vballoon; > > + > > + set_freezable(); > > + while (!kthread_should_stop()) { > > + try_to_freeze(); > > + > > + wait_event_interruptible(vb->wait, > > + kthread_should_stop() > > + freezing(current) > > + process_hcmd(vb) > > + move_towards_target(vb)); > > + } > > + return 0; > > +} > > + > > +static int init_vqs(struct virtio_balloon *vb) > > +{ > > + struct virtqueue *vqs[2]; > > + vq_callback_t *callbacks[] = { wake_balloon, wake_balloon }; > > + const char *names[] = { "gcmd", "hcmd" }; > > + struct scatterlist sg; > > + int err; > > + > > + err = vb->vdev->config->find_vqs(vb->vdev, 2, vqs, callbacks, names); > > + if (err) > > + return err; > > + > > + vb->gcmd_vq = vqs[0]; > > + vb->hcmd_vq = vqs[1]; > > + > > + /* > > + * Prime this virtqueue with one buffer so the hypervisor can > > + * use it to signal us later (it can't be broken yet!). > > + */ > > + sg_init_one(&sg, &vb->hcmd, sizeof(vb->hcmd)); > > + if (virtqueue_add_inbuf(vb->hcmd_vq, &sg, 1, vb, GFP_KERNEL) < 0) > > + BUG(); > > + virtqueue_kick(vb->hcmd_vq); > > + > > + return 0; > > +} > > + > > +static const struct address_space_operations virtio_balloon_aops; > > +#ifdef CONFIG_BALLOON_COMPACTION > > +/* > > + * virtballoon_migratepage - perform the balloon page migration on behalf of > > + * a compation thread. (called under page lock) > > + * @mapping: the page->mapping which will be assigned to the new migrated page. > > + * @newpage: page that will replace the isolated page after migration finishes. > > + * @page : the isolated (old) page that is about to be migrated to newpage. > > + * @mode : compaction mode -- not used for balloon page migration. > > + * > > + * After a ballooned page gets isolated by compaction procedures, this is the > > + * function that performs the page migration on behalf of a compaction thread > > + * The page migration for virtio balloon is done in a simple swap fashion which > > + * follows these two macro steps: > > + * 1) insert newpage into vb->pages list and update the host about it; > > + * 2) update the host about the old page removed from vb->pages list; > > + * > > + * This function preforms the balloon page migration task. > > + * Called through balloon_mapping->a_ops->migratepage > > + */ > > +static int virtballoon_migratepage(struct address_space *mapping, > > + struct page *newpage, struct page *page, enum migrate_mode mode) > > +{ > > + struct balloon_dev_info *vb_dev_info = balloon_page_device(page); > > + struct virtio_balloon *vb; > > + unsigned long flags; > > + int err; > > + > > + BUG_ON(!vb_dev_info); > > + > > + vb = vb_dev_info->balloon_device; > > + > > + /* > > + * In order to avoid lock contention while migrating pages concurrently > > + * to leak_balloon() or fill_balloon() we just give up the balloon_lock > > + * this turn, as it is easier to retry the page migration later. > > + * This also prevents fill_balloon() getting stuck into a mutex > > + * recursion in the case it ends up triggering memory compaction > > + * while it is attempting to inflate the ballon. > > + */ > > + if (!mutex_trylock(&vb->lock)) > > + return -EAGAIN; > > + > > + /* Try to get the page out of the balloon. */ > > + vb->gcmd.get_pages.type = cpu_to_le64(VIRTIO_BALLOON_GCMD_GET_PAGES); > > + vb->gcmd.get_pages.pages[0] = page_to_pfn(page) << PAGE_SHIFT; > > + if (!send_gcmd(vb, offsetof(struct gcmd_get_pages, pages[1]))) { > > + err = -EIO; > > + goto unlock; > > + } > > + > > + /* Now put newpage into balloon. */ > > + vb->gcmd.give_pages.type = cpu_to_le64(VIRTIO_BALLOON_GCMD_GIVE_PAGES); > > + vb->gcmd.give_pages.pages[0] = page_to_pfn(newpage) << PAGE_SHIFT; > > + if (!send_gcmd(vb, offsetof(struct gcmd_give_pages, pages[1]))) { > > + /* We leak a page here, but only happens if balloon broken. */ > > + err = -EIO; > > + goto unlock; > > + } > > + > > + spin_lock_irqsave(&vb_dev_info->pages_lock, flags); > > + balloon_page_insert(newpage, mapping, &vb_dev_info->pages); > > + vb_dev_info->isolated_pages--; > > + spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags); > > + > > + /* > > + * It's safe to delete page->lru here because this page is at > > + * an isolated migration list, and this step is expected to happen here > > + */ > > + balloon_page_delete(page); > > + err = MIGRATEPAGE_BALLOON_SUCCESS; > > + > > +unlock: > > + mutex_unlock(&vb->lock); > > + return err; > > +} > > + > > +/* define the balloon_mapping->a_ops callback to allow balloon page migration */ > > +static const struct address_space_operations virtio_balloon_aops = { > > + .migratepage = virtballoon_migratepage, > > +}; > > +#endif /* CONFIG_BALLOON_COMPACTION */ > > + > > +static int virtballoon_probe(struct virtio_device *vdev) > > +{ > > + struct virtio_balloon *vb; > > + struct address_space *vb_mapping; > > + struct balloon_dev_info *vb_devinfo; > > + __le64 v; > > + int err; > > + > > + virtio_cread(vdev, struct virtio_balloon_config_space, pagesizes, &v); > > + /* FIXME: Support large pages. */ > > + if (!(le64_to_cpu(v) & PAGE_SIZE)) { > > + dev_warn(&vdev->dev, "Unacceptable pagesize %llu
    ", > > + (long long)le64_to_cpu(v)); > > + err = -EINVAL; > > + goto out; > > + } > > + v = cpu_to_le64(PAGE_SIZE); > > + virtio_cwrite(vdev, struct virtio_balloon_config_space, page_size, &v); > > + > > + vdev->priv = vb = kmalloc(sizeof(*vb), GFP_KERNEL); > > + if (!vb) { > > + err = -ENOMEM; > > + goto out; > > + } > > + > > + vb->target = 0; > > + vb->num_pages = 0; > > + mutex_init(&vb->lock); > > + init_waitqueue_head(&vb->wait); > > + vb->vdev = vdev; > > + > > + vb_devinfo = balloon_devinfo_alloc(vb); > > + if (IS_ERR(vb_devinfo)) { > > + err = PTR_ERR(vb_devinfo); > > + goto out_free_vb; > > + } > > + > > + vb_mapping = balloon_mapping_alloc(vb_devinfo, > > + (balloon_compaction_check()) ? > > + &virtio_balloon_aops : NULL); > > + if (IS_ERR(vb_mapping)) { > > + /* > > + * IS_ERR(vb_mapping) && PTR_ERR(vb_mapping) == -EOPNOTSUPP > > + * This means !CONFIG_BALLOON_COMPACTION, otherwise we get off. > > + */ > > + err = PTR_ERR(vb_mapping); > > + if (err != -EOPNOTSUPP) > > + goto out_free_vb_devinfo; > > + } > > + > > + vb->vb_dev_info = vb_devinfo; > > + > > + err = init_vqs(vb); > > + if (err) > > + goto out_free_vb_mapping; > > + > > + vb->thread = kthread_run(balloon, vb, "vballoon"); > > + if (IS_ERR(vb->thread)) { > > + err = PTR_ERR(vb->thread); > > + goto out_del_vqs; > > + } > > + > > + return 0; > > + > > +out_del_vqs: > > + vdev->config->del_vqs(vdev); > > +out_free_vb_mapping: > > + balloon_mapping_free(vb_mapping); > > +out_free_vb_devinfo: > > + balloon_devinfo_free(vb_devinfo); > > +out_free_vb: > > + kfree(vb); > > +out: > > + return err; > > +} > > + > > +/* FIXME: Leave pages alone during suspend, rather than taking them > > + * all back! */ > > +static void remove_common(struct virtio_balloon *vb) > > +{ > > + /* There might be pages left in the balloon: free them. */ > > + while (vb->num_pages) > > + take_from_balloon(vb, vb->num_pages); > > + > > + /* Now we reset the device so we can clean up the queues. */ > > + vb->vdev->config->reset(vb->vdev); > > + vb->vdev->config->del_vqs(vb->vdev); > > +} > > + > > +static void virtballoon_remove(struct virtio_device *vdev) > > +{ > > + struct virtio_balloon *vb = vdev->priv; > > + > > + kthread_stop(vb->thread); > > + remove_common(vb); > > + balloon_mapping_free(vb->vb_dev_info->mapping); > > + balloon_devinfo_free(vb->vb_dev_info); > > + kfree(vb); > > +} > > + > > +#ifdef CONFIG_PM_SLEEP > > +static int virtballoon_freeze(struct virtio_device *vdev) > > +{ > > + struct virtio_balloon *vb = vdev->priv; > > + > > + /* > > + * The kthread is already frozen by the PM core before this > > + * function is called. > > + */ > > + > > + remove_common(vb); > > + return 0; > > +} > > + > > +static int virtballoon_restore(struct virtio_device *vdev) > > +{ > > + return init_vqs(vdev->priv); > > +} > > +#endif > > + > > +static unsigned int features[] = { > > + /* FIXME: Support VIRTIO_BALLOON_F_EXTRA_MEM! */ > > +}; > > + > > +static struct virtio_driver virtio_balloon_driver = { > > + .feature_table = features, > > + .feature_table_size = ARRAY_SIZE(features), > > + .driver.name = KBUILD_MODNAME, > > + .driver.owner = THIS_MODULE, > > + .id_table = id_table, > > + .probe = virtballoon_probe, > > + .remove = virtballoon_remove, > > +#ifdef CONFIG_PM_SLEEP > > + .freeze = virtballoon_freeze, > > + .restore = virtballoon_restore, > > +#endif > > +}; > > + > > +module_virtio_driver(virtio_balloon_driver); > > +MODULE_DEVICE_TABLE(virtio, id_table); > > +MODULE_DESCRIPTION("Virtio balloon driver"); > > +MODULE_LICENSE("GPL"); > > diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h > > index 5e26f61b5df5..cdca2934668a 100644 > > --- a/include/uapi/linux/virtio_balloon.h > > +++ b/include/uapi/linux/virtio_balloon.h > > @@ -28,32 +28,45 @@ > > #include <linux/virtio_ids.h> > > #include <linux/virtio_config.h> > > > > -/* The feature bitmap for virtio balloon */ > > -#define VIRTIO_BALLOON_F_MUST_TELL_HOST 0 /* Tell before reclaiming pages */ > > -#define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */ > > - > > -/* Size of a PFN in the balloon interface. */ > > -#define VIRTIO_BALLOON_PFN_SHIFT 12 > > - > > -struct virtio_balloon_config > > -{ > > - /* Number of pages host wants Guest to give up. */ > > - __le32 num_pages; > > - /* Number of pages we've actually got in balloon. */ > > - __le32 actual; > > +/* This means the balloon can go negative (ie. add memory to system) */ > > +#define VIRTIO_BALLOON_F_EXTRA_MEM 0 > > + > > +struct virtio_balloon_config_space { > > + /* Set by device: bits indicate what page sizes supported. */ > > + __le64 pagesizes; > > + /* Set by driver: only a single bit is set! */ > > + __le64 page_size; > > + > > + /* These set by device if VIRTIO_BALLOON_F_EXTRA_MEM. */ > > + __le64 extra_mem_start; > > + __le64 extra_mem_end; > > +}; > > + > > +struct virtio_balloon_statistic { > > + __le64 tag; /* VIRTIO_BALLOON_S_* */ > > + __le64 val; > > }; > > > > -#define VIRTIO_BALLOON_S_SWAP_IN 0 /* Amount of memory swapped in */ > > -#define VIRTIO_BALLOON_S_SWAP_OUT 1 /* Amount of memory swapped out */ > > -#define VIRTIO_BALLOON_S_MAJFLT 2 /* Number of major faults */ > > -#define VIRTIO_BALLOON_S_MINFLT 3 /* Number of minor faults */ > > -#define VIRTIO_BALLOON_S_MEMFREE 4 /* Total amount of free memory */ > > -#define VIRTIO_BALLOON_S_MEMTOT 5 /* Total amount of memory */ > > -#define VIRTIO_BALLOON_S_NR 6 > > - > > -struct virtio_balloon_stat { > > - __u16 tag; > > - __u64 val; > > -} __attribute__((packed)); > > +/* Guest->host command queue. */ > > +/* Ask the host for more pages. > > + Followed by array of 1 or more readable le64 pageaddr's. */ > > +#define VIRTIO_BALLOON_GCMD_GET_PAGES ((__le64)0) > > +/* Give the host more pages. > > + Followed by array of 1 or more readable le64 pageaddr's */ > > +#define VIRTIO_BALLOON_GCMD_GIVE_PAGES ((__le64)1) > > +/* Dear host: I need more memory. */ > > +#define VIRTIO_BALLOON_GCMD_NEEDMEM ((__le64)2) > > +/* Dear host: here are your stats. > > + * Followed by 0 or more struct virtio_balloon_statistic structs. */ > > +#define VIRTIO_BALLOON_GCMD_STATS_REPLY ((__le64)3) > > + > > +/* Host->guest command queue. */ > > +/* Followed by s64 of new balloon target size (only negative if > > + * VIRTIO_BALLOON_F_EXTRA_MEM). */ > > +#define VIRTIO_BALLOON_HCMD_SET_BALLOON ((__le64)0x8000) > > +/* Ask for statistics */ > > +#define VIRTIO_BALLOON_HCMD_GET_STATS ((__le64)0x8001) > > + > > +#include <linux/virtio_balloon_legacy.h> > > > > #endif /* _LINUX_VIRTIO_BALLOON_H */ > > diff --git a/include/uapi/linux/virtio_balloon_legacy.h b/include/uapi/linux/virtio_balloon_legacy.h > > new file mode 100644 > > index 000000000000..cbf77bc1aee3 > > --- /dev/null > > +++ b/include/uapi/linux/virtio_balloon_legacy.h > > @@ -0,0 +1,59 @@ > > +#ifndef _LINUX_VIRTIO_BALLOON_LEGACY_H > > +#define _LINUX_VIRTIO_BALLOON_LEGACY_H > > +/* This header is BSD licensed so anyone can use the definitions to implement > > + * compatible drivers/servers. > > + * > > + * Redistribution and use in source and binary forms, with or without > > + * modification, are permitted provided that the following conditions > > + * are met: > > + * 1. Redistributions of source code must retain the above copyright > > + * notice, this list of conditions and the following disclaimer. > > + * 2. Redistributions in binary form must reproduce the above copyright > > + * notice, this list of conditions and the following disclaimer in the > > + * documentation and/or other materials provided with the distribution. > > + * 3. Neither the name of IBM nor the names of its contributors > > + * may be used to endorse or promote products derived from this software > > + * without specific prior written permission. > > + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS'' AND > > + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE > > + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE > > + * ARE DISCLAIMED. IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE > > + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL > > + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS > > + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) > > + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT > > + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY > > + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF > > + * SUCH DAMAGE. */ > > +#include <linux/virtio_ids.h> > > +#include <linux/virtio_config.h> > > + > > +/* The feature bitmap for virtio balloon */ > > +#define VIRTIO_BALLOON_F_MUST_TELL_HOST 0 /* Tell before reclaiming pages */ > > +#define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */ > > + > > +/* Size of a PFN in the balloon interface. */ > > +#define VIRTIO_BALLOON_PFN_SHIFT 12 > > + > > +struct virtio_balloon_config > > +{ > > + /* Number of pages host wants Guest to give up. */ > > + __le32 num_pages; > > + /* Number of pages we've actually got in balloon. */ > > + __le32 actual; > > +}; > > + > > +#define VIRTIO_BALLOON_S_SWAP_IN 0 /* Amount of memory swapped in */ > > +#define VIRTIO_BALLOON_S_SWAP_OUT 1 /* Amount of memory swapped out */ > > +#define VIRTIO_BALLOON_S_MAJFLT 2 /* Number of major faults */ > > +#define VIRTIO_BALLOON_S_MINFLT 3 /* Number of minor faults */ > > +#define VIRTIO_BALLOON_S_MEMFREE 4 /* Total amount of free memory */ > > +#define VIRTIO_BALLOON_S_MEMTOT 5 /* Total amount of memory */ > > +#define VIRTIO_BALLOON_S_NR 6 > > + > > +struct virtio_balloon_stat { > > + __u16 tag; > > + __u64 val; > > +} __attribute__((packed)); > > + > > +#endif /* _LINUX_VIRTIO_BALLOON_LEGACY_H */ > > diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h > > index 284fc3a05f7b..8b5ac0047190 100644 > > --- a/include/uapi/linux/virtio_ids.h > > +++ b/include/uapi/linux/virtio_ids.h > > @@ -33,11 +33,12 @@ > > #define VIRTIO_ID_BLOCK 2 /* virtio block */ > > #define VIRTIO_ID_CONSOLE 3 /* virtio console */ > > #define VIRTIO_ID_RNG 4 /* virtio rng */ > > -#define VIRTIO_ID_BALLOON 5 /* virtio balloon */ > > +#define VIRTIO_ID_BALLOON 5 /* virtio balloon (legacy) */ > > #define VIRTIO_ID_RPMSG 7 /* virtio remote processor messaging */ > > #define VIRTIO_ID_SCSI 8 /* virtio scsi */ > > #define VIRTIO_ID_9P 9 /* 9p virtio console */ > > #define VIRTIO_ID_RPROC_SERIAL 11 /* virtio remoteproc serial link */ > > #define VIRTIO_ID_CAIF 12 /* Virtio caif */ > > +#define VIRTIO_ID_MEMBALLOON 13 /* virtio balloon */ > > > > #endif /* _LINUX_VIRTIO_IDS_H */ > > > > > > --------------------------------------------------------------------- > > To unsubscribe from this mail list, you must leave the OASIS TC that > > generates this mail. Follow this link to all your TCs in OASIS at: > > https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php >


  • 6.  Re: [virtio-dev] Re: [virtio] New virtio balloon...

    Posted 01-30-2014 18:42
    On Thu, Jan 30, 2014 at 11:47:19AM -0500, Luiz Capitulino wrote:
    > On Thu, 30 Jan 2014 12:16:29 +0200
    > "Michael S. Tsirkin" <mst@redhat.com> wrote:
    >
    > > Also copy virtio-dev since this in clearly implementation ...
    > >
    > > On Thu, Jan 30, 2014 at 07:34:30PM +1030, Rusty Russell wrote:
    > > > Hi,
    > > >
    > > > I tried to write a new balloon driver; it's completely untested
    > > > (as I need to write the device). The protocol is basically two vqs, one
    > > > for the guest to send commands, one for the host to send commands.
    > > >
    > > > Some interesting things come out:
    > > > 1) We do need to explicitly tell the host where the page is we want.
    > > > This is required for compaction, for example.
    > > >
    > > > 2) We need to be able to exceed the balloon target, especially for page
    > > > migration. Thus there's no mechanism for the device to refuse to
    > > > give us the pages.
    > > >
    > > > 3) The device can offer multiple page sizes, but the driver can only
    > > > accept one. I'm not sure if this is useful, as guests are either
    > > > huge page backed or not, and returning sub-pages isn't useful.
    > > >
    > > > Linux demo code follows.
    > > >
    > > > Cheers,
    > > > Rusty.
    > >
    > > More comments:
    > > - for projects like auto-ballooning that Luiz works on,
    > > it's not nice that to swap page 1 for page 2
    > > you have to inflate then deflate
    > > besides overhead this confuses the host:
    > > imagine you tell QEMU to increase target,
    > > meanwhile guest inflates temporarily,
    > > QEMU thinks okay done, now you suddenly deflate.
    >
    > Yes. Just to give more context: one of my auto-ballooning versions broke
    > when virtballoon_migratepage() ran. The reason was that my host-side code
    > in the balloon device did not expect guest initiated operations. And the
    > current spec does imply that all operations are initiated by the host.
    >
    > So, first suggestion: if the current spec is still valid, we have to add
    > a note there that balloon operations can be initiated by the guest.
    >
    > My current code is different, but something it does that could also brake
    > due to guest initiated inflate/deflate is that it keeps track of the
    > current balloon size. This is done by a counter which is incremented
    > on inflate and decremented on deflate. I did that because the device just
    > doesn't have this information ('actual' is unreliable, besides it's
    > only updated every 256 pages inflated/deflated).
    >
    > Second suggestion: I think we need a reliable way to know the current
    > balloon size on the host. My counter does work, btw.
    >
    > As far as the guest is concerned, my current code just informs the host
    > that the guest is facing pressure. This is done through a "message" virtqueue,
    > but I think this could just use the guest command virtqueue.
    >
    > > A couple of other suggestions:
    > >
    > > - how to accomodate memory pressure in guest?
    > > Let's add a field telling host how hard do we
    > > want our memory back
    >
    > I agree we have to accommodate pressure in the guest some way, but what
    > you proposed is more or less related to auto-ballooning.
    >
    > My suggestion would be for the host to tell the guest what to do in
    > case of pressure. Like, it could tell the guest to just keep trying like
    > it does today or it could ask the guest to stop inflation on pressure
    > (which would require an ack from the host, which complicates the
    > protocol a bit).

    If we need ack anyway, it seems enough to notify host
    and then host can ask guest to stop inflating?

    > Also, there are two ways to know the guest is under pressure: 1. when
    > alloc_page() fails or 2. use in-kernel vmpressure notification like
    > auto-balloon does.

    2 can detect pressure earlier so seems nicer.

    > > - assume you want to over-commit host and start
    > > inflating balloon.
    > > If low on memory it might be better for guest to
    > > wait a bit before inflating.
    > > Also, if host asks for a lot of memory a ton of
    > > allocations will slow guest significantly.
    > > But for guest to do the right thing we need host to tell guest what
    > > are its memory and time contraints.
    > > Let's add a field telling guest how hard do we
    > > want it to give us memory (e.g. time limit)
    >
    > I think this is also related to auto-ballooning.
    > Maybe we should start
    > with a simple device/driver and add all these features on top.


    Just shows these are good ideas :).

    I tried to propose building blocks in a generic way that will
    be useful for multiple features.
    autoballoon can build on top but these features
    are useful even with manual inflate.

    --
    MST



  • 7.  Re: [virtio-dev] Re: [virtio] New virtio balloon...

    Posted 01-30-2014 18:42
    On Thu, Jan 30, 2014 at 11:47:19AM -0500, Luiz Capitulino wrote: > On Thu, 30 Jan 2014 12:16:29 +0200 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > Also copy virtio-dev since this in clearly implementation ... > > > > On Thu, Jan 30, 2014 at 07:34:30PM +1030, Rusty Russell wrote: > > > Hi, > > > > > > I tried to write a new balloon driver; it's completely untested > > > (as I need to write the device). The protocol is basically two vqs, one > > > for the guest to send commands, one for the host to send commands. > > > > > > Some interesting things come out: > > > 1) We do need to explicitly tell the host where the page is we want. > > > This is required for compaction, for example. > > > > > > 2) We need to be able to exceed the balloon target, especially for page > > > migration. Thus there's no mechanism for the device to refuse to > > > give us the pages. > > > > > > 3) The device can offer multiple page sizes, but the driver can only > > > accept one. I'm not sure if this is useful, as guests are either > > > huge page backed or not, and returning sub-pages isn't useful. > > > > > > Linux demo code follows. > > > > > > Cheers, > > > Rusty. > > > > More comments: > > - for projects like auto-ballooning that Luiz works on, > > it's not nice that to swap page 1 for page 2 > > you have to inflate then deflate > > besides overhead this confuses the host: > > imagine you tell QEMU to increase target, > > meanwhile guest inflates temporarily, > > QEMU thinks okay done, now you suddenly deflate. > > Yes. Just to give more context: one of my auto-ballooning versions broke > when virtballoon_migratepage() ran. The reason was that my host-side code > in the balloon device did not expect guest initiated operations. And the > current spec does imply that all operations are initiated by the host. > > So, first suggestion: if the current spec is still valid, we have to add > a note there that balloon operations can be initiated by the guest. > > My current code is different, but something it does that could also brake > due to guest initiated inflate/deflate is that it keeps track of the > current balloon size. This is done by a counter which is incremented > on inflate and decremented on deflate. I did that because the device just > doesn't have this information ('actual' is unreliable, besides it's > only updated every 256 pages inflated/deflated). > > Second suggestion: I think we need a reliable way to know the current > balloon size on the host. My counter does work, btw. > > As far as the guest is concerned, my current code just informs the host > that the guest is facing pressure. This is done through a "message" virtqueue, > but I think this could just use the guest command virtqueue. > > > A couple of other suggestions: > > > > - how to accomodate memory pressure in guest? > > Let's add a field telling host how hard do we > > want our memory back > > I agree we have to accommodate pressure in the guest some way, but what > you proposed is more or less related to auto-ballooning. > > My suggestion would be for the host to tell the guest what to do in > case of pressure. Like, it could tell the guest to just keep trying like > it does today or it could ask the guest to stop inflation on pressure > (which would require an ack from the host, which complicates the > protocol a bit). If we need ack anyway, it seems enough to notify host and then host can ask guest to stop inflating? > Also, there are two ways to know the guest is under pressure: 1. when > alloc_page() fails or 2. use in-kernel vmpressure notification like > auto-balloon does. 2 can detect pressure earlier so seems nicer. > > - assume you want to over-commit host and start > > inflating balloon. > > If low on memory it might be better for guest to > > wait a bit before inflating. > > Also, if host asks for a lot of memory a ton of > > allocations will slow guest significantly. > > But for guest to do the right thing we need host to tell guest what > > are its memory and time contraints. > > Let's add a field telling guest how hard do we > > want it to give us memory (e.g. time limit) > > I think this is also related to auto-ballooning. > Maybe we should start > with a simple device/driver and add all these features on top. Just shows these are good ideas :). I tried to propose building blocks in a generic way that will be useful for multiple features. autoballoon can build on top but these features are useful even with manual inflate. -- MST


  • 8.  Re: [virtio-dev] Re: [virtio] New virtio balloon...

    Posted 01-30-2014 20:42
    On Thu, 30 Jan 2014 20:42:00 +0200
    "Michael S. Tsirkin" <mst@redhat.com> wrote:

    > On Thu, Jan 30, 2014 at 11:47:19AM -0500, Luiz Capitulino wrote:
    > > On Thu, 30 Jan 2014 12:16:29 +0200
    > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
    > >
    > > > Also copy virtio-dev since this in clearly implementation ...
    > > >
    > > > On Thu, Jan 30, 2014 at 07:34:30PM +1030, Rusty Russell wrote:
    > > > > Hi,
    > > > >
    > > > > I tried to write a new balloon driver; it's completely untested
    > > > > (as I need to write the device). The protocol is basically two vqs, one
    > > > > for the guest to send commands, one for the host to send commands.
    > > > >
    > > > > Some interesting things come out:
    > > > > 1) We do need to explicitly tell the host where the page is we want.
    > > > > This is required for compaction, for example.
    > > > >
    > > > > 2) We need to be able to exceed the balloon target, especially for page
    > > > > migration. Thus there's no mechanism for the device to refuse to
    > > > > give us the pages.
    > > > >
    > > > > 3) The device can offer multiple page sizes, but the driver can only
    > > > > accept one. I'm not sure if this is useful, as guests are either
    > > > > huge page backed or not, and returning sub-pages isn't useful.
    > > > >
    > > > > Linux demo code follows.
    > > > >
    > > > > Cheers,
    > > > > Rusty.
    > > >
    > > > More comments:
    > > > - for projects like auto-ballooning that Luiz works on,
    > > > it's not nice that to swap page 1 for page 2
    > > > you have to inflate then deflate
    > > > besides overhead this confuses the host:
    > > > imagine you tell QEMU to increase target,
    > > > meanwhile guest inflates temporarily,
    > > > QEMU thinks okay done, now you suddenly deflate.
    > >
    > > Yes. Just to give more context: one of my auto-ballooning versions broke
    > > when virtballoon_migratepage() ran. The reason was that my host-side code
    > > in the balloon device did not expect guest initiated operations. And the
    > > current spec does imply that all operations are initiated by the host.
    > >
    > > So, first suggestion: if the current spec is still valid, we have to add
    > > a note there that balloon operations can be initiated by the guest.
    > >
    > > My current code is different, but something it does that could also brake
    > > due to guest initiated inflate/deflate is that it keeps track of the
    > > current balloon size. This is done by a counter which is incremented
    > > on inflate and decremented on deflate. I did that because the device just
    > > doesn't have this information ('actual' is unreliable, besides it's
    > > only updated every 256 pages inflated/deflated).
    > >
    > > Second suggestion: I think we need a reliable way to know the current
    > > balloon size on the host. My counter does work, btw.
    > >
    > > As far as the guest is concerned, my current code just informs the host
    > > that the guest is facing pressure. This is done through a "message" virtqueue,
    > > but I think this could just use the guest command virtqueue.
    > >
    > > > A couple of other suggestions:
    > > >
    > > > - how to accomodate memory pressure in guest?
    > > > Let's add a field telling host how hard do we
    > > > want our memory back
    > >
    > > I agree we have to accommodate pressure in the guest some way, but what
    > > you proposed is more or less related to auto-ballooning.
    > >
    > > My suggestion would be for the host to tell the guest what to do in
    > > case of pressure. Like, it could tell the guest to just keep trying like
    > > it does today or it could ask the guest to stop inflation on pressure
    > > (which would require an ack from the host, which complicates the
    > > protocol a bit).
    >
    > If we need ack anyway, it seems enough to notify host
    > and then host can ask guest to stop inflating?

    Yep, that's exactly what auto-ballooning does. When the guest gets into
    pressure while inflating, it holds the inflation and sends a message
    to QEMU saying "hey, I'm under pressure". The guest doesn't continue until
    QEMU acks that message. This gives QEMU enough time to reset num_pages to
    the current balloon size (which cancels inflate) then QEMU acks the
    message and the guest stops inflating, as num_pages == balloon size.

    > > Also, there are two ways to know the guest is under pressure: 1. when
    > > alloc_page() fails or 2. use in-kernel vmpressure notification like
    > > auto-balloon does.
    >
    > 2 can detect pressure earlier so seems nicer.

    Yeah, exactly.

    > > > - assume you want to over-commit host and start
    > > > inflating balloon.
    > > > If low on memory it might be better for guest to
    > > > wait a bit before inflating.
    > > > Also, if host asks for a lot of memory a ton of
    > > > allocations will slow guest significantly.
    > > > But for guest to do the right thing we need host to tell guest what
    > > > are its memory and time contraints.
    > > > Let's add a field telling guest how hard do we
    > > > want it to give us memory (e.g. time limit)
    > >
    > > I think this is also related to auto-ballooning.
    > > Maybe we should start
    > > with a simple device/driver and add all these features on top.
    >
    >
    > Just shows these are good ideas :).
    >
    > I tried to propose building blocks in a generic way that will
    > be useful for multiple features.
    > autoballoon can build on top but these features
    > are useful even with manual inflate.

    Agreed.



  • 9.  Re: [virtio-dev] Re: [virtio] New virtio balloon...

    Posted 01-30-2014 20:43
    On Thu, 30 Jan 2014 20:42:00 +0200 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Thu, Jan 30, 2014 at 11:47:19AM -0500, Luiz Capitulino wrote: > > On Thu, 30 Jan 2014 12:16:29 +0200 > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > Also copy virtio-dev since this in clearly implementation ... > > > > > > On Thu, Jan 30, 2014 at 07:34:30PM +1030, Rusty Russell wrote: > > > > Hi, > > > > > > > > I tried to write a new balloon driver; it's completely untested > > > > (as I need to write the device). The protocol is basically two vqs, one > > > > for the guest to send commands, one for the host to send commands. > > > > > > > > Some interesting things come out: > > > > 1) We do need to explicitly tell the host where the page is we want. > > > > This is required for compaction, for example. > > > > > > > > 2) We need to be able to exceed the balloon target, especially for page > > > > migration. Thus there's no mechanism for the device to refuse to > > > > give us the pages. > > > > > > > > 3) The device can offer multiple page sizes, but the driver can only > > > > accept one. I'm not sure if this is useful, as guests are either > > > > huge page backed or not, and returning sub-pages isn't useful. > > > > > > > > Linux demo code follows. > > > > > > > > Cheers, > > > > Rusty. > > > > > > More comments: > > > - for projects like auto-ballooning that Luiz works on, > > > it's not nice that to swap page 1 for page 2 > > > you have to inflate then deflate > > > besides overhead this confuses the host: > > > imagine you tell QEMU to increase target, > > > meanwhile guest inflates temporarily, > > > QEMU thinks okay done, now you suddenly deflate. > > > > Yes. Just to give more context: one of my auto-ballooning versions broke > > when virtballoon_migratepage() ran. The reason was that my host-side code > > in the balloon device did not expect guest initiated operations. And the > > current spec does imply that all operations are initiated by the host. > > > > So, first suggestion: if the current spec is still valid, we have to add > > a note there that balloon operations can be initiated by the guest. > > > > My current code is different, but something it does that could also brake > > due to guest initiated inflate/deflate is that it keeps track of the > > current balloon size. This is done by a counter which is incremented > > on inflate and decremented on deflate. I did that because the device just > > doesn't have this information ('actual' is unreliable, besides it's > > only updated every 256 pages inflated/deflated). > > > > Second suggestion: I think we need a reliable way to know the current > > balloon size on the host. My counter does work, btw. > > > > As far as the guest is concerned, my current code just informs the host > > that the guest is facing pressure. This is done through a "message" virtqueue, > > but I think this could just use the guest command virtqueue. > > > > > A couple of other suggestions: > > > > > > - how to accomodate memory pressure in guest? > > > Let's add a field telling host how hard do we > > > want our memory back > > > > I agree we have to accommodate pressure in the guest some way, but what > > you proposed is more or less related to auto-ballooning. > > > > My suggestion would be for the host to tell the guest what to do in > > case of pressure. Like, it could tell the guest to just keep trying like > > it does today or it could ask the guest to stop inflation on pressure > > (which would require an ack from the host, which complicates the > > protocol a bit). > > If we need ack anyway, it seems enough to notify host > and then host can ask guest to stop inflating? Yep, that's exactly what auto-ballooning does. When the guest gets into pressure while inflating, it holds the inflation and sends a message to QEMU saying "hey, I'm under pressure". The guest doesn't continue until QEMU acks that message. This gives QEMU enough time to reset num_pages to the current balloon size (which cancels inflate) then QEMU acks the message and the guest stops inflating, as num_pages == balloon size. > > Also, there are two ways to know the guest is under pressure: 1. when > > alloc_page() fails or 2. use in-kernel vmpressure notification like > > auto-balloon does. > > 2 can detect pressure earlier so seems nicer. Yeah, exactly. > > > - assume you want to over-commit host and start > > > inflating balloon. > > > If low on memory it might be better for guest to > > > wait a bit before inflating. > > > Also, if host asks for a lot of memory a ton of > > > allocations will slow guest significantly. > > > But for guest to do the right thing we need host to tell guest what > > > are its memory and time contraints. > > > Let's add a field telling guest how hard do we > > > want it to give us memory (e.g. time limit) > > > > I think this is also related to auto-ballooning. > > Maybe we should start > > with a simple device/driver and add all these features on top. > > > Just shows these are good ideas :). > > I tried to propose building blocks in a generic way that will > be useful for multiple features. > autoballoon can build on top but these features > are useful even with manual inflate. Agreed.


  • 10.  Re: [virtio] New virtio balloon...

    Posted 01-31-2014 05:37
    Luiz Capitulino <lcapitulino@redhat.com> writes:
    > On Thu, 30 Jan 2014 12:16:29 +0200
    > "Michael S. Tsirkin" <mst@redhat.com> wrote:
    >
    >> Also copy virtio-dev since this in clearly implementation ...
    >>
    >> On Thu, Jan 30, 2014 at 07:34:30PM +1030, Rusty Russell wrote:
    >> > Hi,
    >> >
    >> > I tried to write a new balloon driver; it's completely untested
    >> > (as I need to write the device). The protocol is basically two vqs, one
    >> > for the guest to send commands, one for the host to send commands.
    >> >
    >> > Some interesting things come out:
    >> > 1) We do need to explicitly tell the host where the page is we want.
    >> > This is required for compaction, for example.
    >> >
    >> > 2) We need to be able to exceed the balloon target, especially for page
    >> > migration. Thus there's no mechanism for the device to refuse to
    >> > give us the pages.
    >> >
    >> > 3) The device can offer multiple page sizes, but the driver can only
    >> > accept one. I'm not sure if this is useful, as guests are either
    >> > huge page backed or not, and returning sub-pages isn't useful.
    >> >
    >> > Linux demo code follows.
    >> >
    >> > Cheers,
    >> > Rusty.
    >>
    >> More comments:
    >> - for projects like auto-ballooning that Luiz works on,
    >> it's not nice that to swap page 1 for page 2
    >> you have to inflate then deflate
    >> besides overhead this confuses the host:
    >> imagine you tell QEMU to increase target,
    >> meanwhile guest inflates temporarily,
    >> QEMU thinks okay done, now you suddenly deflate.
    >
    > Yes. Just to give more context: one of my auto-ballooning versions broke
    > when virtballoon_migratepage() ran. The reason was that my host-side code
    > in the balloon device did not expect guest initiated operations. And the
    > current spec does imply that all operations are initiated by the host.
    >
    > So, first suggestion: if the current spec is still valid, we have to add
    > a note there that balloon operations can be initiated by the guest.

    Or we could have an explicit exchange operation. That may make more
    sense?

    > My current code is different, but something it does that could also brake
    > due to guest initiated inflate/deflate is that it keeps track of the
    > current balloon size. This is done by a counter which is incremented
    > on inflate and decremented on deflate. I did that because the device just
    > doesn't have this information ('actual' is unreliable, besides it's
    > only updated every 256 pages inflated/deflated).
    >
    > Second suggestion: I think we need a reliable way to know the current
    > balloon size on the host. My counter does work, btw.

    Yes, this patch means we explicitly tell the host what we're doing with
    the balloon, in the assumption that it keeps track.

    > As far as the guest is concerned, my current code just informs the host
    > that the guest is facing pressure. This is done through a "message" virtqueue,
    > but I think this could just use the guest command virtqueue.

    Yes, this rewrite has only two queues, a guest command queue (get pages,
    give pages, need mem and stats reply) and a host command queue (set
    target, get stats).

    >> A couple of other suggestions:
    >>
    >> - how to accomodate memory pressure in guest?
    >> Let's add a field telling host how hard do we
    >> want our memory back
    >
    > I agree we have to accommodate pressure in the guest some way, but what
    > you proposed is more or less related to auto-ballooning.
    >
    > My suggestion would be for the host to tell the guest what to do in
    > case of pressure. Like, it could tell the guest to just keep trying like
    > it does today or it could ask the guest to stop inflation on pressure
    > (which would require an ack from the host, which complicates the
    > protocol a bit).

    How about we allow the guest to send gratuitous stats? (ie. not just
    when the host asks). The host should be able to tell from that whether
    to change the target (rather than a simple need_mem).

    If the host wants the guest to stop inflating, it should alter the
    target.

    > Also, there are two ways to know the guest is under pressure: 1. when
    > alloc_page() fails or 2. use in-kernel vmpressure notification like
    > auto-balloon does.

    Both should probably warn the host. We might need extra stats?

    Cheers,
    Rusty.




  • 11.  Re: [virtio] New virtio balloon...

    Posted 02-01-2014 08:07
    Luiz Capitulino <lcapitulino@redhat.com> writes: > On Thu, 30 Jan 2014 12:16:29 +0200 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > >> Also copy virtio-dev since this in clearly implementation ... >> >> On Thu, Jan 30, 2014 at 07:34:30PM +1030, Rusty Russell wrote: >> > Hi, >> > >> > I tried to write a new balloon driver; it's completely untested >> > (as I need to write the device). The protocol is basically two vqs, one >> > for the guest to send commands, one for the host to send commands. >> > >> > Some interesting things come out: >> > 1) We do need to explicitly tell the host where the page is we want. >> > This is required for compaction, for example. >> > >> > 2) We need to be able to exceed the balloon target, especially for page >> > migration. Thus there's no mechanism for the device to refuse to >> > give us the pages. >> > >> > 3) The device can offer multiple page sizes, but the driver can only >> > accept one. I'm not sure if this is useful, as guests are either >> > huge page backed or not, and returning sub-pages isn't useful. >> > >> > Linux demo code follows. >> > >> > Cheers, >> > Rusty. >> >> More comments: >> - for projects like auto-ballooning that Luiz works on, >> it's not nice that to swap page 1 for page 2 >> you have to inflate then deflate >> besides overhead this confuses the host: >> imagine you tell QEMU to increase target, >> meanwhile guest inflates temporarily, >> QEMU thinks okay done, now you suddenly deflate. > > Yes. Just to give more context: one of my auto-ballooning versions broke > when virtballoon_migratepage() ran. The reason was that my host-side code > in the balloon device did not expect guest initiated operations. And the > current spec does imply that all operations are initiated by the host. > > So, first suggestion: if the current spec is still valid, we have to add > a note there that balloon operations can be initiated by the guest. Or we could have an explicit exchange operation. That may make more sense? > My current code is different, but something it does that could also brake > due to guest initiated inflate/deflate is that it keeps track of the > current balloon size. This is done by a counter which is incremented > on inflate and decremented on deflate. I did that because the device just > doesn't have this information ('actual' is unreliable, besides it's > only updated every 256 pages inflated/deflated). > > Second suggestion: I think we need a reliable way to know the current > balloon size on the host. My counter does work, btw. Yes, this patch means we explicitly tell the host what we're doing with the balloon, in the assumption that it keeps track. > As far as the guest is concerned, my current code just informs the host > that the guest is facing pressure. This is done through a "message" virtqueue, > but I think this could just use the guest command virtqueue. Yes, this rewrite has only two queues, a guest command queue (get pages, give pages, need mem and stats reply) and a host command queue (set target, get stats). >> A couple of other suggestions: >> >> - how to accomodate memory pressure in guest? >> Let's add a field telling host how hard do we >> want our memory back > > I agree we have to accommodate pressure in the guest some way, but what > you proposed is more or less related to auto-ballooning. > > My suggestion would be for the host to tell the guest what to do in > case of pressure. Like, it could tell the guest to just keep trying like > it does today or it could ask the guest to stop inflation on pressure > (which would require an ack from the host, which complicates the > protocol a bit). How about we allow the guest to send gratuitous stats? (ie. not just when the host asks). The host should be able to tell from that whether to change the target (rather than a simple need_mem). If the host wants the guest to stop inflating, it should alter the target. > Also, there are two ways to know the guest is under pressure: 1. when > alloc_page() fails or 2. use in-kernel vmpressure notification like > auto-balloon does. Both should probably warn the host. We might need extra stats? Cheers, Rusty.


  • 12.  Re: [virtio] New virtio balloon...

    Posted 02-03-2014 16:56
    On Fri, 31 Jan 2014 16:07:28 +1030
    Rusty Russell <rusty@au1.ibm.com> wrote:

    > Luiz Capitulino <lcapitulino@redhat.com> writes:
    > > On Thu, 30 Jan 2014 12:16:29 +0200
    > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
    > >
    > >> Also copy virtio-dev since this in clearly implementation ...
    > >>
    > >> On Thu, Jan 30, 2014 at 07:34:30PM +1030, Rusty Russell wrote:
    > >> > Hi,
    > >> >
    > >> > I tried to write a new balloon driver; it's completely untested
    > >> > (as I need to write the device). The protocol is basically two vqs, one
    > >> > for the guest to send commands, one for the host to send commands.
    > >> >
    > >> > Some interesting things come out:
    > >> > 1) We do need to explicitly tell the host where the page is we want.
    > >> > This is required for compaction, for example.
    > >> >
    > >> > 2) We need to be able to exceed the balloon target, especially for page
    > >> > migration. Thus there's no mechanism for the device to refuse to
    > >> > give us the pages.
    > >> >
    > >> > 3) The device can offer multiple page sizes, but the driver can only
    > >> > accept one. I'm not sure if this is useful, as guests are either
    > >> > huge page backed or not, and returning sub-pages isn't useful.
    > >> >
    > >> > Linux demo code follows.
    > >> >
    > >> > Cheers,
    > >> > Rusty.
    > >>
    > >> More comments:
    > >> - for projects like auto-ballooning that Luiz works on,
    > >> it's not nice that to swap page 1 for page 2
    > >> you have to inflate then deflate
    > >> besides overhead this confuses the host:
    > >> imagine you tell QEMU to increase target,
    > >> meanwhile guest inflates temporarily,
    > >> QEMU thinks okay done, now you suddenly deflate.
    > >
    > > Yes. Just to give more context: one of my auto-ballooning versions broke
    > > when virtballoon_migratepage() ran. The reason was that my host-side code
    > > in the balloon device did not expect guest initiated operations. And the
    > > current spec does imply that all operations are initiated by the host.
    > >
    > > So, first suggestion: if the current spec is still valid, we have to add
    > > a note there that balloon operations can be initiated by the guest.
    >
    > Or we could have an explicit exchange operation. That may make more
    > sense?

    I agree, it does.

    > > My current code is different, but something it does that could also brake
    > > due to guest initiated inflate/deflate is that it keeps track of the
    > > current balloon size. This is done by a counter which is incremented
    > > on inflate and decremented on deflate. I did that because the device just
    > > doesn't have this information ('actual' is unreliable, besides it's
    > > only updated every 256 pages inflated/deflated).
    > >
    > > Second suggestion: I think we need a reliable way to know the current
    > > balloon size on the host. My counter does work, btw.
    >
    > Yes, this patch means we explicitly tell the host what we're doing with
    > the balloon, in the assumption that it keeps track.
    >
    > > As far as the guest is concerned, my current code just informs the host
    > > that the guest is facing pressure. This is done through a "message" virtqueue,
    > > but I think this could just use the guest command virtqueue.
    >
    > Yes, this rewrite has only two queues, a guest command queue (get pages,
    > give pages, need mem and stats reply) and a host command queue (set
    > target, get stats).
    >
    > >> A couple of other suggestions:
    > >>
    > >> - how to accomodate memory pressure in guest?
    > >> Let's add a field telling host how hard do we
    > >> want our memory back
    > >
    > > I agree we have to accommodate pressure in the guest some way, but what
    > > you proposed is more or less related to auto-ballooning.
    > >
    > > My suggestion would be for the host to tell the guest what to do in
    > > case of pressure. Like, it could tell the guest to just keep trying like
    > > it does today or it could ask the guest to stop inflation on pressure
    > > (which would require an ack from the host, which complicates the
    > > protocol a bit).
    >
    > How about we allow the guest to send gratuitous stats? (ie. not just
    > when the host asks). The host should be able to tell from that whether
    > to change the target (rather than a simple need_mem).

    Might be useful, but I think that we have to think how far we want
    to solve this problem (ie. handling guest pressure) w/o having the
    other auto-balloon bits I'm working on.

    > If the host wants the guest to stop inflating, it should alter the
    > target.
    >
    > > Also, there are two ways to know the guest is under pressure: 1. when
    > > alloc_page() fails or 2. use in-kernel vmpressure notification like
    > > auto-balloon does.
    >
    > Both should probably warn the host. We might need extra stats?
    >
    > Cheers,
    > Rusty.
    >




  • 13.  Re: [virtio] New virtio balloon...

    Posted 02-03-2014 20:15
    On Fri, 31 Jan 2014 16:07:28 +1030 Rusty Russell <rusty@au1.ibm.com> wrote: > Luiz Capitulino <lcapitulino@redhat.com> writes: > > On Thu, 30 Jan 2014 12:16:29 +0200 > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > >> Also copy virtio-dev since this in clearly implementation ... > >> > >> On Thu, Jan 30, 2014 at 07:34:30PM +1030, Rusty Russell wrote: > >> > Hi, > >> > > >> > I tried to write a new balloon driver; it's completely untested > >> > (as I need to write the device). The protocol is basically two vqs, one > >> > for the guest to send commands, one for the host to send commands. > >> > > >> > Some interesting things come out: > >> > 1) We do need to explicitly tell the host where the page is we want. > >> > This is required for compaction, for example. > >> > > >> > 2) We need to be able to exceed the balloon target, especially for page > >> > migration. Thus there's no mechanism for the device to refuse to > >> > give us the pages. > >> > > >> > 3) The device can offer multiple page sizes, but the driver can only > >> > accept one. I'm not sure if this is useful, as guests are either > >> > huge page backed or not, and returning sub-pages isn't useful. > >> > > >> > Linux demo code follows. > >> > > >> > Cheers, > >> > Rusty. > >> > >> More comments: > >> - for projects like auto-ballooning that Luiz works on, > >> it's not nice that to swap page 1 for page 2 > >> you have to inflate then deflate > >> besides overhead this confuses the host: > >> imagine you tell QEMU to increase target, > >> meanwhile guest inflates temporarily, > >> QEMU thinks okay done, now you suddenly deflate. > > > > Yes. Just to give more context: one of my auto-ballooning versions broke > > when virtballoon_migratepage() ran. The reason was that my host-side code > > in the balloon device did not expect guest initiated operations. And the > > current spec does imply that all operations are initiated by the host. > > > > So, first suggestion: if the current spec is still valid, we have to add > > a note there that balloon operations can be initiated by the guest. > > Or we could have an explicit exchange operation. That may make more > sense? I agree, it does. > > My current code is different, but something it does that could also brake > > due to guest initiated inflate/deflate is that it keeps track of the > > current balloon size. This is done by a counter which is incremented > > on inflate and decremented on deflate. I did that because the device just > > doesn't have this information ('actual' is unreliable, besides it's > > only updated every 256 pages inflated/deflated). > > > > Second suggestion: I think we need a reliable way to know the current > > balloon size on the host. My counter does work, btw. > > Yes, this patch means we explicitly tell the host what we're doing with > the balloon, in the assumption that it keeps track. > > > As far as the guest is concerned, my current code just informs the host > > that the guest is facing pressure. This is done through a "message" virtqueue, > > but I think this could just use the guest command virtqueue. > > Yes, this rewrite has only two queues, a guest command queue (get pages, > give pages, need mem and stats reply) and a host command queue (set > target, get stats). > > >> A couple of other suggestions: > >> > >> - how to accomodate memory pressure in guest? > >> Let's add a field telling host how hard do we > >> want our memory back > > > > I agree we have to accommodate pressure in the guest some way, but what > > you proposed is more or less related to auto-ballooning. > > > > My suggestion would be for the host to tell the guest what to do in > > case of pressure. Like, it could tell the guest to just keep trying like > > it does today or it could ask the guest to stop inflation on pressure > > (which would require an ack from the host, which complicates the > > protocol a bit). > > How about we allow the guest to send gratuitous stats? (ie. not just > when the host asks). The host should be able to tell from that whether > to change the target (rather than a simple need_mem). Might be useful, but I think that we have to think how far we want to solve this problem (ie. handling guest pressure) w/o having the other auto-balloon bits I'm working on. > If the host wants the guest to stop inflating, it should alter the > target. > > > Also, there are two ways to know the guest is under pressure: 1. when > > alloc_page() fails or 2. use in-kernel vmpressure notification like > > auto-balloon does. > > Both should probably warn the host. We might need extra stats? > > Cheers, > Rusty. >


  • 14.  Re: [virtio] New virtio balloon...

    Posted 01-31-2014 05:32
    "Michael S. Tsirkin" <mst@redhat.com> writes:
    > Also copy virtio-dev since this in clearly implementation ...
    >
    > On Thu, Jan 30, 2014 at 07:34:30PM +1030, Rusty Russell wrote:
    >> Hi,
    >>
    >> I tried to write a new balloon driver; it's completely untested
    >> (as I need to write the device). The protocol is basically two vqs, one
    >> for the guest to send commands, one for the host to send commands.
    >>
    >> Some interesting things come out:
    >> 1) We do need to explicitly tell the host where the page is we want.
    >> This is required for compaction, for example.
    >>
    >> 2) We need to be able to exceed the balloon target, especially for page
    >> migration. Thus there's no mechanism for the device to refuse to
    >> give us the pages.
    >>
    >> 3) The device can offer multiple page sizes, but the driver can only
    >> accept one. I'm not sure if this is useful, as guests are either
    >> huge page backed or not, and returning sub-pages isn't useful.
    >>
    >> Linux demo code follows.
    >>
    >> Cheers,
    >> Rusty.
    >
    > More comments:
    > - for projects like auto-ballooning that Luiz works on,
    > it's not nice that to swap page 1 for page 2
    > you have to inflate then deflate
    > besides overhead this confuses the host:
    > imagine you tell QEMU to increase target,
    > meanwhile guest inflates temporarily,
    > QEMU thinks okay done, now you suddenly deflate.

    I originally allowed the host to deny the deflate, which was why I
    reversed it. Then I realized that was a bad idea. I can switch it back.

    > - what's the status of page returned from balloon?
    > is it zeroed or can it have old data in there?
    > I think in practice Linux will sometimes map in a zero page,
    > so guest can save cycles and avoid zeroing it out.
    > I think we should tell this to guest when returning
    > pages.

    QEMU may not know, since the kernel may not tell it. We should assume
    nothing, and let the guest zero if it needs to. Seems like a premuture
    optimization.

    > - I am guessing EXTRA_MEM is for uses like the ones proposed by
    > Frank Swiderski from google that inflate/deflate balloon
    > whenever guest wants (look for "Add a page cache-backed balloon
    > device driver").
    >
    > this is useful but - we need to distinguish pages
    > like this from regular inflate.
    > it's not just counter and host needs a way to know
    > that it's target is reached

    The driver needs to explicitly ask for pages in that region.

    > - do we even want to allow guest not telling host when it wants
    > to reuse the page?
    > if yes, I think this should be per-page somehow: when balloon
    > is inflated guest should tell host whether it
    > expects to use this page.

    I decided against it. Making that optional got us into a mess, so now
    it's compulsory. That also fits better with the idea of a negative
    balloon.

    > So I think we should accomodate these uses, and so we want the following flags:
    >
    > - WEAK_TARGET (that's the EXTRA_MEM but I think done in a better way)
    > flag that specifies pages do not count against target,
    > can be taken out of balloon.
    > EXTRA_MEM suggests there's an upper limit on balloon size
    > but IMHO that's just extra work for host: host does not care
    > I think, give it as much as you want.
    > set by guest, used by host

    I think that Daniel really does want more memory than the guest starts
    with. And I think he still wants to use the balloon to control it.
    Daniel?

    > - TELL_HOST flag that specifies guest will tell host before using pages
    > (that's VIRTIO_BALLOON_F_MUST_TELL_HOST
    > at the moment, listed here for completeness)
    > set by guest, used by host

    Dislike.

    > - ZEROED
    > flag that specifies that page returned to guest
    > is zeroed
    > set by host, used by guest

    I think that's silly. Under Linux the guest doesn't need to know it's
    zeroed or not, it just frees the page.

    > Each of the flags can be just a feature flag, and then
    > if we wants a mix of them host can create multiple
    > balloon devices with differnet flags, and guest looks for best
    > balloon for its purposes.
    >
    > Alternatively flags can be set and reported per page.
    >
    >
    > A couple of other suggestions:
    >
    > - how to accomodate memory pressure in guest?
    > Let's add a field telling host how hard do we
    > want our memory back

    That's very hard to define across guests. Should we be using stats for
    that instead? In fact, should we allow gratuitous stats sending,
    instead of a simple NEED_MEM flag?

    > - assume you want to over-commit host and start
    > inflating balloon.
    > If low on memory it might be better for guest to
    > wait a bit before inflating.
    > Also, if host asks for a lot of memory a ton of
    > allocations will slow guest significantly.
    > But for guest to do the right thing we need host to tell guest what
    > are its memory and time contraints.
    > Let's add a field telling guest how hard do we
    > want it to give us memory (e.g. time limit)

    We can't have intelligence at both ends, I think. We've chosen a
    host-led model, so we should stick to that unless someone has an
    implementation which proves its worth doing otherwise.

    Cheers,
    Rusty.




  • 15.  Re: [virtio] New virtio balloon...

    Posted 02-01-2014 08:07
    "Michael S. Tsirkin" <mst@redhat.com> writes: > Also copy virtio-dev since this in clearly implementation ... > > On Thu, Jan 30, 2014 at 07:34:30PM +1030, Rusty Russell wrote: >> Hi, >> >> I tried to write a new balloon driver; it's completely untested >> (as I need to write the device). The protocol is basically two vqs, one >> for the guest to send commands, one for the host to send commands. >> >> Some interesting things come out: >> 1) We do need to explicitly tell the host where the page is we want. >> This is required for compaction, for example. >> >> 2) We need to be able to exceed the balloon target, especially for page >> migration. Thus there's no mechanism for the device to refuse to >> give us the pages. >> >> 3) The device can offer multiple page sizes, but the driver can only >> accept one. I'm not sure if this is useful, as guests are either >> huge page backed or not, and returning sub-pages isn't useful. >> >> Linux demo code follows. >> >> Cheers, >> Rusty. > > More comments: > - for projects like auto-ballooning that Luiz works on, > it's not nice that to swap page 1 for page 2 > you have to inflate then deflate > besides overhead this confuses the host: > imagine you tell QEMU to increase target, > meanwhile guest inflates temporarily, > QEMU thinks okay done, now you suddenly deflate. I originally allowed the host to deny the deflate, which was why I reversed it. Then I realized that was a bad idea. I can switch it back. > - what's the status of page returned from balloon? > is it zeroed or can it have old data in there? > I think in practice Linux will sometimes map in a zero page, > so guest can save cycles and avoid zeroing it out. > I think we should tell this to guest when returning > pages. QEMU may not know, since the kernel may not tell it. We should assume nothing, and let the guest zero if it needs to. Seems like a premuture optimization. > - I am guessing EXTRA_MEM is for uses like the ones proposed by > Frank Swiderski from google that inflate/deflate balloon > whenever guest wants (look for "Add a page cache-backed balloon > device driver"). > > this is useful but - we need to distinguish pages > like this from regular inflate. > it's not just counter and host needs a way to know > that it's target is reached The driver needs to explicitly ask for pages in that region. > - do we even want to allow guest not telling host when it wants > to reuse the page? > if yes, I think this should be per-page somehow: when balloon > is inflated guest should tell host whether it > expects to use this page. I decided against it. Making that optional got us into a mess, so now it's compulsory. That also fits better with the idea of a negative balloon. > So I think we should accomodate these uses, and so we want the following flags: > > - WEAK_TARGET (that's the EXTRA_MEM but I think done in a better way) > flag that specifies pages do not count against target, > can be taken out of balloon. > EXTRA_MEM suggests there's an upper limit on balloon size > but IMHO that's just extra work for host: host does not care > I think, give it as much as you want. > set by guest, used by host I think that Daniel really does want more memory than the guest starts with. And I think he still wants to use the balloon to control it. Daniel? > - TELL_HOST flag that specifies guest will tell host before using pages > (that's VIRTIO_BALLOON_F_MUST_TELL_HOST > at the moment, listed here for completeness) > set by guest, used by host Dislike. > - ZEROED > flag that specifies that page returned to guest > is zeroed > set by host, used by guest I think that's silly. Under Linux the guest doesn't need to know it's zeroed or not, it just frees the page. > Each of the flags can be just a feature flag, and then > if we wants a mix of them host can create multiple > balloon devices with differnet flags, and guest looks for best > balloon for its purposes. > > Alternatively flags can be set and reported per page. > > > A couple of other suggestions: > > - how to accomodate memory pressure in guest? > Let's add a field telling host how hard do we > want our memory back That's very hard to define across guests. Should we be using stats for that instead? In fact, should we allow gratuitous stats sending, instead of a simple NEED_MEM flag? > - assume you want to over-commit host and start > inflating balloon. > If low on memory it might be better for guest to > wait a bit before inflating. > Also, if host asks for a lot of memory a ton of > allocations will slow guest significantly. > But for guest to do the right thing we need host to tell guest what > are its memory and time contraints. > Let's add a field telling guest how hard do we > want it to give us memory (e.g. time limit) We can't have intelligence at both ends, I think. We've chosen a host-led model, so we should stick to that unless someone has an implementation which proves its worth doing otherwise. Cheers, Rusty.


  • 16.  Re: [virtio] New virtio balloon...

    Posted 02-02-2014 16:21
    On Fri, Jan 31, 2014 at 04:01:39PM +1030, Rusty Russell wrote:
    > "Michael S. Tsirkin" <mst@redhat.com> writes:
    > > Also copy virtio-dev since this in clearly implementation ...
    > >
    > > On Thu, Jan 30, 2014 at 07:34:30PM +1030, Rusty Russell wrote:
    > >> Hi,
    > >>
    > >> I tried to write a new balloon driver; it's completely untested
    > >> (as I need to write the device). The protocol is basically two vqs, one
    > >> for the guest to send commands, one for the host to send commands.
    > >>
    > >> Some interesting things come out:
    > >> 1) We do need to explicitly tell the host where the page is we want.
    > >> This is required for compaction, for example.
    > >>
    > >> 2) We need to be able to exceed the balloon target, especially for page
    > >> migration. Thus there's no mechanism for the device to refuse to
    > >> give us the pages.
    > >>
    > >> 3) The device can offer multiple page sizes, but the driver can only
    > >> accept one. I'm not sure if this is useful, as guests are either
    > >> huge page backed or not, and returning sub-pages isn't useful.
    > >>
    > >> Linux demo code follows.
    > >>
    > >> Cheers,
    > >> Rusty.
    > >
    > > More comments:
    > > - for projects like auto-ballooning that Luiz works on,
    > > it's not nice that to swap page 1 for page 2
    > > you have to inflate then deflate
    > > besides overhead this confuses the host:
    > > imagine you tell QEMU to increase target,
    > > meanwhile guest inflates temporarily,
    > > QEMU thinks okay done, now you suddenly deflate.
    >
    > I originally allowed the host to deny the deflate, which was why I
    > reversed it. Then I realized that was a bad idea. I can switch it back.

    I think explicit swap that you suggested sounds better to me.

    > > - what's the status of page returned from balloon?
    > > is it zeroed or can it have old data in there?
    > > I think in practice Linux will sometimes map in a zero page,
    > > so guest can save cycles and avoid zeroing it out.
    > > I think we should tell this to guest when returning
    > > pages.
    >
    > QEMU may not know, since the kernel may not tell it.

    Depends on what QEMU does.
    I think kernel always gives us zero pages when we allocate
    memory, they must be initialized otherwise it's an information leak.


    > We should assume
    > nothing, and let the guest zero if it needs to. Seems like a premuture
    > optimization.

    Possibly.

    > > - I am guessing EXTRA_MEM is for uses like the ones proposed by
    > > Frank Swiderski from google that inflate/deflate balloon
    > > whenever guest wants (look for "Add a page cache-backed balloon
    > > device driver").
    > >
    > > this is useful but - we need to distinguish pages
    > > like this from regular inflate.
    > > it's not just counter and host needs a way to know
    > > that it's target is reached
    >
    > The driver needs to explicitly ask for pages in that region.

    OK so we'll have an extra flag for that?


    > > - do we even want to allow guest not telling host when it wants
    > > to reuse the page?
    > > if yes, I think this should be per-page somehow: when balloon
    > > is inflated guest should tell host whether it
    > > expects to use this page.
    >
    > I decided against it. Making that optional got us into a mess, so now
    > it's compulsory. That also fits better with the idea of a negative
    > balloon.
    >
    > > So I think we should accomodate these uses, and so we want the following flags:
    > >
    > > - WEAK_TARGET (that's the EXTRA_MEM but I think done in a better way)
    > > flag that specifies pages do not count against target,
    > > can be taken out of balloon.
    > > EXTRA_MEM suggests there's an upper limit on balloon size
    > > but IMHO that's just extra work for host: host does not care
    > > I think, give it as much as you want.
    > > set by guest, used by host
    >
    > I think that Daniel really does want more memory than the guest starts
    > with. And I think he still wants to use the balloon to control it.
    > Daniel?
    >
    > > - TELL_HOST flag that specifies guest will tell host before using pages
    > > (that's VIRTIO_BALLOON_F_MUST_TELL_HOST
    > > at the moment, listed here for completeness)
    > > set by guest, used by host
    >
    > Dislike.
    >
    > > - ZEROED
    > > flag that specifies that page returned to guest
    > > is zeroed
    > > set by host, used by guest
    >
    > I think that's silly. Under Linux the guest doesn't need to know it's
    > zeroed or not, it just frees the page.

    Yes but it's possible that linux will try to zero page right
    after free. It won't be too hard to set a flag that it's
    zeroed when we free it.


    > > Each of the flags can be just a feature flag, and then
    > > if we wants a mix of them host can create multiple
    > > balloon devices with differnet flags, and guest looks for best
    > > balloon for its purposes.
    > >
    > > Alternatively flags can be set and reported per page.
    > >
    > >
    > > A couple of other suggestions:
    > >
    > > - how to accomodate memory pressure in guest?
    > > Let's add a field telling host how hard do we
    > > want our memory back
    >
    > That's very hard to define across guests. Should we be using stats for
    > that instead? In fact, should we allow gratuitous stats sending,
    > instead of a simple NEED_MEM flag?
    >
    > > - assume you want to over-commit host and start
    > > inflating balloon.
    > > If low on memory it might be better for guest to
    > > wait a bit before inflating.
    > > Also, if host asks for a lot of memory a ton of
    > > allocations will slow guest significantly.
    > > But for guest to do the right thing we need host to tell guest what
    > > are its memory and time contraints.
    > > Let's add a field telling guest how hard do we
    > > want it to give us memory (e.g. time limit)
    >
    > We can't have intelligence at both ends, I think. We've chosen a
    > host-led model, so we should stick to that

    I'm saying let's control speed of allocations from host,
    that's still host-led?

    > unless someone has an
    > implementation which proves its worth doing otherwise.
    >
    > Cheers,
    > Rusty.



  • 17.  Re: [virtio] New virtio balloon...

    Posted 02-02-2014 16:21
    On Fri, Jan 31, 2014 at 04:01:39PM +1030, Rusty Russell wrote: > "Michael S. Tsirkin" <mst@redhat.com> writes: > > Also copy virtio-dev since this in clearly implementation ... > > > > On Thu, Jan 30, 2014 at 07:34:30PM +1030, Rusty Russell wrote: > >> Hi, > >> > >> I tried to write a new balloon driver; it's completely untested > >> (as I need to write the device). The protocol is basically two vqs, one > >> for the guest to send commands, one for the host to send commands. > >> > >> Some interesting things come out: > >> 1) We do need to explicitly tell the host where the page is we want. > >> This is required for compaction, for example. > >> > >> 2) We need to be able to exceed the balloon target, especially for page > >> migration. Thus there's no mechanism for the device to refuse to > >> give us the pages. > >> > >> 3) The device can offer multiple page sizes, but the driver can only > >> accept one. I'm not sure if this is useful, as guests are either > >> huge page backed or not, and returning sub-pages isn't useful. > >> > >> Linux demo code follows. > >> > >> Cheers, > >> Rusty. > > > > More comments: > > - for projects like auto-ballooning that Luiz works on, > > it's not nice that to swap page 1 for page 2 > > you have to inflate then deflate > > besides overhead this confuses the host: > > imagine you tell QEMU to increase target, > > meanwhile guest inflates temporarily, > > QEMU thinks okay done, now you suddenly deflate. > > I originally allowed the host to deny the deflate, which was why I > reversed it. Then I realized that was a bad idea. I can switch it back. I think explicit swap that you suggested sounds better to me. > > - what's the status of page returned from balloon? > > is it zeroed or can it have old data in there? > > I think in practice Linux will sometimes map in a zero page, > > so guest can save cycles and avoid zeroing it out. > > I think we should tell this to guest when returning > > pages. > > QEMU may not know, since the kernel may not tell it. Depends on what QEMU does. I think kernel always gives us zero pages when we allocate memory, they must be initialized otherwise it's an information leak. > We should assume > nothing, and let the guest zero if it needs to. Seems like a premuture > optimization. Possibly. > > - I am guessing EXTRA_MEM is for uses like the ones proposed by > > Frank Swiderski from google that inflate/deflate balloon > > whenever guest wants (look for "Add a page cache-backed balloon > > device driver"). > > > > this is useful but - we need to distinguish pages > > like this from regular inflate. > > it's not just counter and host needs a way to know > > that it's target is reached > > The driver needs to explicitly ask for pages in that region. OK so we'll have an extra flag for that? > > - do we even want to allow guest not telling host when it wants > > to reuse the page? > > if yes, I think this should be per-page somehow: when balloon > > is inflated guest should tell host whether it > > expects to use this page. > > I decided against it. Making that optional got us into a mess, so now > it's compulsory. That also fits better with the idea of a negative > balloon. > > > So I think we should accomodate these uses, and so we want the following flags: > > > > - WEAK_TARGET (that's the EXTRA_MEM but I think done in a better way) > > flag that specifies pages do not count against target, > > can be taken out of balloon. > > EXTRA_MEM suggests there's an upper limit on balloon size > > but IMHO that's just extra work for host: host does not care > > I think, give it as much as you want. > > set by guest, used by host > > I think that Daniel really does want more memory than the guest starts > with. And I think he still wants to use the balloon to control it. > Daniel? > > > - TELL_HOST flag that specifies guest will tell host before using pages > > (that's VIRTIO_BALLOON_F_MUST_TELL_HOST > > at the moment, listed here for completeness) > > set by guest, used by host > > Dislike. > > > - ZEROED > > flag that specifies that page returned to guest > > is zeroed > > set by host, used by guest > > I think that's silly. Under Linux the guest doesn't need to know it's > zeroed or not, it just frees the page. Yes but it's possible that linux will try to zero page right after free. It won't be too hard to set a flag that it's zeroed when we free it. > > Each of the flags can be just a feature flag, and then > > if we wants a mix of them host can create multiple > > balloon devices with differnet flags, and guest looks for best > > balloon for its purposes. > > > > Alternatively flags can be set and reported per page. > > > > > > A couple of other suggestions: > > > > - how to accomodate memory pressure in guest? > > Let's add a field telling host how hard do we > > want our memory back > > That's very hard to define across guests. Should we be using stats for > that instead? In fact, should we allow gratuitous stats sending, > instead of a simple NEED_MEM flag? > > > - assume you want to over-commit host and start > > inflating balloon. > > If low on memory it might be better for guest to > > wait a bit before inflating. > > Also, if host asks for a lot of memory a ton of > > allocations will slow guest significantly. > > But for guest to do the right thing we need host to tell guest what > > are its memory and time contraints. > > Let's add a field telling guest how hard do we > > want it to give us memory (e.g. time limit) > > We can't have intelligence at both ends, I think. We've chosen a > host-led model, so we should stick to that I'm saying let's control speed of allocations from host, that's still host-led? > unless someone has an > implementation which proves its worth doing otherwise. > > Cheers, > Rusty.


  • 18.  Re: [virtio] New virtio balloon...

    Posted 02-03-2014 03:07
    "Michael S. Tsirkin" <mst@redhat.com> writes:
    > On Fri, Jan 31, 2014 at 04:01:39PM +1030, Rusty Russell wrote:
    >> I originally allowed the host to deny the deflate, which was why I
    >> reversed it. Then I realized that was a bad idea. I can switch it back.
    >
    > I think explicit swap that you suggested sounds better to me.

    OK, I've added this. It takes 2N PFNS, though Linux only uses 2.

    >> > - what's the status of page returned from balloon?
    >> > is it zeroed or can it have old data in there?
    >> > I think in practice Linux will sometimes map in a zero page,
    >> > so guest can save cycles and avoid zeroing it out.
    >> > I think we should tell this to guest when returning
    >> > pages.
    >>
    >> QEMU may not know, since the kernel may not tell it.
    >
    > Depends on what QEMU does.
    > I think kernel always gives us zero pages when we allocate
    > memory, they must be initialized otherwise it's an information leak.

    MADV_DONTNEED is a bit of a mess. Under Linux it will zero pages,
    under POSIX (POSIX_MADV_DONTNEED) it will keep them intact. Under
    FreeBSD we'd really want MADV_FREE, but linux doesn't support it.

    Let's not design this for today's QEMU.

    >> We should assume
    >> nothing, and let the guest zero if it needs to. Seems like a premuture
    >> optimization.
    >
    > Possibly.
    >
    >> > - I am guessing EXTRA_MEM is for uses like the ones proposed by
    >> > Frank Swiderski from google that inflate/deflate balloon
    >> > whenever guest wants (look for "Add a page cache-backed balloon
    >> > device driver").
    >> >
    >> > this is useful but - we need to distinguish pages
    >> > like this from regular inflate.
    >> > it's not just counter and host needs a way to know
    >> > that it's target is reached
    >>
    >> The driver needs to explicitly ask for pages in that region.
    >
    > OK so we'll have an extra flag for that?

    No, I mean that the driver explicitly requests every page by PFN. If it
    requests a page in the extramem region, it will get one.

    >> > - do we even want to allow guest not telling host when it wants
    >> > to reuse the page?
    >> > if yes, I think this should be per-page somehow: when balloon
    >> > is inflated guest should tell host whether it
    >> > expects to use this page.
    >>
    >> I decided against it. Making that optional got us into a mess, so now
    >> it's compulsory. That also fits better with the idea of a negative
    >> balloon.
    >>
    >> > So I think we should accomodate these uses, and so we want the following flags:
    >> >
    >> > - WEAK_TARGET (that's the EXTRA_MEM but I think done in a better way)
    >> > flag that specifies pages do not count against target,
    >> > can be taken out of balloon.
    >> > EXTRA_MEM suggests there's an upper limit on balloon size
    >> > but IMHO that's just extra work for host: host does not care
    >> > I think, give it as much as you want.
    >> > set by guest, used by host
    >>
    >> I think that Daniel really does want more memory than the guest starts
    >> with. And I think he still wants to use the balloon to control it.
    >> Daniel?
    >>
    >> > - TELL_HOST flag that specifies guest will tell host before using pages
    >> > (that's VIRTIO_BALLOON_F_MUST_TELL_HOST
    >> > at the moment, listed here for completeness)
    >> > set by guest, used by host
    >>
    >> Dislike.
    >>
    >> > - ZEROED
    >> > flag that specifies that page returned to guest
    >> > is zeroed
    >> > set by host, used by guest
    >>
    >> I think that's silly. Under Linux the guest doesn't need to know it's
    >> zeroed or not, it just frees the page.
    >
    > Yes but it's possible that linux will try to zero page right
    > after free. It won't be too hard to set a flag that it's
    > zeroed when we free it.

    We could, but I don't see that this is a bottleneck. See above.

    >> > - how to accomodate memory pressure in guest?
    >> > Let's add a field telling host how hard do we
    >> > want our memory back
    >>
    >> That's very hard to define across guests. Should we be using stats for
    >> that instead? In fact, should we allow gratuitous stats sending,
    >> instead of a simple NEED_MEM flag?
    >>
    >> > - assume you want to over-commit host and start
    >> > inflating balloon.
    >> > If low on memory it might be better for guest to
    >> > wait a bit before inflating.
    >> > Also, if host asks for a lot of memory a ton of
    >> > allocations will slow guest significantly.
    >> > But for guest to do the right thing we need host to tell guest what
    >> > are its memory and time contraints.
    >> > Let's add a field telling guest how hard do we
    >> > want it to give us memory (e.g. time limit)
    >>
    >> We can't have intelligence at both ends, I think. We've chosen a
    >> host-led model, so we should stick to that
    >
    > I'm saying let's control speed of allocations from host,
    > that's still host-led?

    You want the guest to wait a bit, and control the rate at which it
    allocates memory. If that's what we want, let's get the host to delay
    telling it to inflate, and then inflate slowly. Otherwise we have to
    debug both host and guest sides when we hit performance problems.

    I changed the STATS_REPLY to STATS, and included a "want more mem"
    flag. The implication is that the host compare stats across different
    guests.

    Cheers,
    Rusty.

    diff --git a/drivers/virtio/virtio_balloon2.c b/drivers/virtio/virtio_balloon2.c
    index 93f13e7c561d..6c0151f12f8b 100644
    --- a/drivers/virtio/virtio_balloon2.c
    +++ b/drivers/virtio/virtio_balloon2.c
    @@ -39,12 +39,15 @@ struct gcmd_give_pages {
    __le64 pages[256];
    };

    -struct gcmd_need_mem {
    - __le64 type; /* VIRTIO_BALLOON_GCMD_NEED_MEM */
    +struct gcmd_exchange_pages {
    + __le64 type; /* VIRTIO_BALLOON_GCMD_EXCHANGE_PAGES */
    + __le64 from_balloon;
    + __le64 to_balloon;
    };

    -struct gcmd_stats_reply {
    +struct gcmd_stats {
    __le64 type; /* VIRTIO_BALLOON_GCMD_STATS_REPLY */
    + __le64 need_more;
    struct virtio_balloon_statistic stats[VIRTIO_BALLOON_S_NR];
    };

    @@ -90,8 +93,8 @@ struct virtio_balloon {
    __le64 type;
    struct gcmd_get_pages get_pages;
    struct gcmd_give_pages give_pages;
    - struct gcmd_need_mem need_mem;
    - struct gcmd_stats_reply stats_reply;
    + struct gcmd_exchange_pages exchange_pages;
    + struct gcmd_stats stats;
    } gcmd;

    union hcmd {
    @@ -382,19 +385,11 @@ static int virtballoon_migratepage(struct address_space *mapping,
    if (!mutex_trylock(&vb->lock))
    return -EAGAIN;

    - /* Try to get the page out of the balloon. */
    - vb->gcmd.get_pages.type = cpu_to_le64(VIRTIO_BALLOON_GCMD_GET_PAGES);
    - vb->gcmd.get_pages.pages[0] = page_to_pfn(page) << PAGE_SHIFT;
    - if (!send_gcmd(vb, offsetof(struct gcmd_get_pages, pages[1]))) {
    - err = -EIO;
    - goto unlock;
    - }
    -
    - /* Now put newpage into balloon. */
    - vb->gcmd.give_pages.type = cpu_to_le64(VIRTIO_BALLOON_GCMD_GIVE_PAGES);
    - vb->gcmd.give_pages.pages[0] = page_to_pfn(newpage) << PAGE_SHIFT;
    - if (!send_gcmd(vb, offsetof(struct gcmd_give_pages, pages[1]))) {
    - /* We leak a page here, but only happens if balloon broken. */
    + vb->gcmd.exchange_pages.type =
    + cpu_to_le64(VIRTIO_BALLOON_GCMD_EXCHANGE_PAGES);
    + vb->gcmd.exchange_pages.from_balloon = page_to_pfn(page) << PAGE_SHIFT;
    + vb->gcmd.exchange_pages.to_balloon = page_to_pfn(newpage) << PAGE_SHIFT;
    + if (!send_gcmd(vb, sizeof(vb->gcmd.exchange_pages))) {
    err = -EIO;
    goto unlock;
    }
    diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
    index cdca2934668a..925d79ad5c90 100644
    --- a/include/uapi/linux/virtio_balloon.h
    +++ b/include/uapi/linux/virtio_balloon.h
    @@ -48,21 +48,36 @@ struct virtio_balloon_statistic {
    };

    /* Guest->host command queue. */
    -/* Ask the host for more pages.
    - Followed by array of 1 or more readable le64 pageaddr's. */
    +
    +/*
    + * Ask the host for more pages.
    + * Followed by array of 1 or more readable le64 pageaddr's.
    + */
    #define VIRTIO_BALLOON_GCMD_GET_PAGES ((__le64)0)
    -/* Give the host more pages.
    - Followed by array of 1 or more readable le64 pageaddr's */
    +/*
    + * Give the host more pages.
    + * Followed by array of 1 or more readable le64 pageaddr's
    + */
    #define VIRTIO_BALLOON_GCMD_GIVE_PAGES ((__le64)1)
    -/* Dear host: I need more memory. */
    -#define VIRTIO_BALLOON_GCMD_NEEDMEM ((__le64)2)
    -/* Dear host: here are your stats.
    - * Followed by 0 or more struct virtio_balloon_statistic structs. */
    -#define VIRTIO_BALLOON_GCMD_STATS_REPLY ((__le64)3)
    +/*
    + * Exchange pages in the ballon.
    + * Followed by array of 2N readable le64 pageaddr's. First N: to extract from
    + * balloon, next N: to add to the balloon
    +*/
    +#define VIRTIO_BALLOON_GCMD_EXCHANGE_PAGES ((__le64)2)
    +/*
    + * Stats, and optional request for memory.
    + * __le64: 0 if we don't want target increased, 1 if we do.
    + * Followed by 0 or more struct virtio_balloon_statistic structs.
    + */
    +#define VIRTIO_BALLOON_GCMD_STATS ((__le64)3)

    /* Host->guest command queue. */
    -/* Followed by s64 of new balloon target size (only negative if
    - * VIRTIO_BALLOON_F_EXTRA_MEM). */
    +
    +/*
    + * Followed by s64 of new balloon target size (only negative if
    + * VIRTIO_BALLOON_F_EXTRA_MEM).
    + */
    #define VIRTIO_BALLOON_HCMD_SET_BALLOON ((__le64)0x8000)
    /* Ask for statistics */
    #define VIRTIO_BALLOON_HCMD_GET_STATS ((__le64)0x8001)





  • 19.  Re: [virtio] New virtio balloon...

    Posted 02-03-2014 03:17
    "Michael S. Tsirkin" <mst@redhat.com> writes: > On Fri, Jan 31, 2014 at 04:01:39PM +1030, Rusty Russell wrote: >> I originally allowed the host to deny the deflate, which was why I >> reversed it. Then I realized that was a bad idea. I can switch it back. > > I think explicit swap that you suggested sounds better to me. OK, I've added this. It takes 2N PFNS, though Linux only uses 2. >> > - what's the status of page returned from balloon? >> > is it zeroed or can it have old data in there? >> > I think in practice Linux will sometimes map in a zero page, >> > so guest can save cycles and avoid zeroing it out. >> > I think we should tell this to guest when returning >> > pages. >> >> QEMU may not know, since the kernel may not tell it. > > Depends on what QEMU does. > I think kernel always gives us zero pages when we allocate > memory, they must be initialized otherwise it's an information leak. MADV_DONTNEED is a bit of a mess. Under Linux it will zero pages, under POSIX (POSIX_MADV_DONTNEED) it will keep them intact. Under FreeBSD we'd really want MADV_FREE, but linux doesn't support it. Let's not design this for today's QEMU. >> We should assume >> nothing, and let the guest zero if it needs to. Seems like a premuture >> optimization. > > Possibly. > >> > - I am guessing EXTRA_MEM is for uses like the ones proposed by >> > Frank Swiderski from google that inflate/deflate balloon >> > whenever guest wants (look for "Add a page cache-backed balloon >> > device driver"). >> > >> > this is useful but - we need to distinguish pages >> > like this from regular inflate. >> > it's not just counter and host needs a way to know >> > that it's target is reached >> >> The driver needs to explicitly ask for pages in that region. > > OK so we'll have an extra flag for that? No, I mean that the driver explicitly requests every page by PFN. If it requests a page in the extramem region, it will get one. >> > - do we even want to allow guest not telling host when it wants >> > to reuse the page? >> > if yes, I think this should be per-page somehow: when balloon >> > is inflated guest should tell host whether it >> > expects to use this page. >> >> I decided against it. Making that optional got us into a mess, so now >> it's compulsory. That also fits better with the idea of a negative >> balloon. >> >> > So I think we should accomodate these uses, and so we want the following flags: >> > >> > - WEAK_TARGET (that's the EXTRA_MEM but I think done in a better way) >> > flag that specifies pages do not count against target, >> > can be taken out of balloon. >> > EXTRA_MEM suggests there's an upper limit on balloon size >> > but IMHO that's just extra work for host: host does not care >> > I think, give it as much as you want. >> > set by guest, used by host >> >> I think that Daniel really does want more memory than the guest starts >> with. And I think he still wants to use the balloon to control it. >> Daniel? >> >> > - TELL_HOST flag that specifies guest will tell host before using pages >> > (that's VIRTIO_BALLOON_F_MUST_TELL_HOST >> > at the moment, listed here for completeness) >> > set by guest, used by host >> >> Dislike. >> >> > - ZEROED >> > flag that specifies that page returned to guest >> > is zeroed >> > set by host, used by guest >> >> I think that's silly. Under Linux the guest doesn't need to know it's >> zeroed or not, it just frees the page. > > Yes but it's possible that linux will try to zero page right > after free. It won't be too hard to set a flag that it's > zeroed when we free it. We could, but I don't see that this is a bottleneck. See above. >> > - how to accomodate memory pressure in guest? >> > Let's add a field telling host how hard do we >> > want our memory back >> >> That's very hard to define across guests. Should we be using stats for >> that instead? In fact, should we allow gratuitous stats sending, >> instead of a simple NEED_MEM flag? >> >> > - assume you want to over-commit host and start >> > inflating balloon. >> > If low on memory it might be better for guest to >> > wait a bit before inflating. >> > Also, if host asks for a lot of memory a ton of >> > allocations will slow guest significantly. >> > But for guest to do the right thing we need host to tell guest what >> > are its memory and time contraints. >> > Let's add a field telling guest how hard do we >> > want it to give us memory (e.g. time limit) >> >> We can't have intelligence at both ends, I think. We've chosen a >> host-led model, so we should stick to that > > I'm saying let's control speed of allocations from host, > that's still host-led? You want the guest to wait a bit, and control the rate at which it allocates memory. If that's what we want, let's get the host to delay telling it to inflate, and then inflate slowly. Otherwise we have to debug both host and guest sides when we hit performance problems. I changed the STATS_REPLY to STATS, and included a "want more mem" flag. The implication is that the host compare stats across different guests. Cheers, Rusty. diff --git a/drivers/virtio/virtio_balloon2.c b/drivers/virtio/virtio_balloon2.c index 93f13e7c561d..6c0151f12f8b 100644 --- a/drivers/virtio/virtio_balloon2.c +++ b/drivers/virtio/virtio_balloon2.c @@ -39,12 +39,15 @@ struct gcmd_give_pages { __le64 pages[256]; }; -struct gcmd_need_mem { - __le64 type; /* VIRTIO_BALLOON_GCMD_NEED_MEM */ +struct gcmd_exchange_pages { + __le64 type; /* VIRTIO_BALLOON_GCMD_EXCHANGE_PAGES */ + __le64 from_balloon; + __le64 to_balloon; }; -struct gcmd_stats_reply { +struct gcmd_stats { __le64 type; /* VIRTIO_BALLOON_GCMD_STATS_REPLY */ + __le64 need_more; struct virtio_balloon_statistic stats[VIRTIO_BALLOON_S_NR]; }; @@ -90,8 +93,8 @@ struct virtio_balloon { __le64 type; struct gcmd_get_pages get_pages; struct gcmd_give_pages give_pages; - struct gcmd_need_mem need_mem; - struct gcmd_stats_reply stats_reply; + struct gcmd_exchange_pages exchange_pages; + struct gcmd_stats stats; } gcmd; union hcmd { @@ -382,19 +385,11 @@ static int virtballoon_migratepage(struct address_space *mapping, if (!mutex_trylock(&vb->lock)) return -EAGAIN; - /* Try to get the page out of the balloon. */ - vb->gcmd.get_pages.type = cpu_to_le64(VIRTIO_BALLOON_GCMD_GET_PAGES); - vb->gcmd.get_pages.pages[0] = page_to_pfn(page) << PAGE_SHIFT; - if (!send_gcmd(vb, offsetof(struct gcmd_get_pages, pages[1]))) { - err = -EIO; - goto unlock; - } - - /* Now put newpage into balloon. */ - vb->gcmd.give_pages.type = cpu_to_le64(VIRTIO_BALLOON_GCMD_GIVE_PAGES); - vb->gcmd.give_pages.pages[0] = page_to_pfn(newpage) << PAGE_SHIFT; - if (!send_gcmd(vb, offsetof(struct gcmd_give_pages, pages[1]))) { - /* We leak a page here, but only happens if balloon broken. */ + vb->gcmd.exchange_pages.type = + cpu_to_le64(VIRTIO_BALLOON_GCMD_EXCHANGE_PAGES); + vb->gcmd.exchange_pages.from_balloon = page_to_pfn(page) << PAGE_SHIFT; + vb->gcmd.exchange_pages.to_balloon = page_to_pfn(newpage) << PAGE_SHIFT; + if (!send_gcmd(vb, sizeof(vb->gcmd.exchange_pages))) { err = -EIO; goto unlock; } diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h index cdca2934668a..925d79ad5c90 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -48,21 +48,36 @@ struct virtio_balloon_statistic { }; /* Guest->host command queue. */ -/* Ask the host for more pages. - Followed by array of 1 or more readable le64 pageaddr's. */ + +/* + * Ask the host for more pages. + * Followed by array of 1 or more readable le64 pageaddr's. + */ #define VIRTIO_BALLOON_GCMD_GET_PAGES ((__le64)0) -/* Give the host more pages. - Followed by array of 1 or more readable le64 pageaddr's */ +/* + * Give the host more pages. + * Followed by array of 1 or more readable le64 pageaddr's + */ #define VIRTIO_BALLOON_GCMD_GIVE_PAGES ((__le64)1) -/* Dear host: I need more memory. */ -#define VIRTIO_BALLOON_GCMD_NEEDMEM ((__le64)2) -/* Dear host: here are your stats. - * Followed by 0 or more struct virtio_balloon_statistic structs. */ -#define VIRTIO_BALLOON_GCMD_STATS_REPLY ((__le64)3) +/* + * Exchange pages in the ballon. + * Followed by array of 2N readable le64 pageaddr's. First N: to extract from + * balloon, next N: to add to the balloon +*/ +#define VIRTIO_BALLOON_GCMD_EXCHANGE_PAGES ((__le64)2) +/* + * Stats, and optional request for memory. + * __le64: 0 if we don't want target increased, 1 if we do. + * Followed by 0 or more struct virtio_balloon_statistic structs. + */ +#define VIRTIO_BALLOON_GCMD_STATS ((__le64)3) /* Host->guest command queue. */ -/* Followed by s64 of new balloon target size (only negative if - * VIRTIO_BALLOON_F_EXTRA_MEM). */ + +/* + * Followed by s64 of new balloon target size (only negative if + * VIRTIO_BALLOON_F_EXTRA_MEM). + */ #define VIRTIO_BALLOON_HCMD_SET_BALLOON ((__le64)0x8000) /* Ask for statistics */ #define VIRTIO_BALLOON_HCMD_GET_STATS ((__le64)0x8001)


  • 20.  Re: [virtio] New virtio balloon...

    Posted 02-03-2014 09:22
    On Mon, Feb 03, 2014 at 01:37:17PM +1030, Rusty Russell wrote: > "Michael S. Tsirkin" <mst@redhat.com> writes: > > On Fri, Jan 31, 2014 at 04:01:39PM +1030, Rusty Russell wrote: > >> I originally allowed the host to deny the deflate, which was why I > >> reversed it. Then I realized that was a bad idea. I can switch it back. > > > > I think explicit swap that you suggested sounds better to me. > > OK, I've added this. It takes 2N PFNS, though Linux only uses 2. > > >> > - what's the status of page returned from balloon? > >> > is it zeroed or can it have old data in there? > >> > I think in practice Linux will sometimes map in a zero page, > >> > so guest can save cycles and avoid zeroing it out. > >> > I think we should tell this to guest when returning > >> > pages. > >> > >> QEMU may not know, since the kernel may not tell it. > > > > Depends on what QEMU does. > > I think kernel always gives us zero pages when we allocate > > memory, they must be initialized otherwise it's an information leak. > > MADV_DONTNEED is a bit of a mess. Under Linux it will zero pages, > under POSIX (POSIX_MADV_DONTNEED) it will keep them intact. Under > FreeBSD we'd really want MADV_FREE, but linux doesn't support it. > > Let's not design this for today's QEMU. > > >> We should assume > >> nothing, and let the guest zero if it needs to. Seems like a premuture > >> optimization. > > > > Possibly. > > > >> > - I am guessing EXTRA_MEM is for uses like the ones proposed by > >> > Frank Swiderski from google that inflate/deflate balloon > >> > whenever guest wants (look for "Add a page cache-backed balloon > >> > device driver"). > >> > > >> > this is useful but - we need to distinguish pages > >> > like this from regular inflate. > >> > it's not just counter and host needs a way to know > >> > that it's target is reached > >> > >> The driver needs to explicitly ask for pages in that region. > > > > OK so we'll have an extra flag for that? > > No, I mean that the driver explicitly requests every page by PFN. If it > requests a page in the extramem region, it will get one. Maybe I misunderstand. What is EXTRA_MEM is aid of? > >> > - do we even want to allow guest not telling host when it wants > >> > to reuse the page? > >> > if yes, I think this should be per-page somehow: when balloon > >> > is inflated guest should tell host whether it > >> > expects to use this page. > >> > >> I decided against it. Making that optional got us into a mess, so now > >> it's compulsory. That also fits better with the idea of a negative > >> balloon. > >> > >> > So I think we should accomodate these uses, and so we want the following flags: > >> > > >> > - WEAK_TARGET (that's the EXTRA_MEM but I think done in a better way) > >> > flag that specifies pages do not count against target, > >> > can be taken out of balloon. > >> > EXTRA_MEM suggests there's an upper limit on balloon size > >> > but IMHO that's just extra work for host: host does not care > >> > I think, give it as much as you want. > >> > set by guest, used by host > >> > >> I think that Daniel really does want more memory than the guest starts > >> with. And I think he still wants to use the balloon to control it. > >> Daniel? > >> > >> > - TELL_HOST flag that specifies guest will tell host before using pages > >> > (that's VIRTIO_BALLOON_F_MUST_TELL_HOST > >> > at the moment, listed here for completeness) > >> > set by guest, used by host > >> > >> Dislike. > >> > >> > - ZEROED > >> > flag that specifies that page returned to guest > >> > is zeroed > >> > set by host, used by guest > >> > >> I think that's silly. Under Linux the guest doesn't need to know it's > >> zeroed or not, it just frees the page. > > > > Yes but it's possible that linux will try to zero page right > > after free. It won't be too hard to set a flag that it's > > zeroed when we free it. > > We could, but I don't see that this is a bottleneck. See above. > > >> > - how to accomodate memory pressure in guest? > >> > Let's add a field telling host how hard do we > >> > want our memory back > >> > >> That's very hard to define across guests. Should we be using stats for > >> that instead? In fact, should we allow gratuitous stats sending, > >> instead of a simple NEED_MEM flag? > >> > >> > - assume you want to over-commit host and start > >> > inflating balloon. > >> > If low on memory it might be better for guest to > >> > wait a bit before inflating. > >> > Also, if host asks for a lot of memory a ton of > >> > allocations will slow guest significantly. > >> > But for guest to do the right thing we need host to tell guest what > >> > are its memory and time contraints. > >> > Let's add a field telling guest how hard do we > >> > want it to give us memory (e.g. time limit) > >> > >> We can't have intelligence at both ends, I think. We've chosen a > >> host-led model, so we should stick to that > > > > I'm saying let's control speed of allocations from host, > > that's still host-led? > > You want the guest to wait a bit, and control the rate at which it > allocates memory. If that's what we want, let's get the host to delay > telling it to inflate, and then inflate slowly. Otherwise we have to > debug both host and guest sides when we hit performance problems. I wonder if a single bit wil be enough. > > I changed the STATS_REPLY to STATS, and included a "want more mem" > flag. The implication is that the host compare stats across different > guests. > > Cheers, > Rusty. > diff --git a/drivers/virtio/virtio_balloon2.c b/drivers/virtio/virtio_balloon2.c > index 93f13e7c561d..6c0151f12f8b 100644 > --- a/drivers/virtio/virtio_balloon2.c > +++ b/drivers/virtio/virtio_balloon2.c > @@ -39,12 +39,15 @@ struct gcmd_give_pages { > __le64 pages[256]; > }; > > -struct gcmd_need_mem { > - __le64 type; /* VIRTIO_BALLOON_GCMD_NEED_MEM */ > +struct gcmd_exchange_pages { > + __le64 type; /* VIRTIO_BALLOON_GCMD_EXCHANGE_PAGES */ > + __le64 from_balloon; > + __le64 to_balloon; > }; > > -struct gcmd_stats_reply { > +struct gcmd_stats { > __le64 type; /* VIRTIO_BALLOON_GCMD_STATS_REPLY */ > + __le64 need_more; > struct virtio_balloon_statistic stats[VIRTIO_BALLOON_S_NR]; > }; > > @@ -90,8 +93,8 @@ struct virtio_balloon { > __le64 type; > struct gcmd_get_pages get_pages; > struct gcmd_give_pages give_pages; > - struct gcmd_need_mem need_mem; > - struct gcmd_stats_reply stats_reply; > + struct gcmd_exchange_pages exchange_pages; > + struct gcmd_stats stats; > } gcmd; > > union hcmd { > @@ -382,19 +385,11 @@ static int virtballoon_migratepage(struct address_space *mapping, > if (!mutex_trylock(&vb->lock)) > return -EAGAIN; > > - /* Try to get the page out of the balloon. */ > - vb->gcmd.get_pages.type = cpu_to_le64(VIRTIO_BALLOON_GCMD_GET_PAGES); > - vb->gcmd.get_pages.pages[0] = page_to_pfn(page) << PAGE_SHIFT; > - if (!send_gcmd(vb, offsetof(struct gcmd_get_pages, pages[1]))) { > - err = -EIO; > - goto unlock; > - } > - > - /* Now put newpage into balloon. */ > - vb->gcmd.give_pages.type = cpu_to_le64(VIRTIO_BALLOON_GCMD_GIVE_PAGES); > - vb->gcmd.give_pages.pages[0] = page_to_pfn(newpage) << PAGE_SHIFT; > - if (!send_gcmd(vb, offsetof(struct gcmd_give_pages, pages[1]))) { > - /* We leak a page here, but only happens if balloon broken. */ > + vb->gcmd.exchange_pages.type = > + cpu_to_le64(VIRTIO_BALLOON_GCMD_EXCHANGE_PAGES); > + vb->gcmd.exchange_pages.from_balloon = page_to_pfn(page) << PAGE_SHIFT; > + vb->gcmd.exchange_pages.to_balloon = page_to_pfn(newpage) << PAGE_SHIFT; > + if (!send_gcmd(vb, sizeof(vb->gcmd.exchange_pages))) { > err = -EIO; > goto unlock; > } > diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h > index cdca2934668a..925d79ad5c90 100644 > --- a/include/uapi/linux/virtio_balloon.h > +++ b/include/uapi/linux/virtio_balloon.h > @@ -48,21 +48,36 @@ struct virtio_balloon_statistic { > }; > > /* Guest->host command queue. */ > -/* Ask the host for more pages. > - Followed by array of 1 or more readable le64 pageaddr's. */ > + > +/* > + * Ask the host for more pages. > + * Followed by array of 1 or more readable le64 pageaddr's. > + */ > #define VIRTIO_BALLOON_GCMD_GET_PAGES ((__le64)0) > -/* Give the host more pages. > - Followed by array of 1 or more readable le64 pageaddr's */ > +/* > + * Give the host more pages. > + * Followed by array of 1 or more readable le64 pageaddr's > + */ > #define VIRTIO_BALLOON_GCMD_GIVE_PAGES ((__le64)1) > -/* Dear host: I need more memory. */ > -#define VIRTIO_BALLOON_GCMD_NEEDMEM ((__le64)2) > -/* Dear host: here are your stats. > - * Followed by 0 or more struct virtio_balloon_statistic structs. */ > -#define VIRTIO_BALLOON_GCMD_STATS_REPLY ((__le64)3) > +/* > + * Exchange pages in the ballon. > + * Followed by array of 2N readable le64 pageaddr's. First N: to extract from > + * balloon, next N: to add to the balloon > +*/ > +#define VIRTIO_BALLOON_GCMD_EXCHANGE_PAGES ((__le64)2) > +/* > + * Stats, and optional request for memory. > + * __le64: 0 if we don't want target increased, 1 if we do. > + * Followed by 0 or more struct virtio_balloon_statistic structs. > + */ > +#define VIRTIO_BALLOON_GCMD_STATS ((__le64)3) > > /* Host->guest command queue. */ > -/* Followed by s64 of new balloon target size (only negative if > - * VIRTIO_BALLOON_F_EXTRA_MEM). */ > + > +/* > + * Followed by s64 of new balloon target size (only negative if > + * VIRTIO_BALLOON_F_EXTRA_MEM). > + */ > #define VIRTIO_BALLOON_HCMD_SET_BALLOON ((__le64)0x8000) > /* Ask for statistics */ > #define VIRTIO_BALLOON_HCMD_GET_STATS ((__le64)0x8001) > One other comment: for NUMA VMs we might want to have separate counters for each node, and separate memory pressure notification. Maybe just separate balloon for each node? -- MST


  • 21.  Re: [virtio] New virtio balloon...

    Posted 02-03-2014 09:26
    On Mon, Feb 03, 2014 at 01:37:17PM +1030, Rusty Russell wrote:
    > "Michael S. Tsirkin" <mst@redhat.com> writes:
    > > On Fri, Jan 31, 2014 at 04:01:39PM +1030, Rusty Russell wrote:
    > >> I originally allowed the host to deny the deflate, which was why I
    > >> reversed it. Then I realized that was a bad idea. I can switch it back.
    > >
    > > I think explicit swap that you suggested sounds better to me.
    >
    > OK, I've added this. It takes 2N PFNS, though Linux only uses 2.
    >
    > >> > - what's the status of page returned from balloon?
    > >> > is it zeroed or can it have old data in there?
    > >> > I think in practice Linux will sometimes map in a zero page,
    > >> > so guest can save cycles and avoid zeroing it out.
    > >> > I think we should tell this to guest when returning
    > >> > pages.
    > >>
    > >> QEMU may not know, since the kernel may not tell it.
    > >
    > > Depends on what QEMU does.
    > > I think kernel always gives us zero pages when we allocate
    > > memory, they must be initialized otherwise it's an information leak.
    >
    > MADV_DONTNEED is a bit of a mess. Under Linux it will zero pages,
    > under POSIX (POSIX_MADV_DONTNEED) it will keep them intact. Under
    > FreeBSD we'd really want MADV_FREE, but linux doesn't support it.
    >
    > Let's not design this for today's QEMU.
    >
    > >> We should assume
    > >> nothing, and let the guest zero if it needs to. Seems like a premuture
    > >> optimization.
    > >
    > > Possibly.
    > >
    > >> > - I am guessing EXTRA_MEM is for uses like the ones proposed by
    > >> > Frank Swiderski from google that inflate/deflate balloon
    > >> > whenever guest wants (look for "Add a page cache-backed balloon
    > >> > device driver").
    > >> >
    > >> > this is useful but - we need to distinguish pages
    > >> > like this from regular inflate.
    > >> > it's not just counter and host needs a way to know
    > >> > that it's target is reached
    > >>
    > >> The driver needs to explicitly ask for pages in that region.
    > >
    > > OK so we'll have an extra flag for that?
    >
    > No, I mean that the driver explicitly requests every page by PFN. If it
    > requests a page in the extramem region, it will get one.

    Maybe I misunderstand. What is EXTRA_MEM is aid of?

    > >> > - do we even want to allow guest not telling host when it wants
    > >> > to reuse the page?
    > >> > if yes, I think this should be per-page somehow: when balloon
    > >> > is inflated guest should tell host whether it
    > >> > expects to use this page.
    > >>
    > >> I decided against it. Making that optional got us into a mess, so now
    > >> it's compulsory. That also fits better with the idea of a negative
    > >> balloon.
    > >>
    > >> > So I think we should accomodate these uses, and so we want the following flags:
    > >> >
    > >> > - WEAK_TARGET (that's the EXTRA_MEM but I think done in a better way)
    > >> > flag that specifies pages do not count against target,
    > >> > can be taken out of balloon.
    > >> > EXTRA_MEM suggests there's an upper limit on balloon size
    > >> > but IMHO that's just extra work for host: host does not care
    > >> > I think, give it as much as you want.
    > >> > set by guest, used by host
    > >>
    > >> I think that Daniel really does want more memory than the guest starts
    > >> with. And I think he still wants to use the balloon to control it.
    > >> Daniel?
    > >>
    > >> > - TELL_HOST flag that specifies guest will tell host before using pages
    > >> > (that's VIRTIO_BALLOON_F_MUST_TELL_HOST
    > >> > at the moment, listed here for completeness)
    > >> > set by guest, used by host
    > >>
    > >> Dislike.
    > >>
    > >> > - ZEROED
    > >> > flag that specifies that page returned to guest
    > >> > is zeroed
    > >> > set by host, used by guest
    > >>
    > >> I think that's silly. Under Linux the guest doesn't need to know it's
    > >> zeroed or not, it just frees the page.
    > >
    > > Yes but it's possible that linux will try to zero page right
    > > after free. It won't be too hard to set a flag that it's
    > > zeroed when we free it.
    >
    > We could, but I don't see that this is a bottleneck. See above.
    >
    > >> > - how to accomodate memory pressure in guest?
    > >> > Let's add a field telling host how hard do we
    > >> > want our memory back
    > >>
    > >> That's very hard to define across guests. Should we be using stats for
    > >> that instead? In fact, should we allow gratuitous stats sending,
    > >> instead of a simple NEED_MEM flag?
    > >>
    > >> > - assume you want to over-commit host and start
    > >> > inflating balloon.
    > >> > If low on memory it might be better for guest to
    > >> > wait a bit before inflating.
    > >> > Also, if host asks for a lot of memory a ton of
    > >> > allocations will slow guest significantly.
    > >> > But for guest to do the right thing we need host to tell guest what
    > >> > are its memory and time contraints.
    > >> > Let's add a field telling guest how hard do we
    > >> > want it to give us memory (e.g. time limit)
    > >>
    > >> We can't have intelligence at both ends, I think. We've chosen a
    > >> host-led model, so we should stick to that
    > >
    > > I'm saying let's control speed of allocations from host,
    > > that's still host-led?
    >
    > You want the guest to wait a bit, and control the rate at which it
    > allocates memory. If that's what we want, let's get the host to delay
    > telling it to inflate, and then inflate slowly. Otherwise we have to
    > debug both host and guest sides when we hit performance problems.

    I wonder if a single bit wil be enough.

    >
    > I changed the STATS_REPLY to STATS, and included a "want more mem"
    > flag. The implication is that the host compare stats across different
    > guests.
    >
    > Cheers,
    > Rusty.
    > diff --git a/drivers/virtio/virtio_balloon2.c b/drivers/virtio/virtio_balloon2.c
    > index 93f13e7c561d..6c0151f12f8b 100644
    > --- a/drivers/virtio/virtio_balloon2.c
    > +++ b/drivers/virtio/virtio_balloon2.c
    > @@ -39,12 +39,15 @@ struct gcmd_give_pages {
    > __le64 pages[256];
    > };
    >
    > -struct gcmd_need_mem {
    > - __le64 type; /* VIRTIO_BALLOON_GCMD_NEED_MEM */
    > +struct gcmd_exchange_pages {
    > + __le64 type; /* VIRTIO_BALLOON_GCMD_EXCHANGE_PAGES */
    > + __le64 from_balloon;
    > + __le64 to_balloon;
    > };
    >
    > -struct gcmd_stats_reply {
    > +struct gcmd_stats {
    > __le64 type; /* VIRTIO_BALLOON_GCMD_STATS_REPLY */
    > + __le64 need_more;
    > struct virtio_balloon_statistic stats[VIRTIO_BALLOON_S_NR];
    > };
    >
    > @@ -90,8 +93,8 @@ struct virtio_balloon {
    > __le64 type;
    > struct gcmd_get_pages get_pages;
    > struct gcmd_give_pages give_pages;
    > - struct gcmd_need_mem need_mem;
    > - struct gcmd_stats_reply stats_reply;
    > + struct gcmd_exchange_pages exchange_pages;
    > + struct gcmd_stats stats;
    > } gcmd;
    >
    > union hcmd {
    > @@ -382,19 +385,11 @@ static int virtballoon_migratepage(struct address_space *mapping,
    > if (!mutex_trylock(&vb->lock))
    > return -EAGAIN;
    >
    > - /* Try to get the page out of the balloon. */
    > - vb->gcmd.get_pages.type = cpu_to_le64(VIRTIO_BALLOON_GCMD_GET_PAGES);
    > - vb->gcmd.get_pages.pages[0] = page_to_pfn(page) << PAGE_SHIFT;
    > - if (!send_gcmd(vb, offsetof(struct gcmd_get_pages, pages[1]))) {
    > - err = -EIO;
    > - goto unlock;
    > - }
    > -
    > - /* Now put newpage into balloon. */
    > - vb->gcmd.give_pages.type = cpu_to_le64(VIRTIO_BALLOON_GCMD_GIVE_PAGES);
    > - vb->gcmd.give_pages.pages[0] = page_to_pfn(newpage) << PAGE_SHIFT;
    > - if (!send_gcmd(vb, offsetof(struct gcmd_give_pages, pages[1]))) {
    > - /* We leak a page here, but only happens if balloon broken. */
    > + vb->gcmd.exchange_pages.type =
    > + cpu_to_le64(VIRTIO_BALLOON_GCMD_EXCHANGE_PAGES);
    > + vb->gcmd.exchange_pages.from_balloon = page_to_pfn(page) << PAGE_SHIFT;
    > + vb->gcmd.exchange_pages.to_balloon = page_to_pfn(newpage) << PAGE_SHIFT;
    > + if (!send_gcmd(vb, sizeof(vb->gcmd.exchange_pages))) {
    > err = -EIO;
    > goto unlock;
    > }
    > diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
    > index cdca2934668a..925d79ad5c90 100644
    > --- a/include/uapi/linux/virtio_balloon.h
    > +++ b/include/uapi/linux/virtio_balloon.h
    > @@ -48,21 +48,36 @@ struct virtio_balloon_statistic {
    > };
    >
    > /* Guest->host command queue. */
    > -/* Ask the host for more pages.
    > - Followed by array of 1 or more readable le64 pageaddr's. */
    > +
    > +/*
    > + * Ask the host for more pages.
    > + * Followed by array of 1 or more readable le64 pageaddr's.
    > + */
    > #define VIRTIO_BALLOON_GCMD_GET_PAGES ((__le64)0)
    > -/* Give the host more pages.
    > - Followed by array of 1 or more readable le64 pageaddr's */
    > +/*
    > + * Give the host more pages.
    > + * Followed by array of 1 or more readable le64 pageaddr's
    > + */
    > #define VIRTIO_BALLOON_GCMD_GIVE_PAGES ((__le64)1)
    > -/* Dear host: I need more memory. */
    > -#define VIRTIO_BALLOON_GCMD_NEEDMEM ((__le64)2)
    > -/* Dear host: here are your stats.
    > - * Followed by 0 or more struct virtio_balloon_statistic structs. */
    > -#define VIRTIO_BALLOON_GCMD_STATS_REPLY ((__le64)3)
    > +/*
    > + * Exchange pages in the ballon.
    > + * Followed by array of 2N readable le64 pageaddr's. First N: to extract from
    > + * balloon, next N: to add to the balloon
    > +*/
    > +#define VIRTIO_BALLOON_GCMD_EXCHANGE_PAGES ((__le64)2)
    > +/*
    > + * Stats, and optional request for memory.
    > + * __le64: 0 if we don't want target increased, 1 if we do.
    > + * Followed by 0 or more struct virtio_balloon_statistic structs.
    > + */
    > +#define VIRTIO_BALLOON_GCMD_STATS ((__le64)3)
    >
    > /* Host->guest command queue. */
    > -/* Followed by s64 of new balloon target size (only negative if
    > - * VIRTIO_BALLOON_F_EXTRA_MEM). */
    > +
    > +/*
    > + * Followed by s64 of new balloon target size (only negative if
    > + * VIRTIO_BALLOON_F_EXTRA_MEM).
    > + */
    > #define VIRTIO_BALLOON_HCMD_SET_BALLOON ((__le64)0x8000)
    > /* Ask for statistics */
    > #define VIRTIO_BALLOON_HCMD_GET_STATS ((__le64)0x8001)
    >


    One other comment: for NUMA VMs we might want
    to have separate counters for each node,
    and separate memory pressure notification.

    Maybe just separate balloon for each node?

    --
    MST



  • 22.  Re: [virtio] New virtio balloon...

    Posted 02-04-2014 01:30
    "Michael S. Tsirkin" <mst@redhat.com> writes:
    > On Mon, Feb 03, 2014 at 01:37:17PM +1030, Rusty Russell wrote:
    >> "Michael S. Tsirkin" <mst@redhat.com> writes:
    >> > On Fri, Jan 31, 2014 at 04:01:39PM +1030, Rusty Russell wrote:
    >> >> The driver needs to explicitly ask for pages in that region.
    >> >
    >> > OK so we'll have an extra flag for that?
    >>
    >> No, I mean that the driver explicitly requests every page by PFN. If it
    >> requests a page in the extramem region, it will get one.
    >
    > Maybe I misunderstand. What is EXTRA_MEM is aid of?

    The VIRTIO_BALLOON_F_EXTRA_MEM feature enables the extra memory region.
    It's a simple form of memory hotplug. Daniel wanted it.

    It might make sense for some guests to allocate metadata (eg. array of
    struct pages) for that extra memory if it's required.

    > One other comment: for NUMA VMs we might want
    > to have separate counters for each node,
    > and separate memory pressure notification.
    >
    > Maybe just separate balloon for each node?

    Good question. We could add a range field in the config space, such
    that the different balloons cover different areas.

    On the other hand, we can add this in future with a feature bit. Though
    it would mean offering a guest multiple balloons, and then ignoring
    all but one if it didn't understand the range feature.

    Cheers,
    Rusty.




  • 23.  Re: [virtio] New virtio balloon...

    Posted 02-04-2014 02:21
    "Michael S. Tsirkin" <mst@redhat.com> writes: > On Mon, Feb 03, 2014 at 01:37:17PM +1030, Rusty Russell wrote: >> "Michael S. Tsirkin" <mst@redhat.com> writes: >> > On Fri, Jan 31, 2014 at 04:01:39PM +1030, Rusty Russell wrote: >> >> The driver needs to explicitly ask for pages in that region. >> > >> > OK so we'll have an extra flag for that? >> >> No, I mean that the driver explicitly requests every page by PFN. If it >> requests a page in the extramem region, it will get one. > > Maybe I misunderstand. What is EXTRA_MEM is aid of? The VIRTIO_BALLOON_F_EXTRA_MEM feature enables the extra memory region. It's a simple form of memory hotplug. Daniel wanted it. It might make sense for some guests to allocate metadata (eg. array of struct pages) for that extra memory if it's required. > One other comment: for NUMA VMs we might want > to have separate counters for each node, > and separate memory pressure notification. > > Maybe just separate balloon for each node? Good question. We could add a range field in the config space, such that the different balloons cover different areas. On the other hand, we can add this in future with a feature bit. Though it would mean offering a guest multiple balloons, and then ignoring all but one if it didn't understand the range feature. Cheers, Rusty.


  • 24.  Re: [virtio] New virtio balloon...

    Posted 02-04-2014 21:16
    On Tue, Feb 04, 2014 at 12:00:21PM +1030, Rusty Russell wrote:
    > "Michael S. Tsirkin" <mst@redhat.com> writes:
    > > On Mon, Feb 03, 2014 at 01:37:17PM +1030, Rusty Russell wrote:
    > >> "Michael S. Tsirkin" <mst@redhat.com> writes:
    > >> > On Fri, Jan 31, 2014 at 04:01:39PM +1030, Rusty Russell wrote:
    > >> >> The driver needs to explicitly ask for pages in that region.
    > >> >
    > >> > OK so we'll have an extra flag for that?
    > >>
    > >> No, I mean that the driver explicitly requests every page by PFN. If it
    > >> requests a page in the extramem region, it will get one.
    > >
    > > Maybe I misunderstand. What is EXTRA_MEM is aid of?
    >
    > The VIRTIO_BALLOON_F_EXTRA_MEM feature enables the extra memory region.
    > It's a simple form of memory hotplug. Daniel wanted it.
    >
    > It might make sense for some guests to allocate metadata (eg. array of
    > struct pages) for that extra memory if it's required.

    Linux does this dynamically during memory hotadd phase.

    Daniel



  • 25.  Re: [virtio] New virtio balloon...

    Posted 02-04-2014 21:16
    On Tue, Feb 04, 2014 at 12:00:21PM +1030, Rusty Russell wrote: > "Michael S. Tsirkin" <mst@redhat.com> writes: > > On Mon, Feb 03, 2014 at 01:37:17PM +1030, Rusty Russell wrote: > >> "Michael S. Tsirkin" <mst@redhat.com> writes: > >> > On Fri, Jan 31, 2014 at 04:01:39PM +1030, Rusty Russell wrote: > >> >> The driver needs to explicitly ask for pages in that region. > >> > > >> > OK so we'll have an extra flag for that? > >> > >> No, I mean that the driver explicitly requests every page by PFN. If it > >> requests a page in the extramem region, it will get one. > > > > Maybe I misunderstand. What is EXTRA_MEM is aid of? > > The VIRTIO_BALLOON_F_EXTRA_MEM feature enables the extra memory region. > It's a simple form of memory hotplug. Daniel wanted it. > > It might make sense for some guests to allocate metadata (eg. array of > struct pages) for that extra memory if it's required. Linux does this dynamically during memory hotadd phase. Daniel


  • 26.  Re: [virtio] New virtio balloon...

    Posted 02-03-2014 13:36
    On Mon, Feb 03, 2014 at 01:37:17PM +1030, Rusty Russell wrote:

    [...]

    > I changed the STATS_REPLY to STATS, and included a "want more mem"
    > flag. The implication is that the host compare stats across different
    > guests.
    >
    > Cheers,
    > Rusty.
    >
    > diff --git a/drivers/virtio/virtio_balloon2.c b/drivers/virtio/virtio_balloon2.c
    > index 93f13e7c561d..6c0151f12f8b 100644
    > --- a/drivers/virtio/virtio_balloon2.c
    > +++ b/drivers/virtio/virtio_balloon2.c
    > @@ -39,12 +39,15 @@ struct gcmd_give_pages {
    > __le64 pages[256];
    > };
    >
    > -struct gcmd_need_mem {
    > - __le64 type; /* VIRTIO_BALLOON_GCMD_NEED_MEM */
    > +struct gcmd_exchange_pages {
    > + __le64 type; /* VIRTIO_BALLOON_GCMD_EXCHANGE_PAGES */
    > + __le64 from_balloon;
    > + __le64 to_balloon;
    > };
    >
    > -struct gcmd_stats_reply {
    > +struct gcmd_stats {
    > __le64 type; /* VIRTIO_BALLOON_GCMD_STATS_REPLY */

    VIRTIO_BALLOON_GCMD_STATS? Please see above.

    Daniel



  • 27.  Re: [virtio] New virtio balloon...

    Posted 02-03-2014 13:37
    On Mon, Feb 03, 2014 at 01:37:17PM +1030, Rusty Russell wrote: [...] > I changed the STATS_REPLY to STATS, and included a "want more mem" > flag. The implication is that the host compare stats across different > guests. > > Cheers, > Rusty. > > diff --git a/drivers/virtio/virtio_balloon2.c b/drivers/virtio/virtio_balloon2.c > index 93f13e7c561d..6c0151f12f8b 100644 > --- a/drivers/virtio/virtio_balloon2.c > +++ b/drivers/virtio/virtio_balloon2.c > @@ -39,12 +39,15 @@ struct gcmd_give_pages { > __le64 pages[256]; > }; > > -struct gcmd_need_mem { > - __le64 type; /* VIRTIO_BALLOON_GCMD_NEED_MEM */ > +struct gcmd_exchange_pages { > + __le64 type; /* VIRTIO_BALLOON_GCMD_EXCHANGE_PAGES */ > + __le64 from_balloon; > + __le64 to_balloon; > }; > > -struct gcmd_stats_reply { > +struct gcmd_stats { > __le64 type; /* VIRTIO_BALLOON_GCMD_STATS_REPLY */ VIRTIO_BALLOON_GCMD_STATS? Please see above. Daniel


  • 28.  Re: [virtio] New virtio balloon...

    Posted 02-04-2014 00:56
    Daniel Kiper <daniel.kiper@oracle.com> writes:
    > On Mon, Feb 03, 2014 at 01:37:17PM +1030, Rusty Russell wrote:
    >> -struct gcmd_stats_reply {
    >> +struct gcmd_stats {
    >> __le64 type; /* VIRTIO_BALLOON_GCMD_STATS_REPLY */
    >
    > VIRTIO_BALLOON_GCMD_STATS? Please see above.

    Thanks, fixed.

    Cheers,
    Rusty.




  • 29.  Re: [virtio] New virtio balloon...

    Posted 02-04-2014 02:21
    Daniel Kiper <daniel.kiper@oracle.com> writes: > On Mon, Feb 03, 2014 at 01:37:17PM +1030, Rusty Russell wrote: >> -struct gcmd_stats_reply { >> +struct gcmd_stats { >> __le64 type; /* VIRTIO_BALLOON_GCMD_STATS_REPLY */ > > VIRTIO_BALLOON_GCMD_STATS? Please see above. Thanks, fixed. Cheers, Rusty.


  • 30.  Re: [virtio] New virtio balloon...

    Posted 02-03-2014 20:15
    On Mon, 03 Feb 2014 13:37:17 +1030
    Rusty Russell <rusty@au1.ibm.com> wrote:

    > >> > - how to accomodate memory pressure in guest?
    > >> > Let's add a field telling host how hard do we
    > >> > want our memory back
    > >>
    > >> That's very hard to define across guests. Should we be using stats for
    > >> that instead? In fact, should we allow gratuitous stats sending,
    > >> instead of a simple NEED_MEM flag?
    > >>
    > >> > - assume you want to over-commit host and start
    > >> > inflating balloon.
    > >> > If low on memory it might be better for guest to
    > >> > wait a bit before inflating.
    > >> > Also, if host asks for a lot of memory a ton of
    > >> > allocations will slow guest significantly.
    > >> > But for guest to do the right thing we need host to tell guest what
    > >> > are its memory and time contraints.
    > >> > Let's add a field telling guest how hard do we
    > >> > want it to give us memory (e.g. time limit)
    > >>
    > >> We can't have intelligence at both ends, I think. We've chosen a
    > >> host-led model, so we should stick to that
    > >
    > > I'm saying let's control speed of allocations from host,
    > > that's still host-led?
    >
    > You want the guest to wait a bit, and control the rate at which it
    > allocates memory. If that's what we want, let's get the host to delay
    > telling it to inflate, and then inflate slowly. Otherwise we have to
    > debug both host and guest sides when we hit performance problems.
    >
    > I changed the STATS_REPLY to STATS, and included a "want more mem"
    > flag. The implication is that the host compare stats across different
    > guests.

    When would the host do that? Can you elaborate a bit how this would
    be used?

    I feel that what you're proposing is not far away from automatic
    ballooning. Basically, my current idea for automatic ballooning more or
    less is:

    1. QEMU registers for vmpressure events in the host
    (see Documentation/cgroups/memory.txt "Memory Pressure" section)

    2. The virtio-balloon driver in the guest registers for
    in-kernel memory pressure notification (not upstream yet)

    3. When the host is under pressure, QEMU is notified and it asks the
    guest to inflate its balloon by some amount

    4. When the guest is under pressure, QEMU is notified by the
    virtio-balloon driver and QEMU asks the guest to deflate by
    some value

    Now, doing one inflate/deflate per event is not very good. I'm trying
    to find a way where we use the amount of events to determine:

    A. When memory should be moved from the guest to the host

    B. When memory should be moved from the host to the guest

    C. When memory shouldn't move (ie. when both guest and host experience
    similar pressure)

    Note that there's no "central authority" that has information about
    all guests to decide how do that. Each qemu instance has to decide it
    itself, based on the information it has about the host and about its guest.



  • 31.  Re: [virtio] New virtio balloon...

    Posted 02-03-2014 20:15
    On Mon, 03 Feb 2014 13:37:17 +1030 Rusty Russell <rusty@au1.ibm.com> wrote: > >> > - how to accomodate memory pressure in guest? > >> > Let's add a field telling host how hard do we > >> > want our memory back > >> > >> That's very hard to define across guests. Should we be using stats for > >> that instead? In fact, should we allow gratuitous stats sending, > >> instead of a simple NEED_MEM flag? > >> > >> > - assume you want to over-commit host and start > >> > inflating balloon. > >> > If low on memory it might be better for guest to > >> > wait a bit before inflating. > >> > Also, if host asks for a lot of memory a ton of > >> > allocations will slow guest significantly. > >> > But for guest to do the right thing we need host to tell guest what > >> > are its memory and time contraints. > >> > Let's add a field telling guest how hard do we > >> > want it to give us memory (e.g. time limit) > >> > >> We can't have intelligence at both ends, I think. We've chosen a > >> host-led model, so we should stick to that > > > > I'm saying let's control speed of allocations from host, > > that's still host-led? > > You want the guest to wait a bit, and control the rate at which it > allocates memory. If that's what we want, let's get the host to delay > telling it to inflate, and then inflate slowly. Otherwise we have to > debug both host and guest sides when we hit performance problems. > > I changed the STATS_REPLY to STATS, and included a "want more mem" > flag. The implication is that the host compare stats across different > guests. When would the host do that? Can you elaborate a bit how this would be used? I feel that what you're proposing is not far away from automatic ballooning. Basically, my current idea for automatic ballooning more or less is: 1. QEMU registers for vmpressure events in the host (see Documentation/cgroups/memory.txt "Memory Pressure" section) 2. The virtio-balloon driver in the guest registers for in-kernel memory pressure notification (not upstream yet) 3. When the host is under pressure, QEMU is notified and it asks the guest to inflate its balloon by some amount 4. When the guest is under pressure, QEMU is notified by the virtio-balloon driver and QEMU asks the guest to deflate by some value Now, doing one inflate/deflate per event is not very good. I'm trying to find a way where we use the amount of events to determine: A. When memory should be moved from the guest to the host B. When memory should be moved from the host to the guest C. When memory shouldn't move (ie. when both guest and host experience similar pressure) Note that there's no "central authority" that has information about all guests to decide how do that. Each qemu instance has to decide it itself, based on the information it has about the host and about its guest.


  • 32.  Re: [virtio] New virtio balloon...

    Posted 02-04-2014 01:24
    Luiz Capitulino <lcapitulino@redhat.com> writes:
    > On Mon, 03 Feb 2014 13:37:17 +1030
    >> I changed the STATS_REPLY to STATS, and included a "want more mem"
    >> flag. The implication is that the host compare stats across different
    >> guests.
    >
    > When would the host do that? Can you elaborate a bit how this would
    > be used?
    >
    > I feel that what you're proposing is not far away from automatic
    > ballooning. Basically, my current idea for automatic ballooning more or
    > less is:

    Yes, this is to enable automatic ballooning, as per your lkml posting.
    Here's how it maps onto your case:

    > 1. QEMU registers for vmpressure events in the host
    > (see Documentation/cgroups/memory.txt "Memory Pressure" section)
    >
    > 2. The virtio-balloon driver in the guest registers for
    > in-kernel memory pressure notification (not upstream yet)
    >
    > 3. When the host is under pressure, QEMU is notified and it asks the
    > guest to inflate its balloon by some amount

    This would mean the host changing the balloon target, ie
    VIRTIO_BALLOON_HCMD_SET_BALLOON.

    > 4. When the guest is under pressure, QEMU is notified by the
    > virtio-balloon driver and QEMU asks the guest to deflate by
    > some value

    This would mean guest sending stats with the "need_mem" flag to true,
    and the host changing the balloon target.

    > Now, doing one inflate/deflate per event is not very good. I'm trying
    > to find a way where we use the amount of events to determine:
    >
    > A. When memory should be moved from the guest to the host
    >
    > B. When memory should be moved from the host to the guest
    >
    > C. When memory shouldn't move (ie. when both guest and host experience
    > similar pressure)

    That's what the stats are for, to give QEMU an indication of the size of
    the crisis. If that's not sufficient, the stats are not doing their
    job.

    Even the "need_mem" flag is a reluctant crutch: QEMU should be able to
    see that the target should be increased from the stats alone.

    > Note that there's no "central authority" that has information about
    > all guests to decide how do that. Each qemu instance has to decide it
    > itself, based on the information it has about the host and about its guest.

    True today, but higher layers may eventually exist which can do this, or
    may exist on other platforms.

    Cheers,
    Rusty.




  • 33.  Re: [virtio] New virtio balloon...

    Posted 02-04-2014 02:21
    Luiz Capitulino <lcapitulino@redhat.com> writes: > On Mon, 03 Feb 2014 13:37:17 +1030 >> I changed the STATS_REPLY to STATS, and included a "want more mem" >> flag. The implication is that the host compare stats across different >> guests. > > When would the host do that? Can you elaborate a bit how this would > be used? > > I feel that what you're proposing is not far away from automatic > ballooning. Basically, my current idea for automatic ballooning more or > less is: Yes, this is to enable automatic ballooning, as per your lkml posting. Here's how it maps onto your case: > 1. QEMU registers for vmpressure events in the host > (see Documentation/cgroups/memory.txt "Memory Pressure" section) > > 2. The virtio-balloon driver in the guest registers for > in-kernel memory pressure notification (not upstream yet) > > 3. When the host is under pressure, QEMU is notified and it asks the > guest to inflate its balloon by some amount This would mean the host changing the balloon target, ie VIRTIO_BALLOON_HCMD_SET_BALLOON. > 4. When the guest is under pressure, QEMU is notified by the > virtio-balloon driver and QEMU asks the guest to deflate by > some value This would mean guest sending stats with the "need_mem" flag to true, and the host changing the balloon target. > Now, doing one inflate/deflate per event is not very good. I'm trying > to find a way where we use the amount of events to determine: > > A. When memory should be moved from the guest to the host > > B. When memory should be moved from the host to the guest > > C. When memory shouldn't move (ie. when both guest and host experience > similar pressure) That's what the stats are for, to give QEMU an indication of the size of the crisis. If that's not sufficient, the stats are not doing their job. Even the "need_mem" flag is a reluctant crutch: QEMU should be able to see that the target should be increased from the stats alone. > Note that there's no "central authority" that has information about > all guests to decide how do that. Each qemu instance has to decide it > itself, based on the information it has about the host and about its guest. True today, but higher layers may eventually exist which can do this, or may exist on other platforms. Cheers, Rusty.


  • 34.  Re: [virtio] New virtio balloon...

    Posted 02-04-2014 21:58
    On Tue, 04 Feb 2014 11:53:57 +1030
    Rusty Russell <rusty@au1.ibm.com> wrote:

    > Luiz Capitulino <lcapitulino@redhat.com> writes:
    > > On Mon, 03 Feb 2014 13:37:17 +1030
    > >> I changed the STATS_REPLY to STATS, and included a "want more mem"
    > >> flag. The implication is that the host compare stats across different
    > >> guests.
    > >
    > > When would the host do that? Can you elaborate a bit how this would
    > > be used?
    > >
    > > I feel that what you're proposing is not far away from automatic
    > > ballooning. Basically, my current idea for automatic ballooning more or
    > > less is:
    >
    > Yes, this is to enable automatic ballooning, as per your lkml posting.
    > Here's how it maps onto your case:
    >
    > > 1. QEMU registers for vmpressure events in the host
    > > (see Documentation/cgroups/memory.txt "Memory Pressure" section)
    > >
    > > 2. The virtio-balloon driver in the guest registers for
    > > in-kernel memory pressure notification (not upstream yet)
    > >
    > > 3. When the host is under pressure, QEMU is notified and it asks the
    > > guest to inflate its balloon by some amount
    >
    > This would mean the host changing the balloon target, ie
    > VIRTIO_BALLOON_HCMD_SET_BALLOON.

    OK.

    > > 4. When the guest is under pressure, QEMU is notified by the
    > > virtio-balloon driver and QEMU asks the guest to deflate by
    > > some value
    >
    > This would mean guest sending stats with the "need_mem" flag to true,
    > and the host changing the balloon target.
    >
    > > Now, doing one inflate/deflate per event is not very good. I'm trying
    > > to find a way where we use the amount of events to determine:
    > >
    > > A. When memory should be moved from the guest to the host
    > >
    > > B. When memory should be moved from the host to the guest
    > >
    > > C. When memory shouldn't move (ie. when both guest and host experience
    > > similar pressure)
    >
    > That's what the stats are for, to give QEMU an indication of the size of
    > the crisis. If that's not sufficient, the stats are not doing their
    > job.

    I'm not against having the stats, but I'm not sure we can use it to do what
    you propose. Mainly because by the time QEMU receives the stats and decides
    what to do, I wonder whether the picture may be different in the guest.

    Besides, I agree with mst. If we're going to use the stats to refuse an
    inflate for example, I think that the guest is in a better place to do that.

    > Even the "need_mem" flag is a reluctant crutch: QEMU should be able to
    > see that the target should be increased from the stats alone.
    >
    > > Note that there's no "central authority" that has information about
    > > all guests to decide how do that. Each qemu instance has to decide it
    > > itself, based on the information it has about the host and about its guest.
    >
    > True today, but higher layers may eventually exist which can do this, or
    > may exist on other platforms.
    >
    > Cheers,
    > Rusty.
    >




  • 35.  Re: [virtio] New virtio balloon...

    Posted 02-04-2014 21:58
    On Tue, 04 Feb 2014 11:53:57 +1030 Rusty Russell <rusty@au1.ibm.com> wrote: > Luiz Capitulino <lcapitulino@redhat.com> writes: > > On Mon, 03 Feb 2014 13:37:17 +1030 > >> I changed the STATS_REPLY to STATS, and included a "want more mem" > >> flag. The implication is that the host compare stats across different > >> guests. > > > > When would the host do that? Can you elaborate a bit how this would > > be used? > > > > I feel that what you're proposing is not far away from automatic > > ballooning. Basically, my current idea for automatic ballooning more or > > less is: > > Yes, this is to enable automatic ballooning, as per your lkml posting. > Here's how it maps onto your case: > > > 1. QEMU registers for vmpressure events in the host > > (see Documentation/cgroups/memory.txt "Memory Pressure" section) > > > > 2. The virtio-balloon driver in the guest registers for > > in-kernel memory pressure notification (not upstream yet) > > > > 3. When the host is under pressure, QEMU is notified and it asks the > > guest to inflate its balloon by some amount > > This would mean the host changing the balloon target, ie > VIRTIO_BALLOON_HCMD_SET_BALLOON. OK. > > 4. When the guest is under pressure, QEMU is notified by the > > virtio-balloon driver and QEMU asks the guest to deflate by > > some value > > This would mean guest sending stats with the "need_mem" flag to true, > and the host changing the balloon target. > > > Now, doing one inflate/deflate per event is not very good. I'm trying > > to find a way where we use the amount of events to determine: > > > > A. When memory should be moved from the guest to the host > > > > B. When memory should be moved from the host to the guest > > > > C. When memory shouldn't move (ie. when both guest and host experience > > similar pressure) > > That's what the stats are for, to give QEMU an indication of the size of > the crisis. If that's not sufficient, the stats are not doing their > job. I'm not against having the stats, but I'm not sure we can use it to do what you propose. Mainly because by the time QEMU receives the stats and decides what to do, I wonder whether the picture may be different in the guest. Besides, I agree with mst. If we're going to use the stats to refuse an inflate for example, I think that the guest is in a better place to do that. > Even the "need_mem" flag is a reluctant crutch: QEMU should be able to > see that the target should be increased from the stats alone. > > > Note that there's no "central authority" that has information about > > all guests to decide how do that. Each qemu instance has to decide it > > itself, based on the information it has about the host and about its guest. > > True today, but higher layers may eventually exist which can do this, or > may exist on other platforms. > > Cheers, > Rusty. >


  • 36.  Re: [virtio] New virtio balloon...

    Posted 02-03-2014 13:30
    On Sun, Feb 02, 2014 at 06:21:14PM +0200, Michael S. Tsirkin wrote:
    > On Fri, Jan 31, 2014 at 04:01:39PM +1030, Rusty Russell wrote:
    > > "Michael S. Tsirkin" <mst@redhat.com> writes:

    [...]

    > > > - what's the status of page returned from balloon?
    > > > is it zeroed or can it have old data in there?
    > > > I think in practice Linux will sometimes map in a zero page,
    > > > so guest can save cycles and avoid zeroing it out.
    > > > I think we should tell this to guest when returning
    > > > pages.
    > >
    > > QEMU may not know, since the kernel may not tell it.
    >
    > Depends on what QEMU does.
    > I think kernel always gives us zero pages when we allocate
    > memory, they must be initialized otherwise it's an information leak.
    >
    >
    > > We should assume
    > > nothing, and let the guest zero if it needs to. Seems like a premuture
    > > optimization.
    >
    > Possibly.

    I think that at this stage driver should not make any assumption on page
    contents returned from balloon. However, I think that it should be an
    option to clear (zero) pages put into balloon. Xen balloon driver has
    such build time option however I think that it should be runtime option
    on by default. If somebody would like to save some cycles then he/she
    could turn this feature off.

    > > > - I am guessing EXTRA_MEM is for uses like the ones proposed by
    > > > Frank Swiderski from google that inflate/deflate balloon
    > > > whenever guest wants (look for "Add a page cache-backed balloon
    > > > device driver").
    > > >
    > > > this is useful but - we need to distinguish pages
    > > > like this from regular inflate.
    > > > it's not just counter and host needs a way to know
    > > > that it's target is reached
    > >
    > > The driver needs to explicitly ask for pages in that region.
    >
    > OK so we'll have an extra flag for that?
    >
    >
    > > > - do we even want to allow guest not telling host when it wants
    > > > to reuse the page?
    > > > if yes, I think this should be per-page somehow: when balloon
    > > > is inflated guest should tell host whether it
    > > > expects to use this page.
    > >
    > > I decided against it. Making that optional got us into a mess, so now
    > > it's compulsory. That also fits better with the idea of a negative
    > > balloon.
    > >
    > > > So I think we should accomodate these uses, and so we want the following flags:
    > > >
    > > > - WEAK_TARGET (that's the EXTRA_MEM but I think done in a better way)
    > > > flag that specifies pages do not count against target,
    > > > can be taken out of balloon.
    > > > EXTRA_MEM suggests there's an upper limit on balloon size
    > > > but IMHO that's just extra work for host: host does not care
    > > > I think, give it as much as you want.
    > > > set by guest, used by host
    > >
    > > I think that Daniel really does want more memory than the guest starts
    > > with. And I think he still wants to use the balloon to control it.
    > > Daniel?

    Yep, I posted my comments to that stuff earlier.

    > > > - TELL_HOST flag that specifies guest will tell host before using pages
    > > > (that's VIRTIO_BALLOON_F_MUST_TELL_HOST
    > > > at the moment, listed here for completeness)
    > > > set by guest, used by host
    > >
    > > Dislike.
    > >
    > > > - ZEROED
    > > > flag that specifies that page returned to guest
    > > > is zeroed
    > > > set by host, used by guest
    > >
    > > I think that's silly. Under Linux the guest doesn't need to know it's
    > > zeroed or not, it just frees the page.
    >
    > Yes but it's possible that linux will try to zero page right
    > after free. It won't be too hard to set a flag that it's
    > zeroed when we free it.
    >
    >
    > > > Each of the flags can be just a feature flag, and then
    > > > if we wants a mix of them host can create multiple
    > > > balloon devices with differnet flags, and guest looks for best
    > > > balloon for its purposes.
    > > >
    > > > Alternatively flags can be set and reported per page.
    > > >
    > > >
    > > > A couple of other suggestions:
    > > >
    > > > - how to accomodate memory pressure in guest?
    > > > Let's add a field telling host how hard do we
    > > > want our memory back
    > >
    > > That's very hard to define across guests. Should we be using stats for
    > > that instead? In fact, should we allow gratuitous stats sending,
    > > instead of a simple NEED_MEM flag?

    I think that it should be simple as possible. Guest just set new target and host
    fulfill request or not. Guest slow down requests from balloon if requests cannot
    be fulfilled some time. That is all. Guest has best knowledge how to calculate
    memory needs. You should remember that guests are not always Linux stuff.
    Host should know how to prioritize requests among guests. However, I think
    that it is not directly related to balloon device/driver design.

    Daniel



  • 37.  Re: [virtio] New virtio balloon...

    Posted 02-03-2014 13:30
    On Sun, Feb 02, 2014 at 06:21:14PM +0200, Michael S. Tsirkin wrote: > On Fri, Jan 31, 2014 at 04:01:39PM +1030, Rusty Russell wrote: > > "Michael S. Tsirkin" <mst@redhat.com> writes: [...] > > > - what's the status of page returned from balloon? > > > is it zeroed or can it have old data in there? > > > I think in practice Linux will sometimes map in a zero page, > > > so guest can save cycles and avoid zeroing it out. > > > I think we should tell this to guest when returning > > > pages. > > > > QEMU may not know, since the kernel may not tell it. > > Depends on what QEMU does. > I think kernel always gives us zero pages when we allocate > memory, they must be initialized otherwise it's an information leak. > > > > We should assume > > nothing, and let the guest zero if it needs to. Seems like a premuture > > optimization. > > Possibly. I think that at this stage driver should not make any assumption on page contents returned from balloon. However, I think that it should be an option to clear (zero) pages put into balloon. Xen balloon driver has such build time option however I think that it should be runtime option on by default. If somebody would like to save some cycles then he/she could turn this feature off. > > > - I am guessing EXTRA_MEM is for uses like the ones proposed by > > > Frank Swiderski from google that inflate/deflate balloon > > > whenever guest wants (look for "Add a page cache-backed balloon > > > device driver"). > > > > > > this is useful but - we need to distinguish pages > > > like this from regular inflate. > > > it's not just counter and host needs a way to know > > > that it's target is reached > > > > The driver needs to explicitly ask for pages in that region. > > OK so we'll have an extra flag for that? > > > > > - do we even want to allow guest not telling host when it wants > > > to reuse the page? > > > if yes, I think this should be per-page somehow: when balloon > > > is inflated guest should tell host whether it > > > expects to use this page. > > > > I decided against it. Making that optional got us into a mess, so now > > it's compulsory. That also fits better with the idea of a negative > > balloon. > > > > > So I think we should accomodate these uses, and so we want the following flags: > > > > > > - WEAK_TARGET (that's the EXTRA_MEM but I think done in a better way) > > > flag that specifies pages do not count against target, > > > can be taken out of balloon. > > > EXTRA_MEM suggests there's an upper limit on balloon size > > > but IMHO that's just extra work for host: host does not care > > > I think, give it as much as you want. > > > set by guest, used by host > > > > I think that Daniel really does want more memory than the guest starts > > with. And I think he still wants to use the balloon to control it. > > Daniel? Yep, I posted my comments to that stuff earlier. > > > - TELL_HOST flag that specifies guest will tell host before using pages > > > (that's VIRTIO_BALLOON_F_MUST_TELL_HOST > > > at the moment, listed here for completeness) > > > set by guest, used by host > > > > Dislike. > > > > > - ZEROED > > > flag that specifies that page returned to guest > > > is zeroed > > > set by host, used by guest > > > > I think that's silly. Under Linux the guest doesn't need to know it's > > zeroed or not, it just frees the page. > > Yes but it's possible that linux will try to zero page right > after free. It won't be too hard to set a flag that it's > zeroed when we free it. > > > > > Each of the flags can be just a feature flag, and then > > > if we wants a mix of them host can create multiple > > > balloon devices with differnet flags, and guest looks for best > > > balloon for its purposes. > > > > > > Alternatively flags can be set and reported per page. > > > > > > > > > A couple of other suggestions: > > > > > > - how to accomodate memory pressure in guest? > > > Let's add a field telling host how hard do we > > > want our memory back > > > > That's very hard to define across guests. Should we be using stats for > > that instead? In fact, should we allow gratuitous stats sending, > > instead of a simple NEED_MEM flag? I think that it should be simple as possible. Guest just set new target and host fulfill request or not. Guest slow down requests from balloon if requests cannot be fulfilled some time. That is all. Guest has best knowledge how to calculate memory needs. You should remember that guests are not always Linux stuff. Host should know how to prioritize requests among guests. However, I think that it is not directly related to balloon device/driver design. Daniel


  • 38.  Re: [virtio] New virtio balloon...

    Posted 02-03-2014 13:53
    On Mon, Feb 03, 2014 at 02:29:36PM +0100, Daniel Kiper wrote: > On Sun, Feb 02, 2014 at 06:21:14PM +0200, Michael S. Tsirkin wrote: > > On Fri, Jan 31, 2014 at 04:01:39PM +1030, Rusty Russell wrote: > > > "Michael S. Tsirkin" <mst@redhat.com> writes: > > [...] > > > > > - what's the status of page returned from balloon? > > > > is it zeroed or can it have old data in there? > > > > I think in practice Linux will sometimes map in a zero page, > > > > so guest can save cycles and avoid zeroing it out. > > > > I think we should tell this to guest when returning > > > > pages. > > > > > > QEMU may not know, since the kernel may not tell it. > > > > Depends on what QEMU does. > > I think kernel always gives us zero pages when we allocate > > memory, they must be initialized otherwise it's an information leak. > > > > > > > We should assume > > > nothing, and let the guest zero if it needs to. Seems like a premuture > > > optimization. > > > > Possibly. > > I think that at this stage driver should not make any assumption on page > contents returned from balloon. However, I think that it should be an > option to clear (zero) pages put into balloon. Xen balloon driver has > such build time option however I think that it should be runtime option > on by default. If somebody would like to save some cycles then he/she > could turn this feature off. > > > > > - I am guessing EXTRA_MEM is for uses like the ones proposed by > > > > Frank Swiderski from google that inflate/deflate balloon > > > > whenever guest wants (look for "Add a page cache-backed balloon > > > > device driver"). > > > > > > > > this is useful but - we need to distinguish pages > > > > like this from regular inflate. > > > > it's not just counter and host needs a way to know > > > > that it's target is reached > > > > > > The driver needs to explicitly ask for pages in that region. > > > > OK so we'll have an extra flag for that? > > > > > > > > - do we even want to allow guest not telling host when it wants > > > > to reuse the page? > > > > if yes, I think this should be per-page somehow: when balloon > > > > is inflated guest should tell host whether it > > > > expects to use this page. > > > > > > I decided against it. Making that optional got us into a mess, so now > > > it's compulsory. That also fits better with the idea of a negative > > > balloon. > > > > > > > So I think we should accomodate these uses, and so we want the following flags: > > > > > > > > - WEAK_TARGET (that's the EXTRA_MEM but I think done in a better way) > > > > flag that specifies pages do not count against target, > > > > can be taken out of balloon. > > > > EXTRA_MEM suggests there's an upper limit on balloon size > > > > but IMHO that's just extra work for host: host does not care > > > > I think, give it as much as you want. > > > > set by guest, used by host > > > > > > I think that Daniel really does want more memory than the guest starts > > > with. And I think he still wants to use the balloon to control it. > > > Daniel? > > Yep, I posted my comments to that stuff earlier. > > > > > - TELL_HOST flag that specifies guest will tell host before using pages > > > > (that's VIRTIO_BALLOON_F_MUST_TELL_HOST > > > > at the moment, listed here for completeness) > > > > set by guest, used by host > > > > > > Dislike. > > > > > > > - ZEROED > > > > flag that specifies that page returned to guest > > > > is zeroed > > > > set by host, used by guest > > > > > > I think that's silly. Under Linux the guest doesn't need to know it's > > > zeroed or not, it just frees the page. > > > > Yes but it's possible that linux will try to zero page right > > after free. It won't be too hard to set a flag that it's > > zeroed when we free it. > > > > > > > > Each of the flags can be just a feature flag, and then > > > > if we wants a mix of them host can create multiple > > > > balloon devices with differnet flags, and guest looks for best > > > > balloon for its purposes. > > > > > > > > Alternatively flags can be set and reported per page. > > > > > > > > > > > > A couple of other suggestions: > > > > > > > > - how to accomodate memory pressure in guest? > > > > Let's add a field telling host how hard do we > > > > want our memory back > > > > > > That's very hard to define across guests. Should we be using stats for > > > that instead? In fact, should we allow gratuitous stats sending, > > > instead of a simple NEED_MEM flag? > > I think that it should be simple as possible. Guest just set new target and host > fulfill request or not. Guest slow down requests from balloon if requests cannot > be fulfilled some time. That is all. Hmm that's exactly the reverse of what Rusty suggests. > Guest has best knowledge how to calculate > memory needs. You should remember that guests are not always Linux stuff. > Host should know how to prioritize requests among guests. However, I think > that it is not directly related to balloon device/driver design. > > Daniel


  • 39.  Re: [virtio] New virtio balloon...

    Posted 02-03-2014 13:58
    On Mon, Feb 03, 2014 at 02:29:36PM +0100, Daniel Kiper wrote:
    > On Sun, Feb 02, 2014 at 06:21:14PM +0200, Michael S. Tsirkin wrote:
    > > On Fri, Jan 31, 2014 at 04:01:39PM +1030, Rusty Russell wrote:
    > > > "Michael S. Tsirkin" <mst@redhat.com> writes:
    >
    > [...]
    >
    > > > > - what's the status of page returned from balloon?
    > > > > is it zeroed or can it have old data in there?
    > > > > I think in practice Linux will sometimes map in a zero page,
    > > > > so guest can save cycles and avoid zeroing it out.
    > > > > I think we should tell this to guest when returning
    > > > > pages.
    > > >
    > > > QEMU may not know, since the kernel may not tell it.
    > >
    > > Depends on what QEMU does.
    > > I think kernel always gives us zero pages when we allocate
    > > memory, they must be initialized otherwise it's an information leak.
    > >
    > >
    > > > We should assume
    > > > nothing, and let the guest zero if it needs to. Seems like a premuture
    > > > optimization.
    > >
    > > Possibly.
    >
    > I think that at this stage driver should not make any assumption on page
    > contents returned from balloon. However, I think that it should be an
    > option to clear (zero) pages put into balloon. Xen balloon driver has
    > such build time option however I think that it should be runtime option
    > on by default. If somebody would like to save some cycles then he/she
    > could turn this feature off.
    >
    > > > > - I am guessing EXTRA_MEM is for uses like the ones proposed by
    > > > > Frank Swiderski from google that inflate/deflate balloon
    > > > > whenever guest wants (look for "Add a page cache-backed balloon
    > > > > device driver").
    > > > >
    > > > > this is useful but - we need to distinguish pages
    > > > > like this from regular inflate.
    > > > > it's not just counter and host needs a way to know
    > > > > that it's target is reached
    > > >
    > > > The driver needs to explicitly ask for pages in that region.
    > >
    > > OK so we'll have an extra flag for that?
    > >
    > >
    > > > > - do we even want to allow guest not telling host when it wants
    > > > > to reuse the page?
    > > > > if yes, I think this should be per-page somehow: when balloon
    > > > > is inflated guest should tell host whether it
    > > > > expects to use this page.
    > > >
    > > > I decided against it. Making that optional got us into a mess, so now
    > > > it's compulsory. That also fits better with the idea of a negative
    > > > balloon.
    > > >
    > > > > So I think we should accomodate these uses, and so we want the following flags:
    > > > >
    > > > > - WEAK_TARGET (that's the EXTRA_MEM but I think done in a better way)
    > > > > flag that specifies pages do not count against target,
    > > > > can be taken out of balloon.
    > > > > EXTRA_MEM suggests there's an upper limit on balloon size
    > > > > but IMHO that's just extra work for host: host does not care
    > > > > I think, give it as much as you want.
    > > > > set by guest, used by host
    > > >
    > > > I think that Daniel really does want more memory than the guest starts
    > > > with. And I think he still wants to use the balloon to control it.
    > > > Daniel?
    >
    > Yep, I posted my comments to that stuff earlier.
    >
    > > > > - TELL_HOST flag that specifies guest will tell host before using pages
    > > > > (that's VIRTIO_BALLOON_F_MUST_TELL_HOST
    > > > > at the moment, listed here for completeness)
    > > > > set by guest, used by host
    > > >
    > > > Dislike.
    > > >
    > > > > - ZEROED
    > > > > flag that specifies that page returned to guest
    > > > > is zeroed
    > > > > set by host, used by guest
    > > >
    > > > I think that's silly. Under Linux the guest doesn't need to know it's
    > > > zeroed or not, it just frees the page.
    > >
    > > Yes but it's possible that linux will try to zero page right
    > > after free. It won't be too hard to set a flag that it's
    > > zeroed when we free it.
    > >
    > >
    > > > > Each of the flags can be just a feature flag, and then
    > > > > if we wants a mix of them host can create multiple
    > > > > balloon devices with differnet flags, and guest looks for best
    > > > > balloon for its purposes.
    > > > >
    > > > > Alternatively flags can be set and reported per page.
    > > > >
    > > > >
    > > > > A couple of other suggestions:
    > > > >
    > > > > - how to accomodate memory pressure in guest?
    > > > > Let's add a field telling host how hard do we
    > > > > want our memory back
    > > >
    > > > That's very hard to define across guests. Should we be using stats for
    > > > that instead? In fact, should we allow gratuitous stats sending,
    > > > instead of a simple NEED_MEM flag?
    >
    > I think that it should be simple as possible. Guest just set new target and host
    > fulfill request or not. Guest slow down requests from balloon if requests cannot
    > be fulfilled some time. That is all.

    Hmm that's exactly the reverse of what Rusty suggests.

    > Guest has best knowledge how to calculate
    > memory needs. You should remember that guests are not always Linux stuff.
    > Host should know how to prioritize requests among guests. However, I think
    > that it is not directly related to balloon device/driver design.
    >
    > Daniel



  • 40.  Re: [virtio] New virtio balloon...

    Posted 02-04-2014 01:55
    "Michael S. Tsirkin" <mst@redhat.com> writes:
    > On Mon, Feb 03, 2014 at 02:29:36PM +0100, Daniel Kiper wrote:
    >> On Sun, Feb 02, 2014 at 06:21:14PM +0200, Michael S. Tsirkin wrote:
    >> > On Fri, Jan 31, 2014 at 04:01:39PM +1030, Rusty Russell wrote:
    >> > > That's very hard to define across guests. Should we be using stats for
    >> > > that instead? In fact, should we allow gratuitous stats sending,
    >> > > instead of a simple NEED_MEM flag?
    >>
    >> I think that it should be simple as possible. Guest just set new target and host
    >> fulfill request or not. Guest slow down requests from balloon if requests cannot
    >> be fulfilled some time. That is all.
    >
    > Hmm that's exactly the reverse of what Rusty suggests.

    Indeed. The current balloon is entirely host-led. There's no way for
    the guest to give feedback except by failing to meet the target.

    This proposal only modifies that so that the guest can send "need_mem"
    and associated stats. These stats are fairly generic, so should be
    OS-agnostic (and a guest can omit them if it can't tell).

    > Guest has best knowledge how to calculate
    > memory needs.

    True, but not actually useful.

    The guest has the most information, but it still can't reliably predict
    the effect of adding more memory. Because last time I read the
    literature, that problem was Very Hard. I would *love* a heuristic so
    that the guest can say "if you give me X MB, page faults will drop by
    50%", but think in practice everyone just increments a few MB at a time
    until the pain stops, right?

    But even if we had such a thing, the guest has NO IDEA how bad its
    problems are, because bad is relative. So it has to have a way of
    reporting its pain level to the host, who *can* balance things.

    I chose to use stats + "need_mem" flag reporting for that method. But
    if the host doesn't increase the target, the guest should not disobey.

    Cheers,
    Rusty.




  • 41.  Re: [virtio] New virtio balloon...

    Posted 02-04-2014 02:21
    "Michael S. Tsirkin" <mst@redhat.com> writes: > On Mon, Feb 03, 2014 at 02:29:36PM +0100, Daniel Kiper wrote: >> On Sun, Feb 02, 2014 at 06:21:14PM +0200, Michael S. Tsirkin wrote: >> > On Fri, Jan 31, 2014 at 04:01:39PM +1030, Rusty Russell wrote: >> > > That's very hard to define across guests. Should we be using stats for >> > > that instead? In fact, should we allow gratuitous stats sending, >> > > instead of a simple NEED_MEM flag? >> >> I think that it should be simple as possible. Guest just set new target and host >> fulfill request or not. Guest slow down requests from balloon if requests cannot >> be fulfilled some time. That is all. > > Hmm that's exactly the reverse of what Rusty suggests. Indeed. The current balloon is entirely host-led. There's no way for the guest to give feedback except by failing to meet the target. This proposal only modifies that so that the guest can send "need_mem" and associated stats. These stats are fairly generic, so should be OS-agnostic (and a guest can omit them if it can't tell). > Guest has best knowledge how to calculate > memory needs. True, but not actually useful. The guest has the most information, but it still can't reliably predict the effect of adding more memory. Because last time I read the literature, that problem was Very Hard. I would *love* a heuristic so that the guest can say "if you give me X MB, page faults will drop by 50%", but think in practice everyone just increments a few MB at a time until the pain stops, right? But even if we had such a thing, the guest has NO IDEA how bad its problems are, because bad is relative. So it has to have a way of reporting its pain level to the host, who *can* balance things. I chose to use stats + "need_mem" flag reporting for that method. But if the host doesn't increase the target, the guest should not disobey. Cheers, Rusty.


  • 42.  Re: [virtio] New virtio balloon...

    Posted 02-04-2014 21:10
    On Tue, Feb 04, 2014 at 12:24:48PM +1030, Rusty Russell wrote:
    > "Michael S. Tsirkin" <mst@redhat.com> writes:
    > > On Mon, Feb 03, 2014 at 02:29:36PM +0100, Daniel Kiper wrote:
    > >> On Sun, Feb 02, 2014 at 06:21:14PM +0200, Michael S. Tsirkin wrote:
    > >> > On Fri, Jan 31, 2014 at 04:01:39PM +1030, Rusty Russell wrote:
    > >> > > That's very hard to define across guests. Should we be using stats for
    > >> > > that instead? In fact, should we allow gratuitous stats sending,
    > >> > > instead of a simple NEED_MEM flag?
    > >>
    > >> I think that it should be simple as possible. Guest just set new target and host
    > >> fulfill request or not. Guest slow down requests from balloon if requests cannot
    > >> be fulfilled some time. That is all.
    > >
    > > Hmm that's exactly the reverse of what Rusty suggests.
    >
    > Indeed. The current balloon is entirely host-led. There's no way for

    Maybe I am missing something but I am curious why we would like to avoid
    symmetric solution. I mean that guest and host could control the target.
    Such solution is used in Xen balloon driver.

    > the guest to give feedback except by failing to meet the target.
    >
    > This proposal only modifies that so that the guest can send "need_mem"
    > and associated stats. These stats are fairly generic, so should be
    > OS-agnostic (and a guest can omit them if it can't tell).
    >
    > > Guest has best knowledge how to calculate
    > > memory needs.
    >
    > True, but not actually useful.
    >
    > The guest has the most information, but it still can't reliably predict
    > the effect of adding more memory. Because last time I read the
    > literature, that problem was Very Hard. I would *love* a heuristic so
    > that the guest can say "if you give me X MB, page faults will drop by
    > 50%", but think in practice everyone just increments a few MB at a time
    > until the pain stops, right?
    >
    > But even if we had such a thing, the guest has NO IDEA how bad its
    > problems are, because bad is relative. So it has to have a way of

    You mean in comparison to other guests?

    > reporting its pain level to the host, who *can* balance things.
    >
    > I chose to use stats + "need_mem" flag reporting for that method. But
    > if the host doesn't increase the target, the guest should not disobey.

    So if I understand correctly we are going in such direction that only
    host could control the target and guest may only gently ask for more
    memory. Right? However, as I can see we do not have a mechanism to
    directly reject guest requests for more memory. So how guest would
    know that its needs would not be fulfilled at all or they will be
    fulfilled partially? How long guest should wait for target change?
    If it does not have such information directly from a host then it
    does not have a chance to quickly limit impact of memory pressure
    in other way. So I think that the host should have a chance to reject
    guests request directly or inform that a given request will be fulfilled
    partially. This way guest will have a chance to make relevant actions
    (if it use such information).

    Daniel



  • 43.  Re: [virtio] New virtio balloon...

    Posted 02-04-2014 21:10
    On Tue, Feb 04, 2014 at 12:24:48PM +1030, Rusty Russell wrote: > "Michael S. Tsirkin" <mst@redhat.com> writes: > > On Mon, Feb 03, 2014 at 02:29:36PM +0100, Daniel Kiper wrote: > >> On Sun, Feb 02, 2014 at 06:21:14PM +0200, Michael S. Tsirkin wrote: > >> > On Fri, Jan 31, 2014 at 04:01:39PM +1030, Rusty Russell wrote: > >> > > That's very hard to define across guests. Should we be using stats for > >> > > that instead? In fact, should we allow gratuitous stats sending, > >> > > instead of a simple NEED_MEM flag? > >> > >> I think that it should be simple as possible. Guest just set new target and host > >> fulfill request or not. Guest slow down requests from balloon if requests cannot > >> be fulfilled some time. That is all. > > > > Hmm that's exactly the reverse of what Rusty suggests. > > Indeed. The current balloon is entirely host-led. There's no way for Maybe I am missing something but I am curious why we would like to avoid symmetric solution. I mean that guest and host could control the target. Such solution is used in Xen balloon driver. > the guest to give feedback except by failing to meet the target. > > This proposal only modifies that so that the guest can send "need_mem" > and associated stats. These stats are fairly generic, so should be > OS-agnostic (and a guest can omit them if it can't tell). > > > Guest has best knowledge how to calculate > > memory needs. > > True, but not actually useful. > > The guest has the most information, but it still can't reliably predict > the effect of adding more memory. Because last time I read the > literature, that problem was Very Hard. I would *love* a heuristic so > that the guest can say "if you give me X MB, page faults will drop by > 50%", but think in practice everyone just increments a few MB at a time > until the pain stops, right? > > But even if we had such a thing, the guest has NO IDEA how bad its > problems are, because bad is relative. So it has to have a way of You mean in comparison to other guests? > reporting its pain level to the host, who *can* balance things. > > I chose to use stats + "need_mem" flag reporting for that method. But > if the host doesn't increase the target, the guest should not disobey. So if I understand correctly we are going in such direction that only host could control the target and guest may only gently ask for more memory. Right? However, as I can see we do not have a mechanism to directly reject guest requests for more memory. So how guest would know that its needs would not be fulfilled at all or they will be fulfilled partially? How long guest should wait for target change? If it does not have such information directly from a host then it does not have a chance to quickly limit impact of memory pressure in other way. So I think that the host should have a chance to reject guests request directly or inform that a given request will be fulfilled partially. This way guest will have a chance to make relevant actions (if it use such information). Daniel


  • 44.  Re: [virtio] New virtio balloon...

    Posted 02-04-2014 22:06
    On Tue, 4 Feb 2014 22:09:32 +0100
    Daniel Kiper <daniel.kiper@oracle.com> wrote:

    > > I chose to use stats + "need_mem" flag reporting for that method. But
    > > if the host doesn't increase the target, the guest should not disobey.
    >
    > So if I understand correctly we are going in such direction that only
    > host could control the target and guest may only gently ask for more
    > memory. Right? However, as I can see we do not have a mechanism to
    > directly reject guest requests for more memory.

    Honest question: do we need one? Or better, why would we want to reject
    a guest's request?

    With automatic ballooning for example, if the host is not able to supply
    a guest's need of memory, it would try to reclaim it from other guests.
    If this fails, the host goes through regular memory reclaim.

    Guests, on the other hand, need a way to refuse the host's request, at
    least temporarily, so that guests with enough free (or freeeable cached)
    memory have a chance to return memory to the host.

    > So how guest would
    > know that its needs would not be fulfilled at all or they will be
    > fulfilled partially? How long guest should wait for target change?
    > If it does not have such information directly from a host then it
    > does not have a chance to quickly limit impact of memory pressure
    > in other way. So I think that the host should have a chance to reject
    > guests request directly or inform that a given request will be fulfilled
    > partially. This way guest will have a chance to make relevant actions
    > (if it use such information).
    >
    > Daniel
    >




  • 45.  Re: [virtio] New virtio balloon...

    Posted 02-04-2014 22:06
    On Tue, 4 Feb 2014 22:09:32 +0100 Daniel Kiper <daniel.kiper@oracle.com> wrote: > > I chose to use stats + "need_mem" flag reporting for that method. But > > if the host doesn't increase the target, the guest should not disobey. > > So if I understand correctly we are going in such direction that only > host could control the target and guest may only gently ask for more > memory. Right? However, as I can see we do not have a mechanism to > directly reject guest requests for more memory. Honest question: do we need one? Or better, why would we want to reject a guest's request? With automatic ballooning for example, if the host is not able to supply a guest's need of memory, it would try to reclaim it from other guests. If this fails, the host goes through regular memory reclaim. Guests, on the other hand, need a way to refuse the host's request, at least temporarily, so that guests with enough free (or freeeable cached) memory have a chance to return memory to the host. > So how guest would > know that its needs would not be fulfilled at all or they will be > fulfilled partially? How long guest should wait for target change? > If it does not have such information directly from a host then it > does not have a chance to quickly limit impact of memory pressure > in other way. So I think that the host should have a chance to reject > guests request directly or inform that a given request will be fulfilled > partially. This way guest will have a chance to make relevant actions > (if it use such information). > > Daniel >


  • 46.  Re: [virtio] New virtio balloon...

    Posted 02-05-2014 03:38
    Luiz Capitulino <lcapitulino@redhat.com> writes:
    > On Tue, 4 Feb 2014 22:09:32 +0100
    > Daniel Kiper <daniel.kiper@oracle.com> wrote:
    >
    >> > I chose to use stats + "need_mem" flag reporting for that method. But
    >> > if the host doesn't increase the target, the guest should not disobey.
    >>
    >> So if I understand correctly we are going in such direction that only
    >> host could control the target and guest may only gently ask for more
    >> memory. Right? However, as I can see we do not have a mechanism to
    >> directly reject guest requests for more memory.
    >
    > Honest question: do we need one? Or better, why would we want to reject
    > a guest's request?
    >
    > With automatic ballooning for example, if the host is not able to supply
    > a guest's need of memory, it would try to reclaim it from other guests.
    > If this fails, the host goes through regular memory reclaim.

    And if that fails? What if there really isn't memory to give? It might
    really be under stress.

    > Guests, on the other hand, need a way to refuse the host's request, at
    > least temporarily, so that guests with enough free (or freeeable cached)
    > memory have a chance to return memory to the host.

    The guest doesn't know if the administrator has added a much more
    important guest which needs all the memory, or if the memory savings
    should be shared across multiple guests.

    So if the host wants it to meet the target slowly, it should lower the
    target slowly. And monitor the stats to see how its progressing.

    Note that the committed_as variable (which is used by HyperV and xen's
    selfballoon apparently) *isn't* exposed through the stats, and it seems
    it should be. That seems to be the variable of choice?

    Cheers,
    Rusty.




  • 47.  Re: [virtio] New virtio balloon...

    Posted 02-05-2014 04:06
    Luiz Capitulino <lcapitulino@redhat.com> writes: > On Tue, 4 Feb 2014 22:09:32 +0100 > Daniel Kiper <daniel.kiper@oracle.com> wrote: > >> > I chose to use stats + "need_mem" flag reporting for that method. But >> > if the host doesn't increase the target, the guest should not disobey. >> >> So if I understand correctly we are going in such direction that only >> host could control the target and guest may only gently ask for more >> memory. Right? However, as I can see we do not have a mechanism to >> directly reject guest requests for more memory. > > Honest question: do we need one? Or better, why would we want to reject > a guest's request? > > With automatic ballooning for example, if the host is not able to supply > a guest's need of memory, it would try to reclaim it from other guests. > If this fails, the host goes through regular memory reclaim. And if that fails? What if there really isn't memory to give? It might really be under stress. > Guests, on the other hand, need a way to refuse the host's request, at > least temporarily, so that guests with enough free (or freeeable cached) > memory have a chance to return memory to the host. The guest doesn't know if the administrator has added a much more important guest which needs all the memory, or if the memory savings should be shared across multiple guests. So if the host wants it to meet the target slowly, it should lower the target slowly. And monitor the stats to see how its progressing. Note that the committed_as variable (which is used by HyperV and xen's selfballoon apparently) *isn't* exposed through the stats, and it seems it should be. That seems to be the variable of choice? Cheers, Rusty.


  • 48.  Re: [virtio] New virtio balloon...

    Posted 02-05-2014 19:39
    On Wed, 05 Feb 2014 14:07:33 +1030
    Rusty Russell <rusty@au1.ibm.com> wrote:

    > Luiz Capitulino <lcapitulino@redhat.com> writes:
    > > On Tue, 4 Feb 2014 22:09:32 +0100
    > > Daniel Kiper <daniel.kiper@oracle.com> wrote:
    > >
    > >> > I chose to use stats + "need_mem" flag reporting for that method. But
    > >> > if the host doesn't increase the target, the guest should not disobey.
    > >>
    > >> So if I understand correctly we are going in such direction that only
    > >> host could control the target and guest may only gently ask for more
    > >> memory. Right? However, as I can see we do not have a mechanism to
    > >> directly reject guest requests for more memory.
    > >
    > > Honest question: do we need one? Or better, why would we want to reject
    > > a guest's request?
    > >
    > > With automatic ballooning for example, if the host is not able to supply
    > > a guest's need of memory, it would try to reclaim it from other guests.
    > > If this fails, the host goes through regular memory reclaim.
    >
    > And if that fails? What if there really isn't memory to give? It might
    > really be under stress.

    In this case, would it be better for the guests to swap instead of
    having the host to swap? If so, then yes I agree that maybe the guest
    should be able to say no.

    > > Guests, on the other hand, need a way to refuse the host's request, at
    > > least temporarily, so that guests with enough free (or freeeable cached)
    > > memory have a chance to return memory to the host.
    >
    > The guest doesn't know if the administrator has added a much more
    > important guest which needs all the memory, or if the memory savings
    > should be shared across multiple guests.
    >
    > So if the host wants it to meet the target slowly, it should lower the
    > target slowly. And monitor the stats to see how its progressing.
    >
    > Note that the committed_as variable (which is used by HyperV and xen's
    > selfballoon apparently) *isn't* exposed through the stats, and it seems
    > it should be. That seems to be the variable of choice?

    I honestly don't know whether it's a good idea for the host to take balloon
    decisions based on guest's stats. When I say host here I mean QEMU. IMO,
    the guest can do this faster and more accurately than QEMU.

    On another note, I'm getting a bit confused with this discussion. IMHO not
    all requirements are clear and I'm seeing you throwing code in... Maybe we
    should start by doing the spec instead? This might help us concentrate on
    providing the mechanisms only.




  • 49.  Re: [virtio] New virtio balloon...

    Posted 02-05-2014 19:39
    On Wed, 05 Feb 2014 14:07:33 +1030 Rusty Russell <rusty@au1.ibm.com> wrote: > Luiz Capitulino <lcapitulino@redhat.com> writes: > > On Tue, 4 Feb 2014 22:09:32 +0100 > > Daniel Kiper <daniel.kiper@oracle.com> wrote: > > > >> > I chose to use stats + "need_mem" flag reporting for that method. But > >> > if the host doesn't increase the target, the guest should not disobey. > >> > >> So if I understand correctly we are going in such direction that only > >> host could control the target and guest may only gently ask for more > >> memory. Right? However, as I can see we do not have a mechanism to > >> directly reject guest requests for more memory. > > > > Honest question: do we need one? Or better, why would we want to reject > > a guest's request? > > > > With automatic ballooning for example, if the host is not able to supply > > a guest's need of memory, it would try to reclaim it from other guests. > > If this fails, the host goes through regular memory reclaim. > > And if that fails? What if there really isn't memory to give? It might > really be under stress. In this case, would it be better for the guests to swap instead of having the host to swap? If so, then yes I agree that maybe the guest should be able to say no. > > Guests, on the other hand, need a way to refuse the host's request, at > > least temporarily, so that guests with enough free (or freeeable cached) > > memory have a chance to return memory to the host. > > The guest doesn't know if the administrator has added a much more > important guest which needs all the memory, or if the memory savings > should be shared across multiple guests. > > So if the host wants it to meet the target slowly, it should lower the > target slowly. And monitor the stats to see how its progressing. > > Note that the committed_as variable (which is used by HyperV and xen's > selfballoon apparently) *isn't* exposed through the stats, and it seems > it should be. That seems to be the variable of choice? I honestly don't know whether it's a good idea for the host to take balloon decisions based on guest's stats. When I say host here I mean QEMU. IMO, the guest can do this faster and more accurately than QEMU. On another note, I'm getting a bit confused with this discussion. IMHO not all requirements are clear and I'm seeing you throwing code in... Maybe we should start by doing the spec instead? This might help us concentrate on providing the mechanisms only.


  • 50.  Re: [virtio] New virtio balloon...

    Posted 02-06-2014 01:01
    Luiz Capitulino <lcapitulino@redhat.com> writes:
    > On Wed, 05 Feb 2014 14:07:33 +1030
    > Rusty Russell <rusty@au1.ibm.com> wrote:
    >
    >> Luiz Capitulino <lcapitulino@redhat.com> writes:
    >> > On Tue, 4 Feb 2014 22:09:32 +0100
    >> > Daniel Kiper <daniel.kiper@oracle.com> wrote:
    >> >
    >> >> > I chose to use stats + "need_mem" flag reporting for that method. But
    >> >> > if the host doesn't increase the target, the guest should not disobey.
    >> >>
    >> >> So if I understand correctly we are going in such direction that only
    >> >> host could control the target and guest may only gently ask for more
    >> >> memory. Right? However, as I can see we do not have a mechanism to
    >> >> directly reject guest requests for more memory.
    >> >
    >> > Honest question: do we need one? Or better, why would we want to reject
    >> > a guest's request?
    >> >
    >> > With automatic ballooning for example, if the host is not able to supply
    >> > a guest's need of memory, it would try to reclaim it from other guests.
    >> > If this fails, the host goes through regular memory reclaim.
    >>
    >> And if that fails? What if there really isn't memory to give? It might
    >> really be under stress.
    >
    > In this case, would it be better for the guests to swap instead of
    > having the host to swap? If so, then yes I agree that maybe the guest
    > should be able to say no.

    s/guest/host/? I think we're agreeing.

    >> > Guests, on the other hand, need a way to refuse the host's request, at
    >> > least temporarily, so that guests with enough free (or freeeable cached)
    >> > memory have a chance to return memory to the host.
    >>
    >> The guest doesn't know if the administrator has added a much more
    >> important guest which needs all the memory, or if the memory savings
    >> should be shared across multiple guests.
    >>
    >> So if the host wants it to meet the target slowly, it should lower the
    >> target slowly. And monitor the stats to see how its progressing.
    >>
    >> Note that the committed_as variable (which is used by HyperV and xen's
    >> selfballoon apparently) *isn't* exposed through the stats, and it seems
    >> it should be. That seems to be the variable of choice?
    >
    > I honestly don't know whether it's a good idea for the host to take balloon
    > decisions based on guest's stats. When I say host here I mean QEMU. IMO,
    > the guest can do this faster and more accurately than QEMU.

    I don't know either. A call out to qemu is pretty fast, at least
    compared with swapping a page.

    > On another note, I'm getting a bit confused with this discussion. IMHO not
    > all requirements are clear and I'm seeing you throwing code in... Maybe we
    > should start by doing the spec instead? This might help us concentrate on
    > providing the mechanisms only.

    I thought we were closer to consensus than we are. But providing a
    strawman implementation highlighted our differences, so it's not a
    complete waste :)

    Cheers,
    Rusty.




  • 51.  Re: [virtio] New virtio balloon...

    Posted 02-06-2014 06:31
    Luiz Capitulino <lcapitulino@redhat.com> writes: > On Wed, 05 Feb 2014 14:07:33 +1030 > Rusty Russell <rusty@au1.ibm.com> wrote: > >> Luiz Capitulino <lcapitulino@redhat.com> writes: >> > On Tue, 4 Feb 2014 22:09:32 +0100 >> > Daniel Kiper <daniel.kiper@oracle.com> wrote: >> > >> >> > I chose to use stats + "need_mem" flag reporting for that method. But >> >> > if the host doesn't increase the target, the guest should not disobey. >> >> >> >> So if I understand correctly we are going in such direction that only >> >> host could control the target and guest may only gently ask for more >> >> memory. Right? However, as I can see we do not have a mechanism to >> >> directly reject guest requests for more memory. >> > >> > Honest question: do we need one? Or better, why would we want to reject >> > a guest's request? >> > >> > With automatic ballooning for example, if the host is not able to supply >> > a guest's need of memory, it would try to reclaim it from other guests. >> > If this fails, the host goes through regular memory reclaim. >> >> And if that fails? What if there really isn't memory to give? It might >> really be under stress. > > In this case, would it be better for the guests to swap instead of > having the host to swap? If so, then yes I agree that maybe the guest > should be able to say no. s/guest/host/? I think we're agreeing. >> > Guests, on the other hand, need a way to refuse the host's request, at >> > least temporarily, so that guests with enough free (or freeeable cached) >> > memory have a chance to return memory to the host. >> >> The guest doesn't know if the administrator has added a much more >> important guest which needs all the memory, or if the memory savings >> should be shared across multiple guests. >> >> So if the host wants it to meet the target slowly, it should lower the >> target slowly. And monitor the stats to see how its progressing. >> >> Note that the committed_as variable (which is used by HyperV and xen's >> selfballoon apparently) *isn't* exposed through the stats, and it seems >> it should be. That seems to be the variable of choice? > > I honestly don't know whether it's a good idea for the host to take balloon > decisions based on guest's stats. When I say host here I mean QEMU. IMO, > the guest can do this faster and more accurately than QEMU. I don't know either. A call out to qemu is pretty fast, at least compared with swapping a page. > On another note, I'm getting a bit confused with this discussion. IMHO not > all requirements are clear and I'm seeing you throwing code in... Maybe we > should start by doing the spec instead? This might help us concentrate on > providing the mechanisms only. I thought we were closer to consensus than we are. But providing a strawman implementation highlighted our differences, so it's not a complete waste :) Cheers, Rusty.


  • 52.  Re: [virtio] New virtio balloon...

    Posted 02-05-2014 03:16
    Daniel Kiper <daniel.kiper@oracle.com> writes:
    > On Tue, Feb 04, 2014 at 12:24:48PM +1030, Rusty Russell wrote:
    >> "Michael S. Tsirkin" <mst@redhat.com> writes:
    >> > On Mon, Feb 03, 2014 at 02:29:36PM +0100, Daniel Kiper wrote:
    >> >> On Sun, Feb 02, 2014 at 06:21:14PM +0200, Michael S. Tsirkin wrote:
    >> >> > On Fri, Jan 31, 2014 at 04:01:39PM +1030, Rusty Russell wrote:
    >> >> > > That's very hard to define across guests. Should we be using stats for
    >> >> > > that instead? In fact, should we allow gratuitous stats sending,
    >> >> > > instead of a simple NEED_MEM flag?
    >> >>
    >> >> I think that it should be simple as possible. Guest just set new target and host
    >> >> fulfill request or not. Guest slow down requests from balloon if requests cannot
    >> >> be fulfilled some time. That is all.
    >> >
    >> > Hmm that's exactly the reverse of what Rusty suggests.
    >>
    >> Indeed. The current balloon is entirely host-led. There's no way for
    >
    > Maybe I am missing something but I am curious why we would like to avoid
    > symmetric solution. I mean that guest and host could control the target.
    > Such solution is used in Xen balloon driver.

    Interesting reading.

    Am I understanding this correctly? One part is simple (balloon.c),
    where the balloon_process tries to meet the target.

    The target comes from three places: guest sysfs, host xenstore, or
    xen-selfballoon.c. What does the host do if the guest is
    self-ballooning? It seems like they'll just fight over target values
    if they both try to act.

    >> But even if we had such a thing, the guest has NO IDEA how bad its
    >> problems are, because bad is relative. So it has to have a way of
    >
    > You mean in comparison to other guests?

    And the host itself.

    >> reporting its pain level to the host, who *can* balance things.
    >>
    >> I chose to use stats + "need_mem" flag reporting for that method. But
    >> if the host doesn't increase the target, the guest should not disobey.
    >
    > So if I understand correctly we are going in such direction that only
    > host could control the target and guest may only gently ask for more
    > memory. Right? However, as I can see we do not have a mechanism to
    > directly reject guest requests for more memory. So how guest would
    > know that its needs would not be fulfilled at all or they will be
    > fulfilled partially? How long guest should wait for target change?
    > If it does not have such information directly from a host then it
    > does not have a chance to quickly limit impact of memory pressure
    > in other way. So I think that the host should have a chance to reject
    > guests request directly or inform that a given request will be fulfilled
    > partially. This way guest will have a chance to make relevant actions
    > (if it use such information).

    Excellent point... hmmm, what if we put a writable 'le64 target' at the
    end of the (now increasingly badly named) VIRTIO_BALLOON_GCMD_STATS?

    The device MUST update this with the new target. This means that the
    driver will know as soon as the stats are digested (which should be
    pretty fast).

    Patch below against previous... renames to VIRTIO_BALLOON_GCMD_UPDATE.

    Cheers,
    Rusty.

    diff --git a/drivers/virtio/virtio_balloon2.c b/drivers/virtio/virtio_balloon2.c
    index cbe552802f43..01ffecb0463d 100644
    --- a/drivers/virtio/virtio_balloon2.c
    +++ b/drivers/virtio/virtio_balloon2.c
    @@ -45,10 +45,11 @@ struct gcmd_exchange_pages {
    __le64 to_balloon;
    };

    -struct gcmd_stats {
    - __le64 type; /* VIRTIO_BALLOON_GCMD_STATS */
    +struct gcmd_update {
    + __le64 type; /* VIRTIO_BALLOON_GCMD_UPDATE */
    __le64 need_more;
    struct virtio_balloon_statistic stats[VIRTIO_BALLOON_S_NR];
    + __le64 target;
    };

    struct hcmd_set_balloon {
    @@ -94,7 +95,7 @@ struct virtio_balloon {
    struct gcmd_get_pages get_pages;
    struct gcmd_give_pages give_pages;
    struct gcmd_exchange_pages exchange_pages;
    - struct gcmd_stats stats;
    + struct gcmd_update update;
    } gcmd;

    union hcmd {
    @@ -207,17 +208,17 @@ static void take_from_balloon(struct virtio_balloon *vb, u64 num)
    mutex_unlock(&vb->lock);
    }

    -static inline void set_stat(struct gcmd_stats_reply *stats, int idx,
    +static inline void set_stat(struct gcmd_update *update, int idx,
    u64 tag, u64 val)
    {
    - BUG_ON(idx >= ARRAY_SIZE(stats->stats));
    - stats->stats[idx].tag = cpu_to_le64(tag);
    - stats->stats[idx].val = cpu_to_le64(val);
    + BUG_ON(idx >= ARRAY_SIZE(update->stats));
    + update->stats[idx].tag = cpu_to_le64(tag);
    + update->stats[idx].val = cpu_to_le64(val);
    }

    #define pages_to_bytes(x) ((u64)(x) << PAGE_SHIFT)

    -static void get_stats(struct gcmd_stats_reply *stats)
    +static void get_stats(struct gcmd_update *update)
    {
    unsigned long events[NR_VM_EVENT_ITEMS];
    struct sysinfo i;
    @@ -226,18 +227,19 @@ static void get_stats(struct gcmd_stats_reply *stats)
    all_vm_events(events);
    si_meminfo(&i);

    - stats->type = cpu_to_le64(VIRTIO_BALLOON_GCMD_STATS_REPLY);
    - set_stat(stats, idx++, VIRTIO_BALLOON_S_SWAP_IN,
    + update->type = cpu_to_le64(VIRTIO_BALLOON_GCMD_UPDATE);
    + update->need_more = cpu_to_le64(0);
    + set_stat(update, idx++, VIRTIO_BALLOON_S_SWAP_IN,
    pages_to_bytes(events[PSWPIN]));
    - set_stat(stats, idx++, VIRTIO_BALLOON_S_SWAP_OUT,
    + set_stat(update, idx++, VIRTIO_BALLOON_S_SWAP_OUT,
    pages_to_bytes(events[PSWPOUT]));
    - set_stat(stats, idx++, VIRTIO_BALLOON_S_MAJFLT,
    + set_stat(update, idx++, VIRTIO_BALLOON_S_MAJFLT,
    events[PGMAJFAULT]);
    - set_stat(stats, idx++, VIRTIO_BALLOON_S_MINFLT,
    + set_stat(update, idx++, VIRTIO_BALLOON_S_MINFLT,
    events[PGFAULT]);
    - set_stat(stats, idx++, VIRTIO_BALLOON_S_MEMFREE,
    + set_stat(update, idx++, VIRTIO_BALLOON_S_MEMFREE,
    pages_to_bytes(i.freeram));
    - set_stat(stats, idx++, VIRTIO_BALLOON_S_MEMTOT,
    + set_stat(update, idx++, VIRTIO_BALLOON_S_MEMTOT,
    pages_to_bytes(i.totalram));
    }

    @@ -281,8 +283,9 @@ static bool process_hcmd(struct virtio_balloon *vb)
    vb->target = le64_to_cpu(hcmd->set_balloon.target);
    break;
    case cpu_to_le64(VIRTIO_BALLOON_HCMD_GET_STATS):
    - get_stats(&vb->gcmd.stats_reply);
    - send_gcmd(vb, sizeof(vb->gcmd.stats_reply));
    + get_stats(&vb->gcmd.update);
    + send_gcmd(vb, sizeof(vb->gcmd.update));
    + vb->target = le64_to_cpu(vb->gcmd.update.target);
    break;
    default:
    dev_err_ratelimited(&vb->vdev->dev, "Unknown hcmd %llu\n",
    diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
    index 925d79ad5c90..dcb6494f138c 100644
    --- a/include/uapi/linux/virtio_balloon.h
    +++ b/include/uapi/linux/virtio_balloon.h
    @@ -69,8 +69,9 @@ struct virtio_balloon_statistic {
    * Stats, and optional request for memory.
    * __le64: 0 if we don't want target increased, 1 if we do.
    * Followed by 0 or more struct virtio_balloon_statistic structs.
    + * Ended by a writable __le64 with updated target.
    */
    -#define VIRTIO_BALLOON_GCMD_STATS ((__le64)3)
    +#define VIRTIO_BALLOON_GCMD_UPDATE ((__le64)3)

    /* Host->guest command queue. */





  • 53.  Re: [virtio] New virtio balloon...

    Posted 02-05-2014 03:19
    Daniel Kiper <daniel.kiper@oracle.com> writes: > On Tue, Feb 04, 2014 at 12:24:48PM +1030, Rusty Russell wrote: >> "Michael S. Tsirkin" <mst@redhat.com> writes: >> > On Mon, Feb 03, 2014 at 02:29:36PM +0100, Daniel Kiper wrote: >> >> On Sun, Feb 02, 2014 at 06:21:14PM +0200, Michael S. Tsirkin wrote: >> >> > On Fri, Jan 31, 2014 at 04:01:39PM +1030, Rusty Russell wrote: >> >> > > That's very hard to define across guests. Should we be using stats for >> >> > > that instead? In fact, should we allow gratuitous stats sending, >> >> > > instead of a simple NEED_MEM flag? >> >> >> >> I think that it should be simple as possible. Guest just set new target and host >> >> fulfill request or not. Guest slow down requests from balloon if requests cannot >> >> be fulfilled some time. That is all. >> > >> > Hmm that's exactly the reverse of what Rusty suggests. >> >> Indeed. The current balloon is entirely host-led. There's no way for > > Maybe I am missing something but I am curious why we would like to avoid > symmetric solution. I mean that guest and host could control the target. > Such solution is used in Xen balloon driver. Interesting reading. Am I understanding this correctly? One part is simple (balloon.c), where the balloon_process tries to meet the target. The target comes from three places: guest sysfs, host xenstore, or xen-selfballoon.c. What does the host do if the guest is self-ballooning? It seems like they'll just fight over target values if they both try to act. >> But even if we had such a thing, the guest has NO IDEA how bad its >> problems are, because bad is relative. So it has to have a way of > > You mean in comparison to other guests? And the host itself. >> reporting its pain level to the host, who *can* balance things. >> >> I chose to use stats + "need_mem" flag reporting for that method. But >> if the host doesn't increase the target, the guest should not disobey. > > So if I understand correctly we are going in such direction that only > host could control the target and guest may only gently ask for more > memory. Right? However, as I can see we do not have a mechanism to > directly reject guest requests for more memory. So how guest would > know that its needs would not be fulfilled at all or they will be > fulfilled partially? How long guest should wait for target change? > If it does not have such information directly from a host then it > does not have a chance to quickly limit impact of memory pressure > in other way. So I think that the host should have a chance to reject > guests request directly or inform that a given request will be fulfilled > partially. This way guest will have a chance to make relevant actions > (if it use such information). Excellent point... hmmm, what if we put a writable 'le64 target' at the end of the (now increasingly badly named) VIRTIO_BALLOON_GCMD_STATS? The device MUST update this with the new target. This means that the driver will know as soon as the stats are digested (which should be pretty fast). Patch below against previous... renames to VIRTIO_BALLOON_GCMD_UPDATE. Cheers, Rusty. diff --git a/drivers/virtio/virtio_balloon2.c b/drivers/virtio/virtio_balloon2.c index cbe552802f43..01ffecb0463d 100644 --- a/drivers/virtio/virtio_balloon2.c +++ b/drivers/virtio/virtio_balloon2.c @@ -45,10 +45,11 @@ struct gcmd_exchange_pages { __le64 to_balloon; }; -struct gcmd_stats { - __le64 type; /* VIRTIO_BALLOON_GCMD_STATS */ +struct gcmd_update { + __le64 type; /* VIRTIO_BALLOON_GCMD_UPDATE */ __le64 need_more; struct virtio_balloon_statistic stats[VIRTIO_BALLOON_S_NR]; + __le64 target; }; struct hcmd_set_balloon { @@ -94,7 +95,7 @@ struct virtio_balloon { struct gcmd_get_pages get_pages; struct gcmd_give_pages give_pages; struct gcmd_exchange_pages exchange_pages; - struct gcmd_stats stats; + struct gcmd_update update; } gcmd; union hcmd { @@ -207,17 +208,17 @@ static void take_from_balloon(struct virtio_balloon *vb, u64 num) mutex_unlock(&vb->lock); } -static inline void set_stat(struct gcmd_stats_reply *stats, int idx, +static inline void set_stat(struct gcmd_update *update, int idx, u64 tag, u64 val) { - BUG_ON(idx >= ARRAY_SIZE(stats->stats)); - stats->stats[idx].tag = cpu_to_le64(tag); - stats->stats[idx].val = cpu_to_le64(val); + BUG_ON(idx >= ARRAY_SIZE(update->stats)); + update->stats[idx].tag = cpu_to_le64(tag); + update->stats[idx].val = cpu_to_le64(val); } #define pages_to_bytes(x) ((u64)(x) << PAGE_SHIFT) -static void get_stats(struct gcmd_stats_reply *stats) +static void get_stats(struct gcmd_update *update) { unsigned long events[NR_VM_EVENT_ITEMS]; struct sysinfo i; @@ -226,18 +227,19 @@ static void get_stats(struct gcmd_stats_reply *stats) all_vm_events(events); si_meminfo(&i); - stats->type = cpu_to_le64(VIRTIO_BALLOON_GCMD_STATS_REPLY); - set_stat(stats, idx++, VIRTIO_BALLOON_S_SWAP_IN, + update->type = cpu_to_le64(VIRTIO_BALLOON_GCMD_UPDATE); + update->need_more = cpu_to_le64(0); + set_stat(update, idx++, VIRTIO_BALLOON_S_SWAP_IN, pages_to_bytes(events[PSWPIN])); - set_stat(stats, idx++, VIRTIO_BALLOON_S_SWAP_OUT, + set_stat(update, idx++, VIRTIO_BALLOON_S_SWAP_OUT, pages_to_bytes(events[PSWPOUT])); - set_stat(stats, idx++, VIRTIO_BALLOON_S_MAJFLT, + set_stat(update, idx++, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]); - set_stat(stats, idx++, VIRTIO_BALLOON_S_MINFLT, + set_stat(update, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]); - set_stat(stats, idx++, VIRTIO_BALLOON_S_MEMFREE, + set_stat(update, idx++, VIRTIO_BALLOON_S_MEMFREE, pages_to_bytes(i.freeram)); - set_stat(stats, idx++, VIRTIO_BALLOON_S_MEMTOT, + set_stat(update, idx++, VIRTIO_BALLOON_S_MEMTOT, pages_to_bytes(i.totalram)); } @@ -281,8 +283,9 @@ static bool process_hcmd(struct virtio_balloon *vb) vb->target = le64_to_cpu(hcmd->set_balloon.target); break; case cpu_to_le64(VIRTIO_BALLOON_HCMD_GET_STATS): - get_stats(&vb->gcmd.stats_reply); - send_gcmd(vb, sizeof(vb->gcmd.stats_reply)); + get_stats(&vb->gcmd.update); + send_gcmd(vb, sizeof(vb->gcmd.update)); + vb->target = le64_to_cpu(vb->gcmd.update.target); break; default: dev_err_ratelimited(&vb->vdev->dev, "Unknown hcmd %llu
    ", diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h index 925d79ad5c90..dcb6494f138c 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -69,8 +69,9 @@ struct virtio_balloon_statistic { * Stats, and optional request for memory. * __le64: 0 if we don't want target increased, 1 if we do. * Followed by 0 or more struct virtio_balloon_statistic structs. + * Ended by a writable __le64 with updated target. */ -#define VIRTIO_BALLOON_GCMD_STATS ((__le64)3) +#define VIRTIO_BALLOON_GCMD_UPDATE ((__le64)3) /* Host->guest command queue. */


  • 54.  Re: [virtio] New virtio balloon...

    Posted 02-05-2014 13:18
    On Wed, Feb 05, 2014 at 01:45:38PM +1030, Rusty Russell wrote:
    > Daniel Kiper <daniel.kiper@oracle.com> writes:
    > > On Tue, Feb 04, 2014 at 12:24:48PM +1030, Rusty Russell wrote:
    > >> "Michael S. Tsirkin" <mst@redhat.com> writes:
    > >> > On Mon, Feb 03, 2014 at 02:29:36PM +0100, Daniel Kiper wrote:
    > >> >> On Sun, Feb 02, 2014 at 06:21:14PM +0200, Michael S. Tsirkin wrote:
    > >> >> > On Fri, Jan 31, 2014 at 04:01:39PM +1030, Rusty Russell wrote:
    > >> >> > > That's very hard to define across guests. Should we be using stats for
    > >> >> > > that instead? In fact, should we allow gratuitous stats sending,
    > >> >> > > instead of a simple NEED_MEM flag?
    > >> >>
    > >> >> I think that it should be simple as possible. Guest just set new target and host
    > >> >> fulfill request or not. Guest slow down requests from balloon if requests cannot
    > >> >> be fulfilled some time. That is all.
    > >> >
    > >> > Hmm that's exactly the reverse of what Rusty suggests.
    > >>
    > >> Indeed. The current balloon is entirely host-led. There's no way for
    > >
    > > Maybe I am missing something but I am curious why we would like to avoid
    > > symmetric solution. I mean that guest and host could control the target.
    > > Such solution is used in Xen balloon driver.
    >
    > Interesting reading.
    >
    > Am I understanding this correctly? One part is simple (balloon.c),
    > where the balloon_process tries to meet the target.

    Yep.

    > The target comes from three places: guest sysfs, host xenstore, or
    > xen-selfballoon.c. What does the host do if the guest is

    Yep.

    > self-ballooning? It seems like they'll just fight over target values
    > if they both try to act.

    Sadly, yep. I know about that issue. Do you mean that we would like to avoid
    such situation in new VIRTIO balloon driver? Any other reasons?

    [...]

    > >> I chose to use stats + "need_mem" flag reporting for that method. But
    > >> if the host doesn't increase the target, the guest should not disobey.
    > >
    > > So if I understand correctly we are going in such direction that only
    > > host could control the target and guest may only gently ask for more
    > > memory. Right? However, as I can see we do not have a mechanism to
    > > directly reject guest requests for more memory. So how guest would
    > > know that its needs would not be fulfilled at all or they will be
    > > fulfilled partially? How long guest should wait for target change?
    > > If it does not have such information directly from a host then it
    > > does not have a chance to quickly limit impact of memory pressure
    > > in other way. So I think that the host should have a chance to reject
    > > guests request directly or inform that a given request will be fulfilled
    > > partially. This way guest will have a chance to make relevant actions
    > > (if it use such information).
    >
    > Excellent point... hmmm, what if we put a writable 'le64 target' at the
    > end of the (now increasingly badly named) VIRTIO_BALLOON_GCMD_STATS?

    Good idea but...

    > The device MUST update this with the new target. This means that the
    > driver will know as soon as the stats are digested (which should be
    > pretty fast).

    Great...

    > Patch below against previous... renames to VIRTIO_BALLOON_GCMD_UPDATE.
    >
    > Cheers,
    > Rusty.
    >
    > diff --git a/drivers/virtio/virtio_balloon2.c b/drivers/virtio/virtio_balloon2.c
    > index cbe552802f43..01ffecb0463d 100644
    > --- a/drivers/virtio/virtio_balloon2.c
    > +++ b/drivers/virtio/virtio_balloon2.c
    > @@ -45,10 +45,11 @@ struct gcmd_exchange_pages {
    > __le64 to_balloon;
    > };
    >
    > -struct gcmd_stats {
    > - __le64 type; /* VIRTIO_BALLOON_GCMD_STATS */
    > +struct gcmd_update {
    > + __le64 type; /* VIRTIO_BALLOON_GCMD_UPDATE */

    Hmmm... Update what? Maybe VIRTIO_BALLOON_GCMD_UPDATE_STATS
    or VIRTIO_BALLOON_GCMD_UPDATE_STATE? Both are not perfect
    but are better then simple update...

    > __le64 need_more;

    I do not think that it is needed if we have a target here. I hope that
    host is smart and can make simple comparison with current target.

    > struct virtio_balloon_statistic stats[VIRTIO_BALLOON_S_NR];
    > + __le64 target;

    Nice but:
    - what to do if guest puts here target below current target;
    I think that a host should do what is asked for (if it is possible)
    - just lower target; Otherwise host should reject target us usual,
    - 0 (zero) should be a special value; it should mean that guest
    asks for nothing or does not have any preference.

    Daniel



  • 55.  Re: [virtio] New virtio balloon...

    Posted 02-05-2014 13:18
    On Wed, Feb 05, 2014 at 01:45:38PM +1030, Rusty Russell wrote: > Daniel Kiper <daniel.kiper@oracle.com> writes: > > On Tue, Feb 04, 2014 at 12:24:48PM +1030, Rusty Russell wrote: > >> "Michael S. Tsirkin" <mst@redhat.com> writes: > >> > On Mon, Feb 03, 2014 at 02:29:36PM +0100, Daniel Kiper wrote: > >> >> On Sun, Feb 02, 2014 at 06:21:14PM +0200, Michael S. Tsirkin wrote: > >> >> > On Fri, Jan 31, 2014 at 04:01:39PM +1030, Rusty Russell wrote: > >> >> > > That's very hard to define across guests. Should we be using stats for > >> >> > > that instead? In fact, should we allow gratuitous stats sending, > >> >> > > instead of a simple NEED_MEM flag? > >> >> > >> >> I think that it should be simple as possible. Guest just set new target and host > >> >> fulfill request or not. Guest slow down requests from balloon if requests cannot > >> >> be fulfilled some time. That is all. > >> > > >> > Hmm that's exactly the reverse of what Rusty suggests. > >> > >> Indeed. The current balloon is entirely host-led. There's no way for > > > > Maybe I am missing something but I am curious why we would like to avoid > > symmetric solution. I mean that guest and host could control the target. > > Such solution is used in Xen balloon driver. > > Interesting reading. > > Am I understanding this correctly? One part is simple (balloon.c), > where the balloon_process tries to meet the target. Yep. > The target comes from three places: guest sysfs, host xenstore, or > xen-selfballoon.c. What does the host do if the guest is Yep. > self-ballooning? It seems like they'll just fight over target values > if they both try to act. Sadly, yep. I know about that issue. Do you mean that we would like to avoid such situation in new VIRTIO balloon driver? Any other reasons? [...] > >> I chose to use stats + "need_mem" flag reporting for that method. But > >> if the host doesn't increase the target, the guest should not disobey. > > > > So if I understand correctly we are going in such direction that only > > host could control the target and guest may only gently ask for more > > memory. Right? However, as I can see we do not have a mechanism to > > directly reject guest requests for more memory. So how guest would > > know that its needs would not be fulfilled at all or they will be > > fulfilled partially? How long guest should wait for target change? > > If it does not have such information directly from a host then it > > does not have a chance to quickly limit impact of memory pressure > > in other way. So I think that the host should have a chance to reject > > guests request directly or inform that a given request will be fulfilled > > partially. This way guest will have a chance to make relevant actions > > (if it use such information). > > Excellent point... hmmm, what if we put a writable 'le64 target' at the > end of the (now increasingly badly named) VIRTIO_BALLOON_GCMD_STATS? Good idea but... > The device MUST update this with the new target. This means that the > driver will know as soon as the stats are digested (which should be > pretty fast). Great... > Patch below against previous... renames to VIRTIO_BALLOON_GCMD_UPDATE. > > Cheers, > Rusty. > > diff --git a/drivers/virtio/virtio_balloon2.c b/drivers/virtio/virtio_balloon2.c > index cbe552802f43..01ffecb0463d 100644 > --- a/drivers/virtio/virtio_balloon2.c > +++ b/drivers/virtio/virtio_balloon2.c > @@ -45,10 +45,11 @@ struct gcmd_exchange_pages { > __le64 to_balloon; > }; > > -struct gcmd_stats { > - __le64 type; /* VIRTIO_BALLOON_GCMD_STATS */ > +struct gcmd_update { > + __le64 type; /* VIRTIO_BALLOON_GCMD_UPDATE */ Hmmm... Update what? Maybe VIRTIO_BALLOON_GCMD_UPDATE_STATS or VIRTIO_BALLOON_GCMD_UPDATE_STATE? Both are not perfect but are better then simple update... > __le64 need_more; I do not think that it is needed if we have a target here. I hope that host is smart and can make simple comparison with current target. > struct virtio_balloon_statistic stats[VIRTIO_BALLOON_S_NR]; > + __le64 target; Nice but: - what to do if guest puts here target below current target; I think that a host should do what is asked for (if it is possible) - just lower target; Otherwise host should reject target us usual, - 0 (zero) should be a special value; it should mean that guest asks for nothing or does not have any preference. Daniel


  • 56.  Re: [virtio] New virtio balloon...

    Posted 02-06-2014 01:42
    Daniel Kiper <daniel.kiper@oracle.com> writes:
    > On Wed, Feb 05, 2014 at 01:45:38PM +1030, Rusty Russell wrote:
    >> self-ballooning? It seems like they'll just fight over target values
    >> if they both try to act.
    >
    > Sadly, yep. I know about that issue. Do you mean that we would like to avoid
    > such situation in new VIRTIO balloon driver? Any other reasons?

    To step back a moment: I feel that you, Luiz and I are engaged in a
    search for enlightenment, and we are slowly making progress!

    The specification needs to be cover how to implement both device and
    driver. Simple enough to implement correctly, complex enough to be
    useful.

    I see Luiz, with his self-ballooning patch (which was basically a "gimme
    more mem please!" from the guest) heading in the same direction that
    Xen's self-balloon ended up.

    Does this mean that the current (legacy) virtio ballon is completely
    backwards, and that experience shows that the guest must drive? Or does
    it really mean that there are cases for both, and the we need a balloon
    driver which does both?

    If both, it needs to be specified what side wins. A feature bit would
    be the classic method.

    If we really don't know, we should leave the balloon device out of the
    standard entirely for v1.0 rather than rushing it (except to reserve id
    13 for the future virtio balloon).

    >> -struct gcmd_stats {
    >> - __le64 type; /* VIRTIO_BALLOON_GCMD_STATS */
    >> +struct gcmd_update {
    >> + __le64 type; /* VIRTIO_BALLOON_GCMD_UPDATE */
    >
    > Hmmm... Update what? Maybe VIRTIO_BALLOON_GCMD_UPDATE_STATS
    > or VIRTIO_BALLOON_GCMD_UPDATE_STATE? Both are not perfect
    > but are better then simple update...

    Yes, it's vague :(

    >> __le64 need_more;
    >
    > I do not think that it is needed if we have a target here. I hope that
    > host is smart and can make simple comparison with current target.
    >
    >> struct virtio_balloon_statistic stats[VIRTIO_BALLOON_S_NR];
    >> + __le64 target;

    Ah, in my proposal this was written by host, not initialized by guest.

    > Nice but:
    > - what to do if guest puts here target below current target;
    > I think that a host should do what is asked for (if it is possible)
    > - just lower target; Otherwise host should reject target us usual,
    > - 0 (zero) should be a special value; it should mean that guest
    > asks for nothing or does not have any preference.

    We could use it as two way comms like this, in which case your proposal
    makes sense.

    Cheers,
    Rusty.




  • 57.  Re: [virtio] New virtio balloon...

    Posted 02-06-2014 06:31
    Daniel Kiper <daniel.kiper@oracle.com> writes: > On Wed, Feb 05, 2014 at 01:45:38PM +1030, Rusty Russell wrote: >> self-ballooning? It seems like they'll just fight over target values >> if they both try to act. > > Sadly, yep. I know about that issue. Do you mean that we would like to avoid > such situation in new VIRTIO balloon driver? Any other reasons? To step back a moment: I feel that you, Luiz and I are engaged in a search for enlightenment, and we are slowly making progress! The specification needs to be cover how to implement both device and driver. Simple enough to implement correctly, complex enough to be useful. I see Luiz, with his self-ballooning patch (which was basically a "gimme more mem please!" from the guest) heading in the same direction that Xen's self-balloon ended up. Does this mean that the current (legacy) virtio ballon is completely backwards, and that experience shows that the guest must drive? Or does it really mean that there are cases for both, and the we need a balloon driver which does both? If both, it needs to be specified what side wins. A feature bit would be the classic method. If we really don't know, we should leave the balloon device out of the standard entirely for v1.0 rather than rushing it (except to reserve id 13 for the future virtio balloon). >> -struct gcmd_stats { >> - __le64 type; /* VIRTIO_BALLOON_GCMD_STATS */ >> +struct gcmd_update { >> + __le64 type; /* VIRTIO_BALLOON_GCMD_UPDATE */ > > Hmmm... Update what? Maybe VIRTIO_BALLOON_GCMD_UPDATE_STATS > or VIRTIO_BALLOON_GCMD_UPDATE_STATE? Both are not perfect > but are better then simple update... Yes, it's vague :( >> __le64 need_more; > > I do not think that it is needed if we have a target here. I hope that > host is smart and can make simple comparison with current target. > >> struct virtio_balloon_statistic stats[VIRTIO_BALLOON_S_NR]; >> + __le64 target; Ah, in my proposal this was written by host, not initialized by guest. > Nice but: > - what to do if guest puts here target below current target; > I think that a host should do what is asked for (if it is possible) > - just lower target; Otherwise host should reject target us usual, > - 0 (zero) should be a special value; it should mean that guest > asks for nothing or does not have any preference. We could use it as two way comms like this, in which case your proposal makes sense. Cheers, Rusty.


  • 58.  Re: [virtio] New virtio balloon...

    Posted 02-06-2014 13:13
    On Thu, Feb 06, 2014 at 12:12:21PM +1030, Rusty Russell wrote:
    > Daniel Kiper <daniel.kiper@oracle.com> writes:
    > > On Wed, Feb 05, 2014 at 01:45:38PM +1030, Rusty Russell wrote:
    > >> self-ballooning? It seems like they'll just fight over target values
    > >> if they both try to act.
    > >
    > > Sadly, yep. I know about that issue. Do you mean that we would like to avoid
    > > such situation in new VIRTIO balloon driver? Any other reasons?
    >
    > To step back a moment: I feel that you, Luiz and I are engaged in a
    > search for enlightenment, and we are slowly making progress!

    I have the same feeling.

    > The specification needs to be cover how to implement both device and
    > driver. Simple enough to implement correctly, complex enough to be
    > useful.

    Yep.

    > I see Luiz, with his self-ballooning patch (which was basically a "gimme
    > more mem please!" from the guest) heading in the same direction that
    > Xen's self-balloon ended up.
    >
    > Does this mean that the current (legacy) virtio ballon is completely
    > backwards, and that experience shows that the guest must drive? Or does
    > it really mean that there are cases for both, and the we need a balloon
    > driver which does both?
    >
    > If both, it needs to be specified what side wins. A feature bit would
    > be the classic method.
    >
    > If we really don't know, we should leave the balloon device out of the
    > standard entirely for v1.0 rather than rushing it (except to reserve id
    > 13 for the future virtio balloon).

    Maybe we should not built in any complexity in balloon driver itself.
    Hence it should be just simple get/put pages from balloon device (with
    option to hotplug memory in the future). However, host and guest (maybe
    it should be configured at boot?) should provide knobs (via sysfs, ioctl,
    ...?) to drive balloon device/driver. This way anybody could build
    separate process/module with required logic and policies. There is one but...
    Such idea was once implemented in Xen by Dan Magenheimer (look for
    tools/xenballoon in Xen 4.2) but later it was dropped. I do not why.
    So I am CC-ing Bob and Konrad who may know more about that.

    Daniel



  • 59.  Re: [virtio] New virtio balloon...

    Posted 02-06-2014 13:13
    On Thu, Feb 06, 2014 at 12:12:21PM +1030, Rusty Russell wrote: > Daniel Kiper <daniel.kiper@oracle.com> writes: > > On Wed, Feb 05, 2014 at 01:45:38PM +1030, Rusty Russell wrote: > >> self-ballooning? It seems like they'll just fight over target values > >> if they both try to act. > > > > Sadly, yep. I know about that issue. Do you mean that we would like to avoid > > such situation in new VIRTIO balloon driver? Any other reasons? > > To step back a moment: I feel that you, Luiz and I are engaged in a > search for enlightenment, and we are slowly making progress! I have the same feeling. > The specification needs to be cover how to implement both device and > driver. Simple enough to implement correctly, complex enough to be > useful. Yep. > I see Luiz, with his self-ballooning patch (which was basically a "gimme > more mem please!" from the guest) heading in the same direction that > Xen's self-balloon ended up. > > Does this mean that the current (legacy) virtio ballon is completely > backwards, and that experience shows that the guest must drive? Or does > it really mean that there are cases for both, and the we need a balloon > driver which does both? > > If both, it needs to be specified what side wins. A feature bit would > be the classic method. > > If we really don't know, we should leave the balloon device out of the > standard entirely for v1.0 rather than rushing it (except to reserve id > 13 for the future virtio balloon). Maybe we should not built in any complexity in balloon driver itself. Hence it should be just simple get/put pages from balloon device (with option to hotplug memory in the future). However, host and guest (maybe it should be configured at boot?) should provide knobs (via sysfs, ioctl, ...?) to drive balloon device/driver. This way anybody could build separate process/module with required logic and policies. There is one but... Such idea was once implemented in Xen by Dan Magenheimer (look for tools/xenballoon in Xen 4.2) but later it was dropped. I do not why. So I am CC-ing Bob and Konrad who may know more about that. Daniel


  • 60.  Re: [virtio] New virtio balloon...

    Posted 02-06-2014 13:44
    On Thu, 06 Feb 2014 12:12:21 +1030
    Rusty Russell <rusty@au1.ibm.com> wrote:

    > Daniel Kiper <daniel.kiper@oracle.com> writes:
    > > On Wed, Feb 05, 2014 at 01:45:38PM +1030, Rusty Russell wrote:
    > >> self-ballooning? It seems like they'll just fight over target values
    > >> if they both try to act.
    > >
    > > Sadly, yep. I know about that issue. Do you mean that we would like to avoid
    > > such situation in new VIRTIO balloon driver? Any other reasons?
    >
    > To step back a moment: I feel that you, Luiz and I are engaged in a
    > search for enlightenment, and we are slowly making progress!
    >
    > The specification needs to be cover how to implement both device and
    > driver. Simple enough to implement correctly, complex enough to be
    > useful.
    >
    > I see Luiz, with his self-ballooning patch (which was basically a "gimme
    > more mem please!" from the guest)

    Yes, and then the host tries to reclaim memory from all guests if it gets
    into pressure. Guests with idle memory and freeable caches should answer
    the call.

    But to be honest, my solution is still incomplete. The idea behind my project
    is that guest and host should balance each other some way, but my last RFC
    has an artificial parameter where a guest which is under memory pressure will
    ignore the host's call for memory for a fixed period of time. Another problem
    with my RFC is that it reclaims freeable caches way too slowly.

    > heading in the same direction that
    > Xen's self-balloon ended up.
    >
    > Does this mean that the current (legacy) virtio ballon is completely
    > backwards, and that experience shows that the guest must drive? Or does
    > it really mean that there are cases for both, and the we need a balloon
    > driver which does both?
    >
    > If both, it needs to be specified what side wins. A feature bit would
    > be the classic method.
    >
    > If we really don't know, we should leave the balloon device out of the
    > standard entirely for v1.0 rather than rushing it (except to reserve id
    > 13 for the future virtio balloon).

    I don't know which one should drive. My last RFC tries to match the current
    balloon design, but maybe having the host doing the inflate and the guest
    doing the deflate might be an interesting idea.

    I think we shouldn't rush. Actually, I'd very much appreciate if we discuss
    automatic/self ballooning calmly on a different thread.

    On the hand, what does it mean not having the balloon device for v1.0? Does
    it mean that we wouldn't have a balloon driver in Linux? I'm asking this
    because there are user-space solutions out there for automatic/self
    ballooning (like MoM) and we have support for ballooning in other virt stack
    components (like in libvirt). Those guys would break w/o a balloon device.



  • 61.  Re: [virtio] New virtio balloon...

    Posted 02-06-2014 13:44
    On Thu, 06 Feb 2014 12:12:21 +1030 Rusty Russell <rusty@au1.ibm.com> wrote: > Daniel Kiper <daniel.kiper@oracle.com> writes: > > On Wed, Feb 05, 2014 at 01:45:38PM +1030, Rusty Russell wrote: > >> self-ballooning? It seems like they'll just fight over target values > >> if they both try to act. > > > > Sadly, yep. I know about that issue. Do you mean that we would like to avoid > > such situation in new VIRTIO balloon driver? Any other reasons? > > To step back a moment: I feel that you, Luiz and I are engaged in a > search for enlightenment, and we are slowly making progress! > > The specification needs to be cover how to implement both device and > driver. Simple enough to implement correctly, complex enough to be > useful. > > I see Luiz, with his self-ballooning patch (which was basically a "gimme > more mem please!" from the guest) Yes, and then the host tries to reclaim memory from all guests if it gets into pressure. Guests with idle memory and freeable caches should answer the call. But to be honest, my solution is still incomplete. The idea behind my project is that guest and host should balance each other some way, but my last RFC has an artificial parameter where a guest which is under memory pressure will ignore the host's call for memory for a fixed period of time. Another problem with my RFC is that it reclaims freeable caches way too slowly. > heading in the same direction that > Xen's self-balloon ended up. > > Does this mean that the current (legacy) virtio ballon is completely > backwards, and that experience shows that the guest must drive? Or does > it really mean that there are cases for both, and the we need a balloon > driver which does both? > > If both, it needs to be specified what side wins. A feature bit would > be the classic method. > > If we really don't know, we should leave the balloon device out of the > standard entirely for v1.0 rather than rushing it (except to reserve id > 13 for the future virtio balloon). I don't know which one should drive. My last RFC tries to match the current balloon design, but maybe having the host doing the inflate and the guest doing the deflate might be an interesting idea. I think we shouldn't rush. Actually, I'd very much appreciate if we discuss automatic/self ballooning calmly on a different thread. On the hand, what does it mean not having the balloon device for v1.0? Does it mean that we wouldn't have a balloon driver in Linux? I'm asking this because there are user-space solutions out there for automatic/self ballooning (like MoM) and we have support for ballooning in other virt stack components (like in libvirt). Those guys would break w/o a balloon device.


  • 62.  Re: [virtio] New virtio balloon...

    Posted 02-06-2014 14:11
    On Thu, Feb 06, 2014 at 08:44:22AM -0500, Luiz Capitulino wrote: > On Thu, 06 Feb 2014 12:12:21 +1030 > Rusty Russell <rusty@au1.ibm.com> wrote: > > > Daniel Kiper <daniel.kiper@oracle.com> writes: > > > On Wed, Feb 05, 2014 at 01:45:38PM +1030, Rusty Russell wrote: > > >> self-ballooning? It seems like they'll just fight over target values > > >> if they both try to act. > > > > > > Sadly, yep. I know about that issue. Do you mean that we would like to avoid > > > such situation in new VIRTIO balloon driver? Any other reasons? > > > > To step back a moment: I feel that you, Luiz and I are engaged in a > > search for enlightenment, and we are slowly making progress! > > > > The specification needs to be cover how to implement both device and > > driver. Simple enough to implement correctly, complex enough to be > > useful. > > > > I see Luiz, with his self-ballooning patch (which was basically a "gimme > > more mem please!" from the guest) > > Yes, and then the host tries to reclaim memory from all guests if it gets > into pressure. Guests with idle memory and freeable caches should answer > the call. > > But to be honest, my solution is still incomplete. The idea behind my project > is that guest and host should balance each other some way, but my last RFC > has an artificial parameter where a guest which is under memory pressure will > ignore the host's call for memory for a fixed period of time. Another problem > with my RFC is that it reclaims freeable caches way too slowly. > > > heading in the same direction that > > Xen's self-balloon ended up. > > > > Does this mean that the current (legacy) virtio ballon is completely > > backwards, and that experience shows that the guest must drive? Or does > > it really mean that there are cases for both, and the we need a balloon > > driver which does both? > > > > If both, it needs to be specified what side wins. A feature bit would > > be the classic method. > > > > If we really don't know, we should leave the balloon device out of the > > standard entirely for v1.0 rather than rushing it (except to reserve id > > 13 for the future virtio balloon). > > I don't know which one should drive. My last RFC tries to match the current > balloon design, but maybe having the host doing the inflate and the guest > doing the deflate might be an interesting idea. > > I think we shouldn't rush. Actually, I'd very much appreciate if we discuss > automatic/self ballooning calmly on a different thread. > > On the hand, what does it mean not having the balloon device for v1.0? Does > it mean that we wouldn't have a balloon driver in Linux? I'm asking this > because there are user-space solutions out there for automatic/self > ballooning (like MoM) and we have support for ballooning in other virt stack > components (like in libvirt). Those guys would break w/o a balloon device. It merely means people will have to use old balloon driver, in particular balloon would have to be a PCI device (not PCI express).


  • 63.  Re: [virtio] New virtio balloon...

    Posted 02-06-2014 14:16
    On Thu, Feb 06, 2014 at 08:44:22AM -0500, Luiz Capitulino wrote:
    > On Thu, 06 Feb 2014 12:12:21 +1030
    > Rusty Russell <rusty@au1.ibm.com> wrote:
    >
    > > Daniel Kiper <daniel.kiper@oracle.com> writes:
    > > > On Wed, Feb 05, 2014 at 01:45:38PM +1030, Rusty Russell wrote:
    > > >> self-ballooning? It seems like they'll just fight over target values
    > > >> if they both try to act.
    > > >
    > > > Sadly, yep. I know about that issue. Do you mean that we would like to avoid
    > > > such situation in new VIRTIO balloon driver? Any other reasons?
    > >
    > > To step back a moment: I feel that you, Luiz and I are engaged in a
    > > search for enlightenment, and we are slowly making progress!
    > >
    > > The specification needs to be cover how to implement both device and
    > > driver. Simple enough to implement correctly, complex enough to be
    > > useful.
    > >
    > > I see Luiz, with his self-ballooning patch (which was basically a "gimme
    > > more mem please!" from the guest)
    >
    > Yes, and then the host tries to reclaim memory from all guests if it gets
    > into pressure. Guests with idle memory and freeable caches should answer
    > the call.
    >
    > But to be honest, my solution is still incomplete. The idea behind my project
    > is that guest and host should balance each other some way, but my last RFC
    > has an artificial parameter where a guest which is under memory pressure will
    > ignore the host's call for memory for a fixed period of time. Another problem
    > with my RFC is that it reclaims freeable caches way too slowly.
    >
    > > heading in the same direction that
    > > Xen's self-balloon ended up.
    > >
    > > Does this mean that the current (legacy) virtio ballon is completely
    > > backwards, and that experience shows that the guest must drive? Or does
    > > it really mean that there are cases for both, and the we need a balloon
    > > driver which does both?
    > >
    > > If both, it needs to be specified what side wins. A feature bit would
    > > be the classic method.
    > >
    > > If we really don't know, we should leave the balloon device out of the
    > > standard entirely for v1.0 rather than rushing it (except to reserve id
    > > 13 for the future virtio balloon).
    >
    > I don't know which one should drive. My last RFC tries to match the current
    > balloon design, but maybe having the host doing the inflate and the guest
    > doing the deflate might be an interesting idea.
    >
    > I think we shouldn't rush. Actually, I'd very much appreciate if we discuss
    > automatic/self ballooning calmly on a different thread.
    >
    > On the hand, what does it mean not having the balloon device for v1.0? Does
    > it mean that we wouldn't have a balloon driver in Linux? I'm asking this
    > because there are user-space solutions out there for automatic/self
    > ballooning (like MoM) and we have support for ballooning in other virt stack
    > components (like in libvirt). Those guys would break w/o a balloon device.


    It merely means people will have to use old balloon driver,
    in particular balloon would have to be a PCI device (not PCI express).



  • 64.  Re: [virtio] New virtio balloon...

    Posted 02-11-2014 04:11
    "Michael S. Tsirkin" <mst@redhat.com> writes:
    > On Thu, Feb 06, 2014 at 08:44:22AM -0500, Luiz Capitulino wrote:
    >> I think we shouldn't rush. Actually, I'd very much appreciate if we discuss
    >> automatic/self ballooning calmly on a different thread.
    >>
    >> On the hand, what does it mean not having the balloon device for v1.0? Does
    >> it mean that we wouldn't have a balloon driver in Linux? I'm asking this
    >> because there are user-space solutions out there for automatic/self
    >> ballooning (like MoM) and we have support for ballooning in other virt stack
    >> components (like in libvirt). Those guys would break w/o a balloon device.

    Indeed. My feeling at the moment is for 1.0 to relegate the current
    balloon device to legacy, reserve #13 for a new balloon device. When we
    feel we have something solid, we can add it to the standard.

    Let's discuss it further at the meeting.

    Thanks,
    Rusty.




  • 65.  Re: [virtio] New virtio balloon...

    Posted 02-11-2014 04:20
    "Michael S. Tsirkin" <mst@redhat.com> writes: > On Thu, Feb 06, 2014 at 08:44:22AM -0500, Luiz Capitulino wrote: >> I think we shouldn't rush. Actually, I'd very much appreciate if we discuss >> automatic/self ballooning calmly on a different thread. >> >> On the hand, what does it mean not having the balloon device for v1.0? Does >> it mean that we wouldn't have a balloon driver in Linux? I'm asking this >> because there are user-space solutions out there for automatic/self >> ballooning (like MoM) and we have support for ballooning in other virt stack >> components (like in libvirt). Those guys would break w/o a balloon device. Indeed. My feeling at the moment is for 1.0 to relegate the current balloon device to legacy, reserve #13 for a new balloon device. When we feel we have something solid, we can add it to the standard. Let's discuss it further at the meeting. Thanks, Rusty.


  • 66.  Re: [virtio] New virtio balloon...

    Posted 02-11-2014 05:19
    On Tue, Feb 11, 2014 at 02:40:40PM +1030, Rusty Russell wrote: > "Michael S. Tsirkin" <mst@redhat.com> writes: > > On Thu, Feb 06, 2014 at 08:44:22AM -0500, Luiz Capitulino wrote: > >> I think we shouldn't rush. Actually, I'd very much appreciate if we discuss > >> automatic/self ballooning calmly on a different thread. > >> > >> On the hand, what does it mean not having the balloon device for v1.0? Does > >> it mean that we wouldn't have a balloon driver in Linux? I'm asking this > >> because there are user-space solutions out there for automatic/self > >> ballooning (like MoM) and we have support for ballooning in other virt stack > >> components (like in libvirt). Those guys would break w/o a balloon device. > > Indeed. My feeling at the moment is for 1.0 to relegate the current > balloon device to legacy, reserve #13 for a new balloon device. When we > feel we have something solid, we can add it to the standard. > > Let's discuss it further at the meeting. > > Thanks, > Rusty. Basically this boils down to documenting what we have implemented/used but not adding new features, right?


  • 67.  Re: [virtio] New virtio balloon...

    Posted 02-11-2014 05:24
    On Tue, Feb 11, 2014 at 02:40:40PM +1030, Rusty Russell wrote:
    > "Michael S. Tsirkin" <mst@redhat.com> writes:
    > > On Thu, Feb 06, 2014 at 08:44:22AM -0500, Luiz Capitulino wrote:
    > >> I think we shouldn't rush. Actually, I'd very much appreciate if we discuss
    > >> automatic/self ballooning calmly on a different thread.
    > >>
    > >> On the hand, what does it mean not having the balloon device for v1.0? Does
    > >> it mean that we wouldn't have a balloon driver in Linux? I'm asking this
    > >> because there are user-space solutions out there for automatic/self
    > >> ballooning (like MoM) and we have support for ballooning in other virt stack
    > >> components (like in libvirt). Those guys would break w/o a balloon device.
    >
    > Indeed. My feeling at the moment is for 1.0 to relegate the current
    > balloon device to legacy, reserve #13 for a new balloon device. When we
    > feel we have something solid, we can add it to the standard.
    >
    > Let's discuss it further at the meeting.
    >
    > Thanks,
    > Rusty.

    Basically this boils down to documenting what we have implemented/used
    but not adding new features, right?



  • 68.  Re: [virtio] New virtio balloon...

    Posted 02-03-2014 12:53
    On Thu, Jan 30, 2014 at 07:34:30PM +1030, Rusty Russell wrote: > Hi, > > I tried to write a new balloon driver; it's completely untested > (as I need to write the device). The protocol is basically two vqs, one > for the guest to send commands, one for the host to send commands. > > Some interesting things come out: > 1) We do need to explicitly tell the host where the page is we want. > This is required for compaction, for example. > > 2) We need to be able to exceed the balloon target, especially for page > migration. Thus there's no mechanism for the device to refuse to > give us the pages. Admin should have a way to impose a mem limit on guest. However, he/she should be able to change it in any direction (up and down) and even increase it above limit established at guest boot (needed for memory hotplug). On the other hand guest should not be able to allocate more memory then it was requested by admin in a given time. > 3) The device can offer multiple page sizes, but the driver can only > accept one. I'm not sure if this is useful, as guests are either > huge page backed or not, and returning sub-pages isn't useful. Hmmm... I suppose that even if guest is backed by huge pages then internaly it uses standard page sizes (if not directed otherwise). So we have a problem here because I do not know what to do if guest backed by 1 GiB pages would like to inflate balloon with 4 KiB pages. Should we refuse that? > Linux demo code follows. > > Cheers, > Rusty. > > diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile > index 9076635697bb..1dd45691b618 100644 > --- a/drivers/virtio/Makefile > +++ b/drivers/virtio/Makefile > @@ -1,4 +1,4 @@ > obj-$(CONFIG_VIRTIO) += virtio.o virtio_ring.o > obj-$(CONFIG_VIRTIO_MMIO) += virtio_mmio.o > obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o > -obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o > +obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o virtio_balloon2.o > diff --git a/drivers/virtio/virtio_balloon2.c b/drivers/virtio/virtio_balloon2.c > new file mode 100644 > index 000000000000..93f13e7c561d > --- /dev/null > +++ b/drivers/virtio/virtio_balloon2.c > +static const struct address_space_operations virtio_balloon_aops; [...] > +#ifdef CONFIG_BALLOON_COMPACTION > +/* > + * virtballoon_migratepage - perform the balloon page migration on behalf of > + * a compation thread. (called under page lock) ^^^^^^^^^ -> compaction > + * @mapping: the page->mapping which will be assigned to the new migrated page. > + * @newpage: page that will replace the isolated page after migration finishes. > + * @page : the isolated (old) page that is about to be migrated to newpage. > + * @mode : compaction mode -- not used for balloon page migration. > + * > + * After a ballooned page gets isolated by compaction procedures, this is the > + * function that performs the page migration on behalf of a compaction thread > + * The page migration for virtio balloon is done in a simple swap fashion which > + * follows these two macro steps: > + * 1) insert newpage into vb->pages list and update the host about it; > + * 2) update the host about the old page removed from vb->pages list; > + * > + * This function preforms the balloon page migration task. > + * Called through balloon_mapping->a_ops->migratepage > + */ > +static int virtballoon_migratepage(struct address_space *mapping, > + struct page *newpage, struct page *page, enum migrate_mode mode) > +{ > + struct balloon_dev_info *vb_dev_info = balloon_page_device(page); > + struct virtio_balloon *vb; > + unsigned long flags; > + int err; > + > + BUG_ON(!vb_dev_info); > + > + vb = vb_dev_info->balloon_device; > + > + /* > + * In order to avoid lock contention while migrating pages concurrently > + * to leak_balloon() or fill_balloon() we just give up the balloon_lock > + * this turn, as it is easier to retry the page migration later. > + * This also prevents fill_balloon() getting stuck into a mutex > + * recursion in the case it ends up triggering memory compaction > + * while it is attempting to inflate the ballon. > + */ > + if (!mutex_trylock(&vb->lock)) > + return -EAGAIN; > + > + /* Try to get the page out of the balloon. */ > + vb->gcmd.get_pages.type = cpu_to_le64(VIRTIO_BALLOON_GCMD_GET_PAGES); > + vb->gcmd.get_pages.pages[0] = page_to_pfn(page) << PAGE_SHIFT; > + if (!send_gcmd(vb, offsetof(struct gcmd_get_pages, pages[1]))) { > + err = -EIO; > + goto unlock; > + } > + > + /* Now put newpage into balloon. */ > + vb->gcmd.give_pages.type = cpu_to_le64(VIRTIO_BALLOON_GCMD_GIVE_PAGES); > + vb->gcmd.give_pages.pages[0] = page_to_pfn(newpage) << PAGE_SHIFT; > + if (!send_gcmd(vb, offsetof(struct gcmd_give_pages, pages[1]))) { > + /* We leak a page here, but only happens if balloon broken. */ > + err = -EIO; > + goto unlock; > + } > + > + spin_lock_irqsave(&vb_dev_info->pages_lock, flags); > + balloon_page_insert(newpage, mapping, &vb_dev_info->pages); > + vb_dev_info->isolated_pages--; > + spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags); > + > + /* > + * It's safe to delete page->lru here because this page is at > + * an isolated migration list, and this step is expected to happen here > + */ > + balloon_page_delete(page); > + err = MIGRATEPAGE_BALLOON_SUCCESS; > + > +unlock: > + mutex_unlock(&vb->lock); > + return err; > +} > + > +/* define the balloon_mapping->a_ops callback to allow balloon page migration */ > +static const struct address_space_operations virtio_balloon_aops = { > + .migratepage = virtballoon_migratepage, > +}; > +#endif /* CONFIG_BALLOON_COMPACTION */ Do we really need this feature on guest? [...] > +/* This means the balloon can go negative (ie. add memory to system) */ > +#define VIRTIO_BALLOON_F_EXTRA_MEM 0 > + > +struct virtio_balloon_config_space { > + /* Set by device: bits indicate what page sizes supported. */ > + __le64 pagesizes; > + /* Set by driver: only a single bit is set! */ > + __le64 page_size; > + > + /* These set by device if VIRTIO_BALLOON_F_EXTRA_MEM. */ > + __le64 extra_mem_start; > + __le64 extra_mem_end; This cannot be a part of config space. Guest should be able to hotplug memory many times. Hence it should be a part of reply from host. Additionally, we should remember that memory is hotplugged in chunks known as section size. They are usually quite big and architecture depended (e.g. IIRC it is 128 MiB on x86_64). So maybe guest should tell host about its supported section size in config space. However, there should not be a requirement that target must be equal to multiple of section size in case of memory hotplug. It should be set as needed and balloon driver should reserve relevant memory region told by host (it should be rounded by host up to nearest multiple of section size) and later back relevant pages with PFNs up to a given target. Daniel


  • 69.  Re: [virtio] New virtio balloon...

    Posted 02-04-2014 02:21
    Daniel Kiper <daniel.kiper@oracle.com> writes: > On Thu, Jan 30, 2014 at 07:34:30PM +1030, Rusty Russell wrote: >> Hi, >> >> I tried to write a new balloon driver; it's completely untested >> (as I need to write the device). The protocol is basically two vqs, one >> for the guest to send commands, one for the host to send commands. >> >> Some interesting things come out: >> 1) We do need to explicitly tell the host where the page is we want. >> This is required for compaction, for example. >> >> 2) We need to be able to exceed the balloon target, especially for page >> migration. Thus there's no mechanism for the device to refuse to >> give us the pages. > > Admin should have a way to impose a mem limit on guest. However, > he/she should be able to change it in any direction (up and down) and > even increase it above limit established at guest boot (needed for > memory hotplug). On the other hand guest should not be able to allocate > more memory then it was requested by admin in a given time. Well, now we have VIRTIO_BALLOON_GCMD_EXCHANGE_PAGES, one problem with a strict limit is gone. We still have the problem of a race of the host lowering the target while the guest makes a request for more pages, but perhaps we just allow a single such request? >> 3) The device can offer multiple page sizes, but the driver can only >> accept one. I'm not sure if this is useful, as guests are either >> huge page backed or not, and returning sub-pages isn't useful. > > Hmmm... I suppose that even if guest is backed by huge pages then internaly > it uses standard page sizes (if not directed otherwise). So we have a > problem here because I do not know what to do if guest backed by 1 GiB > pages would like to inflate balloon with 4 KiB pages. Should we refuse > that? Two choices: offer 1G pages to the guest. If it can't handle that, it's pretty useless anyway (and will fail initialization). Otherwise, offer both 1G and 4k pages, and it might accept 4k pages (you'd do this if you have the ability to split 1G pages into 4k pages, I guess). >> +/* define the balloon_mapping->a_ops callback to allow balloon page migration */ >> +static const struct address_space_operations virtio_balloon_aops = { >> + .migratepage = virtballoon_migratepage, >> +}; >> +#endif /* CONFIG_BALLOON_COMPACTION */ > > Do we really need this feature on guest? Well, it's really a Linux-specific thing, but yes, if you can't migrate pages then page compation really suffers. Rafael Aquini <aquini@redhat.com> added this. >> +/* This means the balloon can go negative (ie. add memory to system) */ >> +#define VIRTIO_BALLOON_F_EXTRA_MEM 0 >> + >> +struct virtio_balloon_config_space { >> + /* Set by device: bits indicate what page sizes supported. */ >> + __le64 pagesizes; >> + /* Set by driver: only a single bit is set! */ >> + __le64 page_size; >> + >> + /* These set by device if VIRTIO_BALLOON_F_EXTRA_MEM. */ >> + __le64 extra_mem_start; >> + __le64 extra_mem_end; > > This cannot be a part of config space. Guest should be able to hotplug > memory many times. Hence it should be a part of reply from host. This was to specify the upper limits of where the extra mem is. It was intended to represent one or more section sizes. > Additionally, > we should remember that memory is hotplugged in chunks known as section size. > They are usually quite big and architecture depended (e.g. IIRC it is 128 MiB > on x86_64). So maybe guest should tell host about its supported section size > in config space. However, there should not be a requirement that target must > be equal to multiple of section size in case of memory hotplug. It should be > set as needed and balloon driver should reserve relevant memory region told > by host (it should be rounded by host up to nearest multiple of section size) > and later back relevant pages with PFNs up to a given target. We could either have a special request (VIRTIO_BALLOON_GCMD_NEW_MEM) where the guest specifies where it wants another chunk of memory. Then after that, it can ask for those pages' PFNs in VIRTIO_BALLOON_GCMD_GET_PAGES. Or we could simply allow a guest to request (if the VIRTIO_BALLOON_F_EXTRA_MEM feature is negotiated) any PFN it wants, and let it handle its sections itself. The latter is simpler, but is it sufficient? Cheers, Rusty.


  • 70.  Re: [virtio] New virtio balloon...

    Posted 02-04-2014 22:10
    On Tue, Feb 04, 2014 at 12:48:27PM +1030, Rusty Russell wrote: > Daniel Kiper <daniel.kiper@oracle.com> writes: > > On Thu, Jan 30, 2014 at 07:34:30PM +1030, Rusty Russell wrote: > >> Hi, > >> > >> I tried to write a new balloon driver; it's completely untested > >> (as I need to write the device). The protocol is basically two vqs, one > >> for the guest to send commands, one for the host to send commands. > >> > >> Some interesting things come out: > >> 1) We do need to explicitly tell the host where the page is we want. > >> This is required for compaction, for example. > >> > >> 2) We need to be able to exceed the balloon target, especially for page > >> migration. Thus there's no mechanism for the device to refuse to > >> give us the pages. > > > > Admin should have a way to impose a mem limit on guest. However, > > he/she should be able to change it in any direction (up and down) and > > even increase it above limit established at guest boot (needed for > > memory hotplug). On the other hand guest should not be able to allocate > > more memory then it was requested by admin in a given time. > > Well, now we have VIRTIO_BALLOON_GCMD_EXCHANGE_PAGES, one problem with a > strict limit is gone. We still have the problem of a race of the host > lowering the target while the guest makes a request for more pages, but > perhaps we just allow a single such request? I do not see a lot of problem here if balloon is host-led. Host just does his job. Guest just gently asks but host is not forced to fulfill requests. However, we should consider direct rejects which I mentioned in earlier emails. > >> 3) The device can offer multiple page sizes, but the driver can only > >> accept one. I'm not sure if this is useful, as guests are either > >> huge page backed or not, and returning sub-pages isn't useful. > > > > Hmmm... I suppose that even if guest is backed by huge pages then internaly > > it uses standard page sizes (if not directed otherwise). So we have a > > problem here because I do not know what to do if guest backed by 1 GiB > > pages would like to inflate balloon with 4 KiB pages. Should we refuse > > that? > > Two choices: offer 1G pages to the guest. If it can't handle that, it's > pretty useless anyway (and will fail initialization). Otherwise, offer > both 1G and 4k pages, and it might accept 4k pages (you'd do this if you > have the ability to split 1G pages into 4k pages, I guess). Both make sens. However, later, if it is possible, could make a performance hit. Additionally, it looks that in Linux case hugepages are created at boot time and could not be split into smaller chunks. Am I missing something? > >> +/* define the balloon_mapping->a_ops callback to allow balloon page migration */ > >> +static const struct address_space_operations virtio_balloon_aops = { > >> + .migratepage = virtballoon_migratepage, > >> +}; > >> +#endif /* CONFIG_BALLOON_COMPACTION */ > > > > Do we really need this feature on guest? > > Well, it's really a Linux-specific thing, but yes, if you can't migrate > pages then page compation really suffers. Rafael Aquini > <aquini@redhat.com> added this. For what we need high order pages in guests? > >> +/* This means the balloon can go negative (ie. add memory to system) */ > >> +#define VIRTIO_BALLOON_F_EXTRA_MEM 0 > >> + > >> +struct virtio_balloon_config_space { > >> + /* Set by device: bits indicate what page sizes supported. */ > >> + __le64 pagesizes; > >> + /* Set by driver: only a single bit is set! */ > >> + __le64 page_size; > >> + > >> + /* These set by device if VIRTIO_BALLOON_F_EXTRA_MEM. */ > >> + __le64 extra_mem_start; > >> + __le64 extra_mem_end; > > > > This cannot be a part of config space. Guest should be able to hotplug > > memory many times. Hence it should be a part of reply from host. > > This was to specify the upper limits of where the extra mem is. It was If yes then it should be highest available address in a given guest architecture. > intended to represent one or more section sizes. Section size is needed for a host only when we assume that the host should establish hoplugged memory placement. Otherwise the host does not need it. > > Additionally, > > we should remember that memory is hotplugged in chunks known as section size. > > They are usually quite big and architecture depended (e.g. IIRC it is 128 MiB > > on x86_64). So maybe guest should tell host about its supported section size > > in config space. However, there should not be a requirement that target must > > be equal to multiple of section size in case of memory hotplug. It should be > > set as needed and balloon driver should reserve relevant memory region told > > by host (it should be rounded by host up to nearest multiple of section size) > > and later back relevant pages with PFNs up to a given target. > > We could either have a special request (VIRTIO_BALLOON_GCMD_NEW_MEM) > where the guest specifies where it wants another chunk of memory. Then > after that, it can ask for those pages' PFNs in > VIRTIO_BALLOON_GCMD_GET_PAGES. > > Or we could simply allow a guest to request (if the > VIRTIO_BALLOON_F_EXTRA_MEM feature is negotiated) any PFN it wants, and > let it handle its sections itself. > > The latter is simpler, but is it sufficient? Right, but we have some issues with later solution in Xen. Currently I think that the host should establish region for hotplugged memory because it constructed memory map at boot time. However, I am not sure that it has always knowledge about e.g. all IO regions and similar stuff. On the other hand guest after boot may not have access to boot memory map. Hmmm... I still think that the host should establish hoplugged memory placement. Am I wrong? Daniel


  • 71.  Re: [virtio] New virtio balloon...

    Posted 02-05-2014 04:07
    Daniel Kiper <daniel.kiper@oracle.com> writes: > On Tue, Feb 04, 2014 at 12:48:27PM +1030, Rusty Russell wrote: >> Daniel Kiper <daniel.kiper@oracle.com> writes: >> > On Thu, Jan 30, 2014 at 07:34:30PM +1030, Rusty Russell wrote: >> >> Hi, >> >> >> >> I tried to write a new balloon driver; it's completely untested >> >> (as I need to write the device). The protocol is basically two vqs, one >> >> for the guest to send commands, one for the host to send commands. >> >> >> >> Some interesting things come out: >> >> 1) We do need to explicitly tell the host where the page is we want. >> >> This is required for compaction, for example. >> >> >> >> 2) We need to be able to exceed the balloon target, especially for page >> >> migration. Thus there's no mechanism for the device to refuse to >> >> give us the pages. >> > >> > Admin should have a way to impose a mem limit on guest. However, >> > he/she should be able to change it in any direction (up and down) and >> > even increase it above limit established at guest boot (needed for >> > memory hotplug). On the other hand guest should not be able to allocate >> > more memory then it was requested by admin in a given time. >> >> Well, now we have VIRTIO_BALLOON_GCMD_EXCHANGE_PAGES, one problem with a >> strict limit is gone. We still have the problem of a race of the host >> lowering the target while the guest makes a request for more pages, but >> perhaps we just allow a single such request? > > I do not see a lot of problem here if balloon is host-led. Host just > does his job. Guest just gently asks but host is not forced to fulfill > requests. However, we should consider direct rejects which I mentioned > in earlier emails. It's fairly easy to implement reject VIRTIO_BALLOON_GCMD_GET_PAGES. We could just have the device write the pages it is giving back to the array. The virtio protocol tells us how many bytes the device has written. >> >> 3) The device can offer multiple page sizes, but the driver can only >> >> accept one. I'm not sure if this is useful, as guests are either >> >> huge page backed or not, and returning sub-pages isn't useful. >> > >> > Hmmm... I suppose that even if guest is backed by huge pages then internaly >> > it uses standard page sizes (if not directed otherwise). So we have a >> > problem here because I do not know what to do if guest backed by 1 GiB >> > pages would like to inflate balloon with 4 KiB pages. Should we refuse >> > that? >> >> Two choices: offer 1G pages to the guest. If it can't handle that, it's >> pretty useless anyway (and will fail initialization). Otherwise, offer >> both 1G and 4k pages, and it might accept 4k pages (you'd do this if you >> have the ability to split 1G pages into 4k pages, I guess). > > Both make sens. However, later, if it is possible, could make a performance hit. > Additionally, it looks that in Linux case hugepages are created at boot time > and could not be split into smaller chunks. Am I missing something? Transparent huge pages will be assigned and split on demand, though I'm completely ignorant of how that works with KVM. >> >> +/* define the balloon_mapping->a_ops callback to allow balloon page migration */ >> >> +static const struct address_space_operations virtio_balloon_aops = { >> >> + .migratepage = virtballoon_migratepage, >> >> +}; >> >> +#endif /* CONFIG_BALLOON_COMPACTION */ >> > >> > Do we really need this feature on guest? >> >> Well, it's really a Linux-specific thing, but yes, if you can't migrate >> pages then page compation really suffers. Rafael Aquini >> <aquini@redhat.com> added this. > > For what we need high order pages in guests? Huge pages in the guest is as much of a win as it is in the host. Fortunately we don't have drivers requiring huge pages, but userspace will want them. >> >> +/* This means the balloon can go negative (ie. add memory to system) */ >> >> +#define VIRTIO_BALLOON_F_EXTRA_MEM 0 >> >> + >> >> +struct virtio_balloon_config_space { >> >> + /* Set by device: bits indicate what page sizes supported. */ >> >> + __le64 pagesizes; >> >> + /* Set by driver: only a single bit is set! */ >> >> + __le64 page_size; >> >> + >> >> + /* These set by device if VIRTIO_BALLOON_F_EXTRA_MEM. */ >> >> + __le64 extra_mem_start; >> >> + __le64 extra_mem_end; >> > >> > This cannot be a part of config space. Guest should be able to hotplug >> > memory many times. Hence it should be a part of reply from host. >> >> This was to specify the upper limits of where the extra mem is. It was > > If yes then it should be highest available address in a given guest architecture. > >> intended to represent one or more section sizes. > > Section size is needed for a host only when we assume that the host > should establish hoplugged memory placement. Otherwise the host does > not need it. Indeed, but I thought we wanted the host to specify the region which could be used. >> > Additionally, >> > we should remember that memory is hotplugged in chunks known as section size. >> > They are usually quite big and architecture depended (e.g. IIRC it is 128 MiB >> > on x86_64). So maybe guest should tell host about its supported section size >> > in config space. However, there should not be a requirement that target must >> > be equal to multiple of section size in case of memory hotplug. It should be >> > set as needed and balloon driver should reserve relevant memory region told >> > by host (it should be rounded by host up to nearest multiple of section size) >> > and later back relevant pages with PFNs up to a given target. >> >> We could either have a special request (VIRTIO_BALLOON_GCMD_NEW_MEM) >> where the guest specifies where it wants another chunk of memory. Then >> after that, it can ask for those pages' PFNs in >> VIRTIO_BALLOON_GCMD_GET_PAGES. >> >> Or we could simply allow a guest to request (if the >> VIRTIO_BALLOON_F_EXTRA_MEM feature is negotiated) any PFN it wants, and >> let it handle its sections itself. >> >> The latter is simpler, but is it sufficient? > > Right, but we have some issues with later solution in Xen. Currently > I think that the host should establish region for hotplugged memory > because it constructed memory map at boot time. However, I am not sure > that it has always knowledge about e.g. all IO regions and similar stuff. > On the other hand guest after boot may not have access to boot memory map. > Hmmm... I still think that the host should establish hoplugged memory placement. > Am I wrong? I don't know. I tend to agree that it makes sense for the host to establish the hotplug memory region. I assume it would set this up front, and then the guest would request PFNs in that range. I'm not sure if we will know until implementations exist. This will not be established before then, so perhaps we should add this as an extra feature after v1.0? Cheers, Rusty.


  • 72.  Re: [virtio] New virtio balloon...

    Posted 02-05-2014 15:09
    On Wed, Feb 05, 2014 at 02:26:43PM +1030, Rusty Russell wrote: > Daniel Kiper <daniel.kiper@oracle.com> writes: > > On Tue, Feb 04, 2014 at 12:48:27PM +1030, Rusty Russell wrote: > >> Daniel Kiper <daniel.kiper@oracle.com> writes: > >> > On Thu, Jan 30, 2014 at 07:34:30PM +1030, Rusty Russell wrote: > >> >> Hi, > >> >> > >> >> I tried to write a new balloon driver; it's completely untested > >> >> (as I need to write the device). The protocol is basically two vqs, one > >> >> for the guest to send commands, one for the host to send commands. > >> >> > >> >> Some interesting things come out: > >> >> 1) We do need to explicitly tell the host where the page is we want. > >> >> This is required for compaction, for example. > >> >> > >> >> 2) We need to be able to exceed the balloon target, especially for page > >> >> migration. Thus there's no mechanism for the device to refuse to > >> >> give us the pages. > >> > > >> > Admin should have a way to impose a mem limit on guest. However, > >> > he/she should be able to change it in any direction (up and down) and > >> > even increase it above limit established at guest boot (needed for > >> > memory hotplug). On the other hand guest should not be able to allocate > >> > more memory then it was requested by admin in a given time. > >> > >> Well, now we have VIRTIO_BALLOON_GCMD_EXCHANGE_PAGES, one problem with a > >> strict limit is gone. We still have the problem of a race of the host > >> lowering the target while the guest makes a request for more pages, but > >> perhaps we just allow a single such request? > > > > I do not see a lot of problem here if balloon is host-led. Host just > > does his job. Guest just gently asks but host is not forced to fulfill > > requests. However, we should consider direct rejects which I mentioned > > in earlier emails. > > It's fairly easy to implement reject VIRTIO_BALLOON_GCMD_GET_PAGES. We > could just have the device write the pages it is giving back to the > array. The virtio protocol tells us how many bytes the device has > written. I thought more about mechanism which we just established. > >> >> 3) The device can offer multiple page sizes, but the driver can only > >> >> accept one. I'm not sure if this is useful, as guests are either > >> >> huge page backed or not, and returning sub-pages isn't useful. > >> > > >> > Hmmm... I suppose that even if guest is backed by huge pages then internaly > >> > it uses standard page sizes (if not directed otherwise). So we have a > >> > problem here because I do not know what to do if guest backed by 1 GiB > >> > pages would like to inflate balloon with 4 KiB pages. Should we refuse > >> > that? > >> > >> Two choices: offer 1G pages to the guest. If it can't handle that, it's > >> pretty useless anyway (and will fail initialization). Otherwise, offer > >> both 1G and 4k pages, and it might accept 4k pages (you'd do this if you > >> have the ability to split 1G pages into 4k pages, I guess). > > > > Both make sens. However, later, if it is possible, could make a performance hit. > > Additionally, it looks that in Linux case hugepages are created at boot time > > and could not be split into smaller chunks. Am I missing something? > > Transparent huge pages will be assigned and split on demand, though I'm > completely ignorant of how that works with KVM. I forgot about transparent huge pages. [...] > >> >> +/* This means the balloon can go negative (ie. add memory to system) */ > >> >> +#define VIRTIO_BALLOON_F_EXTRA_MEM 0 > >> >> + > >> >> +struct virtio_balloon_config_space { > >> >> + /* Set by device: bits indicate what page sizes supported. */ > >> >> + __le64 pagesizes; > >> >> + /* Set by driver: only a single bit is set! */ > >> >> + __le64 page_size; > >> >> + > >> >> + /* These set by device if VIRTIO_BALLOON_F_EXTRA_MEM. */ > >> >> + __le64 extra_mem_start; > >> >> + __le64 extra_mem_end; > >> > > >> > This cannot be a part of config space. Guest should be able to hotplug > >> > memory many times. Hence it should be a part of reply from host. > >> > >> This was to specify the upper limits of where the extra mem is. It was > > > > If yes then it should be highest available address in a given guest architecture. > > > >> intended to represent one or more section sizes. > > > > Section size is needed for a host only when we assume that the host > > should establish hoplugged memory placement. Otherwise the host does > > not need it. > > Indeed, but I thought we wanted the host to specify the region which > could be used. > > >> > Additionally, > >> > we should remember that memory is hotplugged in chunks known as section size. > >> > They are usually quite big and architecture depended (e.g. IIRC it is 128 MiB > >> > on x86_64). So maybe guest should tell host about its supported section size > >> > in config space. However, there should not be a requirement that target must > >> > be equal to multiple of section size in case of memory hotplug. It should be > >> > set as needed and balloon driver should reserve relevant memory region told > >> > by host (it should be rounded by host up to nearest multiple of section size) > >> > and later back relevant pages with PFNs up to a given target. > >> > >> We could either have a special request (VIRTIO_BALLOON_GCMD_NEW_MEM) > >> where the guest specifies where it wants another chunk of memory. Then > >> after that, it can ask for those pages' PFNs in > >> VIRTIO_BALLOON_GCMD_GET_PAGES. > >> > >> Or we could simply allow a guest to request (if the > >> VIRTIO_BALLOON_F_EXTRA_MEM feature is negotiated) any PFN it wants, and > >> let it handle its sections itself. > >> > >> The latter is simpler, but is it sufficient? > > > > Right, but we have some issues with later solution in Xen. Currently > > I think that the host should establish region for hotplugged memory > > because it constructed memory map at boot time. However, I am not sure > > that it has always knowledge about e.g. all IO regions and similar stuff. > > On the other hand guest after boot may not have access to boot memory map. > > Hmmm... I still think that the host should establish hoplugged memory placement. > > Am I wrong? > > I don't know. I tend to agree that it makes sense for the host to > establish the hotplug memory region. I assume it would set this up > front, and then the guest would request PFNs in that range. It could be difficult. We could predict memory hotplug region start address but we are not able to predict how much memory will be hotplugged in the future. > I'm not sure if we will know until implementations exist. This will not > be established before then, so perhaps we should add this as an extra > feature after v1.0? You are right. I am going to improve memory hotplug for Xen later that year so maybe after that we should return to discussion on that feature. That way we will have more knowledge how to do that and then we could add memory hotplug feature in next VIRTIO release. Daniel