From: Rusty Russell <rusty@rustcorp•com.au>
To: "Michael S. Tsirkin" <mst@redhat•com>
Cc: netdev@vger•kernel.org,
virtualization@lists•linux-foundation.org, kvm@vger•kernel.org,
linux-kernel@vger•kernel.org, mingo@elte•hu, linux-mm@kvack•org,
akpm@linux-foundation•org, hpa@zytor•com,
gregory.haskins@gmail•com, s.hetze@linux-ag•com,
Daniel Walker <dwalker@fifo99•com>,
Eric Dumazet <eric.dumazet@gmail•com>,
"Paul E. McKenney" <paulmck@linux•vnet.ibm.com>
Subject: Re: [PATCHv8 3/3] vhost_net: a kernel-level virtio server
Date: Fri, 6 Nov 2009 15:29:17 +1030 [thread overview]
Message-ID: <200911061529.17500.rusty@rustcorp.com.au> (raw)
In-Reply-To: <20091104155724.GD32673@redhat.com>
On Thu, 5 Nov 2009 02:27:24 am Michael S. Tsirkin wrote:
> What it is: vhost net is a character device that can be used to reduce
> the number of system calls involved in virtio networking.
Hi Michael,
Now everyone else has finally kicked all the tires and it seems to pass,
I've done a fairly complete review. Generally, it's really nice; just one
bug and a few minor suggestions for polishing.
> +/* Caller must have TX VQ lock */
> +static void tx_poll_stop(struct vhost_net *net)
> +{
> + if (likely(net->tx_poll_state != VHOST_NET_POLL_STARTED))
> + return;
likely? Really?
> + for (;;) {
> + head = vhost_get_vq_desc(&net->dev, vq, vq->iov, &out, &in,
> + NULL, NULL);
Danger! You need an arg to vhost_get_vq_desc to tell it the max desc size
you can handle. Otherwise, it's only limited by ring size, and a malicious
guest can overflow you here, and below:
> + /* Skip header. TODO: support TSO. */
> + s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
...
> +
> + use_mm(net->dev.mm);
> + mutex_lock(&vq->mutex);
> + vhost_no_notify(vq);
I prefer a name like "vhost_disable_notify()".
> + /* OK, now we need to know about added descriptors. */
> + if (head == vq->num && vhost_notify(vq))
> + /* They could have slipped one in as we were doing that:
> + * check again. */
> + continue;
> + /* Nothing new? Wait for eventfd to tell us they refilled. */
> + if (head == vq->num)
> + break;
> + /* We don't need to be notified again. */
> + vhost_no_notify(vq);
Similarly, vhost_enable_notify. This one is particularly misleading since
it doesn't actually notify anything!
In particular, this code would be neater as:
if (head == vq->num) {
if (vhost_enable_notify(vq)) {
/* Try again, they could have slipped one in. */
continue;
}
/* Nothing more to do. */
break;
}
vhost_disable_notify(vq);
Now, AFAICT vhost_notify()/enable_notify() would be better rewritten to
return true only when there's more pending. Saves a loop around here most
of the time. Also, the vhost_no_notify/vhost_disable_notify() can be moved
out of the loop entirely. (It could be under an if (unlikely(enabled)), not
sure if it's worth it).
> + len = err;
> + err = memcpy_toiovec(vq->hdr, (unsigned char *)&hdr, hdr_size);
That unsigned char * arg to memcpy_toiovec is annoying. A patch might be
nice, separate from this effort.
> +static int vhost_net_open(struct inode *inode, struct file *f)
> +{
> + struct vhost_net *n = kzalloc(sizeof *n, GFP_KERNEL);
> + int r;
> + if (!n)
> + return -ENOMEM;
> + f->private_data = n;
> + n->vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
> + n->vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
I have a personal dislike of calloc for structures. In userspace, it's
because valgrind can't spot uninitialized fields. These days a similar
argument applies in the kernel, because we have KMEMCHECK now. If someone
adds a field to the struct and forgets to initialize it, we can spot it.
> +static void vhost_net_enable_vq(struct vhost_net *n, int index)
> +{
> + struct socket *sock = n->vqs[index].private_data;
OK, I can't help but this that presenting the vqs as an array doesn't buy
us very much. Esp. if you change vhost_dev_init to take a NULL-terminated
varargs. I think readability would improve. It means passing a vq around
rather than an index.
Not completely sure it'll be a win tho.
> +static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> +{
> + struct socket *sock, *oldsock = NULL;
...
> + sock = get_socket(fd);
> + if (IS_ERR(sock)) {
> + r = PTR_ERR(sock);
> + goto done;
> + }
> +
> + /* start polling new socket */
> + oldsock = vq->private_data;
...
> +done:
> + mutex_unlock(&n->dev.mutex);
> + if (oldsock) {
> + vhost_net_flush_vq(n, index);
> + fput(oldsock->file);
I dislike this style; I prefer multiple different goto points, one for when
oldsock is set, and one for when it's not.
That way, gcc warns us about uninitialized variables if we get it wrong.
> +static long vhost_net_reset_owner(struct vhost_net *n)
> +{
> + struct socket *tx_sock = NULL;
> + struct socket *rx_sock = NULL;
> + long r;
This should be called "err", since that's what it is.
> +static void vhost_net_set_features(struct vhost_net *n, u64 features)
> +{
> + size_t hdr_size = features & (1 << VHOST_NET_F_VIRTIO_NET_HDR) ?
> + sizeof(struct virtio_net_hdr) : 0;
> + int i;
> + mutex_lock(&n->dev.mutex);
> + n->dev.acked_features = features;
Why is this called "acked_features"? Not just "features"? I expected
to see code which exposed these back to userspace, and didn't.
> + case VHOST_GET_FEATURES:
> + features = VHOST_FEATURES;
> + return put_user(features, featurep);
> + case VHOST_ACK_FEATURES:
> + r = get_user(features, featurep);
> + /* No features for now */
> + if (r < 0)
> + return r;
> + if (features & ~VHOST_FEATURES)
> + return -EOPNOTSUPP;
> + vhost_net_set_features(n, features);
OK, from the userspace POV it's "get features" then "ack features". But
I think "VHOST_SET_FEATURES" is more consistent, despite this usage.
> + switch (ioctl) {
> + case VHOST_SET_VRING_NUM:
I haven't looked at your userspace implementation, but does a generic
VHOST_SET_VRING_STATE & VHOST_GET_VRING_STATE with a struct make more
sense? It'd be simpler here, but not sure if it'd be simpler to use?
(Not the fd-setting ioctls of course)
> + case VHOST_SET_VRING_LOG:
> + r = copy_from_user(&a, argp, sizeof a);
> + if (r < 0)
> + break;
> + if (a.padding) {
> + r = -EOPNOTSUPP;
> + break;
> + }
> + if (a.user_addr == VHOST_VRING_LOG_DISABLE) {
> + vq->log_used = false;
> + break;
> + }
> + if (a.user_addr & (sizeof *vq->used->ring - 1)) {
> + r = -EINVAL;
> + break;
> + }
> + vq->log_used = true;
> + vq->log_addr = a.user_addr;
> + break;
For future reference, this is *exactly* the kind of thing which would have
been nice as a followup patch. Easy to separate, easy to review, not critical
to the core.
> +/* TODO: This is really inefficient. We need something like get_user()
> + * (instruction directly accesses the data, with an exception table entry
> + * returning -EFAULT). See Documentation/x86/exception-tables.txt.
> + */
> +static int set_bit_to_user(int nr, void __user *addr)
> +{
I guess we won't be dealing with many contiguous pages, otherwise we could
get a cheap speedup making this set_bits_to_user(int nr, int num_bits...).
> +/* Each buffer in the virtqueues is actually a chain of descriptors. This
> + * function returns the next descriptor in the chain,
> + * or -1 if we're at the end. */
> +static unsigned next_desc(struct vring_desc *desc)
> +{
> + unsigned int next;
> +
> + /* If this descriptor says it doesn't chain, we're done. */
> + if (!(desc->flags & VRING_DESC_F_NEXT))
> + return -1;
Hmm, prefer s/-1/-1U/ in comment, here, and below. Clarifies a bit.
> +/* After we've used one of their buffers, we tell them about it. We'll then
> + * want to send them an interrupt, using vq->call. */
This comment has too much cut & paste:
... want to notify the guest, using the eventfd */
> +/* This actually sends the interrupt for this virtqueue */
> +void vhost_trigger_irq(struct vhost_dev *dev, struct vhost_virtqueue *vq)
> +{
Rename vhost_notify_eventfd() or something, and fix comments?
> +enum {
> + VHOST_NET_MAX_SG = MAX_SKB_FRAGS + 2,
+2? Believable, but is it correct?
> +/* Poll a file (eventfd or socket) */
> +/* Note: there's nothing vhost specific about this structure. */
> +struct vhost_poll {
This comment really helped while reading the code. Kudos!
Thanks!
Rusty.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack•org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack•org"> email@kvack•org </a>
next prev parent reply other threads:[~2009-11-06 4:59 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <cover.1257349249.git.mst@redhat.com>
2009-11-04 15:55 ` [PATCHv8 1/3] tun: export underlying socket Michael S. Tsirkin
2009-11-04 15:55 ` [PATCHv8 2/3] mm: export use_mm/unuse_mm to modules Michael S. Tsirkin
2009-11-04 15:57 ` [PATCHv8 3/3] vhost_net: a kernel-level virtio server Michael S. Tsirkin
2009-11-06 4:59 ` Rusty Russell [this message]
2009-11-08 11:35 ` Michael S. Tsirkin
2009-11-09 6:17 ` Rusty Russell
2009-11-09 7:10 ` Michael S. Tsirkin
2009-11-10 1:08 ` Rusty Russell
2009-11-09 7:20 ` Michael S. Tsirkin
2009-11-09 11:55 ` Michael S. Tsirkin
2010-05-04 18:22 ` virtio: put last_used and last_avail index into ring itself Michael S. Tsirkin
2010-05-06 0:52 ` Rusty Russell
2010-05-06 6:27 ` Michael S. Tsirkin
2010-05-07 3:05 ` Rusty Russell
2010-05-09 8:57 ` Michael S. Tsirkin
2010-05-10 3:11 ` Rusty Russell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200911061529.17500.rusty@rustcorp.com.au \
--to=rusty@rustcorp$(echo .)com.au \
--cc=akpm@linux-foundation$(echo .)org \
--cc=dwalker@fifo99$(echo .)com \
--cc=eric.dumazet@gmail$(echo .)com \
--cc=gregory.haskins@gmail$(echo .)com \
--cc=hpa@zytor$(echo .)com \
--cc=kvm@vger$(echo .)kernel.org \
--cc=linux-kernel@vger$(echo .)kernel.org \
--cc=linux-mm@kvack$(echo .)org \
--cc=mingo@elte$(echo .)hu \
--cc=mst@redhat$(echo .)com \
--cc=netdev@vger$(echo .)kernel.org \
--cc=paulmck@linux$(echo .)vnet.ibm.com \
--cc=s.hetze@linux-ag$(echo .)com \
--cc=virtualization@lists$(echo .)linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox