Re: [PATCH v12 07/26] nvme-tcp: Add DDP offload control path

public inbox for netdev@vger.kernel.org 
 help / color / mirror / Atom feed

From: Aurelien Aptel <aaptel@nvidia•com>
To: Sagi Grimberg <sagi@grimberg•me>,
	linux-nvme@lists•infradead.org, netdev@vger•kernel.org,
	hch@lst•de, kbusch@kernel•org, axboe@fb•com,
	chaitanyak@nvidia•com, davem@davemloft•net, kuba@kernel•org
Cc: Boris Pismenny <borisp@nvidia•com>,
	aurelien.aptel@gmail•com, smalin@nvidia•com, malin1024@gmail•com,
	ogerlitz@nvidia•com, yorayz@nvidia•com, galshalom@nvidia•com,
	mgurtovoy@nvidia•com
Subject: Re: [PATCH v12 07/26] nvme-tcp: Add DDP offload control path
Date: Wed, 16 Aug 2023 15:30:44 +0300	[thread overview]
Message-ID: <253wmxvuq2z.fsf@nvidia.com> (raw)
In-Reply-To: <bc5cd2a7-efc4-e4df-cae5-5c527dd704a6@grimberg.me>

Sagi Grimberg <sagi@grimberg•me> writes:
>>>> +     if (!netdev || !queue)
>>>> +             return false;
>>>
>>> Is it reasonable to be called here with !netdev or !queue ?
>>
>> The check is needed only for the IO queue case but we can move it
>> earlier in nvme_tcp_start_queue().
>
> I still don't understand even on io queues how this can happen.

In case where the netdev does not support DDP, when the admin queue is
started, netdev->offloading_netdev will not be set and therefore will be
NULL.

Later, when the IO queue is started:

    netdev = queue->ctrl->offloading_netdev; <== NULL
    if (is_netdev_ulp_offload_active(netdev, queue)) { <== we pass NULL

We can move the NULL check higher-up if you prefer, like so:

    if (queue->ctrl->offloading_netdev &&
        is_netdev_ulp_offload_active(queue->ctrl->offloading_netdev, queue)) {

>>>> +
>>>> +     /* If we cannot query the netdev limitations, do not offload */
>>>> +     if (!nvme_tcp_ddp_query_limits(netdev, queue))
>>>> +             return false;
>>>> +
>>>> +     /* If netdev supports nvme-tcp ddp offload, we can offload */
>>>> +     if (test_bit(ULP_DDP_C_NVME_TCP_BIT, netdev->ulp_ddp_caps.active))
>>>> +             return true;
>>>
>>> This should be coming from the API itself, have the limits query
>>> api fail if this is off.
>>
>> We can move the function to the ULP DDP layer.
>>
>>> btw, what is the active thing? is this driven from ethtool enable?
>>> what happens if the user disables it while there is a ulp using it?
>>
>> The active bits are indeed driven by ethtool according to the design
>> Jakub suggested.
>> The nvme-tcp connection will have to be reconnected to see the effect of
>> changing the bit.
>
> It should move inside the api as well. Don't want to care about it in
> nvme.

Ok, we will move it there.

>>>> +
>>>> +     return false;
>>>
>>> This can be folded to the above function.
>>
>> We won't be able to check for TLS in a common wrapper. We think this
>> should be kept.
>
> Why? any tcp ddp need to be able to support tls. Nothing specific to
> nvme here.

True, we will move it to the ULP wrapper.

>>>> +     /*
>>>> +      * The atomic operation guarantees that we don't miss any NIC driver
>>>> +      * resync requests submitted after the above checks.
>>>> +      */
>>>> +     if (atomic64_cmpxchg(&queue->resync_req, pdu_val,
>>>> +                          pdu_val & ~ULP_DDP_RESYNC_PENDING) !=
>>>> +                          atomic64_read(&queue->resync_req))
>>>> +             netdev->netdev_ops->ulp_ddp_ops->resync(netdev,
>>>> +                                                     queue->sock->sk,
>>>> +                                                     pdu_seq);
>>>
>>> Who else is doing an atomic on this value? and what happens
>>> if the cmpxchg fails?
>>
>> The driver thread can set queue->resync_req concurrently in patch
>> "net/mlx5e: NVMEoTCP, data-path for DDP+DDGST offload" in function
>> nvmeotcp_update_resync().
>>
>> If the cmpxchg fails it means a new resync request was triggered by the
>> HW, the old request will be dropped and the new one will be processed by
>> a later PDU.
>
> So resync_req is actually the current tcp sequence number or something?
> The name resync_req is very confusing.

queue->resync_req is the TCP sequence for which the HW requested a
resync operation. We can rename it with queue->resync_tcp_seq.

>>>> +                     ret = nvme_tcp_offload_socket(queue);
>>>> +                     if (ret) {
>>>> +                             dev_info(nctrl->device,
>>>> +                                      "failed to setup offload on queue %d ret=%d\n",
>>>> +                                      idx, ret);
>>>> +                     }
>>>> +             }
>>>> +     } else {
>>>>                ret = nvmf_connect_admin_queue(nctrl);
>>>> +             if (ret)
>>>> +                     goto err;
>>>>
>>>> -     if (!ret) {
>>>> -             set_bit(NVME_TCP_Q_LIVE, &queue->flags);
>>>> -     } else {
>>>> -             if (test_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
>>>> -                     __nvme_tcp_stop_queue(queue);
>>>> -             dev_err(nctrl->device,
>>>> -                     "failed to connect queue: %d ret=%d\n", idx, ret);
>>>> +             netdev = get_netdev_for_sock(queue->sock->sk);
>>>
>>> Is there any chance that this is a different netdev than what is
>>> already recorded? doesn't make sense to me.
>>
>> The idea is that we are first starting the admin queue, which looks up
>> the netdev associated with the socket and stored in the queue. Later,
>> when the IO queues are started, we use the recorded netdev.
>>
>> In cases of bonding or vlan, a netdev can have lower device links, which
>> get_netdev_for_sock() will look up.
>
> I think the code should in high level do:
>         if (idx) {
>                 ret = nvmf_connect_io_queue(nctrl, idx);
>                 if (ret)
>                         goto err;
>                 if (nvme_tcp_ddp_query_limits(queue))
>                         nvme_tcp_offload_socket(queue);
>
>         } else {
>                 ret = nvmf_connect_admin_queue(nctrl);
>                 if (ret)
>                         goto err;
>                 ctrl->ddp_netdev = get_netdev_for_sock(queue->sock->sk);
>                 if (nvme_tcp_ddp_query_limits(queue))
>                         nvme_tcp_offload_apply_limits(queue);
>         }

Ok, we will follow this design.

> ctrl->ddp_netdev should be cleared and put when the admin queue
> is stopped/freed, similar to how async_req is handled.

Thanks, we will clear ddp_netdev on queue stop/free.
This will also prevent reusing a potentially wrong netdev after a reconnection.

>>>> +             /*
>>>> +              * release the device as no offload context is
>>>> +              * established yet.
>>>> +              */
>>>> +             dev_put(netdev);
>>>
>>> the put is unclear, what does it pair with? the get_netdev_for_sock?
>>
>> Yes, get_netdev_for_sock() takes a reference, which we don't need at
>> that point so we put it.
>
> Well, you store a pointer to it, what happens if it goes away while
> the controller is being set up?

It's a problem. We will remove the dev_put() to keep the first
reference, and only release it when it is cleared from the admin queue.

next prev parent reply	other threads:[~2023-08-16 12:30 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-12 16:14 [PATCH v12 00/26] nvme-tcp receive offloads Aurelien Aptel
2023-07-12 16:14 ` [PATCH v12 01/26] net: Introduce direct data placement tcp offload Aurelien Aptel
2023-08-09  7:15   ` Sagi Grimberg
2023-08-10 14:46     ` Aurelien Aptel
2023-07-12 16:14 ` [PATCH v12 02/26] net/ethtool: add new stringset ETH_SS_ULP_DDP_{CAPS,STATS} Aurelien Aptel
2023-07-12 16:14 ` [PATCH v12 03/26] net/ethtool: add ULP_DDP_{GET,SET} operations for caps and stats Aurelien Aptel
2023-07-15 10:14   ` Simon Horman
2023-07-17  9:45     ` Aurelien Aptel
2023-07-12 16:14 ` [PATCH v12 04/26] Documentation: document netlink ULP_DDP_GET/SET messages Aurelien Aptel
2023-07-15 10:17   ` Simon Horman
2023-07-17  9:47     ` Aurelien Aptel
2023-07-12 16:14 ` [PATCH v12 05/26] iov_iter: skip copy if src == dst for direct data placement Aurelien Aptel
2023-08-16  0:24   ` Max Gurtovoy
2023-07-12 16:14 ` [PATCH v12 06/26] net/tls,core: export get_netdev_for_sock Aurelien Aptel
2023-07-12 16:14 ` [PATCH v12 07/26] nvme-tcp: Add DDP offload control path Aurelien Aptel
2023-08-01  2:25   ` Chaitanya Kulkarni
2023-08-09  7:39     ` Sagi Grimberg
2023-08-11  5:28       ` Chaitanya Kulkarni
2023-08-16  0:50         ` Max Gurtovoy
2023-08-09  7:13   ` Sagi Grimberg
2023-08-14 16:11     ` Aurelien Aptel
2023-08-14 18:54       ` Sagi Grimberg
2023-08-16 12:30         ` Aurelien Aptel [this message]
2023-07-12 16:14 ` [PATCH v12 08/26] nvme-tcp: Add DDP data-path Aurelien Aptel
2023-08-09  7:35   ` Sagi Grimberg
2023-08-14 16:12     ` Aurelien Aptel
2023-08-14 19:01       ` Sagi Grimberg
2023-08-17 13:28         ` Aurelien Aptel
2023-07-12 16:14 ` [PATCH v12 09/26] nvme-tcp: RX DDGST offload Aurelien Aptel
2023-08-09  7:59   ` Sagi Grimberg
2023-08-10 14:48     ` Aurelien Aptel
2023-08-13 13:49       ` Sagi Grimberg
2023-07-12 16:14 ` [PATCH v12 10/26] nvme-tcp: Deal with netdevice DOWN events Aurelien Aptel
2023-08-09  8:00   ` Sagi Grimberg
2023-08-16 13:03     ` Aurelien Aptel
2023-08-16 14:10       ` Sagi Grimberg
2023-08-17 14:09         ` Aurelien Aptel
2023-08-20 10:50           ` Sagi Grimberg
2023-08-21 12:33             ` Aurelien Aptel
2023-07-12 16:14 ` [PATCH v12 11/26] nvme-tcp: Add modparam to control the ULP offload enablement Aurelien Aptel
2023-08-09  8:03   ` Sagi Grimberg
2023-08-10 14:50     ` Aurelien Aptel
2023-08-16  1:05     ` Max Gurtovoy
2023-07-12 16:14 ` [PATCH v12 12/26] nvme-tcp: Only enable offload with TLS if the driver supports it Aurelien Aptel
2023-08-09  8:05   ` Sagi Grimberg
2023-08-10 14:52     ` Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 13/26] Documentation: add ULP DDP offload documentation Aurelien Aptel
2023-07-15 10:32   ` Simon Horman
2023-07-17  9:48     ` Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 14/26] net/mlx5e: Rename from tls to transport static params Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 15/26] net/mlx5e: Refactor ico sq polling to get budget Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 16/26] net/mlx5e: Have mdev pointer directly on the icosq structure Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 17/26] net/mlx5e: Refactor doorbell function to allow avoiding a completion Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 18/26] net/mlx5: Add NVMEoTCP caps, HW bits, 128B CQE and enumerations Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 19/26] net/mlx5e: NVMEoTCP, offload initialization Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 20/26] net/mlx5e: TCP flow steering for nvme-tcp acceleration Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 21/26] net/mlx5e: NVMEoTCP, use KLM UMRs for buffer registration Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 22/26] net/mlx5e: NVMEoTCP, queue init/teardown Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 23/26] net/mlx5e: NVMEoTCP, ddp setup and resync Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 24/26] net/mlx5e: NVMEoTCP, async ddp invalidation Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 25/26] net/mlx5e: NVMEoTCP, data-path for DDP+DDGST offload Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 26/26] net/mlx5e: NVMEoTCP, statistics Aurelien Aptel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=253wmxvuq2z.fsf@nvidia.com \
    --to=aaptel@nvidia$(echo .)com \
    --cc=aurelien.aptel@gmail$(echo .)com \
    --cc=axboe@fb$(echo .)com \
    --cc=borisp@nvidia$(echo .)com \
    --cc=chaitanyak@nvidia$(echo .)com \
    --cc=davem@davemloft$(echo .)net \
    --cc=galshalom@nvidia$(echo .)com \
    --cc=hch@lst$(echo .)de \
    --cc=kbusch@kernel$(echo .)org \
    --cc=kuba@kernel$(echo .)org \
    --cc=linux-nvme@lists$(echo .)infradead.org \
    --cc=malin1024@gmail$(echo .)com \
    --cc=mgurtovoy@nvidia$(echo .)com \
    --cc=netdev@vger$(echo .)kernel.org \
    --cc=ogerlitz@nvidia$(echo .)com \
    --cc=sagi@grimberg$(echo .)me \
    --cc=smalin@nvidia$(echo .)com \
    --cc=yorayz@nvidia$(echo .)com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox