From: Aurelien Aptel <aaptel@nvidia•com>
To: Sagi Grimberg <sagi@grimberg•me>,
linux-nvme@lists•infradead.org, netdev@vger•kernel.org,
hch@lst•de, kbusch@kernel•org, axboe@fb•com,
chaitanyak@nvidia•com, davem@davemloft•net, kuba@kernel•org
Cc: Boris Pismenny <borisp@nvidia•com>,
aurelien.aptel@gmail•com, smalin@nvidia•com, malin1024@gmail•com,
ogerlitz@nvidia•com, yorayz@nvidia•com, galshalom@nvidia•com,
mgurtovoy@nvidia•com
Subject: Re: [PATCH v12 07/26] nvme-tcp: Add DDP offload control path
Date: Wed, 16 Aug 2023 15:30:44 +0300 [thread overview]
Message-ID: <253wmxvuq2z.fsf@nvidia.com> (raw)
In-Reply-To: <bc5cd2a7-efc4-e4df-cae5-5c527dd704a6@grimberg.me>
Sagi Grimberg <sagi@grimberg•me> writes:
>>>> + if (!netdev || !queue)
>>>> + return false;
>>>
>>> Is it reasonable to be called here with !netdev or !queue ?
>>
>> The check is needed only for the IO queue case but we can move it
>> earlier in nvme_tcp_start_queue().
>
> I still don't understand even on io queues how this can happen.
In case where the netdev does not support DDP, when the admin queue is
started, netdev->offloading_netdev will not be set and therefore will be
NULL.
Later, when the IO queue is started:
netdev = queue->ctrl->offloading_netdev; <== NULL
if (is_netdev_ulp_offload_active(netdev, queue)) { <== we pass NULL
We can move the NULL check higher-up if you prefer, like so:
if (queue->ctrl->offloading_netdev &&
is_netdev_ulp_offload_active(queue->ctrl->offloading_netdev, queue)) {
>>>> +
>>>> + /* If we cannot query the netdev limitations, do not offload */
>>>> + if (!nvme_tcp_ddp_query_limits(netdev, queue))
>>>> + return false;
>>>> +
>>>> + /* If netdev supports nvme-tcp ddp offload, we can offload */
>>>> + if (test_bit(ULP_DDP_C_NVME_TCP_BIT, netdev->ulp_ddp_caps.active))
>>>> + return true;
>>>
>>> This should be coming from the API itself, have the limits query
>>> api fail if this is off.
>>
>> We can move the function to the ULP DDP layer.
>>
>>> btw, what is the active thing? is this driven from ethtool enable?
>>> what happens if the user disables it while there is a ulp using it?
>>
>> The active bits are indeed driven by ethtool according to the design
>> Jakub suggested.
>> The nvme-tcp connection will have to be reconnected to see the effect of
>> changing the bit.
>
> It should move inside the api as well. Don't want to care about it in
> nvme.
Ok, we will move it there.
>>>> +
>>>> + return false;
>>>
>>> This can be folded to the above function.
>>
>> We won't be able to check for TLS in a common wrapper. We think this
>> should be kept.
>
> Why? any tcp ddp need to be able to support tls. Nothing specific to
> nvme here.
True, we will move it to the ULP wrapper.
>>>> + /*
>>>> + * The atomic operation guarantees that we don't miss any NIC driver
>>>> + * resync requests submitted after the above checks.
>>>> + */
>>>> + if (atomic64_cmpxchg(&queue->resync_req, pdu_val,
>>>> + pdu_val & ~ULP_DDP_RESYNC_PENDING) !=
>>>> + atomic64_read(&queue->resync_req))
>>>> + netdev->netdev_ops->ulp_ddp_ops->resync(netdev,
>>>> + queue->sock->sk,
>>>> + pdu_seq);
>>>
>>> Who else is doing an atomic on this value? and what happens
>>> if the cmpxchg fails?
>>
>> The driver thread can set queue->resync_req concurrently in patch
>> "net/mlx5e: NVMEoTCP, data-path for DDP+DDGST offload" in function
>> nvmeotcp_update_resync().
>>
>> If the cmpxchg fails it means a new resync request was triggered by the
>> HW, the old request will be dropped and the new one will be processed by
>> a later PDU.
>
> So resync_req is actually the current tcp sequence number or something?
> The name resync_req is very confusing.
queue->resync_req is the TCP sequence for which the HW requested a
resync operation. We can rename it with queue->resync_tcp_seq.
>>>> + ret = nvme_tcp_offload_socket(queue);
>>>> + if (ret) {
>>>> + dev_info(nctrl->device,
>>>> + "failed to setup offload on queue %d ret=%d\n",
>>>> + idx, ret);
>>>> + }
>>>> + }
>>>> + } else {
>>>> ret = nvmf_connect_admin_queue(nctrl);
>>>> + if (ret)
>>>> + goto err;
>>>>
>>>> - if (!ret) {
>>>> - set_bit(NVME_TCP_Q_LIVE, &queue->flags);
>>>> - } else {
>>>> - if (test_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
>>>> - __nvme_tcp_stop_queue(queue);
>>>> - dev_err(nctrl->device,
>>>> - "failed to connect queue: %d ret=%d\n", idx, ret);
>>>> + netdev = get_netdev_for_sock(queue->sock->sk);
>>>
>>> Is there any chance that this is a different netdev than what is
>>> already recorded? doesn't make sense to me.
>>
>> The idea is that we are first starting the admin queue, which looks up
>> the netdev associated with the socket and stored in the queue. Later,
>> when the IO queues are started, we use the recorded netdev.
>>
>> In cases of bonding or vlan, a netdev can have lower device links, which
>> get_netdev_for_sock() will look up.
>
> I think the code should in high level do:
> if (idx) {
> ret = nvmf_connect_io_queue(nctrl, idx);
> if (ret)
> goto err;
> if (nvme_tcp_ddp_query_limits(queue))
> nvme_tcp_offload_socket(queue);
>
> } else {
> ret = nvmf_connect_admin_queue(nctrl);
> if (ret)
> goto err;
> ctrl->ddp_netdev = get_netdev_for_sock(queue->sock->sk);
> if (nvme_tcp_ddp_query_limits(queue))
> nvme_tcp_offload_apply_limits(queue);
> }
Ok, we will follow this design.
> ctrl->ddp_netdev should be cleared and put when the admin queue
> is stopped/freed, similar to how async_req is handled.
Thanks, we will clear ddp_netdev on queue stop/free.
This will also prevent reusing a potentially wrong netdev after a reconnection.
>>>> + /*
>>>> + * release the device as no offload context is
>>>> + * established yet.
>>>> + */
>>>> + dev_put(netdev);
>>>
>>> the put is unclear, what does it pair with? the get_netdev_for_sock?
>>
>> Yes, get_netdev_for_sock() takes a reference, which we don't need at
>> that point so we put it.
>
> Well, you store a pointer to it, what happens if it goes away while
> the controller is being set up?
It's a problem. We will remove the dev_put() to keep the first
reference, and only release it when it is cleared from the admin queue.
next prev parent reply other threads:[~2023-08-16 12:30 UTC|newest]
Thread overview: 62+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-07-12 16:14 [PATCH v12 00/26] nvme-tcp receive offloads Aurelien Aptel
2023-07-12 16:14 ` [PATCH v12 01/26] net: Introduce direct data placement tcp offload Aurelien Aptel
2023-08-09 7:15 ` Sagi Grimberg
2023-08-10 14:46 ` Aurelien Aptel
2023-07-12 16:14 ` [PATCH v12 02/26] net/ethtool: add new stringset ETH_SS_ULP_DDP_{CAPS,STATS} Aurelien Aptel
2023-07-12 16:14 ` [PATCH v12 03/26] net/ethtool: add ULP_DDP_{GET,SET} operations for caps and stats Aurelien Aptel
2023-07-15 10:14 ` Simon Horman
2023-07-17 9:45 ` Aurelien Aptel
2023-07-12 16:14 ` [PATCH v12 04/26] Documentation: document netlink ULP_DDP_GET/SET messages Aurelien Aptel
2023-07-15 10:17 ` Simon Horman
2023-07-17 9:47 ` Aurelien Aptel
2023-07-12 16:14 ` [PATCH v12 05/26] iov_iter: skip copy if src == dst for direct data placement Aurelien Aptel
2023-08-16 0:24 ` Max Gurtovoy
2023-07-12 16:14 ` [PATCH v12 06/26] net/tls,core: export get_netdev_for_sock Aurelien Aptel
2023-07-12 16:14 ` [PATCH v12 07/26] nvme-tcp: Add DDP offload control path Aurelien Aptel
2023-08-01 2:25 ` Chaitanya Kulkarni
2023-08-09 7:39 ` Sagi Grimberg
2023-08-11 5:28 ` Chaitanya Kulkarni
2023-08-16 0:50 ` Max Gurtovoy
2023-08-09 7:13 ` Sagi Grimberg
2023-08-14 16:11 ` Aurelien Aptel
2023-08-14 18:54 ` Sagi Grimberg
2023-08-16 12:30 ` Aurelien Aptel [this message]
2023-07-12 16:14 ` [PATCH v12 08/26] nvme-tcp: Add DDP data-path Aurelien Aptel
2023-08-09 7:35 ` Sagi Grimberg
2023-08-14 16:12 ` Aurelien Aptel
2023-08-14 19:01 ` Sagi Grimberg
2023-08-17 13:28 ` Aurelien Aptel
2023-07-12 16:14 ` [PATCH v12 09/26] nvme-tcp: RX DDGST offload Aurelien Aptel
2023-08-09 7:59 ` Sagi Grimberg
2023-08-10 14:48 ` Aurelien Aptel
2023-08-13 13:49 ` Sagi Grimberg
2023-07-12 16:14 ` [PATCH v12 10/26] nvme-tcp: Deal with netdevice DOWN events Aurelien Aptel
2023-08-09 8:00 ` Sagi Grimberg
2023-08-16 13:03 ` Aurelien Aptel
2023-08-16 14:10 ` Sagi Grimberg
2023-08-17 14:09 ` Aurelien Aptel
2023-08-20 10:50 ` Sagi Grimberg
2023-08-21 12:33 ` Aurelien Aptel
2023-07-12 16:14 ` [PATCH v12 11/26] nvme-tcp: Add modparam to control the ULP offload enablement Aurelien Aptel
2023-08-09 8:03 ` Sagi Grimberg
2023-08-10 14:50 ` Aurelien Aptel
2023-08-16 1:05 ` Max Gurtovoy
2023-07-12 16:14 ` [PATCH v12 12/26] nvme-tcp: Only enable offload with TLS if the driver supports it Aurelien Aptel
2023-08-09 8:05 ` Sagi Grimberg
2023-08-10 14:52 ` Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 13/26] Documentation: add ULP DDP offload documentation Aurelien Aptel
2023-07-15 10:32 ` Simon Horman
2023-07-17 9:48 ` Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 14/26] net/mlx5e: Rename from tls to transport static params Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 15/26] net/mlx5e: Refactor ico sq polling to get budget Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 16/26] net/mlx5e: Have mdev pointer directly on the icosq structure Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 17/26] net/mlx5e: Refactor doorbell function to allow avoiding a completion Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 18/26] net/mlx5: Add NVMEoTCP caps, HW bits, 128B CQE and enumerations Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 19/26] net/mlx5e: NVMEoTCP, offload initialization Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 20/26] net/mlx5e: TCP flow steering for nvme-tcp acceleration Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 21/26] net/mlx5e: NVMEoTCP, use KLM UMRs for buffer registration Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 22/26] net/mlx5e: NVMEoTCP, queue init/teardown Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 23/26] net/mlx5e: NVMEoTCP, ddp setup and resync Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 24/26] net/mlx5e: NVMEoTCP, async ddp invalidation Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 25/26] net/mlx5e: NVMEoTCP, data-path for DDP+DDGST offload Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 26/26] net/mlx5e: NVMEoTCP, statistics Aurelien Aptel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=253wmxvuq2z.fsf@nvidia.com \
--to=aaptel@nvidia$(echo .)com \
--cc=aurelien.aptel@gmail$(echo .)com \
--cc=axboe@fb$(echo .)com \
--cc=borisp@nvidia$(echo .)com \
--cc=chaitanyak@nvidia$(echo .)com \
--cc=davem@davemloft$(echo .)net \
--cc=galshalom@nvidia$(echo .)com \
--cc=hch@lst$(echo .)de \
--cc=kbusch@kernel$(echo .)org \
--cc=kuba@kernel$(echo .)org \
--cc=linux-nvme@lists$(echo .)infradead.org \
--cc=malin1024@gmail$(echo .)com \
--cc=mgurtovoy@nvidia$(echo .)com \
--cc=netdev@vger$(echo .)kernel.org \
--cc=ogerlitz@nvidia$(echo .)com \
--cc=sagi@grimberg$(echo .)me \
--cc=smalin@nvidia$(echo .)com \
--cc=yorayz@nvidia$(echo .)com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox