From: luoxuanqiang <xuanqiang.luo@linux•dev>
To: Eric Dumazet <edumazet@google•com>
Cc: kuniyu@google•com, "Paul E. McKenney" <paulmck@kernel•org>,
kerneljasonxing@gmail•com, davem@davemloft•net, kuba@kernel•org,
netdev@vger•kernel.org, Xuanqiang Luo <luoxuanqiang@kylinos•cn>,
Frederic Weisbecker <frederic@kernel•org>,
Neeraj Upadhyay <neeraj.upadhyay@kernel•org>
Subject: Re: [PATCH net-next v7 1/3] rculist: Add hlist_nulls_replace_rcu() and hlist_nulls_replace_init_rcu()
Date: Tue, 14 Oct 2025 19:40:13 +0800 [thread overview]
Message-ID: <f27bb6dc-b0ac-45aa-b748-4156531ca825@linux.dev> (raw)
In-Reply-To: <CANn89iKa9kTLSPLf+OBR=Tbs9SE=qpSMrR==L9sW9xc=Mgi0Fw@mail.gmail.com>
在 2025/10/14 18:02, Eric Dumazet 写道:
> On Tue, Oct 14, 2025 at 1:41 AM luoxuanqiang <xuanqiang.luo@linux•dev> wrote:
>>
>> 在 2025/10/14 16:09, Eric Dumazet 写道:
>>> On Tue, Oct 14, 2025 at 1:05 AM luoxuanqiang <xuanqiang.luo@linux•dev> wrote:
>>>> 在 2025/10/14 15:34, Eric Dumazet 写道:
>>>>> On Tue, Oct 14, 2025 at 12:21 AM luoxuanqiang <xuanqiang.luo@linux•dev> wrote:
>>>>>> 在 2025/10/13 17:49, Eric Dumazet 写道:
>>>>>>> On Mon, Oct 13, 2025 at 1:26 AM luoxuanqiang <xuanqiang.luo@linux•dev> wrote:
>>>>>>>> 在 2025/10/13 15:31, Eric Dumazet 写道:
>>>>>>>>> On Fri, Sep 26, 2025 at 12:41 AM <xuanqiang.luo@linux•dev> wrote:
>>>>>>>>>> From: Xuanqiang Luo <luoxuanqiang@kylinos•cn>
>>>>>>>>>>
>>>>>>>>>> Add two functions to atomically replace RCU-protected hlist_nulls entries.
>>>>>>>>>>
>>>>>>>>>> Keep using WRITE_ONCE() to assign values to ->next and ->pprev, as
>>>>>>>>>> mentioned in the patch below:
>>>>>>>>>> commit efd04f8a8b45 ("rcu: Use WRITE_ONCE() for assignments to ->next for
>>>>>>>>>> rculist_nulls")
>>>>>>>>>> commit 860c8802ace1 ("rcu: Use WRITE_ONCE() for assignments to ->pprev for
>>>>>>>>>> hlist_nulls")
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos•cn>
>>>>>>>>>> ---
>>>>>>>>>> include/linux/rculist_nulls.h | 59 +++++++++++++++++++++++++++++++++++
>>>>>>>>>> 1 file changed, 59 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/include/linux/rculist_nulls.h b/include/linux/rculist_nulls.h
>>>>>>>>>> index 89186c499dd4..c26cb83ca071 100644
>>>>>>>>>> --- a/include/linux/rculist_nulls.h
>>>>>>>>>> +++ b/include/linux/rculist_nulls.h
>>>>>>>>>> @@ -52,6 +52,13 @@ static inline void hlist_nulls_del_init_rcu(struct hlist_nulls_node *n)
>>>>>>>>>> #define hlist_nulls_next_rcu(node) \
>>>>>>>>>> (*((struct hlist_nulls_node __rcu __force **)&(node)->next))
>>>>>>>>>>
>>>>>>>>>> +/**
>>>>>>>>>> + * hlist_nulls_pprev_rcu - returns the dereferenced pprev of @node.
>>>>>>>>>> + * @node: element of the list.
>>>>>>>>>> + */
>>>>>>>>>> +#define hlist_nulls_pprev_rcu(node) \
>>>>>>>>>> + (*((struct hlist_nulls_node __rcu __force **)(node)->pprev))
>>>>>>>>>> +
>>>>>>>>>> /**
>>>>>>>>>> * hlist_nulls_del_rcu - deletes entry from hash list without re-initialization
>>>>>>>>>> * @n: the element to delete from the hash list.
>>>>>>>>>> @@ -152,6 +159,58 @@ static inline void hlist_nulls_add_fake(struct hlist_nulls_node *n)
>>>>>>>>>> n->next = (struct hlist_nulls_node *)NULLS_MARKER(NULL);
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> +/**
>>>>>>>>>> + * hlist_nulls_replace_rcu - replace an old entry by a new one
>>>>>>>>>> + * @old: the element to be replaced
>>>>>>>>>> + * @new: the new element to insert
>>>>>>>>>> + *
>>>>>>>>>> + * Description:
>>>>>>>>>> + * Replace the old entry with the new one in a RCU-protected hlist_nulls, while
>>>>>>>>>> + * permitting racing traversals.
>>>>>>>>>> + *
>>>>>>>>>> + * The caller must take whatever precautions are necessary (such as holding
>>>>>>>>>> + * appropriate locks) to avoid racing with another list-mutation primitive, such
>>>>>>>>>> + * as hlist_nulls_add_head_rcu() or hlist_nulls_del_rcu(), running on this same
>>>>>>>>>> + * list. However, it is perfectly legal to run concurrently with the _rcu
>>>>>>>>>> + * list-traversal primitives, such as hlist_nulls_for_each_entry_rcu().
>>>>>>>>>> + */
>>>>>>>>>> +static inline void hlist_nulls_replace_rcu(struct hlist_nulls_node *old,
>>>>>>>>>> + struct hlist_nulls_node *new)
>>>>>>>>>> +{
>>>>>>>>>> + struct hlist_nulls_node *next = old->next;
>>>>>>>>>> +
>>>>>>>>>> + WRITE_ONCE(new->next, next);
>>>>>>>>>> + WRITE_ONCE(new->pprev, old->pprev);
>>>>>>>>> I do not think these two WRITE_ONCE() are needed.
>>>>>>>>>
>>>>>>>>> At this point new is not yet visible.
>>>>>>>>>
>>>>>>>>> The following rcu_assign_pointer() is enough to make sure prior
>>>>>>>>> writes are committed to memory.
>>>>>>>> Dear Eric,
>>>>>>>>
>>>>>>>> I’m quoting your more detailed explanation from the other patch [0], thank
>>>>>>>> you for that!
>>>>>>>>
>>>>>>>> However, regarding new->next, if the new object is allocated with
>>>>>>>> SLAB_TYPESAFE_BY_RCU, would we still encounter the same issue as in commit
>>>>>>>> efd04f8a8b45 (“rcu: Use WRITE_ONCE() for assignments to ->next for
>>>>>>>> rculist_nulls”)?
>>>>>>>>
>>>>>>>> Also, for the WRITE_ONCE() assignments to ->pprev introduced in commit
>>>>>>>> 860c8802ace1 (“rcu: Use WRITE_ONCE() for assignments to ->pprev for
>>>>>>>> hlist_nulls”) within hlist_nulls_add_head_rcu(), is that also unnecessary?
>>>>>>> I forgot sk_unhashed()/sk_hashed() could be called from lockless contexts.
>>>>>>>
>>>>>>> It is a bit weird to annotate the writes, but not the lockless reads,
>>>>>>> even if apparently KCSAN
>>>>>>> is okay with that.
>>>>>>>
>>>>>> Dear Eric,
>>>>>>
>>>>>> I’m sorry—I still haven’t fully grasped the scenario you mentioned where
>>>>>> sk_unhashed()/sk_hashed() can be called from lock‑less contexts. It seems
>>>>>> similar to the race described in commit 860c8802ace1 (“rcu: Use
>>>>>> WRITE_ONCE() for assignments to ->pprev for hlist_nulls”), e.g.: [0].
>>>>>>
>>>>> inet_unhash() does a lockless sk_unhash(sk) call, while no lock is
>>>>> held in some cases (look at tcp_done())
>>>>>
>>>>> void inet_unhash(struct sock *sk)
>>>>> {
>>>>> struct inet_hashinfo *hashinfo = tcp_get_hashinfo(sk);
>>>>>
>>>>> if (sk_unhashed(sk)) // Here no lock is held
>>>>> return;
>>>>>
>>>>> Relevant lock (depending on (sk->sk_state == TCP_LISTEN)) is acquired
>>>>> a few lines later.
>>>>>
>>>>> Then
>>>>>
>>>>> __sk_nulls_del_node_init_rcu() is called safely, while the bucket lock is held.
>>>>>
>>>> Dear Eric,
>>>>
>>>> Thanks for the quick response!
>>>>
>>>> In the call path:
>>>> tcp_retransmit_timer()
>>>> tcp_write_err()
>>>> tcp_done()
>>>>
>>>> tcp_retransmit_timer() already calls lockdep_sock_is_held(sk) to check the
>>>> socket‑lock state.
>>>>
>>>> void tcp_retransmit_timer(struct sock *sk)
>>>> {
>>>> struct tcp_sock *tp = tcp_sk(sk);
>>>> struct net *net = sock_net(sk);
>>>> struct inet_connection_sock *icsk = inet_csk(sk);
>>>> struct request_sock *req;
>>>> struct sk_buff *skb;
>>>>
>>>> req = rcu_dereference_protected(tp->fastopen_rsk,
>>>> lockdep_sock_is_held(sk)); // Check here
>>>>
>>>> Does that mean we’re already protected by lock_sock(sk) or
>>>> bh_lock_sock(sk)?
>>> But the socket lock is not protecting ehash buckets. These are other locks.
>>>
>>> Also, inet_unhash() can be called from other paths, without a socket
>>> lock being held.
>> Dear Eric,
>>
>> I understand the distinction now, but looking at the call stack in [0],
>> both CPUs reach inet_unhash() via the tcp_retransmit_timer() path, so only
>> one of them should pass the check, right?
>>
>> I’m still not clear how this race condition arises.
> Because that is two different sockets. This once again explains why
> holding or not the socket lock is not relevant.
>
> One of them is changing pointers in the chain, messing with
> surrounding pointers.
>
> The second one is reading sk->sk_node.pprev without using
> hlist_unhashed_lockless().
>
> I do not know how to explain this...
>
> Please look at the difference between hlist_unhashed_lockless() and
> hlist_unhashed().
Thank you very much for the patient explanation. :)
I think I understand now!
Best regards!
Xuanqiang.
next prev parent reply other threads:[~2025-10-14 11:41 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-26 7:40 [PATCH net-next v7 0/3] net: Avoid ehash lookup races xuanqiang.luo
2025-09-26 7:40 ` [PATCH net-next v7 1/3] rculist: Add hlist_nulls_replace_rcu() and hlist_nulls_replace_init_rcu() xuanqiang.luo
2025-09-27 20:31 ` Kuniyuki Iwashima
2025-09-30 9:16 ` Paolo Abeni
2025-10-01 15:03 ` luoxuanqiang
2025-10-13 5:36 ` Jiayuan Chen
2025-10-13 6:26 ` Jason Xing
2025-10-13 7:04 ` luoxuanqiang
2025-10-13 12:08 ` Simon Horman
2025-10-14 2:29 ` luoxuanqiang
2025-10-01 12:19 ` Frederic Weisbecker
2025-10-13 7:31 ` Eric Dumazet
2025-10-13 8:25 ` luoxuanqiang
2025-10-13 9:49 ` Eric Dumazet
2025-10-14 7:20 ` luoxuanqiang
2025-10-14 7:34 ` Eric Dumazet
2025-10-14 8:04 ` luoxuanqiang
2025-10-14 8:09 ` Eric Dumazet
2025-10-14 8:40 ` luoxuanqiang
2025-10-14 10:02 ` Eric Dumazet
2025-10-14 11:40 ` luoxuanqiang [this message]
2025-09-26 7:40 ` [PATCH net-next v7 2/3] inet: Avoid ehash lookup race in inet_ehash_insert() xuanqiang.luo
2025-09-26 7:40 ` [PATCH net-next v7 3/3] inet: Avoid ehash lookup race in inet_twsk_hashdance_schedule() xuanqiang.luo
2025-09-27 2:56 ` [PATCH net-next v7 0/3] net: Avoid ehash lookup races Jiayuan Chen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f27bb6dc-b0ac-45aa-b748-4156531ca825@linux.dev \
--to=xuanqiang.luo@linux$(echo .)dev \
--cc=davem@davemloft$(echo .)net \
--cc=edumazet@google$(echo .)com \
--cc=frederic@kernel$(echo .)org \
--cc=kerneljasonxing@gmail$(echo .)com \
--cc=kuba@kernel$(echo .)org \
--cc=kuniyu@google$(echo .)com \
--cc=luoxuanqiang@kylinos$(echo .)cn \
--cc=neeraj.upadhyay@kernel$(echo .)org \
--cc=netdev@vger$(echo .)kernel.org \
--cc=paulmck@kernel$(echo .)org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox