public inbox for netdev@vger.kernel.org 
 help / color / mirror / Atom feed
From: luoxuanqiang <xuanqiang.luo@linux•dev>
To: Eric Dumazet <edumazet@google•com>
Cc: kuniyu@google•com, "Paul E. McKenney" <paulmck@kernel•org>,
	kerneljasonxing@gmail•com, davem@davemloft•net, kuba@kernel•org,
	netdev@vger•kernel.org, Xuanqiang Luo <luoxuanqiang@kylinos•cn>,
	Frederic Weisbecker <frederic@kernel•org>,
	Neeraj Upadhyay <neeraj.upadhyay@kernel•org>
Subject: Re: [PATCH net-next v7 1/3] rculist: Add hlist_nulls_replace_rcu() and hlist_nulls_replace_init_rcu()
Date: Tue, 14 Oct 2025 19:40:13 +0800	[thread overview]
Message-ID: <f27bb6dc-b0ac-45aa-b748-4156531ca825@linux.dev> (raw)
In-Reply-To: <CANn89iKa9kTLSPLf+OBR=Tbs9SE=qpSMrR==L9sW9xc=Mgi0Fw@mail.gmail.com>


在 2025/10/14 18:02, Eric Dumazet 写道:
> On Tue, Oct 14, 2025 at 1:41 AM luoxuanqiang <xuanqiang.luo@linux•dev> wrote:
>>
>> 在 2025/10/14 16:09, Eric Dumazet 写道:
>>> On Tue, Oct 14, 2025 at 1:05 AM luoxuanqiang <xuanqiang.luo@linux•dev> wrote:
>>>> 在 2025/10/14 15:34, Eric Dumazet 写道:
>>>>> On Tue, Oct 14, 2025 at 12:21 AM luoxuanqiang <xuanqiang.luo@linux•dev> wrote:
>>>>>> 在 2025/10/13 17:49, Eric Dumazet 写道:
>>>>>>> On Mon, Oct 13, 2025 at 1:26 AM luoxuanqiang <xuanqiang.luo@linux•dev> wrote:
>>>>>>>> 在 2025/10/13 15:31, Eric Dumazet 写道:
>>>>>>>>> On Fri, Sep 26, 2025 at 12:41 AM <xuanqiang.luo@linux•dev> wrote:
>>>>>>>>>> From: Xuanqiang Luo <luoxuanqiang@kylinos•cn>
>>>>>>>>>>
>>>>>>>>>> Add two functions to atomically replace RCU-protected hlist_nulls entries.
>>>>>>>>>>
>>>>>>>>>> Keep using WRITE_ONCE() to assign values to ->next and ->pprev, as
>>>>>>>>>> mentioned in the patch below:
>>>>>>>>>> commit efd04f8a8b45 ("rcu: Use WRITE_ONCE() for assignments to ->next for
>>>>>>>>>> rculist_nulls")
>>>>>>>>>> commit 860c8802ace1 ("rcu: Use WRITE_ONCE() for assignments to ->pprev for
>>>>>>>>>> hlist_nulls")
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos•cn>
>>>>>>>>>> ---
>>>>>>>>>>       include/linux/rculist_nulls.h | 59 +++++++++++++++++++++++++++++++++++
>>>>>>>>>>       1 file changed, 59 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/include/linux/rculist_nulls.h b/include/linux/rculist_nulls.h
>>>>>>>>>> index 89186c499dd4..c26cb83ca071 100644
>>>>>>>>>> --- a/include/linux/rculist_nulls.h
>>>>>>>>>> +++ b/include/linux/rculist_nulls.h
>>>>>>>>>> @@ -52,6 +52,13 @@ static inline void hlist_nulls_del_init_rcu(struct hlist_nulls_node *n)
>>>>>>>>>>       #define hlist_nulls_next_rcu(node) \
>>>>>>>>>>              (*((struct hlist_nulls_node __rcu __force **)&(node)->next))
>>>>>>>>>>
>>>>>>>>>> +/**
>>>>>>>>>> + * hlist_nulls_pprev_rcu - returns the dereferenced pprev of @node.
>>>>>>>>>> + * @node: element of the list.
>>>>>>>>>> + */
>>>>>>>>>> +#define hlist_nulls_pprev_rcu(node) \
>>>>>>>>>> +       (*((struct hlist_nulls_node __rcu __force **)(node)->pprev))
>>>>>>>>>> +
>>>>>>>>>>       /**
>>>>>>>>>>        * hlist_nulls_del_rcu - deletes entry from hash list without re-initialization
>>>>>>>>>>        * @n: the element to delete from the hash list.
>>>>>>>>>> @@ -152,6 +159,58 @@ static inline void hlist_nulls_add_fake(struct hlist_nulls_node *n)
>>>>>>>>>>              n->next = (struct hlist_nulls_node *)NULLS_MARKER(NULL);
>>>>>>>>>>       }
>>>>>>>>>>
>>>>>>>>>> +/**
>>>>>>>>>> + * hlist_nulls_replace_rcu - replace an old entry by a new one
>>>>>>>>>> + * @old: the element to be replaced
>>>>>>>>>> + * @new: the new element to insert
>>>>>>>>>> + *
>>>>>>>>>> + * Description:
>>>>>>>>>> + * Replace the old entry with the new one in a RCU-protected hlist_nulls, while
>>>>>>>>>> + * permitting racing traversals.
>>>>>>>>>> + *
>>>>>>>>>> + * The caller must take whatever precautions are necessary (such as holding
>>>>>>>>>> + * appropriate locks) to avoid racing with another list-mutation primitive, such
>>>>>>>>>> + * as hlist_nulls_add_head_rcu() or hlist_nulls_del_rcu(), running on this same
>>>>>>>>>> + * list.  However, it is perfectly legal to run concurrently with the _rcu
>>>>>>>>>> + * list-traversal primitives, such as hlist_nulls_for_each_entry_rcu().
>>>>>>>>>> + */
>>>>>>>>>> +static inline void hlist_nulls_replace_rcu(struct hlist_nulls_node *old,
>>>>>>>>>> +                                          struct hlist_nulls_node *new)
>>>>>>>>>> +{
>>>>>>>>>> +       struct hlist_nulls_node *next = old->next;
>>>>>>>>>> +
>>>>>>>>>> +       WRITE_ONCE(new->next, next);
>>>>>>>>>> +       WRITE_ONCE(new->pprev, old->pprev);
>>>>>>>>> I do not think these two WRITE_ONCE() are needed.
>>>>>>>>>
>>>>>>>>> At this point new is not yet visible.
>>>>>>>>>
>>>>>>>>> The following  rcu_assign_pointer() is enough to make sure prior
>>>>>>>>> writes are committed to memory.
>>>>>>>> Dear Eric,
>>>>>>>>
>>>>>>>> I’m quoting your more detailed explanation from the other patch [0], thank
>>>>>>>> you for that!
>>>>>>>>
>>>>>>>> However, regarding new->next, if the new object is allocated with
>>>>>>>> SLAB_TYPESAFE_BY_RCU, would we still encounter the same issue as in commit
>>>>>>>> efd04f8a8b45 (“rcu: Use WRITE_ONCE() for assignments to ->next for
>>>>>>>> rculist_nulls”)?
>>>>>>>>
>>>>>>>> Also, for the WRITE_ONCE() assignments to ->pprev introduced in commit
>>>>>>>> 860c8802ace1 (“rcu: Use WRITE_ONCE() for assignments to ->pprev for
>>>>>>>> hlist_nulls”) within hlist_nulls_add_head_rcu(), is that also unnecessary?
>>>>>>> I forgot sk_unhashed()/sk_hashed() could be called from lockless contexts.
>>>>>>>
>>>>>>> It is a bit weird to annotate the writes, but not the lockless reads,
>>>>>>> even if apparently KCSAN
>>>>>>> is okay with that.
>>>>>>>
>>>>>> Dear Eric,
>>>>>>
>>>>>> I’m sorry—I still haven’t fully grasped the scenario you mentioned where
>>>>>> sk_unhashed()/sk_hashed() can be called from lock‑less contexts. It seems
>>>>>> similar to the race described in commit 860c8802ace1 (“rcu: Use
>>>>>> WRITE_ONCE() for assignments to ->pprev for hlist_nulls”), e.g.: [0].
>>>>>>
>>>>> inet_unhash() does a lockless sk_unhash(sk) call, while no lock is
>>>>> held in some cases (look at tcp_done())
>>>>>
>>>>> void inet_unhash(struct sock *sk)
>>>>> {
>>>>> struct inet_hashinfo *hashinfo = tcp_get_hashinfo(sk);
>>>>>
>>>>> if (sk_unhashed(sk))    // Here no lock is held
>>>>>        return;
>>>>>
>>>>> Relevant lock (depending on (sk->sk_state == TCP_LISTEN)) is acquired
>>>>> a few lines later.
>>>>>
>>>>> Then
>>>>>
>>>>> __sk_nulls_del_node_init_rcu() is called safely, while the bucket lock is held.
>>>>>
>>>> Dear Eric,
>>>>
>>>> Thanks for the quick response!
>>>>
>>>> In the call path:
>>>>            tcp_retransmit_timer()
>>>>                    tcp_write_err()
>>>>                            tcp_done()
>>>>
>>>> tcp_retransmit_timer() already calls lockdep_sock_is_held(sk) to check the
>>>> socket‑lock state.
>>>>
>>>> void tcp_retransmit_timer(struct sock *sk)
>>>> {
>>>>            struct tcp_sock *tp = tcp_sk(sk);
>>>>            struct net *net = sock_net(sk);
>>>>            struct inet_connection_sock *icsk = inet_csk(sk);
>>>>            struct request_sock *req;
>>>>            struct sk_buff *skb;
>>>>
>>>>            req = rcu_dereference_protected(tp->fastopen_rsk,
>>>>                                     lockdep_sock_is_held(sk)); // Check here
>>>>
>>>> Does that mean we’re already protected by lock_sock(sk) or
>>>> bh_lock_sock(sk)?
>>> But the socket lock is not protecting ehash buckets. These are other locks.
>>>
>>> Also, inet_unhash() can be called from other paths, without a socket
>>> lock being held.
>> Dear Eric,
>>
>> I understand the distinction now, but looking at the call stack in [0],
>> both CPUs reach inet_unhash() via the tcp_retransmit_timer() path, so only
>> one of them should pass the check, right?
>>
>> I’m still not clear how this race condition arises.
> Because that is two different sockets. This once again explains why
> holding or not the socket lock is not relevant.
>
> One of them is changing pointers in the chain, messing with
> surrounding pointers.
>
> The second one is reading sk->sk_node.pprev without using
> hlist_unhashed_lockless().
>
> I do not know how to explain this...
>
> Please look at the difference between hlist_unhashed_lockless() and
> hlist_unhashed().

Thank you very much for the patient explanation. :)

I think I understand now!

Best regards!
Xuanqiang.


  reply	other threads:[~2025-10-14 11:41 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-26  7:40 [PATCH net-next v7 0/3] net: Avoid ehash lookup races xuanqiang.luo
2025-09-26  7:40 ` [PATCH net-next v7 1/3] rculist: Add hlist_nulls_replace_rcu() and hlist_nulls_replace_init_rcu() xuanqiang.luo
2025-09-27 20:31   ` Kuniyuki Iwashima
2025-09-30  9:16   ` Paolo Abeni
2025-10-01 15:03     ` luoxuanqiang
2025-10-13  5:36     ` Jiayuan Chen
2025-10-13  6:26       ` Jason Xing
2025-10-13  7:04         ` luoxuanqiang
2025-10-13 12:08           ` Simon Horman
2025-10-14  2:29             ` luoxuanqiang
2025-10-01 12:19   ` Frederic Weisbecker
2025-10-13  7:31   ` Eric Dumazet
2025-10-13  8:25     ` luoxuanqiang
2025-10-13  9:49       ` Eric Dumazet
2025-10-14  7:20         ` luoxuanqiang
2025-10-14  7:34           ` Eric Dumazet
2025-10-14  8:04             ` luoxuanqiang
2025-10-14  8:09               ` Eric Dumazet
2025-10-14  8:40                 ` luoxuanqiang
2025-10-14 10:02                   ` Eric Dumazet
2025-10-14 11:40                     ` luoxuanqiang [this message]
2025-09-26  7:40 ` [PATCH net-next v7 2/3] inet: Avoid ehash lookup race in inet_ehash_insert() xuanqiang.luo
2025-09-26  7:40 ` [PATCH net-next v7 3/3] inet: Avoid ehash lookup race in inet_twsk_hashdance_schedule() xuanqiang.luo
2025-09-27  2:56 ` [PATCH net-next v7 0/3] net: Avoid ehash lookup races Jiayuan Chen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f27bb6dc-b0ac-45aa-b748-4156531ca825@linux.dev \
    --to=xuanqiang.luo@linux$(echo .)dev \
    --cc=davem@davemloft$(echo .)net \
    --cc=edumazet@google$(echo .)com \
    --cc=frederic@kernel$(echo .)org \
    --cc=kerneljasonxing@gmail$(echo .)com \
    --cc=kuba@kernel$(echo .)org \
    --cc=kuniyu@google$(echo .)com \
    --cc=luoxuanqiang@kylinos$(echo .)cn \
    --cc=neeraj.upadhyay@kernel$(echo .)org \
    --cc=netdev@vger$(echo .)kernel.org \
    --cc=paulmck@kernel$(echo .)org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox