From: John Fastabend <john.fastabend@gmail•com>
To: Or Gerlitz <gerlitz.or@gmail•com>
Cc: Or Gerlitz <ogerlitz@mellanox•com>, Jiri Pirko <jiri@resnulli•us>,
"Samudrala, Sridhar" <sridhar.samudrala@intel•com>,
Saeed Mahameed <saeedm@mellanox•com>,
"David S. Miller" <davem@davemloft•net>,
Linux Netdev List <netdev@vger•kernel.org>,
Hadar Hen-Zion <hadarh@mellanox•com>,
Jiri Pirko <jiri@mellanox•com>,
Andy Gospodarek <gospo@cumulusnetworks•com>,
Jesse Brandeburg <jesse.brandeburg@intel•com>,
John Fastabend <john.r.fastabend@intel•com>,
Ido Schimmel <idosch@mellanox•com>,
Tal Anker <Ankertal@mellanox•com>
Subject: Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control
Date: Wed, 29 Jun 2016 21:04:39 -0700 [thread overview]
Message-ID: <57749A57.1030709@gmail.com> (raw)
In-Reply-To: <5774938D.7010902@gmail.com>
On 16-06-29 08:35 PM, John Fastabend wrote:
> On 16-06-29 03:09 PM, John Fastabend wrote:
>> On 16-06-29 02:33 PM, Or Gerlitz wrote:
>>> On Wed, Jun 29, 2016 at 7:35 PM, John Fastabend
>>> <john.fastabend@gmail•com> wrote:
>>>> On 16-06-29 07:48 AM, Or Gerlitz wrote:
>>>>> On 6/28/2016 10:31 PM, John Fastabend wrote:
>>>>>> On 16-06-28 12:12 PM, Jiri Pirko wrote:
>>>
>>>>>>> Why?! Please, leave legacy be legacy. Use the new mode for
>>>>>>> implementing new features. Don't make things any more complicated :(
>>>
>>> [...]
>>>>>> Maybe I'm reading to much into the devlink flag names and if instead
>>>>>> you use a switch like the following,
>>>
>>>>>> VF representer : enable/disable the creation VF netdev's to represent
>>>>>> the virtual functions on the PF
>>>
>>>>>> Much less complicated then magic switching between forwarding logic IMO
>>>>>> and you don't whack a default configuration that an entire stack (e.g.
>>>>>> libvirt) has been built to use.
>>>
>>>>> Re letting the user to observe/modify the rules added by the
>>>>> driver/firmware while legacy mode. Even if possible with bridge/fdb, it
>>>>> will be really pragmatical and doesn't make sense to get that donefor
>>>>> the TC subsystem. So this isn't a well defined solution and anyway, as
>>>>> you said, legacy mode enhancements is a different exercise. Personally,
>>>>> I agree with Jiri, that we should legacy be legacyand focus on adding
>>>>> the new model.
>>>
>>>> The ixgbe driver already supports bridge and tc commands without the VF
>>>> representer. Adding the VF representer to these drivers just extends
>>>> the existing support so we have an identifier for VFs and now the
>>>> redirect action works and the fdb commands can specify the VF netdevs.
>>>> I don't see this as a problem because we already do it today with
>>>> 'ip' and bridge tools.
>>>
>>> To be precise, for both ixgbe and mlx5, the existing tc support
>>> (u32/ixgbe, flower/mlx5) is not for switching functionality but rather
>>> for NIC-ish one, e.g drop, mark, etc. Indeed in ixgbe you added
>>> redirect to VF, but this is only for south --> north (wire --> VF)
>>> traffic, w.o the VF rep you can't do the other way around.
>>>
>>
>> Correct which is why we need the VF rep. So we are completely in
>> sync there.
>>
>>> Just to clarify, to what exact bridge command support did you refer for ixgbe?
>>
>> 'bridge fdb' commands are supported today on the PF. But its the
>> same story as above we need the VF rep to also use it on the
>> VF representer
>>
>> Also 'bridge link' command for veb/vepa modes is supported and the
>> other link attributes could be supported with additional driver
>> support. No need for core changes here. But again yes only on the
>> PF so again we need the VF reps.
>>
>>>
>>> The forwarding done in the legacy mode is not well defined, and
>>> different across vendors, adding there the VF reps will not make it
>>> any better b/c some steering rules will be set by tc/bridge offloads
>>> while other rules will be put by the driver.
>>> I don't see how this takes us to better place.
>>
>> In legacy mode or any other mode you are defining some default policy
>> and rules.
>>
>> In the legacy mode we use mac/vlan assigned l2 forwarding entries in the
>> hardware fdb which are seen when you query 'ip link' and 'bridge fdb'
>> today. And similarly can be modified today using 'ip link' and 'bridge
>> fdb' at least on the intel devices. Its not undefined in any way with
>> a quick query of the tools we can learn exactly what the configuration
>> is and even change it. This works fairly well with existing controllers
>> and stacks.
>>
>> The limitations are 'ip' only supports a single MAC address per VF and
>> 'tc' doesn't work on VF ports because when the VF is assigned to a VM
>> or namespace we lose visibility of it. Providing a VF rep for this
>> solves both of those problems.
>>
>> In this new mode the default policy is to create a default miss rule
>> and implement no l2 forwarding rules. Unfortunately not all hardware
>> in use supports this default miss rule case but would still benefit
>> from having a VF rep. So we shouldn't make this a stipulation for
>> enabling VF reps. It also changes a default policy that has been in
>> place for years without IMO at least any compelling reason. It will
>> be easy enough to change the default l2 policy to a flow based model
>> with a few bridge/tc commands.
>>
>>>
>>>> We are also slightly in disagreement about what the default should be
>>>> with VF netdevs. I think the default should be the same L2 mac/vlan
>>>> switch behavior and see no reason to change it by default just because
>>>> we added VF netdevs. The infrastructure libvirt/openstack/etc are built
>>>> around this default today. But I guess nothing in this series specifies
>>>> what the defaults of any given driver will be. VF netdevs are still
>>>> useful even on older hardware that only supports mac/vlan forwarding to
>>>> expose statistics and send/receive control frames such as lldp.
>>>
>>> Again, this is not about default engineering... and using the VF reps
>>> (not VF netdevs) in legacy mode only make it more cryptic to my
>>> opinion. I agree some changes would be needed in openstack to support
>>> the new model, but this is how progress is made... you can't always
>>> make all layer above you unchanged. Note that the VF reps behave the
>>> same as tap devices (v-switch doing xmit on tap --> recv in VM, VM
>>> sends --> recv on tap into the v-switch), so the change in open-stack
>>> would not be that big.
>>>
>>
>> But in this case we have no reason to break the stack above us. The
>> currently deployed usage is L2 mac/vlan. As soon as you bind a vSwitch
>> or whatever mgmt agent to the device it can go ahead and manage the
>> switch putting it in the correct mode using the tooling in 'bridge' and
>> 'tc'.
>>
>>
>>> [...]
>>>
>>>> Why I think the VF representer is a per port ethtool flag and not a
>>>> devlink option is my use case might be to assign a PF into a VM or
>>>> namespace where I don't want VF netdevs.
>>>
>>> again, we think the correct place to set how the eswitch is managed is
>>> through eswitch manager PCI devices and not net devices and hence
>>> ethtool is not the way to go.
>>>
>>> Also, how do you want your e-switch to be managed in this case?
>>>
>>
>> In the case where I don't create vf netdevs on one of the PFs I'll
>> manage the forwarding tables via the existing mechanisms 'ip' and
>> 'bridge'. However its likely not a big deal because 'ip' and 'bridge'
>> will continue to work even if VF reps are around. The ethtool/devlink
>> comment was more about pointing out that creating VFs does not
>> require you to manage your switch any differently. Its useful even on
>> devices that can't support flow based forwarding for statistics and
>> setting port attributes like mtu, etc.
>>
>> .John
>>
>
> Probably bad form to respond to my own email but just to highlight how
> subtle the distinction is (hopefully not to much repeat),
>
> Today in "legacy" mode each VF mac address is automatically added to
> the fdb along with the PF mac address. If there is a miss in the table
> (an unknown mac) the packet is sent to the PF but unless the PF is in
> promisc mode the packet is dropped by the rx filter. I presume even
> with the proposed model you would want to continue to enforce the
> rx filter otherwise the instance you flip the mode you are open to
> receive unwanted traffic. The promisc mode semantics have been in place
> for a long time so certainly don't want to break that. Can we agree on
> the promisc point? Also bridges/vswitch/etc already set promisc mode
> once they attach to the netdevs.
>
> (assuming we agree on the promisc point?)
> In your proposed model the only difference I can see is when the mode is
> changed you don't want to add the VF mac address to the fdb table. How
> about rather than make this part of the mode selection pick one way to
> do this in all cases. Either add the VF mac addresses to the fdb or
> do not do this. I have a preference for adding the VF mac addresses
> because this is the current behavior. Then rename the devlink option
> "VF reps" or something because that is what it is controlling.
>
> The last thing to argue about is if its a port attribute ala ethtool
> or a device attribute ala devlink. But maybe we can agree on everything
> up to this point?
>
> Thanks,
> John
>
FWIW reviewing devlink and items I want to put there in the future I've
decided it makes sense to keep it in devlink (sorry took me a day of
emails to get here). If you can agree to the above and rename it
something like,
+enum devlink_eswitch_mode {
+ DEVLINK_ESWITCH_MODE_NONE,
+ DEVLINK_ESWITCH_MODE_LEGACY,
+ DEVLINK_ESWITCH_MODE_CREATE_VF_NETDEVS,
+};
I'll Ack it and implement it on the drivers I tend to work on.
.John
next prev parent reply other threads:[~2016-06-30 4:05 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-06-27 16:07 [PATCH net-next 00/16] Mellanox 100G SRIOV E-Switch offload and VF representors Saeed Mahameed
2016-06-27 16:07 ` [PATCH net-next 01/16] net/mlx5: E-Switch, Add operational mode to the SRIOV e-Switch Saeed Mahameed
2016-06-27 16:07 ` [PATCH net-next 02/16] net/mlx5: E-Switch, Add support for the sriov offloads mode Saeed Mahameed
2016-06-27 16:07 ` [PATCH net-next 03/16] net/mlx5: E-Switch, Add miss rule for " Saeed Mahameed
2016-06-27 16:53 ` Sergei Shtylyov
2016-06-27 20:40 ` Or Gerlitz
2016-06-27 16:07 ` [PATCH net-next 04/16] net/mlx5: E-Switch, Add API to create send-to-vport rules Saeed Mahameed
2016-06-27 16:07 ` [PATCH net-next 05/16] net/mlx5: Introduce offloads steering namespace Saeed Mahameed
2016-06-27 16:07 ` [PATCH net-next 06/16] net/mlx5: E-Switch, Add offloads table Saeed Mahameed
2016-06-27 16:07 ` [PATCH net-next 07/16] net/mlx5: E-Switch, Add API to create vport rx rules Saeed Mahameed
2016-06-27 16:07 ` [PATCH net-next 08/16] net/devlink: Add E-Switch mode control Saeed Mahameed
2016-06-28 5:57 ` John Fastabend
2016-06-28 10:25 ` Or Gerlitz
2016-06-28 16:19 ` John Fastabend
2016-06-28 17:19 ` John Fastabend
2016-06-28 18:46 ` Jiri Pirko
2016-06-28 19:04 ` Samudrala, Sridhar
2016-06-28 19:12 ` Jiri Pirko
2016-06-28 19:31 ` John Fastabend
2016-06-29 14:48 ` Or Gerlitz
2016-06-29 16:35 ` John Fastabend
2016-06-29 21:33 ` Or Gerlitz
2016-06-29 22:09 ` John Fastabend
2016-06-30 3:35 ` John Fastabend
2016-06-30 4:04 ` John Fastabend [this message]
2016-06-30 6:25 ` Jiri Pirko
2016-06-30 7:13 ` Samudrala, Sridhar
2016-06-30 7:41 ` Jiri Pirko
2016-06-30 7:57 ` John Fastabend
2016-06-30 10:52 ` Jiri Pirko
2016-06-30 14:24 ` Or Gerlitz
2016-06-30 15:40 ` John Fastabend
2016-06-30 15:53 ` Jiri Pirko
2016-06-30 16:29 ` John Fastabend
2016-06-29 9:44 ` Or Gerlitz
2016-06-28 12:27 ` Jiri Pirko
2016-06-27 16:07 ` [PATCH net-next 09/16] net/mlx5: Add devlink interface Saeed Mahameed
2016-06-27 16:07 ` [PATCH net-next 10/16] net/mlx5e: Add devlink based SRIOV mode changes (legacy --> offloads) Saeed Mahameed
2016-06-28 13:42 ` Andy Gospodarek
2016-06-28 14:25 ` Or Gerlitz
2016-06-28 14:49 ` Andy Gospodarek
2016-06-27 16:07 ` [PATCH net-next 11/16] net/mlx5e: Create NIC global resources only once Saeed Mahameed
2016-06-27 16:07 ` [PATCH net-next 12/16] net/mlx5e: TIRs management refactoring Saeed Mahameed
2016-06-27 16:07 ` [PATCH net-next 13/16] net/mlx5e: Mark enabled RQTs instances explicitly Saeed Mahameed
2016-06-27 16:07 ` [PATCH net-next 14/16] net/mlx5e: Add support for multiple profiles Saeed Mahameed
2016-06-27 16:07 ` [PATCH net-next 15/16] net/mlx5: Add Representors registration API Saeed Mahameed
2016-06-27 16:07 ` [PATCH net-next 16/16] net/mlx5e: Introduce SRIOV VF representors Saeed Mahameed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=57749A57.1030709@gmail.com \
--to=john.fastabend@gmail$(echo .)com \
--cc=Ankertal@mellanox$(echo .)com \
--cc=davem@davemloft$(echo .)net \
--cc=gerlitz.or@gmail$(echo .)com \
--cc=gospo@cumulusnetworks$(echo .)com \
--cc=hadarh@mellanox$(echo .)com \
--cc=idosch@mellanox$(echo .)com \
--cc=jesse.brandeburg@intel$(echo .)com \
--cc=jiri@mellanox$(echo .)com \
--cc=jiri@resnulli$(echo .)us \
--cc=john.r.fastabend@intel$(echo .)com \
--cc=netdev@vger$(echo .)kernel.org \
--cc=ogerlitz@mellanox$(echo .)com \
--cc=saeedm@mellanox$(echo .)com \
--cc=sridhar.samudrala@intel$(echo .)com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox