From: Jakub Kicinski <jakub.kicinski@netronome•com>
To: Eran Ben Elisha <eranbe@mellanox•com>
Cc: netdev@vger•kernel.org, Jiri Pirko <jiri@mellanox•com>,
Andy Gospodarek <andrew.gospodarek@broadcom•com>,
Michael Chan <michael.chan@broadcom•com>,
Simon Horman <simon.horman@netronome•com>,
Alexander Duyck <alexander.duyck@gmail•com>,
Andrew Lunn <andrew@lunn•ch>,
Florian Fainelli <f.fainelli@gmail•com>,
Tal Alon <talal@mellanox•com>, Ariel Almog <ariela@mellanox•com>
Subject: Re: [RFC PATCH iproute2-next] System specification health API
Date: Thu, 13 Sep 2018 10:36:04 -0700 [thread overview]
Message-ID: <20180913103604.0ef868f4@cakuba.netronome.com> (raw)
In-Reply-To: <1536826696-9413-1-git-send-email-eranbe@mellanox.com>
On Thu, 13 Sep 2018 11:18:15 +0300, Eran Ben Elisha wrote:
> The health spec is targeted for Real Time Alerting, in order to know when
> something bad had happened to a PCI device
By spec you mean some standards body spec you implement or this
proposal is a spec?
> - Provide alert debug information
> - Self healing
> - If problem needs vendor support, provide a way to gather all needed debugging
> information.
>
> The health contains sensors which sense for malfunction. Once sensor triggered,
> actions such as logs and correction can be taken.
> Sensors are sensing the health state and can trigger correction action.
>
> The sensors are divided into the following groups
> - Hardware sensor - a sensor which is triggered by the device due to
> malfunction.
> - Software sensor - a sensor which is triggered by the software due to
> malfunction.
> Both group of sensors can be triggered due to error event or due to a periodic check.
>
> Actions are the way to handle sensor events. Action can be in one of the
> following groups:
> - Dump - SW trace, SW dump, HW trace, HW dump
> - Reset - Surgical correction (e.g. modify Q, flush Q, reset of device, etc)
> Actions can be performed by SW or HW.
>
> User is allowed to enable or disable sensors and sensor2action mapping.
>
> This RFC man page patch describes the suggested API of devlink-health in order
> to control sensors and actions.
I like the idea of configuring response to events like this, although
I'm not sure the name sensor is appropriate here - perhaps exception or
error would be better? Are there going to be values reported?
I'm not so sure about HW sensors in relation to existing HWMON
infrastructure... I assume you're targeting things like say some HW
engine/block reporting it encountered an error? Sounds good, too.
Are the actions all envisioned to be performed by the driver?
Firmware? Hardware? I guess that distinction can be added later.
For FW/HW actions we would go back to the problem of persistence of
the setting since it was only implemented for params :S
Is the dump option going to tie back into region snapshots?
next prev parent reply other threads:[~2018-09-13 22:46 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-09-13 8:18 [RFC PATCH iproute2-next] System specification health API Eran Ben Elisha
2018-09-13 8:18 ` [RFC PATCH iproute2-next] man: Add devlink health man page Eran Ben Elisha
2018-09-13 10:27 ` Tobin C. Harding
2018-09-13 11:58 ` Eran Ben Elisha
2018-09-13 22:06 ` Tobin C. Harding
2018-09-13 12:08 ` Andrew Lunn
2018-09-13 12:49 ` Eran Ben Elisha
2018-09-13 13:24 ` Andrew Lunn
2018-09-13 14:30 ` Eran Ben Elisha
2018-09-13 15:12 ` Andrew Lunn
2018-09-16 9:14 ` Eran Ben Elisha
2018-09-13 17:36 ` Jakub Kicinski [this message]
2018-09-16 10:37 ` [RFC PATCH iproute2-next] System specification health API Eran Ben Elisha
2018-09-25 12:00 ` Eran Ben Elisha
2018-09-16 19:29 ` Stephen Hemminger
2018-09-16 19:57 ` Andrew Lunn
2018-09-25 12:17 ` Eran Ben Elisha
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180913103604.0ef868f4@cakuba.netronome.com \
--to=jakub.kicinski@netronome$(echo .)com \
--cc=alexander.duyck@gmail$(echo .)com \
--cc=andrew.gospodarek@broadcom$(echo .)com \
--cc=andrew@lunn$(echo .)ch \
--cc=ariela@mellanox$(echo .)com \
--cc=eranbe@mellanox$(echo .)com \
--cc=f.fainelli@gmail$(echo .)com \
--cc=jiri@mellanox$(echo .)com \
--cc=michael.chan@broadcom$(echo .)com \
--cc=netdev@vger$(echo .)kernel.org \
--cc=simon.horman@netronome$(echo .)com \
--cc=talal@mellanox$(echo .)com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox