public inbox for netdev@vger.kernel.org 
 help / color / mirror / Atom feed
From: Jakub Kicinski <jakub.kicinski@netronome•com>
To: Eran Ben Elisha <eranbe@mellanox•com>
Cc: netdev@vger•kernel.org, Jiri Pirko <jiri@mellanox•com>,
	Andy Gospodarek <andrew.gospodarek@broadcom•com>,
	Michael Chan <michael.chan@broadcom•com>,
	Simon Horman <simon.horman@netronome•com>,
	Alexander Duyck <alexander.duyck@gmail•com>,
	Andrew Lunn <andrew@lunn•ch>,
	Florian Fainelli <f.fainelli@gmail•com>,
	Tal Alon <talal@mellanox•com>, Ariel Almog <ariela@mellanox•com>
Subject: Re: [RFC PATCH iproute2-next] System specification health API
Date: Thu, 13 Sep 2018 10:36:04 -0700	[thread overview]
Message-ID: <20180913103604.0ef868f4@cakuba.netronome.com> (raw)
In-Reply-To: <1536826696-9413-1-git-send-email-eranbe@mellanox.com>

On Thu, 13 Sep 2018 11:18:15 +0300, Eran Ben Elisha wrote:
> The health spec is targeted for Real Time Alerting, in order to know when
> something bad had happened to a PCI device

By spec you mean some standards body spec you implement or this
proposal is a spec?

> - Provide alert debug information
> - Self healing
> - If problem needs vendor support, provide a way to gather all needed debugging
>   information.
> 
> The health contains sensors which sense for malfunction. Once sensor triggered,
> actions such as logs and correction can be taken.
> Sensors are sensing the health state and can trigger correction action.
> 
> The sensors are divided into the following groups
> - Hardware sensor - a sensor which is triggered by the device due to
>   malfunction.
> - Software sensor - a sensor which is triggered by the software due to
>   malfunction.
> Both group of sensors can be triggered due to error event or due to a periodic check.
> 
> Actions are the way to handle sensor events. Action can be in one of the
> following groups:
> - Dump -  SW trace, SW dump, HW trace, HW dump
> - Reset - Surgical correction (e.g. modify Q, flush Q, reset of device, etc)
> Actions can be performed by SW or HW.
> 
> User is allowed to enable or disable sensors and sensor2action mapping.
> 
> This RFC man page patch describes the suggested API of devlink-health in order
> to control sensors and actions.

I like the idea of configuring response to events like this, although
I'm not sure the name sensor is appropriate here - perhaps exception or
error would be better?  Are there going to be values reported?

I'm not so sure about HW sensors in relation to existing HWMON
infrastructure...  I assume you're targeting things like say some HW
engine/block reporting it encountered an error?  Sounds good, too.

Are the actions all envisioned to be performed by the driver?
Firmware?  Hardware?  I guess that distinction can be added later.
For FW/HW actions we would go back to the problem of persistence of 
the setting since it was only implemented for params :S

Is the dump option going to tie back into region snapshots?

  parent reply	other threads:[~2018-09-13 22:46 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-13  8:18 [RFC PATCH iproute2-next] System specification health API Eran Ben Elisha
2018-09-13  8:18 ` [RFC PATCH iproute2-next] man: Add devlink health man page Eran Ben Elisha
2018-09-13 10:27   ` Tobin C. Harding
2018-09-13 11:58     ` Eran Ben Elisha
2018-09-13 22:06       ` Tobin C. Harding
2018-09-13 12:08   ` Andrew Lunn
2018-09-13 12:49     ` Eran Ben Elisha
2018-09-13 13:24       ` Andrew Lunn
2018-09-13 14:30         ` Eran Ben Elisha
2018-09-13 15:12           ` Andrew Lunn
2018-09-16  9:14             ` Eran Ben Elisha
2018-09-13 17:36 ` Jakub Kicinski [this message]
2018-09-16 10:37   ` [RFC PATCH iproute2-next] System specification health API Eran Ben Elisha
2018-09-25 12:00     ` Eran Ben Elisha
2018-09-16 19:29   ` Stephen Hemminger
2018-09-16 19:57     ` Andrew Lunn
2018-09-25 12:17       ` Eran Ben Elisha

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180913103604.0ef868f4@cakuba.netronome.com \
    --to=jakub.kicinski@netronome$(echo .)com \
    --cc=alexander.duyck@gmail$(echo .)com \
    --cc=andrew.gospodarek@broadcom$(echo .)com \
    --cc=andrew@lunn$(echo .)ch \
    --cc=ariela@mellanox$(echo .)com \
    --cc=eranbe@mellanox$(echo .)com \
    --cc=f.fainelli@gmail$(echo .)com \
    --cc=jiri@mellanox$(echo .)com \
    --cc=michael.chan@broadcom$(echo .)com \
    --cc=netdev@vger$(echo .)kernel.org \
    --cc=simon.horman@netronome$(echo .)com \
    --cc=talal@mellanox$(echo .)com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox