From: Yuri Tikhonov <yur@emcraft•com>
To: "Dan Williams" <dan.j.williams@intel•com>
Cc: linux-raid@vger•kernel.org, linuxppc-dev@ozlabs•org, wd@denx•de,
dzu@denx•de, yanok@emcraft•com
Subject: Re[2]: [PATCH 02/11][v3] async_tx: add support for asynchronous GF multiplication
Date: Fri, 16 Jan 2009 14:41:56 +0300 [thread overview]
Message-ID: <1328000796.20090116144156@emcraft.com> (raw)
In-Reply-To: <e9c3a7c20901141656g2fec1802vcecc4ae433838bb3@mail.gmail.com>
=0D=0A Hello Dan,
Thanks for review. Some comments below.
On Thursday, January 15, 2009 you wrote:
[..]
>> +/**
>> + * do_async_pq - asynchronously calculate P and/or Q
>> + */
>> +static struct dma_async_tx_descriptor *
>> +do_async_pq(struct dma_chan *chan, struct page **blocks, unsigned char =
*scfs,
>> + unsigned int offset, int src_cnt, size_t len, enum async_tx_flag=
s flags,
>> + struct dma_async_tx_descriptor *depend_tx,
>> + dma_async_tx_callback cb_fn, void *cb_param)
>> +{
>> + struct dma_device *dma =3D chan->device;
>> + dma_addr_t dma_dest[2], dma_src[src_cnt];
>> + struct dma_async_tx_descriptor *tx =3D NULL;
>> + dma_async_tx_callback _cb_fn;
>> + void *_cb_param;
>> + unsigned char *scf =3D NULL;
>> + int i, src_off =3D 0;
>> + unsigned short pq_src_cnt;
>> + enum async_tx_flags async_flags;
>> + enum dma_ctrl_flags dma_flags =3D 0;
>> +
>> + /* If we won't handle src_cnt in one shot, then the following
>> + * flag(s) will be set only on the first pass of prep_dma
>> + */
>> + if (flags & ASYNC_TX_PQ_ZERO_P)
>> + dma_flags |=3D DMA_PREP_ZERO_P;
>> + if (flags & ASYNC_TX_PQ_ZERO_Q)
>> + dma_flags |=3D DMA_PREP_ZERO_Q;
>> +
>> + /* DMAs use destinations as sources, so use BIDIRECTIONAL mappin=
g */
>> + if (blocks[src_cnt]) {
>> + dma_dest[0] =3D dma_map_page(dma->dev, blocks[src_cnt],
>> + offset, len, DMA_BIDIRECTIONA=
L);
>> + dma_flags |=3D DMA_PREP_HAVE_P;
>> + }
>> + if (blocks[src_cnt+1]) {
>> + dma_dest[1] =3D dma_map_page(dma->dev, blocks[src_cnt+1],
>> + offset, len, DMA_BIDIRECTIONA=
L);
>> + dma_flags |=3D DMA_PREP_HAVE_Q;
>> + }
>> +
>> + for (i =3D 0; i < src_cnt; i++)
>> + dma_src[i] =3D dma_map_page(dma->dev, blocks[i],
>> + offset, len, DMA_TO_DEVICE);
>> +
>> + while (src_cnt) {
>> + async_flags =3D flags;
>> + pq_src_cnt =3D min(src_cnt, (int)dma->max_pq);
>> + /* if we are submitting additional pqs, leave the chain =
open,
>> + * clear the callback parameters, and leave the destinat=
ion
>> + * buffers mapped
>> + */
>> + if (src_cnt > pq_src_cnt) {
>> + async_flags &=3D ~ASYNC_TX_ACK;
>> + dma_flags |=3D DMA_COMPL_SKIP_DEST_UNMAP;
>> + _cb_fn =3D NULL;
>> + _cb_param =3D NULL;
>> + } else {
>> + _cb_fn =3D cb_fn;
>> + _cb_param =3D cb_param;
>> + }
>> + if (_cb_fn)
>> + dma_flags |=3D DMA_PREP_INTERRUPT;
>> + if (scfs)
>> + scf =3D &scfs[src_off];
>> +
>> + /* Since we have clobbered the src_list we are committed
>> + * to doing this asynchronously. Drivers force forward
>> + * progress in case they can not provide a descriptor
>> + */
>> + tx =3D dma->device_prep_dma_pq(chan, dma_dest,
>> + &dma_src[src_off], pq_src_c=
nt,
>> + scf, len, dma_flags);
>> + if (unlikely(!tx))
>> + async_tx_quiesce(&depend_tx);
>> +
>> + /* spin wait for the preceeding transactions to complete=
*/
>> + while (unlikely(!tx)) {
>> + dma_async_issue_pending(chan);
>> + tx =3D dma->device_prep_dma_pq(chan, dma_dest,
>> + &dma_src[src_off], pq_src_cnt,
>> + scf, len, dma_flags);
>> + }
>> +
>> + async_tx_submit(chan, tx, async_flags, depend_tx,
>> + _cb_fn, _cb_param);
>> +
>> + depend_tx =3D tx;
>> + flags |=3D ASYNC_TX_DEP_ACK;
>> +
>> + if (src_cnt > pq_src_cnt) {
>> + /* drop completed sources */
>> + src_cnt -=3D pq_src_cnt;
>> + src_off +=3D pq_src_cnt;
>> +
>> + /* use the intermediate result as a source; we
>> + * clear DMA_PREP_ZERO, so prep_dma_pq will
>> + * include destination(s) into calculations. Thus
>> + * keep DMA_PREP_HAVE_x in dma_flags only
>> + */
>> + dma_flags &=3D (DMA_PREP_HAVE_P | DMA_PREP_HAVE_=
Q);
> I don't think this will work as we will be mixing Q into the new P and
> P into the new Q. In order to support (src_cnt > device->max_pq) we
> need to explicitly tell the driver that the operation is being
> continued (DMA_PREP_CONTINUE) and to apply different coeffeicients to
> P and Q to cancel the effect of including them as sources.
With DMA_PREP_ZERO_P/Q approach, the Q isn't mixed into new P, and P=20
isn't mixed into new Q. For your example of max_pq=3D4:
p, q =3D PQ(src0, src1, src2, src3, src4, COEF({01}, {02}, {04}, {08}, {10=
}))
with the current implementation will be split into:
p, q =3D PQ(src0, src1, src2, src3, COEF({01}, {02}, {04}, {08})
p`,q` =3D PQ(src4, COEF({10}))
which will result to the following:
p =3D ((dma_flags & DMA_PREP_ZERO_P) ? 0 : old_p) + src0 + src1 + src2 + s=
rc3
q =3D ((dma_flags & DMA_PREP_ZERO_Q) ? 0 : old_q) + {01}*src0 + {02}*src1 =
+ {04}*src2 + {08}*src3
=20
p` =3D p + src4
q` =3D q + {10}*src4
But, if we get rid of DMA_PREP_ZERO_P/Q, then the mess with P/Q will=20
have a place indeed.
> Here is an
> example of supporting a 5 source pq operation where max_pq =3D=3D 4 (the
> minimum).
> p, q =3D PQ(src0, src1, src2, src3, COEF({01}, {02}, {04}, {08}))
> p', q' =3D PQ(p, q, q, src4, COEF({00}, {01}, {00}, {10}))
> p' =3D p + q + q + src4 =3D p + src4 =3D P
> q' =3D {00}*p + {01}*q + {00}*q + {10}*src4 =3D q + {10)*src4 =3D Q
> ...at no point do we need to zero P or Q. Yes, this requires a lot of
> extra work for incremental sources,
I would say, that 'very very lot'. In general this means that for=20
the cases of N sources > max_pq we'll have to do:
C =3D 1 + ceil((N-max_pq)/(max_pq - 3)) number of calls to ADMA.
E.g., for max_pq =3D 4:
N =3D 5 =3D> C =3D 2,
N =3D 6 =3D> C =3D 3,
..
N =3D 15 =3D> C =3D 12,
N =3D 16 =3D> C =3D 13,
..
N =3D 128 =3D> C =3D 125.
If we stay with the current approach of using DMA_PREP_ZERO_P/Q, then
C =3D 1 + ceil((N-max_pq)/max_pq)) number of calls to ADMA.
And the same series will result to:
N =3D 5 =3D> C =3D 2,
N =3D 6 =3D> C =3D 2,
..
N =3D 15 =3D> C =3D 4,
N =3D 16 =3D> C =3D 4,
..
N =3D 128 =3D> C =3D 32.
I'm afraid that the difference (13/4, 125/32) is very significant, so=20
getting rid of DMA_PREP_ZERO_P/Q will eat most of the improvement=20
which could be achieved with the current approach.
> but at this point I do not see a cleaner alternatve for engines like iop=
13xx.
I can't find any description of iop13xx processors at Intel's=20
web-site, only 3xx:
http://www.intel.com/design/iio/index.htm?iid=3Dipp_embed+embed_io
So, it's hard for me to do any suggestions. I just wonder - doesn't=20
iop13xx allow users to program destination addresses into the sources=20
fields of descriptors?
>> + } else
>> + break;
>> + }
>> +
>> + return tx;
>> +}
>> +
>> +/**
>> + * do_sync_pq - synchronously calculate P and Q
>> + */
>> +static void
>> +do_sync_pq(struct page **blocks, unsigned char *scfs, unsigned int offs=
et,
>> + int src_cnt, size_t len, enum async_tx_flags flags,
>> + struct dma_async_tx_descriptor *depend_tx,
>> + dma_async_tx_callback cb_fn, void *cb_param)
>> +{
>> + int i, pos;
>> + uint8_t *p =3D NULL, *q =3D NULL, *src;
>> +
>> + /* set destination addresses */
>> + if (blocks[src_cnt])
>> + p =3D (uint8_t *)(page_address(blocks[src_cnt]) + offset=
);
>> + if (blocks[src_cnt+1])
>> + q =3D (uint8_t *)(page_address(blocks[src_cnt+1]) + offs=
et);
>> +
>> + if (flags & ASYNC_TX_PQ_ZERO_P) {
>> + BUG_ON(!p);
>> + memset(p, 0, len);
>> + }
>> +
>> + if (flags & ASYNC_TX_PQ_ZERO_Q) {
>> + BUG_ON(!q);
>> + memset(q, 0, len);
>> + }
>> +
>> + for (i =3D 0; i < src_cnt; i++) {
>> + src =3D (uint8_t *)(page_address(blocks[i]) + offset);
>> + for (pos =3D 0; pos < len; pos++) {
>> + if (p)
>> + p[pos] ^=3D src[pos];
>> + if (q)
>> + q[pos] ^=3D raid6_gfmul[scfs[i]][src[pos=
]];
>> + }
>> + }
>> + async_tx_sync_epilog(cb_fn, cb_param);
>> +}
> sync_pq like sync_gensyndrome should not care about the current
> contents of p and q, just regenerate from the current sources. This
> kills another site where ASYNC_TX_PQ_ZERO_{P,Q} is used.
Well, perhaps you are right. The ASYNC_TX_PQ_ZERO_{P,Q} is set for=20
the most common cases of using async_pq, i.e. the parity generating.=20
The wrap-around async_gen_syndrome() function always set these flags=20
before calling async_pq().
The cases where ASYNC_TX_PQ_ZERO_{P,Q} isn't set are:
(a) async_pq can't process the sources in one short because of src_cnt >=20
max_pq, so it should re-use the intermediate results (destination) as=20
the sources;
(b) async_r6_dd_recov() does XOR with async_pq() assuming re-using the=20
destination as the source.
So, I would say that ASYNC_TX_PQ_ZERO_{P,Q} should definitely go=20
away, if there were no significant overheads in (a) implemented=20
without these flags (see above).
>> +
>> +/**
>> + * async_pq - attempt to do XOR and Galois calculations in parallel usi=
ng
>> + * a dma engine.
>> + * @blocks: source block array from 0 to (src_cnt-1) with the p destina=
tion
>> + * at blocks[src_cnt] and q at blocks[src_cnt + 1]. Only one of two
>> + * destinations may be present (another then has to be set to NULL).
>> + * By default, the result of calculations is XOR-ed with the initial
>> + * content of the destinationa buffers. Use ASYNC_TX_PQ_ZERO_x flags
>> + * to avoid this.
>> + * NOTE: client code must assume the contents of this array are des=
troyed
>> + * @scfs: array of source coefficients used in GF-multiplication
>> + * @offset: offset in pages to start transaction
>> + * @src_cnt: number of source pages
>> + * @len: length in bytes
>> + * @flags: ASYNC_TX_PQ_ZERO_P, ASYNC_TX_PQ_ZERO_Q, ASYNC_TX_ASSUME_COHE=
RENT,
>> + * ASYNC_TX_ACK, ASYNC_TX_DEP_ACK, ASYNC_TX_ASYNC_ONLY
>> + * @depend_tx: depends on the result of this transaction.
>> + * @cb_fn: function to call when the operation completes
>> + * @cb_param: parameter to pass to the callback routine
>> + */
>> +struct dma_async_tx_descriptor *
>> +async_pq(struct page **blocks, unsigned char *scfs, unsigned int offset,
>> + int src_cnt, size_t len, enum async_tx_flags flags,
>> + struct dma_async_tx_descriptor *depend_tx,
>> + dma_async_tx_callback cb_fn, void *cb_param)
>> +{
>> + struct dma_chan *chan =3D async_tx_find_channel(depend_tx, DMA_P=
Q,
>> + &blocks[src_cnt], 2,
>> + blocks, src_cnt, len);
>> + struct dma_device *device =3D chan ? chan->device : NULL;
>> + struct dma_async_tx_descriptor *tx =3D NULL;
>> +
>> + if (!device && (flags & ASYNC_TX_ASYNC_ONLY))
>> + return NULL;
>> +
>> + if (device) {
>> + /* run pq asynchronously */
>> + tx =3D do_async_pq(chan, blocks, scfs, offset, src_cnt,
>> + len, flags, depend_tx, cb_fn,cb_param);
>> + } else {
>> + /* run pq synchronously */
>> + if (!blocks[src_cnt+1]) {
>> + struct page *pdst =3D blocks[src_cnt];
>> + int i;
>> +
>> + /* Calculate P-parity only.
>> + * As opposite to async_xor(), async_pq() assumes
>> + * that destinations are included into calculati=
ons,
>> + * so we should re-arrange the xor src list to
>> + * achieve the similar behavior.
>> + */
>> + if (!(flags & ASYNC_TX_PQ_ZERO_P)) {
>> + /* If async_pq() user doesn't set ZERO f=
lag,
>> + * it's assumed that destination has some
>> + * reasonable data to include in calcula=
tions.
>> + * The destination must be at position 0=
, so
>> + * shift the sources and put pdst at the
>> + * beginning of the list.
>> + */
>> + for (i =3D src_cnt - 1; i >=3D 0; i--)
>> + blocks[i+1] =3D blocks[i];
>> + blocks[0] =3D pdst;
>> + src_cnt++;
>> + flags |=3D ASYNC_TX_XOR_DROP_DST;
>> + } else {
>> + /* If async_pq() user want to clear P, t=
hen
>> + * this will be done automatically in as=
ync
>> + * case, and with the help of ZERO_DST in
>> + * the sync one.
>> + */
>> + flags &=3D ~ASYNC_TX_PQ_ZERO_P;
>> + flags |=3D ASYNC_TX_XOR_ZERO_DST;
>> + }
>> +
>> + return async_xor(pdst, blocks, offset,
>> + src_cnt, len, flags, depend_tx,
>> + cb_fn, cb_param);
> If we assume that async_pq always regenerates parity and never reuses
> the old value then we can get gid of the !(flags & ASYNC_TX_PQ_ZERO_P)
> path. In the case where code does need to reuse the old P,
> async_r6recov.c, it should call async_xor directly since that routine
> provides this semantic.
Right. The question is - will we get rid of ZERO_P/Q or not.
[..]
>> @@ -81,14 +81,28 @@ enum dma_transaction_type {
>> * dependency chains
>> * @DMA_COMPL_SKIP_SRC_UNMAP - set to disable dma-unmapping the source b=
uffer(s)
>> * @DMA_COMPL_SKIP_DEST_UNMAP - set to disable dma-unmapping the destina=
tion(s)
>> + * @DMA_PREP_HAVE_P - set if the destination list includes the correct
>> + * address of P (P-parity should be handled)
>> + * @DMA_PREP_HAVE_Q - set if the destination list includes the correct
>> + * address of Q (Q-parity should be handled)
>> + * @DMA_PREP_ZERO_P - set if P has to be zeroed before proceeding
>> + * @DMA_PREP_ZERO_Q - set if Q has to be zeroed before proceeding
>> */
>> enum dma_ctrl_flags {
>> DMA_PREP_INTERRUPT =3D (1 << 0),
>> DMA_CTRL_ACK =3D (1 << 1),
>> DMA_COMPL_SKIP_SRC_UNMAP =3D (1 << 2),
>> DMA_COMPL_SKIP_DEST_UNMAP =3D (1 << 3),
>> +
>> + DMA_PREP_HAVE_P =3D (1 << 4),
>> + DMA_PREP_HAVE_Q =3D (1 << 5),
>> + DMA_PREP_ZERO_P =3D (1 << 6),
>> + DMA_PREP_ZERO_Q =3D (1 << 7),
>> };
>>
>> +#define DMA_PCHECK_FAILED (1 << 0)
>> +#define DMA_QCHECK_FAILED (1 << 1)
> Perhaps turn these into an enum such that we can pass around a enum
> pq_check_flags pointer rather than a non-descript u32 *.
Agree.
Regards, Yuri
--
Yuri Tikhonov, Senior Software Engineer
Emcraft Systems, www.emcraft.com
next prev parent reply other threads:[~2009-01-16 11:41 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-01-13 0:43 [PATCH 02/11][v3] async_tx: add support for asynchronous GF multiplication Yuri Tikhonov
2009-01-15 0:56 ` Dan Williams
2009-01-16 11:41 ` Yuri Tikhonov [this message]
2009-01-16 18:28 ` Re[2]: " Dan Williams
2009-01-17 12:19 ` Re[4]: " Yuri Tikhonov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1328000796.20090116144156@emcraft.com \
--to=yur@emcraft$(echo .)com \
--cc=dan.j.williams@intel$(echo .)com \
--cc=dzu@denx$(echo .)de \
--cc=linux-raid@vger$(echo .)kernel.org \
--cc=linuxppc-dev@ozlabs$(echo .)org \
--cc=wd@denx$(echo .)de \
--cc=yanok@emcraft$(echo .)com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox