From: Yuri Tikhonov <yur@emcraft•com>
To: "Dan Williams" <dan.j.williams@intel•com>
Cc: linux-raid@vger•kernel.org, linuxppc-dev@ozlabs•org, wd@denx•de,
dzu@denx•de, yanok@emcraft•com
Subject: Re[2]: [PATCH 02/11][v2] async_tx: add support for asynchronous GF multiplication
Date: Fri, 19 Dec 2008 10:43:17 +0300 [thread overview]
Message-ID: <714415337.20081219104317@emcraft.com> (raw)
In-Reply-To: <e9c3a7c20812171034l7403c788y9fd1e7ce217cb41b@mail.gmail.com>
Hello Dan,
On Wednesday, December 17, 2008 you wrote:
[..]
>> + /* DMAs use destinations as sources, so use BIDIRECTIONAL mappin=
g */
>> + dma_dest[0] =3D !blocks[src_cnt] ? 0 :
>> + dma_map_page(dma->dev, blocks[src_cnt],
>> + offset, len, DMA_BIDIRECTIO=
NAL);
> "0" could be a valid dma address on some architectures.
> DMA_ERROR_CODE looks like the closest fit for what we are trying to do
> here, but that only exists on sparc and powerpc. We could add a
> "dest_mask" parameter to device_prep_dma_pq where the mask is 1 =3D
> p-only, 2 =3D q-only, and 3 =3D p and q.
Understood. We can just introduce new DMA_xxx flags and pass them=20
among the other ones passed with device_prep_dma_pq() to ADMA driver=20
instead of introducing a new "dest_mask" parameter. Though, I guess,=20
you meant exactly the same.
>> + dma_dest[1] =3D !blocks[src_cnt+1] ? 0 :
>> + dma_map_page(dma->dev, blocks[src_cnt+1],
>> + offset, len, DMA_BIDIRECTIO=
NAL);
>> +
>> + for (i =3D 0; i < src_cnt; i++)
>> + dma_src[i] =3D dma_map_page(dma->dev, blocks[i],
>> + offset, len, DMA_TO_DEVICE);
>> +
>> + while (src_cnt) {
>> + async_flags =3D flags;
>> + pq_src_cnt =3D min(src_cnt, dma->max_pq);
>> + /* if we are submitting additional pqs, leave the chain =
open,
>> + * clear the callback parameters, and leave the destinat=
ion
>> + * buffers mapped
>> + */
>> + if (src_cnt > pq_src_cnt) {
>> + async_flags &=3D ~ASYNC_TX_ACK;
>> + dma_flags |=3D DMA_COMPL_SKIP_DEST_UNMAP;
>> + _cb_fn =3D NULL;
>> + _cb_param =3D NULL;
>> + } else {
>> + _cb_fn =3D cb_fn;
>> + _cb_param =3D cb_param;
>> + }
>> + if (_cb_fn)
>> + dma_flags |=3D DMA_PREP_INTERRUPT;
>> +
>> + /* Since we have clobbered the src_list we are committed
>> + * to doing this asynchronously. Drivers force forward
>> + * progress in case they can not provide a descriptor
>> + */
>> + tx =3D dma->device_prep_dma_pq(chan, dma_dest,
>> + &dma_src[src_off], pq_src_c=
nt,
>> + scf_list ? &scf_list[src_of=
f] :
>> + NULL,
>> + len, dma_flags);
> ...one nit for readability can we replace these ternary conditionals
> with proper if-else statements? i.e.
> if (scf_list)
> scf =3D &scf_list[src_off];
> else
> scf =3D NULL;
> tx =3D dma->device_prep_dma_pq(chan, dma_dest,
> &dma_src[src_off], pq_src_cn=
t,
> scf, len, dma_flags);
Thanks for pointing this. Sure. Furthermore, it's additionally even a=20
question of performance: e.g. in do_async_pq() we do this "? : " in a=20
cycle, whereas there is absolutely no reason to think it changes.
[..]
>> +/**
>> + * async_pq_zero_sum - attempt a PQ parities check with a dma engine.
>> + * @blocks: array of source pages. The 0..src_cnt-1 are the sources, the
>> + * src_cnt and src_cnt+1 are the P and Q destinations to check, res=
p.
>> + * Only one of two destinations may be present.
>> + * NOTE: client code must assume the contents of this array are des=
troyed
>> + * @scf: coefficients to use in GF-multiplications
>> + * @offset: offset in pages to start transaction
>> + * @src_cnt: number of source pages
>> + * @len: length in bytes
>> + * @presult: where to store the result of P-ckeck, which is 0 if P-pari=
ty
>> + * OK, and non-zero otherwise.
>> + * @qresult: where to store the result of P-ckeck, which is 0 if Q-pari=
ty
>> + * OK, and non-zero otherwise.
>> + * @flags: ASYNC_TX_ASSUME_COHERENT, ASYNC_TX_ACK, ASYNC_TX_DEP_ACK
>> + * @depend_tx: depends on the result of this transaction.
>> + * @cb_fn: function to call when the xor completes
>> + * @cb_param: parameter to pass to the callback routine
>> + */
>> +struct dma_async_tx_descriptor *
>> +async_pq_zero_sum(struct page **blocks, unsigned char *scf,
>> + unsigned int offset, int src_cnt, size_t len,
>> + u32 *presult, u32 *qresult, enum async_tx_flags flags,
>> + struct dma_async_tx_descriptor *depend_tx,
>> + dma_async_tx_callback cb_fn, void *cb_param)
>> +{
>> + struct dma_chan *chan =3D async_tx_find_channel(depend_tx,
>> + DMA_PQ_ZERO_SUM,
>> + &blocks[src_cnt], =
2,
>> + blocks, src_cnt, l=
en);
>> + struct dma_device *device =3D chan ? chan->device : NULL;
>> + struct dma_async_tx_descriptor *tx =3D NULL;
>> +
>> + BUG_ON(src_cnt < 2);
>> +
>> + if (device && src_cnt <=3D device->max_pq) {
>> + dma_addr_t dma_src[src_cnt + 2];
>> + enum dma_ctrl_flags dma_flags =3D cb_fn ? DMA_PREP_INTER=
RUPT : 0;
>> + int i;
>> +
>> + for (i =3D 0; i < src_cnt + 2; i++)
>> + dma_src[i] =3D blocks[i] ? dma_map_page(device->=
dev,
>> + blocks[i], offset, len,
>> + DMA_TO_DEVICE) : 0;
> If we go with the "dest_mask" approach to specifying p and q then we
> need to separate them into their own parameter here... although in
> this case it would be a "src_mask" to select p or q.
We shouldn't do this if enhance 'enum dma_ctrl_flags' with, say,=20
DMA_PREP_P_PRESENT, DMA_PREP_Q_PRESENT. The adma driver which support=20
device_prep_dma_pqzero_sum() then should use/or not first dma_src=20
(which are destinations) depending on dma_flags set.
[..]
>> diff --git a/include/linux/async_tx.h b/include/linux/async_tx.h
>> index 0f50d4c..5d6b639 100644
>> --- a/include/linux/async_tx.h
>> +++ b/include/linux/async_tx.h
>> @@ -42,6 +42,12 @@ struct dma_chan_ref {
>> * @ASYNC_TX_XOR_ZERO_DST: this flag must be used for xor operations whe=
re the
>> * the destination address is not a source. The asynchronous case handl=
es this
>> * implicitly, the synchronous case needs to zero the destination block.
>> + * @ASYNC_TX_PQ_ZERO_P: this flag must be used for async_pq operations =
since the
>> + * destination there is always the source (the result of P after async_=
pq is
>> + * xor-ed with the previous content of P block if this flag isn't set).
>> + * @ASYNC_TX_PQ_ZERO_Q: this flag must be used for async_pq operations =
since the
>> + * destination there is always the source (the result of Q after async_=
pq is
>> + * xor-ed with the previous content of Q block if this flag isn't set).
>> * @ASYNC_TX_XOR_DROP_DST: this flag must be used if the destination add=
ress is
>> * also one of the source addresses. In the synchronous case the destin=
ation
>> * address is an implied source, whereas the asynchronous case it must b=
e listed
>> @@ -50,12 +56,17 @@ struct dma_chan_ref {
>> * @ASYNC_TX_ACK: immediately ack the descriptor, precludes setting up a
>> * dependency chain
>> * @ASYNC_TX_DEP_ACK: ack the dependency descriptor. Useful for chainin=
g.
>> + * @ASYNC_TX_ASYNC_ONLY: if set then try to perform operation requested=
only in
>> + * the asynchronous mode.
>> */
>> enum async_tx_flags {
>> ASYNC_TX_XOR_ZERO_DST =3D (1 << 0),
>> - ASYNC_TX_XOR_DROP_DST =3D (1 << 1),
>> - ASYNC_TX_ACK =3D (1 << 3),
>> - ASYNC_TX_DEP_ACK =3D (1 << 4),
>> + ASYNC_TX_PQ_ZERO_P =3D (1 << 1),
>> + ASYNC_TX_PQ_ZERO_Q =3D (1 << 2),
>> + ASYNC_TX_XOR_DROP_DST =3D (1 << 3),
>> + ASYNC_TX_ACK =3D (1 << 4),
>> + ASYNC_TX_DEP_ACK =3D (1 << 5),
>> + ASYNC_TX_ASYNC_ONLY =3D (1 << 6),
>> };
>>
>> #ifdef CONFIG_DMA_ENGINE
>> @@ -146,5 +157,33 @@ async_trigger_callback(enum async_tx_flags flags,
>> struct dma_async_tx_descriptor *depend_tx,
>> dma_async_tx_callback cb_fn, void *cb_fn_param);
>>
>> +struct dma_async_tx_descriptor *
>> +async_pqxor(struct page *pdest, struct page *qdest,
>> + struct page **src_list, unsigned char *scoef_list,
>> + unsigned int offset, int src_cnt, size_t len, enum async_tx_flag=
s flags,
>> + struct dma_async_tx_descriptor *depend_tx,
>> + dma_async_tx_callback callback, void *callback_param);
>> +
> ...forgot to update the declartion.
Argh.. Missed this when re-generated my final internal patch version.
> In this case async_pq() can be declared static since nothing outside
> of async_pq.c calls it.
It's not true. async_r6_dd_recov() and async_r6_dp_recov() functions=20
actively utilize async_pq(). See crypto/async_tx/async_r6recov.c.
[..]
>> void async_tx_quiesce(struct dma_async_tx_descriptor **tx);
>> #endif /* _ASYNC_TX_H_ */
>> diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
>> index adb0b08..84525c3 100644
>> --- a/include/linux/dmaengine.h
>> +++ b/include/linux/dmaengine.h
>> @@ -81,7 +81,7 @@ enum dma_status {
>> enum dma_transaction_type {
>> DMA_MEMCPY,
>> DMA_XOR,
>> - DMA_PQ_XOR,
>> + DMA_PQ,
>> DMA_DUAL_XOR,
>> DMA_PQ_UPDATE,
>> DMA_ZERO_SUM,
>> @@ -123,6 +123,8 @@ enum dma_ctrl_flags {
>> DMA_CTRL_ACK =3D (1 << 1),
>> DMA_COMPL_SKIP_SRC_UNMAP =3D (1 << 2),
>> DMA_COMPL_SKIP_DEST_UNMAP =3D (1 << 3),
>> + DMA_PREP_ZERO_P =3D (1 << 4),
>> + DMA_PREP_ZERO_Q =3D (1 << 5),
>> };
> I would rather not add operation-type-specific flags to
> dma_ctrl_flags.
But we need somehow:
1) point the ADMA driver should it clear the destination or not;
2) if (1), then what destination(s) to clear.
Above I even propose to add two more flags here :) Are there any=20
reasons why we should spare dma_ctrl_flags, and, instead of adding a=20
couple of new flag bits which are even do not lead to the sizeof(enum)=20
growth, increase the stack usage and, in general, the time of=20
functions calls by adding new parameters to ADMA methods ?
> In this case can we set up a dependency chain with
> async_memset()?
Well, we can. But wouldn't this be an overhead? For example,=20
ppc440spe DMA allows to do so-called RXOR which overwrites, and=20
doesn't take care of destinations. So, we can do ZERO_DST(s)+PQ in one=20
short on one DMA engine. Again, I'm not sure that keeping=20
dma_ctrl_flags unchanged is worthy of creating such a dependency;=20
it'll obviously lead both to degradation of performance & increasing=20
of CPU utilization.
>>
>> /**
>> @@ -299,6 +301,7 @@ struct dma_async_tx_descriptor {
>> * @global_node: list_head for global dma_device_list
>> * @cap_mask: one or more dma_capability flags
>> * @max_xor: maximum number of xor sources, 0 if no capability
>> + * @max_pq: maximum number of PQ sources, 0 if no capability
>> * @refcount: reference count
>> * @done: IO completion struct
>> * @dev_id: unique device ID
>> @@ -308,7 +311,9 @@ struct dma_async_tx_descriptor {
>> * @device_free_chan_resources: release DMA channel's resources
>> * @device_prep_dma_memcpy: prepares a memcpy operation
>> * @device_prep_dma_xor: prepares a xor operation
>> + * @device_prep_dma_pq: prepares a pq operation
>> * @device_prep_dma_zero_sum: prepares a zero_sum operation
>> + * @device_prep_dma_pqzero_sum: prepares a pqzero_sum operation
>> * @device_prep_dma_memset: prepares a memset operation
>> * @device_prep_dma_interrupt: prepares an end of chain interrupt operat=
ion
>> * @device_prep_slave_sg: prepares a slave dma operation
>> @@ -322,6 +327,7 @@ struct dma_device {
>> struct list_head global_node;
>> dma_cap_mask_t cap_mask;
>> int max_xor;
>> + int max_pq;
>>
> max_xor and max_pq can be changed to unsigned shorts to keep the size
> of the struct the same.
Right.
>> struct kref refcount;
>> struct completion done;
>> @@ -339,9 +345,17 @@ struct dma_device {
>> struct dma_async_tx_descriptor *(*device_prep_dma_xor)(
>> struct dma_chan *chan, dma_addr_t dest, dma_addr_t *src,
>> unsigned int src_cnt, size_t len, unsigned long flags);
>> + struct dma_async_tx_descriptor *(*device_prep_dma_pq)(
>> + struct dma_chan *chan, dma_addr_t *dst, dma_addr_t *src,
>> + unsigned int src_cnt, unsigned char *scf,
>> + size_t len, unsigned long flags);
>> struct dma_async_tx_descriptor *(*device_prep_dma_zero_sum)(
>> struct dma_chan *chan, dma_addr_t *src, unsigned int src_=
cnt,
>> size_t len, u32 *result, unsigned long flags);
>> + struct dma_async_tx_descriptor *(*device_prep_dma_pqzero_sum)(
>> + struct dma_chan *chan, dma_addr_t *src, unsigned int src=
_cnt,
>> + unsigned char *scf,
>> + size_t len, u32 *presult, u32 *qresult, unsigned long fl=
ags);
> I would rather we turn the 'result' parameter into a pointer to flags
> where bit 0 is the xor/p result and bit1 is the q result.
Yes, this'll be better.
Thanks for reviewing. I'll re-generate ASYNC_TX patch (in the parts=20
where I absolutely agreed with you), and then re-post. Any comments=20
regarding RAID-6 part?
Regards, Yuri
--
Yuri Tikhonov, Senior Software Engineer
Emcraft Systems, www.emcraft.com
prev parent reply other threads:[~2008-12-19 7:43 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-12-08 21:55 [PATCH 02/11][v2] async_tx: add support for asynchronous GF multiplication Yuri Tikhonov
2008-12-17 18:34 ` Dan Williams
2008-12-19 7:43 ` Yuri Tikhonov [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=714415337.20081219104317@emcraft.com \
--to=yur@emcraft$(echo .)com \
--cc=dan.j.williams@intel$(echo .)com \
--cc=dzu@denx$(echo .)de \
--cc=linux-raid@vger$(echo .)kernel.org \
--cc=linuxppc-dev@ozlabs$(echo .)org \
--cc=wd@denx$(echo .)de \
--cc=yanok@emcraft$(echo .)com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox