public inbox for git@vger.kernel.org 
 help / color / mirror / Atom feed
From: Junio C Hamano <gitster@pobox•com>
To: Kirill Smelkov <kirr@mns•spb.ru>
Cc: git@vger•kernel.org
Subject: Re: [PATCH 1/2] tree-diff: rework diff_tree() to generate diffs for multiparent cases as well
Date: Fri, 14 Feb 2014 09:37:00 -0800	[thread overview]
Message-ID: <xmqqppmpiojn.fsf@gitster.dls.corp.google.com> (raw)
In-Reply-To: <20140214121529.GB3416@tugrik.mns.mnsspb.ru> (Kirill Smelkov's message of "Fri, 14 Feb 2014 16:15:29 +0400")

Kirill Smelkov <kirr@mns•spb.ru> writes:

> ---- 8< ----
> From: Kirill Smelkov <kirr@mns•spb.ru>
> Subject: [PATCH v2 1/2] tree-diff: rework diff_tree() to generate diffs for
>  multiparent cases as well
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit

The last three do not belong here.  Only From, Date and Subject are
relevant and taken as in-body headers.

For that matter, as the From and Subject above are exactly the same
as what you have on the e-mail header, you do not want any of them.

I'll strip them out here, so no need to resend.  Thanks.

> Previously diff_tree(), which is now named __diff_tree_sha1(), was

That name with two leading underscores is a rather unfortunate,
especially for a function that is not a file scope static.  No way
to rename this to something more sensible?

> That impedance mismatch *hurts* *performance* *badly* for generating
> combined diffs - in c839f1bd (combine-diff: optimize combine_diff_path

Please avoid referring to a commit that is not in 'master' by its
object name.  It can be reworked later and get a different name.

> That slowness comes from the fact that currently, while generating
> combined diff, a lot of time is spent computing diff(commit,commit^2)
> just to only then intersect that huge diff to almost small set of files
> from diff(commit,commit^1).

Good observation.

>    |a|   |b|    a < b   ->  a ∉ B   ->   D(A,B) +=  +a    a↓
>    |-|   |-|    a > b   ->  b ∉ A   ->   D(A,B) +=  -b    b↓
>    | |   | |    a = b   ->  investigate δ(a,b)            a↓ b↓

In both the "n-parallel" and "diff-tree", when an entry 'a' is a
tree, I take this "D(A,B) += +a" to mean (recursively) adding all
the paths within 'a' to the result as addition.  Sounds sensible.

> D(A,B)
>
> is by definition the same as combined diff
>
> D(A,[B]),
>
> so if we could rework the code for common case and make it be not slower
> for nparent=1 case, usual diff(t1,t2) generation will not be slower, and
> multiparent diff tree-walker would greatly benefit generating
> combine-diff.

OK.

> What we do is as follows:
>
> 1) diff tree-walker __diff_tree_sha1() is internally reworked to be
>    a paths generator (new name diff_tree_paths()), with each generated path
>    being `struct combine_diff_path` with info for path, new sha1,mode and for
>    every parent which sha1,mode it was in it.
>
> 2) From that info, we can still generate usual diff queue with
>    struct diff_filepairs, via "exporting" generated
>    combine_diff_path, if we know we run for nparent=1 case.
>    (see emit_diff() which is now named emit_diff_p0only())

s/p0/first_parent_/; perhaps?

> 3) In order for diff_can_quit_early(), which checks
>
>        DIFF_OPT_TST(opt, HAS_CHANGES))
>
>    to work, that exporting have to be happening not in bulk, but
>    incrementally, one diff path at a time.

Good thinking.

> Some notes(*):
>
> 1) For loops,
>
>  i = 0; do { ... } while (++i < nparent);
>
> is used instead of
>
>  for (i = 0; i < nparent; ++i)
>      ...
>
> because for the former case, the compiler have to emit additional
> prologue code which checks for i >= nparent case before entering the
> loop.
>
> As we require nparent must be >0, that additional overhead
> conflicts with the "runs not slower for nparent=1 case than before"
> goal.

Unfortunate.  I'd rather see us stick to more readable and familiar
form for maintainability if this were not measurable.

> 2) alloca(), for small arrays, is used for the same reason - if we change
> it to xmalloc()/free() the timings get worse

Do you see any use of it outside compat/?

I thought we specifically avoid alloca() for portability.  Also we
do not use variable-length-arrays on the stack either, I think.

> 3) For every parent tree, we need to keep a tag, whether entry from that
> parent equals to entry from minimal parent. For performance reasons I'm
> keeping that tag in entry's mode field in unused bit - see S_IFXMIN_NEQ.

Unfortunate, but I do not see another place to keep this
information offhand (nor implement this approach without keeping
that piece of information).

> P.S. and combined diff is not some exotic/for-play-only stuff - for

No need to convince us about that ;-)

> example for a program I write to represent Git archives as readonly
> filesystem, there is initial scan with
>
>     `git log --reverse --raw --no-abbrev --no-renames -c`
>
> to extract log of what was created/changed when, as a result building a
> map
>
>     {}  sha1    ->  in which commit (and date) a content was added
>
> that `-c` means also show combined diff for merges, and without them, if
> a merge is non-trivial (merges changes from two parents with both having
> separate changes to a file), or an evil one, the map will not be full,
> i.e. some valid sha1 would be absent from it.
>
> That case was my initial motivation for combined diffs speedup.

I wonder if this machinery can be reused for "log -m" as well (or
perhaps you do that already?).  After all, by performing a single
parallel scan, you are gathering all the necessary information to
let you pretend that you did N pairwise diff-tree.

> diff --git a/tree-diff.c b/tree-diff.c
> index ab61a0a..2b7c991 100644
> --- a/tree-diff.c
> +++ b/tree-diff.c
> @@ -7,6 +7,25 @@
>  #include "tree.h"
>  
>  /*
> + * internal mode marker, saying a tree entry != entry of tp[imin]
> + * (see __diff_tree_paths for what it means there)
> + *
> + * it *must* not overlap with any valid modes, and we will update/use/emit
> + * entry for diff only with it unset. Only non-overlapping to valid modes is
> + * required, because mode in tree_desc, comes here canonicalized via
> + * canon_mode().
> + *
> + * the definition assumes unsigned is at least 32 bits.
> + */
> +#define S_IFXMIN_NEQ	0x80000000

To allow better coordination across multiple codepaths that deal
with modes, I am wondering if this should be defined in cache.h
where made-up S_FIGITLINK and S_IFINVALID are defined (note the
comment that is there, as well).

> +static struct combine_diff_path *__diff_tree_paths(
> +	struct combine_diff_path *p, const unsigned char *sha1,
> +	const unsigned char **parents_sha1, int nparent,
> +	struct strbuf *base, struct diff_options *opt);

Most of our code do not name private helper functions with leading
underscores.

I do like the direction this is going, but it looks to me that
"struct combine_diff" is now misnamed, because it no longer is about
combined diff.  You are introducing a good framework for n-way diff,
and producing combined diff (i.e. -c or --cc) is now merely one way
to use that framework.  We may want to clean these names up after
this series settles---perhaps "struct nway_diff" or something.

> +
> +/*
>   * Compare two tree entries, taking into account only path/S_ISDIR(mode),
>   * but not their sha1's.
>   *
> @@ -33,72 +52,152 @@ static int tree_entry_pathcmp(struct tree_desc *t1, struct tree_desc *t2)
>  }
>  
>  
> -/* convert path, t1/t2 -> opt->diff_*() callbacks */
> -static void emit_diff(struct diff_options *opt, struct strbuf *path,
> -		      struct tree_desc *t1, struct tree_desc *t2)
> +/*
> + * convert path -> opt->diff_*() callbacks
> + *
> + * emits diff to parent0 only.

Please call that "first parent".

> + */

"Returns 0 to tell the caller that we are done with p and it can be
freed" or something?

> +static int emit_diff_p0only(struct diff_options *opt, struct combine_diff_path *p)
>  {
> ...
> +
> +	return 0;	/* = no need to keep allocated combine_diff_path */

Curious; what is that equal sign in the comment?

  reply	other threads:[~2014-02-14 17:37 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-02-13 14:02 [PATCH 0/2] Multiparent diff tree-walker + combine-diff speedup Kirill Smelkov
2014-02-13 14:02 ` [PATCH 1/2] tree-diff: rework diff_tree() to generate diffs for multiparent cases as well Kirill Smelkov
2014-02-13 19:51   ` Junio C Hamano
2014-02-14 12:15     ` Kirill Smelkov
2014-02-14 17:37       ` Junio C Hamano [this message]
2014-02-16  8:08         ` Kirill Smelkov
2014-02-18 21:29           ` Junio C Hamano
2014-02-13 14:02 ` [PATCH 2/2] combine-diff: speed it up, by using multiparent diff tree-walker directly Kirill Smelkov
2014-02-13 19:55   ` Junio C Hamano
2014-02-14 12:16     ` Kirill Smelkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqppmpiojn.fsf@gitster.dls.corp.google.com \
    --to=gitster@pobox$(echo .)com \
    --cc=git@vger$(echo .)kernel.org \
    --cc=kirr@mns$(echo .)spb.ru \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox