public inbox for git@vger.kernel.org 
 help / color / mirror / Atom feed
From: Toon Claes <toon@iotcl•com>
To: Taylor Blau <me@ttaylorr•com>
Cc: git@vger•kernel.org,
	Kristoffer Haugsbakk <kristofferhaugsbakk@fastmail•com>,
	Derrick Stolee <stolee@gmail•com>,
	Junio C Hamano <gitster@pobox•com>
Subject: Re: [PATCH v5 3/6] last-modified: use Bloom filters when available
Date: Tue, 22 Jul 2025 18:02:37 +0200	[thread overview]
Message-ID: <871pq8kyci.fsf@iotcl.com> (raw)
In-Reply-To: <aHmSW1pOUF0U3KFk@nand.local>

Taylor Blau <me@ttaylorr•com> writes:

> On Wed, Jul 16, 2025 at 03:35:15PM +0200, Toon Claes wrote:
>> Our 'git last-modified' performs a revision walk, and computes a diff at
>> each point in the walk to figure out whether a given revision changed
>> any of the paths it considers interesting.
>>
>> When changed-path Bloom filters are available, we can avoid computing
>> many such diffs. Before computing a diff, we first check if any of the
>> remaining paths of interest were possibly changed at a given commit by
>> consulting its Bloom filter. If any of them are, we are resigned to
>> compute the diff.
>>
>> If none of those queries returned "maybe", we know that the given commit
>> doesn't contain any changed paths which are interesting to us. So, we
>> can avoid computing it in this case.
>>
>> Comparing the perf test results on git.git:
>>
>>     Test                                        HEAD~             HEAD
>>     ------------------------------------------------------------------------------------
>>     8020.1: top-level last-modified             4.49(4.34+0.11)   2.22(2.05+0.09) -50.6%
>>     8020.2: top-level recursive last-modified   5.64(5.45+0.11)   5.62(5.30+0.11) -0.4%
>>     8020.3: subdir last-modified                0.11(0.06+0.04)   0.07(0.03+0.04) -36.4%
>
> As an aside on 8020.3 (that I probably should have mentioned in the last
> commit), I think that our "| head -n1" heuristic for picking a sub-tree
> is skewing these results down. In git.git, the lexicographically
> earliest sub-tree is ".github", which is awfully tiny. I wonder if we
> should be grabbing the *last* sub-tree, or maybe the largest one by
> count of entries?

Ah yes, that's automatable. Good suggestion.

>
>> @@ -40,6 +43,12 @@ struct last_modified {
>>
>>  static void last_modified_release(struct last_modified *lm)
>>  {
>> +	struct hashmap_iter iter;
>> +	struct last_modified_entry *ent;
>> +
>> +	hashmap_for_each_entry(&lm->paths, &iter, ent, hashent)
>> +		clear_bloom_key(&ent->key);
>> +
>
> I did a double-take to make sure that ent->key would always be
> initialized here, but it is thanks to the FLEX_ALLOC_STR() call below.
>
>>  	hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
>>  	release_revisions(&lm->rev);
>>  }
>> @@ -67,6 +76,9 @@ static void add_path_from_diff(struct diff_queue_struct *q,
>>
>>  		FLEX_ALLOC_STR(ent, path, path);
>>  		oidcpy(&ent->oid, &p->two->oid);
>> +		if (lm->rev.bloom_filter_settings)
>> +			fill_bloom_key(path, strlen(path), &ent->key,
>> +				       lm->rev.bloom_filter_settings);
>>  		hashmap_entry_init(&ent->hashent, strhash(ent->path));
>>  		hashmap_add(&lm->paths, &ent->hashent);
>>  	}
>> @@ -126,6 +138,7 @@ static void mark_path(const char *path, const struct object_id *oid,
>>  		data->callback(path, data->commit, data->callback_data);
>>
>>  	hashmap_remove(data->paths, &ent->hashent, path);
>> +	clear_bloom_key(&ent->key);
>
> OK, we're calling clear_bloom_key() here, too, but it uses
> FREE_AND_NULL(), so calling it again in last_modified_release() may be a
> noop, but won't ever be a double-free.
>
>> @@ -238,6 +276,13 @@ static int last_modified_init(struct last_modified *lm, struct repository *r,
>>  		return argc;
>>  	}
>>
>> +	/*
>> +	 * We're not interested in generation numbers here,
>> +	 * but calling this function to prepare the commit-graph.
>> +	 */
>> +	(void)generation_numbers_enabled(lm->rev.repo);
>> +	lm->rev.bloom_filter_settings = get_bloom_filter_settings(lm->rev.repo);
>
> Hmmph. I think when I originally wrote this I was using the side-effect
> of calling generation_numbers_enabled() as a hack. But I think that it
> may be worth making "prepare_commit_graph()" a non-static function and
> calling that instead.

Well thanks for confirming this assumption. I wasn't 100%, and I agree
having a non-static "prepare_commit_graph()" would be better.

Somewhat related to that. In your branch you have this line in
"maybe_changed_path()": 

	if (commit_graph_generation(origin) == GENERATION_NUMBER_INFINITY)
		return 1;

I excluded this line because if we ignore the result of
"generation_numbers_enabled()", why would it matter to have a generation
number?

The thing is "maybe_changed_path()" in blame.c also has this line too,
but unfortunately git-blaming that line didn't learn me anything why
it's there. Do you have any idea?

-- 
Cheers,
Toon

  reply	other threads:[~2025-07-22 16:02 UTC|newest]

Thread overview: 135+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-22 17:46 [PATCH RFC 0/5] Introduce git-blame-tree(1) command Toon Claes
2025-04-22 17:46 ` [PATCH RFC 1/5] blame-tree: introduce new subcommand to blame files Toon Claes
2025-04-24 16:19   ` Junio C Hamano
2025-05-07 13:13     ` Toon Claes
2025-04-22 17:46 ` [PATCH RFC 2/5] t/perf: add blame-tree perf script Toon Claes
2025-04-22 17:46 ` [PATCH RFC 3/5] blame-tree: use Bloom filters when available Toon Claes
2025-04-22 17:46 ` [PATCH RFC 4/5] blame-tree: implement faster algorithm Toon Claes
2025-04-22 17:46 ` [PATCH RFC 5/5] blame-tree.c: initialize revision machinery without walk Toon Claes
2025-04-23 13:26 ` [PATCH RFC 0/5] Introduce git-blame-tree(1) command Marc Branchaud
2025-05-07 14:22   ` Toon Claes
2025-05-07 20:23     ` Marc Branchaud
2025-05-07 20:45       ` Junio C Hamano
2025-05-08 13:26         ` Marc Branchaud
2025-05-08 14:26           ` Junio C Hamano
2025-05-08 15:12             ` Marc Branchaud
2025-05-14 14:42               ` Toon Claes
2025-05-14 19:29                 ` Junio C Hamano
2025-05-14 21:15                   ` Marc Branchaud
2025-05-15 13:29                     ` Patrick Steinhardt
2025-05-15 16:39                       ` Junio C Hamano
2025-05-15 17:39                         ` Marc Branchaud
2025-05-15 19:30                           ` Jeff King
2025-05-16  4:38                             ` Patrick Steinhardt
2025-05-20  8:49                               ` Toon Claes
2025-05-15 17:30                       ` Marc Branchaud
2025-05-16  4:30                         ` Patrick Steinhardt
2025-05-14 21:15                 ` Marc Branchaud
2025-05-07 20:49       ` Kristoffer Haugsbakk
2025-05-08 13:20         ` D. Ben Knoble
2025-05-08 13:26         ` Marc Branchaud
2025-05-08 13:18       ` D. Ben Knoble
2025-05-23  9:33 ` [PATCH RFC v2 0/5] Introduce git-last-modified(1) command Toon Claes
2025-05-23  9:33   ` [PATCH RFC v2 1/5] last-modified: new subcommand to show when files were last modified Toon Claes
2025-05-25 20:07     ` Justin Tobler
2025-06-05  8:32       ` Toon Claes
2025-05-27 10:39     ` Patrick Steinhardt
2025-06-13  9:34       ` Toon Claes
2025-06-13  9:52         ` Kristoffer Haugsbakk
2025-05-23  9:33   ` [PATCH RFC v2 2/5] t/perf: add last-modified perf script Toon Claes
2025-05-23  9:33   ` [PATCH RFC v2 3/5] last-modified: use Bloom filters when available Toon Claes
2025-05-27 10:40     ` Patrick Steinhardt
2025-06-13 11:05       ` Toon Claes
2025-05-23  9:33   ` [PATCH RFC v2 4/5] last-modified: implement faster algorithm Toon Claes
2025-05-27 10:39     ` Patrick Steinhardt
2025-05-23  9:33   ` [PATCH RFC v2 5/5] last-modified: initialize revision machinery without walk Toon Claes
2025-05-27 10:39     ` Patrick Steinhardt
2025-07-01 20:35   ` [PATCH RFC v2 0/5] Introduce git-last-modified(1) command Kristoffer Haugsbakk
2025-07-01 21:06     ` Junio C Hamano
2025-07-01 21:30       ` Kristoffer Haugsbakk
2025-07-02 13:00         ` Toon Claes
2025-07-09 15:53           ` Toon Claes
2025-07-09 17:00             ` Junio C Hamano
2025-06-30 18:49 ` [PATCH RFC v3 0/3] " Toon Claes
2025-06-30 18:49   ` [PATCH RFC v3 1/3] last-modified: new subcommand to show when files were last modified Toon Claes
2025-07-01 20:20     ` Kristoffer Haugsbakk
2025-07-02 11:51     ` Junio C Hamano
2025-06-30 18:49   ` [PATCH RFC v3 2/3] t/perf: add last-modified perf script Toon Claes
2025-06-30 18:49   ` [PATCH RFC v3 3/3] last-modified: use Bloom filters when available Toon Claes
2025-07-01 23:01   ` [PATCH RFC v3 0/3] Introduce git-last-modified(1) command Junio C Hamano
2025-07-09 15:26   ` [PATCH v4 " Toon Claes
2025-07-09 21:57     ` Junio C Hamano
2025-07-10 18:37       ` Junio C Hamano
2025-07-16 13:32     ` [PATCH v5 0/6] " Toon Claes
2025-07-16 13:35       ` [PATCH v5 1/6] last-modified: new subcommand to show when files were last modified Toon Claes
2025-07-18  0:02         ` Taylor Blau
2025-07-19  6:44           ` Jeff King
2025-07-22 15:50           ` Toon Claes
2025-08-01  9:09           ` Christian Couder
2025-08-01 16:59             ` Junio C Hamano
2025-07-16 13:35       ` [PATCH v5 2/6] t/perf: add last-modified perf script Toon Claes
2025-07-18  0:08         ` Taylor Blau
2025-07-22 15:52           ` Toon Claes
2025-07-16 13:35       ` [PATCH v5 3/6] last-modified: use Bloom filters when available Toon Claes
2025-07-18  0:16         ` Taylor Blau
2025-07-22 16:02           ` Toon Claes [this message]
2025-07-16 13:35       ` [PATCH v5 4/6] pretty: allow caller to disable indentation Toon Claes
2025-07-16 15:50         ` Junio C Hamano
2025-07-17 16:31           ` Toon Claes
2025-07-16 13:35       ` [PATCH v5 5/6] last-modified: support --extended format Toon Claes
2025-07-16 16:09         ` Junio C Hamano
2025-07-17 16:31           ` Toon Claes
2025-07-17 22:37         ` Junio C Hamano
2025-07-18 17:36           ` Junio C Hamano
2025-07-22 16:06             ` Toon Claes
2025-07-16 13:42       ` [PATCH v5 6/6] fixup! last-modified: use Bloom filters when available Toon Claes
2025-07-17 23:39       ` [PATCH v5 0/6] Introduce git-last-modified(1) command Taylor Blau
2025-07-22 15:35         ` Toon Claes
2025-07-30 17:59           ` Toon Claes
2025-07-31  7:45             ` Patrick Steinhardt
2025-07-30 17:55       ` [PATCH v6 0/4] " Toon Claes
2025-07-31 18:40         ` Junio C Hamano
2025-07-31 23:57           ` Junio C Hamano
2025-08-05  9:33         ` [PATCH v7 0/3] " Toon Claes
2025-08-05 14:34           ` Patrick Steinhardt
2025-08-05 16:21             ` Junio C Hamano
2025-08-05 16:34           ` Junio C Hamano
2025-08-05 16:55             ` Toon Claes
2025-08-05 17:20               ` Jean-Noël AVILA
2025-08-05 21:46                 ` Junio C Hamano
2025-08-06 12:01                   ` Toon Claes
2025-08-06 15:38                     ` Junio C Hamano
2025-08-28 22:44                       ` Junio C Hamano
2025-08-05 18:28               ` Junio C Hamano
2025-08-05  9:33         ` [PATCH v7 1/3] last-modified: new subcommand to show when files were last modified Toon Claes
2025-08-05  9:33         ` [PATCH v7 2/3] t/perf: add last-modified perf script Toon Claes
2025-08-05  9:33         ` [PATCH v7 3/3] last-modified: use Bloom filters when available Toon Claes
2025-07-30 17:55       ` [PATCH v6 1/4] last-modified: new subcommand to show when files were last modified Toon Claes
2025-07-31  6:42         ` Patrick Steinhardt
2025-08-01 16:22           ` Toon Claes
2025-08-01 17:09             ` Junio C Hamano
2025-08-04  6:34               ` Patrick Steinhardt
2025-08-04 17:14                 ` Junio C Hamano
2025-08-05  5:35                   ` Toon Claes
2025-08-01 20:34             ` Jean-Noël AVILA
2025-08-05  5:36               ` Toon Claes
2025-08-04  6:33             ` Patrick Steinhardt
2025-08-01 10:18         ` Christian Couder
2025-08-01 10:22           ` Patrick Steinhardt
2025-08-01 17:06             ` Junio C Hamano
2025-08-02  8:18               ` Christian Couder
2025-08-02 11:31                 ` Christian Couder
2025-08-02 13:38                   ` Christian Couder
2025-08-02 16:26                     ` Junio C Hamano
2025-08-04  6:35               ` Patrick Steinhardt
2025-07-30 17:55       ` [PATCH v6 2/4] t/perf: add last-modified perf script Toon Claes
2025-07-30 17:55       ` [PATCH v6 3/4] commit-graph: export prepare_commit_graph() Toon Claes
2025-07-31  6:42         ` Patrick Steinhardt
2025-07-30 17:55       ` [PATCH v6 4/4] last-modified: use Bloom filters when available Toon Claes
2025-07-31  6:43         ` Patrick Steinhardt
2025-08-01 16:23           ` Toon Claes
2025-08-04  6:33             ` Patrick Steinhardt
2025-07-09 15:26   ` [PATCH v4 1/3] last-modified: new subcommand to show when files were last modified Toon Claes
2025-07-09 15:26   ` [PATCH v4 2/3] t/perf: add last-modified perf script Toon Claes
2025-07-09 15:26   ` [PATCH v4 3/3] last-modified: use Bloom filters when available Toon Claes
2025-07-16 13:35   ` [PATCH v5 6/6] fixup! " Toon Claes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=871pq8kyci.fsf@iotcl.com \
    --to=toon@iotcl$(echo .)com \
    --cc=git@vger$(echo .)kernel.org \
    --cc=gitster@pobox$(echo .)com \
    --cc=kristofferhaugsbakk@fastmail$(echo .)com \
    --cc=me@ttaylorr$(echo .)com \
    --cc=stolee@gmail$(echo .)com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox