From: Toon Claes <toon@iotcl•com>
To: Patrick Steinhardt <ps@pks•im>
Cc: git@vger•kernel.org, "Jeff King" <peff@peff•net>,
"Taylor Blau" <me@ttaylorr•com>,
"Derrick Stolee" <stolee@gmail•com>,
"Ævar Arnfjörð Bjarmason" <avarab@gmail•com>
Subject: Re: [PATCH RFC v2 3/5] last-modified: use Bloom filters when available
Date: Fri, 13 Jun 2025 13:05:43 +0200 [thread overview]
Message-ID: <87h60j296g.fsf@iotcl.com> (raw)
In-Reply-To: <aDWWgj8gTS9EM7v6@pks.im>
Patrick Steinhardt <ps@pks•im> writes:
> On Fri, May 23, 2025 at 11:33:50AM +0200, Toon Claes wrote:
>> Our 'git last-modified' performs a revision walk, and computes a diff at
>> each point in the walk to figure out whether a given revision changed
>> any of the paths it considers interesting.
>>
>> When changed-path Bloom filters are available, we can avoid computing
>> many such diffs. Before computing a diff, we first check if any of the
>> remaining paths of interest were possibly changed at a given commit by
>> consulting its Bloom filter. If any of them are, we are resigned to
>> compute the diff.
>>
>> If none of those queries returned "maybe", we know that the given commit
>> doesn't contain any changed paths which are interesting to us. So, we
>> can avoid computing it in this case.
>>
>> This results in a substantial performance speed-up in common cases of
>> 'git last-modified'. In the kernel, here is the before and after (all
>> times computed with best-of-five):
>>
>> With commit-graphs (but no Bloom filters):
>>
>> real 0m5.133s
>> user 0m4.942s
>> sys 0m0.180s
>>
>> ...and with Bloom filters:
>>
>> real 0m0.936s
>> user 0m0.842s
>> sys 0m0.092s
>>
>> These times are with my development-version of Git, so it's compiled
>> without optimizations. Compiling instead with `-O3`, the results look
>> even better:
>>
>> real 0m0.754s
>> user 0m0.661s
>> sys 0m0.092s
>
> I'm sure that the old state without bloom filters will also improve a
> bit?
These are the benchmarks from the original commits I took over. They are
no longer really relevant, I'll remove them.
>> Signed-off-by: Toon Claes <toon@iotcl•com>
>> ---
>> last-modified.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 44 insertions(+)
>>
>> diff --git a/last-modified.c b/last-modified.c
>> index 9283f8fcae..f628434929 100644
>> --- a/last-modified.c
>> +++ b/last-modified.c
>> @@ -92,12 +99,21 @@ void last_modified_init(struct last_modified *lm,
>> if (setup_revisions(argc, argv, &lm->rev, NULL) > 1)
>> die(_("unknown last-modified argument: %s"), argv[1]);
>>
>> + (void)generation_numbers_enabled(lm->rev.repo);
>
> Why the `(void)` cast? And why even call this in the first place? This
> definitely needs a comment and smells like funky design in our commit
> graph subsystem where we rely on side effects of one function to leak
> into a different function.
This function calls `prepare_commit_graph()` which I think is the
important side-effect. Let me add a comment. Or would you rather to see
a separate function?
>> + lm->rev.bloom_filter_settings = get_bloom_filter_settings(lm->rev.repo);
>> +
>> if (add_from_revs(lm) < 0)
>> die(_("unable to setup last-modified"));
>> }
>>
>> void last_modified_release(struct last_modified *lm)
>> {
>> + struct hashmap_iter iter;
>> + struct last_modified_entry *ent;
>> +
>> + hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) {
>> + clear_bloom_key(&ent->key);
>> + }
>
> The curly braces shouldn't be needed.
Okay.
>> @@ -180,6 +197,30 @@ static void last_modified_diff(struct diff_queue_struct *q,
>> }
>> }
>>
>> +static int maybe_changed_path(struct last_modified *lm, struct commit *origin)
>> +{
>> + struct bloom_filter *filter;
>> + struct last_modified_entry *ent;
>> + struct hashmap_iter iter;
>> +
>> + if (!lm->rev.bloom_filter_settings)
>> + return 1;
>> +
>> + if (commit_graph_generation(origin) == GENERATION_NUMBER_INFINITY)
>> + return 1;
>
> Hm, okay, so here we require generation numbers to exist. Why is that
> though? Shouldn't we only care about bloom filters? I don't quite get
> that part yet.
That's a good question. Because we're above ignoring the return value of
`generation_numbers_enabled()` we shouldn't rely on generation numbers.
I verified things and did some testing and it seems to me we can safely
remove this condition.
>> + filter = get_bloom_filter(lm->rev.repo, origin);
>> + if (!filter)
>> + return 1;
>> +
>> + hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) {
>> + if (bloom_filter_contains(filter, &ent->key,
>> + lm->rev.bloom_filter_settings))
>> + return 1;
>> + }
>> + return 0;
>> +}
>> +
>
> Okay, and here we check whether any of our desired paths may be
> contained in the bloom filter.
>
>> int last_modified_run(struct last_modified *lm, last_modified_callback cb, void *cbdata)
>> {
>> struct last_modified_callback_data data;
>> @@ -199,6 +240,9 @@ int last_modified_run(struct last_modified *lm, last_modified_callback cb, void
>> if (!data.commit)
>> break;
>>
>> + if (!maybe_changed_path(lm, data.commit))
>> + continue;
>
> If there either are no bloom filters or in case none of them contain our
> commit we can safely skip over the commit indeed. Otherwise we'll have
> to check whether the commit really is interesting.
>
> Makes sense.
>
> Patrick
>
--
Cheers,
Toon
next prev parent reply other threads:[~2025-06-13 11:06 UTC|newest]
Thread overview: 135+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-22 17:46 [PATCH RFC 0/5] Introduce git-blame-tree(1) command Toon Claes
2025-04-22 17:46 ` [PATCH RFC 1/5] blame-tree: introduce new subcommand to blame files Toon Claes
2025-04-24 16:19 ` Junio C Hamano
2025-05-07 13:13 ` Toon Claes
2025-04-22 17:46 ` [PATCH RFC 2/5] t/perf: add blame-tree perf script Toon Claes
2025-04-22 17:46 ` [PATCH RFC 3/5] blame-tree: use Bloom filters when available Toon Claes
2025-04-22 17:46 ` [PATCH RFC 4/5] blame-tree: implement faster algorithm Toon Claes
2025-04-22 17:46 ` [PATCH RFC 5/5] blame-tree.c: initialize revision machinery without walk Toon Claes
2025-04-23 13:26 ` [PATCH RFC 0/5] Introduce git-blame-tree(1) command Marc Branchaud
2025-05-07 14:22 ` Toon Claes
2025-05-07 20:23 ` Marc Branchaud
2025-05-07 20:45 ` Junio C Hamano
2025-05-08 13:26 ` Marc Branchaud
2025-05-08 14:26 ` Junio C Hamano
2025-05-08 15:12 ` Marc Branchaud
2025-05-14 14:42 ` Toon Claes
2025-05-14 19:29 ` Junio C Hamano
2025-05-14 21:15 ` Marc Branchaud
2025-05-15 13:29 ` Patrick Steinhardt
2025-05-15 16:39 ` Junio C Hamano
2025-05-15 17:39 ` Marc Branchaud
2025-05-15 19:30 ` Jeff King
2025-05-16 4:38 ` Patrick Steinhardt
2025-05-20 8:49 ` Toon Claes
2025-05-15 17:30 ` Marc Branchaud
2025-05-16 4:30 ` Patrick Steinhardt
2025-05-14 21:15 ` Marc Branchaud
2025-05-07 20:49 ` Kristoffer Haugsbakk
2025-05-08 13:20 ` D. Ben Knoble
2025-05-08 13:26 ` Marc Branchaud
2025-05-08 13:18 ` D. Ben Knoble
2025-05-23 9:33 ` [PATCH RFC v2 0/5] Introduce git-last-modified(1) command Toon Claes
2025-05-23 9:33 ` [PATCH RFC v2 1/5] last-modified: new subcommand to show when files were last modified Toon Claes
2025-05-25 20:07 ` Justin Tobler
2025-06-05 8:32 ` Toon Claes
2025-05-27 10:39 ` Patrick Steinhardt
2025-06-13 9:34 ` Toon Claes
2025-06-13 9:52 ` Kristoffer Haugsbakk
2025-05-23 9:33 ` [PATCH RFC v2 2/5] t/perf: add last-modified perf script Toon Claes
2025-05-23 9:33 ` [PATCH RFC v2 3/5] last-modified: use Bloom filters when available Toon Claes
2025-05-27 10:40 ` Patrick Steinhardt
2025-06-13 11:05 ` Toon Claes [this message]
2025-05-23 9:33 ` [PATCH RFC v2 4/5] last-modified: implement faster algorithm Toon Claes
2025-05-27 10:39 ` Patrick Steinhardt
2025-05-23 9:33 ` [PATCH RFC v2 5/5] last-modified: initialize revision machinery without walk Toon Claes
2025-05-27 10:39 ` Patrick Steinhardt
2025-07-01 20:35 ` [PATCH RFC v2 0/5] Introduce git-last-modified(1) command Kristoffer Haugsbakk
2025-07-01 21:06 ` Junio C Hamano
2025-07-01 21:30 ` Kristoffer Haugsbakk
2025-07-02 13:00 ` Toon Claes
2025-07-09 15:53 ` Toon Claes
2025-07-09 17:00 ` Junio C Hamano
2025-06-30 18:49 ` [PATCH RFC v3 0/3] " Toon Claes
2025-06-30 18:49 ` [PATCH RFC v3 1/3] last-modified: new subcommand to show when files were last modified Toon Claes
2025-07-01 20:20 ` Kristoffer Haugsbakk
2025-07-02 11:51 ` Junio C Hamano
2025-06-30 18:49 ` [PATCH RFC v3 2/3] t/perf: add last-modified perf script Toon Claes
2025-06-30 18:49 ` [PATCH RFC v3 3/3] last-modified: use Bloom filters when available Toon Claes
2025-07-01 23:01 ` [PATCH RFC v3 0/3] Introduce git-last-modified(1) command Junio C Hamano
2025-07-09 15:26 ` [PATCH v4 " Toon Claes
2025-07-09 21:57 ` Junio C Hamano
2025-07-10 18:37 ` Junio C Hamano
2025-07-16 13:32 ` [PATCH v5 0/6] " Toon Claes
2025-07-16 13:35 ` [PATCH v5 1/6] last-modified: new subcommand to show when files were last modified Toon Claes
2025-07-18 0:02 ` Taylor Blau
2025-07-19 6:44 ` Jeff King
2025-07-22 15:50 ` Toon Claes
2025-08-01 9:09 ` Christian Couder
2025-08-01 16:59 ` Junio C Hamano
2025-07-16 13:35 ` [PATCH v5 2/6] t/perf: add last-modified perf script Toon Claes
2025-07-18 0:08 ` Taylor Blau
2025-07-22 15:52 ` Toon Claes
2025-07-16 13:35 ` [PATCH v5 3/6] last-modified: use Bloom filters when available Toon Claes
2025-07-18 0:16 ` Taylor Blau
2025-07-22 16:02 ` Toon Claes
2025-07-16 13:35 ` [PATCH v5 4/6] pretty: allow caller to disable indentation Toon Claes
2025-07-16 15:50 ` Junio C Hamano
2025-07-17 16:31 ` Toon Claes
2025-07-16 13:35 ` [PATCH v5 5/6] last-modified: support --extended format Toon Claes
2025-07-16 16:09 ` Junio C Hamano
2025-07-17 16:31 ` Toon Claes
2025-07-17 22:37 ` Junio C Hamano
2025-07-18 17:36 ` Junio C Hamano
2025-07-22 16:06 ` Toon Claes
2025-07-16 13:42 ` [PATCH v5 6/6] fixup! last-modified: use Bloom filters when available Toon Claes
2025-07-17 23:39 ` [PATCH v5 0/6] Introduce git-last-modified(1) command Taylor Blau
2025-07-22 15:35 ` Toon Claes
2025-07-30 17:59 ` Toon Claes
2025-07-31 7:45 ` Patrick Steinhardt
2025-07-30 17:55 ` [PATCH v6 0/4] " Toon Claes
2025-07-31 18:40 ` Junio C Hamano
2025-07-31 23:57 ` Junio C Hamano
2025-08-05 9:33 ` [PATCH v7 0/3] " Toon Claes
2025-08-05 14:34 ` Patrick Steinhardt
2025-08-05 16:21 ` Junio C Hamano
2025-08-05 16:34 ` Junio C Hamano
2025-08-05 16:55 ` Toon Claes
2025-08-05 17:20 ` Jean-Noël AVILA
2025-08-05 21:46 ` Junio C Hamano
2025-08-06 12:01 ` Toon Claes
2025-08-06 15:38 ` Junio C Hamano
2025-08-28 22:44 ` Junio C Hamano
2025-08-05 18:28 ` Junio C Hamano
2025-08-05 9:33 ` [PATCH v7 1/3] last-modified: new subcommand to show when files were last modified Toon Claes
2025-08-05 9:33 ` [PATCH v7 2/3] t/perf: add last-modified perf script Toon Claes
2025-08-05 9:33 ` [PATCH v7 3/3] last-modified: use Bloom filters when available Toon Claes
2025-07-30 17:55 ` [PATCH v6 1/4] last-modified: new subcommand to show when files were last modified Toon Claes
2025-07-31 6:42 ` Patrick Steinhardt
2025-08-01 16:22 ` Toon Claes
2025-08-01 17:09 ` Junio C Hamano
2025-08-04 6:34 ` Patrick Steinhardt
2025-08-04 17:14 ` Junio C Hamano
2025-08-05 5:35 ` Toon Claes
2025-08-01 20:34 ` Jean-Noël AVILA
2025-08-05 5:36 ` Toon Claes
2025-08-04 6:33 ` Patrick Steinhardt
2025-08-01 10:18 ` Christian Couder
2025-08-01 10:22 ` Patrick Steinhardt
2025-08-01 17:06 ` Junio C Hamano
2025-08-02 8:18 ` Christian Couder
2025-08-02 11:31 ` Christian Couder
2025-08-02 13:38 ` Christian Couder
2025-08-02 16:26 ` Junio C Hamano
2025-08-04 6:35 ` Patrick Steinhardt
2025-07-30 17:55 ` [PATCH v6 2/4] t/perf: add last-modified perf script Toon Claes
2025-07-30 17:55 ` [PATCH v6 3/4] commit-graph: export prepare_commit_graph() Toon Claes
2025-07-31 6:42 ` Patrick Steinhardt
2025-07-30 17:55 ` [PATCH v6 4/4] last-modified: use Bloom filters when available Toon Claes
2025-07-31 6:43 ` Patrick Steinhardt
2025-08-01 16:23 ` Toon Claes
2025-08-04 6:33 ` Patrick Steinhardt
2025-07-09 15:26 ` [PATCH v4 1/3] last-modified: new subcommand to show when files were last modified Toon Claes
2025-07-09 15:26 ` [PATCH v4 2/3] t/perf: add last-modified perf script Toon Claes
2025-07-09 15:26 ` [PATCH v4 3/3] last-modified: use Bloom filters when available Toon Claes
2025-07-16 13:35 ` [PATCH v5 6/6] fixup! " Toon Claes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87h60j296g.fsf@iotcl.com \
--to=toon@iotcl$(echo .)com \
--cc=avarab@gmail$(echo .)com \
--cc=git@vger$(echo .)kernel.org \
--cc=me@ttaylorr$(echo .)com \
--cc=peff@peff$(echo .)net \
--cc=ps@pks$(echo .)im \
--cc=stolee@gmail$(echo .)com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox