From: Junio C Hamano <gitster@pobox•com>
To: "Victoria Dye via GitGitGadget" <gitgitgadget@gmail•com>
Cc: git@vger•kernel.org, Eric Sunshine <sunshine@sunshineco•com>,
Patrick Steinhardt <ps@pks•im>, Victoria Dye <vdye@github•com>
Subject: Re: [PATCH v2 16/17] mktree: allow deeper paths in input
Date: Thu, 27 Jun 2024 12:29:42 -0700 [thread overview]
Message-ID: <xmqqed8itc9l.fsf@gitster.g> (raw)
In-Reply-To: <46756c4e3140d34838ad4cd5e7a070d1f9f46b53.1718834285.git.gitgitgadget@gmail.com> (Victoria Dye via GitGitGadget's message of "Wed, 19 Jun 2024 21:58:04 +0000")
"Victoria Dye via GitGitGadget" <gitgitgadget@gmail•com> writes:
> From: Victoria Dye <vdye@github•com>
>
> Update 'git mktree' to handle entries nested inside of directories (e.g.
> 'path/to/a/file.txt'). This functionality requires a series of changes:
>
> * In 'sort_and_dedup_tree_entry_array()', remove entries inside of
> directories that come after them in input order.
So if you feed "folder/file.txt" and then "folder/", then
"folder/file.txt" gets removed? It is unclear offhand why that is
the right thing to do.
> * Also in 'sort_and_dedup_tree_entry_array()', mark directories that contain
> entries that come after them in input order (e.g., 'folder/' followed by
> 'folder/file.txt') as "need to expand".
Makes me wonder what happens to the object name recorded in the
input for "folder/" when something like this happens. Ideally,
adding (or replacing) "folder/file.txt" to the set of files we
collected out of the base tree and the input stream for "folder/"
and writing that "folder/" out as a tree would result in a tree
whose object name exactly matches it (and we will error out if it
does not)? Or is "need to expand" a signal that we should ignore
the object name in the input and we need to recompute it ourselves?
Again, it is unclear offhand what we want the "need to expand" is
used for.
> * In 'add_tree_entry_to_index()', if a tree entry is marked as "need to
> expand", recurse into it with 'read_tree_at()' & 'build_index_from_tree'.
> * In 'build_index_from_tree()', if a user-specified tree entry is contained
> within the current iterated entry, return 'READ_TREE_RECURSIVE' to recurse
> into the iterated tree.
Surely, no matter what we choose to do to the object name given with
"folder/" when the input stream also talks about "folder/file.txt",
we'd need to recurse into the subtree. But I think we need a higher
level description of what exactly we want to do to the multi-level
pathnames (i.e., "we want to handle them this way") before going into
the implementation details of how we do so (i.e., "hence we deal with
a multi-level pathname this way at these places in the code") in
these bullet points.
Especially, I do not quite understand what semantics the first
bullet point is trying to achieve.
> +Entries may use full pathnames containing directory separators to specify
> +entries nested within one or more directories. These entries are inserted
> +into the appropriate tree in the base tree-ish if one exists. Otherwise,
> +empty parent trees are created to contain the entries.
This still does not answer "how overlapping entries are handled?",
which is more complex than "for two exactly the same paths, the last
one wins", which is mentioned in the next paragraph.
> The order of the tree entries is normalized by `mktree` so pre-sorting the
> input by path is not required. Multiple entries provided with the same path
> are deduplicated, with only the last one specified added to the tree.
> diff --git a/builtin/mktree.c b/builtin/mktree.c
> index 96f06547a2a..74cec92a517 100644
> --- a/builtin/mktree.c
> +++ b/builtin/mktree.c
> ...
> +static struct tree_entry *tree_entry_array_pop(struct tree_entry_array *arr)
> +{
> + if (!arr->nr)
> + return NULL;
> + return arr->entries[--arr->nr];
> +}
> +
> static void tree_entry_array_clear(struct tree_entry_array *arr, int free_entries)
> {
> if (free_entries) {
> @@ -109,8 +118,10 @@ static void append_to_tree(unsigned mode, struct object_id *oid, const char *pat
>
> if (!verify_path(ent->name, mode))
> die(_("invalid path '%s'"), path);
> - if (strchr(ent->name, '/'))
> - die("path %s contains slash", path);
> +
> + /* mark has_nested_entries if needed */
> + if (!arr->has_nested_entries && strchr(ent->name, '/'))
> + arr->has_nested_entries = 1;
OK.
> @@ -168,6 +179,46 @@ static void sort_and_dedup_tree_entry_array(struct tree_entry_array *arr)
> ignore_mode = 0;
> QSORT_S(arr->entries, arr->nr, ent_compare, &ignore_mode);
We have already sorted the array twice (once before simple deduping,
once after). So we now have a sorted array of "last one won" paths
and their object names.
> + if (arr->has_nested_entries) {
We need to deal with overlapping entries if "has-nested-entries" is
true. Even though our input here is sorted, we'd still pay
attention to the original input "order", which may be different from
the order in which we find these entries in arr->entries[].
OK.
> + struct tree_entry_array parent_dir_ents = { 0 };
> +
> + count = arr->nr;
> + arr->nr = 0;
> +
> + /* Remove any entries where one of its parent dirs has a higher 'order' */
Is "has a higher order" equivalent to "appears later in the input"?
More importantly, can the reason why they need to be removed be
clarified? For simple deduping, we can say "we will make the last
one of multiple entries talking about the same path be used", and
that would be a sufficient explanation why we discard the one that
we have seen earlier and replace it with the newly seen one for the
same path. Can a similar and simple explanation be given for the
behaviour this loop tries to achieve? Is it "children, which appear
earlier in the input, of a directory, which appears later than these
children, are discarded, because the entry for the directory has a
concrete object name, and there is no point talking about individual
paths inside the directory. We know what the tree object that would
contain these child paths hashes to in the end. This is a natural
extension of 'last one wins' rule---a directory that comes later
trumps paths contained within that come earlier"?
> + for (size_t i = 0; i < count; i++) {
> + const char *skipped_prefix;
> + struct tree_entry *parent;
> + struct tree_entry *curr = arr->entries[i];
> + int skip_entry = 0;
> +
> + while ((parent = tree_entry_array_pop(&parent_dir_ents))) {
> + if (!skip_prefix(curr->name, parent->name, &skipped_prefix))
> + continue;
> +
> + /* entry in dir, so we push the parent back onto the stack */
> + tree_entry_array_push(&parent_dir_ents, parent);
> +
> + if (parent->order > curr->order)
> + skip_entry = 1;
> + else
> + parent->expand_dir = 1;
> +
> + break;
> + }
> +
> + if (!skip_entry) {
> + arr->entries[arr->nr++] = curr;
> + if (S_ISDIR(curr->mode))
> + tree_entry_array_push(&parent_dir_ents, curr);
> + } else {
> + FREE_AND_NULL(curr);
> + }
> + }
> +
> + tree_entry_array_release(&parent_dir_ents, 0);
> + }
> +
> /* Finally, initialize the directory-file conflict hash map */
> for (size_t i = 0; i < count; i++) {
> struct tree_entry *curr = arr->entries[i];
next prev parent reply other threads:[~2024-06-27 19:29 UTC|newest]
Thread overview: 65+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-06-11 18:24 [PATCH 00/16] mktree: support more flexible usage Victoria Dye via GitGitGadget
2024-06-11 18:24 ` [PATCH 01/16] mktree: use OPT_BOOL Victoria Dye via GitGitGadget
2024-06-11 18:24 ` [PATCH 02/16] mktree: rename treeent to tree_entry Victoria Dye via GitGitGadget
2024-06-12 9:40 ` Patrick Steinhardt
2024-06-11 18:24 ` [PATCH 03/16] mktree: use non-static tree_entry array Victoria Dye via GitGitGadget
2024-06-11 18:45 ` Eric Sunshine
2024-06-12 9:40 ` Patrick Steinhardt
2024-06-11 18:24 ` [PATCH 04/16] update-index: generalize 'read_index_info' Victoria Dye via GitGitGadget
2024-06-11 22:45 ` Junio C Hamano
2024-06-11 18:24 ` [PATCH 05/16] index-info.c: identify empty input lines in read_index_info Victoria Dye via GitGitGadget
2024-06-11 22:52 ` Junio C Hamano
2024-06-18 17:33 ` Victoria Dye
2024-06-11 18:24 ` [PATCH 06/16] index-info.c: parse object type in provided " Victoria Dye via GitGitGadget
2024-06-12 1:54 ` Junio C Hamano
2024-06-11 18:24 ` [PATCH 07/16] mktree: use read_index_info to read stdin lines Victoria Dye via GitGitGadget
2024-06-12 2:11 ` Junio C Hamano
2024-06-12 9:40 ` Patrick Steinhardt
2024-06-12 18:35 ` Junio C Hamano
2024-06-11 18:24 ` [PATCH 08/16] mktree: add a --literally option Victoria Dye via GitGitGadget
2024-06-12 2:18 ` Junio C Hamano
2024-06-11 18:24 ` [PATCH 09/16] mktree: validate paths more carefully Victoria Dye via GitGitGadget
2024-06-12 2:26 ` Junio C Hamano
2024-06-12 19:01 ` Victoria Dye
2024-06-12 19:45 ` Junio C Hamano
2024-06-11 18:24 ` [PATCH 10/16] mktree: overwrite duplicate entries Victoria Dye via GitGitGadget
2024-06-12 9:40 ` Patrick Steinhardt
2024-06-12 18:48 ` Victoria Dye
2024-06-11 18:24 ` [PATCH 11/16] mktree: create tree using an in-core index Victoria Dye via GitGitGadget
2024-06-12 9:40 ` Patrick Steinhardt
2024-06-11 18:24 ` [PATCH 12/16] mktree: use iterator struct to add tree entries to index Victoria Dye via GitGitGadget
2024-06-12 9:40 ` Patrick Steinhardt
2024-06-13 18:38 ` Victoria Dye
2024-06-11 18:24 ` [PATCH 13/16] mktree: add directory-file conflict hashmap Victoria Dye via GitGitGadget
2024-06-11 18:24 ` [PATCH 14/16] mktree: optionally add to an existing tree Victoria Dye via GitGitGadget
2024-06-12 9:40 ` Patrick Steinhardt
2024-06-12 19:50 ` Junio C Hamano
2024-06-17 19:23 ` Victoria Dye
2024-06-11 18:24 ` [PATCH 15/16] mktree: allow deeper paths in input Victoria Dye via GitGitGadget
2024-06-11 18:24 ` [PATCH 16/16] mktree: remove entries when mode is 0 Victoria Dye via GitGitGadget
2024-06-19 21:57 ` [PATCH v2 00/17] mktree: support more flexible usage Victoria Dye via GitGitGadget
2024-06-19 21:57 ` [PATCH v2 01/17] mktree: use OPT_BOOL Victoria Dye via GitGitGadget
2024-06-19 21:57 ` [PATCH v2 02/17] mktree: rename treeent to tree_entry Victoria Dye via GitGitGadget
2024-06-19 21:57 ` [PATCH v2 03/17] mktree: use non-static tree_entry array Victoria Dye via GitGitGadget
2024-06-19 21:57 ` [PATCH v2 04/17] update-index: generalize 'read_index_info' Victoria Dye via GitGitGadget
2024-06-19 21:57 ` [PATCH v2 05/17] index-info.c: return unrecognized lines to caller Victoria Dye via GitGitGadget
2024-06-19 21:57 ` [PATCH v2 06/17] index-info.c: parse object type in provided in read_index_info Victoria Dye via GitGitGadget
2024-06-19 21:57 ` [PATCH v2 07/17] mktree: use read_index_info to read stdin lines Victoria Dye via GitGitGadget
2024-06-20 20:18 ` Junio C Hamano
2024-06-19 21:57 ` [PATCH v2 08/17] mktree.c: do not fail on mismatched submodule type Victoria Dye via GitGitGadget
2024-06-19 21:57 ` [PATCH v2 09/17] mktree: add a --literally option Victoria Dye via GitGitGadget
2024-06-19 21:57 ` [PATCH v2 10/17] mktree: validate paths more carefully Victoria Dye via GitGitGadget
2024-06-19 21:57 ` [PATCH v2 11/17] mktree: overwrite duplicate entries Victoria Dye via GitGitGadget
2024-06-20 22:05 ` Junio C Hamano
2024-06-19 21:58 ` [PATCH v2 12/17] mktree: create tree using an in-core index Victoria Dye via GitGitGadget
2024-06-20 22:26 ` Junio C Hamano
2024-06-19 21:58 ` [PATCH v2 13/17] mktree: use iterator struct to add tree entries to index Victoria Dye via GitGitGadget
2024-06-26 21:10 ` Junio C Hamano
2024-06-19 21:58 ` [PATCH v2 14/17] mktree: add directory-file conflict hashmap Victoria Dye via GitGitGadget
2024-06-19 21:58 ` [PATCH v2 15/17] mktree: optionally add to an existing tree Victoria Dye via GitGitGadget
2024-06-26 21:23 ` Junio C Hamano
2024-06-19 21:58 ` [PATCH v2 16/17] mktree: allow deeper paths in input Victoria Dye via GitGitGadget
2024-06-27 19:29 ` Junio C Hamano [this message]
2024-06-19 21:58 ` [PATCH v2 17/17] mktree: remove entries when mode is 0 Victoria Dye via GitGitGadget
2024-06-25 23:26 ` [PATCH v2 00/17] mktree: support more flexible usage Junio C Hamano
2024-07-10 21:40 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=xmqqed8itc9l.fsf@gitster.g \
--to=gitster@pobox$(echo .)com \
--cc=git@vger$(echo .)kernel.org \
--cc=gitgitgadget@gmail$(echo .)com \
--cc=ps@pks$(echo .)im \
--cc=sunshine@sunshineco$(echo .)com \
--cc=vdye@github$(echo .)com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox