public inbox for git@vger.kernel.org 
 help / color / mirror / Atom feed
From: Junio C Hamano <gitster@pobox•com>
To: "Erik Elfström" <erik.elfstrom@gmail•com>
Cc: git@vger•kernel.org, Jens Lehmann <Jens.Lehmann@web•de>,
	Jeff King <peff@peff•net>
Subject: Re: [PATCH v2 3/3] clean: improve performance when removing lots of directories
Date: Wed, 15 Apr 2015 10:56:50 -0700	[thread overview]
Message-ID: <xmqqpp75l1gd.fsf@gitster.dls.corp.google.com> (raw)
In-Reply-To: <1428770587-9674-5-git-send-email-erik.elfstrom@gmail.com> ("Erik Elfström"'s message of "Sat, 11 Apr 2015 18:43:07 +0200")

Erik Elfström <erik.elfstrom@gmail•com> writes:

> Before this change, clean used resolve_gitlink_ref to check for the
> presence of nested git repositories. This had the drawback of creating
> a ref_cache entry for every directory that should potentially be
> cleaned. The linear search through the ref_cache list caused a massive
> performance hit for large number of directories.

I'd prefer to see the "current state" described in the current
tense, e.g.

    "git clean" uses resolve_gitlink_ref() to check for the presence of
    nested git repositories, but it has the drawback of creating a
    ref_cache entry for every directory that should potentially be
    cleaned. The linear search through the ref_cache list causes a
    massive performance hit for large number of directories.

> Teach clean.c:remove_dirs to use setup.c:is_git_directory
> instead. is_git_directory will actually open HEAD and parse the HEAD
> ref but this implies a nested git repository and should be rare when
> cleaning.

I am not sure what you wanted to say in this paragraph.  What does
it being rare have to do with it?  Even if it is not rare (i.e. the
top-level project you are working with has many submodules checked
out without using the more recent "a file .git pointing into
.git/modules/ via 'gitdir: $overThere'" mechanism), if we found a
nested git repository, we treat it as special and exclude it from
cleaning it out, which is a good thing, no?

Doesn't this implementation get confused by modern submodule
checkouts and descend into and clean their working tree, though?
Module M with path P would have a directory P in the working tree of
the top-level project, and P/.git is a regular file that will fail
"is_git_directory()" test but records the location of the real
submodule repository i.e. ".git/modules/M" via the "gitdir:"
mechanism.

> Using is_git_directory should give a more standardized check for what
> is and what isn't a git repository but also gives a slight behavioral
> change. We will now detect and respect bare and empty nested git
> repositories (only init run). Update t7300 to reflect this.
>
> The time to clean an untracked directory containing 100000 sub
> directories went from 61s to 1.7s after this change.
>
> Helped-by: Jeff King <peff@peff•net>
> Signed-off-by: Erik Elfström <erik.elfstrom@gmail•com>
> ---
>  builtin/clean.c  | 24 ++++++++++++++++++++----
>  t/t7300-clean.sh |  4 ++--
>  2 files changed, 22 insertions(+), 6 deletions(-)
>
> diff --git a/builtin/clean.c b/builtin/clean.c
> index 98c103f..b679913 100644
> --- a/builtin/clean.c
> +++ b/builtin/clean.c
> @@ -10,7 +10,6 @@
>  #include "cache.h"
>  #include "dir.h"
>  #include "parse-options.h"
> -#include "refs.h"
>  #include "string-list.h"
>  #include "quote.h"
>  #include "column.h"
> @@ -148,6 +147,25 @@ static int exclude_cb(const struct option *opt, const char *arg, int unset)
>  	return 0;
>  }
>  
> +static int is_git_repository(struct strbuf *path)
> +{
> +	int ret = 0;
> +	if (is_git_directory(path->buf))
> +		ret = 1;
> +	else {
> +		size_t orig_path_len = path->len;
> +		assert(orig_path_len != 0);
> +		if (path->buf[orig_path_len - 1] != '/')
> +			strbuf_addch(path, '/');
> +		strbuf_addstr(path, ".git");
> +		if (is_git_directory(path->buf))
> +			ret = 1;
> +		strbuf_setlen(path, orig_path_len);
> +	}
> +
> +	return ret;
> +}
> +
>  static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag,
>  		int dry_run, int quiet, int *dir_gone)
>  {
> @@ -155,13 +173,11 @@ static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag,
>  	struct strbuf quoted = STRBUF_INIT;
>  	struct dirent *e;
>  	int res = 0, ret = 0, gone = 1, original_len = path->len, len;
> -	unsigned char submodule_head[20];
>  	struct string_list dels = STRING_LIST_INIT_DUP;
>  
>  	*dir_gone = 1;
>  
> -	if ((force_flag & REMOVE_DIR_KEEP_NESTED_GIT) &&
> -			!resolve_gitlink_ref(path->buf, "HEAD", submodule_head)) {
> +	if ((force_flag & REMOVE_DIR_KEEP_NESTED_GIT) && is_git_repository(path)) {
>  		if (!quiet) {
>  			quote_path_relative(path->buf, prefix, &quoted);
>  			printf(dry_run ?  _(msg_would_skip_git_dir) : _(msg_skip_git_dir),
> diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh
> index 58e6b4a..da294fe 100755
> --- a/t/t7300-clean.sh
> +++ b/t/t7300-clean.sh
> @@ -455,7 +455,7 @@ test_expect_success 'nested git work tree' '
>  	! test -d bar
>  '
>  
> -test_expect_failure 'nested git (only init) should be kept' '
> +test_expect_success 'nested git (only init) should be kept' '
>  	rm -fr foo bar &&
>  	git init foo &&
>  	mkdir bar &&
> @@ -465,7 +465,7 @@ test_expect_failure 'nested git (only init) should be kept' '
>  	test_path_is_missing bar
>  '
>  
> -test_expect_failure 'nested git (bare) should be kept' '
> +test_expect_success 'nested git (bare) should be kept' '
>  	rm -fr foo bar &&
>  	git init --bare foo &&
>  	mkdir bar &&

  reply	other threads:[~2015-04-15 17:57 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-11 16:43 [PATCH v2 0/3] Improving performance of git clean Erik Elfström
2015-04-11 16:43 ` Erik Elfström
2015-04-11 16:43 ` [PATCH v2 1/3] t7300: add tests to document behavior of clean and nested git Erik Elfström
2015-04-11 16:43 ` [PATCH v2 2/3] p7300: add performance tests for clean Erik Elfström
2015-04-11 17:59   ` Thomas Gummerer
2015-04-12 15:31     ` erik elfström
2015-04-12 16:52       ` Thomas Gummerer
2015-04-11 16:43 ` [PATCH v2 3/3] clean: improve performance when removing lots of directories Erik Elfström
2015-04-15 17:56   ` Junio C Hamano [this message]
2015-04-17 18:15     ` erik elfström
2015-04-17 19:00       ` Jeff King
2015-04-17 19:13         ` Junio C Hamano
2015-04-15  3:33 ` [PATCH v2 0/3] Improving performance of git clean Eric Sunshine

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqpp75l1gd.fsf@gitster.dls.corp.google.com \
    --to=gitster@pobox$(echo .)com \
    --cc=Jens.Lehmann@web$(echo .)de \
    --cc=erik.elfstrom@gmail$(echo .)com \
    --cc=git@vger$(echo .)kernel.org \
    --cc=peff@peff$(echo .)net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox