From: Junio C Hamano <gitster@pobox•com>
To: Steffen Prohaska <prohaska@zib•de>
Cc: git@vger•kernel.org, Jeff King <peff@peff•net>,
Scott Chacon <schacon@gmail•com>
Subject: Re: [PATCH] convert: Stream from fd to required clean filter instead of mmap
Date: Mon, 04 Aug 2014 12:03:27 -0700 [thread overview]
Message-ID: <xmqq4mxsrsao.fsf@gitster.dls.corp.google.com> (raw)
In-Reply-To: <1407056176-8231-1-git-send-email-prohaska@zib.de> (Steffen Prohaska's message of "Sun, 3 Aug 2014 10:56:16 +0200")
Steffen Prohaska <prohaska@zib•de> writes:
> The data is streamed to the filter process anyway. Better avoid mapping
> the file if possible. This is especially useful if a clean filter
> reduces the size, for example if it computes a sha1 for binary data,
> like git media. The file size that the previous implementation could
> handle was limited by the available address space; large files for
> example could not be handled with (32-bit) msysgit. The new
> implementation can filter files of any size as long as the filter output
> is small enough.
>
> The new code path is only taken if the filter is required. The filter
> consumes data directly from the fd. The original data is not available
> to git, so it must fail if the filter fails.
Yay ;-)
>
> The test that exercises required filters is modified to verify that the
> data actually has been modified on its way from the file system to the
> object store.
>
> The expectation on the process size is tested using /usr/bin/time. An
> alternative would have been tcsh, which could be used to print memory
> information as follows:
>
> tcsh -c 'set time=(0 "%M"); <cmd>'
>
> Although the logic could perhaps be simplified with tcsh, I chose to use
> 'time' to avoid a dependency on tcsh.
>
> Signed-off-by: Steffen Prohaska <prohaska@zib•de>
> ---
> convert.c | 58 ++++++++++++++++++++++++++++++++++++++-----
> convert.h | 10 +++++---
> sha1_file.c | 29 ++++++++++++++++++++--
> t/t0021-conversion.sh | 68 +++++++++++++++++++++++++++++++++++++++++++++++----
> 4 files changed, 149 insertions(+), 16 deletions(-)
>
> diff --git a/convert.c b/convert.c
> index cb5fbb4..58a516a 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -312,11 +312,12 @@ static int crlf_to_worktree(const char *path, const char *src, size_t len,
> struct filter_params {
> const char *src;
> unsigned long size;
> + int fd;
> const char *cmd;
> const char *path;
> };
>
> -static int filter_buffer(int in, int out, void *data)
> +static int filter_buffer_or_fd(int in, int out, void *data)
> {
> /*
> * Spawn cmd and feed the buffer contents through its stdin.
> @@ -325,6 +326,7 @@ static int filter_buffer(int in, int out, void *data)
> struct filter_params *params = (struct filter_params *)data;
> int write_err, status;
> const char *argv[] = { NULL, NULL };
> + int fd;
>
> /* apply % substitution to cmd */
> struct strbuf cmd = STRBUF_INIT;
> @@ -355,7 +357,17 @@ static int filter_buffer(int in, int out, void *data)
>
> sigchain_push(SIGPIPE, SIG_IGN);
>
> - write_err = (write_in_full(child_process.in, params->src, params->size) < 0);
> + if (params->src) {
> + write_err = (write_in_full(child_process.in, params->src, params->size) < 0);
> + } else {
> + /* dup(), because copy_fd() closes the input fd. */
> + fd = dup(params->fd);
> + if (fd < 0)
> + write_err = error("failed to dup file descriptor.");
> + else
> + write_err = copy_fd(fd, child_process.in);
> + }
> +
> if (close(child_process.in))
> write_err = 1;
> if (write_err)
> @@ -371,7 +383,7 @@ static int filter_buffer(int in, int out, void *data)
> return (write_err || status);
> }
>
> -static int apply_filter(const char *path, const char *src, size_t len,
> +static int apply_filter(const char *path, const char *src, size_t len, int fd,
> struct strbuf *dst, const char *cmd)
> {
> /*
> @@ -392,11 +404,12 @@ static int apply_filter(const char *path, const char *src, size_t len,
> return 1;
>
> memset(&async, 0, sizeof(async));
> - async.proc = filter_buffer;
> + async.proc = filter_buffer_or_fd;
> async.data = ¶ms;
> async.out = -1;
> params.src = src;
> params.size = len;
> + params.fd = fd;
> params.cmd = cmd;
> params.path = path;
>
> @@ -747,6 +760,22 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
> }
> }
>
> +int would_convert_to_git_filter_fd(const char *path) {
Style; opening brace on its own line:
int would_convert_to_git_filter_fd(const char *path)
{
> + struct conv_attrs ca;
> + convert_attrs(&ca, path);
> +
> + if (!ca.drv)
As we do not allow decl-after-stmt, it is easier to read if the
first blank line in the function was between the local variable
definition and the first statement. I.e.
int would_convert_to_git_filter_fd(const char *path)
{
struct ...;
convert_attrs(...);
if (!ca.drv)
> + return 0;
> +
> + /* Apply a filter to an fd only if the filter is required to succeed.
> + * We must die if the filter fails, because the original data before
> + * filtering is not available. */
/*
* multi-line
* comment style.
*/
> + if (!ca.drv->required)
> + return 0;
> +
> + return apply_filter(path, 0, 0, -1, 0, ca.drv->clean);
What's the significance of "-1" here? Does somebody in the
callchain from apply_filter() check if fd < 0 and act differently
(not a complaint nor rhetoric question)?
Spell out the first "0" as "NULL" if you meant src == NULL
(similarly for dst).
> diff --git a/convert.h b/convert.h
> index 0c2143c..d9d853c 100644
> --- a/convert.h
> +++ b/convert.h
> @@ -40,11 +40,15 @@ extern int convert_to_working_tree(const char *path, const char *src,
> size_t len, struct strbuf *dst);
> extern int renormalize_buffer(const char *path, const char *src, size_t len,
> struct strbuf *dst);
> -static inline int would_convert_to_git(const char *path, const char *src,
> - size_t len, enum safe_crlf checksafe)
> +static inline int would_convert_to_git(const char *path)
> {
> - return convert_to_git(path, src, len, NULL, checksafe);
> + return convert_to_git(path, NULL, 0, NULL, 0);
> }
Is this change because there was only one caller that passed nothing
meaningful but "path"? It would have been nicer to see such a change
as a clean-up _before_ the real patch, if that was the case.
next prev parent reply other threads:[~2014-08-04 19:03 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-08-03 8:56 [PATCH] convert: Stream from fd to required clean filter instead of mmap Steffen Prohaska
2014-08-04 19:03 ` Junio C Hamano [this message]
2014-08-06 4:22 ` Steffen Prohaska
2014-08-06 4:32 ` prohaska
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=xmqq4mxsrsao.fsf@gitster.dls.corp.google.com \
--to=gitster@pobox$(echo .)com \
--cc=git@vger$(echo .)kernel.org \
--cc=peff@peff$(echo .)net \
--cc=prohaska@zib$(echo .)de \
--cc=schacon@gmail$(echo .)com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox