public inbox for git@vger.kernel.org 
 help / color / mirror / Atom feed
From: Junio C Hamano <gitster@pobox•com>
To: Karsten Blees <karsten.blees@gmail•com>
Cc: Git List <git@vger•kernel.org>
Subject: Re: [PATCH v1 4/4] hashmap: add string interning API
Date: Mon, 07 Jul 2014 10:44:31 -0700	[thread overview]
Message-ID: <xmqqfviddpxc.fsf@gitster.dls.corp.google.com> (raw)
In-Reply-To: <53B4863E.1020701@gmail.com> (Karsten Blees's message of "Thu, 03 Jul 2014 00:22:54 +0200")

Karsten Blees <karsten.blees@gmail•com> writes:

> Interning short strings with high probability of duplicates can reduce the
> memory footprint and speed up comparisons.
>
> Add strintern() and memintern() APIs that use a hashmap to manage the pool
> of unique, interned strings.
>
> Note: strintern(getenv()) could be used to sanitize git's use of getenv(),
> in case we ever encounter a platform where a call to getenv() invalidates
> previous getenv() results (which is allowed by POSIX).

I think the attribute name/value strings are already interned, so
they may want to be converted to use this (or vice-versa).

>
> Signed-off-by: Karsten Blees <blees@dcon•de>
> ---
>  Documentation/technical/api-hashmap.txt | 15 +++++++++++++
>  hashmap.c                               | 38 +++++++++++++++++++++++++++++++++
>  hashmap.h                               |  8 +++++++
>  t/t0011-hashmap.sh                      | 13 +++++++++++
>  test-hashmap.c                          | 14 ++++++++++++
>  5 files changed, 88 insertions(+)
>
> diff --git a/Documentation/technical/api-hashmap.txt b/Documentation/technical/api-hashmap.txt
> index f9215d6..00c4c29 100644
> --- a/Documentation/technical/api-hashmap.txt
> +++ b/Documentation/technical/api-hashmap.txt
> @@ -193,6 +193,21 @@ more entries.
>  `hashmap_iter_first` is a combination of both (i.e. initializes the iterator
>  and returns the first entry, if any).
>  
> +`const char *strintern(const char *string)`::
> +`const void *memintern(const void *data, size_t len)`::
> +
> +	Returns the unique, interned version of the specified string or data,
> +	similar to the `String.intern` API in Java and .NET, respectively.
> +	Interned strings remain valid for the entire lifetime of the process.
> ++
> +Can be used as `[x]strdup()` or `xmemdupz` replacement, except that interned
> +strings / data must not be modified or freed.
> ++
> +Interned strings are best used for short strings with high probability of
> +duplicates.
> ++
> +Uses a hashmap to store the pool of interned strings.
> +
>  Usage example
>  -------------
>  
> diff --git a/hashmap.c b/hashmap.c
> index d1b8056..f693839 100644
> --- a/hashmap.c
> +++ b/hashmap.c
> @@ -226,3 +226,41 @@ void *hashmap_iter_next(struct hashmap_iter *iter)
>  		current = iter->map->table[iter->tablepos++];
>  	}
>  }
> +
> +struct pool_entry {
> +	struct hashmap_entry ent;
> +	size_t len;
> +	unsigned char data[FLEX_ARRAY];
> +};
> +
> +static int pool_entry_cmp(const struct pool_entry *e1,
> +			  const struct pool_entry *e2,
> +			  const unsigned char *keydata)
> +{
> +	return e1->data != keydata &&
> +	       (e1->len != e2->len || memcmp(e1->data, keydata, e1->len));
> +}
> +
> +const void *memintern(const void *data, size_t len)
> +{
> +	static struct hashmap map;
> +	struct pool_entry key, *e;
> +
> +	/* initialize string pool hashmap */
> +	if (!map.tablesize)
> +		hashmap_init(&map, (hashmap_cmp_fn) pool_entry_cmp, 0);
> +
> +	/* lookup interned string in pool */
> +	hashmap_entry_init(&key, memhash(data, len));
> +	key.len = len;
> +	e = hashmap_get(&map, &key, data);
> +	if (!e) {
> +		/* not found: create it */
> +		e = xmallocz(sizeof(struct pool_entry) + len);
> +		hashmap_entry_init(e, key.ent.hash);
> +		e->len = len;
> +		memcpy(e->data, data, len);
> +		hashmap_add(&map, e);
> +	}
> +	return e->data;
> +}
> diff --git a/hashmap.h b/hashmap.h
> index 12f0668..507884b 100644
> --- a/hashmap.h
> +++ b/hashmap.h
> @@ -87,4 +87,12 @@ static inline void *hashmap_iter_first(struct hashmap *map,
>  	return hashmap_iter_next(iter);
>  }
>  
> +/* string interning */
> +
> +extern const void *memintern(const void *data, size_t len);
> +static inline const char *strintern(const char *string)
> +{
> +	return memintern(string, strlen(string));
> +}
> +
>  #endif
> diff --git a/t/t0011-hashmap.sh b/t/t0011-hashmap.sh
> index 391e2b6..f97c805 100755
> --- a/t/t0011-hashmap.sh
> +++ b/t/t0011-hashmap.sh
> @@ -237,4 +237,17 @@ test_expect_success 'grow / shrink' '
>  
>  '
>  
> +test_expect_success 'string interning' '
> +
> +test_hashmap "intern value1
> +intern Value1
> +intern value2
> +intern value2
> +" "value1
> +Value1
> +value2
> +value2"
> +
> +'
> +
>  test_done
> diff --git a/test-hashmap.c b/test-hashmap.c
> index 3c9f67b..07aa7ec 100644
> --- a/test-hashmap.c
> +++ b/test-hashmap.c
> @@ -234,6 +234,20 @@ int main(int argc, char *argv[])
>  			/* print table sizes */
>  			printf("%u %u\n", map.tablesize, map.size);
>  
> +		} else if (!strcmp("intern", cmd) && l1) {
> +
> +			/* test that strintern works */
> +			const char *i1 = strintern(p1);
> +			const char *i2 = strintern(p1);
> +			if (strcmp(i1, p1))
> +				printf("strintern(%s) returns %s\n", p1, i1);
> +			else if (i1 == p1)
> +				printf("strintern(%s) returns input pointer\n", p1);
> +			else if (i1 != i2)
> +				printf("strintern(%s) != strintern(%s)", i1, i2);
> +			else
> +				printf("%s\n", i1);
> +
>  		} else if (!strcmp("perfhashmap", cmd) && l1 && l2) {
>  
>  			perf_hashmap(atoi(p1), atoi(p2));

  parent reply	other threads:[~2014-07-07 17:44 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-02 22:18 [PATCH v1 0/4] hashmap improvements Karsten Blees
2014-07-02 22:20 ` [PATCH v1 1/4] hashmap: factor out getting an int hash code from a, SHA1 Karsten Blees
2014-07-07 17:22   ` Junio C Hamano
2014-07-02 22:21 ` [PATCH v1 2/4] hashmap: improve struct hashmap member documentation Karsten Blees
2014-07-02 22:22 ` [PATCH v1 3/4] hashmap: add simplified hashmap_get_from_hash() API Karsten Blees
2014-07-07 17:43   ` Junio C Hamano
2014-07-11 19:11     ` Karsten Blees
2014-07-11 22:21       ` Junio C Hamano
2014-07-02 22:22 ` [PATCH v1 4/4] hashmap: add string interning API Karsten Blees
2014-07-03  7:22   ` Matthieu Moy
2014-07-07 17:44   ` Junio C Hamano [this message]
2014-07-03  7:23 ` [PATCH v1 0/4] hashmap improvements Matthieu Moy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqfviddpxc.fsf@gitster.dls.corp.google.com \
    --to=gitster@pobox$(echo .)com \
    --cc=git@vger$(echo .)kernel.org \
    --cc=karsten.blees@gmail$(echo .)com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox