clean/smudge filters for pdf files

public inbox for git@vger.kernel.org 
 help / color / mirror / Atom feed

* clean/smudge filters for pdf files
@ 2008-10-23 19:44 Leo Razoumov
  2008-10-23 21:32 ` Pierre Habouzit
  0 siblings, 1 reply; 5+ messages in thread
From: Leo Razoumov @ 2008-10-23 19:44 UTC (permalink / raw)
  To: git

I am trying to improve storage efficiency for PDF files in a git repo.
Following earlier discussions in this list I am trying to set up
proper clean/smudge filters. What follows is my current setup

# in ~/.gitconfig
[filter "pdf"]
	clean  = "pdftk - output - uncompress"
	smudge = "pdftk - output - compress"

# in .gitattributes
*.pdf filter=pdf

Unfortunately, it seems as though that pdftk uncompress followed by
pdftk compress do not leave the file invariant. I tried several
uncompress+compress iterations and the file still keep changing (the
size though stays the same).
Is there any other alternative way to store PDF files in git repo more
efficiently?
Any alternative to pdftk on Linux?

--Leo--

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: clean/smudge filters for pdf files
  2008-10-23 19:44 clean/smudge filters for pdf files Leo Razoumov
@ 2008-10-23 21:32 ` Pierre Habouzit
  2008-10-24  1:40   ` Leo Razoumov
  0 siblings, 1 reply; 5+ messages in thread
From: Pierre Habouzit @ 2008-10-23 21:32 UTC (permalink / raw)
  To: Leo Razoumov; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 1182 bytes --]

On Thu, Oct 23, 2008 at 07:44:39PM +0000, Leo Razoumov wrote:
> I am trying to improve storage efficiency for PDF files in a git repo.
> Following earlier discussions in this list I am trying to set up
> proper clean/smudge filters. What follows is my current setup
> 
> # in ~/.gitconfig
> [filter "pdf"]
> 	clean  = "pdftk - output - uncompress"
> 	smudge = "pdftk - output - compress"
> 
> # in .gitattributes
> *.pdf filter=pdf
> 
> Unfortunately, it seems as though that pdftk uncompress followed by
> pdftk compress do not leave the file invariant. I tried several
> uncompress+compress iterations and the file still keep changing (the
> size though stays the same).
> Is there any other alternative way to store PDF files in git repo more
> efficiently?
> Any alternative to pdftk on Linux?

actually it uses some kind of zlib algorithm so that's pretty normal you
don't have the same result with a packer. Maybe one could write a tool
like pristine-tar for that purpose.

-- 
·O·  Pierre Habouzit
··O                                                madcoder@debian•org
OOO                                                http://www.madism.org

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: clean/smudge filters for pdf files
  2008-10-23 21:32 ` Pierre Habouzit
@ 2008-10-24  1:40   ` Leo Razoumov
  2008-10-24  8:10     ` Michael J Gruber
  2008-10-24  8:44     ` Michael J Gruber
  0 siblings, 2 replies; 5+ messages in thread
From: Leo Razoumov @ 2008-10-24  1:40 UTC (permalink / raw)
  To: Pierre Habouzit; +Cc: git

On 10/23/08, Pierre Habouzit <madcoder@debian•org> wrote:
> On Thu, Oct 23, 2008 at 07:44:39PM +0000, Leo Razoumov wrote:
>  > I am trying to improve storage efficiency for PDF files in a git repo.
>  > Following earlier discussions in this list I am trying to set up
>  > proper clean/smudge filters. What follows is my current setup
>  >
>  > # in ~/.gitconfig
>  > [filter "pdf"]
>  >       clean  = "pdftk - output - uncompress"
>  >       smudge = "pdftk - output - compress"
>  >
>  > # in .gitattributes
>  > *.pdf filter=pdf
>  >
>  > Unfortunately, it seems as though that pdftk uncompress followed by
>  > pdftk compress do not leave the file invariant. I tried several
>  > uncompress+compress iterations and the file still keep changing (the
>  > size though stays the same).
>  > Is there any other alternative way to store PDF files in git repo more
>  > efficiently?
>  > Any alternative to pdftk on Linux?
>
>
> actually it uses some kind of zlib algorithm so that's pretty normal you
>  don't have the same result with a packer. Maybe one could write a tool
>  like pristine-tar for that purpose.
>

With zlib you get the same deterministic result as long as you use the
same zlib packer and unpacker. With pdftk compress/uncompress seem not
to form a bijection pair. This issue was briefly discussed on this
list back in April 2008 but no resolution emerged.

--Leo--

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: clean/smudge filters for pdf files
  2008-10-24  1:40   ` Leo Razoumov
@ 2008-10-24  8:10     ` Michael J Gruber
  2008-10-24  8:44     ` Michael J Gruber
  1 sibling, 0 replies; 5+ messages in thread
From: Michael J Gruber @ 2008-10-24  8:10 UTC (permalink / raw)
  To: SLONIK.AZ; +Cc: Pierre Habouzit, git

Leo Razoumov venit, vidit, dixit 24.10.2008 03:40:
> On 10/23/08, Pierre Habouzit <madcoder@debian•org> wrote:
>> On Thu, Oct 23, 2008 at 07:44:39PM +0000, Leo Razoumov wrote:
>>  > I am trying to improve storage efficiency for PDF files in a git repo.
>>  > Following earlier discussions in this list I am trying to set up
>>  > proper clean/smudge filters. What follows is my current setup
>>  >
>>  > # in ~/.gitconfig
>>  > [filter "pdf"]
>>  >       clean  = "pdftk - output - uncompress"
>>  >       smudge = "pdftk - output - compress"
>>  >
>>  > # in .gitattributes
>>  > *.pdf filter=pdf
>>  >
>>  > Unfortunately, it seems as though that pdftk uncompress followed by
>>  > pdftk compress do not leave the file invariant. I tried several
>>  > uncompress+compress iterations and the file still keep changing (the
>>  > size though stays the same).
>>  > Is there any other alternative way to store PDF files in git repo more
>>  > efficiently?
>>  > Any alternative to pdftk on Linux?
>>
>>
>> actually it uses some kind of zlib algorithm so that's pretty normal you
>>  don't have the same result with a packer. Maybe one could write a tool
>>  like pristine-tar for that purpose.
>>
> 
> With zlib you get the same deterministic result as long as you use the
> same zlib packer and unpacker. With pdftk compress/uncompress seem not
> to form a bijection pair. This issue was briefly discussed on this
> list back in April 2008 but no resolution emerged.

For a different file format I use the pair "gzip -c, gunzip -c" without
any problems, so zlib is not a problem. I do see the effect that
checkouts on different machines may have different compressed files
(same gzip version), but this is a non-issue.

Your experience with pdftk confirms mine. It shuffles things around
becauses it parses the files into objects and then writes them out again
in possibly different order. This is no problem for pdf because it uses
"pointers" (it's a bijection up to reordering), but it's a weird design,
and complicates things for us.

I'm still looking for something viable, I'll let list know when I've
found something...

Michael

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: clean/smudge filters for pdf files
  2008-10-24  1:40   ` Leo Razoumov
  2008-10-24  8:10     ` Michael J Gruber
@ 2008-10-24  8:44     ` Michael J Gruber
  1 sibling, 0 replies; 5+ messages in thread
From: Michael J Gruber @ 2008-10-24  8:44 UTC (permalink / raw)
  To: SLONIK.AZ; +Cc: Pierre Habouzit, git

Little addition to my previous reply:

Multivalent apparently almost get's there. After 2 iterations most of
the uncompressed file is stable, except for some binary blob at the end.
Alas, it's Java and not even completely open source.

Michael

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2008-10-24  8:45 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-23 19:44 clean/smudge filters for pdf files Leo Razoumov
2008-10-23 21:32 ` Pierre Habouzit
2008-10-24  1:40   ` Leo Razoumov
2008-10-24  8:10     ` Michael J Gruber
2008-10-24  8:44     ` Michael J Gruber

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox