* Performance impact of a large number of commits @ 2008-10-24 19:02 Samuel Abels 2008-10-24 19:43 ` david 0 siblings, 1 reply; 7+ messages in thread From: Samuel Abels @ 2008-10-24 19:02 UTC (permalink / raw) To: git Hi, I am considering Git to maintain a repository of approximately 300.000 files totaling 1 GB, with a number of ~100.000 commits per day, all in one single branch. The only operations performed are "git commit", "git show", and "git checkout", and all on one local machine. Does this sound like a reasonable thing to do with Git? -Samuel ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Performance impact of a large number of commits 2008-10-24 19:02 Performance impact of a large number of commits Samuel Abels @ 2008-10-24 19:43 ` david 2008-10-24 19:56 ` Samuel Abels 0 siblings, 1 reply; 7+ messages in thread From: david @ 2008-10-24 19:43 UTC (permalink / raw) To: Samuel Abels; +Cc: git On Fri, 24 Oct 2008, Samuel Abels wrote: > Hi, > > I am considering Git to maintain a repository of approximately 300.000 > files totaling 1 GB, with a number of ~100.000 commits per day, all in > one single branch. The only operations performed are "git commit", "git > show", and "git checkout", and all on one local machine. Does this sound > like a reasonable thing to do with Git? 100,000 commits per day?? that's 1.5 commits/second. what is updating files that quickly? I suspect that you will have some issues here, but it's going to depend on how many files get updated each 3/4 of a second. David Lang ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Performance impact of a large number of commits 2008-10-24 19:43 ` david @ 2008-10-24 19:56 ` Samuel Abels 2008-10-24 20:11 ` david 0 siblings, 1 reply; 7+ messages in thread From: Samuel Abels @ 2008-10-24 19:56 UTC (permalink / raw) To: david; +Cc: git On Fri, 2008-10-24 at 12:43 -0700, david@lang•hm wrote: > 100,000 commits per day?? > > that's 1.5 commits/second. what is updating files that quickly? This is an automated process taking snapshots of rapidly changing files containing statistical data. Unfortunately, our needs go beyond what a versioning file system has to offer, and the data is an unstructured text file (in other words, using a relational database is not a good option). > I suspect that you will have some issues here, but it's going to depend on > how many files get updated each 3/4 of a second. That would be 5 to 10 changed files per commit, and those are passed to git commit explicitly (i.e., walking the tree to stat files for finding changes is not necessary). -Samuel ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Performance impact of a large number of commits 2008-10-24 19:56 ` Samuel Abels @ 2008-10-24 20:11 ` david 2008-10-25 5:18 ` Samuel Abels 0 siblings, 1 reply; 7+ messages in thread From: david @ 2008-10-24 20:11 UTC (permalink / raw) To: Samuel Abels; +Cc: git On Fri, 24 Oct 2008, Samuel Abels wrote: > On Fri, 2008-10-24 at 12:43 -0700, david@lang•hm wrote: >> 100,000 commits per day?? >> >> that's 1.5 commits/second. what is updating files that quickly? > > This is an automated process taking snapshots of rapidly changing files > containing statistical data. Unfortunately, our needs go beyond what a > versioning file system has to offer, and the data is an unstructured > text file (in other words, using a relational database is not a good > option). > >> I suspect that you will have some issues here, but it's going to depend on >> how many files get updated each 3/4 of a second. > > That would be 5 to 10 changed files per commit, and those are passed to > git commit explicitly (i.e., walking the tree to stat files for finding > changes is not necessary). I suspect that your limits would be filesystem/OS limits more than git limits at 5-10 files/commit you are going to be creating .5-1m files/day, even spread across 256 directories this is going to be a _lot_ of files. packing this may help (depending on how much the files change), but with this many files the work of doing the packing would be expensive. David Lang ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Performance impact of a large number of commits 2008-10-24 20:11 ` david @ 2008-10-25 5:18 ` Samuel Abels 2008-10-25 5:29 ` david 0 siblings, 1 reply; 7+ messages in thread From: Samuel Abels @ 2008-10-25 5:18 UTC (permalink / raw) To: david; +Cc: git On Fri, 2008-10-24 at 13:11 -0700, david@lang•hm wrote: > > git commit explicitly (i.e., walking the tree to stat files for finding > > changes is not necessary). > > I suspect that your limits would be filesystem/OS limits more than git > limits > > at 5-10 files/commit you are going to be creating .5-1m files/day, even > spread across 256 directories this is going to be a _lot_ of files. The files are organized in a way that places no more than ~1.000 files into each directory. Will Git create a directory containing a larger number of object files? I can see that this would be a problem in our use case. > packing this may help (depending on how much the files change), but with > this many files the work of doing the packing would be expensive. We can probably do that even if it takes several hours. -Samuel ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Performance impact of a large number of commits 2008-10-25 5:18 ` Samuel Abels @ 2008-10-25 5:29 ` david 2008-10-25 15:12 ` Samuel Abels 0 siblings, 1 reply; 7+ messages in thread From: david @ 2008-10-25 5:29 UTC (permalink / raw) To: Samuel Abels; +Cc: git On Sat, 25 Oct 2008, Samuel Abels wrote: > On Fri, 2008-10-24 at 13:11 -0700, david@lang•hm wrote: >>> git commit explicitly (i.e., walking the tree to stat files for finding >>> changes is not necessary). >> >> I suspect that your limits would be filesystem/OS limits more than git >> limits >> >> at 5-10 files/commit you are going to be creating .5-1m files/day, even >> spread across 256 directories this is going to be a _lot_ of files. > > The files are organized in a way that places no more than ~1.000 files > into each directory. Will Git create a directory containing a larger > number of object files? I can see that this would be a problem in our > use case. when git stores the copies of the files it does a sha1 hash of the file contents and then stores the file in the directory .git/objects/<first two digits of the hash>/<hash> this means that if you have files that have the same content they all fold togeather, but with lots of files changing rapidly the result is a lot of files in these object directories. it would be a pretty minor change to git to have it use more directories (in fact, there's another thread going on today where people are looking at making this configurable, in that case to reduce the number of directories) the other storage format that git has is the pack file. it takes a bunch of the objects, does some comparisons between them (to find duplicate bits of files), and then stores the result (base files plus deltas to re-create other files). the resulting compression is _extremely_ efficiant, and it collapses many file objects into one pack file (addressing the issues of many files in one directory) >> packing this may help (depending on how much the files change), but with >> this many files the work of doing the packing would be expensive. > > We can probably do that even if it takes several hours. my concern is that spending time creating the pack files will mean that you don't have time to insert the new files. that being said, there may be other ways of dealing with this data rather than putting it into files and then adding it to the git repository. Git has a fast-import streaming format that is designed for programs to use that are converting repositories from other SCM systems. if you can tell more about what you are doing (how the data is being gathered, are the files re-created for each commit, or are they being modified? if they are being modified is it appending data, changing some data, or randomly writing throughout the file? etc) there may be some other options available. at this point I don't know if git can work for you or not, but I'm pretty sure nothing else will have a chance with your size. David Lang ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Performance impact of a large number of commits 2008-10-25 5:29 ` david @ 2008-10-25 15:12 ` Samuel Abels 0 siblings, 0 replies; 7+ messages in thread From: Samuel Abels @ 2008-10-25 15:12 UTC (permalink / raw) To: david; +Cc: git On Fri, 2008-10-24 at 22:29 -0700, david@lang•hm wrote: > when git stores the copies of the files it does a sha1 hash of the file > contents and then stores the file in the directory > .git/objects/<first two digits of the hash>/<hash> > it would be a pretty minor change to git to have it use more directories Ah, I see how this works. Well, I'll think of a way to cope with this (I might patch my Git installation, or see how well it performs on an indexed file system). If all else fails we'll have to slash the number of commits even if this means that some files are not added to the history. > my concern is that spending time creating the pack files will mean that > you don't have time to insert the new files. > > that being said, there may be other ways of dealing with this data rather > than putting it into files and then adding it to the git repository. > > Git has a fast-import streaming format that is designed for programs to > use that are converting repositories from other SCM systems. I'm pretty sure that the streaming format won't do us much good, as the files are re-created from scratch between commits. Thanks a lot for the information, this was very helpful. -Samuel ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2008-10-25 15:14 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-10-24 19:02 Performance impact of a large number of commits Samuel Abels 2008-10-24 19:43 ` david 2008-10-24 19:56 ` Samuel Abels 2008-10-24 20:11 ` david 2008-10-25 5:18 ` Samuel Abels 2008-10-25 5:29 ` david 2008-10-25 15:12 ` Samuel Abels
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox