* GSoC 2009 Prospective student
@ 2009-02-22 19:58 Rohan Dhruva
2009-02-22 20:07 ` Sverre Rabbelier
2009-02-22 20:43 ` Miklos Vajna
0 siblings, 2 replies; 14+ messages in thread
From: Rohan Dhruva @ 2009-02-22 19:58 UTC (permalink / raw)
To: git
Hi,
I am a student from India. I am very interested in taking part in GSoC
2009, working under git project mentors. However, I am completely new
to git, I have never used it in the past. I have used svn, but only
for downloading source code, never to manage my own code. I am very
interested in open source in general, and I have been using Linux from
5-6 years.
That being said, I have knowledge of C/C++ what was taught to me in
school and college. I realize that my qualifications as such are not
very impressive, and hence I wish to start with a smaller project. I
read on the http://git.or.cz/gitwiki/SoC2009Ideas page that a
"jump-in" project might be the "Restartable Clones" proposal. Seeing
my capabilities, I would like to know whether I am "fit" to undertake
work on that project? I promise to put in a lot of hard work to learn
git, and it's source code. However, I would also require a bit of
hand-holding, at least initially, to get me through.
I am very interested to know the opinion of all prospective mentors on
this issue. Thank you very much, and I do hope I am useful to the git
community.
--
Rohan Dhruva
PS: Please CC me, as I am not subscribed to the list. Thanks.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: GSoC 2009 Prospective student
2009-02-22 19:58 GSoC 2009 Prospective student Rohan Dhruva
@ 2009-02-22 20:07 ` Sverre Rabbelier
2009-02-22 20:29 ` Rohan Dhruva
2009-02-22 20:43 ` Miklos Vajna
1 sibling, 1 reply; 14+ messages in thread
From: Sverre Rabbelier @ 2009-02-22 20:07 UTC (permalink / raw)
To: Rohan Dhruva; +Cc: git
Heya,
On Sun, Feb 22, 2009 at 20:58, Rohan Dhruva <rohandhruva@gmail•com> wrote:
> I am a student from India. I am very interested in taking part in GSoC
> 2009, working under git project mentors. However, I am completely new
> to git, I have never used it in the past. I have used svn, but only
> for downloading source code, never to manage my own code. I am very
> interested in open source in general, and I have been using Linux from
> 5-6 years.
I was in a similar situation myself when I decided to apply for Git as
a GSoC student last year, your description makes me wonder "why git"
though, any particular reason?
> That being said, I have knowledge of C/C++ what was taught to me in
> school and college. I realize that my qualifications as such are not
> very impressive, and hence I wish to start with a smaller project. I
> read on the http://git.or.cz/gitwiki/SoC2009Ideas page that a
> "jump-in" project might be the "Restartable Clones" proposal. Seeing
> my capabilities, I would like to know whether I am "fit" to undertake
> work on that project? I promise to put in a lot of hard work to learn
> git, and it's source code. However, I would also require a bit of
> hand-holding, at least initially, to get me through.
Almost all students require such handholding, that's what the mentors
are for ;).
> I am very interested to know the opinion of all prospective mentors on
> this issue. Thank you very much, and I do hope I am useful to the git
> community.
Showing your face early and asking around is a good thing to do as
prospective student, good luck :).
> PS: Please CC me, as I am not subscribed to the list. Thanks.
I'd say, step one in your path to being a GSoC student with git would
be to subscribe to the mailing list. Read the "A note from the
maintainer" mails Junio sends out, as well as his "What's cooking"
mails. Do some research on the topic you are interested in (e.g.,
search gmane.org's git archive for discussions on the topic, etc). You
might also want to hang out in #git on irc.freenode.net and get to
know people there.
--
Cheers,
Sverre Rabbelier
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: GSoC 2009 Prospective student
2009-02-22 20:07 ` Sverre Rabbelier
@ 2009-02-22 20:29 ` Rohan Dhruva
2009-02-22 20:38 ` Sverre Rabbelier
0 siblings, 1 reply; 14+ messages in thread
From: Rohan Dhruva @ 2009-02-22 20:29 UTC (permalink / raw)
To: Sverre Rabbelier; +Cc: git
Hi Sverre,
On Mon, Feb 23, 2009 at 1:37 AM, Sverre Rabbelier <srabbelier@gmail•com> wrote:
> Heya,
>
> On Sun, Feb 22, 2009 at 20:58, Rohan Dhruva <rohandhruva@gmail•com> wrote:
>> I am a student from India. I am very interested in taking part in GSoC
>> 2009, working under git project mentors. However, I am completely new
>> to git, I have never used it in the past. I have used svn, but only
>> for downloading source code, never to manage my own code. I am very
>> interested in open source in general, and I have been using Linux from
>> 5-6 years.
>
> I was in a similar situation myself when I decided to apply for Git as
> a GSoC student last year, your description makes me wonder "why git"
> though, any particular reason?
>
I have developed a particular interest in SCMs lately. Git is a widely
used SCM. Also, this project would require knowledge of C, and not
some other language which I am not familiar with. Seeing that you were
a student yourself, can you please give me some tips? Any things for
me to keep in mind?
>> That being said, I have knowledge of C/C++ what was taught to me in
>> school and college. I realize that my qualifications as such are not
>> very impressive, and hence I wish to start with a smaller project. I
>> read on the http://git.or.cz/gitwiki/SoC2009Ideas page that a
>> "jump-in" project might be the "Restartable Clones" proposal. Seeing
>> my capabilities, I would like to know whether I am "fit" to undertake
>> work on that project? I promise to put in a lot of hard work to learn
>> git, and it's source code. However, I would also require a bit of
>> hand-holding, at least initially, to get me through.
>
> Almost all students require such handholding, that's what the mentors
> are for ;).
>
>> I am very interested to know the opinion of all prospective mentors on
>> this issue. Thank you very much, and I do hope I am useful to the git
>> community.
>
> Showing your face early and asking around is a good thing to do as
> prospective student, good luck :).
>
Thanks, I am encouraged :-)
>> PS: Please CC me, as I am not subscribed to the list. Thanks.
>
> I'd say, step one in your path to being a GSoC student with git would
> be to subscribe to the mailing list. Read the "A note from the
> maintainer" mails Junio sends out, as well as his "What's cooking"
> mails. Do some research on the topic you are interested in (e.g.,
> search gmane.org's git archive for discussions on the topic, etc). You
> might also want to hang out in #git on irc.freenode.net and get to
> know people there.
>
I will join the mailing list soon, thanks :)
Cheers,
--
Rohan Dhruva
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: GSoC 2009 Prospective student
2009-02-22 20:29 ` Rohan Dhruva
@ 2009-02-22 20:38 ` Sverre Rabbelier
0 siblings, 0 replies; 14+ messages in thread
From: Sverre Rabbelier @ 2009-02-22 20:38 UTC (permalink / raw)
To: Rohan Dhruva; +Cc: git
On Sun, Feb 22, 2009 at 21:29, Rohan Dhruva <rohandhruva@gmail•com> wrote:
> I have developed a particular interest in SCMs lately. Git is a widely
> used SCM. Also, this project would require knowledge of C, and not
> some other language which I am not familiar with.
Ah, that makes sense then. You should make sure you like using git
then, use it in some project for school, perhaps in combination with
'git svn'.
> Seeing that you were a student yourself, can you please give me
> some tips? Any things for me to keep in mind?
Hmmm, work on list! As soon as you have anything half-decent (this
will hopefully be after a week or two three), send your work to the
list for review! Work in the open as much as possible and profit from
the combined knowledge of the mailinglist.
Before GSoC starts, get in contact with possible mentors, try to learn
about the area of the code you will be touching. Learn the coding
style, and learn how to send patches by reading
Documentation/SubmittingPatches and the list archive, but preferably
by sending one!.
Most important is that you have a good time and learn from it though :).
--
Cheers,
Sverre Rabbelier
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: GSoC 2009 Prospective student
2009-02-22 19:58 GSoC 2009 Prospective student Rohan Dhruva
2009-02-22 20:07 ` Sverre Rabbelier
@ 2009-02-22 20:43 ` Miklos Vajna
2009-02-22 22:22 ` Nicolas Pitre
1 sibling, 1 reply; 14+ messages in thread
From: Miklos Vajna @ 2009-02-22 20:43 UTC (permalink / raw)
To: Rohan Dhruva; +Cc: git
[-- Attachment #1: Type: text/plain, Size: 641 bytes --]
On Mon, Feb 23, 2009 at 01:28:33AM +0530, Rohan Dhruva <rohandhruva@gmail•com> wrote:
> That being said, I have knowledge of C/C++ what was taught to me in
> school and college. I realize that my qualifications as such are not
> very impressive, and hence I wish to start with a smaller project. I
> read on the http://git.or.cz/gitwiki/SoC2009Ideas page that a
> "jump-in" project might be the "Restartable Clones" proposal.
I would recommend you to read this thread:
http://thread.gmane.org/gmane.comp.version-control.git/55254/focus=55298
Especially Shawn's message, which can be a base for your proposal, if
you want to work in this.
[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: GSoC 2009 Prospective student
2009-02-22 20:43 ` Miklos Vajna
@ 2009-02-22 22:22 ` Nicolas Pitre
2009-02-23 0:46 ` Sitaram Chamarty
2009-02-23 15:37 ` Jakub Narebski
0 siblings, 2 replies; 14+ messages in thread
From: Nicolas Pitre @ 2009-02-22 22:22 UTC (permalink / raw)
To: Miklos Vajna; +Cc: Rohan Dhruva, git
On Sun, 22 Feb 2009, Miklos Vajna wrote:
> On Mon, Feb 23, 2009 at 01:28:33AM +0530, Rohan Dhruva <rohandhruva@gmail•com> wrote:
> > That being said, I have knowledge of C/C++ what was taught to me in
> > school and college. I realize that my qualifications as such are not
> > very impressive, and hence I wish to start with a smaller project. I
> > read on the http://git.or.cz/gitwiki/SoC2009Ideas page that a
> > "jump-in" project might be the "Restartable Clones" proposal.
>
> I would recommend you to read this thread:
>
> http://thread.gmane.org/gmane.comp.version-control.git/55254/focus=55298
>
> Especially Shawn's message, which can be a base for your proposal, if
> you want to work in this.
I don't particularly agree with Shawn's proposal. Reliance on a stable
sorting on the server side is too fragile, restrictive and cumbersome.
Restartable clone is _hard_. Even I who has quite a bit of knowledge in
the affected area didn't find a satisfactory solution yet.
I think restartable clone is a really bad suggestion for SOC students.
After all we want successful SOC projects, not ones that even core git
developers did not yet find a good solution for.
IMHO of course.
Nicolas
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: GSoC 2009 Prospective student
2009-02-22 22:22 ` Nicolas Pitre
@ 2009-02-23 0:46 ` Sitaram Chamarty
2009-02-23 15:37 ` Jakub Narebski
1 sibling, 0 replies; 14+ messages in thread
From: Sitaram Chamarty @ 2009-02-23 0:46 UTC (permalink / raw)
To: git
On 2009-02-22, Nicolas Pitre <nico@cam•org> wrote:
> Restartable clone is _hard_. Even I who has quite a bit of knowledge in
> the affected area didn't find a satisfactory solution yet.
I'm sorry I have not followed the earlier discussion. I
have a question. I know the rsync transport is not much
used, and I myself have never used it. But can there not be
a 'sorry, this repo is not yet open' flag that prevents
local git operations while the clone is going on, and then
the actual clone itself merely does an rsync of the
corresponding files? Because rsync is quite restartable.
I can see that this would be a problem if the remote were to
'git repack' in between 2 attempts by the client, because
the actual tree inside .git/objects would change, but that
is hardly a common occurrence I would think.
I'm sorry if I'm being naive and missing a lot of important
nuances -- but I was looking at it from a "if I had to do it
in shell how would I do it' mindset.
Or perhaps by 'restartable clone' you also mean 'restartable
fetch', etc, in which case of course you can't lock out the
repo if a fetch dies partway.
It is not necessary to reply in detail; even a gmane or
other link will do if this was already shot down :-)
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: GSoC 2009 Prospective student
2009-02-22 22:22 ` Nicolas Pitre
2009-02-23 0:46 ` Sitaram Chamarty
@ 2009-02-23 15:37 ` Jakub Narebski
2009-02-23 15:58 ` Shawn O. Pearce
1 sibling, 1 reply; 14+ messages in thread
From: Jakub Narebski @ 2009-02-23 15:37 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Miklos Vajna, Rohan Dhruva, git
Nicolas Pitre <nico@cam•org> writes:
> On Sun, 22 Feb 2009, Miklos Vajna wrote:
> > On Mon, Feb 23, 2009 at 01:28:33AM +0530, Rohan Dhruva <rohandhruva@gmail•com> wrote:
> > > That being said, I have knowledge of C/C++ what was taught to me in
> > > school and college. I realize that my qualifications as such are not
> > > very impressive, and hence I wish to start with a smaller project. I
> > > read on the http://git.or.cz/gitwiki/SoC2009Ideas page that a
> > > "jump-in" project might be the "Restartable Clones" proposal.
> >
> > I would recommend you to read this thread:
> >
> > http://thread.gmane.org/gmane.comp.version-control.git/55254/focus=55298
> >
> > Especially Shawn's message, which can be a base for your proposal, if
> > you want to work in this.
>
> I don't particularly agree with Shawn's proposal. Reliance on a stable
> sorting on the server side is too fragile, restrictive and cumbersome.
>
> Restartable clone is _hard_. Even I who has quite a bit of knowledge in
> the affected area didn't find a satisfactory solution yet.
I think it is possible for dumb protocols (using commit walkers) and
for (deprecated) rsync.
The only thing would be for "git clone --continue" to bypass check if
directory to download repository to is nonexistent or empty.
I guess that what code can do (or perhaps even does currently) for
commit walk based dumb protocols (like HTTP) is to do commit walk, and
for packfiles which are already downloaded or partially downloaded,
download rest of file (if web server supports it; if not, redownload
whole packfile, but do not redownload already exiting packfiles).
For rsync:// it could be enough to just bypass the check... but the
probability of getting corrupted repository would be even higher,
unfortunately.
> I think restartable clone is a really bad suggestion for SOC students.
> After all we want successful SOC projects, not ones that even core git
> developers did not yet find a good solution for.
>
> IMHO of course.
But I agree that within current limits (as far as I know there are no
way to ask for SHA-1; you can only ask for refs for security reasons)
it would be difficult to very difficult to add restartable clone
support to native (smart) protocols.
If not for this limitation it would be, I think, possible to do a kind
of fsck, checking which commits in packfile are complete (i.e. have
all objects), and based on that ask for subset of objects. This would
require support only from a client... alas, this is not possible.
In mentioned post Shawn talks about a way for server to 1) generate
exactly the same packfile (the proposal is to replay want/have, but it
also requires stable sorting of objects); 2) transfer only the rest of
file (but server has to regenerate packfile anyway, as packfiles are
generated on-the-fly; well, unless it caches packfiles, which might be
good idea anyway).
I think that unless 'restartable clone' is limited to commit wakers
(HTP protocol etc.) it should be moved up the diffuculty from "New to
Git?" section. I guess that mirror-sync, formerly GitTorrent, could be
easier to implement.
--
Jakub Narebski
Poland
ShadeHawk on #git
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: GSoC 2009 Prospective student
2009-02-23 15:37 ` Jakub Narebski
@ 2009-02-23 15:58 ` Shawn O. Pearce
2009-02-23 16:31 ` Nicolas Pitre
2009-02-24 15:38 ` Jakub Narebski
0 siblings, 2 replies; 14+ messages in thread
From: Shawn O. Pearce @ 2009-02-23 15:58 UTC (permalink / raw)
To: Jakub Narebski; +Cc: Nicolas Pitre, Miklos Vajna, Rohan Dhruva, git
Jakub Narebski <jnareb@gmail•com> wrote:
> Nicolas Pitre <nico@cam•org> writes:
> > On Sun, 22 Feb 2009, Miklos Vajna wrote:
> > >
> > > http://thread.gmane.org/gmane.comp.version-control.git/55254/focus=55298
> > >
> > > Especially Shawn's message, which can be a base for your proposal, if
> > > you want to work in this.
> >
> > I don't particularly agree with Shawn's proposal. Reliance on a stable
> > sorting on the server side is too fragile, restrictive and cumbersome.
We already rely on a stable sort in the tree format. Asking that
a stable sort be applied when a clone is started so that we can
later resume it isn't unreasonable. Hell, that tree format sort
is a B***H anyway, its not a simple sort by memcmp(). Almost every
Git re-implementation gets it wrong the first time out.
> > Restartable clone is _hard_. Even I who has quite a bit of knowledge in
> > the affected area didn't find a satisfactory solution yet.
Sure, its difficult, but nobody has put effort into it either.
I think it could be done by enforcing a stable sort during clone
(and perhaps only during clone). That's the basis of that message
Miklos points to. Though I don't think I ever said anything about
the stable sort only being used during clone.
> I think it is possible for dumb protocols (using commit walkers) and
> for (deprecated) rsync.
Yes, it is possible for the commit walkers to implement a restart,
as they are actually beginning at the current root and walking back
in history. Resuming a large file like a pack is easy to do on HTTP
if the remote server supports byte range serving. Its also easy
to validate on the client that the pack wasn't repacked during the
idle period (between initial fetch and restart), just validate the
SHA-1 footer. If the pack was repacked and came up with the same
name you'll have a mismatch on the footer. Discard and try again.
And if you want to save bandwidth, always grab the last 20 bytes
of the file before getting any other parts, save it somewhere,
and revalidate that last 20 before resuming. If its changed,
you should discard what you have and start over from the beginning.
> > I think restartable clone is a really bad suggestion for SOC students.
> > After all we want successful SOC projects, not ones that even core git
> > developers did not yet find a good solution for.
> >
> > IMHO of course.
>
> But I agree that within current limits (as far as I know there are no
> way to ask for SHA-1; you can only ask for refs for security reasons)
> it would be difficult to very difficult to add restartable clone
> support to native (smart) protocols.
>
> If not for this limitation it would be, I think, possible to do a kind
> of fsck, checking which commits in packfile are complete (i.e. have
> all objects), and based on that ask for subset of objects. This would
> require support only from a client... alas, this is not possible.
I think the current "must want advertised ref" restriction is
too strict. If you make the server check the reachability of the
wanted object, (assuming it can be resolved to a commit) then you
can pick up in the middle of history. We already (to some extent)
support that with the deepen thing in a shallow clone. Sure, it
may cause more server load when clients ask for this partial fetch.
But clients can already abuse a server far more by repeatedly doing
a clone, and then break the network connection as soon as the PACK
header comes down the wire. The server just spent a lot of CPU
and IO time building the complete list of the objects to transmit.
Its really a non-trivial load on the server side. And by having
the client break the pipe at the 'PACK' header, the client doesn't
have to absorb the large data transfer either. Making it fairly
easy to DOS a Git daemon with a small botnet.
So, IMHO, the restriction that a commit must be advertised, and not
merely reachable, is overly strict and doesn't buy us a whole lot.
> I think that unless 'restartable clone' is limited to commit wakers
> (HTP protocol etc.) it should be moved up the diffuculty from "New to
> Git?" section. I guess that mirror-sync, formerly GitTorrent, could be
> easier to implement.
Maybe. But a simple stable sort on the objects makes it easier,
perhaps within reach of "new to git".
That ideas page is a wiki for a reason. If folks feel differently
from me, please edit it to improve things! :-)
--
Shawn.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: GSoC 2009 Prospective student
2009-02-23 15:58 ` Shawn O. Pearce
@ 2009-02-23 16:31 ` Nicolas Pitre
2009-02-24 15:38 ` Jakub Narebski
1 sibling, 0 replies; 14+ messages in thread
From: Nicolas Pitre @ 2009-02-23 16:31 UTC (permalink / raw)
To: Shawn O. Pearce; +Cc: Jakub Narebski, Miklos Vajna, Rohan Dhruva, git
On Mon, 23 Feb 2009, Shawn O. Pearce wrote:
> Jakub Narebski <jnareb@gmail•com> wrote:
> > Nicolas Pitre <nico@cam•org> writes:
> > > On Sun, 22 Feb 2009, Miklos Vajna wrote:
> > > >
> > > > http://thread.gmane.org/gmane.comp.version-control.git/55254/focus=55298
> > > >
> > > > Especially Shawn's message, which can be a base for your proposal, if
> > > > you want to work in this.
> > >
> > > I don't particularly agree with Shawn's proposal. Reliance on a stable
> > > sorting on the server side is too fragile, restrictive and cumbersome.
>
> We already rely on a stable sort in the tree format. Asking that
> a stable sort be applied when a clone is started so that we can
> later resume it isn't unreasonable. Hell, that tree format sort
> is a B***H anyway, its not a simple sort by memcmp(). Almost every
> Git re-implementation gets it wrong the first time out.
That's not the issue at all. The sorting within a single tree object is
indeed well defined (even if it is arguably a bit odd). The object
order is not, and now with threaded delta the list of actually deltified
objects may and do vary from successive packing of the same repo.
Committing ourselves to determinism here just for the sake of a
restartable clone is not something I subscribe to.
> > > Restartable clone is _hard_. Even I who has quite a bit of knowledge in
> > > the affected area didn't find a satisfactory solution yet.
>
> Sure, its difficult, but nobody has put effort into it either.
> I think it could be done by enforcing a stable sort during clone
> (and perhaps only during clone).
We should aim for a real solution, not something that is "special" for a
clone. After all, a clone is just a fetch, and large fetches may be
interrupted too.
> > I think it is possible for dumb protocols (using commit walkers) and
> > for (deprecated) rsync.
>
> Yes, it is possible for the commit walkers to implement a restart,
> as they are actually beginning at the current root and walking back
> in history. Resuming a large file like a pack is easy to do on HTTP
> if the remote server supports byte range serving. Its also easy
> to validate on the client that the pack wasn't repacked during the
> idle period (between initial fetch and restart), just validate the
> SHA-1 footer. If the pack was repacked and came up with the same
> name you'll have a mismatch on the footer. Discard and try again.
Sure, dumb protocols are easy. It's one of the few advantages they have
over the native protocol.
> But clients can already abuse a server far more by repeatedly doing
> a clone, and then break the network connection as soon as the PACK
> header comes down the wire. The server just spent a lot of CPU
> and IO time building the complete list of the objects to transmit.
> Its really a non-trivial load on the server side. And by having
> the client break the pipe at the 'PACK' header, the client doesn't
> have to absorb the large data transfer either. Making it fairly
> easy to DOS a Git daemon with a small botnet.
This is easy to fix, and something I've posted design notes about a
while ago. A cache of generated packs can be made, indexed by a hash of
the wanted/excluded refs used for pack generation. This way popular
fetches (say after Linus pushes stuff to his tree and everyone else
fetches it at night) would require computation only once. That is I
think something more suitable for a SOC student project.
Of course willfully abusing a git server can be done despite of this,
but that is true for any other service as well.
> That ideas page is a wiki for a reason. If folks feel differently
> from me, please edit it to improve things! :-)
/me hates editing wiki pages... :-/
Nicolas
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: GSoC 2009 Prospective student
2009-02-23 15:58 ` Shawn O. Pearce
2009-02-23 16:31 ` Nicolas Pitre
@ 2009-02-24 15:38 ` Jakub Narebski
2009-02-24 15:55 ` Shawn O. Pearce
1 sibling, 1 reply; 14+ messages in thread
From: Jakub Narebski @ 2009-02-24 15:38 UTC (permalink / raw)
To: Shawn O. Pearce; +Cc: Nicolas Pitre, Miklos Vajna, Rohan Dhruva, git
On Mon, 23 Feb 2009, Shawn O. Pearce wrote:
> Jakub Narebski <jnareb@gmail•com> wrote:
>> Nicolas Pitre <nico@cam•org> writes:
>>> On Sun, 22 Feb 2009, Miklos Vajna wrote:
>>>>
>>>> http://thread.gmane.org/gmane.comp.version-control.git/55254/focus=55298
>>>>
>>>> Especially Shawn's message, which can be a base for your proposal, if
>>>> you want to work in this.
>>>
>>> I don't particularly agree with Shawn's proposal. Reliance on a stable
>>> sorting on the server side is too fragile, restrictive and cumbersome.
>
> We already rely on a stable sort in the tree format. [...]
I (and Nicolas) by 'sorting order' mean here ordering of objects and
deltas in the pack file, i.e. whether we get _exactly_ the same (byte
for byte) packfile for the same want/have exchange (your proposal), or
even for the same arguments to git-pack-objects (which is a necessary,
although I think not sufficient condition).
[...]
>> I think it is possible for dumb protocols (using commit walkers) and
>> for (deprecated) rsync.
>
> Yes, it is possible for the commit walkers to implement a restart,
> as they are actually beginning at the current root and walking back
> in history. Resuming a large file like a pack is easy to do on HTTP
> if the remote server supports byte range serving. Its also easy
> to validate on the client that the pack wasn't repacked during the
> idle period (between initial fetch and restart), just validate the
> SHA-1 footer. If the pack was repacked and came up with the same
> name you'll have a mismatch on the footer. Discard and try again.
Can we assume that packfiles are named correctly (i.e. name of packfile
match SHA-1 footer)?
>
> And if you want to save bandwidth, always grab the last 20 bytes
> of the file before getting any other parts, save it somewhere,
> and revalidate that last 20 before resuming. If its changed,
> you should discard what you have and start over from the beginning.
Therefore I think that restartable clone for "dumb" (commit walker)
protocols is easy GSoC project, while restartable clone for "smart"
(generate packfile) protocols is at least of medium difficulty, and
might be harder.
>>> I think restartable clone is a really bad suggestion for SOC students.
>>> After all we want successful SOC projects, not ones that even core git
>>> developers did not yet find a good solution for.
>>>
>>> IMHO of course.
>>
>> But I agree that within current limits (as far as I know there are no
>> way to ask for SHA-1; you can only ask for refs for security reasons)
>> it would be difficult to very difficult to add restartable clone
>> support to native (smart) protocols.
>>
>> If not for this limitation it would be, I think, possible to do a kind
>> of fsck, checking which commits in packfile are complete (i.e. have
>> all objects), and based on that ask for subset of objects. This would
>> require support only from a client... alas, this is not possible.
>
> I think the current "must want advertised ref" restriction is
> too strict. If you make the server check the reachability of the
> wanted object, (assuming it can be resolved to a commit) then you
> can pick up in the middle of history. We already (to some extent)
> support that with the deepen thing in a shallow clone. Sure, it
> may cause more server load when clients ask for this partial fetch.
Hmmm... I forgot about shallow clone.
Still, we can have the following situation:
*---*---o---.---.---. .... .---o---*---* <-- some ref
^ ^
| |
a b
where '*' means that we have commit and all its object fully in packfile
(i.e. if they are delta, there is base for delta in packfile), 'o' means
incomplete, for example commit with some o blobs missing, and '.' means
missing commit object.
Because git deals with continuous range, we can tell on restart of clone
that we have 'a', and that we want 'b', but without further extensions
to git protocols, where we can tell that we have some objects (to
exclude), but not assume anything about their requirements; something
that if I remember correctly was implemented in some floating 'lazy
clone' patch (well, lazy loading of blobs patch)...
[...]
> So, IMHO, the restriction that a commit must be advertised, and not
> merely reachable, is overly strict and doesn't buy us a whole lot.
>
>> I think that unless 'restartable clone' is limited to commit wakers
>> (HTP protocol etc.) it should be moved up the diffuculty from "New to
>> Git?" section. I guess that mirror-sync, formerly GitTorrent, could be
>> easier to implement.
>
> Maybe. But a simple stable sort on the objects makes it easier,
> perhaps within reach of "new to git".
As Nico said in the presence of threaded packing ordering of _objects_
on _packfile_ might be not deterministic.
>
> That ideas page is a wiki for a reason. If folks feel differently
> from me, please edit it to improve things! :-)
I'll try to add 'pack file cache for git-daemon' proposal to
GSoC2009Ideas page... but I cannot be mentor (or even co-mentor) for
this idea.
--
Jakub Narebski
Poland
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: GSoC 2009 Prospective student
2009-02-24 15:38 ` Jakub Narebski
@ 2009-02-24 15:55 ` Shawn O. Pearce
2009-02-24 21:08 ` Jakub Narebski
0 siblings, 1 reply; 14+ messages in thread
From: Shawn O. Pearce @ 2009-02-24 15:55 UTC (permalink / raw)
To: Jakub Narebski; +Cc: Nicolas Pitre, Miklos Vajna, Rohan Dhruva, git
Jakub Narebski <jnareb@gmail•com> wrote:
>
> I (and Nicolas) by 'sorting order' mean here ordering of objects and
> deltas in the pack file, i.e. whether we get _exactly_ the same (byte
> for byte) packfile for the same want/have exchange (your proposal), or
> even for the same arguments to git-pack-objects (which is a necessary,
> although I think not sufficient condition).
I know.
My proposal though didn't require the same byte-for-byte pack file.
Only that the objects were in a predictable order. It didn't permit
resuming in the middle of an object. If the last object in the pack
was truncated the client would resume by getting that object again,
and may get a different byte sequence for that object representation.
Its a b**ch to know where you stopped though, as you could be in
a long string of deltas whose base is in the portion you didn't
yet receive. Which means you can't identify that string that you
already have, and pack-objects on resume can't assume you have
those objects, because you only have the deltas for them and are
lacking a way to restore them.
> Can we assume that packfiles are named correctly (i.e. name of packfile
> match SHA-1 footer)?
Wrong.
The hash in "pack-$hash.pack"/"pack-$hash.idx" is *not* the 20 byte
SHA-1 footer. Its the 20 byte SHA-1 of the sorted object names who
are in that pack.
We should try not to assume that the pack's file name matches the
sorted object names, but we can assume that the pack file name is
"pack-$hash.pack" where $hash is a 40 character hexadecimal string.
The dumb commit walkers already have this restriction built into
them, and have for quite some time.
Any pack writers, including fast-import, honor this naming standard
in order to ensure they are compatible with the existing dumb
commit walkers.
> Therefore I think that restartable clone for "dumb" (commit walker)
> protocols is easy GSoC project, while restartable clone for "smart"
> (generate packfile) protocols is at least of medium difficulty, and
> might be harder.
Probably quite right. Unfortunately the majority of the git
repositories out there are served with the smart protocol, because
it is more efficient. :)
> Still, we can have the following situation:
>
> *---*---o---.---.---. .... .---o---*---* <-- some ref
>
> ^ ^
> | |
> a b
>
> where '*' means that we have commit and all its object fully in packfile
> (i.e. if they are delta, there is base for delta in packfile), 'o' means
> incomplete, for example commit with some o blobs missing, and '.' means
> missing commit object.
>
> Because git deals with continuous range, we can tell on restart of clone
> that we have 'a', and that we want 'b', but without further extensions
> to git protocols, where we can tell that we have some objects (to
> exclude), but not assume anything about their requirements; something
> that if I remember correctly was implemented in some floating 'lazy
> clone' patch (well, lazy loading of blobs patch)...
Err, yes. Which is why I wanted to put a stable sort order on the
objects in the pack. If you do that then you can specify a range
within range of objects being fetched.
E.g. in the diagram above if the client said "want b, have a" during
a "git fetch" we can apply the stable ordering to all objects in
that range "a..b", and then apply another subrange to that where the
client says "complete until Q", where "Q" denotes a position in that
sorted list. Thus we only need to transmit the remaining elements.
> As Nico said in the presence of threaded packing ordering of _objects_
> on _packfile_ might be not deterministic.
Yea, ick. I haven't looked at the threaded code in enough detail
to know how it behaves. But from what I read in discussion on the
list it really makes it impossible to get a stable ordering because
the delta base selected for an object can differ depending on which
thread handled that object, and if OFS_DELTA is being used then the
base must go before the delta, making the order somewhat determined
by which thread handled which object.
IIRC, my proposal was pre-threaded delta code being introduced.
Now that we have threaded delta code as the default on many
platforms... yea, this is likely *not* a good project for someone
who is new to Git. Its become a lot more difficult.
> I'll try to add 'pack file cache for git-daemon' proposal to
> GSoC2009Ideas page... but I cannot be mentor (or even co-mentor) for
> this idea.
The pack file cache project is likely easier than restarting a
pack file. Especially in the face of the threaded delta code.
There are difficult details about making the cache secure so we can't
overwrite repository data due to a buffer overflow. Or making
the cache prune itself so it doesn't run out of disk. Etc.
We've talked about a cache before on list.
On a related note, I remember I wrote a patch that saved packs during
"git push", before we added "git gc --auto", as crude attempt to
incrementally repack a repository during other operations.
--
Shawn.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: GSoC 2009 Prospective student
2009-02-24 15:55 ` Shawn O. Pearce
@ 2009-02-24 21:08 ` Jakub Narebski
2009-02-24 21:17 ` Nicolas Pitre
0 siblings, 1 reply; 14+ messages in thread
From: Jakub Narebski @ 2009-02-24 21:08 UTC (permalink / raw)
To: Shawn O. Pearce; +Cc: Nicolas Pitre, Miklos Vajna, Rohan Dhruva, git
On Tue, 24 Feb 2009, Shawn O. Pearce wrote:
> Jakub Narebski <jnareb@gmail•com> wrote:
> >
> > I (and Nicolas) by 'sorting order' mean here ordering of objects and
> > deltas in the pack file, i.e. whether we get _exactly_ the same (byte
> > for byte) packfile for the same want/have exchange (your proposal), or
> > even for the same arguments to git-pack-objects (which is a necessary,
> > although I think not sufficient condition).
>
> I know.
>
> My proposal though didn't require the same byte-for-byte pack file.
> Only that the objects were in a predictable order. It didn't permit
> resuming in the middle of an object. If the last object in the pack
> was truncated the client would resume by getting that object again,
> and may get a different byte sequence for that object representation.
Ah, so you meant skipping first N _objects_, and not first N _bytes_
of a re-generated pack. That's better.
Although in the case when packfiles are cached, I think you can support
resuming on a byte. But I guess only in such case (where exactly
byte-for-byte the same packfile is resend / reused).
>
> Its a b**ch to know where you stopped though, as you could be in
> a long string of deltas whose base is in the portion you didn't
> yet receive. Which means you can't identify that string that you
> already have, and pack-objects on resume can't assume you have
> those objects, because you only have the deltas for them and are
> lacking a way to restore them.
Moreover from what I understand the want/have exchange is about
_commits_, and it assumes that if you 'have' a commit, you have all
its ancestors, and all trees (including those of ancestors), and all
blobs (including those of ancestors). Not only delta without base.
Besides if I remember correctly we always write base before delta; or
am I mistaken here?
But one could take a look at patches (present in git mailing list
archive) which tried to add 'lazy clone' / 'remote alternates' support.
IIRC there was 'haveonly' extension to exchange protocol, which was
to meant that you have (in full) only given object, but not necessary
its prerequisites. Then you can filter out those 'haveonly' objects
from list of objects to pack fed to git-pack-object, isn't it?
>
> > Can we assume that packfiles are named correctly (i.e. name of packfile
> > match SHA-1 footer)?
>
> Wrong.
>
> The hash in "pack-$hash.pack"/"pack-$hash.idx" is *not* the 20 byte
> SHA-1 footer. Its the 20 byte SHA-1 of the sorted object names who
> are in that pack.
>
> We should try not to assume that the pack's file name matches the
> sorted object names, but we can assume that the pack file name is
> "pack-$hash.pack" where $hash is a 40 character hexadecimal string.
> The dumb commit walkers already have this restriction built into
> them, and have for quite some time.
>
> Any pack writers, including fast-import, honor this naming standard
> in order to ensure they are compatible with the existing dumb
> commit walkers.
Ah. So it is a _bit_ harder (for "dumb" protocols) than I thought.
Still much easier than resumable clone for smart (pack generating)
protocols.
>
> > Therefore I think that restartable clone for "dumb" (commit walker)
> > protocols is easy GSoC project, while restartable clone for "smart"
> > (generate packfile) protocols is at least of medium difficulty, and
> > might be harder.
>
> Probably quite right. Unfortunately the majority of the git
> repositories out there are served with the smart protocol, because
> it is more efficient. :)
Long, long time ago rsync:// protocol was recommended for initial clone.
It has serious disadvantage of possibly returning silently corrupted
repository, as it didn't ensure that references and objects were fetched
in correct sequence, and is thus deprecated, and support for it
bit-rotten ;) in places...
I wonder if it is possible to make rsync:// more robust...
[...]
> > I'll try to add 'pack file cache for git-daemon' proposal to
> > GSoC2009Ideas page... but I cannot be mentor (or even co-mentor) for
> > this idea.
>
> The pack file cache project is likely easier than restarting a
> pack file. Especially in the face of the threaded delta code.
>
> There are difficult details about making the cache secure so we can't
> overwrite repository data due to a buffer overflow. Or making
> the cache prune itself so it doesn't run out of disk. Etc.
> We've talked about a cache before on list.
Well, this is _cache_. OTOH having pack cache would make it easy to have
resumable clone if you hit one of cached packfiles on resume...
On the other hand I wonder what improvements it would give, as generating
packs with delta reuse is, I think, quite fast...
--
Jakub Narebski
Poland
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: GSoC 2009 Prospective student
2009-02-24 21:08 ` Jakub Narebski
@ 2009-02-24 21:17 ` Nicolas Pitre
0 siblings, 0 replies; 14+ messages in thread
From: Nicolas Pitre @ 2009-02-24 21:17 UTC (permalink / raw)
To: Jakub Narebski; +Cc: Shawn O. Pearce, Miklos Vajna, Rohan Dhruva, git
On Tue, 24 Feb 2009, Jakub Narebski wrote:
> Well, this is _cache_. OTOH having pack cache would make it easy to have
> resumable clone if you hit one of cached packfiles on resume...
>
> On the other hand I wonder what improvements it would give, as generating
> packs with delta reuse is, I think, quite fast...
Object enumeration is still an issue. A cache would allow skipping that
part as well, making cached packs about the same as a simple file
server.
Nicolas
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2009-02-24 21:18 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-02-22 19:58 GSoC 2009 Prospective student Rohan Dhruva
2009-02-22 20:07 ` Sverre Rabbelier
2009-02-22 20:29 ` Rohan Dhruva
2009-02-22 20:38 ` Sverre Rabbelier
2009-02-22 20:43 ` Miklos Vajna
2009-02-22 22:22 ` Nicolas Pitre
2009-02-23 0:46 ` Sitaram Chamarty
2009-02-23 15:37 ` Jakub Narebski
2009-02-23 15:58 ` Shawn O. Pearce
2009-02-23 16:31 ` Nicolas Pitre
2009-02-24 15:38 ` Jakub Narebski
2009-02-24 15:55 ` Shawn O. Pearce
2009-02-24 21:08 ` Jakub Narebski
2009-02-24 21:17 ` Nicolas Pitre
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox