* [GSoC] Proposal: Complete and extend the remote-object-info command for git cat-file
@ 2026-03-13 10:17 Pablo
2026-03-14 5:58 ` Chandra Pratap
2026-03-16 16:05 ` [GSoC v3] " Pablo Sabater
0 siblings, 2 replies; 10+ messages in thread
From: Pablo @ 2026-03-13 10:17 UTC (permalink / raw)
To: git, christian.couder, karthik nayak, jltobler, Ayush Chandekar,
Siddharth Asthana, Chandra Pratap
## Synopsis
This project finishes Eric Ju's work on `remote-object-info` for `git
cat-file --batch-command` [1], resolves the pending feedback from
Junio Hamano [2] and Jeff King [3] [4] [5], and extends support for
`%(objecttype)`.
Expected project size: 350 hours (Medium)
## About Me and Contact
Name: Pablo Sabater Jiménez (he/him)
Age: 19
Education: Currently on my second Computer Science year at University
of Murcia, Spain
Location: Murcia, Spain (CET, UTC+1)
Languages: C (solid), shell(bash) (good)
Tools: git(proficient)
I've checked that I'm eligible for GSoC 2026.
Email: pabloosabaterr@gmail•com
GitHub: https://github.com/pabloosabaterr
## Relevant Projects
- 16 bit CPU emulator. Good example of C programming.
cpu: https://github.com/pabloosabaterr/CPU16
- Compiler. Good example of working on bigger projects.
compiler: https://github.com/pabloosabaterr/Orn
## Pre-GSoC Work
### Introduction
**[GSoC] Introduction Pablo Sabater**
https://lore.kernel.org/git/CAN5EUNR0KJ4VeuOF_bVupaTuGKGaeTKa0SMRAUoBPo5wWi8YGA@mail.gmail.com
A mailing list thread where I introduced myself to the git community.
### Microproject
**[GSoC PATCH v4] t9200: replace test -f/-d with modern path helpers**
https://lore.kernel.org/git/20260312173305.15112-1-pabloosabaterr@gmail.com/
Merged to `next` on 2026-03-12 at 8500bdf172. Replaces `test -f` with
helper `test_path_is_file`, which makes debugging failing tests easier
with better reporting.
As suggested as microproject.
### Other contributions
**[GSoC PATCH v2] test-lib: print escape sequence names**
https://lore.kernel.org/git/20260311031442.11942-1-pabloosabaterr@gmail.com/
Will merge to `next`, in failed expected/actual checks printing, the
escape sequences were shown as their octal code. This patch fixes that
to print the actual escape sequence name, adds tests, and updates the
expected output.
**[GSoC PATCH] t9200: handle missing CVS with skip_all**
https://lore.kernel.org/git/20260311194002.190195-1-pabloosabaterr@gmail.com/
Merged to `next` on 2026-03-12 at 8500bdf172, wraps CVS setup in a
skip_all for clearer failure reporting and moves Git initialization
into its own test_expect_success.
**[GSoC] Re: [PATCH v11 8/8] cat-file: add remote-object-info to batch-command**
https://lore.kernel.org/git/20260312214154.89120-1-pabloosabaterr@gmail.com/
While testing Eric's v11 I've found and reported a new bug. On
`remote-object-info` when it's preceded by a local query, `data->type`
isn't being cleared. Causing it to return the wrong type.
I have also studied the documentation provided and Eric Ju's work from
v0 to v11 including all the feedback he got up to March 2025, the
feedback he got from Junio Hamano and Jeff King, taking notes about
what's left to be done and what else I can contribute to the already
proposed project. That's how I've identified everything that I will
address on the Problem, Solution and Timeline sections.
I built Eric Ju's v11 and tested the bugs reported to his patch [5],
I've confirmed the segfault and the `die()`, and found a new one:
- When a local `info` runs before `remote-object-info` sharing the
same format string, `data->type` isn't being cleared. A blob queried
remotely after a local commit, `data->type` for blob becomes 'commit'
with no error. I reported it on the mailing list [6].
I attempted to test rebasing Eric Ju's v11 to master and got conflicts
on 4 out of the 8 commits:
- `d04cf85ece` t1006: split test utility functions into new "lib-cat-file.sh".
- `t/t1006-cat-file.sh`
- `d918f720d8` fetch-pack: refactor packet writing.
- `fetch-pack.c`
- `2daf9ed803` transport: add client support for object-info.
- `Makefile`
- `c3ba4afaf6` cat-file: add remote-object-info to batch-command.
- `object-file.c`, `object-store-ll.h` (deleted).
I'm being active on the mailing list and learning the Git flow of work
and from the feedback I've received from the maintainers (Junio) from
my patches.
Following the project guidelines, I haven't done anything on the
project that could step on other candidates' work before being
accepted, and instead I'm focusing on understanding the project and
its needs, and independent patches that will make the Git project more
familiar and understandable to me.
## Availability
My classes end the first week of May. From then until September I
won't have any classes which leaves me free to fully focus on the
project. I can dedicate 8+ hours each day, and for sure 40 hours a
week.
## The Problem
Git's partial clone allows cloning repositories without downloading
all objects (blobs, trees, ...). These objects are fetched on demand
from the remote when needed. However, when a user needs metadata about
these remote objects (size, type, hash, ...), Git has no efficient way
of doing this without downloading all the object content.
The server side support for `object-info` protocol was implemented by
Calvin Wan in 2021. Eric Ju built the client-side `remote-object-info`
for `cat-file --batch-command`. Eric Ju's work remains unmerged after
v11 because of these issues:
- The format validation uses `strstr()` which only checks for
`%(objectsize)`. This causes two different errors:
- Atoms that `expand_atom()` recognizes but the remote doesn't
(`objecttype`,`deltabase`, ...), `expand_atom()` returns 1, but when
accessing `data->type` it only contains garbage, causing segfault. as
Jeff King noted [3].
- Unknown atoms by `expand_atom()`, returns 0, calling
`strbuf_expand_bad_format` on `expand_format()`, which calls `die()`,
as Jeff King found [3].
Both cases block the command, including local `info` queries if the
same format string is shared. Unsupported remote placeholders should
return an empty string, matching how `for-each-ref` returns empty for
known, but inapplicable atoms like `%(tagger)` on non-tags [4] [5].
- When local and remote queries are mixed, `data->type` is not being
cleared between commands. `remote-object-info` returns the wrong type
data from a previous local query [6].
- Style and code issues marked by Junio Hamano [2] and Jeff King [3]
[5] are still undone.
- comment style.
- `#define` formatting.
- line length.
- misleading error messages.
- missing `count > MAX_ALLOWED_OBJ_LIMIT` check at `split_cmdline().`
- if/else invert at `get_remote_info()`.
- `%(objecttype)` is not yet supported on either client or server side.
## The Solution
There are two main goals:
### Goal 1: Rebase and finish Eric's work
Starting from where Eric Ju left off, I will rebase it on top of the
current `master` branch and address the feedback left to do:
- Fix style in comments, `#define` formatting and line length.
- Fix misleading error message in the overflow check.
- Add missing `count > MAX_ALLOWED_OBJ_LIMIT` check after `split_cmdline()`.
- Invert if/else on `get_remote_info()` to keep the small block first
(the error one) as Junio suggested.
#### Replace `strstr()` format validation with allow_list in `expand_atom()`
`strstr()` isn't enough to fully validate the placeholders, it only
searches for `%(objectsize)` and unsupported placeholders cause
segfaults. The fix is to refactor the validation with an allow_list in
`expand_atom()`. But why `expand_atom()` when Jeff King suggested
`expand_atom()` or `expand_format()` [4] ?
- There are two cases, first, inside `expand_atom()` before returning
(segfault) and second, calls `die()` when `expand_atom()` returns 0.
Placing the `allow_list` at the top of `expand_atom()` prevents both
errors, on remote mode, append nothing to `sb` and return 1, accessing
`data->type` won't cause segfault and prevents `expand_format()` from
reaching `die()`.
As extra safety, initializing `data->type` to `OBJ_BAD` and check
for `NULL` from `type_name()` makes it that even without `allow_list`,
uninitialized data doesn't cause a segfault.
At Goal 1, only `%(objectname)` and `%(objectsize)` will be in the
allow_list. Goal 2 will bring `%(objecttype)` support.
### Goal 2: Adding `%(objecttype)`
following what Calvin Wan did in 2021 for `%(objectsize)`, v2 protocol
needs to be extended on the server side to support the new
`%(objecttype)` placeholder:
- extend `object_info_advertise()` at `serve.c`
- add .type to `requested_info` struct at `serve.c`
- support `type` in `cap_object_info()` at `protocol-caps.c`
- look for type at `send_info()` at `protocol-caps.c`
following object-info protocol docs [7] it should look like:
```
attrs = "size" SP "type"
obj-type = "blob" | "tree" | "commit" | "tag"
obj-info = obj-id SP obj-size SP obj-type
info = PKT-LINE(attrs LF)
*PKT-LINE(obj-info LF)
```
`%(objecttype)` needs to be added to the `allow_list`. Client side
needs to learn to ask for `%(objecttype)` from remote, parse what has
been received and fill `expand_data` with the actual type. This makes
it return the object type instead of the empty string returned while
it was unsupported.
Default format evolves to `%(objectname) %(objecttype) %(objectsize)`.
Test and document new placeholder support and server side extension.
#### Backward Compatibility
There are four possible scenarios to happen between client and server:
1. The server doesn't know type (new client but old server):
After receiving the server capabilities, a client will only request
what the server advertises. The `allow_list` would handle this,
returning an empty string when the server doesn't support it.
2. The server knows type but the client doesn't (new server but old client):
Following `gitprotocol-v2.adoc`, "Clients must ignore all unknown
keys", it will ignore type, and request only the known capabilities.
3. Both know type (new client and new server):
Server advertises type, client requests it and gets the type data.
4. Both know type but protocol middleware doesn't (new client, new
server but old middleware):
If a server advertises type but client doesn't receive type, a
client won't ask for anything unadvertised, if a client asks for type
but the server doesn't receive it, it will only return the known
capabilities.
**performance considerations**
To get an object type, we have to look only at the header, to get the
size `oid_object_info()` at `object-file.c` is being called which
already returns the object type in the same call. Sending the string
with the type will only be, worst case scenario 6 bytes for the
"commit" string.
## Timeline
I've designed this to work with enough time so final work can be
shorter than what's said here
May 1-24: Community Bonding
- Talk and meet with mentor that I'm assigned with, to get feedback
about my proposal, how I will report my progress apart from the code
submitted and possible blogs, and tips and tricks to work better at
Git.
- Confirm with mentor that the `allow_list` approach is still the best option.
- Draft commits structure.
Week 1-2: (May 26 - June 8)
- Rebase Eric Ju's v11 on top of current `master`.
- Work on style fixes: comments, `#define` formatting, line length.
- Fix the wrong error message in the overflow check.
- Add missing check `count > MAX_ALLOWED_OBJ_LIMIT` after `split_cmdline()`.
- Invert if/else in `get_remote_info()`.
- Send first patch.
Week 3-4: (June 9 - June 22)
- Implement `allow_list` in `expand_atom()` using `is_atom()` in remote-mode.
- Initialize `data->type` to `OBJ_BAD` and add null check at `type_name()`.
- Implement empty string return for unsupported placeholders.
- Tests for supported placeholders, unsupported, mix, and the intermix
case `info` + `remote-object-info` with the same format string.
- Work with feedback from the first patch.
Week 5-6: (June 23 - July 6):
- Continue with review feedback.
- Goal 1 should be polished or close to the final form.
- Prepare the midterm report.
Midterm evaluation (July 7 - 11) as specified on GSoC timeline docs
- Goal 1 submitted and keep work with feedback.
Week 7-8: (July 14 - July 27)
- Begin Goal 2.
- Extend server side v2 protocol to serve `%(objecttype)`, following
`%(objectsize)` structure.
- Test server side.
Week 9-10: (July 28 - August 10)
- Add `%(objecttype)` to the `allow_list` from Goal 1.
- Extend client side to ask for `%(objecttype)` from remote on `object-info`.
- Parse server answer and fill `expand_data` with the actual type.
- End to end tests and documentation.
- Default format becomes `%(objectname) %(objecttype) %(objectsize)`.
- Send patch series.
Week 11-12: (August 11 - August 24)
- Work with Goal 2 feedback from the patches.
- Polish everything, all tests pass, good test coverage, no
style/comment mistakes.
- Final documentation review.
- Prepare for final evaluation.
Final evaluation (August 18-24) as specified on GSoC timeline docs
### Additional objectives
If there is enough time, or for future work after the project. I've
some ideas on how this could evolve:
#### More placeholders support
I've checked that Eric's v11 patch only supports `%(objectsize)` on
server side, but on the client side there are other placeholders that
can be added too. with the `allow_list` and having Goal 2 implemented
adding more placeholders becomes trivial.
- `%(objectsize:disk)`: Returns the size on the disk (compressed or as
a delta) instead of returning the uncompressed size that
`%(objectsize)` does. To do this, the server would need to send what's
the actual size on disk data.
- `%(deltabase)`: Returns the delta base object OID. non delta objects
return zero OID as it does on local.
#### Returning missing blobs from a tree ordered
In a partial clone, someone might want to know what blobs are missing
inside a concrete tree and their size before fetching them.
The idea is to build on top of `remote-object-info`:
Given a tree hash, return the missing blobs (inside that tree) ordered by size.
Thanks for reading my proposal and considering my application. I'm
very excited about this opportunity,
Pablo
[1]: https://lore.kernel.org/git/20250221190451.12536-1-eric.peijian@gmail.com/
"Eric Ju's v11 patch"
[2]: https://lore.kernel.org/git/xmqqo6yr3wc4.fsf@gitster.g/ "Junio
Hamano feedback"
[3]: https://lore.kernel.org/git/20250224234720.GC729825@coredump.intra.peff.net/
"Jeff King feedback"
[4]: https://lore.kernel.org/git/20250313060250.GH94015@coredump.intra.peff.net/
"options for strstr() by Jeff King"
[5]: https://lore.kernel.org/git/20250324033922.GB690093@coredump.intra.peff.net/
"Jeff King follow-up"
[6]: https://lore.kernel.org/git/20260312214154.89120-1-pabloosabaterr@gmail.com/
"data->type not being cleared bug"
[7]: https://github.com/git/git/blob/master/Documentation/gitprotocol-v2.adoc#object-info
"object-info protocol docs"
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [GSoC] Proposal: Complete and extend the remote-object-info command for git cat-file 2026-03-13 10:17 [GSoC] Proposal: Complete and extend the remote-object-info command for git cat-file Pablo @ 2026-03-14 5:58 ` Chandra Pratap 2026-03-14 18:31 ` Pablo 2026-03-16 16:05 ` [GSoC v3] " Pablo Sabater 1 sibling, 1 reply; 10+ messages in thread From: Chandra Pratap @ 2026-03-14 5:58 UTC (permalink / raw) To: Pablo Cc: git, christian.couder, karthik nayak, jltobler, Ayush Chandekar, Siddharth Asthana Hi Pablo, On Fri, 13 Mar 2026 at 15:47, Pablo <pabloosabaterr@gmail•com> wrote: > > ## Synopsis > > This project finishes Eric Ju's work on `remote-object-info` for `git > cat-file --batch-command` [1], resolves the pending feedback from > Junio Hamano [2] and Jeff King [3] [4] [5], and extends support for > `%(objecttype)`. > > Expected project size: 350 hours (Medium) > ## About Me and Contact > > Name: Pablo Sabater Jiménez (he/him) > > Age: 19 > > Education: Currently on my second Computer Science year at University > of Murcia, Spain > > Location: Murcia, Spain (CET, UTC+1) > > Languages: C (solid), shell(bash) (good) > > Tools: git(proficient) > > I've checked that I'm eligible for GSoC 2026. > > Email: pabloosabaterr@gmail•com > GitHub: https://github.com/pabloosabaterr > > ## Relevant Projects > > - 16 bit CPU emulator. Good example of C programming. > > cpu: https://github.com/pabloosabaterr/CPU16 > > - Compiler. Good example of working on bigger projects. > > compiler: https://github.com/pabloosabaterr/Orn > Thanks for your interest in contributing to Git this GSoC! > ## Pre-GSoC Work > > ### Introduction > > **[GSoC] Introduction Pablo Sabater** > > https://lore.kernel.org/git/CAN5EUNR0KJ4VeuOF_bVupaTuGKGaeTKa0SMRAUoBPo5wWi8YGA@mail.gmail.com > > A mailing list thread where I introduced myself to the git community. Nit: Could use a newline here. > ### Microproject > > **[GSoC PATCH v4] t9200: replace test -f/-d with modern path helpers** > > https://lore.kernel.org/git/20260312173305.15112-1-pabloosabaterr@gmail.com/ > > Merged to `next` on 2026-03-12 at 8500bdf172. Replaces `test -f` with > helper `test_path_is_file`, which makes debugging failing tests easier > with better reporting. > As suggested as microproject. > > ### Other contributions > > **[GSoC PATCH v2] test-lib: print escape sequence names** > > https://lore.kernel.org/git/20260311031442.11942-1-pabloosabaterr@gmail.com/ > > Will merge to `next`, in failed expected/actual checks printing, the > escape sequences were shown as their octal code. This patch fixes that > to print the actual escape sequence name, adds tests, and updates the > expected output. > > **[GSoC PATCH] t9200: handle missing CVS with skip_all** > > https://lore.kernel.org/git/20260311194002.190195-1-pabloosabaterr@gmail.com/ > > Merged to `next` on 2026-03-12 at 8500bdf172, wraps CVS setup in a > skip_all for clearer failure reporting and moves Git initialization > into its own test_expect_success. > > **[GSoC] Re: [PATCH v11 8/8] cat-file: add remote-object-info to batch-command** > > https://lore.kernel.org/git/20260312214154.89120-1-pabloosabaterr@gmail.com/ > > While testing Eric's v11 I've found and reported a new bug. On > `remote-object-info` when it's preceded by a local query, `data->type` > isn't being cleared. Causing it to return the wrong type. > > I have also studied the documentation provided and Eric Ju's work from > v0 to v11 including all the feedback he got up to March 2025, the > feedback he got from Junio Hamano and Jeff King, taking notes about > what's left to be done and what else I can contribute to the already > proposed project. That's how I've identified everything that I will > address on the Problem, Solution and Timeline sections. > > I built Eric Ju's v11 and tested the bugs reported to his patch [5], > I've confirmed the segfault and the `die()`, and found a new one: > - When a local `info` runs before `remote-object-info` sharing the > same format string, `data->type` isn't being cleared. A blob queried > remotely after a local commit, `data->type` for blob becomes 'commit' > with no error. I reported it on the mailing list [6]. > > I attempted to test rebasing Eric Ju's v11 to master and got conflicts > on 4 out of the 8 commits: > - `d04cf85ece` t1006: split test utility functions into new "lib-cat-file.sh". > - `t/t1006-cat-file.sh` > - `d918f720d8` fetch-pack: refactor packet writing. > - `fetch-pack.c` > - `2daf9ed803` transport: add client support for object-info. > - `Makefile` > - `c3ba4afaf6` cat-file: add remote-object-info to batch-command. > - `object-file.c`, `object-store-ll.h` (deleted). > > I'm being active on the mailing list and learning the Git flow of work > and from the feedback I've received from the maintainers (Junio) from > my patches. > > Following the project guidelines, I haven't done anything on the > project that could step on other candidates' work before being > accepted, and instead I'm focusing on understanding the project and > its needs, and independent patches that will make the Git project more > familiar and understandable to me. Great work! It would help if you could split the description of your patches into Status, Description, Comments, etc. It helps a lot when reviewing the proposal. > > ## Availability > > My classes end the first week of May. From then until September I > won't have any classes which leaves me free to fully focus on the > project. I can dedicate 8+ hours each day, and for sure 40 hours a > week. > > ## The Problem > > Git's partial clone allows cloning repositories without downloading > all objects (blobs, trees, ...). These objects are fetched on demand > from the remote when needed. However, when a user needs metadata about > these remote objects (size, type, hash, ...), Git has no efficient way > of doing this without downloading all the object content. > > The server side support for `object-info` protocol was implemented by > Calvin Wan in 2021. Eric Ju built the client-side `remote-object-info` > for `cat-file --batch-command`. This part is likely more relevant in the 'Synopsis' section up top. It provides important context that helps the reader tune their expectations for the rest of the proposal. From my experience, a good rule of thumb when writing a proposal is to assume the reader doesn't know anything about the project or the problem it tackles beforehand. > Eric Ju's work remains unmerged after > v11 because of these issues: > > - The format validation uses `strstr()` which only checks for > `%(objectsize)`. This causes two different errors: > - Atoms that `expand_atom()` recognizes but the remote doesn't > (`objecttype`,`deltabase`, ...), `expand_atom()` returns 1, but when > accessing `data->type` it only contains garbage, causing segfault. as > Jeff King noted [3]. Grammar nit: should be 'garbage causing segfault, as Jeff King noted[3].' The sentence could also use some restructuring for better clarity. It is great that you've referenced the relevant discussion thread here. > - Unknown atoms by `expand_atom()`, returns 0, calling > `strbuf_expand_bad_format` on `expand_format()`, which calls `die()`, > as Jeff King found [3]. > Both cases block the command, including local `info` queries if the > same format string is shared. Unsupported remote placeholders should > return an empty string, matching how `for-each-ref` returns empty for > known, but inapplicable atoms like `%(tagger)` on non-tags [4] [5]. > > - When local and remote queries are mixed, `data->type` is not being > cleared between commands. `remote-object-info` returns the wrong type > data from a previous local query [6]. > You've mentioned the outstanding issues and their implications for the end user. Good work. > - Style and code issues marked by Junio Hamano [2] and Jeff King [3] > [5] are still undone. > - comment style. > - `#define` formatting. > - line length. > - misleading error messages. > - missing `count > MAX_ALLOWED_OBJ_LIMIT` check at `split_cmdline().` > - if/else invert at `get_remote_info()`. > - `%(objecttype)` is not yet supported on either client or server side. > > ## The Solution > > There are two main goals: > > ### Goal 1: Rebase and finish Eric's work > > Starting from where Eric Ju left off, I will rebase it on top of the > current `master` branch and address the feedback left to do: > - Fix style in comments, `#define` formatting and line length. > - Fix misleading error message in the overflow check. > - Add missing `count > MAX_ALLOWED_OBJ_LIMIT` check after `split_cmdline()`. > - Invert if/else on `get_remote_info()` to keep the small block first > (the error one) as Junio suggested. > #### Replace `strstr()` format validation with allow_list in `expand_atom()` Nit: Could use a newline here. > > `strstr()` isn't enough to fully validate the placeholders, it only > searches for `%(objectsize)` and unsupported placeholders cause > segfaults. The fix is to refactor the validation with an allow_list in > `expand_atom()`. It is great if this is your idea, but if not, it would help to credit the person who suggested this and link to the relevant discussion, if applicable. > But why `expand_atom()` when Jeff King suggested > `expand_atom()` or `expand_format()` [4] ? > - There are two cases, first, inside `expand_atom()` before returning > (segfault) and second, calls `die()` when `expand_atom()` returns 0. > Placing the `allow_list` at the top of `expand_atom()` prevents both > errors, on remote mode, append nothing to `sb` and return 1, accessing > `data->type` won't cause segfault and prevents `expand_format()` from > reaching `die()`. > As extra safety, initializing `data->type` to `OBJ_BAD` and check > for `NULL` from `type_name()` makes it that even without `allow_list`, > uninitialized data doesn't cause a segfault. > At Goal 1, only `%(objectname)` and `%(objectsize)` will be in the > allow_list. Goal 2 will bring `%(objecttype)` support. > ### Goal 2: Adding `%(objecttype)` Nit: Newline here as well. > > following what Calvin Wan did in 2021 for `%(objectsize)`, v2 protocol Grammar nit: [F]ollowing. > needs to be extended on the server side to support the new > `%(objecttype)` placeholder: > - extend `object_info_advertise()` at `serve.c` > - add .type to `requested_info` struct at `serve.c` > - support `type` in `cap_object_info()` at `protocol-caps.c` > - look for type at `send_info()` at `protocol-caps.c` > > following object-info protocol docs [7] it should look like: Here as well. > ``` > attrs = "size" SP "type" > obj-type = "blob" | "tree" | "commit" | "tag" > obj-info = obj-id SP obj-size SP obj-type > info = PKT-LINE(attrs LF) > *PKT-LINE(obj-info LF) > ``` > > `%(objecttype)` needs to be added to the `allow_list`. Client side > needs to learn to ask for `%(objecttype)` from remote, parse what has > been received and fill `expand_data` with the actual type. This makes > it return the object type instead of the empty string returned while > it was unsupported. > > Default format evolves to `%(objectname) %(objecttype) %(objectsize)`. > Test and document new placeholder support and server side extension. > Makes sense. > #### Backward Compatibility > > There are four possible scenarios to happen between client and server: > 1. The server doesn't know type (new client but old server): > > After receiving the server capabilities, a client will only request > what the server advertises. The `allow_list` would handle this, > returning an empty string when the server doesn't support it. > 2. The server knows type but the client doesn't (new server but old client): > > Following `gitprotocol-v2.adoc`, "Clients must ignore all unknown > keys", it will ignore type, and request only the known capabilities. > 3. Both know type (new client and new server): > > Server advertises type, client requests it and gets the type data. > 4. Both know type but protocol middleware doesn't (new client, new > server but old middleware): > > If a server advertises type but client doesn't receive type, a > client won't ask for anything unadvertised, if a client asks for type > but the server doesn't receive it, it will only return the known > capabilities. > This section makes sense as well, could use better formatting though. > **performance considerations** > > To get an object type, we have to look only at the header, to get the > size `oid_object_info()` at `object-file.c` is being called which > already returns the object type in the same call. Sending the string > with the type will only be, worst case scenario 6 bytes for the > "commit" string. > ## Timeline > Nit: newline. > I've designed this to work with enough time so final work can be > shorter than what's said here > > May 1-24: Community Bonding > - Talk and meet with mentor that I'm assigned with, to get feedback > about my proposal, how I will report my progress apart from the code > submitted and possible blogs, and tips and tricks to work better at > Git. > - Confirm with mentor that the `allow_list` approach is still the best option. > - Draft commits structure. It would also be helpful if you continue working on your patches that haven't been merged yet from your pre-GSoC efforts. The goal of Community Bonding Period is to interact with the wider community as much as possible, and what better way to do that other than engaging through patches. Also, GSoC/Git requires you to write weekly blog posts detailing your work, what's holding you back, etc. So it's good if you use this time to set up your blog, if you don't have one already. > > Week 1-2: (May 26 - June 8) > - Rebase Eric Ju's v11 on top of current `master`. > - Work on style fixes: comments, `#define` formatting, line length. > - Fix the wrong error message in the overflow check. > - Add missing check `count > MAX_ALLOWED_OBJ_LIMIT` after `split_cmdline()`. > - Invert if/else in `get_remote_info()`. These four points are specifics of how you're going to tackle the 'Style Issues' problem you mentioned above. I don't think there's any benefit in reiterating them here. A single 'Fix the style and code issues.' or something similar would be better. > - Send first patch. > > Week 3-4: (June 9 - June 22) > - Implement `allow_list` in `expand_atom()` using `is_atom()` in remote-mode. > - Initialize `data->type` to `OBJ_BAD` and add null check at `type_name()`. > - Implement empty string return for unsupported placeholders. > - Tests for supported placeholders, unsupported, mix, and the intermix > case `info` + `remote-object-info` with the same format string. > - Work with feedback from the first patch. Again, specifics of the implementation plan don't need reiteration. > > Week 5-6: (June 23 - July 6): > - Continue with review feedback. > - Goal 1 should be polished or close to the final form. > - Prepare the midterm report. > > Midterm evaluation (July 7 - 11) as specified on GSoC timeline docs > - Goal 1 submitted and keep work with feedback. You could probably dedicate this time to start working on Goal 2. Addressing feedback is something that occurs spontaneously and doesn't need dedicated slots in your timeline. > Week 7-8: (July 14 - July 27) > - Begin Goal 2. > - Extend server side v2 protocol to serve `%(objecttype)`, following > `%(objectsize)` structure. > - Test server side. > > Week 9-10: (July 28 - August 10) > - Add `%(objecttype)` to the `allow_list` from Goal 1. > - Extend client side to ask for `%(objecttype)` from remote on `object-info`. > - Parse server answer and fill `expand_data` with the actual type. > - End to end tests and documentation. > - Default format becomes `%(objectname) %(objecttype) %(objectsize)`. > - Send patch series. > > Week 11-12: (August 11 - August 24) > - Work with Goal 2 feedback from the patches. > - Polish everything, all tests pass, good test coverage, no > style/comment mistakes. > - Final documentation review. > - Prepare for final evaluation. > > Final evaluation (August 18-24) as specified on GSoC timeline docs > > ### Additional objectives > > If there is enough time, or for future work after the project. I've > some ideas on how this could evolve: > #### More placeholders support > I've checked that Eric's v11 patch only supports `%(objectsize)` on > server side, but on the client side there are other placeholders that > can be added too. with the `allow_list` and having Goal 2 implemented > adding more placeholders becomes trivial. > > - `%(objectsize:disk)`: Returns the size on the disk (compressed or as > a delta) instead of returning the uncompressed size that > `%(objectsize)` does. To do this, the server would need to send what's > the actual size on disk data. > > - `%(deltabase)`: Returns the delta base object OID. non delta objects > return zero OID as it does on local. > > #### Returning missing blobs from a tree ordered > In a partial clone, someone might want to know what blobs are missing > inside a concrete tree and their size before fetching them. > The idea is to build on top of `remote-object-info`: > Given a tree hash, return the missing blobs (inside that tree) ordered by size. > > Thanks for reading my proposal and considering my application. I'm > very excited about this opportunity, > Pablo > > [1]: https://lore.kernel.org/git/20250221190451.12536-1-eric.peijian@gmail.com/ > "Eric Ju's v11 patch" > > [2]: https://lore.kernel.org/git/xmqqo6yr3wc4.fsf@gitster.g/ "Junio > Hamano feedback" > > [3]: https://lore.kernel.org/git/20250224234720.GC729825@coredump.intra.peff.net/ > "Jeff King feedback" > > [4]: https://lore.kernel.org/git/20250313060250.GH94015@coredump.intra.peff.net/ > "options for strstr() by Jeff King" > > [5]: https://lore.kernel.org/git/20250324033922.GB690093@coredump.intra.peff.net/ > "Jeff King follow-up" > > [6]: https://lore.kernel.org/git/20260312214154.89120-1-pabloosabaterr@gmail.com/ > "data->type not being cleared bug" > > [7]: https://github.com/git/git/blob/master/Documentation/gitprotocol-v2.adoc#object-info > "object-info protocol docs" Overall, great work on the proposal so far! Other than a few stylistic mishaps, the proposal looks pretty strong already. You should upload your proposal on the GSoC website and add the link to it here. The proposal can be then updated later as many times as you like. Regards, Chandra. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [GSoC] Proposal: Complete and extend the remote-object-info command for git cat-file 2026-03-14 5:58 ` Chandra Pratap @ 2026-03-14 18:31 ` Pablo 2026-03-15 9:20 ` Chandra Pratap ` (2 more replies) 0 siblings, 3 replies; 10+ messages in thread From: Pablo @ 2026-03-14 18:31 UTC (permalink / raw) To: Chandra Pratap Cc: git, christian.couder, karthik nayak, jltobler, Ayush Chandekar, Siddharth Asthana Hi Chandra, thanks a lot for the feedback! :) > You should upload your proposal on the GSoC website and add the link to it here. > The proposal can be then updated later as many times as you like. GSoC proposals opens March 16th, for now I'll send my v2 here and as soon as I can I'll swap to GSoC website and send the link to the thread. To avoid having you reread everything again this is what I've done from v1: Moved context explanation from The Problem to Synopsis and Availability below About Me and Contact. Split Pre-GSoC patches into status (for code patches) and description to improve readability. Added a code review and proposal thread to the Pre-GSoC section. Added new lines where noted and fixed capitalization. Correctly credited Jeff King for the allow_list idea and added new [8] for Calvin Wan's work. Community bonding now includes continuing patches and setting up a blog. Removed most of the duplicated iteration on the Timeline from The Problem. (feels a bit empty now tho). I paste here my v2 with the requested changes: ## Synopsis Git's partial clone allows cloning repositories without downloading all objects (blobs, trees, ...). These objects are fetched on demand from the remote when needed. However, when a user needs metadata about these remote objects (size, type, hash, ...), Git has no efficient way of doing this without downloading all the object content. The server side support for `object-info` protocol was implemented by Calvin Wan in 2021 [8]. Eric Ju built the client-side `remote-object-info` for `cat-file --batch-command`. This project finishes Eric Ju's work on `remote-object-info` for `git cat-file --batch-command` [1], resolves the pending feedback from Junio Hamano [2] and Jeff King [3] [4] [5], and extends support for `%(objecttype)`. Expected project size: 350 hours (Medium) ## About Me and Contact Name: Pablo Sabater Jiménez (he/him) Age: 19 Education: Currently on my second Computer Science year at University of Murcia, Spain Location: Murcia, Spain (CET, UTC+1) Languages: C (solid), shell(bash) (good) Tools: git(proficient) I've checked that I'm eligible for GSoC 2026. Email: pabloosabaterr@gmail•com GitHub: https://github.com/pabloosabaterr ## Availability My classes end the first week of May. From then until September I won't have any classes which leaves me free to fully focus on the project. I can dedicate 8+ hours each day, and for sure 40 hours a week. ## Relevant Projects - 16 bit CPU emulator. Good example of C programming. cpu: https://github.com/pabloosabaterr/CPU16 - Compiler. Good example of working on bigger projects. compiler: https://github.com/pabloosabaterr/Orn ## Pre-GSoC Work ### Introduction **[GSoC] Introduction Pablo Sabater** https://lore.kernel.org/git/CAN5EUNR0KJ4VeuOF_bVupaTuGKGaeTKa0SMRAUoBPo5wWi8YGA@mail.gmail.com **Description**: A mailing list thread where I introduced myself to the git community. ### Microproject **[GSoC PATCH v4] t9200: replace test -f/-d with modern path helpers** https://lore.kernel.org/git/20260312173305.15112-1-pabloosabaterr@gmail.com/ **Status**: Merged to `next` on 2026-03-12 at `8500bdf172`. **Description**: Replaces `test -f` with helper `test_path_is_file`, which makes debugging failing tests easier with better reporting. As suggested as microproject. ### Other contributions **[GSoC PATCH v2] test-lib: print escape sequence names** https://lore.kernel.org/git/20260311031442.11942-1-pabloosabaterr@gmail.com/ **Status**: Will merge to `next`. **Description**: In failed expected/actual checks printing, the escape sequences were shown as their octal code. This patch fixes that to print the actual escape sequence name, adds tests, and updates the expected output. **[GSoC PATCH] t9200: handle missing CVS with skip_all** https://lore.kernel.org/git/20260311194002.190195-1-pabloosabaterr@gmail.com/ **Status**: Merged to `next` on 2026-03-12 at `8500bdf172`. **Description**: wraps CVS setup in a skip_all for clearer failure reporting and moves Git initialization into its own test_expect_success. **Re: [PATCH] gc: add git maintenance list command** https://lore.kernel.org/git/20260313115932.15259-1-pabloosabaterr@gmail.com/ **Description**: code review for a patch sent. **[GSoC] Proposal: Complete and extend remote-object-info for git cat-file** https://lore.kernel.org/git/CAN5EUNQKv-LCkbY+5scn6pk6fL8kpmjNR=66rjeY=NqKbqRkhA@mail.gmail.com/ **Description**: Proposal draft thread. **[GSoC] Re: [PATCH v11 8/8] cat-file: add remote-object-info to batch-command** https://lore.kernel.org/git/20260312214154.89120-1-pabloosabaterr@gmail.com/ **Description**: While testing Eric's v11 I've found and reported a new bug. On `remote-object-info` when it's preceded by a local query, `data->type` isn't being cleared. Causing it to return the wrong type. I have also studied the documentation provided and Eric Ju's work from v0 to v11 including all the feedback he got up to March 2025, the feedback he got from Junio Hamano and Jeff King, taking notes about what's left to be done and what else I can contribute to the already proposed project. That's how I've identified everything that I will address on the Problem, Solution and Timeline sections. I built Eric Ju's v11 and tested the bugs reported to his patch [5], I've confirmed the segfault and the `die()`, and found a new one: - When a local `info` runs before `remote-object-info` sharing the same format string, `data->type` isn't being cleared. A blob queried remotely after a local commit, `data->type` for blob becomes 'commit' with no error. I reported it on the mailing list [6]. I attempted to test rebasing Eric Ju's v11 to master and got conflicts on 4 out of the 8 commits: - `d04cf85ece` t1006: split test utility functions into new "lib-cat-file.sh". - `t/t1006-cat-file.sh` - `d918f720d8` fetch-pack: refactor packet writing. - `fetch-pack.c` - `2daf9ed803` transport: add client support for object-info. - `Makefile` - `c3ba4afaf6` cat-file: add remote-object-info to batch-command. - `object-file.c`, `object-store-ll.h` (deleted). I'm being active on the mailing list and learning the Git flow of work and from the feedback I've received from the maintainers (Junio) from my patches. Following the project guidelines, I haven't done anything on the project that could step on other candidates' work before being accepted, and instead I'm focusing on understanding the project and its needs, and independent patches that will make the Git project more familiar and understandable to me. ## The Problem Eric Ju's work remains unmerged after v11 because of these issues: - The format validation uses `strstr()` which only checks for `%(objectsize)`. This causes two different errors: - Atoms that `expand_atom()` recognizes but the remote doesn't (`objecttype`,`deltabase`, ...), `expand_atom()` returns 1, but when accessing `data->type` it only contains garbage, causing segfault, as Jeff King noted [3]. - Unknown atoms by `expand_atom()`, returns 0, calling `strbuf_expand_bad_format` on `expand_format()`, which calls `die()`, as Jeff King found [3]. Both cases block the command, including local `info` queries if the same format string is shared. Unsupported remote placeholders should return an empty string, matching how `for-each-ref` returns empty for known, but inapplicable atoms like `%(tagger)` on non-tags [4] [5]. - When local and remote queries are mixed, `data->type` is not being cleared between commands. `remote-object-info` returns the wrong type data from a previous local query [6]. - Style and code issues marked by Junio Hamano [2] and Jeff King [3] [5] are still undone. - comment style. - `#define` formatting. - line length. - misleading error messages. - missing `count > MAX_ALLOWED_OBJ_LIMIT` check at `split_cmdline().` - if/else invert at `get_remote_info()`. - `%(objecttype)` is not yet supported on either client or server side. ## The Solution There are two main goals: ### Goal 1: Rebase and finish Eric's work Starting from where Eric Ju left off, I will rebase it on top of the current `master` branch and address the feedback left to do: - Fix style in comments, `#define` formatting and line length. - Fix misleading error message in the overflow check. - Add missing `count > MAX_ALLOWED_OBJ_LIMIT` check after `split_cmdline()`. - Invert if/else on `get_remote_info()` to keep the small block first (the error one) as Junio suggested. #### Replace `strstr()` format validation with allow_list in `expand_atom()` `strstr()` isn't enough to fully validate the placeholders, it only searches for `%(objectsize)` and unsupported placeholders cause segfaults. Jeff King noted [4] that the fix was to refactor the validation with an allow_list in `expand_atom()` or `expand_format()`. The best option is to place the validation at `expand_atom()`, but why `expand_atom()` ? - There are two cases, first, inside `expand_atom()` before returning (segfault) and second, calls `die()` when `expand_atom()` returns 0. Placing the `allow_list` at the top of `expand_atom()` prevents both errors, on remote mode, append nothing to `sb` and return 1, accessing `data->type` won't cause segfault and prevents `expand_format()` from reaching `die()`. As extra safety, initializing `data->type` to `OBJ_BAD` and check for `NULL` from `type_name()` makes it that even without `allow_list`, uninitialized data doesn't cause a segfault. At Goal 1, only `%(objectname)` and `%(objectsize)` will be in the allow_list. Goal 2 will bring `%(objecttype)` support. ### Goal 2: Adding `%(objecttype)` Following what Calvin Wan did in 2021 [8] for `%(objectsize)`, v2 protocol needs to be extended on the server side to support the new `%(objecttype)` placeholder: - extend `object_info_advertise()` at `serve.c` - add .type to `requested_info` struct at `serve.c` - support `type` in `cap_object_info()` at `protocol-caps.c` - look for type at `send_info()` at `protocol-caps.c` Following object-info protocol docs [7] it should look like: ``` attrs = "size" SP "type" obj-type = "blob" | "tree" | "commit" | "tag" obj-info = obj-id SP obj-size SP obj-type info = PKT-LINE(attrs LF) *PKT-LINE(obj-info LF) ``` `%(objecttype)` needs to be added to the `allow_list`. Client side needs to learn to ask for `%(objecttype)` from remote, parse what has been received and fill `expand_data` with the actual type. This makes it return the object type instead of the empty string returned while it was unsupported. Default format evolves to `%(objectname) %(objecttype) %(objectsize)`. Test and document new placeholder support and server side extension. #### Backward Compatibility There are four possible scenarios to happen between client and server: 1. **The server doesn't know type (new client but old server)**: After receiving the server capabilities, a client will only request what the server advertises. The `allow_list` would handle this, returning an empty string when the server doesn't support it. 2. **The server knows type but the client doesn't (new server but old client)**: Following `gitprotocol-v2.adoc`, "Clients must ignore all unknown keys", it will ignore type, and request only the known capabilities. 3. **Both know type (new client and new server)**: Server advertises type, client requests it and gets the type data. 4. **Both know type but protocol middleware doesn't (new client, new server but old middleware)**: If a server advertises type but client doesn't receive type, a client won't ask for anything unadvertised, if a client asks for type but the server doesn't receive it, it will only return the known capabilities. **performance considerations** To get an object type, we have to look only at the header, to get the size `oid_object_info()` at `object-file.c` is being called which already returns the object type in the same call. Sending the string with the type will only be, worst case scenario 6 bytes for the "commit" string. ## Timeline I've designed this to work with enough time so final work can be shorter than what's said here May 1-24: Community Bonding - Keep working on my ongoing patches and new ones. - Talk and meet with mentor that I'm assigned with, to get feedback about my proposal, how I will report my progress apart from the code submitted and possible blogs, and tips and tricks to work better at Git. - Confirm with mentor that the `allow_list` approach is still the best option. - Draft commits structure. - Setup a blog to keep track about how GSoC at Git is going. Week 1-2: (May 26 - June 8) - Start Goal 1 fixes. - Fix style and code issues. Week 3-4: (June 9 - June 22) - Start with Goal 1 implementations (allow_list approach). Week 5-6: (June 23 - July 6): - Goal 1 should be polished or close to the final form. - Send patch series for Goal 1. - Start Goal 2. - Prepare the midterm report. **Midterm evaluation** (July 7 - 11) as specified on GSoC timeline docs - Goal 1 submitted. Week 7-8: (July 14 - July 27) - Start with server side v2 protocol extension (`%(objecttype)`). Week 9-10: (July 28 - August 10) - Add `%(objecttype)` to the `allow_list` from Goal 1. - Client side extension. - End to end tests and documentation. - Default format becomes `%(objectname) %(objecttype) %(objectsize)`. - Send patch series. Week 11-12: (August 11 - August 24) - Goal 2 should be close to be done. - Polish everything, all tests pass, good test coverage, no style/comment issues. - Final documentation review. - Prepare for final evaluation. **Final evaluation** (August 18-24) as specified on GSoC timeline docs ### Additional objectives If there is enough time, or for future work after the project. I've some ideas on how this could evolve: #### More placeholders support I've checked that Eric's v11 patch only supports `%(objectsize)` on server side, but on the client side there are other placeholders that can be added too. With the `allow_list` and having Goal 2 implemented, adding more placeholders becomes trivial. - `%(objectsize:disk)`: Returns the size on the disk (compressed or as a delta) instead of returning the uncompressed size that `%(objectsize)` does. To do this, the server would need to send what's the actual size on disk data. - `%(deltabase)`: Returns the delta base object OID. non delta objects return zero OID as it does on local. #### Returning missing blobs from a tree ordered In a partial clone, someone might want to know what blobs are missing inside a concrete tree and their size before fetching them. The idea is to build on top of `remote-object-info`: Given a tree hash, return the missing blobs (inside that tree) ordered by size. Thanks for reading my proposal and considering my application. I'm very excited about this opportunity, Pablo [1]: https://lore.kernel.org/git/20250221190451.12536-1-eric.peijian@gmail.com/ "Eric Ju's v11 patch" [2]: https://lore.kernel.org/git/xmqqo6yr3wc4.fsf@gitster.g/ "Junio Hamano feedback" [3]: https://lore.kernel.org/git/20250224234720.GC729825@coredump.intra.peff.net/ "Jeff King feedback" [4]: https://lore.kernel.org/git/20250313060250.GH94015@coredump.intra.peff.net/ "options for strstr() by Jeff King" [5]: https://lore.kernel.org/git/20250324033922.GB690093@coredump.intra.peff.net/ "Jeff King follow-up" [6]: https://lore.kernel.org/git/20260312214154.89120-1-pabloosabaterr@gmail.com/ "data->type not being cleared bug" [7]: https://github.com/git/git/blob/master/Documentation/gitprotocol-v2.adoc#object-info "object-info protocol docs" [8]: https://lore.kernel.org/git/20220728230210.2952731-1-calvinwan@google.com/#t "Calvin Wan's patch series" --- Again, thanks a lot for the feedback. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [GSoC] Proposal: Complete and extend the remote-object-info command for git cat-file 2026-03-14 18:31 ` Pablo @ 2026-03-15 9:20 ` Chandra Pratap 2026-03-16 11:21 ` Christian Couder 2026-03-16 21:38 ` Karthik Nayak 2 siblings, 0 replies; 10+ messages in thread From: Chandra Pratap @ 2026-03-15 9:20 UTC (permalink / raw) To: Pablo Cc: git, christian.couder, karthik nayak, jltobler, Ayush Chandekar, Siddharth Asthana On Sun, 15 Mar 2026 at 00:01, Pablo <pabloosabaterr@gmail•com> wrote: > > Hi Chandra, thanks a lot for the feedback! :) > > > You should upload your proposal on the GSoC website and add the link to it here. > > The proposal can be then updated later as many times as you like. > > GSoC proposals opens March 16th, for now I'll send my v2 here and as > soon as I can I'll swap to GSoC website and send the link to the > thread. I don't think you need to do this, just make sure you include the link when you send your revised proposals in the future. > To avoid having you reread everything again this is what I've done from v1: > > Moved context explanation from The Problem to Synopsis and > Availability below About Me and Contact. > Split Pre-GSoC patches into status (for code patches) and > description to improve readability. > Added a code review and proposal thread to the Pre-GSoC section. > Added new lines where noted and fixed capitalization. > Correctly credited Jeff King for the allow_list idea and added new > [8] for Calvin Wan's work. > Community bonding now includes continuing patches and setting up a blog. Quickly skimmed over the new proposal and it definitely looks better now. Great job! > Removed most of the duplicated iteration on the Timeline from The > Problem. (feels a bit empty now tho). This is fine because you've already discussed the relevant details in earlier sections. You could think of fleshing it out with new information, but duplicating details just for the sake of a 'fuller' proposal waters down the impact of the rest of your work. There isn't a word count requirement after all :) > > I paste here my v2 with the requested changes: > > ## Synopsis > > Git's partial clone allows cloning repositories without downloading > all objects (blobs, trees, ...). These objects are fetched on demand > from the remote when needed. However, when a user needs metadata about > these remote objects (size, type, hash, ...), Git has no efficient way > of doing this without downloading all the object content. > > The server side support for `object-info` protocol was implemented by > Calvin Wan in 2021 [8]. Eric Ju built the client-side > `remote-object-info` for `cat-file --batch-command`. > > This project finishes Eric Ju's work on `remote-object-info` for `git > cat-file --batch-command` [1], resolves the pending feedback from > Junio Hamano [2] and Jeff King [3] [4] [5], and extends support for > `%(objecttype)`. > > Expected project size: 350 hours (Medium) > > ## About Me and Contact > > Name: Pablo Sabater Jiménez (he/him) > > Age: 19 > > Education: Currently on my second Computer Science year at University > of Murcia, Spain > > Location: Murcia, Spain (CET, UTC+1) > > Languages: C (solid), shell(bash) (good) > > Tools: git(proficient) > > I've checked that I'm eligible for GSoC 2026. > > Email: pabloosabaterr@gmail•com > GitHub: https://github.com/pabloosabaterr > > ## Availability > > My classes end the first week of May. From then until September I > won't have any classes which leaves me free to fully focus on the > project. I can dedicate 8+ hours each day, and for sure 40 hours a > week. > > ## Relevant Projects > > - 16 bit CPU emulator. Good example of C programming. > > cpu: https://github.com/pabloosabaterr/CPU16 > > - Compiler. Good example of working on bigger projects. > > compiler: https://github.com/pabloosabaterr/Orn > > ## Pre-GSoC Work > > ### Introduction > > **[GSoC] Introduction Pablo Sabater** > > https://lore.kernel.org/git/CAN5EUNR0KJ4VeuOF_bVupaTuGKGaeTKa0SMRAUoBPo5wWi8YGA@mail.gmail.com > > **Description**: A mailing list thread where I introduced myself to > the git community. > > ### Microproject > > **[GSoC PATCH v4] t9200: replace test -f/-d with modern path helpers** > > https://lore.kernel.org/git/20260312173305.15112-1-pabloosabaterr@gmail.com/ > > **Status**: Merged to `next` on 2026-03-12 at `8500bdf172`. > > **Description**: Replaces `test -f` with helper `test_path_is_file`, > which makes debugging failing tests easier with better reporting. > As suggested as microproject. > > ### Other contributions > > **[GSoC PATCH v2] test-lib: print escape sequence names** > > https://lore.kernel.org/git/20260311031442.11942-1-pabloosabaterr@gmail.com/ > > **Status**: Will merge to `next`. > > **Description**: In failed expected/actual checks printing, the escape > sequences were shown as their octal code. This patch fixes that to > print the actual escape sequence name, adds tests, and updates the > expected output. > > **[GSoC PATCH] t9200: handle missing CVS with skip_all** > > https://lore.kernel.org/git/20260311194002.190195-1-pabloosabaterr@gmail.com/ > > **Status**: Merged to `next` on 2026-03-12 at `8500bdf172`. > > **Description**: wraps CVS setup in a skip_all for clearer failure > reporting and moves Git initialization into its own > test_expect_success. > > **Re: [PATCH] gc: add git maintenance list command** > > https://lore.kernel.org/git/20260313115932.15259-1-pabloosabaterr@gmail.com/ > > **Description**: code review for a patch sent. > > **[GSoC] Proposal: Complete and extend remote-object-info for git cat-file** > > https://lore.kernel.org/git/CAN5EUNQKv-LCkbY+5scn6pk6fL8kpmjNR=66rjeY=NqKbqRkhA@mail.gmail.com/ > > **Description**: Proposal draft thread. > > **[GSoC] Re: [PATCH v11 8/8] cat-file: add remote-object-info to batch-command** > > https://lore.kernel.org/git/20260312214154.89120-1-pabloosabaterr@gmail.com/ > > **Description**: While testing Eric's v11 I've found and reported a > new bug. On `remote-object-info` when it's preceded by a local query, > `data->type` isn't being cleared. Causing it to return the wrong type. > > I have also studied the documentation provided and Eric Ju's work from > v0 to v11 including all the feedback he got up to March 2025, the > feedback he got from Junio Hamano and Jeff King, taking notes about > what's left to be done and what else I can contribute to the already > proposed project. That's how I've identified everything that I will > address on the Problem, Solution and Timeline sections. > > I built Eric Ju's v11 and tested the bugs reported to his patch [5], > I've confirmed the segfault and the `die()`, and found a new one: > - When a local `info` runs before `remote-object-info` sharing the > same format string, `data->type` isn't being cleared. A blob queried > remotely after a local commit, `data->type` for blob becomes 'commit' > with no error. I reported it on the mailing list [6]. > > I attempted to test rebasing Eric Ju's v11 to master and got conflicts > on 4 out of the 8 commits: > - `d04cf85ece` t1006: split test utility functions into new "lib-cat-file.sh". > - `t/t1006-cat-file.sh` > - `d918f720d8` fetch-pack: refactor packet writing. > - `fetch-pack.c` > - `2daf9ed803` transport: add client support for object-info. > - `Makefile` > - `c3ba4afaf6` cat-file: add remote-object-info to batch-command. > - `object-file.c`, `object-store-ll.h` (deleted). > > I'm being active on the mailing list and learning the Git flow of work > and from the feedback I've received from the maintainers (Junio) from > my patches. > > Following the project guidelines, I haven't done anything on the > project that could step on other candidates' work before being > accepted, and instead I'm focusing on understanding the project and > its needs, and independent patches that will make the Git project more > familiar and understandable to me. > > ## The Problem > > Eric Ju's work remains unmerged after v11 because of these issues: > > - The format validation uses `strstr()` which only checks for > `%(objectsize)`. This causes two different errors: > - Atoms that `expand_atom()` recognizes but the remote doesn't > (`objecttype`,`deltabase`, ...), `expand_atom()` returns 1, but when > accessing `data->type` it only contains garbage, causing segfault, as > Jeff King noted [3]. > - Unknown atoms by `expand_atom()`, returns 0, calling > `strbuf_expand_bad_format` on `expand_format()`, which calls `die()`, > as Jeff King found [3]. > Both cases block the command, including local `info` queries if the > same format string is shared. Unsupported remote placeholders should > return an empty string, matching how `for-each-ref` returns empty for > known, but inapplicable atoms like `%(tagger)` on non-tags [4] [5]. > > - When local and remote queries are mixed, `data->type` is not being > cleared between commands. `remote-object-info` returns the wrong type > data from a previous local query [6]. > > - Style and code issues marked by Junio Hamano [2] and Jeff King [3] > [5] are still undone. > - comment style. > - `#define` formatting. > - line length. > - misleading error messages. > - missing `count > MAX_ALLOWED_OBJ_LIMIT` check at `split_cmdline().` > - if/else invert at `get_remote_info()`. > - `%(objecttype)` is not yet supported on either client or server side. > > ## The Solution > > There are two main goals: > > ### Goal 1: Rebase and finish Eric's work > > Starting from where Eric Ju left off, I will rebase it on top of the > current `master` branch and address the feedback left to do: > - Fix style in comments, `#define` formatting and line length. > - Fix misleading error message in the overflow check. > - Add missing `count > MAX_ALLOWED_OBJ_LIMIT` check after `split_cmdline()`. > - Invert if/else on `get_remote_info()` to keep the small block first > (the error one) as Junio suggested. > > #### Replace `strstr()` format validation with allow_list in `expand_atom()` > > `strstr()` isn't enough to fully validate the placeholders, it only > searches for `%(objectsize)` and unsupported placeholders cause > segfaults. Jeff King noted [4] that the fix was to refactor the > validation with an allow_list in `expand_atom()` or `expand_format()`. > The best option is to place the validation at `expand_atom()`, but why > `expand_atom()` ? > - There are two cases, first, inside `expand_atom()` before returning > (segfault) and second, calls `die()` when `expand_atom()` returns 0. > Placing the `allow_list` at the top of `expand_atom()` prevents both > errors, on remote mode, append nothing to `sb` and return 1, accessing > `data->type` won't cause segfault and prevents `expand_format()` from > reaching `die()`. > As extra safety, initializing `data->type` to `OBJ_BAD` and check > for `NULL` from `type_name()` makes it that even without `allow_list`, > uninitialized data doesn't cause a segfault. > At Goal 1, only `%(objectname)` and `%(objectsize)` will be in the > allow_list. Goal 2 will bring `%(objecttype)` support. > > ### Goal 2: Adding `%(objecttype)` > > Following what Calvin Wan did in 2021 [8] for `%(objectsize)`, v2 > protocol needs to be extended on the server side to support the new > `%(objecttype)` placeholder: > - extend `object_info_advertise()` at `serve.c` > - add .type to `requested_info` struct at `serve.c` > - support `type` in `cap_object_info()` at `protocol-caps.c` > - look for type at `send_info()` at `protocol-caps.c` > > Following object-info protocol docs [7] it should look like: > ``` > attrs = "size" SP "type" > obj-type = "blob" | "tree" | "commit" | "tag" > obj-info = obj-id SP obj-size SP obj-type > info = PKT-LINE(attrs LF) > *PKT-LINE(obj-info LF) > ``` > > `%(objecttype)` needs to be added to the `allow_list`. Client side > needs to learn to ask for `%(objecttype)` from remote, parse what has > been received and fill `expand_data` with the actual type. This makes > it return the object type instead of the empty string returned while > it was unsupported. > > Default format evolves to `%(objectname) %(objecttype) %(objectsize)`. > Test and document new placeholder support and server side extension. > > #### Backward Compatibility > > There are four possible scenarios to happen between client and server: > > 1. **The server doesn't know type (new client but old server)**: > > After receiving the server capabilities, a client will only request > what the server advertises. The `allow_list` would handle this, > returning an empty string when the server doesn't support it. > > 2. **The server knows type but the client doesn't (new server but old client)**: > > Following `gitprotocol-v2.adoc`, "Clients must ignore all unknown > keys", it will ignore type, and request only the known capabilities. > > 3. **Both know type (new client and new server)**: > > Server advertises type, client requests it and gets the type data. > > 4. **Both know type but protocol middleware doesn't (new client, new > server but old middleware)**: > > If a server advertises type but client doesn't receive type, a > client won't ask for anything unadvertised, if a client asks for type > but the server doesn't receive it, it will only return the known > capabilities. > > **performance considerations** > > To get an object type, we have to look only at the header, to get the > size `oid_object_info()` at `object-file.c` is being called which > already returns the object type in the same call. Sending the string > with the type will only be, worst case scenario 6 bytes for the > "commit" string. > > ## Timeline > > I've designed this to work with enough time so final work can be > shorter than what's said here > > May 1-24: Community Bonding > - Keep working on my ongoing patches and new ones. > - Talk and meet with mentor that I'm assigned with, to get feedback > about my proposal, how I will report my progress apart from the code > submitted and possible blogs, and tips and tricks to work better at > Git. > - Confirm with mentor that the `allow_list` approach is still the best option. > - Draft commits structure. > - Setup a blog to keep track about how GSoC at Git is going. > > Week 1-2: (May 26 - June 8) > - Start Goal 1 fixes. > - Fix style and code issues. > > Week 3-4: (June 9 - June 22) > - Start with Goal 1 implementations (allow_list approach). > > Week 5-6: (June 23 - July 6): > - Goal 1 should be polished or close to the final form. > - Send patch series for Goal 1. > - Start Goal 2. > - Prepare the midterm report. > > **Midterm evaluation** (July 7 - 11) as specified on GSoC timeline docs > - Goal 1 submitted. > > Week 7-8: (July 14 - July 27) > - Start with server side v2 protocol extension (`%(objecttype)`). > > Week 9-10: (July 28 - August 10) > - Add `%(objecttype)` to the `allow_list` from Goal 1. > - Client side extension. > - End to end tests and documentation. > - Default format becomes `%(objectname) %(objecttype) %(objectsize)`. > - Send patch series. > > Week 11-12: (August 11 - August 24) > - Goal 2 should be close to be done. > - Polish everything, all tests pass, good test coverage, no > style/comment issues. > - Final documentation review. > - Prepare for final evaluation. > > **Final evaluation** (August 18-24) as specified on GSoC timeline docs > > ### Additional objectives > > If there is enough time, or for future work after the project. I've > some ideas on how this could evolve: > > #### More placeholders support > > I've checked that Eric's v11 patch only supports `%(objectsize)` on > server side, but on the client side there are other placeholders that > can be added too. With the `allow_list` and having Goal 2 implemented, > adding more placeholders becomes trivial. > > - `%(objectsize:disk)`: Returns the size on the disk (compressed or as > a delta) instead of returning the uncompressed size that > `%(objectsize)` does. To do this, the server would need to send what's > the actual size on disk data. > > - `%(deltabase)`: Returns the delta base object OID. non delta objects > return zero OID as it does on local. > > #### Returning missing blobs from a tree ordered > > In a partial clone, someone might want to know what blobs are missing > inside a concrete tree and their size before fetching them. > The idea is to build on top of `remote-object-info`: > Given a tree hash, return the missing blobs (inside that tree) ordered by size. > > Thanks for reading my proposal and considering my application. I'm > very excited about this opportunity, > Pablo > > [1]: https://lore.kernel.org/git/20250221190451.12536-1-eric.peijian@gmail.com/ > "Eric Ju's v11 patch" > > [2]: https://lore.kernel.org/git/xmqqo6yr3wc4.fsf@gitster.g/ "Junio > Hamano feedback" > > [3]: https://lore.kernel.org/git/20250224234720.GC729825@coredump.intra.peff.net/ > "Jeff King feedback" > > [4]: https://lore.kernel.org/git/20250313060250.GH94015@coredump.intra.peff.net/ > "options for strstr() by Jeff King" > > [5]: https://lore.kernel.org/git/20250324033922.GB690093@coredump.intra.peff.net/ > "Jeff King follow-up" > > [6]: https://lore.kernel.org/git/20260312214154.89120-1-pabloosabaterr@gmail.com/ > "data->type not being cleared bug" > > [7]: https://github.com/git/git/blob/master/Documentation/gitprotocol-v2.adoc#object-info > "object-info protocol docs" > > [8]: https://lore.kernel.org/git/20220728230210.2952731-1-calvinwan@google.com/#t > "Calvin Wan's patch series" > > --- > > Again, thanks a lot for the feedback. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [GSoC] Proposal: Complete and extend the remote-object-info command for git cat-file 2026-03-14 18:31 ` Pablo 2026-03-15 9:20 ` Chandra Pratap @ 2026-03-16 11:21 ` Christian Couder 2026-03-16 21:38 ` Karthik Nayak 2 siblings, 0 replies; 10+ messages in thread From: Christian Couder @ 2026-03-16 11:21 UTC (permalink / raw) To: Pablo Cc: Chandra Pratap, git, karthik nayak, jltobler, Ayush Chandekar, Siddharth Asthana Hi Pablo, On Sat, Mar 14, 2026 at 7:31 PM Pablo <pabloosabaterr@gmail•com> wrote: > #### Backward Compatibility > > There are four possible scenarios to happen between client and server: > > 1. **The server doesn't know type (new client but old server)**: > > After receiving the server capabilities, a client will only request > what the server advertises. The `allow_list` would handle this, > returning an empty string when the server doesn't support it. This is not very clear and maybe answering the following questions could help clarify: 1) What is returning an empty string. Is it the `allow_list`, the client, the server or something else? 2) And what is actually reported to the user (en error, a warning, nothing)? 3) Also is it what is implemented in Eric's v11, or what you suggest implementing? > 2. **The server knows type but the client doesn't (new server but old client)**: > > Following `gitprotocol-v2.adoc`, "Clients must ignore all unknown > keys", it will ignore type, and request only the known capabilities. Questions 2) and 3) above might be relevant here too. > 3. **Both know type (new client and new server)**: > > Server advertises type, client requests it and gets the type data. > > 4. **Both know type but protocol middleware doesn't (new client, new > server but old middleware)**: > > If a server advertises type but client doesn't receive type, a > client won't ask for anything unadvertised, if a client asks for type > but the server doesn't receive it, it will only return the known > capabilities. Questions 2) and 3) above might be relevant here too. [...] > Thanks for reading my proposal and considering my application. I'm > very excited about this opportunity, Thanks for your proposal. Best. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [GSoC] Proposal: Complete and extend the remote-object-info command for git cat-file 2026-03-14 18:31 ` Pablo 2026-03-15 9:20 ` Chandra Pratap 2026-03-16 11:21 ` Christian Couder @ 2026-03-16 21:38 ` Karthik Nayak 2026-03-18 10:45 ` Pablo 2 siblings, 1 reply; 10+ messages in thread From: Karthik Nayak @ 2026-03-16 21:38 UTC (permalink / raw) To: Pablo, Chandra Pratap Cc: git, christian.couder, jltobler, Ayush Chandekar, Siddharth Asthana [-- Attachment #1: Type: text/plain, Size: 17746 bytes --] Pablo <pabloosabaterr@gmail•com> writes: > Hi Chandra, thanks a lot for the feedback! :) > >> You should upload your proposal on the GSoC website and add the link to it here. >> The proposal can be then updated later as many times as you like. > > GSoC proposals opens March 16th, for now I'll send my v2 here and as > soon as I can I'll swap to GSoC website and send the link to the > thread. > > To avoid having you reread everything again this is what I've done from v1: > > Moved context explanation from The Problem to Synopsis and > Availability below About Me and Contact. > Split Pre-GSoC patches into status (for code patches) and > description to improve readability. > Added a code review and proposal thread to the Pre-GSoC section. > Added new lines where noted and fixed capitalization. > Correctly credited Jeff King for the allow_list idea and added new > [8] for Calvin Wan's work. > Community bonding now includes continuing patches and setting up a blog. > Removed most of the duplicated iteration on the Timeline from The > Problem. (feels a bit empty now tho). > Perhaps a diff would be a good addition for next time? :) > I paste here my v2 with the requested changes: > > ## Synopsis > > Git's partial clone allows cloning repositories without downloading > all objects (blobs, trees, ...). These objects are fetched on demand > from the remote when needed. However, when a user needs metadata about > these remote objects (size, type, hash, ...), Git has no efficient way > of doing this without downloading all the object content. > > The server side support for `object-info` protocol was implemented by > Calvin Wan in 2021 [8]. Eric Ju built the client-side > `remote-object-info` for `cat-file --batch-command`. > > This project finishes Eric Ju's work on `remote-object-info` for `git > cat-file --batch-command` [1], resolves the pending feedback from > Junio Hamano [2] and Jeff King [3] [4] [5], and extends support for > `%(objecttype)`. > Nice to see that you've linked in the relevant resources. > Expected project size: 350 hours (Medium) > > ## About Me and Contact > > Name: Pablo Sabater Jiménez (he/him) > > Age: 19 > > Education: Currently on my second Computer Science year at University > of Murcia, Spain > > Location: Murcia, Spain (CET, UTC+1) > > Languages: C (solid), shell(bash) (good) > > Tools: git(proficient) > > I've checked that I'm eligible for GSoC 2026. > > Email: pabloosabaterr@gmail•com > GitHub: https://github.com/pabloosabaterr > > ## Availability > > My classes end the first week of May. From then until September I > won't have any classes which leaves me free to fully focus on the > project. I can dedicate 8+ hours each day, and for sure 40 hours a > week. > > ## Relevant Projects > > - 16 bit CPU emulator. Good example of C programming. > > cpu: https://github.com/pabloosabaterr/CPU16 > > - Compiler. Good example of working on bigger projects. > > compiler: https://github.com/pabloosabaterr/Orn > > ## Pre-GSoC Work > > ### Introduction > > **[GSoC] Introduction Pablo Sabater** > > https://lore.kernel.org/git/CAN5EUNR0KJ4VeuOF_bVupaTuGKGaeTKa0SMRAUoBPo5wWi8YGA@mail.gmail.com > > **Description**: A mailing list thread where I introduced myself to > the git community. > > ### Microproject > > **[GSoC PATCH v4] t9200: replace test -f/-d with modern path helpers** > > https://lore.kernel.org/git/20260312173305.15112-1-pabloosabaterr@gmail.com/ > > **Status**: Merged to `next` on 2026-03-12 at `8500bdf172`. > > **Description**: Replaces `test -f` with helper `test_path_is_file`, > which makes debugging failing tests easier with better reporting. > As suggested as microproject. > > ### Other contributions > > **[GSoC PATCH v2] test-lib: print escape sequence names** > > https://lore.kernel.org/git/20260311031442.11942-1-pabloosabaterr@gmail.com/ > > **Status**: Will merge to `next`. > > **Description**: In failed expected/actual checks printing, the escape > sequences were shown as their octal code. This patch fixes that to > print the actual escape sequence name, adds tests, and updates the > expected output. > > **[GSoC PATCH] t9200: handle missing CVS with skip_all** > > https://lore.kernel.org/git/20260311194002.190195-1-pabloosabaterr@gmail.com/ > > **Status**: Merged to `next` on 2026-03-12 at `8500bdf172`. > > **Description**: wraps CVS setup in a skip_all for clearer failure > reporting and moves Git initialization into its own > test_expect_success. > > **Re: [PATCH] gc: add git maintenance list command** > > https://lore.kernel.org/git/20260313115932.15259-1-pabloosabaterr@gmail.com/ > > **Description**: code review for a patch sent. > > **[GSoC] Proposal: Complete and extend remote-object-info for git cat-file** > > https://lore.kernel.org/git/CAN5EUNQKv-LCkbY+5scn6pk6fL8kpmjNR=66rjeY=NqKbqRkhA@mail.gmail.com/ > > **Description**: Proposal draft thread. > > **[GSoC] Re: [PATCH v11 8/8] cat-file: add remote-object-info to batch-command** > > https://lore.kernel.org/git/20260312214154.89120-1-pabloosabaterr@gmail.com/ > > **Description**: While testing Eric's v11 I've found and reported a > new bug. On `remote-object-info` when it's preceded by a local query, > `data->type` isn't being cleared. Causing it to return the wrong type. > Nice to see that you're proactive and already testing out the branch. > I have also studied the documentation provided and Eric Ju's work from > v0 to v11 including all the feedback he got up to March 2025, the > feedback he got from Junio Hamano and Jeff King, taking notes about > what's left to be done and what else I can contribute to the already > proposed project. That's how I've identified everything that I will > address on the Problem, Solution and Timeline sections. > > I built Eric Ju's v11 and tested the bugs reported to his patch [5], > I've confirmed the segfault and the `die()`, and found a new one: > - When a local `info` runs before `remote-object-info` sharing the > same format string, `data->type` isn't being cleared. A blob queried > remotely after a local commit, `data->type` for blob becomes 'commit' > with no error. I reported it on the mailing list [6]. > > I attempted to test rebasing Eric Ju's v11 to master and got conflicts > on 4 out of the 8 commits: > - `d04cf85ece` t1006: split test utility functions into new "lib-cat-file.sh". > - `t/t1006-cat-file.sh` > - `d918f720d8` fetch-pack: refactor packet writing. > - `fetch-pack.c` > - `2daf9ed803` transport: add client support for object-info. > - `Makefile` > - `c3ba4afaf6` cat-file: add remote-object-info to batch-command. > - `object-file.c`, `object-store-ll.h` (deleted). It's been a while, so this is expected. I guess the first week[s] would mostly be getting this series up-to date. > > I'm being active on the mailing list and learning the Git flow of work > and from the feedback I've received from the maintainers (Junio) from > my patches. > > Following the project guidelines, I haven't done anything on the > project that could step on other candidates' work before being > accepted, and instead I'm focusing on understanding the project and > its needs, and independent patches that will make the Git project more > familiar and understandable to me. I know this is the silent expectation, but nice to see it listed out. > > ## The Problem > > Eric Ju's work remains unmerged after v11 because of these issues: > > - The format validation uses `strstr()` which only checks for > `%(objectsize)`. This causes two different errors: > - Atoms that `expand_atom()` recognizes but the remote doesn't > (`objecttype`,`deltabase`, ...), `expand_atom()` returns 1, but when > accessing `data->type` it only contains garbage, causing segfault, as > Jeff King noted [3]. > - Unknown atoms by `expand_atom()`, returns 0, calling > `strbuf_expand_bad_format` on `expand_format()`, which calls `die()`, > as Jeff King found [3]. > Both cases block the command, including local `info` queries if the > same format string is shared. Unsupported remote placeholders should > return an empty string, matching how `for-each-ref` returns empty for > known, but inapplicable atoms like `%(tagger)` on non-tags [4] [5]. > > - When local and remote queries are mixed, `data->type` is not being > cleared between commands. `remote-object-info` returns the wrong type > data from a previous local query [6]. > > - Style and code issues marked by Junio Hamano [2] and Jeff King [3] > [5] are still undone. > - comment style. > - `#define` formatting. > - line length. > - misleading error messages. > - missing `count > MAX_ALLOWED_OBJ_LIMIT` check at `split_cmdline().` > - if/else invert at `get_remote_info()`. > - `%(objecttype)` is not yet supported on either client or server side. > Again, well done on the research. It is always nice to see the requirements being listed out clearly which makes the objective clearer. > ## The Solution > > There are two main goals: > > ### Goal 1: Rebase and finish Eric's work > > Starting from where Eric Ju left off, I will rebase it on top of the > current `master` branch and address the feedback left to do: > - Fix style in comments, `#define` formatting and line length. > - Fix misleading error message in the overflow check. > - Add missing `count > MAX_ALLOWED_OBJ_LIMIT` check after `split_cmdline()`. > - Invert if/else on `get_remote_info()` to keep the small block first > (the error one) as Junio suggested. > > #### Replace `strstr()` format validation with allow_list in `expand_atom()` > > `strstr()` isn't enough to fully validate the placeholders, it only > searches for `%(objectsize)` and unsupported placeholders cause > segfaults. Jeff King noted [4] that the fix was to refactor the > validation with an allow_list in `expand_atom()` or `expand_format()`. > The best option is to place the validation at `expand_atom()`, but why > `expand_atom()` ? > - There are two cases, first, inside `expand_atom()` before returning > (segfault) and second, calls `die()` when `expand_atom()` returns 0. > Placing the `allow_list` at the top of `expand_atom()` prevents both > errors, on remote mode, append nothing to `sb` and return 1, accessing > `data->type` won't cause segfault and prevents `expand_format()` from > reaching `die()`. > As extra safety, initializing `data->type` to `OBJ_BAD` and check > for `NULL` from `type_name()` makes it that even without `allow_list`, > uninitialized data doesn't cause a segfault. > At Goal 1, only `%(objectname)` and `%(objectsize)` will be in the > allow_list. Goal 2 will bring `%(objecttype)` support. > > ### Goal 2: Adding `%(objecttype)` > > Following what Calvin Wan did in 2021 [8] for `%(objectsize)`, v2 > protocol needs to be extended on the server side to support the new > `%(objecttype)` placeholder: > - extend `object_info_advertise()` at `serve.c` > - add .type to `requested_info` struct at `serve.c` > - support `type` in `cap_object_info()` at `protocol-caps.c` > - look for type at `send_info()` at `protocol-caps.c` > > Following object-info protocol docs [7] it should look like: > ``` > attrs = "size" SP "type" > obj-type = "blob" | "tree" | "commit" | "tag" > obj-info = obj-id SP obj-size SP obj-type > info = PKT-LINE(attrs LF) > *PKT-LINE(obj-info LF) > ``` > > `%(objecttype)` needs to be added to the `allow_list`. Client side > needs to learn to ask for `%(objecttype)` from remote, parse what has > been received and fill `expand_data` with the actual type. This makes > it return the object type instead of the empty string returned while > it was unsupported. > > Default format evolves to `%(objectname) %(objecttype) %(objectsize)`. > Test and document new placeholder support and server side extension. > > #### Backward Compatibility > > There are four possible scenarios to happen between client and server: > > 1. **The server doesn't know type (new client but old server)**: > > After receiving the server capabilities, a client will only request > what the server advertises. The `allow_list` would handle this, > returning an empty string when the server doesn't support it. > > 2. **The server knows type but the client doesn't (new server but old client)**: > > Following `gitprotocol-v2.adoc`, "Clients must ignore all unknown > keys", it will ignore type, and request only the known capabilities. > > 3. **Both know type (new client and new server)**: > > Server advertises type, client requests it and gets the type data. > > 4. **Both know type but protocol middleware doesn't (new client, new > server but old middleware)**: > > If a server advertises type but client doesn't receive type, a > client won't ask for anything unadvertised, if a client asks for type > but the server doesn't receive it, it will only return the known > capabilities. > > **performance considerations** > > To get an object type, we have to look only at the header, to get the > size `oid_object_info()` at `object-file.c` is being called which > already returns the object type in the same call. Sending the string > with the type will only be, worst case scenario 6 bytes for the > "commit" string. > > ## Timeline > > I've designed this to work with enough time so final work can be > shorter than what's said here > > May 1-24: Community Bonding > - Keep working on my ongoing patches and new ones. > - Talk and meet with mentor that I'm assigned with, to get feedback > about my proposal, how I will report my progress apart from the code > submitted and possible blogs, and tips and tricks to work better at > Git. > - Confirm with mentor that the `allow_list` approach is still the best option. > - Draft commits structure. > - Setup a blog to keep track about how GSoC at Git is going. > > Week 1-2: (May 26 - June 8) > - Start Goal 1 fixes. > - Fix style and code issues. > > Week 3-4: (June 9 - June 22) > - Start with Goal 1 implementations (allow_list approach). > > Week 5-6: (June 23 - July 6): > - Goal 1 should be polished or close to the final form. > - Send patch series for Goal 1. > - Start Goal 2. > - Prepare the midterm report. > > **Midterm evaluation** (July 7 - 11) as specified on GSoC timeline docs > - Goal 1 submitted. > > Week 7-8: (July 14 - July 27) > - Start with server side v2 protocol extension (`%(objecttype)`). > > Week 9-10: (July 28 - August 10) > - Add `%(objecttype)` to the `allow_list` from Goal 1. > - Client side extension. > - End to end tests and documentation. > - Default format becomes `%(objectname) %(objecttype) %(objectsize)`. > - Send patch series. > > Week 11-12: (August 11 - August 24) > - Goal 2 should be close to be done. > - Polish everything, all tests pass, good test coverage, no > style/comment issues. > - Final documentation review. > - Prepare for final evaluation. > > **Final evaluation** (August 18-24) as specified on GSoC timeline docs > > ### Additional objectives > > If there is enough time, or for future work after the project. I've > some ideas on how this could evolve: > > #### More placeholders support > > I've checked that Eric's v11 patch only supports `%(objectsize)` on > server side, but on the client side there are other placeholders that > can be added too. With the `allow_list` and having Goal 2 implemented, > adding more placeholders becomes trivial. > > - `%(objectsize:disk)`: Returns the size on the disk (compressed or as > a delta) instead of returning the uncompressed size that > `%(objectsize)` does. To do this, the server would need to send what's > the actual size on disk data. > > - `%(deltabase)`: Returns the delta base object OID. non delta objects > return zero OID as it does on local. > > #### Returning missing blobs from a tree ordered > > In a partial clone, someone might want to know what blobs are missing > inside a concrete tree and their size before fetching them. > The idea is to build on top of `remote-object-info`: > Given a tree hash, return the missing blobs (inside that tree) ordered by size. > You might want to look 'git-backfill(1)', I recall there was some thoughts on extending that command to do something similar. But I don't remember on the top of my head. > Thanks for reading my proposal and considering my application. I'm > very excited about this opportunity, > Pablo > > [1]: https://lore.kernel.org/git/20250221190451.12536-1-eric.peijian@gmail.com/ > "Eric Ju's v11 patch" > > [2]: https://lore.kernel.org/git/xmqqo6yr3wc4.fsf@gitster.g/ "Junio > Hamano feedback" > > [3]: https://lore.kernel.org/git/20250224234720.GC729825@coredump.intra.peff.net/ > "Jeff King feedback" > > [4]: https://lore.kernel.org/git/20250313060250.GH94015@coredump.intra.peff.net/ > "options for strstr() by Jeff King" > > [5]: https://lore.kernel.org/git/20250324033922.GB690093@coredump.intra.peff.net/ > "Jeff King follow-up" > > [6]: https://lore.kernel.org/git/20260312214154.89120-1-pabloosabaterr@gmail.com/ > "data->type not being cleared bug" > > [7]: https://github.com/git/git/blob/master/Documentation/gitprotocol-v2.adoc#object-info > "object-info protocol docs" > > [8]: https://lore.kernel.org/git/20220728230210.2952731-1-calvinwan@google.com/#t > "Calvin Wan's patch series" > > --- > > Again, thanks a lot for the feedback. Regards, Karthik [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 690 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [GSoC] Proposal: Complete and extend the remote-object-info command for git cat-file 2026-03-16 21:38 ` Karthik Nayak @ 2026-03-18 10:45 ` Pablo 0 siblings, 0 replies; 10+ messages in thread From: Pablo @ 2026-03-18 10:45 UTC (permalink / raw) To: Karthik Nayak Cc: Chandra Pratap, git, christian.couder, jltobler, Ayush Chandekar, Siddharth Asthana Karthik Nayak (<karthik.188@gmail•com>) writes: > Perhaps a diff would be a good addition for next time? :) Yes, I'll add a diff from now on. > It's been a while, so this is expected. I guess the first week[s] would > mostly be getting this series up-to date. Yes, it's mentioned in The Solution section, but I'll make it more clear adding it explicitly to the Timeline that it will be the first thing to do. > You might want to look 'git-backfill(1)', I recall there was some > thoughts on extending that command to do something similar. But I don't > remember on the top of my head. Thanks, I didn't know about that, from what I've found the 'git-backfill' extension that Stolee is working on [1], it's similar but (correct me if i'm wrong) 'git-backfill' fetches the branch/path. This idea would only bring the metadata asked on a format string e.g.:"%(objectname) %(objectsize) %(objecttype)" leveraging on what has been done on Goal 1 and Goal 2. I'll add a clarification on the proposal about this. This would get along with 'git-backfill' extension by, querying the metadata from a branch first and then fetching it with 'git-backfill' Thanks for the feedback and compliments, Pablo [1]: https://lore.kernel.org/git/pull.2070.git.1773707361.gitgitgadget@gmail.com/ "Stolee 'git-backfill' extension" ^ permalink raw reply [flat|nested] 10+ messages in thread
* [GSoC v3] Proposal: Complete and extend the remote-object-info command for git cat-file 2026-03-13 10:17 [GSoC] Proposal: Complete and extend the remote-object-info command for git cat-file Pablo 2026-03-14 5:58 ` Chandra Pratap @ 2026-03-16 16:05 ` Pablo Sabater 2026-03-18 11:42 ` [GSoC v4] " Pablo 1 sibling, 1 reply; 10+ messages in thread From: Pablo Sabater @ 2026-03-16 16:05 UTC (permalink / raw) To: git Cc: christian.couder, karthik.188, jltobler, ayu.chandekar, siddharthasthana31, chandrapratap3519 Thanks for the feedback on v2, Christian and Chandra. Changes from v2: > 1) What is returning an empty string. Is it the `allow_list`, the > client, the server or something else? > 2) And what is actually reported to the user (en error, a warning, nothing)? > 3) Also is it what is implemented in Eric's v11, or what you suggest > implementing? - Backward Compatibility expanded to answer the questions from Christian at the v2 feedback. - Performance Considerations now uses #### instead of bold ****. - Moved draft proposal on Pre-GSoC to its own subsection. - Added --graph-max RFC patch to Other Contributions. - Capitalization of subsections ## Synopsis Git's partial clone allows cloning repositories without downloading all objects (blobs, trees, ...). These objects are fetched on demand from the remote when needed. However, when a user needs metadata about these remote objects (size, type, hash, ...), Git has no efficient way of doing this without downloading all the object content. The server side support for `object-info` protocol was implemented by Calvin Wan in 2021 [8]. Eric Ju built the client-side `remote-object-info` for `cat-file --batch-command`. This project finishes Eric Ju's work on `remote-object-info` for `git cat-file --batch-command` [1], resolves the pending feedback from Junio Hamano [2] and Jeff King [3] [4] [5], and extends support for `%(objecttype)`. Expected project size: 350 hours (Medium) ## About Me and Contact Name: Pablo Sabater Jiménez (he/him) Age: 19 Education: Currently on my second Computer Science year at University of Murcia, Spain Location: Murcia, Spain (CET, UTC+1) Languages: C (solid), shell(bash) (good) Tools: git(proficient) I've checked that I'm eligible for GSoC 2026. Email: pabloosabaterr@gmail•com GitHub: https://github.com/pabloosabaterr ## Availability My classes end the first week of May. From then until September I won't have any classes which leaves me free to fully focus on the project. I can dedicate 8+ hours each day, and for sure 40 hours a week. ## Relevant Projects - 16 bit CPU emulator. Good example of C programming. cpu: https://github.com/pabloosabaterr/CPU16 - Compiler. Good example of working on bigger projects. compiler: https://github.com/pabloosabaterr/Orn ## Pre-GSoC Work ### Introduction **[GSoC] Introduction Pablo Sabater** https://lore.kernel.org/git/CAN5EUNR0KJ4VeuOF_bVupaTuGKGaeTKa0SMRAUoBPo5wWi8YGA@mail.gmail.com **Description**: A mailing list thread where I introduced myself to the git community. ### Microproject **[GSoC PATCH v4] t9200: replace test -f/-d with modern path helpers** https://lore.kernel.org/git/20260312173305.15112-1-pabloosabaterr@gmail.com/ **Status**: Merged to `next` on 2026-03-12 at `8500bdf172`. **Description**: Replaces `test -f` with helper `test_path_is_file`, which makes debugging failing tests easier with better reporting. As suggested as microproject. ### Draft Proposal **[GSoC] Proposal: Complete and extend remote-object-info for git cat-file** https://lore.kernel.org/git/CAN5EUNQKv-LCkbY+5scn6pk6fL8kpmjNR=66rjeY=NqKbqRkhA@mail.gmail.com/ **Description**: Proposal draft thread. ### Other Contributions **[GSoC PATCH v2] test-lib: print escape sequence names** https://lore.kernel.org/git/20260311031442.11942-1-pabloosabaterr@gmail.com/ **Status**: Will merge to `next`. **Description**: In failed expected/actual checks printing, the escape sequences were shown as their octal code. This patch fixes that to print the actual escape sequence name, adds tests, and updates the expected output. **[GSoC PATCH] t9200: handle missing CVS with skip_all** https://lore.kernel.org/git/20260311194002.190195-1-pabloosabaterr@gmail.com/ **Status**: Merged to `next` on 2026-03-12 at `8500bdf172`. **Description**: Wraps CVS setup in a skip_all for clearer failure reporting and moves Git initialization into its own test_expect_success. **Re: [PATCH] gc: add git maintenance list command** https://lore.kernel.org/git/20260313115932.15259-1-pabloosabaterr@gmail.com/ **Description**: Code review for a patch sent. **[GSoC RFC PATCH] graph: add --graph-max option to limit displayed columns** https://lore.kernel.org/git/20260316133426.117684-1-pabloosabaterr@gmail.com/ **Status**: RFC, waiting for feedback. **Description**: Adds `--graph-max` option to `git log --graph` to cap the number of columns that will be displayed. Helps readability for projects with many branches. **[GSoC] Re: [PATCH v11 8/8] cat-file: add remote-object-info to batch-command** https://lore.kernel.org/git/20260312214154.89120-1-pabloosabaterr@gmail.com/ **Description**: While testing Eric's v11 I've found and reported a new bug. On `remote-object-info` when it's preceded by a local query, `data->type` isn't being cleared. Causing it to return the wrong type. I have also studied the documentation provided and Eric Ju's work from v0 to v11 including all the feedback he got up to March 2025, the feedback he got from Junio Hamano and Jeff King, taking notes about what's left to be done and what else I can contribute to the already proposed project. That's how I've identified everything that I will address on the Problem, Solution and Timeline sections. I built Eric Ju's v11 and tested the bugs reported to his patch [5], I've confirmed the segfault and the `die()`, and found a new one: - When a local `info` runs before `remote-object-info` sharing the same format string, `data->type` isn't being cleared. A blob queried remotely after a local commit, `data->type` for blob becomes 'commit' with no error. I reported it on the mailing list [6]. I attempted to test rebasing Eric Ju's v11 to master and got conflicts on 4 out of the 8 commits: - `d04cf85ece` t1006: split test utility functions into new "lib-cat-file.sh". - `t/t1006-cat-file.sh` - `d918f720d8` fetch-pack: refactor packet writing. - `fetch-pack.c` - `2daf9ed803` transport: add client support for object-info. - `Makefile` - `c3ba4afaf6` cat-file: add remote-object-info to batch-command. - `object-file.c`, `object-store-ll.h` (deleted). I'm being active on the mailing list and learning the Git flow of work and from the feedback I've received from the maintainers (Junio) from my patches. Following the project guidelines, I haven't done anything on the project that could step on other candidates' work before being accepted, and instead I'm focusing on understanding the project and its needs, and independent patches that will make the Git project more familiar and understandable to me. ## The Problem Eric Ju's work remains unmerged after v11 because of these issues: - The format validation uses `strstr()` which only checks for `%(objectsize)`. This causes two different errors: - Atoms that `expand_atom()` recognizes but the remote doesn't (`objecttype`,`deltabase`, ...), `expand_atom()` returns 1, but when accessing `data->type` it only contains garbage, causing segfault, as Jeff King noted [3]. - Unknown atoms by `expand_atom()`, returns 0, calling `strbuf_expand_bad_format` on `expand_format()`, which calls `die()`, as Jeff King found [3]. Both cases block the command, including local `info` queries if the same format string is shared. Unsupported remote placeholders should return an empty string, matching how `for-each-ref` returns empty for known, but inapplicable atoms like `%(tagger)` on non-tags [4] [5]. - When local and remote queries are mixed, `data->type` is not being cleared between commands. `remote-object-info` returns the wrong type data from a previous local query [6]. - Style and code issues marked by Junio Hamano [2] and Jeff King [3] [5] are still undone. - comment style. - `#define` formatting. - line length. - misleading error messages. - missing `count > MAX_ALLOWED_OBJ_LIMIT` check at `split_cmdline().` - if/else invert at `get_remote_info()`. - `%(objecttype)` is not yet supported on either client or server side. ## The Solution There are two main goals: ### Goal 1: Rebase and finish Eric's work Starting from where Eric Ju left off, I will rebase it on top of the current `master` branch and address the feedback left to do: - Fix style in comments, `#define` formatting and line length. - Fix misleading error message in the overflow check. - Add missing `count > MAX_ALLOWED_OBJ_LIMIT` check after `split_cmdline()`. - Invert if/else on `get_remote_info()` to keep the small block first (the error one) as Junio suggested. #### Replace `strstr()` format validation with allow_list in `expand_atom()` `strstr()` isn't enough to fully validate the placeholders, it only searches for `%(objectsize)` and unsupported placeholders cause segfaults. Jeff King noted [4] that the fix was to refactor the validation with an allow_list in `expand_atom()` or `expand_format()`. The best option is to place the validation at `expand_atom()`, but why `expand_atom()` ? - There are two cases, first, inside `expand_atom()` before returning (segfault) and second, calls `die()` when `expand_atom()` returns 0. Placing the `allow_list` at the top of `expand_atom()` prevents both errors, on remote mode, append nothing to `sb` and return 1, accessing `data->type` won't cause segfault and prevents `expand_format()` from reaching `die()`. As extra safety, initializing `data->type` to `OBJ_BAD` and check for `NULL` from `type_name()` makes it that even without `allow_list`, uninitialized data doesn't cause a segfault. At Goal 1, only `%(objectname)` and `%(objectsize)` will be in the allow_list. Goal 2 will bring `%(objecttype)` support. ### Goal 2: Adding `%(objecttype)` Following what Calvin Wan did in 2021 [8] for `%(objectsize)`, v2 protocol needs to be extended on the server side to support the new `%(objecttype)` placeholder: - extend `object_info_advertise()` at `serve.c` - add .type to `requested_info` struct at `serve.c` - support `type` in `cap_object_info()` at `protocol-caps.c` - look for type at `send_info()` at `protocol-caps.c` Following object-info protocol docs [7] it should look like: ``` attrs = "size" SP "type" obj-type = "blob" | "tree" | "commit" | "tag" obj-info = obj-id SP obj-size SP obj-type info = PKT-LINE(attrs LF) *PKT-LINE(obj-info LF) ``` `%(objecttype)` needs to be added to the `allow_list`. Client side needs to learn to ask for `%(objecttype)` from remote, parse what has been received and fill `expand_data` with the actual type. This makes it return the object type instead of the empty string returned while it was unsupported. Default format evolves to `%(objectname) %(objecttype) %(objectsize)`. Test and document new placeholder support and server side extension. #### Backward Compatibility There are four possible scenarios to happen between client and server: 1. **The server doesn't know type (new client but old server)**: After receiving the server capabilities, the client doesn't see `type` being advertised. When the user format string has `%(objecttype)`, `expand_atom()` checks the `allow_list`, finds that type was not fetched. Appends an empty string to the output buffer and returns 1. The user will see an empty field where `type` should be, no errors nor warnings. In Eric Ju's v11, this would crash, as described in The Problem section, the `allow_list` from Goal 1 is what fixes this, following `for-each-ref` behaviour for known but inapplicable atoms as Jeff King suggested [4] [5]. 2. **The server knows type but the client doesn't (new server but old client)**: The server advertises `type`, but the client doesn't know `type` and following `gitprotocol-v2.adoc`, "Clients must ignore all unknown keys", it silently ignores the `type` and only asks for the known (`size`). The server returns only what was requested, user will see the output for `size` but not for `type`. This doesn't need any new code, the v2 protocol already behaves like this. 3. **Both know type (new client and new server)**: The server advertises `type`, the client requests `type` and receives the type data. `expand_atom()` finds `type` in the `allow_list`, fills `data->type` and then the user will see the object type in the output. This is Goal 2. 4. **Both know type but protocol middleware doesn't (new client, new server but old middleware)**: This becomes case 1 or 2 depending on what side is being affected by the middleware. If the middleware removes `type` from the server advertised capabilities, the client never sees it and treats the server as it was old server, it becomes case 1 (empty string). If the middleware removes `type` from the client request, the server will only see `size` being requested and only returns size data, it becomes case 2. #### Performance Considerations To get an object type, we have to look only at the header, to get the size `oid_object_info()` at `object-file.c` is being called which already returns the object type in the same call. Sending the string with the type will only be, worst case scenario 6 bytes for the "commit" string. ## Timeline I've designed this to work with enough time so final work can be shorter than what's said here May 1-24: Community Bonding - Keep working on my ongoing patches and new ones. - Talk and meet with mentor that I'm assigned with, to get feedback about my proposal, how I will report my progress apart from the code submitted and possible blogs, and tips and tricks to work better at Git. - Confirm with mentor that the `allow_list` approach is still the best option. - Draft commits structure. - Setup a blog to keep track about how GSoC at Git is going. Week 1-2: (May 26 - June 8) - Start Goal 1 fixes. - Fix style and code issues. Week 3-4: (June 9 - June 22) - Start with Goal 1 implementations (allow_list approach). Week 5-6: (June 23 - July 6): - Goal 1 should be polished or close to the final form. - Send patch series for Goal 1. - Start Goal 2. - Prepare the midterm report. **Midterm evaluation** (July 7 - 11) as specified on GSoC timeline docs - Goal 1 submitted. Week 7-8: (July 14 - July 27) - Start with server side v2 protocol extension (`%(objecttype)`). Week 9-10: (July 28 - August 10) - Add `%(objecttype)` to the `allow_list` from Goal 1. - Client side extension. - End to end tests and documentation. - Default format becomes `%(objectname) %(objecttype) %(objectsize)`. - Send patch series. Week 11-12: (August 11 - August 24) - Goal 2 should be close to be done. - Polish everything, all tests pass, good test coverage, no style/comment issues. - Final documentation review. - Prepare for final evaluation. **Final evaluation** (August 18-24) as specified on GSoC timeline docs ### Additional objectives If there is enough time, or for future work after the project. I've some ideas on how this could evolve: #### More placeholders support I've checked that Eric's v11 patch only supports `%(objectsize)` on server side, but on the client side there are other placeholders that can be added too. With the `allow_list` and having Goal 2 implemented, adding more placeholders becomes trivial. - `%(objectsize:disk)`: Returns the size on the disk (compressed or as a delta) instead of returning the uncompressed size that `%(objectsize)` does. To do this, the server would need to send what's the actual size on disk data. - `%(deltabase)`: Returns the delta base object OID. non delta objects return zero OID as it does on local. #### Returning missing blobs from a tree ordered In a partial clone, someone might want to know what blobs are missing inside a concrete tree and their size before fetching them. The idea is to build on top of `remote-object-info`: Given a tree hash, return the missing blobs (inside that tree) ordered by size. Thanks for reading my proposal and considering my application. I'm very excited about this opportunity, Pablo [1]: https://lore.kernel.org/git/20250221190451.12536-1-eric.peijian@gmail.com/ "Eric Ju's v11 patch" [2]: https://lore.kernel.org/git/xmqqo6yr3wc4.fsf@gitster.g/ "Junio Hamano feedback" [3]: https://lore.kernel.org/git/20250224234720.GC729825@coredump.intra.peff.net/ "Jeff King feedback" [4]: https://lore.kernel.org/git/20250313060250.GH94015@coredump.intra.peff.net/ "options for strstr() by Jeff King" [5]: https://lore.kernel.org/git/20250324033922.GB690093@coredump.intra.peff.net/ "Jeff King follow-up" [6]: https://lore.kernel.org/git/20260312214154.89120-1-pabloosabaterr@gmail.com/ "data->type not being cleared bug" [7]: https://github.com/git/git/blob/master/Documentation/gitprotocol-v2.adoc#object-info "object-info protocol docs" [8]: https://lore.kernel.org/git/20220728230210.2952731-1-calvinwan@google.com/#t "Calvin Wan's patch series" ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [GSoC v4] Proposal: Complete and extend the remote-object-info command for git cat-file 2026-03-16 16:05 ` [GSoC v3] " Pablo Sabater @ 2026-03-18 11:42 ` Pablo 2026-04-12 14:41 ` Pablo 0 siblings, 1 reply; 10+ messages in thread From: Pablo @ 2026-03-18 11:42 UTC (permalink / raw) To: git Cc: Christian Couder, karthik nayak, jltobler, Ayush Chandekar, Siddharth Asthana, Chandra Pratap I've realised that I've sent v3 in reply to v1 instead of v2, here and on some of the patches. I'm very sorry about that, I'll do it correctly from now on. This v4 addresses v3 improvements and karthik feedback on v2 changes from v3: (detailed diff below the proposal): - Patch status updated - Timeline shows explicitly to rebase on week1 - Extra objective return missing blobs updated to be more clear - fixed format for pdf creation GSoC doesn't let me share the pdf sent. so I can't share a link. I'm sending this as markdown because it is preferred plain text but to see the actual pdf that will be delivered, it can be done with: pandoc <file> -f markdown+autolink_bare_uris -o proposal.pdf -V geometry:"margin=2cm" -V colorlinks=true -V urlcolor=blue --toc --number-sections # Synopsis Git's partial clone allows cloning repositories without downloading all objects (blobs, trees, ...). These objects are fetched on demand from the remote when needed. However, when a user needs metadata about these remote objects (size, type, hash, ...), Git has no efficient way of doing this without downloading all the object content. The server side support for `object-info` protocol was implemented by Calvin Wan in 2021 [8]. Eric Ju built the client-side `remote-object-info` for `cat-file --batch-command`. This project finishes Eric Ju's work on `remote-object-info` for `git cat-file --batch-command` [1], resolves the pending feedback from Junio Hamano [2] and Jeff King [3] [4] [5], and extends support for `%(objecttype)`. Expected project size: 350 hours (Medium) # About Me and Contact Name: Pablo Sabater Jiménez (he/him) Age: 19 Education: Currently on my second Computer Science year at University of Murcia, Spain Location: Murcia, Spain (CET, UTC+1) Languages: C (solid), shell(bash) (good) Tools: git(proficient) I've checked that I'm eligible for GSoC 2026. Email: pabloosabaterr@gmail•com GitHub: https://github.com/pabloosabaterr # Availability My classes end the first week of May. From then until September I won't have any classes which leaves me free to fully focus on the project. I can dedicate 8+ hours each day, and for sure 40 hours a week. # Relevant Projects - 16 bit CPU emulator. Good example of C programming. cpu: https://github.com/pabloosabaterr/CPU16 - Compiler. Good example of working on bigger projects. compiler: https://github.com/pabloosabaterr/Orn # Pre-GSoC Work ## Introduction [[GSoC] Introduction Pablo Sabater](https://lore.kernel.org/git/CAN5EUNR0KJ4VeuOF_bVupaTuGKGaeTKa0SMRAUoBPo5wWi8YGA@mail.gmail.com) **Description**: A mailing list thread where I introduced myself to the git community. ## Microproject [[GSoC PATCH v4] t9200: replace test -f/-d with modern path helpers](https://lore.kernel.org/git/20260312173305.15112-1-pabloosabaterr@gmail.com/) **Status**: Merged to `next` on 2026-03-12 at `8500bdf172`. Will merge to `master`. **Description**: Replaces `test -f` with helper `test_path_is_file`, which makes debugging failing tests easier with better reporting. As suggested as microproject. ## Draft Proposal [[GSoC] Proposal: Complete and extend remote-object-info for git cat-file](https://lore.kernel.org/git/CAN5EUNQKv-LCkbY+5scn6pk6fL8kpmjNR=66rjeY=NqKbqRkhA@mail.gmail.com/) **Description**: Proposal draft thread. ## Other Contributions [[GSoC PATCH v2] test-lib: print escape sequence names](https://lore.kernel.org/git/20260311031442.11942-1-pabloosabaterr@gmail.com/) **Status**: Will merge to `next`. **Description**: In failed expected/actual checks printing, the escape sequences were shown as their octal code. This patch fixes that to print the actual escape sequence name, adds tests, and updates the expected output. [[GSoC PATCH] t9200: handle missing CVS with skip_all](https://lore.kernel.org/git/20260311194002.190195-1-pabloosabaterr@gmail.com/) **Status**: Merged to `next` on 2026-03-12 at `8500bdf172`. Will merge to `master`. **Description**: Wraps CVS setup in a skip_all for clearer failure reporting and moves Git initialization into its own test_expect_success. [Re: [PATCH] gc: add git maintenance list command](https://lore.kernel.org/git/20260313115932.15259-1-pabloosabaterr@gmail.com/) **Description**: Code review for a patch sent. [[GSoC RFC PATCH] graph: add --graph-max option to limit displayed columns](https://lore.kernel.org/git/20260316133426.117684-1-pabloosabaterr@gmail.com/) **Status**: RFC, waiting for feedback. **Description**: Adds `--graph-max` option to `git log --graph` to cap the number of columns that will be displayed. Helps readability for projects with many branches. [[GSoC] Re: [PATCH v11 8/8] cat-file: add remote-object-info to batch-command](https://lore.kernel.org/git/20260312214154.89120-1-pabloosabaterr@gmail.com/) **Description**: While testing Eric's v11 I've found and reported a new bug. On `remote-object-info` when it's preceded by a local query, `data->type` isn't being cleared. Causing it to return the wrong type. I have also studied the documentation provided and Eric Ju's work from v0 to v11 including all the feedback he got up to March 2025, the feedback he got from Junio Hamano and Jeff King, taking notes about what's left to be done and what else I can contribute to the already proposed project. That's how I've identified everything that I will address on the Problem, Solution and Timeline sections. I built Eric Ju's v11 and tested the bugs reported to his patch [5], I've confirmed the segfault and the `die()`, and found a new one: - When a local `info` runs before `remote-object-info` sharing the same format string, `data->type` isn't being cleared. A blob queried remotely after a local commit, `data->type` for blob becomes 'commit' with no error. I reported it on the mailing list [6]. I attempted to test rebasing Eric Ju's v11 to master and got conflicts on 4 out of the 8 commits: - `d04cf85ece` t1006: split test utility functions into new "lib-cat-file.sh". - `t/t1006-cat-file.sh` - `d918f720d8` fetch-pack: refactor packet writing. - `fetch-pack.c` - `2daf9ed803` transport: add client support for object-info. - `Makefile` - `c3ba4afaf6` cat-file: add remote-object-info to batch-command. - `object-file.c`, `object-store-ll.h` (deleted). I'm being active on the mailing list and learning the Git flow of work and from the feedback I've received from the maintainers (Junio) from my patches. Following the project guidelines, I haven't done anything on the project that could step on other candidates' work before being accepted, and instead I'm focusing on understanding the project and its needs, and independent patches that will make the Git project more familiar and understandable to me. # The Problem Eric Ju's work remains unmerged after v11 because of these issues: - The format validation uses `strstr()` which only checks for `%(objectsize)`. This causes two different errors: - Atoms that `expand_atom()` recognizes but the remote doesn't (`objecttype`,`deltabase`, ...), `expand_atom()` returns 1, but when accessing `data->type` it only contains garbage, causing segfault, as Jeff King noted [3]. - Unknown atoms by `expand_atom()`, returns 0, calling `strbuf_expand_bad_format` on `expand_format()`, which calls `die()`, as Jeff King found [3]. Both cases block the command, including local `info` queries if the same format string is shared. Unsupported remote placeholders should return an empty string, matching how `for-each-ref` returns empty for known, but inapplicable atoms like `%(tagger)` on non-tags [4] [5]. - When local and remote queries are mixed, `data->type` is not being cleared between commands. `remote-object-info` returns the wrong type data from a previous local query [6]. - Style and code issues marked by Junio Hamano [2] and Jeff King [3] [5] are still undone. - comment style. - `#define` formatting. - line length. - misleading error messages. - missing `count > MAX_ALLOWED_OBJ_LIMIT` check at `split_cmdline().` - if/else invert at `get_remote_info()`. - `%(objecttype)` is not yet supported on either client or server side. # The Solution There are two main goals: ## Goal 1: Rebase and finish Eric's work Starting from where Eric Ju left off, I will rebase it on top of the current `master` branch and address the feedback left to do: - Fix style in comments, `#define` formatting and line length. - Fix misleading error message in the overflow check. - Add missing `count > MAX_ALLOWED_OBJ_LIMIT` check after `split_cmdline()`. - Invert if/else on `get_remote_info()` to keep the small block first (the error one) as Junio suggested. #### Replace `strstr()` format validation with allow_list in `expand_atom()` `strstr()` isn't enough to fully validate the placeholders, it only searches for `%(objectsize)` and unsupported placeholders cause segfaults. Jeff King noted [4] that the fix was to refactor the validation with an allow_list in `expand_atom()` or `expand_format()`. The best option is to place the validation at `expand_atom()`, but why `expand_atom()` ? - There are two cases, first, inside `expand_atom()` before returning (segfault) and second, calls `die()` when `expand_atom()` returns 0. Placing the `allow_list` at the top of `expand_atom()` prevents both errors, on remote mode, append nothing to `sb` and return 1, accessing `data->type` won't cause segfault and prevents `expand_format()` from reaching `die()`. As extra safety, initializing `data->type` to `OBJ_BAD` and check for `NULL` from `type_name()` makes it that even without `allow_list`, uninitialized data doesn't cause a segfault. At Goal 1, only `%(objectname)` and `%(objectsize)` will be in the allow_list. Goal 2 will bring `%(objecttype)` support. ## Goal 2: Adding `%(objecttype)` Following what Calvin Wan did in 2021 [8] for `%(objectsize)`, v2 protocol needs to be extended on the server side to support the new `%(objecttype)` placeholder: - extend `object_info_advertise()` at `serve.c` - add .type to `requested_info` struct at `serve.c` - support `type` in `cap_object_info()` at `protocol-caps.c` - look for type at `send_info()` at `protocol-caps.c` Following object-info protocol docs [7] it should look like: ``` attrs = "size" SP "type" obj-type = "blob" | "tree" | "commit" | "tag" obj-info = obj-id SP obj-size SP obj-type info = PKT-LINE(attrs LF) *PKT-LINE(obj-info LF) ``` `%(objecttype)` needs to be added to the `allow_list`. Client side needs to learn to ask for `%(objecttype)` from remote, parse what has been received and fill `expand_data` with the actual type. This makes it return the object type instead of the empty string returned while it was unsupported. Default format evolves to `%(objectname) %(objecttype) %(objectsize)`. Test and document new placeholder support and server side extension. ## Backward Compatibility There are four possible scenarios to happen between client and server: 1. **The server doesn't know type (new client but old server)**: After receiving the server capabilities, the client doesn't see `type` being advertised. When the user format string has `%(objecttype)`, `expand_atom()` checks the `allow_list`, finds that type was not fetched. Appends an empty string to the output buffer and returns 1. The user will see an empty field where `type` should be, no errors nor warnings. In Eric Ju's v11, this would crash, as described in The Problem section, the `allow_list` from Goal 1 is what fixes this, following `for-each-ref` behaviour for known but inapplicable atoms as Jeff King suggested [4] [5]. 2. **The server knows type but the client doesn't (new server but old client)**: The server advertises `type`, but the client doesn't know `type` and following `gitprotocol-v2.adoc`, "Clients must ignore all unknown keys", it silently ignores the `type` and only asks for the known (`size`). The server returns only what was requested, user will see the output for `size` but not for `type`. This doesn't need any new code, the v2 protocol already behaves like this. 3. **Both know type (new client and new server)**: The server advertises `type`, the client requests `type` and receives the type data. `expand_atom()` finds `type` in the `allow_list`, fills `data->type` and then the user will see the object type in the output. This is Goal 2. 4. **Both know type but protocol middleware doesn't (new client, new server but old middleware)**: This becomes case 1 or 2 depending on what side is being affected by the middleware. If the middleware removes `type` from the server advertised capabilities, the client never sees it and treats the server as it was old server, it becomes case 1 (empty string). If the middleware removes `type` from the client request, the server will only see `size` being requested and only returns size data, it becomes case 2. ## Performance Considerations To get an object type, we have to look only at the header, to get the size `oid_object_info()` at `object-file.c` is being called which already returns the object type in the same call. Sending the string with the type will only be, worst case scenario 6 bytes for the "commit" string. # Timeline I've designed this to work with enough time so final work can be shorter than what's said here May 1-24: Community Bonding - Keep working on my ongoing patches and new ones. - Talk and meet with mentor that I'm assigned with, to get feedback about my proposal, how I will report my progress apart from the code submitted and possible blogs, and tips and tricks to work better at Git. - Confirm with mentor that the `allow_list` approach is still the best option. - Draft commits structure. - Setup a blog to keep track about how GSoC at Git is going. Week 1-2: (May 26 - June 8) - First of all will be rebasing Eric Ju's v11. - Start Goal 1 fixes. - Fix style and code issues. Week 3-4: (June 9 - June 22) - Start with Goal 1 implementations (allow_list approach). Week 5-6: (June 23 - July 6): - Goal 1 should be polished or close to the final form. - Send patch series for Goal 1. - Start Goal 2. - Prepare the midterm report. **Midterm evaluation** (July 7 - 11) as specified on GSoC timeline docs - Goal 1 submitted. Week 7-8: (July 14 - July 27) - Start with server side v2 protocol extension (`%(objecttype)`). Week 9-10: (July 28 - August 10) - Add `%(objecttype)` to the `allow_list` from Goal 1. - Client side extension. - End to end tests and documentation. - Default format becomes `%(objectname) %(objecttype) %(objectsize)`. - Send patch series. Week 11-12: (August 11 - August 24) - Goal 2 should be close to be done. - Polish everything, all tests pass, good test coverage, no style/comment issues. - Final documentation review. - Prepare for final evaluation. **Final evaluation** (August 18-24) as specified on GSoC timeline docs ## Additional objectives If there is enough time, or for future work after the project. I've some ideas on how this could evolve: ## More placeholders support I've checked that Eric's v11 patch only supports `%(objectsize)` on server side, but on the client side there are other placeholders that can be added too. With the `allow_list` and having Goal 2 implemented, adding more placeholders becomes trivial. - `%(objectsize:disk)`: Returns the size on the disk (compressed or as a delta) instead of returning the uncompressed size that `%(objectsize)` does. To do this, the server would need to send what's the actual size on disk data. - `%(deltabase)`: Returns the delta base object OID. non delta objects return zero OID as it does on local. ## Returning missing blobs from a tree ordered In a partial clone, someone might want to know what blobs are missing inside a concrete tree and order them before fetching them. The idea is to build on top of `remote-object-info` and what's been built in Goal 1 and Goal 2: Given a tree hash, return the missing blobs (inside that tree) ordered by an orderable atom (size, name, type, ...). This looks similar to Stolee's work on `git-backfill` [9], the key difference is that `git-backfill` fetches the missing objects from a path/object, while this would only query the metadata of the missing blobs without fetching them and ordered by a given atom. Thanks for reading my proposal and considering my application. I'm very excited about this opportunity, Pablo \[1\]: https://lore.kernel.org/git/20250221190451.12536-1-eric.peijian@gmail.com/ "Eric Ju's v11 patch" \[2\]: https://lore.kernel.org/git/xmqqo6yr3wc4.fsf@gitster.g/ "Junio Hamano feedback" \[3\]: https://lore.kernel.org/git/20250224234720.GC729825@coredump.intra.peff.net/ "Jeff King feedback" \[4\]: https://lore.kernel.org/git/20250313060250.GH94015@coredump.intra.peff.net/ "options for strstr() by Jeff King" \[5\]: https://lore.kernel.org/git/20250324033922.GB690093@coredump.intra.peff.net/ "Jeff King follow-up" \[6\]: https://lore.kernel.org/git/20260312214154.89120-1-pabloosabaterr@gmail.com/ "data->type not being cleared bug" \[7\]: https://github.com/git/git/blob/master/Documentation/gitprotocol-v2.adoc#object-info "object-info protocol docs" \[8\]: https://lore.kernel.org/git/20220728230210.2952731-1-calvinwan@google.com/#t "Calvin Wan's patch series" \[9\]: https://lore.kernel.org/git/pull.2070.git.1773707361.gitgitgadget@gmail.com/ "git-backfill extension from Stolee" [1]: https://lore.kernel.org/git/20250221190451.12536-1-eric.peijian@gmail.com/ "Eric Ju's v11 patch" [2]: https://lore.kernel.org/git/xmqqo6yr3wc4.fsf@gitster.g/ "Junio Hamano feedback" [3]: https://lore.kernel.org/git/20250224234720.GC729825@coredump.intra.peff.net/ "Jeff King feedback" [4]: https://lore.kernel.org/git/20250313060250.GH94015@coredump.intra.peff.net/ "options for strstr() by Jeff King" [5]: https://lore.kernel.org/git/20250324033922.GB690093@coredump.intra.peff.net/ "Jeff King follow-up" [6]: https://lore.kernel.org/git/20260312214154.89120-1-pabloosabaterr@gmail.com/ "data->type not being cleared bug" [7]: https://github.com/git/git/blob/master/Documentation/gitprotocol-v2.adoc#object-info "object-info protocol docs" [8]: https://lore.kernel.org/git/20220728230210.2952731-1-calvinwan@google.com/#t "Calvin Wan's patch series" [9]: https://lore.kernel.org/git/pull.2070.git.1773707361.gitgitgadget@gmail.com/ "git-backfill extension from Stolee" --- diff --git a/v3.md b/v4.md index 60c86de..c5b8bc6 100755 --- a/v3prop.md +++ b/proposal-pdfFormat.md @@ -1 +1 @@ -## Synopsis +# Synopsis @@ -11 +11 @@ Expected project size: 350 hours (Medium) -## About Me and Contact +# About Me and Contact @@ -31 +31 @@ GitHub: https://github.com/pabloosabaterr -## Availability +# Availability @@ -35 +35 @@ My classes end the first week of May. From then until September I won't have any -## Relevant Projects +# Relevant Projects @@ -45 +45 @@ My classes end the first week of May. From then until September I won't have any -## Pre-GSoC Work +# Pre-GSoC Work @@ -47 +47 @@ My classes end the first week of May. From then until September I won't have any -### Introduction +## Introduction @@ -49,3 +49 @@ My classes end the first week of May. From then until September I won't have any -**[GSoC] Introduction Pablo Sabater** - -https://lore.kernel.org/git/CAN5EUNR0KJ4VeuOF_bVupaTuGKGaeTKa0SMRAUoBPo5wWi8YGA@mail.gmail.com +[[GSoC] Introduction Pablo Sabater](https://lore.kernel.org/git/CAN5EUNR0KJ4VeuOF_bVupaTuGKGaeTKa0SMRAUoBPo5wWi8YGA@mail.gmail.com) @@ -55,3 +53 @@ https://lore.kernel.org/git/CAN5EUNR0KJ4VeuOF_bVupaTuGKGaeTKa0SMRAUoBPo5wWi8YGA@ -### Microproject - -**[GSoC PATCH v4] t9200: replace test -f/-d with modern path helpers** +## Microproject @@ -59 +55 @@ https://lore.kernel.org/git/CAN5EUNR0KJ4VeuOF_bVupaTuGKGaeTKa0SMRAUoBPo5wWi8YGA@ -https://lore.kernel.org/git/20260312173305.15112-1-pabloosabaterr@gmail.com/ +[[GSoC PATCH v4] t9200: replace test -f/-d with modern path helpers](https://lore.kernel.org/git/20260312173305.15112-1-pabloosabaterr@gmail.com/) @@ -61 +57 @@ https://lore.kernel.org/git/20260312173305.15112-1-pabloosabaterr@gmail.com/ -**Status**: Merged to `next` on 2026-03-12 at `8500bdf172`. +**Status**: Merged to `next` on 2026-03-12 at `8500bdf172`. Will merge to `master`. @@ -66,3 +62 @@ As suggested as microproject. -### Draft Proposal - -**[GSoC] Proposal: Complete and extend remote-object-info for git cat-file** +## Draft Proposal @@ -70 +64 @@ As suggested as microproject. -https://lore.kernel.org/git/CAN5EUNQKv-LCkbY+5scn6pk6fL8kpmjNR=66rjeY=NqKbqRkhA@mail.gmail.com/ +[[GSoC] Proposal: Complete and extend remote-object-info for git cat-file](https://lore.kernel.org/git/CAN5EUNQKv-LCkbY+5scn6pk6fL8kpmjNR=66rjeY=NqKbqRkhA@mail.gmail.com/) @@ -74 +68 @@ https://lore.kernel.org/git/CAN5EUNQKv-LCkbY+5scn6pk6fL8kpmjNR=66rjeY=NqKbqRkhA@ -### Other Contributions +## Other Contributions @@ -76,3 +70 @@ https://lore.kernel.org/git/CAN5EUNQKv-LCkbY+5scn6pk6fL8kpmjNR=66rjeY=NqKbqRkhA@ -**[GSoC PATCH v2] test-lib: print escape sequence names** - -https://lore.kernel.org/git/20260311031442.11942-1-pabloosabaterr@gmail.com/ +[[GSoC PATCH v2] test-lib: print escape sequence names](https://lore.kernel.org/git/20260311031442.11942-1-pabloosabaterr@gmail.com/) @@ -84,3 +76 @@ https://lore.kernel.org/git/20260311031442.11942-1-pabloosabaterr@gmail.com/ -**[GSoC PATCH] t9200: handle missing CVS with skip_all** - -https://lore.kernel.org/git/20260311194002.190195-1-pabloosabaterr@gmail.com/ +[[GSoC PATCH] t9200: handle missing CVS with skip_all](https://lore.kernel.org/git/20260311194002.190195-1-pabloosabaterr@gmail.com/) @@ -88 +78 @@ https://lore.kernel.org/git/20260311194002.190195-1-pabloosabaterr@gmail.com/ -**Status**: Merged to `next` on 2026-03-12 at `8500bdf172`. +**Status**: Merged to `next` on 2026-03-12 at `8500bdf172`. Will merge to `master`. @@ -92,3 +82 @@ https://lore.kernel.org/git/20260311194002.190195-1-pabloosabaterr@gmail.com/ -**Re: [PATCH] gc: add git maintenance list command** - -https://lore.kernel.org/git/20260313115932.15259-1-pabloosabaterr@gmail.com/ +[Re: [PATCH] gc: add git maintenance list command](https://lore.kernel.org/git/20260313115932.15259-1-pabloosabaterr@gmail.com/) @@ -98,3 +86 @@ https://lore.kernel.org/git/20260313115932.15259-1-pabloosabaterr@gmail.com/ -**[GSoC RFC PATCH] graph: add --graph-max option to limit displayed columns** - -https://lore.kernel.org/git/20260316133426.117684-1-pabloosabaterr@gmail.com/ +[[GSoC RFC PATCH] graph: add --graph-max option to limit displayed columns](https://lore.kernel.org/git/20260316133426.117684-1-pabloosabaterr@gmail.com/) @@ -106,3 +92 @@ https://lore.kernel.org/git/20260316133426.117684-1-pabloosabaterr@gmail.com/ -**[GSoC] Re: [PATCH v11 8/8] cat-file: add remote-object-info to batch-command** - -https://lore.kernel.org/git/20260312214154.89120-1-pabloosabaterr@gmail.com/ +[[GSoC] Re: [PATCH v11 8/8] cat-file: add remote-object-info to batch-command](https://lore.kernel.org/git/20260312214154.89120-1-pabloosabaterr@gmail.com/) @@ -114,0 +99 @@ I built Eric Ju's v11 and tested the bugs reported to his patch [5], I've confir + @@ -117,0 +103 @@ I attempted to test rebasing Eric Ju's v11 to master and got conflicts on 4 out + @@ -131 +117 @@ Following the project guidelines, I haven't done anything on the project that co -## The Problem +# The Problem @@ -142,0 +129 @@ Eric Ju's work remains unmerged after v11 because of these issues: + @@ -151 +138 @@ Eric Ju's work remains unmerged after v11 because of these issues: -## The Solution +# The Solution @@ -155 +142 @@ There are two main goals: -### Goal 1: Rebase and finish Eric's work +## Goal 1: Rebase and finish Eric's work @@ -157,0 +145 @@ Starting from where Eric Ju left off, I will rebase it on top of the current `ma + @@ -165,0 +154 @@ Starting from where Eric Ju left off, I will rebase it on top of the current `ma + @@ -171 +160 @@ Starting from where Eric Ju left off, I will rebase it on top of the current `ma -### Goal 2: Adding `%(objecttype)` +## Goal 2: Adding `%(objecttype)` @@ -173,0 +163 @@ Following what Calvin Wan did in 2021 [8] for `%(objectsize)`, v2 protocol needs + @@ -192 +182 @@ Default format evolves to `%(objectname) %(objecttype) %(objectsize)`. Test and -#### Backward Compatibility +## Backward Compatibility @@ -212 +202 @@ There are four possible scenarios to happen between client and server: -#### Performance Considerations +## Performance Considerations @@ -216 +206 @@ To get an object type, we have to look only at the header, to get the size `oid_ -## Timeline +# Timeline @@ -220,0 +211 @@ May 1-24: Community Bonding + @@ -227,0 +219,2 @@ Week 1-2: (May 26 - June 8) + +- First of all will be rebasing Eric Ju's v11. @@ -231,0 +225 @@ Week 3-4: (June 9 - June 22) + @@ -234,0 +229 @@ Week 5-6: (June 23 - July 6): + @@ -240,0 +236 @@ Week 5-6: (June 23 - July 6): + @@ -243,0 +240 @@ Week 7-8: (July 14 - July 27) + @@ -246,0 +244 @@ Week 9-10: (July 28 - August 10) + @@ -253,0 +252 @@ Week 11-12: (August 11 - August 24) + @@ -261 +260 @@ Week 11-12: (August 11 - August 24) -### Additional objectives +## Additional objectives @@ -265 +264 @@ If there is enough time, or for future work after the project. I've some ideas o -#### More placeholders support +## More placeholders support @@ -273 +272 @@ I've checked that Eric's v11 patch only supports `%(objectsize)` on server side, -#### Returning missing blobs from a tree ordered +## Returning missing blobs from a tree ordered @@ -275,3 +274,5 @@ I've checked that Eric's v11 patch only supports `%(objectsize)` on server side, -In a partial clone, someone might want to know what blobs are missing inside a concrete tree and their size before fetching them. -The idea is to build on top of `remote-object-info`: -Given a tree hash, return the missing blobs (inside that tree) ordered by size. +In a partial clone, someone might want to know what blobs are missing inside a concrete tree and order them before fetching them. +The idea is to build on top of `remote-object-info` and what's been built in Goal 1 and Goal 2: +Given a tree hash, return the missing blobs (inside that tree) ordered by an orderable atom (size, name, type, ...). + +This looks similar to Stolee's work on `git-backfill` [9], the key difference is that `git-backfill` fetches the missing objects from a path/object, while this would only query the metadata of the missing blobs without fetching them and ordered by a given atom. @@ -279,0 +281 @@ Thanks for reading my proposal and considering my application. I'm very excited + @@ -281,0 +284,18 @@ Pablo +\[1\]: https://lore.kernel.org/git/20250221190451.12536-1-eric.peijian@gmail.com/ "Eric Ju's v11 patch" + +\[2\]: https://lore.kernel.org/git/xmqqo6yr3wc4.fsf@gitster.g/ "Junio Hamano feedback" + +\[3\]: https://lore.kernel.org/git/20250224234720.GC729825@coredump.intra.peff.net/ "Jeff King feedback" + +\[4\]: https://lore.kernel.org/git/20250313060250.GH94015@coredump.intra.peff.net/ "options for strstr() by Jeff King" + +\[5\]: https://lore.kernel.org/git/20250324033922.GB690093@coredump.intra.peff.net/ "Jeff King follow-up" + +\[6\]: https://lore.kernel.org/git/20260312214154.89120-1-pabloosabaterr@gmail.com/ "data->type not being cleared bug" + +\[7\]: https://github.com/git/git/blob/master/Documentation/gitprotocol-v2.adoc#object-info "object-info protocol docs" + +\[8\]: https://lore.kernel.org/git/20220728230210.2952731-1-calvinwan@google.com/#t "Calvin Wan's patch series" + +\[9\]: https://lore.kernel.org/git/pull.2070.git.1773707361.gitgitgadget@gmail.com/ "git-backfill extension from Stolee" + @@ -296,0 +317,2 @@ Pablo + +[9]: https://lore.kernel.org/git/pull.2070.git.1773707361.gitgitgadget@gmail.com/ "git-backfill extension from Stolee" \ No newline at end of file ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [GSoC v4] Proposal: Complete and extend the remote-object-info command for git cat-file 2026-03-18 11:42 ` [GSoC v4] " Pablo @ 2026-04-12 14:41 ` Pablo 0 siblings, 0 replies; 10+ messages in thread From: Pablo @ 2026-04-12 14:41 UTC (permalink / raw) To: git Cc: Christian Couder, karthik nayak, jltobler, Ayush Chandekar, Siddharth Asthana, Chandra Pratap Hi, This is the proposal version that matches with the final one sent. There is no change to the actual proposal, it's mostly updated the work done pre GSoC. There's a diff below the actual proposal with the exact changes. You can generate the pdf that has been sent with: pandoc <file> -f markdown+autolink_bare_uris -o proposal.pdf -V geometry:"margin=2cm" -V colorlinks=true -V urlcolor=blue --toc --number-sections Thanks, Pablo. # Synopsis Git's partial clone allows cloning repositories without downloading all objects (blobs, trees, ...). These objects are fetched on demand from the remote when needed. However, when a user needs metadata about these remote objects (size, type, hash, ...), Git has no efficient way of doing this without downloading all the object content. The server side support for `object-info` protocol was implemented by Calvin Wan in 2021 [8]. Eric Ju built the client-side `remote-object-info` for `cat-file --batch-command`. This project finishes Eric Ju's work on `remote-object-info` for `git cat-file --batch-command` [1], resolves the pending feedback from Junio Hamano [2] and Jeff King [3] [4] [5], and extends support for `%(objecttype)`. Expected project size: 350 hours (Medium) # About Me and Contact Name: Pablo Sabater Jiménez (he/him) Age: 19 Education: Currently on my second Computer Science year at University of Murcia, Spain Location: Murcia, Spain (CET, UTC+2) Languages: C (solid), shell(bash) (good) Tools: git(proficient) I've checked that I'm eligible for GSoC 2026. Email: pabloosabaterr@gmail•com GitHub: https://github.com/pabloosabaterr # Availability My classes end the first week of May. From then until September I won't have any classes which leaves me free to fully focus on the project. I can dedicate 8+ hours each day, and for sure 40 hours a week. # Relevant Projects - 16 bit CPU emulator. Good example of C programming. cpu: https://github.com/pabloosabaterr/CPU16 - Compiler. Good example of working on bigger projects. compiler: https://github.com/pabloosabaterr/Orn # Pre-GSoC Work ## Introduction [[GSoC] Introduction Pablo Sabater](https://lore.kernel.org/git/CAN5EUNR0KJ4VeuOF_bVupaTuGKGaeTKa0SMRAUoBPo5wWi8YGA@mail.gmail.com) **Description**: A mailing list thread where I introduced myself to the git community. ## Microproject [[GSoC PATCH v4] t9200: replace test -f/-d with modern path helpers](https://lore.kernel.org/git/20260312173305.15112-1-pabloosabaterr@gmail.com/) **Status**: Merged to `next` on 2026-03-12 at `8500bdf172`. Will merge to `master`. **Description**: Replaces `test -f` with helper `test_path_is_file`, which makes debugging failing tests easier with better reporting. As suggested as microproject. ## Draft Proposal [[GSoC] Proposal: Complete and extend remote-object-info for git cat-file](https://lore.kernel.org/git/CAN5EUNQKv-LCkbY+5scn6pk6fL8kpmjNR=66rjeY=NqKbqRkhA@mail.gmail.com/) **Description**: Proposal draft thread. ## Other Contributions [[GSoC PATCH v2] test-lib: print escape sequence names](https://lore.kernel.org/git/20260311031442.11942-1-pabloosabaterr@gmail.com/) **Status**: merged to `next` on 2026-03-13 at f545ea5a9c. **Description**: In failed expected/actual checks printing, the escape sequences were shown as their octal code. This patch fixes that to print the actual escape sequence name, adds tests, and updates the expected output. [[GSoC PATCH] t9200: handle missing CVS with skip_all](https://lore.kernel.org/git/20260311194002.190195-1-pabloosabaterr@gmail.com/) **Status**: Merged to `next` on 2026-03-12 at `8500bdf172`. Will merge to `master`. **Description**: Wraps CVS setup in a skip_all for clearer failure reporting and moves Git initialization into its own test_expect_success. [[GSoC PATCH v6 0/3] graph: add --graph-lane-limit option](https://lore.kernel.org/git/20260328001113.1275291-1-pabloosabaterr@gmail.com/) **Status**: WIP. **Description**: Adds `--graph-lane-limit` option to `--graph` to limit the number of horizontal lanes that will be shown. Helps readability for projects with many branches. [[GSoC PATCH 0/3] receive-pack: fix HEAD check for updateInstead](https://lore.kernel.org/git/20260330111822.165188-1-pabloosabaterr@gmail.com/) **Status**: pending review. **Description**: Fix updateInstead HEAD check that only looked for the bare repo context instead of the worktree HEAD, which rejected the pushes even with the wt clean. ## Code Reviews [Re: [PATCH] gc: add git maintenance list command](https://lore.kernel.org/git/20260313115932.15259-1-pabloosabaterr@gmail.com/) **Description**: Code review for a patch sent about simplifying a duplicated code. [Re: [PATCH] t2107: modernize path existence check](https://lore.kernel.org/all/CAN5EUNTNqC6+FPjKafoFfgaEzWdpXEV0QNwumF8CaxBEUOmA6Q@mail.gmail.com/) [Re: [GSoC][PATCH] t2000: modernize path checks to use helper functions](https://lore.kernel.org/all/CAN5EUNTSO7KvtO02c-EHJTK95rmcZKRBtKsn8kjNid1qupWZ0w@mail.gmail.com/) [Re: [PATCH] t5315: use test_path_is_file for loose-object check](https://lore.kernel.org/all/CAN5EUNR2mqpCMG0oPsDnzgZr-2yyL+S0A7p_MM62F7d4MjBuSA@mail.gmail.com/) **Description**: Reviews to newcomers on their microprojects patches. [[GSoC] Re: [PATCH v11 8/8] cat-file: add remote-object-info to batch-command](https://lore.kernel.org/git/20260312214154.89120-1-pabloosabaterr@gmail.com/) **Description**: While testing Eric's v11 I've found and reported a new bug. On `remote-object-info` when it's preceded by a local query, `data->type` isn't being cleared. Causing it to return the wrong type. I have also studied the documentation provided and Eric Ju's work from v0 to v11 including all the feedback he got up to March 2025, the feedback he got from Junio Hamano and Jeff King, taking notes about what's left to be done and what else I can contribute to the already proposed project. That's how I've identified everything that I will address on the Problem, Solution and Timeline sections. I built Eric Ju's v11 and tested the bugs reported to his patch [5], I've confirmed the segfault and the `die()`, and found a new one: - When a local `info` runs before `remote-object-info` sharing the same format string, `data->type` isn't being cleared. A blob queried remotely after a local commit, `data->type` for blob becomes 'commit' with no error. I reported it on the mailing list [6]. I attempted to test rebasing Eric Ju's v11 to master and got conflicts on 4 out of the 8 commits: - `d04cf85ece` t1006: split test utility functions into new "lib-cat-file.sh". - `t/t1006-cat-file.sh` - `d918f720d8` fetch-pack: refactor packet writing. - `fetch-pack.c` - `2daf9ed803` transport: add client support for object-info. - `Makefile` - `c3ba4afaf6` cat-file: add remote-object-info to batch-command. - `object-file.c`, `object-store-ll.h` (deleted). I'm being active on the mailing list, learning the Git flow of work and learning from the feedback I've received from the maintainers on my patches and reviewing others. Following the project guidelines, I haven't done anything on the project that could step on other candidates' work before being accepted, and instead I'm focusing on understanding the project and its needs, and independent patches that will make the Git project more familiar and understandable to me. # The Problem Eric Ju's work remains unmerged after v11 because of these issues: - The format validation uses `strstr()` which only checks for `%(objectsize)`. This causes two different errors: - Atoms that `expand_atom()` recognizes but the remote doesn't (`objecttype`,`deltabase`, ...), `expand_atom()` returns 1, but when accessing `data->type` it only contains garbage, causing segfault, as Jeff King noted [3]. - Unknown atoms by `expand_atom()`, returns 0, calling `strbuf_expand_bad_format` on `expand_format()`, which calls `die()`, as Jeff King found [3]. Both cases block the command, including local `info` queries if the same format string is shared. Unsupported remote placeholders should return an empty string, matching how `for-each-ref` returns empty for known, but inapplicable atoms like `%(tagger)` on non-tags [4] [5]. - When local and remote queries are mixed, `data->type` is not being cleared between commands. `remote-object-info` returns the wrong type data from a previous local query [6]. - Style and code issues marked by Junio Hamano [2] and Jeff King [3] [5] are still undone. - comment style. - `#define` formatting. - line length. - misleading error messages. - missing `count > MAX_ALLOWED_OBJ_LIMIT` check at `split_cmdline().` - if/else invert at `get_remote_info()`. - `%(objecttype)` is not yet supported on either client or server side. # The Solution There are two main goals: ## Goal 1: Rebase and finish Eric's work Starting from where Eric Ju left off, I will rebase it on top of the current `master` branch and address the feedback left to do: - Fix style in comments, `#define` formatting and line length. - Fix misleading error message in the overflow check. - Add missing `count > MAX_ALLOWED_OBJ_LIMIT` check after `split_cmdline()`. - Invert if/else on `get_remote_info()` to keep the small block first (the error one) as Junio suggested. #### Replace `strstr()` format validation with allow_list in `expand_atom()` `strstr()` isn't enough to fully validate the placeholders, it only searches for `%(objectsize)` and unsupported placeholders cause segfaults. Jeff King noted [4] that the fix was to refactor the validation with an allow_list in `expand_atom()` or `expand_format()`. The best option is to place the validation at `expand_atom()`, but why `expand_atom()` ? - There are two cases, first, inside `expand_atom()` before returning (segfault) and second, calls `die()` when `expand_atom()` returns 0. Placing the `allow_list` at the top of `expand_atom()` prevents both errors, on remote mode, append nothing to `sb` and return 1, accessing `data->type` won't cause segfault and prevents `expand_format()` from reaching `die()`. As extra safety, initializing `data->type` to `OBJ_BAD` and check for `NULL` from `type_name()` makes it that even without `allow_list`, uninitialized data doesn't cause a segfault. At Goal 1, only `%(objectname)` and `%(objectsize)` will be in the allow_list. Goal 2 will bring `%(objecttype)` support. ## Goal 2: Adding `%(objecttype)` Following what Calvin Wan did in 2021 [8] for `%(objectsize)`, v2 protocol needs to be extended on the server side to support the new `%(objecttype)` placeholder: - extend `object_info_advertise()` at `serve.c` - add .type to `requested_info` struct at `serve.c` - support `type` in `cap_object_info()` at `protocol-caps.c` - look for type at `send_info()` at `protocol-caps.c` Following object-info protocol docs [7] it should look like: ``` attrs = "size" SP "type" obj-type = "blob" | "tree" | "commit" | "tag" obj-info = obj-id SP obj-size SP obj-type info = PKT-LINE(attrs LF) *PKT-LINE(obj-info LF) ``` `%(objecttype)` needs to be added to the `allow_list`. Client side needs to learn to ask for `%(objecttype)` from remote, parse what has been received and fill `expand_data` with the actual type. This makes it return the object type instead of the empty string returned while it was unsupported. Default format evolves to `%(objectname) %(objecttype) %(objectsize)`. Test and document new placeholder support and server side extension. ## Backward Compatibility There are four possible scenarios to happen between client and server: 1. **The server doesn't know type (new client but old server)**: After receiving the server capabilities, the client doesn't see `type` being advertised. When the user format string has `%(objecttype)`, `expand_atom()` checks the `allow_list`, finds that type was not fetched. Appends an empty string to the output buffer and returns 1. The user will see an empty field where `type` should be, no errors nor warnings. In Eric Ju's v11, this would crash, as described in The Problem section, the `allow_list` from Goal 1 is what fixes this, following `for-each-ref` behaviour for known but inapplicable atoms as Jeff King suggested [4] [5]. 2. **The server knows type but the client doesn't (new server but old client)**: The server advertises `type`, but the client doesn't know `type` and following `gitprotocol-v2.adoc`, "Clients must ignore all unknown keys", it silently ignores the `type` and only asks for the known (`size`). The server returns only what was requested, user will see the output for `size` but not for `type`. This doesn't need any new code, the v2 protocol already behaves like this. 3. **Both know type (new client and new server)**: The server advertises `type`, the client requests `type` and receives the type data. `expand_atom()` finds `type` in the `allow_list`, fills `data->type` and then the user will see the object type in the output. This is Goal 2. 4. **Both know type but protocol middleware doesn't (new client, new server but old middleware)**: This becomes case 1 or 2 depending on what side is being affected by the middleware. If the middleware removes `type` from the server advertised capabilities, the client never sees it and treats the server as it was old server, it becomes case 1 (empty string). If the middleware removes `type` from the client request, the server will only see `size` being requested and only returns size data, it becomes case 2. ## Performance Considerations To get an object type, we have to look only at the header, to get the size `oid_object_info()` at `object-file.c` is being called which already returns the object type in the same call. Sending the string with the type will only be, worst case scenario 6 bytes for the "commit" string. # Timeline I've designed this to work with enough time so final work can be shorter than what's said here May 1-24: Community Bonding - Keep working on my ongoing patches and new ones. - Talk and meet with mentor that I'm assigned with, to get feedback about my proposal, how I will report my progress apart from the code submitted and possible blogs, and tips and tricks to work better at Git. - Confirm with mentor that the `allow_list` approach is still the best option. - Draft commits structure. - Setup a blog to keep track about how GSoC at Git is going. Week 1-2: (May 26 - June 8) - First of all will be rebasing Eric Ju's v11. - Start Goal 1 fixes. - Fix style and code issues. Week 3-4: (June 9 - June 22) - Start with Goal 1 implementations (allow_list approach). Week 5-6: (June 23 - July 6): - Goal 1 should be polished or close to the final form. - Send patch series for Goal 1. - Start Goal 2. - Prepare the midterm report. **Midterm evaluation** (July 7 - 11) as specified on GSoC timeline docs - Goal 1 submitted. Week 7-8: (July 14 - July 27) - Start with server side v2 protocol extension (`%(objecttype)`). Week 9-10: (July 28 - August 10) - Add `%(objecttype)` to the `allow_list` from Goal 1. - Client side extension. - End to end tests and documentation. - Default format becomes `%(objectname) %(objecttype) %(objectsize)`. - Send patch series. Week 11-12: (August 11 - August 24) - Goal 2 should be close to be done. - Polish everything, all tests pass, good test coverage, no style/comment issues. - Final documentation review. - Prepare for final evaluation. **Final evaluation** (August 18-24) as specified on GSoC timeline docs ## Additional objectives If there is enough time, or for future work after the project. I've some ideas on how this could evolve: ## More placeholders support I've checked that Eric's v11 patch only supports `%(objectsize)` on server side, but on the client side there are other placeholders that can be added too. With the `allow_list` and having Goal 2 implemented, adding more placeholders becomes trivial. - `%(objectsize:disk)`: Returns the size on the disk (compressed or as a delta) instead of returning the uncompressed size that `%(objectsize)` does. To do this, the server would need to send what's the actual size on disk data. - `%(deltabase)`: Returns the delta base object OID. non delta objects return zero OID as it does on local. ## Returning missing blobs from a tree ordered In a partial clone, someone might want to know what blobs are missing inside a concrete tree and order them before fetching them. The idea is to build on top of `remote-object-info` and what's been built in Goal 1 and Goal 2: Given a tree hash, return the missing blobs (inside that tree) ordered by an orderable atom (size, name, type, ...). This looks similar to Stolee's work on `git-backfill` [9], the key difference is that `git-backfill` fetches the missing objects from a path/object, while this would only query the metadata of the missing blobs without fetching them and ordered by a given atom. Thanks for reading my proposal and considering my application. I'm very excited about this opportunity, Pablo \[1\]: https://lore.kernel.org/git/20250221190451.12536-1-eric.peijian@gmail.com/ "Eric Ju's v11 patch" \[2\]: https://lore.kernel.org/git/xmqqo6yr3wc4.fsf@gitster.g/ "Junio Hamano feedback" \[3\]: https://lore.kernel.org/git/20250224234720.GC729825@coredump.intra.peff.net/ "Jeff King feedback" \[4\]: https://lore.kernel.org/git/20250313060250.GH94015@coredump.intra.peff.net/ "options for strstr() by Jeff King" \[5\]: https://lore.kernel.org/git/20250324033922.GB690093@coredump.intra.peff.net/ "Jeff King follow-up" \[6\]: https://lore.kernel.org/git/20260312214154.89120-1-pabloosabaterr@gmail.com/ "data->type not being cleared bug" \[7\]: https://github.com/git/git/blob/master/Documentation/gitprotocol-v2.adoc#object-info "object-info protocol docs" \[8\]: https://lore.kernel.org/git/20220728230210.2952731-1-calvinwan@google.com/#t "Calvin Wan's patch series" \[9\]: https://lore.kernel.org/git/pull.2070.git.1773707361.gitgitgadget@gmail.com/ "git-backfill extension from Stolee" [1]: https://lore.kernel.org/git/20250221190451.12536-1-eric.peijian@gmail.com/ "Eric Ju's v11 patch" [2]: https://lore.kernel.org/git/xmqqo6yr3wc4.fsf@gitster.g/ "Junio Hamano feedback" [3]: https://lore.kernel.org/git/20250224234720.GC729825@coredump.intra.peff.net/ "Jeff King feedback" [4]: https://lore.kernel.org/git/20250313060250.GH94015@coredump.intra.peff.net/ "options for strstr() by Jeff King" [5]: https://lore.kernel.org/git/20250324033922.GB690093@coredump.intra.peff.net/ "Jeff King follow-up" [6]: https://lore.kernel.org/git/20260312214154.89120-1-pabloosabaterr@gmail.com/ "data->type not being cleared bug" [7]: https://github.com/git/git/blob/master/Documentation/gitprotocol-v2.adoc#object-info "object-info protocol docs" [8]: https://lore.kernel.org/git/20220728230210.2952731-1-calvinwan@google.com/#t "Calvin Wan's patch series" [9]: https://lore.kernel.org/git/pull.2070.git.1773707361.gitgitgadget@gmail.com/ "git-backfill extension from Stolee" diff --git a/v4.md b/v5.md index c5b8bc6..20b05b1 100755 --- a/v4.md +++ b/v5.md @@ -Location: Murcia, Spain (CET, UTC+1) +Location: Murcia, Spain (CET, UTC+2) @@ -**Status**: Will merge to `next`. +**Status**: merged to `next` on 2026-03-13 at f545ea5a9c. @@ +[[GSoC PATCH v6 0/3] graph: add --graph-lane-limit option](https://lore.kernel.org/git/20260328001113.1275291-1-pabloosabaterr@gmail.com/) + +**Status**: WIP. + +**Description**: Adds `--graph-lane-limit` option to `--graph` to limit the number of horizontal lanes that will be shown. Helps readability for projects with many branches. + +[[GSoC PATCH 0/3] receive-pack: fix HEAD check for updateInstead](https://lore.kernel.org/git/20260330111822.165188-1-pabloosabaterr@gmail.com/) + +**Status**: pending review. + +**Description**: Fix updateInstead HEAD check that only looked for the bare repo context instead of the worktree HEAD, which rejected the pushes even with the wt clean. + +## Code Reviews + @@ -**Description**: Code review for a patch sent. +**Description**: Code review for a patch sent about simplifying a duplicated code. + +[Re: [PATCH] t2107: modernize path existence check](https://lore.kernel.org/all/CAN5EUNTNqC6+FPjKafoFfgaEzWdpXEV0QNwumF8CaxBEUOmA6Q@mail.gmail.com/) @@ -[[GSoC RFC PATCH] graph: add --graph-max option to limit displayed columns](https://lore.kernel.org/git/20260316133426.117684-1-pabloosabaterr@gmail.com/) +[Re: [GSoC][PATCH] t2000: modernize path checks to use helper functions](https://lore.kernel.org/all/CAN5EUNTSO7KvtO02c-EHJTK95rmcZKRBtKsn8kjNid1qupWZ0w@mail.gmail.com/) @@ -**Status**: RFC, waiting for feedback. +[Re: [PATCH] t5315: use test_path_is_file for loose-object check](https://lore.kernel.org/all/CAN5EUNR2mqpCMG0oPsDnzgZr-2yyL+S0A7p_MM62F7d4MjBuSA@mail.gmail.com/) @@ -**Description**: Adds `--graph-max` option to `git log --graph` to cap the number of columns that will be displayed. Helps readability for projects with many branches. +**Description**: Reviews to newcomers on their microprojects patches. @@ -I'm being active on the mailing list and learning the Git flow of work and from the feedback I've received from the maintainers (Junio) from my patches. +I'm being active on the mailing list, learning the Git flow of work and learning from the feedback I've received from the maintainers on my patches and reviewing others. ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-04-12 14:41 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-03-13 10:17 [GSoC] Proposal: Complete and extend the remote-object-info command for git cat-file Pablo 2026-03-14 5:58 ` Chandra Pratap 2026-03-14 18:31 ` Pablo 2026-03-15 9:20 ` Chandra Pratap 2026-03-16 11:21 ` Christian Couder 2026-03-16 21:38 ` Karthik Nayak 2026-03-18 10:45 ` Pablo 2026-03-16 16:05 ` [GSoC v3] " Pablo Sabater 2026-03-18 11:42 ` [GSoC v4] " Pablo 2026-04-12 14:41 ` Pablo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox