Skip to content

Failed to read data_stream from a Diff object in python REPL #642

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
johnlinp opened this issue Jul 13, 2017 · 6 comments
Open

Failed to read data_stream from a Diff object in python REPL #642

johnlinp opened this issue Jul 13, 2017 · 6 comments

Comments

@johnlinp
Copy link

Version: 2.1.5
Python Version: 3.6.0
Reproducing steps:

I have the following python script:

import git

repo = git.Repo('/tmp/gittest')

commit1 = repo.commit('master')
commit2 = repo.commit('master^')

diffs = commit1.diff(commit2)
diff = diffs[0]

diff.b_blob.data_stream
diff.b_blob.data_stream.read()

If I save it into okay.py and execute python okay.py, everything's fine.

However, if I copy the script and paste it to the python REPL, exception occurs:

root@jacky:source# python
Python 3.6.0 (default, Jan 16 2017, 12:12:55)
[GCC 6.3.1 20170109] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import git
>>>
>>> repo = git.Repo('/tmp/gittest')
>>>
>>> commit1 = repo.commit('master')
>>> commit2 = repo.commit('master^')
>>>
>>> diffs = commit1.diff(commit2)
>>> diff = diffs[0]
>>>
>>> diff.b_blob.data_stream
(b'\x9cY\xe2K\x83\x93\x17\x9a]q-\xe4\xf9\x90\x17\x8d\xf5sM\x99', b'blob', 6, <git.cmd.Git.CatFileContentStream object at 0x7faf444c3e48>)
>>> diff.b_blob.data_stream.read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.6/site-packages/git/objects/base.py", line 112, in data_stream
    return self.repo.odb.stream(self.binsha)
  File "/usr/lib/python3.6/site-packages/git/db.py", line 42, in stream
    hexsha, typename, size, stream = self._git.stream_object_data(bin_to_hex(sha))
  File "/usr/lib/python3.6/site-packages/git/cmd.py", line 957, in stream_object_data
    hexsha, typename, size = self.__get_object_header(cmd, ref)
  File "/usr/lib/python3.6/site-packages/git/cmd.py", line 929, in __get_object_header
    return self._parse_object_header(cmd.stdout.readline())
  File "/usr/lib/python3.6/site-packages/git/cmd.py", line 893, in _parse_object_header
    raise ValueError("SHA %s could not be resolved, git returned: %r" % (tokens[0], header_line.strip()))
ValueError: SHA b'first' could not be resolved, git returned: b'first'

Why the inconsistency?

@Byron
Copy link
Member

Byron commented Sep 28, 2017

Thanks for the report, I was able to reproduce the issue.

@SpoonMeiser
Copy link

I have the same issue, it looks like the same persistent cat-file command is used twice, but the first time, only the first line of output is read. Then, on the second call, stdout still contains the output from the first invocation and that is read instead instead of the summary line it tries to parse.

@SpoonMeiser
Copy link

The data_stream property contains the note:

:note: returned streams must be read in order

Which makes for a really awkward interface because it temporally couples the calling code, but at least it gives a hint at how this can be worked around. Maybe the easiest initial fix would be to just make this clear in the documentation somewhere.

We've only run into this after upgrading GitPython, so there is an old (maybe really old) version that didn't have this issue.

@Byron
Copy link
Member

Byron commented Apr 23, 2020

Yes, I absolutely agree. It's an implementation detail of the underlying git object database which leaks into the API, and it's a trap that will leave everyone puzzled as to why it happens.

Even though I am responsible for this awkwardness and thus should know, it wasn't obvious to me either.

Another workaround might be to use the GitDB type when instantiating the git repository, as it is a pure-python implementation that accesses data directly. It's slower, and definitely not suited for server processes due to file handles not being released automatically.

Repo('.', odbt=git.db.GitDB)

ghost pushed a commit to connectedcompany/coco-agent that referenced this issue Dec 13, 2020
@Symbolk
Copy link

Symbolk commented Jul 19, 2021

Yes, I absolutely agree. It's an implementation detail of the underlying git object database which leaks into the API, and it's a trap that will leave everyone puzzled as to why it happens.

Even though I am responsible for this awkwardness and thus should know, it wasn't obvious to me either.

Another workaround might be to use the GitDB type when instantiating the git repository, as it is a pure-python implementation that accesses data directly. It's slower, and definitely not suited for server processes due to file handles not being released automatically.

Repo('.', odbt=git.db.GitDB)

Hi, I am using the latest GitPython 3.1.18 and git version 2.30.0 to mine merge scenarios from a repo and still found this error.

Here is the code: https://github.com/Symbolk/MergeScenarioMiner/blob/4af16a6bf893301be27a352c24409b3c5612bae0/main.py#L167

I tried the workaround but it did not solve the root cause but reported:

c1c45b46a36e9725f9741cce25732c69536be075
Ready to process repo: realm-java at branch: master
Traceback (most recent call last):
  File "/Users/symbolk/coding/dev/MergeScenarioMiner/main.py", line 290, in <module>
    git_service.collect_from_commits(['00c9dd117b4b3279c4f48238948005994c90a491'])
  File "/Users/symbolk/coding/dev/MergeScenarioMiner/main.py", line 216, in collect_from_commits
    conflict_file_paths, num_conflicts_per_file = self.collect_merge_scenrios(merge_commit, unmerged_blobs,
  File "/Users/symbolk/coding/dev/MergeScenarioMiner/main.py", line 166, in collect_merge_scenrios
    base_content = blob.data_stream.read()
  File "/Users/symbolk/coding/dev/MergeScenarioMiner/venv/lib/python3.9/site-packages/git/objects/base.py", line 131, in data_stream
    return self.repo.odb.stream(self.binsha)
  File "/Users/symbolk/coding/dev/MergeScenarioMiner/venv/lib/python3.9/site-packages/gitdb/db/base.py", line 208, in stream
    return self._db_query(sha).stream(sha)
  File "/Users/symbolk/coding/dev/MergeScenarioMiner/venv/lib/python3.9/site-packages/gitdb/db/base.py", line 192, in _db_query
    raise BadObject(sha)
gitdb.exc.BadObject: BadObject: b'8517ee7f4378fe0f54945b3e4973766ff65e455d'

Do you think it is a problem of Git or GitPython?

@Byron
Copy link
Member

Byron commented Jul 19, 2021

It's probably a GitPython issue as by now the object database implementation is unlikely to still be complete. Thus it might not see objects that are there, and independently of that it definitely won't see objects that have since been created.

The only correct implementation is the default one as it uses git itself, but it will require the caller to be careful about object references. Depending on what should be accomplished, maybe using libgit2 for python will be a better choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants