git-fastcdc


Namegit-fastcdc JSON
Version 0.4.0 PyPI version JSON
download
home_pageNone
SummaryFastCDC for large git files
upload_time2024-04-10 10:43:55
maintainerNone
docs_urlNone
authorJean-Louis Fuchs
requires_python<4.0,>=3.9
licenseAGPL-3.0-or-later
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # git-fastcdc

Split certain files using content-defined-chunking for faster deduplication. It
has a similar use-case to git-lfs, but blobs are in-repository. git-fastcdc
mitigates some of the speed penalties. For most use-cases you are probably
better off with git-lfs. If you have a focus on archival and deduplication, git-
fastcdc might right for you.

## Enable

```bash
git fastcdc install
```

## Config

Edit .gitattributes:

```
*.wav binary filter=git_fastcdc
/.gitattributes text -binary -filter
/.gitignore text -binary -filter
```

By default git-fastcdc runs in-memory. Switch to on-disk:

```bash
git config --local fastcdc.ondisk true
```

If you have a pure git-fastcdc repository, you probably want to disable delta-compression 
to benefit from the speedups through fastcdc.

```bash
git fastcdc delta disable
```

Which will set `core.bigFileThreshold` to `200k` which isn't exect science. It
means most of the history- and meta-data is delta-compressed while most of the
cdc-blobs aren't.

## Results

For my repository - 800GB of music collection:

- Without git-fastcdc delta-compression took over 5 hours (actually it took all
  night)
- With git-fastcdc delta-compression takes about 2 minutes
- With git-fastcdc the repostiory got slightly smaller: about 1%

So much faster repack, with the same delta-compression.

Methodology: I took one state of my repostory from 2 years ago and one state
from today. A lot of meta-data has changed in those two states, because I am
constantly fixing these using beaTunes. In both tests I created two commits
and did `reapck -a -d -f` at the end.

## How

It will split files on filtering when you add them. The split files go into
the `git-fastcdc` branch. You need to push this branch to remotes too!

You will see the actual data in the files in the working copy, in `*.wav` in the
example above. But actually the blobs of these files are just a list of chunks.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "git-fastcdc",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.9",
    "maintainer_email": null,
    "keywords": null,
    "author": "Jean-Louis Fuchs",
    "author_email": "jean-louis.fuchs@adfinis.com",
    "download_url": "https://files.pythonhosted.org/packages/84/76/52b94e137a6e63304fe59497aff1ddbc96dd434efadb3e4ff8ddd6e12425/git_fastcdc-0.4.0.tar.gz",
    "platform": null,
    "description": "# git-fastcdc\n\nSplit certain files using content-defined-chunking for faster deduplication. It\nhas a similar use-case to git-lfs, but blobs are in-repository. git-fastcdc\nmitigates some of the speed penalties. For most use-cases you are probably\nbetter off with git-lfs. If you have a focus on archival and deduplication, git-\nfastcdc might right for you.\n\n## Enable\n\n```bash\ngit fastcdc install\n```\n\n## Config\n\nEdit .gitattributes:\n\n```\n*.wav binary filter=git_fastcdc\n/.gitattributes text -binary -filter\n/.gitignore text -binary -filter\n```\n\nBy default git-fastcdc runs in-memory. Switch to on-disk:\n\n```bash\ngit config --local fastcdc.ondisk true\n```\n\nIf you have a pure git-fastcdc repository, you probably want to disable delta-compression \nto benefit from the speedups through fastcdc.\n\n```bash\ngit fastcdc delta disable\n```\n\nWhich will set `core.bigFileThreshold` to `200k` which isn't exect science. It\nmeans most of the history- and meta-data is delta-compressed while most of the\ncdc-blobs aren't.\n\n## Results\n\nFor my repository - 800GB of music collection:\n\n- Without git-fastcdc delta-compression took over 5 hours (actually it took all\n  night)\n- With git-fastcdc delta-compression takes about 2 minutes\n- With git-fastcdc the repostiory got slightly smaller: about 1%\n\nSo much faster repack, with the same delta-compression.\n\nMethodology: I took one state of my repostory from 2 years ago and one state\nfrom today. A lot of meta-data has changed in those two states, because I am\nconstantly fixing these using beaTunes. In both tests I created two commits\nand did `reapck -a -d -f` at the end.\n\n## How\n\nIt will split files on filtering when you add them. The split files go into\nthe `git-fastcdc` branch. You need to push this branch to remotes too!\n\nYou will see the actual data in the files in the working copy, in `*.wav` in the\nexample above. But actually the blobs of these files are just a list of chunks.\n",
    "bugtrack_url": null,
    "license": "AGPL-3.0-or-later",
    "summary": "FastCDC for large git files",
    "version": "0.4.0",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "aa084cbe4d0ca61b96734ab440abe54d5c8fb0e15ae2de3de82c000f96556c41",
                "md5": "83f46cf414e689d6298bbd3dd5cf0b5a",
                "sha256": "f50fb5a5c27f54632830260641a506e04f87b6e0a8864780318b544e868c46d3"
            },
            "downloads": -1,
            "filename": "git_fastcdc-0.4.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "83f46cf414e689d6298bbd3dd5cf0b5a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.9",
            "size": 18368,
            "upload_time": "2024-04-10T10:43:53",
            "upload_time_iso_8601": "2024-04-10T10:43:53.479346Z",
            "url": "https://files.pythonhosted.org/packages/aa/08/4cbe4d0ca61b96734ab440abe54d5c8fb0e15ae2de3de82c000f96556c41/git_fastcdc-0.4.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "847652b94e137a6e63304fe59497aff1ddbc96dd434efadb3e4ff8ddd6e12425",
                "md5": "5c20018d35ebc4822773fb324671baa5",
                "sha256": "ea7da4c9369fbb95bee65f7423064744ba34ae9fb951f7d1fa32ea6a830476e7"
            },
            "downloads": -1,
            "filename": "git_fastcdc-0.4.0.tar.gz",
            "has_sig": false,
            "md5_digest": "5c20018d35ebc4822773fb324671baa5",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.9",
            "size": 17757,
            "upload_time": "2024-04-10T10:43:55",
            "upload_time_iso_8601": "2024-04-10T10:43:55.610038Z",
            "url": "https://files.pythonhosted.org/packages/84/76/52b94e137a6e63304fe59497aff1ddbc96dd434efadb3e4ff8ddd6e12425/git_fastcdc-0.4.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-10 10:43:55",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "git-fastcdc"
}
        
Elapsed time: 0.26198s