# sarfile
Like tarfile, but streamable.
## What is this?
This repository implements a "streaming archive" file format for collecting multiple files into one. This is similar to the TAR format, but it puts the information about all the files in the archive into a contiguous block at the beginning of the file. This solves a couple problems:
1. Much faster startup times for large archives (we read the entire header into memory in one go)
2. Much friendlier to remote file systems (only one network request rather than a bunch), in combination with `smart_open`
3. Fast random access
The file size is the same as an uncompressed TAR file.
The downside is that once we've written a SAR file, we can't change it. Maybe future formats will support this, but for now, the recommended flow is to first generate a TAR file, then convert it using the builtin `sarpack` command line tool or the `sarfile.pack_tar` Python API.
Also, the file format only exists in this repository, although it's very simple to implement (see the `_header.py` documentation and the `sarfile` object for how to load items).
## Getting Started
Install the package using Pip:
```bash
pip install sarfile
```
Next, simply import the module:
```python
import sarfile
```
You can convert a tarfile to a sarfile using the Python API:
```python
sarfile.pack_tar(out="myfile.sar", tar="myfile.tar")
```
Alternatively, you can use the built-in command line tool:
```bash
sarpack myfile.sar myfile.tar
```
Finally, the file can be used in your Python script:
```python
f = sarfile.open("myfile.sar"):
print(f.names)
with f["myfile.txt"] as myfile:
print(myfile.read())
```
If you have installed `smart_open`, then you can also read from S3 as follows:
```python
f = sarfile.open("myfile.sar")
print(f.names)
with f["myfile.txt"] as myfile:
print(myfile.read())
```
The above code is much faster than reading a TAR file from S3, because we read the entire header into memory in one network request, rather than having to make a network request for each file in the archive. On subsequent accesses we also only download the part of the file we want to read.
## Requirements
This package is tested against Python 3.10. Although not required, it is a good idea to install `smart_open` to support reading from S3 or other remote file systems, and `tqdm` to show a progress bar when packing large files.
Raw data
{
"_id": null,
"home_page": "https://github.com/codekansas/sarfile",
"name": "sarfile",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": "",
"keywords": "",
"author": "Benjamin Bolte",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/0b/5d/f6385ebdaea7535316761a2e522de1b8f5265d7159583625230b2b4bf3fe/sarfile-0.1.7.tar.gz",
"platform": null,
"description": "# sarfile\n\nLike tarfile, but streamable.\n\n## What is this?\n\nThis repository implements a \"streaming archive\" file format for collecting multiple files into one. This is similar to the TAR format, but it puts the information about all the files in the archive into a contiguous block at the beginning of the file. This solves a couple problems:\n\n1. Much faster startup times for large archives (we read the entire header into memory in one go)\n2. Much friendlier to remote file systems (only one network request rather than a bunch), in combination with `smart_open`\n3. Fast random access\n\nThe file size is the same as an uncompressed TAR file.\n\nThe downside is that once we've written a SAR file, we can't change it. Maybe future formats will support this, but for now, the recommended flow is to first generate a TAR file, then convert it using the builtin `sarpack` command line tool or the `sarfile.pack_tar` Python API.\n\nAlso, the file format only exists in this repository, although it's very simple to implement (see the `_header.py` documentation and the `sarfile` object for how to load items).\n\n## Getting Started\n\nInstall the package using Pip:\n\n```bash\npip install sarfile\n```\n\nNext, simply import the module:\n\n```python\nimport sarfile\n```\n\nYou can convert a tarfile to a sarfile using the Python API:\n\n```python\nsarfile.pack_tar(out=\"myfile.sar\", tar=\"myfile.tar\")\n```\n\nAlternatively, you can use the built-in command line tool:\n\n```bash\nsarpack myfile.sar myfile.tar\n```\n\nFinally, the file can be used in your Python script:\n\n```python\nf = sarfile.open(\"myfile.sar\"):\nprint(f.names)\nwith f[\"myfile.txt\"] as myfile:\n print(myfile.read())\n```\n\nIf you have installed `smart_open`, then you can also read from S3 as follows:\n\n```python\nf = sarfile.open(\"myfile.sar\")\nprint(f.names)\nwith f[\"myfile.txt\"] as myfile:\n print(myfile.read())\n```\n\nThe above code is much faster than reading a TAR file from S3, because we read the entire header into memory in one network request, rather than having to make a network request for each file in the archive. On subsequent accesses we also only download the part of the file we want to read.\n\n## Requirements\n\nThis package is tested against Python 3.10. Although not required, it is a good idea to install `smart_open` to support reading from S3 or other remote file systems, and `tqdm` to show a progress bar when packing large files.\n",
"bugtrack_url": null,
"license": "",
"summary": "A Python library for reading and writing SAR files.",
"version": "0.1.7",
"project_urls": {
"Homepage": "https://github.com/codekansas/sarfile"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "fc8fb883391e12ac34b70a4ea6e095e6d885950c697dd90c71a77501fe2db54e",
"md5": "c81e0cf59d99088e42b6e65be0f6c159",
"sha256": "b29400ef0c9bead03296815d042ca7008feeb5b229c0b50bf4708926aad3a6ec"
},
"downloads": -1,
"filename": "sarfile-0.1.7-py3-none-any.whl",
"has_sig": false,
"md5_digest": "c81e0cf59d99088e42b6e65be0f6c159",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 11015,
"upload_time": "2024-03-05T19:25:14",
"upload_time_iso_8601": "2024-03-05T19:25:14.298028Z",
"url": "https://files.pythonhosted.org/packages/fc/8f/b883391e12ac34b70a4ea6e095e6d885950c697dd90c71a77501fe2db54e/sarfile-0.1.7-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "0b5df6385ebdaea7535316761a2e522de1b8f5265d7159583625230b2b4bf3fe",
"md5": "b1e7f8a0c92e2cae47140a1267d0b756",
"sha256": "ba01122793717e63233825b741165b8ff5256db344d03d695513c243b3fb3ae6"
},
"downloads": -1,
"filename": "sarfile-0.1.7.tar.gz",
"has_sig": false,
"md5_digest": "b1e7f8a0c92e2cae47140a1267d0b756",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 12448,
"upload_time": "2024-03-05T19:25:15",
"upload_time_iso_8601": "2024-03-05T19:25:15.490875Z",
"url": "https://files.pythonhosted.org/packages/0b/5d/f6385ebdaea7535316761a2e522de1b8f5265d7159583625230b2b4bf3fe/sarfile-0.1.7.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-03-05 19:25:15",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "codekansas",
"github_project": "sarfile",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "sarfile"
}