circadian-scp-upload


Namecircadian-scp-upload JSON
Version 0.3.2 PyPI version JSON
download
home_pagehttps://github.com/dostuffthatmatters/circadian-scp-upload
SummaryResumable, interruptible, SCP upload client for any files or directories generated day by day
upload_time2023-08-26 13:15:44
maintainer
docs_urlNone
authorMoritz Makowski
requires_python>=3.10,<4.0
licenseAGPL-3.0-only
keywords python library utilities scp ssh synchronization upload files directories checksum daily data time-series
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # 📮 Circadian SCP Upload

**Resumable, interruptible, SCP upload client for any files or directories generated day by day.**

[![GitHub Workflow Status (with event)](https://img.shields.io/github/actions/workflow/status/dostuffthatmatters/circadian-scp-upload/test.yaml?label=tests%20on%20main%20branch)](https://github.com/dostuffthatmatters/circadian-scp-upload/actions/workflows/test.yaml)
[![GitHub](https://img.shields.io/github/license/dostuffthatmatters/circadian-scp-upload?color=f1f5f9)](https://github.com/dostuffthatmatters/circadian-scp-upload/blob/main/LICENSE.md)
[![PyPI - Version](https://img.shields.io/github/v/tag/dostuffthatmatters/circadian-scp-upload?label=version&color=f1f5f9)](https://pypi.org/project/circadian-scp-upload)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/circadian_scp_upload?label=supported%20Python%20versions&color=f1f5f9)](https://pypi.org/project/circadian-scp-upload)

## Use Case

You have a local directory that generates daily data on your local machine. The directory looks like this:

```
📁 data-directory-1
├── 📁 20190101
│   ├── 📄 file1.txt
│   ├── 📄 file2.txt
│   └── 📄 file3.txt
└── 📁 20190102
    ├── 📄 file1.txt
    ├── 📄 file2.txt
    └── 📄 file3.txt
```

Or like this:

```
📁 data-directory-2
├── 📄 20190101.txt
├── 📄 20190102-a.txt
├── 📄 20190102-b.txt
└── 📄 20190103.txt
```

You want to upload that data to a server, but only after the day of creation. Additionally, you want to mark the directories as "in progress" on the remote server so that subsequent processing steps will not touch unfinished days of data while uploading.

This tool uses [SCP](https://en.wikipedia.org/wiki/Secure_copy_protocol) via the Python library [paramiko](https://github.com/paramiko/paramiko) to do that. It will write files named `.do-not-touch` in the local and remote directories during the upload process and delete them afterward.

Below is a code snippet that defines a specific directory/file naming scheme (for example, `%Y%m%d-(\.txt|\.csv)`). The client uses this information to tell _when_ a specific file or directory was generated. It will only upload files when at least one hour of the following day has passed.

**Can't I use `rsync` or a similar CLI tool for that?**

Yes, of course. However, the actual copying logic of individual files or directories is just 130 lines of code of this repository. The rest of this library is dedicated to being a plug-and-play solution for any Python codebase: logging, regex filters, being interruptable, in-progress markers, and so on.

One should be able to `pip install`/`poetry add`/... and call a well-documented and typed upload client class instead of manually connecting each codebase to rsync and doing all the pattern and scheduling logic repeatedly.

**How do you make sure that the upload works correctly?**

First, the whole codebase has type hints and is strictly checked with [Mypy](https://github.com/python/mypy) - even the snippet in the usage section below is tye checked with Mypy.

Secondly, the date patterning is tested extensively, and the upload process of the files and directories is tested with an actual remote server by generating a bunch of sample files and directories and uploading them to that server. One can check out the output of the test runs in the [GitHub Actions](https://github.com/dostuffthatmatters/circadian-scp-upload/actions/workflows/test.yaml) of this repository - in the "Run pytests" step.

Thirdly, after the upload, the checksum of the local and the remote directories/files is compared to ensure the upload was successful. Only if those checksums match will the client delete the local files. The file removal has to be actively enabled or disabled.

<br/>

## Usage

Install into any Python `^3.10` project:

```bash
pip install circadian_scp_upload
# or
poetry add circadian_scp_upload
```

Configure and use the upload client:

```python
import circadian_scp_upload

# Use the callbacks to customize the upload process
# and integrate it into your own codebase. All callbacks
# are optional and the callback object does not need to be
# passed to the upload client. The lambda functions below
# are the default values.

upload_client_callbacks = circadian_scp_upload.UploadClientCallbacks(
    # which directories to consider in the upload process; only supports
    # %Y/%y/%m/%d - does not support parentheses in the string
    dated_directory_regex=r"^" + "%Y%m%d" + r"$",

    # which files to consider in the upload process; only supports
    # %Y/%y/%m/%d - does not support parentheses in the string
    dated_file_regex=r"^.*" + "%Y%m%d" + r".*$",

    # use your own logger instead of print statements
    log_info=lambda message: print(f"INFO - {message}"),
    log_error=lambda message: print(f"ERROR - {message}"),

    # callback that is called periodically during the upload
    # process to check if the upload should be aborted
    should_abort_upload=lambda: False,
)

# teardown happens automatically when leaving the "with"-block
with circadian_scp_upload.RemoteConnection(
    "1.2.3.4", "someusername", "somepassword"
) as remote_connection:

    # upload a directory full of directories "YYYYMMDD/"
    circadian_scp_upload.DailyTransferClient(
        remote_connection=remote_connection,
        src_path="/path/to/local/data-directory-1",
        dst_path="/path/to/remote/data-directory-1",
        remove_files_after_upload=True,
        variant="directories",
        callbacks=upload_client_callbacks,
    ).run()

    # upload a directory full of files "YYYYMMDD.txt"
    circadian_scp_upload.DailyTransferClient(
        remote_connection=remote_connection,
        src_path="/path/to/local/data-directory-2",
        dst_path="/path/to/remote/data-directory-2",
        remove_files_after_upload=True,
        variant="files",
        callbacks=upload_client_callbacks,
    ).run()
```

The client will produce an informational output wherever one directs the log output - the progress is only logged at steps of 10%:

```log
INFO - 2005-06-20: found 1 paths for this date: ['/tmp/circadian_scp_upload_test_1693053096_3.10.12/20050620']
INFO - 2005-06-20: starting to upload directory local directory '/tmp/circadian_scp_upload_test_1693053096_3.10.12/20050620' to remote directory '/tmp/circadian_scp_upload_test_1693053096_3.10.12/20050620'
INFO - 2005-06-20: found 5 files in src directory
INFO - 2005-06-20: 5 files missing in dst
INFO - 2005-06-20: created remote directory
INFO - 2005-06-20:   0 % (1/5) uploaded
INFO - 2005-06-20:  20 % (2/5) uploaded
INFO - 2005-06-20:  40 % (3/5) uploaded
INFO - 2005-06-20:  60 % (4/5) uploaded
INFO - 2005-06-20:  80 % (5/5) uploaded
INFO - 2005-06-20: 100 % (5/5) uploaded (finished)
INFO - 2005-06-20: checksums match
INFO - 2005-06-20: finished removing source
INFO - 2005-06-20: done (successful)
INFO - 2023-08-23: found 1 paths for this date: ['/tmp/circadian_scp_upload_test_1693053096_3.10.12/20230823']
INFO - 2023-08-23: starting to upload directory local directory '/tmp/circadian_scp_upload_test_1693053096_3.10.12/20230823' to remote directory '/tmp/circadian_scp_upload_test_1693053096_3.10.12/20230823'
INFO - 2023-08-23: found 5 files in src directory
INFO - 2023-08-23: 5 files missing in dst
INFO - 2023-08-23: created remote directory
INFO - 2023-08-23:   0 % (1/5) uploaded
INFO - 2023-08-23:  20 % (2/5) uploaded
INFO - 2023-08-23:  40 % (3/5) uploaded
INFO - 2023-08-23:  60 % (4/5) uploaded
INFO - 2023-08-23:  80 % (5/5) uploaded
INFO - 2023-08-23: 100 % (5/5) uploaded (finished)
INFO - 2023-08-23: checksums match
INFO - 2023-08-23: finished removing source
INFO - 2023-08-23: done (successful)
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/dostuffthatmatters/circadian-scp-upload",
    "name": "circadian-scp-upload",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.10,<4.0",
    "maintainer_email": "",
    "keywords": "python,library,utilities,scp,ssh,synchronization,upload,files,directories,checksum,daily,data,time-series",
    "author": "Moritz Makowski",
    "author_email": "moritz@dostuffthatmatters.dev",
    "download_url": "https://files.pythonhosted.org/packages/a3/fc/69bb68217e9da5faab63e332fde1196918a5d8e26ec3925478841dee93a4/circadian_scp_upload-0.3.2.tar.gz",
    "platform": null,
    "description": "# \ud83d\udcee Circadian SCP Upload\n\n**Resumable, interruptible, SCP upload client for any files or directories generated day by day.**\n\n[![GitHub Workflow Status (with event)](https://img.shields.io/github/actions/workflow/status/dostuffthatmatters/circadian-scp-upload/test.yaml?label=tests%20on%20main%20branch)](https://github.com/dostuffthatmatters/circadian-scp-upload/actions/workflows/test.yaml)\n[![GitHub](https://img.shields.io/github/license/dostuffthatmatters/circadian-scp-upload?color=f1f5f9)](https://github.com/dostuffthatmatters/circadian-scp-upload/blob/main/LICENSE.md)\n[![PyPI - Version](https://img.shields.io/github/v/tag/dostuffthatmatters/circadian-scp-upload?label=version&color=f1f5f9)](https://pypi.org/project/circadian-scp-upload)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/circadian_scp_upload?label=supported%20Python%20versions&color=f1f5f9)](https://pypi.org/project/circadian-scp-upload)\n\n## Use Case\n\nYou have a local directory that generates daily data on your local machine. The directory looks like this:\n\n```\n\ud83d\udcc1 data-directory-1\n\u251c\u2500\u2500 \ud83d\udcc1 20190101\n\u2502   \u251c\u2500\u2500 \ud83d\udcc4 file1.txt\n\u2502   \u251c\u2500\u2500 \ud83d\udcc4 file2.txt\n\u2502   \u2514\u2500\u2500 \ud83d\udcc4 file3.txt\n\u2514\u2500\u2500 \ud83d\udcc1 20190102\n    \u251c\u2500\u2500 \ud83d\udcc4 file1.txt\n    \u251c\u2500\u2500 \ud83d\udcc4 file2.txt\n    \u2514\u2500\u2500 \ud83d\udcc4 file3.txt\n```\n\nOr like this:\n\n```\n\ud83d\udcc1 data-directory-2\n\u251c\u2500\u2500 \ud83d\udcc4 20190101.txt\n\u251c\u2500\u2500 \ud83d\udcc4 20190102-a.txt\n\u251c\u2500\u2500 \ud83d\udcc4 20190102-b.txt\n\u2514\u2500\u2500 \ud83d\udcc4 20190103.txt\n```\n\nYou want to upload that data to a server, but only after the day of creation. Additionally, you want to mark the directories as \"in progress\" on the remote server so that subsequent processing steps will not touch unfinished days of data while uploading.\n\nThis tool uses [SCP](https://en.wikipedia.org/wiki/Secure_copy_protocol) via the Python library [paramiko](https://github.com/paramiko/paramiko) to do that. It will write files named `.do-not-touch` in the local and remote directories during the upload process and delete them afterward.\n\nBelow is a code snippet that defines a specific directory/file naming scheme (for example, `%Y%m%d-(\\.txt|\\.csv)`). The client uses this information to tell _when_ a specific file or directory was generated. It will only upload files when at least one hour of the following day has passed.\n\n**Can't I use `rsync` or a similar CLI tool for that?**\n\nYes, of course. However, the actual copying logic of individual files or directories is just 130 lines of code of this repository. The rest of this library is dedicated to being a plug-and-play solution for any Python codebase: logging, regex filters, being interruptable, in-progress markers, and so on.\n\nOne should be able to `pip install`/`poetry add`/... and call a well-documented and typed upload client class instead of manually connecting each codebase to rsync and doing all the pattern and scheduling logic repeatedly.\n\n**How do you make sure that the upload works correctly?**\n\nFirst, the whole codebase has type hints and is strictly checked with [Mypy](https://github.com/python/mypy) - even the snippet in the usage section below is tye checked with Mypy.\n\nSecondly, the date patterning is tested extensively, and the upload process of the files and directories is tested with an actual remote server by generating a bunch of sample files and directories and uploading them to that server. One can check out the output of the test runs in the [GitHub Actions](https://github.com/dostuffthatmatters/circadian-scp-upload/actions/workflows/test.yaml) of this repository - in the \"Run pytests\" step.\n\nThirdly, after the upload, the checksum of the local and the remote directories/files is compared to ensure the upload was successful. Only if those checksums match will the client delete the local files. The file removal has to be actively enabled or disabled.\n\n<br/>\n\n## Usage\n\nInstall into any Python `^3.10` project:\n\n```bash\npip install circadian_scp_upload\n# or\npoetry add circadian_scp_upload\n```\n\nConfigure and use the upload client:\n\n```python\nimport circadian_scp_upload\n\n# Use the callbacks to customize the upload process\n# and integrate it into your own codebase. All callbacks\n# are optional and the callback object does not need to be\n# passed to the upload client. The lambda functions below\n# are the default values.\n\nupload_client_callbacks = circadian_scp_upload.UploadClientCallbacks(\n    # which directories to consider in the upload process; only supports\n    # %Y/%y/%m/%d - does not support parentheses in the string\n    dated_directory_regex=r\"^\" + \"%Y%m%d\" + r\"$\",\n\n    # which files to consider in the upload process; only supports\n    # %Y/%y/%m/%d - does not support parentheses in the string\n    dated_file_regex=r\"^.*\" + \"%Y%m%d\" + r\".*$\",\n\n    # use your own logger instead of print statements\n    log_info=lambda message: print(f\"INFO - {message}\"),\n    log_error=lambda message: print(f\"ERROR - {message}\"),\n\n    # callback that is called periodically during the upload\n    # process to check if the upload should be aborted\n    should_abort_upload=lambda: False,\n)\n\n# teardown happens automatically when leaving the \"with\"-block\nwith circadian_scp_upload.RemoteConnection(\n    \"1.2.3.4\", \"someusername\", \"somepassword\"\n) as remote_connection:\n\n    # upload a directory full of directories \"YYYYMMDD/\"\n    circadian_scp_upload.DailyTransferClient(\n        remote_connection=remote_connection,\n        src_path=\"/path/to/local/data-directory-1\",\n        dst_path=\"/path/to/remote/data-directory-1\",\n        remove_files_after_upload=True,\n        variant=\"directories\",\n        callbacks=upload_client_callbacks,\n    ).run()\n\n    # upload a directory full of files \"YYYYMMDD.txt\"\n    circadian_scp_upload.DailyTransferClient(\n        remote_connection=remote_connection,\n        src_path=\"/path/to/local/data-directory-2\",\n        dst_path=\"/path/to/remote/data-directory-2\",\n        remove_files_after_upload=True,\n        variant=\"files\",\n        callbacks=upload_client_callbacks,\n    ).run()\n```\n\nThe client will produce an informational output wherever one directs the log output - the progress is only logged at steps of 10%:\n\n```log\nINFO - 2005-06-20: found 1 paths for this date: ['/tmp/circadian_scp_upload_test_1693053096_3.10.12/20050620']\nINFO - 2005-06-20: starting to upload directory local directory '/tmp/circadian_scp_upload_test_1693053096_3.10.12/20050620' to remote directory '/tmp/circadian_scp_upload_test_1693053096_3.10.12/20050620'\nINFO - 2005-06-20: found 5 files in src directory\nINFO - 2005-06-20: 5 files missing in dst\nINFO - 2005-06-20: created remote directory\nINFO - 2005-06-20:   0 % (1/5) uploaded\nINFO - 2005-06-20:  20 % (2/5) uploaded\nINFO - 2005-06-20:  40 % (3/5) uploaded\nINFO - 2005-06-20:  60 % (4/5) uploaded\nINFO - 2005-06-20:  80 % (5/5) uploaded\nINFO - 2005-06-20: 100 % (5/5) uploaded (finished)\nINFO - 2005-06-20: checksums match\nINFO - 2005-06-20: finished removing source\nINFO - 2005-06-20: done (successful)\nINFO - 2023-08-23: found 1 paths for this date: ['/tmp/circadian_scp_upload_test_1693053096_3.10.12/20230823']\nINFO - 2023-08-23: starting to upload directory local directory '/tmp/circadian_scp_upload_test_1693053096_3.10.12/20230823' to remote directory '/tmp/circadian_scp_upload_test_1693053096_3.10.12/20230823'\nINFO - 2023-08-23: found 5 files in src directory\nINFO - 2023-08-23: 5 files missing in dst\nINFO - 2023-08-23: created remote directory\nINFO - 2023-08-23:   0 % (1/5) uploaded\nINFO - 2023-08-23:  20 % (2/5) uploaded\nINFO - 2023-08-23:  40 % (3/5) uploaded\nINFO - 2023-08-23:  60 % (4/5) uploaded\nINFO - 2023-08-23:  80 % (5/5) uploaded\nINFO - 2023-08-23: 100 % (5/5) uploaded (finished)\nINFO - 2023-08-23: checksums match\nINFO - 2023-08-23: finished removing source\nINFO - 2023-08-23: done (successful)\n```\n",
    "bugtrack_url": null,
    "license": "AGPL-3.0-only",
    "summary": "Resumable, interruptible, SCP upload client for any files or directories generated day by day",
    "version": "0.3.2",
    "project_urls": {
        "Documentation": "https://github.com/dostuffthatmatters/circadian-scp-upload",
        "Homepage": "https://github.com/dostuffthatmatters/circadian-scp-upload",
        "Repository": "https://github.com/dostuffthatmatters/circadian-scp-upload"
    },
    "split_keywords": [
        "python",
        "library",
        "utilities",
        "scp",
        "ssh",
        "synchronization",
        "upload",
        "files",
        "directories",
        "checksum",
        "daily",
        "data",
        "time-series"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "78c13539f8235b1ba2c581fcb6776a066a4ec6cba92a846e88ba6cecb8b7cb68",
                "md5": "80ae5dce73eaeb2c467bc6c15ffae257",
                "sha256": "fd5a9f615bef7419836470f953e0abd5732f41b8e4df71743c2462f32df6de78"
            },
            "downloads": -1,
            "filename": "circadian_scp_upload-0.3.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "80ae5dce73eaeb2c467bc6c15ffae257",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10,<4.0",
            "size": 23694,
            "upload_time": "2023-08-26T13:15:42",
            "upload_time_iso_8601": "2023-08-26T13:15:42.570143Z",
            "url": "https://files.pythonhosted.org/packages/78/c1/3539f8235b1ba2c581fcb6776a066a4ec6cba92a846e88ba6cecb8b7cb68/circadian_scp_upload-0.3.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a3fc69bb68217e9da5faab63e332fde1196918a5d8e26ec3925478841dee93a4",
                "md5": "6064f8b4c9088a6bdfc9a0738cfa2147",
                "sha256": "05effff53eb096699790975b6e9f0be215e3a23808ef6fb9418c97e001c7f0c9"
            },
            "downloads": -1,
            "filename": "circadian_scp_upload-0.3.2.tar.gz",
            "has_sig": false,
            "md5_digest": "6064f8b4c9088a6bdfc9a0738cfa2147",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10,<4.0",
            "size": 24319,
            "upload_time": "2023-08-26T13:15:44",
            "upload_time_iso_8601": "2023-08-26T13:15:44.032430Z",
            "url": "https://files.pythonhosted.org/packages/a3/fc/69bb68217e9da5faab63e332fde1196918a5d8e26ec3925478841dee93a4/circadian_scp_upload-0.3.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-08-26 13:15:44",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "dostuffthatmatters",
    "github_project": "circadian-scp-upload",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "circadian-scp-upload"
}
        
Elapsed time: 0.30194s