datasalad


Namedatasalad JSON
Version 0.3.0 PyPI version JSON
download
home_pageNone
Summaryutilities for working with data in the vicinity of Git and git-annex
upload_time2024-09-21 10:00:48
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords datalad git git-annex iterator subprocess
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # DataSALad

[![GitHub release](https://img.shields.io/github/release/datalad/datasalad.svg)](https://GitHub.com/datalad/datasalad/releases/)
[![PyPI version fury.io](https://badge.fury.io/py/datasalad.svg)](https://pypi.python.org/pypi/datasalad/)
[![Build status](https://ci.appveyor.com/api/projects/status/wtksrottgt82h2ra/branch/main?svg=true)](https://ci.appveyor.com/project/mih/datasalad/branch/main)
[![codecov](https://codecov.io/gh/datalad/datasalad/branch/main/graph/badge.svg?token=VSO592NATM)](https://codecov.io/gh/datalad/datasalad)
[![Documentation Status](https://readthedocs.org/projects/datasalad/badge/?version=latest)](https://datasalad.readthedocs.io/latest/?badge=latest)

This is a pure-Python library with a collection of utilities for working with
data in the vicinity of Git and git-annex.  While this is a foundational
library from and for the [DataLad project](https://datalad.org), its
implementations are standalone, and are meant to be equally well usable outside
the DataLad system.

A focus of this library is efficient communication with subprocesses, such as
Git or git-annex commands, which read and produce data in some format.

Here is a demo of what can be accomplished with this library. The following
code queries a remote git-annex repository via a `git annex find` command
running over an SSH connection in batch-mode. The output in JSON-lines format
is then itemized and decoded to native Python data types. Both inputs and
outputs are iterables with meaningful items, even though at a lower level
information is transmitted as an arbitrarily chunked byte stream.

```py
>>> from more_itertools import intersperse
>>> from pprint import pprint
>>> from datasalad.runners import iter_subproc
>>> from datasalad.itertools import (
...     itemize,
...     load_json,
... )

>>> # a bunch of photos we are interested in
>>> interesting = [
...     b'DIY/IMG_20200504_205821.jpg',
...     b'DIY/IMG_20200505_082136.jpg',
... ]

>>> # run `git-annex find` on a remote server in a repository
>>> # that has these photos in the worktree.
>>> with iter_subproc(
...     ['ssh', 'photos@pididdy.local',
...      'git -C "collections" annex find --json --batch'],
...     # the remote process is fed the file names,
...     # and a newline after each one to make git-annex write
...     # a report in JSON-lines format
...     inputs=intersperse(b'\n', interesting),
... ) as remote_annex:
...     # we loop over the output of the remote process.
...     # this is originally a byte stream downloaded in arbitrary
...     # chunks, so we itemize at any newline separator.
...     # each item is then decoded from JSON-lines format to
...     # native datatypes
...     for rec in load_json(itemize(remote_annex, sep=b'\n')):
...         # for this demo we just pretty-print it
...         pprint(rec)
{'backend': 'SHA256E',
 'bytesize': '3357612',
 'error-messages': [],
 'file': 'DIY/IMG_20200504_205821.jpg',
 'hashdirlower': '853/12f/',
 'hashdirmixed': '65/qp/',
 'humansize': '3.36 MB',
 'key': 'SHA256E-s3357612--700a52971714c2707c2de975f6015ca14d1a4cdbbf01e43d73951c45cd58c176.jpg',
 'keyname': '700a52971714c2707c2de975f6015ca14d1a4cdbbf01e43d73951c45cd58c176.jpg',
 'mtime': 'unknown'}
{'backend': 'SHA256E',
 'bytesize': '3284291',
 ...
```

## Developing with datasalad

API stability is important, just as adequate semantic versioning, and informative
changelogs.

### Public vs internal API

Anything that can be imported directly from any of the sub-packages in
`datasalad` is considered to be part of the public API. Changes to this API
determine the versioning, and development is done with the aim to keep this API
as stable as possible. This includes signatures and return value behavior.

As an example: `from datasalad.runners import iter_git_subproc` imports a
part of the public API, but `from datasalad.runners.git import
iter_git_subproc` does not.

### Use of the internal API

Developers can obviously use parts of the non-public API. However, this should
only be done with the understanding that these components may change from one
release to another, with no guarantee of transition periods, deprecation
warnings, etc.

Developers are advised to never reuse any components with names starting with
`_` (underscore). Their use should be limited to their individual subpackage.

## Contributing

Contributions to this library are welcome! Please see the [contributing
guidelines](CONTRIBUTING.md) for details on scope and style of potential
contributions.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "datasalad",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Michael Hanke <michael.hanke@gmail.com>",
    "keywords": "datalad, git, git-annex, iterator, subprocess",
    "author": null,
    "author_email": "Michael Hanke <michael.hanke@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/6a/7b/e0dd1cb0c21a103038982cd4723efae92cb1f7a0bf25e81ef962f4e86cab/datasalad-0.3.0.tar.gz",
    "platform": null,
    "description": "# DataSALad\n\n[![GitHub release](https://img.shields.io/github/release/datalad/datasalad.svg)](https://GitHub.com/datalad/datasalad/releases/)\n[![PyPI version fury.io](https://badge.fury.io/py/datasalad.svg)](https://pypi.python.org/pypi/datasalad/)\n[![Build status](https://ci.appveyor.com/api/projects/status/wtksrottgt82h2ra/branch/main?svg=true)](https://ci.appveyor.com/project/mih/datasalad/branch/main)\n[![codecov](https://codecov.io/gh/datalad/datasalad/branch/main/graph/badge.svg?token=VSO592NATM)](https://codecov.io/gh/datalad/datasalad)\n[![Documentation Status](https://readthedocs.org/projects/datasalad/badge/?version=latest)](https://datasalad.readthedocs.io/latest/?badge=latest)\n\nThis is a pure-Python library with a collection of utilities for working with\ndata in the vicinity of Git and git-annex.  While this is a foundational\nlibrary from and for the [DataLad project](https://datalad.org), its\nimplementations are standalone, and are meant to be equally well usable outside\nthe DataLad system.\n\nA focus of this library is efficient communication with subprocesses, such as\nGit or git-annex commands, which read and produce data in some format.\n\nHere is a demo of what can be accomplished with this library. The following\ncode queries a remote git-annex repository via a `git annex find` command\nrunning over an SSH connection in batch-mode. The output in JSON-lines format\nis then itemized and decoded to native Python data types. Both inputs and\noutputs are iterables with meaningful items, even though at a lower level\ninformation is transmitted as an arbitrarily chunked byte stream.\n\n```py\n>>> from more_itertools import intersperse\n>>> from pprint import pprint\n>>> from datasalad.runners import iter_subproc\n>>> from datasalad.itertools import (\n...     itemize,\n...     load_json,\n... )\n\n>>> # a bunch of photos we are interested in\n>>> interesting = [\n...     b'DIY/IMG_20200504_205821.jpg',\n...     b'DIY/IMG_20200505_082136.jpg',\n... ]\n\n>>> # run `git-annex find` on a remote server in a repository\n>>> # that has these photos in the worktree.\n>>> with iter_subproc(\n...     ['ssh', 'photos@pididdy.local',\n...      'git -C \"collections\" annex find --json --batch'],\n...     # the remote process is fed the file names,\n...     # and a newline after each one to make git-annex write\n...     # a report in JSON-lines format\n...     inputs=intersperse(b'\\n', interesting),\n... ) as remote_annex:\n...     # we loop over the output of the remote process.\n...     # this is originally a byte stream downloaded in arbitrary\n...     # chunks, so we itemize at any newline separator.\n...     # each item is then decoded from JSON-lines format to\n...     # native datatypes\n...     for rec in load_json(itemize(remote_annex, sep=b'\\n')):\n...         # for this demo we just pretty-print it\n...         pprint(rec)\n{'backend': 'SHA256E',\n 'bytesize': '3357612',\n 'error-messages': [],\n 'file': 'DIY/IMG_20200504_205821.jpg',\n 'hashdirlower': '853/12f/',\n 'hashdirmixed': '65/qp/',\n 'humansize': '3.36 MB',\n 'key': 'SHA256E-s3357612--700a52971714c2707c2de975f6015ca14d1a4cdbbf01e43d73951c45cd58c176.jpg',\n 'keyname': '700a52971714c2707c2de975f6015ca14d1a4cdbbf01e43d73951c45cd58c176.jpg',\n 'mtime': 'unknown'}\n{'backend': 'SHA256E',\n 'bytesize': '3284291',\n ...\n```\n\n## Developing with datasalad\n\nAPI stability is important, just as adequate semantic versioning, and informative\nchangelogs.\n\n### Public vs internal API\n\nAnything that can be imported directly from any of the sub-packages in\n`datasalad` is considered to be part of the public API. Changes to this API\ndetermine the versioning, and development is done with the aim to keep this API\nas stable as possible. This includes signatures and return value behavior.\n\nAs an example: `from datasalad.runners import iter_git_subproc` imports a\npart of the public API, but `from datasalad.runners.git import\niter_git_subproc` does not.\n\n### Use of the internal API\n\nDevelopers can obviously use parts of the non-public API. However, this should\nonly be done with the understanding that these components may change from one\nrelease to another, with no guarantee of transition periods, deprecation\nwarnings, etc.\n\nDevelopers are advised to never reuse any components with names starting with\n`_` (underscore). Their use should be limited to their individual subpackage.\n\n## Contributing\n\nContributions to this library are welcome! Please see the [contributing\nguidelines](CONTRIBUTING.md) for details on scope and style of potential\ncontributions.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "utilities for working with data in the vicinity of Git and git-annex",
    "version": "0.3.0",
    "project_urls": {
        "Changelog": "https://github.com/datalad/datasalad/blob/main/CHANGELOG.md",
        "Documentation": "https://github.com/datalad/datasalad#readme",
        "Homepage": "https://github.com/datalad/datasalad",
        "Issues": "https://github.com/datalad/datasalad/issues",
        "Source": "https://github.com/datalad/datasalad"
    },
    "split_keywords": [
        "datalad",
        " git",
        " git-annex",
        " iterator",
        " subprocess"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "17394dbabf88279a19ddfb5ad7173e71bd0c2e70e4a8f525691f3848b7e56a8d",
                "md5": "209d043585274cf50b1ff9c05fe08903",
                "sha256": "bb2d929bbcaeb137fff45731afa83b5aeb782de4f93d2c3f4ea685f6d5813fc2"
            },
            "downloads": -1,
            "filename": "datasalad-0.3.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "209d043585274cf50b1ff9c05fe08903",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 35474,
            "upload_time": "2024-09-21T10:00:50",
            "upload_time_iso_8601": "2024-09-21T10:00:50.071110Z",
            "url": "https://files.pythonhosted.org/packages/17/39/4dbabf88279a19ddfb5ad7173e71bd0c2e70e4a8f525691f3848b7e56a8d/datasalad-0.3.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6a7be0dd1cb0c21a103038982cd4723efae92cb1f7a0bf25e81ef962f4e86cab",
                "md5": "4eec245f24aa50c065be5ffc724319ed",
                "sha256": "11a527ddf63efef1627a6764a0c6da9530a61b02062b37a52dcfa44883b5f09d"
            },
            "downloads": -1,
            "filename": "datasalad-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "4eec245f24aa50c065be5ffc724319ed",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 37722,
            "upload_time": "2024-09-21T10:00:48",
            "upload_time_iso_8601": "2024-09-21T10:00:48.306573Z",
            "url": "https://files.pythonhosted.org/packages/6a/7b/e0dd1cb0c21a103038982cd4723efae92cb1f7a0bf25e81ef962f4e86cab/datasalad-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-21 10:00:48",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "datalad",
    "github_project": "datasalad",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "appveyor": true,
    "lcname": "datasalad"
}
        
Elapsed time: 0.38884s