gamdam


Namegamdam JSON
Version 0.5.0 PyPI version JSON
download
home_page
SummaryGit-Annex Mass Downloader and Metadata-er
upload_time2023-12-12 21:37:00
maintainer
docs_urlNone
author
requires_python>=3.8
license
keywords anyio async download git-annex
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            .. image:: https://www.repostatus.org/badges/latest/unsupported.svg
    :target: https://www.repostatus.org/#unsupported
    :alt: Project Status: Unsupported – The project has reached a stable,
          usable state but the author(s) have ceased all work on it. A new
          maintainer may be desired.

.. image:: https://github.com/jwodder/gamdam/actions/workflows/test.yml/badge.svg
    :target: https://github.com/jwodder/gamdam/actions/workflows/test.yml
    :alt: CI Status

.. image:: https://codecov.io/gh/jwodder/gamdam/branch/master/graph/badge.svg
    :target: https://codecov.io/gh/jwodder/gamdam

.. image:: https://img.shields.io/pypi/pyversions/gamdam.svg
    :target: https://pypi.org/project/gamdam/

.. image:: https://img.shields.io/github/license/jwodder/gamdam.svg
    :target: https://opensource.org/licenses/MIT
    :alt: MIT License

`GitHub <https://github.com/jwodder/gamdam>`_
| `PyPI <https://pypi.org/project/gamdam/>`_
| `Issues <https://github.com/jwodder/gamdam/issues>`_
| `Changelog <https://github.com/jwodder/gamdam/blob/master/CHANGELOG.md>`_

``gamdam`` is the Git-Annex Mass Downloader and Metadata-er.  It takes a stream
of JSON Lines describing what to download and what metadata each file has,
downloads them in parallel to a git-annex_ repository, attaches the metadata
using git-annex's metadata facilities, and commits the results.

This program was written as an experiment/proof-of-concept for a larger program
and is no longer maintained.  However, the author has also produced a Rust
translation of this program at <https://github.com/jwodder/gamdam-rust> which
is currently being maintained.

.. _git-annex: https://git-annex.branchable.com


Installation
============
``gamdam`` requires Python 3.8 or higher.  Just use `pip
<https://pip.pypa.io>`_ for Python 3 (You have pip, right?) to install
``gamdam`` and its dependencies::

    python3 -m pip install gamdam

``gamdam`` also requires ``git-annex`` v10.20220222 or higher to be installed
separately in order to run.


Usage
=====

::

    gamdam [<options>] [<input-file>]

``gamdam`` reads a series of JSON entries from a file (or from standard input
if no file is specified) following the `input format`_ described below.  It
feeds the URLs and output paths to ``git-annex addurl``, and once each file has
finished downloading, it attaches any listed metadata and extra URLs using
``git-annex metadata`` and ``git-annex registerurl``, respectively.

Note that the latter step can only be performed on files tracked by git-annex;
if you, say, have configured git-annex to not track text files, then any text
files downloaded will not have any metadata or alternative URLs registered.

Options
-------

--addurl-opts OPTIONS           Extra options to pass to the ``git-annex
                                addurl`` command.  Note that multiple options &
                                arguments need to be quoted as a single string,
                                which must also use proper shell quoting
                                internally; e.g., ``--addurl-opts="--user-agent
                                'gamdam via git-annex'"``.

-C DIR, --chdir DIR             The directory in which to download files;
                                defaults to the current directory.  If the
                                directory does not exist, it will be created.
                                If the directory does not belong to a Git or
                                git-annex repository, it will be initialized as
                                one.

-F FILE, --failures FILE        If any files fail to download, write their
                                input records back out to ``FILE``

-J INT, --jobs INT              Number of parallel jobs for ``git-annex
                                addurl`` to use; by default, the process is
                                instructed to use one job per CPU core.

-l LEVEL, --log-level LEVEL     Set the log level to the given value.  Possible
                                values are "``CRITICAL``", "``ERROR``",
                                "``WARNING``", "``INFO``", "``DEBUG``" (all
                                case-insensitive) and their Python integer
                                equivalents.  [default: ``INFO``]

-m TEXT, --message TEXT         The commit message to use when saving.  This
                                may contain a ``{downloaded}`` placeholder
                                which will be replaced with the number of files
                                successfully downloaded.

--no-save-on-fail               Don't commit the downloaded files if any files
                                failed to download

--save, --no-save               Whether to commit the downloaded files once
                                they've all been downloaded  [default:
                                ``--save``]


Input Format
------------

Input is a series of JSON objects, one per line (a.k.a. "JSON Lines").  Each
object has the following fields:

``url``
    *(required)* A URL to download

``path``
    *(required)* A relative path where the contents of the URL should be saved.
    If an entry with a given path is encountered while another entry with the
    same path is being downloaded, the later entry is discarded, and a warning
    is emitted.

    If a file already exists at a given path, ``git-annex`` will try to
    register the URL as an additional location for the file, failing if the
    resource at the URL is not the same size as the extant file.

``metadata``
    A collection of metadata in the form used by ``git-annex metadata``, i.e.,
    a ``dict`` mapping key names to lists of string values.

``extra_urls``
    A list of alternative URLs for the resource, to be attached to the
    downloaded file with ``git-annex registerurl``.

If a given input line is invalid, it is discarded, and an error message is
emitted.


Library Usage
=============

``gamdam`` can also be used as a Python library.  It exports the following:

.. code:: python

    async def download(
        repo: pathlib.Path,
        objects: AsyncIterator[Downloadable],
        jobs: Optional[int] = None,
        addurl_opts: Optional[List[str]] = None,
        subscriber: Optional[anyio.abc.ObjectSendStream[DownloadResult]] = None,
    ) -> Report

Download the items yielded by the async iterator ``objects`` to the directory
``repo`` (which must be part of a git-annex repository) and set their metadata.
``jobs`` is the number of parallel jobs for the ``git-annex addurl`` process to
use; a value of ``None`` means to use one job per CPU core.  ``addurl_opts``
contains any additional arguments to append to the ``git-annex addurl``
command.

If ``subscriber`` is supplied, it will be sent a ``DownloadResult`` (see below)
for each completed download, both successful and failed.  This can be used to
implement custom post-processing of downloads.

.. code:: python

   class Downloadable(pydantic.BaseModel):
       path: pathlib.Path
       url: pydantic.AnyHttpUrl
       metadata: Optional[Dict[str, List[str]]] = None
       extra_urls: Optional[List[pydantic.AnyHttpUrl]] = None

``Downloadable`` is a pydantic_ model used to represent files to download; see
`Input Format`_ above for the meanings of the fields.

.. code:: python

    class DownloadResult(pydantic.BaseModel):
        downloadable: Downloadable
        success: bool
        key: Optional[str] = None
        error_messages: Optional[List[str]] = None

``DownloadResult`` is a pydantic_ model used to represent a completed download.
It contains the original ``Downloadable``, a flag to indicate download success,
the downloaded file's git-annex key (only set if the download was successful
and the file is tracked by git-annex) and any error messages from the addurl
process (only set if the download failed).

.. code:: python

    @dataclass
    class Report:
        downloaded: int
        failed: int

``Report`` is used as the return value of ``download()``; it contains the
number of files successfully downloaded and the number of failed downloads.

.. _pydantic: https://pydantic-docs.helpmanual.io

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "gamdam",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "anyio,async,download,git-annex",
    "author": "",
    "author_email": "John Thorvald Wodder II <gamdam@varonathe.org>",
    "download_url": "https://files.pythonhosted.org/packages/09/b2/0b83eae7aaad650b4741ecf1029447869331f7f4c1e34f3fe105e1d7384e/gamdam-0.5.0.tar.gz",
    "platform": null,
    "description": ".. image:: https://www.repostatus.org/badges/latest/unsupported.svg\n    :target: https://www.repostatus.org/#unsupported\n    :alt: Project Status: Unsupported \u2013 The project has reached a stable,\n          usable state but the author(s) have ceased all work on it. A new\n          maintainer may be desired.\n\n.. image:: https://github.com/jwodder/gamdam/actions/workflows/test.yml/badge.svg\n    :target: https://github.com/jwodder/gamdam/actions/workflows/test.yml\n    :alt: CI Status\n\n.. image:: https://codecov.io/gh/jwodder/gamdam/branch/master/graph/badge.svg\n    :target: https://codecov.io/gh/jwodder/gamdam\n\n.. image:: https://img.shields.io/pypi/pyversions/gamdam.svg\n    :target: https://pypi.org/project/gamdam/\n\n.. image:: https://img.shields.io/github/license/jwodder/gamdam.svg\n    :target: https://opensource.org/licenses/MIT\n    :alt: MIT License\n\n`GitHub <https://github.com/jwodder/gamdam>`_\n| `PyPI <https://pypi.org/project/gamdam/>`_\n| `Issues <https://github.com/jwodder/gamdam/issues>`_\n| `Changelog <https://github.com/jwodder/gamdam/blob/master/CHANGELOG.md>`_\n\n``gamdam`` is the Git-Annex Mass Downloader and Metadata-er.  It takes a stream\nof JSON Lines describing what to download and what metadata each file has,\ndownloads them in parallel to a git-annex_ repository, attaches the metadata\nusing git-annex's metadata facilities, and commits the results.\n\nThis program was written as an experiment/proof-of-concept for a larger program\nand is no longer maintained.  However, the author has also produced a Rust\ntranslation of this program at <https://github.com/jwodder/gamdam-rust> which\nis currently being maintained.\n\n.. _git-annex: https://git-annex.branchable.com\n\n\nInstallation\n============\n``gamdam`` requires Python 3.8 or higher.  Just use `pip\n<https://pip.pypa.io>`_ for Python 3 (You have pip, right?) to install\n``gamdam`` and its dependencies::\n\n    python3 -m pip install gamdam\n\n``gamdam`` also requires ``git-annex`` v10.20220222 or higher to be installed\nseparately in order to run.\n\n\nUsage\n=====\n\n::\n\n    gamdam [<options>] [<input-file>]\n\n``gamdam`` reads a series of JSON entries from a file (or from standard input\nif no file is specified) following the `input format`_ described below.  It\nfeeds the URLs and output paths to ``git-annex addurl``, and once each file has\nfinished downloading, it attaches any listed metadata and extra URLs using\n``git-annex metadata`` and ``git-annex registerurl``, respectively.\n\nNote that the latter step can only be performed on files tracked by git-annex;\nif you, say, have configured git-annex to not track text files, then any text\nfiles downloaded will not have any metadata or alternative URLs registered.\n\nOptions\n-------\n\n--addurl-opts OPTIONS           Extra options to pass to the ``git-annex\n                                addurl`` command.  Note that multiple options &\n                                arguments need to be quoted as a single string,\n                                which must also use proper shell quoting\n                                internally; e.g., ``--addurl-opts=\"--user-agent\n                                'gamdam via git-annex'\"``.\n\n-C DIR, --chdir DIR             The directory in which to download files;\n                                defaults to the current directory.  If the\n                                directory does not exist, it will be created.\n                                If the directory does not belong to a Git or\n                                git-annex repository, it will be initialized as\n                                one.\n\n-F FILE, --failures FILE        If any files fail to download, write their\n                                input records back out to ``FILE``\n\n-J INT, --jobs INT              Number of parallel jobs for ``git-annex\n                                addurl`` to use; by default, the process is\n                                instructed to use one job per CPU core.\n\n-l LEVEL, --log-level LEVEL     Set the log level to the given value.  Possible\n                                values are \"``CRITICAL``\", \"``ERROR``\",\n                                \"``WARNING``\", \"``INFO``\", \"``DEBUG``\" (all\n                                case-insensitive) and their Python integer\n                                equivalents.  [default: ``INFO``]\n\n-m TEXT, --message TEXT         The commit message to use when saving.  This\n                                may contain a ``{downloaded}`` placeholder\n                                which will be replaced with the number of files\n                                successfully downloaded.\n\n--no-save-on-fail               Don't commit the downloaded files if any files\n                                failed to download\n\n--save, --no-save               Whether to commit the downloaded files once\n                                they've all been downloaded  [default:\n                                ``--save``]\n\n\nInput Format\n------------\n\nInput is a series of JSON objects, one per line (a.k.a. \"JSON Lines\").  Each\nobject has the following fields:\n\n``url``\n    *(required)* A URL to download\n\n``path``\n    *(required)* A relative path where the contents of the URL should be saved.\n    If an entry with a given path is encountered while another entry with the\n    same path is being downloaded, the later entry is discarded, and a warning\n    is emitted.\n\n    If a file already exists at a given path, ``git-annex`` will try to\n    register the URL as an additional location for the file, failing if the\n    resource at the URL is not the same size as the extant file.\n\n``metadata``\n    A collection of metadata in the form used by ``git-annex metadata``, i.e.,\n    a ``dict`` mapping key names to lists of string values.\n\n``extra_urls``\n    A list of alternative URLs for the resource, to be attached to the\n    downloaded file with ``git-annex registerurl``.\n\nIf a given input line is invalid, it is discarded, and an error message is\nemitted.\n\n\nLibrary Usage\n=============\n\n``gamdam`` can also be used as a Python library.  It exports the following:\n\n.. code:: python\n\n    async def download(\n        repo: pathlib.Path,\n        objects: AsyncIterator[Downloadable],\n        jobs: Optional[int] = None,\n        addurl_opts: Optional[List[str]] = None,\n        subscriber: Optional[anyio.abc.ObjectSendStream[DownloadResult]] = None,\n    ) -> Report\n\nDownload the items yielded by the async iterator ``objects`` to the directory\n``repo`` (which must be part of a git-annex repository) and set their metadata.\n``jobs`` is the number of parallel jobs for the ``git-annex addurl`` process to\nuse; a value of ``None`` means to use one job per CPU core.  ``addurl_opts``\ncontains any additional arguments to append to the ``git-annex addurl``\ncommand.\n\nIf ``subscriber`` is supplied, it will be sent a ``DownloadResult`` (see below)\nfor each completed download, both successful and failed.  This can be used to\nimplement custom post-processing of downloads.\n\n.. code:: python\n\n   class Downloadable(pydantic.BaseModel):\n       path: pathlib.Path\n       url: pydantic.AnyHttpUrl\n       metadata: Optional[Dict[str, List[str]]] = None\n       extra_urls: Optional[List[pydantic.AnyHttpUrl]] = None\n\n``Downloadable`` is a pydantic_ model used to represent files to download; see\n`Input Format`_ above for the meanings of the fields.\n\n.. code:: python\n\n    class DownloadResult(pydantic.BaseModel):\n        downloadable: Downloadable\n        success: bool\n        key: Optional[str] = None\n        error_messages: Optional[List[str]] = None\n\n``DownloadResult`` is a pydantic_ model used to represent a completed download.\nIt contains the original ``Downloadable``, a flag to indicate download success,\nthe downloaded file's git-annex key (only set if the download was successful\nand the file is tracked by git-annex) and any error messages from the addurl\nprocess (only set if the download failed).\n\n.. code:: python\n\n    @dataclass\n    class Report:\n        downloaded: int\n        failed: int\n\n``Report`` is used as the return value of ``download()``; it contains the\nnumber of files successfully downloaded and the number of failed downloads.\n\n.. _pydantic: https://pydantic-docs.helpmanual.io\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Git-Annex Mass Downloader and Metadata-er",
    "version": "0.5.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/jwodder/gamdam/issues",
        "Source Code": "https://github.com/jwodder/gamdam"
    },
    "split_keywords": [
        "anyio",
        "async",
        "download",
        "git-annex"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2ef3b12fb6bd0193283ecdc8de9ae197f67746d9cac65cf0a2cdcb884830081d",
                "md5": "352491a35cbbdde9899f5ceda6b1e2da",
                "sha256": "460d89c1f6d67c1c97cd239aeee86fbc8345dc637e9b52561ee2d5252e951ca6"
            },
            "downloads": -1,
            "filename": "gamdam-0.5.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "352491a35cbbdde9899f5ceda6b1e2da",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 12262,
            "upload_time": "2023-12-12T21:36:58",
            "upload_time_iso_8601": "2023-12-12T21:36:58.699117Z",
            "url": "https://files.pythonhosted.org/packages/2e/f3/b12fb6bd0193283ecdc8de9ae197f67746d9cac65cf0a2cdcb884830081d/gamdam-0.5.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "09b20b83eae7aaad650b4741ecf1029447869331f7f4c1e34f3fe105e1d7384e",
                "md5": "2694c18e5ae2b227820cbf2433da5bc0",
                "sha256": "23fca2b899f5f5382d6a6821490a851b605ccfae9a2ba9ad4c2c302dd5e8571a"
            },
            "downloads": -1,
            "filename": "gamdam-0.5.0.tar.gz",
            "has_sig": false,
            "md5_digest": "2694c18e5ae2b227820cbf2433da5bc0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 14157,
            "upload_time": "2023-12-12T21:37:00",
            "upload_time_iso_8601": "2023-12-12T21:37:00.109832Z",
            "url": "https://files.pythonhosted.org/packages/09/b2/0b83eae7aaad650b4741ecf1029447869331f7f4c1e34f3fe105e1d7384e/gamdam-0.5.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-12-12 21:37:00",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "jwodder",
    "github_project": "gamdam",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "tox": true,
    "lcname": "gamdam"
}
        
Elapsed time: 2.03498s