apibackuper


Nameapibackuper JSON
Version 1.0.8 PyPI version JSON
download
home_pagehttps://github.com/datacoon/apibackuper/
Summaryapibackuper: a command-line tool and python library for API backuping
upload_time2022-12-02 08:12:47
maintainer
docs_urlNone
authorIvan Begtin
requires_python
licenseMIT
keywords api json jsonl csv bson cli dataset
VCS
bugtrack_url
requirements pymongo lxml click urllib3 requests xmltodict
Travis-CI No Travis.
coveralls test coverage
            ==============================================================
apibackuper -- a command-line tool to archive/backup API calls
==============================================================


apibackuper is a command line tool to archive/backup API calls.
It's goal to download all data behind REST API and to archive it to local storage.
This tool designed to backup API data, so simple as possible.


.. contents::

.. section-numbering::


History
=======
This tool was developed optimize backup/archival procedures for Russian government information from E-Budget portal budget.gov.ru and
some other government IT systems too. Examples of tool usage could be found in "examples" directory

Main features
=============


* Any GET/POST iterative API supported
* Allows to estimate time required to backup API
* Stores data inside ZIP container
* Supports export of backup data as JSON lines file
* Documentation
* Test coverage



Installation
============

Linux
-----

Most Linux distributions provide a package that can be installed using the
system package manager, for example:

.. code-block:: bash

    # Debian, Ubuntu, etc.
    $ apt install apibackuper

.. code-block:: bash

    # Fedora
    $ dnf install apibackuper

.. code-block:: bash

    # CentOS, RHEL, ...
    $ yum install apibackuper

.. code-block:: bash

    # Arch Linux
    $ pacman -S apibackuper


Windows, etc.
-------------

A universal installation method (that works on Windows, Mac OS X, Linux, …,
and always provides the latest version) is to use pip:


.. code-block:: bash

    # Make sure we have an up-to-date version of pip and setuptools:
    $ pip install --upgrade pip setuptools

    $ pip install --upgrade apibackuper


(If ``pip`` installation fails for some reason, you can try
``easy_install apibackuper`` as a fallback.)


Python version
--------------

Python version 3.6 or greater is required.


Quickstart
==========

This example is about backup of Russian certificate authorities.
List of them published at e-trust.gosuslugi.ru and available via undocumented API.

.. code-block:: bash

    $ apibackuper create etrust
    $ cd etrust

Edit apibackuper.cfg as:

.. code-block:: bash

    [settings]
    initialized = True
    name = etrust

    [project]
    description = E-Trust UC list
    url = https://e-trust.gosuslugi.ru/app/scc/portal/api/v1/portal/ca/list
    http_mode = POST
    work_modes = full,incremental,update
    iterate_by = page

    [params]
    page_size_param = recordsOnPage
    page_size_limit = 100
    page_number_param = page

    [data]
    total_number_key = total
    data_key = data
    item_key = РеестровыйНомер
    change_key = СтатусАккредитации.ДействуетС

    [storage]
    storage_type = zip

Add file params.json with parameters used with POST requests

.. code-block:: json

    {"page":1,"orderBy":"id","ascending":false,"recordsOnPage":100,"searchString":null,"cities":null,"software":null,"cryptToolClasses":null,"statuses":null}

Execute command "estimate" to see how long data will be collected and how much space needed

.. code-block:: bash

    $ apibackuper estimate full

Output:

.. code-block:: bash

    Total records: 502
    Records per request: 100
    Total requests: 6
    Average record size 32277.96 bytes
    Estimated size (json lines) 16.20 MB
    Avg request time, seconds 66.9260
    Estimated all requests time, seconds 402.8947

Execute command "run" to collect the data. Result stored in "storage.zip"

.. code-block:: bash

    $ apibackuper run full

Exports data from storage and saves as jsonl file called "etrust.jsonl"

.. code-block:: bash

    $ apibackuper export jsonl etrust.jsonl


Config options
==============

Example config file

.. code-block:: bash

    [settings]
    initialized = True
    name = <name>
    splitter = .

    [project]
    description = <description>
    url = <url>
    http_mode = <GET or POST>
    work_modes = <combination of full,incremental,update>
    iterate_by = <page or skip>

    [params]
    page_size_param = <page size param>
    page_size_limit = <page size limit>
    page_number_param = <page number>
    count_skip_param = <key to iterate in skip mode>


    [data]
    total_number_key = <total number key>
    data_key = <data key>
    item_key = <item key>
    change_key = <change key>

    [follow]
    follow_mode = <type of follow mode>
    follow_pattern = <url prefix to follow links>
    follow_data_key = <follow data item key>
    follow_param = <follow param>
    follow_item_key = <follow item key>

    [files]
    fetch_mode = <file fetch mode>
    root_url = <file root url>
    keys = <keys with file data>
    storage_mode = <file storage mode>


    [storage]
    storage_type = zip
    compression = True


settings
--------
* name - short name of the project
* splitter - value of field splitter. Needed for rare cases when '.' is part of field name. For example for OData requests and '@odata.count' field

project
-------
* description - text that explains what for is this project
* url - API endpoint url
* http_mode - one of HTTP modes: GET or POST
* work_modes - type of operations: full - archive everything, incremental - add new records only, update - collect changed data only
* iterate_by - type of iteration of records. By 'page' - default, page by page or by 'skip' if skip value provided

params
------

* page_size_param - parameter with page size
* page_size_limit - limit of records provided by API
* page_number_param = parameter with page number
* count_skip_param - parameter for 'skip' type of iteration

data
----
* total_number_key - key in data with total number of records
* data_key - key in data with list of records
* item_key - key in data with unique identifier of the record. Could be group of keys separated with comma
* change_key - key in data that indicates that record changed. Could be group of keys separated with comma

follow
------
* follow_mode - mode to follow objects. Could be 'url' or 'item'. If mode is 'url' than follow_pattern not used
* follow_pattern - url pattern / url prefix for followed objects. Only for mode 'item''
* follow_data_key - if object/objects are inside array, key of this array
* follow_param - parameter used in 'item' mode
* follow_item_key - item key


files
-----
* fetch_mode - file fetch mode. Could be 'prefix' or 'id'. Prefix
* root_url - root url / prefix  for files
* keys - list of keys with urls/file id's to search for files to save
* storage_mode - a way how files stored in storage/files.zip. By default 'filepath' and files storaged same way as they presented in url

storage
-------
* storage_type - type of local storage. 'zip' is local zip file is default one
* compression - if True than compressed ZIP file used, less space used, more CPU time processing data

Usage
=====

Synopsis:

.. code-block:: bash

    $ apibackuper [flags] [command] inputfile


See also ``apibackuper --help``.


Examples
--------

Create project "budgettofk":

.. code-block:: bash

    $ apibackuper create budgettofk


Estimate execution time for 'budgettofk' project. Should be called in project dir or project dir provided via -p parameter:

.. code-block:: bash

    $ apibackuper estimate full -p budgettofk

Output

.. code-block:: bash

    Total records: 12282
    Records per request: 500
    Total requests: 25
    Average record size 1293.60 bytes
    Estimated size (json lines) 15.89 MB
    Avg request time, seconds 1.8015
    Estimated all requests time, seconds 46.0536


Run project. Should be called in project dir or project dir provided via -p parameter

.. code-block:: bash

    $ apibackuper run full

Export data from project. Should be called in project dir or project dir provided via -p parameter

.. code-block:: bash

    $ apibackuper export jsonl hhemployers.jsonl -p hhemployers


Follows each object of downloaded data and does requests for each objects
.. code-block:: bash

    $ apibackuper follow continue

Downloads all files associated with API objects
.. code-block:: bash

    $ apibackuper getfiles



Advanced
========

TBD


.. :changelog:

History
=======

1.0.7 (2021-11-4)
-----------------
* Fixed "continue" mode. Now supports continue not only for "follow" command but for "run" command too. Use "apibackuper run continue" if it was stopped by error or user input.

1.0.6 (2021-11-1)
-----------------
* Added "default_delay", 'retry_delay' and "retry_count" to manage error handling
* If get HTTP status 500 or 503 starts retrying latest request till HTTP status 200 or retry_count ends

1.0.5 (2021-05-31)
------------------
* Minor fixes

1.0.4 (2021-05-31)
------------------
* Added "start_page" in case if start_page is not 1 (could be 0 sometimes)
* Added support of data returned as JSON array, not JSON dict and data_key not provided
* Added initial code to implement Frictionless Data packaging

1.0.3 (2020-10-28)
------------------
* Added several new options
* Added aria2 download support for files downloading


1.0.2 (2020-09-20)
------------------
* Using permanent storage dir "storage" instead of temporary "temp" dir
* Added logic to do requests to get addition info on retrieved objects, command "follow"
* Added logic to retrieve files linked with retrieved objects, command "getfiles"

1.0.1 (2020-08-14)
------------------
* First public release on PyPI and updated github code





            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/datacoon/apibackuper/",
    "name": "apibackuper",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "api json jsonl csv bson cli dataset",
    "author": "Ivan Begtin",
    "author_email": "ivan@begtin.tech",
    "download_url": "https://files.pythonhosted.org/packages/f4/97/3d85a8083050c44bc35ed78c4b57e30dd952266a028d71917b04ca2d4053/apibackuper-1.0.8.tar.gz",
    "platform": null,
    "description": "==============================================================\r\napibackuper -- a command-line tool to archive/backup API calls\r\n==============================================================\r\n\r\n\r\napibackuper is a command line tool to archive/backup API calls.\r\nIt's goal to download all data behind REST API and to archive it to local storage.\r\nThis tool designed to backup API data, so simple as possible.\r\n\r\n\r\n.. contents::\r\n\r\n.. section-numbering::\r\n\r\n\r\nHistory\r\n=======\r\nThis tool was developed optimize backup/archival procedures for Russian government information from E-Budget portal budget.gov.ru and\r\nsome other government IT systems too. Examples of tool usage could be found in \"examples\" directory\r\n\r\nMain features\r\n=============\r\n\r\n\r\n* Any GET/POST iterative API supported\r\n* Allows to estimate time required to backup API\r\n* Stores data inside ZIP container\r\n* Supports export of backup data as JSON lines file\r\n* Documentation\r\n* Test coverage\r\n\r\n\r\n\r\nInstallation\r\n============\r\n\r\nLinux\r\n-----\r\n\r\nMost Linux distributions provide a package that can be installed using the\r\nsystem package manager, for example:\r\n\r\n.. code-block:: bash\r\n\r\n    # Debian, Ubuntu, etc.\r\n    $ apt install apibackuper\r\n\r\n.. code-block:: bash\r\n\r\n    # Fedora\r\n    $ dnf install apibackuper\r\n\r\n.. code-block:: bash\r\n\r\n    # CentOS, RHEL, ...\r\n    $ yum install apibackuper\r\n\r\n.. code-block:: bash\r\n\r\n    # Arch Linux\r\n    $ pacman -S apibackuper\r\n\r\n\r\nWindows, etc.\r\n-------------\r\n\r\nA universal installation method (that works on Windows, Mac OS X, Linux, \u0432\u0402\u00a6,\r\nand always provides the latest version) is to use pip:\r\n\r\n\r\n.. code-block:: bash\r\n\r\n    # Make sure we have an up-to-date version of pip and setuptools:\r\n    $ pip install --upgrade pip setuptools\r\n\r\n    $ pip install --upgrade apibackuper\r\n\r\n\r\n(If ``pip`` installation fails for some reason, you can try\r\n``easy_install apibackuper`` as a fallback.)\r\n\r\n\r\nPython version\r\n--------------\r\n\r\nPython version 3.6 or greater is required.\r\n\r\n\r\nQuickstart\r\n==========\r\n\r\nThis example is about backup of Russian certificate authorities.\r\nList of them published at e-trust.gosuslugi.ru and available via undocumented API.\r\n\r\n.. code-block:: bash\r\n\r\n    $ apibackuper create etrust\r\n    $ cd etrust\r\n\r\nEdit apibackuper.cfg as:\r\n\r\n.. code-block:: bash\r\n\r\n    [settings]\r\n    initialized = True\r\n    name = etrust\r\n\r\n    [project]\r\n    description = E-Trust UC list\r\n    url = https://e-trust.gosuslugi.ru/app/scc/portal/api/v1/portal/ca/list\r\n    http_mode = POST\r\n    work_modes = full,incremental,update\r\n    iterate_by = page\r\n\r\n    [params]\r\n    page_size_param = recordsOnPage\r\n    page_size_limit = 100\r\n    page_number_param = page\r\n\r\n    [data]\r\n    total_number_key = total\r\n    data_key = data\r\n    item_key = \u0420\u0435\u0435\u0441\u0442\u0440\u043e\u0432\u044b\u0439\u041d\u043e\u043c\u0435\u0440\r\n    change_key = \u0421\u0442\u0430\u0442\u0443\u0441\u0410\u043a\u043a\u0440\u0435\u0434\u0438\u0442\u0430\u0446\u0438\u0438.\u0414\u0435\u0439\u0441\u0442\u0432\u0443\u0435\u0442\u0421\r\n\r\n    [storage]\r\n    storage_type = zip\r\n\r\nAdd file params.json with parameters used with POST requests\r\n\r\n.. code-block:: json\r\n\r\n    {\"page\":1,\"orderBy\":\"id\",\"ascending\":false,\"recordsOnPage\":100,\"searchString\":null,\"cities\":null,\"software\":null,\"cryptToolClasses\":null,\"statuses\":null}\r\n\r\nExecute command \"estimate\" to see how long data will be collected and how much space needed\r\n\r\n.. code-block:: bash\r\n\r\n    $ apibackuper estimate full\r\n\r\nOutput:\r\n\r\n.. code-block:: bash\r\n\r\n    Total records: 502\r\n    Records per request: 100\r\n    Total requests: 6\r\n    Average record size 32277.96 bytes\r\n    Estimated size (json lines) 16.20 MB\r\n    Avg request time, seconds 66.9260\r\n    Estimated all requests time, seconds 402.8947\r\n\r\nExecute command \"run\" to collect the data. Result stored in \"storage.zip\"\r\n\r\n.. code-block:: bash\r\n\r\n    $ apibackuper run full\r\n\r\nExports data from storage and saves as jsonl file called \"etrust.jsonl\"\r\n\r\n.. code-block:: bash\r\n\r\n    $ apibackuper export jsonl etrust.jsonl\r\n\r\n\r\nConfig options\r\n==============\r\n\r\nExample config file\r\n\r\n.. code-block:: bash\r\n\r\n    [settings]\r\n    initialized = True\r\n    name = <name>\r\n    splitter = .\r\n\r\n    [project]\r\n    description = <description>\r\n    url = <url>\r\n    http_mode = <GET or POST>\r\n    work_modes = <combination of full,incremental,update>\r\n    iterate_by = <page or skip>\r\n\r\n    [params]\r\n    page_size_param = <page size param>\r\n    page_size_limit = <page size limit>\r\n    page_number_param = <page number>\r\n    count_skip_param = <key to iterate in skip mode>\r\n\r\n\r\n    [data]\r\n    total_number_key = <total number key>\r\n    data_key = <data key>\r\n    item_key = <item key>\r\n    change_key = <change key>\r\n\r\n    [follow]\r\n    follow_mode = <type of follow mode>\r\n    follow_pattern = <url prefix to follow links>\r\n    follow_data_key = <follow data item key>\r\n    follow_param = <follow param>\r\n    follow_item_key = <follow item key>\r\n\r\n    [files]\r\n    fetch_mode = <file fetch mode>\r\n    root_url = <file root url>\r\n    keys = <keys with file data>\r\n    storage_mode = <file storage mode>\r\n\r\n\r\n    [storage]\r\n    storage_type = zip\r\n    compression = True\r\n\r\n\r\nsettings\r\n--------\r\n* name - short name of the project\r\n* splitter - value of field splitter. Needed for rare cases when '.' is part of field name. For example for OData requests and '@odata.count' field\r\n\r\nproject\r\n-------\r\n* description - text that explains what for is this project\r\n* url - API endpoint url\r\n* http_mode - one of HTTP modes: GET or POST\r\n* work_modes - type of operations: full - archive everything, incremental - add new records only, update - collect changed data only\r\n* iterate_by - type of iteration of records. By 'page' - default, page by page or by 'skip' if skip value provided\r\n\r\nparams\r\n------\r\n\r\n* page_size_param - parameter with page size\r\n* page_size_limit - limit of records provided by API\r\n* page_number_param = parameter with page number\r\n* count_skip_param - parameter for 'skip' type of iteration\r\n\r\ndata\r\n----\r\n* total_number_key - key in data with total number of records\r\n* data_key - key in data with list of records\r\n* item_key - key in data with unique identifier of the record. Could be group of keys separated with comma\r\n* change_key - key in data that indicates that record changed. Could be group of keys separated with comma\r\n\r\nfollow\r\n------\r\n* follow_mode - mode to follow objects. Could be 'url' or 'item'. If mode is 'url' than follow_pattern not used\r\n* follow_pattern - url pattern / url prefix for followed objects. Only for mode 'item''\r\n* follow_data_key - if object/objects are inside array, key of this array\r\n* follow_param - parameter used in 'item' mode\r\n* follow_item_key - item key\r\n\r\n\r\nfiles\r\n-----\r\n* fetch_mode - file fetch mode. Could be 'prefix' or 'id'. Prefix\r\n* root_url - root url / prefix  for files\r\n* keys - list of keys with urls/file id's to search for files to save\r\n* storage_mode - a way how files stored in storage/files.zip. By default 'filepath' and files storaged same way as they presented in url\r\n\r\nstorage\r\n-------\r\n* storage_type - type of local storage. 'zip' is local zip file is default one\r\n* compression - if True than compressed ZIP file used, less space used, more CPU time processing data\r\n\r\nUsage\r\n=====\r\n\r\nSynopsis:\r\n\r\n.. code-block:: bash\r\n\r\n    $ apibackuper [flags] [command] inputfile\r\n\r\n\r\nSee also ``apibackuper --help``.\r\n\r\n\r\nExamples\r\n--------\r\n\r\nCreate project \"budgettofk\":\r\n\r\n.. code-block:: bash\r\n\r\n    $ apibackuper create budgettofk\r\n\r\n\r\nEstimate execution time for 'budgettofk' project. Should be called in project dir or project dir provided via -p parameter:\r\n\r\n.. code-block:: bash\r\n\r\n    $ apibackuper estimate full -p budgettofk\r\n\r\nOutput\r\n\r\n.. code-block:: bash\r\n\r\n    Total records: 12282\r\n    Records per request: 500\r\n    Total requests: 25\r\n    Average record size 1293.60 bytes\r\n    Estimated size (json lines) 15.89 MB\r\n    Avg request time, seconds 1.8015\r\n    Estimated all requests time, seconds 46.0536\r\n\r\n\r\nRun project. Should be called in project dir or project dir provided via -p parameter\r\n\r\n.. code-block:: bash\r\n\r\n    $ apibackuper run full\r\n\r\nExport data from project. Should be called in project dir or project dir provided via -p parameter\r\n\r\n.. code-block:: bash\r\n\r\n    $ apibackuper export jsonl hhemployers.jsonl -p hhemployers\r\n\r\n\r\nFollows each object of downloaded data and does requests for each objects\r\n.. code-block:: bash\r\n\r\n    $ apibackuper follow continue\r\n\r\nDownloads all files associated with API objects\r\n.. code-block:: bash\r\n\r\n    $ apibackuper getfiles\r\n\r\n\r\n\r\nAdvanced\r\n========\r\n\r\nTBD\r\n\r\n\r\n.. :changelog:\r\n\r\nHistory\r\n=======\r\n\r\n1.0.7 (2021-11-4)\r\n-----------------\r\n* Fixed \"continue\" mode. Now supports continue not only for \"follow\" command but for \"run\" command too. Use \"apibackuper run continue\" if it was stopped by error or user input.\r\n\r\n1.0.6 (2021-11-1)\r\n-----------------\r\n* Added \"default_delay\", 'retry_delay' and \"retry_count\" to manage error handling\r\n* If get HTTP status 500 or 503 starts retrying latest request till HTTP status 200 or retry_count ends\r\n\r\n1.0.5 (2021-05-31)\r\n------------------\r\n* Minor fixes\r\n\r\n1.0.4 (2021-05-31)\r\n------------------\r\n* Added \"start_page\" in case if start_page is not 1 (could be 0 sometimes)\r\n* Added support of data returned as JSON array, not JSON dict and data_key not provided\r\n* Added initial code to implement Frictionless Data packaging\r\n\r\n1.0.3 (2020-10-28)\r\n------------------\r\n* Added several new options\r\n* Added aria2 download support for files downloading\r\n\r\n\r\n1.0.2 (2020-09-20)\r\n------------------\r\n* Using permanent storage dir \"storage\" instead of temporary \"temp\" dir\r\n* Added logic to do requests to get addition info on retrieved objects, command \"follow\"\r\n* Added logic to retrieve files linked with retrieved objects, command \"getfiles\"\r\n\r\n1.0.1 (2020-08-14)\r\n------------------\r\n* First public release on PyPI and updated github code\r\n\r\n\r\n\r\n\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "apibackuper: a command-line tool and python library for API backuping",
    "version": "1.0.8",
    "split_keywords": [
        "api",
        "json",
        "jsonl",
        "csv",
        "bson",
        "cli",
        "dataset"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "4bedef01aa924538c9d27af08b1a5cd8",
                "sha256": "02ca6262f2f4fd7c1a15f2871bd00b5de5e05999f040651dbd8b3ad1fc12fd87"
            },
            "downloads": -1,
            "filename": "apibackuper-1.0.8.tar.gz",
            "has_sig": false,
            "md5_digest": "4bedef01aa924538c9d27af08b1a5cd8",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 21733,
            "upload_time": "2022-12-02T08:12:47",
            "upload_time_iso_8601": "2022-12-02T08:12:47.658108Z",
            "url": "https://files.pythonhosted.org/packages/f4/97/3d85a8083050c44bc35ed78c4b57e30dd952266a028d71917b04ca2d4053/apibackuper-1.0.8.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-02 08:12:47",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "datacoon",
    "github_project": "apibackuper",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": false,
    "requirements": [
        {
            "name": "pymongo",
            "specs": []
        },
        {
            "name": "lxml",
            "specs": []
        },
        {
            "name": "click",
            "specs": []
        },
        {
            "name": "urllib3",
            "specs": []
        },
        {
            "name": "requests",
            "specs": []
        },
        {
            "name": "xmltodict",
            "specs": []
        }
    ],
    "tox": true,
    "lcname": "apibackuper"
}
        
Elapsed time: 0.01298s