ckanext-xloader


Nameckanext-xloader JSON
Version 0.12.2 PyPI version JSON
download
home_pagehttps://github.com/ckan/ckanext-xloader
SummaryExpress Loader - quickly load data into CKAN DataStore
upload_time2022-11-30 14:52:08
maintainer
docs_urlNone
authorDavid Read
requires_python
licenseAGPL
keywords ckan extension datastore
VCS
bugtrack_url
requirements ckantoolkit requests six tabulator Unidecode python-dateutil
Travis-CI No Travis.
coveralls test coverage
            .. You should enable this project on travis-ci.org and coveralls.io to make
   these badges work. The necessary Travis and Coverage config files have been
   generated for you.

.. image:: https://travis-ci.org/ckan/ckanext-xloader.svg?branch=master
    :target: https://travis-ci.org/ckan/ckanext-xloader

.. image:: https://img.shields.io/pypi/v/ckanext-xloader.svg
    :target: https://pypi.org/project/ckanext-xloader/
    :alt: Latest Version

.. image:: https://img.shields.io/pypi/pyversions/ckanext-xloader.svg
    :target: https://pypi.org/project/ckanext-xloader/
    :alt: Supported Python versions

.. image:: https://img.shields.io/pypi/status/ckanext-xloader.svg
    :target: https://pypi.org/project/ckanext-xloader/
    :alt: Development Status

.. image:: https://img.shields.io/pypi/l/ckanext-xloader.svg
    :target: https://pypi.org/project/ckanext-xloader/
    :alt: License

=========================
XLoader - ckanext-xloader
=========================

Loads CSV (and similar) data into CKAN's DataStore. Designed as a replacement
for DataPusher because it offers ten times the speed and more robustness
(hence the name, derived from "Express Loader")

**OpenGov Inc.** has sponsored this development, with the aim of benefitting
open data infrastructure worldwide.

-------------------------------
Key differences from DataPusher
-------------------------------

Speed of loading
----------------

DataPusher - parses CSV rows, converts to detected column types, converts the
data to a JSON string, calls datastore_create for each batch of rows, which
reformats the data into an INSERT statement string, which is passed to
PostgreSQL.

XLoader - pipes the CSV file directly into PostgreSQL using COPY.

In `tests <https://github.com/ckan/ckanext-xloader/issues/25>`_, XLoader
is over ten times faster than DataPusher.

Robustness
----------

DataPusher - one cause of failure was when casting cells to a guessed type. The
type of a column was decided by looking at the values of only the first few
rows. So if a column is mainly numeric or dates, but a string (like "N/A")
comes later on, then this will cause the load to error at that point, leaving
it half-loaded into DataStore.

XLoader - loads all the cells as text, before allowing the admin to
convert columns to the types they want (using the Data Dictionary feature). In
future it could do automatic detection and conversion.

Simpler queueing tech
---------------------

DataPusher - job queue is done by ckan-service-provider which is bespoke,
complicated and stores jobs in its own database (sqlite by default).

XLoader - job queue is done by RQ, which is simpler, is backed by Redis, allows
access to the CKAN model and is CKAN's default queue technology (since CKAN
2.7). You can also debug jobs easily using pdb. Job results are stored in
Sqlite by default, and for production simply specify CKAN's database in the
config and it's held there - easy.

(The other obvious candidate is Celery, but we don't need its heavyweight
architecture and its jobs are not debuggable with pdb.)

Separate web server
-------------------

DataPusher - has the complication that the queue jobs are done by a separate
(Flask) web app, apart from CKAN. This was the design because the job requires
intensive processing to convert every line of the data into JSON. However it
means more complicated code as info needs to be passed between the services in
http requests, more for the user to set-up and manage - another app config,
another apache config, separate log files.

XLoader - the job runs in a worker process, in the same app as CKAN, so
can access the CKAN config, db and logging directly and avoids many HTTP calls.
This simplification makes sense because the xloader job doesn't need to do much
processing - mainly it is streaming the CSV file from disk into PostgreSQL.

Caveat - column types
---------------------

Note: With XLoader, all columns are stored in DataStore's database as 'text'
type (whereas DataPusher did some rudimentary type guessing - see 'Robustness'
above). However once a resource is xloaded, an admin can use the resource's
Data Dictionary tab (CKAN 2.7 onwards) to change these types to numeric or
datestamp and re-load the file. When migrating from DataPusher to XLoader you
can preserve the types of existing resources by using the ``migrate_types``
command.

There is scope to add functionality for automatically guessing column type -
offers to contribute this are welcomed.


------------
Requirements
------------

Compatibility with core CKAN versions:

=============== =============
CKAN version    Compatibility
=============== =============
2.3             no longer tested and you must install ckanext-rq
2.4             no longer tested and you must install ckanext-rq
2.5             no longer tested and you must install ckanext-rq
2.6             no longer tested and you must install ckanext-rq
2.7             yes
2.8             yes
2.9             yes (both Python2 and Python3)
=============== =============

------------
Installation
------------

To install XLoader:

1. Activate your CKAN virtual environment, for example::

     . /usr/lib/ckan/default/bin/activate

2. Install the ckanext-xloader Python package into your virtual environment::

     pip install ckanext-xloader

3. Install dependencies::

     pip install -r https://raw.githubusercontent.com/ckan/ckanext-xloader/master/requirements.txt
     pip install -U requests[security]

4. If you are using CKAN version before 2.8.x you need to define the
   ``populate_full_text_trigger`` in your database
   ::

     sudo -u postgres psql datastore_default -f full_text_function.sql

   If successful it will print
   ::

     CREATE FUNCTION
     ALTER FUNCTION

   NB this assumes you used the defaults for the database name and username.
   If in doubt, check your config's ``ckan.datastore.write_url``. If you don't have
   database name ``datastore_default`` and username ``ckan_default`` then adjust
   the psql option and ``full_text_function.sql`` before running this.

5. Add ``xloader`` to the ``ckan.plugins`` setting in your CKAN
   config file (by default the config file is located at
   ``/etc/ckan/default/production.ini``).

   You should also remove ``datapusher`` if it is in the list, to avoid them
   both trying to load resources into the DataStore.

   Ensure ``datastore`` is also listed, to enable CKAN DataStore.

6. Starting CKAN 2.10 you will need to set an API Token to be able to
   execute jobs against the server::

     ckanext.xloader.api_token = <your-CKAN-generated-API-Token>

7. If it is a production server, you'll want to store jobs info in a more
   robust database than the default sqlite file. It can happily use the main
   CKAN postgres db by adding this line to the config, but with the same value
   as you have for ``sqlalchemy.url``::

     ckanext.xloader.jobs_db.uri = postgresql://ckan_default:pass@localhost/ckan_default

   (This step can be skipped when just developing or testing.)

8. Restart CKAN. For example if you've deployed CKAN with Apache on Ubuntu::

     sudo service apache2 reload

9. Run the worker. First test it on the command-line. If you have CKAN version 2.9 or above::
   
    ckan -c /etc/ckan/default/ckan.ini jobs worker
    
   otherwise::

     paster --plugin=ckan jobs -c /etc/ckan/default/ckan.ini worker

   or if you have CKAN version 2.6.x or less (and are therefore using ckanext-rq)::

     paster --plugin=ckanext-rq jobs -c /etc/ckan/default/ckan.ini worker

   Test it will load a CSV ok by submitting a `CSV in the web interface <http://docs.ckan.org/projects/datapusher/en/latest/using.html#ckan-2-2-and-above>`_
   or in another shell::

     paster --plugin=ckanext-xloader xloader submit <dataset-name> -c /etc/ckan/default/ckan.ini

   Clearly, running the worker on the command-line is only for testing - for
   production services see:

       http://docs.ckan.org/en/ckan-2.7.0/maintaining/background-tasks.html#using-supervisor

   If you have CKAN version 2.6.x or less then you'll need to download
   `supervisor-ckan-worker.conf <https://raw.githubusercontent.com/ckan/ckan/master/ckan/config/supervisor-ckan-worker.conf>`_ and adjust the ``command`` to reference
   ckanext-rq.


---------------
Config settings
---------------

Configuration:

::

    # The connection string for the jobs database used by XLoader. The
    # default of an sqlite file is fine for development. For production use a
    # Postgresql database.
    ckanext.xloader.jobs_db.uri = sqlite:////tmp/xloader_jobs.db

    # The formats that are accepted. If the value of the resource.format is
    # anything else then it won't be 'xloadered' to DataStore (and will therefore
    # only be available to users in the form of the original download/link).
    # Case insensitive.
    # (optional, defaults are listed in plugin.py - DEFAULT_FORMATS).
    ckanext.xloader.formats = csv application/csv xls application/vnd.ms-excel

    # The maximum size of files to load into DataStore. In bytes. Default is 1 GB.
    ckanext.xloader.max_content_length = 1000000000

    # By default, xloader will first try to add tabular data to the DataStore
    # with a direct PostgreSQL COPY. This is relatively fast, but does not
    # guess column types. If this fails, xloader falls back to a method more
    # like DataPusher's behaviour. This has the advantage that the column types
    # are guessed. However it is more error prone and far slower.
    # To always skip the direct PostgreSQL COPY and use type guessing, set
    # this option to True.
    ckanext.xloader.use_type_guessing = False

    # Deprecated: use ckanext.xloader.use_type_guessing instead.
    ckanext.xloader.just_load_with_messytables = False

    # Whether ambiguous dates should be parsed day first. Defaults to False.
    # If set to True, dates like '01.02.2022' will be parsed as day = 01,
    # month = 02.
    # NB: isoformat dates like '2022-01-02' will be parsed as YYYY-MM-DD, and
    # this option will not override that.
    # See https://dateutil.readthedocs.io/en/stable/parser.html#dateutil.parser.parse
    # for more details.
    ckanext.xloader.parse_dates_dayfirst = False

    # Whether ambiguous dates should be parsed year first. Defaults to False.
    # If set to True, dates like '01.02.03' will be parsed as year = 2001,
    # month = 02, day = 03. See https://dateutil.readthedocs.io/en/stable/parser.html#dateutil.parser.parse
    # for more details.
    ckanext.xloader.parse_dates_yearfirst = False

    # The maximum time for the loading of a resource before it is aborted.
    # Give an amount in seconds. Default is 60 minutes
    ckanext.xloader.job_timeout = 3600

    # Ignore the file hash when submitting to the DataStore, if set to True
    # resources are always submitted (if their format matches), if set to
    # False (default), resources are only submitted if their hash has changed.
    ckanext.xloader.ignore_hash = False

    # When loading a file that is bigger than `max_content_length`, xloader can
    # still try and load some of the file, which is useful to display a
    # preview. Set this option to the desired number of lines/rows that it
    # loads in this case.
    # If the file-type is supported (CSV, TSV) an excerpt with the number of
    # `max_excerpt_lines` lines will be submitted while the `max_content_length`
    # is not exceeded.
    # If set to 0 (default) files that exceed the `max_content_length` will
    # not be loaded into the datastore.
    ckanext.xloader.max_excerpt_lines = 100

    # Requests verifies SSL certificates for HTTPS requests. Setting verify to
    # False should only be enabled during local development or testing. Default
    # to True.
    ckanext.xloader.ssl_verify = True

    # Uses a specific API token for the xloader_submit action instead of the
    # apikey of the site_user
    ckanext.xloader.api_token = ckan-provided-api-token


------------------------
Developer installation
------------------------

To install XLoader for development, activate your CKAN virtualenv and
in the directory up from your local ckan repo::

    git clone https://github.com/ckan/ckanext-xloader.git
    cd ckanext-xloader
    python setup.py develop
    pip install -r requirements.txt
    pip install -r dev-requirements.txt


-------------------------
Upgrading from DataPusher
-------------------------

To upgrade from DataPusher to XLoader:

1. Install XLoader as above, including running the xloader worker.

2. (Optional) For existing datasets that have been datapushed to datastore, freeze the column types (in the data dictionaries), so that XLoader doesn't change them back to string on next xload::

       ckan -c /etc/ckan/default/ckan.ini migrate_types

3. If you've not already, change the enabled plugin in your config - on the
   ``ckan.plugins`` line replace ``datapusher`` with ``xloader``.

4. (Optional) If you wish, you can disable the direct loading and continue to
   just use tabulator - for more about this see the docs on config option:
   ``ckanext.xloader.use_type_guessing``

5. Stop the datapusher worker::

       sudo a2dissite datapusher

6. Restart CKAN::

       sudo service apache2 reload
       sudo service nginx reload

----------------------
Command-line interface
----------------------

You can submit single or multiple resources to be xloaded using the
command-line interface.

e.g. ::

    [2.9] ckan -c /etc/ckan/default/ckan.ini xloader submit <dataset-name>
    [pre-2.9] paster --plugin=ckanext-xloader xloader submit <dataset-name> -c /etc/ckan/default/ckan.ini

For debugging you can try xloading it synchronously (which does the load
directly, rather than asking the worker to do it) with the ``-s`` option::

    [2.9] ckan -c /etc/ckan/default/ckan.ini xloader submit <dataset-name> -s
    [pre-2.9] paster --plugin=ckanext-xloader xloader submit <dataset-name> -s -c /etc/ckan/default/ckan.ini

See the status of jobs::

    [2.9] ckan -c /etc/ckan/default/ckan.ini xloader status
    [pre-2.9] paster --plugin=ckanext-xloader xloader status -c /etc/ckan/default/development.ini

Submit all datasets' resources to the DataStore::

    [2.9] ckan -c /etc/ckan/default/ckan.ini xloader submit all
    [pre-2.9] paster --plugin=ckanext-xloader xloader submit all -c /etc/ckan/default/ckan.ini

Re-submit all the resources already in the DataStore (Ignores any resources
that have not been stored in DataStore e.g. because they are not tabular)::

    [2.9] ckan -c /etc/ckan/default/ckan.ini xloader submit all-existing
    [pre-2.9] paster --plugin=ckanext-xloader xloader submit all-existing -c /etc/ckan/default/ckan.ini

**Full list of XLoader CLI commands**::

    [2.9] ckan -c /etc/ckan/default/ckan.ini xloader --help
    [pre-2.9] paster --plugin=ckanext-xloader xloader --help

Jobs and workers
----------------

Main docs for managing jobs: <https://docs.ckan.org/en/latest/maintaining/background-tasks.html#managing-background-jobs>

Main docs for running and managing workers are here: https://docs.ckan.org/en/latest/maintaining/background-tasks.html#running-background-jobs

Useful commands:

Clear (delete) all outstanding jobs::

    CKAN 2.9, Python 3 ckan -c /etc/ckan/default/ckan.ini jobs clear [QUEUES]
    CKAN <2.9, Python 2 paster --plugin=ckanext-xloader xloader jobs clear [QUEUES] -c /etc/ckan/default/development.ini

If having trouble with the worker process, restarting it can help::

    sudo supervisorctl restart ckan-worker:*

---------------
Troubleshooting
---------------

**KeyError: "Action 'datastore_search' not found"**

You need to enable the `datastore` plugin in your CKAN config. See
'Installation' section above to do this and restart the worker.

**ProgrammingError: (ProgrammingError) relation "_table_metadata" does not
exist**

Your DataStore permissions have not been set-up - see:
<https://docs.ckan.org/en/latest/maintaining/datastore.html#set-permissions>

**When editing a package, all its existing resources get re-loaded by xloader**

This behavior was documented in
`Issue 75 <https://github.com/ckan/ckanext-xloader/issues/75>`_ and is related
to a bug in CKAN that is fixed in versions 2.6.9, 2.7.7, 2.8.4
and 2.9.0+.

-----------------
Running the Tests
-----------------

The first time, your test datastore database needs the trigger applied::

    sudo -u postgres psql datastore_test -f full_text_function.sql

To run the tests, do::

    nosetests --nologcapture --with-pylons=test.ini

To run the tests and produce a coverage report, first make sure you have
coverage installed in your virtualenv (``pip install coverage``) then run::

    nosetests --nologcapture --with-pylons=test.ini --with-coverage --cover-package=ckanext.xloader --cover-inclusive --cover-erase --cover-tests

----------------------------------
Releasing a New Version of XLoader
----------------------------------

XLoader is available on PyPI as https://pypi.org/project/ckanext-xloader.

To publish a new version to PyPI follow these steps:

1. Update the version number in the ``setup.py`` file.
   See `PEP 440 <http://legacy.python.org/dev/peps/pep-0440/#public-version-identifiers>`_
   for how to choose version numbers.

2. Update the CHANGELOG.

3. Make sure you have the latest version of necessary packages::

       pip install --upgrade setuptools wheel twine

4. Create source and binary distributions of the new version::

       python setup.py sdist bdist_wheel && twine check dist/*

   Fix any errors you get.

5. Upload the source distribution to PyPI::

       twine upload dist/*

6. Commit any outstanding changes::

       git commit -a
       git push

7. Tag the new release of the project on GitHub with the version number from
   the ``setup.py`` file. For example if the version number in ``setup.py`` is
   0.0.1 then do::

       git tag 0.0.1
       git push --tags
            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ckan/ckanext-xloader",
    "name": "ckanext-xloader",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "CKAN extension datastore",
    "author": "David Read",
    "author_email": "david.read@hackneyworkshop.com",
    "download_url": "https://files.pythonhosted.org/packages/a4/a8/90d9d58a3d6e8411fa9543830ce21129de9904d57c114029b6521025caef/ckanext-xloader-0.12.2.tar.gz",
    "platform": null,
    "description": ".. You should enable this project on travis-ci.org and coveralls.io to make\n   these badges work. The necessary Travis and Coverage config files have been\n   generated for you.\n\n.. image:: https://travis-ci.org/ckan/ckanext-xloader.svg?branch=master\n    :target: https://travis-ci.org/ckan/ckanext-xloader\n\n.. image:: https://img.shields.io/pypi/v/ckanext-xloader.svg\n    :target: https://pypi.org/project/ckanext-xloader/\n    :alt: Latest Version\n\n.. image:: https://img.shields.io/pypi/pyversions/ckanext-xloader.svg\n    :target: https://pypi.org/project/ckanext-xloader/\n    :alt: Supported Python versions\n\n.. image:: https://img.shields.io/pypi/status/ckanext-xloader.svg\n    :target: https://pypi.org/project/ckanext-xloader/\n    :alt: Development Status\n\n.. image:: https://img.shields.io/pypi/l/ckanext-xloader.svg\n    :target: https://pypi.org/project/ckanext-xloader/\n    :alt: License\n\n=========================\nXLoader - ckanext-xloader\n=========================\n\nLoads CSV (and similar) data into CKAN's DataStore. Designed as a replacement\nfor DataPusher because it offers ten times the speed and more robustness\n(hence the name, derived from \"Express Loader\")\n\n**OpenGov Inc.** has sponsored this development, with the aim of benefitting\nopen data infrastructure worldwide.\n\n-------------------------------\nKey differences from DataPusher\n-------------------------------\n\nSpeed of loading\n----------------\n\nDataPusher - parses CSV rows, converts to detected column types, converts the\ndata to a JSON string, calls datastore_create for each batch of rows, which\nreformats the data into an INSERT statement string, which is passed to\nPostgreSQL.\n\nXLoader - pipes the CSV file directly into PostgreSQL using COPY.\n\nIn `tests <https://github.com/ckan/ckanext-xloader/issues/25>`_, XLoader\nis over ten times faster than DataPusher.\n\nRobustness\n----------\n\nDataPusher - one cause of failure was when casting cells to a guessed type. The\ntype of a column was decided by looking at the values of only the first few\nrows. So if a column is mainly numeric or dates, but a string (like \"N/A\")\ncomes later on, then this will cause the load to error at that point, leaving\nit half-loaded into DataStore.\n\nXLoader - loads all the cells as text, before allowing the admin to\nconvert columns to the types they want (using the Data Dictionary feature). In\nfuture it could do automatic detection and conversion.\n\nSimpler queueing tech\n---------------------\n\nDataPusher - job queue is done by ckan-service-provider which is bespoke,\ncomplicated and stores jobs in its own database (sqlite by default).\n\nXLoader - job queue is done by RQ, which is simpler, is backed by Redis, allows\naccess to the CKAN model and is CKAN's default queue technology (since CKAN\n2.7). You can also debug jobs easily using pdb. Job results are stored in\nSqlite by default, and for production simply specify CKAN's database in the\nconfig and it's held there - easy.\n\n(The other obvious candidate is Celery, but we don't need its heavyweight\narchitecture and its jobs are not debuggable with pdb.)\n\nSeparate web server\n-------------------\n\nDataPusher - has the complication that the queue jobs are done by a separate\n(Flask) web app, apart from CKAN. This was the design because the job requires\nintensive processing to convert every line of the data into JSON. However it\nmeans more complicated code as info needs to be passed between the services in\nhttp requests, more for the user to set-up and manage - another app config,\nanother apache config, separate log files.\n\nXLoader - the job runs in a worker process, in the same app as CKAN, so\ncan access the CKAN config, db and logging directly and avoids many HTTP calls.\nThis simplification makes sense because the xloader job doesn't need to do much\nprocessing - mainly it is streaming the CSV file from disk into PostgreSQL.\n\nCaveat - column types\n---------------------\n\nNote: With XLoader, all columns are stored in DataStore's database as 'text'\ntype (whereas DataPusher did some rudimentary type guessing - see 'Robustness'\nabove). However once a resource is xloaded, an admin can use the resource's\nData Dictionary tab (CKAN 2.7 onwards) to change these types to numeric or\ndatestamp and re-load the file. When migrating from DataPusher to XLoader you\ncan preserve the types of existing resources by using the ``migrate_types``\ncommand.\n\nThere is scope to add functionality for automatically guessing column type -\noffers to contribute this are welcomed.\n\n\n------------\nRequirements\n------------\n\nCompatibility with core CKAN versions:\n\n=============== =============\nCKAN version    Compatibility\n=============== =============\n2.3             no longer tested and you must install ckanext-rq\n2.4             no longer tested and you must install ckanext-rq\n2.5             no longer tested and you must install ckanext-rq\n2.6             no longer tested and you must install ckanext-rq\n2.7             yes\n2.8             yes\n2.9             yes (both Python2 and Python3)\n=============== =============\n\n------------\nInstallation\n------------\n\nTo install XLoader:\n\n1. Activate your CKAN virtual environment, for example::\n\n     . /usr/lib/ckan/default/bin/activate\n\n2. Install the ckanext-xloader Python package into your virtual environment::\n\n     pip install ckanext-xloader\n\n3. Install dependencies::\n\n     pip install -r https://raw.githubusercontent.com/ckan/ckanext-xloader/master/requirements.txt\n     pip install -U requests[security]\n\n4. If you are using CKAN version before 2.8.x you need to define the\n   ``populate_full_text_trigger`` in your database\n   ::\n\n     sudo -u postgres psql datastore_default -f full_text_function.sql\n\n   If successful it will print\n   ::\n\n     CREATE FUNCTION\n     ALTER FUNCTION\n\n   NB this assumes you used the defaults for the database name and username.\n   If in doubt, check your config's ``ckan.datastore.write_url``. If you don't have\n   database name ``datastore_default`` and username ``ckan_default`` then adjust\n   the psql option and ``full_text_function.sql`` before running this.\n\n5. Add ``xloader`` to the ``ckan.plugins`` setting in your CKAN\n   config file (by default the config file is located at\n   ``/etc/ckan/default/production.ini``).\n\n   You should also remove ``datapusher`` if it is in the list, to avoid them\n   both trying to load resources into the DataStore.\n\n   Ensure ``datastore`` is also listed, to enable CKAN DataStore.\n\n6. Starting CKAN 2.10 you will need to set an API Token to be able to\n   execute jobs against the server::\n\n     ckanext.xloader.api_token = <your-CKAN-generated-API-Token>\n\n7. If it is a production server, you'll want to store jobs info in a more\n   robust database than the default sqlite file. It can happily use the main\n   CKAN postgres db by adding this line to the config, but with the same value\n   as you have for ``sqlalchemy.url``::\n\n     ckanext.xloader.jobs_db.uri = postgresql://ckan_default:pass@localhost/ckan_default\n\n   (This step can be skipped when just developing or testing.)\n\n8. Restart CKAN. For example if you've deployed CKAN with Apache on Ubuntu::\n\n     sudo service apache2 reload\n\n9. Run the worker. First test it on the command-line. If you have CKAN version 2.9 or above::\n   \n    ckan -c /etc/ckan/default/ckan.ini jobs worker\n    \n   otherwise::\n\n     paster --plugin=ckan jobs -c /etc/ckan/default/ckan.ini worker\n\n   or if you have CKAN version 2.6.x or less (and are therefore using ckanext-rq)::\n\n     paster --plugin=ckanext-rq jobs -c /etc/ckan/default/ckan.ini worker\n\n   Test it will load a CSV ok by submitting a `CSV in the web interface <http://docs.ckan.org/projects/datapusher/en/latest/using.html#ckan-2-2-and-above>`_\n   or in another shell::\n\n     paster --plugin=ckanext-xloader xloader submit <dataset-name> -c /etc/ckan/default/ckan.ini\n\n   Clearly, running the worker on the command-line is only for testing - for\n   production services see:\n\n       http://docs.ckan.org/en/ckan-2.7.0/maintaining/background-tasks.html#using-supervisor\n\n   If you have CKAN version 2.6.x or less then you'll need to download\n   `supervisor-ckan-worker.conf <https://raw.githubusercontent.com/ckan/ckan/master/ckan/config/supervisor-ckan-worker.conf>`_ and adjust the ``command`` to reference\n   ckanext-rq.\n\n\n---------------\nConfig settings\n---------------\n\nConfiguration:\n\n::\n\n    # The connection string for the jobs database used by XLoader. The\n    # default of an sqlite file is fine for development. For production use a\n    # Postgresql database.\n    ckanext.xloader.jobs_db.uri = sqlite:////tmp/xloader_jobs.db\n\n    # The formats that are accepted. If the value of the resource.format is\n    # anything else then it won't be 'xloadered' to DataStore (and will therefore\n    # only be available to users in the form of the original download/link).\n    # Case insensitive.\n    # (optional, defaults are listed in plugin.py - DEFAULT_FORMATS).\n    ckanext.xloader.formats = csv application/csv xls application/vnd.ms-excel\n\n    # The maximum size of files to load into DataStore. In bytes. Default is 1 GB.\n    ckanext.xloader.max_content_length = 1000000000\n\n    # By default, xloader will first try to add tabular data to the DataStore\n    # with a direct PostgreSQL COPY. This is relatively fast, but does not\n    # guess column types. If this fails, xloader falls back to a method more\n    # like DataPusher's behaviour. This has the advantage that the column types\n    # are guessed. However it is more error prone and far slower.\n    # To always skip the direct PostgreSQL COPY and use type guessing, set\n    # this option to True.\n    ckanext.xloader.use_type_guessing = False\n\n    # Deprecated: use ckanext.xloader.use_type_guessing instead.\n    ckanext.xloader.just_load_with_messytables = False\n\n    # Whether ambiguous dates should be parsed day first. Defaults to False.\n    # If set to True, dates like '01.02.2022' will be parsed as day = 01,\n    # month = 02.\n    # NB: isoformat dates like '2022-01-02' will be parsed as YYYY-MM-DD, and\n    # this option will not override that.\n    # See https://dateutil.readthedocs.io/en/stable/parser.html#dateutil.parser.parse\n    # for more details.\n    ckanext.xloader.parse_dates_dayfirst = False\n\n    # Whether ambiguous dates should be parsed year first. Defaults to False.\n    # If set to True, dates like '01.02.03' will be parsed as year = 2001,\n    # month = 02, day = 03. See https://dateutil.readthedocs.io/en/stable/parser.html#dateutil.parser.parse\n    # for more details.\n    ckanext.xloader.parse_dates_yearfirst = False\n\n    # The maximum time for the loading of a resource before it is aborted.\n    # Give an amount in seconds. Default is 60 minutes\n    ckanext.xloader.job_timeout = 3600\n\n    # Ignore the file hash when submitting to the DataStore, if set to True\n    # resources are always submitted (if their format matches), if set to\n    # False (default), resources are only submitted if their hash has changed.\n    ckanext.xloader.ignore_hash = False\n\n    # When loading a file that is bigger than `max_content_length`, xloader can\n    # still try and load some of the file, which is useful to display a\n    # preview. Set this option to the desired number of lines/rows that it\n    # loads in this case.\n    # If the file-type is supported (CSV, TSV) an excerpt with the number of\n    # `max_excerpt_lines` lines will be submitted while the `max_content_length`\n    # is not exceeded.\n    # If set to 0 (default) files that exceed the `max_content_length` will\n    # not be loaded into the datastore.\n    ckanext.xloader.max_excerpt_lines = 100\n\n    # Requests verifies SSL certificates for HTTPS requests. Setting verify to\n    # False should only be enabled during local development or testing. Default\n    # to True.\n    ckanext.xloader.ssl_verify = True\n\n    # Uses a specific API token for the xloader_submit action instead of the\n    # apikey of the site_user\n    ckanext.xloader.api_token = ckan-provided-api-token\n\n\n------------------------\nDeveloper installation\n------------------------\n\nTo install XLoader for development, activate your CKAN virtualenv and\nin the directory up from your local ckan repo::\n\n    git clone https://github.com/ckan/ckanext-xloader.git\n    cd ckanext-xloader\n    python setup.py develop\n    pip install -r requirements.txt\n    pip install -r dev-requirements.txt\n\n\n-------------------------\nUpgrading from DataPusher\n-------------------------\n\nTo upgrade from DataPusher to XLoader:\n\n1. Install XLoader as above, including running the xloader worker.\n\n2. (Optional) For existing datasets that have been datapushed to datastore, freeze the column types (in the data dictionaries), so that XLoader doesn't change them back to string on next xload::\n\n       ckan -c /etc/ckan/default/ckan.ini migrate_types\n\n3. If you've not already, change the enabled plugin in your config - on the\n   ``ckan.plugins`` line replace ``datapusher`` with ``xloader``.\n\n4. (Optional) If you wish, you can disable the direct loading and continue to\n   just use tabulator - for more about this see the docs on config option:\n   ``ckanext.xloader.use_type_guessing``\n\n5. Stop the datapusher worker::\n\n       sudo a2dissite datapusher\n\n6. Restart CKAN::\n\n       sudo service apache2 reload\n       sudo service nginx reload\n\n----------------------\nCommand-line interface\n----------------------\n\nYou can submit single or multiple resources to be xloaded using the\ncommand-line interface.\n\ne.g. ::\n\n    [2.9] ckan -c /etc/ckan/default/ckan.ini xloader submit <dataset-name>\n    [pre-2.9] paster --plugin=ckanext-xloader xloader submit <dataset-name> -c /etc/ckan/default/ckan.ini\n\nFor debugging you can try xloading it synchronously (which does the load\ndirectly, rather than asking the worker to do it) with the ``-s`` option::\n\n    [2.9] ckan -c /etc/ckan/default/ckan.ini xloader submit <dataset-name> -s\n    [pre-2.9] paster --plugin=ckanext-xloader xloader submit <dataset-name> -s -c /etc/ckan/default/ckan.ini\n\nSee the status of jobs::\n\n    [2.9] ckan -c /etc/ckan/default/ckan.ini xloader status\n    [pre-2.9] paster --plugin=ckanext-xloader xloader status -c /etc/ckan/default/development.ini\n\nSubmit all datasets' resources to the DataStore::\n\n    [2.9] ckan -c /etc/ckan/default/ckan.ini xloader submit all\n    [pre-2.9] paster --plugin=ckanext-xloader xloader submit all -c /etc/ckan/default/ckan.ini\n\nRe-submit all the resources already in the DataStore (Ignores any resources\nthat have not been stored in DataStore e.g. because they are not tabular)::\n\n    [2.9] ckan -c /etc/ckan/default/ckan.ini xloader submit all-existing\n    [pre-2.9] paster --plugin=ckanext-xloader xloader submit all-existing -c /etc/ckan/default/ckan.ini\n\n**Full list of XLoader CLI commands**::\n\n    [2.9] ckan -c /etc/ckan/default/ckan.ini xloader --help\n    [pre-2.9] paster --plugin=ckanext-xloader xloader --help\n\nJobs and workers\n----------------\n\nMain docs for managing jobs: <https://docs.ckan.org/en/latest/maintaining/background-tasks.html#managing-background-jobs>\n\nMain docs for running and managing workers are here: https://docs.ckan.org/en/latest/maintaining/background-tasks.html#running-background-jobs\n\nUseful commands:\n\nClear (delete) all outstanding jobs::\n\n    CKAN 2.9, Python 3 ckan -c /etc/ckan/default/ckan.ini jobs clear [QUEUES]\n    CKAN <2.9, Python 2 paster --plugin=ckanext-xloader xloader jobs clear [QUEUES] -c /etc/ckan/default/development.ini\n\nIf having trouble with the worker process, restarting it can help::\n\n    sudo supervisorctl restart ckan-worker:*\n\n---------------\nTroubleshooting\n---------------\n\n**KeyError: \"Action 'datastore_search' not found\"**\n\nYou need to enable the `datastore` plugin in your CKAN config. See\n'Installation' section above to do this and restart the worker.\n\n**ProgrammingError: (ProgrammingError) relation \"_table_metadata\" does not\nexist**\n\nYour DataStore permissions have not been set-up - see:\n<https://docs.ckan.org/en/latest/maintaining/datastore.html#set-permissions>\n\n**When editing a package, all its existing resources get re-loaded by xloader**\n\nThis behavior was documented in\n`Issue 75 <https://github.com/ckan/ckanext-xloader/issues/75>`_ and is related\nto a bug in CKAN that is fixed in versions 2.6.9, 2.7.7, 2.8.4\nand 2.9.0+.\n\n-----------------\nRunning the Tests\n-----------------\n\nThe first time, your test datastore database needs the trigger applied::\n\n    sudo -u postgres psql datastore_test -f full_text_function.sql\n\nTo run the tests, do::\n\n    nosetests --nologcapture --with-pylons=test.ini\n\nTo run the tests and produce a coverage report, first make sure you have\ncoverage installed in your virtualenv (``pip install coverage``) then run::\n\n    nosetests --nologcapture --with-pylons=test.ini --with-coverage --cover-package=ckanext.xloader --cover-inclusive --cover-erase --cover-tests\n\n----------------------------------\nReleasing a New Version of XLoader\n----------------------------------\n\nXLoader is available on PyPI as https://pypi.org/project/ckanext-xloader.\n\nTo publish a new version to PyPI follow these steps:\n\n1. Update the version number in the ``setup.py`` file.\n   See `PEP 440 <http://legacy.python.org/dev/peps/pep-0440/#public-version-identifiers>`_\n   for how to choose version numbers.\n\n2. Update the CHANGELOG.\n\n3. Make sure you have the latest version of necessary packages::\n\n       pip install --upgrade setuptools wheel twine\n\n4. Create source and binary distributions of the new version::\n\n       python setup.py sdist bdist_wheel && twine check dist/*\n\n   Fix any errors you get.\n\n5. Upload the source distribution to PyPI::\n\n       twine upload dist/*\n\n6. Commit any outstanding changes::\n\n       git commit -a\n       git push\n\n7. Tag the new release of the project on GitHub with the version number from\n   the ``setup.py`` file. For example if the version number in ``setup.py`` is\n   0.0.1 then do::\n\n       git tag 0.0.1\n       git push --tags",
    "bugtrack_url": null,
    "license": "AGPL",
    "summary": "Express Loader - quickly load data into CKAN DataStore",
    "version": "0.12.2",
    "split_keywords": [
        "ckan",
        "extension",
        "datastore"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "65c5fd5bd3c5a403257fb0c50f9bf6f6",
                "sha256": "c7f347b6bd038c7b054f7e91a54159a6438e506cb7a284852f2566ade98b8f4b"
            },
            "downloads": -1,
            "filename": "ckanext-xloader-0.12.2.tar.gz",
            "has_sig": false,
            "md5_digest": "65c5fd5bd3c5a403257fb0c50f9bf6f6",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 76999,
            "upload_time": "2022-11-30T14:52:08",
            "upload_time_iso_8601": "2022-11-30T14:52:08.907929Z",
            "url": "https://files.pythonhosted.org/packages/a4/a8/90d9d58a3d6e8411fa9543830ce21129de9904d57c114029b6521025caef/ckanext-xloader-0.12.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-11-30 14:52:08",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "ckan",
    "github_project": "ckanext-xloader",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "requirements": [
        {
            "name": "ckantoolkit",
            "specs": []
        },
        {
            "name": "requests",
            "specs": [
                [
                    ">=",
                    "2.11.1"
                ]
            ]
        },
        {
            "name": "six",
            "specs": [
                [
                    ">=",
                    "1.12.0"
                ]
            ]
        },
        {
            "name": "tabulator",
            "specs": [
                [
                    "==",
                    "1.53.5"
                ]
            ]
        },
        {
            "name": "Unidecode",
            "specs": [
                [
                    "==",
                    "1.0.22"
                ]
            ]
        },
        {
            "name": "python-dateutil",
            "specs": [
                [
                    ">=",
                    "2.8.2"
                ]
            ]
        }
    ],
    "lcname": "ckanext-xloader"
}
        
Elapsed time: 0.01394s