tarentula


Nametarentula JSON
Version 4.4.0 PyPI version JSON
download
home_pageNone
SummaryNone
upload_time2024-04-17 10:29:28
maintainerNone
docs_urlNone
authorICIJ
requires_python<4.0,>=3.8
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Datashare Tarentula [![CircleCI](https://circleci.com/gh/ICIJ/datashare-tarentula.svg?style=svg)](https://circleci.com/gh/ICIJ/datashare-tarentula)

Cli toolbelt for [Datashare](https://datashare.icij.org).

```
     /      \
  \  \  ,,  /  /
   '-.`\()/`.-'
  .--_'(  )'_--.
 / /` /`""`\ `\ \
  |  |  ><  |  |
  \  \      /  /
      '.__.'

Usage: tarentula [OPTIONS] COMMAND [ARGS]...

Options:
  --syslog-address      TEXT    localhost   Syslog address
  --syslog-port         INTEGER 514         Syslog port
  --syslog-facility     TEXT    local7      Syslog facility
  --stdout-loglevel     TEXT    ERROR       Change the default log level for stdout error handler
  --help                                    Show this message and exit
  --version                                 Show the installed version of Tarentula

Commands:
  aggregate
  count
  clean-tags-by-query
  download
  export-by-query
  list-metadata
  tagging
  tagging-by-query
```

---
<!-- TOC depthFrom:2 depthTo:6 withLinks:1 updateOnSave:1 orderedList:0 -->

- [Installation](#installation)
- [Usage](#usage)
  - [Cookbook 👩‍🍳](#cookbook-)
  - [Count](#count)
  - [Clean Tags by Query](#clean-tags-by-query)
  - [Download](#download)
  - [Export by Query](#export-by-query)
  - [Tagging](#tagging)
    - [CSV formats](#csv-formats)
  - [Tagging by Query](#tagging-by-query)
  - [Aggregate](#aggregate)
  - [Following your changes](#following-your-changes)
- [Configuration File](#configuration-file)
- [Testing](#testing)
- [Releasing](#releasing)
  - [1. Create a new release](#1-create-a-new-release)
  - [2. Upload distributions on pypi](#2-upload-distributions-on-pypi)
  - [3. Build and publish the Docker image](#3-build-and-publish-the-docker-image)
  - [4. Push your changes on Github](#4-push-your-changes-on-github)

<!-- /TOC -->
---

## Installation

You can insatll Datashare Tarentula with your favorite package manager:

```
pip3 install --user tarentula
```

Or alternativly with Docker:

```
docker run icij/datashare-tarentula
```

## Usage

Datashare Tarentula comes with basic commands to interact with a Datashare instance (running locally or on a remote server). Primarily focus on bulk actions, it provides you with both a cli interface and a python API.

### Cookbook 👩‍🍳

To learn more about how to use Datashare Tarentula with a list of examples, please refer to <a href="./COOKBOOK.md">the Cookbook</a>.

### Count

A command to just count the number of files matching a query.

```
Usage: tarentula count [OPTIONS]

Options:
  --datashare-url           TEXT        Datashare URL
  --datashare-project       TEXT        Datashare project
  --elasticsearch-url       TEXT        You can additionally pass the Elasticsearch
                                          URL in order to use scrollingcapabilities of
                                          Elasticsearch (useful when dealing with a
                                          lot of results)
  --query                   TEXT        The query string to filter documents
  --cookies                 TEXT        Key/value pair to add a cookie to each
                                          request to the API. You can
                                          separatesemicolons: key1=val1;key2=val2;...
  --apikey                  TEXT        Datashare authentication apikey
                                          in the downloaded document from the index
  --traceback / --no-traceback          Display a traceback in case of error
  --type [Document|NamedEntity]         Type of indexed documents to download
  --help                                Show this message and exit
```

### Clean Tags by Query

A command that uses Elasticsearch `update-by-query` feature to batch untag documents directly in the index.

```
Usage: tarentula clean-tags-by-query [OPTIONS]

Options:
  --datashare-project       TEXT        Datashare project
  --elasticsearch-url       TEXT        Elasticsearch URL which is used to perform
                                          update by query
  --cookies                 TEXT        Key/value pair to add a cookie to each
                                          request to the API. You can
                                          separatesemicolons: key1=val1;key2=val2;...
  --apikey                  TEXT        Datashare authentication apikey
  --traceback / --no-traceback          Display a traceback in case of error
  --wait-for-completion / --no-wait-for-completion
                                        Create a Elasticsearch task to perform the
                                          updateasynchronously
  --query                   TEXT        Give a JSON query to filter documents that
                                          will have their tags cleaned. It can be
                                          afile with @path/to/file. Default to all.
  --help                                Show this message and exit
```

### Download

A command to download all files matching a query.

```
Usage: tarentula download [OPTIONS]

Options:
  --apikey TEXT                   Datashare authentication apikey
  --datashare-url TEXT            Datashare URL
  --datashare-project TEXT        Datashare project
  --elasticsearch-url TEXT        You can additionally pass the Elasticsearch
                                  URL in order to use scrollingcapabilities of
                                  Elasticsearch (useful when dealing with a
                                  lot of results)

  --query TEXT                    The query string to filter documents
  --destination-directory TEXT    Directory documents will be downloaded
  --throttle INTEGER              Request throttling (in ms)
  --cookies TEXT                  Key/value pair to add a cookie to each
                                  request to the API. You can
                                  separatesemicolons: key1=val1;key2=val2;...

  --path-format TEXT              Downloaded document path template
  --scroll TEXT                   Scroll duration
  --source TEXT                   A comma-separated list of field to include
                                  in the downloaded document from the index

  -f, --from INTEGER              Passed to the search it will bypass the
                                  first n documents
  -l, --limit INTEGER             Limit the total results to return
  --sort-by TEXT                  Field to use to sort results
  --order-by [asc|desc]           Order to use to sort results
  --once / --not-once             Download file only once
  --traceback / --no-traceback    Display a traceback in case of error
  --progressbar / --no-progressbar
                                  Display a progressbar
  --raw-file / --no-raw-file      Download raw file from Datashare
  --type [Document|NamedEntity]   Type of indexed documents to download
  --help                          Show this message and exit.
```


### Export by Query

A command to export all files matching a query.

```
Usage: tarentula export-by-query [OPTIONS]

Options:
  --apikey TEXT                   Datashare authentication apikey
  --datashare-url TEXT            Datashare URL
  --datashare-project TEXT        Datashare project
  --elasticsearch-url TEXT        You can additionally pass the Elasticsearch
                                  URL in order to use scrollingcapabilities of
                                  Elasticsearch (useful when dealing with a
                                  lot of results)

  --query TEXT                    The query string to filter documents
  --output-file TEXT              Path to the CSV file
  --throttle INTEGER              Request throttling (in ms)
  --cookies TEXT                  Key/value pair to add a cookie to each
                                  request to the API. You can
                                  separatesemicolons: key1=val1;key2=val2;...

  --scroll TEXT                   Scroll duration
  --source TEXT                   A comma-separated list of field to include
                                  in the export

  --sort-by TEXT                  Field to use to sort results
  --order-by [asc|desc]           Order to use to sort results
  --traceback / --no-traceback    Display a traceback in case of error
  --progressbar / --no-progressbar
                                  Display a progressbar
  --type [Document|NamedEntity]   Type of indexed documents to download
  -f, --from INTEGER              Passed to the search it will bypass the
                                  first n documents
  -l, --limit INTEGER             Limit the total results to return
  --size INTEGER                  Size of the scroll request that powers the
                                  operation.

  --query-field / --no-query-field
                                  Add the query to the export CSV
  --help                          Show this message and exit.
```


### Tagging

A command to batch tag documents with a CSV file.

```
Usage: tarentula tagging [OPTIONS] CSV_PATH

Options:
  --datashare-url       TEXT        http://localhost:8080   Datashare URL
  --datashare-project   TEXT        local-datashare         Datashare project
  --throttle            INTEGER     0                       Request throttling (in ms)
  --cookies             TEXT        _Empty string_          Key/value pair to add a cookie to each request to the API. You can separate semicolons: key1=val1;key2=val2;...
  --apikey              TEXT        None                    Datashare authentication apikey
  --traceback / --no-traceback                              Display a traceback in case of error
  --progressbar / --no-progressbar                          Display a progressbar
  --help                                                    Show this message and exit
```

#### CSV formats

Tagging with a `documentId` and `routing`:

```csv
tag,documentId,routing
Actinopodidae,l7VnZZEzg2fr960NWWEG,l7VnZZEzg2fr960NWWEG
Antrodiaetidae,DWLOskax28jPQ2CjFrCo
Atracidae,6VE7cVlWszkUd94XeuSd,vZJQpKQYhcI577gJR0aN
Atypidae,DbhveTJEwQfJL5Gn3Zgi,DbhveTJEwQfJL5Gn3Zgi
Barychelidae,DbhveTJEwQfJL5Gn3Zgi,DbhveTJEwQfJL5Gn3Zgi
```

Tagging with a `documentUrl`:

```csv
tag,documentUrl
Mecicobothriidae,http://localhost:8080/#/d/local-datashare/DbhveTJEwQfJL5Gn3Zgi/DbhveTJEwQfJL5Gn3Zgi
Microstigmatidae,http://localhost:8080/#/d/local-datashare/iuL6GUBpO7nKyfSSFaS0/iuL6GUBpO7nKyfSSFaS0
Migidae,http://localhost:8080/#/d/local-datashare/BmovvXBisWtyyx6o9cuG/BmovvXBisWtyyx6o9cuG
Nemesiidae,http://localhost:8080/#/d/local-datashare/vZJQpKQYhcI577gJR0aN/vZJQpKQYhcI577gJR0aN
Paratropididae,http://localhost:8080/#/d/local-datashare/vYl1C4bsWphUKvXEBDhM/vYl1C4bsWphUKvXEBDhM
Porrhothelidae,http://localhost:8080/#/d/local-datashare/fgCt6JLfHSl160fnsjRp/fgCt6JLfHSl160fnsjRp
Theraphosidae,http://localhost:8080/#/d/local-datashare/WvwVvNjEDQJXkwHISQIu/WvwVvNjEDQJXkwHISQIu
```

### Tagging by Query

A command that uses Elasticsearch `update-by-query` feature to batch tag documents directly in the index.

To see an example of input file, refer to [this JSON](tests/fixtures/tags-by-content-type.json).

```
Usage: tarentula tagging-by-query [OPTIONS] JSON_PATH

Options:
  --datashare-project       TEXT        Datashare project
  --elasticsearch-url       TEXT        Elasticsearch URL which is used to perform
                                          update by query
  --throttle                INTEGER     Request throttling (in ms)
  --cookies                 TEXT        Key/value pair to add a cookie to each
                                          request to the API. You can
                                          separatesemicolons: key1=val1;key2=val2;...
  --apikey                  TEXT        Datashare authentication apikey
  --traceback / --no-traceback          Display a traceback in case of error
  --progressbar / --no-progressbar      Display a progressbar
  --wait-for-completion / --no-wait-for-completion
                                        Create a Elasticsearch task to perform the
                                          updateasynchronously
  --help                                Show this message and exit
```


### List Metadata

You can list the metadata from the mapping, optionally counting the number of occurrences of each field in the index, with the `--count` parameter. Counting the fields is disabled by default.

It includes a `--filter_by` parameter to narrow retrieving metadata properties of specific sets of documents. For instance it can be used to get just emails related properties with: `--filter_by "contentType=message/rfc822"`

```
$ tarentula list-metadata --help
Usage: tarentula list-metadata [OPTIONS]

Options:
  --datashare-project TEXT       Datashare project
  --elasticsearch-url TEXT       You can additionally pass the Elasticsearch
                                 URL in order to use scrollingcapabilities of
                                 Elasticsearch (useful when dealing with a lot
                                 of results)
  --type [Document|NamedEntity]  Type of indexed documents to get metadata
  --filter_by TEXT               Filter documents by pairs concatenated by
                                 coma of field names and values separated by
                                 =.Example "contentType=message/rfc822,content
                                 Type=message/rfc822"
  --count / --no-count           Count or not the number of docs for each
                                 property found

  --help                         Show this message and exit.

```

### Aggregate

You can run aggregations on the data, the ElasticSearch aggregations API is partially enabled with this command.
The possibilities are:

- count: grouping by a given field different values, and count the num of docs.
- nunique: returns the number of unique values of a given field.
- date_histogram: returns counting of monthly or yearly grouped values for a given date field.
- sum: returns the sum of values of number type fields.
- min: returns the min of values of number type fields.
- max: returns the max of values of number type fields.
- avg: returns the average of values of number type fields.
- stats: returns a bunch of statistics for a given number type fields.
- string_stats: returns a bunch of string statistics for a given string type fields.



```
$ tarentula aggregate --help
Usage: tarentula aggregate [OPTIONS]

Options:
  --apikey TEXT                   Datashare authentication apikey
  --datashare-url TEXT            Datashare URL
  --datashare-project TEXT        Datashare project
  --elasticsearch-url TEXT        You can additionally pass the Elasticsearch
                                  URL in order to use scrollingcapabilities of
                                  Elasticsearch (useful when dealing with a
                                  lot of results)
  --query TEXT                    The query string to filter documents
  --cookies TEXT                  Key/value pair to add a cookie to each
                                  request to the API. You can
                                  separatesemicolons: key1=val1;key2=val2;...
  --traceback / --no-traceback    Display a traceback in case of error
  --type [Document|NamedEntity]   Type of indexed documents to download
  --group_by TEXT                 Field to use to aggregate results
  --operation_field TEXT          Field to run the operation on
  --run [count|nunique|date_histogram|sum|stats|string_stats|min|max|avg]
                                  Operation to run
  --calendar_interval [year|month]
                                  Calendar interval for date histogram
                                  aggregation
  --help                          Show this message and exit.
```

### Following your changes

When running Elasticsearch changes on big datasets, it could take a very long time. As we were curling ES to see if the task was still running well, we added a small utility to follow the changes. It makes a live graph of a provided ES indicator with a specified filter.

It uses [mathplotlib](https://matplotlib.org/) and python3-tk.

If you see the following message :

```
$ graph_es
graph_realtime.py:32: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure
```

Then you have to install [tkinter](https://docs.python.org/3/library/tkinter.html), i.e. python3-tk for Debian/Ubuntu.

The command has the options below:

```
$ graph_es --help
Usage: graph_es [OPTIONS]

Options:
  --query               TEXT        Give a JSON query to filter documents. It can be
                                      a file with @path/to/file. Default to all.
  --index               TEXT        Elasticsearch index (default local-datashare)
  --refresh-interval    INTEGER     Graph refresh interval in seconds (default 5s)
  --field               TEXT        Field value to display over time (default "hits.total")
  --elasticsearch-url   TEXT        Elasticsearch URL which is used to perform
                                      update by query (default http://elasticsearch:9200)
```

## Configuration File

Tarentula supports several sources for configuring its behavior, including an ini files and command-line options.

Configuration file will be searched for in the following order (use the first file found, all others are ignored):

  * `TARENTULA_CONFIG` (environment variable if set)
  * `tarentula.ini` (in the current directory)
  * `~/.tarentula.ini` (in the home directory)
  * `/etc/tarentula/tarentula.ini`

It should follow the following format (all values bellow are optional):

```
[DEFAULT]
apikey = SECRETHALONOPROCTIDAE
datashare_url = http://here:8080
datashare_project = local-datashare

[logger]
syslog_address = 127.0.0.0
syslog_port = 514
syslog_facility = local7
stdout_loglevel = INFO
```

## Testing

To test this tool, you must have Datashare and Elasticsearch running on your development machine.

After you [installed Datashare](https://datashare.icij.org/), just run it with a test project/user:

```
datashare -p test-datashare -u test
```

In a separate terminal, install the development dependencies:

```
make install
```

Finally, run the test

```
make test
```


## Releasing

The releasing process uses [bumpversion](https://pypi.org/project/bumpversion/) to manage versions of this package, [pypi](https://pypi.org/project/tarentula/) to publish the Python package and [Docker Hub](https://hub.docker.com/) for the Docker image.

### 1. Create a new release

```
make [patch|minor|major]
```

### 2. Upload distributions on pypi

_To be able to do this, you will need to be a maintainer of the [pypi](https://pypi.org/project/tarentula/) project._

```
make distribute
```

### 3. Build and publish the Docker image

To build and upload a new image on the [docker repository](https://hub.docker.com/repository/docker/icij/datashare-tarentula) :

_To be able to do this, you will need to be part of the ICIJ organization on docker_

```
make docker-publish
```

**Note**: Datashare Tarentula is a multi-platform build. You might need to setup your environment for 
multi-platform using the `make docker-setup-multiarch` command. Read more 
[on Docker documentation](https://docs.docker.com/build/building/multi-platform/). 

### 4. Push your changes on Github

Git push release and tag :

```
git push origin master --tags
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "tarentula",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "ICIJ",
    "author_email": "engineering@icij.org",
    "download_url": "https://files.pythonhosted.org/packages/d0/b9/d66b4b4bb239d4ca0d8cca43a52130143bf601d0904d51e42da02af4e5f0/tarentula-4.4.0.tar.gz",
    "platform": null,
    "description": "# Datashare Tarentula [![CircleCI](https://circleci.com/gh/ICIJ/datashare-tarentula.svg?style=svg)](https://circleci.com/gh/ICIJ/datashare-tarentula)\n\nCli toolbelt for [Datashare](https://datashare.icij.org).\n\n```\n     /      \\\n  \\  \\  ,,  /  /\n   '-.`\\()/`.-'\n  .--_'(  )'_--.\n / /` /`\"\"`\\ `\\ \\\n  |  |  ><  |  |\n  \\  \\      /  /\n      '.__.'\n\nUsage: tarentula [OPTIONS] COMMAND [ARGS]...\n\nOptions:\n  --syslog-address      TEXT    localhost   Syslog address\n  --syslog-port         INTEGER 514         Syslog port\n  --syslog-facility     TEXT    local7      Syslog facility\n  --stdout-loglevel     TEXT    ERROR       Change the default log level for stdout error handler\n  --help                                    Show this message and exit\n  --version                                 Show the installed version of Tarentula\n\nCommands:\n  aggregate\n  count\n  clean-tags-by-query\n  download\n  export-by-query\n  list-metadata\n  tagging\n  tagging-by-query\n```\n\n---\n<!-- TOC depthFrom:2 depthTo:6 withLinks:1 updateOnSave:1 orderedList:0 -->\n\n- [Installation](#installation)\n- [Usage](#usage)\n  - [Cookbook \ud83d\udc69\u200d\ud83c\udf73](#cookbook-)\n  - [Count](#count)\n  - [Clean Tags by Query](#clean-tags-by-query)\n  - [Download](#download)\n  - [Export by Query](#export-by-query)\n  - [Tagging](#tagging)\n    - [CSV formats](#csv-formats)\n  - [Tagging by Query](#tagging-by-query)\n  - [Aggregate](#aggregate)\n  - [Following your changes](#following-your-changes)\n- [Configuration File](#configuration-file)\n- [Testing](#testing)\n- [Releasing](#releasing)\n  - [1. Create a new release](#1-create-a-new-release)\n  - [2. Upload distributions on pypi](#2-upload-distributions-on-pypi)\n  - [3. Build and publish the Docker image](#3-build-and-publish-the-docker-image)\n  - [4. Push your changes on Github](#4-push-your-changes-on-github)\n\n<!-- /TOC -->\n---\n\n## Installation\n\nYou can insatll Datashare Tarentula with your favorite package manager:\n\n```\npip3 install --user tarentula\n```\n\nOr alternativly with Docker:\n\n```\ndocker run icij/datashare-tarentula\n```\n\n## Usage\n\nDatashare Tarentula comes with basic commands to interact with a Datashare instance (running locally or on a remote server). Primarily focus on bulk actions, it provides you with both a cli interface and a python API.\n\n### Cookbook \ud83d\udc69\u200d\ud83c\udf73\n\nTo learn more about how to use Datashare Tarentula with a list of examples, please refer to <a href=\"./COOKBOOK.md\">the Cookbook</a>.\n\n### Count\n\nA command to just count the number of files matching a query.\n\n```\nUsage: tarentula count [OPTIONS]\n\nOptions:\n  --datashare-url           TEXT        Datashare URL\n  --datashare-project       TEXT        Datashare project\n  --elasticsearch-url       TEXT        You can additionally pass the Elasticsearch\n                                          URL in order to use scrollingcapabilities of\n                                          Elasticsearch (useful when dealing with a\n                                          lot of results)\n  --query                   TEXT        The query string to filter documents\n  --cookies                 TEXT        Key/value pair to add a cookie to each\n                                          request to the API. You can\n                                          separatesemicolons: key1=val1;key2=val2;...\n  --apikey                  TEXT        Datashare authentication apikey\n                                          in the downloaded document from the index\n  --traceback / --no-traceback          Display a traceback in case of error\n  --type [Document|NamedEntity]         Type of indexed documents to download\n  --help                                Show this message and exit\n```\n\n### Clean Tags by Query\n\nA command that uses Elasticsearch `update-by-query` feature to batch untag documents directly in the index.\n\n```\nUsage: tarentula clean-tags-by-query [OPTIONS]\n\nOptions:\n  --datashare-project       TEXT        Datashare project\n  --elasticsearch-url       TEXT        Elasticsearch URL which is used to perform\n                                          update by query\n  --cookies                 TEXT        Key/value pair to add a cookie to each\n                                          request to the API. You can\n                                          separatesemicolons: key1=val1;key2=val2;...\n  --apikey                  TEXT        Datashare authentication apikey\n  --traceback / --no-traceback          Display a traceback in case of error\n  --wait-for-completion / --no-wait-for-completion\n                                        Create a Elasticsearch task to perform the\n                                          updateasynchronously\n  --query                   TEXT        Give a JSON query to filter documents that\n                                          will have their tags cleaned. It can be\n                                          afile with @path/to/file. Default to all.\n  --help                                Show this message and exit\n```\n\n### Download\n\nA command to download all files matching a query.\n\n```\nUsage: tarentula download [OPTIONS]\n\nOptions:\n  --apikey TEXT                   Datashare authentication apikey\n  --datashare-url TEXT            Datashare URL\n  --datashare-project TEXT        Datashare project\n  --elasticsearch-url TEXT        You can additionally pass the Elasticsearch\n                                  URL in order to use scrollingcapabilities of\n                                  Elasticsearch (useful when dealing with a\n                                  lot of results)\n\n  --query TEXT                    The query string to filter documents\n  --destination-directory TEXT    Directory documents will be downloaded\n  --throttle INTEGER              Request throttling (in ms)\n  --cookies TEXT                  Key/value pair to add a cookie to each\n                                  request to the API. You can\n                                  separatesemicolons: key1=val1;key2=val2;...\n\n  --path-format TEXT              Downloaded document path template\n  --scroll TEXT                   Scroll duration\n  --source TEXT                   A comma-separated list of field to include\n                                  in the downloaded document from the index\n\n  -f, --from INTEGER              Passed to the search it will bypass the\n                                  first n documents\n  -l, --limit INTEGER             Limit the total results to return\n  --sort-by TEXT                  Field to use to sort results\n  --order-by [asc|desc]           Order to use to sort results\n  --once / --not-once             Download file only once\n  --traceback / --no-traceback    Display a traceback in case of error\n  --progressbar / --no-progressbar\n                                  Display a progressbar\n  --raw-file / --no-raw-file      Download raw file from Datashare\n  --type [Document|NamedEntity]   Type of indexed documents to download\n  --help                          Show this message and exit.\n```\n\n\n### Export by Query\n\nA command to export all files matching a query.\n\n```\nUsage: tarentula export-by-query [OPTIONS]\n\nOptions:\n  --apikey TEXT                   Datashare authentication apikey\n  --datashare-url TEXT            Datashare URL\n  --datashare-project TEXT        Datashare project\n  --elasticsearch-url TEXT        You can additionally pass the Elasticsearch\n                                  URL in order to use scrollingcapabilities of\n                                  Elasticsearch (useful when dealing with a\n                                  lot of results)\n\n  --query TEXT                    The query string to filter documents\n  --output-file TEXT              Path to the CSV file\n  --throttle INTEGER              Request throttling (in ms)\n  --cookies TEXT                  Key/value pair to add a cookie to each\n                                  request to the API. You can\n                                  separatesemicolons: key1=val1;key2=val2;...\n\n  --scroll TEXT                   Scroll duration\n  --source TEXT                   A comma-separated list of field to include\n                                  in the export\n\n  --sort-by TEXT                  Field to use to sort results\n  --order-by [asc|desc]           Order to use to sort results\n  --traceback / --no-traceback    Display a traceback in case of error\n  --progressbar / --no-progressbar\n                                  Display a progressbar\n  --type [Document|NamedEntity]   Type of indexed documents to download\n  -f, --from INTEGER              Passed to the search it will bypass the\n                                  first n documents\n  -l, --limit INTEGER             Limit the total results to return\n  --size INTEGER                  Size of the scroll request that powers the\n                                  operation.\n\n  --query-field / --no-query-field\n                                  Add the query to the export CSV\n  --help                          Show this message and exit.\n```\n\n\n### Tagging\n\nA command to batch tag documents with a CSV file.\n\n```\nUsage: tarentula tagging [OPTIONS] CSV_PATH\n\nOptions:\n  --datashare-url       TEXT        http://localhost:8080   Datashare URL\n  --datashare-project   TEXT        local-datashare         Datashare project\n  --throttle            INTEGER     0                       Request throttling (in ms)\n  --cookies             TEXT        _Empty string_          Key/value pair to add a cookie to each request to the API. You can separate semicolons: key1=val1;key2=val2;...\n  --apikey              TEXT        None                    Datashare authentication apikey\n  --traceback / --no-traceback                              Display a traceback in case of error\n  --progressbar / --no-progressbar                          Display a progressbar\n  --help                                                    Show this message and exit\n```\n\n#### CSV formats\n\nTagging with a `documentId` and `routing`:\n\n```csv\ntag,documentId,routing\nActinopodidae,l7VnZZEzg2fr960NWWEG,l7VnZZEzg2fr960NWWEG\nAntrodiaetidae,DWLOskax28jPQ2CjFrCo\nAtracidae,6VE7cVlWszkUd94XeuSd,vZJQpKQYhcI577gJR0aN\nAtypidae,DbhveTJEwQfJL5Gn3Zgi,DbhveTJEwQfJL5Gn3Zgi\nBarychelidae,DbhveTJEwQfJL5Gn3Zgi,DbhveTJEwQfJL5Gn3Zgi\n```\n\nTagging with a `documentUrl`:\n\n```csv\ntag,documentUrl\nMecicobothriidae,http://localhost:8080/#/d/local-datashare/DbhveTJEwQfJL5Gn3Zgi/DbhveTJEwQfJL5Gn3Zgi\nMicrostigmatidae,http://localhost:8080/#/d/local-datashare/iuL6GUBpO7nKyfSSFaS0/iuL6GUBpO7nKyfSSFaS0\nMigidae,http://localhost:8080/#/d/local-datashare/BmovvXBisWtyyx6o9cuG/BmovvXBisWtyyx6o9cuG\nNemesiidae,http://localhost:8080/#/d/local-datashare/vZJQpKQYhcI577gJR0aN/vZJQpKQYhcI577gJR0aN\nParatropididae,http://localhost:8080/#/d/local-datashare/vYl1C4bsWphUKvXEBDhM/vYl1C4bsWphUKvXEBDhM\nPorrhothelidae,http://localhost:8080/#/d/local-datashare/fgCt6JLfHSl160fnsjRp/fgCt6JLfHSl160fnsjRp\nTheraphosidae,http://localhost:8080/#/d/local-datashare/WvwVvNjEDQJXkwHISQIu/WvwVvNjEDQJXkwHISQIu\n```\n\n### Tagging by Query\n\nA command that uses Elasticsearch `update-by-query` feature to batch tag documents directly in the index.\n\nTo see an example of input file, refer to [this JSON](tests/fixtures/tags-by-content-type.json).\n\n```\nUsage: tarentula tagging-by-query [OPTIONS] JSON_PATH\n\nOptions:\n  --datashare-project       TEXT        Datashare project\n  --elasticsearch-url       TEXT        Elasticsearch URL which is used to perform\n                                          update by query\n  --throttle                INTEGER     Request throttling (in ms)\n  --cookies                 TEXT        Key/value pair to add a cookie to each\n                                          request to the API. You can\n                                          separatesemicolons: key1=val1;key2=val2;...\n  --apikey                  TEXT        Datashare authentication apikey\n  --traceback / --no-traceback          Display a traceback in case of error\n  --progressbar / --no-progressbar      Display a progressbar\n  --wait-for-completion / --no-wait-for-completion\n                                        Create a Elasticsearch task to perform the\n                                          updateasynchronously\n  --help                                Show this message and exit\n```\n\n\n### List Metadata\n\nYou can list the metadata from the mapping, optionally counting the number of occurrences of each field in the index, with the `--count` parameter. Counting the fields is disabled by default.\n\nIt includes a `--filter_by` parameter to narrow retrieving metadata properties of specific sets of documents. For instance it can be used to get just emails related properties with: `--filter_by \"contentType=message/rfc822\"`\n\n```\n$ tarentula list-metadata --help\nUsage: tarentula list-metadata [OPTIONS]\n\nOptions:\n  --datashare-project TEXT       Datashare project\n  --elasticsearch-url TEXT       You can additionally pass the Elasticsearch\n                                 URL in order to use scrollingcapabilities of\n                                 Elasticsearch (useful when dealing with a lot\n                                 of results)\n  --type [Document|NamedEntity]  Type of indexed documents to get metadata\n  --filter_by TEXT               Filter documents by pairs concatenated by\n                                 coma of field names and values separated by\n                                 =.Example \"contentType=message/rfc822,content\n                                 Type=message/rfc822\"\n  --count / --no-count           Count or not the number of docs for each\n                                 property found\n\n  --help                         Show this message and exit.\n\n```\n\n### Aggregate\n\nYou can run aggregations on the data, the ElasticSearch aggregations API is partially enabled with this command.\nThe possibilities are:\n\n- count: grouping by a given field different values, and count the num of docs.\n- nunique: returns the number of unique values of a given field.\n- date_histogram: returns counting of monthly or yearly grouped values for a given date field.\n- sum: returns the sum of values of number type fields.\n- min: returns the min of values of number type fields.\n- max: returns the max of values of number type fields.\n- avg: returns the average of values of number type fields.\n- stats: returns a bunch of statistics for a given number type fields.\n- string_stats: returns a bunch of string statistics for a given string type fields.\n\n\n\n```\n$ tarentula aggregate --help\nUsage: tarentula aggregate [OPTIONS]\n\nOptions:\n  --apikey TEXT                   Datashare authentication apikey\n  --datashare-url TEXT            Datashare URL\n  --datashare-project TEXT        Datashare project\n  --elasticsearch-url TEXT        You can additionally pass the Elasticsearch\n                                  URL in order to use scrollingcapabilities of\n                                  Elasticsearch (useful when dealing with a\n                                  lot of results)\n  --query TEXT                    The query string to filter documents\n  --cookies TEXT                  Key/value pair to add a cookie to each\n                                  request to the API. You can\n                                  separatesemicolons: key1=val1;key2=val2;...\n  --traceback / --no-traceback    Display a traceback in case of error\n  --type [Document|NamedEntity]   Type of indexed documents to download\n  --group_by TEXT                 Field to use to aggregate results\n  --operation_field TEXT          Field to run the operation on\n  --run [count|nunique|date_histogram|sum|stats|string_stats|min|max|avg]\n                                  Operation to run\n  --calendar_interval [year|month]\n                                  Calendar interval for date histogram\n                                  aggregation\n  --help                          Show this message and exit.\n```\n\n### Following your changes\n\nWhen running Elasticsearch changes on big datasets, it could take a very long time. As we were curling ES to see if the task was still running well, we added a small utility to follow the changes. It makes a live graph of a provided ES indicator with a specified filter.\n\nIt uses [mathplotlib](https://matplotlib.org/) and python3-tk.\n\nIf you see the following message :\n\n```\n$ graph_es\ngraph_realtime.py:32: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure\n```\n\nThen you have to install [tkinter](https://docs.python.org/3/library/tkinter.html), i.e. python3-tk for Debian/Ubuntu.\n\nThe command has the options below:\n\n```\n$ graph_es --help\nUsage: graph_es [OPTIONS]\n\nOptions:\n  --query               TEXT        Give a JSON query to filter documents. It can be\n                                      a file with @path/to/file. Default to all.\n  --index               TEXT        Elasticsearch index (default local-datashare)\n  --refresh-interval    INTEGER     Graph refresh interval in seconds (default 5s)\n  --field               TEXT        Field value to display over time (default \"hits.total\")\n  --elasticsearch-url   TEXT        Elasticsearch URL which is used to perform\n                                      update by query (default http://elasticsearch:9200)\n```\n\n## Configuration File\n\nTarentula supports several sources for configuring its behavior, including an ini files and command-line options.\n\nConfiguration file will be searched for in the following order (use the first file found, all others are ignored):\n\n  * `TARENTULA_CONFIG` (environment variable if set)\n  * `tarentula.ini` (in the current directory)\n  * `~/.tarentula.ini` (in the home directory)\n  * `/etc/tarentula/tarentula.ini`\n\nIt should follow the following format (all values bellow are optional):\n\n```\n[DEFAULT]\napikey = SECRETHALONOPROCTIDAE\ndatashare_url = http://here:8080\ndatashare_project = local-datashare\n\n[logger]\nsyslog_address = 127.0.0.0\nsyslog_port = 514\nsyslog_facility = local7\nstdout_loglevel = INFO\n```\n\n## Testing\n\nTo test this tool, you must have Datashare and Elasticsearch running on your development machine.\n\nAfter you [installed Datashare](https://datashare.icij.org/), just run it with a test project/user:\n\n```\ndatashare -p test-datashare -u test\n```\n\nIn a separate terminal, install the development dependencies:\n\n```\nmake install\n```\n\nFinally, run the test\n\n```\nmake test\n```\n\n\n## Releasing\n\nThe releasing process uses [bumpversion](https://pypi.org/project/bumpversion/) to manage versions of this package, [pypi](https://pypi.org/project/tarentula/) to publish the Python package and [Docker Hub](https://hub.docker.com/) for the Docker image.\n\n### 1. Create a new release\n\n```\nmake [patch|minor|major]\n```\n\n### 2. Upload distributions on pypi\n\n_To be able to do this, you will need to be a maintainer of the [pypi](https://pypi.org/project/tarentula/) project._\n\n```\nmake distribute\n```\n\n### 3. Build and publish the Docker image\n\nTo build and upload a new image on the [docker repository](https://hub.docker.com/repository/docker/icij/datashare-tarentula) :\n\n_To be able to do this, you will need to be part of the ICIJ organization on docker_\n\n```\nmake docker-publish\n```\n\n**Note**: Datashare Tarentula is a multi-platform build. You might need to setup your environment for \nmulti-platform using the `make docker-setup-multiarch` command. Read more \n[on Docker documentation](https://docs.docker.com/build/building/multi-platform/). \n\n### 4. Push your changes on Github\n\nGit push release and tag :\n\n```\ngit push origin master --tags\n```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": null,
    "version": "4.4.0",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ac2b759024d3acedd1e02f5583008ac210d7058c0aac8c02254de9de7de83ebf",
                "md5": "d93f2b59754f6537c60a6a57f17d396c",
                "sha256": "72bd88af6ab47947e0a1aa521a2bc67c1362e6b7597de992a12b9a93619babec"
            },
            "downloads": -1,
            "filename": "tarentula-4.4.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d93f2b59754f6537c60a6a57f17d396c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.8",
            "size": 38649,
            "upload_time": "2024-04-17T10:29:27",
            "upload_time_iso_8601": "2024-04-17T10:29:27.250877Z",
            "url": "https://files.pythonhosted.org/packages/ac/2b/759024d3acedd1e02f5583008ac210d7058c0aac8c02254de9de7de83ebf/tarentula-4.4.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d0b9d66b4b4bb239d4ca0d8cca43a52130143bf601d0904d51e42da02af4e5f0",
                "md5": "ef0d118e5741a9204128e962c8bb98fa",
                "sha256": "334fbe3ba7f41f8c7b56063eece1573f644c93b3fcdca4de53729e854e9437d2"
            },
            "downloads": -1,
            "filename": "tarentula-4.4.0.tar.gz",
            "has_sig": false,
            "md5_digest": "ef0d118e5741a9204128e962c8bb98fa",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.8",
            "size": 35214,
            "upload_time": "2024-04-17T10:29:28",
            "upload_time_iso_8601": "2024-04-17T10:29:28.708622Z",
            "url": "https://files.pythonhosted.org/packages/d0/b9/d66b4b4bb239d4ca0d8cca43a52130143bf601d0904d51e42da02af4e5f0/tarentula-4.4.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-17 10:29:28",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "tarentula"
}
        
Elapsed time: 0.35162s