| Name | tarentula JSON |
| Version |
4.4.0
JSON |
| download |
| home_page | None |
| Summary | None |
| upload_time | 2024-04-17 10:29:28 |
| maintainer | None |
| docs_url | None |
| author | ICIJ |
| requires_python | <4.0,>=3.8 |
| license | None |
| keywords |
|
| VCS |
|
| bugtrack_url |
|
| requirements |
No requirements were recorded.
|
| Travis-CI |
No Travis.
|
| coveralls test coverage |
No coveralls.
|
# Datashare Tarentula [](https://circleci.com/gh/ICIJ/datashare-tarentula)
Cli toolbelt for [Datashare](https://datashare.icij.org).
```
/ \
\ \ ,, / /
'-.`\()/`.-'
.--_'( )'_--.
/ /` /`""`\ `\ \
| | >< | |
\ \ / /
'.__.'
Usage: tarentula [OPTIONS] COMMAND [ARGS]...
Options:
--syslog-address TEXT localhost Syslog address
--syslog-port INTEGER 514 Syslog port
--syslog-facility TEXT local7 Syslog facility
--stdout-loglevel TEXT ERROR Change the default log level for stdout error handler
--help Show this message and exit
--version Show the installed version of Tarentula
Commands:
aggregate
count
clean-tags-by-query
download
export-by-query
list-metadata
tagging
tagging-by-query
```
---
<!-- TOC depthFrom:2 depthTo:6 withLinks:1 updateOnSave:1 orderedList:0 -->
- [Installation](#installation)
- [Usage](#usage)
- [Cookbook 👩🍳](#cookbook-)
- [Count](#count)
- [Clean Tags by Query](#clean-tags-by-query)
- [Download](#download)
- [Export by Query](#export-by-query)
- [Tagging](#tagging)
- [CSV formats](#csv-formats)
- [Tagging by Query](#tagging-by-query)
- [Aggregate](#aggregate)
- [Following your changes](#following-your-changes)
- [Configuration File](#configuration-file)
- [Testing](#testing)
- [Releasing](#releasing)
- [1. Create a new release](#1-create-a-new-release)
- [2. Upload distributions on pypi](#2-upload-distributions-on-pypi)
- [3. Build and publish the Docker image](#3-build-and-publish-the-docker-image)
- [4. Push your changes on Github](#4-push-your-changes-on-github)
<!-- /TOC -->
---
## Installation
You can insatll Datashare Tarentula with your favorite package manager:
```
pip3 install --user tarentula
```
Or alternativly with Docker:
```
docker run icij/datashare-tarentula
```
## Usage
Datashare Tarentula comes with basic commands to interact with a Datashare instance (running locally or on a remote server). Primarily focus on bulk actions, it provides you with both a cli interface and a python API.
### Cookbook 👩🍳
To learn more about how to use Datashare Tarentula with a list of examples, please refer to <a href="./COOKBOOK.md">the Cookbook</a>.
### Count
A command to just count the number of files matching a query.
```
Usage: tarentula count [OPTIONS]
Options:
--datashare-url TEXT Datashare URL
--datashare-project TEXT Datashare project
--elasticsearch-url TEXT You can additionally pass the Elasticsearch
URL in order to use scrollingcapabilities of
Elasticsearch (useful when dealing with a
lot of results)
--query TEXT The query string to filter documents
--cookies TEXT Key/value pair to add a cookie to each
request to the API. You can
separatesemicolons: key1=val1;key2=val2;...
--apikey TEXT Datashare authentication apikey
in the downloaded document from the index
--traceback / --no-traceback Display a traceback in case of error
--type [Document|NamedEntity] Type of indexed documents to download
--help Show this message and exit
```
### Clean Tags by Query
A command that uses Elasticsearch `update-by-query` feature to batch untag documents directly in the index.
```
Usage: tarentula clean-tags-by-query [OPTIONS]
Options:
--datashare-project TEXT Datashare project
--elasticsearch-url TEXT Elasticsearch URL which is used to perform
update by query
--cookies TEXT Key/value pair to add a cookie to each
request to the API. You can
separatesemicolons: key1=val1;key2=val2;...
--apikey TEXT Datashare authentication apikey
--traceback / --no-traceback Display a traceback in case of error
--wait-for-completion / --no-wait-for-completion
Create a Elasticsearch task to perform the
updateasynchronously
--query TEXT Give a JSON query to filter documents that
will have their tags cleaned. It can be
afile with @path/to/file. Default to all.
--help Show this message and exit
```
### Download
A command to download all files matching a query.
```
Usage: tarentula download [OPTIONS]
Options:
--apikey TEXT Datashare authentication apikey
--datashare-url TEXT Datashare URL
--datashare-project TEXT Datashare project
--elasticsearch-url TEXT You can additionally pass the Elasticsearch
URL in order to use scrollingcapabilities of
Elasticsearch (useful when dealing with a
lot of results)
--query TEXT The query string to filter documents
--destination-directory TEXT Directory documents will be downloaded
--throttle INTEGER Request throttling (in ms)
--cookies TEXT Key/value pair to add a cookie to each
request to the API. You can
separatesemicolons: key1=val1;key2=val2;...
--path-format TEXT Downloaded document path template
--scroll TEXT Scroll duration
--source TEXT A comma-separated list of field to include
in the downloaded document from the index
-f, --from INTEGER Passed to the search it will bypass the
first n documents
-l, --limit INTEGER Limit the total results to return
--sort-by TEXT Field to use to sort results
--order-by [asc|desc] Order to use to sort results
--once / --not-once Download file only once
--traceback / --no-traceback Display a traceback in case of error
--progressbar / --no-progressbar
Display a progressbar
--raw-file / --no-raw-file Download raw file from Datashare
--type [Document|NamedEntity] Type of indexed documents to download
--help Show this message and exit.
```
### Export by Query
A command to export all files matching a query.
```
Usage: tarentula export-by-query [OPTIONS]
Options:
--apikey TEXT Datashare authentication apikey
--datashare-url TEXT Datashare URL
--datashare-project TEXT Datashare project
--elasticsearch-url TEXT You can additionally pass the Elasticsearch
URL in order to use scrollingcapabilities of
Elasticsearch (useful when dealing with a
lot of results)
--query TEXT The query string to filter documents
--output-file TEXT Path to the CSV file
--throttle INTEGER Request throttling (in ms)
--cookies TEXT Key/value pair to add a cookie to each
request to the API. You can
separatesemicolons: key1=val1;key2=val2;...
--scroll TEXT Scroll duration
--source TEXT A comma-separated list of field to include
in the export
--sort-by TEXT Field to use to sort results
--order-by [asc|desc] Order to use to sort results
--traceback / --no-traceback Display a traceback in case of error
--progressbar / --no-progressbar
Display a progressbar
--type [Document|NamedEntity] Type of indexed documents to download
-f, --from INTEGER Passed to the search it will bypass the
first n documents
-l, --limit INTEGER Limit the total results to return
--size INTEGER Size of the scroll request that powers the
operation.
--query-field / --no-query-field
Add the query to the export CSV
--help Show this message and exit.
```
### Tagging
A command to batch tag documents with a CSV file.
```
Usage: tarentula tagging [OPTIONS] CSV_PATH
Options:
--datashare-url TEXT http://localhost:8080 Datashare URL
--datashare-project TEXT local-datashare Datashare project
--throttle INTEGER 0 Request throttling (in ms)
--cookies TEXT _Empty string_ Key/value pair to add a cookie to each request to the API. You can separate semicolons: key1=val1;key2=val2;...
--apikey TEXT None Datashare authentication apikey
--traceback / --no-traceback Display a traceback in case of error
--progressbar / --no-progressbar Display a progressbar
--help Show this message and exit
```
#### CSV formats
Tagging with a `documentId` and `routing`:
```csv
tag,documentId,routing
Actinopodidae,l7VnZZEzg2fr960NWWEG,l7VnZZEzg2fr960NWWEG
Antrodiaetidae,DWLOskax28jPQ2CjFrCo
Atracidae,6VE7cVlWszkUd94XeuSd,vZJQpKQYhcI577gJR0aN
Atypidae,DbhveTJEwQfJL5Gn3Zgi,DbhveTJEwQfJL5Gn3Zgi
Barychelidae,DbhveTJEwQfJL5Gn3Zgi,DbhveTJEwQfJL5Gn3Zgi
```
Tagging with a `documentUrl`:
```csv
tag,documentUrl
Mecicobothriidae,http://localhost:8080/#/d/local-datashare/DbhveTJEwQfJL5Gn3Zgi/DbhveTJEwQfJL5Gn3Zgi
Microstigmatidae,http://localhost:8080/#/d/local-datashare/iuL6GUBpO7nKyfSSFaS0/iuL6GUBpO7nKyfSSFaS0
Migidae,http://localhost:8080/#/d/local-datashare/BmovvXBisWtyyx6o9cuG/BmovvXBisWtyyx6o9cuG
Nemesiidae,http://localhost:8080/#/d/local-datashare/vZJQpKQYhcI577gJR0aN/vZJQpKQYhcI577gJR0aN
Paratropididae,http://localhost:8080/#/d/local-datashare/vYl1C4bsWphUKvXEBDhM/vYl1C4bsWphUKvXEBDhM
Porrhothelidae,http://localhost:8080/#/d/local-datashare/fgCt6JLfHSl160fnsjRp/fgCt6JLfHSl160fnsjRp
Theraphosidae,http://localhost:8080/#/d/local-datashare/WvwVvNjEDQJXkwHISQIu/WvwVvNjEDQJXkwHISQIu
```
### Tagging by Query
A command that uses Elasticsearch `update-by-query` feature to batch tag documents directly in the index.
To see an example of input file, refer to [this JSON](tests/fixtures/tags-by-content-type.json).
```
Usage: tarentula tagging-by-query [OPTIONS] JSON_PATH
Options:
--datashare-project TEXT Datashare project
--elasticsearch-url TEXT Elasticsearch URL which is used to perform
update by query
--throttle INTEGER Request throttling (in ms)
--cookies TEXT Key/value pair to add a cookie to each
request to the API. You can
separatesemicolons: key1=val1;key2=val2;...
--apikey TEXT Datashare authentication apikey
--traceback / --no-traceback Display a traceback in case of error
--progressbar / --no-progressbar Display a progressbar
--wait-for-completion / --no-wait-for-completion
Create a Elasticsearch task to perform the
updateasynchronously
--help Show this message and exit
```
### List Metadata
You can list the metadata from the mapping, optionally counting the number of occurrences of each field in the index, with the `--count` parameter. Counting the fields is disabled by default.
It includes a `--filter_by` parameter to narrow retrieving metadata properties of specific sets of documents. For instance it can be used to get just emails related properties with: `--filter_by "contentType=message/rfc822"`
```
$ tarentula list-metadata --help
Usage: tarentula list-metadata [OPTIONS]
Options:
--datashare-project TEXT Datashare project
--elasticsearch-url TEXT You can additionally pass the Elasticsearch
URL in order to use scrollingcapabilities of
Elasticsearch (useful when dealing with a lot
of results)
--type [Document|NamedEntity] Type of indexed documents to get metadata
--filter_by TEXT Filter documents by pairs concatenated by
coma of field names and values separated by
=.Example "contentType=message/rfc822,content
Type=message/rfc822"
--count / --no-count Count or not the number of docs for each
property found
--help Show this message and exit.
```
### Aggregate
You can run aggregations on the data, the ElasticSearch aggregations API is partially enabled with this command.
The possibilities are:
- count: grouping by a given field different values, and count the num of docs.
- nunique: returns the number of unique values of a given field.
- date_histogram: returns counting of monthly or yearly grouped values for a given date field.
- sum: returns the sum of values of number type fields.
- min: returns the min of values of number type fields.
- max: returns the max of values of number type fields.
- avg: returns the average of values of number type fields.
- stats: returns a bunch of statistics for a given number type fields.
- string_stats: returns a bunch of string statistics for a given string type fields.
```
$ tarentula aggregate --help
Usage: tarentula aggregate [OPTIONS]
Options:
--apikey TEXT Datashare authentication apikey
--datashare-url TEXT Datashare URL
--datashare-project TEXT Datashare project
--elasticsearch-url TEXT You can additionally pass the Elasticsearch
URL in order to use scrollingcapabilities of
Elasticsearch (useful when dealing with a
lot of results)
--query TEXT The query string to filter documents
--cookies TEXT Key/value pair to add a cookie to each
request to the API. You can
separatesemicolons: key1=val1;key2=val2;...
--traceback / --no-traceback Display a traceback in case of error
--type [Document|NamedEntity] Type of indexed documents to download
--group_by TEXT Field to use to aggregate results
--operation_field TEXT Field to run the operation on
--run [count|nunique|date_histogram|sum|stats|string_stats|min|max|avg]
Operation to run
--calendar_interval [year|month]
Calendar interval for date histogram
aggregation
--help Show this message and exit.
```
### Following your changes
When running Elasticsearch changes on big datasets, it could take a very long time. As we were curling ES to see if the task was still running well, we added a small utility to follow the changes. It makes a live graph of a provided ES indicator with a specified filter.
It uses [mathplotlib](https://matplotlib.org/) and python3-tk.
If you see the following message :
```
$ graph_es
graph_realtime.py:32: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure
```
Then you have to install [tkinter](https://docs.python.org/3/library/tkinter.html), i.e. python3-tk for Debian/Ubuntu.
The command has the options below:
```
$ graph_es --help
Usage: graph_es [OPTIONS]
Options:
--query TEXT Give a JSON query to filter documents. It can be
a file with @path/to/file. Default to all.
--index TEXT Elasticsearch index (default local-datashare)
--refresh-interval INTEGER Graph refresh interval in seconds (default 5s)
--field TEXT Field value to display over time (default "hits.total")
--elasticsearch-url TEXT Elasticsearch URL which is used to perform
update by query (default http://elasticsearch:9200)
```
## Configuration File
Tarentula supports several sources for configuring its behavior, including an ini files and command-line options.
Configuration file will be searched for in the following order (use the first file found, all others are ignored):
* `TARENTULA_CONFIG` (environment variable if set)
* `tarentula.ini` (in the current directory)
* `~/.tarentula.ini` (in the home directory)
* `/etc/tarentula/tarentula.ini`
It should follow the following format (all values bellow are optional):
```
[DEFAULT]
apikey = SECRETHALONOPROCTIDAE
datashare_url = http://here:8080
datashare_project = local-datashare
[logger]
syslog_address = 127.0.0.0
syslog_port = 514
syslog_facility = local7
stdout_loglevel = INFO
```
## Testing
To test this tool, you must have Datashare and Elasticsearch running on your development machine.
After you [installed Datashare](https://datashare.icij.org/), just run it with a test project/user:
```
datashare -p test-datashare -u test
```
In a separate terminal, install the development dependencies:
```
make install
```
Finally, run the test
```
make test
```
## Releasing
The releasing process uses [bumpversion](https://pypi.org/project/bumpversion/) to manage versions of this package, [pypi](https://pypi.org/project/tarentula/) to publish the Python package and [Docker Hub](https://hub.docker.com/) for the Docker image.
### 1. Create a new release
```
make [patch|minor|major]
```
### 2. Upload distributions on pypi
_To be able to do this, you will need to be a maintainer of the [pypi](https://pypi.org/project/tarentula/) project._
```
make distribute
```
### 3. Build and publish the Docker image
To build and upload a new image on the [docker repository](https://hub.docker.com/repository/docker/icij/datashare-tarentula) :
_To be able to do this, you will need to be part of the ICIJ organization on docker_
```
make docker-publish
```
**Note**: Datashare Tarentula is a multi-platform build. You might need to setup your environment for
multi-platform using the `make docker-setup-multiarch` command. Read more
[on Docker documentation](https://docs.docker.com/build/building/multi-platform/).
### 4. Push your changes on Github
Git push release and tag :
```
git push origin master --tags
```
Raw data
{
"_id": null,
"home_page": null,
"name": "tarentula",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.8",
"maintainer_email": null,
"keywords": null,
"author": "ICIJ",
"author_email": "engineering@icij.org",
"download_url": "https://files.pythonhosted.org/packages/d0/b9/d66b4b4bb239d4ca0d8cca43a52130143bf601d0904d51e42da02af4e5f0/tarentula-4.4.0.tar.gz",
"platform": null,
"description": "# Datashare Tarentula [](https://circleci.com/gh/ICIJ/datashare-tarentula)\n\nCli toolbelt for [Datashare](https://datashare.icij.org).\n\n```\n / \\\n \\ \\ ,, / /\n '-.`\\()/`.-'\n .--_'( )'_--.\n / /` /`\"\"`\\ `\\ \\\n | | >< | |\n \\ \\ / /\n '.__.'\n\nUsage: tarentula [OPTIONS] COMMAND [ARGS]...\n\nOptions:\n --syslog-address TEXT localhost Syslog address\n --syslog-port INTEGER 514 Syslog port\n --syslog-facility TEXT local7 Syslog facility\n --stdout-loglevel TEXT ERROR Change the default log level for stdout error handler\n --help Show this message and exit\n --version Show the installed version of Tarentula\n\nCommands:\n aggregate\n count\n clean-tags-by-query\n download\n export-by-query\n list-metadata\n tagging\n tagging-by-query\n```\n\n---\n<!-- TOC depthFrom:2 depthTo:6 withLinks:1 updateOnSave:1 orderedList:0 -->\n\n- [Installation](#installation)\n- [Usage](#usage)\n - [Cookbook \ud83d\udc69\u200d\ud83c\udf73](#cookbook-)\n - [Count](#count)\n - [Clean Tags by Query](#clean-tags-by-query)\n - [Download](#download)\n - [Export by Query](#export-by-query)\n - [Tagging](#tagging)\n - [CSV formats](#csv-formats)\n - [Tagging by Query](#tagging-by-query)\n - [Aggregate](#aggregate)\n - [Following your changes](#following-your-changes)\n- [Configuration File](#configuration-file)\n- [Testing](#testing)\n- [Releasing](#releasing)\n - [1. Create a new release](#1-create-a-new-release)\n - [2. Upload distributions on pypi](#2-upload-distributions-on-pypi)\n - [3. Build and publish the Docker image](#3-build-and-publish-the-docker-image)\n - [4. Push your changes on Github](#4-push-your-changes-on-github)\n\n<!-- /TOC -->\n---\n\n## Installation\n\nYou can insatll Datashare Tarentula with your favorite package manager:\n\n```\npip3 install --user tarentula\n```\n\nOr alternativly with Docker:\n\n```\ndocker run icij/datashare-tarentula\n```\n\n## Usage\n\nDatashare Tarentula comes with basic commands to interact with a Datashare instance (running locally or on a remote server). Primarily focus on bulk actions, it provides you with both a cli interface and a python API.\n\n### Cookbook \ud83d\udc69\u200d\ud83c\udf73\n\nTo learn more about how to use Datashare Tarentula with a list of examples, please refer to <a href=\"./COOKBOOK.md\">the Cookbook</a>.\n\n### Count\n\nA command to just count the number of files matching a query.\n\n```\nUsage: tarentula count [OPTIONS]\n\nOptions:\n --datashare-url TEXT Datashare URL\n --datashare-project TEXT Datashare project\n --elasticsearch-url TEXT You can additionally pass the Elasticsearch\n URL in order to use scrollingcapabilities of\n Elasticsearch (useful when dealing with a\n lot of results)\n --query TEXT The query string to filter documents\n --cookies TEXT Key/value pair to add a cookie to each\n request to the API. You can\n separatesemicolons: key1=val1;key2=val2;...\n --apikey TEXT Datashare authentication apikey\n in the downloaded document from the index\n --traceback / --no-traceback Display a traceback in case of error\n --type [Document|NamedEntity] Type of indexed documents to download\n --help Show this message and exit\n```\n\n### Clean Tags by Query\n\nA command that uses Elasticsearch `update-by-query` feature to batch untag documents directly in the index.\n\n```\nUsage: tarentula clean-tags-by-query [OPTIONS]\n\nOptions:\n --datashare-project TEXT Datashare project\n --elasticsearch-url TEXT Elasticsearch URL which is used to perform\n update by query\n --cookies TEXT Key/value pair to add a cookie to each\n request to the API. You can\n separatesemicolons: key1=val1;key2=val2;...\n --apikey TEXT Datashare authentication apikey\n --traceback / --no-traceback Display a traceback in case of error\n --wait-for-completion / --no-wait-for-completion\n Create a Elasticsearch task to perform the\n updateasynchronously\n --query TEXT Give a JSON query to filter documents that\n will have their tags cleaned. It can be\n afile with @path/to/file. Default to all.\n --help Show this message and exit\n```\n\n### Download\n\nA command to download all files matching a query.\n\n```\nUsage: tarentula download [OPTIONS]\n\nOptions:\n --apikey TEXT Datashare authentication apikey\n --datashare-url TEXT Datashare URL\n --datashare-project TEXT Datashare project\n --elasticsearch-url TEXT You can additionally pass the Elasticsearch\n URL in order to use scrollingcapabilities of\n Elasticsearch (useful when dealing with a\n lot of results)\n\n --query TEXT The query string to filter documents\n --destination-directory TEXT Directory documents will be downloaded\n --throttle INTEGER Request throttling (in ms)\n --cookies TEXT Key/value pair to add a cookie to each\n request to the API. You can\n separatesemicolons: key1=val1;key2=val2;...\n\n --path-format TEXT Downloaded document path template\n --scroll TEXT Scroll duration\n --source TEXT A comma-separated list of field to include\n in the downloaded document from the index\n\n -f, --from INTEGER Passed to the search it will bypass the\n first n documents\n -l, --limit INTEGER Limit the total results to return\n --sort-by TEXT Field to use to sort results\n --order-by [asc|desc] Order to use to sort results\n --once / --not-once Download file only once\n --traceback / --no-traceback Display a traceback in case of error\n --progressbar / --no-progressbar\n Display a progressbar\n --raw-file / --no-raw-file Download raw file from Datashare\n --type [Document|NamedEntity] Type of indexed documents to download\n --help Show this message and exit.\n```\n\n\n### Export by Query\n\nA command to export all files matching a query.\n\n```\nUsage: tarentula export-by-query [OPTIONS]\n\nOptions:\n --apikey TEXT Datashare authentication apikey\n --datashare-url TEXT Datashare URL\n --datashare-project TEXT Datashare project\n --elasticsearch-url TEXT You can additionally pass the Elasticsearch\n URL in order to use scrollingcapabilities of\n Elasticsearch (useful when dealing with a\n lot of results)\n\n --query TEXT The query string to filter documents\n --output-file TEXT Path to the CSV file\n --throttle INTEGER Request throttling (in ms)\n --cookies TEXT Key/value pair to add a cookie to each\n request to the API. You can\n separatesemicolons: key1=val1;key2=val2;...\n\n --scroll TEXT Scroll duration\n --source TEXT A comma-separated list of field to include\n in the export\n\n --sort-by TEXT Field to use to sort results\n --order-by [asc|desc] Order to use to sort results\n --traceback / --no-traceback Display a traceback in case of error\n --progressbar / --no-progressbar\n Display a progressbar\n --type [Document|NamedEntity] Type of indexed documents to download\n -f, --from INTEGER Passed to the search it will bypass the\n first n documents\n -l, --limit INTEGER Limit the total results to return\n --size INTEGER Size of the scroll request that powers the\n operation.\n\n --query-field / --no-query-field\n Add the query to the export CSV\n --help Show this message and exit.\n```\n\n\n### Tagging\n\nA command to batch tag documents with a CSV file.\n\n```\nUsage: tarentula tagging [OPTIONS] CSV_PATH\n\nOptions:\n --datashare-url TEXT http://localhost:8080 Datashare URL\n --datashare-project TEXT local-datashare Datashare project\n --throttle INTEGER 0 Request throttling (in ms)\n --cookies TEXT _Empty string_ Key/value pair to add a cookie to each request to the API. You can separate semicolons: key1=val1;key2=val2;...\n --apikey TEXT None Datashare authentication apikey\n --traceback / --no-traceback Display a traceback in case of error\n --progressbar / --no-progressbar Display a progressbar\n --help Show this message and exit\n```\n\n#### CSV formats\n\nTagging with a `documentId` and `routing`:\n\n```csv\ntag,documentId,routing\nActinopodidae,l7VnZZEzg2fr960NWWEG,l7VnZZEzg2fr960NWWEG\nAntrodiaetidae,DWLOskax28jPQ2CjFrCo\nAtracidae,6VE7cVlWszkUd94XeuSd,vZJQpKQYhcI577gJR0aN\nAtypidae,DbhveTJEwQfJL5Gn3Zgi,DbhveTJEwQfJL5Gn3Zgi\nBarychelidae,DbhveTJEwQfJL5Gn3Zgi,DbhveTJEwQfJL5Gn3Zgi\n```\n\nTagging with a `documentUrl`:\n\n```csv\ntag,documentUrl\nMecicobothriidae,http://localhost:8080/#/d/local-datashare/DbhveTJEwQfJL5Gn3Zgi/DbhveTJEwQfJL5Gn3Zgi\nMicrostigmatidae,http://localhost:8080/#/d/local-datashare/iuL6GUBpO7nKyfSSFaS0/iuL6GUBpO7nKyfSSFaS0\nMigidae,http://localhost:8080/#/d/local-datashare/BmovvXBisWtyyx6o9cuG/BmovvXBisWtyyx6o9cuG\nNemesiidae,http://localhost:8080/#/d/local-datashare/vZJQpKQYhcI577gJR0aN/vZJQpKQYhcI577gJR0aN\nParatropididae,http://localhost:8080/#/d/local-datashare/vYl1C4bsWphUKvXEBDhM/vYl1C4bsWphUKvXEBDhM\nPorrhothelidae,http://localhost:8080/#/d/local-datashare/fgCt6JLfHSl160fnsjRp/fgCt6JLfHSl160fnsjRp\nTheraphosidae,http://localhost:8080/#/d/local-datashare/WvwVvNjEDQJXkwHISQIu/WvwVvNjEDQJXkwHISQIu\n```\n\n### Tagging by Query\n\nA command that uses Elasticsearch `update-by-query` feature to batch tag documents directly in the index.\n\nTo see an example of input file, refer to [this JSON](tests/fixtures/tags-by-content-type.json).\n\n```\nUsage: tarentula tagging-by-query [OPTIONS] JSON_PATH\n\nOptions:\n --datashare-project TEXT Datashare project\n --elasticsearch-url TEXT Elasticsearch URL which is used to perform\n update by query\n --throttle INTEGER Request throttling (in ms)\n --cookies TEXT Key/value pair to add a cookie to each\n request to the API. You can\n separatesemicolons: key1=val1;key2=val2;...\n --apikey TEXT Datashare authentication apikey\n --traceback / --no-traceback Display a traceback in case of error\n --progressbar / --no-progressbar Display a progressbar\n --wait-for-completion / --no-wait-for-completion\n Create a Elasticsearch task to perform the\n updateasynchronously\n --help Show this message and exit\n```\n\n\n### List Metadata\n\nYou can list the metadata from the mapping, optionally counting the number of occurrences of each field in the index, with the `--count` parameter. Counting the fields is disabled by default.\n\nIt includes a `--filter_by` parameter to narrow retrieving metadata properties of specific sets of documents. For instance it can be used to get just emails related properties with: `--filter_by \"contentType=message/rfc822\"`\n\n```\n$ tarentula list-metadata --help\nUsage: tarentula list-metadata [OPTIONS]\n\nOptions:\n --datashare-project TEXT Datashare project\n --elasticsearch-url TEXT You can additionally pass the Elasticsearch\n URL in order to use scrollingcapabilities of\n Elasticsearch (useful when dealing with a lot\n of results)\n --type [Document|NamedEntity] Type of indexed documents to get metadata\n --filter_by TEXT Filter documents by pairs concatenated by\n coma of field names and values separated by\n =.Example \"contentType=message/rfc822,content\n Type=message/rfc822\"\n --count / --no-count Count or not the number of docs for each\n property found\n\n --help Show this message and exit.\n\n```\n\n### Aggregate\n\nYou can run aggregations on the data, the ElasticSearch aggregations API is partially enabled with this command.\nThe possibilities are:\n\n- count: grouping by a given field different values, and count the num of docs.\n- nunique: returns the number of unique values of a given field.\n- date_histogram: returns counting of monthly or yearly grouped values for a given date field.\n- sum: returns the sum of values of number type fields.\n- min: returns the min of values of number type fields.\n- max: returns the max of values of number type fields.\n- avg: returns the average of values of number type fields.\n- stats: returns a bunch of statistics for a given number type fields.\n- string_stats: returns a bunch of string statistics for a given string type fields.\n\n\n\n```\n$ tarentula aggregate --help\nUsage: tarentula aggregate [OPTIONS]\n\nOptions:\n --apikey TEXT Datashare authentication apikey\n --datashare-url TEXT Datashare URL\n --datashare-project TEXT Datashare project\n --elasticsearch-url TEXT You can additionally pass the Elasticsearch\n URL in order to use scrollingcapabilities of\n Elasticsearch (useful when dealing with a\n lot of results)\n --query TEXT The query string to filter documents\n --cookies TEXT Key/value pair to add a cookie to each\n request to the API. You can\n separatesemicolons: key1=val1;key2=val2;...\n --traceback / --no-traceback Display a traceback in case of error\n --type [Document|NamedEntity] Type of indexed documents to download\n --group_by TEXT Field to use to aggregate results\n --operation_field TEXT Field to run the operation on\n --run [count|nunique|date_histogram|sum|stats|string_stats|min|max|avg]\n Operation to run\n --calendar_interval [year|month]\n Calendar interval for date histogram\n aggregation\n --help Show this message and exit.\n```\n\n### Following your changes\n\nWhen running Elasticsearch changes on big datasets, it could take a very long time. As we were curling ES to see if the task was still running well, we added a small utility to follow the changes. It makes a live graph of a provided ES indicator with a specified filter.\n\nIt uses [mathplotlib](https://matplotlib.org/) and python3-tk.\n\nIf you see the following message :\n\n```\n$ graph_es\ngraph_realtime.py:32: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure\n```\n\nThen you have to install [tkinter](https://docs.python.org/3/library/tkinter.html), i.e. python3-tk for Debian/Ubuntu.\n\nThe command has the options below:\n\n```\n$ graph_es --help\nUsage: graph_es [OPTIONS]\n\nOptions:\n --query TEXT Give a JSON query to filter documents. It can be\n a file with @path/to/file. Default to all.\n --index TEXT Elasticsearch index (default local-datashare)\n --refresh-interval INTEGER Graph refresh interval in seconds (default 5s)\n --field TEXT Field value to display over time (default \"hits.total\")\n --elasticsearch-url TEXT Elasticsearch URL which is used to perform\n update by query (default http://elasticsearch:9200)\n```\n\n## Configuration File\n\nTarentula supports several sources for configuring its behavior, including an ini files and command-line options.\n\nConfiguration file will be searched for in the following order (use the first file found, all others are ignored):\n\n * `TARENTULA_CONFIG` (environment variable if set)\n * `tarentula.ini` (in the current directory)\n * `~/.tarentula.ini` (in the home directory)\n * `/etc/tarentula/tarentula.ini`\n\nIt should follow the following format (all values bellow are optional):\n\n```\n[DEFAULT]\napikey = SECRETHALONOPROCTIDAE\ndatashare_url = http://here:8080\ndatashare_project = local-datashare\n\n[logger]\nsyslog_address = 127.0.0.0\nsyslog_port = 514\nsyslog_facility = local7\nstdout_loglevel = INFO\n```\n\n## Testing\n\nTo test this tool, you must have Datashare and Elasticsearch running on your development machine.\n\nAfter you [installed Datashare](https://datashare.icij.org/), just run it with a test project/user:\n\n```\ndatashare -p test-datashare -u test\n```\n\nIn a separate terminal, install the development dependencies:\n\n```\nmake install\n```\n\nFinally, run the test\n\n```\nmake test\n```\n\n\n## Releasing\n\nThe releasing process uses [bumpversion](https://pypi.org/project/bumpversion/) to manage versions of this package, [pypi](https://pypi.org/project/tarentula/) to publish the Python package and [Docker Hub](https://hub.docker.com/) for the Docker image.\n\n### 1. Create a new release\n\n```\nmake [patch|minor|major]\n```\n\n### 2. Upload distributions on pypi\n\n_To be able to do this, you will need to be a maintainer of the [pypi](https://pypi.org/project/tarentula/) project._\n\n```\nmake distribute\n```\n\n### 3. Build and publish the Docker image\n\nTo build and upload a new image on the [docker repository](https://hub.docker.com/repository/docker/icij/datashare-tarentula) :\n\n_To be able to do this, you will need to be part of the ICIJ organization on docker_\n\n```\nmake docker-publish\n```\n\n**Note**: Datashare Tarentula is a multi-platform build. You might need to setup your environment for \nmulti-platform using the `make docker-setup-multiarch` command. Read more \n[on Docker documentation](https://docs.docker.com/build/building/multi-platform/). \n\n### 4. Push your changes on Github\n\nGit push release and tag :\n\n```\ngit push origin master --tags\n```\n",
"bugtrack_url": null,
"license": null,
"summary": null,
"version": "4.4.0",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "ac2b759024d3acedd1e02f5583008ac210d7058c0aac8c02254de9de7de83ebf",
"md5": "d93f2b59754f6537c60a6a57f17d396c",
"sha256": "72bd88af6ab47947e0a1aa521a2bc67c1362e6b7597de992a12b9a93619babec"
},
"downloads": -1,
"filename": "tarentula-4.4.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d93f2b59754f6537c60a6a57f17d396c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.8",
"size": 38649,
"upload_time": "2024-04-17T10:29:27",
"upload_time_iso_8601": "2024-04-17T10:29:27.250877Z",
"url": "https://files.pythonhosted.org/packages/ac/2b/759024d3acedd1e02f5583008ac210d7058c0aac8c02254de9de7de83ebf/tarentula-4.4.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "d0b9d66b4b4bb239d4ca0d8cca43a52130143bf601d0904d51e42da02af4e5f0",
"md5": "ef0d118e5741a9204128e962c8bb98fa",
"sha256": "334fbe3ba7f41f8c7b56063eece1573f644c93b3fcdca4de53729e854e9437d2"
},
"downloads": -1,
"filename": "tarentula-4.4.0.tar.gz",
"has_sig": false,
"md5_digest": "ef0d118e5741a9204128e962c8bb98fa",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.8",
"size": 35214,
"upload_time": "2024-04-17T10:29:28",
"upload_time_iso_8601": "2024-04-17T10:29:28.708622Z",
"url": "https://files.pythonhosted.org/packages/d0/b9/d66b4b4bb239d4ca0d8cca43a52130143bf601d0904d51e42da02af4e5f0/tarentula-4.4.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-04-17 10:29:28",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "tarentula"
}