# EIS1600 Tools
* [File Preparation](#file-preparation)
* [Processing Workflow](#processing-workflow)
* [Installation](#installation)
* [Common Error Messages](#common-error-messages)
* [Set Up](#set-up-virtual-environment-and-install-the-eis1600-pkg-there)
* [Working Directory Structure](#structure-of-the-working-directory)
* [Usage](#usage)
* [convert_mARkdown_to_EIS1600TMP](#convert-markdown-to-eis1600-files)
* [ids_insert_or_update](#eis1600TMP-to-eis1600)
* [check_formatting](#check-eis1600-formatting)
* [reannotation](#reannotation)
* [q_tags_to_bio](#get-training-data-from-q-annotations)
* [miu_random_revision](#miu-revision)
## File Preparation
1. Convert from mARkdown to EIS1600TMP with `convert_mARkdown_to_EIS1600`
2. Check the `.EIS1600TMP` and correct tagged structure
3. Mark file as ready in the Google Spreadsheet (this includes the file into our processing pipeline)
4. Optional: Run `ids_insert_or_update` on the checked `.EIS1600TMP` (or run `incorporate_newly_prepared_files_in_corpus` which will add IDs for all files listed as ready or double-checked).
If you need to change the tagged structure in an `.EIS1600` file, you do those changes with _Simple Markdown_.
Run `ids_insert_or_update` to convert the changes in _Simple Markdown_ to _EIS1600 mARkdown_.
Check the format of the EIS1600 file with `check_formatting <path/to/file>`.
## Processing Workflow
1. Run `incorporate_newly_prepared_files_in_corpus`. This script downloads the Google Sheet and processes all ready and double-checked files:
1. Ready files are converted from EIS1600TMP to EIS1600 and IDs are added;
2. Formatting of ready files (now EIS1600 files) and double-checked files is checked;
3. IDs are updated if necessary.
Files are now finalized and ready to be processed by the pipeline.
2. Run `analyse_all_on_cluster`. This script analysis all files prepared by the previous step:
1. Each file is disassembled into MIUs;
2. Analysis routine is run for each MIU;
3. Results are returned as a JSON for each file that contains the annotated text, the populated yml and the analysis results (as df).
The JSON files are ready to be imported into our database.
## Installation
You can either do the complete local setup and have everything installed on your machine.
Alternatively, you can also use the docker image which can execute all the commands from the EIS1600-pkg.
### Docker Installation
Install Docker Desktop: [https://docs.docker.com/desktop/install/mac-install/](https://docs.docker.com/desktop/install/mac-install/)
It should install Docker Engine as well, which can be used through command line interface (CLI).
To run a script from the EIS1600-pkg with docker, give the command to docker through CLI:
```shell
$ docker run <--gpus all> -it -v "</path/to/EIS1600>:/EIS1600" eis1600-pkg <EIS1600-pkg-command and its params>
```
Explanation:
* `docker run` starts the image, `-it` propagates CLI input to the image.
* `--gpus all`, optional to run docker with GPUs.
* `-v` will virtualize a directory from your system in the docker image.
* `-v` virtualized `</path/to/EIS1600>` from your system to `/EIS1600` in the docker image. You give the absolute path to our `EIS1600` parent directory on your machine. Make sure to replace `</path/to/EIS1600>` with the correct path on your machine! This is the part in front of the colon, after the colon the destination inside the docker image is specified (this one is fixed).
* `eis1600-pkg` the repository name on docker hub from where the image will be downloaded
* Last, the command from the package you want to execute including all parameters required by that command.
E.G., to run `q_tags_to_bio` for toponym descriptions through docker:
```shell
$ docker run -it -v "</path/to/EIS1600>:/EIS1600" eis1600-pkg q_tags_to_bio Topo_Data/MIUs/ TOPONYM_DESCRIPTION_DETECTION/toponym_description_training_data TOPD
```
To run the annotation pipeline:
```shell
$ docker run --gpus all -it -v "</path/to/EIS1600>:/EIS1600" eis1600-pkg analyse_all_on_cluster
```
Maybe add `-D` as parameter to `analyse_all_on_cluster` because parallel processing does not work with GPU.
### Local Setup
After creating and activating the eis16000_env (see [Set Up](#set-up-virtual-environment-and-install-the-eis1600-pkg-there)), use:
```shell
$ pip install eis1600
```
In case you have an older version installed, use:
```shell
$ pip install --upgrade eis1600
```
The package comes with different options, to install camel-tools use the following command.
Check also their installation instructions because atm they require additional packages [https://camel-tools.readthedocs.io/en/latest/getting_started.html#installation](https://camel-tools.readthedocs.io/en/latest/getting_started.html#installation)
```shell
$ pip install 'eis1600[NER]'
```
If you want to run the annotation pipeline, you also need to download camel-tools data:
```shell
$ camel_data -i disambig-mle-calima-msa-r13
```
To run the annotation pipeline with GPU, use this command:
```shell
$ pip install 'eis1600[EIS]'
```
**Note**. You can use `pip freeze` to check the versions of all installed packages, including `eis1600`.
### Common Error Messages
You need to download all the models ONE BY ONE from Google Drive.
Something breaks if you try to download the whole folder, and you get this error:
```shell
OSError: Error no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory EIS1600_Pretrained_Models/camelbert-ca-finetuned
```
Better to sync `EIS1600_Pretrained_Models` with our nextcloud.
If you want to install `eis1600-pkg` from source you have to add the data modules for `gazetteers` and `helper` manually.
You can find the modules in our nextcloud.
## Set Up Virtual Environment and Install the EIS1600 PKG there
To not mess with other python installations, we recommend installing the package in a virual environment.
To create a new virtual environment with python, run:
```shell
python3 -m venv eis1600_env
```
**NB:** while creating your new virtual environment, you must use Python 3.7 or 3.8, as these are version required by CAMeL-Tools.
After creation of the environment it can be activated by:
```shell
source eis1600_env/bin/activate
```
The environment is now activated and the eis1600 package can be installed into that environment with pip:
```shell
$ pip install eis1600
```
This command installs all dependencies as well, so you should see lots of other libraries being installed. If you do not, you must have used a wrong version of Python while creating your virtual environment.
You can now use the commands listed in this README.
To use the environment, you have to activate it for **every session**, by:
```shell
source eis1600_env/bin/activate
```
After successful activation, your user has the pre-text `(eis1600_env)`.
Probably, you want to create an alias for the source command in your *alias* file by adding the following line:
```shell
alias eis="source eis1600_env/bin/activate"
```
Alias files:
- on Linux:
- `.bash_aliases`
- On Mac:
- `.zshrc` if you use `zsh` (default in the latest versions Mac OS);
## Structure of the working directory
The working directory is always the main `EIS1600` directory which is a parent to all the different repositories.
The `EIS1600` directory has the following structure:
```
|
|---| eis_env
|---| EIS1600_JSONs
|---| EIS1600_Pretrained_Models (for annotation, sync from Nextcloud)
|---| gazetteers
|---| Master_Chronicle
|---| OpenITI_EIS1600_Texts
|---| Training_Data
```
Path variables are in the module `eis1600/helper/repo`.
## Usage
__All commands must be run from the [parent directory](#structure-of-the-working-directory) `EIS1600`!__
See also [Processing Workflow](#processing-workflow).
* Use `-D` flag to get detailed debug messages in the console.
### Annotation Pipeline
* Use `-P` flag to run annotation of MIUs in parallel, parallel processing will eat up __ALL__ resources!
* Use `-D` flag to get detailed debug messages in the console.
```shell
$ analyse_all_on_cluster
```
### Convert mARkdown to EIS1600 files
Converts mARkdown file to EIS1600TMP (without inserting UIDs).
The .EIS1600TMP file will be created next to the .mARkdown file (you can input .inProcess or .completed files as well).
This command can be run from anywhere within the text repo - use auto complete (`tab`) to get the correct path to the file.
Alternative: open command line from the folder which contains the file which shall be converted.
```shell
$ convert_mARkdown_to_EIS1600TMP <uri>.mARkdown
```
#### Batch processing of mARkdown files
Run from the [parent directory](#structure-of-the-working-directory) `EIS1600`.
Use the `-e` option to convert all files from the EIS1600 repo.
```shell
$ convert_mARkdown_to_EIS1600 -e <EIS1600_repo>
```
### EIS1600TMP to EIS1600
EIS1600TMP files do not contain IDs yet, to insert IDs run `ids_insert_or_update` on the `.EIS1600TMP` file.
Use auto complete (`tab`) to get the correct path to the file.
```shell
$ ids_insert_or_update <OpenITI_EIS1600_Text/data/path/to/file>.EIS1600TMP
```
Additionally, this routine updates IDs if you run it on a `.EIS1600` file.
Update IDs means inserting missing UIDs and updating SubIDs.
```shell
$ ids_insert_or_update <OpenITI_EIS1600_Text/data/path/to/file>.EIS1600
```
#### Batch processing
See also [Processing Workflow](#processing-workflow).
Use `incorporate_newly_prepared_files_in_corpus` to add IDs to all ready files from the Google Sheet.
```shell
$ incorporate_newly_prepared_files_in_corpus
```
### Check EIS1600 Formatting
Check if the formatting is correct (structural tagging)
```shell
$ check_formatting <OpenITI_EIS1600_Text/data/path/to/file>.EIS1600
```
#### Batch processing
Check the formatting of all `.EIS1600` files:
```shell
$ check_formatting
```
This will create a log-file with all issues found. The log file is here: `OpenITI_EIS1600_Texts/mal_formatted_texts.log`.
It will print a list of files marked as 'ready' for which no `.EIS1600TMP` file was found.
It will also print a list of files marked as 'double-cheked' for which no `.EIS1600` file was found.
Check if the author or book URI are still matching the folders and the file in the `OpenITI_EIS1600_Texts` directory.
### Reannotation
This script can be run on a folder containing files which were exported from the onine editor. Those files are MIUs but are missing directionality tags and paragraph tags (they use new lines to indicate paragraphs).
Use these flags to active the respective model for annotation:
* NER
* O [_Onomastics_]
* P [_Persons and STFCOX_]
* T [_Toponyms and BDKOPRX_]
__THIS WILL OVERWRITE THE ORIGINAL FILES IN THE FOLDER!__
```shell
$ reannotation -NER -O -P -T <path/to/folder>
```
### Get training data from Q annotations
This script can be used to transform Q-tags from EIS1600-mARkdown to BIO-labels.
The script will operate on a directory of MIUs and write a JSON file with annotated MIUs in BIO training format.
Parameters are:
1. Path to directory containing annotated MIUs;
2. Filename or path inside RESEARCH_DATA repo for JSON output file
3. BIO_main_class, optional, defaults to 'Q'. Try to use something more meaningful and distinguishable.
```shell
$ q_tags_to_bio <path/to/MIUs/> <q_training_data> <bio_main_class>
```
For toponym definitions/descriptions:
```shell
$ q_tags_to_bio Topo_Data/MIUs/ TOPONYM_DESCRIPTION_DETECTION/toponym_description_training_data TOPD
```
### MIU revision
Run the following command from the root of the MIU repo to revise automated annotated files:
```shell
$ miu_random_revisions
```
When first run, the file *file_picker.yml* is added to the root of the MIU repository.
Make sure to specify your operating system and to set your initials and the path/command to/for Kate in this YAML file.
```yaml
system: ... # options: mac, lin, win;
reviewer: eis1600researcher # change this to your name;
path_to_kate: kate # add absolute path to Kate on your machine; or a working alias (kate should already work)
```
Optional, you can specify a path from where to open files - e.g. if you only want to open training-data, set:
```yaml
miu_main_path: ./training_data/
```
When revising files, remember to change
```yaml
reviewed : NOT REVIEWED
```
to
```yaml
reviewed : REVIEWED
```
### TSV dump
Run the following command from the root of the MIU repo to revise create tsv files with the corpus dump:
```shell
$ tsv_dump
```
This command will create two files:
1. eis1600-structure.tsv contains all structural data from the eis1600 corpus
2. eis1600-content.tsv contains all content data from the eis1600 corpus.
- By default, this file is splitted in 4 parts (eis1600-content_part0001.tsv, etc), so that the files are not too large. The output can be splitted in a different number of files using the argument --part, e.g. `$ tsv_dump --parts 0` will create only one file, without any parts.
- By default, all entities will be added to the tsv output. The list of entities are: SECTIONS, TOKENS, TAGS_LISTS, NER_LABELS, LEMMAS, POS_TAGS, ROOTS, TOPONYM_LABELS, NER_TAGS, DATE_TAGS, MONTH_TAGS, ONOM_TAGS,
ONOMASTIC_TAGS. A different selection of output entities can be done with the argument --label_list, e.g. `$ tsv_dump --label_list NER_LABELS NER_TAGS` will output only the information included in those entities.
> For example, to extract all TOPONYM_LABELS from the whole eis1600 data and output it to a single file, use: `$ tsv_dump --label_list TOPONYM_LABELS`
Raw data
{
"_id": null,
"home_page": "https://github.com/EIS1600/eis1600-pkg",
"name": "eis1600",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": null,
"author": "Lisa Mischer",
"author_email": "mischer.lisa@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/49/4d/3601887e460589c8ab48dfa89baf91a4e0b11ef49e5b8fd7500adc1df0f5/eis1600-1.6.9.tar.gz",
"platform": null,
"description": "# EIS1600 Tools\n\n* [File Preparation](#file-preparation)\n* [Processing Workflow](#processing-workflow)\n* [Installation](#installation)\n * [Common Error Messages](#common-error-messages)\n* [Set Up](#set-up-virtual-environment-and-install-the-eis1600-pkg-there)\n* [Working Directory Structure](#structure-of-the-working-directory)\n* [Usage](#usage)\n * [convert_mARkdown_to_EIS1600TMP](#convert-markdown-to-eis1600-files)\n * [ids_insert_or_update](#eis1600TMP-to-eis1600)\n * [check_formatting](#check-eis1600-formatting)\n * [reannotation](#reannotation)\n * [q_tags_to_bio](#get-training-data-from-q-annotations)\n * [miu_random_revision](#miu-revision)\n\n## File Preparation\n\n1. Convert from mARkdown to EIS1600TMP with `convert_mARkdown_to_EIS1600`\n2. Check the `.EIS1600TMP` and correct tagged structure\n3. Mark file as ready in the Google Spreadsheet (this includes the file into our processing pipeline)\n4. Optional: Run `ids_insert_or_update` on the checked `.EIS1600TMP` (or run `incorporate_newly_prepared_files_in_corpus` which will add IDs for all files listed as ready or double-checked).\n\nIf you need to change the tagged structure in an `.EIS1600` file, you do those changes with _Simple Markdown_.\nRun `ids_insert_or_update` to convert the changes in _Simple Markdown_ to _EIS1600 mARkdown_.\nCheck the format of the EIS1600 file with `check_formatting <path/to/file>`.\n\n## Processing Workflow\n\n1. Run `incorporate_newly_prepared_files_in_corpus`. This script downloads the Google Sheet and processes all ready and double-checked files:\n 1. Ready files are converted from EIS1600TMP to EIS1600 and IDs are added;\n 2. Formatting of ready files (now EIS1600 files) and double-checked files is checked;\n 3. IDs are updated if necessary.\n\nFiles are now finalized and ready to be processed by the pipeline.\n\n2. Run `analyse_all_on_cluster`. This script analysis all files prepared by the previous step:\n 1. Each file is disassembled into MIUs;\n 2. Analysis routine is run for each MIU;\n 3. Results are returned as a JSON for each file that contains the annotated text, the populated yml and the analysis results (as df).\n\nThe JSON files are ready to be imported into our database.\n\n## Installation\n\nYou can either do the complete local setup and have everything installed on your machine.\nAlternatively, you can also use the docker image which can execute all the commands from the EIS1600-pkg.\n\n### Docker Installation\n\nInstall Docker Desktop: [https://docs.docker.com/desktop/install/mac-install/](https://docs.docker.com/desktop/install/mac-install/)\n\nIt should install Docker Engine as well, which can be used through command line interface (CLI).\n\nTo run a script from the EIS1600-pkg with docker, give the command to docker through CLI:\n```shell\n$ docker run <--gpus all> -it -v \"</path/to/EIS1600>:/EIS1600\" eis1600-pkg <EIS1600-pkg-command and its params>\n```\n\nExplanation:\n* `docker run` starts the image, `-it` propagates CLI input to the image.\n* `--gpus all`, optional to run docker with GPUs.\n* `-v` will virtualize a directory from your system in the docker image.\n* `-v` virtualized `</path/to/EIS1600>` from your system to `/EIS1600` in the docker image. You give the absolute path to our `EIS1600` parent directory on your machine. Make sure to replace `</path/to/EIS1600>` with the correct path on your machine! This is the part in front of the colon, after the colon the destination inside the docker image is specified (this one is fixed).\n* `eis1600-pkg` the repository name on docker hub from where the image will be downloaded\n* Last, the command from the package you want to execute including all parameters required by that command.\n\nE.G., to run `q_tags_to_bio` for toponym descriptions through docker:\n```shell\n$ docker run -it -v \"</path/to/EIS1600>:/EIS1600\" eis1600-pkg q_tags_to_bio Topo_Data/MIUs/ TOPONYM_DESCRIPTION_DETECTION/toponym_description_training_data TOPD\n```\n\nTo run the annotation pipeline:\n```shell\n$ docker run --gpus all -it -v \"</path/to/EIS1600>:/EIS1600\" eis1600-pkg analyse_all_on_cluster\n```\nMaybe add `-D` as parameter to `analyse_all_on_cluster` because parallel processing does not work with GPU.\n\n\n### Local Setup\n\nAfter creating and activating the eis16000_env (see [Set Up](#set-up-virtual-environment-and-install-the-eis1600-pkg-there)), use:\n```shell\n$ pip install eis1600\n```\n\nIn case you have an older version installed, use:\n```shell\n$ pip install --upgrade eis1600\n```\n\nThe package comes with different options, to install camel-tools use the following command.\nCheck also their installation instructions because atm they require additional packages [https://camel-tools.readthedocs.io/en/latest/getting_started.html#installation](https://camel-tools.readthedocs.io/en/latest/getting_started.html#installation)\n```shell\n$ pip install 'eis1600[NER]'\n```\n\nIf you want to run the annotation pipeline, you also need to download camel-tools data:\n```shell\n$ camel_data -i disambig-mle-calima-msa-r13\n```\n\nTo run the annotation pipeline with GPU, use this command:\n\n```shell\n$ pip install 'eis1600[EIS]'\n```\n\n**Note**. You can use `pip freeze` to check the versions of all installed packages, including `eis1600`.\n\n### Common Error Messages\n\nYou need to download all the models ONE BY ONE from Google Drive.\nSomething breaks if you try to download the whole folder, and you get this error:\n```shell\nOSError: Error no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory EIS1600_Pretrained_Models/camelbert-ca-finetuned\n```\nBetter to sync `EIS1600_Pretrained_Models` with our nextcloud.\n\nIf you want to install `eis1600-pkg` from source you have to add the data modules for `gazetteers` and `helper` manually.\nYou can find the modules in our nextcloud.\n\n## Set Up Virtual Environment and Install the EIS1600 PKG there\n\nTo not mess with other python installations, we recommend installing the package in a virual environment.\nTo create a new virtual environment with python, run:\n```shell\npython3 -m venv eis1600_env\n```\n\n**NB:** while creating your new virtual environment, you must use Python 3.7 or 3.8, as these are version required by CAMeL-Tools.\n\nAfter creation of the environment it can be activated by:\n```shell\nsource eis1600_env/bin/activate\n```\n\nThe environment is now activated and the eis1600 package can be installed into that environment with pip:\n```shell\n$ pip install eis1600\n```\nThis command installs all dependencies as well, so you should see lots of other libraries being installed. If you do not, you must have used a wrong version of Python while creating your virtual environment.\n\nYou can now use the commands listed in this README.\n\nTo use the environment, you have to activate it for **every session**, by:\n```shell\nsource eis1600_env/bin/activate\n```\nAfter successful activation, your user has the pre-text `(eis1600_env)`.\n\nProbably, you want to create an alias for the source command in your *alias* file by adding the following line:\n```shell\nalias eis=\"source eis1600_env/bin/activate\"\n```\n\nAlias files:\n\n- on Linux:\n - `.bash_aliases`\n- On Mac:\n - `.zshrc` if you use `zsh` (default in the latest versions Mac OS);\n\n## Structure of the working directory\n\nThe working directory is always the main `EIS1600` directory which is a parent to all the different repositories.\nThe `EIS1600` directory has the following structure:\n\n```\n|\n|---| eis_env\n|---| EIS1600_JSONs\n|---| EIS1600_Pretrained_Models (for annotation, sync from Nextcloud)\n|---| gazetteers\n|---| Master_Chronicle\n|---| OpenITI_EIS1600_Texts\n|---| Training_Data\n```\n\nPath variables are in the module `eis1600/helper/repo`.\n\n## Usage\n\n__All commands must be run from the [parent directory](#structure-of-the-working-directory) `EIS1600`!__\nSee also [Processing Workflow](#processing-workflow).\n* Use `-D` flag to get detailed debug messages in the console.\n\n### Annotation Pipeline\n\n* Use `-P` flag to run annotation of MIUs in parallel, parallel processing will eat up __ALL__ resources!\n* Use `-D` flag to get detailed debug messages in the console.\n```shell\n$ analyse_all_on_cluster\n```\n\n### Convert mARkdown to EIS1600 files\n\nConverts mARkdown file to EIS1600TMP (without inserting UIDs).\nThe .EIS1600TMP file will be created next to the .mARkdown file (you can input .inProcess or .completed files as well).\nThis command can be run from anywhere within the text repo - use auto complete (`tab`) to get the correct path to the file.\nAlternative: open command line from the folder which contains the file which shall be converted.\n```shell\n$ convert_mARkdown_to_EIS1600TMP <uri>.mARkdown\n```\n\n#### Batch processing of mARkdown files\n\nRun from the [parent directory](#structure-of-the-working-directory) `EIS1600`.\nUse the `-e` option to convert all files from the EIS1600 repo.\n```shell\n$ convert_mARkdown_to_EIS1600 -e <EIS1600_repo>\n```\n\n### EIS1600TMP to EIS1600\n\nEIS1600TMP files do not contain IDs yet, to insert IDs run `ids_insert_or_update` on the `.EIS1600TMP` file.\nUse auto complete (`tab`) to get the correct path to the file.\n```shell\n$ ids_insert_or_update <OpenITI_EIS1600_Text/data/path/to/file>.EIS1600TMP\n```\n\nAdditionally, this routine updates IDs if you run it on a `.EIS1600` file.\nUpdate IDs means inserting missing UIDs and updating SubIDs.\n```shell\n$ ids_insert_or_update <OpenITI_EIS1600_Text/data/path/to/file>.EIS1600\n```\n\n#### Batch processing\n\nSee also [Processing Workflow](#processing-workflow).\nUse `incorporate_newly_prepared_files_in_corpus` to add IDs to all ready files from the Google Sheet.\n```shell\n$ incorporate_newly_prepared_files_in_corpus\n```\n\n### Check EIS1600 Formatting\n\nCheck if the formatting is correct (structural tagging)\n```shell\n$ check_formatting <OpenITI_EIS1600_Text/data/path/to/file>.EIS1600\n```\n\n#### Batch processing\n\nCheck the formatting of all `.EIS1600` files:\n```shell\n$ check_formatting\n```\nThis will create a log-file with all issues found. The log file is here: `OpenITI_EIS1600_Texts/mal_formatted_texts.log`.\nIt will print a list of files marked as 'ready' for which no `.EIS1600TMP` file was found.\nIt will also print a list of files marked as 'double-cheked' for which no `.EIS1600` file was found.\nCheck if the author or book URI are still matching the folders and the file in the `OpenITI_EIS1600_Texts` directory.\n\n\n### Reannotation\n\nThis script can be run on a folder containing files which were exported from the onine editor. Those files are MIUs but are missing directionality tags and paragraph tags (they use new lines to indicate paragraphs).\nUse these flags to active the respective model for annotation:\n * NER\n * O [_Onomastics_]\n * P [_Persons and STFCOX_]\n * T [_Toponyms and BDKOPRX_]\n\n\n__THIS WILL OVERWRITE THE ORIGINAL FILES IN THE FOLDER!__\n```shell\n$ reannotation -NER -O -P -T <path/to/folder>\n```\n\n\n\n### Get training data from Q annotations\n\nThis script can be used to transform Q-tags from EIS1600-mARkdown to BIO-labels.\nThe script will operate on a directory of MIUs and write a JSON file with annotated MIUs in BIO training format.\nParameters are:\n1. Path to directory containing annotated MIUs;\n2. Filename or path inside RESEARCH_DATA repo for JSON output file\n3. BIO_main_class, optional, defaults to 'Q'. Try to use something more meaningful and distinguishable.\n\n```shell\n$ q_tags_to_bio <path/to/MIUs/> <q_training_data> <bio_main_class>\n```\n\nFor toponym definitions/descriptions:\n```shell\n$ q_tags_to_bio Topo_Data/MIUs/ TOPONYM_DESCRIPTION_DETECTION/toponym_description_training_data TOPD\n```\n\n### MIU revision\n\nRun the following command from the root of the MIU repo to revise automated annotated files:\n```shell\n$ miu_random_revisions\n```\n\nWhen first run, the file *file_picker.yml* is added to the root of the MIU repository.\nMake sure to specify your operating system and to set your initials and the path/command to/for Kate in this YAML file.\n```yaml\nsystem: ... # options: mac, lin, win;\nreviewer: eis1600researcher # change this to your name;\npath_to_kate: kate # add absolute path to Kate on your machine; or a working alias (kate should already work)\n```\nOptional, you can specify a path from where to open files - e.g. if you only want to open training-data, set:\n```yaml\nmiu_main_path: ./training_data/\n```\n\nWhen revising files, remember to change\n```yaml\nreviewed : NOT REVIEWED\n```\nto\n```yaml\nreviewed : REVIEWED\n```\n\n\n### TSV dump\n\nRun the following command from the root of the MIU repo to revise create tsv files with the corpus dump:\n```shell\n$ tsv_dump\n```\n\nThis command will create two files:\n1. eis1600-structure.tsv contains all structural data from the eis1600 corpus\n2. eis1600-content.tsv contains all content data from the eis1600 corpus.\n - By default, this file is splitted in 4 parts (eis1600-content_part0001.tsv, etc), so that the files are not too large. The output can be splitted in a different number of files using the argument --part, e.g. `$ tsv_dump --parts 0` will create only one file, without any parts.\n - By default, all entities will be added to the tsv output. The list of entities are: SECTIONS, TOKENS, TAGS_LISTS, NER_LABELS, LEMMAS, POS_TAGS, ROOTS, TOPONYM_LABELS, NER_TAGS, DATE_TAGS, MONTH_TAGS, ONOM_TAGS,\n ONOMASTIC_TAGS. A different selection of output entities can be done with the argument --label_list, e.g. `$ tsv_dump --label_list NER_LABELS NER_TAGS` will output only the information included in those entities.\n\n> For example, to extract all TOPONYM_LABELS from the whole eis1600 data and output it to a single file, use: `$ tsv_dump --label_list TOPONYM_LABELS`\n",
"bugtrack_url": null,
"license": "MIT License",
"summary": "EIS1600 project tools and utilities",
"version": "1.6.9",
"project_urls": {
"Homepage": "https://github.com/EIS1600/eis1600-pkg"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "c2a24e815b9c0c821b2432c37eeb36760b2ede13bbef2f7e97d2f0f0052563d3",
"md5": "2dd0f620c9840cbccc720d39bdac99c3",
"sha256": "aa2dcaa106c95f5e992280c4fff3333b1072415cb82f268266453d73a6bc6ed6"
},
"downloads": -1,
"filename": "eis1600-1.6.9-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2dd0f620c9840cbccc720d39bdac99c3",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 251474,
"upload_time": "2024-08-09T14:33:02",
"upload_time_iso_8601": "2024-08-09T14:33:02.939426Z",
"url": "https://files.pythonhosted.org/packages/c2/a2/4e815b9c0c821b2432c37eeb36760b2ede13bbef2f7e97d2f0f0052563d3/eis1600-1.6.9-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "494d3601887e460589c8ab48dfa89baf91a4e0b11ef49e5b8fd7500adc1df0f5",
"md5": "ab0acbe98eb1c07bb3c7ba1fd0b4bc39",
"sha256": "0e67476231c3ce2361f5cab283a5fb68838487b59f96e4d91bf7b3a38384da75"
},
"downloads": -1,
"filename": "eis1600-1.6.9.tar.gz",
"has_sig": false,
"md5_digest": "ab0acbe98eb1c07bb3c7ba1fd0b4bc39",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 215096,
"upload_time": "2024-08-09T14:33:05",
"upload_time_iso_8601": "2024-08-09T14:33:05.490234Z",
"url": "https://files.pythonhosted.org/packages/49/4d/3601887e460589c8ab48dfa89baf91a4e0b11ef49e5b8fd7500adc1df0f5/eis1600-1.6.9.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-09 14:33:05",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "EIS1600",
"github_project": "eis1600-pkg",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "eis1600"
}