pytakes

Name	pytakes JSON
Version	2.0.0 JSON
	download
home_page
Summary	Simple entity extraction module from a lexicon.
upload_time	2023-07-24 22:55:03
maintainer
docs_url	None
author
requires_python	>=3.8
license
keywords	nlp information extraction
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # pytakes

Simple entity extraction module released under the MIT license.

## Overview

This module will look for a pre-defined set of terms in a corpus of text, and use a variation of the negex/context
algorithm to determine whether these terms express negation, historical, or various other qualifiers. (The set of
negation terms is also configurable.)

## Requirements ##

* Python 3.8+
* See requirements.txt (`pip install -r requirements.txt`)
    * Various requirements-_.txt files are provided depending on your needs:
        * dev: for running tests, general development
        * db: for connecting to database using pyodbc
        * psql: connecting to postgres database
        * sas: if data is stored in SAS

## Prerequisites ##

1. Generate a word list of terms/concepts ('concept dictionary')
    * in pyTAKES, a 'concept' is a set of terms with more or less the same meaning (e.g., ckd, chronic kidney disease)
    * the minimal should be a CSV file with three columns:
        * id - unique int for each line
        * cui - string label for a 'concept' ('concept unique identifier')
        * text - text to look for
    * dictionary builder script is also provided which help generate variations of terms(documented below)
2. A corpus with an id (for tracking, this will be in output) and text field (for processing, extracting concepts)

## Doco ##

### Basics ###

* The entry point is `python example/run.py config.py`.
    * You can see an example `config.py` at `example/simple/example.config.py`
    * `pytakes` module must be on your PYTHONPATH, so `set/export PYTHONPATH=src` prior to running

### Install ###

1. Clone from git repo: `git clone ...pytakes.git`
2. `cd pytakes`
3. (optional) build virtualenv
    * `PYTHON_INSTALL/Scripts/virtualenv .venv`
    * `pip install virtualenv` if not yet available
4. Pip install prerequisites `pip install -r requirements.txt`
5. Run tests (`pytest tests`)

### Use ###

You will need to have an input `concepts.csv` file with at least three columns (`id`, `cui`, `text`). There are several
examples in the `pytakes/tests/data` directory.

#### Negation Table ####

This table implements a modified version of Chapman's ConText (see,
e.g., http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.145.6566&rep=rep1&type=pdf,
and https://code.google.com/archive/p/negex/).

This table is loosely based on the csv file
here: https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/negex/lexical_kb.csv

Negation works by searching for the following word categories and then using a set of rules to determine which words
should be 'qualified' in different ways. Within the recommended `jsonl` output, you'll find this included under
the `qualifiers` key. Consider the following example:

```json
{
  ...
  "qualifiers": {
    "certainty": 0,
    "hypothetical": true,
    "historical": false,
    "other_subject": false,
    "terms": [
      "not",
      "afraid"
    ]
  },
  ...
}
```

All values begin as `False` except `certainty` which starts as 4, meaning `definite`/`affirmative`. Here, `certainty` is
0, meaning `negated`. This was done by the term 'not' which appears in the list of `terms`. `hypothetical` has been
marked as `True` due to the term `afraid` (often in a context like 'afraid that he will catch a cold').

**Certainty:**

| Number | Interpretation |
|--------|----------------|
| 0      | negated        |
| 1      | improbable     |
| 2      | possible       |
| 3      | probable       |
| 4      | definite       |

**Columns:**

1. negex: negation (or related) term; capitalization and punctuation will be normalized (i.e., removed) so just include
   letters; I don't think regexes work
2. type: four letter abbreviation for negation role with brackets (these will vary based on your text and what you want
   to extract)
    * `[IMPR]`: improbable words (e.g., 'low probability')
        * sets `certainty` to 1
    * `[NEGN]`: negation words (e.g., 'denies')
        * sets `certainty` to 0
    * `[PSEU]`: pseudonegation (e.g., 'not only')
    * `[INDI]`: indication (e.g., 'rule out')
    * `[HIST]`: historical (e.g., 'previous')
        * sets `historical` to `True`
    * `[CONJ]`: conjunction - interferes with negation scope (e.g., 'though', 'except')
    * `[PROB]`: probable (e.g., 'appears')
        * sets `certainty` to 3
    * `[POSS]`: possible (e.g., 'possible')
        * sets `certainty` to 2
    * `[HYPO]`: hypothetical (e.g., 'might')
        * sets `hypothetical` to `True`
    * `[OTHR]`: other subject - refers to someone other than the subject (e.g., 'mother')
        * sets `other_subject` to `True`
    * `[SUBJ]`: subject - when reference of OTHR is still referring to the subject (e.g., 'by patient mother')
    * `[PREN]`: prenegation <- not sure if this is supposed to be used
    * `[AFFM]`: affirmed (e.g., 'obvious', 'positive for')
    * `[FUTP]`: future possibility (e.g., 'risk for')
3. direction
    * 0: directionality doesn't make sense (e.g., CONJ)
    * 1: term applies negation, etc. **backward** in the sentence (e.g., 'not seen')
    * 2: term applies negation, etc. **forward** in the sentence (e.g., 'dont see')
    * 3: term applies negation, etc. **forward and/or backward** in the sentence (e.g., 'likely')

#### Concept/Term Table ####

This is the table containing the terms you want to search for (i.e., the entities you want extracted). I have added a
script to autogenerate these based on some basic configuration files.

| Column	         | Type	                                                                    | Description                                                                                                                                                       |       
|-----------------|--------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ID	             | int	                                                                     | identity column; unique integer for each row                                                                                                                      |
| CUI	 string	    | category identifier; can be used to "group" different     terms together |
| Text	           | string	                                                                  | term                                                                                                                                                              |
| RegexVariation	 | int	                                                                     | amount of variation: 0=none; 3=very; 1=default; -1=don't even allow suffixes, exact matches only; see #Rules#parameters below; I suggest you just use "0" or "-1" |
| WordOrder	      | int	                                                                     | how accurate must the given word order be; 2=exactly; 1=fword constraint; 0=no word order                                                                         |
| MaxIntervening  | int                                                                      | how many intervening words to allow when locating words; 'how many intervening words do I allow?'                                                                 |
| MaxWords	       | int                                                                      | how many words to look ahead to find the next word; is ‘how far do I look ahead after each term?’                                                                 | 

MaxIntervening and MaxWords should not be used together.

To autogenerate this table format, use the `pytakes-build-dictionary` script installed into the Python Scripts
directory. For an example, run this with the `--create-sample` option (optionally specify the output with
the `--path C:\somewhere` option. For additional specifications, see the "Dictionary Builder" section below.

#### Document Table ####

This is the table containing the text you are in interesting in searching in.

The text itself must currently be labeled 'note_text'. The option to specify this is currently not implemented. Sorry.

The document table must also include a unique id for each note_text (just make an autoincrementing primary key). Specify
this and any other meta information you want to pass along under `--meta-labels` option (**ensure that the unique doc_id
is specified first**).

#### Example Config File ####

I prefer to specify the configuration file as a Python file (`config.py`). Yaml and json are also accepted. See an
example below (copied from `example.config.py`). Please note that the `print(config`) at the end is required.

    config = {
        'corpus': {  # how to get the text data
            'directories': [  # specify path to .txt files
                r'PATH'
            ],
            'connections': [  # specify other connection types
                {
                    'name': 'TABLENAME',
                    'name_col': 'TEXT ID COLUMN',
                    'text_col': 'TEXT COLUMN',
                    # specify either driver/server/database OR connection_string
                    # connection string examples here: https://docs.sqlalchemy.org/en/13/core/engines.html
                    'connection_string': 'SQLALCHEMY-LIKE CONNECTION_STRING',
                    # db args: driver/server/database
                    'driver': 'DRIVER',  # available listed in pytakes/iolib/sqlai.py, or use connection string
                    'server': 'SERVER',
                    'database': 'DATABASE',
                }
            ]
        },
        'keywords': [  # path to keyword files, usually stored as CSV
            {
                'path': r'PATH',
                'regex_variation': 0  # set to -1 if you don't want any expansion
            }
        ],
        'negation': {  # select either version or path (not both)
            'path': r'PATH TO NEGATION CSV FILE',
            'version': 1,  # int (version number), built-in/default
            'skip': False,  # bool: if skip is True: don't do negation
        },
        'output': {
            'path': r'PATH TO OUTPUT DIRECTORY',
            'outfile': 'NAME.out.jsonl',  # name of output file (or given default name)
            'hostname': ''
        },
    }
    
    print(config)

## Dictionary Builder ###

For a simple example, run (you will first need to install this package, run `python setup.py install` in the base
directory):

    pytakes-build-dictionary --create-sample --path OUTPUT_PATH

### COMMAND LINE ARGUMENTS ###

| Short | Long        | Description                                                                                  |
|-------|-------------|----------------------------------------------------------------------------------------------|
| -p    | --path      | Specifies parent directory of folders; program will prompt if unable to locate the directory |
| -o    | --output    | Specify output CSV file; if ".csv" is not included, it will be added                         |
| -t    | --table     | Specify output table in specified database (See below)                                       |
| -v    | --verbosity | Specify amount of log output to show: 3-most verbose; 0-least verbose                        |
|       | --driver    | If -t is specified, driver where table should be created. Defaults to SQL Server             |
|       | --server    | If -t is specified, server where table should be created.                                    |
|       | --database  | If -t is specified, database where table should be created.                                  |

### OUTPUT COLUMNS ####

Not all of these output columns are required (most don't do anything). This was originally designed for building a
dictionary using cTAKES.

| Column         | Type          | Description                                                                               |
|----------------|---------------|-------------------------------------------------------------------------------------------|
| ID             | int           | identity column; unique integer for each row                                              |
| CUI            | varchar(8)    | category identifier; can be used to "group" different terms together                      |
| Fword          | varchar(80)   | first word of term                                                                        |
| Text           | varchar(8000) | term                                                                                      |
| TextLength     | int           | length of term (all characters including spaces)                                          |
| RegexVariation | int           | amount of variation: 0=none; 3=very; 1=default; see #Rules#parameters below               |
| WordOrder      | int           | how accurate must the given word order be; 2=exactly; 1=fword constraint; 0=no word order |
| Valence        | int           | this should just be "1"; program is not designed to work with this correctly              |

### RULES ###

Rules are the text entries in the cTAKES-like dictionary, however, they can include "categories" in addition to just
text.A category is any string of text surrounded by "[" and "]". The intervening text string is the name of a "
category". The category must have a definition, and each item (synonym) in the definition will be used in the rule.

For example, if a rule is `[smart_person] is smart` and the category `smart_person` is defined by the terms "Albert
Einstein", "Old McDonald", and "Brain", then the resulting output will be

    Albert Einsten is smart
    Old McDonald is smart
    Brain is smart

The rule file consists of a set of rules (as above), and each rule must be on its own line.

    [smart_person] is smart
    smart [smart_person]
    [smart_person] not dumb
    [not_so_smart] not smart 

The rule file must be named "rules" or "rules.some-extension" (e.g., "rules.txt").

#### Parameters ####

Rules may also maintain configuration parameters.The configurations are indexed in the following order (bold indicates
the default parameter):

| RegexVariation | Description                                               |
|----------------|-----------------------------------------------------------|
| 0              | **(Default)** no variation in regular expression coverage |
| 1              | minimal variation in regular expression coverage          |
| 2              | moderate variation in regular expression coverage         |
| 3              | high flexibility in regular expression coverage           |

| WordOrder | Description                           |
|-----------|---------------------------------------|
| 0         | free word order                       |
| 1         | **(Default)** enforce first word rule |
| 2         | require precise word order            |

| Valence | Description                |
|---------|----------------------------|
| 1       | **(Default)** Always use 1 |

These are designated by the double percent ('%%') and follow the rule.

    [category]%%REGEX_VARIATION,WORD_ORDER,VALENCE

For example:

    [smart_person] is smart%%1,2   # minimal regex variation; requires precise word order 
    [smart_person]%%,2        # same as above: the first parameter is left blank, and the default is used
    [smart_person] not dumb           # default for both parameters are used
    [not_so_smart] not smart%%2    # default second parameter

Definitions/categories (see below) can also be assigned parameters in exactly the same way. When the parameters
collide/disagree (e.g., the rule asks for free word order, but the definition asks that the first word rule be
enforced), the more conservative will be selected.

#### DEFINITIONS ####

The definition (also called "category" files) provide a set of words to replace the name of a category in a particular
rule.The definition file must either be within a "cat" directory, or must have the extension ".cat". The program will
choose one or the other--which one is undefined.There are several ways to write the definition for a particular
category. Examples:

In the definition file smart_person.cat, each row will be considered a definition for the smart_person category. Also a
not_so_smart.cat should be included.
In the definition file definitions.cat:

    [smart_person]                       
    Albert Einstein                       
    Old McDonald                       
    Brain                       
    [not_so_smart]                       
    Humpty Dumpty  

#### Comments. ####

All lines beginning with "#" are ignored, and all characters occurring after a '#' are ignored as comments.

    # last updated by me, yesterday evening                       
    [smart_person]                       
    Albert Einstein       # comment here                       
    Old McDonald                       
    Brain                       
    [not_so_smart]                       
    # shouldn't there be others?!?                       
    Humpty Dumpty  

#### CUIs. ####

CUIs are usually assigned uniquely for each rule, rather than for a category. A CUI can be included for a given
definition of a category by assigning it with the syntax: C1025459==Albert Einstein
Or, in the entire definition file:

    [smart_person]                       
    C1025459==Albert Einstein                       
    C4495545==Old McDonald                       
    Brain                       
    [not_so_smart]                       
    Humpty Dumpty   

In the above example, Humpty Dumpty and Brain are both assigned a default CUI.

#### Word Variant Notation. ####

Definitions may also be written on a single line, separated by the double pipe (i.e., '||'). If more than three or four
definitions are listed on a single line, the definitions file becomes somewhat unreadable. Thus, it is best practice to
only include word variants on a single line.

    [smart_person]                       
    C1025459==Albert Einstein||Einstein                       
    C4495545==Old McDonald||Ol' McDonald||Ole McDonald||Jeff                       
    Brain                       
    [not_so_smart]                       
    Humpty Dumpty||Humpty-Dumpty # this is a common use for word-variant notation   

#### Parameters ####

For definitions, see the parameters section under Rules.

Example:

    [smart_person]                       
    C1025459==Albert Einstein%%1,2   # all rules involving this definition with have minimal regex variation; requires precise word order                       
    C4495545==Old McDonald                       
    Brain                       
    [not_so_smart]                       
    Humpty Dumpty              # will use the defaults

**Conflict Resolution.**

Regardless of how the conflict occurs, the more conservative of the rule and all relevant definitions will be chosen.
NB: This process will never choose values that have been left as default (unless the default is specifically requested).

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "pytakes",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "nlp,information extraction",
    "author": "",
    "author_email": "dcronkite <dcronkite+pypi@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/5a/86/c5560d9a9c800f089d0150167a4bd85e83c3a56bd799aaa9e6865cc3f351/pytakes-2.0.0.tar.gz",
    "platform": null,
    "description": "# pytakes\n\nSimple entity extraction module released under the MIT license.\n\n## Overview\n\nThis module will look for a pre-defined set of terms in a corpus of text, and use a variation of the negex/context\nalgorithm to determine whether these terms express negation, historical, or various other qualifiers. (The set of\nnegation terms is also configurable.)\n\n## Requirements ##\n\n* Python 3.8+\n* See requirements.txt (`pip install -r requirements.txt`)\n    * Various requirements-_.txt files are provided depending on your needs:\n        * dev: for running tests, general development\n        * db: for connecting to database using pyodbc\n        * psql: connecting to postgres database\n        * sas: if data is stored in SAS\n\n## Prerequisites ##\n\n1. Generate a word list of terms/concepts ('concept dictionary')\n    * in pyTAKES, a 'concept' is a set of terms with more or less the same meaning (e.g., ckd, chronic kidney disease)\n    * the minimal should be a CSV file with three columns:\n        * id - unique int for each line\n        * cui - string label for a 'concept' ('concept unique identifier')\n        * text - text to look for\n    * dictionary builder script is also provided which help generate variations of terms(documented below)\n2. A corpus with an id (for tracking, this will be in output) and text field (for processing, extracting concepts)\n\n## Doco ##\n\n### Basics ###\n\n* The entry point is `python example/run.py config.py`.\n    * You can see an example `config.py` at `example/simple/example.config.py`\n    * `pytakes` module must be on your PYTHONPATH, so `set/export PYTHONPATH=src` prior to running\n\n### Install ###\n\n1. Clone from git repo: `git clone ...pytakes.git`\n2. `cd pytakes`\n3. (optional) build virtualenv\n    * `PYTHON_INSTALL/Scripts/virtualenv .venv`\n    * `pip install virtualenv` if not yet available\n4. Pip install prerequisites `pip install -r requirements.txt`\n5. Run tests (`pytest tests`)\n\n### Use ###\n\nYou will need to have an input `concepts.csv` file with at least three columns (`id`, `cui`, `text`). There are several\nexamples in the `pytakes/tests/data` directory.\n\n#### Negation Table ####\n\nThis table implements a modified version of Chapman's ConText (see,\ne.g., http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.145.6566&rep=rep1&type=pdf,\nand https://code.google.com/archive/p/negex/).\n\nThis table is loosely based on the csv file\nhere: https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/negex/lexical_kb.csv\n\nNegation works by searching for the following word categories and then using a set of rules to determine which words\nshould be 'qualified' in different ways. Within the recommended `jsonl` output, you'll find this included under\nthe `qualifiers` key. Consider the following example:\n\n```json\n{\n  ...\n  \"qualifiers\": {\n    \"certainty\": 0,\n    \"hypothetical\": true,\n    \"historical\": false,\n    \"other_subject\": false,\n    \"terms\": [\n      \"not\",\n      \"afraid\"\n    ]\n  },\n  ...\n}\n```\n\nAll values begin as `False` except `certainty` which starts as 4, meaning `definite`/`affirmative`. Here, `certainty` is\n0, meaning `negated`. This was done by the term 'not' which appears in the list of `terms`. `hypothetical` has been\nmarked as `True` due to the term `afraid` (often in a context like 'afraid that he will catch a cold').\n\n**Certainty:**\n\n| Number | Interpretation |\n|--------|----------------|\n| 0      | negated        |\n| 1      | improbable     |\n| 2      | possible       |\n| 3      | probable       |\n| 4      | definite       |\n\n**Columns:**\n\n1. negex: negation (or related) term; capitalization and punctuation will be normalized (i.e., removed) so just include\n   letters; I don't think regexes work\n2. type: four letter abbreviation for negation role with brackets (these will vary based on your text and what you want\n   to extract)\n    * `[IMPR]`: improbable words (e.g., 'low probability')\n        * sets `certainty` to 1\n    * `[NEGN]`: negation words (e.g., 'denies')\n        * sets `certainty` to 0\n    * `[PSEU]`: pseudonegation (e.g., 'not only')\n    * `[INDI]`: indication (e.g., 'rule out')\n    * `[HIST]`: historical (e.g., 'previous')\n        * sets `historical` to `True`\n    * `[CONJ]`: conjunction - interferes with negation scope (e.g., 'though', 'except')\n    * `[PROB]`: probable (e.g., 'appears')\n        * sets `certainty` to 3\n    * `[POSS]`: possible (e.g., 'possible')\n        * sets `certainty` to 2\n    * `[HYPO]`: hypothetical (e.g., 'might')\n        * sets `hypothetical` to `True`\n    * `[OTHR]`: other subject - refers to someone other than the subject (e.g., 'mother')\n        * sets `other_subject` to `True`\n    * `[SUBJ]`: subject - when reference of OTHR is still referring to the subject (e.g., 'by patient mother')\n    * `[PREN]`: prenegation <- not sure if this is supposed to be used\n    * `[AFFM]`: affirmed (e.g., 'obvious', 'positive for')\n    * `[FUTP]`: future possibility (e.g., 'risk for')\n3. direction\n    * 0: directionality doesn't make sense (e.g., CONJ)\n    * 1: term applies negation, etc. **backward** in the sentence (e.g., 'not seen')\n    * 2: term applies negation, etc. **forward** in the sentence (e.g., 'dont see')\n    * 3: term applies negation, etc. **forward and/or backward** in the sentence (e.g., 'likely')\n\n#### Concept/Term Table ####\n\nThis is the table containing the terms you want to search for (i.e., the entities you want extracted). I have added a\nscript to autogenerate these based on some basic configuration files.\n\n| Column\t         | Type\t                                                                    | Description                                                                                                                                                       |       \n|-----------------|--------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| ID\t             | int\t                                                                     | identity column; unique integer for each row                                                                                                                      |\n| CUI\t string\t    | category identifier; can be used to \"group\" different     terms together |\n| Text\t           | string\t                                                                  | term                                                                                                                                                              |\n| RegexVariation\t | int\t                                                                     | amount of variation: 0=none; 3=very; 1=default; -1=don't even allow suffixes, exact matches only; see #Rules#parameters below; I suggest you just use \"0\" or \"-1\" |\n| WordOrder\t      | int\t                                                                     | how accurate must the given word order be; 2=exactly; 1=fword constraint; 0=no word order                                                                         |\n| MaxIntervening  | int                                                                      | how many intervening words to allow when locating words; 'how many intervening words do I allow?'                                                                 |\n| MaxWords\t       | int                                                                      | how many words to look ahead to find the next word; is \u2018how far do I look ahead after each term?\u2019                                                                 | \n\nMaxIntervening and MaxWords should not be used together.\n\nTo autogenerate this table format, use the `pytakes-build-dictionary` script installed into the Python Scripts\ndirectory. For an example, run this with the `--create-sample` option (optionally specify the output with\nthe `--path C:\\somewhere` option. For additional specifications, see the \"Dictionary Builder\" section below.\n\n#### Document Table ####\n\nThis is the table containing the text you are in interesting in searching in.\n\nThe text itself must currently be labeled 'note_text'. The option to specify this is currently not implemented. Sorry.\n\nThe document table must also include a unique id for each note_text (just make an autoincrementing primary key). Specify\nthis and any other meta information you want to pass along under `--meta-labels` option (**ensure that the unique doc_id\nis specified first**).\n\n#### Example Config File ####\n\nI prefer to specify the configuration file as a Python file (`config.py`). Yaml and json are also accepted. See an\nexample below (copied from `example.config.py`). Please note that the `print(config`) at the end is required.\n\n    config = {\n        'corpus': {  # how to get the text data\n            'directories': [  # specify path to .txt files\n                r'PATH'\n            ],\n            'connections': [  # specify other connection types\n                {\n                    'name': 'TABLENAME',\n                    'name_col': 'TEXT ID COLUMN',\n                    'text_col': 'TEXT COLUMN',\n                    # specify either driver/server/database OR connection_string\n                    # connection string examples here: https://docs.sqlalchemy.org/en/13/core/engines.html\n                    'connection_string': 'SQLALCHEMY-LIKE CONNECTION_STRING',\n                    # db args: driver/server/database\n                    'driver': 'DRIVER',  # available listed in pytakes/iolib/sqlai.py, or use connection string\n                    'server': 'SERVER',\n                    'database': 'DATABASE',\n                }\n            ]\n        },\n        'keywords': [  # path to keyword files, usually stored as CSV\n            {\n                'path': r'PATH',\n                'regex_variation': 0  # set to -1 if you don't want any expansion\n            }\n        ],\n        'negation': {  # select either version or path (not both)\n            'path': r'PATH TO NEGATION CSV FILE',\n            'version': 1,  # int (version number), built-in/default\n            'skip': False,  # bool: if skip is True: don't do negation\n        },\n        'output': {\n            'path': r'PATH TO OUTPUT DIRECTORY',\n            'outfile': 'NAME.out.jsonl',  # name of output file (or given default name)\n            'hostname': ''\n        },\n    }\n    \n    print(config)\n\n## Dictionary Builder ###\n\nFor a simple example, run (you will first need to install this package, run `python setup.py install` in the base\ndirectory):\n\n    pytakes-build-dictionary --create-sample --path OUTPUT_PATH\n\n### COMMAND LINE ARGUMENTS ###\n\n| Short | Long        | Description                                                                                  |\n|-------|-------------|----------------------------------------------------------------------------------------------|\n| -p    | --path      | Specifies parent directory of folders; program will prompt if unable to locate the directory |\n| -o    | --output    | Specify output CSV file; if \".csv\" is not included, it will be added                         |\n| -t    | --table     | Specify output table in specified database (See below)                                       |\n| -v    | --verbosity | Specify amount of log output to show: 3-most verbose; 0-least verbose                        |\n|       | --driver    | If -t is specified, driver where table should be created. Defaults to SQL Server             |\n|       | --server    | If -t is specified, server where table should be created.                                    |\n|       | --database  | If -t is specified, database where table should be created.                                  |\n\n### OUTPUT COLUMNS ####\n\nNot all of these output columns are required (most don't do anything). This was originally designed for building a\ndictionary using cTAKES.\n\n| Column         | Type          | Description                                                                               |\n|----------------|---------------|-------------------------------------------------------------------------------------------|\n| ID             | int           | identity column; unique integer for each row                                              |\n| CUI            | varchar(8)    | category identifier; can be used to \"group\" different terms together                      |\n| Fword          | varchar(80)   | first word of term                                                                        |\n| Text           | varchar(8000) | term                                                                                      |\n| TextLength     | int           | length of term (all characters including spaces)                                          |\n| RegexVariation | int           | amount of variation: 0=none; 3=very; 1=default; see #Rules#parameters below               |\n| WordOrder      | int           | how accurate must the given word order be; 2=exactly; 1=fword constraint; 0=no word order |\n| Valence        | int           | this should just be \"1\"; program is not designed to work with this correctly              |\n\n### RULES ###\n\nRules are the text entries in the cTAKES-like dictionary, however, they can include \"categories\" in addition to just\ntext.A category is any string of text surrounded by \"[\" and \"]\". The intervening text string is the name of a \"\ncategory\". The category must have a definition, and each item (synonym) in the definition will be used in the rule.\n\nFor example, if a rule is `[smart_person] is smart` and the category `smart_person` is defined by the terms \"Albert\nEinstein\", \"Old McDonald\", and \"Brain\", then the resulting output will be\n\n    Albert Einsten is smart\n    Old McDonald is smart\n    Brain is smart\n\nThe rule file consists of a set of rules (as above), and each rule must be on its own line.\n\n    [smart_person] is smart\n    smart [smart_person]\n    [smart_person] not dumb\n    [not_so_smart] not smart \n\nThe rule file must be named \"rules\" or \"rules.some-extension\" (e.g., \"rules.txt\").\n\n#### Parameters ####\n\nRules may also maintain configuration parameters.The configurations are indexed in the following order (bold indicates\nthe default parameter):\n\n| RegexVariation | Description                                               |\n|----------------|-----------------------------------------------------------|\n| 0              | **(Default)** no variation in regular expression coverage |\n| 1              | minimal variation in regular expression coverage          |\n| 2              | moderate variation in regular expression coverage         |\n| 3              | high flexibility in regular expression coverage           |\n\n| WordOrder | Description                           |\n|-----------|---------------------------------------|\n| 0         | free word order                       |\n| 1         | **(Default)** enforce first word rule |\n| 2         | require precise word order            |\n\n| Valence | Description                |\n|---------|----------------------------|\n| 1       | **(Default)** Always use 1 |\n\nThese are designated by the double percent ('%%') and follow the rule.\n\n    [category]%%REGEX_VARIATION,WORD_ORDER,VALENCE\n\nFor example:\n\n    [smart_person] is smart%%1,2   # minimal regex variation; requires precise word order \n    [smart_person]%%,2        # same as above: the first parameter is left blank, and the default is used\n    [smart_person] not dumb           # default for both parameters are used\n    [not_so_smart] not smart%%2    # default second parameter\n\nDefinitions/categories (see below) can also be assigned parameters in exactly the same way. When the parameters\ncollide/disagree (e.g., the rule asks for free word order, but the definition asks that the first word rule be\nenforced), the more conservative will be selected.\n\n#### DEFINITIONS ####\n\nThe definition (also called \"category\" files) provide a set of words to replace the name of a category in a particular\nrule.The definition file must either be within a \"cat\" directory, or must have the extension \".cat\". The program will\nchoose one or the other--which one is undefined.There are several ways to write the definition for a particular\ncategory. Examples:\n\nIn the definition file smart_person.cat, each row will be considered a definition for the smart_person category. Also a\nnot_so_smart.cat should be included.\nIn the definition file definitions.cat:\n\n    [smart_person]                       \n    Albert Einstein                       \n    Old McDonald                       \n    Brain                       \n    [not_so_smart]                       \n    Humpty Dumpty  \n\n#### Comments. ####\n\nAll lines beginning with \"#\" are ignored, and all characters occurring after a '#' are ignored as comments.\n\n    # last updated by me, yesterday evening                       \n    [smart_person]                       \n    Albert Einstein       # comment here                       \n    Old McDonald                       \n    Brain                       \n    [not_so_smart]                       \n    # shouldn't there be others?!?                       \n    Humpty Dumpty  \n\n#### CUIs. ####\n\nCUIs are usually assigned uniquely for each rule, rather than for a category. A CUI can be included for a given\ndefinition of a category by assigning it with the syntax: C1025459==Albert Einstein\nOr, in the entire definition file:\n\n    [smart_person]                       \n    C1025459==Albert Einstein                       \n    C4495545==Old McDonald                       \n    Brain                       \n    [not_so_smart]                       \n    Humpty Dumpty   \n\nIn the above example, Humpty Dumpty and Brain are both assigned a default CUI.\n\n#### Word Variant Notation. ####\n\nDefinitions may also be written on a single line, separated by the double pipe (i.e., '||'). If more than three or four\ndefinitions are listed on a single line, the definitions file becomes somewhat unreadable. Thus, it is best practice to\nonly include word variants on a single line.\n\n    [smart_person]                       \n    C1025459==Albert Einstein||Einstein                       \n    C4495545==Old McDonald||Ol' McDonald||Ole McDonald||Jeff                       \n    Brain                       \n    [not_so_smart]                       \n    Humpty Dumpty||Humpty-Dumpty # this is a common use for word-variant notation   \n\n#### Parameters ####\n\nFor definitions, see the parameters section under Rules.\n\nExample:\n\n    [smart_person]                       \n    C1025459==Albert Einstein%%1,2   # all rules involving this definition with have minimal regex variation; requires precise word order                       \n    C4495545==Old McDonald                       \n    Brain                       \n    [not_so_smart]                       \n    Humpty Dumpty              # will use the defaults\n\n**Conflict Resolution.**\n\nRegardless of how the conflict occurs, the more conservative of the rule and all relevant definitions will be chosen.\nNB: This process will never choose values that have been left as default (unless the default is specifically requested).\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Simple entity extraction module from a lexicon.",
    "version": "2.0.0",
    "project_urls": {
        "Home": "https://github.com/dcronkite/pytakes"
    },
    "split_keywords": [
        "nlp",
        "information extraction"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "46314879810641362a171c163c60e78c0617895d958f57f9b84bfeeec4f00667",
                "md5": "6b19a557b49a32d537e06f5147b9ac59",
                "sha256": "5ecc98e8a68434350927b3e679a8b5b6b7d5d5257744645c3774412f7738b57c"
            },
            "downloads": -1,
            "filename": "pytakes-2.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6b19a557b49a32d537e06f5147b9ac59",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 1655836,
            "upload_time": "2023-07-24T22:55:00",
            "upload_time_iso_8601": "2023-07-24T22:55:00.459999Z",
            "url": "https://files.pythonhosted.org/packages/46/31/4879810641362a171c163c60e78c0617895d958f57f9b84bfeeec4f00667/pytakes-2.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5a86c5560d9a9c800f089d0150167a4bd85e83c3a56bd799aaa9e6865cc3f351",
                "md5": "0c6ec15bdd66e857c0a5f87b496f7344",
                "sha256": "1152fee84b29ed1151078ed0d1c076dde775c62a0c8df607328dc660a02c8365"
            },
            "downloads": -1,
            "filename": "pytakes-2.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "0c6ec15bdd66e857c0a5f87b496f7344",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 1726492,
            "upload_time": "2023-07-24T22:55:03",
            "upload_time_iso_8601": "2023-07-24T22:55:03.000645Z",
            "url": "https://files.pythonhosted.org/packages/5a/86/c5560d9a9c800f089d0150167a4bd85e83c3a56bd799aaa9e6865cc3f351/pytakes-2.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-07-24 22:55:03",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "dcronkite",
    "github_project": "pytakes",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "pytakes"
}