clevertable


Nameclevertable JSON
Version 3.1.0 PyPI version JSON
download
home_page
SummaryLow effort conversion of tabular data into numerical values.
upload_time2023-06-14 06:17:52
maintainer
docs_urlNone
authorTom Mohr
requires_python>=3.9
licenseMIT License Copyright (c) 2023 Tom Mohr Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords parser converter numerical
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # CleverTable
![Pytest](https://github.com/tom-mohr/clevertable/actions/workflows/pytest.yml/badge.svg)

Consistent, intelligent transformation of text-based tabular data into numerical data.<br>
Minimal configuration required.

Installation:

```bash
pip install clevertable
```

Example:

```python
from clevertable import *

profile = ConversionProfile({
    # optionally specify converters for specific columns:
    "Country": OneHot(),
    "Diagnosis": Binary(positive="cancer", negative="benign"),
    "Hospitalized": None,  # ignore column
}, pre_processing=None)

df = profile.fit_transform("datasets/survey.xlsx")  # transformed pandas.DataFrame
```

# Why this Library?

- CleverTable makes it really easy to convert text-based tabular data
  (optionally mixed with numbers) into numerical data, e.g. a medical survey
  into a Pandas DataFrame or a NumPy array.
- If something is obvious, you should not need to specify it.
  CleverTable will try to make choices for you if you don't make them.
- You stay in control: All choices made by CleverTable can be modified and overridden.

This is how CleverTable works: (see below for a full [tutorial](#tutorial))

1. You create a new `profile = ConversionProfile()`.
   Here, you can optionally specify certain converters.
2. You call `profile.fit(data)` on a sample data set, which creates a fixed conversion profile.
    - CleverTable chooses the best converter for each column if you don't specify it.
    - The converter (chosen by you or by CleverTable) adapts its internal state to fit the data.
3. You call `profile.transform(data)` on the actual data set (which may be the same as for `fit()`),
   which converts the data according to the fixed profile.

Here are some examples on what you can do with CleverTable:

- Chain multiple converters to achieve complex conversions:
  ```python
  profile["Column 7"] = [
      Split(),
      ForEach(Strip()),
      Flatten(),
      Infer()  # Infer() -> CleverTable will choose what to put here
  ]
  ```
- Use the `Infer()` converter where you want CleverTable to figure out the best solution (see above).
- Concise shorthand writings with Python syntax:
  ```python
  profile["Column 1"] = [  # Python lists create pipelines
    str.lower,             # functions /
    lambda s: s.strip(),   # lambda expressions are allowed
  ]
  profile["Column 2"] = {"Hello": 1, "Bye": 2}
  profile["Column 3"] = Float(), 1  # tries conversion to float, defaults to 1 on error
  ```
- Incremental configuration: If a column already has a correct converter, you can further process the column
  by adding another converter.
  This implicitly creates a pipeline.
  ```python
  profile["Column 5"] += OneHot()
  ```
- After `fit()`, you can access the inferred state of the converters.
  ```python
  my_weather_conv = profile["Weather"]            # e.g. OneHot()
  my_weather_categories = my_weather_conv.values  # e.g. ["sunny", "cloudy", "rainy"]
  ```
- Send multiple columns into one converter:
  ```python
  profile["Column 1", "Column 2"] = max
  ```
- Send nested columns into one converter:
  ```python
  profile[("Column A", "Column B"), "Column C"] = [Parallel(max, floor), min]  # min(max(A, B), floor(C))
  ```

# Tutorial

Suppose you want to convert the following table of survey results in a 2D numpy array of numbers:

| Country | Age | Diagnosis | Hospitalized | Education level | Symptoms        |
|---------|-----|-----------|--------------|-----------------|-----------------|
| China   | 32  | benign    | no           | University      | cough, fever    |
| France  | 45  | cancer    | yes          | PhD             | fever           |
| Italy   | 19  | benign    | yes          | High School     | cough           |
| Germany | 56  | cancer    | yes          | High School     | fever and cough |
| Nigeria | 23  | benign    | no           | University      | cough           |
| India   | 34  | benign    | yes          | University      | cough, fever    |
| ...     | ... | ...       | ...          | ...             | ...             |

For example, you might want to convert the `Country` column into a column of integers,
with every integer representing a different country.<br>
However:

- You don't really care which number represents which country.
- But you want to make sure that the same country always gets the same number,
  even if you add more data to the table later.
- You also want to know which integer was chosen for which country.

That's what CleverTable is for:

- First, you call `fit()` on a sample data set, which creates a fixed conversion profile.
- Then, you call `transform()` on the actual data set, and it converts the data according
  to the fixed profile.

Moreover, CleverTable does many things automatically:

- It chooses the best converter if you don't specify it.
- And then, the converter also adapts its internal state to fit the data.

Let's see how that works:

```python
from clevertable import *

table = "datasets/survey.xlsx"  # filename or pandas.DataFrame

profile = ConversionProfile()
profile.fit(table)  # chooses best converters and creates a fixed conversion profile
```

`print(profile)` will show the inferred conversion profile:

```python
{
    "Country": Enumerate('china', 'france', 'germany', ...),  # lots of countries
    "Age": Float(),
    "Diagnosis": Binary(),
    "Hospitalized": Binary(),
    "Education level": OneHot('high school', 'phd', 'university'),
    "Symptoms": ListAndOr(),
}
```

We can access the individual converters and their properties by indexing the profile with the column name:

```python
country_converter = profile["Country"]  # Enumerate('china', 'france', 'germany', ...)

# see which integer corresponds to which country:
countries_list = country_converter.values  # ('china', 'france', 'germany', ...)
```

You can now use this profile to convert data:

```python
# transform the whole table:
df = profile.transform(table)  # pandas.DataFrame
arr = df.to_numpy()  # 2D numpy array

# transform a single data point:
data_point = {"Country": "Germany"}
transformed = profile.transform_single(data_point)  # {'Country': 2}
```

The nice thing is that you can now use the fixed profile
to find out after conversion where the numerical values originated from:

```python
# find out which country corresponds to the number 2:
country_id = 2
country = profile["Country"].values[country_id]  # 'germany'
```

You may have noticed that all the strings appear in lowercase.
That is because the `ConversionProfile` pre-processes all strings to lowercase by default.
You can disable this behavior by passing `pre_processing=None` to the constructor
or setting this property after construction:

```python
profile.pre_processing = None  # disable pre-processing
profile.pre_processing = str.lower  # default behavior
profile.pre_processing = lambda s: s.strip().lower()
```

It's okay to provide a pre-processing function that doesn't work for some entries
(e.g. `str.lower` will fail for non-string entries),
because CleverTable will catch errors and ignore them during pre-processing.

You may also have noticed that the `Education level` column was converted to `OneHot()`,
even though it contains arbitrary words, just like the `Country` column.
That's because CleverTable detected that there are too many different values
in the `Country` column for a `OneHot()` converter, so it chose the `Enumerate()` converter.

But you can always override this behavior by explicitly setting the conversion method
before calling `fit()`:

```python
from clevertable import *

table = "datasets/survey.xlsx"

profile = ConversionProfile()

# explicitly specify some converters:
profile["Country"] = OneHot()
profile["Diagnosis"] = Binary(positive="cancer", negative="benign")

profile.fit(table)
```

In this example, we also made sure that the "Diagnosis" column
is choosing the correct positive and negative values.

You can also achieve the same by passing a dictionary to the constructor:

```python
from clevertable import *

table = "datasets/survey.xlsx"

profile = ConversionProfile({
    "Country": OneHot(),
    "Diagnosis": Binary(positive="cancer", negative="benign"),
}).fit(table)  # fit() returns self
```

Two final notes:

- You can ignore columns by setting their converter to `None` (which is shorthand for the `Ignore()` converter).
- You can use `fit_transform()` to perform `fit()` and `transform()` with the same data in one call.

This leaves us with this very concise code:

```python
from clevertable import *

df = ConversionProfile({
    "Country": OneHot(),
    "Diagnosis": Binary(positive="cancer", negative="benign"),
    "Hospitalized": None,
}, pre_processing=None).fit_transform("datasets/survey.xlsx")
```

Which produces the following transformed table:

| Country=China | Country=France | ... | Country=Zimbabwe | Age | Diagnosis | Education level=High School | Education level=PhD | Education level=University | Symptoms=cough | Symptoms=fever |
|---------------|----------------|-----|------------------|-----|-----------|-----------------------------|---------------------|----------------------------|----------------|----------------|
| 1             | 0              | ... | 0                | 32  | 0         | 0                           | 0                   | 1                          | 1              | 1              |
| 0             | 1              | ... | 0                | 45  | 1         | 0                           | 1                   | 0                          | 0              | 1              |
| 0             | 0              | ... | 0                | 19  | 0         | 1                           | 0                   | 0                          | 1              | 0              |
| 0             | 0              | ... | 0                | 56  | 1         | 1                           | 0                   | 0                          | 1              | 1              |
| 0             | 0              | ... | 0                | 23  | 0         | 0                           | 0                   | 1                          | 1              | 0              |
| 0             | 0              | ... | 0                | 34  | 0         | 0                           | 0                   | 1                          | 1              | 1              |

# CLI

`pip install clevertable` also makes the command `clevertable` available
in the command line.
It can convert files with tabular data.
Execute `clevertable --help` to see what arguments can be passed to the tool:

```text
usage: clevertable [-h] [-i IGNORE [IGNORE ...]] src out

Consistent and intelligent conversion of tabular data into numerical values.

positional arguments:
  src                   Path to input file.
  out                   Path to output file.

optional arguments:
  -h, --help            show this help message and exit
  -i IGNORE [IGNORE ...], --ignore IGNORE [IGNORE ...]
                        Column names to ignore.
```

# How to Contribute

Basic workflow of contribution:

- Fork the repository
- Create a new branch
- Make your changes
- Create a pull request
- Wait for the pull request to be accepted or rejected
- If accepted, you can delete your branch
- If rejected, make the requested changes and push them to your branch
- Repeat until pull request is accepted

What to contribute:

- New converters (classes that inherit from `Converter`)
- Improvements to converter inference (logic in `Infer()` converter)
- Improvements to default preprocessing
- Make more features available through the CLI
- New tests
- New documentation, tutorials, examples
- New ideas, suggestions, bug reports → create an issue or contact me directly

# Documentation

There are only two classes that:

- `ConversionProfile`: A collection of converters.
- `Converter`: Transforms columns of data into columns of data.

## Converters

Here's a quick overview of all converters:

| Converters                            | Description                                                                         | Shorthand | Example Usage                                                   |
|---------------------------------------|-------------------------------------------------------------------------------------|-----------|-----------------------------------------------------------------|
| Basic:                                |                                                                                     |           |                                                                 |
| [`Float()`](#float)                   | Convert numbers into floats.                                                        |           |                                                                 |
| [`Enumerate()`](#enumerate)           |                                                                                     |           |                                                                 |
| [`OneHot()`](#onehot)                 |                                                                                     |           |                                                                 |
| [`Binary()`](#binary)                 | Convert to 0 and 1. Detects common "positive" and "negative" terms in strings.      |           |                                                                 |
| [`List()`](#list)                     |                                                                                     |           |                                                                 |
| [`ListAndOr()`](#listandor)           |                                                                                     |           |                                                                 |
| [`Map()`](#map)                       |                                                                                     | dict      | {<br>&nbsp;&nbsp;"foo": 1,<br>&nbsp;&nbsp;"bar": -2,<br>}       |
| [`Const()`](#const)                   | Return a constant value.                                                            | *any*     | 42<br>"foo"                                                     |
| Text Processing:                      |                                                                                     |           |                                                                 |
| [`Strip()`](#strip)                   |                                                                                     |           |                                                                 |
| [`Split()`](#split)                   |                                                                                     |           |                                                                 |
| Combining Converters:                 |                                                                                     |           |                                                                 |
| [`Pipeline()`](#pipeline)             | Apply multiple converters in sequence.                                              | list      | [<br>&nbsp;&nbsp;Split(),<br>&nbsp;&nbsp;ForEach(Strip()),<br>] |
| [`Try()`](#try)                       | Try multiple converters and return the first one that succeeds.                     | tuple     | (Float(), Binary())                                             |
| [`ForEach()`](#foreach)               | Apply the same converter to all items.                                              |           |                                                                 |
| [`Parallel()`](#parallel)             | Apply different converters to the respective items.                                 |           |                                                                 |
| Special:                              |                                                                                     |           |                                                                 |
| [`Id()`](#id)                         |                                                                                     |           |                                                                 |
| [`Ignore()`](#ignore)                 | Drop the column.                                                                    | None      | None                                                            |
| [`Infer()`](#infer)                   |                                                                                     |           |                                                                 |
| [`Label()`](#label)                   |                                                                                     |           |                                                                 |
| Dimensionality:                       |                                                                                     |           |                                                                 |
| [`Flatten()`](#flatten)               | Flatten a tuple of tuples into a single tuple. This is often needed after `ForEach()` or `Parallel()`. |           |                                                                 |
| [`Transpose()`](#transpose)           |                                                                                     |           |                                                                 |
| Arbitrary Functions:                  |                                                                                     |           |                                                                 |
| [`Function()`](#function)             | Apply a user-defined function to the data.                                          | callable  | lambda x: x**2                                                  |
| [`StrictFunction()`](#strictfunction) | Apply a user-defined function to the data. Less flexible than `Function()`.         |           |                                                                 |

---

### Float

Converts a column of numbers into a column of numbers.
If invalid values are encountered (`NaN`, `inf`, `None`, etc.),
a warning is printed and the value is replaced with `np.nan`.
This can be circumvented by passing a value to the `default` argument:

```python
"Temperature": Float(default=37.0)
```

You can also specify `"mean"`, `"median"`, or `"mode"` as the default value.
This will choose the default value based on the data in the specified column:

```python
"Temperature": Float(default="mean")
```

| Temperature | ⇒ | Temperature |
|-------------|---|-------------|
| 37.5        |   | 37.5        |
| 40.0        |   | 40.0        |
| 38.5        |   | 38.5        |
|             |   | 38.75       |
| 39.0        |   | 39.0        |

Results in:

```python
"Temperature": Float(default=38.75)
```

---

### Enumerate

This is the extension of the [`Binary()`](#binary) conversion method
to columns with more than two possible values.
The values are converted into integers starting at 0,
resulting in a single column of integers.

The possible values can be passed to the constructor:

```python
"Country": Enumerate("france", "germany", "italy")
```

| Country | ⇒ | Country |
|---------|---|---------|
| france  |   | 0       |
| italy   |   | 2       |
| germany |   | 1       |

Their index in the argument list is used as the numerical value.
If no values are specified, the values found in the provided
data are sorted in lexically ascending order.

---

### OneHot

If each entry contains one of multiple possible values.
The possible values can be specified via the `values` argument:

```python
"Education Level": OneHot("primary", "secondary", "tertiary")
```

| Education Level | ⇒ | Education Level=primary | Education Level=secondary | Education Level=tertiary |
|-----------------|---|-------------------------|---------------------------|--------------------------|
| primary         |   | 1                       | 0                         | 0                        |
| secondary       |   | 0                       | 1                         | 0                        |
| tertiary        |   | 0                       | 0                         | 1                        |

If no values are specified, the possible values are inferred from the data.

---

### Binary

Similar to [`Enumerate()`](#enumerate), but with just two possible values,
and with some extra intelligence for this purpose.
For example, it can detect words commonly used for positive and negative values:

- Positive: `yes`, `true`, `positive`, `1`, `female`
- Negative: `no`, `false`, `negative`, `0`, `male`, `none`

Example:

```python
"Hospitalized": Binary()
```

| Hospitalized | ⇒ | Hospitalized |
|--------------|---|--------------|
| no           |   | 0            |
| yes          |   | 1            |
| false        |   | 0            |
| true         |   | 1            |
| none         |   | 0            |

Results in:

```python
"Hospitalized": Binary(positive={"yes", "true"},
                       negative={"no", "false", "none"})
```

You can explicitly specify the values of the `positive` class and the `negative` class via the constructor:

```python
"Hospitalized": Binary(positive="yes", negative="no")
```

| Hospitalized | ⇒ | Hospitalized |
|--------------|---|--------------|
| yes          |   | 1            |
| no           |   | 0            |
| no           |   | 0            |
| yes          |   | 1            |

If only one argument is specified (either `positive` or `negative`),
all other values present in the data are treated as instances of the other class:

```python
"Time served": Binary(negative="none")
```

| Time served | ⇒ | Time served |
|-------------|---|-------------|
| none        |   | 0           |
| 1 year      |   | 1           |
| 4 years     |   | 1           |
| none        |   | 0           |

It's also possible to specify more than one value for the ``positive`` and ``negative`` classes.
Example:

```python
"Hospitalized": Binary(positive={"yes", "true"}, negative={"no", "false"})
```

| Hospitalized | ⇒ | Hospitalized |
|--------------|---|--------------|
| yes          |   | 1            |
| no           |   | 0            |
| false        |   | 0            |
| true         |   | 1            |

If no positive or negative values are specified, a set of strings commonly used
to indicate positive / negative values is tested against the available data.
For instance, in the example above, the specified arguments would have been
inferred automatically as positive and negative.

If this approach is not successful, the lexically smallest value is chosen as the `negative` argument and
the `positive` argument is left empty, causing all other values to be treated as positive:

```python
"Fruits": Binary()
```

| Fruits | ⇒ | Fruits |
|--------|---|--------|
| banana |   | 1      |
| apple  |   | 0      |
| kiwi   |   | 1      |
| apple  |   | 0      |

Results in:

```python
"Fruits": Binary(negative="apple")
```

---

### List

Converts lists of values into multiple binary columns.

```python
"Symptoms": List()
```

| Symptoms               | ⇒ | Symptoms=cough | Symptoms=fever | Symptoms=headache |
|------------------------|---|----------------|----------------|-------------------|
| fever, cough, headache |   | 1              | 1              | 1                 |
| headache, cough        |   | 1              | 0              | 1                 |

The default delimiter is a comma.<br>
You can specify a custom delimiter via the `delimiter` argument:

```python
"Symptoms": List(delimiter=";")
"Symptoms": List(delimiter=[",", ";"])  # also accepts lists
```

The passed strings are interpreted as regular expressions.

### ListAndOr

```python
"Symptoms": ListAndOr()
```

| Symptoms                  | ⇒ | Symptoms=cough | Symptoms=fever | Symptoms=headache |
|---------------------------|---|----------------|----------------|-------------------|
| fever, cough and headache |   | 1              | 1              | 1                 |
| headache or cough         |   | 1              | 0              | 1                 |

The default delimiters are comma, "and" and "or".<br>
The passed strings are interpreted as regular expressions.

### Map

### Strip

### Split

### Pipeline

### Try

```python
Try(converter1, converter2, ...)
```

Returns value of the first converter that does not raise an exception,
or the original value if all converters raise an exception.
`Try()` always only applies one converter and returns its output (if it didn't fail).

```python
"Product": Try(Float(), Infer())  # will infer the converter for the samples that cannot be converted to floats
```

| Product | ⇒ | Product |
|---------|---|---------|
| Kiwi    |   | 48      |
| Apple   |   | 0       |
| 712356  |   | 712356  |
| 261382  |   | 261382  |
| Banana  |   | 1       |
| Kiwi    |   | 48      |
| ...     |   | ...     |

This would result in the following profile after `fit()`:

```python
"Product": Try(Float(), Enumerate("Apple", "Banana", ...))
```

### ForEach

Apply the same converter to all items.

### Parallel

```python
Parallel(converter1, converter2, ...)
```

Apply different converters to the respective items.
Usually used in a Pipeline after other converters that create outputs with multiple items (e.g. `Split()`).
Also, you usually want to use `Flatten()` after this, as each individual converter will return a tuple of items,
even if it only contains one item.
Example:

```python
"Latitude;Longitude": [
    Split(";"),  # must always result in two items, because Parallel() has 2 converters
    Parallel(Ignore(), Float()),  # ignore latitude, convert longitude to float -> [[], [longitude]]
    Flatten(),  # -> [longitude]
]
```

| Latitude;Longitude  | ⇒ | Longitude |
|---------------------|---|-----------|
| 52.520008;13.404954 |   | 13.404954 |
| 48.137154;11.576124 |   | 11.576124 |

### Const

### Id

Identity.
Keeps the input unchanged.

### Ignore

Drops the column.

```python
"registration_timestamp": None
```

This is chosen if no appropriate conversion method could be found.

### Infer

Tries to infer the conversion method from the column name.
After `fit()`, this converter will be replaced with the inferred converter in the profile.

This is the default converter for columns where no converter is specified.
This converter can however also be used anywhere else explicitly.
Examples:

```python
"col1": [
    str.upper,
    Infer()
],
"col2": Try(Float(), Infer()),  # will infer the converter for the samples that cannot be converted to floats
```

### Label

### Flatten

### Transpose

Can transpose nested tuples, given that the nested tuples are of equal length.

For example, look at this elegant implementation of the [`List()`](#list) converter:

```python
"Symptoms": [
    Split(r"\s*,\s*"),  # split at comma
    ForEach(OneHot()),
    Transpose(),
    ForEach(max),
    Flatten()
]
```

`Transpose()` allows us to apply `max` to each column of the one-hot encodings
across all tuple elements.

### Function

```python
Function(transform, labels=None)
```

Shorthand: Instead of `Function(transform, None)`, just write `transform`, where `transform` is some callable.

Creates a custom converter from a custom ``transform()`` function
(and optionally, a custom ``labels()`` function).
This is a handy way to create a converter that doesn't need ``fit()``.

Unlike ``StrictFunction()``, this class can handle functions that don't accept or return tuples,
which often allows for more concise code.

This is achieved during ``fit()`` as follows:

1. If all incoming items are 1-element tuples, it sets a flag ``UNPACK_OUTPUT`` to always
   unpack the element before passing them to the wrapped function during ``transform()``.
2. If during ``fit()`` the wrapped function doesn't return tuples,
   it tries to turn that output into a tuple:
    - If the output is always a non-string iterable, it will simply set a
      flag ``CONVERT_ITERABLE`` to always convert the iterable output into a tuple during ``transform()``.
    - Otherwise, it sets a flag ``WRAP_OUTPUT`` to always wrap the output in a
      1-element tuple during ``transform()``.

A similar logic is applied to the labels.
If a custom labels function is given, the following procedure is followed during ``labels()``:

1. If the incoming labels are a 1-element tuples, the single label is unpacked before
   it is passed to the custom labels function.
2. If the custom labels function returns something other than a tuple,
   this class tries to convert it into a tuple:
    - If the output is a non-string iterable, it is converted into a tuple.
    - Otherwise, the output is wrapped in a 1-element tuple.

If no custom labels function is given, the output labels are generated based on the
output cardinality inferred during ``fit()`` and according to the following logic:

- If the number of incoming labels is identical to the output cardinality,
  the labels will be returned unchanged.
- If there are multiple incoming labels but a single output label,
  the output label is formed by joining the incoming labels with ``, ``.
- If there is a single incoming label but multiple output labels,
  the output labels are formed by adding suffixes ``_0``, ``_1``, etc.
  to the single input label.

|                    | **1 Output Label** | **M Output Labels**          |
|--------------------|--------------------|------------------------------|
| **1 Input Label**  | identical          | suffixes ``_0``, ``_1``, ... |
| **N Input Labels** | join with ``, ``   | identical if M=N, else ERROR |

A special case are functions returning output of varying cardinality during ``fit()``.
In this case, a single label is returned.
If only one input label is given, that single label is returned.
If multiple input labels are given, they are joined with ``_``.

The following example turns a text column into two columns containing the ascii code of the first and last letter.

```python
"Name": lambda x: (ord(x[0]), ord(x[-1]))
```

| Name  | ⇒ | Name_0 | Name_1 |
|-------|---|--------|--------|
| Alice |   | 97     | 101    |
| Bob   |   | 98     | 98     |

(Remember that by default, all text entries are converted to
lowercase before further processing.)

As you can see, the number of columns is inferred directly from the return value of the conversion function.
If the function returns a tuple, the resulting column names are indexed.

You can also set the labels explicitly with a lambda function
that takes the input column name as an argument and returns output column names:

```python
"Name": Function(lambda x: (ord(x[0]), ord(x[-1])),
                 labels=lambda s: (f"ord(first letter of {s})", f"ord(last letter of {s})")),
```

| Name  | ⇒ | ord(first letter of Name) | ord(last letter of Name) |
|-------|---|---------------------------|--------------------------|
| Alice |   | 97                        | 101                      |
| Bob   |   | 98                        | 98                       |

However, remember that you can always simply use [`Label()`](#label) to rename the columns after the conversion,
if you don't need the output column names to depend on the input column names.

```python
"Name": [lambda x: (ord(x[0]), ord(x[-1])),
         Labels("ord(first letter)", "ord(last letter)")],
```

### StrictFunction

```python
StrictFunction(transform, labels=None)
```

Works mostly like [`Function()`](#function), but simpler:
``transform`` and ``labels`` must both accept and return tuples.
Instead of something like this:

```python
"Name": str.lower,
```

you have to write this:

```python
"Name": StrictFunction(lambda x: (str.lower(x[0]),))  # notice the comma, which makes it a 1-element tuple
```

That is, you will still receive 1-element tuples as tuples to the function,
even if all input elements during `fit()` are 1-element tuples.
Also, you must now explicitly return a tuple,
even if it is just a 1-element tuple, as otherwise an error will be raised.

See [`Function()`](#function) for a convenient extension of this converter.

---

## Understanding Multi-Column Converters

A converter returns two things:

- `transform()`: the items of the transformed data
- `labels()`: a label for each item

Both return values are tuples.

For top-level converters, this then creates the corresponding amount of columns.
This includes the case of

- a 1-element tuple `(item,)`, which is the case for most converters.
- an empty tuple `()`, in which the result is ignored.
  (In fact, this is exactly how `Ignore()` is implemented.)

This means, however, that for top-level converters, `labels()` and `transform()`
must return the same number of items.
That is because `labels()` is used to create the output column names.
If `transform()` returns a different number of items, that will raise an error for top-level converters.

However, for nested converters, `labels()` and `transform()` can return different numbers of items.
For example `Split.labels()` always returns only one item,
because the number of items returned by `Split.transform()` varies from input to input.
Therefore, `Split()` can't be used as a top-level converter
and has to be used inside a `Pipeline` or similar devices,
so that other converters can ensure that the final output is of constant size.

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "clevertable",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "",
    "keywords": "parser,converter,numerical",
    "author": "Tom Mohr",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/fd/32/f090e03a76789a479ff1b68cf478eea65ecc1a4d09862851012fc163964d/clevertable-3.1.0.tar.gz",
    "platform": null,
    "description": "# CleverTable\r\n![Pytest](https://github.com/tom-mohr/clevertable/actions/workflows/pytest.yml/badge.svg)\r\n\r\nConsistent, intelligent transformation of text-based tabular data into numerical data.<br>\r\nMinimal configuration required.\r\n\r\nInstallation:\r\n\r\n```bash\r\npip install clevertable\r\n```\r\n\r\nExample:\r\n\r\n```python\r\nfrom clevertable import *\r\n\r\nprofile = ConversionProfile({\r\n    # optionally specify converters for specific columns:\r\n    \"Country\": OneHot(),\r\n    \"Diagnosis\": Binary(positive=\"cancer\", negative=\"benign\"),\r\n    \"Hospitalized\": None,  # ignore column\r\n}, pre_processing=None)\r\n\r\ndf = profile.fit_transform(\"datasets/survey.xlsx\")  # transformed pandas.DataFrame\r\n```\r\n\r\n# Why this Library?\r\n\r\n- CleverTable makes it really easy to convert text-based tabular data\r\n  (optionally mixed with numbers) into numerical data, e.g. a medical survey\r\n  into a Pandas DataFrame or a NumPy array.\r\n- If something is obvious, you should not need to specify it.\r\n  CleverTable will try to make choices for you if you don't make them.\r\n- You stay in control: All choices made by CleverTable can be modified and overridden.\r\n\r\nThis is how CleverTable works: (see below for a full [tutorial](#tutorial))\r\n\r\n1. You create a new `profile = ConversionProfile()`.\r\n   Here, you can optionally specify certain converters.\r\n2. You call `profile.fit(data)` on a sample data set, which creates a fixed conversion profile.\r\n    - CleverTable chooses the best converter for each column if you don't specify it.\r\n    - The converter (chosen by you or by CleverTable) adapts its internal state to fit the data.\r\n3. You call `profile.transform(data)` on the actual data set (which may be the same as for `fit()`),\r\n   which converts the data according to the fixed profile.\r\n\r\nHere are some examples on what you can do with CleverTable:\r\n\r\n- Chain multiple converters to achieve complex conversions:\r\n  ```python\r\n  profile[\"Column 7\"] = [\r\n      Split(),\r\n      ForEach(Strip()),\r\n      Flatten(),\r\n      Infer()  # Infer() -> CleverTable will choose what to put here\r\n  ]\r\n  ```\r\n- Use the `Infer()` converter where you want CleverTable to figure out the best solution (see above).\r\n- Concise shorthand writings with Python syntax:\r\n  ```python\r\n  profile[\"Column 1\"] = [  # Python lists create pipelines\r\n    str.lower,             # functions /\r\n    lambda s: s.strip(),   # lambda expressions are allowed\r\n  ]\r\n  profile[\"Column 2\"] = {\"Hello\": 1, \"Bye\": 2}\r\n  profile[\"Column 3\"] = Float(), 1  # tries conversion to float, defaults to 1 on error\r\n  ```\r\n- Incremental configuration: If a column already has a correct converter, you can further process the column\r\n  by adding another converter.\r\n  This implicitly creates a pipeline.\r\n  ```python\r\n  profile[\"Column 5\"] += OneHot()\r\n  ```\r\n- After `fit()`, you can access the inferred state of the converters.\r\n  ```python\r\n  my_weather_conv = profile[\"Weather\"]            # e.g. OneHot()\r\n  my_weather_categories = my_weather_conv.values  # e.g. [\"sunny\", \"cloudy\", \"rainy\"]\r\n  ```\r\n- Send multiple columns into one converter:\r\n  ```python\r\n  profile[\"Column 1\", \"Column 2\"] = max\r\n  ```\r\n- Send nested columns into one converter:\r\n  ```python\r\n  profile[(\"Column A\", \"Column B\"), \"Column C\"] = [Parallel(max, floor), min]  # min(max(A, B), floor(C))\r\n  ```\r\n\r\n# Tutorial\r\n\r\nSuppose you want to convert the following table of survey results in a 2D numpy array of numbers:\r\n\r\n| Country | Age | Diagnosis | Hospitalized | Education level | Symptoms        |\r\n|---------|-----|-----------|--------------|-----------------|-----------------|\r\n| China   | 32  | benign    | no           | University      | cough, fever    |\r\n| France  | 45  | cancer    | yes          | PhD             | fever           |\r\n| Italy   | 19  | benign    | yes          | High School     | cough           |\r\n| Germany | 56  | cancer    | yes          | High School     | fever and cough |\r\n| Nigeria | 23  | benign    | no           | University      | cough           |\r\n| India   | 34  | benign    | yes          | University      | cough, fever    |\r\n| ...     | ... | ...       | ...          | ...             | ...             |\r\n\r\nFor example, you might want to convert the `Country` column into a column of integers,\r\nwith every integer representing a different country.<br>\r\nHowever:\r\n\r\n- You don't really care which number represents which country.\r\n- But you want to make sure that the same country always gets the same number,\r\n  even if you add more data to the table later.\r\n- You also want to know which integer was chosen for which country.\r\n\r\nThat's what CleverTable is for:\r\n\r\n- First, you call `fit()` on a sample data set, which creates a fixed conversion profile.\r\n- Then, you call `transform()` on the actual data set, and it converts the data according\r\n  to the fixed profile.\r\n\r\nMoreover, CleverTable does many things automatically:\r\n\r\n- It chooses the best converter if you don't specify it.\r\n- And then, the converter also adapts its internal state to fit the data.\r\n\r\nLet's see how that works:\r\n\r\n```python\r\nfrom clevertable import *\r\n\r\ntable = \"datasets/survey.xlsx\"  # filename or pandas.DataFrame\r\n\r\nprofile = ConversionProfile()\r\nprofile.fit(table)  # chooses best converters and creates a fixed conversion profile\r\n```\r\n\r\n`print(profile)` will show the inferred conversion profile:\r\n\r\n```python\r\n{\r\n    \"Country\": Enumerate('china', 'france', 'germany', ...),  # lots of countries\r\n    \"Age\": Float(),\r\n    \"Diagnosis\": Binary(),\r\n    \"Hospitalized\": Binary(),\r\n    \"Education level\": OneHot('high school', 'phd', 'university'),\r\n    \"Symptoms\": ListAndOr(),\r\n}\r\n```\r\n\r\nWe can access the individual converters and their properties by indexing the profile with the column name:\r\n\r\n```python\r\ncountry_converter = profile[\"Country\"]  # Enumerate('china', 'france', 'germany', ...)\r\n\r\n# see which integer corresponds to which country:\r\ncountries_list = country_converter.values  # ('china', 'france', 'germany', ...)\r\n```\r\n\r\nYou can now use this profile to convert data:\r\n\r\n```python\r\n# transform the whole table:\r\ndf = profile.transform(table)  # pandas.DataFrame\r\narr = df.to_numpy()  # 2D numpy array\r\n\r\n# transform a single data point:\r\ndata_point = {\"Country\": \"Germany\"}\r\ntransformed = profile.transform_single(data_point)  # {'Country': 2}\r\n```\r\n\r\nThe nice thing is that you can now use the fixed profile\r\nto find out after conversion where the numerical values originated from:\r\n\r\n```python\r\n# find out which country corresponds to the number 2:\r\ncountry_id = 2\r\ncountry = profile[\"Country\"].values[country_id]  # 'germany'\r\n```\r\n\r\nYou may have noticed that all the strings appear in lowercase.\r\nThat is because the `ConversionProfile` pre-processes all strings to lowercase by default.\r\nYou can disable this behavior by passing `pre_processing=None` to the constructor\r\nor setting this property after construction:\r\n\r\n```python\r\nprofile.pre_processing = None  # disable pre-processing\r\nprofile.pre_processing = str.lower  # default behavior\r\nprofile.pre_processing = lambda s: s.strip().lower()\r\n```\r\n\r\nIt's okay to provide a pre-processing function that doesn't work for some entries\r\n(e.g. `str.lower` will fail for non-string entries),\r\nbecause CleverTable will catch errors and ignore them during pre-processing.\r\n\r\nYou may also have noticed that the `Education level` column was converted to `OneHot()`,\r\neven though it contains arbitrary words, just like the `Country` column.\r\nThat's because CleverTable detected that there are too many different values\r\nin the `Country` column for a `OneHot()` converter, so it chose the `Enumerate()` converter.\r\n\r\nBut you can always override this behavior by explicitly setting the conversion method\r\nbefore calling `fit()`:\r\n\r\n```python\r\nfrom clevertable import *\r\n\r\ntable = \"datasets/survey.xlsx\"\r\n\r\nprofile = ConversionProfile()\r\n\r\n# explicitly specify some converters:\r\nprofile[\"Country\"] = OneHot()\r\nprofile[\"Diagnosis\"] = Binary(positive=\"cancer\", negative=\"benign\")\r\n\r\nprofile.fit(table)\r\n```\r\n\r\nIn this example, we also made sure that the \"Diagnosis\" column\r\nis choosing the correct positive and negative values.\r\n\r\nYou can also achieve the same by passing a dictionary to the constructor:\r\n\r\n```python\r\nfrom clevertable import *\r\n\r\ntable = \"datasets/survey.xlsx\"\r\n\r\nprofile = ConversionProfile({\r\n    \"Country\": OneHot(),\r\n    \"Diagnosis\": Binary(positive=\"cancer\", negative=\"benign\"),\r\n}).fit(table)  # fit() returns self\r\n```\r\n\r\nTwo final notes:\r\n\r\n- You can ignore columns by setting their converter to `None` (which is shorthand for the `Ignore()` converter).\r\n- You can use `fit_transform()` to perform `fit()` and `transform()` with the same data in one call.\r\n\r\nThis leaves us with this very concise code:\r\n\r\n```python\r\nfrom clevertable import *\r\n\r\ndf = ConversionProfile({\r\n    \"Country\": OneHot(),\r\n    \"Diagnosis\": Binary(positive=\"cancer\", negative=\"benign\"),\r\n    \"Hospitalized\": None,\r\n}, pre_processing=None).fit_transform(\"datasets/survey.xlsx\")\r\n```\r\n\r\nWhich produces the following transformed table:\r\n\r\n| Country=China | Country=France | ... | Country=Zimbabwe | Age | Diagnosis | Education level=High School | Education level=PhD | Education level=University | Symptoms=cough | Symptoms=fever |\r\n|---------------|----------------|-----|------------------|-----|-----------|-----------------------------|---------------------|----------------------------|----------------|----------------|\r\n| 1             | 0              | ... | 0                | 32  | 0         | 0                           | 0                   | 1                          | 1              | 1              |\r\n| 0             | 1              | ... | 0                | 45  | 1         | 0                           | 1                   | 0                          | 0              | 1              |\r\n| 0             | 0              | ... | 0                | 19  | 0         | 1                           | 0                   | 0                          | 1              | 0              |\r\n| 0             | 0              | ... | 0                | 56  | 1         | 1                           | 0                   | 0                          | 1              | 1              |\r\n| 0             | 0              | ... | 0                | 23  | 0         | 0                           | 0                   | 1                          | 1              | 0              |\r\n| 0             | 0              | ... | 0                | 34  | 0         | 0                           | 0                   | 1                          | 1              | 1              |\r\n\r\n# CLI\r\n\r\n`pip install clevertable` also makes the command `clevertable` available\r\nin the command line.\r\nIt can convert files with tabular data.\r\nExecute `clevertable --help` to see what arguments can be passed to the tool:\r\n\r\n```text\r\nusage: clevertable [-h] [-i IGNORE [IGNORE ...]] src out\r\n\r\nConsistent and intelligent conversion of tabular data into numerical values.\r\n\r\npositional arguments:\r\n  src                   Path to input file.\r\n  out                   Path to output file.\r\n\r\noptional arguments:\r\n  -h, --help            show this help message and exit\r\n  -i IGNORE [IGNORE ...], --ignore IGNORE [IGNORE ...]\r\n                        Column names to ignore.\r\n```\r\n\r\n# How to Contribute\r\n\r\nBasic workflow of contribution:\r\n\r\n- Fork the repository\r\n- Create a new branch\r\n- Make your changes\r\n- Create a pull request\r\n- Wait for the pull request to be accepted or rejected\r\n- If accepted, you can delete your branch\r\n- If rejected, make the requested changes and push them to your branch\r\n- Repeat until pull request is accepted\r\n\r\nWhat to contribute:\r\n\r\n- New converters (classes that inherit from `Converter`)\r\n- Improvements to converter inference (logic in `Infer()` converter)\r\n- Improvements to default preprocessing\r\n- Make more features available through the CLI\r\n- New tests\r\n- New documentation, tutorials, examples\r\n- New ideas, suggestions, bug reports \u2192 create an issue or contact me directly\r\n\r\n# Documentation\r\n\r\nThere are only two classes that:\r\n\r\n- `ConversionProfile`: A collection of converters.\r\n- `Converter`: Transforms columns of data into columns of data.\r\n\r\n## Converters\r\n\r\nHere's a quick overview of all converters:\r\n\r\n| Converters                            | Description                                                                         | Shorthand | Example Usage                                                   |\r\n|---------------------------------------|-------------------------------------------------------------------------------------|-----------|-----------------------------------------------------------------|\r\n| Basic:                                |                                                                                     |           |                                                                 |\r\n| [`Float()`](#float)                   | Convert numbers into floats.                                                        |           |                                                                 |\r\n| [`Enumerate()`](#enumerate)           |                                                                                     |           |                                                                 |\r\n| [`OneHot()`](#onehot)                 |                                                                                     |           |                                                                 |\r\n| [`Binary()`](#binary)                 | Convert to 0 and 1. Detects common \"positive\" and \"negative\" terms in strings.      |           |                                                                 |\r\n| [`List()`](#list)                     |                                                                                     |           |                                                                 |\r\n| [`ListAndOr()`](#listandor)           |                                                                                     |           |                                                                 |\r\n| [`Map()`](#map)                       |                                                                                     | dict      | {<br>&nbsp;&nbsp;\"foo\": 1,<br>&nbsp;&nbsp;\"bar\": -2,<br>}       |\r\n| [`Const()`](#const)                   | Return a constant value.                                                            | *any*     | 42<br>\"foo\"                                                     |\r\n| Text Processing:                      |                                                                                     |           |                                                                 |\r\n| [`Strip()`](#strip)                   |                                                                                     |           |                                                                 |\r\n| [`Split()`](#split)                   |                                                                                     |           |                                                                 |\r\n| Combining Converters:                 |                                                                                     |           |                                                                 |\r\n| [`Pipeline()`](#pipeline)             | Apply multiple converters in sequence.                                              | list      | [<br>&nbsp;&nbsp;Split(),<br>&nbsp;&nbsp;ForEach(Strip()),<br>] |\r\n| [`Try()`](#try)                       | Try multiple converters and return the first one that succeeds.                     | tuple     | (Float(), Binary())                                             |\r\n| [`ForEach()`](#foreach)               | Apply the same converter to all items.                                              |           |                                                                 |\r\n| [`Parallel()`](#parallel)             | Apply different converters to the respective items.                                 |           |                                                                 |\r\n| Special:                              |                                                                                     |           |                                                                 |\r\n| [`Id()`](#id)                         |                                                                                     |           |                                                                 |\r\n| [`Ignore()`](#ignore)                 | Drop the column.                                                                    | None      | None                                                            |\r\n| [`Infer()`](#infer)                   |                                                                                     |           |                                                                 |\r\n| [`Label()`](#label)                   |                                                                                     |           |                                                                 |\r\n| Dimensionality:                       |                                                                                     |           |                                                                 |\r\n| [`Flatten()`](#flatten)               | Flatten a tuple of tuples into a single tuple. This is often needed after `ForEach()` or `Parallel()`. |           |                                                                 |\r\n| [`Transpose()`](#transpose)           |                                                                                     |           |                                                                 |\r\n| Arbitrary Functions:                  |                                                                                     |           |                                                                 |\r\n| [`Function()`](#function)             | Apply a user-defined function to the data.                                          | callable  | lambda x: x**2                                                  |\r\n| [`StrictFunction()`](#strictfunction) | Apply a user-defined function to the data. Less flexible than `Function()`.         |           |                                                                 |\r\n\r\n---\r\n\r\n### Float\r\n\r\nConverts a column of numbers into a column of numbers.\r\nIf invalid values are encountered (`NaN`, `inf`, `None`, etc.),\r\na warning is printed and the value is replaced with `np.nan`.\r\nThis can be circumvented by passing a value to the `default` argument:\r\n\r\n```python\r\n\"Temperature\": Float(default=37.0)\r\n```\r\n\r\nYou can also specify `\"mean\"`, `\"median\"`, or `\"mode\"` as the default value.\r\nThis will choose the default value based on the data in the specified column:\r\n\r\n```python\r\n\"Temperature\": Float(default=\"mean\")\r\n```\r\n\r\n| Temperature | \u21d2 | Temperature |\r\n|-------------|---|-------------|\r\n| 37.5        |   | 37.5        |\r\n| 40.0        |   | 40.0        |\r\n| 38.5        |   | 38.5        |\r\n|             |   | 38.75       |\r\n| 39.0        |   | 39.0        |\r\n\r\nResults in:\r\n\r\n```python\r\n\"Temperature\": Float(default=38.75)\r\n```\r\n\r\n---\r\n\r\n### Enumerate\r\n\r\nThis is the extension of the [`Binary()`](#binary) conversion method\r\nto columns with more than two possible values.\r\nThe values are converted into integers starting at 0,\r\nresulting in a single column of integers.\r\n\r\nThe possible values can be passed to the constructor:\r\n\r\n```python\r\n\"Country\": Enumerate(\"france\", \"germany\", \"italy\")\r\n```\r\n\r\n| Country | \u21d2 | Country |\r\n|---------|---|---------|\r\n| france  |   | 0       |\r\n| italy   |   | 2       |\r\n| germany |   | 1       |\r\n\r\nTheir index in the argument list is used as the numerical value.\r\nIf no values are specified, the values found in the provided\r\ndata are sorted in lexically ascending order.\r\n\r\n---\r\n\r\n### OneHot\r\n\r\nIf each entry contains one of multiple possible values.\r\nThe possible values can be specified via the `values` argument:\r\n\r\n```python\r\n\"Education Level\": OneHot(\"primary\", \"secondary\", \"tertiary\")\r\n```\r\n\r\n| Education Level | \u21d2 | Education Level=primary | Education Level=secondary | Education Level=tertiary |\r\n|-----------------|---|-------------------------|---------------------------|--------------------------|\r\n| primary         |   | 1                       | 0                         | 0                        |\r\n| secondary       |   | 0                       | 1                         | 0                        |\r\n| tertiary        |   | 0                       | 0                         | 1                        |\r\n\r\nIf no values are specified, the possible values are inferred from the data.\r\n\r\n---\r\n\r\n### Binary\r\n\r\nSimilar to [`Enumerate()`](#enumerate), but with just two possible values,\r\nand with some extra intelligence for this purpose.\r\nFor example, it can detect words commonly used for positive and negative values:\r\n\r\n- Positive: `yes`, `true`, `positive`, `1`, `female`\r\n- Negative: `no`, `false`, `negative`, `0`, `male`, `none`\r\n\r\nExample:\r\n\r\n```python\r\n\"Hospitalized\": Binary()\r\n```\r\n\r\n| Hospitalized | \u21d2 | Hospitalized |\r\n|--------------|---|--------------|\r\n| no           |   | 0            |\r\n| yes          |   | 1            |\r\n| false        |   | 0            |\r\n| true         |   | 1            |\r\n| none         |   | 0            |\r\n\r\nResults in:\r\n\r\n```python\r\n\"Hospitalized\": Binary(positive={\"yes\", \"true\"},\r\n                       negative={\"no\", \"false\", \"none\"})\r\n```\r\n\r\nYou can explicitly specify the values of the `positive` class and the `negative` class via the constructor:\r\n\r\n```python\r\n\"Hospitalized\": Binary(positive=\"yes\", negative=\"no\")\r\n```\r\n\r\n| Hospitalized | \u21d2 | Hospitalized |\r\n|--------------|---|--------------|\r\n| yes          |   | 1            |\r\n| no           |   | 0            |\r\n| no           |   | 0            |\r\n| yes          |   | 1            |\r\n\r\nIf only one argument is specified (either `positive` or `negative`),\r\nall other values present in the data are treated as instances of the other class:\r\n\r\n```python\r\n\"Time served\": Binary(negative=\"none\")\r\n```\r\n\r\n| Time served | \u21d2 | Time served |\r\n|-------------|---|-------------|\r\n| none        |   | 0           |\r\n| 1 year      |   | 1           |\r\n| 4 years     |   | 1           |\r\n| none        |   | 0           |\r\n\r\nIt's also possible to specify more than one value for the ``positive`` and ``negative`` classes.\r\nExample:\r\n\r\n```python\r\n\"Hospitalized\": Binary(positive={\"yes\", \"true\"}, negative={\"no\", \"false\"})\r\n```\r\n\r\n| Hospitalized | \u21d2 | Hospitalized |\r\n|--------------|---|--------------|\r\n| yes          |   | 1            |\r\n| no           |   | 0            |\r\n| false        |   | 0            |\r\n| true         |   | 1            |\r\n\r\nIf no positive or negative values are specified, a set of strings commonly used\r\nto indicate positive / negative values is tested against the available data.\r\nFor instance, in the example above, the specified arguments would have been\r\ninferred automatically as positive and negative.\r\n\r\nIf this approach is not successful, the lexically smallest value is chosen as the `negative` argument and\r\nthe `positive` argument is left empty, causing all other values to be treated as positive:\r\n\r\n```python\r\n\"Fruits\": Binary()\r\n```\r\n\r\n| Fruits | \u21d2 | Fruits |\r\n|--------|---|--------|\r\n| banana |   | 1      |\r\n| apple  |   | 0      |\r\n| kiwi   |   | 1      |\r\n| apple  |   | 0      |\r\n\r\nResults in:\r\n\r\n```python\r\n\"Fruits\": Binary(negative=\"apple\")\r\n```\r\n\r\n---\r\n\r\n### List\r\n\r\nConverts lists of values into multiple binary columns.\r\n\r\n```python\r\n\"Symptoms\": List()\r\n```\r\n\r\n| Symptoms               | \u21d2 | Symptoms=cough | Symptoms=fever | Symptoms=headache |\r\n|------------------------|---|----------------|----------------|-------------------|\r\n| fever, cough, headache |   | 1              | 1              | 1                 |\r\n| headache, cough        |   | 1              | 0              | 1                 |\r\n\r\nThe default delimiter is a comma.<br>\r\nYou can specify a custom delimiter via the `delimiter` argument:\r\n\r\n```python\r\n\"Symptoms\": List(delimiter=\";\")\r\n\"Symptoms\": List(delimiter=[\",\", \";\"])  # also accepts lists\r\n```\r\n\r\nThe passed strings are interpreted as regular expressions.\r\n\r\n### ListAndOr\r\n\r\n```python\r\n\"Symptoms\": ListAndOr()\r\n```\r\n\r\n| Symptoms                  | \u21d2 | Symptoms=cough | Symptoms=fever | Symptoms=headache |\r\n|---------------------------|---|----------------|----------------|-------------------|\r\n| fever, cough and headache |   | 1              | 1              | 1                 |\r\n| headache or cough         |   | 1              | 0              | 1                 |\r\n\r\nThe default delimiters are comma, \"and\" and \"or\".<br>\r\nThe passed strings are interpreted as regular expressions.\r\n\r\n### Map\r\n\r\n### Strip\r\n\r\n### Split\r\n\r\n### Pipeline\r\n\r\n### Try\r\n\r\n```python\r\nTry(converter1, converter2, ...)\r\n```\r\n\r\nReturns value of the first converter that does not raise an exception,\r\nor the original value if all converters raise an exception.\r\n`Try()` always only applies one converter and returns its output (if it didn't fail).\r\n\r\n```python\r\n\"Product\": Try(Float(), Infer())  # will infer the converter for the samples that cannot be converted to floats\r\n```\r\n\r\n| Product | \u21d2 | Product |\r\n|---------|---|---------|\r\n| Kiwi    |   | 48      |\r\n| Apple   |   | 0       |\r\n| 712356  |   | 712356  |\r\n| 261382  |   | 261382  |\r\n| Banana  |   | 1       |\r\n| Kiwi    |   | 48      |\r\n| ...     |   | ...     |\r\n\r\nThis would result in the following profile after `fit()`:\r\n\r\n```python\r\n\"Product\": Try(Float(), Enumerate(\"Apple\", \"Banana\", ...))\r\n```\r\n\r\n### ForEach\r\n\r\nApply the same converter to all items.\r\n\r\n### Parallel\r\n\r\n```python\r\nParallel(converter1, converter2, ...)\r\n```\r\n\r\nApply different converters to the respective items.\r\nUsually used in a Pipeline after other converters that create outputs with multiple items (e.g. `Split()`).\r\nAlso, you usually want to use `Flatten()` after this, as each individual converter will return a tuple of items,\r\neven if it only contains one item.\r\nExample:\r\n\r\n```python\r\n\"Latitude;Longitude\": [\r\n    Split(\";\"),  # must always result in two items, because Parallel() has 2 converters\r\n    Parallel(Ignore(), Float()),  # ignore latitude, convert longitude to float -> [[], [longitude]]\r\n    Flatten(),  # -> [longitude]\r\n]\r\n```\r\n\r\n| Latitude;Longitude  | \u21d2 | Longitude |\r\n|---------------------|---|-----------|\r\n| 52.520008;13.404954 |   | 13.404954 |\r\n| 48.137154;11.576124 |   | 11.576124 |\r\n\r\n### Const\r\n\r\n### Id\r\n\r\nIdentity.\r\nKeeps the input unchanged.\r\n\r\n### Ignore\r\n\r\nDrops the column.\r\n\r\n```python\r\n\"registration_timestamp\": None\r\n```\r\n\r\nThis is chosen if no appropriate conversion method could be found.\r\n\r\n### Infer\r\n\r\nTries to infer the conversion method from the column name.\r\nAfter `fit()`, this converter will be replaced with the inferred converter in the profile.\r\n\r\nThis is the default converter for columns where no converter is specified.\r\nThis converter can however also be used anywhere else explicitly.\r\nExamples:\r\n\r\n```python\r\n\"col1\": [\r\n    str.upper,\r\n    Infer()\r\n],\r\n\"col2\": Try(Float(), Infer()),  # will infer the converter for the samples that cannot be converted to floats\r\n```\r\n\r\n### Label\r\n\r\n### Flatten\r\n\r\n### Transpose\r\n\r\nCan transpose nested tuples, given that the nested tuples are of equal length.\r\n\r\nFor example, look at this elegant implementation of the [`List()`](#list) converter:\r\n\r\n```python\r\n\"Symptoms\": [\r\n    Split(r\"\\s*,\\s*\"),  # split at comma\r\n    ForEach(OneHot()),\r\n    Transpose(),\r\n    ForEach(max),\r\n    Flatten()\r\n]\r\n```\r\n\r\n`Transpose()` allows us to apply `max` to each column of the one-hot encodings\r\nacross all tuple elements.\r\n\r\n### Function\r\n\r\n```python\r\nFunction(transform, labels=None)\r\n```\r\n\r\nShorthand: Instead of `Function(transform, None)`, just write `transform`, where `transform` is some callable.\r\n\r\nCreates a custom converter from a custom ``transform()`` function\r\n(and optionally, a custom ``labels()`` function).\r\nThis is a handy way to create a converter that doesn't need ``fit()``.\r\n\r\nUnlike ``StrictFunction()``, this class can handle functions that don't accept or return tuples,\r\nwhich often allows for more concise code.\r\n\r\nThis is achieved during ``fit()`` as follows:\r\n\r\n1. If all incoming items are 1-element tuples, it sets a flag ``UNPACK_OUTPUT`` to always\r\n   unpack the element before passing them to the wrapped function during ``transform()``.\r\n2. If during ``fit()`` the wrapped function doesn't return tuples,\r\n   it tries to turn that output into a tuple:\r\n    - If the output is always a non-string iterable, it will simply set a\r\n      flag ``CONVERT_ITERABLE`` to always convert the iterable output into a tuple during ``transform()``.\r\n    - Otherwise, it sets a flag ``WRAP_OUTPUT`` to always wrap the output in a\r\n      1-element tuple during ``transform()``.\r\n\r\nA similar logic is applied to the labels.\r\nIf a custom labels function is given, the following procedure is followed during ``labels()``:\r\n\r\n1. If the incoming labels are a 1-element tuples, the single label is unpacked before\r\n   it is passed to the custom labels function.\r\n2. If the custom labels function returns something other than a tuple,\r\n   this class tries to convert it into a tuple:\r\n    - If the output is a non-string iterable, it is converted into a tuple.\r\n    - Otherwise, the output is wrapped in a 1-element tuple.\r\n\r\nIf no custom labels function is given, the output labels are generated based on the\r\noutput cardinality inferred during ``fit()`` and according to the following logic:\r\n\r\n- If the number of incoming labels is identical to the output cardinality,\r\n  the labels will be returned unchanged.\r\n- If there are multiple incoming labels but a single output label,\r\n  the output label is formed by joining the incoming labels with ``, ``.\r\n- If there is a single incoming label but multiple output labels,\r\n  the output labels are formed by adding suffixes ``_0``, ``_1``, etc.\r\n  to the single input label.\r\n\r\n|                    | **1 Output Label** | **M Output Labels**          |\r\n|--------------------|--------------------|------------------------------|\r\n| **1 Input Label**  | identical          | suffixes ``_0``, ``_1``, ... |\r\n| **N Input Labels** | join with ``, ``   | identical if M=N, else ERROR |\r\n\r\nA special case are functions returning output of varying cardinality during ``fit()``.\r\nIn this case, a single label is returned.\r\nIf only one input label is given, that single label is returned.\r\nIf multiple input labels are given, they are joined with ``_``.\r\n\r\nThe following example turns a text column into two columns containing the ascii code of the first and last letter.\r\n\r\n```python\r\n\"Name\": lambda x: (ord(x[0]), ord(x[-1]))\r\n```\r\n\r\n| Name  | \u21d2 | Name_0 | Name_1 |\r\n|-------|---|--------|--------|\r\n| Alice |   | 97     | 101    |\r\n| Bob   |   | 98     | 98     |\r\n\r\n(Remember that by default, all text entries are converted to\r\nlowercase before further processing.)\r\n\r\nAs you can see, the number of columns is inferred directly from the return value of the conversion function.\r\nIf the function returns a tuple, the resulting column names are indexed.\r\n\r\nYou can also set the labels explicitly with a lambda function\r\nthat takes the input column name as an argument and returns output column names:\r\n\r\n```python\r\n\"Name\": Function(lambda x: (ord(x[0]), ord(x[-1])),\r\n                 labels=lambda s: (f\"ord(first letter of {s})\", f\"ord(last letter of {s})\")),\r\n```\r\n\r\n| Name  | \u21d2 | ord(first letter of Name) | ord(last letter of Name) |\r\n|-------|---|---------------------------|--------------------------|\r\n| Alice |   | 97                        | 101                      |\r\n| Bob   |   | 98                        | 98                       |\r\n\r\nHowever, remember that you can always simply use [`Label()`](#label) to rename the columns after the conversion,\r\nif you don't need the output column names to depend on the input column names.\r\n\r\n```python\r\n\"Name\": [lambda x: (ord(x[0]), ord(x[-1])),\r\n         Labels(\"ord(first letter)\", \"ord(last letter)\")],\r\n```\r\n\r\n### StrictFunction\r\n\r\n```python\r\nStrictFunction(transform, labels=None)\r\n```\r\n\r\nWorks mostly like [`Function()`](#function), but simpler:\r\n``transform`` and ``labels`` must both accept and return tuples.\r\nInstead of something like this:\r\n\r\n```python\r\n\"Name\": str.lower,\r\n```\r\n\r\nyou have to write this:\r\n\r\n```python\r\n\"Name\": StrictFunction(lambda x: (str.lower(x[0]),))  # notice the comma, which makes it a 1-element tuple\r\n```\r\n\r\nThat is, you will still receive 1-element tuples as tuples to the function,\r\neven if all input elements during `fit()` are 1-element tuples.\r\nAlso, you must now explicitly return a tuple,\r\neven if it is just a 1-element tuple, as otherwise an error will be raised.\r\n\r\nSee [`Function()`](#function) for a convenient extension of this converter.\r\n\r\n---\r\n\r\n## Understanding Multi-Column Converters\r\n\r\nA converter returns two things:\r\n\r\n- `transform()`: the items of the transformed data\r\n- `labels()`: a label for each item\r\n\r\nBoth return values are tuples.\r\n\r\nFor top-level converters, this then creates the corresponding amount of columns.\r\nThis includes the case of\r\n\r\n- a 1-element tuple `(item,)`, which is the case for most converters.\r\n- an empty tuple `()`, in which the result is ignored.\r\n  (In fact, this is exactly how `Ignore()` is implemented.)\r\n\r\nThis means, however, that for top-level converters, `labels()` and `transform()`\r\nmust return the same number of items.\r\nThat is because `labels()` is used to create the output column names.\r\nIf `transform()` returns a different number of items, that will raise an error for top-level converters.\r\n\r\nHowever, for nested converters, `labels()` and `transform()` can return different numbers of items.\r\nFor example `Split.labels()` always returns only one item,\r\nbecause the number of items returned by `Split.transform()` varies from input to input.\r\nTherefore, `Split()` can't be used as a top-level converter\r\nand has to be used inside a `Pipeline` or similar devices,\r\nso that other converters can ensure that the final output is of constant size.\r\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2023 Tom Mohr  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "Low effort conversion of tabular data into numerical values.",
    "version": "3.1.0",
    "project_urls": {
        "Homepage": "https://github.com/tom-mohr/clevertable"
    },
    "split_keywords": [
        "parser",
        "converter",
        "numerical"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3e96235f47e2f9c1cac2b68620b0bd5491e69deaf0ff027d163c9464385474b2",
                "md5": "dad1dae9405af9feb67926a41f7a5acf",
                "sha256": "949816d00f969a4515ebf03176dec7271d6bbcd3053713040220c179e45c3bc7"
            },
            "downloads": -1,
            "filename": "clevertable-3.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "dad1dae9405af9feb67926a41f7a5acf",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 37649,
            "upload_time": "2023-06-14T06:17:50",
            "upload_time_iso_8601": "2023-06-14T06:17:50.746317Z",
            "url": "https://files.pythonhosted.org/packages/3e/96/235f47e2f9c1cac2b68620b0bd5491e69deaf0ff027d163c9464385474b2/clevertable-3.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fd32f090e03a76789a479ff1b68cf478eea65ecc1a4d09862851012fc163964d",
                "md5": "510724ad5b78143765feec8a7834a286",
                "sha256": "3dd5f7804095b21720ec722dfae88b022617953eed14df691c422497a3780eec"
            },
            "downloads": -1,
            "filename": "clevertable-3.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "510724ad5b78143765feec8a7834a286",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 45743,
            "upload_time": "2023-06-14T06:17:52",
            "upload_time_iso_8601": "2023-06-14T06:17:52.753594Z",
            "url": "https://files.pythonhosted.org/packages/fd/32/f090e03a76789a479ff1b68cf478eea65ecc1a4d09862851012fc163964d/clevertable-3.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-14 06:17:52",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "tom-mohr",
    "github_project": "clevertable",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "clevertable"
}
        
Elapsed time: 0.25385s