extract-transform


Nameextract-transform JSON
Version 1.0.10 PyPI version JSON
download
home_pagehttps://github.com/frederikvanhevel/extract-transform
SummaryA Python library for efficient encoding, decoding, and transformation of complex data, ideal for machine learning workflows.
upload_time2023-09-06 19:12:11
maintainer
docs_urlNone
authorFrederik Vanhevel
requires_python>=3.7,<4.0
licenseMIT
keywords etl data transformation extraction processing machine learning pipeline
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # 🧬 Extract Transform
[![PyPI version](https://badge.fury.io/py/extract-transform.svg)](https://badge.fury.io/py/extract-transform) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

`Extract Transform` is a Python library offering robust tools for encoding, decoding, and transforming complex data structures. Designed for efficient nested data manipulation and custom data type handling, it's an indispensable tool for preparing and structuring data, especially in machine learning workflows.

# Installation

Install the library via pip:

```shell
pip install extract-transform
```

or poetry:

```shell
poetry add extract-transform
```

# Usage

`Extract Transform` comprises a variety of extractors, each tailored to handle specific types or transformations. To utilize the library's capabilities, you would typically select the relevant extractor and apply it to your data.

## Basic example

```python
from extract_transform import Record, Transform

# define the extractor
extractor = Record({"key": Transform(lambda s: s.upper())})

# use the extractor on your data
result = extractor.extract({"key": "value"}) # Output: {"key": "VALUE"}
```

## Advanced Examples

For more intricate use-cases and advanced examples, please refer to the following:

- [Twitter API Example](https://github.com/frederikvanhevel/extract-transform/blob/master/examples/twitter_api.py)
- [Machine learning preprocessing](https://github.com/frederikvanhevel/extract-transform/blob/master/examples/machine_learning_preprocessing.py)
- [OpenWeather API Example](https://github.com/frederikvanhevel/extract-transform/blob/master/examples/openweather_api.py)
- [Experian API Example](https://github.com/frederikvanhevel/extract-transform/blob/master/examples/experian_api.py)


## Available Extractors

- [Basic types](#basic-types)
- [Complex types](#complex-types)
- [Data manipulation](#data-manipulation)
- [Custom extractors](#custom-extractors)

### Basic types

- [Boolean](#boolean)
- [Decimal](#decimal)
- [Float](#float)
- [Integer](#integer)
- [String](#string)
- [Hexadecimal](#hexadecimal)
- [Raw](#raw)

#### Boolean

Converts data to a boolean using provided truthy and falsy values.

**Input:** "true", "false", 1, 0
**Output:** `True` or `False`.

```python
extractor = Boolean()
extractor.extract("true") # Output: True

extractor = Boolean(truthy_values=["yes"], falsy_values=["no"])
extractor.extract("yes") # Output: True
```

---

#### Decimal

Converts data to a decimal with specified precision and scale.

**Input:** "123.456", 123.456, "123", 123, etc.  
**Output:** Rounded decimal.Decimal value.

```python
extractor = Decimal()
extractor.extract("123.456") # Output: decimal.Decimal
```
---

#### Float

Converts data to a float.

**Input:** "123.456", 123.456, "123", 123, etc.  
**Output:** Float value.

```python
extractor = Float()
extractor.extract("123.456") # Output: 123.456
```



---

#### Integer

Converts data to its integer representation.

**Input:** "12345", 12345, "1a3f" (hexadecimal), etc.  
**Output:** Integer value.

```python
extractor = Integer()
extractor.extract("12345") # Output: 12345
```

---

#### String

Converts data to its string representation.

**Input:** 12345, True, [1, 2, 3], etc.  
**Output:** String representation, e.g., "12345".

```python
extractor = String()
extractor.extract(550) # Output: "550"
```

---

#### Hexadecimal

Converts a hexadecimal string to an integer.

**Input:** "1a3f", "fa3c", etc.  
**Output:** Integer representation of the hexadecimal.

```python
extractor = Hexadecimal()
extractor.extract("1a3f") # Output: 6719
```

---

#### Raw

Returns data as-is without processing.

**Input:** 12345, "Hello", {"key": "value"}, etc.  
**Output:** Input data without any alterations.

```python
extractor = Raw()
extractor.extract("12345") # Output: "12345"
```

### Complex types

- [Array](#array)
- [Record](#record)
- [NumericWithCodes](#numericwithcodes)

#### Array

Processes input into a list, transforming each item based on a provided extractor.

**Input:** A list or a single item. E.g., ["Alice", "Bob"] or "Alice".  
**Output:** List with items processed according to the extractor, e.g., ["Alice", "Bob"].

```python
extractor = Array(Integer())
extractor.extract(["20", "30"]) # Output: [20, 30]
```

---

#### Record

Processes a dictionary by transforming values based on provided field mappings.

**Input:** Dictionary with fields like {"name": "Alice", "age": 30}.  
**Output:** New dictionary with mapped fields, e.g., {"name": "Alice", "age": 30}.

```python
extractor = Record({
    ("identity", "id"): Integer(),
    "amount": Integer()
})

extractor.extract({
    "identity": 5,
    "amount": "25"
}) # Output: {"id": 5, "amount": 25}
```
---

#### NumericWithCodes

Extracts numeric values and categorizes them based on boundaries. If the value lies within boundaries, it's returned as-is; otherwise, returned as a string.

**Input:** Numeric representations like 5, 5.0, "5.0", or Decimal('5.0').  
**Output:** Dictionary with 'value' and 'categorical' keys. E.g., {"value": 5, "categorical": None} or {"value": None, "categorical": "15"}.

```python
extractor = NumericWithCodes(
    Integer(),
    min_val=1,
    max_val=100
)

extractor.extract(55) # Output: {"value": 55, "categorical": None}
extractor.extract(9999) # Output: {"value": None, "categorical": "9999"}
```


### Data manipulation

- [Compose](#compose)
- [Count](#count)
- [DefaultValue](#defaultvalue)
- [DictMap](#dictmap)
- [Exists](#exists)
- [Filter](#filter)
- [Flatten](#flatten)
- [MapValue](#mapvalue)
- [Pivot](#pivot)
- [SelectListItem](#selectlistitem)
- [Select](#select)
- [SortDictList](#sortdictlist)
- [Split](#split)
- [Transform](#transform)
- [Unpivot](#unpivot)
- [When](#when)


#### Compose

Chains multiple extractors, passing the output of one as the input to the next.

**Input:** Data compatible with the first extractor, e.g., if the first expects a string, provide a string.  
**Output:** Data processed by all extractors. The nature depends on the sequence, e.g., if the last returns an integer, the output will be an integer.

```python
extractor = Compose(Boolean(), Integer())
extractor.extract("true") # Output: 1
```

---

#### Count

Counts the items in a list based on a given predicate.

**Input:** A list of items.  
**Output:** Integer representing the count of items satisfying the predicate.

```python
extractor = Count()
extractor.extract([1, 2, 3, 4]) # Output: 4

extractor = Count(lambda x: x > 2)
extractor.extract([1, 2, 3, 4]) # Output: 2
```

---

#### DefaultValue

Returns the input if it's not None; otherwise, a default value.

**Input:** Any data type or None.  
**Output:** Input data if it's not None; otherwise, the default value.

```python
extractor = DefaultValue(1000)
extractor.extract(None) # Output: 1000
extractor.extract(550) # Output: 550
```

---

#### DictMap

Processes each dictionary value through a specified extractor, returning the processed dictionary.

**Input:** Dictionary with arbitrary keys and values, e.g., {"name": "Alice", "age": "30"}.  
**Output:** Dictionary with processed values, e.g., with an integer extractor: {"name": "Alice", "age": 30}.

```python
extractor = DictMap(Integer())
extractor.extract({"a": "10", "b": "20"}) # Output: {"a": 10, "b": 20}
```

---

#### Exists

Checks if a specified key exists in the given data.

**Input:** Data that supports the "in" operation, typically dictionaries or lists. E.g., `{"name": "Alice", "age": 30}` or `["Alice", "Bob", "Charlie"]`.  
**Output:** Boolean indicating the key's existence. E.g., for key "name" and dictionary input: `True`.

```python
extractor = Exists("Alice")
extractor.extract(["Alice", "Bob", "Charlie"]) # Output: True
```
---

#### Filter

Filters items in a list based on a predicate.

**Input:** A list of items.  
**Output:** A list of items that satisfy the predicate.

```python
extractor = Filter(lambda x: x > 2)
extractor.extract([1, 2, 3, 4]) # Output: [3, 4]
```

---

#### Flatten

Flattens a nested dictionary into a single-level dictionary with compound keys.

**Input:** A possibly nested dictionary. E.g., `{"a": {"b": 1, "c": {"d": 2}}}`.  
**Output:** A single-level dictionary. E.g., `{"a.b": 1, "a.c.d": 2}`.

```python
extractor = Flatten()
extractor.extract({"a": {"b": 1, "c": {"d": 2}}}) # Output: {"a.b": 1, "a.c.d": 2}
```

---

#### MapValue

Maps input values to a representation based on a provided mapping.

**Input:** A value that might exist in the mapping. E.g., `1` or `"apple"`.  
**Output:** Mapped value or the default. E.g., for mapping `{1: "TypeA", "apple": "fruit"}`:
- Input: `1` -> Output: `"TypeA"`
- Input: `"orange"` -> Output: `"UnknownType"` (if default is "UnknownType").

```python
extractor = MapValue({1: "TypeA", 2: "TypeB", 3: "TypeC"}, default="UknownType")
extractor.extract(1) # Output: "TypeA"
extractor.extract(5) # Output: "UknownType"
```

---

#### Pivot

Groups data items by a specified key and applies a result extractor to each group.

**Input:** A list of dictionaries.  
**Output:** Dictionary with keys as distinct values from the input list's 'key' field, and values as the result of the `result_extractor` applied to items with the same key.

```python
extractor = Pivot("group", Raw(), exclude_key=True)
data = [
    {"group": "A", "value": "10"},
    {"group": "B", "value": "20"},
    {"group": "A", "value": "30"},
]
extractor.extract(data) # Output: {"A": [{"value": "10"}, {"value": "30"}], "B": [{"value": "20"}]
```

---

#### SelectListItem

Retrieves an item from a list by its position or a criteria.

**Input:** A list, like [1, 2, 3, 4].  
**Output:** Item based on position or criteria, e.g., for position 2: 3.

```python
extractor = SelectListItem()
extractor.extract([1, 2, 3]) # Output: 1

extractor = SelectListItem(position=1)
extractor.extract([10, 20, 30]) # Output: 20

extractor = SelectListItem(criteria=lambda x: x > 15)
extractor.extract([10, 20, 30]) # Output: 20
```

---

#### Select

Extracts a value from a dictionary by a key and optionally processes it.

**Input:** A dictionary, like {"name": "Alice", "age": 30}.  
**Output:** Value based on the key and optional extractor, e.g., for key "name": "Alice".

```python
extractor = Select(key="age")
extractor.extract({"name": "John", "age": 25}) # Output: 25
```

---

#### SortDictList

Sorts dictionaries in a list by a specified key.

**Input:** List of dictionaries with consistent keys, like [{"name": "Bob", "age": 30}, {"name": "Alice", "age": 25}].  
**Output:** Sorted list by `sort_key`, e.g., for `sort_key="age"`: [{"name": "Alice", "age": 25}, {"name": "Bob", "age": 30}].

```python
data = [
    {"name": "Alice", "age": 28},
    {"name": "Bob", "age": 22},
    {"name": "Charlie", "age": 24},
]

extractor = SortDictList(sort_key="age")

extractor.extract(data)

# Output: [
#     {"name": "Bob", "age": 22},
#     {"name": "Charlie", "age": 24},
#     {"name": "Alice", "age": 28},
# ]
```

---

#### Split

Splits a string by a separator and applies an extractor to the substrings.

**Input:** A string with the separator, like "Apple,Banana,Cherry".  
**Output:** List of extracted values from substrings, e.g., for `sep=","`: ["Apple", "Banana", "Cherry"].


```python
extractor = Split(sep=",")
extractor.extract("apple,banana,grape") # Output: ["apple", "banana", "grape"]
```

#### Transform

Applies a transformation function to the data and optionally processes it with another extractor.

**Input:** Compatible data with the transformation function.  
**Output:** If no extractor is provided, it's the transformed data. If an extractor is given, it's the result of the extractor on the transformed data.  
**Example:** For a function that capitalizes strings and an extractor that reverses the string, input 'apple' gives 'ELPPA'.


```python
extractor = Transform(lambda x: x * 2, Raw())
extractor.extract(2) # Output: 4
```

---

#### Unpivot

Converts a dictionary into a list of dictionaries with specific key-value pairs.

**Input:** Dictionary with values as lists.  
**Output:** A list of dictionaries. Each dictionary has two keys: 'category' for the original dictionary's key and 'value' for the extracted value from the original list.

```python
data = {
    "Fruits": ["Apple", "Banana"],
    "Vegetables": ["Carrot", "Broccoli"]
}

extractor = Unpivot(key="category", result_extractor=String())
extractor.extract(data)

# Output: [
#     {"category": "Fruits", "value": "Apple"},
#     {"category": "Fruits", "value": "Banana"},
#     {"category": "Vegetables", "value": "Carrot"},
#     {"category": "Vegetables", "value": "Broccoli"},
# ]
```

---

#### When

Uses an extractor based on a given condition.

**Input:** Any data that the condition function can evaluate.  
**Output:** If the condition is true, the extracted data is returned. Otherwise, it returns None.

```python
extractor = When(lambda x: x == "yes", Raw())
extractor.extract("yes") # Output: yes
extractor.extract("no") # Output: None
```

### Dates and times

- [Date](#date)
- [Datetime](#datetime)
- [DatetimeUnix](#datetimeunix)
- [RelativeDate](#relativedate)
- [RelativeDatetime](#relativedatetime)

#### Date

Extracts a date from its string representation using the specified format.

**Input:** A string that matches the date format (e.g., "2023-05-01" for default "%Y-%m-%d" format).  
**Output:** A date object matching the input (e.g., date(2023, 5, 1)).

```python
extractor = Date()
extractor.extract("2023-05-01") # Output: datetime.date
```

---

#### DateTime

Parses a date-time string into a datetime object. Can adjust for timezones.

**Input:** A date-time string (e.g., "2023-04-30T17:00:00" for the default format).  
**Output:** The parsed datetime object, optionally adjusted to the provided timezone. If there's a parsing error, the output is None.

```python
extractor = DateTime()
extractor.extract("2023-04-30T17:00:00") # Output: datetime.date
```

---

#### DatetimeUnix

Turns a UNIX timestamp into a datetime object.

**Input:** A numeric representation of a UNIX timestamp in seconds (e.g., 1619856000).  
**Output:** The corresponding datetime object (e.g., datetime(2023, 4, 30, 17, 0)).

```python
extractor = DatetimeUnix()
extractor.extract(1609459200) # Output: datetime.datetime
```

---

#### RelativeDate

Determines the days difference between an input date string and a reference date from the context.

**Input:** A date string (e.g., "2023-04-30" for the default format).  
**Output:** A float showing the days difference between the input date and the reference date from the context. If the dates are invalid or not provided, the output is None.

```python
extractor = RelativeDate()
extractor.extract("2023-09-05") # Output: 5
```

---

#### RelativeDatetime

Calculates the seconds difference between an input datetime string and a reference datetime from the context.

**Input:** A datetime string (e.g., "2023-04-30T15:30:00" for the default format).  
**Output:** A float representing the seconds difference between the input and the reference datetime. If the datetimes are invalid or not provided, the output is None.

```python
extractor = RelativeDate()
extractor.extract("2023-09-05T15:30:00") # Output: 4.45
```


### Encoding and categorical

- [Categorical](#categorical)
- [Ordinal](#ordinal)
- [OneHot](#onehot)
- [MultiHot](#multihot)

#### Categorical

Validates input data against a predefined set of categories.

**Input:** A string denoting a category (e.g., "category_A").  
**Output:** The input string, if it's in the set of `valid_categories`. If not, a warning is raised, but the input string is still returned without change.

```python
extractor = Categorical({"apple", "banana", "cherry"})
extractor.extract("apple") # Output: apple
extractor.extract("pineapple") # Output: pineapple

extractor = Categorical({"apple", "banana", "cherry"}, raise_on_warning=True)
extractor.extract("pineapple") # Exception

```

---

#### Ordinal

Transforms categorical data into its ordinal representation based on a defined order or explicit mapping.

**Input:** A string representing a category (e.g., "medium").  
**Output:** An integer that denotes the ordinal position of the input category.  
- If the `ordered_categories` is a list like ["low", "medium", "high"] and the input is "medium", the output is 1.  
- If the `ordered_categories` is a dictionary like {"low": 0, "medium": 5, "high": 10} and the input is "medium", the output is 5.

```python
extractor = Ordinal(["low", "medium", "high"])
extractor.extract(1) # Output: "medium"
```

---

#### OneHot

One-hot encodes the input data according to predefined categories.

**Input:** A string that represents a category (e.g., "category_A").  
**Output:** A dictionary where each key is a category from the `categories` list and the corresponding value is either 1 (if the input matches the category) or 0 (if it doesn't).  
For instance, if `categories` = ["category_A", "category_B", "category_C"] and input is "category_A", the output is: {"category_A": 1, "category_B": 0, "category_C": 0}.

```python
extractor = OneHot(["cat", "dog", "bird"])
extractor.extract("cat") # Output: {"cat": 1, "dog": 0, "bird": 0}
```

---

#### MultiHot

Encodes a list of categories into a multi-hot representation based on a list of predefined categories.

**Input:** A list of strings, where each string represents a category (e.g., ["category_A", "category_B"]).  
**Output:** A list of integers (0 or 1) that indicates the presence or absence of each category in the `categories` list.  
Example: If `categories` = ["category_A", "category_B", "category_C"] and the input list is ["category_A", "category_B"], the output is: [1, 1, 0].

```python
extractor = MultiHot(["cat", "dog", "bird"])
extractor.extract(["cat", "bird"]) # Output: [1, 0, 1]
```

## Custom Extractors

In scenarios where the provided built-in extractors aren't adequate for specific data transformation needs, you can create custom extractors by subclassing the `Extractor` class and implementing the `extract` method.

### Example: Temperature Extractor

Let's consider a situation where you have temperature data in Kelvin and want to extract it in Celsius and Fahrenheit formats. 

Here's how you can create a custom `TemperatureExtractor`:

```python
from extract_transform import Extractor

class Temperature(Extractor):
    """
    Extracts the Kelvin temperature and calculates the temperatures in Celsius and Fahrenheit.
    """

    def extract(self, data: Any):
        kelvin = data["temp"]
        celsius = kelvin - 273.15
        fahrenheit = kelvin * 9 / 5 - 459.67

        return {"celsius": celsius, "fahrenheit": fahrenheit}


Temperature().extract({"temp": 310}) # Output: {"celsius": 36.85, "fahrenheit": 98.33}
            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/frederikvanhevel/extract-transform",
    "name": "extract-transform",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7,<4.0",
    "maintainer_email": "",
    "keywords": "ETL,data,transformation,extraction,processing,machine,learning,pipeline",
    "author": "Frederik Vanhevel",
    "author_email": "frederik.vanhevel@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/4d/36/6991d7ca5262695d8ddaf20f532afb7f4163e15dc25e52f71072ec01a6db/extract_transform-1.0.10.tar.gz",
    "platform": null,
    "description": "# \ud83e\uddec Extract Transform\n[![PyPI version](https://badge.fury.io/py/extract-transform.svg)](https://badge.fury.io/py/extract-transform) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\n`Extract Transform` is a Python library offering robust tools for encoding, decoding, and transforming complex data structures. Designed for efficient nested data manipulation and custom data type handling, it's an indispensable tool for preparing and structuring data, especially in machine learning workflows.\n\n# Installation\n\nInstall the library via pip:\n\n```shell\npip install extract-transform\n```\n\nor poetry:\n\n```shell\npoetry add extract-transform\n```\n\n# Usage\n\n`Extract Transform` comprises a variety of extractors, each tailored to handle specific types or transformations. To utilize the library's capabilities, you would typically select the relevant extractor and apply it to your data.\n\n## Basic example\n\n```python\nfrom extract_transform import Record, Transform\n\n# define the extractor\nextractor = Record({\"key\": Transform(lambda s: s.upper())})\n\n# use the extractor on your data\nresult = extractor.extract({\"key\": \"value\"}) # Output: {\"key\": \"VALUE\"}\n```\n\n## Advanced Examples\n\nFor more intricate use-cases and advanced examples, please refer to the following:\n\n- [Twitter API Example](https://github.com/frederikvanhevel/extract-transform/blob/master/examples/twitter_api.py)\n- [Machine learning preprocessing](https://github.com/frederikvanhevel/extract-transform/blob/master/examples/machine_learning_preprocessing.py)\n- [OpenWeather API Example](https://github.com/frederikvanhevel/extract-transform/blob/master/examples/openweather_api.py)\n- [Experian API Example](https://github.com/frederikvanhevel/extract-transform/blob/master/examples/experian_api.py)\n\n\n## Available Extractors\n\n- [Basic types](#basic-types)\n- [Complex types](#complex-types)\n- [Data manipulation](#data-manipulation)\n- [Custom extractors](#custom-extractors)\n\n### Basic types\n\n- [Boolean](#boolean)\n- [Decimal](#decimal)\n- [Float](#float)\n- [Integer](#integer)\n- [String](#string)\n- [Hexadecimal](#hexadecimal)\n- [Raw](#raw)\n\n#### Boolean\n\nConverts data to a boolean using provided truthy and falsy values.\n\n**Input:** \"true\", \"false\", 1, 0\n**Output:** `True` or `False`.\n\n```python\nextractor = Boolean()\nextractor.extract(\"true\") # Output: True\n\nextractor = Boolean(truthy_values=[\"yes\"], falsy_values=[\"no\"])\nextractor.extract(\"yes\") # Output: True\n```\n\n---\n\n#### Decimal\n\nConverts data to a decimal with specified precision and scale.\n\n**Input:** \"123.456\", 123.456, \"123\", 123, etc.  \n**Output:** Rounded decimal.Decimal value.\n\n```python\nextractor = Decimal()\nextractor.extract(\"123.456\") # Output: decimal.Decimal\n```\n---\n\n#### Float\n\nConverts data to a float.\n\n**Input:** \"123.456\", 123.456, \"123\", 123, etc.  \n**Output:** Float value.\n\n```python\nextractor = Float()\nextractor.extract(\"123.456\") # Output: 123.456\n```\n\n\n\n---\n\n#### Integer\n\nConverts data to its integer representation.\n\n**Input:** \"12345\", 12345, \"1a3f\" (hexadecimal), etc.  \n**Output:** Integer value.\n\n```python\nextractor = Integer()\nextractor.extract(\"12345\") # Output: 12345\n```\n\n---\n\n#### String\n\nConverts data to its string representation.\n\n**Input:** 12345, True, [1, 2, 3], etc.  \n**Output:** String representation, e.g., \"12345\".\n\n```python\nextractor = String()\nextractor.extract(550) # Output: \"550\"\n```\n\n---\n\n#### Hexadecimal\n\nConverts a hexadecimal string to an integer.\n\n**Input:** \"1a3f\", \"fa3c\", etc.  \n**Output:** Integer representation of the hexadecimal.\n\n```python\nextractor = Hexadecimal()\nextractor.extract(\"1a3f\") # Output: 6719\n```\n\n---\n\n#### Raw\n\nReturns data as-is without processing.\n\n**Input:** 12345, \"Hello\", {\"key\": \"value\"}, etc.  \n**Output:** Input data without any alterations.\n\n```python\nextractor = Raw()\nextractor.extract(\"12345\") # Output: \"12345\"\n```\n\n### Complex types\n\n- [Array](#array)\n- [Record](#record)\n- [NumericWithCodes](#numericwithcodes)\n\n#### Array\n\nProcesses input into a list, transforming each item based on a provided extractor.\n\n**Input:** A list or a single item. E.g., [\"Alice\", \"Bob\"] or \"Alice\".  \n**Output:** List with items processed according to the extractor, e.g., [\"Alice\", \"Bob\"].\n\n```python\nextractor = Array(Integer())\nextractor.extract([\"20\", \"30\"]) # Output: [20, 30]\n```\n\n---\n\n#### Record\n\nProcesses a dictionary by transforming values based on provided field mappings.\n\n**Input:** Dictionary with fields like {\"name\": \"Alice\", \"age\": 30}.  \n**Output:** New dictionary with mapped fields, e.g., {\"name\": \"Alice\", \"age\": 30}.\n\n```python\nextractor = Record({\n    (\"identity\", \"id\"): Integer(),\n    \"amount\": Integer()\n})\n\nextractor.extract({\n    \"identity\": 5,\n    \"amount\": \"25\"\n}) # Output: {\"id\": 5, \"amount\": 25}\n```\n---\n\n#### NumericWithCodes\n\nExtracts numeric values and categorizes them based on boundaries. If the value lies within boundaries, it's returned as-is; otherwise, returned as a string.\n\n**Input:** Numeric representations like 5, 5.0, \"5.0\", or Decimal('5.0').  \n**Output:** Dictionary with 'value' and 'categorical' keys. E.g., {\"value\": 5, \"categorical\": None} or {\"value\": None, \"categorical\": \"15\"}.\n\n```python\nextractor = NumericWithCodes(\n    Integer(),\n    min_val=1,\n    max_val=100\n)\n\nextractor.extract(55) # Output: {\"value\": 55, \"categorical\": None}\nextractor.extract(9999) # Output: {\"value\": None, \"categorical\": \"9999\"}\n```\n\n\n### Data manipulation\n\n- [Compose](#compose)\n- [Count](#count)\n- [DefaultValue](#defaultvalue)\n- [DictMap](#dictmap)\n- [Exists](#exists)\n- [Filter](#filter)\n- [Flatten](#flatten)\n- [MapValue](#mapvalue)\n- [Pivot](#pivot)\n- [SelectListItem](#selectlistitem)\n- [Select](#select)\n- [SortDictList](#sortdictlist)\n- [Split](#split)\n- [Transform](#transform)\n- [Unpivot](#unpivot)\n- [When](#when)\n\n\n#### Compose\n\nChains multiple extractors, passing the output of one as the input to the next.\n\n**Input:** Data compatible with the first extractor, e.g., if the first expects a string, provide a string.  \n**Output:** Data processed by all extractors. The nature depends on the sequence, e.g., if the last returns an integer, the output will be an integer.\n\n```python\nextractor = Compose(Boolean(), Integer())\nextractor.extract(\"true\") # Output: 1\n```\n\n---\n\n#### Count\n\nCounts the items in a list based on a given predicate.\n\n**Input:** A list of items.  \n**Output:** Integer representing the count of items satisfying the predicate.\n\n```python\nextractor = Count()\nextractor.extract([1, 2, 3, 4]) # Output: 4\n\nextractor = Count(lambda x: x > 2)\nextractor.extract([1, 2, 3, 4]) # Output: 2\n```\n\n---\n\n#### DefaultValue\n\nReturns the input if it's not None; otherwise, a default value.\n\n**Input:** Any data type or None.  \n**Output:** Input data if it's not None; otherwise, the default value.\n\n```python\nextractor = DefaultValue(1000)\nextractor.extract(None) # Output: 1000\nextractor.extract(550) # Output: 550\n```\n\n---\n\n#### DictMap\n\nProcesses each dictionary value through a specified extractor, returning the processed dictionary.\n\n**Input:** Dictionary with arbitrary keys and values, e.g., {\"name\": \"Alice\", \"age\": \"30\"}.  \n**Output:** Dictionary with processed values, e.g., with an integer extractor: {\"name\": \"Alice\", \"age\": 30}.\n\n```python\nextractor = DictMap(Integer())\nextractor.extract({\"a\": \"10\", \"b\": \"20\"}) # Output: {\"a\": 10, \"b\": 20}\n```\n\n---\n\n#### Exists\n\nChecks if a specified key exists in the given data.\n\n**Input:** Data that supports the \"in\" operation, typically dictionaries or lists. E.g., `{\"name\": \"Alice\", \"age\": 30}` or `[\"Alice\", \"Bob\", \"Charlie\"]`.  \n**Output:** Boolean indicating the key's existence. E.g., for key \"name\" and dictionary input: `True`.\n\n```python\nextractor = Exists(\"Alice\")\nextractor.extract([\"Alice\", \"Bob\", \"Charlie\"]) # Output: True\n```\n---\n\n#### Filter\n\nFilters items in a list based on a predicate.\n\n**Input:** A list of items.  \n**Output:** A list of items that satisfy the predicate.\n\n```python\nextractor = Filter(lambda x: x > 2)\nextractor.extract([1, 2, 3, 4]) # Output: [3, 4]\n```\n\n---\n\n#### Flatten\n\nFlattens a nested dictionary into a single-level dictionary with compound keys.\n\n**Input:** A possibly nested dictionary. E.g., `{\"a\": {\"b\": 1, \"c\": {\"d\": 2}}}`.  \n**Output:** A single-level dictionary. E.g., `{\"a.b\": 1, \"a.c.d\": 2}`.\n\n```python\nextractor = Flatten()\nextractor.extract({\"a\": {\"b\": 1, \"c\": {\"d\": 2}}}) # Output: {\"a.b\": 1, \"a.c.d\": 2}\n```\n\n---\n\n#### MapValue\n\nMaps input values to a representation based on a provided mapping.\n\n**Input:** A value that might exist in the mapping. E.g., `1` or `\"apple\"`.  \n**Output:** Mapped value or the default. E.g., for mapping `{1: \"TypeA\", \"apple\": \"fruit\"}`:\n- Input: `1` -> Output: `\"TypeA\"`\n- Input: `\"orange\"` -> Output: `\"UnknownType\"` (if default is \"UnknownType\").\n\n```python\nextractor = MapValue({1: \"TypeA\", 2: \"TypeB\", 3: \"TypeC\"}, default=\"UknownType\")\nextractor.extract(1) # Output: \"TypeA\"\nextractor.extract(5) # Output: \"UknownType\"\n```\n\n---\n\n#### Pivot\n\nGroups data items by a specified key and applies a result extractor to each group.\n\n**Input:** A list of dictionaries.  \n**Output:** Dictionary with keys as distinct values from the input list's 'key' field, and values as the result of the `result_extractor` applied to items with the same key.\n\n```python\nextractor = Pivot(\"group\", Raw(), exclude_key=True)\ndata = [\n    {\"group\": \"A\", \"value\": \"10\"},\n    {\"group\": \"B\", \"value\": \"20\"},\n    {\"group\": \"A\", \"value\": \"30\"},\n]\nextractor.extract(data) # Output: {\"A\": [{\"value\": \"10\"}, {\"value\": \"30\"}], \"B\": [{\"value\": \"20\"}]\n```\n\n---\n\n#### SelectListItem\n\nRetrieves an item from a list by its position or a criteria.\n\n**Input:** A list, like [1, 2, 3, 4].  \n**Output:** Item based on position or criteria, e.g., for position 2: 3.\n\n```python\nextractor = SelectListItem()\nextractor.extract([1, 2, 3]) # Output: 1\n\nextractor = SelectListItem(position=1)\nextractor.extract([10, 20, 30]) # Output: 20\n\nextractor = SelectListItem(criteria=lambda x: x > 15)\nextractor.extract([10, 20, 30]) # Output: 20\n```\n\n---\n\n#### Select\n\nExtracts a value from a dictionary by a key and optionally processes it.\n\n**Input:** A dictionary, like {\"name\": \"Alice\", \"age\": 30}.  \n**Output:** Value based on the key and optional extractor, e.g., for key \"name\": \"Alice\".\n\n```python\nextractor = Select(key=\"age\")\nextractor.extract({\"name\": \"John\", \"age\": 25}) # Output: 25\n```\n\n---\n\n#### SortDictList\n\nSorts dictionaries in a list by a specified key.\n\n**Input:** List of dictionaries with consistent keys, like [{\"name\": \"Bob\", \"age\": 30}, {\"name\": \"Alice\", \"age\": 25}].  \n**Output:** Sorted list by `sort_key`, e.g., for `sort_key=\"age\"`: [{\"name\": \"Alice\", \"age\": 25}, {\"name\": \"Bob\", \"age\": 30}].\n\n```python\ndata = [\n    {\"name\": \"Alice\", \"age\": 28},\n    {\"name\": \"Bob\", \"age\": 22},\n    {\"name\": \"Charlie\", \"age\": 24},\n]\n\nextractor = SortDictList(sort_key=\"age\")\n\nextractor.extract(data)\n\n# Output: [\n#     {\"name\": \"Bob\", \"age\": 22},\n#     {\"name\": \"Charlie\", \"age\": 24},\n#     {\"name\": \"Alice\", \"age\": 28},\n# ]\n```\n\n---\n\n#### Split\n\nSplits a string by a separator and applies an extractor to the substrings.\n\n**Input:** A string with the separator, like \"Apple,Banana,Cherry\".  \n**Output:** List of extracted values from substrings, e.g., for `sep=\",\"`: [\"Apple\", \"Banana\", \"Cherry\"].\n\n\n```python\nextractor = Split(sep=\",\")\nextractor.extract(\"apple,banana,grape\") # Output: [\"apple\", \"banana\", \"grape\"]\n```\n\n#### Transform\n\nApplies a transformation function to the data and optionally processes it with another extractor.\n\n**Input:** Compatible data with the transformation function.  \n**Output:** If no extractor is provided, it's the transformed data. If an extractor is given, it's the result of the extractor on the transformed data.  \n**Example:** For a function that capitalizes strings and an extractor that reverses the string, input 'apple' gives 'ELPPA'.\n\n\n```python\nextractor = Transform(lambda x: x * 2, Raw())\nextractor.extract(2) # Output: 4\n```\n\n---\n\n#### Unpivot\n\nConverts a dictionary into a list of dictionaries with specific key-value pairs.\n\n**Input:** Dictionary with values as lists.  \n**Output:** A list of dictionaries. Each dictionary has two keys: 'category' for the original dictionary's key and 'value' for the extracted value from the original list.\n\n```python\ndata = {\n    \"Fruits\": [\"Apple\", \"Banana\"],\n    \"Vegetables\": [\"Carrot\", \"Broccoli\"]\n}\n\nextractor = Unpivot(key=\"category\", result_extractor=String())\nextractor.extract(data)\n\n# Output: [\n#     {\"category\": \"Fruits\", \"value\": \"Apple\"},\n#     {\"category\": \"Fruits\", \"value\": \"Banana\"},\n#     {\"category\": \"Vegetables\", \"value\": \"Carrot\"},\n#     {\"category\": \"Vegetables\", \"value\": \"Broccoli\"},\n# ]\n```\n\n---\n\n#### When\n\nUses an extractor based on a given condition.\n\n**Input:** Any data that the condition function can evaluate.  \n**Output:** If the condition is true, the extracted data is returned. Otherwise, it returns None.\n\n```python\nextractor = When(lambda x: x == \"yes\", Raw())\nextractor.extract(\"yes\") # Output: yes\nextractor.extract(\"no\") # Output: None\n```\n\n### Dates and times\n\n- [Date](#date)\n- [Datetime](#datetime)\n- [DatetimeUnix](#datetimeunix)\n- [RelativeDate](#relativedate)\n- [RelativeDatetime](#relativedatetime)\n\n#### Date\n\nExtracts a date from its string representation using the specified format.\n\n**Input:** A string that matches the date format (e.g., \"2023-05-01\" for default \"%Y-%m-%d\" format).  \n**Output:** A date object matching the input (e.g., date(2023, 5, 1)).\n\n```python\nextractor = Date()\nextractor.extract(\"2023-05-01\") # Output: datetime.date\n```\n\n---\n\n#### DateTime\n\nParses a date-time string into a datetime object. Can adjust for timezones.\n\n**Input:** A date-time string (e.g., \"2023-04-30T17:00:00\" for the default format).  \n**Output:** The parsed datetime object, optionally adjusted to the provided timezone. If there's a parsing error, the output is None.\n\n```python\nextractor = DateTime()\nextractor.extract(\"2023-04-30T17:00:00\") # Output: datetime.date\n```\n\n---\n\n#### DatetimeUnix\n\nTurns a UNIX timestamp into a datetime object.\n\n**Input:** A numeric representation of a UNIX timestamp in seconds (e.g., 1619856000).  \n**Output:** The corresponding datetime object (e.g., datetime(2023, 4, 30, 17, 0)).\n\n```python\nextractor = DatetimeUnix()\nextractor.extract(1609459200) # Output: datetime.datetime\n```\n\n---\n\n#### RelativeDate\n\nDetermines the days difference between an input date string and a reference date from the context.\n\n**Input:** A date string (e.g., \"2023-04-30\" for the default format).  \n**Output:** A float showing the days difference between the input date and the reference date from the context. If the dates are invalid or not provided, the output is None.\n\n```python\nextractor = RelativeDate()\nextractor.extract(\"2023-09-05\") # Output: 5\n```\n\n---\n\n#### RelativeDatetime\n\nCalculates the seconds difference between an input datetime string and a reference datetime from the context.\n\n**Input:** A datetime string (e.g., \"2023-04-30T15:30:00\" for the default format).  \n**Output:** A float representing the seconds difference between the input and the reference datetime. If the datetimes are invalid or not provided, the output is None.\n\n```python\nextractor = RelativeDate()\nextractor.extract(\"2023-09-05T15:30:00\") # Output: 4.45\n```\n\n\n### Encoding and categorical\n\n- [Categorical](#categorical)\n- [Ordinal](#ordinal)\n- [OneHot](#onehot)\n- [MultiHot](#multihot)\n\n#### Categorical\n\nValidates input data against a predefined set of categories.\n\n**Input:** A string denoting a category (e.g., \"category_A\").  \n**Output:** The input string, if it's in the set of `valid_categories`. If not, a warning is raised, but the input string is still returned without change.\n\n```python\nextractor = Categorical({\"apple\", \"banana\", \"cherry\"})\nextractor.extract(\"apple\") # Output: apple\nextractor.extract(\"pineapple\") # Output: pineapple\n\nextractor = Categorical({\"apple\", \"banana\", \"cherry\"}, raise_on_warning=True)\nextractor.extract(\"pineapple\") # Exception\n\n```\n\n---\n\n#### Ordinal\n\nTransforms categorical data into its ordinal representation based on a defined order or explicit mapping.\n\n**Input:** A string representing a category (e.g., \"medium\").  \n**Output:** An integer that denotes the ordinal position of the input category.  \n- If the `ordered_categories` is a list like [\"low\", \"medium\", \"high\"] and the input is \"medium\", the output is 1.  \n- If the `ordered_categories` is a dictionary like {\"low\": 0, \"medium\": 5, \"high\": 10} and the input is \"medium\", the output is 5.\n\n```python\nextractor = Ordinal([\"low\", \"medium\", \"high\"])\nextractor.extract(1) # Output: \"medium\"\n```\n\n---\n\n#### OneHot\n\nOne-hot encodes the input data according to predefined categories.\n\n**Input:** A string that represents a category (e.g., \"category_A\").  \n**Output:** A dictionary where each key is a category from the `categories` list and the corresponding value is either 1 (if the input matches the category) or 0 (if it doesn't).  \nFor instance, if `categories` = [\"category_A\", \"category_B\", \"category_C\"] and input is \"category_A\", the output is: {\"category_A\": 1, \"category_B\": 0, \"category_C\": 0}.\n\n```python\nextractor = OneHot([\"cat\", \"dog\", \"bird\"])\nextractor.extract(\"cat\") # Output: {\"cat\": 1, \"dog\": 0, \"bird\": 0}\n```\n\n---\n\n#### MultiHot\n\nEncodes a list of categories into a multi-hot representation based on a list of predefined categories.\n\n**Input:** A list of strings, where each string represents a category (e.g., [\"category_A\", \"category_B\"]).  \n**Output:** A list of integers (0 or 1) that indicates the presence or absence of each category in the `categories` list.  \nExample: If `categories` = [\"category_A\", \"category_B\", \"category_C\"] and the input list is [\"category_A\", \"category_B\"], the output is: [1, 1, 0].\n\n```python\nextractor = MultiHot([\"cat\", \"dog\", \"bird\"])\nextractor.extract([\"cat\", \"bird\"]) # Output: [1, 0, 1]\n```\n\n## Custom Extractors\n\nIn scenarios where the provided built-in extractors aren't adequate for specific data transformation needs, you can create custom extractors by subclassing the `Extractor` class and implementing the `extract` method.\n\n### Example: Temperature Extractor\n\nLet's consider a situation where you have temperature data in Kelvin and want to extract it in Celsius and Fahrenheit formats. \n\nHere's how you can create a custom `TemperatureExtractor`:\n\n```python\nfrom extract_transform import Extractor\n\nclass Temperature(Extractor):\n    \"\"\"\n    Extracts the Kelvin temperature and calculates the temperatures in Celsius and Fahrenheit.\n    \"\"\"\n\n    def extract(self, data: Any):\n        kelvin = data[\"temp\"]\n        celsius = kelvin - 273.15\n        fahrenheit = kelvin * 9 / 5 - 459.67\n\n        return {\"celsius\": celsius, \"fahrenheit\": fahrenheit}\n\n\nTemperature().extract({\"temp\": 310}) # Output: {\"celsius\": 36.85, \"fahrenheit\": 98.33}",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A Python library for efficient encoding, decoding, and transformation of complex data, ideal for machine learning workflows.",
    "version": "1.0.10",
    "project_urls": {
        "Documentation": "https://github.com/frederikvanhevel/extract-transform",
        "Homepage": "https://github.com/frederikvanhevel/extract-transform",
        "Repository": "https://github.com/frederikvanhevel/extract-transform"
    },
    "split_keywords": [
        "etl",
        "data",
        "transformation",
        "extraction",
        "processing",
        "machine",
        "learning",
        "pipeline"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "321e3cb0f6a321f506b7e1a40e55447f2c81623233e07d3867574118c0806ce7",
                "md5": "0d9d88531aedc5e908c873edc0729d73",
                "sha256": "d036b164aeac80dccb066940684b8fd6c6b5b02e1a83f35f3f2cdc4b7a0f7ffc"
            },
            "downloads": -1,
            "filename": "extract_transform-1.0.10-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0d9d88531aedc5e908c873edc0729d73",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7,<4.0",
            "size": 54301,
            "upload_time": "2023-09-06T19:12:09",
            "upload_time_iso_8601": "2023-09-06T19:12:09.369775Z",
            "url": "https://files.pythonhosted.org/packages/32/1e/3cb0f6a321f506b7e1a40e55447f2c81623233e07d3867574118c0806ce7/extract_transform-1.0.10-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4d366991d7ca5262695d8ddaf20f532afb7f4163e15dc25e52f71072ec01a6db",
                "md5": "c4f9e2088322f1479b397a742ead13cd",
                "sha256": "281c1ec6641792bf0e5407e480c808223a0ae49e731fb8b01c5478888c7978c3"
            },
            "downloads": -1,
            "filename": "extract_transform-1.0.10.tar.gz",
            "has_sig": false,
            "md5_digest": "c4f9e2088322f1479b397a742ead13cd",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7,<4.0",
            "size": 30423,
            "upload_time": "2023-09-06T19:12:11",
            "upload_time_iso_8601": "2023-09-06T19:12:11.098511Z",
            "url": "https://files.pythonhosted.org/packages/4d/36/6991d7ca5262695d8ddaf20f532afb7f4163e15dc25e52f71072ec01a6db/extract_transform-1.0.10.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-09-06 19:12:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "frederikvanhevel",
    "github_project": "extract-transform",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "extract-transform"
}
        
Elapsed time: 0.11042s