# Convector
## Introduction
**Convector** is a tool designed to facilitate the unification of conversational datasets into a consistent format. Capable of handling various dataset formats including JSONL, Parquet, Zstandard (Zst), JSON.GZ, CSV, and TXT, **Convector** converts them into JSONL format. The user can choose to filter data during transformation for enhanced customization. It offers flexibility in output formats with options like the default format and chat_completion format, the latter being compliant with OpenAI's format.
## Installation
**Convector** can be installed either via PyPI or directly from GitHub.
- **Using PyPI**: Run `pip install convector` in the terminal.
- **Using GitHub**: Clone and install using the following commands:
```bash
git clone https://github.com/teilomillet/convector
cd convector
pip install .
```
## Usage
**Convector** provides a command-line interface for easy data processing with various customization options.
- **Basic Command**: `convector process <file_path> [OPTIONS]`
- **Options**:
- `-p, --profile`: Predefined profile from the config (default is 'default').
- `-c, --conversation`: Allow to process conversational exchanges.
- `--instruction`: Key for instructions or system messages.
- `-i, --input-key`: Key for user inputs.
- `-o, --output-key`: Key for bot responses.
- `-s, --schema`: Schema of the output data.
- `--filter`: Filter conditions in "field,operator,value" format.
- `-l, --limit`: Limit to a number of lines.
- `--bytes`: Limit to a number of bytes.
- `-f, --file-out`: File for transformed data.
- `-d, --dir-out`: Directory for output files.
- `-v, --verbose`: Enable detailed logs.
- **Example Commands**:
---------------------------------------
- Process each file in a folder:
```bash
convector process /path/to/data/
```
---------------------------------------
- Process the file `data.jsonl`, which is a conversation `-c`, keep all the data with an `id` under 10500, the output will be saved in `/path/to/output_dir/output.jsonl`:
```bash
convector process /data.jsonl -c --filter id<10500 -f output.jsonl -d /output_dir/
```
---------------------------------------
- Process the file `data.parquet` and output the data into a `chat_completion` format with the `id` and `user_id` at each row. (the output data will be saved in `data_tr.jsonl` inside the default output location (convector/silo)):
```bash
convector process /data.parquet --filter "id;user_id" --schema chat_completion
```
---------------------------------------
- Register a profile name `sampler`, process `333` lines of the file `data.parquet` and save the output into a `chat_completion` format in a file name `sampler.jsonl`:
```bash
convector process /data.json -p sampler -l 333 -s chat_completion -f sampler.jsonl
```
---------------------------------------
- Process all the files in the folder `/data`, using all the commands previously saved in the profile `sampler` (see above):
```bash
convector process /data/ -p sampler
```
---------------------------------------
## Advanced Features
- **Conversational Data Handling**: **Convector** efficiently processes nested conversational data. Using the `--conversation` command, it can identify and handle complex conversation structures, auto-generating a `conversation_id` when needed.
- **Customization**: Users can customize the data fields to be retained during processing with the `--filter` option. By default, **Convector** keeps `instruction`, `input`, and `output`. Additional fields can be included as required.
- **Folder Handling**: **Convector** can go through folders to process the data inside it. It will by default, create a file using `_tr` at the end if no `--file-out` is specified.
## Configuration and Customization
- **Profile Customization**: Users can define and use custom profiles for different types of data processing tasks inside the `config.yaml`. The profile will automatically be saved and updated if used with new commands.
- **Schema Application**: **Convector** allows for the application of custom schemas to tailor the output according to specific requirements.
- Default Schema:
```json
{"instruction":"","input":"","output":"","source":""}
```
- Chat_completion Schema:
```json
"messages": [
{"role": "system", "content": ""},
{"role": "user", "content": ""},
{"role": "assistant", "content": ""}
],
"source":""
```
Raw data
{
"_id": null,
"home_page": "https://github.com/teilomillet/convector",
"name": "convector",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6, <4",
"maintainer_email": "",
"keywords": "conversational data transformation",
"author": "Te\u00eflo Millet",
"author_email": "teilomillet@proton.me",
"download_url": "https://files.pythonhosted.org/packages/0b/d1/2c504cd637d5a40f7a71ac24a2e1b1db2b91baf35a79713a09f5011ea2b9/convector-0.1.1.tar.gz",
"platform": null,
"description": "\n# Convector\n\n## Introduction\n**Convector** is a tool designed to facilitate the unification of conversational datasets into a consistent format. Capable of handling various dataset formats including JSONL, Parquet, Zstandard (Zst), JSON.GZ, CSV, and TXT, **Convector** converts them into JSONL format. The user can choose to filter data during transformation for enhanced customization. It offers flexibility in output formats with options like the default format and chat_completion format, the latter being compliant with OpenAI's format.\n\n## Installation\n**Convector** can be installed either via PyPI or directly from GitHub.\n\n- **Using PyPI**: Run `pip install convector` in the terminal.\n- **Using GitHub**: Clone and install using the following commands:\n ```bash\n git clone https://github.com/teilomillet/convector\n cd convector\n pip install .\n ```\n\n## Usage\n**Convector** provides a command-line interface for easy data processing with various customization options.\n\n- **Basic Command**: `convector process <file_path> [OPTIONS]`\n- **Options**:\n - `-p, --profile`: Predefined profile from the config (default is 'default').\n - `-c, --conversation`: Allow to process conversational exchanges.\n - `--instruction`: Key for instructions or system messages.\n - `-i, --input-key`: Key for user inputs.\n - `-o, --output-key`: Key for bot responses.\n - `-s, --schema`: Schema of the output data.\n - `--filter`: Filter conditions in \"field,operator,value\" format.\n - `-l, --limit`: Limit to a number of lines.\n - `--bytes`: Limit to a number of bytes.\n - `-f, --file-out`: File for transformed data.\n - `-d, --dir-out`: Directory for output files.\n - `-v, --verbose`: Enable detailed logs.\n\n- **Example Commands**: \n ---------------------------------------\n - Process each file in a folder:\n ```bash\n convector process /path/to/data/\n ```\n ---------------------------------------\n - Process the file `data.jsonl`, which is a conversation `-c`, keep all the data with an `id` under 10500, the output will be saved in `/path/to/output_dir/output.jsonl`:\n ```bash\n convector process /data.jsonl -c --filter id<10500 -f output.jsonl -d /output_dir/\n ```\n ---------------------------------------\n - Process the file `data.parquet` and output the data into a `chat_completion` format with the `id` and `user_id` at each row. (the output data will be saved in `data_tr.jsonl` inside the default output location (convector/silo)):\n ```bash\n convector process /data.parquet --filter \"id;user_id\" --schema chat_completion\n ```\n ---------------------------------------\n - Register a profile name `sampler`, process `333` lines of the file `data.parquet` and save the output into a `chat_completion` format in a file name `sampler.jsonl`:\n ```bash\n convector process /data.json -p sampler -l 333 -s chat_completion -f sampler.jsonl\n ```\n ---------------------------------------\n - Process all the files in the folder `/data`, using all the commands previously saved in the profile `sampler` (see above):\n ```bash\n convector process /data/ -p sampler\n ```\n ---------------------------------------\n\n## Advanced Features\n- **Conversational Data Handling**: **Convector** efficiently processes nested conversational data. Using the `--conversation` command, it can identify and handle complex conversation structures, auto-generating a `conversation_id` when needed.\n- **Customization**: Users can customize the data fields to be retained during processing with the `--filter` option. By default, **Convector** keeps `instruction`, `input`, and `output`. Additional fields can be included as required.\n- **Folder Handling**: **Convector** can go through folders to process the data inside it. It will by default, create a file using `_tr` at the end if no `--file-out` is specified.\n\n## Configuration and Customization\n- **Profile Customization**: Users can define and use custom profiles for different types of data processing tasks inside the `config.yaml`. The profile will automatically be saved and updated if used with new commands.\n- **Schema Application**: **Convector** allows for the application of custom schemas to tailor the output according to specific requirements. \n - Default Schema:\n ```json\n {\"instruction\":\"\",\"input\":\"\",\"output\":\"\",\"source\":\"\"}\n ```\n - Chat_completion Schema:\n ```json\n \"messages\": [\n {\"role\": \"system\", \"content\": \"\"},\n {\"role\": \"user\", \"content\": \"\"},\n {\"role\": \"assistant\", \"content\": \"\"}\n ],\n \"source\":\"\"\n ```\n",
"bugtrack_url": null,
"license": "",
"summary": "A tool for transforming conversational data to a unified format",
"version": "0.1.1",
"project_urls": {
"Bug Reports": "https://github.com/teilomillet/convector/issues",
"Homepage": "https://github.com/teilomillet/convector",
"Source": "https://github.com/teilomillet/convector/"
},
"split_keywords": [
"conversational",
"data",
"transformation"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "81f72ff917736f81a33fc01110c2fd87a21f806b5c1df09a6bd44994270cb72d",
"md5": "c1cc78df8014b134a9180d617751c679",
"sha256": "1f6185413c5dd44b2e8a15527f24e3708478177fb68e8b42f65cef05f4acaa24"
},
"downloads": -1,
"filename": "convector-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "c1cc78df8014b134a9180d617751c679",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6, <4",
"size": 42498,
"upload_time": "2023-11-13T15:33:37",
"upload_time_iso_8601": "2023-11-13T15:33:37.815690Z",
"url": "https://files.pythonhosted.org/packages/81/f7/2ff917736f81a33fc01110c2fd87a21f806b5c1df09a6bd44994270cb72d/convector-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "0bd12c504cd637d5a40f7a71ac24a2e1b1db2b91baf35a79713a09f5011ea2b9",
"md5": "d3eadebb784b4c27947397a05a3e3d80",
"sha256": "fa1e6fc739d211c3412d108f528c110112b5790dc96f843de2e834dd40f9f591"
},
"downloads": -1,
"filename": "convector-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "d3eadebb784b4c27947397a05a3e3d80",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6, <4",
"size": 34750,
"upload_time": "2023-11-13T15:33:39",
"upload_time_iso_8601": "2023-11-13T15:33:39.662640Z",
"url": "https://files.pythonhosted.org/packages/0b/d1/2c504cd637d5a40f7a71ac24a2e1b1db2b91baf35a79713a09f5011ea2b9/convector-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-11-13 15:33:39",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "teilomillet",
"github_project": "convector",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "convector"
}