The **llm-dataset-converter** allows the conversion between
various dataset formats for large language models (LLMs).
Filters can be supplied as well, e.g., for cleaning up the data.
Dataset formats:
- pairs: alpaca (r/w), csv (r/w), jsonl (r/w), parquet (r/w), tsv (r/w), xtuner (r/w)
- pretrain: csv (r/w), jsonl (r/w), parquet (r/w), tsv (r/w), txt (r/w)
- translation: csv (r/w), jsonl (r/w), parquet (r/w), tsv (r/w), txt (r/w)
- classification: csv (r/w), jsonl (r/w), parquet (r/w), tsv (r/w)
Compression formats:
- bzip
- gzip
- xz
- zstd
Examples:
Simple conversion with logging info::
llm-convert \
from-alpaca \
-l INFO \
--input ./alpaca_data_cleaned.json \
to-csv-pr \
-l INFO \
--output alpaca_data_cleaned.csv
Automatic decompression/compression (based on file extension)::
llm-convert \
from-alpaca \
--input ./alpaca_data_cleaned.json.xz \
to-csv-pr \
--output alpaca_data_cleaned.csv.gz
Filtering::
llm-convert \
-l INFO \
from-alpaca \
-l INFO \
--input alpaca_data_cleaned.json \
keyword \
-l INFO \
--keyword function \
--location any \
--action keep \
to-alpaca \
-l INFO \
--output alpaca_data_cleaned-filtered.json
Examples can be found here:
https://github.com/waikato-llm/llm-dataset-converter-examples
Changelog
=========
0.2.8 (2025-07-15)
------------------
- requiring seppl>=0.2.20 now for improved help requests in `llm-convert` tool
0.2.7 (2025-07-11)
------------------
- added `set-placeholder` filter for dynamically setting (temporary) placeholders at runtime
- using `wai_logging` instead of `wai.logging` now
- added `remove-strings` filter that just removes sub-strings
- added `strip-strings` filter for stripping whitespaces from start/end of strings
- requiring `seppl>=0.2.17` now to avoid deprecated use of pkg_resources
0.2.6 (2025-03-14)
------------------
- switched to underscores in project name
- requiring seppl>=0.2.13 now
- added support for aliases
- added `discard-by-name` filter, which uses the `file` filed in the meta-data for its matching
- added placeholder support
- method `text_utils.empty_str_if_none` now handles bool/int/float as well
- CSV/TSV writers now have an `--encoding` option to use a specific encoding other than the default, e.g., UTF-8
0.2.5 (2024-12-20)
------------------
- added `setuptools` as dependency
0.2.4 (2024-07-05)
------------------
- requiring seppl>=0.2.6 now
- readers use default globs now, allowing the user to simply supply directories as input
- renamed `split` filter to `split-records` to avoid name clash with meta-data key `split` as parameter
0.2.3 (2024-05-06)
------------------
- requiring seppl>=0.2.4 now
0.2.2 (2024-05-03)
------------------
- requiring seppl>=0.2.3 now
0.2.1 (2024-05-02)
------------------
- filters `split` and `tee` now support `ClassificationData` as well
- added `metadata-from-name` filter to extract meta-data from the current input file name
- added `inspect` filter that allows inspecting data interactively as it passes through the pipeline
- added `empty_str_if_none` helper method to `ldc.text_utils` to ensure no None/null values are output with writers
- upgraded seppl to 0.2.2 and switched to using `seppl.ClassListerRegistry`
0.2.0 (2024-02-27)
------------------
- added support for XTuner conversation JSON format: `from-xtuner` and `to-xtuner`
- added filter `update-pair-data` to allow tweaking or rearranging of the data
- introduced `ldc.api` module to separate out abstract superclasses and avoid circular imports
- readers now set the 'file' meta-data value
- added `file-filter` filter for explicitly allowing/discarding records that stem from certain files (entry in meta-data: 'file')
- added `record-files` filter for recording the files that the records are based on (entry in meta-data: 'file')
- filter `pretrain-sentences-to-pairs` can now omit filling the `instruction` when using 0 as prompt step
- requiring seppl>=0.1.2 now
- added global option `-U, --unescape_unicode` to `llm-convert` tool to allow conversion of escaped unicode characters
- the `llm-append` tool now supports appending for json, jsonlines and CSV files apart from plain-text files (default)
0.1.1 (2024-02-15)
------------------
- added `classification` domain
- added `from-jsonlines-cl` reader and `to-jsonlines-cl` writer for classification data in JSON lines format
- added filter `pretrain-sentences-to-classification` to turn pretrain data into classification data (with a predefined label)
- added filter `classification-label-map` that can generate a label string/int map
- the `to-llama2-format` filter now has the `--skip_tokens` options to leave out the [INST] [/INST] tokens
- added `from-parquet-cl` reader and `to-parquet-cl` writer for classification data in Parquet database format
- added `from-csv-cl`/`from-tsv-cl` readers and `to-csv-cl`/`to-tsv-cl` writers for classification data in CSV/TSV file format
0.1.0 (2024-02-05)
------------------
- fixed output format of `to-llama2-format` filter
- `llama2-to-pairs` filter has more robust parsing now
- upgraded seppl to 0.1.0
- switched to seppl classes: Splitter, MetaDataHandler, Reader, Writer, StreamWriter, BatchWriter
0.0.5 (2024-01-24)
------------------
- added flag `-b/--force_batch` to the `llm-convert` tool which all data to be reader from the
reader before filtering it and then passing it to the writer; useful for batch filters.
- added the `randomize-records` batch filter
- added the `--encoding ENC` option to file readers
- auto-determined encoding is now being logged (`INFO` level)
- the `LDC_ENCODING_MAX_CHECK_LENGTH` environment variable allows overriding the default
number of bytes used for determining the file encoding in auto-detect mode
- default max number of bytes inspected for determining file encoding is now 10kb
- method `locate_files` in `base_io` no longer includes directories when expanding globs
- added tool `llm-file-encoding` for determining file encodings of text files
- added method `replace_extension` to `base_io` module for changing a files extension
(removes any supported compression suffix first)
- stream writers (.jsonl/.txt) now work with `--force_batch` mode; the output file name
gets automatically generated from the input file name when just using a directory for
the output
0.0.4 (2023-12-19)
------------------
- `pairs-to-llama2` filter now has an optional `--prefix` parameter to use with the instruction
- added the `pretrain-sentences-to-pairs` filter for generating artificial prompt/response datasets from pretrain data
- requires seppl>=0.0.11 now
- the `LDC_MODULES_EXCL` environment variable is now used for specifying modules to be excluded from the registration
process (e.g., used when generating help screens for derived libraries that shouldn't output the
base plugins as well)
- `llm-registry` and `llm-help` now allow specifying excluded modules via `-e/--excluded_modules` option
- `to-alpaca` writer now has the `-a/--ensure_ascii` flag to enforce ASCII compatibility in the output
- added global option `-u/--update_interval` to `convert` tool to customize how often progress of # records
processed is being output in the console (default: 1000)
- `text-length` filter now handles None values, i.e., ignores them
- locations (i.e., input/instructions/output/etc) can be specified now multiple times
- the `llm-help` tool can generate index files for all the plugins now; in case of markdown
it will link to the other markdown files
0.0.3 (2023-11-10)
------------------
- added the `record-window` filter
- added the `llm-registry` tool for querying the registry from the command-line
- added the `replace_patterns` method to `ldc.text_utils` module
- added the `replace-patterns` filter
- added `-p/--pretty-print` flag to `to-alpaca` writer
- added `pairs-to-llama2` and `llama2-to-pairs` filter
(since llama2 has instruction as part of the string, it is treated as pretrain data)
- added `to-llama2-format` filter for pretrain records (no [INST]...[/INST] block)
- now requiring seppl>=0.0.8 in order to raise Exceptions when encountering unknown arguments
0.0.2 (2023-10-31)
------------------
- added `text-stats` filter
- stream writers accept iterable of data records now as well to improve throughput
- `text_utils.apply_max_length` now uses simple whitespace splitting instead of
searching for nearest word boundary to break a line, which results in a massive
speed improvement
- fix: `text_utils.remove_patterns` no longer multiplies the generated lines when using
more than one pattern
- added `remove-patterns` filter
- pretrain and translation text writers now buffer records by default (`-b`, `--buffer_size`)
in order to improve throughput
- jsonlines writers for pair, pretrain and translation data are now stream writers
0.0.1 (2023-10-26)
------------------
- initial release
Raw data
{
"_id": null,
"home_page": "https://github.com/waikato-llm/llm-dataset-converter",
"name": "llm-dataset-converter",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": null,
"author": "Peter Reutemann",
"author_email": "fracpete@waikato.ac.nz",
"download_url": "https://files.pythonhosted.org/packages/da/21/0d3222b1e19482004130bf235b259ead705f5a0bf884efb88fb289d1f9b1/llm_dataset_converter-0.2.8.tar.gz",
"platform": null,
"description": "The **llm-dataset-converter** allows the conversion between\nvarious dataset formats for large language models (LLMs).\nFilters can be supplied as well, e.g., for cleaning up the data.\n\nDataset formats:\n\n- pairs: alpaca (r/w), csv (r/w), jsonl (r/w), parquet (r/w), tsv (r/w), xtuner (r/w)\n- pretrain: csv (r/w), jsonl (r/w), parquet (r/w), tsv (r/w), txt (r/w)\n- translation: csv (r/w), jsonl (r/w), parquet (r/w), tsv (r/w), txt (r/w)\n- classification: csv (r/w), jsonl (r/w), parquet (r/w), tsv (r/w)\n\n\nCompression formats:\n\n- bzip\n- gzip\n- xz\n- zstd\n\n\nExamples:\n\nSimple conversion with logging info::\n\n llm-convert \\\n from-alpaca \\\n -l INFO \\\n --input ./alpaca_data_cleaned.json \\\n to-csv-pr \\\n -l INFO \\\n --output alpaca_data_cleaned.csv\n\nAutomatic decompression/compression (based on file extension)::\n\n llm-convert \\\n from-alpaca \\\n --input ./alpaca_data_cleaned.json.xz \\\n to-csv-pr \\\n --output alpaca_data_cleaned.csv.gz\n\nFiltering::\n\n llm-convert \\\n -l INFO \\\n from-alpaca \\\n -l INFO \\\n --input alpaca_data_cleaned.json \\\n keyword \\\n -l INFO \\\n --keyword function \\\n --location any \\\n --action keep \\\n to-alpaca \\\n -l INFO \\\n --output alpaca_data_cleaned-filtered.json\n\n\n\nExamples can be found here:\n\nhttps://github.com/waikato-llm/llm-dataset-converter-examples\n\n\nChangelog\n=========\n\n0.2.8 (2025-07-15)\n------------------\n\n- requiring seppl>=0.2.20 now for improved help requests in `llm-convert` tool\n\n\n0.2.7 (2025-07-11)\n------------------\n\n- added `set-placeholder` filter for dynamically setting (temporary) placeholders at runtime\n- using `wai_logging` instead of `wai.logging` now\n- added `remove-strings` filter that just removes sub-strings\n- added `strip-strings` filter for stripping whitespaces from start/end of strings\n- requiring `seppl>=0.2.17` now to avoid deprecated use of pkg_resources\n\n\n0.2.6 (2025-03-14)\n------------------\n\n- switched to underscores in project name\n- requiring seppl>=0.2.13 now\n- added support for aliases\n- added `discard-by-name` filter, which uses the `file` filed in the meta-data for its matching\n- added placeholder support\n- method `text_utils.empty_str_if_none` now handles bool/int/float as well\n- CSV/TSV writers now have an `--encoding` option to use a specific encoding other than the default, e.g., UTF-8\n\n\n0.2.5 (2024-12-20)\n------------------\n\n- added `setuptools` as dependency\n\n\n0.2.4 (2024-07-05)\n------------------\n\n- requiring seppl>=0.2.6 now\n- readers use default globs now, allowing the user to simply supply directories as input\n- renamed `split` filter to `split-records` to avoid name clash with meta-data key `split` as parameter\n\n\n0.2.3 (2024-05-06)\n------------------\n\n- requiring seppl>=0.2.4 now\n\n\n0.2.2 (2024-05-03)\n------------------\n\n- requiring seppl>=0.2.3 now\n\n\n0.2.1 (2024-05-02)\n------------------\n\n- filters `split` and `tee` now support `ClassificationData` as well\n- added `metadata-from-name` filter to extract meta-data from the current input file name\n- added `inspect` filter that allows inspecting data interactively as it passes through the pipeline\n- added `empty_str_if_none` helper method to `ldc.text_utils` to ensure no None/null values are output with writers\n- upgraded seppl to 0.2.2 and switched to using `seppl.ClassListerRegistry`\n\n\n0.2.0 (2024-02-27)\n------------------\n\n- added support for XTuner conversation JSON format: `from-xtuner` and `to-xtuner`\n- added filter `update-pair-data` to allow tweaking or rearranging of the data\n- introduced `ldc.api` module to separate out abstract superclasses and avoid circular imports\n- readers now set the 'file' meta-data value\n- added `file-filter` filter for explicitly allowing/discarding records that stem from certain files (entry in meta-data: 'file')\n- added `record-files` filter for recording the files that the records are based on (entry in meta-data: 'file')\n- filter `pretrain-sentences-to-pairs` can now omit filling the `instruction` when using 0 as prompt step\n- requiring seppl>=0.1.2 now\n- added global option `-U, --unescape_unicode` to `llm-convert` tool to allow conversion of escaped unicode characters\n- the `llm-append` tool now supports appending for json, jsonlines and CSV files apart from plain-text files (default)\n\n\n0.1.1 (2024-02-15)\n------------------\n\n- added `classification` domain\n- added `from-jsonlines-cl` reader and `to-jsonlines-cl` writer for classification data in JSON lines format\n- added filter `pretrain-sentences-to-classification` to turn pretrain data into classification data (with a predefined label)\n- added filter `classification-label-map` that can generate a label string/int map\n- the `to-llama2-format` filter now has the `--skip_tokens` options to leave out the [INST] [/INST] tokens\n- added `from-parquet-cl` reader and `to-parquet-cl` writer for classification data in Parquet database format\n- added `from-csv-cl`/`from-tsv-cl` readers and `to-csv-cl`/`to-tsv-cl` writers for classification data in CSV/TSV file format\n\n\n0.1.0 (2024-02-05)\n------------------\n\n- fixed output format of `to-llama2-format` filter\n- `llama2-to-pairs` filter has more robust parsing now\n- upgraded seppl to 0.1.0\n- switched to seppl classes: Splitter, MetaDataHandler, Reader, Writer, StreamWriter, BatchWriter\n\n\n0.0.5 (2024-01-24)\n------------------\n\n- added flag `-b/--force_batch` to the `llm-convert` tool which all data to be reader from the\n reader before filtering it and then passing it to the writer; useful for batch filters.\n- added the `randomize-records` batch filter\n- added the `--encoding ENC` option to file readers\n- auto-determined encoding is now being logged (`INFO` level)\n- the `LDC_ENCODING_MAX_CHECK_LENGTH` environment variable allows overriding the default\n number of bytes used for determining the file encoding in auto-detect mode\n- default max number of bytes inspected for determining file encoding is now 10kb\n- method `locate_files` in `base_io` no longer includes directories when expanding globs\n- added tool `llm-file-encoding` for determining file encodings of text files\n- added method `replace_extension` to `base_io` module for changing a files extension\n (removes any supported compression suffix first)\n- stream writers (.jsonl/.txt) now work with `--force_batch` mode; the output file name\n gets automatically generated from the input file name when just using a directory for\n the output\n\n\n0.0.4 (2023-12-19)\n------------------\n\n- `pairs-to-llama2` filter now has an optional `--prefix` parameter to use with the instruction\n- added the `pretrain-sentences-to-pairs` filter for generating artificial prompt/response datasets from pretrain data\n- requires seppl>=0.0.11 now\n- the `LDC_MODULES_EXCL` environment variable is now used for specifying modules to be excluded from the registration\n process (e.g., used when generating help screens for derived libraries that shouldn't output the\n base plugins as well)\n- `llm-registry` and `llm-help` now allow specifying excluded modules via `-e/--excluded_modules` option\n- `to-alpaca` writer now has the `-a/--ensure_ascii` flag to enforce ASCII compatibility in the output\n- added global option `-u/--update_interval` to `convert` tool to customize how often progress of # records\n processed is being output in the console (default: 1000)\n- `text-length` filter now handles None values, i.e., ignores them\n- locations (i.e., input/instructions/output/etc) can be specified now multiple times\n- the `llm-help` tool can generate index files for all the plugins now; in case of markdown\n it will link to the other markdown files\n\n\n0.0.3 (2023-11-10)\n------------------\n\n- added the `record-window` filter\n- added the `llm-registry` tool for querying the registry from the command-line\n- added the `replace_patterns` method to `ldc.text_utils` module\n- added the `replace-patterns` filter\n- added `-p/--pretty-print` flag to `to-alpaca` writer\n- added `pairs-to-llama2` and `llama2-to-pairs` filter\n (since llama2 has instruction as part of the string, it is treated as pretrain data)\n- added `to-llama2-format` filter for pretrain records (no [INST]...[/INST] block)\n- now requiring seppl>=0.0.8 in order to raise Exceptions when encountering unknown arguments\n\n\n0.0.2 (2023-10-31)\n------------------\n\n- added `text-stats` filter\n- stream writers accept iterable of data records now as well to improve throughput\n- `text_utils.apply_max_length` now uses simple whitespace splitting instead of\n searching for nearest word boundary to break a line, which results in a massive\n speed improvement\n- fix: `text_utils.remove_patterns` no longer multiplies the generated lines when using\n more than one pattern\n- added `remove-patterns` filter\n- pretrain and translation text writers now buffer records by default (`-b`, `--buffer_size`)\n in order to improve throughput\n- jsonlines writers for pair, pretrain and translation data are now stream writers\n\n\n0.0.1 (2023-10-26)\n------------------\n\n- initial release\n\n",
"bugtrack_url": null,
"license": "MIT License",
"summary": "Python3 library for converting between various LLM dataset formats.",
"version": "0.2.8",
"project_urls": {
"Homepage": "https://github.com/waikato-llm/llm-dataset-converter"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "da210d3222b1e19482004130bf235b259ead705f5a0bf884efb88fb289d1f9b1",
"md5": "08ea2b9cd97948a7f8e8f8db62be0171",
"sha256": "7bb681635e84908d0e1c15899ec45f882d8ee09dac20933a91ad5e3f14e3c4a5"
},
"downloads": -1,
"filename": "llm_dataset_converter-0.2.8.tar.gz",
"has_sig": false,
"md5_digest": "08ea2b9cd97948a7f8e8f8db62be0171",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 92948,
"upload_time": "2025-07-15T03:33:42",
"upload_time_iso_8601": "2025-07-15T03:33:42.246878Z",
"url": "https://files.pythonhosted.org/packages/da/21/0d3222b1e19482004130bf235b259ead705f5a0bf884efb88fb289d1f9b1/llm_dataset_converter-0.2.8.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-15 03:33:42",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "waikato-llm",
"github_project": "llm-dataset-converter",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "llm-dataset-converter"
}