# `split-folders` [![Build Status](https://img.shields.io/github/workflow/status/jfilter/split-folders/Test)](https://github.com/jfilter/split-folders/actions/workflows/test.yml) [![PyPI](https://img.shields.io/pypi/v/split-folders.svg)](https://pypi.org/project/split-folders/) [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/split-folders.svg)](https://pypi.org/project/split-folders/) [![PyPI - Downloads](https://img.shields.io/pypi/dm/split-folders)](https://pypistats.org/packages/split-folders)
Split folders with files (e.g. images) into **train**, **validation** and **test** (dataset) folders.
The input folder should have the following format:
```
input/
class1/
img1.jpg
img2.jpg
...
class2/
imgWhatever.jpg
...
...
```
In order to give you this:
```
output/
train/
class1/
img1.jpg
...
class2/
imga.jpg
...
val/
class1/
img2.jpg
...
class2/
imgb.jpg
...
test/
class1/
img3.jpg
...
class2/
imgc.jpg
...
```
This should get you started to do some serious deep learning on your data. [Read here](https://stats.stackexchange.com/questions/19048/what-is-the-difference-between-test-set-and-validation-set) why it's a good idea to split your data intro three different sets.
- Split files into a training set and a validation set (and optionally a test set).
- Works on any file types.
- The files get shuffled.
- A [seed](https://docs.python.org/3/library/random.html#random.seed) makes splits reproducible.
- Allows randomized [oversampling](https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis) for imbalanced datasets.
- Optionally group files by prefix.
- (Should) work on all operating systems.
## Install
This package is Python only and there are no external dependencies.
```bash
pip install split-folders
```
Optionally, you may install [tqdm](https://github.com/tqdm/tqdm) to get get a progress bar when moving files.
```bash
pip install split-folders[full]
```
## Usage
You can use `split-folders` as Python module or as a Command Line Interface (CLI).
If your datasets is balanced (each class has the same number of samples), choose `ratio` otherwise `fixed`.
NB: oversampling is turned off by default.
Oversampling is only applied to the _train_ folder since having duplicates in _val_ or _test_ would be considered cheating.
### Module
```python
import splitfolders
# Split with a ratio.
# To only split into training and validation set, set a tuple to `ratio`, i.e, `(.8, .2)`.
splitfolders.ratio("input_folder", output="output",
seed=1337, ratio=(.8, .1, .1), group_prefix=None, move=False) # default values
# Split val/test with a fixed number of items, e.g. `(100, 100)`, for each set.
# To only split into training and validation set, use a single number to `fixed`, i.e., `10`.
# Set 3 values, e.g. `(300, 100, 100)`, to limit the number of training values.
splitfolders.fixed("input_folder", output="output",
seed=1337, fixed=(100, 100), oversample=False, group_prefix=None, move=False) # default values
```
Occasionally, you may have things that comprise more than a single file (e.g. picture (.png) + annotation (.txt)).
`splitfolders` lets you split files into equally-sized groups based on their prefix.
Set `group_prefix` to the length of the group (e.g. `2`).
But now _all_ files should be part of groups.
Set `move=True` if you want to move the files instead of copying.
### CLI
```
Usage:
splitfolders [--output] [--ratio] [--fixed] [--seed] [--oversample] [--group_prefix] [--move] folder_with_images
Options:
--output path to the output folder. defaults to `output`. Get created if non-existent.
--ratio the ratio to split. e.g. for train/val/test `.8 .1 .1 --` or for train/val `.8 .2 --`.
--fixed set the absolute number of items per validation/test set. The remaining items constitute
the training set. e.g. for train/val/test `100 100` or for train/val `100`.
Set 3 values, e.g. `300 100 100`, to limit the number of training values.
--seed set seed value for shuffling the items. defaults to 1337.
--oversample enable oversampling of imbalanced datasets, works only with --fixed.
--group_prefix split files into equally-sized groups based on their prefix
--move move the files instead of copying
Example:
splitfolders --ratio .8 .1 .1 -- folder_with_images
```
Because of some [Python quirks](https://github.com/jfilter/split-folders/issues/19) you have to prepend ` --` afer using `--ratio`.
Instead of the command `splitfolders` you can also use `split_folders` or `split-folders`.
## Development
Install and use [poetry](https://python-poetry.org/).
## Contributing
If you have a **question**, found a **bug** or want to propose a new **feature**, have a look at the [issues page](https://github.com/jfilter/split-folders/issues).
**Pull requests** are especially welcomed when they fix bugs or improve the code quality.
## License
MIT
Raw data
{
"_id": null,
"home_page": "https://github.com/jfilter/split-folders",
"name": "split-folders",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "",
"keywords": "machine-learning,training-validation-test,datasets,folders",
"author": "Johannes Filter",
"author_email": "hi@jfilter.de",
"download_url": "https://files.pythonhosted.org/packages/a7/4c/32d2d49b82ea5baf0ff1a55de88c7fb8a0bf2aab02763c8501b2a51bf55f/split_folders-0.5.1.tar.gz",
"platform": "",
"description": "# `split-folders` [![Build Status](https://img.shields.io/github/workflow/status/jfilter/split-folders/Test)](https://github.com/jfilter/split-folders/actions/workflows/test.yml) [![PyPI](https://img.shields.io/pypi/v/split-folders.svg)](https://pypi.org/project/split-folders/) [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/split-folders.svg)](https://pypi.org/project/split-folders/) [![PyPI - Downloads](https://img.shields.io/pypi/dm/split-folders)](https://pypistats.org/packages/split-folders)\n\nSplit folders with files (e.g. images) into **train**, **validation** and **test** (dataset) folders.\n\nThe input folder should have the following format:\n\n```\ninput/\n class1/\n img1.jpg\n img2.jpg\n ...\n class2/\n imgWhatever.jpg\n ...\n ...\n```\n\nIn order to give you this:\n\n```\noutput/\n train/\n class1/\n img1.jpg\n ...\n class2/\n imga.jpg\n ...\n val/\n class1/\n img2.jpg\n ...\n class2/\n imgb.jpg\n ...\n test/\n class1/\n img3.jpg\n ...\n class2/\n imgc.jpg\n ...\n```\n\nThis should get you started to do some serious deep learning on your data. [Read here](https://stats.stackexchange.com/questions/19048/what-is-the-difference-between-test-set-and-validation-set) why it's a good idea to split your data intro three different sets.\n\n- Split files into a training set and a validation set (and optionally a test set).\n- Works on any file types.\n- The files get shuffled.\n- A [seed](https://docs.python.org/3/library/random.html#random.seed) makes splits reproducible.\n- Allows randomized [oversampling](https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis) for imbalanced datasets.\n- Optionally group files by prefix.\n- (Should) work on all operating systems.\n\n## Install\n\nThis package is Python only and there are no external dependencies.\n\n```bash\npip install split-folders\n```\n\nOptionally, you may install [tqdm](https://github.com/tqdm/tqdm) to get get a progress bar when moving files.\n\n```bash\npip install split-folders[full]\n```\n\n## Usage\n\nYou can use `split-folders` as Python module or as a Command Line Interface (CLI).\n\nIf your datasets is balanced (each class has the same number of samples), choose `ratio` otherwise `fixed`.\nNB: oversampling is turned off by default.\nOversampling is only applied to the _train_ folder since having duplicates in _val_ or _test_ would be considered cheating.\n\n### Module\n\n```python\nimport splitfolders\n\n# Split with a ratio.\n# To only split into training and validation set, set a tuple to `ratio`, i.e, `(.8, .2)`.\nsplitfolders.ratio(\"input_folder\", output=\"output\",\n seed=1337, ratio=(.8, .1, .1), group_prefix=None, move=False) # default values\n\n# Split val/test with a fixed number of items, e.g. `(100, 100)`, for each set.\n# To only split into training and validation set, use a single number to `fixed`, i.e., `10`.\n# Set 3 values, e.g. `(300, 100, 100)`, to limit the number of training values.\nsplitfolders.fixed(\"input_folder\", output=\"output\",\n seed=1337, fixed=(100, 100), oversample=False, group_prefix=None, move=False) # default values\n```\n\nOccasionally, you may have things that comprise more than a single file (e.g. picture (.png) + annotation (.txt)).\n`splitfolders` lets you split files into equally-sized groups based on their prefix.\nSet `group_prefix` to the length of the group (e.g. `2`).\nBut now _all_ files should be part of groups.\n\nSet `move=True` if you want to move the files instead of copying.\n\n### CLI\n\n```\nUsage:\n splitfolders [--output] [--ratio] [--fixed] [--seed] [--oversample] [--group_prefix] [--move] folder_with_images\nOptions:\n --output path to the output folder. defaults to `output`. Get created if non-existent.\n --ratio the ratio to split. e.g. for train/val/test `.8 .1 .1 --` or for train/val `.8 .2 --`.\n --fixed set the absolute number of items per validation/test set. The remaining items constitute\n the training set. e.g. for train/val/test `100 100` or for train/val `100`.\n Set 3 values, e.g. `300 100 100`, to limit the number of training values.\n --seed set seed value for shuffling the items. defaults to 1337.\n --oversample enable oversampling of imbalanced datasets, works only with --fixed.\n --group_prefix split files into equally-sized groups based on their prefix\n --move move the files instead of copying\nExample:\n splitfolders --ratio .8 .1 .1 -- folder_with_images\n```\n\nBecause of some [Python quirks](https://github.com/jfilter/split-folders/issues/19) you have to prepend ` --` afer using `--ratio`.\n\nInstead of the command `splitfolders` you can also use `split_folders` or `split-folders`.\n\n## Development\n\nInstall and use [poetry](https://python-poetry.org/).\n\n## Contributing\n\nIf you have a **question**, found a **bug** or want to propose a new **feature**, have a look at the [issues page](https://github.com/jfilter/split-folders/issues).\n\n**Pull requests** are especially welcomed when they fix bugs or improve the code quality.\n\n## License\n\nMIT\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Split folders with files (e.g. images) into training, validation and test (dataset) folders.",
"version": "0.5.1",
"project_urls": {
"Homepage": "https://github.com/jfilter/split-folders",
"Repository": "https://github.com/jfilter/split-folders"
},
"split_keywords": [
"machine-learning",
"training-validation-test",
"datasets",
"folders"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "b6d5307d63c03356bad6e141d8718d3f4116f51bd9c4b09e2614ffcee1f3c6fd",
"md5": "5084233ef742b710d616cdb882095f9f",
"sha256": "cb010e00f34d247b8e8bbfd6cfe527f871361d8524ed54734924e7efd261801f"
},
"downloads": -1,
"filename": "split_folders-0.5.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5084233ef742b710d616cdb882095f9f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 8388,
"upload_time": "2022-02-03T21:56:09",
"upload_time_iso_8601": "2022-02-03T21:56:09.072007Z",
"url": "https://files.pythonhosted.org/packages/b6/d5/307d63c03356bad6e141d8718d3f4116f51bd9c4b09e2614ffcee1f3c6fd/split_folders-0.5.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "a74c32d2d49b82ea5baf0ff1a55de88c7fb8a0bf2aab02763c8501b2a51bf55f",
"md5": "de7808804bdfc0eb5e8a2fb9371ed97f",
"sha256": "7127a226b90e00fa86cda4451fe015c6f3755bc3d627064adb9c5209fc8280f6"
},
"downloads": -1,
"filename": "split_folders-0.5.1.tar.gz",
"has_sig": false,
"md5_digest": "de7808804bdfc0eb5e8a2fb9371ed97f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 7949,
"upload_time": "2022-02-03T21:56:10",
"upload_time_iso_8601": "2022-02-03T21:56:10.398070Z",
"url": "https://files.pythonhosted.org/packages/a7/4c/32d2d49b82ea5baf0ff1a55de88c7fb8a0bf2aab02763c8501b2a51bf55f/split_folders-0.5.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2022-02-03 21:56:10",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "jfilter",
"github_project": "split-folders",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "split-folders"
}