DatasetRising

Name	DatasetRising JSON
Version	1.0.4 JSON
	download
home_page
Summary	Toolchain for creating and training Stable Diffusion models with custom datasets
upload_time	2023-12-12 00:36:16
maintainer
docs_url	None
author
requires_python	>=3.8
license	Copyright 2023 Mr. Stallion, Esq. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords	training crawler machine-learning imageboard booru danbooru ml dataset dataset-generation gelbooru e621 imagebooru finetuning mlops huggingface mlops-workflow stable-diffusion huggingface-users diffusers sdxl
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Dataset Rising

> A toolchain for creating and training Stable Diffusion 1.x, Stable Diffusion 2.x, and Stable Diffusion XL models
> with custom datasets.

With this toolchain, you can:
* **Crawl and download** metadata and images from 'booru' style image boards
* Combine **multiple sources of images** (including your own custom sources)
* **Build datasets** based on your personal preferences and filters
* **Train Stable Diffusion models** with your datasets
* Convert models into [Stable Diffusion WebUI](https://github.com/AUTOMATIC1111/stable-diffusion-webui/tree/master) compatible models
* Use only the parts you need – the toolchain uses modular design, YAML configuration files, and JSONL data exchange formats
* Work with confidence that the end-to-end tooling has been tested with Nvidia's RTX30x0, RTX40x0, A100, and H100 GPUs

## Requirements
* Python `>=3.8`
* Docker `>=22.0.0`

## Tested With
* MacOS 13 (M1)
* Ubuntu 22 (x86_64)


## Full Example
Below is a summary of each step in dataset generation process. For a full production-quality example, see [e621-rising-configs](https://github.com/hearmeneigh/e621-rising-configs) (NSFW).

### 0. Installation
```bash
# install DatasetRising
pip3 install DatasetRising

# start MongoDB database; use `dr-db-down` to stop
dr-db-up
```

### 1. Download Metadata (Posts, Tags, ...)
Dataset Rising has a crawler (`dr-crawl`) to download metadata (=posts and tags) from booru-style image boards.

You must select a unique user agent string for your crawler (`--agent AGENT_STRING`). This string will
be passed to the image board with every  HTTP request. If you don't pick a user agent that uniquely identifies you,
the image boards will likely block your requests. For example:

> `--agent 'my-imageboard-crawler/1.0 (user @my-username-on-the-imageboard)'`

The crawler will automatically manage rate limits and retries. If you want to automatically resume a previous (failed)
crawl, use `--recover`.

```bash
## download tag metadata to /tmp/tags.jsonl
dr-crawl --output /tmp/e962-tags.jsonl --type tags --source e926 --recover --agent '<AGENT_STRING>'

## download posts metadata to /tmp/e926.net-posts.jsonl
dr-crawl --output /tmp/e926.net-posts.jsonl --type posts --source e926 --recover --agent '<AGENT_STRING>'
```

### 2. Import Metadata

> This section requires a running MongoDB database, which you can start with `dr-db-up` command.

Once you have enough post and tag metadata, it's time to import the data into a database.

Dataset Rising uses MongoDB as a store for the post and tag metadata. Use `dr-import` to
import the metadata downloaded in the previous step into MongoDB.

If you want to adjust how the tag metadata is treated during the import,
review files in [`<dataset-rising>/examples/tag_normalizer`](./examples/tag_normalizer) and set the optional
parameters `--prefilter FILE`, `--rewrites FILE`, `--aspect-ratios FILE`, `--category-weights FILE`, and
`--symbols FILE` accordingly.

```bash
dr-import --tags /tmp/e926.net-tags.jsonl --posts /tmp/e926.net-posts.jsonl --source e926
```

### 3. Preview Selectors
> This section requires a running MongoDB database, which you can start with `dr-db-up` command.

After the metadata has been imported into a database, you can use selector files to select
a subset of the posts in a dataset.

Your goal is **not** to include **all** images, but to produce
a set of **high quality** samples. The selectors are the mechanism for that.

Each selector contains a **positive** and **negative** list of tags. A post will be included
by the selector, if it contains at least one tag from the **positive** list and none of the
tags in the **negative** list.

Note that a great dataset will contain positive **and** negative examples. If you only
train your dataset with positive samples, your model will not be able to use negative
prompts well. That's why the examples below include four different types of selectors.

Dataset Rising has example selectors available in [`<dataset-rising>/examples/select`](examples/select).

To make sure your selectors are producing the kind of samples you want, use the `dr-preview`
script:

```bash
# generate a HTML preview of how the selector will perform (note: --aggregate is required):
dr-preview --selector ./examples/select/tier-1/tier-1.yaml --output /tmp/preview/tier-1 --limit 1000 --output --aggregate

# generate a HTML preview of how each sub-selector will perform:
dr-preview --selector ./examples/select/tier-1/helpers/artists.yaml --output /tmp/preview/tier-1-artists
```

### 4. Select Images For a Dataset
> This section requires a running MongoDB database, which you can start with `dr-db-up` command.

When you're confident that the selectors are producing the right kind of samples, it's time to select the posts for
building a dataset. Use `dr-select` to select posts from the database and store them in a JSONL file. 

```bash
cd <dataset-rising>/database

dr-select --selector ./examples/select/tier-1/tier-1.yaml --output /tmp/tier-1.jsonl
dr-select --selector ./examples/select/tier-2/tier-2.yaml --output /tmp/tier-2.jsonl
```

### 5. Build a Dataset
After selecting the posts for the dataset, use `dr-join` to combine the selections and 
`dr-build` to download the images and build the actual dataset.

By default, the build script prunes all tags that have fewer than 100 samples. To adjust this limit, use `--min-posts-per-tag LIMIT`.

The build script will also prune all images that have fewer than 10 tags. To adjust this limit, use `--min-tags-per-post LIMIT`.

Adding a percentage at the end of a `--source` tells the build script to pick that many samples of the total dataset from the given source, e.g. `--source ./my.jsonl:50%`.

```bash
dr-join \
  --samples '/tmp/tier-1.jsonl:80%' \
  --samples '/tmp/tier-2.jsonl:20%' \
  --output '/tmp/joined.jsonl'

dr-build \
  --source '/tmp/joined.jsonl' \
  --output '/tmp/my-dataset' \
  --upload-to-hf 'username/dataset-name' \
  --upload-to-s3 's3://some-bucket/some/path'
```

### 6. Train a Model
The dataset built by the `dr-build` script is ready to be used for training as is.  Dataset Rising uses
[Huggingface Accelerate](https://huggingface.co/docs/accelerate/index) to train Stable Diffusion models.

To train a model, you will need to pick a base model to start from. The `--base-model` can be any
[Diffusers](https://huggingface.co/docs/diffusers/index) compatible model, such as:

* [hearmeneigh/e621-rising-v3](https://huggingface.co/hearmeneigh/e621-rising-v3) (NSFW)
* [stabilityai/stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0)
* [stabilityai/stable-diffusion-2-1-base](https://huggingface.co/stabilityai/stable-diffusion-2-1-base)
* [runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5)

Note that your training results will be improved significantly if you set `--image_width` and `--image_height`
to match the resolution the base model was trained with.

> This example does not scale to multiple GPUs. See the [Advanced Topics](#advanced-topics) section for multi-GPU training.

```bash
dr-train \
  --pretrained-model-name-or-path 'stabilityai/stable-diffusion-xl-base-1.0' \
  --dataset-name 'username/dataset-name' \
  --output '/tmp/dataset-rising-v3-model' \
  --resolution 1024 \
  --maintain-aspect-ratio \
  --reshuffle-tags \
  --tag-separator ' ' \
  --random-flip \
  --train-batch-size 32 \
  --learning-rate 4e-6 \
  --use-ema \
  --max-grad-norm 1 \
  --checkpointing-steps 1000 \
  --lr-scheduler constant \
  --lr-warmup-steps 0
```

### 7. Generate Samples
After training, you can use the `dr-generate` script to verify that the model is working as expected.

```bash
dr-generate \
  --model '/tmp/dataset-rising-v3-model' \
  --output '/tmp/samples' \
  --prompt 'cat playing chess with a horse' \
  --samples 100 \
```

### 8. Use the Model with Stable Diffusion WebUI
In order to use the model with [Stable Diffusion WebUI](https://github.com/AUTOMATIC1111/stable-diffusion-webui), it has to be converted to the `safetensors` format.

```bash
# Stable Diffusion XL models:
dr-convert-sdxl \
  --model_path '/tmp/dataset-rising-v3-model' \
  --checkpoint_path '/tmp/dataset-rising-v3-model.safetensors' \
  --use_safetensors

# Other Stable Diffusion models:
dr-convert-sd \
  --model_path '/tmp/dataset-rising-v3-model' \
  --checkpoint_path '/tmp/dataset-rising-v3-model.safetensors' \
  --use_safetensors
  
# Copy the model to the WebUI models directory:
cp '/tmp/dataset-rising-v3-model.safetensors' '<webui-root>/models/Stable-diffusion'

# Copy the model configuration file to WebUI models directory:
cp '/tmp/dataset-rising-v3-model.yaml' '<webui-root>/models/Stable-diffusion'
```

## Uninstall
The only part of Dataset Rising that requires uninstallation is the MongoDB database. You can uninstall the database
with the following commands:

```bash
# Shut down MongoDB instance
dr-db-down

# Remove MongoDB container and its data -- warning! data loss will occur
dr-db-uninstall
```

## Advanced Topics

### Resetting the Database
To reset the database, run the following commands.

> **Warning: You will lose all data in the database.**

```bash
dr-db-uninstall && dr-db-up && dr-db-create
```

### Importing Posts from Multiple Sources
The `append` script allows you to import posts from additional sources.

Use `import` to import the first source and define the tag namespace, then use `append` to import additional sources.

```bash
# main sources and tags
dr-import ...

# additional sources
dr-append --input /tmp/gelbooru-posts.jsonl --source gelbooru
```

### Multi-GPU Training
Multi-GPU training can be carried out with [Huggingface Accelerate](https://huggingface.co/docs/accelerate/package_reference/cli) library.

Before training, run `accelerate config` to set up your Multi-GPU environment.

```bash
cd <dataset-rising>/train

# set up environment
accelerate config

# run training
accelerate launch \
  --multi_gpu \
  --mixed_precision=${PRECISION} \
  dr_train.py \
    --pretrained-model-name-or-path 'stabilityai/stable-diffusion-xl-base-1.0' \
    --dataset-name 'username/dataset-name' \
    --resolution 1024 \
    --maintain-aspect-ratio \
    --reshuffle-tags \
    --tag-separator ' ' \
    --random-flip \
    --train-batch-size 32 \
    --learning-rate 4e-6 \
    --use-ema \
    --max-grad-norm 1 \
    --checkpointing-steps 1000 \
    --lr-scheduler constant \
    --lr-warmup-steps 0
```

## Setting Up a Training Machine
* Install `dataset-rising`
* Install [Huggingface CLI](https://huggingface.co/docs/huggingface_hub/installation)
* Install [Accelerate CLI](https://huggingface.co/docs/accelerate/basic_tutorials/install)
* Configure Huggingface CLI (`huggingface-cli login`)
* Configure Accelerate CLI (`accelerate config`)

### Optional
* Install [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html)
* Install [xFormers](https://github.com/facebookresearch/xformers)
* Configure AWS CLI (`aws configure`)

### Troubleshooting

#### NCCL Errors
Some configurations will require `NCCL_P2P_DISABLE=1` and/or `NCCL_IB_DISABLE=1` environment variables to be set.

```bash
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1

dr-train ...
```

### Cache Directories
Use `HF_DATASETS_CACHE` and `HF_MODULES_CACHE` to control where Huggingface stores its cache files

```bash
export HF_DATASETS_CACHE=/workspace/cache/huggingface/datasets
export HF_MODULES_CACHE=/workspace/cache/huggingface/modules

dr-train ...
```


## Developers
### Setting Up
Creates a virtual environment, installs packages, and sets up a MongoDB database on Docker. 

```bash
cd <dataset-rising>
./up.sh
```

### Shutting Down
Stops the MongoDB database container. The database can be restarted by running `./up.sh` again.

```bash
cd <dataset-rising>
./down.sh
```


### Uninstall
Warning: This step **removes** the MongoDB database container and all data stored on it.

```bash
cd <dataset-rising>
./uninstall.sh
```

### Deployments
```bash
python3 -m pip install --upgrade build twine
python3 -m build 
python3 -m twine upload dist/*
```

### Architecture
```mermaid
flowchart TD
    CRAWL[Crawl/Download posts, tags, and tag aliases] -- JSONL --> IMPORT
    IMPORT[Import posts, tags, and tag aliases] --> STORE
    APPEND[Append additional posts] --> STORE
    STORE[Database] --> PREVIEW
    STORE --> SELECT1
    STORE --> SELECT2
    STORE --> SELECT3
    PREVIEW[Preview selectors] --> HTML(HTML)
    SELECT1[Select samples] -- JSONL --> JOIN
    SELECT2[Select samples] -- JSONL --> JOIN
    SELECT3[Select samples] -- JSONL --> JOIN
    JOIN[Join and prune samples] -- JSONL --> BUILD 
    BUILD[Build dataset] -- HF Dataset/Parquet --> TRAIN
    TRAIN[Train model] --> MODEL[Model]
```

## Links
* [SDXL training notes](https://rentry.org/59xed3#sdxl)

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "DatasetRising",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "training,crawler,machine-learning,imageboard,booru,danbooru,ml,dataset,dataset-generation,gelbooru,e621,imagebooru,finetuning,mlops,huggingface,mlops-workflow,stable-diffusion,huggingface-users,diffusers,sdxl",
    "author": "",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/28/cf/ce4c6cb823db02fd3a03f7d696dcd63e63eb56f8fa54ad74b063375a40ca/DatasetRising-1.0.4.tar.gz",
    "platform": null,
    "description": "# Dataset Rising\n\n> A toolchain for creating and training Stable Diffusion 1.x, Stable Diffusion 2.x, and Stable Diffusion XL models\n> with custom datasets.\n\nWith this toolchain, you can:\n* **Crawl and download** metadata and images from 'booru' style image boards\n* Combine **multiple sources of images** (including your own custom sources)\n* **Build datasets** based on your personal preferences and filters\n* **Train Stable Diffusion models** with your datasets\n* Convert models into [Stable Diffusion WebUI](https://github.com/AUTOMATIC1111/stable-diffusion-webui/tree/master) compatible models\n* Use only the parts you need \u2013 the toolchain uses modular design, YAML configuration files, and JSONL data exchange formats\n* Work with confidence that the end-to-end tooling has been tested with Nvidia's RTX30x0, RTX40x0, A100, and H100 GPUs\n\n## Requirements\n* Python `>=3.8`\n* Docker `>=22.0.0`\n\n## Tested With\n* MacOS 13 (M1)\n* Ubuntu 22 (x86_64)\n\n\n## Full Example\nBelow is a summary of each step in dataset generation process. For a full production-quality example, see [e621-rising-configs](https://github.com/hearmeneigh/e621-rising-configs) (NSFW).\n\n### 0. Installation\n```bash\n# install DatasetRising\npip3 install DatasetRising\n\n# start MongoDB database; use `dr-db-down` to stop\ndr-db-up\n```\n\n### 1. Download Metadata (Posts, Tags, ...)\nDataset Rising has a crawler (`dr-crawl`) to download metadata (=posts and tags) from booru-style image boards.\n\nYou must select a unique user agent string for your crawler (`--agent AGENT_STRING`). This string will\nbe passed to the image board with every  HTTP request. If you don't pick a user agent that uniquely identifies you,\nthe image boards will likely block your requests. For example:\n\n> `--agent 'my-imageboard-crawler/1.0 (user @my-username-on-the-imageboard)'`\n\nThe crawler will automatically manage rate limits and retries. If you want to automatically resume a previous (failed)\ncrawl, use `--recover`.\n\n```bash\n## download tag metadata to /tmp/tags.jsonl\ndr-crawl --output /tmp/e962-tags.jsonl --type tags --source e926 --recover --agent '<AGENT_STRING>'\n\n## download posts metadata to /tmp/e926.net-posts.jsonl\ndr-crawl --output /tmp/e926.net-posts.jsonl --type posts --source e926 --recover --agent '<AGENT_STRING>'\n```\n\n### 2. Import Metadata\n\n> This section requires a running MongoDB database, which you can start with `dr-db-up` command.\n\nOnce you have enough post and tag metadata, it's time to import the data into a database.\n\nDataset Rising uses MongoDB as a store for the post and tag metadata. Use `dr-import` to\nimport the metadata downloaded in the previous step into MongoDB.\n\nIf you want to adjust how the tag metadata is treated during the import,\nreview files in [`<dataset-rising>/examples/tag_normalizer`](./examples/tag_normalizer) and set the optional\nparameters `--prefilter FILE`, `--rewrites FILE`, `--aspect-ratios FILE`, `--category-weights FILE`, and\n`--symbols FILE` accordingly.\n\n```bash\ndr-import --tags /tmp/e926.net-tags.jsonl --posts /tmp/e926.net-posts.jsonl --source e926\n```\n\n### 3. Preview Selectors\n> This section requires a running MongoDB database, which you can start with `dr-db-up` command.\n\nAfter the metadata has been imported into a database, you can use selector files to select\na subset of the posts in a dataset.\n\nYour goal is **not** to include **all** images, but to produce\na set of **high quality** samples. The selectors are the mechanism for that.\n\nEach selector contains a **positive** and **negative** list of tags. A post will be included\nby the selector, if it contains at least one tag from the **positive** list and none of the\ntags in the **negative** list.\n\nNote that a great dataset will contain positive **and** negative examples. If you only\ntrain your dataset with positive samples, your model will not be able to use negative\nprompts well. That's why the examples below include four different types of selectors.\n\nDataset Rising has example selectors available in [`<dataset-rising>/examples/select`](examples/select).\n\nTo make sure your selectors are producing the kind of samples you want, use the `dr-preview`\nscript:\n\n```bash\n# generate a HTML preview of how the selector will perform (note: --aggregate is required):\ndr-preview --selector ./examples/select/tier-1/tier-1.yaml --output /tmp/preview/tier-1 --limit 1000 --output --aggregate\n\n# generate a HTML preview of how each sub-selector will perform:\ndr-preview --selector ./examples/select/tier-1/helpers/artists.yaml --output /tmp/preview/tier-1-artists\n```\n\n### 4. Select Images For a Dataset\n> This section requires a running MongoDB database, which you can start with `dr-db-up` command.\n\nWhen you're confident that the selectors are producing the right kind of samples, it's time to select the posts for\nbuilding a dataset. Use `dr-select` to select posts from the database and store them in a JSONL file. \n\n```bash\ncd <dataset-rising>/database\n\ndr-select --selector ./examples/select/tier-1/tier-1.yaml --output /tmp/tier-1.jsonl\ndr-select --selector ./examples/select/tier-2/tier-2.yaml --output /tmp/tier-2.jsonl\n```\n\n### 5. Build a Dataset\nAfter selecting the posts for the dataset, use `dr-join` to combine the selections and \n`dr-build` to download the images and build the actual dataset.\n\nBy default, the build script prunes all tags that have fewer than 100 samples. To adjust this limit, use `--min-posts-per-tag LIMIT`.\n\nThe build script will also prune all images that have fewer than 10 tags. To adjust this limit, use `--min-tags-per-post LIMIT`.\n\nAdding a percentage at the end of a `--source` tells the build script to pick that many samples of the total dataset from the given source, e.g. `--source ./my.jsonl:50%`.\n\n```bash\ndr-join \\\n  --samples '/tmp/tier-1.jsonl:80%' \\\n  --samples '/tmp/tier-2.jsonl:20%' \\\n  --output '/tmp/joined.jsonl'\n\ndr-build \\\n  --source '/tmp/joined.jsonl' \\\n  --output '/tmp/my-dataset' \\\n  --upload-to-hf 'username/dataset-name' \\\n  --upload-to-s3 's3://some-bucket/some/path'\n```\n\n### 6. Train a Model\nThe dataset built by the `dr-build` script is ready to be used for training as is.  Dataset Rising uses\n[Huggingface Accelerate](https://huggingface.co/docs/accelerate/index) to train Stable Diffusion models.\n\nTo train a model, you will need to pick a base model to start from. The `--base-model` can be any\n[Diffusers](https://huggingface.co/docs/diffusers/index) compatible model, such as:\n\n* [hearmeneigh/e621-rising-v3](https://huggingface.co/hearmeneigh/e621-rising-v3) (NSFW)\n* [stabilityai/stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0)\n* [stabilityai/stable-diffusion-2-1-base](https://huggingface.co/stabilityai/stable-diffusion-2-1-base)\n* [runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5)\n\nNote that your training results will be improved significantly if you set `--image_width` and `--image_height`\nto match the resolution the base model was trained with.\n\n> This example does not scale to multiple GPUs. See the [Advanced Topics](#advanced-topics) section for multi-GPU training.\n\n```bash\ndr-train \\\n  --pretrained-model-name-or-path 'stabilityai/stable-diffusion-xl-base-1.0' \\\n  --dataset-name 'username/dataset-name' \\\n  --output '/tmp/dataset-rising-v3-model' \\\n  --resolution 1024 \\\n  --maintain-aspect-ratio \\\n  --reshuffle-tags \\\n  --tag-separator ' ' \\\n  --random-flip \\\n  --train-batch-size 32 \\\n  --learning-rate 4e-6 \\\n  --use-ema \\\n  --max-grad-norm 1 \\\n  --checkpointing-steps 1000 \\\n  --lr-scheduler constant \\\n  --lr-warmup-steps 0\n```\n\n### 7. Generate Samples\nAfter training, you can use the `dr-generate` script to verify that the model is working as expected.\n\n```bash\ndr-generate \\\n  --model '/tmp/dataset-rising-v3-model' \\\n  --output '/tmp/samples' \\\n  --prompt 'cat playing chess with a horse' \\\n  --samples 100 \\\n```\n\n### 8. Use the Model with Stable Diffusion WebUI\nIn order to use the model with [Stable Diffusion WebUI](https://github.com/AUTOMATIC1111/stable-diffusion-webui), it has to be converted to the `safetensors` format.\n\n```bash\n# Stable Diffusion XL models:\ndr-convert-sdxl \\\n  --model_path '/tmp/dataset-rising-v3-model' \\\n  --checkpoint_path '/tmp/dataset-rising-v3-model.safetensors' \\\n  --use_safetensors\n\n# Other Stable Diffusion models:\ndr-convert-sd \\\n  --model_path '/tmp/dataset-rising-v3-model' \\\n  --checkpoint_path '/tmp/dataset-rising-v3-model.safetensors' \\\n  --use_safetensors\n  \n# Copy the model to the WebUI models directory:\ncp '/tmp/dataset-rising-v3-model.safetensors' '<webui-root>/models/Stable-diffusion'\n\n# Copy the model configuration file to WebUI models directory:\ncp '/tmp/dataset-rising-v3-model.yaml' '<webui-root>/models/Stable-diffusion'\n```\n\n## Uninstall\nThe only part of Dataset Rising that requires uninstallation is the MongoDB database. You can uninstall the database\nwith the following commands:\n\n```bash\n# Shut down MongoDB instance\ndr-db-down\n\n# Remove MongoDB container and its data -- warning! data loss will occur\ndr-db-uninstall\n```\n\n## Advanced Topics\n\n### Resetting the Database\nTo reset the database, run the following commands.\n\n> **Warning: You will lose all data in the database.**\n\n```bash\ndr-db-uninstall && dr-db-up && dr-db-create\n```\n\n### Importing Posts from Multiple Sources\nThe `append` script allows you to import posts from additional sources.\n\nUse `import` to import the first source and define the tag namespace, then use `append` to import additional sources.\n\n```bash\n# main sources and tags\ndr-import ...\n\n# additional sources\ndr-append --input /tmp/gelbooru-posts.jsonl --source gelbooru\n```\n\n### Multi-GPU Training\nMulti-GPU training can be carried out with [Huggingface Accelerate](https://huggingface.co/docs/accelerate/package_reference/cli) library.\n\nBefore training, run `accelerate config` to set up your Multi-GPU environment.\n\n```bash\ncd <dataset-rising>/train\n\n# set up environment\naccelerate config\n\n# run training\naccelerate launch \\\n  --multi_gpu \\\n  --mixed_precision=${PRECISION} \\\n  dr_train.py \\\n    --pretrained-model-name-or-path 'stabilityai/stable-diffusion-xl-base-1.0' \\\n    --dataset-name 'username/dataset-name' \\\n    --resolution 1024 \\\n    --maintain-aspect-ratio \\\n    --reshuffle-tags \\\n    --tag-separator ' ' \\\n    --random-flip \\\n    --train-batch-size 32 \\\n    --learning-rate 4e-6 \\\n    --use-ema \\\n    --max-grad-norm 1 \\\n    --checkpointing-steps 1000 \\\n    --lr-scheduler constant \\\n    --lr-warmup-steps 0\n```\n\n## Setting Up a Training Machine\n* Install `dataset-rising`\n* Install [Huggingface CLI](https://huggingface.co/docs/huggingface_hub/installation)\n* Install [Accelerate CLI](https://huggingface.co/docs/accelerate/basic_tutorials/install)\n* Configure Huggingface CLI (`huggingface-cli login`)\n* Configure Accelerate CLI (`accelerate config`)\n\n### Optional\n* Install [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html)\n* Install [xFormers](https://github.com/facebookresearch/xformers)\n* Configure AWS CLI (`aws configure`)\n\n### Troubleshooting\n\n#### NCCL Errors\nSome configurations will require `NCCL_P2P_DISABLE=1` and/or `NCCL_IB_DISABLE=1` environment variables to be set.\n\n```bash\nexport NCCL_P2P_DISABLE=1\nexport NCCL_IB_DISABLE=1\n\ndr-train ...\n```\n\n### Cache Directories\nUse `HF_DATASETS_CACHE` and `HF_MODULES_CACHE` to control where Huggingface stores its cache files\n\n```bash\nexport HF_DATASETS_CACHE=/workspace/cache/huggingface/datasets\nexport HF_MODULES_CACHE=/workspace/cache/huggingface/modules\n\ndr-train ...\n```\n\n\n## Developers\n### Setting Up\nCreates a virtual environment, installs packages, and sets up a MongoDB database on Docker. \n\n```bash\ncd <dataset-rising>\n./up.sh\n```\n\n### Shutting Down\nStops the MongoDB database container. The database can be restarted by running `./up.sh` again.\n\n```bash\ncd <dataset-rising>\n./down.sh\n```\n\n\n### Uninstall\nWarning: This step **removes** the MongoDB database container and all data stored on it.\n\n```bash\ncd <dataset-rising>\n./uninstall.sh\n```\n\n### Deployments\n```bash\npython3 -m pip install --upgrade build twine\npython3 -m build \npython3 -m twine upload dist/*\n```\n\n### Architecture\n```mermaid\nflowchart TD\n    CRAWL[Crawl/Download posts, tags, and tag aliases] -- JSONL --> IMPORT\n    IMPORT[Import posts, tags, and tag aliases] --> STORE\n    APPEND[Append additional posts] --> STORE\n    STORE[Database] --> PREVIEW\n    STORE --> SELECT1\n    STORE --> SELECT2\n    STORE --> SELECT3\n    PREVIEW[Preview selectors] --> HTML(HTML)\n    SELECT1[Select samples] -- JSONL --> JOIN\n    SELECT2[Select samples] -- JSONL --> JOIN\n    SELECT3[Select samples] -- JSONL --> JOIN\n    JOIN[Join and prune samples] -- JSONL --> BUILD \n    BUILD[Build dataset] -- HF Dataset/Parquet --> TRAIN\n    TRAIN[Train model] --> MODEL[Model]\n```\n\n## Links\n* [SDXL training notes](https://rentry.org/59xed3#sdxl)\n",
    "bugtrack_url": null,
    "license": "Copyright 2023 Mr. Stallion, Esq.  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \u201cSoftware\u201d), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \u201cAS IS\u201d, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "Toolchain for creating and training Stable Diffusion models with custom datasets",
    "version": "1.0.4",
    "project_urls": {
        "Bug Reports": "https://github.com/hearmeneigh/dataset-rising/issues",
        "Homepage": "https://github.com/hearmeneigh/dataset-rising",
        "Source": "https://github.com/hearmeneigh/dataset-rising"
    },
    "split_keywords": [
        "training",
        "crawler",
        "machine-learning",
        "imageboard",
        "booru",
        "danbooru",
        "ml",
        "dataset",
        "dataset-generation",
        "gelbooru",
        "e621",
        "imagebooru",
        "finetuning",
        "mlops",
        "huggingface",
        "mlops-workflow",
        "stable-diffusion",
        "huggingface-users",
        "diffusers",
        "sdxl"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ebd3dac3d05f5bdcd5f191319b1ddecf88f615c8b1affb50cd856cfbcc08c2ca",
                "md5": "f62795a94c34bb5a76c08451ab24dc48",
                "sha256": "0cf7063f75b6ce7ed269b45af6c9b2e61b2db453fe3974c4b128867131ec647a"
            },
            "downloads": -1,
            "filename": "DatasetRising-1.0.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f62795a94c34bb5a76c08451ab24dc48",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 177050,
            "upload_time": "2023-12-12T00:36:14",
            "upload_time_iso_8601": "2023-12-12T00:36:14.161809Z",
            "url": "https://files.pythonhosted.org/packages/eb/d3/dac3d05f5bdcd5f191319b1ddecf88f615c8b1affb50cd856cfbcc08c2ca/DatasetRising-1.0.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "28cfce4c6cb823db02fd3a03f7d696dcd63e63eb56f8fa54ad74b063375a40ca",
                "md5": "dbf2864f9a633d44cf5b7a6d1ff66e70",
                "sha256": "17d7d2f58cfc42a7b361c36d67141fd6ad9426b3510d8a71685834b63f000ecb"
            },
            "downloads": -1,
            "filename": "DatasetRising-1.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "dbf2864f9a633d44cf5b7a6d1ff66e70",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 154275,
            "upload_time": "2023-12-12T00:36:16",
            "upload_time_iso_8601": "2023-12-12T00:36:16.651470Z",
            "url": "https://files.pythonhosted.org/packages/28/cf/ce4c6cb823db02fd3a03f7d696dcd63e63eb56f8fa54ad74b063375a40ca/DatasetRising-1.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-12-12 00:36:16",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "hearmeneigh",
    "github_project": "dataset-rising",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "datasetrising"
}