# cc2dataset
[![pypi](https://img.shields.io/pypi/v/cc2dataset.svg)](https://pypi.python.org/pypi/cc2dataset)
[![Try it on gitpod](https://img.shields.io/badge/try-on%20gitpod-brightgreen.svg)](https://gitpod.io/#https://github.com/rom1504/cc2dataset)
Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...
Common crawl has [5M wat files](https://commoncrawl.org/the-data/get-started/). They provide links of the web.
This simple tool allows you to process one warc in about 50s and get documents link along with the alt text.
It also runs deduplication against url+text in order to save on output space and speed up the process.
This makes it possible to do the first step of building a dataset like [laion5B](https://laion.ai/blog/laion-5b/) in 70k cpu core hours. (`5*10^6*50/(3600)`)
That's `$2.8k` using aws EC2 (0.04$/core hour)
## Intended usage
This tool produces a collection of link + caption. It is meant as the stage 1 of creating a dataset. It does deduplication and as minimal as possible filtering (does it look like an url / is the caption non empty).
This produces a large quantity of raw data that can then be further filtered by appropriate techniques.
An example of stage 2 can be to estimate the similarity between (link, text) with a model such as CLIP. This may reduce the quantity of data by a factor of up to 100x depending on the chosen threshold.
## What hardware to pick ?
CC is big and located at s3 us east 1, so it makes a lot of sense in term of network to use machines located in the same place.
`cpu128-dy-c6i-32xlarge` instances are advised. Spark stores the non duplicated first stage in local disk. They should be nvme drive for speed during deduplication. At this first stage, one wat takes about 20MB, so the total (over all workers) space must be more than 20MB times wat count. So for example for the whole CC, that means 100TB. So for example that can fit in 150 instances with 1TB nvme drive each. 150 instances of 128 cores is 19200 cores so the whole processing takes 2h. Less instances with bigger hard drives can work too. It's also a possibility to do the processing in multiple pieces if temporary disk space is an issue by specifying `--multipart`.
## Document type
This tool support extracting several documents from CC:
* image/text: about 300B after dedup
* image/text even with empty text: estimated 1T
* audio/text: about 2B after dedup
* text doc : about 10B after dedup
* video/text: about 2B after dedup
They can be selected with eg `--document_type audio`.
You may experiment with more document kinds by running `python example single_warc_example.py` and exploring the resulting output.parquet.
## Install
pip install cc2dataset
## Python examples
Checkout these examples:
* [run_on_spark.py](examples/run_on_spark.py) it shows how to bring your own spark session
If you have a slurm cluster, refer to https://gist.github.com/rom1504/67ada3dedbecc113ae2dbdfd9c642d83 to start a spark cluster there.
## API
This module exposes a single function `cc2dataset` which takes the same arguments as the command line tool:
* **output_path** the output path, should probably start with s3://. The output will be written to this path sufixed by the date (*required*)
* **wat_index_count** the number of wat index files to read, can be None for all. (*default 1*)
* **wat_count** the number of wat files to read, can be None for all, will randomly subsample if present. (*default 100*)
* **master** the spark master url. (*default local*)
* **num_cores** the number of cores of each spark executor. (*default 128*)
* **mem_gb** the memory of each spark executor. (*default 256*)
* **multipart** runs the processing of the specified number of parts, merge at the end (*default None*)
* **shuffle** randomly shuffle the output right before saving (*default True*)
* **resume** the specific path of the output to resume (*default None*)
* **spark_builder** a function that create a spark session, None will default to the built-in methods (*default None*)
* **document_type** the kind of document to extract (*default image*)
* **source_cc_protocol** get common crawl from http or s3 (*default s3*)
## For development
Either locally, or in [gitpod](https://gitpod.io/#https://github.com/rom1504/cc2dataset) (do `export PIP_USER=false` there)
Setup a virtualenv:
```
python3 -m venv .env
source .env/bin/activate
pip install -e .
```
to run tests:
```
pip install -r requirements-test.txt
```
then
```
make lint
make test
```
You can use `make black` to reformat the code
`python -m pytest -x -s -v tests -k "dummy"` to run a specific test
## Thanks
* [Vaishaal](https://github.com/Vaishaal) for providing the initial CC parsing code with efficient libraries
* [rvencu](https://github.com/rvencu) for optimizing the cc [parsing code](https://github.com/rvencu/crawlingathome-gpu-hcloud) for laion5B on which the idea of this package is based on
Raw data
{
"_id": null,
"home_page": "https://github.com/rom1504/cc2dataset",
"name": "cc2dataset",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "machine learning",
"author": "Romain Beaumont",
"author_email": "romain.rom1@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/1c/34/6d3150135577dd3650811d4eb72b57142a3f413ff968ea378bd3734e2cf2/cc2dataset-1.5.0.tar.gz",
"platform": null,
"description": "# cc2dataset\n[![pypi](https://img.shields.io/pypi/v/cc2dataset.svg)](https://pypi.python.org/pypi/cc2dataset)\n[![Try it on gitpod](https://img.shields.io/badge/try-on%20gitpod-brightgreen.svg)](https://gitpod.io/#https://github.com/rom1504/cc2dataset)\n\nEasily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...\n\nCommon crawl has [5M wat files](https://commoncrawl.org/the-data/get-started/). They provide links of the web.\nThis simple tool allows you to process one warc in about 50s and get documents link along with the alt text.\n\nIt also runs deduplication against url+text in order to save on output space and speed up the process.\n\nThis makes it possible to do the first step of building a dataset like [laion5B](https://laion.ai/blog/laion-5b/) in 70k cpu core hours. (`5*10^6*50/(3600)`)\nThat's `$2.8k` using aws EC2 (0.04$/core hour)\n\n## Intended usage\n\nThis tool produces a collection of link + caption. It is meant as the stage 1 of creating a dataset. It does deduplication and as minimal as possible filtering (does it look like an url / is the caption non empty).\n\nThis produces a large quantity of raw data that can then be further filtered by appropriate techniques.\nAn example of stage 2 can be to estimate the similarity between (link, text) with a model such as CLIP. This may reduce the quantity of data by a factor of up to 100x depending on the chosen threshold.\n\n## What hardware to pick ?\n\nCC is big and located at s3 us east 1, so it makes a lot of sense in term of network to use machines located in the same place.\n\n`cpu128-dy-c6i-32xlarge` instances are advised. Spark stores the non duplicated first stage in local disk. They should be nvme drive for speed during deduplication. At this first stage, one wat takes about 20MB, so the total (over all workers) space must be more than 20MB times wat count. So for example for the whole CC, that means 100TB. So for example that can fit in 150 instances with 1TB nvme drive each. 150 instances of 128 cores is 19200 cores so the whole processing takes 2h. Less instances with bigger hard drives can work too. It's also a possibility to do the processing in multiple pieces if temporary disk space is an issue by specifying `--multipart`.\n\n## Document type\n\nThis tool support extracting several documents from CC:\n* image/text: about 300B after dedup\n* image/text even with empty text: estimated 1T\n* audio/text: about 2B after dedup\n* text doc : about 10B after dedup\n* video/text: about 2B after dedup\n\nThey can be selected with eg `--document_type audio`.\nYou may experiment with more document kinds by running `python example single_warc_example.py` and exploring the resulting output.parquet.\n\n## Install\n\npip install cc2dataset\n\n## Python examples\n\nCheckout these examples:\n* [run_on_spark.py](examples/run_on_spark.py) it shows how to bring your own spark session\n\nIf you have a slurm cluster, refer to https://gist.github.com/rom1504/67ada3dedbecc113ae2dbdfd9c642d83 to start a spark cluster there.\n\n## API\n\nThis module exposes a single function `cc2dataset` which takes the same arguments as the command line tool:\n* **output_path** the output path, should probably start with s3://. The output will be written to this path sufixed by the date (*required*)\n* **wat_index_count** the number of wat index files to read, can be None for all. (*default 1*)\n* **wat_count** the number of wat files to read, can be None for all, will randomly subsample if present. (*default 100*)\n* **master** the spark master url. (*default local*)\n* **num_cores** the number of cores of each spark executor. (*default 128*)\n* **mem_gb** the memory of each spark executor. (*default 256*)\n* **multipart** runs the processing of the specified number of parts, merge at the end (*default None*)\n* **shuffle** randomly shuffle the output right before saving (*default True*)\n* **resume** the specific path of the output to resume (*default None*)\n* **spark_builder** a function that create a spark session, None will default to the built-in methods (*default None*)\n* **document_type** the kind of document to extract (*default image*)\n* **source_cc_protocol** get common crawl from http or s3 (*default s3*)\n\n## For development\n\nEither locally, or in [gitpod](https://gitpod.io/#https://github.com/rom1504/cc2dataset) (do `export PIP_USER=false` there)\n\nSetup a virtualenv:\n\n```\npython3 -m venv .env\nsource .env/bin/activate\npip install -e .\n```\n\nto run tests:\n```\npip install -r requirements-test.txt\n```\nthen \n```\nmake lint\nmake test\n```\n\nYou can use `make black` to reformat the code\n\n`python -m pytest -x -s -v tests -k \"dummy\"` to run a specific test\n\n\n## Thanks\n\n* [Vaishaal](https://github.com/Vaishaal) for providing the initial CC parsing code with efficient libraries\n* [rvencu](https://github.com/rvencu) for optimizing the cc [parsing code](https://github.com/rvencu/crawlingathome-gpu-hcloud) for laion5B on which the idea of this package is based on\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Easily convert common crawl to image caption set using pyspark",
"version": "1.5.0",
"project_urls": {
"Homepage": "https://github.com/rom1504/cc2dataset"
},
"split_keywords": [
"machine",
"learning"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "17b7edae6e5bb33371b4b324c5d15aa35229002b2d10d755aa07013deea160a1",
"md5": "daf7cca8dee1d839a755ba4a5860f676",
"sha256": "e903a02b39f0bb98d320d966b80dd4abfc8646e385488d811abd6bd7e9619ef5"
},
"downloads": -1,
"filename": "cc2dataset-1.5.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "daf7cca8dee1d839a755ba4a5860f676",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 12305,
"upload_time": "2023-06-25T22:54:59",
"upload_time_iso_8601": "2023-06-25T22:54:59.407453Z",
"url": "https://files.pythonhosted.org/packages/17/b7/edae6e5bb33371b4b324c5d15aa35229002b2d10d755aa07013deea160a1/cc2dataset-1.5.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "1c346d3150135577dd3650811d4eb72b57142a3f413ff968ea378bd3734e2cf2",
"md5": "2af8d852037ba4b31e8ee649b9698934",
"sha256": "9677a85d2e5d2aefe1ef76ef9b01074c4ca316aae827cb12700b023ff45a2252"
},
"downloads": -1,
"filename": "cc2dataset-1.5.0.tar.gz",
"has_sig": false,
"md5_digest": "2af8d852037ba4b31e8ee649b9698934",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 11842,
"upload_time": "2023-06-25T22:55:00",
"upload_time_iso_8601": "2023-06-25T22:55:00.997378Z",
"url": "https://files.pythonhosted.org/packages/1c/34/6d3150135577dd3650811d4eb72b57142a3f413ff968ea378bd3734e2cf2/cc2dataset-1.5.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-06-25 22:55:00",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "rom1504",
"github_project": "cc2dataset",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "pyspark",
"specs": []
},
{
"name": "pysimdjson",
"specs": []
},
{
"name": "fsspec",
"specs": []
},
{
"name": "pandas",
"specs": []
},
{
"name": "loguru",
"specs": []
},
{
"name": "pyarrow",
"specs": []
},
{
"name": "fastwarc",
"specs": []
},
{
"name": "s3fs",
"specs": []
},
{
"name": "fire",
"specs": []
},
{
"name": "requests",
"specs": []
}
],
"lcname": "cc2dataset"
}