# SpeakLeash
SpeakLeash agnostic dataset for Polish
## Basic Usage
If you just want to see the details of the datasets
```
from speakleash import Speakleash
import os
base_dir = os.path.join(os.path.dirname(__file__))
replicate_to = os.path.join(base_dir, "datasets")
sl = Speakleash(replicate_to)
for d in sl.datasets:
print(d.name)
for doc in d.data:
size_mb = round(d.characters/1024/1024)
print("Dataset: {0}, size: {1} MB, characters: {2}, documents: {3}".format(d.name, size_mb, d.characters, d.documents))
```
You can use individual properties (e.g.:***characters***, ***documents***), but you can display the entire manifest
```
sl = Speakleash(replicate_to)
print(sl.get("plwiki").manifest)
```
If you chose one of them (***.get(name of dataset)***) then you will get a lot of text data ;-)
```
from speakleash import Speakleash
import os
base_dir = os.path.join(os.path.dirname(__file__))
replicate_to = os.path.join(base_dir, "datasets")
sl = Speakleash(replicate_to)
wiki = sl.get("plwiki").data
for doc in wiki:
print(doc[:40])
```
If you also need meta data then use the ***ext_data*** property
```
ds = sl.get("plwiki").ext_data
for doc in ds:
print(doc)
txt, meta = doc
print(meta.get("title"))
print(txt)
```
Popular meta data:
* title
* length
* sentences
* words
* verbs
* nouns
* symbols
* punctuations
Raw data
{
"_id": null,
"home_page": "https://github.com/speakleash/speakleash",
"name": "speakleash",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "",
"keywords": "",
"author": "SpeakLeash Team",
"author_email": "team@speakleash.org",
"download_url": "https://files.pythonhosted.org/packages/47/70/eb4cb2cb5fb0b8ab108ea552999c0770824a46c93101ffe86981be28b696/speakleash-0.0.11.tar.gz",
"platform": null,
"description": "# SpeakLeash\n\nSpeakLeash agnostic dataset for Polish\n\n## Basic Usage\n\nIf you just want to see the details of the datasets\n\n```\nfrom speakleash import Speakleash\nimport os\n\nbase_dir = os.path.join(os.path.dirname(__file__))\nreplicate_to = os.path.join(base_dir, \"datasets\")\n\nsl = Speakleash(replicate_to)\n\nfor d in sl.datasets:\n print(d.name)\n for doc in d.data:\n size_mb = round(d.characters/1024/1024)\n print(\"Dataset: {0}, size: {1} MB, characters: {2}, documents: {3}\".format(d.name, size_mb, d.characters, d.documents))\n\n```\n\nYou can use individual properties (e.g.:***characters***, ***documents***), but you can display the entire manifest\n```\nsl = Speakleash(replicate_to)\nprint(sl.get(\"plwiki\").manifest)\n\n```\n\nIf you chose one of them (***.get(name of dataset)***) then you will get a lot of text data ;-)\n```\nfrom speakleash import Speakleash\nimport os\n\nbase_dir = os.path.join(os.path.dirname(__file__))\nreplicate_to = os.path.join(base_dir, \"datasets\")\n\nsl = Speakleash(replicate_to)\n\nwiki = sl.get(\"plwiki\").data\nfor doc in wiki:\n print(doc[:40])\n\n```\n\nIf you also need meta data then use the ***ext_data*** property\n```\n\nds = sl.get(\"plwiki\").ext_data\nfor doc in ds:\n print(doc)\n txt, meta = doc\n print(meta.get(\"title\"))\n print(txt)\n\n\n```\n\nPopular meta data:\n\n* title\n* length\n* sentences\n* words\n* verbs\n* nouns\n* symbols\n* punctuations\n\n\n\n",
"bugtrack_url": null,
"license": "",
"summary": "SpeakLeash agnostic dataset for Polish",
"version": "0.0.11",
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "373ba7fe4bc3cc1dd6f7333dfed1c52735870e6a04a13662222c3852a1349406",
"md5": "a82162345039f73c2a7ca32c5860bf3a",
"sha256": "1dcde410801754bdb0ffac382d10cab224fba87cb85ebc5c325115104fcb0099"
},
"downloads": -1,
"filename": "speakleash-0.0.11-py3-none-any.whl",
"has_sig": false,
"md5_digest": "a82162345039f73c2a7ca32c5860bf3a",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 3963,
"upload_time": "2023-01-25T10:33:24",
"upload_time_iso_8601": "2023-01-25T10:33:24.002603Z",
"url": "https://files.pythonhosted.org/packages/37/3b/a7fe4bc3cc1dd6f7333dfed1c52735870e6a04a13662222c3852a1349406/speakleash-0.0.11-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "4770eb4cb2cb5fb0b8ab108ea552999c0770824a46c93101ffe86981be28b696",
"md5": "51635abd0da05eae85ef9c07710ecb30",
"sha256": "00265d0ad7cf1471dfc842158782654b9650ef330cbbfe97ca9e799a71e7dfb2"
},
"downloads": -1,
"filename": "speakleash-0.0.11.tar.gz",
"has_sig": false,
"md5_digest": "51635abd0da05eae85ef9c07710ecb30",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 3651,
"upload_time": "2023-01-25T10:34:50",
"upload_time_iso_8601": "2023-01-25T10:34:50.171691Z",
"url": "https://files.pythonhosted.org/packages/47/70/eb4cb2cb5fb0b8ab108ea552999c0770824a46c93101ffe86981be28b696/speakleash-0.0.11.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-01-25 10:34:50",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "speakleash",
"github_project": "speakleash",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "requests",
"specs": []
},
{
"name": "tqdm",
"specs": []
},
{
"name": "lm_dataformat",
"specs": []
}
],
"lcname": "speakleash"
}