speakleash


Namespeakleash JSON
Version 0.0.11 PyPI version JSON
download
home_pagehttps://github.com/speakleash/speakleash
SummarySpeakLeash agnostic dataset for Polish
upload_time2023-01-25 10:34:50
maintainer
docs_urlNone
authorSpeakLeash Team
requires_python>=3.6
license
keywords
VCS
bugtrack_url
requirements requests tqdm lm_dataformat
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # SpeakLeash

SpeakLeash agnostic dataset for Polish

## Basic Usage

If you just want to see the details of the datasets

```
from speakleash import Speakleash
import os

base_dir = os.path.join(os.path.dirname(__file__))
replicate_to = os.path.join(base_dir, "datasets")

sl = Speakleash(replicate_to)

for d in sl.datasets:
    print(d.name)
    for doc in d.data:
        size_mb = round(d.characters/1024/1024)
        print("Dataset: {0}, size: {1} MB, characters: {2}, documents: {3}".format(d.name, size_mb, d.characters, d.documents))

```

You can use individual properties (e.g.:***characters***, ***documents***), but you can display the entire manifest
```
sl = Speakleash(replicate_to)
print(sl.get("plwiki").manifest)

```

If you chose one of them (***.get(name of dataset)***) then you will get a lot of text data ;-)
```
from speakleash import Speakleash
import os

base_dir = os.path.join(os.path.dirname(__file__))
replicate_to = os.path.join(base_dir, "datasets")

sl = Speakleash(replicate_to)

wiki = sl.get("plwiki").data
for doc in wiki:
    print(doc[:40])

```

If you also need meta data then use the ***ext_data*** property
```

ds = sl.get("plwiki").ext_data
for doc in ds:
    print(doc)
    txt, meta = doc
    print(meta.get("title"))
    print(txt)


```

Popular meta data:

* title
* length
* sentences
* words
* verbs
* nouns
* symbols
* punctuations




            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/speakleash/speakleash",
    "name": "speakleash",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "",
    "author": "SpeakLeash Team",
    "author_email": "team@speakleash.org",
    "download_url": "https://files.pythonhosted.org/packages/47/70/eb4cb2cb5fb0b8ab108ea552999c0770824a46c93101ffe86981be28b696/speakleash-0.0.11.tar.gz",
    "platform": null,
    "description": "# SpeakLeash\n\nSpeakLeash agnostic dataset for Polish\n\n## Basic Usage\n\nIf you just want to see the details of the datasets\n\n```\nfrom speakleash import Speakleash\nimport os\n\nbase_dir = os.path.join(os.path.dirname(__file__))\nreplicate_to = os.path.join(base_dir, \"datasets\")\n\nsl = Speakleash(replicate_to)\n\nfor d in sl.datasets:\n    print(d.name)\n    for doc in d.data:\n        size_mb = round(d.characters/1024/1024)\n        print(\"Dataset: {0}, size: {1} MB, characters: {2}, documents: {3}\".format(d.name, size_mb, d.characters, d.documents))\n\n```\n\nYou can use individual properties (e.g.:***characters***, ***documents***), but you can display the entire manifest\n```\nsl = Speakleash(replicate_to)\nprint(sl.get(\"plwiki\").manifest)\n\n```\n\nIf you chose one of them (***.get(name of dataset)***) then you will get a lot of text data ;-)\n```\nfrom speakleash import Speakleash\nimport os\n\nbase_dir = os.path.join(os.path.dirname(__file__))\nreplicate_to = os.path.join(base_dir, \"datasets\")\n\nsl = Speakleash(replicate_to)\n\nwiki = sl.get(\"plwiki\").data\nfor doc in wiki:\n    print(doc[:40])\n\n```\n\nIf you also need meta data then use the ***ext_data*** property\n```\n\nds = sl.get(\"plwiki\").ext_data\nfor doc in ds:\n    print(doc)\n    txt, meta = doc\n    print(meta.get(\"title\"))\n    print(txt)\n\n\n```\n\nPopular meta data:\n\n* title\n* length\n* sentences\n* words\n* verbs\n* nouns\n* symbols\n* punctuations\n\n\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "SpeakLeash agnostic dataset for Polish",
    "version": "0.0.11",
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "373ba7fe4bc3cc1dd6f7333dfed1c52735870e6a04a13662222c3852a1349406",
                "md5": "a82162345039f73c2a7ca32c5860bf3a",
                "sha256": "1dcde410801754bdb0ffac382d10cab224fba87cb85ebc5c325115104fcb0099"
            },
            "downloads": -1,
            "filename": "speakleash-0.0.11-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a82162345039f73c2a7ca32c5860bf3a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 3963,
            "upload_time": "2023-01-25T10:33:24",
            "upload_time_iso_8601": "2023-01-25T10:33:24.002603Z",
            "url": "https://files.pythonhosted.org/packages/37/3b/a7fe4bc3cc1dd6f7333dfed1c52735870e6a04a13662222c3852a1349406/speakleash-0.0.11-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4770eb4cb2cb5fb0b8ab108ea552999c0770824a46c93101ffe86981be28b696",
                "md5": "51635abd0da05eae85ef9c07710ecb30",
                "sha256": "00265d0ad7cf1471dfc842158782654b9650ef330cbbfe97ca9e799a71e7dfb2"
            },
            "downloads": -1,
            "filename": "speakleash-0.0.11.tar.gz",
            "has_sig": false,
            "md5_digest": "51635abd0da05eae85ef9c07710ecb30",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 3651,
            "upload_time": "2023-01-25T10:34:50",
            "upload_time_iso_8601": "2023-01-25T10:34:50.171691Z",
            "url": "https://files.pythonhosted.org/packages/47/70/eb4cb2cb5fb0b8ab108ea552999c0770824a46c93101ffe86981be28b696/speakleash-0.0.11.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-01-25 10:34:50",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "speakleash",
    "github_project": "speakleash",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "requests",
            "specs": []
        },
        {
            "name": "tqdm",
            "specs": []
        },
        {
            "name": "lm_dataformat",
            "specs": []
        }
    ],
    "lcname": "speakleash"
}
        
Elapsed time: 0.08921s