slangmod


Nameslangmod JSON
Version 0.1.2 PyPI version JSON
download
home_pageNone
SummarySmall language model.
upload_time2025-02-09 19:12:36
maintainerNone
docs_urlNone
authorNone
requires_python>=3.12
licenseNone
keywords natural language processing generative ai transformer
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ![GitHub Pages](https://github.com/yedivanseven/slangmod/actions/workflows/publish-documentation.yml/badge.svg)
![PyPI](https://github.com/yedivanseven/slangmod/actions/workflows/publish-package.yml/badge.svg)

# slangmod
_**small language model**_

Ever wondered how large language models (LLMs) like ChatGPT, Claude,
LLama, Deepseek, _etc._, actually work, like, _really_ work? I did. And I
figured there is only one way to find out: Make one yourself. From scratch.

Of course, I wasn't expecting to beat the big players at their own game,
but I wanted to know what you can do on consumer hardware (meaning a
state-of-the art gaming PC with a single graphics card supported by
[PyTorch](https://pytorch.org/)). So, naturally, it was going to be a *small*
language model. These hardware limitations are reflected in software
design choices. Specifically, `slangmod` does *not* employ any type of
parallelization that would keep multiple GPUs busy at the same time, and *all*
training data are loaded into CPU RAM at once, to be drip-fed to the model
on the GPU from there (1 billion tokens take up about 7.5 GB worth of 64-bit
integer numbers).

Having said that, `slangmod` provides everything you need to
- preprocess and clean your text corpus;
- chose and train one of the HuggingFace
  [tokenizers](https://huggingface.co/docs/tokenizers/index);
- specify a Transformer model including the type of positional encodings
  and the feedforward block;
- train your model with a choice of optimizers and learning-rate schedulers,
  employing early-stopping if you like;
- monitor convergence and experiment on hyperparameters;
- explore text-generation algorithms like top-k, top-p or beamsearch;
- and, finally, chat with your model.

To do all these things, `slangmod` provides a command-line interface (CLI)
with fine-grained configuration options on one hand, and the raw building
blocks it is made of on the other hand. Leveraging the foundational
functionalities provided by the [swak](https://github.com/yedivanseven/swak) package, any other workflow
can thus be quickly coded up.


## Installation
### Python package
* Create a new virtual environment running at least `python 3.12`.
* The easiest way of installing `slangmod` is from the python package index
[PyPI](https://pypi.org/project/slangmod/), where it is hosted. Simply type
  ```shell
  pip install slangmod
  ```
  or treat it like any other python package in your dependency management.
* While it is, in principle, possible to run `slangmod` on the CPU, this is
  only intended for debugging purposes. To get any results in finite time, you 
  also need a decent graphics card, and you must have a working installation
  of [PyTorch](https://pytorch.org/) to make good use of it. Because there is
  no way of knowing which version of CUDA (or ROC) you have installed on your
  machine and how you installed it, [PyTorch](https://pytorch.org/) is not an
  explicit dependency of `slangmod`. You will have to install it yourself,
  _e.g._, following [these instructions](https://pytorch.org/get-started/locally/).
  If you are using `pipenv` for dependency management, you can also have a
  look at the [Pipfile](https://github.com/yedivanseven/slangmod/blob/main/Pipfile) in the root of the `slangmod` [repository](https://github.com/yedivanseven/slangmod)
  and taylor it to your needs. Personally, I go
  ```shell
  pipenv sync --categories=cpu
  ```
  for a CPU-only installation of PyTorch (for debugging only) and
  ```shell
  pipenv sync --categories=cuda
  ```
  if I want GPU support.
* Finally, with the virtual environment you just created active, open a console
  and type
  ```shell
  slagnmod -h
  ```
  to check that everything works.


### Docker image
A docker image with GPU-enabled [PyTorch](https://pytorch.org/) and all other
dependencies inside is available on the [Docker Hub](https://hub.docker.com/r/yedivanseven/slangmod).
```shell
docker pull yedivanseven/slangmod
```
To use it, you must have a host machine that
* has an NVIDIA GPU,
* has the drivers for it installed, and
* exposes it via the [container toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/).

Change into a *working directory*, i.e., one where `slangmod` will read its
config file *slangmod.toml* from and where it will save outputs to, and mount
this directory to the path `/workdir` inside the container when you run it.
```shell
docker run --rm -v ./:/workdir yedivanseven/slangmod
```
This will invoke `slangmod -h`.

In the event that you still want to clean your raw text with the help of
`slangmod`, you will also have to mount the folder with those dirty files
when your start a docker container.
```shell
docker run --rm -v ./:/workdir -v /path/to/raw/docs:/raw yedivanseven/slangmod clean ...
```

For all other command-line options and to find out about this config TOML file,
refer to the ...


## Documentation
The documentation for both the CLI and the API of `slangmod` is hosted
on [GitHub Pages](https://yedivanseven.github.io/slangmod/).

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "slangmod",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.12",
    "maintainer_email": null,
    "keywords": "Natural Language Processing, Generative AI, Transformer",
    "author": null,
    "author_email": "yedivanseven <yedivanseven@outlook.de>",
    "download_url": "https://files.pythonhosted.org/packages/1d/f6/b3ee43f07064f13c162370c8616abab88a06568af07f5083222a58349995/slangmod-0.1.2.tar.gz",
    "platform": null,
    "description": "![GitHub Pages](https://github.com/yedivanseven/slangmod/actions/workflows/publish-documentation.yml/badge.svg)\n![PyPI](https://github.com/yedivanseven/slangmod/actions/workflows/publish-package.yml/badge.svg)\n\n# slangmod\n_**small language model**_\n\nEver wondered how large language models (LLMs) like ChatGPT, Claude,\nLLama, Deepseek, _etc._, actually work, like, _really_ work? I did. And I\nfigured there is only one way to find out: Make one yourself. From scratch.\n\nOf course, I wasn't expecting to beat the big players at their own game,\nbut I wanted to know what you can do on consumer hardware (meaning a\nstate-of-the art gaming PC with a single graphics card supported by\n[PyTorch](https://pytorch.org/)). So, naturally, it was going to be a *small*\nlanguage model. These hardware limitations are reflected in software\ndesign choices. Specifically, `slangmod` does *not* employ any type of\nparallelization that would keep multiple GPUs busy at the same time, and *all*\ntraining data are loaded into CPU RAM at once, to be drip-fed to the model\non the GPU from there (1 billion tokens take up about 7.5 GB worth of 64-bit\ninteger numbers).\n\nHaving said that, `slangmod` provides everything you need to\n- preprocess and clean your text corpus;\n- chose and train one of the HuggingFace\n  [tokenizers](https://huggingface.co/docs/tokenizers/index);\n- specify a Transformer model including the type of positional encodings\n  and the feedforward block;\n- train your model with a choice of optimizers and learning-rate schedulers,\n  employing early-stopping if you like;\n- monitor convergence and experiment on hyperparameters;\n- explore text-generation algorithms like top-k, top-p or beamsearch;\n- and, finally, chat with your model.\n\nTo do all these things, `slangmod` provides a command-line interface (CLI)\nwith fine-grained configuration options on one hand, and the raw building\nblocks it is made of on the other hand. Leveraging the foundational\nfunctionalities provided by the [swak](https://github.com/yedivanseven/swak) package, any other workflow\ncan thus be quickly coded up.\n\n\n## Installation\n### Python package\n* Create a new virtual environment running at least `python 3.12`.\n* The easiest way of installing `slangmod` is from the python package index\n[PyPI](https://pypi.org/project/slangmod/), where it is hosted. Simply type\n  ```shell\n  pip install slangmod\n  ```\n  or treat it like any other python package in your dependency management.\n* While it is, in principle, possible to run `slangmod` on the CPU, this is\n  only intended for debugging purposes. To get any results in finite time, you \n  also need a decent graphics card, and you must have a working installation\n  of [PyTorch](https://pytorch.org/) to make good use of it. Because there is\n  no way of knowing which version of CUDA (or ROC) you have installed on your\n  machine and how you installed it, [PyTorch](https://pytorch.org/) is not an\n  explicit dependency of `slangmod`. You will have to install it yourself,\n  _e.g._, following [these instructions](https://pytorch.org/get-started/locally/).\n  If you are using `pipenv` for dependency management, you can also have a\n  look at the [Pipfile](https://github.com/yedivanseven/slangmod/blob/main/Pipfile) in the root of the `slangmod` [repository](https://github.com/yedivanseven/slangmod)\n  and taylor it to your needs. Personally, I go\n  ```shell\n  pipenv sync --categories=cpu\n  ```\n  for a CPU-only installation of PyTorch (for debugging only) and\n  ```shell\n  pipenv sync --categories=cuda\n  ```\n  if I want GPU support.\n* Finally, with the virtual environment you just created active, open a console\n  and type\n  ```shell\n  slagnmod -h\n  ```\n  to check that everything works.\n\n\n### Docker image\nA docker image with GPU-enabled [PyTorch](https://pytorch.org/) and all other\ndependencies inside is available on the [Docker Hub](https://hub.docker.com/r/yedivanseven/slangmod).\n```shell\ndocker pull yedivanseven/slangmod\n```\nTo use it, you must have a host machine that\n* has an NVIDIA GPU,\n* has the drivers for it installed, and\n* exposes it via the [container toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/).\n\nChange into a *working directory*, i.e., one where `slangmod` will read its\nconfig file *slangmod.toml* from and where it will save outputs to, and mount\nthis directory to the path `/workdir` inside the container when you run it.\n```shell\ndocker run --rm -v ./:/workdir yedivanseven/slangmod\n```\nThis will invoke `slangmod -h`.\n\nIn the event that you still want to clean your raw text with the help of\n`slangmod`, you will also have to mount the folder with those dirty files\nwhen your start a docker container.\n```shell\ndocker run --rm -v ./:/workdir -v /path/to/raw/docs:/raw yedivanseven/slangmod clean ...\n```\n\nFor all other command-line options and to find out about this config TOML file,\nrefer to the ...\n\n\n## Documentation\nThe documentation for both the CLI and the API of `slangmod` is hosted\non [GitHub Pages](https://yedivanseven.github.io/slangmod/).\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Small language model.",
    "version": "0.1.2",
    "project_urls": {
        "Changelog": "https://github.com/yedivanseven/slangmod/blob/main/CHANGELOG.md",
        "Documentation": "https://yedivanseven.github.io/slangmod/",
        "Issues": "https://github.com/yedivanseven/slangmod/issues",
        "Repository": "https://github.com/yedivanseven/slangmod.git"
    },
    "split_keywords": [
        "natural language processing",
        " generative ai",
        " transformer"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "dbcf442bdfc5183ba37725041e93c76501d421ca9f03e2f457c566f0e1c6b66a",
                "md5": "4561dcc547249b4b5d437291f799a8e1",
                "sha256": "a6366b2c84656a436d871d2e466b3717918b7fb624988f7924a5b1c0c7b33f59"
            },
            "downloads": -1,
            "filename": "slangmod-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4561dcc547249b4b5d437291f799a8e1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.12",
            "size": 84794,
            "upload_time": "2025-02-09T19:12:33",
            "upload_time_iso_8601": "2025-02-09T19:12:33.598761Z",
            "url": "https://files.pythonhosted.org/packages/db/cf/442bdfc5183ba37725041e93c76501d421ca9f03e2f457c566f0e1c6b66a/slangmod-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "1df6b3ee43f07064f13c162370c8616abab88a06568af07f5083222a58349995",
                "md5": "2467efa57341347bf55c5254b5192424",
                "sha256": "e3bf0c69dfd2b146b46accefef33d87fcdee9e7ade303dd211a732e59ac8b2cc"
            },
            "downloads": -1,
            "filename": "slangmod-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "2467efa57341347bf55c5254b5192424",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.12",
            "size": 57003,
            "upload_time": "2025-02-09T19:12:36",
            "upload_time_iso_8601": "2025-02-09T19:12:36.200511Z",
            "url": "https://files.pythonhosted.org/packages/1d/f6/b3ee43f07064f13c162370c8616abab88a06568af07f5083222a58349995/slangmod-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-09 19:12:36",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "yedivanseven",
    "github_project": "slangmod",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "slangmod"
}
        
Elapsed time: 0.39666s