

# slangmod
_**small language model**_
Ever wondered how large language models (LLMs) like ChatGPT, Claude,
LLama, Deepseek, _etc._, actually work, like, _really_ work? I did. And I
figured there is only one way to find out: Make one yourself. From scratch.
Of course, I wasn't expecting to beat the big players at their own game,
but I wanted to know what you can do on consumer hardware (meaning a
state-of-the art gaming PC with a single graphics card supported by
[PyTorch](https://pytorch.org/)). So, naturally, it was going to be a *small*
language model. These hardware limitations are reflected in software
design choices. Specifically, `slangmod` does *not* employ any type of
parallelization that would keep multiple GPUs busy at the same time, and *all*
training data are loaded into CPU RAM at once, to be drip-fed to the model
on the GPU from there (1 billion tokens take up about 7.5 GB worth of 64-bit
integer numbers).
Having said that, `slangmod` provides everything you need to
- preprocess and clean your text corpus;
- chose and train one of the HuggingFace
[tokenizers](https://huggingface.co/docs/tokenizers/index);
- specify a Transformer model including the type of positional encodings
and the feedforward block;
- train your model with a choice of optimizers and learning-rate schedulers,
employing early-stopping if you like;
- monitor convergence and experiment on hyperparameters;
- explore text-generation algorithms like top-k, top-p or beamsearch;
- and, finally, chat with your model.
To do all these things, `slangmod` provides a command-line interface (CLI)
with fine-grained configuration options on one hand, and the raw building
blocks it is made of on the other hand. Leveraging the foundational
functionalities provided by the [swak](https://github.com/yedivanseven/swak) package, any other workflow
can thus be quickly coded up.
## Installation
### Python package
* Create a new virtual environment running at least `python 3.12`.
* The easiest way of installing `slangmod` is from the python package index
[PyPI](https://pypi.org/project/slangmod/), where it is hosted. Simply type
```shell
pip install slangmod
```
or treat it like any other python package in your dependency management.
* While it is, in principle, possible to run `slangmod` on the CPU, this is
only intended for debugging purposes. To get any results in finite time, you
also need a decent graphics card, and you must have a working installation
of [PyTorch](https://pytorch.org/) to make good use of it. Because there is
no way of knowing which version of CUDA (or ROC) you have installed on your
machine and how you installed it, [PyTorch](https://pytorch.org/) is not an
explicit dependency of `slangmod`. You will have to install it yourself,
_e.g._, following [these instructions](https://pytorch.org/get-started/locally/).
If you are using `pipenv` for dependency management, you can also have a
look at the [Pipfile](https://github.com/yedivanseven/slangmod/blob/main/Pipfile) in the root of the `slangmod` [repository](https://github.com/yedivanseven/slangmod)
and taylor it to your needs. Personally, I go
```shell
pipenv sync --categories=cpu
```
for a CPU-only installation of PyTorch (for debugging only) and
```shell
pipenv sync --categories=cuda
```
if I want GPU support.
* Finally, with the virtual environment you just created active, open a console
and type
```shell
slagnmod -h
```
to check that everything works.
### Docker image
A docker image with GPU-enabled [PyTorch](https://pytorch.org/) and all other
dependencies inside is available on the [Docker Hub](https://hub.docker.com/r/yedivanseven/slangmod).
```shell
docker pull yedivanseven/slangmod
```
To use it, you must have a host machine that
* has an NVIDIA GPU,
* has the drivers for it installed, and
* exposes it via the [container toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/).
Change into a *working directory*, i.e., one where `slangmod` will read its
config file *slangmod.toml* from and where it will save outputs to, and mount
this directory to the path `/workdir` inside the container when you run it.
```shell
docker run --rm -v ./:/workdir yedivanseven/slangmod
```
This will invoke `slangmod -h`.
In the event that you still want to clean your raw text with the help of
`slangmod`, you will also have to mount the folder with those dirty files
when your start a docker container.
```shell
docker run --rm -v ./:/workdir -v /path/to/raw/docs:/raw yedivanseven/slangmod clean ...
```
For all other command-line options and to find out about this config TOML file,
refer to the ...
## Documentation
The documentation for both the CLI and the API of `slangmod` is hosted
on [GitHub Pages](https://yedivanseven.github.io/slangmod/).
Raw data
{
"_id": null,
"home_page": null,
"name": "slangmod",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.12",
"maintainer_email": null,
"keywords": "Natural Language Processing, Generative AI, Transformer",
"author": null,
"author_email": "yedivanseven <yedivanseven@outlook.de>",
"download_url": "https://files.pythonhosted.org/packages/1d/f6/b3ee43f07064f13c162370c8616abab88a06568af07f5083222a58349995/slangmod-0.1.2.tar.gz",
"platform": null,
"description": "\n\n\n# slangmod\n_**small language model**_\n\nEver wondered how large language models (LLMs) like ChatGPT, Claude,\nLLama, Deepseek, _etc._, actually work, like, _really_ work? I did. And I\nfigured there is only one way to find out: Make one yourself. From scratch.\n\nOf course, I wasn't expecting to beat the big players at their own game,\nbut I wanted to know what you can do on consumer hardware (meaning a\nstate-of-the art gaming PC with a single graphics card supported by\n[PyTorch](https://pytorch.org/)). So, naturally, it was going to be a *small*\nlanguage model. These hardware limitations are reflected in software\ndesign choices. Specifically, `slangmod` does *not* employ any type of\nparallelization that would keep multiple GPUs busy at the same time, and *all*\ntraining data are loaded into CPU RAM at once, to be drip-fed to the model\non the GPU from there (1 billion tokens take up about 7.5 GB worth of 64-bit\ninteger numbers).\n\nHaving said that, `slangmod` provides everything you need to\n- preprocess and clean your text corpus;\n- chose and train one of the HuggingFace\n [tokenizers](https://huggingface.co/docs/tokenizers/index);\n- specify a Transformer model including the type of positional encodings\n and the feedforward block;\n- train your model with a choice of optimizers and learning-rate schedulers,\n employing early-stopping if you like;\n- monitor convergence and experiment on hyperparameters;\n- explore text-generation algorithms like top-k, top-p or beamsearch;\n- and, finally, chat with your model.\n\nTo do all these things, `slangmod` provides a command-line interface (CLI)\nwith fine-grained configuration options on one hand, and the raw building\nblocks it is made of on the other hand. Leveraging the foundational\nfunctionalities provided by the [swak](https://github.com/yedivanseven/swak) package, any other workflow\ncan thus be quickly coded up.\n\n\n## Installation\n### Python package\n* Create a new virtual environment running at least `python 3.12`.\n* The easiest way of installing `slangmod` is from the python package index\n[PyPI](https://pypi.org/project/slangmod/), where it is hosted. Simply type\n ```shell\n pip install slangmod\n ```\n or treat it like any other python package in your dependency management.\n* While it is, in principle, possible to run `slangmod` on the CPU, this is\n only intended for debugging purposes. To get any results in finite time, you \n also need a decent graphics card, and you must have a working installation\n of [PyTorch](https://pytorch.org/) to make good use of it. Because there is\n no way of knowing which version of CUDA (or ROC) you have installed on your\n machine and how you installed it, [PyTorch](https://pytorch.org/) is not an\n explicit dependency of `slangmod`. You will have to install it yourself,\n _e.g._, following [these instructions](https://pytorch.org/get-started/locally/).\n If you are using `pipenv` for dependency management, you can also have a\n look at the [Pipfile](https://github.com/yedivanseven/slangmod/blob/main/Pipfile) in the root of the `slangmod` [repository](https://github.com/yedivanseven/slangmod)\n and taylor it to your needs. Personally, I go\n ```shell\n pipenv sync --categories=cpu\n ```\n for a CPU-only installation of PyTorch (for debugging only) and\n ```shell\n pipenv sync --categories=cuda\n ```\n if I want GPU support.\n* Finally, with the virtual environment you just created active, open a console\n and type\n ```shell\n slagnmod -h\n ```\n to check that everything works.\n\n\n### Docker image\nA docker image with GPU-enabled [PyTorch](https://pytorch.org/) and all other\ndependencies inside is available on the [Docker Hub](https://hub.docker.com/r/yedivanseven/slangmod).\n```shell\ndocker pull yedivanseven/slangmod\n```\nTo use it, you must have a host machine that\n* has an NVIDIA GPU,\n* has the drivers for it installed, and\n* exposes it via the [container toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/).\n\nChange into a *working directory*, i.e., one where `slangmod` will read its\nconfig file *slangmod.toml* from and where it will save outputs to, and mount\nthis directory to the path `/workdir` inside the container when you run it.\n```shell\ndocker run --rm -v ./:/workdir yedivanseven/slangmod\n```\nThis will invoke `slangmod -h`.\n\nIn the event that you still want to clean your raw text with the help of\n`slangmod`, you will also have to mount the folder with those dirty files\nwhen your start a docker container.\n```shell\ndocker run --rm -v ./:/workdir -v /path/to/raw/docs:/raw yedivanseven/slangmod clean ...\n```\n\nFor all other command-line options and to find out about this config TOML file,\nrefer to the ...\n\n\n## Documentation\nThe documentation for both the CLI and the API of `slangmod` is hosted\non [GitHub Pages](https://yedivanseven.github.io/slangmod/).\n",
"bugtrack_url": null,
"license": null,
"summary": "Small language model.",
"version": "0.1.2",
"project_urls": {
"Changelog": "https://github.com/yedivanseven/slangmod/blob/main/CHANGELOG.md",
"Documentation": "https://yedivanseven.github.io/slangmod/",
"Issues": "https://github.com/yedivanseven/slangmod/issues",
"Repository": "https://github.com/yedivanseven/slangmod.git"
},
"split_keywords": [
"natural language processing",
" generative ai",
" transformer"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "dbcf442bdfc5183ba37725041e93c76501d421ca9f03e2f457c566f0e1c6b66a",
"md5": "4561dcc547249b4b5d437291f799a8e1",
"sha256": "a6366b2c84656a436d871d2e466b3717918b7fb624988f7924a5b1c0c7b33f59"
},
"downloads": -1,
"filename": "slangmod-0.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "4561dcc547249b4b5d437291f799a8e1",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.12",
"size": 84794,
"upload_time": "2025-02-09T19:12:33",
"upload_time_iso_8601": "2025-02-09T19:12:33.598761Z",
"url": "https://files.pythonhosted.org/packages/db/cf/442bdfc5183ba37725041e93c76501d421ca9f03e2f457c566f0e1c6b66a/slangmod-0.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "1df6b3ee43f07064f13c162370c8616abab88a06568af07f5083222a58349995",
"md5": "2467efa57341347bf55c5254b5192424",
"sha256": "e3bf0c69dfd2b146b46accefef33d87fcdee9e7ade303dd211a732e59ac8b2cc"
},
"downloads": -1,
"filename": "slangmod-0.1.2.tar.gz",
"has_sig": false,
"md5_digest": "2467efa57341347bf55c5254b5192424",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.12",
"size": 57003,
"upload_time": "2025-02-09T19:12:36",
"upload_time_iso_8601": "2025-02-09T19:12:36.200511Z",
"url": "https://files.pythonhosted.org/packages/1d/f6/b3ee43f07064f13c162370c8616abab88a06568af07f5083222a58349995/slangmod-0.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-09 19:12:36",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "yedivanseven",
"github_project": "slangmod",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "slangmod"
}