[![Multi-Modality](agorabanner.png)](https://discord.gg/qUtxnK2NMf)
# HeptaPod Non-Linear Transformer
The HeptaPod Non-Linear Transformer is a novel deep learning architecture inspired by the linguistic capabilities of the Heptapods from the movie "Arrival". This transformer aims to generate text non-linearly in all directions simultaneously, revolutionizing the way we think about sequence generation.
# Install
`pip3 install --upgrade nonlinear-transformer`
## Usage
```python
import torch
from heptapod.model import NonLinearTransformer
x = torch.randint(0, 100, (10, 10))
model = NonLinearTransformer(
vocab_size=100, embed_size=128, matrix_dim=10, heads=8, window_size=3, iterations=2
)
out = model(x)
print(out.shape)
```
## Training
- We're currently smoothing out some rough spots to train the model, so please help us
## Table of Contents
- [Introduction](#introduction)
- [Architecture Overview](#architecture-overview)
- [2D Rotary Embeddings](#2d-rotary-embeddings)
- [Local 2D Attention](#local-2d-attention)
- [Non-Linear Transformer Block](#non-linear-transformer-block)
- [Implementation](#implementation)
- [Usage Example](#usage-example)
- [License](#license)
## Introduction
Traditional transformers generate sequences linearly, token by token. The HeptaPod Non-Linear Transformer, however, works with 2D matrices of tokens, where each token is influenced by its neighbors in all directions. This architecture is designed to generate text resembling the Heptapod's logograms, which convey meaning non-linearly.
## Architecture Overview
The main components of the HeptaPod Non-Linear Transformer are:
### 2D Rotary Embeddings
Positional information is crucial for transformers. Unlike 1D embeddings used in traditional transformers, the HeptaPod transformer uses 2D rotary embeddings. These embeddings capture both row-wise and column-wise positional information, ensuring every token understands its position in the 2D matrix.
### Local 2D Attention
Instead of attending to all tokens in the sequence, the Local 2D Attention mechanism focuses on a localized window around each token. Each token attends only to its immediate neighbors, defined by a specified window size. This localized attention ensures that each token gathers context from its surroundings, making the generation process truly non-linear.
### Non-Linear Transformer Block
This is the core of the architecture. Each block consists of:
1. Layer normalization
2. Local 2D attention mechanism
3. A feed-forward neural network
These blocks can be stacked to deepen the architecture, allowing the model to learn more complex patterns and relationships in the data.
## Implementation
The implementation is done in PyTorch, one of the leading deep learning libraries. The design ensures modularity, allowing easy customization and experimentation.
Key features:
1. Modular design: Each component, like the Local 2D Attention mechanism, is implemented as a separate module, allowing for easy modifications and replacements.
2. Extensibility: The architecture is designed to be easily extensible. You can stack multiple Non-Linear Transformer Blocks to increase the model's depth.
Remember to adjust hyperparameters like `dim`, `depth`, and `matrix_dim` as per your dataset and requirements.
# Deep Dive
## Architecture Details
### Token Representation in 2D
The representation of tokens in a 2D matrix is the foundation of the HeptaPod Non-Linear Transformer. Unlike traditional transformers that work with 1D sequences, this architecture treats input as a 2D grid. This inherently facilitates the capturing of relationships in multiple dimensions — both row-wise and column-wise.
### Hierarchical Processing
One potential advancement to this model is the introduction of hierarchical processing. After processing the entire matrix at a given resolution, the model could further abstract the matrix into larger "chunks" or "blocks", treating each chunk as a super-token. This hierarchical processing can help in capturing broader context, much like pooling layers in CNNs.
### Local vs. Global Attention
While the primary focus is on local attention, there could be merit in periodically applying global attention to capture long-range dependencies. A hybrid approach, where certain layers (or certain heads within layers) employ global attention, could offer a balance between local context and global understanding.
### Conditional Masking
Considering the non-linear nature of the text, it might be beneficial to apply conditional masks during training. Rather than always attending to the same local window, the model could be trained to decide where to look based on the token's content, allowing dynamic context windows.
## Potential Methods for Improvement
### Adaptive Window Sizes
While a fixed window size offers simplicity, an adaptive window mechanism that adjusts the size based on the token's context can capture varying degrees of local information.
### Multi-Scale Representation
Just as multi-scale feature maps are beneficial in image processing tasks, using multi-scale token representations could offer richer context. This involves processing the input matrix at different resolutions and integrating the results.
### Cross-Attention Between Hierarchies
If hierarchical processing is employed, introducing cross-attention mechanisms between different hierarchies can ensure better information flow.
### Sparse Attention Mechanisms
To efficiently capture long-range dependencies without the computational cost of global attention, sparse attention mechanisms like the ones proposed in models like the Longformer could be integrated.
## Further Work
### Integration with Vision Models
Given the 2D nature of the input, there's potential synergy with vision models. Combining the HeptaPod Non-Linear Transformer with architectures like Vision Transformers (ViTs) could yield models that excel in tasks involving both text and images.
### Transfer Learning & Pre-training
Exploring pre-training strategies on large corpora can make the HeptaPod Non-Linear Transformer more versatile. Fine-tuning on specific tasks post pre-training can lead to better performance, leveraging knowledge from vast amounts of data.
### Feedback Loops
Introducing feedback loops where the output is recursively fed back as input can help in refining the generated matrix, potentially leading to more coherent outputs.
### Custom Loss Functions
Given the non-linear generation process, custom loss functions that reward coherent formation in multiple directions can be beneficial. This would be in addition to the traditional token prediction losses.
### Token Merging Strategies
Post generation, there's potential in exploring strategies that merge or group tokens in the 2D matrix to form super-tokens, condensing information and making it more interpretable.
## Architectural Conclusion
The HeptaPod Non-Linear Transformer represents a paradigm shift in sequence generation. While the foundation is promising, the architecture offers numerous avenues for exploration, innovation, and improvement. As with any novel approach, iterative research, experimentation, and collaboration with the broader research community will be pivotal in realizing its full potential.
### License
This project is licensed under the MIT License. This ensures that the HeptaPod Non-Linear Transformer is free for all to use, modify, and distribute. We believe in open-source and encourage innovations and improvements to the concept.
# Todo
- [ ] Implement the 2d nonlinear training script and train the model
- [ ] Benchmark and make improvements on nonlinear structures
Raw data
{
"_id": null,
"home_page": "https://github.com/kyegomez/HeptapodLM",
"name": "nonlinear-transformer",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6,<4.0",
"maintainer_email": "",
"keywords": "artificial intelligence,deep learning,optimizers,Prompt Engineering",
"author": "Kye Gomez",
"author_email": "kye@apac.ai",
"download_url": "https://files.pythonhosted.org/packages/a1/5a/86418dc8dac57de74d8b1a50b5b1f5db98952fb0ae3c0d053c7fa8cf4a2e/nonlinear_transformer-0.0.3.tar.gz",
"platform": null,
"description": "[![Multi-Modality](agorabanner.png)](https://discord.gg/qUtxnK2NMf)\n\n# HeptaPod Non-Linear Transformer\n\nThe HeptaPod Non-Linear Transformer is a novel deep learning architecture inspired by the linguistic capabilities of the Heptapods from the movie \"Arrival\". This transformer aims to generate text non-linearly in all directions simultaneously, revolutionizing the way we think about sequence generation.\n\n# Install\n`pip3 install --upgrade nonlinear-transformer`\n\n\n## Usage\n```python\nimport torch\nfrom heptapod.model import NonLinearTransformer\n\nx = torch.randint(0, 100, (10, 10))\n\nmodel = NonLinearTransformer(\n vocab_size=100, embed_size=128, matrix_dim=10, heads=8, window_size=3, iterations=2\n)\n\nout = model(x)\nprint(out.shape)\n\n```\n\n## Training\n- We're currently smoothing out some rough spots to train the model, so please help us\n\n## Table of Contents\n\n- [Introduction](#introduction)\n- [Architecture Overview](#architecture-overview)\n - [2D Rotary Embeddings](#2d-rotary-embeddings)\n - [Local 2D Attention](#local-2d-attention)\n - [Non-Linear Transformer Block](#non-linear-transformer-block)\n- [Implementation](#implementation)\n- [Usage Example](#usage-example)\n- [License](#license)\n\n## Introduction\n\nTraditional transformers generate sequences linearly, token by token. The HeptaPod Non-Linear Transformer, however, works with 2D matrices of tokens, where each token is influenced by its neighbors in all directions. This architecture is designed to generate text resembling the Heptapod's logograms, which convey meaning non-linearly.\n\n## Architecture Overview\n\nThe main components of the HeptaPod Non-Linear Transformer are:\n\n### 2D Rotary Embeddings\n\nPositional information is crucial for transformers. Unlike 1D embeddings used in traditional transformers, the HeptaPod transformer uses 2D rotary embeddings. These embeddings capture both row-wise and column-wise positional information, ensuring every token understands its position in the 2D matrix.\n\n### Local 2D Attention\n\nInstead of attending to all tokens in the sequence, the Local 2D Attention mechanism focuses on a localized window around each token. Each token attends only to its immediate neighbors, defined by a specified window size. This localized attention ensures that each token gathers context from its surroundings, making the generation process truly non-linear.\n\n### Non-Linear Transformer Block\n\nThis is the core of the architecture. Each block consists of:\n1. Layer normalization\n2. Local 2D attention mechanism\n3. A feed-forward neural network\n\nThese blocks can be stacked to deepen the architecture, allowing the model to learn more complex patterns and relationships in the data.\n\n## Implementation\n\nThe implementation is done in PyTorch, one of the leading deep learning libraries. The design ensures modularity, allowing easy customization and experimentation.\n\nKey features:\n1. Modular design: Each component, like the Local 2D Attention mechanism, is implemented as a separate module, allowing for easy modifications and replacements.\n2. Extensibility: The architecture is designed to be easily extensible. You can stack multiple Non-Linear Transformer Blocks to increase the model's depth.\n\n\nRemember to adjust hyperparameters like `dim`, `depth`, and `matrix_dim` as per your dataset and requirements.\n\n# Deep Dive\n\n## Architecture Details\n\n### Token Representation in 2D\n\nThe representation of tokens in a 2D matrix is the foundation of the HeptaPod Non-Linear Transformer. Unlike traditional transformers that work with 1D sequences, this architecture treats input as a 2D grid. This inherently facilitates the capturing of relationships in multiple dimensions \u2014 both row-wise and column-wise.\n\n### Hierarchical Processing\n\nOne potential advancement to this model is the introduction of hierarchical processing. After processing the entire matrix at a given resolution, the model could further abstract the matrix into larger \"chunks\" or \"blocks\", treating each chunk as a super-token. This hierarchical processing can help in capturing broader context, much like pooling layers in CNNs.\n\n### Local vs. Global Attention\n\nWhile the primary focus is on local attention, there could be merit in periodically applying global attention to capture long-range dependencies. A hybrid approach, where certain layers (or certain heads within layers) employ global attention, could offer a balance between local context and global understanding.\n\n### Conditional Masking\n\nConsidering the non-linear nature of the text, it might be beneficial to apply conditional masks during training. Rather than always attending to the same local window, the model could be trained to decide where to look based on the token's content, allowing dynamic context windows.\n\n## Potential Methods for Improvement\n\n### Adaptive Window Sizes\n\nWhile a fixed window size offers simplicity, an adaptive window mechanism that adjusts the size based on the token's context can capture varying degrees of local information.\n\n### Multi-Scale Representation\n\nJust as multi-scale feature maps are beneficial in image processing tasks, using multi-scale token representations could offer richer context. This involves processing the input matrix at different resolutions and integrating the results.\n\n### Cross-Attention Between Hierarchies\n\nIf hierarchical processing is employed, introducing cross-attention mechanisms between different hierarchies can ensure better information flow.\n\n### Sparse Attention Mechanisms\n\nTo efficiently capture long-range dependencies without the computational cost of global attention, sparse attention mechanisms like the ones proposed in models like the Longformer could be integrated.\n\n## Further Work\n\n### Integration with Vision Models\n\nGiven the 2D nature of the input, there's potential synergy with vision models. Combining the HeptaPod Non-Linear Transformer with architectures like Vision Transformers (ViTs) could yield models that excel in tasks involving both text and images.\n\n### Transfer Learning & Pre-training\n\nExploring pre-training strategies on large corpora can make the HeptaPod Non-Linear Transformer more versatile. Fine-tuning on specific tasks post pre-training can lead to better performance, leveraging knowledge from vast amounts of data.\n\n### Feedback Loops\n\nIntroducing feedback loops where the output is recursively fed back as input can help in refining the generated matrix, potentially leading to more coherent outputs.\n\n### Custom Loss Functions\n\nGiven the non-linear generation process, custom loss functions that reward coherent formation in multiple directions can be beneficial. This would be in addition to the traditional token prediction losses.\n\n### Token Merging Strategies\n\nPost generation, there's potential in exploring strategies that merge or group tokens in the 2D matrix to form super-tokens, condensing information and making it more interpretable.\n\n## Architectural Conclusion\n\nThe HeptaPod Non-Linear Transformer represents a paradigm shift in sequence generation. While the foundation is promising, the architecture offers numerous avenues for exploration, innovation, and improvement. As with any novel approach, iterative research, experimentation, and collaboration with the broader research community will be pivotal in realizing its full potential.\n\n### License\n\nThis project is licensed under the MIT License. This ensures that the HeptaPod Non-Linear Transformer is free for all to use, modify, and distribute. We believe in open-source and encourage innovations and improvements to the concept.\n\n\n\n# Todo\n- [ ] Implement the 2d nonlinear training script and train the model\n- [ ] Benchmark and make improvements on nonlinear structures\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Paper - Pytorch",
"version": "0.0.3",
"project_urls": {
"Homepage": "https://github.com/kyegomez/HeptapodLM",
"Repository": "https://github.com/kyegomez/HeptapodLM"
},
"split_keywords": [
"artificial intelligence",
"deep learning",
"optimizers",
"prompt engineering"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "527e64ed185526492f6cb2ced0128610f95fe74a383538c040151857fd3a8733",
"md5": "669717751ac37d52a7b3dcff43aa23fb",
"sha256": "509bc5e0f92dd59d197cc4f255f16a5b797cb894b1d61f6769638f21ce9084ba"
},
"downloads": -1,
"filename": "nonlinear_transformer-0.0.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "669717751ac37d52a7b3dcff43aa23fb",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6,<4.0",
"size": 8263,
"upload_time": "2023-11-02T20:51:05",
"upload_time_iso_8601": "2023-11-02T20:51:05.925676Z",
"url": "https://files.pythonhosted.org/packages/52/7e/64ed185526492f6cb2ced0128610f95fe74a383538c040151857fd3a8733/nonlinear_transformer-0.0.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "a15a86418dc8dac57de74d8b1a50b5b1f5db98952fb0ae3c0d053c7fa8cf4a2e",
"md5": "a77d2759ba79966b38d0e32ebdfcc44a",
"sha256": "1c11c0bf38e2f4b7ae734ff3c8e2328ef5c9ebd767ff648d573e273fba118f09"
},
"downloads": -1,
"filename": "nonlinear_transformer-0.0.3.tar.gz",
"has_sig": false,
"md5_digest": "a77d2759ba79966b38d0e32ebdfcc44a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6,<4.0",
"size": 8790,
"upload_time": "2023-11-02T20:51:07",
"upload_time_iso_8601": "2023-11-02T20:51:07.414155Z",
"url": "https://files.pythonhosted.org/packages/a1/5a/86418dc8dac57de74d8b1a50b5b1f5db98952fb0ae3c0d053c7fa8cf4a2e/nonlinear_transformer-0.0.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-11-02 20:51:07",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "kyegomez",
"github_project": "HeptapodLM",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "torch",
"specs": []
},
{
"name": "einops",
"specs": []
}
],
"lcname": "nonlinear-transformer"
}