phi-torch


Namephi-torch JSON
Version 0.0.4 PyPI version JSON
download
home_pagehttps://github.com/kyegomez/phi-1
SummaryPhi - Pytorch
upload_time2023-09-18 01:54:26
maintainer
docs_urlNone
authorKye Gomez
requires_python>=3.6,<4.0
licenseMIT
keywords artificial intelligence attention mechanism transformers
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![Multi-Modality](agorabanner.png)](https://discord.gg/qUtxnK2NMf)

Since Phi is ready to train Agora is actively seeking cloud providers or grant providers to train this all-new revolutionary model and release it open source, if you would like to learn more please email me at `kye@apac.ai`

# Phi: Ultra-Fast and Ultra-Intelligent SOTA Language Model πŸš€πŸŒŒ

[Textbooks Are All You Need](https://arxiv.org/abs/2306.11644)

Phi is a state-of-the-art language model that pushes the boundaries of natural language understanding and generation. Designed for high performance and efficiency, Phi is built upon advanced techniques that make it a strong contender against the likes of OpenAI's GPT-4 and PALM.



# Usage
Get started:

1. Clone the repository and install the required packages.


```bash
pip install phi1
```

# Training

First:

`Accelerate Config`

Enable Deepspeed 3: 

`Accelerate launch train_distributed_accelerate.py`



## Dataset building building

Data
You can preprocess a different dataset in a way similar to the C4 dataset used during training by running the build_dataset.py script. This will pre-tokenize, chunk the data in blocks of a specified sequence length, and upload to the Huggingface hub. For example:

```python3 Phi/build_dataset.py --seed 42 --seq_len 8192 --hf_account "HUGGINGFACE APIKEY" --tokenizer "EleutherAI/gpt-neox-20b" --dataset_name "EleutherAI/the_pile_deduplicated"```



# Inference

```python3 inference.py "My dog is very cute" --seq_len 256 --temperature 0.8 --filter_thres 0.9 --model "phi"``` 

Not yet we need to submit model to pytorch hub



## Model Architecture πŸ§ πŸ”§

```python
model = TransformerWrapper(
        num_tokens=64007,
        max_seq_len=8192,
        use_abs_pos_emb=False,
        tokenizer=tokenizer, # !
        embedding_provider=AndromedaEmbedding(),
        attn_layers = Decoder(
            dim=128, # 2048
            depth=8, # 16
            dim_head=128,
            heads=8,
            alibi_pos_bias=True,
            alibi_num_heads=4,
            rotary_xpos=True,
            attn_flash = True,
            deepnorm=True,
            shift_tokens=1,
            attn_one_kv_head = True,
            qk_norm=True,
            attn_qk_norm=True,
            attn_qk_norm_dim_scale=True # set this to True, in addition to `attn_qk_norm = True`
        )
    )
```

## Roadmap πŸ—ΊοΈπŸ“

1. **Training phase**: Train Phi on a large-scale dataset to achieve SOTA performance in various natural language processing tasks.

2. **World-class inference infrastructure**: Establish a robust and efficient infrastructure that leverages techniques such as:

   - Model quantization: Reduce memory and computational requirements without significant loss in performance.
   - Distillation: Train smaller, faster models that retain the knowledge of the larger model.
   - Optimized serving frameworks: Deploy Phi using efficient serving frameworks, such as NVIDIA Triton or TensorFlow Serving, for rapid inference.

3. **Continuous improvement**: Continuously fine-tune Phi on diverse data sources and adapt it to new tasks and domains.

4. **Community-driven development**: Encourage open-source contributions, including pre-processing improvements, advanced training techniques, and novel use cases.

## Why Phi? πŸŒ πŸ’‘

Phi can potentially be finetuned with 100k+ token sequence length.
Phi is a state-of-the-art language model that leverages advanced techniques to optimize its performance and efficiency. Some of these techniques include alibi positional bias, rotary position encodings (xpos), flash attention, and deep normalization (deepnorm). Let's explore the benefits of these techniques and provide some usage examples.

### Alibi Positional Bias

Alibi positional bias allows the model to learn relative positions between tokens, enabling it to better capture the relationships and dependencies between tokens in a sequence.

Usage example:

```python
attn_layers = Decoder(
    ...
    alibi_pos_bias=True,
    alibi_num_heads=4,
    ...
)
```

### Rotary Position Encodings (xpos)

Rotary position encodings introduce a more efficient way to encode positions in the input sequence. They avoid the need for absolute positional embeddings, reducing the model's memory footprint and improving training speed.

Usage example:

```python
attn_layers = Decoder(
    ...
    rotary_xpos=True,
    ...
)
```

### Flash Attention

Flash attention speeds up the self-attention mechanism by reducing the number of attention computations. It accelerates training and inference while maintaining a high level of performance.

Usage example:

```python
attn_layers = Decoder(
    ...
    attn_flash=True,
    ...
)
```

Usage example:

```python
attn_layers = Decoder(
    ...
    deepnorm=True,
    ...
)
```

### Deep Normalization (deepnorm)

Deep normalization is a technique that normalizes the activations within a layer, helping with training stability and convergence. It allows the model to better learn complex patterns and generalize to unseen data.

# Phi Principles
- **Efficiency**: Phi incorporates cutting-edge optimization techniques, such as attention flashing, rotary position encodings, and deep normalization, resulting in efficient training and inference.

- **Flexibility**: The modular design of Phi allows for easy adaptation to various tasks and domains, making it a versatile choice for a wide range of applications.

- **Scalability**: Phi's architecture is designed to scale with the ever-growing computational resources and data sizes, ensuring its continuous relevance in the NLP landscape.

- **Community-driven**: As an open-source project, Phi thrives on contributions from the community, fostering an environment of collaboration, innovation, and continuous improvement.

Join us on this exciting journey to create a powerful, efficient, and intelligent language model that will revolutionize the NLP landscape! πŸš€πŸŒŸ

## Todo:

* [Pretrain on Falcon](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)

* [Finetune on this](https://huggingface.co/datasets/Open-Orca/OpenOrca)

* [Create synthetic datasets with the Distiller](https://github.com/Agora-X/The-Distiller)

# Implementing the Phi-1 Model

This guide is meant to assist you in implementing our Phi-1 model based on the decoder-only transformer model [VSP+ 17] using the FlashAttention implementation of multihead attention (MHA) [DFE+ 22].

## 1. Architecture

1. **Phi-1 model**: Implement an architecture with the following specifications:
   - 24 layers
   - Hidden dimension of 2048
   - MLP-inner dimension of 8192
   - 32 attention heads of dimension 64 each
2. **Phi1-small model**: Implement an architecture with the following specifications:
   - 20 layers
   - Hidden dimension of 1024
   - MLP-inner dimension of 4096
   - 16 attention heads of dimension 64 each
3. For both architectures, include rotary position embedding [SLP+ 21] with a rotary dimension of 32.
4. Tokenize your data using the same tokenizer as codegen-350M-mono [NPH+ 22].

## 2. Pretraining

1. Concatenate your dataset into a single dimensional array, using the "⟨∣endoftext∣⟩" token for separating files.
2. Train your model on a sequence length of 2048 sliced from your dataset array with next-token prediction loss.
3. Utilize the AdamW optimizer and a linear-warmup-linear-decay learning rate schedule.
4. Use attention and residual dropout of 0.1.
5. Execute your training on 8 Nvidia-A100 GPUs using deepspeed.
6. Use the following specifications for training:
   - Effective batch size: 1024
   - Maximum learning rate: 1e-3
   - Warmup over 750 steps
   - Weight decay: 0.1
7. Run your training for a total of 36,000 steps, using the checkpoint at 24,000 steps as your Phi-1-base.

## 3. Finetuning

1. Finetune your Phi-1-base model on your respective finetuning dataset.
2. Follow the same setup as pretraining, but with different hyperparameters:
   - Effective batch size: 256
   - Maximum learning rate: 1e-4
   - Warmup over 50 steps
   - Weight decay: 0.01
3. Run your training for a total of 6,000 steps and pick the best checkpoint (saved every 1000 steps).

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/kyegomez/phi-1",
    "name": "phi-torch",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6,<4.0",
    "maintainer_email": "",
    "keywords": "artificial intelligence,attention mechanism,transformers",
    "author": "Kye Gomez",
    "author_email": "kye@apac.ai",
    "download_url": "https://files.pythonhosted.org/packages/79/d3/bf115f6dc15e250dbe7742eac9b6a37d410862c639c4a61c6e2c9c88b502/phi_torch-0.0.4.tar.gz",
    "platform": null,
    "description": "[![Multi-Modality](agorabanner.png)](https://discord.gg/qUtxnK2NMf)\n\nSince Phi is ready to train Agora is actively seeking cloud providers or grant providers to train this all-new revolutionary model and release it open source, if you would like to learn more please email me at `kye@apac.ai`\n\n# Phi: Ultra-Fast and Ultra-Intelligent SOTA Language Model \ud83d\ude80\ud83c\udf0c\n\n[Textbooks Are All You Need](https://arxiv.org/abs/2306.11644)\n\nPhi is a state-of-the-art language model that pushes the boundaries of natural language understanding and generation. Designed for high performance and efficiency, Phi is built upon advanced techniques that make it a strong contender against the likes of OpenAI's GPT-4 and PALM.\n\n\n\n# Usage\nGet started:\n\n1. Clone the repository and install the required packages.\n\n\n```bash\npip install phi1\n```\n\n# Training\n\nFirst:\n\n`Accelerate Config`\n\nEnable Deepspeed 3: \n\n`Accelerate launch train_distributed_accelerate.py`\n\n\n\n## Dataset building building\n\nData\nYou can preprocess a different dataset in a way similar to the C4 dataset used during training by running the build_dataset.py script. This will pre-tokenize, chunk the data in blocks of a specified sequence length, and upload to the Huggingface hub. For example:\n\n```python3 Phi/build_dataset.py --seed 42 --seq_len 8192 --hf_account \"HUGGINGFACE APIKEY\" --tokenizer \"EleutherAI/gpt-neox-20b\" --dataset_name \"EleutherAI/the_pile_deduplicated\"```\n\n\n\n# Inference\n\n```python3 inference.py \"My dog is very cute\" --seq_len 256 --temperature 0.8 --filter_thres 0.9 --model \"phi\"``` \n\nNot yet we need to submit model to pytorch hub\n\n\n\n## Model Architecture \ud83e\udde0\ud83d\udd27\n\n```python\nmodel = TransformerWrapper(\n        num_tokens=64007,\n        max_seq_len=8192,\n        use_abs_pos_emb=False,\n        tokenizer=tokenizer, # !\n        embedding_provider=AndromedaEmbedding(),\n        attn_layers = Decoder(\n            dim=128, # 2048\n            depth=8, # 16\n            dim_head=128,\n            heads=8,\n            alibi_pos_bias=True,\n            alibi_num_heads=4,\n            rotary_xpos=True,\n            attn_flash = True,\n            deepnorm=True,\n            shift_tokens=1,\n            attn_one_kv_head = True,\n            qk_norm=True,\n            attn_qk_norm=True,\n            attn_qk_norm_dim_scale=True # set this to True, in addition to `attn_qk_norm = True`\n        )\n    )\n```\n\n## Roadmap \ud83d\uddfa\ufe0f\ud83d\udccd\n\n1. **Training phase**: Train Phi on a large-scale dataset to achieve SOTA performance in various natural language processing tasks.\n\n2. **World-class inference infrastructure**: Establish a robust and efficient infrastructure that leverages techniques such as:\n\n   - Model quantization: Reduce memory and computational requirements without significant loss in performance.\n   - Distillation: Train smaller, faster models that retain the knowledge of the larger model.\n   - Optimized serving frameworks: Deploy Phi using efficient serving frameworks, such as NVIDIA Triton or TensorFlow Serving, for rapid inference.\n\n3. **Continuous improvement**: Continuously fine-tune Phi on diverse data sources and adapt it to new tasks and domains.\n\n4. **Community-driven development**: Encourage open-source contributions, including pre-processing improvements, advanced training techniques, and novel use cases.\n\n## Why Phi? \ud83c\udf20\ud83d\udca1\n\nPhi can potentially be finetuned with 100k+ token sequence length.\nPhi is a state-of-the-art language model that leverages advanced techniques to optimize its performance and efficiency. Some of these techniques include alibi positional bias, rotary position encodings (xpos), flash attention, and deep normalization (deepnorm). Let's explore the benefits of these techniques and provide some usage examples.\n\n### Alibi Positional Bias\n\nAlibi positional bias allows the model to learn relative positions between tokens, enabling it to better capture the relationships and dependencies between tokens in a sequence.\n\nUsage example:\n\n```python\nattn_layers = Decoder(\n    ...\n    alibi_pos_bias=True,\n    alibi_num_heads=4,\n    ...\n)\n```\n\n### Rotary Position Encodings (xpos)\n\nRotary position encodings introduce a more efficient way to encode positions in the input sequence. They avoid the need for absolute positional embeddings, reducing the model's memory footprint and improving training speed.\n\nUsage example:\n\n```python\nattn_layers = Decoder(\n    ...\n    rotary_xpos=True,\n    ...\n)\n```\n\n### Flash Attention\n\nFlash attention speeds up the self-attention mechanism by reducing the number of attention computations. It accelerates training and inference while maintaining a high level of performance.\n\nUsage example:\n\n```python\nattn_layers = Decoder(\n    ...\n    attn_flash=True,\n    ...\n)\n```\n\nUsage example:\n\n```python\nattn_layers = Decoder(\n    ...\n    deepnorm=True,\n    ...\n)\n```\n\n### Deep Normalization (deepnorm)\n\nDeep normalization is a technique that normalizes the activations within a layer, helping with training stability and convergence. It allows the model to better learn complex patterns and generalize to unseen data.\n\n# Phi Principles\n- **Efficiency**: Phi incorporates cutting-edge optimization techniques, such as attention flashing, rotary position encodings, and deep normalization, resulting in efficient training and inference.\n\n- **Flexibility**: The modular design of Phi allows for easy adaptation to various tasks and domains, making it a versatile choice for a wide range of applications.\n\n- **Scalability**: Phi's architecture is designed to scale with the ever-growing computational resources and data sizes, ensuring its continuous relevance in the NLP landscape.\n\n- **Community-driven**: As an open-source project, Phi thrives on contributions from the community, fostering an environment of collaboration, innovation, and continuous improvement.\n\nJoin us on this exciting journey to create a powerful, efficient, and intelligent language model that will revolutionize the NLP landscape! \ud83d\ude80\ud83c\udf1f\n\n## Todo:\n\n* [Pretrain on Falcon](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)\n\n* [Finetune on this](https://huggingface.co/datasets/Open-Orca/OpenOrca)\n\n* [Create synthetic datasets with the Distiller](https://github.com/Agora-X/The-Distiller)\n\n# Implementing the Phi-1 Model\n\nThis guide is meant to assist you in implementing our Phi-1 model based on the decoder-only transformer model [VSP+ 17] using the FlashAttention implementation of multihead attention (MHA) [DFE+ 22].\n\n## 1. Architecture\n\n1. **Phi-1 model**: Implement an architecture with the following specifications:\n   - 24 layers\n   - Hidden dimension of 2048\n   - MLP-inner dimension of 8192\n   - 32 attention heads of dimension 64 each\n2. **Phi1-small model**: Implement an architecture with the following specifications:\n   - 20 layers\n   - Hidden dimension of 1024\n   - MLP-inner dimension of 4096\n   - 16 attention heads of dimension 64 each\n3. For both architectures, include rotary position embedding [SLP+ 21] with a rotary dimension of 32.\n4. Tokenize your data using the same tokenizer as codegen-350M-mono [NPH+ 22].\n\n## 2. Pretraining\n\n1. Concatenate your dataset into a single dimensional array, using the \"\u27e8\u2223endoftext\u2223\u27e9\" token for separating files.\n2. Train your model on a sequence length of 2048 sliced from your dataset array with next-token prediction loss.\n3. Utilize the AdamW optimizer and a linear-warmup-linear-decay learning rate schedule.\n4. Use attention and residual dropout of 0.1.\n5. Execute your training on 8 Nvidia-A100 GPUs using deepspeed.\n6. Use the following specifications for training:\n   - Effective batch size: 1024\n   - Maximum learning rate: 1e-3\n   - Warmup over 750 steps\n   - Weight decay: 0.1\n7. Run your training for a total of 36,000 steps, using the checkpoint at 24,000 steps as your Phi-1-base.\n\n## 3. Finetuning\n\n1. Finetune your Phi-1-base model on your respective finetuning dataset.\n2. Follow the same setup as pretraining, but with different hyperparameters:\n   - Effective batch size: 256\n   - Maximum learning rate: 1e-4\n   - Warmup over 50 steps\n   - Weight decay: 0.01\n3. Run your training for a total of 6,000 steps and pick the best checkpoint (saved every 1000 steps).\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Phi - Pytorch",
    "version": "0.0.4",
    "project_urls": {
        "Homepage": "https://github.com/kyegomez/phi-1"
    },
    "split_keywords": [
        "artificial intelligence",
        "attention mechanism",
        "transformers"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a02d6beaba10a8330df25901e08a3c5683fad9882f9c9c43710a4ab102bfbb08",
                "md5": "84b581457177ab41d6acbde2dcedfcb0",
                "sha256": "1c05cfba0d260d117c507a8cc6c46c3e94d42d73a88d4fc7a0e2463c381dc98d"
            },
            "downloads": -1,
            "filename": "phi_torch-0.0.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "84b581457177ab41d6acbde2dcedfcb0",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6,<4.0",
            "size": 52829,
            "upload_time": "2023-09-18T01:54:24",
            "upload_time_iso_8601": "2023-09-18T01:54:24.223696Z",
            "url": "https://files.pythonhosted.org/packages/a0/2d/6beaba10a8330df25901e08a3c5683fad9882f9c9c43710a4ab102bfbb08/phi_torch-0.0.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "79d3bf115f6dc15e250dbe7742eac9b6a37d410862c639c4a61c6e2c9c88b502",
                "md5": "f5e7bd1ccf6aedd5d9628b1c94772e35",
                "sha256": "b42a97178d4678627d5435710a68857b75c7451d797edc18fdbb315033f053fe"
            },
            "downloads": -1,
            "filename": "phi_torch-0.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "f5e7bd1ccf6aedd5d9628b1c94772e35",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6,<4.0",
            "size": 91579,
            "upload_time": "2023-09-18T01:54:26",
            "upload_time_iso_8601": "2023-09-18T01:54:26.050548Z",
            "url": "https://files.pythonhosted.org/packages/79/d3/bf115f6dc15e250dbe7742eac9b6a37d410862c639c4a61c6e2c9c88b502/phi_torch-0.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-09-18 01:54:26",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "kyegomez",
    "github_project": "phi-1",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "phi-torch"
}
        
Elapsed time: 0.21017s