fastsafetensors is an efficient safetensors model loader.
We introduced three major features to optimize model loading performance:
1. Batched, lazy tensor instantiations
2. GPU offloading for sharding, type conversions, and device pointer alignment.
3. GPU Direct Storage enablement for file loading from storage to GPU memory
A major design difference from the original safetensors file loader is *NOT* to use `mmap`.
It loads tensors on-demand with mmap'ed files,
but unfortunately, it cannot fully utilize high-throughput I/O such as NVMe SSDs.
So, we asynchronously transfer files in parallel to saturate storage throughput.
Then, fastsafetensors lazily instantiates tensors at GPU device memory with DLPack.
Another design change is to offload sharding and other manipulations on tensors to GPUs.
The original loader provides slicing for sharding at user programs before copying to device memory. However, it incurrs high CPU usages for host memory accesses.
So, we introduce a special APIs to run sharding with `torch.distributed` collective operations such as `broadcast` and `scatter`.
The offloading is also applied to other tensor manipulations such as type conversions.
The above two design can be naturally extended to utilize device-to-device data transfers with GPU Direct Storage.
The technology helps to minimize copy overheads from NVMe SSDs to GPU memory with host CPU and memory bypassed.
Check more details in [doc/overview.md](doc/overview.md)
## Dependencies
We currently test fastsafetensors only with python 3.11, pytorch 2.1, and cuda-12.
Note: when using different versions of pytorch, you may require changes on build environments for libpytorch since it seems slightly changing ABIs.
## Install from PyPi
```bash
pip install fastsafetensors
```
## Local installation
Prerequisites: Install torch, cuda, and numa headers
```bash
make install
```
## Package build
Prerequisites: Install Docker (libtorch 2.1, cuda, and numa are automatically pulled)
```bash
make dist
```
## Unit tests
After installing fastsafetensors with `pip` or `make install`, run
```bash
make unittest
```
## Basic API usages
`SafeTensorsFileLoader` is the primary entrypoint of the fastsafetensors library. To use it, pass either `SingleGroup()` for simple inference or `ProcessGroup()` (from `torch.distributed`) for tensor-parallel inference. The loader supports both CPU and CUDA devices, with optional GPU Direct Storage (GDS) support. You can specify the device and GDS settings using the `device` and `nogds` arguments, respectively. Note that if GDS is not available, the loader will fail to open files when `nogds=False`. For more information on enabling GDS, please refer to the NVIDIA documentation.
After creating a `SafeTensorsFileLoader` instance, first map target files and a rank using the `.add_filenames()` method. Then, call `.copy_file_to_device()` to trigger the actual file copies on aggregated GPU memory fragments and directly instantiate a group of Tensors. Once the files are loaded, you can retrieve a tensor using the `.get_tensor()` method. Additionally, you can obtain sharded tensors by `.get_sharded()`, which internally run collective operations in `torch.distributed`.
Important: To release the GPU memory allocated for tensors, you must explicitly call the `.close()` method. This is because Fastsafetensors allows multiple tensors to share a limited number of GPU memory fragments. As a result, it is the user's responsibility to ensure that all tensors are properly released before calling `.close()`, which will then safely release the underlying GPU memory.
## Example: single run
examples/run_single.py:
```python
import torch
from fastsafetensors import SafeTensorsFileLoader, SingleGroup
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
loader = SafeTensorsFileLoader(SingleGroup(), device, nogds=True, debug_log=True)
loader.add_filenames({0: ["a.safetensors", "b.safetensors"]}) # {rank: files}
fb = loader.copy_files_to_device()
tensor_a0 = fb.get_tensor(tensor_name="a0")
print(f"a0: {tensor_a0}")
fb.close()
loader.close()
```
```
cd examples
python run_single.py
```
Example output:
```
add_filenames 1: path=a.safetensors
[DEBUG] raw_device_pointer: raw_alloc: 0x7acf000, length=256, elapsed=3 us
[DEBUG] nogds_file_reader.submit_read: cudaHostAlloc, size=1048576, elapsed=10 us
[DEBUG] nogds_file_reader.submit_read #3, thread_id=1
[DEBUG] nogds_file_reader._thread: read (mmap=0), fd=4, offset=104, count=256, c=256, copy=13 us, cuda_copy=0 us
wait_io: tensor=a0
a0: tensor([[ 0., 0., 0., 0., 0., 0., 0., 0.],
[ 1., 1., 1., 1., 1., 1., 1., 1.],
[ 2., 2., 2., 2., 2., 2., 2., 2.],
[ 3., 3., 3., 3., 3., 3., 3., 3.],
[ 4., 4., 4., 4., 4., 4., 4., 4.],
[ 5., 5., 5., 5., 5., 5., 5., 5.],
[ 6., 6., 6., 6., 6., 6., 6., 6.],
[ 7., 7., 7., 7., 7., 7., 7., 7.],
[ 8., 8., 8., 8., 8., 8., 8., 8.],
[ 9., 9., 9., 9., 9., 9., 9., 9.],
[10., 10., 10., 10., 10., 10., 10., 10.],
[11., 11., 11., 11., 11., 11., 11., 11.],
[12., 12., 12., 12., 12., 12., 12., 12.],
[13., 13., 13., 13., 13., 13., 13., 13.],
[14., 14., 14., 14., 14., 14., 14., 14.],
[15., 15., 15., 15., 15., 15., 15., 15.]], dtype=torch.float16)
[DEBUG] ~nogds_file_reader: elapsed=28 us
[DEBUG] ~raw_device_pointer: torch_raw_delete: 0x7acf000, elapsed=0 us
```
## Example: parallel run
examples/run_parallel.py:
```python
import torch
import torch.distributed as dist
from fastsafetensors import SafeTensorsFileLoader
dist.init_process_group(backend="gloo")
dist.barrier()
pg = dist.group.WORLD
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
loader = SafeTensorsFileLoader(pg, device, nogds=True, debug_log=True)
loader.add_filenames({0: ["a.safetensors"], 1:["b.safetensors"]}) # {rank: files}
fb = loader.copy_files_to_device()
tensor_a0 = fb.get_tensor(tensor_name="a0") # broadcast
tensor_b0_sharded = fb.get_sharded(tensor_name="b0", dim=1) # partition and scatter
print(f"RANK {pg.rank()}: tensor_a0={tensor_a0}")
print(f"RANK {pg.rank()}: tensor_b0_sharded={tensor_b0_sharded}")
fb.close()
loader.close()
```
You can test the script with torchrun
```bash
cd examples
torchrun --nnodes=2 --master_addr=0.0.0.0 --master_port=1234 --node_rank=0 run_parallel.py &
PIDS+=$($!)
torchrun --nnodes=2 --master_addr=0.0.0.0 --master_port=1234 --node_rank=1 run_parallel.py &
PIDS+=$($!)
wait ${PIDS[@]}
```
Example output:
```
add_filenames 1: path=a.safetensors
[DEBUG] raw_device_pointer: raw_alloc: 0x6ba1000, length=256, elapsed=2 us
[DEBUG] nogds_file_reader.submit_read: cudaHostAlloc, size=1048576, elapsed=10 us
[DEBUG] nogds_file_reader.submit_read #3, thread_id=1
[DEBUG] nogds_file_reader._thread: read (mmap=0), fd=15, offset=104, count=256, c=256, copy=15 us, cuda_copy=0 us
wait_io: tensor=a0
shuffle: broadcast, tensor_name=a0, shape=torch.Size([16, 8]), self.rank=0, pg.rank()=0, has_tensor=True
add_filenames 2: path=b.safetensors
[DEBUG] raw_device_pointer: raw_alloc: 0x7cbb000, length=256, elapsed=2 us
[DEBUG] nogds_file_reader.submit_read: cudaHostAlloc, size=1048576, elapsed=12 us
[DEBUG] nogds_file_reader.submit_read #3, thread_id=1
[DEBUG] nogds_file_reader._thread: read (mmap=0), fd=15, offset=104, count=256, c=256, copy=15 us, cuda_copy=0 us
wait_io: tensor=b0
shuffle: broadcast, tensor_name=a0, shape=torch.Size([16, 8]), self.rank=0, pg.rank()=1, has_tensor=False
_get_tensor: free_dev_ptrs, lidx=0, src=a.safetensorsshuffle: use cache, tensor_name=a0
[DEBUG] ~raw_device_pointer: torch_raw_delete: 0x6ba1000, elapsed=0 us
shuffle: use cache, tensor_name=a0
_get_tensor: free_dev_ptrs, lidx=0, src=a.safetensors
shuffle: scatter, tensor_name=b0, shape=torch.Size([16, 8])->torch.Size([16, 4]), self.rank=1, pg.rank()=0, rank_slices=[(slice(None, None, None), slice(0, 4, 1)), (slice(None, None, None), slice(4, 8, 1))], len(scatter_list)=0
shuffle: scatter, tensor_name=b0, shape=torch.Size([16, 8])->torch.Size([16, 4]), self.rank=1, pg.rank()=1, rank_slices=[(slice(None, None, None), slice(0, 4, 1)), (slice(None, None, None), slice(4, 8, 1))], len(scatter_list)=2
_get_tensor: free_dev_ptrs, lidx=0, src=b.safetensors
[DEBUG] ~raw_device_pointer: torch_raw_delete: 0x7cbb000, elapsed=0 us
RANK 0: tensor_a0=tensor([[ 0., 0., 0., 0., 0., 0., 0., 0.],
[ 1., 1., 1., 1., 1., 1., 1., 1.],
[ 2., 2., 2., 2., 2., 2., 2., 2.],
[ 3., 3., 3., 3., 3., 3., 3., 3.],
[ 4., 4., 4., 4., 4., 4., 4., 4.],
[ 5., 5., 5., 5., 5., 5., 5., 5.],
[ 6., 6., 6., 6., 6., 6., 6., 6.],
[ 7., 7., 7., 7., 7., 7., 7., 7.],
[ 8., 8., 8., 8., 8., 8., 8., 8.],
[ 9., 9., 9., 9., 9., 9., 9., 9.],
[10., 10., 10., 10., 10., 10., 10., 10.],
[11., 11., 11., 11., 11., 11., 11., 11.],
[12., 12., 12., 12., 12., 12., 12., 12.],
[13., 13., 13., 13., 13., 13., 13., 13.],
[14., 14., 14., 14., 14., 14., 14., 14.],
[15., 15., 15., 15., 15., 15., 15., 15.]], dtype=torch.float16)RANK 1: tensor_a0=tensor([[ 0., 0., 0., 0., 0., 0., 0., 0.],
[ 1., 1., 1., 1., 1., 1., 1., 1.],
[ 2., 2., 2., 2., 2., 2., 2., 2.],
[ 3., 3., 3., 3., 3., 3., 3., 3.],
[ 4., 4., 4., 4., 4., 4., 4., 4.],
[ 5., 5., 5., 5., 5., 5., 5., 5.],
[ 6., 6., 6., 6., 6., 6., 6., 6.],
[ 7., 7., 7., 7., 7., 7., 7., 7.],
[ 8., 8., 8., 8., 8., 8., 8., 8.],
[ 9., 9., 9., 9., 9., 9., 9., 9.],
[10., 10., 10., 10., 10., 10., 10., 10.],
[11., 11., 11., 11., 11., 11., 11., 11.],
[12., 12., 12., 12., 12., 12., 12., 12.],
[13., 13., 13., 13., 13., 13., 13., 13.],
[14., 14., 14., 14., 14., 14., 14., 14.],
[15., 15., 15., 15., 15., 15., 15., 15.]], dtype=torch.float16)
RANK 1: tensor_b0_sharded=tensor([[ 0., 0., 0., 0.],
[ 1., 1., 1., 1.],
[ 2., 2., 2., 2.],
[ 3., 3., 3., 3.],
[ 4., 4., 4., 4.],
[ 5., 5., 5., 5.],
[ 6., 6., 6., 6.],
[ 7., 7., 7., 7.],
[ 8., 8., 8., 8.],
[ 9., 9., 9., 9.],
[10., 10., 10., 10.],
[11., 11., 11., 11.],
[12., 12., 12., 12.],
[13., 13., 13., 13.],
[14., 14., 14., 14.],
[15., 15., 15., 15.]], dtype=torch.float16)RANK 0: tensor_b0_sharded=tensor([[ 0., 0., 0., 0.],
[ 1., 1., 1., 1.],
[ 2., 2., 2., 2.],
[ 3., 3., 3., 3.],
[ 4., 4., 4., 4.],
[ 5., 5., 5., 5.],
[ 6., 6., 6., 6.],
[ 7., 7., 7., 7.],
[ 8., 8., 8., 8.],
[ 9., 9., 9., 9.],
[10., 10., 10., 10.],
[11., 11., 11., 11.],
[12., 12., 12., 12.],
[13., 13., 13., 13.],
[14., 14., 14., 14.],
[15., 15., 15., 15.]], dtype=torch.float16)
```
Raw data
{
"_id": null,
"home_page": null,
"name": "fastsafetensors",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": "Takeshi Yoshimura <tyos@jp.ibm.com>",
"keywords": "fastsafetensors, safetensors, GDS",
"author": null,
"author_email": "Takeshi Yoshimura <tyos@jp.ibm.com>",
"download_url": "https://files.pythonhosted.org/packages/dc/36/6e74f887ac3c63c1ecb95070d09cdb6b951c1e1b07ea2c618d3199a189e8/fastsafetensors-0.1.10.tar.gz",
"platform": null,
"description": "fastsafetensors is an efficient safetensors model loader.\nWe introduced three major features to optimize model loading performance:\n1. Batched, lazy tensor instantiations\n2. GPU offloading for sharding, type conversions, and device pointer alignment.\n3. GPU Direct Storage enablement for file loading from storage to GPU memory\n\nA major design difference from the original safetensors file loader is *NOT* to use `mmap`.\nIt loads tensors on-demand with mmap'ed files,\nbut unfortunately, it cannot fully utilize high-throughput I/O such as NVMe SSDs.\nSo, we asynchronously transfer files in parallel to saturate storage throughput.\nThen, fastsafetensors lazily instantiates tensors at GPU device memory with DLPack.\n\nAnother design change is to offload sharding and other manipulations on tensors to GPUs.\nThe original loader provides slicing for sharding at user programs before copying to device memory. However, it incurrs high CPU usages for host memory accesses.\nSo, we introduce a special APIs to run sharding with `torch.distributed` collective operations such as `broadcast` and `scatter`.\nThe offloading is also applied to other tensor manipulations such as type conversions.\n\nThe above two design can be naturally extended to utilize device-to-device data transfers with GPU Direct Storage.\nThe technology helps to minimize copy overheads from NVMe SSDs to GPU memory with host CPU and memory bypassed.\n\nCheck more details in [doc/overview.md](doc/overview.md)\n\n## Dependencies\n\nWe currently test fastsafetensors only with python 3.11, pytorch 2.1, and cuda-12.\nNote: when using different versions of pytorch, you may require changes on build environments for libpytorch since it seems slightly changing ABIs.\n\n## Install from PyPi\n\n```bash\npip install fastsafetensors\n```\n\n## Local installation\n\nPrerequisites: Install torch, cuda, and numa headers\n\n```bash\nmake install\n```\n\n## Package build\n\nPrerequisites: Install Docker (libtorch 2.1, cuda, and numa are automatically pulled)\n\n```bash\nmake dist\n```\n\n## Unit tests\n\nAfter installing fastsafetensors with `pip` or `make install`, run\n\n```bash\nmake unittest\n```\n\n## Basic API usages\n\n`SafeTensorsFileLoader` is the primary entrypoint of the fastsafetensors library. To use it, pass either `SingleGroup()` for simple inference or `ProcessGroup()` (from `torch.distributed`) for tensor-parallel inference. The loader supports both CPU and CUDA devices, with optional GPU Direct Storage (GDS) support. You can specify the device and GDS settings using the `device` and `nogds` arguments, respectively. Note that if GDS is not available, the loader will fail to open files when `nogds=False`. For more information on enabling GDS, please refer to the NVIDIA documentation.\n\nAfter creating a `SafeTensorsFileLoader` instance, first map target files and a rank using the `.add_filenames()` method. Then, call `.copy_file_to_device()` to trigger the actual file copies on aggregated GPU memory fragments and directly instantiate a group of Tensors. Once the files are loaded, you can retrieve a tensor using the `.get_tensor()` method. Additionally, you can obtain sharded tensors by `.get_sharded()`, which internally run collective operations in `torch.distributed`.\n\nImportant: To release the GPU memory allocated for tensors, you must explicitly call the `.close()` method. This is because Fastsafetensors allows multiple tensors to share a limited number of GPU memory fragments. As a result, it is the user's responsibility to ensure that all tensors are properly released before calling `.close()`, which will then safely release the underlying GPU memory.\n\n## Example: single run\n\nexamples/run_single.py:\n\n```python\nimport torch\nfrom fastsafetensors import SafeTensorsFileLoader, SingleGroup\ndevice = torch.device(\"cuda:0\" if torch.cuda.is_available() else \"cpu\")\nloader = SafeTensorsFileLoader(SingleGroup(), device, nogds=True, debug_log=True)\nloader.add_filenames({0: [\"a.safetensors\", \"b.safetensors\"]}) # {rank: files}\nfb = loader.copy_files_to_device()\ntensor_a0 = fb.get_tensor(tensor_name=\"a0\")\nprint(f\"a0: {tensor_a0}\")\nfb.close()\nloader.close()\n```\n\n```\ncd examples\npython run_single.py\n```\n\nExample output:\n\n```\nadd_filenames 1: path=a.safetensors\n[DEBUG] raw_device_pointer: raw_alloc: 0x7acf000, length=256, elapsed=3 us\n[DEBUG] nogds_file_reader.submit_read: cudaHostAlloc, size=1048576, elapsed=10 us\n[DEBUG] nogds_file_reader.submit_read #3, thread_id=1\n[DEBUG] nogds_file_reader._thread: read (mmap=0), fd=4, offset=104, count=256, c=256, copy=13 us, cuda_copy=0 us\nwait_io: tensor=a0\na0: tensor([[ 0., 0., 0., 0., 0., 0., 0., 0.],\n [ 1., 1., 1., 1., 1., 1., 1., 1.],\n [ 2., 2., 2., 2., 2., 2., 2., 2.],\n [ 3., 3., 3., 3., 3., 3., 3., 3.],\n [ 4., 4., 4., 4., 4., 4., 4., 4.],\n [ 5., 5., 5., 5., 5., 5., 5., 5.],\n [ 6., 6., 6., 6., 6., 6., 6., 6.],\n [ 7., 7., 7., 7., 7., 7., 7., 7.],\n [ 8., 8., 8., 8., 8., 8., 8., 8.],\n [ 9., 9., 9., 9., 9., 9., 9., 9.],\n [10., 10., 10., 10., 10., 10., 10., 10.],\n [11., 11., 11., 11., 11., 11., 11., 11.],\n [12., 12., 12., 12., 12., 12., 12., 12.],\n [13., 13., 13., 13., 13., 13., 13., 13.],\n [14., 14., 14., 14., 14., 14., 14., 14.],\n [15., 15., 15., 15., 15., 15., 15., 15.]], dtype=torch.float16)\n[DEBUG] ~nogds_file_reader: elapsed=28 us\n[DEBUG] ~raw_device_pointer: torch_raw_delete: 0x7acf000, elapsed=0 us\n```\n\n## Example: parallel run\n\nexamples/run_parallel.py:\n```python\nimport torch\nimport torch.distributed as dist\nfrom fastsafetensors import SafeTensorsFileLoader\ndist.init_process_group(backend=\"gloo\")\ndist.barrier()\npg = dist.group.WORLD\ndevice = torch.device(\"cuda:0\" if torch.cuda.is_available() else \"cpu\")\nloader = SafeTensorsFileLoader(pg, device, nogds=True, debug_log=True)\nloader.add_filenames({0: [\"a.safetensors\"], 1:[\"b.safetensors\"]}) # {rank: files}\nfb = loader.copy_files_to_device()\ntensor_a0 = fb.get_tensor(tensor_name=\"a0\") # broadcast\ntensor_b0_sharded = fb.get_sharded(tensor_name=\"b0\", dim=1) # partition and scatter\nprint(f\"RANK {pg.rank()}: tensor_a0={tensor_a0}\")\nprint(f\"RANK {pg.rank()}: tensor_b0_sharded={tensor_b0_sharded}\")\nfb.close()\nloader.close()\n```\n\nYou can test the script with torchrun\n\n```bash\ncd examples\ntorchrun --nnodes=2 --master_addr=0.0.0.0 --master_port=1234 --node_rank=0 run_parallel.py &\nPIDS+=$($!)\ntorchrun --nnodes=2 --master_addr=0.0.0.0 --master_port=1234 --node_rank=1 run_parallel.py &\nPIDS+=$($!)\nwait ${PIDS[@]}\n```\n\nExample output:\n\n```\nadd_filenames 1: path=a.safetensors\n[DEBUG] raw_device_pointer: raw_alloc: 0x6ba1000, length=256, elapsed=2 us\n[DEBUG] nogds_file_reader.submit_read: cudaHostAlloc, size=1048576, elapsed=10 us\n[DEBUG] nogds_file_reader.submit_read #3, thread_id=1\n[DEBUG] nogds_file_reader._thread: read (mmap=0), fd=15, offset=104, count=256, c=256, copy=15 us, cuda_copy=0 us\nwait_io: tensor=a0\nshuffle: broadcast, tensor_name=a0, shape=torch.Size([16, 8]), self.rank=0, pg.rank()=0, has_tensor=True\nadd_filenames 2: path=b.safetensors\n[DEBUG] raw_device_pointer: raw_alloc: 0x7cbb000, length=256, elapsed=2 us\n[DEBUG] nogds_file_reader.submit_read: cudaHostAlloc, size=1048576, elapsed=12 us\n[DEBUG] nogds_file_reader.submit_read #3, thread_id=1\n[DEBUG] nogds_file_reader._thread: read (mmap=0), fd=15, offset=104, count=256, c=256, copy=15 us, cuda_copy=0 us\nwait_io: tensor=b0\nshuffle: broadcast, tensor_name=a0, shape=torch.Size([16, 8]), self.rank=0, pg.rank()=1, has_tensor=False\n_get_tensor: free_dev_ptrs, lidx=0, src=a.safetensorsshuffle: use cache, tensor_name=a0\n\n[DEBUG] ~raw_device_pointer: torch_raw_delete: 0x6ba1000, elapsed=0 us\nshuffle: use cache, tensor_name=a0\n_get_tensor: free_dev_ptrs, lidx=0, src=a.safetensors\nshuffle: scatter, tensor_name=b0, shape=torch.Size([16, 8])->torch.Size([16, 4]), self.rank=1, pg.rank()=0, rank_slices=[(slice(None, None, None), slice(0, 4, 1)), (slice(None, None, None), slice(4, 8, 1))], len(scatter_list)=0\nshuffle: scatter, tensor_name=b0, shape=torch.Size([16, 8])->torch.Size([16, 4]), self.rank=1, pg.rank()=1, rank_slices=[(slice(None, None, None), slice(0, 4, 1)), (slice(None, None, None), slice(4, 8, 1))], len(scatter_list)=2\n_get_tensor: free_dev_ptrs, lidx=0, src=b.safetensors\n[DEBUG] ~raw_device_pointer: torch_raw_delete: 0x7cbb000, elapsed=0 us\nRANK 0: tensor_a0=tensor([[ 0., 0., 0., 0., 0., 0., 0., 0.],\n [ 1., 1., 1., 1., 1., 1., 1., 1.],\n [ 2., 2., 2., 2., 2., 2., 2., 2.],\n [ 3., 3., 3., 3., 3., 3., 3., 3.],\n [ 4., 4., 4., 4., 4., 4., 4., 4.],\n [ 5., 5., 5., 5., 5., 5., 5., 5.],\n [ 6., 6., 6., 6., 6., 6., 6., 6.],\n [ 7., 7., 7., 7., 7., 7., 7., 7.],\n [ 8., 8., 8., 8., 8., 8., 8., 8.],\n [ 9., 9., 9., 9., 9., 9., 9., 9.],\n [10., 10., 10., 10., 10., 10., 10., 10.],\n [11., 11., 11., 11., 11., 11., 11., 11.],\n [12., 12., 12., 12., 12., 12., 12., 12.],\n [13., 13., 13., 13., 13., 13., 13., 13.],\n [14., 14., 14., 14., 14., 14., 14., 14.],\n [15., 15., 15., 15., 15., 15., 15., 15.]], dtype=torch.float16)RANK 1: tensor_a0=tensor([[ 0., 0., 0., 0., 0., 0., 0., 0.],\n [ 1., 1., 1., 1., 1., 1., 1., 1.],\n [ 2., 2., 2., 2., 2., 2., 2., 2.],\n [ 3., 3., 3., 3., 3., 3., 3., 3.],\n [ 4., 4., 4., 4., 4., 4., 4., 4.],\n [ 5., 5., 5., 5., 5., 5., 5., 5.],\n [ 6., 6., 6., 6., 6., 6., 6., 6.],\n [ 7., 7., 7., 7., 7., 7., 7., 7.],\n [ 8., 8., 8., 8., 8., 8., 8., 8.],\n [ 9., 9., 9., 9., 9., 9., 9., 9.],\n [10., 10., 10., 10., 10., 10., 10., 10.],\n [11., 11., 11., 11., 11., 11., 11., 11.],\n [12., 12., 12., 12., 12., 12., 12., 12.],\n [13., 13., 13., 13., 13., 13., 13., 13.],\n [14., 14., 14., 14., 14., 14., 14., 14.],\n [15., 15., 15., 15., 15., 15., 15., 15.]], dtype=torch.float16)\n\nRANK 1: tensor_b0_sharded=tensor([[ 0., 0., 0., 0.],\n [ 1., 1., 1., 1.],\n [ 2., 2., 2., 2.],\n [ 3., 3., 3., 3.],\n [ 4., 4., 4., 4.],\n [ 5., 5., 5., 5.],\n [ 6., 6., 6., 6.],\n [ 7., 7., 7., 7.],\n [ 8., 8., 8., 8.],\n [ 9., 9., 9., 9.],\n [10., 10., 10., 10.],\n [11., 11., 11., 11.],\n [12., 12., 12., 12.],\n [13., 13., 13., 13.],\n [14., 14., 14., 14.],\n [15., 15., 15., 15.]], dtype=torch.float16)RANK 0: tensor_b0_sharded=tensor([[ 0., 0., 0., 0.],\n [ 1., 1., 1., 1.],\n [ 2., 2., 2., 2.],\n [ 3., 3., 3., 3.],\n [ 4., 4., 4., 4.],\n [ 5., 5., 5., 5.],\n [ 6., 6., 6., 6.],\n [ 7., 7., 7., 7.],\n [ 8., 8., 8., 8.],\n [ 9., 9., 9., 9.],\n [10., 10., 10., 10.],\n [11., 11., 11., 11.],\n [12., 12., 12., 12.],\n [13., 13., 13., 13.],\n [14., 14., 14., 14.],\n [15., 15., 15., 15.]], dtype=torch.float16)\n```\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "High-performance safetensors model loader",
"version": "0.1.10",
"project_urls": {
"Repository": "https://github.com/foundation-model-stack/fastsafetensors"
},
"split_keywords": [
"fastsafetensors",
" safetensors",
" gds"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "2ff0351af7751d6b1d0bd1aa478f26748cd748741fe3f31ed283ce1517be9eb0",
"md5": "ab613d0fb3fc3db1731bc8f2ef2002e4",
"sha256": "adb77dba9289ba385f678b4ba9b4b77d4ae3177745c427fae2cddd5a3d4e53de"
},
"downloads": -1,
"filename": "fastsafetensors-0.1.10-cp310-cp310-manylinux_2_34_x86_64.whl",
"has_sig": false,
"md5_digest": "ab613d0fb3fc3db1731bc8f2ef2002e4",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.9",
"size": 1398043,
"upload_time": "2024-12-16T14:07:48",
"upload_time_iso_8601": "2024-12-16T14:07:48.577700Z",
"url": "https://files.pythonhosted.org/packages/2f/f0/351af7751d6b1d0bd1aa478f26748cd748741fe3f31ed283ce1517be9eb0/fastsafetensors-0.1.10-cp310-cp310-manylinux_2_34_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "2d23fd4a84636ccfbe8957f7df987faa07697e466f5a962af41fd59e290c4c89",
"md5": "98b53dd8d447cef7a6861b93b420d1dc",
"sha256": "90971e4d2180d54d62da7105b2e6bb3a1b79d22949adfc7ae32f9468cd642acc"
},
"downloads": -1,
"filename": "fastsafetensors-0.1.10-cp311-cp311-manylinux_2_34_x86_64.whl",
"has_sig": false,
"md5_digest": "98b53dd8d447cef7a6861b93b420d1dc",
"packagetype": "bdist_wheel",
"python_version": "cp311",
"requires_python": ">=3.9",
"size": 1418748,
"upload_time": "2024-12-16T14:07:50",
"upload_time_iso_8601": "2024-12-16T14:07:50.705737Z",
"url": "https://files.pythonhosted.org/packages/2d/23/fd4a84636ccfbe8957f7df987faa07697e466f5a962af41fd59e290c4c89/fastsafetensors-0.1.10-cp311-cp311-manylinux_2_34_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "a0a715dab55c4e41940e15e8d5ce65d7c2e726c4037c69acd98fb9857187fe7c",
"md5": "d0bfdbbc0dd512cab921ca2746a65829",
"sha256": "cca1829623773d24d090e100b350e45059767c9c5e341113d86ba50bde3ea50e"
},
"downloads": -1,
"filename": "fastsafetensors-0.1.10-cp312-cp312-manylinux_2_34_x86_64.whl",
"has_sig": false,
"md5_digest": "d0bfdbbc0dd512cab921ca2746a65829",
"packagetype": "bdist_wheel",
"python_version": "cp312",
"requires_python": ">=3.9",
"size": 1426452,
"upload_time": "2024-12-16T14:07:52",
"upload_time_iso_8601": "2024-12-16T14:07:52.984077Z",
"url": "https://files.pythonhosted.org/packages/a0/a7/15dab55c4e41940e15e8d5ce65d7c2e726c4037c69acd98fb9857187fe7c/fastsafetensors-0.1.10-cp312-cp312-manylinux_2_34_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "0549de6b65da12021fc47c257cd477da6f0c69784af7cab3a4f0b5d62ef46b26",
"md5": "72603f419794d3cb2a867a5f0eadedaa",
"sha256": "434080a58cfdc4d82401a4f3d2a512007427d9d6ae6ec77d96fd0c82591a6f23"
},
"downloads": -1,
"filename": "fastsafetensors-0.1.10-cp39-cp39-manylinux_2_34_x86_64.whl",
"has_sig": false,
"md5_digest": "72603f419794d3cb2a867a5f0eadedaa",
"packagetype": "bdist_wheel",
"python_version": "cp39",
"requires_python": ">=3.9",
"size": 1396779,
"upload_time": "2024-12-16T14:07:59",
"upload_time_iso_8601": "2024-12-16T14:07:59.082431Z",
"url": "https://files.pythonhosted.org/packages/05/49/de6b65da12021fc47c257cd477da6f0c69784af7cab3a4f0b5d62ef46b26/fastsafetensors-0.1.10-cp39-cp39-manylinux_2_34_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "dc366e74f887ac3c63c1ecb95070d09cdb6b951c1e1b07ea2c618d3199a189e8",
"md5": "b0fc59b60e2f6f90fb54d78f46e7f102",
"sha256": "86d36c638b35c29bb83ecc96dd349bf9df9f6e67122861108650993f6ab2b4af"
},
"downloads": -1,
"filename": "fastsafetensors-0.1.10.tar.gz",
"has_sig": false,
"md5_digest": "b0fc59b60e2f6f90fb54d78f46e7f102",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 31364,
"upload_time": "2024-12-16T14:08:02",
"upload_time_iso_8601": "2024-12-16T14:08:02.959785Z",
"url": "https://files.pythonhosted.org/packages/dc/36/6e74f887ac3c63c1ecb95070d09cdb6b951c1e1b07ea2c618d3199a189e8/fastsafetensors-0.1.10.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-16 14:08:02",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "foundation-model-stack",
"github_project": "fastsafetensors",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "fastsafetensors"
}