s3-datakit


Names3-datakit JSON
Version 0.3.9 PyPI version JSON
download
home_pageNone
SummaryA Python toolkit to simplify common operations between S3 and Pandas.
upload_time2025-08-10 19:17:56
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT License
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # S3 DataKit 🧰

A Python toolkit to simplify common operations between Amazon S3 and Pandas DataFrames.

## Key Features

* **List** files in an S3 bucket.
* **Upload** local files to S3.
* **Download** files from S3 directly to a local path or a Pandas DataFrame.
* Supports **CSV** and **Stata (.dta)** when reading into DataFrames.

## Installation
```bash
pip install s3-datakit
```
or
```bash
uv add s3-datakit
```

## Credential Configuration
This package uses `boto3` to interact with AWS. `boto3` will automatically search for credentials in the following order:

1.  Environment variables (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, etc.).
2.  The AWS CLI credentials file (`~/.aws/credentials`).
3.  IAM roles (if running on an EC2 instance or ECS container).

For local development, the easiest method is to use a `.env` file.

**1. Install `python-dotenv` in your project (not as a library dependency):**
```bash
pip install python-dotenv
```

**2. Create a `.env` file in your project's root:**
```
AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY
AWS_SECRET_ACCESS_KEY=YOUR_SECRET_KEY
AWS_DEFAULT_REGION=your-region # e.g., us-east-1
```

**3. Load the variables in your script *before* using `s3datakit`:**
```python
from dotenv import load_dotenv
import s3datakit as s3dk

# Load environment variables from .env
load_dotenv()

# Now you can use the package's functions
s3dk.list_s3_files(bucket="my-bucket")
```

## Usage

### List Files

```python
import s3datakit as s3dk

file_list = s3dk.list_s3_files(bucket="my-data-bucket")
if file_list:
    print(file_list)
```

### Upload a File

You can specify the full destination path in S3. If `s3_path` is not provided, the original filename from `local_path` is used as the S3 object key.

```python
import s3datakit as s3dk

# Upload with a specific S3 path
s3dk.upload_s3_file(
    local_path="reports/report.csv",
    bucket="my-data-bucket",
    s3_path="final-reports/report_2025.csv"
)

# Upload using the local filename as the S3 key
# This will upload 'reports/report.csv' to 's3://my-data-bucket/report.csv'
s3dk.upload_s3_file(
    local_path="reports/report.csv",
    bucket="my-data-bucket"
)
```

### Download a File

The `download_s3_file` function is versatile. You can download a file to a local path or load it directly into a Pandas DataFrame.

The download_s3_file function accepts the following parameters:

`bucket (str): Required.` The name of the S3 bucket where the file is located.

`s3_path (str): Required.` The full path (key) of the file within the bucket.
local_path (str, optional): The local path where the file will be saved. If you don't provide this, the file will be saved in a data/ directory in your current working folder, using its original S3 filename.

`to_df (bool, optional, default: False):` If set to True, the function will attempt to read the downloaded file into a Pandas DataFrame. This is useful for .csv and Stata .dta files.

`replace (bool, optional, default: False):` If True, it will overwrite a local file if it already exists. By default, it skips the download if the file is already present to save time and bandwidth.

`low_memory (bool, optional, default: True):` When reading a CSV into a DataFrame (to_df=True), this is passed to pandas.read_csv to process the file in chunks, which can reduce memory usage for large files.

`sep (str, optional, default: ","):`**` The separator or delimiter to use when reading a CSV file into a DataFrame. For example, use '\t' for tab-separated files.

**Option 1: Download to a local path**

By default, if `local_path` is not provided, files are saved to a `data/` directory in the current working directory.

```python
import s3datakit as s3dk

# Download to a specific path
local_file = s3dk.download_s3_file(
    bucket="my-data-bucket",
    s3_path="final-reports/report_2025.csv",
    local_path="downloads/report.csv"
)
print(f"File downloaded to: {local_file}")

# Download to the default 'data/' directory, overwriting if it exists
s3dk.download_s3_file(
    bucket="my-data-bucket",
    s3_path="final-reports/report_2025.csv",
    replace=True
)
```

**Option 2: Download directly to a Pandas DataFrame**
```python
import s3datakit as s3dk

df = s3dk.download_s3_file(
    bucket="my-data-bucket",
    s3_path="stata-data/survey.dta",
    to_df=True
)
print(df.head())
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "s3-datakit",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": null,
    "author_email": "Carlos Coelho <coelho.carlosw@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/8d/87/00e2f317eb02335b7d7a397f67938f5ef654807cdff8a9b77322fdc5a847/s3_datakit-0.3.9.tar.gz",
    "platform": null,
    "description": "# S3 DataKit \ud83e\uddf0\n\nA Python toolkit to simplify common operations between Amazon S3 and Pandas DataFrames.\n\n## Key Features\n\n* **List** files in an S3 bucket.\n* **Upload** local files to S3.\n* **Download** files from S3 directly to a local path or a Pandas DataFrame.\n* Supports **CSV** and **Stata (.dta)** when reading into DataFrames.\n\n## Installation\n```bash\npip install s3-datakit\n```\nor\n```bash\nuv add s3-datakit\n```\n\n## Credential Configuration\nThis package uses `boto3` to interact with AWS. `boto3` will automatically search for credentials in the following order:\n\n1.  Environment variables (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, etc.).\n2.  The AWS CLI credentials file (`~/.aws/credentials`).\n3.  IAM roles (if running on an EC2 instance or ECS container).\n\nFor local development, the easiest method is to use a `.env` file.\n\n**1. Install `python-dotenv` in your project (not as a library dependency):**\n```bash\npip install python-dotenv\n```\n\n**2. Create a `.env` file in your project's root:**\n```\nAWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY\nAWS_SECRET_ACCESS_KEY=YOUR_SECRET_KEY\nAWS_DEFAULT_REGION=your-region # e.g., us-east-1\n```\n\n**3. Load the variables in your script *before* using `s3datakit`:**\n```python\nfrom dotenv import load_dotenv\nimport s3datakit as s3dk\n\n# Load environment variables from .env\nload_dotenv()\n\n# Now you can use the package's functions\ns3dk.list_s3_files(bucket=\"my-bucket\")\n```\n\n## Usage\n\n### List Files\n\n```python\nimport s3datakit as s3dk\n\nfile_list = s3dk.list_s3_files(bucket=\"my-data-bucket\")\nif file_list:\n    print(file_list)\n```\n\n### Upload a File\n\nYou can specify the full destination path in S3. If `s3_path` is not provided, the original filename from `local_path` is used as the S3 object key.\n\n```python\nimport s3datakit as s3dk\n\n# Upload with a specific S3 path\ns3dk.upload_s3_file(\n    local_path=\"reports/report.csv\",\n    bucket=\"my-data-bucket\",\n    s3_path=\"final-reports/report_2025.csv\"\n)\n\n# Upload using the local filename as the S3 key\n# This will upload 'reports/report.csv' to 's3://my-data-bucket/report.csv'\ns3dk.upload_s3_file(\n    local_path=\"reports/report.csv\",\n    bucket=\"my-data-bucket\"\n)\n```\n\n### Download a File\n\nThe `download_s3_file` function is versatile. You can download a file to a local path or load it directly into a Pandas DataFrame.\n\nThe download_s3_file function accepts the following parameters:\n\n`bucket (str): Required.` The name of the S3 bucket where the file is located.\n\n`s3_path (str): Required.` The full path (key) of the file within the bucket.\nlocal_path (str, optional): The local path where the file will be saved. If you don't provide this, the file will be saved in a data/ directory in your current working folder, using its original S3 filename.\n\n`to_df (bool, optional, default: False):` If set to True, the function will attempt to read the downloaded file into a Pandas DataFrame. This is useful for .csv and Stata .dta files.\n\n`replace (bool, optional, default: False):` If True, it will overwrite a local file if it already exists. By default, it skips the download if the file is already present to save time and bandwidth.\n\n`low_memory (bool, optional, default: True):` When reading a CSV into a DataFrame (to_df=True), this is passed to pandas.read_csv to process the file in chunks, which can reduce memory usage for large files.\n\n`sep (str, optional, default: \",\"):`**` The separator or delimiter to use when reading a CSV file into a DataFrame. For example, use '\\t' for tab-separated files.\n\n**Option 1: Download to a local path**\n\nBy default, if `local_path` is not provided, files are saved to a `data/` directory in the current working directory.\n\n```python\nimport s3datakit as s3dk\n\n# Download to a specific path\nlocal_file = s3dk.download_s3_file(\n    bucket=\"my-data-bucket\",\n    s3_path=\"final-reports/report_2025.csv\",\n    local_path=\"downloads/report.csv\"\n)\nprint(f\"File downloaded to: {local_file}\")\n\n# Download to the default 'data/' directory, overwriting if it exists\ns3dk.download_s3_file(\n    bucket=\"my-data-bucket\",\n    s3_path=\"final-reports/report_2025.csv\",\n    replace=True\n)\n```\n\n**Option 2: Download directly to a Pandas DataFrame**\n```python\nimport s3datakit as s3dk\n\ndf = s3dk.download_s3_file(\n    bucket=\"my-data-bucket\",\n    s3_path=\"stata-data/survey.dta\",\n    to_df=True\n)\nprint(df.head())\n```\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "A Python toolkit to simplify common operations between S3 and Pandas.",
    "version": "0.3.9",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3c8336e4b19cdd7477305ede5b631bf63a0ed8ad042c76a72df97a133a711367",
                "md5": "944eef8da20afea484d30f15bc984430",
                "sha256": "cd1883868d06c93855b9ed3bb6f367f0ef7aae3913c1a396249ad5bf6bad30ce"
            },
            "downloads": -1,
            "filename": "s3_datakit-0.3.9-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "944eef8da20afea484d30f15bc984430",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 5839,
            "upload_time": "2025-08-10T19:17:55",
            "upload_time_iso_8601": "2025-08-10T19:17:55.524175Z",
            "url": "https://files.pythonhosted.org/packages/3c/83/36e4b19cdd7477305ede5b631bf63a0ed8ad042c76a72df97a133a711367/s3_datakit-0.3.9-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8d8700e2f317eb02335b7d7a397f67938f5ef654807cdff8a9b77322fdc5a847",
                "md5": "b3aec8c02a97ca66f6e837670e884161",
                "sha256": "80ff05fe82875e477f1c80376ba66e5ecb507a373ff17c82e48587bc43521b58"
            },
            "downloads": -1,
            "filename": "s3_datakit-0.3.9.tar.gz",
            "has_sig": false,
            "md5_digest": "b3aec8c02a97ca66f6e837670e884161",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 5769,
            "upload_time": "2025-08-10T19:17:56",
            "upload_time_iso_8601": "2025-08-10T19:17:56.381266Z",
            "url": "https://files.pythonhosted.org/packages/8d/87/00e2f317eb02335b7d7a397f67938f5ef654807cdff8a9b77322fdc5a847/s3_datakit-0.3.9.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-10 19:17:56",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "s3-datakit"
}
        
Elapsed time: 1.08823s