stdflow


Namestdflow JSON
Version 0.0.73 PyPI version JSON
download
home_pagehttps://github.com/CyprienRicque/stdflow
SummaryData flow tool that transform your notebooks and python files into pipeline steps by standardizing the data input / output. [for Data science project]
upload_time2023-10-12 03:56:56
maintainer
docs_urlNone
authorCyprien Ricque
requires_python>=3.7
licenseApache Software License 2.0
keywords nbdev jupyter notebook python "data science" data flow pipeline
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # stdflow

<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

Data flow tool that transform your notebooks and python files into
pipeline steps by standardizing the data input / output. (for data
science projects)

Create clean data flow pipelines just by replacing your `pd.read_csv()`
and `df.to_csv()` by `sf.load()` and `sf.save()`.

[Documentation](https://cyprienricque.github.io/stdflow/)

## Install

``` sh
pip install stdflow
```

## How to use

### Pipelines

``` python
from stdflow import StepRunner
from stdflow.pipeline import Pipeline

# Pipeline with 2 steps

dm = "../demo_project/notebooks/"

ingestion_ppl = Pipeline([
    StepRunner(dm + "01_ingestion/countries.ipynb"), 
    StepRunner(dm + "01_ingestion/world_happiness.ipynb")
])

# === OR ===
ingestion_ppl = Pipeline(
    StepRunner(dm + "01_ingestion/countries.ipynb"), 
    StepRunner(dm + "01_ingestion/world_happiness.ipynb")
)

# === OR ===
ingestion_ppl = Pipeline()

ingestion_ppl.add_step(StepRunner(dm + "01_ingestion/countries.ipynb"))
# OR
ingestion_ppl.add_step(dm + "01_ingestion/world_happiness.ipynb")


ingestion_ppl
```


    ================================
                PIPELINE            
    ================================

    STEP 1
        path: ../demo_project/notebooks/01_ingestion/countries.ipynb
        vars: {}

    STEP 2
        path: ../demo_project/notebooks/01_ingestion/world_happiness.ipynb
        vars: {}

    ================================

Run the pipeline

``` python
ingestion_ppl.run(verbose=True, kernel=":any_available")
```

    =================================================================================
        01.                ../demo_project/notebooks/01_ingestion/countries.ipynb
    =================================================================================
    Variables: {}
    using kernel:  python3
        Path: ../demo_project/notebooks/01_ingestion/countries.ipynb
        Duration: 0 days 00:00:00.603051
        Env: {}
    Notebook executed successfully.


    =================================================================================
        02.          ../demo_project/notebooks/01_ingestion/world_happiness.ipynb
    =================================================================================
    Variables: {}
    using kernel:  python3
        Path: ../demo_project/notebooks/01_ingestion/world_happiness.ipynb
        Duration: 0 days 00:00:00.644909
        Env: {}
    Notebook executed successfully.

## Load and save data

### Option 1: Specify All Parameters

``` python
import stdflow as sf
import pandas as pd


# load data from ../demo_project/data/countries/step_loaded/v_202309212245/countries.csv
df = sf.load(
   root="../demo_project/data/",
   attrs=['countries'],
   step='created',
   version=':last',  # loads last version in alphanumeric order
   file_name='countries.csv',
   method=pd.read_csv,  # or method='csv'
   verbose=False,
)

# export data to ./data/raw/twitter/france/step_processed/v_1/countries.csv
sf.save(
   df,
   root="../demo_project/data/",
   attrs='countries/',
   step='loaded',
   version='%Y-03',  # creates v_2023-03
   file_name='countries.csv',
   method=pd.DataFrame.to_csv,  # or method='csv'  or any function that takes the object to export as first input
)
```

    attrs=countries/::step_name=loaded::version=2023-03::file_name=countries.csv

Each time you perform a save, a metadata.json file is created in the
folder. This keeps track of how your data was created and other
information.

### Option 2: Use default variables

``` python
import stdflow as sf
sf.reset()  # used when multiple steps are done with the same Step object (not recommended). see below

# use package level default values
sf.root = "../demo_project/data/"
sf.attrs = 'countries'  # if needed use attrs_in and attrs_out
sf.step_in = 'loaded'
sf.step_out = 'formatted'

df = sf.load()
# ! root / attrs / step : used from default values set above
# ! version : the last version was automatically used. default: ":last"
# ! file_name : the file, alone in the folder, was automatically found
# ! method : was automatically used from the file extension

sf.save(df)
# ! root / attrs / step : used from default values set above
# ! version: used default %Y%m%d%H%M format
# ! file_name: used from the input (because only one file)
# ! method : inferred from file name
```

    attrs=countries::step_name=formatted::version=202310101716::file_name=countries.csv

Note that everything we did at package level can be done with the Step
class When you have multiple steps in a notebook, you can create one
Step object per step. stdflow (sf) at package level is a singleton
instance of Step.

``` python
from stdflow import Step

step = Step(
    root="../demo_project/data/",
    attrs='countries',
    step_in='formatted',
    step_out='pre_processed'
)
# or set after
step.root = "../demo_project/data/"
# ...

df = step.load(version=':last', file_name=":auto", verbose=True)

step.save(df, verbose=True)
```

    INFO:stdflow.step:Loading data from ../demo_project/data/countries/step_formatted/v_202310101716/countries.csv
    INFO:stdflow.step:Data loaded from ../demo_project/data/countries/step_formatted/v_202310101716/countries.csv
    INFO:stdflow.step:Saving data to ../demo_project/data/countries/step_pre_processed/v_202310101716/countries.csv
    INFO:stdflow.step:Data saved to ../demo_project/data/countries/step_pre_processed/v_202310101716/countries.csv
    INFO:stdflow.step:Saving metadata to ../demo_project/data/countries/step_pre_processed/v_202310101716/

    attrs=countries::step_name=pre_processed::version=202310101716::file_name=countries.csv

Each time you perform a save, a metadata.json file is created in the
folder. This keeps track of how your data was created and other
information.

## Do not

- Save in the same directory from different steps. Because this will
  erase metadata from the previous step.

## Data visualization

``` python
import stdflow as sf

step.save(df, verbose=True, export_viz_tool=True)
```

    INFO:stdflow.step:Saving data to ../demo_project/data/countries/step_pre_processed/v_202310101716/countries.csv
    INFO:stdflow.step:Data saved to ../demo_project/data/countries/step_pre_processed/v_202310101716/countries.csv
    INFO:stdflow.step:Saving metadata to ../demo_project/data/countries/step_pre_processed/v_202310101716/
    INFO:stdflow.step:Exporting viz tool to ../demo_project/data/countries/step_pre_processed/v_202310101716/

    attrs=countries::step_name=pre_processed::version=202310101716::file_name=countries.csv

This command exports a folder `metadata_viz` in the same folder as the
data you exported. The metadata to display is saved in the metadata.json
file.

In order to display it you need to get both the file and the folder on
your local pc (download if you are working on a server)

Then go to the html file in your file explorer and open it. it should
open in your browser and lets you upload the metadata.json file.

Data flow tool that transform your notebooks and python files into
pipeline steps by standardizing the data input / output. (for data
science projects)

Create clean data flow pipelines just by replacing your `pd.read_csv()`
and `df.to_csv()` by `sf.load()` and `sf.save()`.

## Data Organization

### Format

Data folder organization is systematic and used by the function to load
and save. If follows this format:
root_data_folder/attrs_1/attrs_2/…/attrs_n/step_name/version/file_name

where:

- root_data_folder: is the path to the root of your data folder, and is
  not exported in the metadata
- attrs: information to classify your dataset (e.g. country, client, …)
- step_name: name of the step. always starts with `step_`
- version: version of the step. always starts with `v_`
- file_name: name of the file. can be anything

Each folder is the output of a step. It contains a metadata.json file
with information about all files in the folder and how it was generated.
It can also contain a html page (if you set `html_export=True` in
`save()`) that lets you visualize the pipeline and your metadata

## Best Practices:

- Do not use `sf.reset` as part of your final code
- In one step, export only to one path (except the version). meaning for
  one step only one combination of attrs and step_name
- Do not set sub-dirs within the export (i.e. version folder is the last
  depth). if you need similar operation for different datasets, create
  pipelines

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/CyprienRicque/stdflow",
    "name": "stdflow",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "nbdev jupyter notebook python \"data science\" data flow pipeline",
    "author": "Cyprien Ricque",
    "author_email": "cyprien.ricque@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/ff/ea/fccb6a3dc5df89f5c6767311c60095a0d73b732671516ce47f5a31f9c72f/stdflow-0.0.73.tar.gz",
    "platform": null,
    "description": "# stdflow\n\n<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->\n\nData flow tool that transform your notebooks and python files into\npipeline steps by standardizing the data input / output. (for data\nscience projects)\n\nCreate clean data flow pipelines just by replacing your `pd.read_csv()`\nand `df.to_csv()` by `sf.load()` and `sf.save()`.\n\n[Documentation](https://cyprienricque.github.io/stdflow/)\n\n## Install\n\n``` sh\npip install stdflow\n```\n\n## How to use\n\n### Pipelines\n\n``` python\nfrom stdflow import StepRunner\nfrom stdflow.pipeline import Pipeline\n\n# Pipeline with 2 steps\n\ndm = \"../demo_project/notebooks/\"\n\ningestion_ppl = Pipeline([\n    StepRunner(dm + \"01_ingestion/countries.ipynb\"), \n    StepRunner(dm + \"01_ingestion/world_happiness.ipynb\")\n])\n\n# === OR ===\ningestion_ppl = Pipeline(\n    StepRunner(dm + \"01_ingestion/countries.ipynb\"), \n    StepRunner(dm + \"01_ingestion/world_happiness.ipynb\")\n)\n\n# === OR ===\ningestion_ppl = Pipeline()\n\ningestion_ppl.add_step(StepRunner(dm + \"01_ingestion/countries.ipynb\"))\n# OR\ningestion_ppl.add_step(dm + \"01_ingestion/world_happiness.ipynb\")\n\n\ningestion_ppl\n```\n\n\n    ================================\n                PIPELINE            \n    ================================\n\n    STEP 1\n        path: ../demo_project/notebooks/01_ingestion/countries.ipynb\n        vars: {}\n\n    STEP 2\n        path: ../demo_project/notebooks/01_ingestion/world_happiness.ipynb\n        vars: {}\n\n    ================================\n\nRun the pipeline\n\n``` python\ningestion_ppl.run(verbose=True, kernel=\":any_available\")\n```\n\n    =================================================================================\n        01.                ../demo_project/notebooks/01_ingestion/countries.ipynb\n    =================================================================================\n    Variables: {}\n    using kernel:  python3\n        Path: ../demo_project/notebooks/01_ingestion/countries.ipynb\n        Duration: 0 days 00:00:00.603051\n        Env: {}\n    Notebook executed successfully.\n\n\n    =================================================================================\n        02.          ../demo_project/notebooks/01_ingestion/world_happiness.ipynb\n    =================================================================================\n    Variables: {}\n    using kernel:  python3\n        Path: ../demo_project/notebooks/01_ingestion/world_happiness.ipynb\n        Duration: 0 days 00:00:00.644909\n        Env: {}\n    Notebook executed successfully.\n\n## Load and save data\n\n### Option 1: Specify All Parameters\n\n``` python\nimport stdflow as sf\nimport pandas as pd\n\n\n# load data from ../demo_project/data/countries/step_loaded/v_202309212245/countries.csv\ndf = sf.load(\n   root=\"../demo_project/data/\",\n   attrs=['countries'],\n   step='created',\n   version=':last',  # loads last version in alphanumeric order\n   file_name='countries.csv',\n   method=pd.read_csv,  # or method='csv'\n   verbose=False,\n)\n\n# export data to ./data/raw/twitter/france/step_processed/v_1/countries.csv\nsf.save(\n   df,\n   root=\"../demo_project/data/\",\n   attrs='countries/',\n   step='loaded',\n   version='%Y-03',  # creates v_2023-03\n   file_name='countries.csv',\n   method=pd.DataFrame.to_csv,  # or method='csv'  or any function that takes the object to export as first input\n)\n```\n\n    attrs=countries/::step_name=loaded::version=2023-03::file_name=countries.csv\n\nEach time you perform a save, a metadata.json file is created in the\nfolder. This keeps track of how your data was created and other\ninformation.\n\n### Option 2: Use default variables\n\n``` python\nimport stdflow as sf\nsf.reset()  # used when multiple steps are done with the same Step object (not recommended). see below\n\n# use package level default values\nsf.root = \"../demo_project/data/\"\nsf.attrs = 'countries'  # if needed use attrs_in and attrs_out\nsf.step_in = 'loaded'\nsf.step_out = 'formatted'\n\ndf = sf.load()\n# ! root / attrs / step : used from default values set above\n# ! version : the last version was automatically used. default: \":last\"\n# ! file_name : the file, alone in the folder, was automatically found\n# ! method : was automatically used from the file extension\n\nsf.save(df)\n# ! root / attrs / step : used from default values set above\n# ! version: used default %Y%m%d%H%M format\n# ! file_name: used from the input (because only one file)\n# ! method : inferred from file name\n```\n\n    attrs=countries::step_name=formatted::version=202310101716::file_name=countries.csv\n\nNote that everything we did at package level can be done with the Step\nclass When you have multiple steps in a notebook, you can create one\nStep object per step. stdflow (sf) at package level is a singleton\ninstance of Step.\n\n``` python\nfrom stdflow import Step\n\nstep = Step(\n    root=\"../demo_project/data/\",\n    attrs='countries',\n    step_in='formatted',\n    step_out='pre_processed'\n)\n# or set after\nstep.root = \"../demo_project/data/\"\n# ...\n\ndf = step.load(version=':last', file_name=\":auto\", verbose=True)\n\nstep.save(df, verbose=True)\n```\n\n    INFO:stdflow.step:Loading data from ../demo_project/data/countries/step_formatted/v_202310101716/countries.csv\n    INFO:stdflow.step:Data loaded from ../demo_project/data/countries/step_formatted/v_202310101716/countries.csv\n    INFO:stdflow.step:Saving data to ../demo_project/data/countries/step_pre_processed/v_202310101716/countries.csv\n    INFO:stdflow.step:Data saved to ../demo_project/data/countries/step_pre_processed/v_202310101716/countries.csv\n    INFO:stdflow.step:Saving metadata to ../demo_project/data/countries/step_pre_processed/v_202310101716/\n\n    attrs=countries::step_name=pre_processed::version=202310101716::file_name=countries.csv\n\nEach time you perform a save, a metadata.json file is created in the\nfolder. This keeps track of how your data was created and other\ninformation.\n\n## Do not\n\n- Save in the same directory from different steps. Because this will\n  erase metadata from the previous step.\n\n## Data visualization\n\n``` python\nimport stdflow as sf\n\nstep.save(df, verbose=True, export_viz_tool=True)\n```\n\n    INFO:stdflow.step:Saving data to ../demo_project/data/countries/step_pre_processed/v_202310101716/countries.csv\n    INFO:stdflow.step:Data saved to ../demo_project/data/countries/step_pre_processed/v_202310101716/countries.csv\n    INFO:stdflow.step:Saving metadata to ../demo_project/data/countries/step_pre_processed/v_202310101716/\n    INFO:stdflow.step:Exporting viz tool to ../demo_project/data/countries/step_pre_processed/v_202310101716/\n\n    attrs=countries::step_name=pre_processed::version=202310101716::file_name=countries.csv\n\nThis command exports a folder `metadata_viz` in the same folder as the\ndata you exported. The metadata to display is saved in the metadata.json\nfile.\n\nIn order to display it you need to get both the file and the folder on\nyour local pc (download if you are working on a server)\n\nThen go to the html file in your file explorer and open it. it should\nopen in your browser and lets you upload the metadata.json file.\n\nData flow tool that transform your notebooks and python files into\npipeline steps by standardizing the data input / output. (for data\nscience projects)\n\nCreate clean data flow pipelines just by replacing your `pd.read_csv()`\nand `df.to_csv()` by `sf.load()` and `sf.save()`.\n\n## Data Organization\n\n### Format\n\nData folder organization is systematic and used by the function to load\nand save. If follows this format:\nroot_data_folder/attrs_1/attrs_2/\u2026/attrs_n/step_name/version/file_name\n\nwhere:\n\n- root_data_folder: is the path to the root of your data folder, and is\n  not exported in the metadata\n- attrs: information to classify your dataset (e.g.\u00a0country, client, \u2026)\n- step_name: name of the step. always starts with `step_`\n- version: version of the step. always starts with `v_`\n- file_name: name of the file. can be anything\n\nEach folder is the output of a step. It contains a metadata.json file\nwith information about all files in the folder and how it was generated.\nIt can also contain a html page (if you set `html_export=True` in\n`save()`) that lets you visualize the pipeline and your metadata\n\n## Best Practices:\n\n- Do not use `sf.reset` as part of your final code\n- In one step, export only to one path (except the version). meaning for\n  one step only one combination of attrs and step_name\n- Do not set sub-dirs within the export (i.e.\u00a0version folder is the last\n  depth). if you need similar operation for different datasets, create\n  pipelines\n",
    "bugtrack_url": null,
    "license": "Apache Software License 2.0",
    "summary": "Data flow tool that transform your notebooks and python files into pipeline steps by standardizing the data input / output. [for Data science project]",
    "version": "0.0.73",
    "project_urls": {
        "Homepage": "https://github.com/CyprienRicque/stdflow"
    },
    "split_keywords": [
        "nbdev",
        "jupyter",
        "notebook",
        "python",
        "\"data",
        "science\"",
        "data",
        "flow",
        "pipeline"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0409c2284cfae524c5ed81c938356bbb67960371e75df2a6f66de355b40b715a",
                "md5": "45962b2ecf0c62da1d954fbec31c476c",
                "sha256": "2739b4d7227f3a8618c56d0e136ee1cc519fd3a7cc1c454a179f561a328d37d0"
            },
            "downloads": -1,
            "filename": "stdflow-0.0.73-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "45962b2ecf0c62da1d954fbec31c476c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 60422,
            "upload_time": "2023-10-12T03:56:54",
            "upload_time_iso_8601": "2023-10-12T03:56:54.831955Z",
            "url": "https://files.pythonhosted.org/packages/04/09/c2284cfae524c5ed81c938356bbb67960371e75df2a6f66de355b40b715a/stdflow-0.0.73-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ffeafccb6a3dc5df89f5c6767311c60095a0d73b732671516ce47f5a31f9c72f",
                "md5": "77d112befbca1177ff7b561d156151a1",
                "sha256": "974196dda87ad847c21edb11a5b7f62556a96402c8401314370e8ca68c16a9d4"
            },
            "downloads": -1,
            "filename": "stdflow-0.0.73.tar.gz",
            "has_sig": false,
            "md5_digest": "77d112befbca1177ff7b561d156151a1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 47444,
            "upload_time": "2023-10-12T03:56:56",
            "upload_time_iso_8601": "2023-10-12T03:56:56.849842Z",
            "url": "https://files.pythonhosted.org/packages/ff/ea/fccb6a3dc5df89f5c6767311c60095a0d73b732671516ce47f5a31f9c72f/stdflow-0.0.73.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-10-12 03:56:56",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "CyprienRicque",
    "github_project": "stdflow",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "stdflow"
}
        
Elapsed time: 0.28538s