scilight


Namescilight JSON
Version 0.6.3 PyPI version JSON
download
home_pagehttps://github.com/samuell/scilight
SummaryWorkflow library in pure python, for executing shell commands saving data to the file system without re-executing already executed tasks.
upload_time2023-10-02 15:44:55
maintainer
docs_urlNone
authorSamuel Lampa
requires_python
licenseMIT
keywords workflows workflow pipeline task
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # SciLight - Simple task execution in Python scripts

[![CircleCI](https://circleci.com/gh/samuell/scilight.svg?style=shield)](https://app.circleci.com/pipelines/github/samuell/scilight)
[![PyPI](https://img.shields.io/pypi/v/scilight.svg?style=flat)](https://pypi.org/project/scilight)

A super-simple library for performing stepwise batch tasks (implemented as shell
commands, or python functions) that saves things to files, such that outputs
from already finished tasks are not needlessly re-computed. See the
[below](#example) for an example.

SciLight does not (currently) have a scheduler or central worker pool or
anything like that. Instead you simply execute your tasks manually in a
procedural way. This way task executions can easily be mixed with other
procedural python code.

SciLight can work as an alternative to full-blown workflow frameworks like Luigi
or Airflow for cases when you just have a single python script, where you want
to do a few batch steps before starting your interactive analysis, such as
downloading datasets, unpacking them, preprocessing et cetera.

SciLight is small (not much more than 100 lines of code), and has no external
dependencies, meaning that you can even copy the implementation into your own
code repos if you want to ensure maximum future reproducibility.

## What does 'SciLight' mean?

SciLight is named as such, as being a lighter version of
[SciPipe](https://scipipe.org), also written by the author, in Go.  If you need
true multicore-performance and a compiled language, you might want to have a
look at SciPipe instead.

## Prerequisites

- SciLight is so far only tested on unix-like environments (it is runnable on Windows
  via Windows Sybsystem for Linux (WSL)).

## Installation

Install from the Python Package Index using pip:

```
pip install scilight
```

## Usage

SciLight works by specifying either a shell command, or a python function to
be executed, as the first argument to `scilight.shell()` or `scilight.func()` respectively.

In shell commands, you need to replace input and output file paths with
placeholders on the form of `[i:inputname]` and `[o:outputname:outputpath]`
respectively. Additionally, you can provide parameters (as string values) using
the `[p:paramname]` syntax. You will also need to provide dicts which specify
the paths to the inputs, outputs and parameters as appropriate, by providing
them to the optional `inputs`, `outputs` and `params` parameters of
`scilight.shell()` and `scilight.func()`. See the example below for a concrete
example.

Inputs should always be provided via the `inputs`-parameter, while output paths
are easiest to provide inline in the command, in the respective placeholder.
Note that you can re-use input placeholder values to produce the output path.
So, for example, if you want to name your output the same as the input, but
with an extra `.txt` extension, you can specify it like this in the command:
`somecommand > [o:myoutput:[i:myinput].txt]`.

### Removing file extensions

If you have an existing extension in the input that you want to remove, you can
do it by adding `|%.actual-extension-here` in the input placeholder. So, if you
have an input `myinput` with the path `myfile.txt.gz`, you can reuse just the
`myfile.txt` part by writing `[i:myinput|%.gz]`, to remove the `.gz` part.
Putting that inside an output placeholder, you could for example do: `zcat
[i:archivefile] > [o:unpacked:[i:archivefile|%.gz]]`, in order to name the
unpacked file the same as the archive, but without the `.gz` extension.

### Removing parent directories from paths

Often it is the case that the input path contains a long folder path that you
don't want to re-use when re-using the input filename. To clean the path from
the parent directory structure, you can add the `|basename` modifier inside any
path placeholder.  So, if you have an input `myinput` with the path
`some/directory/structure/file.txt.gz`, you can reuse just the `myfile.txt` part
by writing `[i:myinput|basename]`, to remove the `some/directory/structure`
part. Modifiers can be compbined, so for example, given that you have an archive
file in another directory named `some/directory/structure`, you could do
the following to extract the archive, removing the `.gz` file extension and
putting the extracted file in a new directory named `other-directory`:
`zcat [i:archivefile] > other-directory/[o:unpacked:[i:archivefile|basename|%.gz]]`

### Search and replace in paths

Sometimes you want to clean up paths in more creative ways. This could be for
example replacing spaces with `_`, to avoid problems on Unix-like systems.
This can be done inside path formatters by using the format
`[i:myinput|s/SEARCH/REPLACE/]`, so, replacing spaces with underscores would be
`[i:myinput|s/ /_/]`.

See the example below for how to use some of this in practice!

## Example

Below is a small example that downloads a gzipped text file (in the so called
FASTA format), un-gzips it, and then calculates the number of A:s, T:s, G:s and
C:s and calculates the fraction of G and C:s in relation to all A, T, G, C:s
(the so-called GC-fraction measure for DNA).

The two first tasks are done by executing shell commands, and the second one
using a python function.

```python
import scilight as sl

# ------------------------------------------------------------------------
# Download a gzipped fasta file and save it as chrmt.fa.gz
# ------------------------------------------------------------------------
url = 'ftp://ftp.ensembl.org/pub/release-100/fasta/'+ \
      'homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.MT.fa.gz'
download_task = sl.shell(f'wget -O [o:gz:chrmt.fa.gz] {url}')

# ------------------------------------------------------------------------
# Un-GZip the file, into a file named chrmt.fa
# ------------------------------------------------------------------------
ungzip_task = sl.shell('zcat [i:gz] > [o:fa:[i:gz|%.gz]]',
        inputs={'gz': download_task.outputs['gz']})

# ------------------------------------------------------------------------
# Count the fraction of G+C, vs G+C+A+T
# ------------------------------------------------------------------------
# A function for Count GC fraction in DNA
def count_gcfrac_func(task):
    gc_count = 0
    at_count = 0

    with open(task.inputs['fa']) as infile:
        for line in infile:
            if line[0] == '>':
                continue
            for char in line:
                if char in ['A', 'T']:
                    at_count += 1
                elif char in ['G', 'C']:
                    gc_count += 1

    gc_fraction = gc_count/(gc_count+at_count)

    with open(task.outputs['gcfrac'], 'w') as outfile:
        outfile.write(f'{gc_fraction}\n')

# Execute the function
count_task = sl.func(count_gcfrac_func,
        inputs={'fa': ungzip_task.outputs['fa']},
        outputs={'gcfrac': 'gcfrac.txt'})
```

Add this code to a file named `gcfrac.py` and run it with:

```bash
python gcfrac.py
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/samuell/scilight",
    "name": "scilight",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "workflows workflow pipeline task",
    "author": "Samuel Lampa",
    "author_email": "shl@rilspace.com",
    "download_url": "https://files.pythonhosted.org/packages/d8/04/eed88a23e6609283df4acbecc642f42ee72d50c5118f0d71f4b249718018/scilight-0.6.3.tar.gz",
    "platform": null,
    "description": "# SciLight - Simple task execution in Python scripts\n\n[![CircleCI](https://circleci.com/gh/samuell/scilight.svg?style=shield)](https://app.circleci.com/pipelines/github/samuell/scilight)\n[![PyPI](https://img.shields.io/pypi/v/scilight.svg?style=flat)](https://pypi.org/project/scilight)\n\nA super-simple library for performing stepwise batch tasks (implemented as shell\ncommands, or python functions) that saves things to files, such that outputs\nfrom already finished tasks are not needlessly re-computed. See the\n[below](#example) for an example.\n\nSciLight does not (currently) have a scheduler or central worker pool or\nanything like that. Instead you simply execute your tasks manually in a\nprocedural way. This way task executions can easily be mixed with other\nprocedural python code.\n\nSciLight can work as an alternative to full-blown workflow frameworks like Luigi\nor Airflow for cases when you just have a single python script, where you want\nto do a few batch steps before starting your interactive analysis, such as\ndownloading datasets, unpacking them, preprocessing et cetera.\n\nSciLight is small (not much more than 100 lines of code), and has no external\ndependencies, meaning that you can even copy the implementation into your own\ncode repos if you want to ensure maximum future reproducibility.\n\n## What does 'SciLight' mean?\n\nSciLight is named as such, as being a lighter version of\n[SciPipe](https://scipipe.org), also written by the author, in Go.  If you need\ntrue multicore-performance and a compiled language, you might want to have a\nlook at SciPipe instead.\n\n## Prerequisites\n\n- SciLight is so far only tested on unix-like environments (it is runnable on Windows\n  via Windows Sybsystem for Linux (WSL)).\n\n## Installation\n\nInstall from the Python Package Index using pip:\n\n```\npip install scilight\n```\n\n## Usage\n\nSciLight works by specifying either a shell command, or a python function to\nbe executed, as the first argument to `scilight.shell()` or `scilight.func()` respectively.\n\nIn shell commands, you need to replace input and output file paths with\nplaceholders on the form of `[i:inputname]` and `[o:outputname:outputpath]`\nrespectively. Additionally, you can provide parameters (as string values) using\nthe `[p:paramname]` syntax. You will also need to provide dicts which specify\nthe paths to the inputs, outputs and parameters as appropriate, by providing\nthem to the optional `inputs`, `outputs` and `params` parameters of\n`scilight.shell()` and `scilight.func()`. See the example below for a concrete\nexample.\n\nInputs should always be provided via the `inputs`-parameter, while output paths\nare easiest to provide inline in the command, in the respective placeholder.\nNote that you can re-use input placeholder values to produce the output path.\nSo, for example, if you want to name your output the same as the input, but\nwith an extra `.txt` extension, you can specify it like this in the command:\n`somecommand > [o:myoutput:[i:myinput].txt]`.\n\n### Removing file extensions\n\nIf you have an existing extension in the input that you want to remove, you can\ndo it by adding `|%.actual-extension-here` in the input placeholder. So, if you\nhave an input `myinput` with the path `myfile.txt.gz`, you can reuse just the\n`myfile.txt` part by writing `[i:myinput|%.gz]`, to remove the `.gz` part.\nPutting that inside an output placeholder, you could for example do: `zcat\n[i:archivefile] > [o:unpacked:[i:archivefile|%.gz]]`, in order to name the\nunpacked file the same as the archive, but without the `.gz` extension.\n\n### Removing parent directories from paths\n\nOften it is the case that the input path contains a long folder path that you\ndon't want to re-use when re-using the input filename. To clean the path from\nthe parent directory structure, you can add the `|basename` modifier inside any\npath placeholder.  So, if you have an input `myinput` with the path\n`some/directory/structure/file.txt.gz`, you can reuse just the `myfile.txt` part\nby writing `[i:myinput|basename]`, to remove the `some/directory/structure`\npart. Modifiers can be compbined, so for example, given that you have an archive\nfile in another directory named `some/directory/structure`, you could do\nthe following to extract the archive, removing the `.gz` file extension and\nputting the extracted file in a new directory named `other-directory`:\n`zcat [i:archivefile] > other-directory/[o:unpacked:[i:archivefile|basename|%.gz]]`\n\n### Search and replace in paths\n\nSometimes you want to clean up paths in more creative ways. This could be for\nexample replacing spaces with `_`, to avoid problems on Unix-like systems.\nThis can be done inside path formatters by using the format\n`[i:myinput|s/SEARCH/REPLACE/]`, so, replacing spaces with underscores would be\n`[i:myinput|s/ /_/]`.\n\nSee the example below for how to use some of this in practice!\n\n## Example\n\nBelow is a small example that downloads a gzipped text file (in the so called\nFASTA format), un-gzips it, and then calculates the number of A:s, T:s, G:s and\nC:s and calculates the fraction of G and C:s in relation to all A, T, G, C:s\n(the so-called GC-fraction measure for DNA).\n\nThe two first tasks are done by executing shell commands, and the second one\nusing a python function.\n\n```python\nimport scilight as sl\n\n# ------------------------------------------------------------------------\n# Download a gzipped fasta file and save it as chrmt.fa.gz\n# ------------------------------------------------------------------------\nurl = 'ftp://ftp.ensembl.org/pub/release-100/fasta/'+ \\\n      'homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.MT.fa.gz'\ndownload_task = sl.shell(f'wget -O [o:gz:chrmt.fa.gz] {url}')\n\n# ------------------------------------------------------------------------\n# Un-GZip the file, into a file named chrmt.fa\n# ------------------------------------------------------------------------\nungzip_task = sl.shell('zcat [i:gz] > [o:fa:[i:gz|%.gz]]',\n        inputs={'gz': download_task.outputs['gz']})\n\n# ------------------------------------------------------------------------\n# Count the fraction of G+C, vs G+C+A+T\n# ------------------------------------------------------------------------\n# A function for Count GC fraction in DNA\ndef count_gcfrac_func(task):\n    gc_count = 0\n    at_count = 0\n\n    with open(task.inputs['fa']) as infile:\n        for line in infile:\n            if line[0] == '>':\n                continue\n            for char in line:\n                if char in ['A', 'T']:\n                    at_count += 1\n                elif char in ['G', 'C']:\n                    gc_count += 1\n\n    gc_fraction = gc_count/(gc_count+at_count)\n\n    with open(task.outputs['gcfrac'], 'w') as outfile:\n        outfile.write(f'{gc_fraction}\\n')\n\n# Execute the function\ncount_task = sl.func(count_gcfrac_func,\n        inputs={'fa': ungzip_task.outputs['fa']},\n        outputs={'gcfrac': 'gcfrac.txt'})\n```\n\nAdd this code to a file named `gcfrac.py` and run it with:\n\n```bash\npython gcfrac.py\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Workflow library in pure python, for executing shell commands saving data to the file system without re-executing already executed tasks.",
    "version": "0.6.3",
    "project_urls": {
        "Homepage": "https://github.com/samuell/scilight"
    },
    "split_keywords": [
        "workflows",
        "workflow",
        "pipeline",
        "task"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "52bf2c719bc8d0e687228cc6b8752bc6d99caf68643b55a6a63f243ba91173d1",
                "md5": "757ca8c73d3a4d9294415bafffa88d2e",
                "sha256": "ccdd28565fca84f6db7e347e28929f1be669b265d1ef952468dc9e3072721e6d"
            },
            "downloads": -1,
            "filename": "scilight-0.6.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "757ca8c73d3a4d9294415bafffa88d2e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 7500,
            "upload_time": "2023-10-02T15:44:54",
            "upload_time_iso_8601": "2023-10-02T15:44:54.350694Z",
            "url": "https://files.pythonhosted.org/packages/52/bf/2c719bc8d0e687228cc6b8752bc6d99caf68643b55a6a63f243ba91173d1/scilight-0.6.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d804eed88a23e6609283df4acbecc642f42ee72d50c5118f0d71f4b249718018",
                "md5": "d5a46aaf80e0f06b48cfb202e1a60cad",
                "sha256": "b11438a1412c6e1a1c3c2e82bc9278b8344625c7c08aec433d0df577460e9737"
            },
            "downloads": -1,
            "filename": "scilight-0.6.3.tar.gz",
            "has_sig": false,
            "md5_digest": "d5a46aaf80e0f06b48cfb202e1a60cad",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 7791,
            "upload_time": "2023-10-02T15:44:55",
            "upload_time_iso_8601": "2023-10-02T15:44:55.628689Z",
            "url": "https://files.pythonhosted.org/packages/d8/04/eed88a23e6609283df4acbecc642f42ee72d50c5118f0d71f4b249718018/scilight-0.6.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-10-02 15:44:55",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "samuell",
    "github_project": "scilight",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "circle": true,
    "requirements": [],
    "lcname": "scilight"
}
        
Elapsed time: 0.66452s