joinem


Namejoinem JSON
Version 0.9.2 PyPI version JSON
download
home_pageNone
SummaryCLI for fast, flexbile concatenation of tabular data using Polars.
upload_time2025-01-04 14:26:28
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT license
keywords polars data processing csv parquet data science
VCS
bugtrack_url
requirements black build bump2version click isort mypy-extensions packaging pathspec pip-tools platformdirs polars pyproject-hooks ruff tomli tqdm wheel
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [
![PyPi](https://img.shields.io/pypi/v/joinem.svg?)
](https://pypi.python.org/pypi/joinem)
[
![CI](https://github.com/mmore500/joinem/actions/workflows/ci.yaml/badge.svg)
](https://github.com/mmore500/joinem/actions)
[
![GitHub stars](https://img.shields.io/github/stars/mmore500/joinem.svg?style=round-square&logo=github&label=Stars&logoColor=white)](https://github.com/mmore500/joinem)
[![DOI](https://zenodo.org/badge/760045369.svg)](https://zenodo.org/doi/10.5281/zenodo.10701182)

**_joinem_** provides a CLI for fast, flexbile concatenation of tabular data using [polars](https://pola.rs/)

- Free software: MIT license
- Repository: <https://github.com/mmore500/joinem>
- Documentation: <https://github.com/mmore500/joinem/blob/master/README.md>

## Install

`python3 -m pip install joinem`

## Features

- Lazily streams I/O to expeditiously handle numerous large files.
- Supports CSV and parquet input files.
    - Due to current polars limitations, JSON and feather files are not supported.
    - Input formats may be mixed.
- Supports output to CSV, JSON, parquet, and feather file types.
- Allows mismatched columns and/or empty data files with `--how diagonal` and `--how diagonal_relaxed`.
- Provides a progress bar with `--progress`.
- Add programatically-generated columns to output.

## Example Usage

Pass input filenames via stdin, one filename per line.
```
find path/to/*.parquet path/to/*.csv | python3 -m joinem out.parquet
```

Output file type is inferred from the extension of the output file name.
Supported output types are feather, JSON, parquet, and csv.
```
find -name '*.parquet' | python3 -m joinem out.json
```

Use `--progress` to show a progress bar.
```
ls -1 path/{*.csv,*.pqt} | python3 -m joinem out.csv --progress
```

If file columns may mismatch, use `--how diagonal`.
```
find path/to/ -name '*.csv' | python3 -m joinem out.csv --how diagonal
```

If some files may be empty, use `--how diagonal_relaxed`.

To run via Singularity/Apptainer,
```
ls -1 *.csv | singularity run docker://ghcr.io/mmore500/joinem out.feather
```

Add literal value column to output.
```
ls -1 *.csv | python3 -m joinem out.csv --with-column 'pl.lit(2).alias("two")'
```

Alias an existing column in the output.
```
ls -1 *.csv | python3 -m joinem out.csv --with-column 'pl.col("a").alias("a2")'
```

Apply regex on source datafile paths to create new column in output.
```
ls -1 path/to/*.csv | python3 -m joinem out.csv \
  --with-column 'pl.lit(filepath).str.replace(r".*?([^/]*)\.csv", r"${1}").alias("filename stem")'
```

Read data from stdin and write data to stdout.
```
cat foo.csv | python3 -m joinem "/dev/stdout" --stdin --output-filetype csv --input-filetype csv
```

Advanced usage.
Write to parquet via stdout using `pv` to display progress, cast "myValue" column to categorical, and use lz4 for parquet compression.
```
ls -1 input/*.pqt | python3 -m joinem "/dev/stdout" --output-filetype pqt --with-column 'pl.col("myValue").cast(pl.Categorical)' --write-kwarg 'compression="lz4"' | pv > concat.pqt
```

## API

```
usage: __main__.py [-h] [--version] [--progress] [--stdin] [--eager-read]
                   [--eager-write] [--with-column WITH_COLUMNS]
                   [--string-cache]
                   [--how {vertical,horizontal,diagonal,diagonal_relaxed}]
                   [--input-filetype INPUT_FILETYPE]
                   [--output-filetype OUTPUT_FILETYPE]
                   [--read-kwarg READ_KWARGS] [--write-kwarg WRITE_KWARGS]
                   output_file

CLI for fast, flexbile concatenation of tabular data using Polars.

positional arguments:
  output_file           Output file name

options:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --progress            Show progress bar
  --stdin               Read data from stdin
  --drop DROP           Columns to drop.
  --eager-read          Use read_* instead of scan_*. Can improve performance
                        in some cases.
  --eager-write         Use write_* instead of sink_*. Can improve performance
                        in some cases.
  --filter FILTERS      Expression to be evaluated and passed to polars DataFrame.filter.
                        Example: 'pl.col("thing") == 0'
  --with-column WITH_COLUMNS
                        Expression to be evaluated to add a column, as access
                        to each datafile's filepath as `filepath` and polars
                        as `pl`. Example:
                        'pl.lit(filepath).str.replace(r".*?([^/]*)\.csv",
                        r"${1}").alias("filename stem")'
  --shrink-dtypes       Shrink numeric columns to the minimal required datatype.
  --string-cache        Enable Polars global string cache.
  --how {vertical,horizontal,diagonal,diagonal_relaxed}
                        How to concatenate frames. See
                        <https://docs.pola.rs/py-
                        polars/html/reference/api/polars.concat.html> for more
                        information.
  --input-filetype INPUT_FILETYPE
                        Filetype of input. Otherwise, inferred. Example: csv,
                        parquet, json, feather
  --output-filetype OUTPUT_FILETYPE
                        Filetype of output. Otherwise, inferred. Example: csv,
                        parquet
  --read-kwarg READ_KWARGS
                        Additional keyword arguments to pass to pl.read_* or
                        pl.scan_* call(s). Provide as 'key=value'. Specify
                        multiple kwargs by using this flag multiple times.
                        Arguments will be evaluated as Python expressions.
                        Example: 'infer_schema_length=None'
  --write-kwarg WRITE_KWARGS
                        Additional keyword arguments to pass to pl.write_* or
                        pl.sink_* call. Provide as 'key=value'. Specify
                        multiple kwargs by using this flag multiple times.
                        Arguments will be evaluated as Python expressions.
                        Example: 'compression="lz4"'

Provide input filepaths via stdin. Example: find path/to/ -name '*.csv' |
python3 -m joinem out.csv
```

## Citing

If *joinem* contributes to a scholarly work, please cite it as

> Matthew Andres Moreno. (2024). mmore500/joinem. Zenodo. https://doi.org/10.5281/zenodo.10701182

```bibtex
@software{moreno2024joinem,
  author = {Matthew Andres Moreno},
  title = {mmore500/joinem},
  month = feb,
  year = 2024,
  publisher = {Zenodo},
  doi = {10.5281/zenodo.10701182},
  url = {https://doi.org/10.5281/zenodo.10701182}
}
```

And don't forget to leave a [star on GitHub](https://github.com/mmore500/joinem/stargazers)!

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "joinem",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "polars, data processing, CSV, parquet, data science",
    "author": null,
    "author_email": "Matthew Andres Moreno <m.more500@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/20/cc/55f897468cf779ebca5e000d0e3c6c83f87d550aed8ae7b89fe170fe5e6d/joinem-0.9.2.tar.gz",
    "platform": null,
    "description": "[\n![PyPi](https://img.shields.io/pypi/v/joinem.svg?)\n](https://pypi.python.org/pypi/joinem)\n[\n![CI](https://github.com/mmore500/joinem/actions/workflows/ci.yaml/badge.svg)\n](https://github.com/mmore500/joinem/actions)\n[\n![GitHub stars](https://img.shields.io/github/stars/mmore500/joinem.svg?style=round-square&logo=github&label=Stars&logoColor=white)](https://github.com/mmore500/joinem)\n[![DOI](https://zenodo.org/badge/760045369.svg)](https://zenodo.org/doi/10.5281/zenodo.10701182)\n\n**_joinem_** provides a CLI for fast, flexbile concatenation of tabular data using [polars](https://pola.rs/)\n\n- Free software: MIT license\n- Repository: <https://github.com/mmore500/joinem>\n- Documentation: <https://github.com/mmore500/joinem/blob/master/README.md>\n\n## Install\n\n`python3 -m pip install joinem`\n\n## Features\n\n- Lazily streams I/O to expeditiously handle numerous large files.\n- Supports CSV and parquet input files.\n    - Due to current polars limitations, JSON and feather files are not supported.\n    - Input formats may be mixed.\n- Supports output to CSV, JSON, parquet, and feather file types.\n- Allows mismatched columns and/or empty data files with `--how diagonal` and `--how diagonal_relaxed`.\n- Provides a progress bar with `--progress`.\n- Add programatically-generated columns to output.\n\n## Example Usage\n\nPass input filenames via stdin, one filename per line.\n```\nfind path/to/*.parquet path/to/*.csv | python3 -m joinem out.parquet\n```\n\nOutput file type is inferred from the extension of the output file name.\nSupported output types are feather, JSON, parquet, and csv.\n```\nfind -name '*.parquet' | python3 -m joinem out.json\n```\n\nUse `--progress` to show a progress bar.\n```\nls -1 path/{*.csv,*.pqt} | python3 -m joinem out.csv --progress\n```\n\nIf file columns may mismatch, use `--how diagonal`.\n```\nfind path/to/ -name '*.csv' | python3 -m joinem out.csv --how diagonal\n```\n\nIf some files may be empty, use `--how diagonal_relaxed`.\n\nTo run via Singularity/Apptainer,\n```\nls -1 *.csv | singularity run docker://ghcr.io/mmore500/joinem out.feather\n```\n\nAdd literal value column to output.\n```\nls -1 *.csv | python3 -m joinem out.csv --with-column 'pl.lit(2).alias(\"two\")'\n```\n\nAlias an existing column in the output.\n```\nls -1 *.csv | python3 -m joinem out.csv --with-column 'pl.col(\"a\").alias(\"a2\")'\n```\n\nApply regex on source datafile paths to create new column in output.\n```\nls -1 path/to/*.csv | python3 -m joinem out.csv \\\n  --with-column 'pl.lit(filepath).str.replace(r\".*?([^/]*)\\.csv\", r\"${1}\").alias(\"filename stem\")'\n```\n\nRead data from stdin and write data to stdout.\n```\ncat foo.csv | python3 -m joinem \"/dev/stdout\" --stdin --output-filetype csv --input-filetype csv\n```\n\nAdvanced usage.\nWrite to parquet via stdout using `pv` to display progress, cast \"myValue\" column to categorical, and use lz4 for parquet compression.\n```\nls -1 input/*.pqt | python3 -m joinem \"/dev/stdout\" --output-filetype pqt --with-column 'pl.col(\"myValue\").cast(pl.Categorical)' --write-kwarg 'compression=\"lz4\"' | pv > concat.pqt\n```\n\n## API\n\n```\nusage: __main__.py [-h] [--version] [--progress] [--stdin] [--eager-read]\n                   [--eager-write] [--with-column WITH_COLUMNS]\n                   [--string-cache]\n                   [--how {vertical,horizontal,diagonal,diagonal_relaxed}]\n                   [--input-filetype INPUT_FILETYPE]\n                   [--output-filetype OUTPUT_FILETYPE]\n                   [--read-kwarg READ_KWARGS] [--write-kwarg WRITE_KWARGS]\n                   output_file\n\nCLI for fast, flexbile concatenation of tabular data using Polars.\n\npositional arguments:\n  output_file           Output file name\n\noptions:\n  -h, --help            show this help message and exit\n  --version             show program's version number and exit\n  --progress            Show progress bar\n  --stdin               Read data from stdin\n  --drop DROP           Columns to drop.\n  --eager-read          Use read_* instead of scan_*. Can improve performance\n                        in some cases.\n  --eager-write         Use write_* instead of sink_*. Can improve performance\n                        in some cases.\n  --filter FILTERS      Expression to be evaluated and passed to polars DataFrame.filter.\n                        Example: 'pl.col(\"thing\") == 0'\n  --with-column WITH_COLUMNS\n                        Expression to be evaluated to add a column, as access\n                        to each datafile's filepath as `filepath` and polars\n                        as `pl`. Example:\n                        'pl.lit(filepath).str.replace(r\".*?([^/]*)\\.csv\",\n                        r\"${1}\").alias(\"filename stem\")'\n  --shrink-dtypes       Shrink numeric columns to the minimal required datatype.\n  --string-cache        Enable Polars global string cache.\n  --how {vertical,horizontal,diagonal,diagonal_relaxed}\n                        How to concatenate frames. See\n                        <https://docs.pola.rs/py-\n                        polars/html/reference/api/polars.concat.html> for more\n                        information.\n  --input-filetype INPUT_FILETYPE\n                        Filetype of input. Otherwise, inferred. Example: csv,\n                        parquet, json, feather\n  --output-filetype OUTPUT_FILETYPE\n                        Filetype of output. Otherwise, inferred. Example: csv,\n                        parquet\n  --read-kwarg READ_KWARGS\n                        Additional keyword arguments to pass to pl.read_* or\n                        pl.scan_* call(s). Provide as 'key=value'. Specify\n                        multiple kwargs by using this flag multiple times.\n                        Arguments will be evaluated as Python expressions.\n                        Example: 'infer_schema_length=None'\n  --write-kwarg WRITE_KWARGS\n                        Additional keyword arguments to pass to pl.write_* or\n                        pl.sink_* call. Provide as 'key=value'. Specify\n                        multiple kwargs by using this flag multiple times.\n                        Arguments will be evaluated as Python expressions.\n                        Example: 'compression=\"lz4\"'\n\nProvide input filepaths via stdin. Example: find path/to/ -name '*.csv' |\npython3 -m joinem out.csv\n```\n\n## Citing\n\nIf *joinem* contributes to a scholarly work, please cite it as\n\n> Matthew Andres Moreno. (2024). mmore500/joinem. Zenodo. https://doi.org/10.5281/zenodo.10701182\n\n```bibtex\n@software{moreno2024joinem,\n  author = {Matthew Andres Moreno},\n  title = {mmore500/joinem},\n  month = feb,\n  year = 2024,\n  publisher = {Zenodo},\n  doi = {10.5281/zenodo.10701182},\n  url = {https://doi.org/10.5281/zenodo.10701182}\n}\n```\n\nAnd don't forget to leave a [star on GitHub](https://github.com/mmore500/joinem/stargazers)!\n",
    "bugtrack_url": null,
    "license": "MIT license",
    "summary": "CLI for fast, flexbile concatenation of tabular data using Polars.",
    "version": "0.9.2",
    "project_urls": {
        "documentation": "https://github.com/mmore500/joinem",
        "homepage": "https://github.com/mmore500/joinem",
        "repository": "https://github.com/mmore500/joinem",
        "tracker": "https://github.com/mmore500/joinem/issues"
    },
    "split_keywords": [
        "polars",
        " data processing",
        " csv",
        " parquet",
        " data science"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6c2a9622d546fac7ea81bd126e586f966985f50db1e925c7aa7abea5a5d778bb",
                "md5": "3389977fe4ad9c7dad911254696b82d8",
                "sha256": "a9438a3a61f9d6f597472e6a1d025f516a19ea922289e75411a4578fc27ca82b"
            },
            "downloads": -1,
            "filename": "joinem-0.9.2-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3389977fe4ad9c7dad911254696b82d8",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": ">=3.8",
            "size": 8610,
            "upload_time": "2025-01-04T14:26:25",
            "upload_time_iso_8601": "2025-01-04T14:26:25.700425Z",
            "url": "https://files.pythonhosted.org/packages/6c/2a/9622d546fac7ea81bd126e586f966985f50db1e925c7aa7abea5a5d778bb/joinem-0.9.2-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "20cc55f897468cf779ebca5e000d0e3c6c83f87d550aed8ae7b89fe170fe5e6d",
                "md5": "b682db76f1cb6c991ed7f281ffc36f73",
                "sha256": "32dffedf89e58d4a5f1d4d5f5bec52a49891cab876122707d642f45b7409ea53"
            },
            "downloads": -1,
            "filename": "joinem-0.9.2.tar.gz",
            "has_sig": false,
            "md5_digest": "b682db76f1cb6c991ed7f281ffc36f73",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 8332,
            "upload_time": "2025-01-04T14:26:28",
            "upload_time_iso_8601": "2025-01-04T14:26:28.158718Z",
            "url": "https://files.pythonhosted.org/packages/20/cc/55f897468cf779ebca5e000d0e3c6c83f87d550aed8ae7b89fe170fe5e6d/joinem-0.9.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-04 14:26:28",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "mmore500",
    "github_project": "joinem",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "black",
            "specs": [
                [
                    "==",
                    "22.10.0"
                ]
            ]
        },
        {
            "name": "build",
            "specs": [
                [
                    "==",
                    "1.0.3"
                ]
            ]
        },
        {
            "name": "bump2version",
            "specs": [
                [
                    "==",
                    "1.0.1"
                ]
            ]
        },
        {
            "name": "click",
            "specs": [
                [
                    "==",
                    "8.1.7"
                ]
            ]
        },
        {
            "name": "isort",
            "specs": [
                [
                    "==",
                    "5.12.0"
                ]
            ]
        },
        {
            "name": "mypy-extensions",
            "specs": [
                [
                    "==",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "packaging",
            "specs": [
                [
                    "==",
                    "23.2"
                ]
            ]
        },
        {
            "name": "pathspec",
            "specs": [
                [
                    "==",
                    "0.12.1"
                ]
            ]
        },
        {
            "name": "pip-tools",
            "specs": [
                [
                    "==",
                    "7.3.0"
                ]
            ]
        },
        {
            "name": "platformdirs",
            "specs": [
                [
                    "==",
                    "4.2.0"
                ]
            ]
        },
        {
            "name": "polars",
            "specs": [
                [
                    "==",
                    "0.20.10"
                ]
            ]
        },
        {
            "name": "pyproject-hooks",
            "specs": [
                [
                    "==",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "ruff",
            "specs": [
                [
                    "==",
                    "0.1.11"
                ]
            ]
        },
        {
            "name": "tomli",
            "specs": [
                [
                    "==",
                    "2.0.1"
                ]
            ]
        },
        {
            "name": "tqdm",
            "specs": [
                [
                    "==",
                    "4.66.2"
                ]
            ]
        },
        {
            "name": "wheel",
            "specs": [
                [
                    "==",
                    "0.42.0"
                ]
            ]
        }
    ],
    "lcname": "joinem"
}
        
Elapsed time: 0.50320s