[

](https://pypi.python.org/pypi/joinem)
[

](https://github.com/mmore500/joinem/actions)
[
](https://github.com/mmore500/joinem)
[](https://zenodo.org/doi/10.5281/zenodo.10701182)
**_joinem_** provides a CLI for fast, flexbile concatenation of tabular data using [polars](https://pola.rs/)
- Free software: MIT license
- Repository: <https://github.com/mmore500/joinem>
- Documentation: <https://github.com/mmore500/joinem/blob/master/README.md>
## Install
`python3 -m pip install joinem`
## Features
- Lazily streams I/O to expeditiously handle numerous large files.
- Supports CSV and parquet input files.
- Due to current polars limitations, JSON and feather files are not supported.
- Input formats may be mixed.
- Supports output to CSV, JSON, parquet, and feather file types.
- Allows mismatched columns and/or empty data files with `--how diagonal` and `--how diagonal_relaxed`.
- Provides a progress bar with `--progress`.
- Add programatically-generated columns to output.
## Example Usage
Pass input filenames via stdin, one filename per line.
```
find path/to/*.parquet path/to/*.csv | python3 -m joinem out.parquet
```
Output file type is inferred from the extension of the output file name.
Supported output types are feather, JSON, parquet, and csv.
```
find -name '*.parquet' | python3 -m joinem out.json
```
Use `--progress` to show a progress bar.
```
ls -1 path/{*.csv,*.pqt} | python3 -m joinem out.csv --progress
```
If file columns may mismatch, use `--how diagonal`.
```
find path/to/ -name '*.csv' | python3 -m joinem out.csv --how diagonal
```
If some files may be empty, use `--how diagonal_relaxed`.
To run via Singularity/Apptainer,
```
ls -1 *.csv | singularity run docker://ghcr.io/mmore500/joinem out.feather
```
Add literal value column to output.
```
ls -1 *.csv | python3 -m joinem out.csv --with-column 'pl.lit(2).alias("two")'
```
Alias an existing column in the output.
```
ls -1 *.csv | python3 -m joinem out.csv --with-column 'pl.col("a").alias("a2")'
```
Apply regex on source datafile paths to create new column in output.
```
ls -1 path/to/*.csv | python3 -m joinem out.csv \
--with-column 'pl.lit(filepath).str.replace(r".*?([^/]*)\.csv", r"${1}").alias("filename stem")'
```
Read data from stdin and write data to stdout.
```
cat foo.csv | python3 -m joinem "/dev/stdout" --stdin --output-filetype csv --input-filetype csv
```
Advanced usage.
Write to parquet via stdout using `pv` to display progress, cast "myValue" column to categorical, and use lz4 for parquet compression.
```
ls -1 input/*.pqt | python3 -m joinem "/dev/stdout" --output-filetype pqt --with-column 'pl.col("myValue").cast(pl.Categorical)' --write-kwarg 'compression="lz4"' | pv > concat.pqt
```
## API
```
usage: __main__.py [-h] [--version] [--progress] [--stdin] [--eager-read]
[--eager-write] [--with-column WITH_COLUMNS]
[--string-cache]
[--how {vertical,horizontal,diagonal,diagonal_relaxed}]
[--input-filetype INPUT_FILETYPE]
[--output-filetype OUTPUT_FILETYPE]
[--read-kwarg READ_KWARGS] [--write-kwarg WRITE_KWARGS]
output_file
CLI for fast, flexbile concatenation of tabular data using Polars.
positional arguments:
output_file Output file name
options:
-h, --help show this help message and exit
--version show program's version number and exit
--progress Show progress bar
--stdin Read data from stdin
--drop DROP Columns to drop.
--eager-read Use read_* instead of scan_*. Can improve performance
in some cases.
--eager-write Use write_* instead of sink_*. Can improve performance
in some cases.
--filter FILTERS Expression to be evaluated and passed to polars DataFrame.filter.
Example: 'pl.col("thing") == 0'
--with-column WITH_COLUMNS
Expression to be evaluated to add a column, as access
to each datafile's filepath as `filepath` and polars
as `pl`. Example:
'pl.lit(filepath).str.replace(r".*?([^/]*)\.csv",
r"${1}").alias("filename stem")'
--shrink-dtypes Shrink numeric columns to the minimal required datatype.
--string-cache Enable Polars global string cache.
--how {vertical,horizontal,diagonal,diagonal_relaxed}
How to concatenate frames. See
<https://docs.pola.rs/py-
polars/html/reference/api/polars.concat.html> for more
information.
--input-filetype INPUT_FILETYPE
Filetype of input. Otherwise, inferred. Example: csv,
parquet, json, feather
--output-filetype OUTPUT_FILETYPE
Filetype of output. Otherwise, inferred. Example: csv,
parquet
--read-kwarg READ_KWARGS
Additional keyword arguments to pass to pl.read_* or
pl.scan_* call(s). Provide as 'key=value'. Specify
multiple kwargs by using this flag multiple times.
Arguments will be evaluated as Python expressions.
Example: 'infer_schema_length=None'
--write-kwarg WRITE_KWARGS
Additional keyword arguments to pass to pl.write_* or
pl.sink_* call. Provide as 'key=value'. Specify
multiple kwargs by using this flag multiple times.
Arguments will be evaluated as Python expressions.
Example: 'compression="lz4"'
Provide input filepaths via stdin. Example: find path/to/ -name '*.csv' |
python3 -m joinem out.csv
```
## Citing
If *joinem* contributes to a scholarly work, please cite it as
> Matthew Andres Moreno. (2024). mmore500/joinem. Zenodo. https://doi.org/10.5281/zenodo.10701182
```bibtex
@software{moreno2024joinem,
author = {Matthew Andres Moreno},
title = {mmore500/joinem},
month = feb,
year = 2024,
publisher = {Zenodo},
doi = {10.5281/zenodo.10701182},
url = {https://doi.org/10.5281/zenodo.10701182}
}
```
And don't forget to leave a [star on GitHub](https://github.com/mmore500/joinem/stargazers)!
Raw data
{
"_id": null,
"home_page": null,
"name": "joinem",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "polars, data processing, CSV, parquet, data science",
"author": null,
"author_email": "Matthew Andres Moreno <m.more500@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/20/cc/55f897468cf779ebca5e000d0e3c6c83f87d550aed8ae7b89fe170fe5e6d/joinem-0.9.2.tar.gz",
"platform": null,
"description": "[\n\n](https://pypi.python.org/pypi/joinem)\n[\n\n](https://github.com/mmore500/joinem/actions)\n[\n](https://github.com/mmore500/joinem)\n[](https://zenodo.org/doi/10.5281/zenodo.10701182)\n\n**_joinem_** provides a CLI for fast, flexbile concatenation of tabular data using [polars](https://pola.rs/)\n\n- Free software: MIT license\n- Repository: <https://github.com/mmore500/joinem>\n- Documentation: <https://github.com/mmore500/joinem/blob/master/README.md>\n\n## Install\n\n`python3 -m pip install joinem`\n\n## Features\n\n- Lazily streams I/O to expeditiously handle numerous large files.\n- Supports CSV and parquet input files.\n - Due to current polars limitations, JSON and feather files are not supported.\n - Input formats may be mixed.\n- Supports output to CSV, JSON, parquet, and feather file types.\n- Allows mismatched columns and/or empty data files with `--how diagonal` and `--how diagonal_relaxed`.\n- Provides a progress bar with `--progress`.\n- Add programatically-generated columns to output.\n\n## Example Usage\n\nPass input filenames via stdin, one filename per line.\n```\nfind path/to/*.parquet path/to/*.csv | python3 -m joinem out.parquet\n```\n\nOutput file type is inferred from the extension of the output file name.\nSupported output types are feather, JSON, parquet, and csv.\n```\nfind -name '*.parquet' | python3 -m joinem out.json\n```\n\nUse `--progress` to show a progress bar.\n```\nls -1 path/{*.csv,*.pqt} | python3 -m joinem out.csv --progress\n```\n\nIf file columns may mismatch, use `--how diagonal`.\n```\nfind path/to/ -name '*.csv' | python3 -m joinem out.csv --how diagonal\n```\n\nIf some files may be empty, use `--how diagonal_relaxed`.\n\nTo run via Singularity/Apptainer,\n```\nls -1 *.csv | singularity run docker://ghcr.io/mmore500/joinem out.feather\n```\n\nAdd literal value column to output.\n```\nls -1 *.csv | python3 -m joinem out.csv --with-column 'pl.lit(2).alias(\"two\")'\n```\n\nAlias an existing column in the output.\n```\nls -1 *.csv | python3 -m joinem out.csv --with-column 'pl.col(\"a\").alias(\"a2\")'\n```\n\nApply regex on source datafile paths to create new column in output.\n```\nls -1 path/to/*.csv | python3 -m joinem out.csv \\\n --with-column 'pl.lit(filepath).str.replace(r\".*?([^/]*)\\.csv\", r\"${1}\").alias(\"filename stem\")'\n```\n\nRead data from stdin and write data to stdout.\n```\ncat foo.csv | python3 -m joinem \"/dev/stdout\" --stdin --output-filetype csv --input-filetype csv\n```\n\nAdvanced usage.\nWrite to parquet via stdout using `pv` to display progress, cast \"myValue\" column to categorical, and use lz4 for parquet compression.\n```\nls -1 input/*.pqt | python3 -m joinem \"/dev/stdout\" --output-filetype pqt --with-column 'pl.col(\"myValue\").cast(pl.Categorical)' --write-kwarg 'compression=\"lz4\"' | pv > concat.pqt\n```\n\n## API\n\n```\nusage: __main__.py [-h] [--version] [--progress] [--stdin] [--eager-read]\n [--eager-write] [--with-column WITH_COLUMNS]\n [--string-cache]\n [--how {vertical,horizontal,diagonal,diagonal_relaxed}]\n [--input-filetype INPUT_FILETYPE]\n [--output-filetype OUTPUT_FILETYPE]\n [--read-kwarg READ_KWARGS] [--write-kwarg WRITE_KWARGS]\n output_file\n\nCLI for fast, flexbile concatenation of tabular data using Polars.\n\npositional arguments:\n output_file Output file name\n\noptions:\n -h, --help show this help message and exit\n --version show program's version number and exit\n --progress Show progress bar\n --stdin Read data from stdin\n --drop DROP Columns to drop.\n --eager-read Use read_* instead of scan_*. Can improve performance\n in some cases.\n --eager-write Use write_* instead of sink_*. Can improve performance\n in some cases.\n --filter FILTERS Expression to be evaluated and passed to polars DataFrame.filter.\n Example: 'pl.col(\"thing\") == 0'\n --with-column WITH_COLUMNS\n Expression to be evaluated to add a column, as access\n to each datafile's filepath as `filepath` and polars\n as `pl`. Example:\n 'pl.lit(filepath).str.replace(r\".*?([^/]*)\\.csv\",\n r\"${1}\").alias(\"filename stem\")'\n --shrink-dtypes Shrink numeric columns to the minimal required datatype.\n --string-cache Enable Polars global string cache.\n --how {vertical,horizontal,diagonal,diagonal_relaxed}\n How to concatenate frames. See\n <https://docs.pola.rs/py-\n polars/html/reference/api/polars.concat.html> for more\n information.\n --input-filetype INPUT_FILETYPE\n Filetype of input. Otherwise, inferred. Example: csv,\n parquet, json, feather\n --output-filetype OUTPUT_FILETYPE\n Filetype of output. Otherwise, inferred. Example: csv,\n parquet\n --read-kwarg READ_KWARGS\n Additional keyword arguments to pass to pl.read_* or\n pl.scan_* call(s). Provide as 'key=value'. Specify\n multiple kwargs by using this flag multiple times.\n Arguments will be evaluated as Python expressions.\n Example: 'infer_schema_length=None'\n --write-kwarg WRITE_KWARGS\n Additional keyword arguments to pass to pl.write_* or\n pl.sink_* call. Provide as 'key=value'. Specify\n multiple kwargs by using this flag multiple times.\n Arguments will be evaluated as Python expressions.\n Example: 'compression=\"lz4\"'\n\nProvide input filepaths via stdin. Example: find path/to/ -name '*.csv' |\npython3 -m joinem out.csv\n```\n\n## Citing\n\nIf *joinem* contributes to a scholarly work, please cite it as\n\n> Matthew Andres Moreno. (2024). mmore500/joinem. Zenodo. https://doi.org/10.5281/zenodo.10701182\n\n```bibtex\n@software{moreno2024joinem,\n author = {Matthew Andres Moreno},\n title = {mmore500/joinem},\n month = feb,\n year = 2024,\n publisher = {Zenodo},\n doi = {10.5281/zenodo.10701182},\n url = {https://doi.org/10.5281/zenodo.10701182}\n}\n```\n\nAnd don't forget to leave a [star on GitHub](https://github.com/mmore500/joinem/stargazers)!\n",
"bugtrack_url": null,
"license": "MIT license",
"summary": "CLI for fast, flexbile concatenation of tabular data using Polars.",
"version": "0.9.2",
"project_urls": {
"documentation": "https://github.com/mmore500/joinem",
"homepage": "https://github.com/mmore500/joinem",
"repository": "https://github.com/mmore500/joinem",
"tracker": "https://github.com/mmore500/joinem/issues"
},
"split_keywords": [
"polars",
" data processing",
" csv",
" parquet",
" data science"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "6c2a9622d546fac7ea81bd126e586f966985f50db1e925c7aa7abea5a5d778bb",
"md5": "3389977fe4ad9c7dad911254696b82d8",
"sha256": "a9438a3a61f9d6f597472e6a1d025f516a19ea922289e75411a4578fc27ca82b"
},
"downloads": -1,
"filename": "joinem-0.9.2-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "3389977fe4ad9c7dad911254696b82d8",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": ">=3.8",
"size": 8610,
"upload_time": "2025-01-04T14:26:25",
"upload_time_iso_8601": "2025-01-04T14:26:25.700425Z",
"url": "https://files.pythonhosted.org/packages/6c/2a/9622d546fac7ea81bd126e586f966985f50db1e925c7aa7abea5a5d778bb/joinem-0.9.2-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "20cc55f897468cf779ebca5e000d0e3c6c83f87d550aed8ae7b89fe170fe5e6d",
"md5": "b682db76f1cb6c991ed7f281ffc36f73",
"sha256": "32dffedf89e58d4a5f1d4d5f5bec52a49891cab876122707d642f45b7409ea53"
},
"downloads": -1,
"filename": "joinem-0.9.2.tar.gz",
"has_sig": false,
"md5_digest": "b682db76f1cb6c991ed7f281ffc36f73",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 8332,
"upload_time": "2025-01-04T14:26:28",
"upload_time_iso_8601": "2025-01-04T14:26:28.158718Z",
"url": "https://files.pythonhosted.org/packages/20/cc/55f897468cf779ebca5e000d0e3c6c83f87d550aed8ae7b89fe170fe5e6d/joinem-0.9.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-01-04 14:26:28",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "mmore500",
"github_project": "joinem",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "black",
"specs": [
[
"==",
"22.10.0"
]
]
},
{
"name": "build",
"specs": [
[
"==",
"1.0.3"
]
]
},
{
"name": "bump2version",
"specs": [
[
"==",
"1.0.1"
]
]
},
{
"name": "click",
"specs": [
[
"==",
"8.1.7"
]
]
},
{
"name": "isort",
"specs": [
[
"==",
"5.12.0"
]
]
},
{
"name": "mypy-extensions",
"specs": [
[
"==",
"1.0.0"
]
]
},
{
"name": "packaging",
"specs": [
[
"==",
"23.2"
]
]
},
{
"name": "pathspec",
"specs": [
[
"==",
"0.12.1"
]
]
},
{
"name": "pip-tools",
"specs": [
[
"==",
"7.3.0"
]
]
},
{
"name": "platformdirs",
"specs": [
[
"==",
"4.2.0"
]
]
},
{
"name": "polars",
"specs": [
[
"==",
"0.20.10"
]
]
},
{
"name": "pyproject-hooks",
"specs": [
[
"==",
"1.0.0"
]
]
},
{
"name": "ruff",
"specs": [
[
"==",
"0.1.11"
]
]
},
{
"name": "tomli",
"specs": [
[
"==",
"2.0.1"
]
]
},
{
"name": "tqdm",
"specs": [
[
"==",
"4.66.2"
]
]
},
{
"name": "wheel",
"specs": [
[
"==",
"0.42.0"
]
]
}
],
"lcname": "joinem"
}