csvuniondiff


Namecsvuniondiff JSON
Version 0.0.0.dev1 PyPI version JSON
download
home_pagehttps://github.com/BrianWeiHaoMa/csvuniondiff
SummaryA package for comparing CSV-like files through union and difference operations.
upload_time2024-07-13 02:12:14
maintainerNone
docs_urlNone
authorBrian Wei Hao Ma
requires_python>=3.10
licenseMIT
keywords csv diff union compare files command-line tool data analysis pandas dataframes comparison merge difference columns rows data processing data integration file comparison open source csv files xlsx files json files xml files html files cli tool data matching data alignment data cleaning double-check data validation validation confirm error detection sql
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # CSVUnionDiff
CSVUnionDiff is an open-source library for comparing CSV-like files through union and difference operations. 

- [CSVUnionDiff](#csvuniondiff)
  - [Features](#features)
  - [Installation and usage](#installation-and-usage)
  - [Examples](#examples)
    - [Command-line](#command-line)
    - [Programming](#programming)
  - [Possible use cases](#possible-use-cases)
    - [Command-line](#command-line-1)
    - [Programming](#programming-1)
  - [Match rows algorithm explanation](#match-rows-algorithm-explanation)

## Features
- A convenient command-line tool for quickly comparing files.
- A robust python package for comparing files in a programmatic way.
- A union operation to get the common rows between files.
- A diff operation to get unique rows between files.
- A match rows option which forces comparisons to be carried in a [specific useful way](#match-rows-explanation).
- Incorporates pandas to allow for various input and output types (csv, xlsx, json, xml, or html) and integration with dataframes.

## Installation and usage
To install through command-line, use 
```
python -m pip install csvuniondiff
```
To view available options for the command-line tool, use
```
csvuniondiff -h
```
To use the package in python, do
```
from csvuniondiff import ...
```
where ```...``` can be replaced with whatever is available from the package.

## Examples
### Command-line
Currently supported command line options are:
```
options:
  -h, --help            show this help message and exit
  --version             print the version of this package
  --diff DIFF DIFF      use the diff command, takes 2 files as arguments
  --union UNION UNION   use the union command, takes 2 files as arguments
  -a, --align-columns   aligns common columns on the left sorted
  -c [USE_COLUMNS ...], --use-columns [USE_COLUMNS ...]
                        only use these columns for comparison
  --ignore-columns [IGNORE_COLUMNS ...]
                        do not use these columns for comparison
  -f [FILL_NULL], --fill-null [FILL_NULL]
                        fills null option value so that they can be compared, default is 'NULL'
  -d, --drop-null       drop rows with nulls
  -D, --drop-duplicates
                        drop duplicate rows
  -i INPUT_DIR, --input-dir INPUT_DIR
                        use this directory path as the base for the path to the files
  -o OUTPUT_DIR, --output-dir OUTPUT_DIR
                        save outputs from the script to this directory
  -m, --match-rows      use the match rows algorithm for comparison
  -k [KEEP_COLUMNS ...], --keep-columns [KEEP_COLUMNS ...]
                        only keep these columns in the final result
  -C, --use-common-columns
                        use the maximal set of common columns for comparison
  --dont-add-timestamp  don't add a timestamp directory when outputting files
  --disable-printing    disable printing to stdout
  --print-prepared      print the prepared df before comparison
  --save-file-extension SAVE_FILE_EXTENSION
                        the extension for output files (csv, xlsx, json, xml, or html)
  -r, --row-counts      use the counts of each unique row in the final result instead
```

1. 
    **test2.csv**:
    |       | column7 | column1 | column2 | column3  | column4  | column5  | column6  |
    |-------|---------|---------|---------|----------|----------|----------|----------|
    | 0     | value7  | value1  | value2  | value3   | value4   | value5   | value6   |
    | 1     | value14 | value8  | value9  | value10  | value11  | value12  | value13  |
    | 2     | value21 | value15 | value16 | value17  | value18  | value19  | value20  |

    **test4.csv**:
    |       | column4 | column3 | column2 | column1 |
    |-------|---------|---------|---------|---------|
    | 0     |         | value3  | value2  | value1  |
    | 1     | value9  | value8  |         | value6  |
    | 2     | value14 |         |         | value11 |

    **Input**
    ```
    csvuniondiff
    --input-dir csvuniondiff/tests/test-data/random/ 
    --union test2.csv test4.csv 
    --match-rows            # use the match rows algorithm
    --fill-null value4      # fills nulls with 'value4'
    --align-columns         # align common columns sorted on the left
    --use-common-columns    # compare using all common columns
    ```

    **Output**
    ```
    Timestamp: 2024-07-10 14:27:45.402980

    Input directory: csvuniondiff/tests/test-data/random/

    union(
        args
        ----
        left_input: ['test2.csv']
        right_input: ['test4.csv']
        data_save_file_extensions: ['csv']

        options
        -------
        align_columns: True
        fill_null: value4
        match_rows: True
        enable_printing: True
        add_save_timestamp: True
        use_common_columns: True
    )

    Intersecting rows from test2.csv (1, 7):
    column1 column2 column3 column4 column7 column5 column6
    0  value1  value2  value3  value4  value7  value5  value6

    Intersecting rows from test4.csv (1, 4):
    column1 column2 column3 column4
    0  value1  value2  value3  value4
    ```
2. 
    Look [here](#match-rows-explanation) for input files.
   
    **Input**
    ```
    csvuniondiff
    --input-dir csvuniondiff/tests/test-data/diff/testset-1/ 
    --diff csv1.csv csv2.csv
    ```

    **Output**
    ```
    Timestamp: 2024-07-10 12:00:06.554911

    Input directory: csvuniondiff/tests/test-data/diff/testset-1/

    diff(
        args
        ----
        left_input: ['csv1.csv']
        right_input: ['csv2.csv']
        data_save_file_extensions: ['csv']

        options
        -------
        enable_printing: True
        add_save_timestamp: True
    )

    Only in csv1.csv (5, 3):
                Name  Age                      Email
    3  Michael Wilson   32  michaelwilson@example.com
    4  Michael Wilson   32  michaelwilson@example.com
    5    Bob Thompson   35    bobthompson@example.com
    6     Emily Davis   27     emilydavis@example.com
    7  Michael Wilson   32  michaelwilson@example.com

    Only in csv2.csv (3, 3):
                    Name  Age                       Email
    6       John Smith__1   35       johnsmith@example.com
    7  Michael Johnson__1   32  michaeljohnson@example.com
    8      Emily Davis__1   27      emilydavis@example.com
    ```

### Programming
1. 
    **test1.csv**
    |       | Name             | Age | Email                    |
    |-------|------------------|-----|--------------------------|
    | 0     | John Doe         | 25  | johndoe@example.com      |
    | 1     | Jane Smith       | 30  | janesmith@example.com    |
    | 2     | Mark Johnson     | 40  | markjohnson@example.com  |
    | 3     | Emily Davis      | 35  | emilydavis@example.com   |
    | 4     | Michael Brown    | 28  | michaelbrown@example.com |
    | 5     | Sarah Wilson     | 32  | sarahwilson@example.com  |
    | 6     | David Thompson   | 45  | davidthompson@example.com|
    | 7     | Jessica Martinez | 27  | jessicamartinez@example.com|
    | 8     | Christopher Lee  | 33  | christopherlee@example.com|
    | 9     | Laura Taylor     | 29  | laurataylor@example.com  |
    
    **test2.csv**
    |       | Name             | Email                     | Age |
    |-------|------------------|---------------------------|-----|
    | 0     | John Doe         | johndoe25@example.com     | 25  |
    | 1     | Jane Smith       | janesmith30@example.com   | 30  |
    | 2     | Mark Johnson     | markjohnson40@example.com | 40  |
    | 3     | Emily Davis      | emilydavis35@example.com  | 35  |
    | 4     | Jessica Martinez | jessicamartinez27@example.com | 27 |
    | 5     | Christopher Lee  | christopherlee33@example.com | 33 |
    | 6     | Laura Taylor     | laurataylor29@example.com | 29  |
    | 7     | Brian Harris     | brianharris33@example.com | 33  |

    **Input**
    ```python
    import pandas as pd
    from csvuniondiff.csvuniondiff import (
        CsvUnionDiff,
        ParallelInputArgs,
        CommandOptions,
    )

    obj = CsvUnionDiff(
        "./csvuniondiff/tests/test-data/diff/testset-2/",
        None,
    )

    def left_df_trans(df: pd.DataFrame) -> pd.DataFrame:
        def email_trans(row):
            arr = row["Email"].split("@")
            return arr[0] + str(row["Age"]) + "@" + arr[1]
        
        df["Email"] = df.apply(email_trans, axis=1)

        df = df[["Name", "Email", "Age"]]
        return df

    left_dfs, right_dfs = obj.diff(
        args=ParallelInputArgs(
            ["test1.csv"], 
            ["test2.csv"], 
            left_trans_funcs=[left_df_trans],
            right_trans_funcs=[lambda x: x],
            return_transformed_rows=False, # selects the rows from original table
        ),
        options=CommandOptions(
            match_rows=True,
            enable_printing=True
        ),
    )

    left_df = left_dfs[0]
    right_df = right_dfs[0] # use the results somewhere
    ```

    **Output**
    ```
    Timestamp: 2024-07-11 11:37:38.748144

    Input directory: ./csvuniondiff/tests/test-data/diff/testset-2/

    diff(
        args
        ----
        left_input: ['test1.csv']
        right_input: ['test2.csv']
        left_trans_funcs: [<function left_df_trans at 0x000001E40ACBA340>]
        right_trans_funcs: [<function <lambda> at 0x000001E4259B3E20>]

        options
        -------
        match_rows: True
        enable_printing: True
    )

    Only in test1.csv (3, 3):
                Name  Age                      Email
    4   Michael Brown   28   michaelbrown@example.com
    5    Sarah Wilson   32    sarahwilson@example.com
    6  David Thompson   45  davidthompson@example.com

    Only in test2.csv (1, 3):
            Name                      Email  Age
    7  Brian Harris  brianharris33@example.com   33
    ```
2. 
    **Input**
    ```python
    from csvuniondiff.csvuniondiff import (
        CsvUnionDiff,
        ParallelInputArgs,
        CommandOptions,
    )

    obj = CsvUnionDiff(
        input_dir="./csvuniondiff/tests/test-data/diff/testset-1/",
        output_dir=None,

    )

    left_dfs, right_dfs = obj.diff(
        args=ParallelInputArgs(
            left_input=["test1.csv"],
            right_input=["test2.csv"],
        ),
        options=CommandOptions(
            enable_printing=True,
            add_save_timestamp=True,
        )
    )

    only_in_test1, only_in_test2 = left_dfs[0], right_dfs[0] # use dataframe results somewhere
    ```

    **Output**
    ```
    Timestamp: 2024-07-10 13:23:35.239955

    Input directory: ./csvuniondiff/tests/test-data/diff/testset-1/

    diff(
        args
        ----
        left_input: ['test1.csv']
        right_input: ['test2.csv']

        options
        -------
        match_rows: True
        enable_printing: True
        add_save_timestamp: True
    )

    Only in test1.csv (5, 3):
                Name  Age                      Email
    3  Michael Wilson   32  michaelwilson@example.com
    4  Michael Wilson   32  michaelwilson@example.com
    5    Bob Thompson   35    bobthompson@example.com
    6     Emily Davis   27     emilydavis@example.com
    7  Michael Wilson   32  michaelwilson@example.com

    Only in test2.csv (5, 3):
                    Name  Age                       Email
    1            John Doe   25         johndoe@example.com
    3          Jane Smith   30       janesmith@example.com
    6       John Smith__1   35       johnsmith@example.com
    7  Michael Johnson__1   32  michaeljohnson@example.com
    8      Emily Davis__1   27      emilydavis@example.com
    ```

## Possible use cases
### Command-line
1.  
    A personal usecase of mine is to **cross-check SQL results with an expected CSV/Excel file** (perhaps one that was created manually). 
    I would use my SQL management tool to generate the CSV file from my query, then call ```csvuniondiff --diff my.csv expected.csv --match-rows``` 
    to see the differences and the magnitude of the differences. I could also call ```csvuniondiff --union my.csv expected.csv --match-rows``` 
    to see what rows my SQL query is getting right.
2.  
    You want to compare 2 CSV files but some aspect covered by this tool makes it impossible to (for example NULL values, unaligned columns, or you
    want to only compare a subset of columns etc.) and you want to do it fast.

### Programming
1.  
    I had a case where I needed to check for the existence of rows with certain values in specific columns across many Excel files. 
    I can make a dataframe with the columns and values that I am looking for:
    |    | Name            | Age | Email                     |
    |----|-----------------|-----|---------------------------|
    | 0  | John Doe        | 25  | johndoe@example.com       |

    I can put all of the Excel files in a directory and then run the union command with the above CSV against 
    the target CSV's in the directory.
2.  
    The files are slightly different but could be transformed to be compared.
3.  
    You don't want to personally code out difference and union operations with match rows and stdout output.

## Match rows algorithm explanation
To explain the match rows option, let's consider the following CSV tables:

<table>
<tr><th>csv1</th><th>csv2</th></tr>
<tr><td>

|    | Name            | Age | Email                     |
|----|-----------------|-----|---------------------------|
| 0  | John Doe        | 25  | johndoe@example.com       |
| 1  | Jane Smith      | 30  | janesmith@example.com     |
| 2  | Alice Johnson   | 28  | alicejohnson@example.com  |
| 3  | Michael Wilson  | 32  | michaelwilson@example.com |
| 4  | Michael Wilson  | 32  | michaelwilson@example.com |
| 5  | Bob Thompson    | 35  | bobthompson@example.com   |
| 6  | Emily Davis     | 27  | emilydavis@example.com    |
| 7  | Michael Wilson  | 32  | michaelwilson@example.com |
| 8  | Sarah Brown     | 29  | sarahbrown@example.com    |
</td><td>

|     | Name              | Age | Email                   |
|-----|-------------------|-----|-------------------------|
| 0   | John Doe          | 25  | johndoe@example.com     |
| 1   | John Doe          | 25  | johndoe@example.com     |
| 2   | Jane Smith        | 30  | janesmith@example.com   |
| 3   | Jane Smith        | 30  | janesmith@example.com   |
| 4   | Alice Johnson     | 28  | alicejohnson@example.com|
| 5   | Sarah Brown       | 29  | sarahbrown@example.com  |
| 6   | John Smith__1     | 35  | johnsmith@example.com   |
| 7   | Michael Johnson__1| 32  | michaeljohnson@example.com|
| 8   | Emily Davis__1    | 27  | emilydavis@example.com  |
</td></tr> 
</table>

When matching rows in the diff operation
1.  The first John Doe in both files is matched, so the second John Doe in **csv2** is only in **csv2**.
2.  The first Jane Smith in both files is matched, so the second Jane Smith in **csv2** is only in **csv2**.
3.  Both files have exactly 1 Alice Johnson and Sarah Brown so they are both matched and neither are only in **csv1** or **csv2**.
4.  The remaining rows are all unique between the two files so they are only in **csv1** or **csv2**, respectively.

Therefore, with the match rows option, the results of the diff operation will be:

<table>
<tr><th>only in csv1</th><th>only in csv2</th></tr>
<tr><td>

|     | Name            | Age | Email                     |
|-----|-----------------|-----|---------------------------|
| 3   | Michael Wilson  | 32  | michaelwilson@example.com |
| 4   | Michael Wilson  | 32  | michaelwilson@example.com |
| 5   | Bob Thompson    | 35  | bobthompson@example.com   |
| 6   | Emily Davis     | 27  | emilydavis@example.com    |
| 7   | Michael Wilson  | 32  | michaelwilson@example.com |
</td><td>

|     | Name              | Age | Email                      |
|-----|-------------------|-----|----------------------------|
| 1   | John Doe          | 25  | johndoe@example.com        |
| 3   | Jane Smith        | 30  | janesmith@example.com      |
| 6   | John Smith__1     | 35  | johnsmith@example.com      |
| 7   | Michael Johnson__1| 32  | michaeljohnson@example.com |
| 8   | Emily Davis__1    | 27  | emilydavis@example.com     |
</td></tr>
</table>

Using the union operation with match rows instead with **csv1** and **csv2**, we get:

<table>
<tr><th>intersecting from csv1</th><th>intersecting from csv2</th></tr>
<tr><td>

|       | Name           | Age | Email                     |
|-------|----------------|-----|---------------------------|
| 0     | John Doe       | 25  | johndoe@example.com       |
| 1     | Jane Smith     | 30  | janesmith@example.com     |
| 2     | Alice Johnson  | 28  | alicejohnson@example.com  |
| 8     | Sarah Brown    | 29  | sarahbrown@example.com    |
</td><td>

|       | Name           | Age | Email                     |
|-------|----------------|-----|---------------------------|
| 0     | John Doe       | 25  | johndoe@example.com       |
| 2     | Jane Smith     | 30  | janesmith@example.com     |
| 4     | Alice Johnson  | 28  | alicejohnson@example.com  |
| 5     | Sarah Brown    | 29  | sarahbrown@example.com    |
</td></tr>
</table>

An example of where this might be useful is when you are cross-checking using the diff operation
and you want to know the magnitude of the rows that you've missed and the extra rows that you have
in your CSV when compared against an expected CSV. In this case, you can use
```csvuniondiff --diff my.csv expected.csv --match-rows``` to get the rows with duplicates or
 ```csvuniondiff --diff my.csv expected.csv --match-rows --row-counts``` to 
get the count of each unique row in the result.

## Contribution
Feel free to open an issue if something isn't working properly or you think that another feature would be
worth it to add. 

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/BrianWeiHaoMa/csvuniondiff",
    "name": "csvuniondiff",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "csv, diff, union, compare, files, command-line, tool, data, analysis, pandas, dataframes, comparison, merge, difference, columns, rows, data processing, data integration, file comparison, open source, csv files, xlsx files, json files, xml files, html files, cli tool, data matching, data alignment, data cleaning, double-check, data validation, validation, confirm, error detection, sql",
    "author": "Brian Wei Hao Ma",
    "author_email": "brianmaytc@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/a8/81/a25bc1e669fa9f999b8dd1b4a2dccb8412a13a13964c833fe13ca72915e9/csvuniondiff-0.0.0.dev1.tar.gz",
    "platform": null,
    "description": "# CSVUnionDiff\r\nCSVUnionDiff is an open-source library for comparing CSV-like files through union and difference operations. \r\n\r\n- [CSVUnionDiff](#csvuniondiff)\r\n  - [Features](#features)\r\n  - [Installation and usage](#installation-and-usage)\r\n  - [Examples](#examples)\r\n    - [Command-line](#command-line)\r\n    - [Programming](#programming)\r\n  - [Possible use cases](#possible-use-cases)\r\n    - [Command-line](#command-line-1)\r\n    - [Programming](#programming-1)\r\n  - [Match rows algorithm explanation](#match-rows-algorithm-explanation)\r\n\r\n## Features\r\n- A convenient command-line tool for quickly comparing files.\r\n- A robust python package for comparing files in a programmatic way.\r\n- A union operation to get the common rows between files.\r\n- A diff operation to get unique rows between files.\r\n- A match rows option which forces comparisons to be carried in a [specific useful way](#match-rows-explanation).\r\n- Incorporates pandas to allow for various input and output types (csv, xlsx, json, xml, or html) and integration with dataframes.\r\n\r\n## Installation and usage\r\nTo install through command-line, use \r\n```\r\npython -m pip install csvuniondiff\r\n```\r\nTo view available options for the command-line tool, use\r\n```\r\ncsvuniondiff -h\r\n```\r\nTo use the package in python, do\r\n```\r\nfrom csvuniondiff import ...\r\n```\r\nwhere ```...``` can be replaced with whatever is available from the package.\r\n\r\n## Examples\r\n### Command-line\r\nCurrently supported command line options are:\r\n```\r\noptions:\r\n  -h, --help            show this help message and exit\r\n  --version             print the version of this package\r\n  --diff DIFF DIFF      use the diff command, takes 2 files as arguments\r\n  --union UNION UNION   use the union command, takes 2 files as arguments\r\n  -a, --align-columns   aligns common columns on the left sorted\r\n  -c [USE_COLUMNS ...], --use-columns [USE_COLUMNS ...]\r\n                        only use these columns for comparison\r\n  --ignore-columns [IGNORE_COLUMNS ...]\r\n                        do not use these columns for comparison\r\n  -f [FILL_NULL], --fill-null [FILL_NULL]\r\n                        fills null option value so that they can be compared, default is 'NULL'\r\n  -d, --drop-null       drop rows with nulls\r\n  -D, --drop-duplicates\r\n                        drop duplicate rows\r\n  -i INPUT_DIR, --input-dir INPUT_DIR\r\n                        use this directory path as the base for the path to the files\r\n  -o OUTPUT_DIR, --output-dir OUTPUT_DIR\r\n                        save outputs from the script to this directory\r\n  -m, --match-rows      use the match rows algorithm for comparison\r\n  -k [KEEP_COLUMNS ...], --keep-columns [KEEP_COLUMNS ...]\r\n                        only keep these columns in the final result\r\n  -C, --use-common-columns\r\n                        use the maximal set of common columns for comparison\r\n  --dont-add-timestamp  don't add a timestamp directory when outputting files\r\n  --disable-printing    disable printing to stdout\r\n  --print-prepared      print the prepared df before comparison\r\n  --save-file-extension SAVE_FILE_EXTENSION\r\n                        the extension for output files (csv, xlsx, json, xml, or html)\r\n  -r, --row-counts      use the counts of each unique row in the final result instead\r\n```\r\n\r\n1. \r\n    **test2.csv**:\r\n    |       | column7 | column1 | column2 | column3  | column4  | column5  | column6  |\r\n    |-------|---------|---------|---------|----------|----------|----------|----------|\r\n    | 0     | value7  | value1  | value2  | value3   | value4   | value5   | value6   |\r\n    | 1     | value14 | value8  | value9  | value10  | value11  | value12  | value13  |\r\n    | 2     | value21 | value15 | value16 | value17  | value18  | value19  | value20  |\r\n\r\n    **test4.csv**:\r\n    |       | column4 | column3 | column2 | column1 |\r\n    |-------|---------|---------|---------|---------|\r\n    | 0     |         | value3  | value2  | value1  |\r\n    | 1     | value9  | value8  |         | value6  |\r\n    | 2     | value14 |         |         | value11 |\r\n\r\n    **Input**\r\n    ```\r\n    csvuniondiff\r\n    --input-dir csvuniondiff/tests/test-data/random/ \r\n    --union test2.csv test4.csv \r\n    --match-rows            # use the match rows algorithm\r\n    --fill-null value4      # fills nulls with 'value4'\r\n    --align-columns         # align common columns sorted on the left\r\n    --use-common-columns    # compare using all common columns\r\n    ```\r\n\r\n    **Output**\r\n    ```\r\n    Timestamp: 2024-07-10 14:27:45.402980\r\n\r\n    Input directory: csvuniondiff/tests/test-data/random/\r\n\r\n    union(\r\n        args\r\n        ----\r\n        left_input: ['test2.csv']\r\n        right_input: ['test4.csv']\r\n        data_save_file_extensions: ['csv']\r\n\r\n        options\r\n        -------\r\n        align_columns: True\r\n        fill_null: value4\r\n        match_rows: True\r\n        enable_printing: True\r\n        add_save_timestamp: True\r\n        use_common_columns: True\r\n    )\r\n\r\n    Intersecting rows from test2.csv (1, 7):\r\n    column1 column2 column3 column4 column7 column5 column6\r\n    0  value1  value2  value3  value4  value7  value5  value6\r\n\r\n    Intersecting rows from test4.csv (1, 4):\r\n    column1 column2 column3 column4\r\n    0  value1  value2  value3  value4\r\n    ```\r\n2. \r\n    Look [here](#match-rows-explanation) for input files.\r\n   \r\n    **Input**\r\n    ```\r\n    csvuniondiff\r\n    --input-dir csvuniondiff/tests/test-data/diff/testset-1/ \r\n    --diff csv1.csv csv2.csv\r\n    ```\r\n\r\n    **Output**\r\n    ```\r\n    Timestamp: 2024-07-10 12:00:06.554911\r\n\r\n    Input directory: csvuniondiff/tests/test-data/diff/testset-1/\r\n\r\n    diff(\r\n        args\r\n        ----\r\n        left_input: ['csv1.csv']\r\n        right_input: ['csv2.csv']\r\n        data_save_file_extensions: ['csv']\r\n\r\n        options\r\n        -------\r\n        enable_printing: True\r\n        add_save_timestamp: True\r\n    )\r\n\r\n    Only in csv1.csv (5, 3):\r\n                Name  Age                      Email\r\n    3  Michael Wilson   32  michaelwilson@example.com\r\n    4  Michael Wilson   32  michaelwilson@example.com\r\n    5    Bob Thompson   35    bobthompson@example.com\r\n    6     Emily Davis   27     emilydavis@example.com\r\n    7  Michael Wilson   32  michaelwilson@example.com\r\n\r\n    Only in csv2.csv (3, 3):\r\n                    Name  Age                       Email\r\n    6       John Smith__1   35       johnsmith@example.com\r\n    7  Michael Johnson__1   32  michaeljohnson@example.com\r\n    8      Emily Davis__1   27      emilydavis@example.com\r\n    ```\r\n\r\n### Programming\r\n1. \r\n    **test1.csv**\r\n    |       | Name             | Age | Email                    |\r\n    |-------|------------------|-----|--------------------------|\r\n    | 0     | John Doe         | 25  | johndoe@example.com      |\r\n    | 1     | Jane Smith       | 30  | janesmith@example.com    |\r\n    | 2     | Mark Johnson     | 40  | markjohnson@example.com  |\r\n    | 3     | Emily Davis      | 35  | emilydavis@example.com   |\r\n    | 4     | Michael Brown    | 28  | michaelbrown@example.com |\r\n    | 5     | Sarah Wilson     | 32  | sarahwilson@example.com  |\r\n    | 6     | David Thompson   | 45  | davidthompson@example.com|\r\n    | 7     | Jessica Martinez | 27  | jessicamartinez@example.com|\r\n    | 8     | Christopher Lee  | 33  | christopherlee@example.com|\r\n    | 9     | Laura Taylor     | 29  | laurataylor@example.com  |\r\n    \r\n    **test2.csv**\r\n    |       | Name             | Email                     | Age |\r\n    |-------|------------------|---------------------------|-----|\r\n    | 0     | John Doe         | johndoe25@example.com     | 25  |\r\n    | 1     | Jane Smith       | janesmith30@example.com   | 30  |\r\n    | 2     | Mark Johnson     | markjohnson40@example.com | 40  |\r\n    | 3     | Emily Davis      | emilydavis35@example.com  | 35  |\r\n    | 4     | Jessica Martinez | jessicamartinez27@example.com | 27 |\r\n    | 5     | Christopher Lee  | christopherlee33@example.com | 33 |\r\n    | 6     | Laura Taylor     | laurataylor29@example.com | 29  |\r\n    | 7     | Brian Harris     | brianharris33@example.com | 33  |\r\n\r\n    **Input**\r\n    ```python\r\n    import pandas as pd\r\n    from csvuniondiff.csvuniondiff import (\r\n        CsvUnionDiff,\r\n        ParallelInputArgs,\r\n        CommandOptions,\r\n    )\r\n\r\n    obj = CsvUnionDiff(\r\n        \"./csvuniondiff/tests/test-data/diff/testset-2/\",\r\n        None,\r\n    )\r\n\r\n    def left_df_trans(df: pd.DataFrame) -> pd.DataFrame:\r\n        def email_trans(row):\r\n            arr = row[\"Email\"].split(\"@\")\r\n            return arr[0] + str(row[\"Age\"]) + \"@\" + arr[1]\r\n        \r\n        df[\"Email\"] = df.apply(email_trans, axis=1)\r\n\r\n        df = df[[\"Name\", \"Email\", \"Age\"]]\r\n        return df\r\n\r\n    left_dfs, right_dfs = obj.diff(\r\n        args=ParallelInputArgs(\r\n            [\"test1.csv\"], \r\n            [\"test2.csv\"], \r\n            left_trans_funcs=[left_df_trans],\r\n            right_trans_funcs=[lambda x: x],\r\n            return_transformed_rows=False, # selects the rows from original table\r\n        ),\r\n        options=CommandOptions(\r\n            match_rows=True,\r\n            enable_printing=True\r\n        ),\r\n    )\r\n\r\n    left_df = left_dfs[0]\r\n    right_df = right_dfs[0] # use the results somewhere\r\n    ```\r\n\r\n    **Output**\r\n    ```\r\n    Timestamp: 2024-07-11 11:37:38.748144\r\n\r\n    Input directory: ./csvuniondiff/tests/test-data/diff/testset-2/\r\n\r\n    diff(\r\n        args\r\n        ----\r\n        left_input: ['test1.csv']\r\n        right_input: ['test2.csv']\r\n        left_trans_funcs: [<function left_df_trans at 0x000001E40ACBA340>]\r\n        right_trans_funcs: [<function <lambda> at 0x000001E4259B3E20>]\r\n\r\n        options\r\n        -------\r\n        match_rows: True\r\n        enable_printing: True\r\n    )\r\n\r\n    Only in test1.csv (3, 3):\r\n                Name  Age                      Email\r\n    4   Michael Brown   28   michaelbrown@example.com\r\n    5    Sarah Wilson   32    sarahwilson@example.com\r\n    6  David Thompson   45  davidthompson@example.com\r\n\r\n    Only in test2.csv (1, 3):\r\n            Name                      Email  Age\r\n    7  Brian Harris  brianharris33@example.com   33\r\n    ```\r\n2. \r\n    **Input**\r\n    ```python\r\n    from csvuniondiff.csvuniondiff import (\r\n        CsvUnionDiff,\r\n        ParallelInputArgs,\r\n        CommandOptions,\r\n    )\r\n\r\n    obj = CsvUnionDiff(\r\n        input_dir=\"./csvuniondiff/tests/test-data/diff/testset-1/\",\r\n        output_dir=None,\r\n\r\n    )\r\n\r\n    left_dfs, right_dfs = obj.diff(\r\n        args=ParallelInputArgs(\r\n            left_input=[\"test1.csv\"],\r\n            right_input=[\"test2.csv\"],\r\n        ),\r\n        options=CommandOptions(\r\n            enable_printing=True,\r\n            add_save_timestamp=True,\r\n        )\r\n    )\r\n\r\n    only_in_test1, only_in_test2 = left_dfs[0], right_dfs[0] # use dataframe results somewhere\r\n    ```\r\n\r\n    **Output**\r\n    ```\r\n    Timestamp: 2024-07-10 13:23:35.239955\r\n\r\n    Input directory: ./csvuniondiff/tests/test-data/diff/testset-1/\r\n\r\n    diff(\r\n        args\r\n        ----\r\n        left_input: ['test1.csv']\r\n        right_input: ['test2.csv']\r\n\r\n        options\r\n        -------\r\n        match_rows: True\r\n        enable_printing: True\r\n        add_save_timestamp: True\r\n    )\r\n\r\n    Only in test1.csv (5, 3):\r\n                Name  Age                      Email\r\n    3  Michael Wilson   32  michaelwilson@example.com\r\n    4  Michael Wilson   32  michaelwilson@example.com\r\n    5    Bob Thompson   35    bobthompson@example.com\r\n    6     Emily Davis   27     emilydavis@example.com\r\n    7  Michael Wilson   32  michaelwilson@example.com\r\n\r\n    Only in test2.csv (5, 3):\r\n                    Name  Age                       Email\r\n    1            John Doe   25         johndoe@example.com\r\n    3          Jane Smith   30       janesmith@example.com\r\n    6       John Smith__1   35       johnsmith@example.com\r\n    7  Michael Johnson__1   32  michaeljohnson@example.com\r\n    8      Emily Davis__1   27      emilydavis@example.com\r\n    ```\r\n\r\n## Possible use cases\r\n### Command-line\r\n1.  \r\n    A personal usecase of mine is to **cross-check SQL results with an expected CSV/Excel file** (perhaps one that was created manually). \r\n    I would use my SQL management tool to generate the CSV file from my query, then call ```csvuniondiff --diff my.csv expected.csv --match-rows``` \r\n    to see the differences and the magnitude of the differences. I could also call ```csvuniondiff --union my.csv expected.csv --match-rows``` \r\n    to see what rows my SQL query is getting right.\r\n2.  \r\n    You want to compare 2 CSV files but some aspect covered by this tool makes it impossible to (for example NULL values, unaligned columns, or you\r\n    want to only compare a subset of columns etc.) and you want to do it fast.\r\n\r\n### Programming\r\n1.  \r\n    I had a case where I needed to check for the existence of rows with certain values in specific columns across many Excel files. \r\n    I can make a dataframe with the columns and values that I am looking for:\r\n    |    | Name            | Age | Email                     |\r\n    |----|-----------------|-----|---------------------------|\r\n    | 0  | John Doe        | 25  | johndoe@example.com       |\r\n\r\n    I can put all of the Excel files in a directory and then run the union command with the above CSV against \r\n    the target CSV's in the directory.\r\n2.  \r\n    The files are slightly different but could be transformed to be compared.\r\n3.  \r\n    You don't want to personally code out difference and union operations with match rows and stdout output.\r\n\r\n## Match rows algorithm explanation\r\nTo explain the match rows option, let's consider the following CSV tables:\r\n\r\n<table>\r\n<tr><th>csv1</th><th>csv2</th></tr>\r\n<tr><td>\r\n\r\n|    | Name            | Age | Email                     |\r\n|----|-----------------|-----|---------------------------|\r\n| 0  | John Doe        | 25  | johndoe@example.com       |\r\n| 1  | Jane Smith      | 30  | janesmith@example.com     |\r\n| 2  | Alice Johnson   | 28  | alicejohnson@example.com  |\r\n| 3  | Michael Wilson  | 32  | michaelwilson@example.com |\r\n| 4  | Michael Wilson  | 32  | michaelwilson@example.com |\r\n| 5  | Bob Thompson    | 35  | bobthompson@example.com   |\r\n| 6  | Emily Davis     | 27  | emilydavis@example.com    |\r\n| 7  | Michael Wilson  | 32  | michaelwilson@example.com |\r\n| 8  | Sarah Brown     | 29  | sarahbrown@example.com    |\r\n</td><td>\r\n\r\n|     | Name              | Age | Email                   |\r\n|-----|-------------------|-----|-------------------------|\r\n| 0   | John Doe          | 25  | johndoe@example.com     |\r\n| 1   | John Doe          | 25  | johndoe@example.com     |\r\n| 2   | Jane Smith        | 30  | janesmith@example.com   |\r\n| 3   | Jane Smith        | 30  | janesmith@example.com   |\r\n| 4   | Alice Johnson     | 28  | alicejohnson@example.com|\r\n| 5   | Sarah Brown       | 29  | sarahbrown@example.com  |\r\n| 6   | John Smith__1     | 35  | johnsmith@example.com   |\r\n| 7   | Michael Johnson__1| 32  | michaeljohnson@example.com|\r\n| 8   | Emily Davis__1    | 27  | emilydavis@example.com  |\r\n</td></tr> \r\n</table>\r\n\r\nWhen matching rows in the diff operation\r\n1.  The first John Doe in both files is matched, so the second John Doe in **csv2** is only in **csv2**.\r\n2.  The first Jane Smith in both files is matched, so the second Jane Smith in **csv2** is only in **csv2**.\r\n3.  Both files have exactly 1 Alice Johnson and Sarah Brown so they are both matched and neither are only in **csv1** or **csv2**.\r\n4.  The remaining rows are all unique between the two files so they are only in **csv1** or **csv2**, respectively.\r\n\r\nTherefore, with the match rows option, the results of the diff operation will be:\r\n\r\n<table>\r\n<tr><th>only in csv1</th><th>only in csv2</th></tr>\r\n<tr><td>\r\n\r\n|     | Name            | Age | Email                     |\r\n|-----|-----------------|-----|---------------------------|\r\n| 3   | Michael Wilson  | 32  | michaelwilson@example.com |\r\n| 4   | Michael Wilson  | 32  | michaelwilson@example.com |\r\n| 5   | Bob Thompson    | 35  | bobthompson@example.com   |\r\n| 6   | Emily Davis     | 27  | emilydavis@example.com    |\r\n| 7   | Michael Wilson  | 32  | michaelwilson@example.com |\r\n</td><td>\r\n\r\n|     | Name              | Age | Email                      |\r\n|-----|-------------------|-----|----------------------------|\r\n| 1   | John Doe          | 25  | johndoe@example.com        |\r\n| 3   | Jane Smith        | 30  | janesmith@example.com      |\r\n| 6   | John Smith__1     | 35  | johnsmith@example.com      |\r\n| 7   | Michael Johnson__1| 32  | michaeljohnson@example.com |\r\n| 8   | Emily Davis__1    | 27  | emilydavis@example.com     |\r\n</td></tr>\r\n</table>\r\n\r\nUsing the union operation with match rows instead with **csv1** and **csv2**, we get:\r\n\r\n<table>\r\n<tr><th>intersecting from csv1</th><th>intersecting from csv2</th></tr>\r\n<tr><td>\r\n\r\n|       | Name           | Age | Email                     |\r\n|-------|----------------|-----|---------------------------|\r\n| 0     | John Doe       | 25  | johndoe@example.com       |\r\n| 1     | Jane Smith     | 30  | janesmith@example.com     |\r\n| 2     | Alice Johnson  | 28  | alicejohnson@example.com  |\r\n| 8     | Sarah Brown    | 29  | sarahbrown@example.com    |\r\n</td><td>\r\n\r\n|       | Name           | Age | Email                     |\r\n|-------|----------------|-----|---------------------------|\r\n| 0     | John Doe       | 25  | johndoe@example.com       |\r\n| 2     | Jane Smith     | 30  | janesmith@example.com     |\r\n| 4     | Alice Johnson  | 28  | alicejohnson@example.com  |\r\n| 5     | Sarah Brown    | 29  | sarahbrown@example.com    |\r\n</td></tr>\r\n</table>\r\n\r\nAn example of where this might be useful is when you are cross-checking using the diff operation\r\nand you want to know the magnitude of the rows that you've missed and the extra rows that you have\r\nin your CSV when compared against an expected CSV. In this case, you can use\r\n```csvuniondiff --diff my.csv expected.csv --match-rows``` to get the rows with duplicates or\r\n ```csvuniondiff --diff my.csv expected.csv --match-rows --row-counts``` to \r\nget the count of each unique row in the result.\r\n\r\n## Contribution\r\nFeel free to open an issue if something isn't working properly or you think that another feature would be\r\nworth it to add. \r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A package for comparing CSV-like files through union and difference operations.",
    "version": "0.0.0.dev1",
    "project_urls": {
        "Homepage": "https://github.com/BrianWeiHaoMa/csvuniondiff"
    },
    "split_keywords": [
        "csv",
        " diff",
        " union",
        " compare",
        " files",
        " command-line",
        " tool",
        " data",
        " analysis",
        " pandas",
        " dataframes",
        " comparison",
        " merge",
        " difference",
        " columns",
        " rows",
        " data processing",
        " data integration",
        " file comparison",
        " open source",
        " csv files",
        " xlsx files",
        " json files",
        " xml files",
        " html files",
        " cli tool",
        " data matching",
        " data alignment",
        " data cleaning",
        " double-check",
        " data validation",
        " validation",
        " confirm",
        " error detection",
        " sql"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "273b4a8455349f7047d732c1fa35633d3d7bd103c5cf885e6a715d613dc5e423",
                "md5": "1465ed0e9e07355a6c70854a6163d4b8",
                "sha256": "cd1e4967969041397931bd6669860973383078b9ff19c910b9e33d3047270408"
            },
            "downloads": -1,
            "filename": "csvuniondiff-0.0.0.dev1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1465ed0e9e07355a6c70854a6163d4b8",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 19877,
            "upload_time": "2024-07-13T02:12:12",
            "upload_time_iso_8601": "2024-07-13T02:12:12.352082Z",
            "url": "https://files.pythonhosted.org/packages/27/3b/4a8455349f7047d732c1fa35633d3d7bd103c5cf885e6a715d613dc5e423/csvuniondiff-0.0.0.dev1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a881a25bc1e669fa9f999b8dd1b4a2dccb8412a13a13964c833fe13ca72915e9",
                "md5": "b782b0bbbbfe4d728c8bd7a52b273b6c",
                "sha256": "a1c07d80fa80265f64d68292ff3fcc64bb69a462a69a1dcef0bb15d33d1687b3"
            },
            "downloads": -1,
            "filename": "csvuniondiff-0.0.0.dev1.tar.gz",
            "has_sig": false,
            "md5_digest": "b782b0bbbbfe4d728c8bd7a52b273b6c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 23395,
            "upload_time": "2024-07-13T02:12:14",
            "upload_time_iso_8601": "2024-07-13T02:12:14.177775Z",
            "url": "https://files.pythonhosted.org/packages/a8/81/a25bc1e669fa9f999b8dd1b4a2dccb8412a13a13964c833fe13ca72915e9/csvuniondiff-0.0.0.dev1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-13 02:12:14",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "BrianWeiHaoMa",
    "github_project": "csvuniondiff",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "csvuniondiff"
}
        
Elapsed time: 4.62629s