TabularDataInvestigation


NameTabularDataInvestigation JSON
Version 0.0.8 PyPI version JSON
download
home_pagehttps://github.com/Tanvir223/TabularDataInvestigation
SummaryThis package provide a fast tabular data investigation and it will eligible for ML model building and also helps to developers in their projects when needed
upload_time2023-07-07 09:21:00
maintainer
docs_urlNone
authorTanvir Islam
requires_python
licenseMIT
keywords pypi tabulardatainvestigation tabulardata data-manupulation data-preprocessing data cleaning machine learning artificial intelligence industry data data science
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <div class="cell markdown" id="lJNrv9lwHw7t">

## This package provide a fast tabular data investigation and it will eligible for ML model building and also helps to developers in their projects when needed. Most of the functions return a dataframe or json as output

</div>

<div class="cell code" id="HRSU2lZ5Ht9v">

``` python
pip install TabularDataInvestigation
```

</div>

<div class="cell code" id="VzC1_y2hTSlb">

``` python
from TabularDataInvestigation import tdi
```

</div>

<div class="cell markdown" id="5o29-lUSKmIy">

# tdi.find_index_for_null_values(df, return_type='dataframe')

**Parameters (Input):**

-   df: pandas Dataframe
-   return_type(optional): Default = 'dataframe'

</div>

<div class="cell markdown" id="ZxDbq_gbf3fM">

**Output : DataFrame**

</div>

<div class="cell markdown" id="vYvrNlAvIDQd">

Sometimes, we need to drop or fill the null values according to
individual cell specifically with different methods. Some data contains
meaning and some are unnecceesary.So, using this function we can get all
the missing cell indexes that we can use in our project.

</div>

<div class="cell code"
colab="{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:143}"
id="TGS5k7JWIqz8" outputId="3663fd25-23cb-4558-a175-5ede1691661d">

``` python
df = pd.DataFrame({'A': [1, None, 3], 'B': ['!', 5, '?'], 'C': ['a', 'b', None]})
df
```

<div class="output execute_result" execution_count="19">

         A  B     C
    0  1.0  !     a
    1  NaN  5     b
    2  3.0  ?  None

</div>

</div>

<div class="cell code"
colab="{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:81}"
id="HLVms92Zn4qW" outputId="669872b8-bce7-456b-afae-37dca420cc49">

``` python
tdi.find_index_for_null_values(df)
```

<div class="output execute_result" execution_count="20">

         A    C
    0  [1]  [2]

</div>

</div>

<div class="cell code"
colab="{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:35}"
id="8_NmdR_UKRWy" outputId="b650a6f9-2f2b-49f8-a3e0-e575a1d74df0">

``` python
tdi.find_index_for_null_values(df, return_type='json')
```

<div class="output execute_result" execution_count="21">

``` json
{"type":"string"}
```

</div>

</div>

<div class="cell markdown" id="d9uRAKLzKYDy">

Here return type is optional ('dataframe' or 'json'). Default:
dataframe. From the output we understand that column "A" Index "1" has a
null value.

</div>

<div class="cell markdown" id="OSAs6sHgLVf3">

# tdi.check_error_data_types(df, return_type='dataframe')

**Parameters(Input):**

-   df: pandas Dataframe
-   return_type(optional): Default = 'dataframe')

</div>

<div class="cell markdown" id="jTm7bE7mgP1k">

**Output : DataFrame**

</div>

<div class="cell markdown" id="OpKOTMPmLl5w">

We usually face some unusual behave in the dataframe for data type
issue. Sometimes we are seeing there is a numeric type column but after
checking it shows object or string type column for error in the data. So
this function will find the cuase in the dataframe.

</div>

<div class="cell code"
colab="{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:143}"
id="WQAv7nOpLktf" outputId="599eb053-4f53-4857-a2a0-52c1c68fc8d3">

``` python
df = pd.DataFrame({'A': [1, 'a', 3], 'B': [1, 5, 2], 'C': [1.3, 3.9,'2,0']})
df
```

<div class="output execute_result" execution_count="25">

       A  B    C
    0  1  1  1.3
    1  a  5  3.9
    2  3  2  2,0

</div>

</div>

<div class="cell code"
colab="{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:112}"
id="S8gaMAASbZc4" outputId="0325075a-93e6-4347-bbfc-823e74021772">

``` python
tdi.check_error_data_types(df)
```

<div class="output execute_result" execution_count="26">

      columns error_data error_index
    0       A        [a]         [1]
    1       C      [2,0]         [2]

</div>

</div>

<div class="cell code"
colab="{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:35}"
id="RcCmVNZAMgkH" outputId="4c6315d7-6fd3-4a87-9ef2-4116f47b4401">

``` python
tdi.check_error_data_types(df, return_type='json')
```

<div class="output execute_result" execution_count="27">

``` json
{"type":"string"}
```

</div>

</div>

<div class="cell markdown" id="NSjujMp0Mqv1">

Here return type is optional ('dataframe' or 'json'). Default:
dataframe. Above output defines that column `"A"` have error data value
`"a"` and which index is `"1"`

</div>

<div class="cell markdown" id="IwI-sfT4MsLG">

# tdi.check_num_of_min_category(df, return_type='dataframe')

**Parameters (Input):**

-   df: pandas Dataframe
-   minimum_threshold : this define the minimum count of a
    category(Default=3)
-   return_type(optional):how want to get the output (Default =
    'dataframe')

</div>

<div class="cell markdown" id="nQZU9yJAhqOg">

**Output : DataFrame**

</div>

<div class="cell code"
colab="{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:175}"
id="eM1XHOiOM1Fd" outputId="4b9d38f4-fe36-4b69-e47c-971529b56d69">

``` python
df = pd.DataFrame({'A': ['b', 'a', 'b','a'], 'B': ['x', 'x', 'y','x'], 'C': ['p', 'p', 'q','q']})
df
```

<div class="output execute_result" execution_count="28">

       A  B  C
    0  b  x  p
    1  a  x  p
    2  b  y  q
    3  a  x  q

</div>

</div>

<div class="cell code"
colab="{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:81}"
id="EwgrDhQ4C0fj" outputId="7adb6b4a-e039-46f5-9273-93e4299faf54">

``` python
tdi.check_num_of_min_category(df, minimum_threshold=1)
```

<div class="output execute_result" execution_count="29">

      columns category index
    0       B      [y]   [2]

</div>

</div>

<div class="cell code"
colab="{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:35}"
id="zNg1D-hZNwhF" outputId="c4f3bef9-1796-499d-927c-ab572ab08c51">

``` python
tdi.check_num_of_min_category(df, minimum_threshold=1, return_type='json')
```

<div class="output execute_result" execution_count="30">

``` json
{"type":"string"}
```

</div>

</div>

<div class="cell markdown" id="FpWRMlBINw0M">

Here return type is optional ('dataframe' or 'json'). Default:
dataframe. Above output defines that column "B" have fewer categories
because we set the minimum threshold is "1" and which index is "2"

</div>

<div class="cell markdown" id="gXIXp6XBN7NM">

# tdi.check_col_with_one_category(df, return_type='dataframe')

**Parameters (Input):**

-   df: pandas Dataframe
-   return_type(optional):how want to get the output (Default =
    'dataframe')

</div>

<div class="cell markdown" id="5yE89DndjHNl">

**Output : DataFrame**

</div>

<div class="cell markdown" id="YO1-JmApOFDz">

Sometimes we got such categorical column which data have no variation
that means all column's data are same. So this function will findout
those column(s)

</div>

<div class="cell code"
colab="{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:175}"
id="GOO1rwMHN6c0" outputId="288a9482-0ac7-437b-f248-67f5021c8753">

``` python
df = pd.DataFrame({'A': ['b', 'a', 'b','a'], 'B': ['x', 'x', 'x','x'], 'C': ['p', 'p', 'q','q']})
df
```

<div class="output execute_result" execution_count="31">

       A  B  C
    0  b  x  p
    1  a  x  p
    2  b  x  q
    3  a  x  q

</div>

</div>

<div class="cell code"
colab="{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:81}"
id="R_dXOEClEBKg" outputId="c10cbf14-cb6c-49fc-82a6-e37fdd3c4787">

``` python
tdi.check_col_with_one_category(df)
```

<div class="output execute_result" execution_count="32">

      columns category_name
    0       B           [x]

</div>

</div>

<div class="cell code"
colab="{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:35}"
id="m2Fz5iqlO9nM" outputId="1cbbf8e0-c640-47bc-ae60-1a16ef866c60">

``` python
tdi.check_col_with_one_category(df,return_type='json')
```

<div class="output execute_result" execution_count="33">

``` json
{"type":"string"}
```

</div>

</div>

<div class="cell markdown" id="uR3g6QQIPFRh">

Here return type is optional ('dataframe' or 'json'). Default:
dataframe. Above output defines that column "B" has one category only
which category value is "x"

</div>

<div class="cell markdown" id="zal9tBcxPH5B">

# tdi.find_special_char_index(df, return_type='dataframe')

**Parameters (Input):**

-   df: pandas Dataframe
-   return_type(optional):how want to get the output (Default =
    'dataframe')

</div>

<div class="cell markdown" id="IzRE_G_3jwN1">

**Output : DataFrame**

</div>

<div class="cell markdown" id="BujDuVwXPRrg">

This function will find out for us those indexes which contain the
double spaces and special characters into the dataframe.

</div>

<div class="cell code"
colab="{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:143}"
id="WzJTJwX3O9vR" outputId="3f85f568-b4ad-4a2c-a3d2-8fa68a9099e2">

``` python
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['!', 5, '?'], 'C': [1.2, 2.6, '3,2']})
df
```

<div class="output execute_result" execution_count="34">

       A  B    C
    0  1  !  1.2
    1  2  5  2.6
    2  3  ?  3,2

</div>

</div>

<div class="cell code"
colab="{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:143}"
id="OhApZcUnENOd" outputId="934cef2f-027e-44b0-f5e8-574664df530a">

``` python
tdi.find_special_char_index(df)
```

<div class="output execute_result" execution_count="35">

      columns has_special_char_at
    0       A                  []
    1       B              [0, 2]
    2       C                 [2]

</div>

</div>

<div class="cell code"
colab="{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:53}"
id="BpYx5OUJP6TR" outputId="a4ba4445-0265-4826-a059-b7103b62b763">

``` python
tdi.find_special_char_index(df, return_type='json')
```

<div class="output execute_result" execution_count="36">

``` json
{"type":"string"}
```

</div>

</div>

<div class="cell markdown" id="CBfT0D9sP6ho">

Here return type is optional ('dataframe' or 'json'). Default:
dataframe. Above output dataframe defines that column "B" have special
characters which indexes are \[0,2\] and column "C" has also special
character which index is \[2\]

</div>

<div class="cell markdown" id="78DYtKBjQFDA">

# tdi.duplicate_columns(df)

**Parameters (Input):**

-   df: pandas Dataframe

</div>

<div class="cell markdown" id="qMMBY5Kuk73j">

**Output : List**

</div>

<div class="cell markdown" id="rKojcLDDQXC4">

This function returns a list of column names those containing the same
value. Also, handle the case that the column name may different but data
is same

</div>

<div class="cell code"
colab="{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:143}"
id="h3KeKIY6Q1Hn" outputId="e6dd6b1d-eade-4120-ce55-ae9a5bc92896">

``` python
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['!', 5, '?'], 'C': [1, 2, 3]})
df
```

<div class="output execute_result" execution_count="37">

       A  B  C
    0  1  !  1
    1  2  5  2
    2  3  ?  3

</div>

</div>

<div class="cell code"
colab="{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;}"
id="oLt8h-ksEQCf" outputId="93387b3e-d8d1-4df3-8071-4884039a9575">

``` python
l = tdi.duplicate_columns(df)
l
```

<div class="output execute_result" execution_count="38">

    ['A', 'C']

</div>

</div>

<div class="cell markdown" id="ii508Ja0Q87v">

So here 'A' and 'C' columns contain the same data

</div>

<div class="cell markdown" id="_tH4YAE-SIKl">

# tdi.correlated_columns(df, return_type='dataframe')

**Parameters (Input):**

-   df: pandas Dataframe
-   return_type(optional):how want to get the output (Default =
    'dataframe')

</div>

<div class="cell markdown" id="HjHMEZB6lNoC">

**Output : DataFrame**

</div>

<div class="cell markdown" id="9QYFkYRQSQb-">

This function will return a dataframe or json which will define that
different column but the data is more than 90% correlated.

</div>

<div class="cell code"
colab="{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:143}"
id="qTs0jpchFPHD" outputId="e49b0c80-9db3-478d-d2f0-301cb27b9c75">

``` python
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 5, 'b'], 'C': [1, 2, 3]})
df
```

<div class="output execute_result" execution_count="40">

       A  B  C
    0  1  a  1
    1  2  5  2
    2  3  b  3

</div>

</div>

<div class="cell code"
colab="{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:167}"
id="zwpXXkFPFAFH" outputId="34fe3c23-24ed-41ea-b0b6-888c1449cd73">

``` python
tdi.correlated_columns(df, return_type='dataframe')
```

<div class="output stream stderr">

    /usr/local/lib/python3.10/dist-packages/TabularDataInvestigation/tdi.py:116: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
      correalated_df = df.corr()

</div>

<div class="output execute_result" execution_count="41">

      columns correlated_columns correlation
    0       A                [C]       [1.0]
    1       C                [A]       [1.0]

</div>

</div>

<div class="cell code"
colab="{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:107}"
id="1_5JXsFgFIdg" outputId="cf45f62e-cd6e-4fed-9803-d2a30b4f4f95">

``` python
tdi.correlated_columns(df, return_type='json')
```

<div class="output stream stderr">

    /usr/local/lib/python3.10/dist-packages/TabularDataInvestigation/tdi.py:116: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
      correalated_df = df.corr()

</div>

<div class="output execute_result" execution_count="42">

``` json
{"type":"string"}
```

</div>

</div>

<div class="cell markdown" id="AMpvo2RYS92c">

Here return type is optional ('dataframe' or 'json'). Default:
dataframe. Above output defines that column A is correlated with column
C and also shows the correlation value

</div>

<div class="cell code" id="6uyQ5pVcS5VG">

``` python
```

</div>

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Tanvir223/TabularDataInvestigation",
    "name": "TabularDataInvestigation",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "pypi,TabularDataInvestigation,TabularData,Data-Manupulation,Data-Preprocessing,Data Cleaning,Machine Learning,Artificial Intelligence,Industry Data,Data Science",
    "author": "Tanvir Islam",
    "author_email": "islamtanvir659@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/65/90/b06f649b379a796023c933c2b7bdc64d1b04db945aec17e1c8677bbb481a/TabularDataInvestigation-0.0.8.tar.gz",
    "platform": null,
    "description": "<div class=\"cell markdown\" id=\"lJNrv9lwHw7t\">\r\n\r\n## This package provide a fast tabular data investigation and it will eligible for ML model building and also helps to developers in their projects when needed. Most of the functions return a dataframe or json as output\r\n\r\n</div>\r\n\r\n<div class=\"cell code\" id=\"HRSU2lZ5Ht9v\">\r\n\r\n``` python\r\npip install TabularDataInvestigation\r\n```\r\n\r\n</div>\r\n\r\n<div class=\"cell code\" id=\"VzC1_y2hTSlb\">\r\n\r\n``` python\r\nfrom TabularDataInvestigation import tdi\r\n```\r\n\r\n</div>\r\n\r\n<div class=\"cell markdown\" id=\"5o29-lUSKmIy\">\r\n\r\n# tdi.find_index_for_null_values(df, return_type='dataframe')\r\n\r\n**Parameters (Input):**\r\n\r\n-   df: pandas Dataframe\r\n-   return_type(optional): Default = 'dataframe'\r\n\r\n</div>\r\n\r\n<div class=\"cell markdown\" id=\"ZxDbq_gbf3fM\">\r\n\r\n**Output : DataFrame**\r\n\r\n</div>\r\n\r\n<div class=\"cell markdown\" id=\"vYvrNlAvIDQd\">\r\n\r\nSometimes, we need to drop or fill the null values according to\r\nindividual cell specifically with different methods. Some data contains\r\nmeaning and some are unnecceesary.So, using this function we can get all\r\nthe missing cell indexes that we can use in our project.\r\n\r\n</div>\r\n\r\n<div class=\"cell code\"\r\ncolab=\"{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:143}\"\r\nid=\"TGS5k7JWIqz8\" outputId=\"3663fd25-23cb-4558-a175-5ede1691661d\">\r\n\r\n``` python\r\ndf = pd.DataFrame({'A': [1, None, 3], 'B': ['!', 5, '?'], 'C': ['a', 'b', None]})\r\ndf\r\n```\r\n\r\n<div class=\"output execute_result\" execution_count=\"19\">\r\n\r\n         A  B     C\r\n    0  1.0  !     a\r\n    1  NaN  5     b\r\n    2  3.0  ?  None\r\n\r\n</div>\r\n\r\n</div>\r\n\r\n<div class=\"cell code\"\r\ncolab=\"{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:81}\"\r\nid=\"HLVms92Zn4qW\" outputId=\"669872b8-bce7-456b-afae-37dca420cc49\">\r\n\r\n``` python\r\ntdi.find_index_for_null_values(df)\r\n```\r\n\r\n<div class=\"output execute_result\" execution_count=\"20\">\r\n\r\n         A    C\r\n    0  [1]  [2]\r\n\r\n</div>\r\n\r\n</div>\r\n\r\n<div class=\"cell code\"\r\ncolab=\"{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:35}\"\r\nid=\"8_NmdR_UKRWy\" outputId=\"b650a6f9-2f2b-49f8-a3e0-e575a1d74df0\">\r\n\r\n``` python\r\ntdi.find_index_for_null_values(df, return_type='json')\r\n```\r\n\r\n<div class=\"output execute_result\" execution_count=\"21\">\r\n\r\n``` json\r\n{\"type\":\"string\"}\r\n```\r\n\r\n</div>\r\n\r\n</div>\r\n\r\n<div class=\"cell markdown\" id=\"d9uRAKLzKYDy\">\r\n\r\nHere return type is optional ('dataframe' or 'json'). Default:\r\ndataframe. From the output we understand that column \"A\" Index \"1\" has a\r\nnull value.\r\n\r\n</div>\r\n\r\n<div class=\"cell markdown\" id=\"OSAs6sHgLVf3\">\r\n\r\n# tdi.check_error_data_types(df, return_type='dataframe')\r\n\r\n**Parameters(Input):**\r\n\r\n-   df: pandas Dataframe\r\n-   return_type(optional): Default = 'dataframe')\r\n\r\n</div>\r\n\r\n<div class=\"cell markdown\" id=\"jTm7bE7mgP1k\">\r\n\r\n**Output : DataFrame**\r\n\r\n</div>\r\n\r\n<div class=\"cell markdown\" id=\"OpKOTMPmLl5w\">\r\n\r\nWe usually face some unusual behave in the dataframe for data type\r\nissue. Sometimes we are seeing there is a numeric type column but after\r\nchecking it shows object or string type column for error in the data. So\r\nthis function will find the cuase in the dataframe.\r\n\r\n</div>\r\n\r\n<div class=\"cell code\"\r\ncolab=\"{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:143}\"\r\nid=\"WQAv7nOpLktf\" outputId=\"599eb053-4f53-4857-a2a0-52c1c68fc8d3\">\r\n\r\n``` python\r\ndf = pd.DataFrame({'A': [1, 'a', 3], 'B': [1, 5, 2], 'C': [1.3, 3.9,'2,0']})\r\ndf\r\n```\r\n\r\n<div class=\"output execute_result\" execution_count=\"25\">\r\n\r\n       A  B    C\r\n    0  1  1  1.3\r\n    1  a  5  3.9\r\n    2  3  2  2,0\r\n\r\n</div>\r\n\r\n</div>\r\n\r\n<div class=\"cell code\"\r\ncolab=\"{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:112}\"\r\nid=\"S8gaMAASbZc4\" outputId=\"0325075a-93e6-4347-bbfc-823e74021772\">\r\n\r\n``` python\r\ntdi.check_error_data_types(df)\r\n```\r\n\r\n<div class=\"output execute_result\" execution_count=\"26\">\r\n\r\n      columns error_data error_index\r\n    0       A        [a]         [1]\r\n    1       C      [2,0]         [2]\r\n\r\n</div>\r\n\r\n</div>\r\n\r\n<div class=\"cell code\"\r\ncolab=\"{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:35}\"\r\nid=\"RcCmVNZAMgkH\" outputId=\"4c6315d7-6fd3-4a87-9ef2-4116f47b4401\">\r\n\r\n``` python\r\ntdi.check_error_data_types(df, return_type='json')\r\n```\r\n\r\n<div class=\"output execute_result\" execution_count=\"27\">\r\n\r\n``` json\r\n{\"type\":\"string\"}\r\n```\r\n\r\n</div>\r\n\r\n</div>\r\n\r\n<div class=\"cell markdown\" id=\"NSjujMp0Mqv1\">\r\n\r\nHere return type is optional ('dataframe' or 'json'). Default:\r\ndataframe. Above output defines that column `\"A\"` have error data value\r\n`\"a\"` and which index is `\"1\"`\r\n\r\n</div>\r\n\r\n<div class=\"cell markdown\" id=\"IwI-sfT4MsLG\">\r\n\r\n# tdi.check_num_of_min_category(df, return_type='dataframe')\r\n\r\n**Parameters (Input):**\r\n\r\n-   df: pandas Dataframe\r\n-   minimum_threshold : this define the minimum count of a\r\n    category(Default=3)\r\n-   return_type(optional):how want to get the output (Default =\r\n    'dataframe')\r\n\r\n</div>\r\n\r\n<div class=\"cell markdown\" id=\"nQZU9yJAhqOg\">\r\n\r\n**Output : DataFrame**\r\n\r\n</div>\r\n\r\n<div class=\"cell code\"\r\ncolab=\"{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:175}\"\r\nid=\"eM1XHOiOM1Fd\" outputId=\"4b9d38f4-fe36-4b69-e47c-971529b56d69\">\r\n\r\n``` python\r\ndf = pd.DataFrame({'A': ['b', 'a', 'b','a'], 'B': ['x', 'x', 'y','x'], 'C': ['p', 'p', 'q','q']})\r\ndf\r\n```\r\n\r\n<div class=\"output execute_result\" execution_count=\"28\">\r\n\r\n       A  B  C\r\n    0  b  x  p\r\n    1  a  x  p\r\n    2  b  y  q\r\n    3  a  x  q\r\n\r\n</div>\r\n\r\n</div>\r\n\r\n<div class=\"cell code\"\r\ncolab=\"{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:81}\"\r\nid=\"EwgrDhQ4C0fj\" outputId=\"7adb6b4a-e039-46f5-9273-93e4299faf54\">\r\n\r\n``` python\r\ntdi.check_num_of_min_category(df, minimum_threshold=1)\r\n```\r\n\r\n<div class=\"output execute_result\" execution_count=\"29\">\r\n\r\n      columns category index\r\n    0       B      [y]   [2]\r\n\r\n</div>\r\n\r\n</div>\r\n\r\n<div class=\"cell code\"\r\ncolab=\"{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:35}\"\r\nid=\"zNg1D-hZNwhF\" outputId=\"c4f3bef9-1796-499d-927c-ab572ab08c51\">\r\n\r\n``` python\r\ntdi.check_num_of_min_category(df, minimum_threshold=1, return_type='json')\r\n```\r\n\r\n<div class=\"output execute_result\" execution_count=\"30\">\r\n\r\n``` json\r\n{\"type\":\"string\"}\r\n```\r\n\r\n</div>\r\n\r\n</div>\r\n\r\n<div class=\"cell markdown\" id=\"FpWRMlBINw0M\">\r\n\r\nHere return type is optional ('dataframe' or 'json'). Default:\r\ndataframe. Above output defines that column \"B\" have fewer categories\r\nbecause we set the minimum threshold is \"1\" and which index is \"2\"\r\n\r\n</div>\r\n\r\n<div class=\"cell markdown\" id=\"gXIXp6XBN7NM\">\r\n\r\n# tdi.check_col_with_one_category(df, return_type='dataframe')\r\n\r\n**Parameters (Input):**\r\n\r\n-   df: pandas Dataframe\r\n-   return_type(optional):how want to get the output (Default =\r\n    'dataframe')\r\n\r\n</div>\r\n\r\n<div class=\"cell markdown\" id=\"5yE89DndjHNl\">\r\n\r\n**Output : DataFrame**\r\n\r\n</div>\r\n\r\n<div class=\"cell markdown\" id=\"YO1-JmApOFDz\">\r\n\r\nSometimes we got such categorical column which data have no variation\r\nthat means all column's data are same. So this function will findout\r\nthose column(s)\r\n\r\n</div>\r\n\r\n<div class=\"cell code\"\r\ncolab=\"{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:175}\"\r\nid=\"GOO1rwMHN6c0\" outputId=\"288a9482-0ac7-437b-f248-67f5021c8753\">\r\n\r\n``` python\r\ndf = pd.DataFrame({'A': ['b', 'a', 'b','a'], 'B': ['x', 'x', 'x','x'], 'C': ['p', 'p', 'q','q']})\r\ndf\r\n```\r\n\r\n<div class=\"output execute_result\" execution_count=\"31\">\r\n\r\n       A  B  C\r\n    0  b  x  p\r\n    1  a  x  p\r\n    2  b  x  q\r\n    3  a  x  q\r\n\r\n</div>\r\n\r\n</div>\r\n\r\n<div class=\"cell code\"\r\ncolab=\"{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:81}\"\r\nid=\"R_dXOEClEBKg\" outputId=\"c10cbf14-cb6c-49fc-82a6-e37fdd3c4787\">\r\n\r\n``` python\r\ntdi.check_col_with_one_category(df)\r\n```\r\n\r\n<div class=\"output execute_result\" execution_count=\"32\">\r\n\r\n      columns category_name\r\n    0       B           [x]\r\n\r\n</div>\r\n\r\n</div>\r\n\r\n<div class=\"cell code\"\r\ncolab=\"{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:35}\"\r\nid=\"m2Fz5iqlO9nM\" outputId=\"1cbbf8e0-c640-47bc-ae60-1a16ef866c60\">\r\n\r\n``` python\r\ntdi.check_col_with_one_category(df,return_type='json')\r\n```\r\n\r\n<div class=\"output execute_result\" execution_count=\"33\">\r\n\r\n``` json\r\n{\"type\":\"string\"}\r\n```\r\n\r\n</div>\r\n\r\n</div>\r\n\r\n<div class=\"cell markdown\" id=\"uR3g6QQIPFRh\">\r\n\r\nHere return type is optional ('dataframe' or 'json'). Default:\r\ndataframe. Above output defines that column \"B\" has one category only\r\nwhich category value is \"x\"\r\n\r\n</div>\r\n\r\n<div class=\"cell markdown\" id=\"zal9tBcxPH5B\">\r\n\r\n# tdi.find_special_char_index(df, return_type='dataframe')\r\n\r\n**Parameters (Input):**\r\n\r\n-   df: pandas Dataframe\r\n-   return_type(optional):how want to get the output (Default =\r\n    'dataframe')\r\n\r\n</div>\r\n\r\n<div class=\"cell markdown\" id=\"IzRE_G_3jwN1\">\r\n\r\n**Output : DataFrame**\r\n\r\n</div>\r\n\r\n<div class=\"cell markdown\" id=\"BujDuVwXPRrg\">\r\n\r\nThis function will find out for us those indexes which contain the\r\ndouble spaces and special characters into the dataframe.\r\n\r\n</div>\r\n\r\n<div class=\"cell code\"\r\ncolab=\"{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:143}\"\r\nid=\"WzJTJwX3O9vR\" outputId=\"3f85f568-b4ad-4a2c-a3d2-8fa68a9099e2\">\r\n\r\n``` python\r\ndf = pd.DataFrame({'A': [1, 2, 3], 'B': ['!', 5, '?'], 'C': [1.2, 2.6, '3,2']})\r\ndf\r\n```\r\n\r\n<div class=\"output execute_result\" execution_count=\"34\">\r\n\r\n       A  B    C\r\n    0  1  !  1.2\r\n    1  2  5  2.6\r\n    2  3  ?  3,2\r\n\r\n</div>\r\n\r\n</div>\r\n\r\n<div class=\"cell code\"\r\ncolab=\"{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:143}\"\r\nid=\"OhApZcUnENOd\" outputId=\"934cef2f-027e-44b0-f5e8-574664df530a\">\r\n\r\n``` python\r\ntdi.find_special_char_index(df)\r\n```\r\n\r\n<div class=\"output execute_result\" execution_count=\"35\">\r\n\r\n      columns has_special_char_at\r\n    0       A                  []\r\n    1       B              [0, 2]\r\n    2       C                 [2]\r\n\r\n</div>\r\n\r\n</div>\r\n\r\n<div class=\"cell code\"\r\ncolab=\"{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:53}\"\r\nid=\"BpYx5OUJP6TR\" outputId=\"a4ba4445-0265-4826-a059-b7103b62b763\">\r\n\r\n``` python\r\ntdi.find_special_char_index(df, return_type='json')\r\n```\r\n\r\n<div class=\"output execute_result\" execution_count=\"36\">\r\n\r\n``` json\r\n{\"type\":\"string\"}\r\n```\r\n\r\n</div>\r\n\r\n</div>\r\n\r\n<div class=\"cell markdown\" id=\"CBfT0D9sP6ho\">\r\n\r\nHere return type is optional ('dataframe' or 'json'). Default:\r\ndataframe. Above output dataframe defines that column \"B\" have special\r\ncharacters which indexes are \\[0,2\\] and column \"C\" has also special\r\ncharacter which index is \\[2\\]\r\n\r\n</div>\r\n\r\n<div class=\"cell markdown\" id=\"78DYtKBjQFDA\">\r\n\r\n# tdi.duplicate_columns(df)\r\n\r\n**Parameters (Input):**\r\n\r\n-   df: pandas Dataframe\r\n\r\n</div>\r\n\r\n<div class=\"cell markdown\" id=\"qMMBY5Kuk73j\">\r\n\r\n**Output : List**\r\n\r\n</div>\r\n\r\n<div class=\"cell markdown\" id=\"rKojcLDDQXC4\">\r\n\r\nThis function returns a list of column names those containing the same\r\nvalue. Also, handle the case that the column name may different but data\r\nis same\r\n\r\n</div>\r\n\r\n<div class=\"cell code\"\r\ncolab=\"{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:143}\"\r\nid=\"h3KeKIY6Q1Hn\" outputId=\"e6dd6b1d-eade-4120-ce55-ae9a5bc92896\">\r\n\r\n``` python\r\ndf = pd.DataFrame({'A': [1, 2, 3], 'B': ['!', 5, '?'], 'C': [1, 2, 3]})\r\ndf\r\n```\r\n\r\n<div class=\"output execute_result\" execution_count=\"37\">\r\n\r\n       A  B  C\r\n    0  1  !  1\r\n    1  2  5  2\r\n    2  3  ?  3\r\n\r\n</div>\r\n\r\n</div>\r\n\r\n<div class=\"cell code\"\r\ncolab=\"{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;}\"\r\nid=\"oLt8h-ksEQCf\" outputId=\"93387b3e-d8d1-4df3-8071-4884039a9575\">\r\n\r\n``` python\r\nl = tdi.duplicate_columns(df)\r\nl\r\n```\r\n\r\n<div class=\"output execute_result\" execution_count=\"38\">\r\n\r\n    ['A', 'C']\r\n\r\n</div>\r\n\r\n</div>\r\n\r\n<div class=\"cell markdown\" id=\"ii508Ja0Q87v\">\r\n\r\nSo here 'A' and 'C' columns contain the same data\r\n\r\n</div>\r\n\r\n<div class=\"cell markdown\" id=\"_tH4YAE-SIKl\">\r\n\r\n# tdi.correlated_columns(df, return_type='dataframe')\r\n\r\n**Parameters (Input):**\r\n\r\n-   df: pandas Dataframe\r\n-   return_type(optional):how want to get the output (Default =\r\n    'dataframe')\r\n\r\n</div>\r\n\r\n<div class=\"cell markdown\" id=\"HjHMEZB6lNoC\">\r\n\r\n**Output : DataFrame**\r\n\r\n</div>\r\n\r\n<div class=\"cell markdown\" id=\"9QYFkYRQSQb-\">\r\n\r\nThis function will return a dataframe or json which will define that\r\ndifferent column but the data is more than 90% correlated.\r\n\r\n</div>\r\n\r\n<div class=\"cell code\"\r\ncolab=\"{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:143}\"\r\nid=\"qTs0jpchFPHD\" outputId=\"e49b0c80-9db3-478d-d2f0-301cb27b9c75\">\r\n\r\n``` python\r\ndf = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 5, 'b'], 'C': [1, 2, 3]})\r\ndf\r\n```\r\n\r\n<div class=\"output execute_result\" execution_count=\"40\">\r\n\r\n       A  B  C\r\n    0  1  a  1\r\n    1  2  5  2\r\n    2  3  b  3\r\n\r\n</div>\r\n\r\n</div>\r\n\r\n<div class=\"cell code\"\r\ncolab=\"{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:167}\"\r\nid=\"zwpXXkFPFAFH\" outputId=\"34fe3c23-24ed-41ea-b0b6-888c1449cd73\">\r\n\r\n``` python\r\ntdi.correlated_columns(df, return_type='dataframe')\r\n```\r\n\r\n<div class=\"output stream stderr\">\r\n\r\n    /usr/local/lib/python3.10/dist-packages/TabularDataInvestigation/tdi.py:116: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.\r\n      correalated_df = df.corr()\r\n\r\n</div>\r\n\r\n<div class=\"output execute_result\" execution_count=\"41\">\r\n\r\n      columns correlated_columns correlation\r\n    0       A                [C]       [1.0]\r\n    1       C                [A]       [1.0]\r\n\r\n</div>\r\n\r\n</div>\r\n\r\n<div class=\"cell code\"\r\ncolab=\"{&quot;base_uri&quot;:&quot;https://localhost:8080/&quot;,&quot;height&quot;:107}\"\r\nid=\"1_5JXsFgFIdg\" outputId=\"cf45f62e-cd6e-4fed-9803-d2a30b4f4f95\">\r\n\r\n``` python\r\ntdi.correlated_columns(df, return_type='json')\r\n```\r\n\r\n<div class=\"output stream stderr\">\r\n\r\n    /usr/local/lib/python3.10/dist-packages/TabularDataInvestigation/tdi.py:116: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.\r\n      correalated_df = df.corr()\r\n\r\n</div>\r\n\r\n<div class=\"output execute_result\" execution_count=\"42\">\r\n\r\n``` json\r\n{\"type\":\"string\"}\r\n```\r\n\r\n</div>\r\n\r\n</div>\r\n\r\n<div class=\"cell markdown\" id=\"AMpvo2RYS92c\">\r\n\r\nHere return type is optional ('dataframe' or 'json'). Default:\r\ndataframe. Above output defines that column A is correlated with column\r\nC and also shows the correlation value\r\n\r\n</div>\r\n\r\n<div class=\"cell code\" id=\"6uyQ5pVcS5VG\">\r\n\r\n``` python\r\n```\r\n\r\n</div>\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "This package provide a fast tabular data investigation and it will eligible for ML model building and also helps to developers in their projects when needed",
    "version": "0.0.8",
    "project_urls": {
        "Bug Tracker": "https://github.com/Tanvir223/TabularDataInvestigation/issues",
        "Download": "https://github.com/Tanvir223/TabularDataInvestigation/archive/refs/tags/0.0.8.tar.gz",
        "Homepage": "https://github.com/Tanvir223/TabularDataInvestigation"
    },
    "split_keywords": [
        "pypi",
        "tabulardatainvestigation",
        "tabulardata",
        "data-manupulation",
        "data-preprocessing",
        "data cleaning",
        "machine learning",
        "artificial intelligence",
        "industry data",
        "data science"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6590b06f649b379a796023c933c2b7bdc64d1b04db945aec17e1c8677bbb481a",
                "md5": "b7c015eb54ed14721108cf800bfdf0eb",
                "sha256": "6cdaaa6daa283ffd028f587eb9a6800028b11b7112ea01d81e2ed9f23544768c"
            },
            "downloads": -1,
            "filename": "TabularDataInvestigation-0.0.8.tar.gz",
            "has_sig": false,
            "md5_digest": "b7c015eb54ed14721108cf800bfdf0eb",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 14462,
            "upload_time": "2023-07-07T09:21:00",
            "upload_time_iso_8601": "2023-07-07T09:21:00.820115Z",
            "url": "https://files.pythonhosted.org/packages/65/90/b06f649b379a796023c933c2b7bdc64d1b04db945aec17e1c8677bbb481a/TabularDataInvestigation-0.0.8.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-07-07 09:21:00",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Tanvir223",
    "github_project": "TabularDataInvestigation",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "tabulardatainvestigation"
}
        
Elapsed time: 0.09847s