**FindAnywhere** is a tool designed for data analysts and developers facing the challenge of extracting meaningful
information from poorly structured or malformed CSV files.
This tool simplifies the process of filtering and analyzing data by allowing users to prefilter large datasets without
needing to correct their format first, focusing efforts on smaller, more relevant subsets.
## Example
Suppose we have a malformed CSV file where some parts of the address blend into the email column. Furthermore,
the csv file has some escaping issues, generating extra columns.
````csv
username,address,email,notes
alice.ashcroft,5th Avenue Ashville,alice.ashcroft@here.local,
bob.bones,Alice Ashcroft Memorial Lane,Ashville Cyan County, California,bob.bones@here.local
charlie.st.claire,charlie.st.claire@here.local,1st street Cleveland,
````
We want to search the csv file for email addresses and some town information, but can't rely on the data
being present where it should be. The information we seek is provided as json file:
````json
[
{"id": "alice", "email": "alice.ashcroft@here.local", "town": "Ashville"},
{"id": "charlie", "email": "charlie.st.claire@here.local"}
]
````
After running *findanywhere* on the datasets we get the following results in the json lines format. The result file holds
data sets that might be relevant to analyze further, without having to fix the original csv file, especially when
problems might be hard to find in larger data sets.
````json lines
{"of": "alice", "best_matches": {"email": {"position": {"line": 0, "column": "email"}, "value": "alice.ashcroft@here.local", "similarity": 1.0}, "id": {"position": {"line": 0, "column": "username"}, "value": "alice.ashcroft", "similarity": 0.8714285714285714}, "town": {"position": {"line": 0, "column": "address"}, "value": "5th Avenue Ashville", "similarity": 1.0}}, "score": 0.9571428571428572}
{"of": "charlie", "best_matches": {"email": {"position": {"line": 2, "column": "address"}, "value": "charlie.st.claire@here.local", "similarity": 1.0}, "id": {"position": {"line": 2, "column": "username"}, "value": "charlie.st.claire", "similarity": 0.8823529411764706}}, "score": 0.9411764705882353}
````
Alternatively, a direct search can be issued by using the command
````shell
findanywhere_search search_data.json input.csv \
--source tabular --threshold constant
--threshold-constant 0.8 --similarity jaro_winkler
````
## Usage
Start by creating a schema to define the parameters for searching through your data:
````shell
findanywhere_schema tabular string_based_evaluation \
--threshold constant \
--out schema.yml
````
Edit the **schema.yml** file as needed, using the documentation to guide the configuration of options and methods.
````yaml
deduction:
config: {}
name: average
evaluation:
config:
aggregate: max
similarity: token_best_fit_similarity
similarity_parameter: {}
name: string_based_evaluation
source:
config:
encoding: utf-8
errors: surrogateescape
name: tabular
threshold:
config:
constant: 0.9
name: constant
````
Run the tool against your datasets using the defined schema:
````shell
findanywhere schema.yml search_data.json garbage.csv --out result.json_line
````
Results will be stored in result.json_line. For additional commands and options, use the --help flag.
## Installation
Install **WhereIsIt** easily using pip:
````shell
pip install findanywhere
````
## Key Features
- **Robust Malformed File Handling:** Efficiently processes CSV files with irregular column structures or misplaced data entries.
- **Fuzzy Matching Capabilities:** Utilizes advanced algorithms to match data points based on similarity, accommodating various types of data discrepancies.
- **Parallel Processing Support:** Leverages multiple processes to enhance performance on large datasets.
Raw data
{
"_id": null,
"home_page": "https://gitlab.com/patrick.daniel.gress/findanywhere",
"name": "findanywhere",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.10",
"maintainer_email": null,
"keywords": "search, fuzzy_search, preprocessing",
"author": "voidpointercast",
"author_email": "voidpointercast@justmail.de",
"download_url": null,
"platform": null,
"description": "**FindAnywhere** is a tool designed for data analysts and developers facing the challenge of extracting meaningful \ninformation from poorly structured or malformed CSV files. \nThis tool simplifies the process of filtering and analyzing data by allowing users to prefilter large datasets without \nneeding to correct their format first, focusing efforts on smaller, more relevant subsets.\n\n\n## Example\n\nSuppose we have a malformed CSV file where some parts of the address blend into the email column. Furthermore,\nthe csv file has some escaping issues, generating extra columns. \n\n````csv\nusername,address,email,notes\nalice.ashcroft,5th Avenue Ashville,alice.ashcroft@here.local,\nbob.bones,Alice Ashcroft Memorial Lane,Ashville Cyan County, California,bob.bones@here.local\ncharlie.st.claire,charlie.st.claire@here.local,1st street Cleveland,\n````\n\nWe want to search the csv file for email addresses and some town information, but can't rely on the data\nbeing present where it should be. The information we seek is provided as json file:\n\n````json\n[\n {\"id\": \"alice\", \"email\": \"alice.ashcroft@here.local\", \"town\": \"Ashville\"},\n {\"id\": \"charlie\", \"email\": \"charlie.st.claire@here.local\"}\n]\n````\n\nAfter running *findanywhere* on the datasets we get the following results in the json lines format. The result file holds\ndata sets that might be relevant to analyze further, without having to fix the original csv file, especially when\nproblems might be hard to find in larger data sets.\n\n````json lines\n{\"of\": \"alice\", \"best_matches\": {\"email\": {\"position\": {\"line\": 0, \"column\": \"email\"}, \"value\": \"alice.ashcroft@here.local\", \"similarity\": 1.0}, \"id\": {\"position\": {\"line\": 0, \"column\": \"username\"}, \"value\": \"alice.ashcroft\", \"similarity\": 0.8714285714285714}, \"town\": {\"position\": {\"line\": 0, \"column\": \"address\"}, \"value\": \"5th Avenue Ashville\", \"similarity\": 1.0}}, \"score\": 0.9571428571428572}\n{\"of\": \"charlie\", \"best_matches\": {\"email\": {\"position\": {\"line\": 2, \"column\": \"address\"}, \"value\": \"charlie.st.claire@here.local\", \"similarity\": 1.0}, \"id\": {\"position\": {\"line\": 2, \"column\": \"username\"}, \"value\": \"charlie.st.claire\", \"similarity\": 0.8823529411764706}}, \"score\": 0.9411764705882353}\n````\n\nAlternatively, a direct search can be issued by using the command\n\n````shell\nfindanywhere_search search_data.json input.csv \\\n--source tabular --threshold constant\n--threshold-constant 0.8 --similarity jaro_winkler\n````\n\n\n## Usage\n\nStart by creating a schema to define the parameters for searching through your data:\n\n\n````shell\nfindanywhere_schema tabular string_based_evaluation \\\n--threshold constant \\\n--out schema.yml\n````\n\nEdit the **schema.yml** file as needed, using the documentation to guide the configuration of options and methods.\n````yaml\ndeduction:\n config: {}\n name: average\nevaluation:\n config:\n aggregate: max\n similarity: token_best_fit_similarity\n similarity_parameter: {}\n name: string_based_evaluation\nsource:\n config:\n encoding: utf-8\n errors: surrogateescape\n name: tabular\nthreshold:\n config:\n constant: 0.9\n name: constant\n````\n\nRun the tool against your datasets using the defined schema:\n\n````shell\nfindanywhere schema.yml search_data.json garbage.csv --out result.json_line\n````\n\nResults will be stored in result.json_line. For additional commands and options, use the --help flag.\n\n## Installation\n\nInstall **WhereIsIt** easily using pip:\n\n````shell\npip install findanywhere\n````\n\n\n## Key Features\n\n- **Robust Malformed File Handling:** Efficiently processes CSV files with irregular column structures or misplaced data entries.\n- **Fuzzy Matching Capabilities:** Utilizes advanced algorithms to match data points based on similarity, accommodating various types of data discrepancies.\n- **Parallel Processing Support:** Leverages multiple processes to enhance performance on large datasets.\n",
"bugtrack_url": null,
"license": "BSD License (BSD)",
"summary": "Tool for searching data in possible malformed input data as preprocessing step for further analysis.",
"version": "1.6.3",
"project_urls": {
"Documentation": "https://findanywhere.readthedocs.io/en/latest",
"Homepage": "https://gitlab.com/patrick.daniel.gress/findanywhere",
"Repository": "https://gitlab.com/patrick.daniel.gress/findanywhere"
},
"split_keywords": [
"search",
" fuzzy_search",
" preprocessing"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "211c6326ef571b3082cdf3c0ed1905701d6f7ba67cb8367f7eec9e5752807cd3",
"md5": "37c01046a81dbfd5bab215255eb76e58",
"sha256": "9119c1b3a6ae3ed84cbe55633634c452f56086a4229a00c33e44839daedce9bd"
},
"downloads": -1,
"filename": "findanywhere-1.6.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "37c01046a81dbfd5bab215255eb76e58",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.10",
"size": 45487,
"upload_time": "2024-10-23T07:36:35",
"upload_time_iso_8601": "2024-10-23T07:36:35.163290Z",
"url": "https://files.pythonhosted.org/packages/21/1c/6326ef571b3082cdf3c0ed1905701d6f7ba67cb8367f7eec9e5752807cd3/findanywhere-1.6.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-23 07:36:35",
"github": false,
"gitlab": true,
"bitbucket": false,
"codeberg": false,
"gitlab_user": "patrick.daniel.gress",
"gitlab_project": "findanywhere",
"lcname": "findanywhere"
}