# Regex-learner
This project provides a tool/library implementing an automated regular expression building mechanism.
This project takes inspiration on the paper from Ilyas, et al [1]
[Ilyas, Andrew, M. F. da Trindade, Joana, Castro Fernandez, Raul and Madden, Samuel. 2018. "Extracting Syntactical Patterns from Databases."](https://hdl.handle.net/1721.1/137774)
This repository contains code and examples to assist in the exeuction of regular expression learning from the columns of data.
This is a basic readme. It will be completed as the prototype grows.
# Installation
The project can be installed via pip:
```bash
pip install regex-learner
```
# Examples of usage
Example of learning a date pattern from 100 examples of randomly sampled dates in the format DD-MM-YYYY.
```python
from xsystem import XTructure
from faker import Faker
fake = Faker()
x = XTructure() # Create basic XTructure class
for _ in range(100):
d = fake.date(pattern=r"%d-%m-%Y") # Create example of data - date in the format DD-MM-YYYY
x.learn_new_word(d) # Add example to XSystem and learn new features
print(str(x)) # ([0312][0-9])(-)([01][891652073])(-)([21][09][078912][0-9])
```
Similary, the tool can be used directly from the command line using the `regex-learner` CLI provided by the installation of the package.
The tool has several options, as described by the help message:
```
> regex-learner -h
usage: regex-learner [-h] [-i INPUT] [-o OUTPUT] [--max-branch MAX_BRANCH] [--alpha ALPHA] [--branch-threshold BRANCH_THRESHOLD]
A simple tool to learn human readable a regular expression from examples
options:
-h, --help show this help message and exit
-i INPUT, --input INPUT
Path to the input source, defaults to stdin
-o OUTPUT, --output OUTPUT
Path to the output file, defaults to stdout
--max-branch MAX_BRANCH
Maximum number of branches allowed, defaults to 8
--alpha ALPHA Weight for fitting tuples, defaults to 1/5
--branch-threshold BRANCH_THRESHOLD
Branching threshold, defaults to 0.85, relative to the fitting score alpha
```
Assuming a data file containing the examples to learn from is called `EXAMPLE_FILE`, and assuming one is interested in a very simple regular expression, the tool can be used as follows:
```bash
cat EXAMPLE_FILE | regex-learner --max-branch 2
```
## Note
Note that this project is not based on the actual implementation of the paper as presented in [2]
## References
1. Ilyas, Andrew, et al. "Extracting syntactical patterns from databases." 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 2018.
2. https://github.com/mitdbg/XSystem
Raw data
{
"_id": null,
"home_page": "https://github.com/IBM/regex-learner",
"name": "regex-learner",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "",
"author": "Stefano Braghin, Liubov Nedoshivina",
"author_email": "\"Liubov Nedoshivia\" <liubov.nedoshivina@ibm.com>",
"download_url": "https://files.pythonhosted.org/packages/d1/4f/a0f85e09fdfa431080d97949a2e71e26e2a01a08b48eb684d06726adfce3/regex-learner-0.0.4.tar.gz",
"platform": null,
"description": "# Regex-learner\n\nThis project provides a tool/library implementing an automated regular expression building mechanism.\n\nThis project takes inspiration on the paper from Ilyas, et al [1]\n\n[Ilyas, Andrew, M. F. da Trindade, Joana, Castro Fernandez, Raul and Madden, Samuel. 2018. \"Extracting Syntactical Patterns from Databases.\"](https://hdl.handle.net/1721.1/137774)\n\nThis repository contains code and examples to assist in the exeuction of regular expression learning from the columns of data.\n\nThis is a basic readme. It will be completed as the prototype grows.\n\n# Installation\n\nThe project can be installed via pip:\n```bash\npip install regex-learner\n```\n\n# Examples of usage\n\nExample of learning a date pattern from 100 examples of randomly sampled dates in the format DD-MM-YYYY.\n\n```python\nfrom xsystem import XTructure\nfrom faker import Faker\n\nfake = Faker()\nx = XTructure() # Create basic XTructure class\n\nfor _ in range(100):\n d = fake.date(pattern=r\"%d-%m-%Y\") # Create example of data - date in the format DD-MM-YYYY\n x.learn_new_word(d) # Add example to XSystem and learn new features\n\nprint(str(x)) # ([0312][0-9])(-)([01][891652073])(-)([21][09][078912][0-9])\n```\n\nSimilary, the tool can be used directly from the command line using the `regex-learner` CLI provided by the installation of the package.\n\nThe tool has several options, as described by the help message:\n\n```\n> regex-learner -h\nusage: regex-learner [-h] [-i INPUT] [-o OUTPUT] [--max-branch MAX_BRANCH] [--alpha ALPHA] [--branch-threshold BRANCH_THRESHOLD]\n\nA simple tool to learn human readable a regular expression from examples\n\noptions:\n -h, --help show this help message and exit\n -i INPUT, --input INPUT\n Path to the input source, defaults to stdin\n -o OUTPUT, --output OUTPUT\n Path to the output file, defaults to stdout\n --max-branch MAX_BRANCH\n Maximum number of branches allowed, defaults to 8\n --alpha ALPHA Weight for fitting tuples, defaults to 1/5\n --branch-threshold BRANCH_THRESHOLD\n Branching threshold, defaults to 0.85, relative to the fitting score alpha\n```\n\nAssuming a data file containing the examples to learn from is called `EXAMPLE_FILE`, and assuming one is interested in a very simple regular expression, the tool can be used as follows:\n\n```bash\ncat EXAMPLE_FILE | regex-learner --max-branch 2\n```\n\n## Note\nNote that this project is not based on the actual implementation of the paper as presented in [2]\n\n## References\n1. Ilyas, Andrew, et al. \"Extracting syntactical patterns from databases.\" 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 2018.\n2. https://github.com/mitdbg/XSystem\n",
"bugtrack_url": null,
"license": "Apache License 2.0",
"summary": "The project provides a tool/library implementing an automated regular expression building mechanism.",
"version": "0.0.4",
"project_urls": {
"Homepage": "https://github.com/IBM/regex-learner"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "015e78c0f07a08f285f3fdadf55288ea30966c61006970a64ffabedf695abc60",
"md5": "6ba07de961480c7c6ccb9cb7d8068736",
"sha256": "2a46f7983421a73faf2d80de1a57a1100b34126b043a8a029d516e9451fe86e0"
},
"downloads": -1,
"filename": "regex_learner-0.0.4-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "6ba07de961480c7c6ccb9cb7d8068736",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": ">=3.8",
"size": 10932,
"upload_time": "2023-08-28T15:18:24",
"upload_time_iso_8601": "2023-08-28T15:18:24.051976Z",
"url": "https://files.pythonhosted.org/packages/01/5e/78c0f07a08f285f3fdadf55288ea30966c61006970a64ffabedf695abc60/regex_learner-0.0.4-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "d14fa0f85e09fdfa431080d97949a2e71e26e2a01a08b48eb684d06726adfce3",
"md5": "d3b376a32ec88ed598e17ec872a963dc",
"sha256": "f92d9d918616bcf360f64aecb0384c39701886c514f4d548735aa31497a4bee8"
},
"downloads": -1,
"filename": "regex-learner-0.0.4.tar.gz",
"has_sig": false,
"md5_digest": "d3b376a32ec88ed598e17ec872a963dc",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 10417,
"upload_time": "2023-08-28T15:18:24",
"upload_time_iso_8601": "2023-08-28T15:18:24.969913Z",
"url": "https://files.pythonhosted.org/packages/d1/4f/a0f85e09fdfa431080d97949a2e71e26e2a01a08b48eb684d06726adfce3/regex-learner-0.0.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-08-28 15:18:24",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "IBM",
"github_project": "regex-learner",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"tox": true,
"lcname": "regex-learner"
}