urlgenie

Name	urlgenie JSON
Version	1.2.0 JSON
	download
home_page	None
Summary	Python package to make URL extraction, generalization, validation, and filtration easy.
upload_time	2024-06-09 09:51:24
maintainer	Ahmed Khatib
docs_url	None
author	Ahmed Khatib
requires_python	>=3.7
license	MIT
keywords	url-parsing data-cleaning data-curation generalization data-cleansing data-processing data-sanitization url-generalization
VCS
bugtrack_url
requirements	tldextract
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <p align = "center">
<img src = "https://raw.githubusercontent.com/bluestero/urlgenie/master/images/mascot.png" alt = "urlgenie" /><div align = "center" style = "margin-top: 0;">
<h1>🧞 URL Genie 🧞</h1>
</div>
<h3 align="center">
  URL extraction, generalization, validation, and filtration made easy.
</h3>

## 🚀 About URL Genie
It is a python package based on research involving over 2 million URLs, designed to handle URLs in a flexible manner for data-driven projects.

## 💡 How it works
It checks the given URL input, validates it against the URL regex, identifies each component of the URL and processes it according to the set flags.

## ✨ Features
- Handles both encoded and decoded URLs.
- Handles comma separated URLs using recursion.
- Filter out valid, invalid URLs and also bad socials with ease.
- Email and Socials extraction from a given text and validate them.
- URL validation using regex and over 1400 TLDs (off by default).
- Duplicate reduction by minimizing the general and social URL patterns.
- Domain mismatch by just extracting the domain along with the TLD and match against the email domain.
- Researched and refined social regexes to recognize different social patterns and generalizes them to the standardized format.

## ⚙️ Installation
First things first, you need to install URL Genie by running the following command in your terminal:

```shell
python -m pip install urlgenie
```

That's it! Now you can use URL Genie in your code.

## ♨️ Usage
### Importing and creating the object
Let's first import the package and create an object of it to access its features.

```python
from urlgenie import  UrlGenie
from pprint import pprint

genie = UrlGenie()
```

### Generalizing your first URL
Let's try to give a sample input url and get it generalized.

```python
url = "test.something.com/hello?somequery=True#someFragment"
gen = genie.generalize(url)
print(gen)
```

Would return `https://test.something.com/hello` as the output.

It detects that the schema is missing and adds it. By default, it removes the query (starts with ?) and fragment (starts with #).

### Different flags for generalization
As explained previously, URL Genie breaks down the URL, identifies the components and allows you to form the URL as per your needs.

This can be achieved using the flags (boolean parameters) and is explained here: [flags.md](https://github.com/bluestero/urlgenie/blob/main/flags.md).

## ❗ For Nerds
Below are the different use cases where URL Genie might come in handy.

### URL Extraction
Just provide a string text and URL Genie will extract a dict containing emails and socials for you.

```python
text = """
This is a good email: sample@gmail.com and this is a bad email: sample@image.png.
Another would be an email with a custom domain: sample@example.com.
Sample facebook facebook.com/sample1, lets try with fb domain: fb.com/sample2.
Lets add a bad facebook: fb.com/profile.php?
Lets add 2 twitter formats: x.com/sample and twitter.com/sample with same handles.
How about a linkedin pub? linkedin.com/pub/aravind-p-r/24/324/185?_l=en_US.
Let's also add its in url: linkedin.com/in/aravind-p-r-18532424/"""
result_dict = genie.extract_from_text(text)
pprint(result_dict)
```

This would return:

```shell
{'email': {'sample@example.com', 'sample@gmail.com'},
 'facebook': {'fb.com/sample2.', 'facebook.com/sample1', 'fb.com/profile.php'},
 'instagram': set(),
 'linkedin': {'linkedin.com/in/aravind-p-r-18532424', 'linkedin.com/pub/aravind-p-r/24/324/185'},
 'phone': set(),
 'twitter': {'x.com/sample', 'twitter.com/sample'}}
```

### Extract Validation
As you can see, it has strict regexes which prevented the bad email (sample@image.png) from being extracted.

But it has extracted fb.com/profile.php which is not really a URL we want since it does not lead to any person / organization / page.

Also, there are duplicates for twitter having the same handle and are not really in a standardized format.

For that, we can validate the given extract to remove invalid data and standardize the valid ones.

```python
result_dict = genie.extract_from_text(text)
validated_dict = genie.validate_result_dict(result_dict)
pprint(validated_dict)
```

This would return:

```shell
{'email': {'sample@example.com', 'sample@gmail.com'},
 'facebook': {'https://www.facebook.com/sample1', 'https://www.facebook.com/sample2.'},
 'instagram': set(),
 'linkedin': {'https://www.linkedin.com/in/aravind-p-r-18532424'},
 'phone': set(),
 'twitter': {'https://twitter.com/sample'}}
```

With this, we have removed the duplicates, invalid URLs like fb.com/profile.php, generalized URLs such as LinkedIn PUB to IN.

### Email Domain Validation
When you scrape websites for contact info, you might get a lot of emails, and not all of them would be related to the organization.

To filter out the ones which are not related to the organization, we can use the email validation.

```python
result_dict = genie.extract_from_text(text)
validated_dict = genie.validate_result_dict(result_dict, url = "https://www.example.com/ContactUs")
pprint(validated_dict)
```

This would return:

```shell
{'email': {'sample@example.com'},
 'facebook': {'https://www.facebook.com/sample1', 'https://www.facebook.com/sample2.'},
 'instagram': set(),
 'linkedin': {'https://www.linkedin.com/in/aravind-p-r-18532424'},
 'phone': set(),
 'twitter': {'https://twitter.com/sample'}}
```

Now, we have removed the sample@gmail.com which is not related to the organization's URL we have provided.

This would prove to be helpful when making scrapers or processing and cleaning data.

### Social Filtration
You can filter out valid URLs, invalid URLs and invalid socials when you have data in bulk to deal with.

For this example, we would be using data stored in a CSV.

**Test.CSV**

```csv
url
badbadwebsite?!,something
fb.com/people/hello
twitter.com/intent
https://x.com/intent/follow?original_referer=&region=follow_link&screen_name=elonmusk&tw_p=followbutton&variant=2&mx=2
anotherbadwebsite???
```

**Test.py***

```python
import pandas as pd
from pprint import pprint
from urlgenie import  UrlGenie

#-Reading the CSV-#
df = pd.read_csv("test.csv", encoding = "utf-8")

#-Creating UrlGenie object with custom texts for Bad Url and Socials, and TLD validation-#
genie = UrlGenie(bad_url = "Bad Url", bad_social = "Bad Social", proper_tlds = True)

#-Applying the generalize function and creating a new column-#
df["gen"] = df["url"].apply(genie.generalize)

#-Printing the updated dataframe-#
pprint(df)
```

Would return:

```
                                                 url                             gen
0                                    badbadwebsite?!                         Bad Url
1                                fb.com/people/hello  https://www.facebook.com/hello
2                                 twitter.com/intent                      Bad Social
3                                random.haz/somePath                         Bad Url
4  https://x.com/intent/follow?original_referer=&...    https://twitter.com/elonmusk
5                               anotherbadwebsite???                         Bad Url
```

As you can see, we got genrealized URLs for the valid ones and Bad Url, Bad Social for the invalid ones.

The reason why random.haz was deemed as invalid is due to the proper_tlds flag which verified the tld 'haz' agaisnt over 1400 TLDs.

As for the twitter one, intent is not a valid twitter page, hence a valid url but an invalid social.

## 📖 Resources
- [Sample Sheet](https://docs.google.com/spreadsheets/d/12QHwZxiDv80ksFngQK10hkOmPQLRpI0s6dPfe6mRuxk/edit?usp=sharing)
- [Social Research Doc](https://docs.google.com/document/d/12Z025x5m9xBlEahkiRI0wLE0zJNhPtSTCllk_GIqReQ/edit?usp=sharing)

## ⭐ Love It? [Star It!](https://github.com/bluestero/urlgenie)
Just a simple click but would help me out ;)

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "urlgenie",
    "maintainer": "Ahmed Khatib",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "ahmedkhatib99@gmail.com",
    "keywords": "url-parsing, data-cleaning, data-curation, generalization, data-cleansing, data-processing, data-sanitization, url-generalization",
    "author": "Ahmed Khatib",
    "author_email": "ahmedkhatib99@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/97/ae/82fdeb6ad1d0acb539237eaa43b4be737ec09d88509f0458e0ffab803f7c/urlgenie-1.2.0.tar.gz",
    "platform": null,
    "description": "<p align = \"center\">\r\n<img src = \"https://raw.githubusercontent.com/bluestero/urlgenie/master/images/mascot.png\" alt = \"urlgenie\" /><div align = \"center\" style = \"margin-top: 0;\">\r\n<h1>\ud83e\uddde URL Genie \ud83e\uddde</h1>\r\n</div>\r\n<h3 align=\"center\">\r\n  URL extraction, generalization, validation, and filtration made easy.\r\n</h3>\r\n\r\n## \ud83d\ude80 About URL Genie\r\nIt is a python package based on research involving over 2 million URLs, designed to handle URLs in a flexible manner for data-driven projects.\r\n\r\n## \ud83d\udca1 How it works\r\nIt checks the given URL input, validates it against the URL regex, identifies each component of the URL and processes it according to the set flags.\r\n\r\n## \u2728 Features\r\n- Handles both encoded and decoded URLs.\r\n- Handles comma separated URLs using recursion.\r\n- Filter out valid, invalid URLs and also bad socials with ease.\r\n- Email and Socials extraction from a given text and validate them.\r\n- URL validation using regex and over 1400 TLDs (off by default).\r\n- Duplicate reduction by minimizing the general and social URL patterns.\r\n- Domain mismatch by just extracting the domain along with the TLD and match against the email domain.\r\n- Researched and refined social regexes to recognize different social patterns and generalizes them to the standardized format.\r\n\r\n## \u2699\ufe0f Installation\r\nFirst things first, you need to install URL Genie by running the following command in your terminal:\r\n\r\n```shell\r\npython -m pip install urlgenie\r\n```\r\n\r\nThat's it! Now you can use URL Genie in your code.\r\n\r\n## \u2668\ufe0f Usage\r\n### Importing and creating the object\r\nLet's first import the package and create an object of it to access its features.\r\n\r\n```python\r\nfrom urlgenie import  UrlGenie\r\nfrom pprint import pprint\r\n\r\ngenie = UrlGenie()\r\n```\r\n\r\n### Generalizing your first URL\r\nLet's try to give a sample input url and get it generalized.\r\n\r\n```python\r\nurl = \"test.something.com/hello?somequery=True#someFragment\"\r\ngen = genie.generalize(url)\r\nprint(gen)\r\n```\r\n\r\nWould return `https://test.something.com/hello` as the output.\r\n\r\nIt detects that the schema is missing and adds it. By default, it removes the query (starts with ?) and fragment (starts with #).\r\n\r\n### Different flags for generalization\r\nAs explained previously, URL Genie breaks down the URL, identifies the components and allows you to form the URL as per your needs.\r\n\r\nThis can be achieved using the flags (boolean parameters) and is explained here: [flags.md](https://github.com/bluestero/urlgenie/blob/main/flags.md).\r\n\r\n## \u2757 For Nerds\r\nBelow are the different use cases where URL Genie might come in handy.\r\n\r\n### URL Extraction\r\nJust provide a string text and URL Genie will extract a dict containing emails and socials for you.\r\n\r\n```python\r\ntext = \"\"\"\r\nThis is a good email: sample@gmail.com and this is a bad email: sample@image.png.\r\nAnother would be an email with a custom domain: sample@example.com.\r\nSample facebook facebook.com/sample1, lets try with fb domain: fb.com/sample2.\r\nLets add a bad facebook: fb.com/profile.php?\r\nLets add 2 twitter formats: x.com/sample and twitter.com/sample with same handles.\r\nHow about a linkedin pub? linkedin.com/pub/aravind-p-r/24/324/185?_l=en_US.\r\nLet's also add its in url: linkedin.com/in/aravind-p-r-18532424/\"\"\"\r\nresult_dict = genie.extract_from_text(text)\r\npprint(result_dict)\r\n```\r\n\r\nThis would return:\r\n\r\n```shell\r\n{'email': {'sample@example.com', 'sample@gmail.com'},\r\n 'facebook': {'fb.com/sample2.', 'facebook.com/sample1', 'fb.com/profile.php'},\r\n 'instagram': set(),\r\n 'linkedin': {'linkedin.com/in/aravind-p-r-18532424', 'linkedin.com/pub/aravind-p-r/24/324/185'},\r\n 'phone': set(),\r\n 'twitter': {'x.com/sample', 'twitter.com/sample'}}\r\n```\r\n\r\n### Extract Validation\r\nAs you can see, it has strict regexes which prevented the bad email (sample@image.png) from being extracted.\r\n\r\nBut it has extracted fb.com/profile.php which is not really a URL we want since it does not lead to any person / organization / page.\r\n\r\nAlso, there are duplicates for twitter having the same handle and are not really in a standardized format.\r\n\r\nFor that, we can validate the given extract to remove invalid data and standardize the valid ones.\r\n\r\n```python\r\nresult_dict = genie.extract_from_text(text)\r\nvalidated_dict = genie.validate_result_dict(result_dict)\r\npprint(validated_dict)\r\n```\r\n\r\nThis would return:\r\n\r\n```shell\r\n{'email': {'sample@example.com', 'sample@gmail.com'},\r\n 'facebook': {'https://www.facebook.com/sample1', 'https://www.facebook.com/sample2.'},\r\n 'instagram': set(),\r\n 'linkedin': {'https://www.linkedin.com/in/aravind-p-r-18532424'},\r\n 'phone': set(),\r\n 'twitter': {'https://twitter.com/sample'}}\r\n```\r\n\r\nWith this, we have removed the duplicates, invalid URLs like fb.com/profile.php, generalized URLs such as LinkedIn PUB to IN.\r\n\r\n### Email Domain Validation\r\nWhen you scrape websites for contact info, you might get a lot of emails, and not all of them would be related to the organization.\r\n\r\nTo filter out the ones which are not related to the organization, we can use the email validation.\r\n\r\n```python\r\nresult_dict = genie.extract_from_text(text)\r\nvalidated_dict = genie.validate_result_dict(result_dict, url = \"https://www.example.com/ContactUs\")\r\npprint(validated_dict)\r\n```\r\n\r\nThis would return:\r\n\r\n```shell\r\n{'email': {'sample@example.com'},\r\n 'facebook': {'https://www.facebook.com/sample1', 'https://www.facebook.com/sample2.'},\r\n 'instagram': set(),\r\n 'linkedin': {'https://www.linkedin.com/in/aravind-p-r-18532424'},\r\n 'phone': set(),\r\n 'twitter': {'https://twitter.com/sample'}}\r\n```\r\n\r\nNow, we have removed the sample@gmail.com which is not related to the organization's URL we have provided.\r\n\r\nThis would prove to be helpful when making scrapers or processing and cleaning data.\r\n\r\n### Social Filtration\r\nYou can filter out valid URLs, invalid URLs and invalid socials when you have data in bulk to deal with.\r\n\r\nFor this example, we would be using data stored in a CSV.\r\n\r\n**Test.CSV**\r\n\r\n```csv\r\nurl\r\nbadbadwebsite?!,something\r\nfb.com/people/hello\r\ntwitter.com/intent\r\nhttps://x.com/intent/follow?original_referer=&region=follow_link&screen_name=elonmusk&tw_p=followbutton&variant=2&mx=2\r\nanotherbadwebsite???\r\n```\r\n\r\n**Test.py***\r\n\r\n```python\r\nimport pandas as pd\r\nfrom pprint import pprint\r\nfrom urlgenie import  UrlGenie\r\n\r\n#-Reading the CSV-#\r\ndf = pd.read_csv(\"test.csv\", encoding = \"utf-8\")\r\n\r\n#-Creating UrlGenie object with custom texts for Bad Url and Socials, and TLD validation-#\r\ngenie = UrlGenie(bad_url = \"Bad Url\", bad_social = \"Bad Social\", proper_tlds = True)\r\n\r\n#-Applying the generalize function and creating a new column-#\r\ndf[\"gen\"] = df[\"url\"].apply(genie.generalize)\r\n\r\n#-Printing the updated dataframe-#\r\npprint(df)\r\n```\r\n\r\nWould return:\r\n\r\n```\r\n                                                 url                             gen\r\n0                                    badbadwebsite?!                         Bad Url\r\n1                                fb.com/people/hello  https://www.facebook.com/hello\r\n2                                 twitter.com/intent                      Bad Social\r\n3                                random.haz/somePath                         Bad Url\r\n4  https://x.com/intent/follow?original_referer=&...    https://twitter.com/elonmusk\r\n5                               anotherbadwebsite???                         Bad Url\r\n```\r\n\r\nAs you can see, we got genrealized URLs for the valid ones and Bad Url, Bad Social for the invalid ones.\r\n\r\nThe reason why random.haz was deemed as invalid is due to the proper_tlds flag which verified the tld 'haz' agaisnt over 1400 TLDs.\r\n\r\nAs for the twitter one, intent is not a valid twitter page, hence a valid url but an invalid social.\r\n\r\n## \ud83d\udcd6 Resources\r\n- [Sample Sheet](https://docs.google.com/spreadsheets/d/12QHwZxiDv80ksFngQK10hkOmPQLRpI0s6dPfe6mRuxk/edit?usp=sharing)\r\n- [Social Research Doc](https://docs.google.com/document/d/12Z025x5m9xBlEahkiRI0wLE0zJNhPtSTCllk_GIqReQ/edit?usp=sharing)\r\n\r\n## \u2b50 Love It? [Star It!](https://github.com/bluestero/urlgenie)\r\nJust a simple click but would help me out ;)\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Python package to make URL extraction, generalization, validation, and filtration easy.",
    "version": "1.2.0",
    "project_urls": {
        "Documentation": "https://github.com/bluestero/urlgenie/blob/main/README.md",
        "Source": "https://github.com/bluestero/urlgenie",
        "Tracker": "https://github.com/bluestero/urlgenie/issues"
    },
    "split_keywords": [
        "url-parsing",
        " data-cleaning",
        " data-curation",
        " generalization",
        " data-cleansing",
        " data-processing",
        " data-sanitization",
        " url-generalization"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "97ae82fdeb6ad1d0acb539237eaa43b4be737ec09d88509f0458e0ffab803f7c",
                "md5": "0d98f4ea6bbd6c4786ec416894af9b96",
                "sha256": "02c2bc28518cd15fe72bb3cc08f32c2b23c4138580405a3246512f0c6b311fef"
            },
            "downloads": -1,
            "filename": "urlgenie-1.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "0d98f4ea6bbd6c4786ec416894af9b96",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 21522,
            "upload_time": "2024-06-09T09:51:24",
            "upload_time_iso_8601": "2024-06-09T09:51:24.175247Z",
            "url": "https://files.pythonhosted.org/packages/97/ae/82fdeb6ad1d0acb539237eaa43b4be737ec09d88509f0458e0ffab803f7c/urlgenie-1.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-09 09:51:24",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "bluestero",
    "github_project": "urlgenie",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "tldextract",
            "specs": [
                [
                    "==",
                    "5.1.2"
                ]
            ]
        }
    ],
    "lcname": "urlgenie"
}

Ahmed Khatib