# Similarius
Similarius is a Python library to compare web page and evaluate the level of similarity.
The tool can be used as a stand-alone tool or to feed other systems.
# Requirements
- Python 3.8+
- [Requests](https://github.com/psf/requests)
- [Scikit-learn](https://github.com/scikit-learn/scikit-learn)
- [Beautifulsoup4](https://pypi.org/project/beautifulsoup4/)
- [nltk](https://github.com/nltk/nltk)
# Installation
## Source install
**Similarius** can be install with poetry. If you don't have poetry installed, you can do the following `curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python`.
~~~bash
$ poetry install
$ poetry shell
$ similarius -h
~~~
## pip installation
~~~bash
$ pip3 install similarius
~~~
# Usage
~~~bash
dacru@dacru:~/git/Similarius/similarius$ similarius --help
usage: similarius.py [-h] [-o ORIGINAL] [-w WEBSITE [WEBSITE ...]]
optional arguments:
-h, --help show this help message and exit
-o ORIGINAL, --original ORIGINAL
Website to compare
-w WEBSITE [WEBSITE ...], --website WEBSITE [WEBSITE ...]
Website to compare
~~~
# Usage example
~~~bash
dacru@dacru:~/git/Similarius/similarius$ similarius -o circl.lu -w europa.eu circl.eu circl.lu
~~~
# Used as a library
~~~python
import argparse
from similarius import get_website, extract_text_ressource, sk_similarity, ressource_difference, ratio
parser = argparse.ArgumentParser()
parser.add_argument("-w", "--website", nargs="+", help="Website to compare")
parser.add_argument("-o", "--original", help="Website to compare")
args = parser.parse_args()
# Original
original = get_website(args.original)
if not original:
print("[-] The original website is unreachable...")
exit(1)
original_text, original_ressource = extract_text_ressource(original.text)
for website in args.website:
print(f"\n********** {args.original} <-> {website} **********")
# Compare
compare = get_website(website)
if not compare:
print(f"[-] {website} is unreachable...")
continue
compare_text, compare_ressource = extract_text_ressource(compare.text)
# Calculate
sim = str(sk_similarity(compare_text, original_text))
print(f"\nSimilarity: {sim}")
ressource_diff = ressource_difference(original_ressource, compare_ressource)
print(f"Ressource Difference: {ressource_diff}")
ratio_compare = ratio(ressource_diff, sim)
print(f"Ratio: {ratio_compare}")
~~~
Raw data
{
"_id": null,
"home_page": "https://github.com/ail-project/Similarius",
"name": "similarius",
"maintainer": "Alexandre Dulaunoy",
"docs_url": null,
"requires_python": ">=3.8,<4.0",
"maintainer_email": "a@foo.be",
"keywords": "web similarity,web comparaison",
"author": "David Cruciani",
"author_email": "david.cruciani@securitymadein.lu",
"download_url": "https://files.pythonhosted.org/packages/a3/70/75a950e7006f4da0d364e68311b9bc5d800ce4707ec86463ba0c074a1e2c/similarius-0.0.1.tar.gz",
"platform": null,
"description": "# Similarius\n\nSimilarius is a Python library to compare web page and evaluate the level of similarity.\n\nThe tool can be used as a stand-alone tool or to feed other systems.\n\n\n\n# Requirements\n\n- Python 3.8+\n- [Requests](https://github.com/psf/requests)\n- [Scikit-learn](https://github.com/scikit-learn/scikit-learn)\n- [Beautifulsoup4](https://pypi.org/project/beautifulsoup4/)\n- [nltk](https://github.com/nltk/nltk)\n\n\n\n# Installation\n\n## Source install\n\n**Similarius** can be install with poetry. If you don't have poetry installed, you can do the following `curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python`.\n\n~~~bash\n$ poetry install\n$ poetry shell\n$ similarius -h\n~~~\n\n## pip installation\n\n~~~bash\n$ pip3 install similarius\n~~~\n\n\n\n# Usage\n\n~~~bash\ndacru@dacru:~/git/Similarius/similarius$ similarius --help\nusage: similarius.py [-h] [-o ORIGINAL] [-w WEBSITE [WEBSITE ...]]\n\noptional arguments:\n -h, --help show this help message and exit\n -o ORIGINAL, --original ORIGINAL\n Website to compare\n -w WEBSITE [WEBSITE ...], --website WEBSITE [WEBSITE ...]\n Website to compare\n~~~\n\n\n\n# Usage example\n\n~~~bash\ndacru@dacru:~/git/Similarius/similarius$ similarius -o circl.lu -w europa.eu circl.eu circl.lu\n~~~\n\n\n\n# Used as a library\n\n~~~python\nimport argparse\nfrom similarius import get_website, extract_text_ressource, sk_similarity, ressource_difference, ratio\n\nparser = argparse.ArgumentParser()\nparser.add_argument(\"-w\", \"--website\", nargs=\"+\", help=\"Website to compare\")\nparser.add_argument(\"-o\", \"--original\", help=\"Website to compare\")\nargs = parser.parse_args()\n\n# Original\noriginal = get_website(args.original)\n\nif not original:\n print(\"[-] The original website is unreachable...\")\n exit(1)\n\noriginal_text, original_ressource = extract_text_ressource(original.text)\n\nfor website in args.website:\n print(f\"\\n********** {args.original} <-> {website} **********\")\n\n # Compare\n compare = get_website(website)\n\n if not compare:\n print(f\"[-] {website} is unreachable...\")\n continue\n\n compare_text, compare_ressource = extract_text_ressource(compare.text)\n\n # Calculate\n sim = str(sk_similarity(compare_text, original_text))\n print(f\"\\nSimilarity: {sim}\")\n\n ressource_diff = ressource_difference(original_ressource, compare_ressource)\n print(f\"Ressource Difference: {ressource_diff}\")\n\n ratio_compare = ratio(ressource_diff, sim)\n print(f\"Ratio: {ratio_compare}\")\n~~~\n\n",
"bugtrack_url": null,
"license": "BSD-2-Clause",
"summary": "Compare web page and evaluate the level of similarity.",
"version": "0.0.1",
"split_keywords": [
"web similarity",
"web comparaison"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "fcf221cb53f70c27481be0593e2e3015b0f1676628e7ec63299a7d9b48f22b2b",
"md5": "47c9e62f1fad01062ebd8fa5ff31025c",
"sha256": "6b49fe0ccc766d574d9034420c262a700290deb6bf51324c3d4ba5e496b550a7"
},
"downloads": -1,
"filename": "similarius-0.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "47c9e62f1fad01062ebd8fa5ff31025c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8,<4.0",
"size": 5764,
"upload_time": "2023-01-16T15:02:37",
"upload_time_iso_8601": "2023-01-16T15:02:37.123485Z",
"url": "https://files.pythonhosted.org/packages/fc/f2/21cb53f70c27481be0593e2e3015b0f1676628e7ec63299a7d9b48f22b2b/similarius-0.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "a37075a950e7006f4da0d364e68311b9bc5d800ce4707ec86463ba0c074a1e2c",
"md5": "2037f6543a14defe471d1ad0dca5af5e",
"sha256": "398826bfa359518d318a2f004cf16cb137a4671959f8f63cdb1d4279a8d2ea77"
},
"downloads": -1,
"filename": "similarius-0.0.1.tar.gz",
"has_sig": false,
"md5_digest": "2037f6543a14defe471d1ad0dca5af5e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8,<4.0",
"size": 5182,
"upload_time": "2023-01-16T15:02:39",
"upload_time_iso_8601": "2023-01-16T15:02:39.214557Z",
"url": "https://files.pythonhosted.org/packages/a3/70/75a950e7006f4da0d364e68311b9bc5d800ce4707ec86463ba0c074a1e2c/similarius-0.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-01-16 15:02:39",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "ail-project",
"github_project": "Similarius",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "similarius"
}