similarius


Namesimilarius JSON
Version 0.0.1 PyPI version JSON
download
home_pagehttps://github.com/ail-project/Similarius
SummaryCompare web page and evaluate the level of similarity.
upload_time2023-01-16 15:02:39
maintainerAlexandre Dulaunoy
docs_urlNone
authorDavid Cruciani
requires_python>=3.8,<4.0
licenseBSD-2-Clause
keywords web similarity web comparaison
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Similarius

Similarius is a Python library to compare web page and evaluate the level of similarity.

The tool can be used as a stand-alone tool or to feed other systems.



# Requirements

- Python 3.8+
- [Requests](https://github.com/psf/requests)
- [Scikit-learn](https://github.com/scikit-learn/scikit-learn)
- [Beautifulsoup4](https://pypi.org/project/beautifulsoup4/)
- [nltk](https://github.com/nltk/nltk)



# Installation

## Source install

**Similarius** can be install with poetry. If you don't have poetry installed, you can do the following `curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python`.

~~~bash
$ poetry install
$ poetry shell
$ similarius -h
~~~

## pip installation

~~~bash
$ pip3 install similarius
~~~



# Usage

~~~bash
dacru@dacru:~/git/Similarius/similarius$ similarius --help
usage: similarius.py [-h] [-o ORIGINAL] [-w WEBSITE [WEBSITE ...]]

optional arguments:
  -h, --help            show this help message and exit
  -o ORIGINAL, --original ORIGINAL
                        Website to compare
  -w WEBSITE [WEBSITE ...], --website WEBSITE [WEBSITE ...]
                        Website to compare
~~~



# Usage example

~~~bash
dacru@dacru:~/git/Similarius/similarius$ similarius -o circl.lu -w europa.eu circl.eu circl.lu
~~~



# Used as a library

~~~python
import argparse
from similarius import get_website, extract_text_ressource, sk_similarity, ressource_difference, ratio

parser = argparse.ArgumentParser()
parser.add_argument("-w", "--website", nargs="+", help="Website to compare")
parser.add_argument("-o", "--original", help="Website to compare")
args = parser.parse_args()

# Original
original = get_website(args.original)

if not original:
    print("[-] The original website is unreachable...")
    exit(1)

original_text, original_ressource = extract_text_ressource(original.text)

for website in args.website:
    print(f"\n********** {args.original} <-> {website} **********")

    # Compare
    compare = get_website(website)

    if not compare:
        print(f"[-] {website} is unreachable...")
        continue

    compare_text, compare_ressource = extract_text_ressource(compare.text)

    # Calculate
    sim = str(sk_similarity(compare_text, original_text))
    print(f"\nSimilarity: {sim}")

    ressource_diff = ressource_difference(original_ressource, compare_ressource)
    print(f"Ressource Difference: {ressource_diff}")

    ratio_compare = ratio(ressource_diff, sim)
    print(f"Ratio: {ratio_compare}")
~~~


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ail-project/Similarius",
    "name": "similarius",
    "maintainer": "Alexandre Dulaunoy",
    "docs_url": null,
    "requires_python": ">=3.8,<4.0",
    "maintainer_email": "a@foo.be",
    "keywords": "web similarity,web comparaison",
    "author": "David Cruciani",
    "author_email": "david.cruciani@securitymadein.lu",
    "download_url": "https://files.pythonhosted.org/packages/a3/70/75a950e7006f4da0d364e68311b9bc5d800ce4707ec86463ba0c074a1e2c/similarius-0.0.1.tar.gz",
    "platform": null,
    "description": "# Similarius\n\nSimilarius is a Python library to compare web page and evaluate the level of similarity.\n\nThe tool can be used as a stand-alone tool or to feed other systems.\n\n\n\n# Requirements\n\n- Python 3.8+\n- [Requests](https://github.com/psf/requests)\n- [Scikit-learn](https://github.com/scikit-learn/scikit-learn)\n- [Beautifulsoup4](https://pypi.org/project/beautifulsoup4/)\n- [nltk](https://github.com/nltk/nltk)\n\n\n\n# Installation\n\n## Source install\n\n**Similarius** can be install with poetry. If you don't have poetry installed, you can do the following `curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python`.\n\n~~~bash\n$ poetry install\n$ poetry shell\n$ similarius -h\n~~~\n\n## pip installation\n\n~~~bash\n$ pip3 install similarius\n~~~\n\n\n\n# Usage\n\n~~~bash\ndacru@dacru:~/git/Similarius/similarius$ similarius --help\nusage: similarius.py [-h] [-o ORIGINAL] [-w WEBSITE [WEBSITE ...]]\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -o ORIGINAL, --original ORIGINAL\n                        Website to compare\n  -w WEBSITE [WEBSITE ...], --website WEBSITE [WEBSITE ...]\n                        Website to compare\n~~~\n\n\n\n# Usage example\n\n~~~bash\ndacru@dacru:~/git/Similarius/similarius$ similarius -o circl.lu -w europa.eu circl.eu circl.lu\n~~~\n\n\n\n# Used as a library\n\n~~~python\nimport argparse\nfrom similarius import get_website, extract_text_ressource, sk_similarity, ressource_difference, ratio\n\nparser = argparse.ArgumentParser()\nparser.add_argument(\"-w\", \"--website\", nargs=\"+\", help=\"Website to compare\")\nparser.add_argument(\"-o\", \"--original\", help=\"Website to compare\")\nargs = parser.parse_args()\n\n# Original\noriginal = get_website(args.original)\n\nif not original:\n    print(\"[-] The original website is unreachable...\")\n    exit(1)\n\noriginal_text, original_ressource = extract_text_ressource(original.text)\n\nfor website in args.website:\n    print(f\"\\n********** {args.original} <-> {website} **********\")\n\n    # Compare\n    compare = get_website(website)\n\n    if not compare:\n        print(f\"[-] {website} is unreachable...\")\n        continue\n\n    compare_text, compare_ressource = extract_text_ressource(compare.text)\n\n    # Calculate\n    sim = str(sk_similarity(compare_text, original_text))\n    print(f\"\\nSimilarity: {sim}\")\n\n    ressource_diff = ressource_difference(original_ressource, compare_ressource)\n    print(f\"Ressource Difference: {ressource_diff}\")\n\n    ratio_compare = ratio(ressource_diff, sim)\n    print(f\"Ratio: {ratio_compare}\")\n~~~\n\n",
    "bugtrack_url": null,
    "license": "BSD-2-Clause",
    "summary": "Compare web page and evaluate the level of similarity.",
    "version": "0.0.1",
    "split_keywords": [
        "web similarity",
        "web comparaison"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fcf221cb53f70c27481be0593e2e3015b0f1676628e7ec63299a7d9b48f22b2b",
                "md5": "47c9e62f1fad01062ebd8fa5ff31025c",
                "sha256": "6b49fe0ccc766d574d9034420c262a700290deb6bf51324c3d4ba5e496b550a7"
            },
            "downloads": -1,
            "filename": "similarius-0.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "47c9e62f1fad01062ebd8fa5ff31025c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8,<4.0",
            "size": 5764,
            "upload_time": "2023-01-16T15:02:37",
            "upload_time_iso_8601": "2023-01-16T15:02:37.123485Z",
            "url": "https://files.pythonhosted.org/packages/fc/f2/21cb53f70c27481be0593e2e3015b0f1676628e7ec63299a7d9b48f22b2b/similarius-0.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a37075a950e7006f4da0d364e68311b9bc5d800ce4707ec86463ba0c074a1e2c",
                "md5": "2037f6543a14defe471d1ad0dca5af5e",
                "sha256": "398826bfa359518d318a2f004cf16cb137a4671959f8f63cdb1d4279a8d2ea77"
            },
            "downloads": -1,
            "filename": "similarius-0.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "2037f6543a14defe471d1ad0dca5af5e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8,<4.0",
            "size": 5182,
            "upload_time": "2023-01-16T15:02:39",
            "upload_time_iso_8601": "2023-01-16T15:02:39.214557Z",
            "url": "https://files.pythonhosted.org/packages/a3/70/75a950e7006f4da0d364e68311b9bc5d800ce4707ec86463ba0c074a1e2c/similarius-0.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-01-16 15:02:39",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "ail-project",
    "github_project": "Similarius",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "similarius"
}
        
Elapsed time: 0.12114s