kraken-extract-from-html


Namekraken-extract-from-html JSON
Version 0.0.21 PyPI version JSON
download
home_pagehttps://github.com/tactik8/kraken_extract_from_html2
SummaryKraken Extract From HTML
upload_time2023-12-03 19:08:41
maintainer
docs_urlNone
authorTactik8
requires_python
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Extract from html


## What it does
Extracts the following from html:
- urls
- emails
- images
- tables
- structured data (schema.org)
- text
- title
- feeds


## How to use

### Using the api

#### Send a url (get)
Send the url as a query parameter 'url'.
Will retrieve the content and return extracted data.
If 'contentUrl' provided, will use the content from 'contentUrl' but use 'url' as attributes


#### Send a WebContent object (post)
The content will be extracted from either the 'text' field or it will retrieve the content from the url in 'archivedAt'.

```
{
    "@type": "webContent",
    "url": [
        "https://storage.googleapis.com/kraken-cdn/641fcdaa9664421b3ac4db2b6b494397bf0dc8d65a559e9c2238de77d09e740e.html"
    ],
    "archivedAt": [
        "https://storage.googleapis.com/kraken-cdn/641fcdaa9664421b3ac4db2b6b494397bf0dc8d65a559e9c2238de77d09e740e.html"
    ],
    "about": {
        "@type": "webPage",
        "url": "https://www.petro-canada.ca/en/business/rack-prices"
    }
}

```

### Using the library
Provided url of the page and html content, returns list of records with extractions.

`from kraken_extract_from_html import kraken_extract_from_html as k
`

`records = k.get(url, html)`

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/tactik8/kraken_extract_from_html2",
    "name": "kraken-extract-from-html",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "Tactik8",
    "author_email": "info@tactik8.com",
    "download_url": "https://files.pythonhosted.org/packages/78/57/e5f4e5de4610879acf79f66e27b0599584b32f197517514c2fe6d70d5303/kraken-extract-from-html-0.0.21.tar.gz",
    "platform": null,
    "description": "# Extract from html\n\n\n## What it does\nExtracts the following from html:\n- urls\n- emails\n- images\n- tables\n- structured data (schema.org)\n- text\n- title\n- feeds\n\n\n## How to use\n\n### Using the api\n\n#### Send a url (get)\nSend the url as a query parameter 'url'.\nWill retrieve the content and return extracted data.\nIf 'contentUrl' provided, will use the content from 'contentUrl' but use 'url' as attributes\n\n\n#### Send a WebContent object (post)\nThe content will be extracted from either the 'text' field or it will retrieve the content from the url in 'archivedAt'.\n\n```\n{\n    \"@type\": \"webContent\",\n    \"url\": [\n        \"https://storage.googleapis.com/kraken-cdn/641fcdaa9664421b3ac4db2b6b494397bf0dc8d65a559e9c2238de77d09e740e.html\"\n    ],\n    \"archivedAt\": [\n        \"https://storage.googleapis.com/kraken-cdn/641fcdaa9664421b3ac4db2b6b494397bf0dc8d65a559e9c2238de77d09e740e.html\"\n    ],\n    \"about\": {\n        \"@type\": \"webPage\",\n        \"url\": \"https://www.petro-canada.ca/en/business/rack-prices\"\n    }\n}\n\n```\n\n### Using the library\nProvided url of the page and html content, returns list of records with extractions.\n\n`from kraken_extract_from_html import kraken_extract_from_html as k\n`\n\n`records = k.get(url, html)`\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Kraken Extract From HTML",
    "version": "0.0.21",
    "project_urls": {
        "Homepage": "https://github.com/tactik8/kraken_extract_from_html2"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6ca49801c38761c97da184d3ff40e92873e4a6402d16d052d71f54d65bc21f2e",
                "md5": "4cc3a8c01d41d701d299c6d8de8a5ee8",
                "sha256": "fec384a162812b09a17c3451aee396124d80e43dee31d6c9c3548c540965b4dd"
            },
            "downloads": -1,
            "filename": "kraken_extract_from_html-0.0.21-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4cc3a8c01d41d701d299c6d8de8a5ee8",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 10750,
            "upload_time": "2023-12-03T19:08:40",
            "upload_time_iso_8601": "2023-12-03T19:08:40.120945Z",
            "url": "https://files.pythonhosted.org/packages/6c/a4/9801c38761c97da184d3ff40e92873e4a6402d16d052d71f54d65bc21f2e/kraken_extract_from_html-0.0.21-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7857e5f4e5de4610879acf79f66e27b0599584b32f197517514c2fe6d70d5303",
                "md5": "155d2f085b82925de07e292ce27f62a7",
                "sha256": "a7385d9afcfed3343346a51634bf1a00d15bd9dc5d692fac04ab582261ebbf53"
            },
            "downloads": -1,
            "filename": "kraken-extract-from-html-0.0.21.tar.gz",
            "has_sig": false,
            "md5_digest": "155d2f085b82925de07e292ce27f62a7",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 7145,
            "upload_time": "2023-12-03T19:08:41",
            "upload_time_iso_8601": "2023-12-03T19:08:41.657622Z",
            "url": "https://files.pythonhosted.org/packages/78/57/e5f4e5de4610879acf79f66e27b0599584b32f197517514c2fe6d70d5303/kraken-extract-from-html-0.0.21.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-12-03 19:08:41",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "tactik8",
    "github_project": "kraken_extract_from_html2",
    "github_not_found": true,
    "lcname": "kraken-extract-from-html"
}
        
Elapsed time: 0.15141s