# Extract from html
## What it does
Extracts the following from html:
- urls
- emails
- images
- tables
- structured data (schema.org)
- text
- title
- feeds
## How to use
### Using the api
#### Send a url (get)
Send the url as a query parameter 'url'.
Will retrieve the content and return extracted data.
If 'contentUrl' provided, will use the content from 'contentUrl' but use 'url' as attributes
#### Send a WebContent object (post)
The content will be extracted from either the 'text' field or it will retrieve the content from the url in 'archivedAt'.
```
{
"@type": "webContent",
"url": [
"https://storage.googleapis.com/kraken-cdn/641fcdaa9664421b3ac4db2b6b494397bf0dc8d65a559e9c2238de77d09e740e.html"
],
"archivedAt": [
"https://storage.googleapis.com/kraken-cdn/641fcdaa9664421b3ac4db2b6b494397bf0dc8d65a559e9c2238de77d09e740e.html"
],
"about": {
"@type": "webPage",
"url": "https://www.petro-canada.ca/en/business/rack-prices"
}
}
```
### Using the library
Provided url of the page and html content, returns list of records with extractions.
`from kraken_extract_from_html import kraken_extract_from_html as k
`
`records = k.get(url, html)`
Raw data
{
"_id": null,
"home_page": "https://github.com/tactik8/kraken_extract_from_html2",
"name": "kraken-extract-from-html",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "",
"author": "Tactik8",
"author_email": "info@tactik8.com",
"download_url": "https://files.pythonhosted.org/packages/78/57/e5f4e5de4610879acf79f66e27b0599584b32f197517514c2fe6d70d5303/kraken-extract-from-html-0.0.21.tar.gz",
"platform": null,
"description": "# Extract from html\n\n\n## What it does\nExtracts the following from html:\n- urls\n- emails\n- images\n- tables\n- structured data (schema.org)\n- text\n- title\n- feeds\n\n\n## How to use\n\n### Using the api\n\n#### Send a url (get)\nSend the url as a query parameter 'url'.\nWill retrieve the content and return extracted data.\nIf 'contentUrl' provided, will use the content from 'contentUrl' but use 'url' as attributes\n\n\n#### Send a WebContent object (post)\nThe content will be extracted from either the 'text' field or it will retrieve the content from the url in 'archivedAt'.\n\n```\n{\n \"@type\": \"webContent\",\n \"url\": [\n \"https://storage.googleapis.com/kraken-cdn/641fcdaa9664421b3ac4db2b6b494397bf0dc8d65a559e9c2238de77d09e740e.html\"\n ],\n \"archivedAt\": [\n \"https://storage.googleapis.com/kraken-cdn/641fcdaa9664421b3ac4db2b6b494397bf0dc8d65a559e9c2238de77d09e740e.html\"\n ],\n \"about\": {\n \"@type\": \"webPage\",\n \"url\": \"https://www.petro-canada.ca/en/business/rack-prices\"\n }\n}\n\n```\n\n### Using the library\nProvided url of the page and html content, returns list of records with extractions.\n\n`from kraken_extract_from_html import kraken_extract_from_html as k\n`\n\n`records = k.get(url, html)`\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Kraken Extract From HTML",
"version": "0.0.21",
"project_urls": {
"Homepage": "https://github.com/tactik8/kraken_extract_from_html2"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "6ca49801c38761c97da184d3ff40e92873e4a6402d16d052d71f54d65bc21f2e",
"md5": "4cc3a8c01d41d701d299c6d8de8a5ee8",
"sha256": "fec384a162812b09a17c3451aee396124d80e43dee31d6c9c3548c540965b4dd"
},
"downloads": -1,
"filename": "kraken_extract_from_html-0.0.21-py3-none-any.whl",
"has_sig": false,
"md5_digest": "4cc3a8c01d41d701d299c6d8de8a5ee8",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 10750,
"upload_time": "2023-12-03T19:08:40",
"upload_time_iso_8601": "2023-12-03T19:08:40.120945Z",
"url": "https://files.pythonhosted.org/packages/6c/a4/9801c38761c97da184d3ff40e92873e4a6402d16d052d71f54d65bc21f2e/kraken_extract_from_html-0.0.21-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "7857e5f4e5de4610879acf79f66e27b0599584b32f197517514c2fe6d70d5303",
"md5": "155d2f085b82925de07e292ce27f62a7",
"sha256": "a7385d9afcfed3343346a51634bf1a00d15bd9dc5d692fac04ab582261ebbf53"
},
"downloads": -1,
"filename": "kraken-extract-from-html-0.0.21.tar.gz",
"has_sig": false,
"md5_digest": "155d2f085b82925de07e292ce27f62a7",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 7145,
"upload_time": "2023-12-03T19:08:41",
"upload_time_iso_8601": "2023-12-03T19:08:41.657622Z",
"url": "https://files.pythonhosted.org/packages/78/57/e5f4e5de4610879acf79f66e27b0599584b32f197517514c2fe6d70d5303/kraken-extract-from-html-0.0.21.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-12-03 19:08:41",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "tactik8",
"github_project": "kraken_extract_from_html2",
"github_not_found": true,
"lcname": "kraken-extract-from-html"
}