# urltotext
A light weight library that takes in a url and extracts any readable text in it.
Accepting any and all PRs!
## Installation
```bash
pip install urltotext
```
## Pre-requisites
1. `urltotext` uses `selenium` with the driver scope currently limited to `chrome` only. Please ensure that chromedriver is properly configured. Use this [link](https://www.swtestacademy.com/install-chrome-driver-on-mac/) for installation instructions.
## Usage
1. Import and initialize ContentFinder
```python
from urltotext import ContentFinder
cf = ContentFinder()
```
2. Scrape a url
```python
# scrape a url
cs.scrape_url(url="your_url_here")
# print the article
cs.print_article(url="your_url_here")
# all urls passed will be stored in the class instance.
# use the flush_data method to free memory
cs.flush_data()
```
Enjoy!
Raw data
{
"_id": null,
"home_page": "https://github.com/ChinmayShrivastava/url2text",
"name": "urltotext",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": null,
"author": "Chinmay Shrivastava",
"author_email": "cshrivastava99@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/a3/04/923ca3bbd26f493555bafe7ee6227d86c962ef4f8ef04c60de02f5cc029f/urltotext-0.3.0.tar.gz",
"platform": null,
"description": "# urltotext\n A light weight library that takes in a url and extracts any readable text in it.\n\n Accepting any and all PRs!\n\n## Installation\n\n```bash\npip install urltotext\n```\n\n## Pre-requisites\n\n1. `urltotext` uses `selenium` with the driver scope currently limited to `chrome` only. Please ensure that chromedriver is properly configured. Use this [link](https://www.swtestacademy.com/install-chrome-driver-on-mac/) for installation instructions.\n\n## Usage\n\n1. Import and initialize ContentFinder\n\n```python\nfrom urltotext import ContentFinder\ncf = ContentFinder()\n```\n\n2. Scrape a url\n\n```python\n# scrape a url\ncs.scrape_url(url=\"your_url_here\")\n\n# print the article\ncs.print_article(url=\"your_url_here\")\n\n# all urls passed will be stored in the class instance.\n# use the flush_data method to free memory\ncs.flush_data()\n```\n\nEnjoy!\n",
"bugtrack_url": null,
"license": "GPLv3",
"summary": "A light weight library that takes in a url and extracts any readable text in it.",
"version": "0.3.0",
"project_urls": {
"Homepage": "https://github.com/ChinmayShrivastava/url2text"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "9eb5a9e7a8540124e27c1a8206a2eeecc63933c355c40b34dea536c4a6ae5a1d",
"md5": "470dcc38a66a23b3514f49c13b9b9952",
"sha256": "598b47b8e71a4ac07618aec5af09f6916340b174dfb57c5d26247c45fbe9765c"
},
"downloads": -1,
"filename": "urltotext-0.3.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "470dcc38a66a23b3514f49c13b9b9952",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 15827,
"upload_time": "2024-03-20T04:23:58",
"upload_time_iso_8601": "2024-03-20T04:23:58.232626Z",
"url": "https://files.pythonhosted.org/packages/9e/b5/a9e7a8540124e27c1a8206a2eeecc63933c355c40b34dea536c4a6ae5a1d/urltotext-0.3.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "a304923ca3bbd26f493555bafe7ee6227d86c962ef4f8ef04c60de02f5cc029f",
"md5": "85f3b6eb42a498230abba92fc0fa6ada",
"sha256": "86a9204af6c38c734a4eb0ee34882477d13f9635eceae6dfc716aff68a638b7c"
},
"downloads": -1,
"filename": "urltotext-0.3.0.tar.gz",
"has_sig": false,
"md5_digest": "85f3b6eb42a498230abba92fc0fa6ada",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 15628,
"upload_time": "2024-03-20T04:24:00",
"upload_time_iso_8601": "2024-03-20T04:24:00.142962Z",
"url": "https://files.pythonhosted.org/packages/a3/04/923ca3bbd26f493555bafe7ee6227d86c962ef4f8ef04c60de02f5cc029f/urltotext-0.3.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-03-20 04:24:00",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ChinmayShrivastava",
"github_project": "url2text",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "attrs",
"specs": [
[
"==",
"23.2.0"
]
]
},
{
"name": "beautifulsoup4",
"specs": [
[
"==",
"4.12.3"
]
]
},
{
"name": "bs4",
"specs": [
[
"==",
"0.0.2"
]
]
},
{
"name": "certifi",
"specs": [
[
"==",
"2024.2.2"
]
]
},
{
"name": "charset-normalizer",
"specs": [
[
"==",
"3.3.2"
]
]
},
{
"name": "h11",
"specs": [
[
"==",
"0.14.0"
]
]
},
{
"name": "idna",
"specs": [
[
"==",
"3.6"
]
]
},
{
"name": "langdetect",
"specs": [
[
"==",
"1.0.9"
]
]
},
{
"name": "outcome",
"specs": [
[
"==",
"1.3.0.post0"
]
]
},
{
"name": "PySocks",
"specs": [
[
"==",
"1.7.1"
]
]
},
{
"name": "requests",
"specs": [
[
"==",
"2.31.0"
]
]
},
{
"name": "selenium",
"specs": [
[
"==",
"4.18.1"
]
]
},
{
"name": "six",
"specs": [
[
"==",
"1.16.0"
]
]
},
{
"name": "sniffio",
"specs": [
[
"==",
"1.3.1"
]
]
},
{
"name": "sortedcontainers",
"specs": [
[
"==",
"2.4.0"
]
]
},
{
"name": "soupsieve",
"specs": [
[
"==",
"2.5"
]
]
},
{
"name": "trio",
"specs": [
[
"==",
"0.25.0"
]
]
},
{
"name": "trio-websocket",
"specs": [
[
"==",
"0.11.1"
]
]
},
{
"name": "typing_extensions",
"specs": [
[
"==",
"4.10.0"
]
]
},
{
"name": "urllib3",
"specs": [
[
"==",
"2.2.1"
]
]
},
{
"name": "wsproto",
"specs": [
[
"==",
"1.2.0"
]
]
}
],
"lcname": "urltotext"
}