SpiderNet


NameSpiderNet JSON
Version 1.3 PyPI version JSON
download
home_pagehttps://github.com/query-lang/SpiderWeb
SummaryA python package to simplify web scraping . Built using REgex and Curl
upload_time2024-07-30 19:31:19
maintainerNone
docs_urlNone
authorVishal
requires_pythonNone
licenseMIT
keywords conversion
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <div align="center">
<img src="https://i.imgur.com/Xvzus2m.png" height=20% width=20%>
</div>
<br>
<div align="center">
<p>A simple and lightweight library for scraping the web</p>
</div>
<br>
<p>Built on Curl and Regex in python , SpiderNet offers similar functionality to the (BeautifulSoup and requests) alternative . For the package to work , you need to have <a href="https://help.ubidots.com/en/articles/2165289-learn-how-to-install-run-curl-on-windows-macosx-linux">curl</a> installed in your system . </p>

### Install the latest version from Pypi or the <a href="https://github.com/query-lang/SpiderWeb/releases/tag/SpiderWeb">releases page</a> 
```shell
pip install SpiderNet
```
- Features 
  - [x] Scrape tags from websites 
  - [x] Scrape the text within the tags
  - [x] Obtain href attributes for the <a> tag (anchor tag)
  - [x] Obtain src attribute for the <img> tag (image tag)
  - [x] The package contains new <a href="https://github.com/query-lang/SpiderWeb/tree/main/examples/DataTypes">Datatypes</a> made for easier workflow which integrate with the parameters and values of the package.  

### The main class is ```GenSpider``` . 

```python
from SpiderNet import GenSpider
web=GenSpider(<website>)
```
### The methods are 
<ol>
  <ul>
    <li><code>website_text</code></li>
    This method returns the markup text of the website . <br>
    <li><code>find_all_html_tags</code></li>
    This method finds all html tags passed in the parameter. If the tags are nested then 
    upon looping them you can add the 'text' keyword in the function to target the initial looped text . <br>
    <li><code>extract_text_from_html</code></li>
    This method extracts text from the looped instance of the tag! <br>
    <li><code>find_all_tags_by_classname</code></li>
    This method finds all html tags passed in the parameter with the given class only , also passed in the parameter. If the tags are nested then 
        upon looping them you can add the 'text' keyword in the function to target the initial looped text. <br>
    <li><code>get_href_from_a_tags</code></li>
    Returns a list of all href attributes of anchor tag . Optional text parameter if you want to target a particualr text piece. Default is extracting href from the entire page.<br>
    <li><code>get_src_from_img_tags</code></li>
    Returns a list of all src attributes of img tag . Optional text parameter if you want to target a particualr text piece. Default is extracting src from the entire page.<br>

  </ul>
</ol>
<br>

### Example code of extracting Comic Book Chapters from <a href="https://readallcomics.com/">readallcomics</a> , using the new DataTypes , and their respective href attributes 

```python
from SpiderNet import HashMap , ForEach , GenSpider , Str


string=Str("https://readallcomics.com/category/chakra-the-invincible/")
web=GenSpider(string)
x=web.find_all_tags_by_classname('ul','list-story')
arr=HashMap()
for d in x:
  
    w=web.find_all_html_tags('a',text=d)
    num=1
    link_content=web.get_href_from_a_tags(text=d)
    for y in range(len(w)):
        text_content = web.extract_text_from_html(w[y])
        
        arr.add(text_content,link_content[y])
        num+=1

ForEach(arr).unit()
```

### The output of the code will be as follows 
```shell
Chakra The Invincible 010 (2016) => https://readallcomics.com/chakra-the-invincible-010-2016/
Chakra The Invincible 009 (2016) => https://readallcomics.com/chakra-the-invincible-009-2016/
Chakra The Invincible 008 (2016) => https://readallcomics.com/chakra-the-invincible-008-2016/
Chakra The Invincible 007 (2016) => https://readallcomics.com/chakra-the-invincible-007-2016/
Chakra The Invincible 006 (2015) => https://readallcomics.com/chakra-the-invincible-006-2015/
Chakra The Invincible 005 (2015) => https://readallcomics.com/chakra-the-invincible-005-2015/
Chakra The Invincible 004 (2015) => https://readallcomics.com/chakra-the-invincible-004-2015/
Chakra The Invincible 003 (2015) => https://readallcomics.com/chakra-the-invincible-003-2015/
Chakra The Invincible 002 (2015) => https://readallcomics.com/chakra-the-invincible-002-2015/
Chakra The Invincible 001 (2015) => https://readallcomics.com/chakra-the-invincible-001-2015/
```
<br>

<p>For more examples look at : </p>
<ul>
  <li><a href="https://github.com/query-lang/SpiderWeb/tree/main/examples/web">Web Scraping code examples</a></li>
  <li><a href="https://github.com/query-lang/SpiderWeb/tree/main/examples/DataTypes">Data Types code examples</a></li>
  
</ul>

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/query-lang/SpiderWeb",
    "name": "SpiderNet",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "conversion",
    "author": "Vishal",
    "author_email": "vishalvenkat2604@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/dd/42/4a35467e7fe30ccb7822cd5ede7826e8447c662ce849b47ff8928ea3c3e1/spidernet-1.3.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n<img src=\"https://i.imgur.com/Xvzus2m.png\" height=20% width=20%>\n</div>\n<br>\n<div align=\"center\">\n<p>A simple and lightweight library for scraping the web</p>\n</div>\n<br>\n<p>Built on Curl and Regex in python , SpiderNet offers similar functionality to the (BeautifulSoup and requests) alternative . For the package to work , you need to have <a href=\"https://help.ubidots.com/en/articles/2165289-learn-how-to-install-run-curl-on-windows-macosx-linux\">curl</a> installed in your system . </p>\n\n### Install the latest version from Pypi or the <a href=\"https://github.com/query-lang/SpiderWeb/releases/tag/SpiderWeb\">releases page</a> \n```shell\npip install SpiderNet\n```\n- Features \n  - [x] Scrape tags from websites \n  - [x] Scrape the text within the tags\n  - [x] Obtain href attributes for the <a> tag (anchor tag)\n  - [x] Obtain src attribute for the <img> tag (image tag)\n  - [x] The package contains new <a href=\"https://github.com/query-lang/SpiderWeb/tree/main/examples/DataTypes\">Datatypes</a> made for easier workflow which integrate with the parameters and values of the package.  \n\n### The main class is ```GenSpider``` . \n\n```python\nfrom SpiderNet import GenSpider\nweb=GenSpider(<website>)\n```\n### The methods are \n<ol>\n  <ul>\n    <li><code>website_text</code></li>\n    This method returns the markup text of the website . <br>\n    <li><code>find_all_html_tags</code></li>\n    This method finds all html tags passed in the parameter. If the tags are nested then \n    upon looping them you can add the 'text' keyword in the function to target the initial looped text . <br>\n    <li><code>extract_text_from_html</code></li>\n    This method extracts text from the looped instance of the tag! <br>\n    <li><code>find_all_tags_by_classname</code></li>\n    This method finds all html tags passed in the parameter with the given class only , also passed in the parameter. If the tags are nested then \n        upon looping them you can add the 'text' keyword in the function to target the initial looped text. <br>\n    <li><code>get_href_from_a_tags</code></li>\n    Returns a list of all href attributes of anchor tag . Optional text parameter if you want to target a particualr text piece. Default is extracting href from the entire page.<br>\n    <li><code>get_src_from_img_tags</code></li>\n    Returns a list of all src attributes of img tag . Optional text parameter if you want to target a particualr text piece. Default is extracting src from the entire page.<br>\n\n  </ul>\n</ol>\n<br>\n\n### Example code of extracting Comic Book Chapters from <a href=\"https://readallcomics.com/\">readallcomics</a> , using the new DataTypes , and their respective href attributes \n\n```python\nfrom SpiderNet import HashMap , ForEach , GenSpider , Str\n\n\nstring=Str(\"https://readallcomics.com/category/chakra-the-invincible/\")\nweb=GenSpider(string)\nx=web.find_all_tags_by_classname('ul','list-story')\narr=HashMap()\nfor d in x:\n  \n    w=web.find_all_html_tags('a',text=d)\n    num=1\n    link_content=web.get_href_from_a_tags(text=d)\n    for y in range(len(w)):\n        text_content = web.extract_text_from_html(w[y])\n        \n        arr.add(text_content,link_content[y])\n        num+=1\n\nForEach(arr).unit()\n```\n\n### The output of the code will be as follows \n```shell\nChakra The Invincible 010 (2016) => https://readallcomics.com/chakra-the-invincible-010-2016/\nChakra The Invincible 009 (2016) => https://readallcomics.com/chakra-the-invincible-009-2016/\nChakra The Invincible 008 (2016) => https://readallcomics.com/chakra-the-invincible-008-2016/\nChakra The Invincible 007 (2016) => https://readallcomics.com/chakra-the-invincible-007-2016/\nChakra The Invincible 006 (2015) => https://readallcomics.com/chakra-the-invincible-006-2015/\nChakra The Invincible 005 (2015) => https://readallcomics.com/chakra-the-invincible-005-2015/\nChakra The Invincible 004 (2015) => https://readallcomics.com/chakra-the-invincible-004-2015/\nChakra The Invincible 003 (2015) => https://readallcomics.com/chakra-the-invincible-003-2015/\nChakra The Invincible 002 (2015) => https://readallcomics.com/chakra-the-invincible-002-2015/\nChakra The Invincible 001 (2015) => https://readallcomics.com/chakra-the-invincible-001-2015/\n```\n<br>\n\n<p>For more examples look at : </p>\n<ul>\n  <li><a href=\"https://github.com/query-lang/SpiderWeb/tree/main/examples/web\">Web Scraping code examples</a></li>\n  <li><a href=\"https://github.com/query-lang/SpiderWeb/tree/main/examples/DataTypes\">Data Types code examples</a></li>\n  \n</ul>\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A python package to simplify web scraping . Built using REgex and Curl",
    "version": "1.3",
    "project_urls": {
        "Homepage": "https://github.com/query-lang/SpiderWeb"
    },
    "split_keywords": [
        "conversion"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "285b41cb729254a7cf905d4995af3e0fa462222e7eafa3fe8d1a14bc5c8925f0",
                "md5": "5eb951a1a67166fc1ddec371c6a5d514",
                "sha256": "f02ff83df917e6d25432d3b8de2e5bffbd6161a78fdb90cb9fd3acded34f06b3"
            },
            "downloads": -1,
            "filename": "SpiderNet-1.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5eb951a1a67166fc1ddec371c6a5d514",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 5806,
            "upload_time": "2024-07-30T19:31:17",
            "upload_time_iso_8601": "2024-07-30T19:31:17.788533Z",
            "url": "https://files.pythonhosted.org/packages/28/5b/41cb729254a7cf905d4995af3e0fa462222e7eafa3fe8d1a14bc5c8925f0/SpiderNet-1.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "dd424a35467e7fe30ccb7822cd5ede7826e8447c662ce849b47ff8928ea3c3e1",
                "md5": "95a07ca580337f0a9042f97c86eeed15",
                "sha256": "de9a9b420bfefecdc40b4ac724069d7b6ac58785b82d3d98611dbef295175fec"
            },
            "downloads": -1,
            "filename": "spidernet-1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "95a07ca580337f0a9042f97c86eeed15",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 5142,
            "upload_time": "2024-07-30T19:31:19",
            "upload_time_iso_8601": "2024-07-30T19:31:19.325231Z",
            "url": "https://files.pythonhosted.org/packages/dd/42/4a35467e7fe30ccb7822cd5ede7826e8447c662ce849b47ff8928ea3c3e1/spidernet-1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-30 19:31:19",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "query-lang",
    "github_project": "SpiderWeb",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "spidernet"
}
        
Elapsed time: 0.30220s