hodorlive


Namehodorlive JSON
Version 1.2.17 PyPI version JSON
download
home_pageNone
Summaryxpath/css based scraper with pagination
upload_time2025-11-04 06:42:10
maintainerNone
docs_urlNone
authorNone
requires_python>=3.11
licenseMIT
keywords cssselect hodor lxml scraping
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            

# Hodor [![PyPI](https://img.shields.io/pypi/v/hodorlive.svg?maxAge=2592000?style=plastic)](https://pypi.python.org/pypi/hodorlive/)

A simple html scraper with xpath or css.

## Install

```pip install hodorlive```

## Usage

### As python package

***WARNING: This package by default doesn't verify ssl connections. Please check the [arguments](#arguments) to enable them.***

#### Sample code
```python
from hodor import Hodor
from dateutil.parser import parse


def date_convert(data):
    return parse(data)

url = 'http://www.nasdaq.com/markets/stocks/symbol-change-history.aspx'

CONFIG = {
    'old_symbol': {
        'css': '#SymbolChangeList_table tr td:nth-child(1)',
        'many': True
    },
    'new_symbol': {
        'css': '#SymbolChangeList_table tr td:nth-child(2)',
        'many': True
    },
    'effective_date': {
        'css': '#SymbolChangeList_table tr td:nth-child(3)',
        'many': True,
        'transform': date_convert
    },
    '_groups': {
        'data': '__all__',
        'ticker_changes': ['old_symbol', 'new_symbol']
    },
    '_paginate_by': {
        'xpath': '//*[@id="two_column_main_content_lb_NextPage"]/@href',
        'many': False
    }
}

h = Hodor(url=url, config=CONFIG, pagination_max_limit=5)

h.data
```
#### Sample output
```python
{'data': [{'effective_date': datetime.datetime(2016, 11, 1, 0, 0),
           'new_symbol': 'ARNC',
           'old_symbol': 'AA'},
          {'effective_date': datetime.datetime(2016, 11, 1, 0, 0),
           'new_symbol': 'ARNC$',
           'old_symbol': 'AA$'},
          {'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
           'new_symbol': 'MALN8',
           'old_symbol': 'AHUSDN2018'},
          {'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
           'new_symbol': 'MALN9',
           'old_symbol': 'AHUSDN2019'},
          {'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
           'new_symbol': 'MALQ6',
           'old_symbol': 'AHUSDQ2016'},
          {'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
           'new_symbol': 'MALQ7',
           'old_symbol': 'AHUSDQ2017'},
          {'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
           'new_symbol': 'MALQ8',
           'old_symbol': 'AHUSDQ2018'}]}
```

#### Arguments

- ```ua``` (User-Agent)
- ```proxies``` (check requesocks)
- ```auth```
- ```crawl_delay``` (crawl delay in seconds across pagination - default: 3 seconds)
- ```pagination_max_limit``` (max number of pages to crawl - default: 100)
- ```ssl_verify``` (default: False)
- ```robots``` (if set respects robots.txt - default: True)
- ```reppy_capacity``` (robots cache LRU capacity - default: 100)
- ```trim_values``` (if set trims output for leading and trailing whitespace - default: True)


#### Config parameters:
- By default any key in the config is a rule to parse.
    - Each rule can be either a ```xpath``` or a ```css```
    - Each rule can extract ```many``` values by default unless explicity set to ```False```
    - Each rule can allow to ```transform``` the result with a function if provided
- Extra parameters include grouping (```_groups```) and pagination (```_paginate_by```) which is also of the rule format.



## Building & Publishing

### Prerequisites

- Install [uv](https://docs.astral.sh/uv/getting-started/installation/).
- Review the [uvx execution model](https://docs.astral.sh/uv/concepts/tools/#execution-vs-installation) for running tools without global installs.
- Hatch documentation: [https://hatch.pypa.io/latest/](https://hatch.pypa.io/latest/).

### Build workflow

Run the release helper to build and publish wheels and source archives via Hatch:

```bash
./upload.sh
```

The script shells out to `uvx hatch build` followed by `uvx hatch publish` so that Hatch is executed in an ephemeral environment.

### Publishing requirements

Configure credentials in `~/.pypirc` as described in the [PyPI configuration specification](https://packaging.python.org/en/latest/specifications/pypirc/).

Example configuration:

```ini
[distutils]
index-servers =
  pypi
  testpypi

[pypi]
repository = https://upload.pypi.org/legacy/
username = __token__
password = <pypi-token>

[testpypi]
repository = https://test.pypi.org/legacy/
username = __token__
password = <testpypi-token>
```

Replace token placeholders with secrets from the team password manager and avoid committing the file to version control.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "hodorlive",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "cssselect, hodor, lxml, scraping",
    "author": null,
    "author_email": "Compile Inc <dev@compile.com>",
    "download_url": "https://files.pythonhosted.org/packages/55/e4/f21907dc770c3784218b7fdf1e33575c50a68f7f0b379159cf2e65666cba/hodorlive-1.2.17.tar.gz",
    "platform": null,
    "description": "\n\n# Hodor [![PyPI](https://img.shields.io/pypi/v/hodorlive.svg?maxAge=2592000?style=plastic)](https://pypi.python.org/pypi/hodorlive/)\n\nA simple html scraper with xpath or css.\n\n## Install\n\n```pip install hodorlive```\n\n## Usage\n\n### As python package\n\n***WARNING: This package by default doesn't verify ssl connections. Please check the [arguments](#arguments) to enable them.***\n\n#### Sample code\n```python\nfrom hodor import Hodor\nfrom dateutil.parser import parse\n\n\ndef date_convert(data):\n    return parse(data)\n\nurl = 'http://www.nasdaq.com/markets/stocks/symbol-change-history.aspx'\n\nCONFIG = {\n    'old_symbol': {\n        'css': '#SymbolChangeList_table tr td:nth-child(1)',\n        'many': True\n    },\n    'new_symbol': {\n        'css': '#SymbolChangeList_table tr td:nth-child(2)',\n        'many': True\n    },\n    'effective_date': {\n        'css': '#SymbolChangeList_table tr td:nth-child(3)',\n        'many': True,\n        'transform': date_convert\n    },\n    '_groups': {\n        'data': '__all__',\n        'ticker_changes': ['old_symbol', 'new_symbol']\n    },\n    '_paginate_by': {\n        'xpath': '//*[@id=\"two_column_main_content_lb_NextPage\"]/@href',\n        'many': False\n    }\n}\n\nh = Hodor(url=url, config=CONFIG, pagination_max_limit=5)\n\nh.data\n```\n#### Sample output\n```python\n{'data': [{'effective_date': datetime.datetime(2016, 11, 1, 0, 0),\n           'new_symbol': 'ARNC',\n           'old_symbol': 'AA'},\n          {'effective_date': datetime.datetime(2016, 11, 1, 0, 0),\n           'new_symbol': 'ARNC$',\n           'old_symbol': 'AA$'},\n          {'effective_date': datetime.datetime(2016, 8, 16, 0, 0),\n           'new_symbol': 'MALN8',\n           'old_symbol': 'AHUSDN2018'},\n          {'effective_date': datetime.datetime(2016, 8, 16, 0, 0),\n           'new_symbol': 'MALN9',\n           'old_symbol': 'AHUSDN2019'},\n          {'effective_date': datetime.datetime(2016, 8, 16, 0, 0),\n           'new_symbol': 'MALQ6',\n           'old_symbol': 'AHUSDQ2016'},\n          {'effective_date': datetime.datetime(2016, 8, 16, 0, 0),\n           'new_symbol': 'MALQ7',\n           'old_symbol': 'AHUSDQ2017'},\n          {'effective_date': datetime.datetime(2016, 8, 16, 0, 0),\n           'new_symbol': 'MALQ8',\n           'old_symbol': 'AHUSDQ2018'}]}\n```\n\n#### Arguments\n\n- ```ua``` (User-Agent)\n- ```proxies``` (check requesocks)\n- ```auth```\n- ```crawl_delay``` (crawl delay in seconds across pagination - default: 3 seconds)\n- ```pagination_max_limit``` (max number of pages to crawl - default: 100)\n- ```ssl_verify``` (default: False)\n- ```robots``` (if set respects robots.txt - default: True)\n- ```reppy_capacity``` (robots cache LRU capacity - default: 100)\n- ```trim_values``` (if set trims output for leading and trailing whitespace - default: True)\n\n\n#### Config parameters:\n- By default any key in the config is a rule to parse.\n    - Each rule can be either a ```xpath``` or a ```css```\n    - Each rule can extract ```many``` values by default unless explicity set to ```False```\n    - Each rule can allow to ```transform``` the result with a function if provided\n- Extra parameters include grouping (```_groups```) and pagination (```_paginate_by```) which is also of the rule format.\n\n\n\n## Building & Publishing\n\n### Prerequisites\n\n- Install [uv](https://docs.astral.sh/uv/getting-started/installation/).\n- Review the [uvx execution model](https://docs.astral.sh/uv/concepts/tools/#execution-vs-installation) for running tools without global installs.\n- Hatch documentation: [https://hatch.pypa.io/latest/](https://hatch.pypa.io/latest/).\n\n### Build workflow\n\nRun the release helper to build and publish wheels and source archives via Hatch:\n\n```bash\n./upload.sh\n```\n\nThe script shells out to `uvx hatch build` followed by `uvx hatch publish` so that Hatch is executed in an ephemeral environment.\n\n### Publishing requirements\n\nConfigure credentials in `~/.pypirc` as described in the [PyPI configuration specification](https://packaging.python.org/en/latest/specifications/pypirc/).\n\nExample configuration:\n\n```ini\n[distutils]\nindex-servers =\n  pypi\n  testpypi\n\n[pypi]\nrepository = https://upload.pypi.org/legacy/\nusername = __token__\npassword = <pypi-token>\n\n[testpypi]\nrepository = https://test.pypi.org/legacy/\nusername = __token__\npassword = <testpypi-token>\n```\n\nReplace token placeholders with secrets from the team password manager and avoid committing the file to version control.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "xpath/css based scraper with pagination",
    "version": "1.2.17",
    "project_urls": {
        "Download": "https://github.com/CompileInc/hodor/archive/v1.2.17.tar.gz",
        "Homepage": "https://github.com/CompileInc/hodor"
    },
    "split_keywords": [
        "cssselect",
        " hodor",
        " lxml",
        " scraping"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "988489926f95ceebbcfecb0da3834260b1124e82975ddb7dea7ca146652aa812",
                "md5": "7ee85475c61e27cb49cb4b9aea9e5295",
                "sha256": "da021b8d5f39401df9bc0f5a9d09458ffc7d6ca8ceb30639e62ccb18d7867059"
            },
            "downloads": -1,
            "filename": "hodorlive-1.2.17-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7ee85475c61e27cb49cb4b9aea9e5295",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 5787,
            "upload_time": "2025-11-04T06:42:11",
            "upload_time_iso_8601": "2025-11-04T06:42:11.669551Z",
            "url": "https://files.pythonhosted.org/packages/98/84/89926f95ceebbcfecb0da3834260b1124e82975ddb7dea7ca146652aa812/hodorlive-1.2.17-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "55e4f21907dc770c3784218b7fdf1e33575c50a68f7f0b379159cf2e65666cba",
                "md5": "7c8f346ed5e579c328f70b61410b1d06",
                "sha256": "54a26e7322b1b64b117038c58625dc34f2810929b11d955b32aaaab1a3651248"
            },
            "downloads": -1,
            "filename": "hodorlive-1.2.17.tar.gz",
            "has_sig": false,
            "md5_digest": "7c8f346ed5e579c328f70b61410b1d06",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 23655,
            "upload_time": "2025-11-04T06:42:10",
            "upload_time_iso_8601": "2025-11-04T06:42:10.316297Z",
            "url": "https://files.pythonhosted.org/packages/55/e4/f21907dc770c3784218b7fdf1e33575c50a68f7f0b379159cf2e65666cba/hodorlive-1.2.17.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-11-04 06:42:10",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "CompileInc",
    "github_project": "hodor",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "hodorlive"
}
        
Elapsed time: 1.68370s