Name | ssc_codegen JSON |
Version |
0.7.1
JSON |
| download |
home_page | None |
Summary | Python-dsl code converter to html parser for web scraping |
upload_time | 2025-02-22 12:11:14 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.10 |
license | None |
keywords |
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# Selector Schema codegen
## Introduction
ssc-gen - based-python DSL language for writing html parsers in dataclass style for converting to targeting language.
Project solving next problems:
- boilerplate code
- create types (type annotations) and documentation
- simplify code support
- portability to other languages
## Support converters
Current support converters
| Language | Library (html parser backend) | XPath Support | CSS Support | Generated types | Code formatter |
|---------------|--------------------------------------------------------------|---------------|-------------|------------------------------------------|----------------|
| Python (3.8+) | bs4 | N | Y | TypedDict*, list, dict | ruff |
| ... | parsel | Y | Y | ... | - |
| ... | selectolax (modest) | N | Y | ... | - |
| ... | scrapy (possibly use parsel - pass Response.selector object) | Y | Y | ... | - |
| Dart (3) | universal_html | N | Y | record, List, Map | dart format |
| js (ES6) | pure (firefox/chrome) | Y | Y | Array, Map** | - |
| go (1.10+) | goquery | N | Y | struct(json anchors include), array, map | gofmt |
- *this annotation type was deliberately chosen as a compromise reasons.
Python has many ways of serialization: `dataclass, namedtuple, attrs, pydantic`
- TypedDict is like a build-in dict, but with IDE and linter hint support, and you can easily implement an adapter for the required structure.
- **js not exists build-in serialization methods
### Limitations
For maximum portability of the configuration to the target language:
- Use CSS selectors: they are guaranteed to be converted to XPATH
- Unlike javascript, most html parse libs implement [CSS3 selectors standard](https://www.w3.org/TR/selectors-3/)
- basic selectors: (`tag`, `.class`, `#id`)
- combined: (`div p`, `ul > li`, `h2 +p`\[1])
- attribute: (`a[href]`, `input[type='text']`)\[2]
- pseudo classes: (`:nth-child(n)`, `:first-child`, `:last-child`)\[3]
- **often, not support more complex, dynamic styles**: (`:has()`, `:nth-of-type()`, `:where()`, `:is()`)
1. Several libs not support `+` operations (eg: [selectolax(modest)](https://github.com/rushter/selectolax), [dart.universal_html](https://pub.dev/packages/universal_html))
2. Often, web scraping libs not supports attribute operations like `*=`, `~=`, `|=`, `^=` and `$=`
3. Several libs not support pseudo classes (eg: standard [dart.html](https://dart.dev/libraries/dart-html) lib miss this feature).
This project will not implement converters with such a cons
## Getting started
ssc_gen required python 3.10 version or higher
### Install
pip:
```shell
pip install ssc_codegen
```
uv:
```shell
uv pip install ssc_codegen
```
as cli converter tool:
| package manager | command |
|-----------------|-------------------------------|
| pipx | `pipx install ssc_codegen` |
| uv | `uv tool install ssc_codegen` |
## Example
### Create a file `schema.py` with:
```python
from ssc_codegen import ItemSchema, D
class HelloWorld(ItemSchema):
title = D().css('title').text()
a_hrefs = D().css_all('a').attr('href')
```
### try it in cli
>[!note]
> this tools developed for testing purposes, not for web-scraping
### from file
>[!warning]
> DO NOT PASS CONFIGS FROM UNKNOWN SOURCES:
>
> PYTHON CODE FROM CONFIGS COMPILE IN RUNTIME WOUT SECURITY CHECKS!!!
Download any html file and pass as argument:
```shell
ssc-gen parse-from-file index.html -t schema.py:HelloWorld
```
Short options descriptions:
- `-t --target` - config schema file and class from where to start the parser

### from url
```shell
ssc-gen parse-from-url https://example.com -t schema.py:HelloWorld
```

### from Chromium browser (CDP protocol)
```shell
ssc-gen parse-from-chrome https://example.com -t schema.py:HelloWorld
```
>[!note]
> if script cannot found chrome executable - provide it manually:
```shell
ssc-gen parse-from-chrome https://example.com -t schema.py:HelloWorld -sc /usr/bin/chromium
```
### Convert to code
Convert to code for use in projects:
>![note]
> for example, used js: it can be fast test in developer console
```shell
ssc-gen js schema.py -o .
```
Code output looks like this (code formatted by IDE):
```javascript
// autogenerated by ssc-gen DO NOT_EDIT
/**
*
*
* {
* "title": "String",
* "a_hrefs": "Array<String>"
* }
*/
class HelloWorld {
constructor(doc) {
if (typeof doc === 'string') {
this._doc = new DOMParser().parseFromString(doc, 'text/html');
} else if (doc instanceof Document || doc instanceof Element) {
this._doc = doc;
} else {
throw new Error("Invalid input: Expected a Document, Element, or string");
}
}
_parseTitle(value) {
let value1 = value.querySelector('title');
return typeof value1.textContent === "undefined" ? value1.documentElement.textContent : value1.textContent;
}
_parseAHrefs(value) {
let value1 = Array.from(value.querySelectorAll('a'));
return value1.map(e => e.getAttribute('href'));
}
parse() {
return {title: this._parseTitle(this._doc), a_hrefs: this._parseAHrefs(this._doc)};
}
}
```
### copy code output and past to developer console:
Print output:
```javascript
alert(JSON.stringify((new HelloWorld(document).parse())))
```

You can use any html source:
- read from html file
- get from http request
- get from browser (playwright, selenium, chrome-cdp)
- paste code to developer console (js)
- or call curl in shell and parse stdin
## See also
- [Brief](docs/brief.md) about css selectors and regular expressions.
- [Tutorial](docs/tutorial.md) how to use ssc-gen
- [Reference](docs/reference.md) about high-level API
Raw data
{
"_id": null,
"home_page": null,
"name": "ssc_codegen",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": null,
"author": null,
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/dd/8d/a84f6f7337cea0d9c2d422ff1d231537c18bfc87160749778fdba6979d75/ssc_codegen-0.7.1.tar.gz",
"platform": null,
"description": "# Selector Schema codegen\n\n## Introduction\n\nssc-gen - based-python DSL language for writing html parsers in dataclass style for converting to targeting language.\n\nProject solving next problems:\n\n- boilerplate code\n- create types (type annotations) and documentation \n- simplify code support\n- portability to other languages\n\n## Support converters\n\nCurrent support converters\n\n\n| Language | Library (html parser backend) | XPath Support | CSS Support | Generated types | Code formatter |\n|---------------|--------------------------------------------------------------|---------------|-------------|------------------------------------------|----------------|\n| Python (3.8+) | bs4 | N | Y | TypedDict*, list, dict | ruff |\n| ... | parsel | Y | Y | ... | - |\n| ... | selectolax (modest) | N | Y | ... | - |\n| ... | scrapy (possibly use parsel - pass Response.selector object) | Y | Y | ... | - |\n| Dart (3) | universal_html | N | Y | record, List, Map | dart format |\n| js (ES6) | pure (firefox/chrome) | Y | Y | Array, Map** | - |\n| go (1.10+) | goquery | N | Y | struct(json anchors include), array, map | gofmt |\n\n- *this annotation type was deliberately chosen as a compromise reasons. \nPython has many ways of serialization: `dataclass, namedtuple, attrs, pydantic`\n - TypedDict is like a build-in dict, but with IDE and linter hint support, and you can easily implement an adapter for the required structure.\n- **js not exists build-in serialization methods\n\n### Limitations\n\nFor maximum portability of the configuration to the target language:\n\n- Use CSS selectors: they are guaranteed to be converted to XPATH\n- Unlike javascript, most html parse libs implement [CSS3 selectors standard](https://www.w3.org/TR/selectors-3/)\n - basic selectors: (`tag`, `.class`, `#id`)\n - combined: (`div p`, `ul > li`, `h2 +p`\\[1])\n - attribute: (`a[href]`, `input[type='text']`)\\[2]\n - pseudo classes: (`:nth-child(n)`, `:first-child`, `:last-child`)\\[3]\n - **often, not support more complex, dynamic styles**: (`:has()`, `:nth-of-type()`, `:where()`, `:is()`)\n\n1. Several libs not support `+` operations (eg: [selectolax(modest)](https://github.com/rushter/selectolax), [dart.universal_html](https://pub.dev/packages/universal_html))\n2. Often, web scraping libs not supports attribute operations like `*=`, `~=`, `|=`, `^=` and `$=`\n3. Several libs not support pseudo classes (eg: standard [dart.html](https://dart.dev/libraries/dart-html) lib miss this feature). \nThis project will not implement converters with such a cons\n\n## Getting started\n\nssc_gen required python 3.10 version or higher\n\n### Install\n\npip:\n\n```shell\npip install ssc_codegen\n```\n\nuv:\n\n```shell\nuv pip install ssc_codegen\n```\n\nas cli converter tool:\n\n| package manager | command |\n|-----------------|-------------------------------|\n| pipx | `pipx install ssc_codegen` |\n| uv | `uv tool install ssc_codegen` |\n\n## Example\n\n### Create a file `schema.py` with:\n\n```python\nfrom ssc_codegen import ItemSchema, D\n\nclass HelloWorld(ItemSchema):\n title = D().css('title').text()\n a_hrefs = D().css_all('a').attr('href')\n```\n\n### try it in cli\n\n>[!note]\n> this tools developed for testing purposes, not for web-scraping\n\n### from file\n\n>[!warning]\n> DO NOT PASS CONFIGS FROM UNKNOWN SOURCES: \n> \n> PYTHON CODE FROM CONFIGS COMPILE IN RUNTIME WOUT SECURITY CHECKS!!!\n\nDownload any html file and pass as argument:\n\n```shell\nssc-gen parse-from-file index.html -t schema.py:HelloWorld \n```\n\nShort options descriptions:\n\n- `-t --target` - config schema file and class from where to start the parser\n\n\n\n### from url\n\n```shell\nssc-gen parse-from-url https://example.com -t schema.py:HelloWorld \n```\n\n\n### from Chromium browser (CDP protocol)\n\n\n```shell\nssc-gen parse-from-chrome https://example.com -t schema.py:HelloWorld\n```\n\n>[!note]\n> if script cannot found chrome executable - provide it manually:\n\n```shell\nssc-gen parse-from-chrome https://example.com -t schema.py:HelloWorld -sc /usr/bin/chromium\n```\n\n\n### Convert to code\n\nConvert to code for use in projects:\n\n>![note]\n> for example, used js: it can be fast test in developer console\n\n\n```shell\nssc-gen js schema.py -o .\n```\n\nCode output looks like this (code formatted by IDE):\n\n```javascript\n// autogenerated by ssc-gen DO NOT_EDIT\n/**\n *\n *\n * {\n * \"title\": \"String\",\n * \"a_hrefs\": \"Array<String>\"\n * }\n */\nclass HelloWorld {\n constructor(doc) {\n if (typeof doc === 'string') {\n this._doc = new DOMParser().parseFromString(doc, 'text/html');\n } else if (doc instanceof Document || doc instanceof Element) {\n this._doc = doc;\n } else {\n throw new Error(\"Invalid input: Expected a Document, Element, or string\");\n }\n }\n\n _parseTitle(value) {\n let value1 = value.querySelector('title');\n return typeof value1.textContent === \"undefined\" ? value1.documentElement.textContent : value1.textContent;\n }\n\n _parseAHrefs(value) {\n let value1 = Array.from(value.querySelectorAll('a'));\n return value1.map(e => e.getAttribute('href'));\n }\n\n parse() {\n return {title: this._parseTitle(this._doc), a_hrefs: this._parseAHrefs(this._doc)};\n }\n}\n```\n\n### copy code output and past to developer console:\n\nPrint output:\n\n```javascript\nalert(JSON.stringify((new HelloWorld(document).parse())))\n```\n\n\n\n\nYou can use any html source:\n\n- read from html file\n- get from http request\n- get from browser (playwright, selenium, chrome-cdp)\n- paste code to developer console (js)\n- or call curl in shell and parse stdin\n\n\n## See also\n- [Brief](docs/brief.md) about css selectors and regular expressions.\n- [Tutorial](docs/tutorial.md) how to use ssc-gen\n- [Reference](docs/reference.md) about high-level API",
"bugtrack_url": null,
"license": null,
"summary": "Python-dsl code converter to html parser for web scraping ",
"version": "0.7.1",
"project_urls": {
"Documentation": "https://github.com/vypivshiy/selector_schema_codegen#readme",
"Examples": "https://github.com/vypivshiy/selector_schema_codegen/examples",
"Issues": "https://github.com/vypivshiy/selector_schema_codegen/issues",
"Source": "https://github.com/vypivshiy/selector_schema_codegen"
},
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "14bce6146bcb0728e3cc5a006a6541a501215682039cec6a54d0a42604f51096",
"md5": "0352b6852411df4bfa471a34f192f345",
"sha256": "3d5f9694a5d670e1d288c604c6919d4fbf9b98c63b11218162d290c5098dad4d"
},
"downloads": -1,
"filename": "ssc_codegen-0.7.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "0352b6852411df4bfa471a34f192f345",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 76815,
"upload_time": "2025-02-22T12:11:12",
"upload_time_iso_8601": "2025-02-22T12:11:12.799345Z",
"url": "https://files.pythonhosted.org/packages/14/bc/e6146bcb0728e3cc5a006a6541a501215682039cec6a54d0a42604f51096/ssc_codegen-0.7.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "dd8da84f6f7337cea0d9c2d422ff1d231537c18bfc87160749778fdba6979d75",
"md5": "883bbfdb5d3b59c0bd231354ae22cff4",
"sha256": "e8d4939072fa563c13d67e0c2e38ab875e07e03751dd5a64d70fe9eeace41542"
},
"downloads": -1,
"filename": "ssc_codegen-0.7.1.tar.gz",
"has_sig": false,
"md5_digest": "883bbfdb5d3b59c0bd231354ae22cff4",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 56313,
"upload_time": "2025-02-22T12:11:14",
"upload_time_iso_8601": "2025-02-22T12:11:14.971356Z",
"url": "https://files.pythonhosted.org/packages/dd/8d/a84f6f7337cea0d9c2d422ff1d231537c18bfc87160749778fdba6979d75/ssc_codegen-0.7.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-22 12:11:14",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "vypivshiy",
"github_project": "selector_schema_codegen#readme",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "ssc_codegen"
}