Name | ssc_codegen JSON |
Version |
0.10.1
JSON |
| download |
home_page | None |
Summary | Python-dsl code converter to html parser for web scraping |
upload_time | 2025-09-16 05:19:35 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.10 |
license | None |
keywords |
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# Selector Schema codegen
## Introduction
ssc-gen - a python-based DSL to describe parsers for html documents, which is translated into a standalone parsing module
### For a better experience using this library, you should know:
- HTML CSS selectors (CSS3 standard min), Xpath
- regular expressions (PCRE)
### Project solving next problems:
- designed for SSR (server-side-render) html pages parsers, **NOT FOR REST-API, GRAPHQL ENDPOINTS**
- decrease boilerplate code
- generates independent modules from the project that can be reused.
- generates docstring documentation and the signature of the parser output.
- for a better IDE experience, generates a typedefs, type annotations (if the target programming language supports it).
- support annotation and parsing of JSON-like strings from a document
- AST API codegen for developing a converter for parsing
## Support converters
Current support converters
| Language | HTML parser lib + dependencies | XPath | CSS3 | CSS4 | Generated annotations, types, structs | formatter dependency |
| ---------------------------------------- | -------------------------------------- | ----- | ---- | ---- | ------------------------------------- | -------------------- |
| Python (3.8+) | bs4, lxml | N | Y | Y | TypedDict`1`, list, dict | ruff |
| ... | parsel | Y | Y | N | ... | ... |
| ... | selectolax (lexbor) | N | Y | N | ... | ... |
| ... | lxml | Y | Y | N | ... | ... |
| js (ES6)`2` | pure (firefox/chrome extension/nodejs) | Y | Y | Y | Array, Map`3` | prettier |
| go (1.10+) **(UNSTABLE)** | goquery, gjson (`4`) | N | Y | N | struct(+json anchors), array, map | gofmt |
| lua (5.2+), luajit(2+) **(UNSTABLE)**`5` | lua-htmlparser, lrexlib(opt), dkjson | N | Y | N | EmmyLua | LuaFormatter |
- **CSS3** means support next selectors:
- basic: (`tag`, `.class`, `#id`, `tag1,tag2`)
- combined: (`div p`, `ul > li`, `h2 +p`, `title ~head`)
- attribute: (`a[href]`, `input[type='text']`, `a[href*='...']`, ...)
- CSS3 pseudo classes: (`:nth-child(n)`, `:first-child`, `:last-child`)
- **CSS4** means support next selectors:
- `:nth-of-type()`, `:where()`, `:is()`, `:not()` etc
- `1`this annotation type was deliberately chosen as a compromise reasons:
Python has many ways of serialization: `namedtuple, dataclass, attrs, pydantic, msgspec, etc`
- TypedDict is like a build-in dict, but with IDE and linter hint support, and you can easily implement an adapter for the required structure.
- `2`ES8 standart required if needed use PCRE `re.S | re.DOTALL` flag
- `3`js exclude build-in serialization methods, used standard Array and Map types. Focus on the singanutur documentation!
- `4`golang has not been tested much, there may be issues
- **formatter dependency** - optional dependency for prettify and fix codestyle
- `5`lua
- Experimental Research PoC, performance and stability are not guaranteed
- Priority on generation to pure lua without C-libs dependencies. using [mva/htmlparser](https://luarocks.org/modules/mva/htmlparser) and [dhkolf/dkjson](https://luarocks.org/modules/dhkolf/dkjson)
- Translates unsupported CSS3 selectors into the equivalent in the form of function calls:
- for example, `div +p` is equivalent to `CssExt.combine_plus(root:select("div"), "p")`
- Translates PCRE regex to [string pattern matching](https://www.lua.org/manual/5.4/manual.html#6.4.1) (with restrictions) for more information in [lua_re_compat.py ](ssc_codegen/converters/templates/lua_re_compat.py)
### Limitations
For maximum portability of the configuration to the target language:
- If possible, use CSS selectors: they are guaranteed to be converted to XPATH
- Unlike javascript, most html parse libs implement [CSS3 selectors standard](https://www.w3.org/TR/selectors-3/). They may not fully implement the functionality!
Check the html parser lib documentation aboud CSS selectors before implement code. Examples:
1. Several libs not support `+` operations (eg: [selectolax(modest)](https://github.com/rushter/selectolax), [dart.universal_html](https://pub.dev/packages/universal_html))
2. For research purpose, lua_htmlparser include converter for unsupported CSS3 query syntax
2. HTML parser libs maybe not supports attribute selectors: `*=`, `~=`, `|=`, `^=`, `$=`
3. Several libs not support pseudo classes (eg: standard [dart.html](https://dart.dev/libraries/dart-html) lib miss this feature).
## Getting started
ssc_gen required python 3.10 version or higher
### Install
pip:
```shell
pip install ssc_codegen
```
uv:
```shell
uv pip install ssc_codegen
```
## Example
### Create a file `schema.py` with:
```python
from ssc_codegen import ItemSchema, D
class HelloWorld(ItemSchema):
title = D().css('title').text()
a_hrefs = D().css_all('a').attr('href')
```
### try it in cli
> [!note]
> this tools developed for testing purposes, not for web-scraping tasks
### eval from file
Download any html file and pass as argument:
```shell
ssc-gen parse-from-file index.html -t schema.py:HelloWorld
```
Short options descriptions:
- `-t --target` - config schema file and class from where to start the parser

### send GET request to url and parse response
```shell
ssc-gen parse-from-url https://example.com -t schema.py:HelloWorld
```

### send request via Chromium browser (CDP protocol)
```shell
ssc-gen parse-from-chrome https://example.com -t schema.py:HelloWorld
```
> [!note]
> if script cannot found chrome executable - provide it manually:
```shell
ssc-gen parse-from-chrome https://example.com -t schema.py:HelloWorld -sc /usr/bin/chromium
```
### Convert to code
Convert to code for use in projects:
> [!note]
> for example, used js: it can be fast test in developer console
```shell
ssc-gen js schema.py -o .
```
Code output looks like this:
```javascript
// autogenerated by ssc-gen DO NOT_EDIT
/***
*
* {
* "title": "String",
* "a_hrefs": "Array<String>"
* }*/
class HelloWorld {
constructor(doc) {
if (typeof doc === "string") {
this._doc = new DOMParser().parseFromString(doc, "text/html");
} else if (doc instanceof Document || doc instanceof Element) {
this._doc = doc;
} else {
throw new Error("Invalid input: Expected a Document, Element, or string");
}
}
_parseTitle(v) {
let v0 = v.querySelector("title");
return typeof v0.textContent === "undefined"
? v0.documentElement.textContent
: v0.textContent;
}
_parseAHrefs(v) {
let v0 = Array.from(v.querySelectorAll("a"));
return v0.map((e) => e.getAttribute("href"));
}
parse() {
return {
title: this._parseTitle(this._doc),
a_hrefs: this._parseAHrefs(this._doc),
};
}
}
```
### Copy code output and past to developer console:
Print output:
```javascript
alert(JSON.stringify(new HelloWorld(document).parse()));
```

You can use any html source:
- parse from html files
- parse from http responses
- parse from browsers: playwright, selenium, chrome-cdp, etc.
- call curl in shell and parse STDIN
- use in STDIN pipelines with third-party tools like [projectdiscovery/httpx](https://github.com/projectdiscovery/httpx)
## See also
- [Brief](docs/briefing.md) about css selectors and regular expressions.
- [Quickstart](docs/quickstart.md) about css selectors and regular expressions.
- [Tutorial](docs/tutorial.md) basic usage ssc-gen
- [AST reference](docs/ast_reference.md) about generation code from AST
Raw data
{
"_id": null,
"home_page": null,
"name": "ssc_codegen",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": null,
"author": null,
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/b2/c4/801fbec102ea6972f477bba47261310086bfbfd8d4b5067f5b83414ca81c/ssc_codegen-0.10.1.tar.gz",
"platform": null,
"description": "# Selector Schema codegen\n\n## Introduction\n\nssc-gen - a python-based DSL to describe parsers for html documents, which is translated into a standalone parsing module\n\n### For a better experience using this library, you should know:\n\n- HTML CSS selectors (CSS3 standard min), Xpath\n- regular expressions (PCRE)\n\n### Project solving next problems:\n\n- designed for SSR (server-side-render) html pages parsers, **NOT FOR REST-API, GRAPHQL ENDPOINTS**\n- decrease boilerplate code\n- generates independent modules from the project that can be reused.\n- generates docstring documentation and the signature of the parser output.\n- for a better IDE experience, generates a typedefs, type annotations (if the target programming language supports it).\n- support annotation and parsing of JSON-like strings from a document\n- AST API codegen for developing a converter for parsing\n\n## Support converters\n\nCurrent support converters\n\n| Language | HTML parser lib + dependencies | XPath | CSS3 | CSS4 | Generated annotations, types, structs | formatter dependency |\n| ---------------------------------------- | -------------------------------------- | ----- | ---- | ---- | ------------------------------------- | -------------------- |\n| Python (3.8+) | bs4, lxml | N | Y | Y | TypedDict`1`, list, dict | ruff |\n| ... | parsel | Y | Y | N | ... | ... |\n| ... | selectolax (lexbor) | N | Y | N | ... | ... |\n| ... | lxml | Y | Y | N | ... | ... |\n| js (ES6)`2` | pure (firefox/chrome extension/nodejs) | Y | Y | Y | Array, Map`3` | prettier |\n| go (1.10+) **(UNSTABLE)** | goquery, gjson (`4`) | N | Y | N | struct(+json anchors), array, map | gofmt |\n| lua (5.2+), luajit(2+) **(UNSTABLE)**`5` | lua-htmlparser, lrexlib(opt), dkjson | N | Y | N | EmmyLua | LuaFormatter |\n\n- **CSS3** means support next selectors:\n - basic: (`tag`, `.class`, `#id`, `tag1,tag2`)\n - combined: (`div p`, `ul > li`, `h2 +p`, `title ~head`)\n - attribute: (`a[href]`, `input[type='text']`, `a[href*='...']`, ...)\n - CSS3 pseudo classes: (`:nth-child(n)`, `:first-child`, `:last-child`)\n- **CSS4** means support next selectors:\n - `:nth-of-type()`, `:where()`, `:is()`, `:not()` etc\n- `1`this annotation type was deliberately chosen as a compromise reasons:\n Python has many ways of serialization: `namedtuple, dataclass, attrs, pydantic, msgspec, etc`\n - TypedDict is like a build-in dict, but with IDE and linter hint support, and you can easily implement an adapter for the required structure.\n- `2`ES8 standart required if needed use PCRE `re.S | re.DOTALL` flag\n- `3`js exclude build-in serialization methods, used standard Array and Map types. Focus on the singanutur documentation!\n- `4`golang has not been tested much, there may be issues\n- **formatter dependency** - optional dependency for prettify and fix codestyle\n\n- `5`lua\n - Experimental Research PoC, performance and stability are not guaranteed\n - Priority on generation to pure lua without C-libs dependencies. using [mva/htmlparser](https://luarocks.org/modules/mva/htmlparser) and [dhkolf/dkjson](https://luarocks.org/modules/dhkolf/dkjson)\n - Translates unsupported CSS3 selectors into the equivalent in the form of function calls:\n - for example, `div +p` is equivalent to `CssExt.combine_plus(root:select(\"div\"), \"p\")`\n - Translates PCRE regex to [string pattern matching](https://www.lua.org/manual/5.4/manual.html#6.4.1) (with restrictions) for more information in [lua_re_compat.py ](ssc_codegen/converters/templates/lua_re_compat.py)\n\n### Limitations\n\nFor maximum portability of the configuration to the target language:\n\n- If possible, use CSS selectors: they are guaranteed to be converted to XPATH\n- Unlike javascript, most html parse libs implement [CSS3 selectors standard](https://www.w3.org/TR/selectors-3/). They may not fully implement the functionality!\n Check the html parser lib documentation aboud CSS selectors before implement code. Examples:\n 1. Several libs not support `+` operations (eg: [selectolax(modest)](https://github.com/rushter/selectolax), [dart.universal_html](https://pub.dev/packages/universal_html))\n 2. For research purpose, lua_htmlparser include converter for unsupported CSS3 query syntax\n2. HTML parser libs maybe not supports attribute selectors: `*=`, `~=`, `|=`, `^=`, `$=`\n3. Several libs not support pseudo classes (eg: standard [dart.html](https://dart.dev/libraries/dart-html) lib miss this feature).\n\n## Getting started\n\nssc_gen required python 3.10 version or higher\n\n### Install\n\npip:\n\n```shell\npip install ssc_codegen\n```\n\nuv:\n\n```shell\nuv pip install ssc_codegen\n```\n\n## Example\n\n### Create a file `schema.py` with:\n\n```python\nfrom ssc_codegen import ItemSchema, D\n\nclass HelloWorld(ItemSchema):\n title = D().css('title').text()\n a_hrefs = D().css_all('a').attr('href')\n```\n\n### try it in cli\n\n> [!note]\n> this tools developed for testing purposes, not for web-scraping tasks\n\n### eval from file\n\nDownload any html file and pass as argument:\n\n```shell\nssc-gen parse-from-file index.html -t schema.py:HelloWorld\n```\n\nShort options descriptions:\n\n- `-t --target` - config schema file and class from where to start the parser\n\n\n\n### send GET request to url and parse response\n\n```shell\nssc-gen parse-from-url https://example.com -t schema.py:HelloWorld\n```\n\n\n\n### send request via Chromium browser (CDP protocol)\n\n```shell\nssc-gen parse-from-chrome https://example.com -t schema.py:HelloWorld\n```\n\n> [!note]\n> if script cannot found chrome executable - provide it manually:\n\n```shell\nssc-gen parse-from-chrome https://example.com -t schema.py:HelloWorld -sc /usr/bin/chromium\n```\n\n### Convert to code\n\nConvert to code for use in projects:\n\n> [!note]\n> for example, used js: it can be fast test in developer console\n\n```shell\nssc-gen js schema.py -o .\n```\n\nCode output looks like this:\n\n```javascript\n// autogenerated by ssc-gen DO NOT_EDIT\n/***\n *\n * {\n * \"title\": \"String\",\n * \"a_hrefs\": \"Array<String>\"\n * }*/\nclass HelloWorld {\n constructor(doc) {\n if (typeof doc === \"string\") {\n this._doc = new DOMParser().parseFromString(doc, \"text/html\");\n } else if (doc instanceof Document || doc instanceof Element) {\n this._doc = doc;\n } else {\n throw new Error(\"Invalid input: Expected a Document, Element, or string\");\n }\n }\n\n _parseTitle(v) {\n let v0 = v.querySelector(\"title\");\n return typeof v0.textContent === \"undefined\"\n ? v0.documentElement.textContent\n : v0.textContent;\n }\n\n _parseAHrefs(v) {\n let v0 = Array.from(v.querySelectorAll(\"a\"));\n return v0.map((e) => e.getAttribute(\"href\"));\n }\n\n parse() {\n return {\n title: this._parseTitle(this._doc),\n a_hrefs: this._parseAHrefs(this._doc),\n };\n }\n}\n```\n\n### Copy code output and past to developer console:\n\nPrint output:\n\n```javascript\nalert(JSON.stringify(new HelloWorld(document).parse()));\n```\n\n\n\nYou can use any html source:\n\n- parse from html files\n- parse from http responses\n- parse from browsers: playwright, selenium, chrome-cdp, etc.\n- call curl in shell and parse STDIN\n- use in STDIN pipelines with third-party tools like [projectdiscovery/httpx](https://github.com/projectdiscovery/httpx)\n\n## See also\n\n- [Brief](docs/briefing.md) about css selectors and regular expressions.\n- [Quickstart](docs/quickstart.md) about css selectors and regular expressions.\n- [Tutorial](docs/tutorial.md) basic usage ssc-gen\n- [AST reference](docs/ast_reference.md) about generation code from AST\n",
"bugtrack_url": null,
"license": null,
"summary": "Python-dsl code converter to html parser for web scraping ",
"version": "0.10.1",
"project_urls": {
"Documentation": "https://github.com/vypivshiy/selector_schema_codegen#readme",
"Examples": "https://github.com/vypivshiy/selector_schema_codegen/examples",
"Issues": "https://github.com/vypivshiy/selector_schema_codegen/issues",
"Source": "https://github.com/vypivshiy/selector_schema_codegen"
},
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "ec3dae014000d3207ea776544cc219d6fb03a276fbf9ed4c3b65b3cb227a4390",
"md5": "493ddc0fbfd9a6929fa0c5baee656644",
"sha256": "84936548a7ba01e83d36e66fc44f227617967553666e2456928192f04fd8dd8e"
},
"downloads": -1,
"filename": "ssc_codegen-0.10.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "493ddc0fbfd9a6929fa0c5baee656644",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 125793,
"upload_time": "2025-09-16T05:19:33",
"upload_time_iso_8601": "2025-09-16T05:19:33.811457Z",
"url": "https://files.pythonhosted.org/packages/ec/3d/ae014000d3207ea776544cc219d6fb03a276fbf9ed4c3b65b3cb227a4390/ssc_codegen-0.10.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "b2c4801fbec102ea6972f477bba47261310086bfbfd8d4b5067f5b83414ca81c",
"md5": "a289cbeefa24a97d800b2348ccbdddbd",
"sha256": "82252f17330c7b6449b408ce09be8d2e6384611bb5250ee881b982bb355e8e0a"
},
"downloads": -1,
"filename": "ssc_codegen-0.10.1.tar.gz",
"has_sig": false,
"md5_digest": "a289cbeefa24a97d800b2348ccbdddbd",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 102516,
"upload_time": "2025-09-16T05:19:35",
"upload_time_iso_8601": "2025-09-16T05:19:35.619481Z",
"url": "https://files.pythonhosted.org/packages/b2/c4/801fbec102ea6972f477bba47261310086bfbfd8d4b5067f5b83414ca81c/ssc_codegen-0.10.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-16 05:19:35",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "vypivshiy",
"github_project": "selector_schema_codegen#readme",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "ssc_codegen"
}