blogsimi


Nameblogsimi JSON
Version 0.0.6 PyPI version JSON
download
home_pageNone
SummarySemantic related-article indexer and generator for static blogs.
upload_time2025-10-25 08:47:06
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseNone
keywords semantic-search blog pgvector recommendations
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Blog similarity index

`blogsimi` builds semantically related article recommendations for static blogs.
It walks your rendered site, turns the content into embeddings, stores them in
PostgreSQL via [pgvector](https://github.com/pgvector/pgvector), and then exports
a recommendation JSON that you can drop straight into Jekyll or any static site
generator.  The design and rationale are described in [a blog post](https://www.tspi.at/2025/09/25/simi.html).

## Features

- Extracts rendered HTML, strips boilerplate (extracts only specific content ids or classes) and
  chunks content into embedding-friendly blocks.
- Supports Ollama (default) and OpenAI embedding providers; switching providers is a config change and
  requires a rebuild of the index.
- Persists embeddings, metadata, and recommendations in PostgreSQL with pgvector distance queries.
  Pages are only re-indexed when the content has changed.
- A very simple CLI

## Installation

The project can be installed from [PyPi](https://pypi.org/project/blogsimi/):

```bash
pip install blogsimi                # from PyPI once published
# or
pip install .                       # from a local checkout
```

## Configuration

Configuration lives in `~/.config/blogsimilarity.cfg` by default (overridable
with `--config`).  The file is JSON and mirrors the defaults baked into the package:

```json
{
  "site_root": "_site",
  "data_out": "_data/related.json",
  "exclude_globs": ["tags/**", "drafts/**", "private/**", "admin/**"],
  "content_ids": ["content"],
  "neighbors": {
    "ksample": 16,
    "k": 8,
    "temperature": 0.7,
    "pin_top": true,
    "seed": null,
    "seealso": 4
  },
  "chunk": {
    "max_tokens": 800,
    "overlap_tokens": 100
  },
  "embedding": {
    "provider": "ollama",
    "model": "nomic-embed-text",
    "ollama_url": "http://127.0.0.1:11434/api/embeddings",
    "openai_api_base": "https://api.openai.com/v1/embeddings",
    "openai_api_key_env": "OPENAI_API_KEY"
  },
  "db": {
    "host": "127.0.0.1",
    "port": 5432,
    "user": "blog",
    "password": "blog",
    "dbname": "blog"
  },
  "strip_image_hosts": null
}
```

Set `OPENAI_API_KEY` (or the environment variable you configure) when using the
OpenAI provider.  Ensure your PostgreSQL instance has the `pgvector` extension
installed. You have to manually enable the extension as a superuser in the database:

```SQL
-- as a PostgreSQL superuser
CREATE ROLE blog LOGIN PASSWORD 'blog';
CREATE DATABASE blog OWNER blog;
\c blog
CREATE EXTENSION IF NOT EXISTS vector;  -- requires superuser or appropriate privileges
```

## CLI Usage

All functionality is exposed via the `blogsimi` command:

- `blogsimi initdb` – create the required tables and infer the embedding dimension from your provider.
  Note that the ```VECTOR``` extension has to be already enabled.
- `blogsimi resetdb` – drop and recreate the tables (useful when switching embedding dimensions).
- `blogsimi index [--page PATH]` – walk the rendered site (defaults to `site_root`), compute
  embeddings where content changed, and persist them.
- `blogsimi genrel [--out PATH]` – produce the recommendation JSON ready for your
  static site.

A typical run after rendering your blog might look like:

```bash
blogsimi index --page _site
blogsimi genrel --out _data/related.json
```

## Development

The package is available via [PyPi](https://pypi.org/project/blogsimi/) and can be installed via

```
pip install blogsimi
```

The repository uses a `src/` layout.  For local development, install in editable mode:

```bash
pip install -e .
PYTHONPATH=src python -m blogsimi.cli --help
```

## License

This project is released under the MIT License.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "blogsimi",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "semantic-search, blog, pgvector, recommendations",
    "author": null,
    "author_email": "Thomas Spielauer <pypipackages01@tspi.at>",
    "download_url": "https://files.pythonhosted.org/packages/65/a2/b228e3d221a81500a72ba8618eac6e623181f02d8210cdf235cc07484c9f/blogsimi-0.0.6.tar.gz",
    "platform": null,
    "description": "# Blog similarity index\n\n`blogsimi` builds semantically related article recommendations for static blogs.\nIt walks your rendered site, turns the content into embeddings, stores them in\nPostgreSQL via [pgvector](https://github.com/pgvector/pgvector), and then exports\na recommendation JSON that you can drop straight into Jekyll or any static site\ngenerator.  The design and rationale are described in [a blog post](https://www.tspi.at/2025/09/25/simi.html).\n\n## Features\n\n- Extracts rendered HTML, strips boilerplate (extracts only specific content ids or classes) and\n  chunks content into embedding-friendly blocks.\n- Supports Ollama (default) and OpenAI embedding providers; switching providers is a config change and\n  requires a rebuild of the index.\n- Persists embeddings, metadata, and recommendations in PostgreSQL with pgvector distance queries.\n  Pages are only re-indexed when the content has changed.\n- A very simple CLI\n\n## Installation\n\nThe project can be installed from [PyPi](https://pypi.org/project/blogsimi/):\n\n```bash\npip install blogsimi                # from PyPI once published\n# or\npip install .                       # from a local checkout\n```\n\n## Configuration\n\nConfiguration lives in `~/.config/blogsimilarity.cfg` by default (overridable\nwith `--config`).  The file is JSON and mirrors the defaults baked into the package:\n\n```json\n{\n  \"site_root\": \"_site\",\n  \"data_out\": \"_data/related.json\",\n  \"exclude_globs\": [\"tags/**\", \"drafts/**\", \"private/**\", \"admin/**\"],\n  \"content_ids\": [\"content\"],\n  \"neighbors\": {\n    \"ksample\": 16,\n    \"k\": 8,\n    \"temperature\": 0.7,\n    \"pin_top\": true,\n    \"seed\": null,\n    \"seealso\": 4\n  },\n  \"chunk\": {\n    \"max_tokens\": 800,\n    \"overlap_tokens\": 100\n  },\n  \"embedding\": {\n    \"provider\": \"ollama\",\n    \"model\": \"nomic-embed-text\",\n    \"ollama_url\": \"http://127.0.0.1:11434/api/embeddings\",\n    \"openai_api_base\": \"https://api.openai.com/v1/embeddings\",\n    \"openai_api_key_env\": \"OPENAI_API_KEY\"\n  },\n  \"db\": {\n    \"host\": \"127.0.0.1\",\n    \"port\": 5432,\n    \"user\": \"blog\",\n    \"password\": \"blog\",\n    \"dbname\": \"blog\"\n  },\n  \"strip_image_hosts\": null\n}\n```\n\nSet `OPENAI_API_KEY` (or the environment variable you configure) when using the\nOpenAI provider.  Ensure your PostgreSQL instance has the `pgvector` extension\ninstalled. You have to manually enable the extension as a superuser in the database:\n\n```SQL\n-- as a PostgreSQL superuser\nCREATE ROLE blog LOGIN PASSWORD 'blog';\nCREATE DATABASE blog OWNER blog;\n\\c blog\nCREATE EXTENSION IF NOT EXISTS vector;  -- requires superuser or appropriate privileges\n```\n\n## CLI Usage\n\nAll functionality is exposed via the `blogsimi` command:\n\n- `blogsimi initdb` \u2013 create the required tables and infer the embedding dimension from your provider.\n  Note that the ```VECTOR``` extension has to be already enabled.\n- `blogsimi resetdb` \u2013 drop and recreate the tables (useful when switching embedding dimensions).\n- `blogsimi index [--page PATH]` \u2013 walk the rendered site (defaults to `site_root`), compute\n  embeddings where content changed, and persist them.\n- `blogsimi genrel [--out PATH]` \u2013 produce the recommendation JSON ready for your\n  static site.\n\nA typical run after rendering your blog might look like:\n\n```bash\nblogsimi index --page _site\nblogsimi genrel --out _data/related.json\n```\n\n## Development\n\nThe package is available via [PyPi](https://pypi.org/project/blogsimi/) and can be installed via\n\n```\npip install blogsimi\n```\n\nThe repository uses a `src/` layout.  For local development, install in editable mode:\n\n```bash\npip install -e .\nPYTHONPATH=src python -m blogsimi.cli --help\n```\n\n## License\n\nThis project is released under the MIT License.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Semantic related-article indexer and generator for static blogs.",
    "version": "0.0.6",
    "project_urls": null,
    "split_keywords": [
        "semantic-search",
        " blog",
        " pgvector",
        " recommendations"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a3e30c6c304106b39fe6dcea7a17b0f041abce91ccb5f5e491c2a7bd912619e1",
                "md5": "ca6ad9ad93118b58c0612d139410a79b",
                "sha256": "9f3d3172b6ad58251d1fbd4668e9629b4cefdb8a57c08cb1c29376d8d864c862"
            },
            "downloads": -1,
            "filename": "blogsimi-0.0.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ca6ad9ad93118b58c0612d139410a79b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 12103,
            "upload_time": "2025-10-25T08:47:05",
            "upload_time_iso_8601": "2025-10-25T08:47:05.617474Z",
            "url": "https://files.pythonhosted.org/packages/a3/e3/0c6c304106b39fe6dcea7a17b0f041abce91ccb5f5e491c2a7bd912619e1/blogsimi-0.0.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "65a2b228e3d221a81500a72ba8618eac6e623181f02d8210cdf235cc07484c9f",
                "md5": "c26950fcd1ecef1da0738396f00b4fd1",
                "sha256": "4df19f7e3dc90a51843d891c0911eb90687e4d7cac276bc98c01357a245fa86e"
            },
            "downloads": -1,
            "filename": "blogsimi-0.0.6.tar.gz",
            "has_sig": false,
            "md5_digest": "c26950fcd1ecef1da0738396f00b4fd1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 12752,
            "upload_time": "2025-10-25T08:47:06",
            "upload_time_iso_8601": "2025-10-25T08:47:06.884036Z",
            "url": "https://files.pythonhosted.org/packages/65/a2/b228e3d221a81500a72ba8618eac6e623181f02d8210cdf235cc07484c9f/blogsimi-0.0.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-25 08:47:06",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "blogsimi"
}
        
Elapsed time: 1.96162s