housaku

Name	housaku JSON
Version	0.7.11 JSON
	download
home_page	None
Summary	A personal search engine built on top of SQLite's FTS5.
upload_time	2024-12-02 10:26:58
maintainer	None
docs_url	None
author	None
requires_python	>=3.13
license	MIT
keywords	bm25 cli fts rss rss parsing search search engine sqlite tui web web crawling
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

# Housaku (豊作 「ほうさく」)

Housaku is a personal search engine built on top of SQLite's FTS5 that lets you query your documents, books, PDFs, favorite feeds and more all in one place.

![Screenshot of the TUI](./.github/screenshot_tui.png)

> Housaku is in early development, so you can expect some incompatible changes and other minor issues when updating. Once version `v1.0.0` is reached, my goal is to focus on stability and avoiding breaking changes as much as possible.

## Features

- Support for multiple file formats like `.txt`, `.md`, `.csv`, `.pdf`, `.epub`, `.docx`, `.xlsx` and `.pptx`.
- Support for RSS/Atom feeds parsing and indexing.
- Parallel file processing.
- Concurrent feed processing.
- Web UI.
- Modern TUI with support for theming.
- Easy-to-use CLI.
- Relevant results powered by the BM25 algorithm.
- Support for incremental updates.

> Support for file formats like `.odt` is coming as well as the possibility of indexing posts from Bluesky feeds and Mastodon.

## Stack

- [SQLite's FTS5 extension](https://sqlite.org/fts5.html).
- [SQLite](https://www.sqlite.org/index.html).
- [Starlette](https://www.starlette.io).
- [aiohttp](https://docs.aiohttp.org/en/stable/index.html).
- [click](https://click.palletsprojects.com/en/stable/).
- [feedparser](https://feedparser.readthedocs.io/en/latest/).
- [pydantic](https://docs.pydantic.dev/latest/).
- [pymupdf](https://pymupdf.readthedocs.io/en/latest/)
- [rich](https://rich.readthedocs.io/en/stable/introduction.html).
- [textual](https://www.textualize.io).

## Motivation

The first reason I decided to start working on Housaku was to learn more about the basics of full-text search and how search engines operate under the hood. In fact, if you look at the commit history, you can see that initially, all the parsing, tokenization and TF/IDF calculations were handled "manually" before I opted to use SQLite's FTS5 solution due to performance.

The second and final reason was the large volume of documents I was managing. I have ~5,000 notes in Obsidian, formatted in Markdown, a couple of hundred books in my Calibre library, mainly in `.epub`, a significant number of PDFs, and PowerPoint presentations from my computer science degree at UNED. Additionally, I also have a vast collection of RSS feeds that I have subscribed to for a long time. So, I wanted/needed an efficient and easy way to search through all of this documents without having to worry about the specifics of where each of them was located or in what format.

## Installation

The recommended way of installing Housaku is by using [uv](https://github.com/astral-sh/uv):

```bash
uv tool install --python 3.13 housaku
```

Now, you just run:

```bash
housaku --help
```

To upgrade, use:

```bash
uv tool upgrade housaku

# Or

uv tool upgrade housaku --reinstall
```

### Using `pipx`

To install Housaku using `pipx`, simply run:

```bash
pipx install housaku
```

> Just remember that the minimal version of Python required is `>=3.13`.

### Via `pip`

You can also install Housaku using pip, but the exact command will depend on how your environment is set up. In this case, the command should look something like this:

```bash
python3 -m pip install housaku
```

### Configuration

Before you start using Housaku, the first step is to edit the `config.toml` file located at your `$XDG_CONFIG_HOME/housaku/config.toml`. This file is generated automatically the first time you run `housaku` and will look something like this:

```toml
# Welcome! This is the configuration file for Housaku.

# Available themes include:
# - "dracula"
# - "textual-dark"
# - "textual-light"
# - "nord"
# - "gruvbox"
# - "catppuccin-mocha"
# - "textual-ansi"
# - "tokyo-night"
# - "monokai"
# - "flexoki"
# - "catppuccin-latte"
# - "solarized-light"

theme = "dracula"

[files]
# Directories to include for indexing.
# Example: include = ["/home/<user>/documents/notes"]
include = []

# Patterns to exclude from the indexing
# Example: exclude = ["*.tmp", "backup", "*.png"]
exclude = []

[feeds]
# List of RSS/Atom feeds to index
# Example: urls = ["https://example.com/feed", "https://anotherexample.com/rss"]
urls = []
```

> The folder that holds the configuration file as well as the SQLite database is determined by the `get_app_dir` utility. You can read more about it [here](https://click.palletsprojects.com/en/stable/api/#click.get_app_dir).

An easy way to open your `config.toml` file is to run the following command:

```bash
housaku config
```

## Usage

### Help

The best way to see which commands are available is to run `housaku` with the `--help` flag.

```bash
housaku --help
```

You can also learn more about what a specific command does by running:

```bash
housaku [command] --help

# For example:

housaku index --help
```

### Config

The `config` command is a very simple command that just open the `config.toml` file using the default editor.

```bash
housaku config
```

### Index

After you have configured the list of directories containing the documents you want to index, as well as the list of feeds from which you want to fetch the posts, you can run:

```bash
housaku index
```

#### Filtering content

To index only your files, use the following command:

```bash
housaku index --include files
```

To index only your feeds:

```bash
housaku index --include feeds
```

> You can specify both options to index files and feeds together, but this is equivalent to simply running the `index` command without any options.

#### Parallelism

You can also change the number of threads being used when indexing your files and documents:

```bash
housaku index -t 8
```

> My recommendation is to stick with the default number of threads.

At the moment, indexing files is done in parallel using multi-threading, which makes the process faster but also introduces some complications. For example, cancelling the indexing half-way using `ctrl+c` will cause some threads to exit while others will continue running in the background and then fail.

### Search

#### The `search` command

The simplest way to start searching your documents and posts is by using the `search` command:

```bash
houskau searh --query "Django AND Postgres"
```

You can also limit the number of results by using the `--limit` option which, by default, is set to 10:

```bash
housaku search --query "Django AND Postgres" --limit 20
```

If you don't specify a `query` using the `--query/-q` options you will be prompted to enter one.

> You can learn more about the query syntax [here](https://sqlite.org/fts5.html#full_text_query_syntax).

#### Using the TUI

My favorite and recommended way to search is by using the TUI. To start it, just run:

```bash
housaku tui
```

> To exit the TUI just press `ctrl + q`, and to open a search result, press `Enter` while the result is highlighted.

#### Using the Web UI

Housaku also has a very simple Web UI that you can access by running:

```bash
housaku web
```

![Screenshot of the Web](./.github/screenshot_web.png)

> The default port is `4242`.

This searching method have some limitations. For example, you can't open results that link to your local documents.

### `vacuum` and `purge`

The `vacuum` command is used to optimize the SQLite database by reclaiming unused space and improving performance. To run the vacuum command, simply execute:

```bash
housaku vacuum
```

The `purge` command is used to completely clear all data from the database. This command is useful when you want to reset the database to its initial state.

```bash
housaku purge
```

> Be careful before using both of these commands since they will have a direct impact on the data you hold in your database.

## Contributing

Contributions are welcomed! If you have any suggestions feel free to open an issue.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "housaku",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.13",
    "maintainer_email": null,
    "keywords": "bm25, cli, fts, rss, rss parsing, search, search engine, sqlite, tui, web, web crawling",
    "author": null,
    "author_email": "dnlzrgz <contact@dnlzrgz.com>",
    "download_url": "https://files.pythonhosted.org/packages/d7/c0/f67c333951d0d2557b7c1540a22cb58b958807b110af5c8c61308ed4834b/housaku-0.7.11.tar.gz",
    "platform": null,
    "description": "# Housaku (\u8c4a\u4f5c \u300c\u307b\u3046\u3055\u304f\u300d)\n\nHousaku is a personal search engine built on top of SQLite's FTS5 that lets you query your documents, books, PDFs, favorite feeds and more all in one place.\n\n![Screenshot of the TUI](./.github/screenshot_tui.png)\n\n> Housaku is in early development, so you can expect some incompatible changes and other minor issues when updating. Once version `v1.0.0` is reached, my goal is to focus on stability and avoiding breaking changes as much as possible.\n\n## Features\n\n- Support for multiple file formats like `.txt`, `.md`, `.csv`, `.pdf`, `.epub`, `.docx`, `.xlsx` and `.pptx`.\n- Support for RSS/Atom feeds parsing and indexing.\n- Parallel file processing.\n- Concurrent feed processing.\n- Web UI.\n- Modern TUI with support for theming.\n- Easy-to-use CLI.\n- Relevant results powered by the BM25 algorithm.\n- Support for incremental updates.\n\n> Support for file formats like `.odt` is coming as well as the possibility of indexing posts from Bluesky feeds and Mastodon.\n\n## Stack\n\n- [SQLite's FTS5 extension](https://sqlite.org/fts5.html).\n- [SQLite](https://www.sqlite.org/index.html).\n- [Starlette](https://www.starlette.io).\n- [aiohttp](https://docs.aiohttp.org/en/stable/index.html).\n- [click](https://click.palletsprojects.com/en/stable/).\n- [feedparser](https://feedparser.readthedocs.io/en/latest/).\n- [pydantic](https://docs.pydantic.dev/latest/).\n- [pymupdf](https://pymupdf.readthedocs.io/en/latest/)\n- [rich](https://rich.readthedocs.io/en/stable/introduction.html).\n- [textual](https://www.textualize.io).\n\n## Motivation\n\nThe first reason I decided to start working on Housaku was to learn more about the basics of full-text search and how search engines operate under the hood. In fact, if you look at the commit history, you can see that initially, all the parsing, tokenization and TF/IDF calculations were handled \"manually\" before I opted to use SQLite's FTS5 solution due to performance.\n\nThe second and final reason was the large volume of documents I was managing. I have ~5,000 notes in Obsidian, formatted in Markdown, a couple of hundred books in my Calibre library, mainly in `.epub`, a significant number of PDFs, and PowerPoint presentations from my computer science degree at UNED. Additionally, I also have a vast collection of RSS feeds that I have subscribed to for a long time. So, I wanted/needed an efficient and easy way to search through all of this documents without having to worry about the specifics of where each of them was located or in what format.\n\n## Installation\n\nThe recommended way of installing Housaku is by using [uv](https://github.com/astral-sh/uv):\n\n```bash\nuv tool install --python 3.13 housaku\n```\n\nNow, you just run:\n\n```bash\nhousaku --help\n```\n\nTo upgrade, use:\n\n```bash\nuv tool upgrade housaku\n\n# Or\n\nuv tool upgrade housaku --reinstall\n```\n\n### Using `pipx`\n\nTo install Housaku using `pipx`, simply run:\n\n```bash\npipx install housaku\n```\n\n> Just remember that the minimal version of Python required is `>=3.13`.\n\n### Via `pip`\n\nYou can also install Housaku using pip, but the exact command will depend on how your environment is set up. In this case, the command should look something like this:\n\n```bash\npython3 -m pip install housaku\n```\n\n### Configuration\n\nBefore you start using Housaku, the first step is to edit the `config.toml` file located at your `$XDG_CONFIG_HOME/housaku/config.toml`. This file is generated automatically the first time you run `housaku` and will look something like this:\n\n```toml\n# Welcome! This is the configuration file for Housaku.\n\n# Available themes include:\n# - \"dracula\"\n# - \"textual-dark\"\n# - \"textual-light\"\n# - \"nord\"\n# - \"gruvbox\"\n# - \"catppuccin-mocha\"\n# - \"textual-ansi\"\n# - \"tokyo-night\"\n# - \"monokai\"\n# - \"flexoki\"\n# - \"catppuccin-latte\"\n# - \"solarized-light\"\n\ntheme = \"dracula\"\n\n[files]\n# Directories to include for indexing.\n# Example: include = [\"/home/<user>/documents/notes\"]\ninclude = []\n\n# Patterns to exclude from the indexing\n# Example: exclude = [\"*.tmp\", \"backup\", \"*.png\"]\nexclude = []\n\n[feeds]\n# List of RSS/Atom feeds to index\n# Example: urls = [\"https://example.com/feed\", \"https://anotherexample.com/rss\"]\nurls = []\n```\n\n> The folder that holds the configuration file as well as the SQLite database is determined by the `get_app_dir` utility. You can read more about it [here](https://click.palletsprojects.com/en/stable/api/#click.get_app_dir).\n\nAn easy way to open your `config.toml` file is to run the following command:\n\n```bash\nhousaku config\n```\n\n## Usage\n\n### Help\n\nThe best way to see which commands are available is to run `housaku` with the `--help` flag.\n\n```bash\nhousaku --help\n```\n\nYou can also learn more about what a specific command does by running:\n\n```bash\nhousaku [command] --help\n\n# For example:\n\nhousaku index --help\n```\n\n### Config\n\nThe `config` command is a very simple command that just open the `config.toml` file using the default editor.\n\n```bash\nhousaku config\n```\n\n### Index\n\nAfter you have configured the list of directories containing the documents you want to index, as well as the list of feeds from which you want to fetch the posts, you can run:\n\n```bash\nhousaku index\n```\n\n#### Filtering content\n\nTo index only your files, use the following command:\n\n```bash\nhousaku index --include files\n```\n\nTo index only your feeds:\n\n```bash\nhousaku index --include feeds\n```\n\n> You can specify both options to index files and feeds together, but this is equivalent to simply running the `index` command without any options.\n\n#### Parallelism\n\nYou can also change the number of threads being used when indexing your files and documents:\n\n```bash\nhousaku index -t 8\n```\n\n> My recommendation is to stick with the default number of threads.\n\nAt the moment, indexing files is done in parallel using multi-threading, which makes the process faster but also introduces some complications. For example, cancelling the indexing half-way using `ctrl+c` will cause some threads to exit while others will continue running in the background and then fail.\n\n### Search\n\n#### The `search` command\n\nThe simplest way to start searching your documents and posts is by using the `search` command:\n\n```bash\nhouskau searh --query \"Django AND Postgres\"\n```\n\nYou can also limit the number of results by using the `--limit` option which, by default, is set to 10:\n\n```bash\nhousaku search --query \"Django AND Postgres\" --limit 20\n```\n\nIf you don't specify a `query` using the `--query/-q` options you will be prompted to enter one.\n\n> You can learn more about the query syntax [here](https://sqlite.org/fts5.html#full_text_query_syntax).\n\n#### Using the TUI\n\nMy favorite and recommended way to search is by using the TUI. To start it, just run:\n\n```bash\nhousaku tui\n```\n\n> To exit the TUI just press `ctrl + q`, and to open a search result, press `Enter` while the result is highlighted.\n\n#### Using the Web UI\n\nHousaku also has a very simple Web UI that you can access by running:\n\n```bash\nhousaku web\n```\n\n![Screenshot of the Web](./.github/screenshot_web.png)\n\n> The default port is `4242`.\n\nThis searching method have some limitations. For example, you can't open results that link to your local documents.\n\n### `vacuum` and `purge`\n\nThe `vacuum` command is used to optimize the SQLite database by reclaiming unused space and improving performance. To run the vacuum command, simply execute:\n\n```bash\nhousaku vacuum\n```\n\nThe `purge` command is used to completely clear all data from the database. This command is useful when you want to reset the database to its initial state.\n\n```bash\nhousaku purge\n```\n\n> Be careful before using both of these commands since they will have a direct impact on the data you hold in your database.\n\n## Contributing\n\nContributions are welcomed! If you have any suggestions feel free to open an issue.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A personal search engine built on top of SQLite's FTS5.",
    "version": "0.7.11",
    "project_urls": {
        "homepage": "https://dnlzrgz.com/projects/housaku/",
        "issues": "https://github.com/dnlzrgz/housaku/issues",
        "releases": "https://github.com/dnlzrgz/housaku/releases",
        "source": "https://github.com/dnlzrgz/housaku"
    },
    "split_keywords": [
        "bm25",
        " cli",
        " fts",
        " rss",
        " rss parsing",
        " search",
        " search engine",
        " sqlite",
        " tui",
        " web",
        " web crawling"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "136568bdfbf60c6babd1fe68f69e326b9ab36db0d9685034f9f956c21b195917",
                "md5": "26e0a395d88a5cc183ac763ce2d8a5dc",
                "sha256": "1df45cbcddd02dec5b05c1964b2aa490b73540960bd658c35d8cbb44d21d7a43"
            },
            "downloads": -1,
            "filename": "housaku-0.7.11-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "26e0a395d88a5cc183ac763ce2d8a5dc",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.13",
            "size": 38596,
            "upload_time": "2024-12-02T10:26:55",
            "upload_time_iso_8601": "2024-12-02T10:26:55.983804Z",
            "url": "https://files.pythonhosted.org/packages/13/65/68bdfbf60c6babd1fe68f69e326b9ab36db0d9685034f9f956c21b195917/housaku-0.7.11-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d7c0f67c333951d0d2557b7c1540a22cb58b958807b110af5c8c61308ed4834b",
                "md5": "f591dff9e92a050418240ca1b1ce7eb2",
                "sha256": "5c3488c8aa5f486855d5f9e62e891f0f83e4ba0b18d9df0704082e169c8a0b29"
            },
            "downloads": -1,
            "filename": "housaku-0.7.11.tar.gz",
            "has_sig": false,
            "md5_digest": "f591dff9e92a050418240ca1b1ce7eb2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.13",
            "size": 1515421,
            "upload_time": "2024-12-02T10:26:58",
            "upload_time_iso_8601": "2024-12-02T10:26:58.142762Z",
            "url": "https://files.pythonhosted.org/packages/d7/c0/f67c333951d0d2557b7c1540a22cb58b958807b110af5c8c61308ed4834b/housaku-0.7.11.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-02 10:26:58",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "dnlzrgz",
    "github_project": "housaku",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "housaku"
}

None