filecabinet


Namefilecabinet JSON
Version 2.1.0 PyPI version JSON
download
home_pagehttps://vonshednob.cc/filecabinet
SummaryA local, offline document archive
upload_time2023-06-23 11:39:32
maintainer
docs_urlNone
authorR
requires_python>=3.7
licenseCopyright 2023 Robert Labudda All Rights Reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistribution of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistribution in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of Robert Labudda or the names of contributors may be used to endorse or promote products derived from this software without specific prior written permission. This software is provided "AS IS," without a warranty of any kind. ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS AND WARRANTIES, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT, ARE HEREBY EXCLUDED. ROBERT LABUDDA ("RL") AND ITS LICENSORS SHALL NOT BE LIABLE FOR ANY DAMAGES SUFFERED BY LICENSEE AS A RESULT OF USING, MODIFYING OR DISTRIBUTING THIS SOFTWARE OR ITS DERIVATIVES. IN NO EVENT WILL RL OR ITS LICENSORS BE LIABLE FOR ANY LOST REVENUE, PROFIT OR DATA, OR FOR DIRECT, INDIRECT, SPECIAL, CONSEQUENTIAL, INCIDENTAL OR PUNITIVE DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF THE USE OF OR INABILITY TO USE THIS SOFTWARE, EVEN IF RL HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # filecabinet

filecabinet is a minimal document management system for your computer. It has
metadata per document and supports fulltext search in various document types.


# Installing

The easiest way to install is to use `pip`:

```bash
    pip install filecabinet
```

Alternatively you can get the source code at
[codeberg](https://codeberg.org/vonshednob/filecabinet):

```bash
    git clone https://codeberg.org/vonshednob/filecabinet
    pip install filecabinet
```


## Requirements

`filecabinet` **requires** the [xapian python bindings](https://xapian.org/docs/bindings/python/)
which can not be installed through `pip`!

Other automatically installed required dependencies are:

 * [metaindex](https://codeberg.org/vonshednob/metaindex)
 * [Pillow](https://pypi.org/project/Pillow/)
 * [PyPDF](https://pypi.org/project/pypdf/)
 * [PyYAML](https://pypi.org/project/PyYAML/)

Even though optional, I strongly recommend installing [Tesseract OCR](https://github.com/tesseract-ocr/tesseract)
to enable fulltext search in scanned documents.


# Quick start

To initialize your file cabinet, run `filecabinet init` and provide a new
path where you would like to store your documents:

```bash
    filecabinet init ~/Documents/cabinet
```

Now you can start either copying files into `~/Documents/cabinet/inbox` and
run

```bash
    filecabinet pickup
```

to process them, or add files manually via

```bash
    filecabinet add ~/some_scanned_document.jpg
```

To get a basic overview of documents, you can use the Shell.


# Workflow / Use cases

Here’s the usual worflow with filecabinet:

 1. Put some documents (PDF, scanned documents, etc) into the `inbox`
    folder of your cabinet
 2. Run `filecabinet pickup`
 3. List all new documents with `filecabinet list new`

Other use cases are:

 * **Search** for a specific document with `filecabinet find "searchterm" "other search term"`
 * **Edit** the metadata of a document through the shell `filecabinet shell` (see next section)


# Shell

There’s a basic shell that allows you to inspect indexed documents, edit
their metadata (by means of an external text editor), or view the
documents.

To open the shell, run

```bash
    filecabinet shell
```

Try `help` inside the shell to see what your options are.


## Metadata editing

If you want to use a specific text editor to modify metadata, consider
updating your configuration file’s `Shell` section and add a
`document_editor`, like this:

```ini
    [Shell]
    editor = subl -w
```

In this example we set up SublimeText as the external editor. Note that the
`-w` option is necessary to make filecabinet wait until you’re done editing
the file before returning into the shell.  
Visual Studio Code uses the `-W` or `--wait` flag to accomplish the same
behaviour.


# Searching

Searching for **tags** is done case-insensitive and is done using `tag:`.
For example if you're looking for a document that's tagged with *banana*, you
can search for it by `tag:banana`.

Searching **new** documents is accomplished by searching for `tag:new`.
If you only want to find documents that are not new, you can also
search for `-tag:new`. Unless specified, a search will ignore whether or not a
document is new.

You can search for any **metadata** value, like *title*, *author*, or *language*,
by searching with the metadata name and a colon like `title:gravity`.

Everything else that does not match the special search terms will be used in
the **fulltext** search.

If you want to search for terms with whitespaces, you can use quotes:
`title:"brain surgery"`.

**Example:**

The title contains "brain", is from author "Gumby" and it was set to some time
before August 2005: `title:brain author:gumby date:2015-08-01`

Looking for a newly added document with the title "The Larch": `title:larch tag:new`


# Grouping of pages

Sometimes you will have a scanned document in form of multiple pages, each
page a `.jpg` file, like `page1.jpg`, `page2.jpg`, `page3.jpg`.

Of course all these pages form the same document.

To tell filecabinet that these files all belong to the same document, you
can put them in a folder inside the inbox before running `pickup`:

 * `inbox/doc/page1.jpg`
 * `inbox/doc/page2.jpg`
 * `inbox/doc/page3.jpg`

This will tell filecabinet that they all belong to the same
document.

Here’s also where you can hint to the language of the document
for OCR (see *Language hinting* in the next section) by calling the folder,
for example, `doc-nl` to indicate that all pages are written in the dutch
language.


# OCR

filecabinet can use Tesseract OCR to do character recognition on pictures and
scanned PDFs, so you can search the text of images.

In order for that to work, you have to install Tesseract and some language
packages, depending on the languages of the documents you wish to scan.

If you don't have Tesseract OCR installed, filecabinet will still work, but
be much less useful.


## Language hinting

You can tell filecabinet what language a document has even as it is in the
inbox by adding its language as a suffix: hyphen followed by language code
(ISO-639).

A few examples will help. Consider these files:

 * `page-1.jpg`
 * `contract.png`

Suppose your default language is set to english (`default-lanugage = eng` in
the configuration file); `page-1.jpg` is in English but `contract.png` is
in German.

OCR will likely have difficulties with letters like `öäü` in `contract.png`
unless you tell it what language the document is in:

 * `contract-ger.png`

`ger` is one of the ISO-639 language codes for German (others are `de` and
`deu`; see [wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) for the long listing).

With this `-ger` suffix, filecabinet will use the correct language packet
(if you have it installed) and the OCR will yield much better results.


# Rule based tagging

By using metaindex, filecabinet inherits the powerful rule based
tagging. This allows you to automatically add metadata tags to documents
based on their text (which might have come from OCR).

Rules are defined in text files and you have to point filecabinet to the
rule files that you want it to use. To do that, add a section `[Rules]` to
your configuration file (usually at
`~/.config/filecabinet/filecabinet.conf`) and list your rule files like
this:

```ini
    [Rules]
    base = ~/.config/filecabinet/basic_rules.txt
    companies = ~/Document/company_rules.txt
```

The names (before the `=`) are somewhat free-form descriptors.

To understand how to write these rule files, please have a look at the
[metaindex documentation](https://codeberg.org/vonshednob/metaindex/src/branch/main/doc/source/indexers.rst#rule-based-indexer).

To test your rules on documents, you can use the `filecabinet test-rules`
command. It will run all indexers on a file and show you what tags have
been found by your rules.

When using `test-rules` the tested document will not be added to your
cabinet.


# Cabinet Directory Structure

Assuming a cabinet is set up at `~/cabinet`, the directory structure is:

```
    ~/cabinet
     │
     ├── inbox
     │
     ├── metaindex.conf
     │
     ├── metaindex.log
     │
     └── documents
          │
          └── <partial document id>
               │
               └── <full document id>
                    │
                    ├── <document id>.yaml
                    │
                    ├── <document id>.<suffix>
                    │
                    └── <document id>.txt
```

 * `inbox` will be processed (and emptied) when `filecabinet pickup` is being run
 * `documents` contains the documents
 * `<document id>.yaml` contains the metadata
 * `<document id>.<suffix>` is the original document (usually a PDF)
 * `<document id>.txt` is the extracted full text, if it could be extracted
 * `metaindex.conf`, the configuration file for filecabinet's metaindexserver
 * `metaindex.log`, the log file of file cabinet's metaindexserver


# Configuration

filecabinet itself as well as each individual cabinet can be configured
through the user’s configuration file (usually in `~/.config/filecabinet/filecabinet.conf`).

See `example.conf` for all configuration options!


# Usage from Python

To use `filecabinet` from Python, you can use this boilerplate:

```python
    from filecabinet import Manager


    manager = Manager()
    manager.launch_server()

    session = manager.new_session()
```

`session` will be an instance of `Session` which, together with `manager`,
allows manipulation of metadata and querying of documents.


            

Raw data

            {
    "_id": null,
    "home_page": "https://vonshednob.cc/filecabinet",
    "name": "filecabinet",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "",
    "author": "R",
    "author_email": "dev+filecabinet-this-is-spam@vonshednob.cc",
    "download_url": "https://files.pythonhosted.org/packages/de/fe/149f77940f59df218bc391eeded446c681596c5d7772588a9aa394f3414d/filecabinet-2.1.0.tar.gz",
    "platform": null,
    "description": "# filecabinet\n\nfilecabinet is a minimal document management system for your computer. It has\nmetadata per document and supports fulltext search in various document types.\n\n\n# Installing\n\nThe easiest way to install is to use `pip`:\n\n```bash\n    pip install filecabinet\n```\n\nAlternatively you can get the source code at\n[codeberg](https://codeberg.org/vonshednob/filecabinet):\n\n```bash\n    git clone https://codeberg.org/vonshednob/filecabinet\n    pip install filecabinet\n```\n\n\n## Requirements\n\n`filecabinet` **requires** the [xapian python bindings](https://xapian.org/docs/bindings/python/)\nwhich can not be installed through `pip`!\n\nOther automatically installed required dependencies are:\n\n * [metaindex](https://codeberg.org/vonshednob/metaindex)\n * [Pillow](https://pypi.org/project/Pillow/)\n * [PyPDF](https://pypi.org/project/pypdf/)\n * [PyYAML](https://pypi.org/project/PyYAML/)\n\nEven though optional, I strongly recommend installing [Tesseract OCR](https://github.com/tesseract-ocr/tesseract)\nto enable fulltext search in scanned documents.\n\n\n# Quick start\n\nTo initialize your file cabinet, run `filecabinet init` and provide a new\npath where you would like to store your documents:\n\n```bash\n    filecabinet init ~/Documents/cabinet\n```\n\nNow you can start either copying files into `~/Documents/cabinet/inbox` and\nrun\n\n```bash\n    filecabinet pickup\n```\n\nto process them, or add files manually via\n\n```bash\n    filecabinet add ~/some_scanned_document.jpg\n```\n\nTo get a basic overview of documents, you can use the Shell.\n\n\n# Workflow / Use cases\n\nHere\u2019s the usual worflow with filecabinet:\n\n 1. Put some documents (PDF, scanned documents, etc) into the `inbox`\n    folder of your cabinet\n 2. Run `filecabinet pickup`\n 3. List all new documents with `filecabinet list new`\n\nOther use cases are:\n\n * **Search** for a specific document with `filecabinet find \"searchterm\" \"other search term\"`\n * **Edit** the metadata of a document through the shell `filecabinet shell` (see next section)\n\n\n# Shell\n\nThere\u2019s a basic shell that allows you to inspect indexed documents, edit\ntheir metadata (by means of an external text editor), or view the\ndocuments.\n\nTo open the shell, run\n\n```bash\n    filecabinet shell\n```\n\nTry `help` inside the shell to see what your options are.\n\n\n## Metadata editing\n\nIf you want to use a specific text editor to modify metadata, consider\nupdating your configuration file\u2019s `Shell` section and add a\n`document_editor`, like this:\n\n```ini\n    [Shell]\n    editor = subl -w\n```\n\nIn this example we set up SublimeText as the external editor. Note that the\n`-w` option is necessary to make filecabinet wait until you\u2019re done editing\nthe file before returning into the shell.  \nVisual Studio Code uses the `-W` or `--wait` flag to accomplish the same\nbehaviour.\n\n\n# Searching\n\nSearching for **tags** is done case-insensitive and is done using `tag:`.\nFor example if you're looking for a document that's tagged with *banana*, you\ncan search for it by `tag:banana`.\n\nSearching **new** documents is accomplished by searching for `tag:new`.\nIf you only want to find documents that are not new, you can also\nsearch for `-tag:new`. Unless specified, a search will ignore whether or not a\ndocument is new.\n\nYou can search for any **metadata** value, like *title*, *author*, or *language*,\nby searching with the metadata name and a colon like `title:gravity`.\n\nEverything else that does not match the special search terms will be used in\nthe **fulltext** search.\n\nIf you want to search for terms with whitespaces, you can use quotes:\n`title:\"brain surgery\"`.\n\n**Example:**\n\nThe title contains \"brain\", is from author \"Gumby\" and it was set to some time\nbefore August 2005: `title:brain author:gumby date:2015-08-01`\n\nLooking for a newly added document with the title \"The Larch\": `title:larch tag:new`\n\n\n# Grouping of pages\n\nSometimes you will have a scanned document in form of multiple pages, each\npage a `.jpg` file, like `page1.jpg`, `page2.jpg`, `page3.jpg`.\n\nOf course all these pages form the same document.\n\nTo tell filecabinet that these files all belong to the same document, you\ncan put them in a folder inside the inbox before running `pickup`:\n\n * `inbox/doc/page1.jpg`\n * `inbox/doc/page2.jpg`\n * `inbox/doc/page3.jpg`\n\nThis will tell filecabinet that they all belong to the same\ndocument.\n\nHere\u2019s also where you can hint to the language of the document\nfor OCR (see *Language hinting* in the next section) by calling the folder,\nfor example, `doc-nl` to indicate that all pages are written in the dutch\nlanguage.\n\n\n# OCR\n\nfilecabinet can use Tesseract OCR to do character recognition on pictures and\nscanned PDFs, so you can search the text of images.\n\nIn order for that to work, you have to install Tesseract and some language\npackages, depending on the languages of the documents you wish to scan.\n\nIf you don't have Tesseract OCR installed, filecabinet will still work, but\nbe much less useful.\n\n\n## Language hinting\n\nYou can tell filecabinet what language a document has even as it is in the\ninbox by adding its language as a suffix: hyphen followed by language code\n(ISO-639).\n\nA few examples will help. Consider these files:\n\n * `page-1.jpg`\n * `contract.png`\n\nSuppose your default language is set to english (`default-lanugage = eng` in\nthe configuration file); `page-1.jpg` is in English but `contract.png` is\nin German.\n\nOCR will likely have difficulties with letters like `\u00f6\u00e4\u00fc` in `contract.png`\nunless you tell it what language the document is in:\n\n * `contract-ger.png`\n\n`ger` is one of the ISO-639 language codes for German (others are `de` and\n`deu`; see [wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) for the long listing).\n\nWith this `-ger` suffix, filecabinet will use the correct language packet\n(if you have it installed) and the OCR will yield much better results.\n\n\n# Rule based tagging\n\nBy using metaindex, filecabinet inherits the powerful rule based\ntagging. This allows you to automatically add metadata tags to documents\nbased on their text (which might have come from OCR).\n\nRules are defined in text files and you have to point filecabinet to the\nrule files that you want it to use. To do that, add a section `[Rules]` to\nyour configuration file (usually at\n`~/.config/filecabinet/filecabinet.conf`) and list your rule files like\nthis:\n\n```ini\n    [Rules]\n    base = ~/.config/filecabinet/basic_rules.txt\n    companies = ~/Document/company_rules.txt\n```\n\nThe names (before the `=`) are somewhat free-form descriptors.\n\nTo understand how to write these rule files, please have a look at the\n[metaindex documentation](https://codeberg.org/vonshednob/metaindex/src/branch/main/doc/source/indexers.rst#rule-based-indexer).\n\nTo test your rules on documents, you can use the `filecabinet test-rules`\ncommand. It will run all indexers on a file and show you what tags have\nbeen found by your rules.\n\nWhen using `test-rules` the tested document will not be added to your\ncabinet.\n\n\n# Cabinet Directory Structure\n\nAssuming a cabinet is set up at `~/cabinet`, the directory structure is:\n\n```\n    ~/cabinet\n     \u2502\n     \u251c\u2500\u2500 inbox\n     \u2502\n     \u251c\u2500\u2500 metaindex.conf\n     \u2502\n     \u251c\u2500\u2500 metaindex.log\n     \u2502\n     \u2514\u2500\u2500 documents\n          \u2502\n          \u2514\u2500\u2500 <partial document id>\n               \u2502\n               \u2514\u2500\u2500 <full document id>\n                    \u2502\n                    \u251c\u2500\u2500 <document id>.yaml\n                    \u2502\n                    \u251c\u2500\u2500 <document id>.<suffix>\n                    \u2502\n                    \u2514\u2500\u2500 <document id>.txt\n```\n\n * `inbox` will be processed (and emptied) when `filecabinet pickup` is being run\n * `documents` contains the documents\n * `<document id>.yaml` contains the metadata\n * `<document id>.<suffix>` is the original document (usually a PDF)\n * `<document id>.txt` is the extracted full text, if it could be extracted\n * `metaindex.conf`, the configuration file for filecabinet's metaindexserver\n * `metaindex.log`, the log file of file cabinet's metaindexserver\n\n\n# Configuration\n\nfilecabinet itself as well as each individual cabinet can be configured\nthrough the user\u2019s configuration file (usually in `~/.config/filecabinet/filecabinet.conf`).\n\nSee `example.conf` for all configuration options!\n\n\n# Usage from Python\n\nTo use `filecabinet` from Python, you can use this boilerplate:\n\n```python\n    from filecabinet import Manager\n\n\n    manager = Manager()\n    manager.launch_server()\n\n    session = manager.new_session()\n```\n\n`session` will be an instance of `Session` which, together with `manager`,\nallows manipulation of metadata and querying of documents.\n\n",
    "bugtrack_url": null,
    "license": "Copyright 2023 Robert Labudda All Rights Reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistribution of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistribution in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of Robert Labudda or the names of contributors may be used to endorse or promote products derived from this software without specific prior written permission. This software is provided \"AS IS,\" without a warranty of any kind. ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS AND WARRANTIES, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT, ARE HEREBY EXCLUDED. ROBERT LABUDDA (\"RL\") AND ITS LICENSORS SHALL NOT BE LIABLE FOR ANY DAMAGES SUFFERED BY LICENSEE AS A RESULT OF USING, MODIFYING OR DISTRIBUTING THIS SOFTWARE OR ITS DERIVATIVES. IN NO EVENT WILL RL OR ITS LICENSORS BE LIABLE FOR ANY LOST REVENUE, PROFIT OR DATA, OR FOR DIRECT, INDIRECT, SPECIAL, CONSEQUENTIAL, INCIDENTAL OR PUNITIVE DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF THE USE OF OR INABILITY TO USE THIS SOFTWARE, EVEN IF RL HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ",
    "summary": "A local, offline document archive",
    "version": "2.1.0",
    "project_urls": {
        "Homepage": "https://vonshednob.cc/filecabinet"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "be48f023c64996e63f8c026144d5b88fc6689624b825698bf24c6a4c2c8c47d4",
                "md5": "d6e0dda7fd7d8d8ad73d7782b82a02b3",
                "sha256": "bb768c846fd24247181e6fc03116fd04c58e83fbebadac67d3bce865bae9a833"
            },
            "downloads": -1,
            "filename": "filecabinet-2.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d6e0dda7fd7d8d8ad73d7782b82a02b3",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 21984,
            "upload_time": "2023-06-23T11:39:30",
            "upload_time_iso_8601": "2023-06-23T11:39:30.750139Z",
            "url": "https://files.pythonhosted.org/packages/be/48/f023c64996e63f8c026144d5b88fc6689624b825698bf24c6a4c2c8c47d4/filecabinet-2.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "defe149f77940f59df218bc391eeded446c681596c5d7772588a9aa394f3414d",
                "md5": "ade760b8c9d814e3fa5fe14618ebbf6f",
                "sha256": "921e0fb8b41f7c27d0f0480fa0fa12ea4590049a26afd1169546e073b09f826c"
            },
            "downloads": -1,
            "filename": "filecabinet-2.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "ade760b8c9d814e3fa5fe14618ebbf6f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 24954,
            "upload_time": "2023-06-23T11:39:32",
            "upload_time_iso_8601": "2023-06-23T11:39:32.350195Z",
            "url": "https://files.pythonhosted.org/packages/de/fe/149f77940f59df218bc391eeded446c681596c5d7772588a9aa394f3414d/filecabinet-2.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-23 11:39:32",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "filecabinet"
}
        
R
Elapsed time: 0.07939s