tabeltekstilo

Name	tabeltekstilo JSON
Version	1.1.0 JSON
	download
home_page	None
Summary	multi-purpose tool for manipulating text in tabular data format
upload_time	2024-09-21 18:18:10
maintainer	None
docs_url	None
author	hugues de keyzer
requires_python	<4.0,>=3.9
license	AGPL-3.0-or-later
keywords	alphabetical index indexer pandas spreadsheet tabular text xml
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <!--
SPDX-FileCopyrightText: 2023 hugues de keyzer

SPDX-License-Identifier: AGPL-3.0-or-later
-->

# tabeltekstilo

tabeltekstilo is a multi-purpose tool for manipulating text in tabular data format.

## introduction

text in tabular data format is text formatted as a table (usually stored as a spreadsheet file), where each row of the table contains one word of the text.
one column contains the actual word as it appears in the text, while other columns may contain more information about the word, like the page and line number where it appears, a cleaned-up form (with uniform casing and no punctuation), its lemma, its grammatical category,…

from text in that format, tabeltekstilo can generate:

*   an alphabetical dictionary by aggregating columns
*   an alphabetical index (like the ones that appear at the end of books)
*   an xml file

see [the examples section](#examples) below for concrete examples.

## features

general features:

*   alphabetical sorting using the unicode collation algorithm
*   right-to-left text support

dictionary features:

*   dictionary generation
*   multiple column aggregation with custom join string

index features:

*   multi-level index generation
*   multiple values in parent columns support for agglutinated forms
*   multiple reference support (for example: page, line)
*   grouping of identical references with count
*   total count of form occurrences at each parent level
*   filtering with regular expressions

xml features:

*   nesting of forms under multiple levels of parent columns
*   custom attributes on form elements
*   custom root element
*   optional header with custom copyright and licensing information

## usage

tabeltekstilo takes a subcommand, an input filename and an output filename as arguments, as well as some options.
input and output files should be in opendocument (.ods) or office open xml (.xlsx) format.

### dictionary

the minimal usage is:

```
tabeltekstilo dictionary --form-col form --agg-col agg input.ods output.ods
```
where `form` is title of the column (in `input.ods`) that contains the form that will appear in the dictionary and `agg` is the title of the column (in `input.ods`) that contains the values to aggregate next to the form.

to display a full description of the usage syntax:

```
tabeltekstilo dictionary --help
```

### index

the minimal usage is:

```
tabeltekstilo index --ref-col ref --form-col form input.ods output.ods
```
where `ref` is the title of the column (in `input.ods`) that contains the reference to use in the index (the page number, for example) and `form` is title of the column (in `input.ods`) that contains the form that will appear in the index.

to display a full description of the usage syntax:

```
tabeltekstilo index --help
```

### xml

the minimal usage is:

```
tabeltekstilo xml input.ods output.xml
```
where `input.ods` is the input file and `output.xml` the output file to generate.

to display a full description of the usage syntax:

```
tabeltekstilo xml --help
```

## examples

### dictionary

let’s take the following example text:

> le reste des avions vola vers l’est. nous avions du retard. c’est ce qu’il reste des vers à propos des vers.

it must first be converted to this format as `input.ods`:

| word    | form   | lemma      | type         |
| ------- | ------ | ---------- | ------------ |
| le      | le     | le (la)    | det_art      |
| reste   | reste  | reste      | noun         |
| des     | des    | de+le (la) | prep+det_art |
| avions  | avions | avion      | noun         |
| vola    | vola   | voler      | verb         |
| vers    | vers   | vers       | prep         |
| l’      | l’     | le (la)    | det_art      |
| est.    | est    | est        | noun         |
| nous    | nous   | nous       | pro_per      |
| avions  | avions | avoir      | verb         |
| du      | du     | de+le (la) | prep+det_art |
| retard. | retard | retard     | noun         |
| c’      | c’     | ce         | pro_dem      |
| est     | est    | être       | verb         |
| ce      | ce     | ce         | pro_dem      |
| qu’     | qu’    | que        | conjs        |
| il      | il     | il         | pro_per      |
| reste   | reste  | rester     | verb         |
| des     | des    | de+le (la) | prep+det_art |
| vers    | vers   | vers       | noun         |
| à       | à      | à          | prep         |
| propos  | propos | propos     | noun         |
| des     | des    | de+le (la) | prep+det_art |
| vers.   | vers   | ver        | noun         |

now, let’s generate the dictionary by calling:

```
tabeltekstilo dictionary --form-col form --agg-col lemma --agg-col type input.ods output.ods
```

this will generate the following table as `output.ods`:

|    | form   | lemma           | type             |
| -- | ------ | --------------- | ---------------- |
| 0  | à      | à               | prep             |
| 1  | avions | avion; avoir    | noun; verb       |
| 2  | c’     | ce              | pro_dem          |
| 3  | ce     | ce              | pro_dem          |
| 4  | des    | de+le (la)      | prep+det_art     |
| 5  | du     | de+le (la)      | prep+det_art     |
| 6  | est    | est; être       | noun; verb       |
| 7  | il     | il              | pro_per          |
| 8  | l’     | le (la)         | det_art          |
| 9  | le     | le (la)         | det_art          |
| 10 | nous   | nous            | pro_per          |
| 11 | propos | propos          | noun             |
| 12 | qu’    | que             | conjs            |
| 13 | reste  | reste; rester   | noun; verb       |
| 14 | retard | retard          | noun             |
| 15 | vers   | ver; vers; vers | noun; noun; prep |
| 16 | vola   | voler           | verb             |

### index

let’s take the following example text, and say that it appears on line 1 and 2 of page 42:

> la suno brilas hodiaŭ. hieraŭ estis malvarme, sed hodiaŭ estas varme.<br>
> ni estas bonŝancaj!

it must first be converted to this format as `input.ods`:

| page | line | word       | form      | lemma      |
| ---- | ---- | ---------- | --------- | ---------- |
| 42   | 1    | la         | la        | la         |
| 42   | 1    | suno       | suno      | suno       |
| 42   | 1    | brilas     | brilas    | brili      |
| 42   | 1    | hodiaŭ.    | hodiaŭ    | hodiaŭ     |
| 42   | 1    | hieraŭ     | hieraŭ    | hieraŭ     |
| 42   | 1    | estis      | estis     | esti       |
| 42   | 1    | malvarme,  | malvarme  | varma      |
| 42   | 1    | sed        | sed       | sed        |
| 42   | 1    | hodiaŭ     | hodiaŭ    | hodiaŭ     |
| 42   | 1    | estas      | estas     | esti       |
| 42   | 1    | varme.     | varme     | varma      |
| 42   | 2    | ni         | ni        | ni         |
| 42   | 2    | estas      | estas     | esti       |
| 42   | 2    | bonŝancaj! | bonŝancaj | bona+ŝanco |

now, let’s generate the index by calling:

```
tabeltekstilo index --ref-col page --ref-col line --parent-col lemma --form-col form --split-char + input.ods output.ods
```

this will generate the following table as `output.ods`:

|    | lemma_count | lemma  | form_count | form      | refs         |
| -- | ----------- | ------ | ---------- | --------- | ------------ |
| 0  | 1           | bona   | 1          | bonŝancaj | 42, 2        |
| 1  | 1           | brili  | 1          | brilas    | 42, 1        |
| 2  | 3           | esti   | 2          | estas     | 42, 1; 42, 2 |
| 3  |             |        | 1          | estis     | 42, 1        |
| 4  | 1           | hieraŭ | 1          | hieraŭ    | 42, 1        |
| 5  | 2           | hodiaŭ | 2          | hodiaŭ    | 42, 1 (2)    |
| 6  | 1           | la     | 1          | la        | 42, 1        |
| 7  | 1           | ni     | 1          | ni        | 42, 2        |
| 8  | 1           | ŝanco  | 1          | bonŝancaj | 42, 2        |
| 9  | 1           | sed    | 1          | sed       | 42, 1        |
| 10 | 1           | suno   | 1          | suno      | 42, 1        |
| 11 | 2           | varma  | 1          | malvarme  | 42, 1        |
| 12 |             |        | 1          | varme     | 42, 1        |

note that “bonŝancaj” appears twice in the index, once under the form “bona” and once under the form “ŝanco”.
this is because the lemma column contained two values, separated by the defined split character.

note that the word “hodiaŭ” appears twice on the same line.
this is why its reference has “(2)” appended to it.

#### filtering

the tabeltekstilo index function allows to filter rows based on column values using regular expressions.

for example, using the same input file as in the previous example, let’s say that only noun lemmas should appear.
in this case, they all end with “o”, so this command can be used:

```
tabeltekstilo index --ref-col page --ref-col line --parent-col lemma --form-col form --split-char + --filter "lemma:.*o" input.ods output.ods
```

in this example, the argument is quoted to avoid the `*` character to be interpreted by the shell.
this depends on the shell used.

this will generate the following table:

|   | lemma_count | lemma | form_count | form      | refs  |
| - | ----------- | ----- | ---------- | --------- | ----- |
| 0 | 1           | ŝanco | 1          | bonŝancaj | 42, 2 |
| 1 | 1           | suno  | 1          | suno      | 42, 1 |

note that “bonŝancaj” appears only once in this case, because the lemma “bona” was filtered out.

multiple filter arguments may be used.
the format of the filter expressions is `col:regex`, where `col` is a column name and `regex` is a regular expression matching the value (after splitting).
any column of the input table can be used, even those not used by the index.

by default, filtering is inclusive, which means that at least one expression should match for the row to be included.
this behavior can be reversed with `--filter-exclude`.
in this case, any row matching an expression is excluded; only the rows not matching any of the expressions are included.

for example, still using the same input file, let’s say that forms with less than 4 letters should be excluded.
this command can be used:

```
tabeltekstilo index --ref-col page --ref-col line --parent-col lemma --form-col form --split-char + --filter "form:.{1,3}" --filter-exclude input.ods output.ods
```

this will generate the following table:

|   | lemma_count | lemma  | form_count | form      | refs         |
| - | ----------- | ------ | ---------- | --------- | ------------ |
| 0 | 1           | bona   | 1          | bonŝancaj | 42, 2        |
| 1 | 1           | brili  | 1          | brilas    | 42, 1        |
| 2 | 3           | esti   | 2          | estas     | 42, 1; 42, 2 |
| 3 |             |        | 1          | estis     | 42, 1        |
| 4 | 1           | hieraŭ | 1          | hieraŭ    | 42, 1        |
| 5 | 2           | hodiaŭ | 2          | hodiaŭ    | 42, 1 (2)    |
| 6 | 1           | ŝanco  | 1          | bonŝancaj | 42, 2        |
| 7 | 1           | suno   | 1          | suno      | 42, 1        |
| 8 | 2           | varma  | 1          | malvarme  | 42, 1        |
| 9 |             |        | 1          | varme     | 42, 1        |

tabeltekstilo uses python’s regular expressions.
their documentation is [here](https://docs.python.org/3/library/re.html).

### xml

let’s take the table from the index command example, but change it slightly by renaming the first 2 columns and putting the word column last :

| page p | line l | form      | lemma      | word       |
| ------ | ------ | --------- | ---------- | ---------- |
| 42     | 1      | la        | la         | la         |
| 42     | 1      | suno      | suno       | suno       |
| 42     | 1      | brilas    | brili      | brilas     |
| 42     | 1      | hodiaŭ    | hodiaŭ     | hodiaŭ.    |
| 42     | 1      | hieraŭ    | hieraŭ     | hieraŭ     |
| 42     | 1      | estis     | esti       | estis      |
| 42     | 1      | malvarme  | varma      | malvarme,  |
| 42     | 1      | sed       | sed        | sed        |
| 42     | 1      | hodiaŭ    | hodiaŭ     | hodiaŭ     |
| 42     | 1      | estas     | esti       | estas      |
| 42     | 1      | varme     | varma      | varme.     |
| 42     | 2      | ni        | ni         | ni         |
| 42     | 2      | estas     | esti       | estas      |
| 42     | 2      | bonŝancaj | bona+ŝanco | bonŝancaj! |

now, let’s generate the xml file by calling:

```
tabeltekstilo xml input.ods output.xml
```

this will generate the following file as `output.xml`:

```xml
<?xml version="1.0" encoding="utf-8"?>
<document>
  <page p="42">
    <line l="1">
      <word form="la" lemma="la">la</word>
      <word form="suno" lemma="suno">suno</word>
      <word form="brilas" lemma="brili">brilas</word>
      <word form="hodiaŭ" lemma="hodiaŭ">hodiaŭ.</word>
      <word form="hieraŭ" lemma="hieraŭ">hieraŭ</word>
      <word form="estis" lemma="esti">estis</word>
      <word form="malvarme" lemma="varma">malvarme,</word>
      <word form="sed" lemma="sed">sed</word>
      <word form="hodiaŭ" lemma="hodiaŭ">hodiaŭ</word>
      <word form="estas" lemma="esti">estas</word>
      <word form="varme" lemma="varma">varme.</word>
    </line>
    <line l="2">
      <word form="ni" lemma="ni">ni</word>
      <word form="estas" lemma="esti">estas</word>
      <word form="bonŝancaj" lemma="bona+ŝanco">bonŝancaj!</word>
    </line>
  </page>
</document>
```

the parent columns are identified by the fact that they contain a space character which separates the element name from the attribute name.

the last column is used as the deepest element, with all non-parent columns before it used as attributes.

## credits

This development was funded by Bastien Kindt for the GREgORI Project.<br>
<https://uclouvain.be/fr/instituts-recherche/incal/ciol/gregori-project.html><br>
<https://www.v2.gregoriproject.com/><br>
with financial support from<br>
INCAL - Institut des civilisations, arts et lettres<br>
<https://uclouvain.be/fr/instituts-recherche/incal><br>
CIOL - Centre d'études orientales - Institut orientaliste de Louvain<br>
<https://uclouvain.be/fr/instituts-recherche/incal/ciol>

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "tabeltekstilo",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.9",
    "maintainer_email": null,
    "keywords": "alphabetical, index, indexer, pandas, spreadsheet, tabular, text, xml",
    "author": "hugues de keyzer",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/1f/aa/1306c731406534882ff82ec225928fe09f4a6c7a08d0e82285fac0018493/tabeltekstilo-1.1.0.tar.gz",
    "platform": null,
    "description": "<!--\nSPDX-FileCopyrightText: 2023 hugues de keyzer\n\nSPDX-License-Identifier: AGPL-3.0-or-later\n-->\n\n# tabeltekstilo\n\ntabeltekstilo is a multi-purpose tool for manipulating text in tabular data format.\n\n## introduction\n\ntext in tabular data format is text formatted as a table (usually stored as a spreadsheet file), where each row of the table contains one word of the text.\none column contains the actual word as it appears in the text, while other columns may contain more information about the word, like the page and line number where it appears, a cleaned-up form (with uniform casing and no punctuation), its lemma, its grammatical category,\u2026\n\nfrom text in that format, tabeltekstilo can generate:\n\n*   an alphabetical dictionary by aggregating columns\n*   an alphabetical index (like the ones that appear at the end of books)\n*   an xml file\n\nsee [the examples section](#examples) below for concrete examples.\n\n## features\n\ngeneral features:\n\n*   alphabetical sorting using the unicode collation algorithm\n*   right-to-left text support\n\ndictionary features:\n\n*   dictionary generation\n*   multiple column aggregation with custom join string\n\nindex features:\n\n*   multi-level index generation\n*   multiple values in parent columns support for agglutinated forms\n*   multiple reference support (for example: page, line)\n*   grouping of identical references with count\n*   total count of form occurrences at each parent level\n*   filtering with regular expressions\n\nxml features:\n\n*   nesting of forms under multiple levels of parent columns\n*   custom attributes on form elements\n*   custom root element\n*   optional header with custom copyright and licensing information\n\n## usage\n\ntabeltekstilo takes a subcommand, an input filename and an output filename as arguments, as well as some options.\ninput and output files should be in opendocument (.ods) or office open xml (.xlsx) format.\n\n### dictionary\n\nthe minimal usage is:\n\n```\ntabeltekstilo dictionary --form-col form --agg-col agg input.ods output.ods\n```\nwhere `form` is title of the column (in `input.ods`) that contains the form that will appear in the dictionary and `agg` is the title of the column (in `input.ods`) that contains the values to aggregate next to the form.\n\nto display a full description of the usage syntax:\n\n```\ntabeltekstilo dictionary --help\n```\n\n### index\n\nthe minimal usage is:\n\n```\ntabeltekstilo index --ref-col ref --form-col form input.ods output.ods\n```\nwhere `ref` is the title of the column (in `input.ods`) that contains the reference to use in the index (the page number, for example) and `form` is title of the column (in `input.ods`) that contains the form that will appear in the index.\n\nto display a full description of the usage syntax:\n\n```\ntabeltekstilo index --help\n```\n\n### xml\n\nthe minimal usage is:\n\n```\ntabeltekstilo xml input.ods output.xml\n```\nwhere `input.ods` is the input file and `output.xml` the output file to generate.\n\nto display a full description of the usage syntax:\n\n```\ntabeltekstilo xml --help\n```\n\n## examples\n\n### dictionary\n\nlet\u2019s take the following example text:\n\n> le reste des avions vola vers l\u2019est. nous avions du retard. c\u2019est ce qu\u2019il reste des vers \u00e0 propos des vers.\n\nit must first be converted to this format as `input.ods`:\n\n| word    | form   | lemma      | type         |\n| ------- | ------ | ---------- | ------------ |\n| le      | le     | le (la)    | det_art      |\n| reste   | reste  | reste      | noun         |\n| des     | des    | de+le (la) | prep+det_art |\n| avions  | avions | avion      | noun         |\n| vola    | vola   | voler      | verb         |\n| vers    | vers   | vers       | prep         |\n| l\u2019      | l\u2019     | le (la)    | det_art      |\n| est.    | est    | est        | noun         |\n| nous    | nous   | nous       | pro_per      |\n| avions  | avions | avoir      | verb         |\n| du      | du     | de+le (la) | prep+det_art |\n| retard. | retard | retard     | noun         |\n| c\u2019      | c\u2019     | ce         | pro_dem      |\n| est     | est    | \u00eatre       | verb         |\n| ce      | ce     | ce         | pro_dem      |\n| qu\u2019     | qu\u2019    | que        | conjs        |\n| il      | il     | il         | pro_per      |\n| reste   | reste  | rester     | verb         |\n| des     | des    | de+le (la) | prep+det_art |\n| vers    | vers   | vers       | noun         |\n| \u00e0       | \u00e0      | \u00e0          | prep         |\n| propos  | propos | propos     | noun         |\n| des     | des    | de+le (la) | prep+det_art |\n| vers.   | vers   | ver        | noun         |\n\nnow, let\u2019s generate the dictionary by calling:\n\n```\ntabeltekstilo dictionary --form-col form --agg-col lemma --agg-col type input.ods output.ods\n```\n\nthis will generate the following table as `output.ods`:\n\n|    | form   | lemma           | type             |\n| -- | ------ | --------------- | ---------------- |\n| 0  | \u00e0      | \u00e0               | prep             |\n| 1  | avions | avion; avoir    | noun; verb       |\n| 2  | c\u2019     | ce              | pro_dem          |\n| 3  | ce     | ce              | pro_dem          |\n| 4  | des    | de+le (la)      | prep+det_art     |\n| 5  | du     | de+le (la)      | prep+det_art     |\n| 6  | est    | est; \u00eatre       | noun; verb       |\n| 7  | il     | il              | pro_per          |\n| 8  | l\u2019     | le (la)         | det_art          |\n| 9  | le     | le (la)         | det_art          |\n| 10 | nous   | nous            | pro_per          |\n| 11 | propos | propos          | noun             |\n| 12 | qu\u2019    | que             | conjs            |\n| 13 | reste  | reste; rester   | noun; verb       |\n| 14 | retard | retard          | noun             |\n| 15 | vers   | ver; vers; vers | noun; noun; prep |\n| 16 | vola   | voler           | verb             |\n\n### index\n\nlet\u2019s take the following example text, and say that it appears on line 1 and 2 of page 42:\n\n> la suno brilas hodia\u016d. hiera\u016d estis malvarme, sed hodia\u016d estas varme.<br>\n> ni estas bon\u015dancaj!\n\nit must first be converted to this format as `input.ods`:\n\n| page | line | word       | form      | lemma      |\n| ---- | ---- | ---------- | --------- | ---------- |\n| 42   | 1    | la         | la        | la         |\n| 42   | 1    | suno       | suno      | suno       |\n| 42   | 1    | brilas     | brilas    | brili      |\n| 42   | 1    | hodia\u016d.    | hodia\u016d    | hodia\u016d     |\n| 42   | 1    | hiera\u016d     | hiera\u016d    | hiera\u016d     |\n| 42   | 1    | estis      | estis     | esti       |\n| 42   | 1    | malvarme,  | malvarme  | varma      |\n| 42   | 1    | sed        | sed       | sed        |\n| 42   | 1    | hodia\u016d     | hodia\u016d    | hodia\u016d     |\n| 42   | 1    | estas      | estas     | esti       |\n| 42   | 1    | varme.     | varme     | varma      |\n| 42   | 2    | ni         | ni        | ni         |\n| 42   | 2    | estas      | estas     | esti       |\n| 42   | 2    | bon\u015dancaj! | bon\u015dancaj | bona+\u015danco |\n\nnow, let\u2019s generate the index by calling:\n\n```\ntabeltekstilo index --ref-col page --ref-col line --parent-col lemma --form-col form --split-char + input.ods output.ods\n```\n\nthis will generate the following table as `output.ods`:\n\n|    | lemma_count | lemma  | form_count | form      | refs         |\n| -- | ----------- | ------ | ---------- | --------- | ------------ |\n| 0  | 1           | bona   | 1          | bon\u015dancaj | 42, 2        |\n| 1  | 1           | brili  | 1          | brilas    | 42, 1        |\n| 2  | 3           | esti   | 2          | estas     | 42, 1; 42, 2 |\n| 3  |             |        | 1          | estis     | 42, 1        |\n| 4  | 1           | hiera\u016d | 1          | hiera\u016d    | 42, 1        |\n| 5  | 2           | hodia\u016d | 2          | hodia\u016d    | 42, 1 (2)    |\n| 6  | 1           | la     | 1          | la        | 42, 1        |\n| 7  | 1           | ni     | 1          | ni        | 42, 2        |\n| 8  | 1           | \u015danco  | 1          | bon\u015dancaj | 42, 2        |\n| 9  | 1           | sed    | 1          | sed       | 42, 1        |\n| 10 | 1           | suno   | 1          | suno      | 42, 1        |\n| 11 | 2           | varma  | 1          | malvarme  | 42, 1        |\n| 12 |             |        | 1          | varme     | 42, 1        |\n\nnote that \u201cbon\u015dancaj\u201d appears twice in the index, once under the form \u201cbona\u201d and once under the form \u201c\u015danco\u201d.\nthis is because the lemma column contained two values, separated by the defined split character.\n\nnote that the word \u201chodia\u016d\u201d appears twice on the same line.\nthis is why its reference has \u201c(2)\u201d appended to it.\n\n#### filtering\n\nthe tabeltekstilo index function allows to filter rows based on column values using regular expressions.\n\nfor example, using the same input file as in the previous example, let\u2019s say that only noun lemmas should appear.\nin this case, they all end with \u201co\u201d, so this command can be used:\n\n```\ntabeltekstilo index --ref-col page --ref-col line --parent-col lemma --form-col form --split-char + --filter \"lemma:.*o\" input.ods output.ods\n```\n\nin this example, the argument is quoted to avoid the `*` character to be interpreted by the shell.\nthis depends on the shell used.\n\nthis will generate the following table:\n\n|   | lemma_count | lemma | form_count | form      | refs  |\n| - | ----------- | ----- | ---------- | --------- | ----- |\n| 0 | 1           | \u015danco | 1          | bon\u015dancaj | 42, 2 |\n| 1 | 1           | suno  | 1          | suno      | 42, 1 |\n\nnote that \u201cbon\u015dancaj\u201d appears only once in this case, because the lemma \u201cbona\u201d was filtered out.\n\nmultiple filter arguments may be used.\nthe format of the filter expressions is `col:regex`, where `col` is a column name and `regex` is a regular expression matching the value (after splitting).\nany column of the input table can be used, even those not used by the index.\n\nby default, filtering is inclusive, which means that at least one expression should match for the row to be included.\nthis behavior can be reversed with `--filter-exclude`.\nin this case, any row matching an expression is excluded; only the rows not matching any of the expressions are included.\n\nfor example, still using the same input file, let\u2019s say that forms with less than 4 letters should be excluded.\nthis command can be used:\n\n```\ntabeltekstilo index --ref-col page --ref-col line --parent-col lemma --form-col form --split-char + --filter \"form:.{1,3}\" --filter-exclude input.ods output.ods\n```\n\nthis will generate the following table:\n\n|   | lemma_count | lemma  | form_count | form      | refs         |\n| - | ----------- | ------ | ---------- | --------- | ------------ |\n| 0 | 1           | bona   | 1          | bon\u015dancaj | 42, 2        |\n| 1 | 1           | brili  | 1          | brilas    | 42, 1        |\n| 2 | 3           | esti   | 2          | estas     | 42, 1; 42, 2 |\n| 3 |             |        | 1          | estis     | 42, 1        |\n| 4 | 1           | hiera\u016d | 1          | hiera\u016d    | 42, 1        |\n| 5 | 2           | hodia\u016d | 2          | hodia\u016d    | 42, 1 (2)    |\n| 6 | 1           | \u015danco  | 1          | bon\u015dancaj | 42, 2        |\n| 7 | 1           | suno   | 1          | suno      | 42, 1        |\n| 8 | 2           | varma  | 1          | malvarme  | 42, 1        |\n| 9 |             |        | 1          | varme     | 42, 1        |\n\ntabeltekstilo uses python\u2019s regular expressions.\ntheir documentation is [here](https://docs.python.org/3/library/re.html).\n\n### xml\n\nlet\u2019s take the table from the index command example, but change it slightly by renaming the first 2 columns and putting the word column last\u202f:\n\n| page p | line l | form      | lemma      | word       |\n| ------ | ------ | --------- | ---------- | ---------- |\n| 42     | 1      | la        | la         | la         |\n| 42     | 1      | suno      | suno       | suno       |\n| 42     | 1      | brilas    | brili      | brilas     |\n| 42     | 1      | hodia\u016d    | hodia\u016d     | hodia\u016d.    |\n| 42     | 1      | hiera\u016d    | hiera\u016d     | hiera\u016d     |\n| 42     | 1      | estis     | esti       | estis      |\n| 42     | 1      | malvarme  | varma      | malvarme,  |\n| 42     | 1      | sed       | sed        | sed        |\n| 42     | 1      | hodia\u016d    | hodia\u016d     | hodia\u016d     |\n| 42     | 1      | estas     | esti       | estas      |\n| 42     | 1      | varme     | varma      | varme.     |\n| 42     | 2      | ni        | ni         | ni         |\n| 42     | 2      | estas     | esti       | estas      |\n| 42     | 2      | bon\u015dancaj | bona+\u015danco | bon\u015dancaj! |\n\nnow, let\u2019s generate the xml file by calling:\n\n```\ntabeltekstilo xml input.ods output.xml\n```\n\nthis will generate the following file as `output.xml`:\n\n```xml\n<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<document>\n  <page p=\"42\">\n    <line l=\"1\">\n      <word form=\"la\" lemma=\"la\">la</word>\n      <word form=\"suno\" lemma=\"suno\">suno</word>\n      <word form=\"brilas\" lemma=\"brili\">brilas</word>\n      <word form=\"hodia\u016d\" lemma=\"hodia\u016d\">hodia\u016d.</word>\n      <word form=\"hiera\u016d\" lemma=\"hiera\u016d\">hiera\u016d</word>\n      <word form=\"estis\" lemma=\"esti\">estis</word>\n      <word form=\"malvarme\" lemma=\"varma\">malvarme,</word>\n      <word form=\"sed\" lemma=\"sed\">sed</word>\n      <word form=\"hodia\u016d\" lemma=\"hodia\u016d\">hodia\u016d</word>\n      <word form=\"estas\" lemma=\"esti\">estas</word>\n      <word form=\"varme\" lemma=\"varma\">varme.</word>\n    </line>\n    <line l=\"2\">\n      <word form=\"ni\" lemma=\"ni\">ni</word>\n      <word form=\"estas\" lemma=\"esti\">estas</word>\n      <word form=\"bon\u015dancaj\" lemma=\"bona+\u015danco\">bon\u015dancaj!</word>\n    </line>\n  </page>\n</document>\n```\n\nthe parent columns are identified by the fact that they contain a space character which separates the element name from the attribute name.\n\nthe last column is used as the deepest element, with all non-parent columns before it used as attributes.\n\n## credits\n\nThis development was funded by Bastien Kindt for the GREgORI Project.<br>\n<https://uclouvain.be/fr/instituts-recherche/incal/ciol/gregori-project.html><br>\n<https://www.v2.gregoriproject.com/><br>\nwith financial support from<br>\nINCAL - Institut des civilisations, arts et lettres<br>\n<https://uclouvain.be/fr/instituts-recherche/incal><br>\nCIOL - Centre d'\u00e9tudes orientales - Institut orientaliste de Louvain<br>\n<https://uclouvain.be/fr/instituts-recherche/incal/ciol>\n\n",
    "bugtrack_url": null,
    "license": "AGPL-3.0-or-later",
    "summary": "multi-purpose tool for manipulating text in tabular data format",
    "version": "1.1.0",
    "project_urls": null,
    "split_keywords": [
        "alphabetical",
        " index",
        " indexer",
        " pandas",
        " spreadsheet",
        " tabular",
        " text",
        " xml"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2dcdee208a5b725a92f96eba00d6496b0321bae96df6e54ec6a566daddc032ee",
                "md5": "aadb7aad650eb9cb5c5746a2e7c0334a",
                "sha256": "4d8d5e47aa330b61b49fc1e1863e2229b79de11b5865d4ad48ef018ced2ae726"
            },
            "downloads": -1,
            "filename": "tabeltekstilo-1.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "aadb7aad650eb9cb5c5746a2e7c0334a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.9",
            "size": 24873,
            "upload_time": "2024-09-21T18:18:08",
            "upload_time_iso_8601": "2024-09-21T18:18:08.268910Z",
            "url": "https://files.pythonhosted.org/packages/2d/cd/ee208a5b725a92f96eba00d6496b0321bae96df6e54ec6a566daddc032ee/tabeltekstilo-1.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1faa1306c731406534882ff82ec225928fe09f4a6c7a08d0e82285fac0018493",
                "md5": "c8d01efa058d3758efdc579d768f1bf6",
                "sha256": "1d58ca25a77e5f390777c576a55efd82802d6ca67c258ad6d3d531c45a84a13a"
            },
            "downloads": -1,
            "filename": "tabeltekstilo-1.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "c8d01efa058d3758efdc579d768f1bf6",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.9",
            "size": 29037,
            "upload_time": "2024-09-21T18:18:10",
            "upload_time_iso_8601": "2024-09-21T18:18:10.053041Z",
            "url": "https://files.pythonhosted.org/packages/1f/aa/1306c731406534882ff82ec225928fe09f4a6c7a08d0e82285fac0018493/tabeltekstilo-1.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-21 18:18:10",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "tabeltekstilo"
}

hugues de keyzer