makeitsample


Namemakeitsample JSON
Version 1.0.0 PyPI version JSON
download
home_pageNone
SummaryA Toolkit for Generating Typological Language Samples Based on the Diversity Value
upload_time2025-10-07 08:27:01
maintainerNone
docs_urlNone
authorNone
requires_python>=3.7
licenseNone
keywords linguistics typology language diversity language sampling diversity value language families phylogenetics sampling
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <div align="center">

# makeitsample

<!-- [![Paper](http://img.shields.io/badge/paper-ACL--anthology-B31B1B.svg)](https://aclanthology.org/2021.sigtyp-1.2/)
[![Conference](https://img.shields.io/badge/conference-NAACL--2021-blue.svg)](https://2021.naacl.org/)-->
[![License: MIT][mit-shield]][mit]
![Python](https://img.shields.io/badge/python-3.7%20|%20higher%20version-orange.svg)

[mit]: https://opensource.org/license/mit
[mit-shield]: https://img.shields.io/badge/License-MIT-yellow.svg

</div>

**makeitsample** is a Python library for generating typological language samples
using the *diversity value (DV)* metric (Rijkhoff et al., 1993; Rijkhoff and
Bakker, 1998; Bakker, 2010).

It provides tools to build language family trees from CSV data, compute
diversity values for each node, and select a representative set of languages
that reflect genealogical and typological diversity.

---

## 📚 What It Does

makeitsample is designed to support researchers and linguists in the creation of
typologically diverse language samples. It consists of two main modules:

- `language_family_tree.py` — defines tree structures and computes diversity
values (DV).
- `forest.py` — manages a forest of language families and handles sampling logic
across multiple trees.

---

## 🚀 Features

- Build hierarchical language family trees from CSV input.
- Handle both nested genealogies and isolated languages.
- Calculate diversity values at each node in a tree.
- Select representative languages based on weighted sampling.
- Minimal dependencies and easy integration into other projects.

---

## 📦 Installation

```bash
pip install makeitsample
```

## 🛠️ Usage

### Prepare the input files

makeitsample requires a set of input files (representing language families) in CSV
format.
The CSV files should contain the following columns:
- `id`: the id of the node
- `name`: the name of the node
- `parent_id`: the id of the parent node
- `type`: the type of the node (the only allowed values are "family", "group" or "language")

The user can also add any other columns to the CSV files.

### As a library

#### Create a language family tree from CSV data

```python
import makeitsample.language_family_tree as lft

# Create a language family tree from CSV data
family = lft.LanguageFamilyTree("path/to/csv/file.csv")

# Print the tree structure
print(family)
```

#### Calculate diversity values for the language family trees

```python
from makeitsample.forest import Forest

# Create a forest of language families
language_families = Forest(dir="path/to/directory/with/csv/files")

# Update the trees with diversity values
language_families.dv()

# Export the updated trees to CSV
language_families.export_forest(dir="path/to/output/dir", format="csv")
```

#### Sample languages from the language family trees

```python
from makeitsample.forest import Forest

# Create a forest of language families
language_families = Forest("path/to/directory/with/csv/files")

# Sample languages from the forest
language_families.make_sample(n=100)

# Export the sampled languages to CSV
language_families.export_sample(dir="path/to/output/dir", format="csv")

# Export the sampled languages to JSON
language_families.export_sample(dir="path/to/output/dir", format="json")
```

### As a command-line tool

#### Sample languages from the language family trees

```bash
makeitsample [-h] [-n N] [-i INPUT] [-o OUTPUT] [-f {csv,json}] [-s SAMPLENAME] [-r RANDOM_SEED]
```

#### Arguments
- `-h`, `--help`: Show this help message and exit.
- `-n N`, `--number N`: Number of languages to sample.
- `-i INPUT`, `--input INPUT`: Path to the input directory containing CSV files.
- `-o OUTPUT`, `--output OUTPUT`: Path to the output directory for sampled languages.
- `-f {csv,json}`, `--format {csv,json}`: Output format for sampled languages (default: csv).
- `-s SAMPLENAME`, `--sample_name SAMPLENAME`: Name of the sample (default: sample).
- `-r RANDOM_SEED`, `--random_seed RANDOM_SEED`: Random seed for reproducibility.

#### Example usage

```bash
makeitsample -n 100 -i data -o out -f csv -s test_sample
```

## 📄 License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.

## 📄 Citation

If you use this library in your research, please cite the following paper:

<!--```bibtex
@inproceedings{makeitsample2025,
  title = {Samplify: a Tool for Generating Typological Language Samples Based on the Diversity Value},
  author = {Brigada Villa, Luca},
  year = {2025},
  url = {https://makeitsample.unipv.it},
  version = {1.0}
}
```-->

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "makeitsample",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "linguistics, typology, language diversity, language sampling, diversity value, language families, phylogenetics, sampling",
    "author": null,
    "author_email": "Luca Brigada Villa <luca.brigadavilla@unipv.it>",
    "download_url": "https://files.pythonhosted.org/packages/bd/db/4223af7928dadc0e9cb9ddc5ab06bdc8292fa61bb2695c2729e891409e26/makeitsample-1.0.0.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n\n# makeitsample\n\n<!-- [![Paper](http://img.shields.io/badge/paper-ACL--anthology-B31B1B.svg)](https://aclanthology.org/2021.sigtyp-1.2/)\n[![Conference](https://img.shields.io/badge/conference-NAACL--2021-blue.svg)](https://2021.naacl.org/)-->\n[![License: MIT][mit-shield]][mit]\n![Python](https://img.shields.io/badge/python-3.7%20|%20higher%20version-orange.svg)\n\n[mit]: https://opensource.org/license/mit\n[mit-shield]: https://img.shields.io/badge/License-MIT-yellow.svg\n\n</div>\n\n**makeitsample** is a Python library for generating typological language samples\nusing the *diversity value (DV)* metric (Rijkhoff et al., 1993; Rijkhoff and\nBakker, 1998; Bakker, 2010).\n\nIt provides tools to build language family trees from CSV data, compute\ndiversity values for each node, and select a representative set of languages\nthat reflect genealogical and typological diversity.\n\n---\n\n## \ud83d\udcda What It Does\n\nmakeitsample is designed to support researchers and linguists in the creation of\ntypologically diverse language samples. It consists of two main modules:\n\n- `language_family_tree.py` \u2014 defines tree structures and computes diversity\nvalues (DV).\n- `forest.py` \u2014 manages a forest of language families and handles sampling logic\nacross multiple trees.\n\n---\n\n## \ud83d\ude80 Features\n\n- Build hierarchical language family trees from CSV input.\n- Handle both nested genealogies and isolated languages.\n- Calculate diversity values at each node in a tree.\n- Select representative languages based on weighted sampling.\n- Minimal dependencies and easy integration into other projects.\n\n---\n\n## \ud83d\udce6 Installation\n\n```bash\npip install makeitsample\n```\n\n## \ud83d\udee0\ufe0f Usage\n\n### Prepare the input files\n\nmakeitsample requires a set of input files (representing language families) in CSV\nformat.\nThe CSV files should contain the following columns:\n- `id`: the id of the node\n- `name`: the name of the node\n- `parent_id`: the id of the parent node\n- `type`: the type of the node (the only allowed values are \"family\", \"group\" or \"language\")\n\nThe user can also add any other columns to the CSV files.\n\n### As a library\n\n#### Create a language family tree from CSV data\n\n```python\nimport makeitsample.language_family_tree as lft\n\n# Create a language family tree from CSV data\nfamily = lft.LanguageFamilyTree(\"path/to/csv/file.csv\")\n\n# Print the tree structure\nprint(family)\n```\n\n#### Calculate diversity values for the language family trees\n\n```python\nfrom makeitsample.forest import Forest\n\n# Create a forest of language families\nlanguage_families = Forest(dir=\"path/to/directory/with/csv/files\")\n\n# Update the trees with diversity values\nlanguage_families.dv()\n\n# Export the updated trees to CSV\nlanguage_families.export_forest(dir=\"path/to/output/dir\", format=\"csv\")\n```\n\n#### Sample languages from the language family trees\n\n```python\nfrom makeitsample.forest import Forest\n\n# Create a forest of language families\nlanguage_families = Forest(\"path/to/directory/with/csv/files\")\n\n# Sample languages from the forest\nlanguage_families.make_sample(n=100)\n\n# Export the sampled languages to CSV\nlanguage_families.export_sample(dir=\"path/to/output/dir\", format=\"csv\")\n\n# Export the sampled languages to JSON\nlanguage_families.export_sample(dir=\"path/to/output/dir\", format=\"json\")\n```\n\n### As a command-line tool\n\n#### Sample languages from the language family trees\n\n```bash\nmakeitsample [-h] [-n N] [-i INPUT] [-o OUTPUT] [-f {csv,json}] [-s SAMPLENAME] [-r RANDOM_SEED]\n```\n\n#### Arguments\n- `-h`, `--help`: Show this help message and exit.\n- `-n N`, `--number N`: Number of languages to sample.\n- `-i INPUT`, `--input INPUT`: Path to the input directory containing CSV files.\n- `-o OUTPUT`, `--output OUTPUT`: Path to the output directory for sampled languages.\n- `-f {csv,json}`, `--format {csv,json}`: Output format for sampled languages (default: csv).\n- `-s SAMPLENAME`, `--sample_name SAMPLENAME`: Name of the sample (default: sample).\n- `-r RANDOM_SEED`, `--random_seed RANDOM_SEED`: Random seed for reproducibility.\n\n#### Example usage\n\n```bash\nmakeitsample -n 100 -i data -o out -f csv -s test_sample\n```\n\n## \ud83d\udcc4 License\nThis project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.\n\n## \ud83d\udcc4 Citation\n\nIf you use this library in your research, please cite the following paper:\n\n<!--```bibtex\n@inproceedings{makeitsample2025,\n  title = {Samplify: a Tool for Generating Typological Language Samples Based on the Diversity Value},\n  author = {Brigada Villa, Luca},\n  year = {2025},\n  url = {https://makeitsample.unipv.it},\n  version = {1.0}\n}\n```-->\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A Toolkit for Generating Typological Language Samples Based on the Diversity Value",
    "version": "1.0.0",
    "project_urls": {
        "Homepage": "https://github.com/unipv-larl/makeitsample",
        "Issues": "https://github.com/unipv-larl/makeitsample/issues"
    },
    "split_keywords": [
        "linguistics",
        " typology",
        " language diversity",
        " language sampling",
        " diversity value",
        " language families",
        " phylogenetics",
        " sampling"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2316565ecb904abe9f8ecafa3bec02e271b88b6210ae1a1f3fd3362ef0d94dd0",
                "md5": "85a1049295d77650ae2ee5a856aca129",
                "sha256": "334bb58167cff382a9d198b8fd7c9c6fcc7696524703d7b09e4f6ae9aa761384"
            },
            "downloads": -1,
            "filename": "makeitsample-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "85a1049295d77650ae2ee5a856aca129",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 12610,
            "upload_time": "2025-10-07T08:27:00",
            "upload_time_iso_8601": "2025-10-07T08:27:00.338067Z",
            "url": "https://files.pythonhosted.org/packages/23/16/565ecb904abe9f8ecafa3bec02e271b88b6210ae1a1f3fd3362ef0d94dd0/makeitsample-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "bddb4223af7928dadc0e9cb9ddc5ab06bdc8292fa61bb2695c2729e891409e26",
                "md5": "ffb785fbcd292a289a7ad40e0c595a41",
                "sha256": "eebdfd8ccc032a274c6784bed3cc00ad865427bab7579683435e1dc9a456a1b2"
            },
            "downloads": -1,
            "filename": "makeitsample-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "ffb785fbcd292a289a7ad40e0c595a41",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 13259,
            "upload_time": "2025-10-07T08:27:01",
            "upload_time_iso_8601": "2025-10-07T08:27:01.866309Z",
            "url": "https://files.pythonhosted.org/packages/bd/db/4223af7928dadc0e9cb9ddc5ab06bdc8292fa61bb2695c2729e891409e26/makeitsample-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-07 08:27:01",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "unipv-larl",
    "github_project": "makeitsample",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "makeitsample"
}
        
Elapsed time: 1.61340s