<div align="center">
# makeitsample
<!-- [](https://aclanthology.org/2021.sigtyp-1.2/)
[](https://2021.naacl.org/)-->
[![License: MIT][mit-shield]][mit]

[mit]: https://opensource.org/license/mit
[mit-shield]: https://img.shields.io/badge/License-MIT-yellow.svg
</div>
**makeitsample** is a Python library for generating typological language samples
using the *diversity value (DV)* metric (Rijkhoff et al., 1993; Rijkhoff and
Bakker, 1998; Bakker, 2010).
It provides tools to build language family trees from CSV data, compute
diversity values for each node, and select a representative set of languages
that reflect genealogical and typological diversity.
---
## 📚 What It Does
makeitsample is designed to support researchers and linguists in the creation of
typologically diverse language samples. It consists of two main modules:
- `language_family_tree.py` — defines tree structures and computes diversity
values (DV).
- `forest.py` — manages a forest of language families and handles sampling logic
across multiple trees.
---
## 🚀 Features
- Build hierarchical language family trees from CSV input.
- Handle both nested genealogies and isolated languages.
- Calculate diversity values at each node in a tree.
- Select representative languages based on weighted sampling.
- Minimal dependencies and easy integration into other projects.
---
## 📦 Installation
```bash
pip install makeitsample
```
## 🛠️ Usage
### Prepare the input files
makeitsample requires a set of input files (representing language families) in CSV
format.
The CSV files should contain the following columns:
- `id`: the id of the node
- `name`: the name of the node
- `parent_id`: the id of the parent node
- `type`: the type of the node (the only allowed values are "family", "group" or "language")
The user can also add any other columns to the CSV files.
### As a library
#### Create a language family tree from CSV data
```python
import makeitsample.language_family_tree as lft
# Create a language family tree from CSV data
family = lft.LanguageFamilyTree("path/to/csv/file.csv")
# Print the tree structure
print(family)
```
#### Calculate diversity values for the language family trees
```python
from makeitsample.forest import Forest
# Create a forest of language families
language_families = Forest(dir="path/to/directory/with/csv/files")
# Update the trees with diversity values
language_families.dv()
# Export the updated trees to CSV
language_families.export_forest(dir="path/to/output/dir", format="csv")
```
#### Sample languages from the language family trees
```python
from makeitsample.forest import Forest
# Create a forest of language families
language_families = Forest("path/to/directory/with/csv/files")
# Sample languages from the forest
language_families.make_sample(n=100)
# Export the sampled languages to CSV
language_families.export_sample(dir="path/to/output/dir", format="csv")
# Export the sampled languages to JSON
language_families.export_sample(dir="path/to/output/dir", format="json")
```
### As a command-line tool
#### Sample languages from the language family trees
```bash
makeitsample [-h] [-n N] [-i INPUT] [-o OUTPUT] [-f {csv,json}] [-s SAMPLENAME] [-r RANDOM_SEED]
```
#### Arguments
- `-h`, `--help`: Show this help message and exit.
- `-n N`, `--number N`: Number of languages to sample.
- `-i INPUT`, `--input INPUT`: Path to the input directory containing CSV files.
- `-o OUTPUT`, `--output OUTPUT`: Path to the output directory for sampled languages.
- `-f {csv,json}`, `--format {csv,json}`: Output format for sampled languages (default: csv).
- `-s SAMPLENAME`, `--sample_name SAMPLENAME`: Name of the sample (default: sample).
- `-r RANDOM_SEED`, `--random_seed RANDOM_SEED`: Random seed for reproducibility.
#### Example usage
```bash
makeitsample -n 100 -i data -o out -f csv -s test_sample
```
## 📄 License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
## 📄 Citation
If you use this library in your research, please cite the following paper:
<!--```bibtex
@inproceedings{makeitsample2025,
title = {Samplify: a Tool for Generating Typological Language Samples Based on the Diversity Value},
author = {Brigada Villa, Luca},
year = {2025},
url = {https://makeitsample.unipv.it},
version = {1.0}
}
```-->
Raw data
{
"_id": null,
"home_page": null,
"name": "makeitsample",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "linguistics, typology, language diversity, language sampling, diversity value, language families, phylogenetics, sampling",
"author": null,
"author_email": "Luca Brigada Villa <luca.brigadavilla@unipv.it>",
"download_url": "https://files.pythonhosted.org/packages/bd/db/4223af7928dadc0e9cb9ddc5ab06bdc8292fa61bb2695c2729e891409e26/makeitsample-1.0.0.tar.gz",
"platform": null,
"description": "<div align=\"center\">\n\n# makeitsample\n\n<!-- [](https://aclanthology.org/2021.sigtyp-1.2/)\n[](https://2021.naacl.org/)-->\n[![License: MIT][mit-shield]][mit]\n\n\n[mit]: https://opensource.org/license/mit\n[mit-shield]: https://img.shields.io/badge/License-MIT-yellow.svg\n\n</div>\n\n**makeitsample** is a Python library for generating typological language samples\nusing the *diversity value (DV)* metric (Rijkhoff et al., 1993; Rijkhoff and\nBakker, 1998; Bakker, 2010).\n\nIt provides tools to build language family trees from CSV data, compute\ndiversity values for each node, and select a representative set of languages\nthat reflect genealogical and typological diversity.\n\n---\n\n## \ud83d\udcda What It Does\n\nmakeitsample is designed to support researchers and linguists in the creation of\ntypologically diverse language samples. It consists of two main modules:\n\n- `language_family_tree.py` \u2014 defines tree structures and computes diversity\nvalues (DV).\n- `forest.py` \u2014 manages a forest of language families and handles sampling logic\nacross multiple trees.\n\n---\n\n## \ud83d\ude80 Features\n\n- Build hierarchical language family trees from CSV input.\n- Handle both nested genealogies and isolated languages.\n- Calculate diversity values at each node in a tree.\n- Select representative languages based on weighted sampling.\n- Minimal dependencies and easy integration into other projects.\n\n---\n\n## \ud83d\udce6 Installation\n\n```bash\npip install makeitsample\n```\n\n## \ud83d\udee0\ufe0f Usage\n\n### Prepare the input files\n\nmakeitsample requires a set of input files (representing language families) in CSV\nformat.\nThe CSV files should contain the following columns:\n- `id`: the id of the node\n- `name`: the name of the node\n- `parent_id`: the id of the parent node\n- `type`: the type of the node (the only allowed values are \"family\", \"group\" or \"language\")\n\nThe user can also add any other columns to the CSV files.\n\n### As a library\n\n#### Create a language family tree from CSV data\n\n```python\nimport makeitsample.language_family_tree as lft\n\n# Create a language family tree from CSV data\nfamily = lft.LanguageFamilyTree(\"path/to/csv/file.csv\")\n\n# Print the tree structure\nprint(family)\n```\n\n#### Calculate diversity values for the language family trees\n\n```python\nfrom makeitsample.forest import Forest\n\n# Create a forest of language families\nlanguage_families = Forest(dir=\"path/to/directory/with/csv/files\")\n\n# Update the trees with diversity values\nlanguage_families.dv()\n\n# Export the updated trees to CSV\nlanguage_families.export_forest(dir=\"path/to/output/dir\", format=\"csv\")\n```\n\n#### Sample languages from the language family trees\n\n```python\nfrom makeitsample.forest import Forest\n\n# Create a forest of language families\nlanguage_families = Forest(\"path/to/directory/with/csv/files\")\n\n# Sample languages from the forest\nlanguage_families.make_sample(n=100)\n\n# Export the sampled languages to CSV\nlanguage_families.export_sample(dir=\"path/to/output/dir\", format=\"csv\")\n\n# Export the sampled languages to JSON\nlanguage_families.export_sample(dir=\"path/to/output/dir\", format=\"json\")\n```\n\n### As a command-line tool\n\n#### Sample languages from the language family trees\n\n```bash\nmakeitsample [-h] [-n N] [-i INPUT] [-o OUTPUT] [-f {csv,json}] [-s SAMPLENAME] [-r RANDOM_SEED]\n```\n\n#### Arguments\n- `-h`, `--help`: Show this help message and exit.\n- `-n N`, `--number N`: Number of languages to sample.\n- `-i INPUT`, `--input INPUT`: Path to the input directory containing CSV files.\n- `-o OUTPUT`, `--output OUTPUT`: Path to the output directory for sampled languages.\n- `-f {csv,json}`, `--format {csv,json}`: Output format for sampled languages (default: csv).\n- `-s SAMPLENAME`, `--sample_name SAMPLENAME`: Name of the sample (default: sample).\n- `-r RANDOM_SEED`, `--random_seed RANDOM_SEED`: Random seed for reproducibility.\n\n#### Example usage\n\n```bash\nmakeitsample -n 100 -i data -o out -f csv -s test_sample\n```\n\n## \ud83d\udcc4 License\nThis project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.\n\n## \ud83d\udcc4 Citation\n\nIf you use this library in your research, please cite the following paper:\n\n<!--```bibtex\n@inproceedings{makeitsample2025,\n title = {Samplify: a Tool for Generating Typological Language Samples Based on the Diversity Value},\n author = {Brigada Villa, Luca},\n year = {2025},\n url = {https://makeitsample.unipv.it},\n version = {1.0}\n}\n```-->\n",
"bugtrack_url": null,
"license": null,
"summary": "A Toolkit for Generating Typological Language Samples Based on the Diversity Value",
"version": "1.0.0",
"project_urls": {
"Homepage": "https://github.com/unipv-larl/makeitsample",
"Issues": "https://github.com/unipv-larl/makeitsample/issues"
},
"split_keywords": [
"linguistics",
" typology",
" language diversity",
" language sampling",
" diversity value",
" language families",
" phylogenetics",
" sampling"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "2316565ecb904abe9f8ecafa3bec02e271b88b6210ae1a1f3fd3362ef0d94dd0",
"md5": "85a1049295d77650ae2ee5a856aca129",
"sha256": "334bb58167cff382a9d198b8fd7c9c6fcc7696524703d7b09e4f6ae9aa761384"
},
"downloads": -1,
"filename": "makeitsample-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "85a1049295d77650ae2ee5a856aca129",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 12610,
"upload_time": "2025-10-07T08:27:00",
"upload_time_iso_8601": "2025-10-07T08:27:00.338067Z",
"url": "https://files.pythonhosted.org/packages/23/16/565ecb904abe9f8ecafa3bec02e271b88b6210ae1a1f3fd3362ef0d94dd0/makeitsample-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "bddb4223af7928dadc0e9cb9ddc5ab06bdc8292fa61bb2695c2729e891409e26",
"md5": "ffb785fbcd292a289a7ad40e0c595a41",
"sha256": "eebdfd8ccc032a274c6784bed3cc00ad865427bab7579683435e1dc9a456a1b2"
},
"downloads": -1,
"filename": "makeitsample-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "ffb785fbcd292a289a7ad40e0c595a41",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 13259,
"upload_time": "2025-10-07T08:27:01",
"upload_time_iso_8601": "2025-10-07T08:27:01.866309Z",
"url": "https://files.pythonhosted.org/packages/bd/db/4223af7928dadc0e9cb9ddc5ab06bdc8292fa61bb2695c2729e891409e26/makeitsample-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-07 08:27:01",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "unipv-larl",
"github_project": "makeitsample",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "makeitsample"
}