====================================
pelican_nlp
====================================
.. |logo| image:: https://raw.githubusercontent.com/ypauli/pelican_nlp/main/docs/images/pelican_logo.png
:alt: pelican_nlp Logo
:width: 200px
+------------+-------------------------------------------------------------------+
| |logo| | pelican_nlp stands for "Preprocessing and Extraction of Linguistic|
| | Information for Computational Analysis - Natural Language |
| | Processing". This package enables the creation of standardized and|
| | reproducible language processing pipelines, extracting linguistic |
| | features from various tasks like discourse, fluency, and image |
| | descriptions. |
+------------+-------------------------------------------------------------------+
.. image:: https://img.shields.io/pypi/v/pelican_nlp.svg
:target: https://pypi.org/project/pelican_nlp/
:alt: PyPI version
.. image:: https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg
:target: https://github.com/ypauli/pelican_nlp/blob/main/LICENSE
:alt: License CC BY-NC 4.0
.. image:: https://img.shields.io/pypi/pyversions/pelican_nlp.svg
:target: https://pypi.org/project/pelican_nlp/
:alt: Supported Python Versions
.. image:: https://img.shields.io/badge/Contributions-Welcome-brightgreen.svg
:target: https://github.com/ypauli/pelican_nlp/blob/main/CONTRIBUTING.md
:alt: Contributions Welcome
Installation
============
Create conda environment
.. code-block:: bash
conda create --name pelican-nlp --channel defaults python=3.10
Activate environment
.. code-block:: bash
conda activate pelican-nlp
Install the package using pip:
.. code-block:: bash
pip install pelican-nlp
For the latest development version:
.. code-block:: bash
pip install https://github.com/ypauli/pelican_nlp/releases/tag/v0.1.2-alpha
Usage
=====
To run ``pelican_nlp``, you need a ``configuration.yml`` file in your main project directory. This file defines the settings and parameters used for your project.
Sample configuration files are available here:
`https://github.com/ypauli/pelican_nlp/tree/main/sample_configuration_files <https://github.com/ypauli/pelican_nlp/tree/main/sample_configuration_files>`_
1. Adapt a sample configuration to your needs.
2. Save your personalized ``configuration.yml`` in the root of your project directory.
Running pelican_nlp
-------------------
You can run ``pelican_nlp`` via the command line or a Python script.
**From the command line**:
Navigate to your project directory (must contain your ``participants/`` folder and ``configuration.yml``), then run:
.. code-block:: bash
conda activate pelican-nlp
pelican-run
To optimize performance, close other programs and limit GPU usage during language processing.
Data Format Requirements: LPDS
------------------------------
For reliable operation, your data must follow the *Language Processing Data Structure (LPDS)*, inspired by brain imaging data structures like BIDS.
Main Concepts (Quick Guide)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
- **Project Root**: Contains a ``participants/`` folder plus optional files like ``participants.tsv``, ``dataset_description.json``, and ``README``.
- **Participants**: Each participant has a folder named ``part-<ID>`` (e.g., ``part-01``).
- **Sessions (Optional)**: For longitudinal studies, use ``ses-<ID>`` subfolders inside each participant folder.
- **Tasks/Contexts**: Each session (or directly in the participant folder for non-longitudinal studies) includes subfolders for specific tasks (e.g., ``interview``, ``fluency``, ``image-description``).
- **Data Files**: Named with structured metadata, e.g.:
``part-01_ses-01_task-fluency_cat-semantic_acq-baseline_transcript.txt``
Filename Structure
~~~~~~~~~~~~~~~~~~
Filenames follow this format::
part-<id>[_ses-<id>]_task-<label>[_<key>-<value>...][_suffix].<extension>
- **Required Entities**: ``part``, ``task``
- **Optional Entities Examples**: ``ses``, ``cat``, ``acq``, ``proc``, ``metric``, ``model``, ``run``, ``group``, ``param``
- **Suffix Examples**: ``transcript``, ``audio``, ``embeddings``, ``logits``, ``annotations``
Example Project Structure
~~~~~~~~~~~~~~~~~~~~~~~~~
::
my_project/
├── participants/
│ ├── part-01/
│ │ └── ses-01/
│ │ └── interview/
│ │ └── part-01_ses-01_task-interview_transcript.txt
│ └── part-02/
│ └── fluency/
│ └── part-02_task-fluency_audio.wav
├── configuration.yml
├── dataset_description.json
├── participants.tsv
└── README.md
Features
========
- **Feature 1: Cleaning text files**
- Handles whitespaces, timestamps, punctuation, special characters, and case-sensitivity.
- **Feature 2: Linguistic Feature Extraction**
- Extracts semantic embeddings, logits, distance from optimality, and semantic similarity.
Examples
========
You can find example setups on the github repository in the `examples <https://github.com/ypauli/pelican_nlp/tree/main/examples>`_ folder:
Contributing
============
Contributions are welcome! Please check out the `contributing guide <https://github.com/ypauli/pelican_nlp/blob/main/CONTRIBUTING.md>`_.
License
=======
This project is licensed under Attribution-NonCommercial 4.0 International. See the `LICENSE <https://github.com/ypauli/pelican_nlp/blob/main/LICENSE>`_ file for details.
Raw data
{
"_id": null,
"home_page": null,
"name": "pelican-nlp",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "nlp, linguistics, preprocessing, language-processing, text-analysis",
"author": null,
"author_email": "Yves Pauli <yves.pauli@gmail.com>",
"download_url": null,
"platform": null,
"description": "====================================\npelican_nlp\n====================================\n\n.. |logo| image:: https://raw.githubusercontent.com/ypauli/pelican_nlp/main/docs/images/pelican_logo.png\n :alt: pelican_nlp Logo\n :width: 200px\n\n+------------+-------------------------------------------------------------------+\n| |logo| | pelican_nlp stands for \"Preprocessing and Extraction of Linguistic|\n| | Information for Computational Analysis - Natural Language |\n| | Processing\". This package enables the creation of standardized and|\n| | reproducible language processing pipelines, extracting linguistic |\n| | features from various tasks like discourse, fluency, and image |\n| | descriptions. |\n+------------+-------------------------------------------------------------------+\n\n.. image:: https://img.shields.io/pypi/v/pelican_nlp.svg\n :target: https://pypi.org/project/pelican_nlp/\n :alt: PyPI version\n\n.. image:: https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg\n :target: https://github.com/ypauli/pelican_nlp/blob/main/LICENSE\n :alt: License CC BY-NC 4.0\n\n.. image:: https://img.shields.io/pypi/pyversions/pelican_nlp.svg\n :target: https://pypi.org/project/pelican_nlp/\n :alt: Supported Python Versions\n\n.. image:: https://img.shields.io/badge/Contributions-Welcome-brightgreen.svg\n :target: https://github.com/ypauli/pelican_nlp/blob/main/CONTRIBUTING.md\n :alt: Contributions Welcome\n\nInstallation\n============\n\nCreate conda environment\n\n.. code-block:: bash\n\n conda create --name pelican-nlp --channel defaults python=3.10\n\nActivate environment\n\n.. code-block:: bash\n\n conda activate pelican-nlp\n\nInstall the package using pip:\n\n.. code-block:: bash\n\n pip install pelican-nlp\n\nFor the latest development version:\n\n.. code-block:: bash\n\n pip install https://github.com/ypauli/pelican_nlp/releases/tag/v0.1.2-alpha\n\nUsage\n=====\n\nTo run ``pelican_nlp``, you need a ``configuration.yml`` file in your main project directory. This file defines the settings and parameters used for your project.\n\nSample configuration files are available here:\n`https://github.com/ypauli/pelican_nlp/tree/main/sample_configuration_files <https://github.com/ypauli/pelican_nlp/tree/main/sample_configuration_files>`_\n\n1. Adapt a sample configuration to your needs.\n2. Save your personalized ``configuration.yml`` in the root of your project directory.\n\nRunning pelican_nlp\n-------------------\n\nYou can run ``pelican_nlp`` via the command line or a Python script.\n\n**From the command line**:\n\nNavigate to your project directory (must contain your ``participants/`` folder and ``configuration.yml``), then run:\n\n.. code-block:: bash\n\n conda activate pelican-nlp\n pelican-run\n\nTo optimize performance, close other programs and limit GPU usage during language processing.\n\nData Format Requirements: LPDS\n------------------------------\n\nFor reliable operation, your data must follow the *Language Processing Data Structure (LPDS)*, inspired by brain imaging data structures like BIDS.\n\nMain Concepts (Quick Guide)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n- **Project Root**: Contains a ``participants/`` folder plus optional files like ``participants.tsv``, ``dataset_description.json``, and ``README``.\n- **Participants**: Each participant has a folder named ``part-<ID>`` (e.g., ``part-01``).\n- **Sessions (Optional)**: For longitudinal studies, use ``ses-<ID>`` subfolders inside each participant folder.\n- **Tasks/Contexts**: Each session (or directly in the participant folder for non-longitudinal studies) includes subfolders for specific tasks (e.g., ``interview``, ``fluency``, ``image-description``).\n- **Data Files**: Named with structured metadata, e.g.:\n ``part-01_ses-01_task-fluency_cat-semantic_acq-baseline_transcript.txt``\n\nFilename Structure\n~~~~~~~~~~~~~~~~~~\n\nFilenames follow this format::\n\n part-<id>[_ses-<id>]_task-<label>[_<key>-<value>...][_suffix].<extension>\n\n- **Required Entities**: ``part``, ``task``\n- **Optional Entities Examples**: ``ses``, ``cat``, ``acq``, ``proc``, ``metric``, ``model``, ``run``, ``group``, ``param``\n- **Suffix Examples**: ``transcript``, ``audio``, ``embeddings``, ``logits``, ``annotations``\n\nExample Project Structure\n~~~~~~~~~~~~~~~~~~~~~~~~~\n\n::\n\n my_project/\n \u251c\u2500\u2500 participants/\n \u2502 \u251c\u2500\u2500 part-01/\n \u2502 \u2502 \u2514\u2500\u2500 ses-01/\n \u2502 \u2502 \u2514\u2500\u2500 interview/\n \u2502 \u2502 \u2514\u2500\u2500 part-01_ses-01_task-interview_transcript.txt\n \u2502 \u2514\u2500\u2500 part-02/\n \u2502 \u2514\u2500\u2500 fluency/\n \u2502 \u2514\u2500\u2500 part-02_task-fluency_audio.wav\n \u251c\u2500\u2500 configuration.yml\n \u251c\u2500\u2500 dataset_description.json\n \u251c\u2500\u2500 participants.tsv\n \u2514\u2500\u2500 README.md\n\n\nFeatures\n========\n\n- **Feature 1: Cleaning text files**\n - Handles whitespaces, timestamps, punctuation, special characters, and case-sensitivity.\n\n- **Feature 2: Linguistic Feature Extraction**\n - Extracts semantic embeddings, logits, distance from optimality, and semantic similarity.\n\nExamples\n========\n\nYou can find example setups on the github repository in the `examples <https://github.com/ypauli/pelican_nlp/tree/main/examples>`_ folder:\n\nContributing\n============\n\nContributions are welcome! Please check out the `contributing guide <https://github.com/ypauli/pelican_nlp/blob/main/CONTRIBUTING.md>`_.\n\nLicense\n=======\n\nThis project is licensed under Attribution-NonCommercial 4.0 International. See the `LICENSE <https://github.com/ypauli/pelican_nlp/blob/main/LICENSE>`_ file for details.\n",
"bugtrack_url": null,
"license": null,
"summary": "Preprocessing and Extraction of Linguistic Information for Computational Analysis",
"version": "0.3.9",
"project_urls": {
"Bug Tracker": "https://github.com/ypauli/pelican_nlp/issues",
"Documentation": "https://github.com/ypauli/pelican_nlp#readme",
"Homepage": "https://github.com/ypauli/pelican_nlp",
"Repository": "https://github.com/ypauli/pelican_nlp"
},
"split_keywords": [
"nlp",
" linguistics",
" preprocessing",
" language-processing",
" text-analysis"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "b252304b468f8f3df859e8f8b27e6ce01c7395bacce808eedc29a2026e444b3c",
"md5": "e8be4fccb55e8d9dc2263e5339fd6044",
"sha256": "71ca4732698cf2412e44bcf2bc147fb9e20155c0911a072bec96321d00c5f942"
},
"downloads": -1,
"filename": "pelican_nlp-0.3.9-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e8be4fccb55e8d9dc2263e5339fd6044",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 34427462,
"upload_time": "2025-09-02T13:56:33",
"upload_time_iso_8601": "2025-09-02T13:56:33.261238Z",
"url": "https://files.pythonhosted.org/packages/b2/52/304b468f8f3df859e8f8b27e6ce01c7395bacce808eedc29a2026e444b3c/pelican_nlp-0.3.9-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-02 13:56:33",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ypauli",
"github_project": "pelican_nlp",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "accelerate",
"specs": [
[
"==",
"1.4.0"
]
]
},
{
"name": "aiohappyeyeballs",
"specs": [
[
"==",
"2.4.4"
]
]
},
{
"name": "aiohttp",
"specs": [
[
"==",
"3.11.10"
]
]
},
{
"name": "aiosignal",
"specs": [
[
"==",
"1.2.0"
]
]
},
{
"name": "annotated-types",
"specs": [
[
"==",
"0.6.0"
]
]
},
{
"name": "async-timeout",
"specs": [
[
"==",
"5.0.1"
]
]
},
{
"name": "attrs",
"specs": [
[
"==",
"24.3.0"
]
]
},
{
"name": "cython-blis",
"specs": [
[
"==",
"1.0.1"
]
]
},
{
"name": "Bottleneck",
"specs": [
[
"==",
"1.4.2"
]
]
},
{
"name": "Brotli-python",
"specs": [
[
"==",
"1.0.9"
]
]
},
{
"name": "catalogue",
"specs": [
[
"==",
"2.0.10"
]
]
},
{
"name": "certifi",
"specs": [
[
"==",
"2025.1.31"
]
]
},
{
"name": "charset-normalizer",
"specs": [
[
"==",
"3.3.2"
]
]
},
{
"name": "click",
"specs": [
[
"==",
"8.1.7"
]
]
},
{
"name": "cloudpathlib",
"specs": [
[
"==",
"0.16.0"
]
]
},
{
"name": "colorama",
"specs": [
[
"==",
"0.4.6"
]
]
},
{
"name": "confection",
"specs": [
[
"==",
"0.1.4"
]
]
},
{
"name": "cymem",
"specs": [
[
"==",
"2.0.6"
]
]
},
{
"name": "datasets",
"specs": [
[
"==",
"3.3.2"
]
]
},
{
"name": "Deprecated",
"specs": [
[
"==",
"1.2.13"
]
]
},
{
"name": "dill",
"specs": [
[
"==",
"0.3.8"
]
]
},
{
"name": "fasttext-wheel",
"specs": [
[
"==",
"0.9.2"
]
]
},
{
"name": "filelock",
"specs": [
[
"==",
"3.13.1"
]
]
},
{
"name": "frozenlist",
"specs": [
[
"==",
"1.5.0"
]
]
},
{
"name": "fsspec",
"specs": [
[
"==",
"2024.12.0"
]
]
},
{
"name": "gmpy2",
"specs": [
[
"==",
"2.2.1"
]
]
},
{
"name": "huggingface_hub",
"specs": [
[
"==",
"0.29.2"
]
]
},
{
"name": "idna",
"specs": [
[
"==",
"3.7"
]
]
},
{
"name": "importlib_metadata",
"specs": [
[
"==",
"8.5.0"
]
]
},
{
"name": "Jinja2",
"specs": [
[
"==",
"3.1.6"
]
]
},
{
"name": "langcodes",
"specs": [
[
"==",
"3.3.0"
]
]
},
{
"name": "lit",
"specs": [
[
"==",
"19.1.7.dev0"
]
]
},
{
"name": "markdown-it-py",
"specs": [
[
"==",
"2.2.0"
]
]
},
{
"name": "MarkupSafe",
"specs": [
[
"==",
"3.0.2"
]
]
},
{
"name": "mdurl",
"specs": [
[
"==",
"0.1.0"
]
]
},
{
"name": "mkl_fft",
"specs": [
[
"==",
"1.3.11"
]
]
},
{
"name": "mkl_random",
"specs": [
[
"==",
"1.2.8"
]
]
},
{
"name": "mkl-service",
"specs": [
[
"==",
"2.4.0"
]
]
},
{
"name": "mpmath",
"specs": [
[
"==",
"1.3.0"
]
]
},
{
"name": "multidict",
"specs": [
[
"==",
"6.1.0"
]
]
},
{
"name": "multiprocess",
"specs": [
[
"==",
"0.70.15"
]
]
},
{
"name": "murmurhash",
"specs": [
[
"==",
"1.0.12"
]
]
},
{
"name": "networkx",
"specs": [
[
"==",
"3.4.2"
]
]
},
{
"name": "numexpr",
"specs": [
[
"==",
"2.10.1"
]
]
},
{
"name": "numpy",
"specs": [
[
"==",
"2.0.1"
]
]
},
{
"name": "opentelemetry-api",
"specs": [
[
"==",
"1.30.0"
]
]
},
{
"name": "packaging",
"specs": [
[
"==",
"24.2"
]
]
},
{
"name": "pandas",
"specs": [
[
"==",
"2.2.3"
]
]
},
{
"name": "pip",
"specs": [
[
"==",
"25.0"
]
]
},
{
"name": "preshed",
"specs": [
[
"==",
"3.0.6"
]
]
},
{
"name": "propcache",
"specs": [
[
"==",
"0.3.1"
]
]
},
{
"name": "protobuf",
"specs": [
[
"==",
"4.25.3"
]
]
},
{
"name": "psutil",
"specs": [
[
">=",
"5.9.0"
]
]
},
{
"name": "pyarrow",
"specs": [
[
"==",
"19.0.0"
]
]
},
{
"name": "pybind11",
"specs": [
[
"==",
"2.13.6"
]
]
},
{
"name": "pybind11_global",
"specs": [
[
"==",
"2.13.6"
]
]
},
{
"name": "pydantic",
"specs": [
[
"==",
"2.10.3"
]
]
},
{
"name": "pydantic_core",
"specs": [
[
"==",
"2.27.1"
]
]
},
{
"name": "Pygments",
"specs": [
[
"==",
"2.15.1"
]
]
},
{
"name": "PySocks",
"specs": [
[
"==",
"1.7.1"
]
]
},
{
"name": "python-dateutil",
"specs": [
[
"==",
"2.9.0.post0"
]
]
},
{
"name": "pytz",
"specs": [
[
"==",
"2024.1"
]
]
},
{
"name": "PyYAML",
"specs": [
[
"==",
"6.0.2"
]
]
},
{
"name": "regex",
"specs": [
[
"==",
"2024.11.6"
]
]
},
{
"name": "requests",
"specs": [
[
"==",
"2.32.3"
]
]
},
{
"name": "rich",
"specs": [
[
"==",
"13.9.4"
]
]
},
{
"name": "safetensors",
"specs": [
[
"==",
"0.5.3"
]
]
},
{
"name": "scipy",
"specs": [
[
"==",
"1.15.2"
]
]
},
{
"name": "setuptools",
"specs": [
[
"==",
"75.8.0"
]
]
},
{
"name": "shellingham",
"specs": [
[
"==",
"1.5.0"
]
]
},
{
"name": "six",
"specs": [
[
"==",
"1.16.0"
]
]
},
{
"name": "smart-open",
"specs": [
[
"==",
"5.2.1"
]
]
},
{
"name": "spacy",
"specs": [
[
"==",
"3.8.2"
]
]
},
{
"name": "spacy-legacy",
"specs": [
[
"==",
"3.0.12"
]
]
},
{
"name": "spacy-loggers",
"specs": [
[
"==",
"1.0.4"
]
]
},
{
"name": "srsly",
"specs": [
[
"==",
"2.5.1"
]
]
},
{
"name": "sympy",
"specs": [
[
"==",
"1.13.3"
]
]
},
{
"name": "thinc",
"specs": [
[
"==",
"8.3.2"
]
]
},
{
"name": "tokenizers",
"specs": [
[
"==",
"0.21.0"
]
]
},
{
"name": "torch",
"specs": [
[
"==",
"2.5.1"
]
]
},
{
"name": "tqdm",
"specs": [
[
"==",
"4.67.1"
]
]
},
{
"name": "transformers",
"specs": [
[
"==",
"4.49.0"
]
]
},
{
"name": "triton",
"specs": [
[
"==",
"3.1.0"
]
]
},
{
"name": "typer",
"specs": [
[
"==",
"0.9.0"
]
]
},
{
"name": "typing_extensions",
"specs": [
[
"==",
"4.12.2"
]
]
},
{
"name": "tzdata",
"specs": [
[
"==",
"2023.3"
]
]
},
{
"name": "urllib3",
"specs": [
[
"==",
"2.3.0"
]
]
},
{
"name": "wasabi",
"specs": [
[
"==",
"0.9.1"
]
]
},
{
"name": "weasel",
"specs": [
[
"==",
"0.3.4"
]
]
},
{
"name": "wheel",
"specs": [
[
"==",
"0.45.1"
]
]
},
{
"name": "wrapt",
"specs": [
[
"==",
"1.17.0"
]
]
},
{
"name": "xxhash",
"specs": [
[
"==",
"3.5.0"
]
]
},
{
"name": "yarl",
"specs": [
[
"==",
"1.18.0"
]
]
},
{
"name": "zipp",
"specs": [
[
"==",
"3.21.0"
]
]
},
{
"name": "yaml",
"specs": [
[
"==",
"0.2.5"
]
]
},
{
"name": "docx2txt",
"specs": [
[
"==",
"0.9"
]
]
},
{
"name": "striprtf",
"specs": [
[
">=",
"0.0.29"
]
]
},
{
"name": "chardet",
"specs": [
[
">=",
"4.0.0"
]
]
},
{
"name": "scikit_learn",
"specs": [
[
">=",
"1.6.1"
]
]
},
{
"name": "matplotlib",
"specs": [
[
">=",
"3.10.0"
]
]
},
{
"name": "seaborn",
"specs": [
[
">=",
"0.13.2"
]
]
},
{
"name": "editdistance",
"specs": [
[
">=",
"0.8.1"
]
]
},
{
"name": "pytest",
"specs": [
[
">=",
"8.3.4"
]
]
},
{
"name": "statsmodels",
"specs": [
[
">=",
"0.14.4"
]
]
},
{
"name": "audiofile",
"specs": [
[
">=",
"1.5.1"
]
]
},
{
"name": "soundfile",
"specs": [
[
">=",
"0.13.1"
]
]
},
{
"name": "opensmile",
"specs": [
[
">=",
"2.6.0"
]
]
},
{
"name": "praat-parselmouth",
"specs": [
[
">=",
"0.4.6"
]
]
}
],
"lcname": "pelican-nlp"
}