# Data-Pipeline
## Motivation
Hate speech detection faces challenges due to the diverse manifestations of abusive language across different tasks and languages. There is no universal model, as existing solutions target specific phenomena like racial discrimination or abusive language individually. With the rise of foundation models, there is a growing need for a unified dataset that integrates various hate speech datasets to support a comprehensive solution. Additionally, the lack of multilingual data, especially for low-resource languages, further complicates model development. A flexible, scalable data processing pipeline is essential to address these challenges, streamline dataset integration, and support future model advancements in hate speech detection across languages and tasks.
## Dataset-to-SQLite Pipeline
The dataset-to-SQLite pipeline is composed of modular components, each responsible for a distinct phase of the data management workflow. This design ensures flexibility, maintainability, and ease of extension across stages like configuration, data insertion, validation, and querying.
`config` **Module**
The `config` module simplifies the process of importing data files (e.g., CSV, TSV) that may not match the target database schema. A configuration file is used to map source file columns to the correct database tables, ensuring smooth integration. This module is built on a base class with an inheritance structure, allowing easy adaptation for future schema changes without breaking compatibility with the validator.
`loader` **Module**
The `loader` module is responsible for validating, formatting, and loading datasets into the database. It operates in a structured, phase-based manner:
- **Validator**: Ensures the integrity of the incoming datasets by checking that all required files and columns (as specified in the configuration file) are present. This prevents incomplete or corrupted data from entering the pipeline.
- **Formatter**: Breaks down validated datasets into multiple dataframes, formatting them to match the target database schema. This step improves clarity and efficiency in the loading process.
- **Loader**: Manages the data insertion process, handling both single and multi-file datasets. It ensures data integrity by controlling commit and rollback operations on a per-dataset basis.
`database` **Module**
The `database` module manages schema setup and data querying to ensure smooth integration and retrieval:
- **Setup**: Creates all database tables in the correct order, maintaining foreign key constraints. It also offers a reset function to clear tables when needed, simplifying schema management.
- **Querying**: Provides two main interfaces鈥攐ne for displaying dataset-text-label information (with optional source language details) and another for executing queries from external SQL files. Both include a `show_lines` parameter for previewing rows and support exporting query results to CSV or TSV files.
`utils` **Module**
The `utils` module includes a set of helpful tools for data analysis and selection during the dataset preparation phase:
- **Distribute Tool**: Analyzes the distribution of one column relative to another, helping users identify balanced or imbalanced data points, useful for dataset selection.
- **Fuzzysearch Tool**: Allows approximate matching within the dataset, helping locate relevant data, such as label definitions or metadata, without requiring exact queries.
- **Sampling Tool**: Provides three pre-configured sampling strategies to ensure balanced and representative data subsets for experimental setups.
# License
This project is licensed under the Apache License 2.0 - see the [LICENSE](https://github.com/Master-Project-Hate-Speech/Data-Pipeline/blob/main/LICENSE) file for details.
Raw data
{
"_id": null,
"home_page": "https://github.com/Master-Project-Hate-Speech/Data-Pipeline",
"name": "STITCHED",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": null,
"keywords": "NLP hate speech pipeline",
"author": "UZH STITCHED",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/8c/e3/60350805f4c68d074d81c27f9988807e2df9929ae8962a1bb664a8cef2e7/STITCHED-0.1.0.tar.gz",
"platform": null,
"description": "# Data-Pipeline\r\n\r\n## Motivation\r\nHate speech detection faces challenges due to the diverse manifestations of abusive language across different tasks and languages. There is no universal model, as existing solutions target specific phenomena like racial discrimination or abusive language individually. With the rise of foundation models, there is a growing need for a unified dataset that integrates various hate speech datasets to support a comprehensive solution. Additionally, the lack of multilingual data, especially for low-resource languages, further complicates model development. A flexible, scalable data processing pipeline is essential to address these challenges, streamline dataset integration, and support future model advancements in hate speech detection across languages and tasks.\r\n\r\n## Dataset-to-SQLite Pipeline\r\nThe dataset-to-SQLite pipeline is composed of modular components, each responsible for a distinct phase of the data management workflow. This design ensures flexibility, maintainability, and ease of extension across stages like configuration, data insertion, validation, and querying.\r\n\r\n`config` **Module**\r\n\r\nThe `config` module simplifies the process of importing data files (e.g., CSV, TSV) that may not match the target database schema. A configuration file is used to map source file columns to the correct database tables, ensuring smooth integration. This module is built on a base class with an inheritance structure, allowing easy adaptation for future schema changes without breaking compatibility with the validator.\r\n\r\n\r\n`loader` **Module**\r\n\r\nThe `loader` module is responsible for validating, formatting, and loading datasets into the database. It operates in a structured, phase-based manner:\r\n\r\n- **Validator**: Ensures the integrity of the incoming datasets by checking that all required files and columns (as specified in the configuration file) are present. This prevents incomplete or corrupted data from entering the pipeline.\r\n- **Formatter**: Breaks down validated datasets into multiple dataframes, formatting them to match the target database schema. This step improves clarity and efficiency in the loading process.\r\n- **Loader**: Manages the data insertion process, handling both single and multi-file datasets. It ensures data integrity by controlling commit and rollback operations on a per-dataset basis.\r\n\r\n\r\n`database` **Module**\r\n\r\nThe `database` module manages schema setup and data querying to ensure smooth integration and retrieval:\r\n\r\n- **Setup**: Creates all database tables in the correct order, maintaining foreign key constraints. It also offers a reset function to clear tables when needed, simplifying schema management.\r\n- **Querying**: Provides two main interfaces\u9225\u6510ne for displaying dataset-text-label information (with optional source language details) and another for executing queries from external SQL files. Both include a `show_lines` parameter for previewing rows and support exporting query results to CSV or TSV files.\r\n\r\n`utils` **Module**\r\n\r\nThe `utils` module includes a set of helpful tools for data analysis and selection during the dataset preparation phase:\r\n\r\n- **Distribute Tool**: Analyzes the distribution of one column relative to another, helping users identify balanced or imbalanced data points, useful for dataset selection.\r\n- **Fuzzysearch Tool**: Allows approximate matching within the dataset, helping locate relevant data, such as label definitions or metadata, without requiring exact queries.\r\n- **Sampling Tool**: Provides three pre-configured sampling strategies to ensure balanced and representative data subsets for experimental setups.\r\n\r\n# License\r\nThis project is licensed under the Apache License 2.0 - see the [LICENSE](https://github.com/Master-Project-Hate-Speech/Data-Pipeline/blob/main/LICENSE) file for details.\r\n",
"bugtrack_url": null,
"license": "Apache License 2.0",
"summary": null,
"version": "0.1.0",
"project_urls": {
"Homepage": "https://github.com/Master-Project-Hate-Speech/Data-Pipeline",
"Source": "https://github.com/Master-Project-Hate-Speech/Data-Pipeline"
},
"split_keywords": [
"nlp",
"hate",
"speech",
"pipeline"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "2f07936715b497694e2c14aa0806c8ec5f81a6f84bcb7997f65d0304845f3d81",
"md5": "f6c4300ba941b00dd857f30fd2b5ae36",
"sha256": "045e68e88d17cd122f614959e36fff4b0d36619c74ea68a609a19db73fd885cc"
},
"downloads": -1,
"filename": "STITCHED-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f6c4300ba941b00dd857f30fd2b5ae36",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 20874,
"upload_time": "2024-09-28T22:03:25",
"upload_time_iso_8601": "2024-09-28T22:03:25.186063Z",
"url": "https://files.pythonhosted.org/packages/2f/07/936715b497694e2c14aa0806c8ec5f81a6f84bcb7997f65d0304845f3d81/STITCHED-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "8ce360350805f4c68d074d81c27f9988807e2df9929ae8962a1bb664a8cef2e7",
"md5": "21e92654fe75f5c0f800ba5d7f794c16",
"sha256": "c310f2f28784f8512c8f65bd6228bc2bd29e5ea8ee2e73fe321cd4f7ee759b1f"
},
"downloads": -1,
"filename": "STITCHED-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "21e92654fe75f5c0f800ba5d7f794c16",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 22126,
"upload_time": "2024-09-28T22:03:26",
"upload_time_iso_8601": "2024-09-28T22:03:26.303650Z",
"url": "https://files.pythonhosted.org/packages/8c/e3/60350805f4c68d074d81c27f9988807e2df9929ae8962a1bb664a8cef2e7/STITCHED-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-28 22:03:26",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Master-Project-Hate-Speech",
"github_project": "Data-Pipeline",
"github_not_found": true,
"lcname": "stitched"
}