Here is the entire README in a single markdown block for easy copying:
# spacy-column-classifier
A Python package that classifies DataFrame columns into Named Entity (NER) or Literal types using spaCy's powerful natural language processing models. This library is optimized for batch processing, making it efficient for working with large datasets.
## Features
- **Classification of Columns**: Classifies each column of a DataFrame as Named Entity (NER) or Literal (LIT) types, including LOCATION, ORGANIZATION, PERSON, NUMBER, DATE, and more.
- **Batch Processing**: Uses spaCy’s `nlp.pipe()` to efficiently process multiple columns across multiple tables in parallel, improving performance for large datasets.
- **Customizable**: Supports both transformer-based models (for high accuracy) and smaller models (for speed).
- **Handles Multiple DataFrames**: Allows you to classify columns across multiple DataFrames in one go.
- **Conflict Resolution**: Handles cases where multiple class types are detected for a single column and resolves conflicts based on customizable thresholds.
## Installation
You can install the package via pip:
```bash
pip install column-classifier
```
Make sure you have installed one of the compatible spaCy models:
For accuracy (slower but more precise):
```bash
python -m spacy download en_core_web_trf
```
For speed (faster but less accurate):
```bash
python -m spacy download en_core_web_sm
```
Quick Start
Here’s how you can use spacy-column-classifier in your project with hardcoded example data:
```bash
import pandas as pd
from column_classifier import ColumnClassifier
# Hardcoded sample data
data1 = {
'title': ['Inception', 'The Matrix', 'Interstellar'],
'director': ['Christopher Nolan', 'The Wachowskis', 'Christopher Nolan'],
'release year': [2010, 1999, 2014],
'domestic distributor': ['Warner Bros.', 'Warner Bros.', 'Paramount'],
'length in min': [148, 136, 169],
'worldwide gross': [829895144, 466364845, 677471339]
}
data2 = {
'company': ['Google', 'Microsoft', 'Apple'],
'location': ['California', 'Washington', 'California'],
'founded': [1998, 1975, 1976],
'CEO': ['Sundar Pichai', 'Satya Nadella', 'Tim Cook'],
'employees': [139995, 163000, 147000],
'revenue': [182527, 168088, 274515]
}
# Create DataFrames
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
# List of DataFrames to classify
dataframes = [df1, df2]
# Create an instance of ColumnClassifier
classifier = ColumnClassifier(model_type='accurate') # 'accurate' for transformer model
# Classify multiple DataFrames
results = classifier.classify_multiple_tables(dataframes)
# Display the results
for table_result in results:
for table_name, classification in table_result.items():
print(f"Results for {table_name}:")
for col, types in classification.items():
print(f" Column '{col}': Classified as {types['classification']}")
print()
```
API Reference
ColumnClassifier
The main class used to classify DataFrame columns.
Parameters:
• model_type: Choose between ‘accurate’ (transformer-based) or ‘fast’ (small model).
• sample_size: Number of samples to analyze per column.
• classification_threshold: Minimum threshold for confident classification.
• close_prob_threshold: Threshold for resolving conflicts between close probabilities.
• word_threshold: If the average word count in a column exceeds this, the column is classified as a DESCRIPTION.
Methods:
• classify_multiple_tables(tables: list) -> list: Classifies all columns across multiple DataFrames. Returns a list of dictionaries containing the classification results.
• classify_column(column_data: pd.Series) -> dict: Classifies a single column and returns a dictionary of classifications and probabilities.
Example Output
After classifying your DataFrames, the output will be structured like this:
```bash
[
{
"table_1": {
"title": {
"classification": "OTHER",
"probabilities": {
"OTHER": 1.0
}
},
"director": {
"classification": "PERSON",
"probabilities": {
"PERSON": 1.0
}
},
"release year": {
"classification": "NUMBER",
"probabilities": {
"NUMBER": 1.0,
"DATE": 1.0
}
},
"domestic distributor": {
"classification": "ORGANIZATION",
"probabilities": {
"ORGANIZATION": 1.0
}
},
"length in min": {
"classification": "NUMBER",
"probabilities": {
"NUMBER": 1.0
}
},
"worldwide gross": {
"classification": "NUMBER",
"probabilities": {
"NUMBER": 1.0
}
}
}
},
{
"table_2": {
"company": {
"classification": "ORGANIZATION",
"probabilities": {
"ORGANIZATION": 1.0
}
},
"location": {
"classification": "LOCATION",
"probabilities": {
"LOCATION": 1.0
}
},
"founded": {
"classification": "NUMBER",
"probabilities": {
"NUMBER": 1.0,
"DATE": 1.0
}
},
"CEO": {
"classification": "PERSON",
"probabilities": {
"PERSON": 1.0
}
},
"employees": {
"classification": "NUMBER",
"probabilities": {
"NUMBER": 1.0
}
},
"revenue": {
"classification": "NUMBER",
"probabilities": {
"NUMBER": 1.0
}
}
}
}
]
```
Each column is classified with a winning classification, and the probabilities show the likelihood of different class types detected in the column.
License
This project is licensed under the Apache License.
This version should be easier to copy and paste correctly without errors.
Raw data
{
"_id": null,
"home_page": "https://github.com/roby-avo/spacy-column-classifier",
"name": "column-classifier",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": null,
"author": "Roberto",
"author_email": "roberto.avogadro@sintef.no",
"download_url": "https://files.pythonhosted.org/packages/24/79/5f05b9bf84903546ccfc82ffce054497abda0336ccad80fccee4cd7605f4/column_classifier-0.1.7.tar.gz",
"platform": null,
"description": "Here is the entire README in a single markdown block for easy copying:\n\n# spacy-column-classifier\n\nA Python package that classifies DataFrame columns into Named Entity (NER) or Literal types using spaCy's powerful natural language processing models. This library is optimized for batch processing, making it efficient for working with large datasets.\n\n## Features\n\n- **Classification of Columns**: Classifies each column of a DataFrame as Named Entity (NER) or Literal (LIT) types, including LOCATION, ORGANIZATION, PERSON, NUMBER, DATE, and more.\n- **Batch Processing**: Uses spaCy\u2019s `nlp.pipe()` to efficiently process multiple columns across multiple tables in parallel, improving performance for large datasets.\n- **Customizable**: Supports both transformer-based models (for high accuracy) and smaller models (for speed).\n- **Handles Multiple DataFrames**: Allows you to classify columns across multiple DataFrames in one go.\n- **Conflict Resolution**: Handles cases where multiple class types are detected for a single column and resolves conflicts based on customizable thresholds.\n\n## Installation\n\nYou can install the package via pip:\n\n```bash\npip install column-classifier\n```\n\nMake sure you have installed one of the compatible spaCy models:\n\nFor accuracy (slower but more precise): \n```bash\npython -m spacy download en_core_web_trf\n```\n\t\nFor speed (faster but less accurate):\n```bash\npython -m spacy download en_core_web_sm\n```\n\nQuick Start\n\nHere\u2019s how you can use spacy-column-classifier in your project with hardcoded example data:\n```bash\nimport pandas as pd\nfrom column_classifier import ColumnClassifier\n\n# Hardcoded sample data\ndata1 = {\n 'title': ['Inception', 'The Matrix', 'Interstellar'],\n 'director': ['Christopher Nolan', 'The Wachowskis', 'Christopher Nolan'],\n 'release year': [2010, 1999, 2014],\n 'domestic distributor': ['Warner Bros.', 'Warner Bros.', 'Paramount'],\n 'length in min': [148, 136, 169],\n 'worldwide gross': [829895144, 466364845, 677471339]\n}\n\ndata2 = {\n 'company': ['Google', 'Microsoft', 'Apple'],\n 'location': ['California', 'Washington', 'California'],\n 'founded': [1998, 1975, 1976],\n 'CEO': ['Sundar Pichai', 'Satya Nadella', 'Tim Cook'],\n 'employees': [139995, 163000, 147000],\n 'revenue': [182527, 168088, 274515]\n}\n\n# Create DataFrames\ndf1 = pd.DataFrame(data1)\ndf2 = pd.DataFrame(data2)\n\n# List of DataFrames to classify\ndataframes = [df1, df2]\n\n# Create an instance of ColumnClassifier\nclassifier = ColumnClassifier(model_type='accurate') # 'accurate' for transformer model\n\n# Classify multiple DataFrames\nresults = classifier.classify_multiple_tables(dataframes)\n\n# Display the results\nfor table_result in results:\n for table_name, classification in table_result.items():\n print(f\"Results for {table_name}:\")\n for col, types in classification.items():\n print(f\" Column '{col}': Classified as {types['classification']}\")\n print()\n```\n\nAPI Reference\n\nColumnClassifier\n\nThe main class used to classify DataFrame columns.\n\nParameters:\n\n\t\u2022\tmodel_type: Choose between \u2018accurate\u2019 (transformer-based) or \u2018fast\u2019 (small model).\n\t\u2022\tsample_size: Number of samples to analyze per column.\n\t\u2022\tclassification_threshold: Minimum threshold for confident classification.\n\t\u2022\tclose_prob_threshold: Threshold for resolving conflicts between close probabilities.\n\t\u2022\tword_threshold: If the average word count in a column exceeds this, the column is classified as a DESCRIPTION.\n\nMethods:\n\n\t\u2022\tclassify_multiple_tables(tables: list) -> list: Classifies all columns across multiple DataFrames. Returns a list of dictionaries containing the classification results.\n\t\u2022\tclassify_column(column_data: pd.Series) -> dict: Classifies a single column and returns a dictionary of classifications and probabilities.\n\nExample Output\n\nAfter classifying your DataFrames, the output will be structured like this:\n```bash\n[\n {\n \"table_1\": {\n \"title\": {\n \"classification\": \"OTHER\",\n \"probabilities\": {\n \"OTHER\": 1.0\n }\n },\n \"director\": {\n \"classification\": \"PERSON\",\n \"probabilities\": {\n \"PERSON\": 1.0\n }\n },\n \"release year\": {\n \"classification\": \"NUMBER\",\n \"probabilities\": {\n \"NUMBER\": 1.0,\n \"DATE\": 1.0\n }\n },\n \"domestic distributor\": {\n \"classification\": \"ORGANIZATION\",\n \"probabilities\": {\n \"ORGANIZATION\": 1.0\n }\n },\n \"length in min\": {\n \"classification\": \"NUMBER\",\n \"probabilities\": {\n \"NUMBER\": 1.0\n }\n },\n \"worldwide gross\": {\n \"classification\": \"NUMBER\",\n \"probabilities\": {\n \"NUMBER\": 1.0\n }\n }\n }\n },\n {\n \"table_2\": {\n \"company\": {\n \"classification\": \"ORGANIZATION\",\n \"probabilities\": {\n \"ORGANIZATION\": 1.0\n }\n },\n \"location\": {\n \"classification\": \"LOCATION\",\n \"probabilities\": {\n \"LOCATION\": 1.0\n }\n },\n \"founded\": {\n \"classification\": \"NUMBER\",\n \"probabilities\": {\n \"NUMBER\": 1.0,\n \"DATE\": 1.0\n }\n },\n \"CEO\": {\n \"classification\": \"PERSON\",\n \"probabilities\": {\n \"PERSON\": 1.0\n }\n },\n \"employees\": {\n \"classification\": \"NUMBER\",\n \"probabilities\": {\n \"NUMBER\": 1.0\n }\n },\n \"revenue\": {\n \"classification\": \"NUMBER\",\n \"probabilities\": {\n \"NUMBER\": 1.0\n }\n }\n }\n }\n]\n```\n\nEach column is classified with a winning classification, and the probabilities show the likelihood of different class types detected in the column.\n\nLicense\n\nThis project is licensed under the Apache License.\n\nThis version should be easier to copy and paste correctly without errors.\n",
"bugtrack_url": null,
"license": "Apache License 2.0",
"summary": "A column classifier using spaCy for entity recognition.",
"version": "0.1.7",
"project_urls": {
"Homepage": "https://github.com/roby-avo/spacy-column-classifier"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "12a28ffde48ea90205c60563fcda485bce482d6a2f1b77513cbafe2701b77b65",
"md5": "5ba6a0bccf97802a91a4af13f0fc577a",
"sha256": "1ec5b909521327141ae3770f7f1155b3be8c7b365f6bdce6901c4a6713397eb5"
},
"downloads": -1,
"filename": "column_classifier-0.1.7-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5ba6a0bccf97802a91a4af13f0fc577a",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 11310,
"upload_time": "2025-02-20T12:26:04",
"upload_time_iso_8601": "2025-02-20T12:26:04.647385Z",
"url": "https://files.pythonhosted.org/packages/12/a2/8ffde48ea90205c60563fcda485bce482d6a2f1b77513cbafe2701b77b65/column_classifier-0.1.7-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "24795f05b9bf84903546ccfc82ffce054497abda0336ccad80fccee4cd7605f4",
"md5": "0c882d282d1a921845888d23e2b90207",
"sha256": "5e2861294b2547e11de2cd2eb4bb1014d04a74509f272e39d1910d5e3ae15e1f"
},
"downloads": -1,
"filename": "column_classifier-0.1.7.tar.gz",
"has_sig": false,
"md5_digest": "0c882d282d1a921845888d23e2b90207",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 10391,
"upload_time": "2025-02-20T12:26:06",
"upload_time_iso_8601": "2025-02-20T12:26:06.654886Z",
"url": "https://files.pythonhosted.org/packages/24/79/5f05b9bf84903546ccfc82ffce054497abda0336ccad80fccee4cd7605f4/column_classifier-0.1.7.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-20 12:26:06",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "roby-avo",
"github_project": "spacy-column-classifier",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "spacy",
"specs": [
[
">=",
"3.0"
]
]
}
],
"lcname": "column-classifier"
}