Name | codcat JSON |
Version |
0.2.0
JSON |
| download |
home_page | |
Summary | Code snippets language classification tool |
upload_time | 2023-06-05 11:30:15 |
maintainer | |
docs_url | None |
author | |
requires_python | >=3.9 |
license | MIT |
keywords |
code
snippets
classification
language
nlp
ml
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# Codcat
Natural Language Processing (NLP) is a rapidly growing field that aims to help machines understand and interpret human language. With the increasing use of code repositories like Github, there is a growing need to accurately categorize program code by programming language. This is particularly important in large repositories where multiple programming languages are used, as it allows developers to easily navigate and search for specific code snippets.
The goal of this NLP project is to develop a model that can automatically categorize program code by programming language. The model will be trained on a large dataset of code snippets from various programming languages, and will use NLP techniques to extract features and patterns from the code.
The project will involve several steps, including data collection, pre-processing, feature extraction, model selection and evaluation. The dataset for the project will be sourced from various public code repositories, including GitHub, GitLab, Stackoverflow. The collected data will then be pre-processed to remove irrelevant information and to standardize the format of the code snippets. This will involve techniques such as tokenization and stop-word removal.
Once the data is pre-processed, features will be extracted from the code snippets using NLP techniques. This will involve using methods such as Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), and Embeddings to capture the semantics of the code. These features will then be used to train and evaluate several machine learning models, including Naive Bayes, RandomForest, CNN, RNN, Transformers.
The final model will be evaluated on a test set of code snippets to assess its accuracy and generalizability.
## Prerequisites
On your PC with local run you must have Python >= 3.9
## Installation
Install `codcat` with pip:
```bash
pip install codcat
```
or with your favorite package manager.
## Example
### Input
```python
from codcat.downloader import load
model = load('base-tiny')
print(model.predict(['def foo(bar): return bar', '#include <stdio.h>']))
```
### Output
```python
['python' 'c']
```
## Authors
- Templin Konstantin <1qnbhd@gmail.com>
Raw data
{
"_id": null,
"home_page": "",
"name": "codcat",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": "Konstantin Templin <1qnbhd@gmail.com>",
"keywords": "code,snippets,classification,language,NLP,ML",
"author": "",
"author_email": "Konstantin Templin <1qnbhd@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/ff/0e/012a57e3fa782483a3901dfe9c6fdce61f1bb43589561660a5ee1b04eda1/codcat-0.2.0.tar.gz",
"platform": null,
"description": "# Codcat\n\nNatural Language Processing (NLP) is a rapidly growing field that aims to help machines understand and interpret human language. With the increasing use of code repositories like Github, there is a growing need to accurately categorize program code by programming language. This is particularly important in large repositories where multiple programming languages are used, as it allows developers to easily navigate and search for specific code snippets.\n\nThe goal of this NLP project is to develop a model that can automatically categorize program code by programming language. The model will be trained on a large dataset of code snippets from various programming languages, and will use NLP techniques to extract features and patterns from the code.\n\nThe project will involve several steps, including data collection, pre-processing, feature extraction, model selection and evaluation. The dataset for the project will be sourced from various public code repositories, including GitHub, GitLab, Stackoverflow. The collected data will then be pre-processed to remove irrelevant information and to standardize the format of the code snippets. This will involve techniques such as tokenization and stop-word removal.\n\nOnce the data is pre-processed, features will be extracted from the code snippets using NLP techniques. This will involve using methods such as Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), and Embeddings to capture the semantics of the code. These features will then be used to train and evaluate several machine learning models, including Naive Bayes, RandomForest, CNN, RNN, Transformers.\n\nThe final model will be evaluated on a test set of code snippets to assess its accuracy and generalizability.\n\n## Prerequisites\n\nOn your PC with local run you must have Python >= 3.9\n\n## Installation\nInstall `codcat` with pip:\n\n```bash\npip install codcat\n```\n\nor with your favorite package manager.\n\n## Example\n\n### Input\n\n```python\nfrom codcat.downloader import load\nmodel = load('base-tiny')\nprint(model.predict(['def foo(bar): return bar', '#include <stdio.h>']))\n```\n\n### Output\n\n```python\n['python' 'c']\n```\n\n## Authors\n\n- Templin Konstantin <1qnbhd@gmail.com>\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Code snippets language classification tool",
"version": "0.2.0",
"project_urls": {
"Repository": "https://gitlab.com/codcat/codcat"
},
"split_keywords": [
"code",
"snippets",
"classification",
"language",
"nlp",
"ml"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "31f0fbf26b92b046b9c20113843fad5eda500a79d328d1234336725f34fff1f8",
"md5": "4f9545eeb8b78b67ddae64cf3a8144f7",
"sha256": "39b26ddd776fa48fc1927eed2e146c9c4b1728ca64fe246e817620d206d1707c"
},
"downloads": -1,
"filename": "codcat-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "4f9545eeb8b78b67ddae64cf3a8144f7",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 13430,
"upload_time": "2023-06-05T11:30:13",
"upload_time_iso_8601": "2023-06-05T11:30:13.699460Z",
"url": "https://files.pythonhosted.org/packages/31/f0/fbf26b92b046b9c20113843fad5eda500a79d328d1234336725f34fff1f8/codcat-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "ff0e012a57e3fa782483a3901dfe9c6fdce61f1bb43589561660a5ee1b04eda1",
"md5": "e4fe7f3cbbdd3ff175b7c9c77edc433f",
"sha256": "caa446a064c6ca4a3eef0d3bdc044fcbd1e7e30b49470d03795a62f2b3a5fc5a"
},
"downloads": -1,
"filename": "codcat-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "e4fe7f3cbbdd3ff175b7c9c77edc433f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 2617535,
"upload_time": "2023-06-05T11:30:15",
"upload_time_iso_8601": "2023-06-05T11:30:15.433418Z",
"url": "https://files.pythonhosted.org/packages/ff/0e/012a57e3fa782483a3901dfe9c6fdce61f1bb43589561660a5ee1b04eda1/codcat-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-06-05 11:30:15",
"github": false,
"gitlab": true,
"bitbucket": false,
"codeberg": false,
"gitlab_user": "codcat",
"gitlab_project": "codcat",
"lcname": "codcat"
}