## TubeLearns: YouTube Video Transcript Extractor
TubeLearns is a Python script designed for extracting and cleaning YouTube video transcripts for preprocessing in machine learning. This versatile tool streamlines the process of acquiring high-quality text data from YouTube videos, making it ideal for various natural language processing tasks, sentiment analysis, speech recognition, and more.
## Features
- Extracts video transcripts from YouTube videos.
- Saves cleaned transcripts into separate text files.
- Supports individual video URLs, batch processing from a list of URLs, and entire playlists.
- Streamlines the dataset collection process for machine learning applications.
- New Feature: Tokenization and Punctuation Removal for text preprocessing and cleaning.
## Installation
You can install TubeLearns using pip:
```bash
pip install tubelearns
```
## Usage
### Playlist Grabbing
```python
from tubelearns import Acquisition
# Initialize the Acquisition class
model = Acquisition()
# Grab transcripts from a YouTube playlist
playlist_url = 'https://www.youtube.com/your_playlist_url'
model.PlaylistGrab(playlist_url, name="raw_data")
```
### Extract Video Links from Playlist
```python
# Extract video links from a YouTube playlist
playlist_url = 'https://www.youtube.com/your_playlist_url'
model.Play2Text(playlist_url)
```
### Tokenization and Cleaning
```python
from tubelearns.tokenizers import Tokenization, Cleaning
# Initialize the Tokenization class
tokenizer = Tokenization()
cleaner = Cleaning()
# Tokenize text data
text_data = "Your input text here."
tokenized_data = tokenizer.TokenizeRaw(text_data)
cleaned_data = tokenizer.PunctList(tokenized_data)
```
Refer to the [TubeLearns documentation](https://github.com/KabilPreethamK/tubelearns/blob/main/Documentation.md) for detailed usage instructions and examples.
## Contributing
If you'd like to contribute to TubeLearns or report issues, please check out the [GitHub repository](https://github.com/KabilPreethamK/tubelearns).
## License
This project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details.
## Acknowledgments
- [YouTube Transcript API](https://github.com/jdepoix/youtube-transcript-api)
- [PyTube](https://github.com/pytube/pytube)
- [spaCy](https://spacy.io/)
- [num2words](https://github.com/savoirfairelinux/num2words)
---
Enjoy using TubeLearns! If you have any questions or encounter issues, please don't hesitate to [get in touch](mailto:tubelearnsofficial@gmail.com).
Raw data
{
"_id": null,
"home_page": "",
"name": "tubelearns",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "python,video,transcript,raw data,cleaning,machine learning,pre-processing",
"author": "KabilPreethamK",
"author_email": "<kabilpreethamk@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/a6/55/71245520ff10df487a46ced63e806720f70fe97f724f76ffe94469618b07/tubelearns-2.1.0.tar.gz",
"platform": null,
"description": "\n\n## TubeLearns: YouTube Video Transcript Extractor\n\nTubeLearns is a Python script designed for extracting and cleaning YouTube video transcripts for preprocessing in machine learning. This versatile tool streamlines the process of acquiring high-quality text data from YouTube videos, making it ideal for various natural language processing tasks, sentiment analysis, speech recognition, and more.\n\n## Features\n\n- Extracts video transcripts from YouTube videos.\n- Saves cleaned transcripts into separate text files.\n- Supports individual video URLs, batch processing from a list of URLs, and entire playlists.\n- Streamlines the dataset collection process for machine learning applications.\n- New Feature: Tokenization and Punctuation Removal for text preprocessing and cleaning.\n\n## Installation\n\nYou can install TubeLearns using pip:\n\n```bash\npip install tubelearns\n```\n\n## Usage\n\n### Playlist Grabbing\n\n```python\nfrom tubelearns import Acquisition\n\n# Initialize the Acquisition class\nmodel = Acquisition()\n\n# Grab transcripts from a YouTube playlist\nplaylist_url = 'https://www.youtube.com/your_playlist_url'\nmodel.PlaylistGrab(playlist_url, name=\"raw_data\")\n```\n\n### Extract Video Links from Playlist\n\n```python\n# Extract video links from a YouTube playlist\nplaylist_url = 'https://www.youtube.com/your_playlist_url'\nmodel.Play2Text(playlist_url)\n```\n\n### Tokenization and Cleaning\n\n```python\nfrom tubelearns.tokenizers import Tokenization, Cleaning\n\n# Initialize the Tokenization class\ntokenizer = Tokenization()\ncleaner = Cleaning()\n\n# Tokenize text data\ntext_data = \"Your input text here.\"\ntokenized_data = tokenizer.TokenizeRaw(text_data)\ncleaned_data = tokenizer.PunctList(tokenized_data)\n```\n\nRefer to the [TubeLearns documentation](https://github.com/KabilPreethamK/tubelearns/blob/main/Documentation.md) for detailed usage instructions and examples.\n\n## Contributing\n\nIf you'd like to contribute to TubeLearns or report issues, please check out the [GitHub repository](https://github.com/KabilPreethamK/tubelearns).\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details.\n\n## Acknowledgments\n\n- [YouTube Transcript API](https://github.com/jdepoix/youtube-transcript-api)\n- [PyTube](https://github.com/pytube/pytube)\n- [spaCy](https://spacy.io/)\n- [num2words](https://github.com/savoirfairelinux/num2words)\n\n---\n\n\nEnjoy using TubeLearns! If you have any questions or encounter issues, please don't hesitate to [get in touch](mailto:tubelearnsofficial@gmail.com).\n\n\n\n",
"bugtrack_url": null,
"license": "",
"summary": "Python script for extracting, cleaning, and tokenizing YouTube video transcripts for Pre-Processing in machine learning.",
"version": "2.1.0",
"project_urls": null,
"split_keywords": [
"python",
"video",
"transcript",
"raw data",
"cleaning",
"machine learning",
"pre-processing"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "0ad1f39b8562c2512b70add81e712f60df10e6ff26230893a655148da707f9a4",
"md5": "a450abbfa4cc302f2727cc77bd5f8383",
"sha256": "8bc045bce373fbaf0492d06672e02a1f4a79783b5c4485601a27772f37902d47"
},
"downloads": -1,
"filename": "tubelearns-2.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "a450abbfa4cc302f2727cc77bd5f8383",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 6834,
"upload_time": "2024-03-10T11:03:38",
"upload_time_iso_8601": "2024-03-10T11:03:38.806612Z",
"url": "https://files.pythonhosted.org/packages/0a/d1/f39b8562c2512b70add81e712f60df10e6ff26230893a655148da707f9a4/tubelearns-2.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "a65571245520ff10df487a46ced63e806720f70fe97f724f76ffe94469618b07",
"md5": "b321541d5dd3693c2d234b85905e0cdf",
"sha256": "d884ec2870005914107d6755df783723cd328bfbf48a3b85753b234d222e4a57"
},
"downloads": -1,
"filename": "tubelearns-2.1.0.tar.gz",
"has_sig": false,
"md5_digest": "b321541d5dd3693c2d234b85905e0cdf",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 7420,
"upload_time": "2024-03-10T11:03:40",
"upload_time_iso_8601": "2024-03-10T11:03:40.581403Z",
"url": "https://files.pythonhosted.org/packages/a6/55/71245520ff10df487a46ced63e806720f70fe97f724f76ffe94469618b07/tubelearns-2.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-03-10 11:03:40",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "tubelearns"
}