tubelearns


Nametubelearns JSON
Version 2.1.0 PyPI version JSON
download
home_page
SummaryPython script for extracting, cleaning, and tokenizing YouTube video transcripts for Pre-Processing in machine learning.
upload_time2024-03-10 11:03:40
maintainer
docs_urlNone
authorKabilPreethamK
requires_python
license
keywords python video transcript raw data cleaning machine learning pre-processing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            

## TubeLearns: YouTube Video Transcript Extractor

TubeLearns is a Python script designed for extracting and cleaning YouTube video transcripts for preprocessing in machine learning. This versatile tool streamlines the process of acquiring high-quality text data from YouTube videos, making it ideal for various natural language processing tasks, sentiment analysis, speech recognition, and more.

## Features

- Extracts video transcripts from YouTube videos.
- Saves cleaned transcripts into separate text files.
- Supports individual video URLs, batch processing from a list of URLs, and entire playlists.
- Streamlines the dataset collection process for machine learning applications.
- New Feature: Tokenization and Punctuation Removal for text preprocessing and cleaning.

## Installation

You can install TubeLearns using pip:

```bash
pip install tubelearns
```

## Usage

### Playlist Grabbing

```python
from tubelearns import Acquisition

# Initialize the Acquisition class
model = Acquisition()

# Grab transcripts from a YouTube playlist
playlist_url = 'https://www.youtube.com/your_playlist_url'
model.PlaylistGrab(playlist_url, name="raw_data")
```

### Extract Video Links from Playlist

```python
# Extract video links from a YouTube playlist
playlist_url = 'https://www.youtube.com/your_playlist_url'
model.Play2Text(playlist_url)
```

### Tokenization and Cleaning

```python
from tubelearns.tokenizers import Tokenization, Cleaning

# Initialize the Tokenization class
tokenizer = Tokenization()
cleaner = Cleaning()

# Tokenize text data
text_data = "Your input text here."
tokenized_data = tokenizer.TokenizeRaw(text_data)
cleaned_data = tokenizer.PunctList(tokenized_data)
```

Refer to the [TubeLearns documentation](https://github.com/KabilPreethamK/tubelearns/blob/main/Documentation.md) for detailed usage instructions and examples.

## Contributing

If you'd like to contribute to TubeLearns or report issues, please check out the [GitHub repository](https://github.com/KabilPreethamK/tubelearns).

## License

This project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details.

## Acknowledgments

- [YouTube Transcript API](https://github.com/jdepoix/youtube-transcript-api)
- [PyTube](https://github.com/pytube/pytube)
- [spaCy](https://spacy.io/)
- [num2words](https://github.com/savoirfairelinux/num2words)

---


Enjoy using TubeLearns! If you have any questions or encounter issues, please don't hesitate to [get in touch](mailto:tubelearnsofficial@gmail.com).




            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "tubelearns",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "python,video,transcript,raw data,cleaning,machine learning,pre-processing",
    "author": "KabilPreethamK",
    "author_email": "<kabilpreethamk@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/a6/55/71245520ff10df487a46ced63e806720f70fe97f724f76ffe94469618b07/tubelearns-2.1.0.tar.gz",
    "platform": null,
    "description": "\n\n## TubeLearns: YouTube Video Transcript Extractor\n\nTubeLearns is a Python script designed for extracting and cleaning YouTube video transcripts for preprocessing in machine learning. This versatile tool streamlines the process of acquiring high-quality text data from YouTube videos, making it ideal for various natural language processing tasks, sentiment analysis, speech recognition, and more.\n\n## Features\n\n- Extracts video transcripts from YouTube videos.\n- Saves cleaned transcripts into separate text files.\n- Supports individual video URLs, batch processing from a list of URLs, and entire playlists.\n- Streamlines the dataset collection process for machine learning applications.\n- New Feature: Tokenization and Punctuation Removal for text preprocessing and cleaning.\n\n## Installation\n\nYou can install TubeLearns using pip:\n\n```bash\npip install tubelearns\n```\n\n## Usage\n\n### Playlist Grabbing\n\n```python\nfrom tubelearns import Acquisition\n\n# Initialize the Acquisition class\nmodel = Acquisition()\n\n# Grab transcripts from a YouTube playlist\nplaylist_url = 'https://www.youtube.com/your_playlist_url'\nmodel.PlaylistGrab(playlist_url, name=\"raw_data\")\n```\n\n### Extract Video Links from Playlist\n\n```python\n# Extract video links from a YouTube playlist\nplaylist_url = 'https://www.youtube.com/your_playlist_url'\nmodel.Play2Text(playlist_url)\n```\n\n### Tokenization and Cleaning\n\n```python\nfrom tubelearns.tokenizers import Tokenization, Cleaning\n\n# Initialize the Tokenization class\ntokenizer = Tokenization()\ncleaner = Cleaning()\n\n# Tokenize text data\ntext_data = \"Your input text here.\"\ntokenized_data = tokenizer.TokenizeRaw(text_data)\ncleaned_data = tokenizer.PunctList(tokenized_data)\n```\n\nRefer to the [TubeLearns documentation](https://github.com/KabilPreethamK/tubelearns/blob/main/Documentation.md) for detailed usage instructions and examples.\n\n## Contributing\n\nIf you'd like to contribute to TubeLearns or report issues, please check out the [GitHub repository](https://github.com/KabilPreethamK/tubelearns).\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details.\n\n## Acknowledgments\n\n- [YouTube Transcript API](https://github.com/jdepoix/youtube-transcript-api)\n- [PyTube](https://github.com/pytube/pytube)\n- [spaCy](https://spacy.io/)\n- [num2words](https://github.com/savoirfairelinux/num2words)\n\n---\n\n\nEnjoy using TubeLearns! If you have any questions or encounter issues, please don't hesitate to [get in touch](mailto:tubelearnsofficial@gmail.com).\n\n\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Python script for extracting, cleaning, and tokenizing YouTube video transcripts for Pre-Processing in machine learning.",
    "version": "2.1.0",
    "project_urls": null,
    "split_keywords": [
        "python",
        "video",
        "transcript",
        "raw data",
        "cleaning",
        "machine learning",
        "pre-processing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0ad1f39b8562c2512b70add81e712f60df10e6ff26230893a655148da707f9a4",
                "md5": "a450abbfa4cc302f2727cc77bd5f8383",
                "sha256": "8bc045bce373fbaf0492d06672e02a1f4a79783b5c4485601a27772f37902d47"
            },
            "downloads": -1,
            "filename": "tubelearns-2.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a450abbfa4cc302f2727cc77bd5f8383",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 6834,
            "upload_time": "2024-03-10T11:03:38",
            "upload_time_iso_8601": "2024-03-10T11:03:38.806612Z",
            "url": "https://files.pythonhosted.org/packages/0a/d1/f39b8562c2512b70add81e712f60df10e6ff26230893a655148da707f9a4/tubelearns-2.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a65571245520ff10df487a46ced63e806720f70fe97f724f76ffe94469618b07",
                "md5": "b321541d5dd3693c2d234b85905e0cdf",
                "sha256": "d884ec2870005914107d6755df783723cd328bfbf48a3b85753b234d222e4a57"
            },
            "downloads": -1,
            "filename": "tubelearns-2.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "b321541d5dd3693c2d234b85905e0cdf",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 7420,
            "upload_time": "2024-03-10T11:03:40",
            "upload_time_iso_8601": "2024-03-10T11:03:40.581403Z",
            "url": "https://files.pythonhosted.org/packages/a6/55/71245520ff10df487a46ced63e806720f70fe97f724f76ffe94469618b07/tubelearns-2.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-10 11:03:40",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "tubelearns"
}
        
Elapsed time: 0.23878s