| Name | huggify-data JSON |
| Version |
0.4.4
JSON |
| download |
| home_page | None |
| Summary | This is a helper library to push data to HuggingFace. |
| upload_time | 2024-07-24 04:39:51 |
| maintainer | None |
| docs_url | None |
| author | Yiqiao Yin |
| requires_python | <3.13,>=3.9 |
| license | None |
| keywords |
|
| VCS |
|
| bugtrack_url |
|
| requirements |
No requirements were recorded.
|
| Travis-CI |
No Travis.
|
| coveralls test coverage |
No coveralls.
|
# huggify-data
## Introduction
**huggify-data** 📦 is a Python library 🐍 designed to simplify the process of scraping `.pdf` documents, generating question-answer pairs using `openai`, converse with the document, and then uploading datasets 📊 to the Hugging Face Hub 🤗. This library allows you to verify ✅, process 🔄, and push 🚀 your pandas DataFrame directly to Hugging Face, making it easier to share and collaborate 🤝 on datasets. Additionally, the new version enables users to fine-tune the Llama2 model on their proprietary data, enhancing its capabilities even further. As the name suggests, the **huggify-data** package enhances your data experience by wrapping it in warmth, comfort, and user-friendly interactions, making data handling feel as reassuring and pleasant as a hug.
[](https://www.youtube.com/watch?v=XLExhyangWw)
## Repo
You can access the repo here: [✨ Huggify Data ✨](https://github.com/yiqiao-yin/huggify-data)
## Installation
To use **huggify-data**, ensure you have the necessary libraries installed. You can easily install them using pip:
```sh
pip install huggify-data
```
## Notebooks
We have made tutorial notebooks available to guide you through the process step-by-step:
- **Step 1**: Scrape any `.pdf` file and generate question-answer pairs. [Link](https://github.com/yiqiao-yin/WYNAssociates/blob/main/docs/ref-deeplearning/ex_%20-%20huggify%20data%20-%20part%201%20-%20scrape%20and%20generate%20qa.ipynb)
- **Step 2**: Fine-tune the Llama2 model on customized data. [Link](https://github.com/yiqiao-yin/WYNAssociates/blob/main/docs/ref-deeplearning/ex_%20-%20huggify%20data%20-%20part%202%20-%20fine%20tune%20llama2%20over%20custom%20data.ipynb)
- **Step 3**: Perform inference on customized data. [Link](https://github.com/yiqiao-yin/WYNAssociates/blob/main/docs/ref-deeplearning/ex_%20-%20huggify%20data%20-%20part%203%20-%20inference%20using%20fine%20tuned%20llama2.ipynb)
## Examples
Here's a complete example illustrating how to use **huggify-data** to scrape a PDF and save it as question-answer pairs in a `.csv` file. The following block of code will scrape the content, convert it into a `.csv`, and save the file locally:
```python
from huggify_data.scrape_modules import *
# Example usage:
pdf_path = "path_of_pdf.pdf"
openai_api_key = "<sk-API_KEY_HERE>"
generator = PDFQnAGenerator(pdf_path, openai_api_key)
generator.process_scraped_content()
generator.generate_questions_answers()
df = generator.convert_to_dataframe()
print(df)
```
When you have a `.csv` or a `pd.DataFrame` frrom the previous chunk of code, you can run the following code to iteratively generate a list of `.md` files from the `.csv` file.
```python
from huggify_data.bot_modules import ChatBot
bot = ChatBot(api_key=openai_api_key)
from huggify_data.generate_md_modules import *
# This code will start generate a list of .md files
# Please make sure you are in the desired directory
markdown_generator = MarkdownGenerator(bot, df)
markdown_generator.generate_markdown()
```
After a list of `.md` files are generated, one can navigate to [here](https://huggingface.co/spaces/eagle0504/llama-openai-demo/blob/main/app.py) to build a chatbot with RAG system to iteratively read in a list of `.md` file using **RAG** or **Retrieval Augmented Generation** pipeline using **llama_index**.
Once you have created a data frame of question-answer pairs, you can have a conversation with your data:
```python
from huggify_data.bot_modules import *
current_prompt = "<question_about_the_document>"
chatbot = ChatBot(api_key=openai_api_key)
response = chatbot.run_rag(openai_api_key, current_prompt, df, top_n=2)
print(response)
```
Moreover, you can push it to the cloud. Here's a complete example illustrating how to use the **huggify-data** library to push data (assuming an existing `.csv` file with columns `questions` and `answers`) to Hugging Face Hub:
```python
from huggify_data.push_modules import DataFrameUploader
# Example usage:
df = pd.read_csv('/content/toy_data.csv')
uploader = DataFrameUploader(df, hf_token="<huggingface-token-here>", repo_name='<desired-repo-name>', username='<your-username>')
uploader.process_data()
uploader.push_to_hub()
```
Here's a complete example illustrating how to use the **huggify-data** library to fine-tune a Llama2 model (assuming you have a directory from Hugging Face ready):
```python
from huggify_data.train_modules import *
# Parameters
model_name = "NousResearch/Llama-2-7b-chat-hf" # Recommended base model
dataset_name = "eagle0504/sample_toy_data_v9" # Desired name, e.g., <hf_user_id>/<desired_name>
new_model = "youthless-homeless-shelter-web-scrape-dataset-v4" # Desired name
huggingface_token = userdata.get('HF_TOKEN')
# Initiate
trainer = LlamaTrainer(model_name, dataset_name, new_model, huggingface_token)
peft_config = trainer.configure_lora()
training_args = trainer.configure_training_arguments(num_train_epochs=1)
# Train
trainer.train_model(training_args, peft_config)
# Inference
some_model, some_tokenizer = trainer.load_model_and_tokenizer(
base_model_path="NousResearch/Llama-2-7b-chat-hf",
new_model_path="ysa-test-july-4-v3",
)
prompt = "hi, tell me a joke"
response = trainer.generate_response(
some_model,
some_tokenizer,
prompt,
max_len=200)
print(response)
```
To perform inference, please follow the example below:
```python
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="eagle0504/youthless-homeless-shelter-web-scrape-dataset-v4") # Same name as above
response = pipe("### Human: What is YSA? ### Assistant: ")
print(response[0]["generated_text"])
print(response[0]["generated_text"].split("### ")[-1])
```
## License
This project is licensed under the MIT License. See the [LICENSE](https://github.com/yiqiao-yin/huggify-data/blob/main/LICENSE) file for more details.
## Contributing
Contributions are welcome! Please open an issue or submit a pull request if you have any improvements or suggestions.
## Contact
For any questions or support, please contact [eagle0504@gmail.com](mailto: eagle0504@gmail.com).
## About Me
Hello there! I'm excited to share a bit about myself and my projects. Check out these links for more information:
- 🏠 **Personal Site**: [✨ y-yin.io ✨](https://www.y-yin.io/)
- 🎓 **Education Site**: [📚 Future Minds 📚](https://www.future-minds.io/)
Feel free to explore and connect with me! 😊
Raw data
{
"_id": null,
"home_page": null,
"name": "huggify-data",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.13,>=3.9",
"maintainer_email": null,
"keywords": null,
"author": "Yiqiao Yin",
"author_email": "eagle0504@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/7b/b3/f9a3cb22145c822894c66c8178edda039c9f6db8e5917398d886061ae2e4/huggify_data-0.4.4.tar.gz",
"platform": null,
"description": "# huggify-data\n\n## Introduction\n\n**huggify-data** \ud83d\udce6 is a Python library \ud83d\udc0d designed to simplify the process of scraping `.pdf` documents, generating question-answer pairs using `openai`, converse with the document, and then uploading datasets \ud83d\udcca to the Hugging Face Hub \ud83e\udd17. This library allows you to verify \u2705, process \ud83d\udd04, and push \ud83d\ude80 your pandas DataFrame directly to Hugging Face, making it easier to share and collaborate \ud83e\udd1d on datasets. Additionally, the new version enables users to fine-tune the Llama2 model on their proprietary data, enhancing its capabilities even further. As the name suggests, the **huggify-data** package enhances your data experience by wrapping it in warmth, comfort, and user-friendly interactions, making data handling feel as reassuring and pleasant as a hug.\n\n[](https://www.youtube.com/watch?v=XLExhyangWw)\n\n## Repo\n\nYou can access the repo here: [\u2728 Huggify Data \u2728](https://github.com/yiqiao-yin/huggify-data)\n\n## Installation\n\nTo use **huggify-data**, ensure you have the necessary libraries installed. You can easily install them using pip:\n\n```sh\npip install huggify-data\n```\n\n## Notebooks\n\nWe have made tutorial notebooks available to guide you through the process step-by-step:\n\n- **Step 1**: Scrape any `.pdf` file and generate question-answer pairs. [Link](https://github.com/yiqiao-yin/WYNAssociates/blob/main/docs/ref-deeplearning/ex_%20-%20huggify%20data%20-%20part%201%20-%20scrape%20and%20generate%20qa.ipynb)\n- **Step 2**: Fine-tune the Llama2 model on customized data. [Link](https://github.com/yiqiao-yin/WYNAssociates/blob/main/docs/ref-deeplearning/ex_%20-%20huggify%20data%20-%20part%202%20-%20fine%20tune%20llama2%20over%20custom%20data.ipynb)\n- **Step 3**: Perform inference on customized data. [Link](https://github.com/yiqiao-yin/WYNAssociates/blob/main/docs/ref-deeplearning/ex_%20-%20huggify%20data%20-%20part%203%20-%20inference%20using%20fine%20tuned%20llama2.ipynb)\n\n## Examples\n\nHere's a complete example illustrating how to use **huggify-data** to scrape a PDF and save it as question-answer pairs in a `.csv` file. The following block of code will scrape the content, convert it into a `.csv`, and save the file locally:\n\n```python\nfrom huggify_data.scrape_modules import *\n\n# Example usage:\npdf_path = \"path_of_pdf.pdf\"\nopenai_api_key = \"<sk-API_KEY_HERE>\"\ngenerator = PDFQnAGenerator(pdf_path, openai_api_key)\ngenerator.process_scraped_content()\ngenerator.generate_questions_answers()\ndf = generator.convert_to_dataframe()\nprint(df)\n```\n\nWhen you have a `.csv` or a `pd.DataFrame` frrom the previous chunk of code, you can run the following code to iteratively generate a list of `.md` files from the `.csv` file.\n\n```python\nfrom huggify_data.bot_modules import ChatBot\nbot = ChatBot(api_key=openai_api_key)\n\nfrom huggify_data.generate_md_modules import *\n\n# This code will start generate a list of .md files\n# Please make sure you are in the desired directory\nmarkdown_generator = MarkdownGenerator(bot, df)\nmarkdown_generator.generate_markdown()\n```\n\nAfter a list of `.md` files are generated, one can navigate to [here](https://huggingface.co/spaces/eagle0504/llama-openai-demo/blob/main/app.py) to build a chatbot with RAG system to iteratively read in a list of `.md` file using **RAG** or **Retrieval Augmented Generation** pipeline using **llama_index**. \n\nOnce you have created a data frame of question-answer pairs, you can have a conversation with your data:\n\n```python\nfrom huggify_data.bot_modules import *\n\ncurrent_prompt = \"<question_about_the_document>\"\nchatbot = ChatBot(api_key=openai_api_key)\nresponse = chatbot.run_rag(openai_api_key, current_prompt, df, top_n=2)\nprint(response)\n```\n\nMoreover, you can push it to the cloud. Here's a complete example illustrating how to use the **huggify-data** library to push data (assuming an existing `.csv` file with columns `questions` and `answers`) to Hugging Face Hub:\n\n```python\nfrom huggify_data.push_modules import DataFrameUploader\n\n# Example usage:\ndf = pd.read_csv('/content/toy_data.csv')\nuploader = DataFrameUploader(df, hf_token=\"<huggingface-token-here>\", repo_name='<desired-repo-name>', username='<your-username>')\nuploader.process_data()\nuploader.push_to_hub()\n```\n\nHere's a complete example illustrating how to use the **huggify-data** library to fine-tune a Llama2 model (assuming you have a directory from Hugging Face ready):\n\n```python\nfrom huggify_data.train_modules import *\n\n# Parameters\nmodel_name = \"NousResearch/Llama-2-7b-chat-hf\" # Recommended base model\ndataset_name = \"eagle0504/sample_toy_data_v9\" # Desired name, e.g., <hf_user_id>/<desired_name>\nnew_model = \"youthless-homeless-shelter-web-scrape-dataset-v4\" # Desired name\nhuggingface_token = userdata.get('HF_TOKEN')\n\n# Initiate\ntrainer = LlamaTrainer(model_name, dataset_name, new_model, huggingface_token)\npeft_config = trainer.configure_lora()\ntraining_args = trainer.configure_training_arguments(num_train_epochs=1)\n\n# Train\ntrainer.train_model(training_args, peft_config)\n\n# Inference\nsome_model, some_tokenizer = trainer.load_model_and_tokenizer(\n base_model_path=\"NousResearch/Llama-2-7b-chat-hf\",\n new_model_path=\"ysa-test-july-4-v3\",\n)\n\nprompt = \"hi, tell me a joke\"\nresponse = trainer.generate_response(\n some_model,\n some_tokenizer,\n prompt,\n max_len=200)\nprint(response)\n\n```\n\nTo perform inference, please follow the example below:\n\n```python\n# Use a pipeline as a high-level helper\nfrom transformers import pipeline\n\npipe = pipeline(\"text-generation\", model=\"eagle0504/youthless-homeless-shelter-web-scrape-dataset-v4\") # Same name as above\nresponse = pipe(\"### Human: What is YSA? ### Assistant: \")\nprint(response[0][\"generated_text\"])\nprint(response[0][\"generated_text\"].split(\"### \")[-1])\n```\n\n## License\n\nThis project is licensed under the MIT License. See the [LICENSE](https://github.com/yiqiao-yin/huggify-data/blob/main/LICENSE) file for more details.\n\n## Contributing\n\nContributions are welcome! Please open an issue or submit a pull request if you have any improvements or suggestions.\n\n## Contact\n\nFor any questions or support, please contact [eagle0504@gmail.com](mailto: eagle0504@gmail.com).\n\n## About Me\n\nHello there! I'm excited to share a bit about myself and my projects. Check out these links for more information:\n\n- \ud83c\udfe0 **Personal Site**: [\u2728 y-yin.io \u2728](https://www.y-yin.io/)\n- \ud83c\udf93 **Education Site**: [\ud83d\udcda Future Minds \ud83d\udcda](https://www.future-minds.io/)\n\nFeel free to explore and connect with me! \ud83d\ude0a\n",
"bugtrack_url": null,
"license": null,
"summary": "This is a helper library to push data to HuggingFace.",
"version": "0.4.4",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "1417ebd486fd6bccd2e3880857bf750012aefc910f0775454b343a3342782ca5",
"md5": "ce13f980595bae2fcb79ed6aefd281da",
"sha256": "408c56a901e535c431c685ffce415b5a399e611c332a6cdd98278519f6c0d43d"
},
"downloads": -1,
"filename": "huggify_data-0.4.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ce13f980595bae2fcb79ed6aefd281da",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.13,>=3.9",
"size": 14122,
"upload_time": "2024-07-24T04:39:49",
"upload_time_iso_8601": "2024-07-24T04:39:49.631339Z",
"url": "https://files.pythonhosted.org/packages/14/17/ebd486fd6bccd2e3880857bf750012aefc910f0775454b343a3342782ca5/huggify_data-0.4.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "7bb3f9a3cb22145c822894c66c8178edda039c9f6db8e5917398d886061ae2e4",
"md5": "e827d3af7710661cd6963ea8701229a8",
"sha256": "b6a47f67c52d7b3edcb0a86f315ba9a643c532e417260c784b60d552ce892d03"
},
"downloads": -1,
"filename": "huggify_data-0.4.4.tar.gz",
"has_sig": false,
"md5_digest": "e827d3af7710661cd6963ea8701229a8",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.13,>=3.9",
"size": 12376,
"upload_time": "2024-07-24T04:39:51",
"upload_time_iso_8601": "2024-07-24T04:39:51.045021Z",
"url": "https://files.pythonhosted.org/packages/7b/b3/f9a3cb22145c822894c66c8178edda039c9f6db8e5917398d886061ae2e4/huggify_data-0.4.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-24 04:39:51",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "huggify-data"
}