gutenbergpy


Namegutenbergpy JSON
Version 0.3.5 PyPI version JSON
download
home_pagehttps://github.com/raduangelescu/gutenbergpy
SummaryLibrary to create and interogate local cache for Project Gutenberg
upload_time2023-03-27 07:30:10
maintainer
docs_urlNone
authorRadu Angelescu
requires_python>=3.6
licenseLICENSE.txt
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            GutenbergPy
========

![image](https://github.com/raduangelescu/gutenbergpy/blob/master/dblogos.png?raw=true)

This package makes filtering and getting information from [Project Gutenberg](http://www.gutenberg.org) easier from python.

It's target audience is machine learning guys that need data for their project, but may be freely used by anybody.

The package:

-   Generates a local cache (of all gutenberg informations) that you can interogate to get book ids. The Local cache may be sqlite (default) or mongodb (for wich you need to have installed the pymongodb packet)
-   Downloads and cleans raw text from gutenberg books

The package has been tested with Python 3.6 on both Windows and Linux It is faster, smaller and less third-party intensive alternative to <https://github.com/c-w/Gutenberg>

About development: <http://www.raduangelescu.com/gutenbergpy.html>

Installation
============

```
pip install gutenbergpy
```

or just install it from source (it's all just python code):

```
git clone https://github.com/raduangelescu/gutenbergpy
python setup.py install
```

Usage
=====

Downloading a text
------------------
```
import gutenbergpy.textget

```
After importing our module, we can download a text from gutenberg.

```python
def usage_example():
    # This gets a book by its gutenberg id number
    raw_book = gutenbergpy.textget.get_text_by_id(2701) # with headers
    clean_book = gutenbergpy.textget.strip_headers(raw_book) # without headers
    return clean_book, raw_book
```
The code above can easily be used without the function declaration, this is simply for illustration.

```python
cleaned_book, raw_book = usage_example()

# Cleaned Book
print(f'Example phrase from the cleaned book: {" ".join(str(cleaned_book[3000:3050]).split(" "))}')
# Raw Book
print(f'Example phrase from the raw book: {" ".join(str(raw_book[3000:3050]).split(" "))}')

```
The output of the code above is:
```
b'rgris.\n\nCHAPTER 93. The Castaway.\n\nCHAPTER 94. A S'
b'\n\n\n\nMOBY-DICK;\n\nor, THE WHALE.\n\nBy Herman Melville\n\n\n\nCONTENTS\n\nETYMOLOGY.\n\nEXTRACTS (Supplied by a Sub-Sub-Librarian).\n\nCHAPTER 1. Loomings.\n\nCHAPTER 2. The Carpet-Bag.\n\nCHAPTER 3. The Spouter-Inn.\n\nCHAPTER 4. The Counterpane.\n\nCHAPTER 5. Breakfast.\n\nCHAPTER 6. The Street.\n\nCHAPTER 7. The Chapel.\n\nCHAPTER 8. The Pulpit.\n\nCHAPTER 9. The Sermon.\n\nCHAPTER 10. A Bosom Friend.\n\nCHAPTER 11. Nightgown.\n\nCHAPTER 12. Biographical.\n\nCHAPTER 13. Wheelbarrow.\n\nCHAPTER 14. Nantucket.\n\nCHAPTER 15. Chowder.\n\nCHAPTER 16. The Ship.\n\nCHAPTER 17. The Ramadan.\n\nCHAPTER 18. His Mark.\n\nCHAPTER 19. The Prophet.\n\nCHAPTER 20. All Astir.\n\nCHAPTER 21. Going Aboard.\n\nCHAPTER 22. Merry Christmas.\n\nCHAPTER 23. The Lee Shore.\n\nCHAPTER 24. The Advocate.\n\nCHAPTER 25. Postscript.\n\nCHAPTER 26. Knights and Squires.\n\nCHAPTER 27. Knights and Squires.\n\nCHAPTER 28. Ahab.\n\nCHAPTER 29. Enter Ahab; to Him, Stubb.\n\nCHAPTER 30. The Pipe.\n\nCHAPTER 31. Queen Mab.\n\nCHAPTER 32. Cetology.\n\nCHAPTER 33. The Specksnyder.\n\nCHAPTER 34. Th'
```
They are both pretty messy, and will need to be cleaned prior to use for NLP etc.

The Raw book:
```output
b'b\xe2\x80\x99s Supper.\r\n\r\nCHAPTER 65. The Whale as a Dish.\r'
b'\n\n\n\nMOBY-DICK;\n\nor, THE WHALE.\n\nBy Herman Melville\n\n\n\nCONTENTS\n\nETYMOLOGY.\n\nEXTRACTS (Supplied by a Sub-Sub-Librarian).\n\nCHAPTER 1. Loomings.\n\nCHAPTER 2. The Carpet-Bag.\n\nCHAPTER 3. The Spouter-Inn.\n\nCHAPTER 4. The Counterpane.\n\nCHAPTER 5. Breakfast.\n\nCHAPTER 6. The Street.\n\nCHAPTER 7. The Chapel.\n\nCHAPTER 8. The Pulpit.\n\nCHAPTER 9. The Sermon.\n\nCHAPTER 10. A Bosom Friend.\n\nCHAPTER 11. Nightgown.\n\nCHAPTER 12. Biographical.\n\nCHAPTER 13. Wheelbarrow.\n\nCHAPTER 14. Nantucket.\n\nCHAPTER 15. Chowder.\n\nCHAPTER 16. The Ship.\n\nCHAPTER 17. The Ramadan.\n\nCHAPTER 18. His Mark.\n\nCHAPTER 19. The Prophet.\n\nCHAPTER 20. All Astir.\n\nCHAPTER 21. Going Aboard.\n\nCHAPTER 22. Merry Christmas.\n\nCHAPTER 23. The Lee Shore.\n\nCHAPTER 24. The Advocate.\n\nCHAPTER 25. Postscript.\n\nCHAPTER 26. Knights and Squires.\n\nCHAPTER 27. Knights and Squires.\n\nCHAPTER 28. Ahab.\n\nCHAPTER 29. Enter Ahab; to Him, Stubb.\n\nCHAPTER 30. The Pipe.\n\nCHAPTER 31. Queen Mab.\n\nCHAPTER 32. Cetology.\n\nCHAPTER 33. The Specksnyder.\n\nCHAPTER 34. Th'

```
Query the cache
---------------

To do this you first need to create the cache (this is a one time thing per os, until you decide to redo it)

```
from gutenbergpy.gutenbergcache import GutenbergCache
#for sqlite
GutenbergCache.create()
#for mongodb
GutenbergCache.create(type=GutenbergCacheTypes.CACHE_TYPE_MONGODB)
```

for debugging/better control you have these boolean options on create

> -   *refresh* deletes the old cache
> -   *download* property downloads the rdf file from the gutenberg project
> -   *unpack* unpacks it
> -   *parse* parses it in memory
> -   *cache* writes the cache

```
GutenbergCache.create(refresh=True, download=True, unpack=True, parse=True, cache=True, deleteTemp=True)
```

for even better control you may set the GutenbergCacheSettings
-   *CacheFilename*
-   *CacheUnpackDir*
-   *CacheArchiveName*
-   *ProgressBarMaxLength*
-   *CacheRDFDownloadLink*
-   *TextFilesCacheFolder*
-   *MongoDBCacheServer*

```
GutenbergCacheSettings.set( CacheFilename="", CacheUnpackDir="",
    CacheArchiveName="", ProgressBarMaxLength="", CacheRDFDownloadLink="", TextFilesCacheFolder="", MongoDBCacheServer="")
```

After doing a `create` go grab a coffee, it will be over in about 5 minutes, depending on your internet speed and computer power (On a i7 with gigabit connection and ssd it finishes in about 1 minute)

Get the cache
```
#for mongodb
cache = GutenbergCache.get_cache(GutenbergCacheTypes.CACHE_TYPE_MONGODB)
#for sqlite
cache  = GutenbergCache.get_cache()
```
Now you can do queries

Get the book Gutenberg unique indices by using this query function

Standard query fields:
-   languages
-   authors
-   types
-   titles
-   subjects
-   publishers
-   bookshelves
-   downloadtype
```
print(cache.query(downloadtype=['application/plain','text/plain','text/html; charset=utf-8']))
```
Or do a native query on the sqlite database
```
#python
cache.native_query("SELECT * FROM books")
#mongodb
cache.native_query({type:'Text'}}
```
For SQLITE custom queries, take a look at the SQLITE database scheme:

![image](https://github.com/raduangelescu/gutenbergpy/blob/master/sqlitecheme.png?raw=true)

For MongoDB queries, you have all the books collection. Each book with the following fields:

> -   book(publisher, rights, language, book\_shelf, gutenberg\_book\_id, date\_issued, num\_downloads, titles, subjects, authors, files ,type)

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/raduangelescu/gutenbergpy",
    "name": "gutenbergpy",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "",
    "author": "Radu Angelescu",
    "author_email": "raduangelescu@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/56/44/b7b9b014af7069045873a17fb5af461c1ea5df12da46be0e3fdc4934a1aa/gutenbergpy-0.3.5.tar.gz",
    "platform": null,
    "description": "GutenbergPy\r\n========\r\n\r\n![image](https://github.com/raduangelescu/gutenbergpy/blob/master/dblogos.png?raw=true)\r\n\r\nThis package makes filtering and getting information from [Project Gutenberg](http://www.gutenberg.org) easier from python.\r\n\r\nIt's target audience is machine learning guys that need data for their project, but may be freely used by anybody.\r\n\r\nThe package:\r\n\r\n-   Generates a local cache (of all gutenberg informations) that you can interogate to get book ids. The Local cache may be sqlite (default) or mongodb (for wich you need to have installed the pymongodb packet)\r\n-   Downloads and cleans raw text from gutenberg books\r\n\r\nThe package has been tested with Python 3.6 on both Windows and Linux It is faster, smaller and less third-party intensive alternative to <https://github.com/c-w/Gutenberg>\r\n\r\nAbout development: <http://www.raduangelescu.com/gutenbergpy.html>\r\n\r\nInstallation\r\n============\r\n\r\n```\r\npip install gutenbergpy\r\n```\r\n\r\nor just install it from source (it's all just python code):\r\n\r\n```\r\ngit clone https://github.com/raduangelescu/gutenbergpy\r\npython setup.py install\r\n```\r\n\r\nUsage\r\n=====\r\n\r\nDownloading a text\r\n------------------\r\n```\r\nimport gutenbergpy.textget\r\n\r\n```\r\nAfter importing our module, we can download a text from gutenberg.\r\n\r\n```python\r\ndef usage_example():\r\n    # This gets a book by its gutenberg id number\r\n    raw_book = gutenbergpy.textget.get_text_by_id(2701) # with headers\r\n    clean_book = gutenbergpy.textget.strip_headers(raw_book) # without headers\r\n    return clean_book, raw_book\r\n```\r\nThe code above can easily be used without the function declaration, this is simply for illustration.\r\n\r\n```python\r\ncleaned_book, raw_book = usage_example()\r\n\r\n# Cleaned Book\r\nprint(f'Example phrase from the cleaned book: {\" \".join(str(cleaned_book[3000:3050]).split(\" \"))}')\r\n# Raw Book\r\nprint(f'Example phrase from the raw book: {\" \".join(str(raw_book[3000:3050]).split(\" \"))}')\r\n\r\n```\r\nThe output of the code above is:\r\n```\r\nb'rgris.\\n\\nCHAPTER 93. The Castaway.\\n\\nCHAPTER 94. A S'\r\nb'\\n\\n\\n\\nMOBY-DICK;\\n\\nor, THE WHALE.\\n\\nBy Herman Melville\\n\\n\\n\\nCONTENTS\\n\\nETYMOLOGY.\\n\\nEXTRACTS (Supplied by a Sub-Sub-Librarian).\\n\\nCHAPTER 1. Loomings.\\n\\nCHAPTER 2. The Carpet-Bag.\\n\\nCHAPTER 3. The Spouter-Inn.\\n\\nCHAPTER 4. The Counterpane.\\n\\nCHAPTER 5. Breakfast.\\n\\nCHAPTER 6. The Street.\\n\\nCHAPTER 7. The Chapel.\\n\\nCHAPTER 8. The Pulpit.\\n\\nCHAPTER 9. The Sermon.\\n\\nCHAPTER 10. A Bosom Friend.\\n\\nCHAPTER 11. Nightgown.\\n\\nCHAPTER 12. Biographical.\\n\\nCHAPTER 13. Wheelbarrow.\\n\\nCHAPTER 14. Nantucket.\\n\\nCHAPTER 15. Chowder.\\n\\nCHAPTER 16. The Ship.\\n\\nCHAPTER 17. The Ramadan.\\n\\nCHAPTER 18. His Mark.\\n\\nCHAPTER 19. The Prophet.\\n\\nCHAPTER 20. All Astir.\\n\\nCHAPTER 21. Going Aboard.\\n\\nCHAPTER 22. Merry Christmas.\\n\\nCHAPTER 23. The Lee Shore.\\n\\nCHAPTER 24. The Advocate.\\n\\nCHAPTER 25. Postscript.\\n\\nCHAPTER 26. Knights and Squires.\\n\\nCHAPTER 27. Knights and Squires.\\n\\nCHAPTER 28. Ahab.\\n\\nCHAPTER 29. Enter Ahab; to Him, Stubb.\\n\\nCHAPTER 30. The Pipe.\\n\\nCHAPTER 31. Queen Mab.\\n\\nCHAPTER 32. Cetology.\\n\\nCHAPTER 33. The Specksnyder.\\n\\nCHAPTER 34. Th'\r\n```\r\nThey are both pretty messy, and will need to be cleaned prior to use for NLP etc.\r\n\r\nThe Raw book:\r\n```output\r\nb'b\\xe2\\x80\\x99s Supper.\\r\\n\\r\\nCHAPTER 65. The Whale as a Dish.\\r'\r\nb'\\n\\n\\n\\nMOBY-DICK;\\n\\nor, THE WHALE.\\n\\nBy Herman Melville\\n\\n\\n\\nCONTENTS\\n\\nETYMOLOGY.\\n\\nEXTRACTS (Supplied by a Sub-Sub-Librarian).\\n\\nCHAPTER 1. Loomings.\\n\\nCHAPTER 2. The Carpet-Bag.\\n\\nCHAPTER 3. The Spouter-Inn.\\n\\nCHAPTER 4. The Counterpane.\\n\\nCHAPTER 5. Breakfast.\\n\\nCHAPTER 6. The Street.\\n\\nCHAPTER 7. The Chapel.\\n\\nCHAPTER 8. The Pulpit.\\n\\nCHAPTER 9. The Sermon.\\n\\nCHAPTER 10. A Bosom Friend.\\n\\nCHAPTER 11. Nightgown.\\n\\nCHAPTER 12. Biographical.\\n\\nCHAPTER 13. Wheelbarrow.\\n\\nCHAPTER 14. Nantucket.\\n\\nCHAPTER 15. Chowder.\\n\\nCHAPTER 16. The Ship.\\n\\nCHAPTER 17. The Ramadan.\\n\\nCHAPTER 18. His Mark.\\n\\nCHAPTER 19. The Prophet.\\n\\nCHAPTER 20. All Astir.\\n\\nCHAPTER 21. Going Aboard.\\n\\nCHAPTER 22. Merry Christmas.\\n\\nCHAPTER 23. The Lee Shore.\\n\\nCHAPTER 24. The Advocate.\\n\\nCHAPTER 25. Postscript.\\n\\nCHAPTER 26. Knights and Squires.\\n\\nCHAPTER 27. Knights and Squires.\\n\\nCHAPTER 28. Ahab.\\n\\nCHAPTER 29. Enter Ahab; to Him, Stubb.\\n\\nCHAPTER 30. The Pipe.\\n\\nCHAPTER 31. Queen Mab.\\n\\nCHAPTER 32. Cetology.\\n\\nCHAPTER 33. The Specksnyder.\\n\\nCHAPTER 34. Th'\r\n\r\n```\r\nQuery the cache\r\n---------------\r\n\r\nTo do this you first need to create the cache (this is a one time thing per os, until you decide to redo it)\r\n\r\n```\r\nfrom gutenbergpy.gutenbergcache import GutenbergCache\r\n#for sqlite\r\nGutenbergCache.create()\r\n#for mongodb\r\nGutenbergCache.create(type=GutenbergCacheTypes.CACHE_TYPE_MONGODB)\r\n```\r\n\r\nfor debugging/better control you have these boolean options on create\r\n\r\n> -   *refresh* deletes the old cache\r\n> -   *download* property downloads the rdf file from the gutenberg project\r\n> -   *unpack* unpacks it\r\n> -   *parse* parses it in memory\r\n> -   *cache* writes the cache\r\n\r\n```\r\nGutenbergCache.create(refresh=True, download=True, unpack=True, parse=True, cache=True, deleteTemp=True)\r\n```\r\n\r\nfor even better control you may set the GutenbergCacheSettings\r\n-   *CacheFilename*\r\n-   *CacheUnpackDir*\r\n-   *CacheArchiveName*\r\n-   *ProgressBarMaxLength*\r\n-   *CacheRDFDownloadLink*\r\n-   *TextFilesCacheFolder*\r\n-   *MongoDBCacheServer*\r\n\r\n```\r\nGutenbergCacheSettings.set( CacheFilename=\"\", CacheUnpackDir=\"\",\r\n    CacheArchiveName=\"\", ProgressBarMaxLength=\"\", CacheRDFDownloadLink=\"\", TextFilesCacheFolder=\"\", MongoDBCacheServer=\"\")\r\n```\r\n\r\nAfter doing a `create` go grab a coffee, it will be over in about 5 minutes, depending on your internet speed and computer power (On a i7 with gigabit connection and ssd it finishes in about 1 minute)\r\n\r\nGet the cache\r\n```\r\n#for mongodb\r\ncache = GutenbergCache.get_cache(GutenbergCacheTypes.CACHE_TYPE_MONGODB)\r\n#for sqlite\r\ncache  = GutenbergCache.get_cache()\r\n```\r\nNow you can do queries\r\n\r\nGet the book Gutenberg unique indices by using this query function\r\n\r\nStandard query fields:\r\n-   languages\r\n-   authors\r\n-   types\r\n-   titles\r\n-   subjects\r\n-   publishers\r\n-   bookshelves\r\n-   downloadtype\r\n```\r\nprint(cache.query(downloadtype=['application/plain','text/plain','text/html; charset=utf-8']))\r\n```\r\nOr do a native query on the sqlite database\r\n```\r\n#python\r\ncache.native_query(\"SELECT * FROM books\")\r\n#mongodb\r\ncache.native_query({type:'Text'}}\r\n```\r\nFor SQLITE custom queries, take a look at the SQLITE database scheme:\r\n\r\n![image](https://github.com/raduangelescu/gutenbergpy/blob/master/sqlitecheme.png?raw=true)\r\n\r\nFor MongoDB queries, you have all the books collection. Each book with the following fields:\r\n\r\n> -   book(publisher, rights, language, book\\_shelf, gutenberg\\_book\\_id, date\\_issued, num\\_downloads, titles, subjects, authors, files ,type)\r\n",
    "bugtrack_url": null,
    "license": "LICENSE.txt",
    "summary": "Library to create and interogate local cache for Project Gutenberg",
    "version": "0.3.5",
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "64cc48d726295fa760a4c40ee0a8213d22c36435672298be1a0ec55a4743e0be",
                "md5": "6f1c804604e14395f3666a7c02992944",
                "sha256": "b21e4a9fc97f23b57cd68aaa81b909187062c64b6a0b290fa53742f8fe849989"
            },
            "downloads": -1,
            "filename": "gutenbergpy-0.3.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6f1c804604e14395f3666a7c02992944",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 22398,
            "upload_time": "2023-03-27T07:30:08",
            "upload_time_iso_8601": "2023-03-27T07:30:08.752027Z",
            "url": "https://files.pythonhosted.org/packages/64/cc/48d726295fa760a4c40ee0a8213d22c36435672298be1a0ec55a4743e0be/gutenbergpy-0.3.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5644b7b9b014af7069045873a17fb5af461c1ea5df12da46be0e3fdc4934a1aa",
                "md5": "fbe5feac5bc57f4654936551192b7a8f",
                "sha256": "0cfddf0b2ac865cd5697b34e4cc4ac471d95a6f71f15970ae3626605760d1055"
            },
            "downloads": -1,
            "filename": "gutenbergpy-0.3.5.tar.gz",
            "has_sig": false,
            "md5_digest": "fbe5feac5bc57f4654936551192b7a8f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 19658,
            "upload_time": "2023-03-27T07:30:10",
            "upload_time_iso_8601": "2023-03-27T07:30:10.544280Z",
            "url": "https://files.pythonhosted.org/packages/56/44/b7b9b014af7069045873a17fb5af461c1ea5df12da46be0e3fdc4934a1aa/gutenbergpy-0.3.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-03-27 07:30:10",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "raduangelescu",
    "github_project": "gutenbergpy",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "gutenbergpy"
}
        
Elapsed time: 0.04980s