lowmemorywordcount


Namelowmemorywordcount JSON
Version 0.10 PyPI version JSON
download
home_pagehttps://github.com/hansalemaos/lowmemorywordcount
SummaryFast count of the occurrences of words in a text file or a given string - low memory consumption
upload_time2023-07-20 21:30:53
maintainer
docs_urlNone
authorJohannes Fischer
requires_python
licenseMIT
keywords word count nlp
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# Fast count of the occurrences of words in a text file or a given string - low memory consumption

## pip install lowmemorywordcount 

#### Tested against Windows 10 / Python 3.10 / Anaconda 

The count_words function provides a powerful and customizable tool for counting word occurrences in both files and strings, making it valuable for a wide range of professionals dealing with textual data

### Customization: 

The function allows users to customize word counting by providing several optional parameters. Users can specify the encoding, error handling, chunk size for file reading, inclusion of hyphens in words, inclusion of words containing numbers, file mode, ignoring case sensitivity, and setting minimum and maximum word lengths. This level of customization allows users to tailor the word counting process to their specific requirements.

### Efficiency: 

The function reads the input file in chunks, which is memory-efficient for large files. By processing data in chunks, it reduces memory consumption and is suitable for handling large text files without running into memory-related issues.

### Unicode Support: 

The function leverages the regex library, which provides excellent support for Unicode characters. This means it can handle words from various languages and character sets, making it suitable for analyzing text data in diverse contexts.

### Word Frequency Counting: 

The function utilizes a defaultdict to store word frequencies, which provides a convenient way to count occurrences of words. Users can access the counts directly by using the word as a key without needing to initialize the count for each word manually.

### Flexibility: 

The function can work with both file paths and strings as inputs. This flexibility allows users to analyze text from different sources, whether it's from a file on disk or a dynamically generated string.


```python

    Parameters:
        file_or_string (str | bytes): The path to the text file or the input string.
        encoding (str, optional): The encoding to use for reading the file (default is "utf-8").
        errors (str, optional): How to handle encoding errors while reading the file (default is "ignore").
        chunk_size (int, optional): The size of the data chunk to read from the file (default is 8192 bytes).
        words_with_hyphen (bool, optional): Set to True to include hyphens as part of words (default is True).
        include_numbers (bool, optional): Set to True to include numbers like "111", strings like: "70s" are always included (default is False).
        mode (str, optional): The file mode to open the file (default is "r").
        ignore_case (bool, optional): Set to True to ignore the case when counting words (default is True).
        min_len (int | None, optional): The minimum length of words to include (default is None, which means no minimum).
        max_len (int | None, optional): The maximum length of words to include (default is None, which means no maximum).

    Returns:
        defaultdict: A defaultdict with words as keys and their occurrences as values.

    Example:
        from lowmemorywordcount import count_words
        # Count words in a text file
        di = count_words(
            file_or_string=r"F:\textfile.txt",
            encoding="utf-8",
            errors="ignore",
            chunk_size=8192,
            words_with_hyphen=False,
            include_numbers=False,
            mode="r",
            ignore_case=True,
            min_len=None,
            max_len=None,
        )

        from lowmemorywordcount import count_words
        # Count words in a string or file
        di = count_words(
            file_or_string=b"This is a sample text. It contains some words, including words like 'apple' and 'orange'.",
            encoding="utf-8",
            words_with_hyphen=False,
            include_numbers=False,
            ignore_case=True,
            min_len=3,
            max_len=10,
            mode='rb'
        )

        Out[6]:
        defaultdict(int,
                    {b'this': 1,
                     b'sample': 1,
                     b'text': 1,
                     b'contains': 1,
                     b'some': 1,
                     b'words': 2,
                     b'including': 1,
                     b'like': 1,
                     b'apple': 1,
                     b'and': 1,
                     b'orange': 1})

        from lowmemorywordcount import count_words
        di = count_words(
            file_or_string="This is a sample text. It contains some words, including words like 'apple' and 'orange'.",
            encoding="utf-8",
            words_with_hyphen=False,
            include_numbers=False,
            ignore_case=True,
            min_len=3,
            max_len=10,
            mode='r'
        )
        Out[8]:
        defaultdict(int,
                    {'this': 1,
                     'sample': 1,
                     'text': 1,
                     'contains': 1,
                     'some': 1,
                     'words': 2,
                     'including': 1,
                     'like': 1,
                     'apple': 1,
                     'and': 1,
                     'orange': 1})
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/hansalemaos/lowmemorywordcount",
    "name": "lowmemorywordcount",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "word,count,nlp",
    "author": "Johannes Fischer",
    "author_email": "aulasparticularesdealemaosp@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/55/10/0370226d937e5ea53f65c9812151117ccf296758515d430389fe512e7147/lowmemorywordcount-0.10.tar.gz",
    "platform": null,
    "description": "\r\n# Fast count of the occurrences of words in a text file or a given string - low memory consumption\r\n\r\n## pip install lowmemorywordcount \r\n\r\n#### Tested against Windows 10 / Python 3.10 / Anaconda \r\n\r\nThe count_words function provides a powerful and customizable tool for counting word occurrences in both files and strings, making it valuable for a wide range of professionals dealing with textual data\r\n\r\n### Customization: \r\n\r\nThe function allows users to customize word counting by providing several optional parameters. Users can specify the encoding, error handling, chunk size for file reading, inclusion of hyphens in words, inclusion of words containing numbers, file mode, ignoring case sensitivity, and setting minimum and maximum word lengths. This level of customization allows users to tailor the word counting process to their specific requirements.\r\n\r\n### Efficiency: \r\n\r\nThe function reads the input file in chunks, which is memory-efficient for large files. By processing data in chunks, it reduces memory consumption and is suitable for handling large text files without running into memory-related issues.\r\n\r\n### Unicode Support: \r\n\r\nThe function leverages the regex library, which provides excellent support for Unicode characters. This means it can handle words from various languages and character sets, making it suitable for analyzing text data in diverse contexts.\r\n\r\n### Word Frequency Counting: \r\n\r\nThe function utilizes a defaultdict to store word frequencies, which provides a convenient way to count occurrences of words. Users can access the counts directly by using the word as a key without needing to initialize the count for each word manually.\r\n\r\n### Flexibility: \r\n\r\nThe function can work with both file paths and strings as inputs. This flexibility allows users to analyze text from different sources, whether it's from a file on disk or a dynamically generated string.\r\n\r\n\r\n```python\r\n\r\n    Parameters:\r\n        file_or_string (str | bytes): The path to the text file or the input string.\r\n        encoding (str, optional): The encoding to use for reading the file (default is \"utf-8\").\r\n        errors (str, optional): How to handle encoding errors while reading the file (default is \"ignore\").\r\n        chunk_size (int, optional): The size of the data chunk to read from the file (default is 8192 bytes).\r\n        words_with_hyphen (bool, optional): Set to True to include hyphens as part of words (default is True).\r\n        include_numbers (bool, optional): Set to True to include numbers like \"111\", strings like: \"70s\" are always included (default is False).\r\n        mode (str, optional): The file mode to open the file (default is \"r\").\r\n        ignore_case (bool, optional): Set to True to ignore the case when counting words (default is True).\r\n        min_len (int | None, optional): The minimum length of words to include (default is None, which means no minimum).\r\n        max_len (int | None, optional): The maximum length of words to include (default is None, which means no maximum).\r\n\r\n    Returns:\r\n        defaultdict: A defaultdict with words as keys and their occurrences as values.\r\n\r\n    Example:\r\n        from lowmemorywordcount import count_words\r\n        # Count words in a text file\r\n        di = count_words(\r\n            file_or_string=r\"F:\\textfile.txt\",\r\n            encoding=\"utf-8\",\r\n            errors=\"ignore\",\r\n            chunk_size=8192,\r\n            words_with_hyphen=False,\r\n            include_numbers=False,\r\n            mode=\"r\",\r\n            ignore_case=True,\r\n            min_len=None,\r\n            max_len=None,\r\n        )\r\n\r\n        from lowmemorywordcount import count_words\r\n        # Count words in a string or file\r\n        di = count_words(\r\n            file_or_string=b\"This is a sample text. It contains some words, including words like 'apple' and 'orange'.\",\r\n            encoding=\"utf-8\",\r\n            words_with_hyphen=False,\r\n            include_numbers=False,\r\n            ignore_case=True,\r\n            min_len=3,\r\n            max_len=10,\r\n            mode='rb'\r\n        )\r\n\r\n        Out[6]:\r\n        defaultdict(int,\r\n                    {b'this': 1,\r\n                     b'sample': 1,\r\n                     b'text': 1,\r\n                     b'contains': 1,\r\n                     b'some': 1,\r\n                     b'words': 2,\r\n                     b'including': 1,\r\n                     b'like': 1,\r\n                     b'apple': 1,\r\n                     b'and': 1,\r\n                     b'orange': 1})\r\n\r\n        from lowmemorywordcount import count_words\r\n        di = count_words(\r\n            file_or_string=\"This is a sample text. It contains some words, including words like 'apple' and 'orange'.\",\r\n            encoding=\"utf-8\",\r\n            words_with_hyphen=False,\r\n            include_numbers=False,\r\n            ignore_case=True,\r\n            min_len=3,\r\n            max_len=10,\r\n            mode='r'\r\n        )\r\n        Out[8]:\r\n        defaultdict(int,\r\n                    {'this': 1,\r\n                     'sample': 1,\r\n                     'text': 1,\r\n                     'contains': 1,\r\n                     'some': 1,\r\n                     'words': 2,\r\n                     'including': 1,\r\n                     'like': 1,\r\n                     'apple': 1,\r\n                     'and': 1,\r\n                     'orange': 1})\r\n```\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Fast count of the occurrences of words in a text file or a given string - low memory consumption",
    "version": "0.10",
    "project_urls": {
        "Homepage": "https://github.com/hansalemaos/lowmemorywordcount"
    },
    "split_keywords": [
        "word",
        "count",
        "nlp"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fc381e452257f874b47b7102ab093800cab60af82dacd099c414bd8c3120116d",
                "md5": "bc4698e736386fd710d95052cd2dc82f",
                "sha256": "5ca8d9f87765044cecb8481a7a867b75eb565dfbdada6a5ae87fa9310ba16144"
            },
            "downloads": -1,
            "filename": "lowmemorywordcount-0.10-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "bc4698e736386fd710d95052cd2dc82f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 11680,
            "upload_time": "2023-07-20T21:30:49",
            "upload_time_iso_8601": "2023-07-20T21:30:49.539065Z",
            "url": "https://files.pythonhosted.org/packages/fc/38/1e452257f874b47b7102ab093800cab60af82dacd099c414bd8c3120116d/lowmemorywordcount-0.10-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "55100370226d937e5ea53f65c9812151117ccf296758515d430389fe512e7147",
                "md5": "8d6aa0dcda3dfc348c3ab18b9f6be3b5",
                "sha256": "0886d812ffd2b4b15c59638e8a91aa96787ae60607d9dbc355c40f02cc45ba97"
            },
            "downloads": -1,
            "filename": "lowmemorywordcount-0.10.tar.gz",
            "has_sig": false,
            "md5_digest": "8d6aa0dcda3dfc348c3ab18b9f6be3b5",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 10069,
            "upload_time": "2023-07-20T21:30:53",
            "upload_time_iso_8601": "2023-07-20T21:30:53.288538Z",
            "url": "https://files.pythonhosted.org/packages/55/10/0370226d937e5ea53f65c9812151117ccf296758515d430389fe512e7147/lowmemorywordcount-0.10.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-07-20 21:30:53",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "hansalemaos",
    "github_project": "lowmemorywordcount",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "lowmemorywordcount"
}
        
Elapsed time: 0.21299s