# binary2strings - Python module to extract strings from binary blobs
Python module to extract Ascii, Utf8, and wide strings from binary data. Supports Unicode characters. Fast wrapper around c++ compiled code. This is designed to extract strings from binary content such as compiled executables.
Supported extracting strings of formats:
* Utf8 (8-bit Unicode variable length characters)
* Wide-character strings (UCS-2 Unicode fixed 16-bit characters)
International language string extraction is supported for both Utf8 and wide-character string standards - for example Chinese simplified, Japanese, and Korean strings will be extracted.
Optionally uses a machine learning model to filter out erroneous junk strings.
## Installation
Recommended installation method:
```
pip install binary2strings
```
Alternatively, download the repo and run:
```
python setup.py install
```
## Documentation
Api:
```python
import binary2strings as b2s
[(string, encoding, span, is_interesting),] =
b2s.extract_all_strings(buffer, min_chars=4, only_interesting=False)
```
Parameters:
* **buffer:**
A bytes array to extract strings from. All strings within this buffer will be extracted.
* **min_chars:**
(default 4) Minimum number of characters in a valid extracted string. Recommended minimum 4 to reduce noise.
* **only_interesting:** Boolean on whether only interesting strings should be returned. Interesting strings are non-gibberish strings, and a lightweight machine learning model is used for this identification. This will filter out the vast majority of junk strings, with a low risk of filtering out strings you care about.
Returns an array of tuples ordered according to the order in which they are located in the binary:
* **string:** The resulting string that was extracted in standard python string. All strings are converted to Utf8 here.
* **encoding:** "UTF8" | "WIDE_STRING". This is the encoding of the original string within the binary buffer.
* **span:** (start, end) tuple describing byte indices of where the string starts and ends within the buffer.
* **is_interesting:** Boolean describing whether the string is likely interesting. An interesting string is defined as non-gibberish. A machine learning model is used to compute this flag.
## Example usages
Example usage:
```python
import binary2strings as b2s
data = b"hello world\x00\x00a\x00b\x00c\x00d\x00\x00"
result = b2s.extract_all_strings(data, min_chars=4)
print(result)
# [
# ('hello world', 'UTF8', (0, 10), True),
# ('abcd', 'WIDE_STRING', (13, 19), False)
# ]
```
It also supports international languages, eg:
```python
import binary2strings as b2s
# "hello world" in Chinese simplified
string = "\x00世界您好\x00"
data = bytes(string, 'utf-8')
result = b2s.extract_all_strings(data, min_chars=4)
print(result)
# [
# ('世界您好', 'UTF8', (1, 12), False)
# ]
```
Example extracting all strings from a binary file:
```python
import binary2strings as b2s
with open("C:\\Windows\\System32\\cmd.exe", "rb") as i:
data = i.read()
for (string, type, span, is_interesting) in b2s.extract_all_strings(data):
print(f"{type}:{is_interesting}:{string}")
```
Example extracting only interesting strings from a binary file:
```python
import binary2strings as b2s
with open("C:\\Windows\\System32\\cmd.exe", "rb") as i:
data = i.read()
for (string, type, span, is_interesting) in b2s.extract_all_strings(data, only_interesting=True):
print(f"{type}:{is_interesting}:{string}")
```
Raw data
{
"_id": null,
"home_page": "https://github.com/glmcdona/binary2strings",
"name": "binary2strings",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "",
"author": "Geoff McDonald",
"author_email": "glmcdona@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/3e/27/6b4f5883936eba87d4e9c7177b6c413d71749ab691da43bf475c992df93a/binary2strings-0.1.13.tar.gz",
"platform": null,
"description": "# binary2strings - Python module to extract strings from binary blobs\r\nPython module to extract Ascii, Utf8, and wide strings from binary data. Supports Unicode characters. Fast wrapper around c++ compiled code. This is designed to extract strings from binary content such as compiled executables.\r\n\r\nSupported extracting strings of formats:\r\n* Utf8 (8-bit Unicode variable length characters)\r\n* Wide-character strings (UCS-2 Unicode fixed 16-bit characters)\r\n\r\nInternational language string extraction is supported for both Utf8 and wide-character string standards - for example Chinese simplified, Japanese, and Korean strings will be extracted.\r\n\r\nOptionally uses a machine learning model to filter out erroneous junk strings.\r\n\r\n## Installation\r\nRecommended installation method:\r\n```\r\npip install binary2strings\r\n```\r\n\r\nAlternatively, download the repo and run:\r\n```\r\npython setup.py install\r\n```\r\n\r\n## Documentation\r\n\r\nApi:\r\n```python\r\nimport binary2strings as b2s\r\n\r\n[(string, encoding, span, is_interesting),] =\r\n b2s.extract_all_strings(buffer, min_chars=4, only_interesting=False)\r\n```\r\nParameters:\r\n\r\n* **buffer:**\r\nA bytes array to extract strings from. All strings within this buffer will be extracted.\r\n* **min_chars:**\r\n(default 4) Minimum number of characters in a valid extracted string. Recommended minimum 4 to reduce noise.\r\n* **only_interesting:** Boolean on whether only interesting strings should be returned. Interesting strings are non-gibberish strings, and a lightweight machine learning model is used for this identification. This will filter out the vast majority of junk strings, with a low risk of filtering out strings you care about.\r\n\r\n\r\nReturns an array of tuples ordered according to the order in which they are located in the binary:\r\n* **string:** The resulting string that was extracted in standard python string. All strings are converted to Utf8 here.\r\n* **encoding:** \"UTF8\" | \"WIDE_STRING\". This is the encoding of the original string within the binary buffer.\r\n* **span:** (start, end) tuple describing byte indices of where the string starts and ends within the buffer.\r\n* **is_interesting:** Boolean describing whether the string is likely interesting. An interesting string is defined as non-gibberish. A machine learning model is used to compute this flag.\r\n\r\n## Example usages\r\n\r\nExample usage:\r\n```python\r\nimport binary2strings as b2s\r\n\r\ndata = b\"hello world\\x00\\x00a\\x00b\\x00c\\x00d\\x00\\x00\"\r\nresult = b2s.extract_all_strings(data, min_chars=4)\r\nprint(result)\r\n# [\r\n# ('hello world', 'UTF8', (0, 10), True),\r\n# ('abcd', 'WIDE_STRING', (13, 19), False)\r\n# ]\r\n```\r\n\r\nIt also supports international languages, eg:\r\n```python\r\nimport binary2strings as b2s\r\n\r\n# \"hello world\" in Chinese simplified\r\nstring = \"\\x00\u00e4\u00b8\u2013\u00e7\u2022\u0152\u00e6\u201a\u00a8\u00e5\u00a5\u00bd\\x00\"\r\ndata = bytes(string, 'utf-8')\r\n\r\nresult = b2s.extract_all_strings(data, min_chars=4)\r\nprint(result)\r\n# [\r\n# ('\u00e4\u00b8\u2013\u00e7\u2022\u0152\u00e6\u201a\u00a8\u00e5\u00a5\u00bd', 'UTF8', (1, 12), False)\r\n# ]\r\n```\r\n\r\nExample extracting all strings from a binary file:\r\n```python\r\nimport binary2strings as b2s\r\n\r\nwith open(\"C:\\\\Windows\\\\System32\\\\cmd.exe\", \"rb\") as i:\r\n data = i.read()\r\n for (string, type, span, is_interesting) in b2s.extract_all_strings(data):\r\n print(f\"{type}:{is_interesting}:{string}\")\r\n```\r\n\r\n\r\nExample extracting only interesting strings from a binary file:\r\n```python\r\nimport binary2strings as b2s\r\n\r\nwith open(\"C:\\\\Windows\\\\System32\\\\cmd.exe\", \"rb\") as i:\r\n data = i.read()\r\n for (string, type, span, is_interesting) in b2s.extract_all_strings(data, only_interesting=True):\r\n print(f\"{type}:{is_interesting}:{string}\")\r\n```\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Fast string extraction from binary buffers.",
"version": "0.1.13",
"project_urls": {
"Homepage": "https://github.com/glmcdona/binary2strings"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "45de180dc8de1be742b065f42714e0c16062b15e53588addb1452679bfd5fcc9",
"md5": "e0923feed37253328bb0bd98ce92e9c8",
"sha256": "02be02f5964726d4a001fb1a23c7feb02d71bfe9f4dbc15f899ef445a1904115"
},
"downloads": -1,
"filename": "binary2strings-0.1.13-cp310-cp310-win_amd64.whl",
"has_sig": false,
"md5_digest": "e0923feed37253328bb0bd98ce92e9c8",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.7",
"size": 160822,
"upload_time": "2023-07-20T03:37:30",
"upload_time_iso_8601": "2023-07-20T03:37:30.281676Z",
"url": "https://files.pythonhosted.org/packages/45/de/180dc8de1be742b065f42714e0c16062b15e53588addb1452679bfd5fcc9/binary2strings-0.1.13-cp310-cp310-win_amd64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "3e276b4f5883936eba87d4e9c7177b6c413d71749ab691da43bf475c992df93a",
"md5": "24960aaf7733e6180b4e4790c9afdcd8",
"sha256": "c6395fc97c4d908b36e08f5a558a79d371a843a8b308e21a0e2b489591877620"
},
"downloads": -1,
"filename": "binary2strings-0.1.13.tar.gz",
"has_sig": false,
"md5_digest": "24960aaf7733e6180b4e4790c9afdcd8",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 59217,
"upload_time": "2023-07-20T03:37:32",
"upload_time_iso_8601": "2023-07-20T03:37:32.184808Z",
"url": "https://files.pythonhosted.org/packages/3e/27/6b4f5883936eba87d4e9c7177b6c413d71749ab691da43bf475c992df93a/binary2strings-0.1.13.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-07-20 03:37:32",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "glmcdona",
"github_project": "binary2strings",
"travis_ci": true,
"coveralls": false,
"github_actions": false,
"appveyor": true,
"lcname": "binary2strings"
}