mft2df


Namemft2df JSON
Version 0.13 PyPI version JSON
download
home_pagehttps://github.com/hansalemaos/mft2df
SummaryLists the files on a drive insanely fast (43 seconds for 1,800,000 files - 600 GB) by converting the $MFT to a pandas DataFrame
upload_time2023-06-28 15:33:37
maintainer
docs_urlNone
authorJohannes Fischer
requires_python
licenseMIT
keywords mft windows pandas
VCS
bugtrack_url
requirements getfilenuitkapython pandas
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# Lists the files on a drive insanely fast (43 seconds for 1,800,000 files - 600 GB) by converting the $MFT to a pandas DataFrame 

## pip install mft2df

### Tested against Windows 10 / Python 3.10 / Anaconda 


The list_files_from_drive function can be used by individuals or developers who need to retrieve a 
list of files from a specified drive. 
It can be particularly useful for tasks such as file system analysis, data exploration, or building file management utilities.

### Advantages of list_files_from_drive:

- Retrieves file information from the specified drive and returns it as a structured pandas DataFrame, allowing for easy data manipulation and analysis.
- Supports parsing of the Master File Table (MFT) dump using external utilities (mft.exe https://github.com/makitos666/MFT_Fast_Transcoder -to copy the mft-  and mft_dump.exe https://github.com/omerbenamram/mft -to parse the mft-) to extract file metadata.
- Uses subprocess calls to execute external commands in a hidden window, providing a seamless user experience.
- Parses the output of the MFT dump into a DataFrame using pandas, enabling efficient data handling and processing.
- Performs data type conversions and date parsing for specific columns, ensuring data consistency and usability.
- Filters out rows with missing FullPath values to ensure the integrity of the data.
- Prepends the drive letter to the FullPath column to create a complete file path.
- Cleans up the temporary MFT dump file after processing.
- Utilizes efficient memory management by explicitly deleting variables, garbage collection, and low-memory options in the pandas read_csv function.


```python
Args:
    drive (str): The drive letter to retrieve the files from. Default is "c".
    convert_dates (bool): Whether to use pd.to_datetime to convert "FileNameLastModified", "FileNameLastAccess",
                           "FileNameCreated","StandardInfoLastModified","StandardInfoLastAccess","StandardInfoCreated"
                          (Parsing takes about 2x longer, and the resulting DataFrame is about 30% bigger)
Returns:
    pd.DataFrame: A DataFrame containing the list of files retrieved from the drive.
Raises:
    None
    
# Important: you need admin rights!!!!
from mft2df import list_files_from_drive
from time import perf_counter
start = perf_counter()
df=list_files_from_drive(drive= "c")
print(f'Time needed: {perf_counter() - start} for {len(df)} files')
print(df[200060:200066].to_string())

# Time needed: 43.62916430000041 for 1842450 files

#        Signature  EntryId  Sequence  BaseEntryId  BaseEntrySequence  HardLinkCount      Flags  UsedEntrySize  TotalEntrySize  FileSize  IsADirectory  IsDeleted  HasAlternateDataStreams StandardInfoFlags     StandardInfoLastModified       StandardInfoLastAccess          StandardInfoCreated           FileNameFlags         FileNameLastModified           FileNameLastAccess              FileNameCreated                                                                                                                     FullPath
# 200060      FILE   202514         1            0                  0              2  ALLOCATED            672            1024       211         False      False                    False           (empty)  2020-03-04T10:38:59.012552Z  2020-03-04T10:38:59.012552Z  2020-03-04T10:39:00.779040Z  FILE_ATTRIBUTE_ARCHIVE  2020-03-04T10:38:59.012552Z  2020-03-04T10:38:59.012552Z  2020-03-04T10:38:59.012552Z  c:\Windows\WinSxS\Manifests\amd64_bthmtpenum.inf-languagepack_31bf3856ad364e35_10.0.18362.1_de-de_710d1caf8aa9bb19.manifest
# 200061      FILE   202515         1            0                  0              2  ALLOCATED            664            1024       208         False      False                    False           (empty)  2020-03-04T10:38:59.022586Z  2020-03-04T10:38:59.022586Z  2020-03-04T10:39:00.779040Z  FILE_ATTRIBUTE_ARCHIVE  2020-03-04T10:38:59.022586Z  2020-03-04T10:38:59.022586Z  2020-03-04T10:38:59.022586Z       c:\Windows\WinSxS\Manifests\amd64_c_wpd.inf-languagepack_31bf3856ad364e35_10.0.18362.1_de-de_a4c4bcf7ec41f07e.manifest
# 200062      FILE   202516         1            0                  0              2  ALLOCATED            672            1024       207         False      False                    False           (empty)  2020-03-04T10:38:59.032170Z  2020-03-04T10:38:59.032170Z  2020-03-04T10:39:00.779040Z  FILE_ATTRIBUTE_ARCHIVE  2020-03-04T10:38:59.032170Z  2020-03-04T10:38:59.032170Z  2020-03-04T10:38:59.022586Z     c:\Windows\WinSxS\Manifests\amd64_wpdcomp.inf-languagepack_31bf3856ad364e35_10.0.18362.1_de-de_78d37c0df7225559.manifest
# 200063      FILE   202517         1            0                  0              2  ALLOCATED            664            1024       207         False      False                    False           (empty)  2020-03-04T10:38:59.032699Z  2020-03-04T10:38:59.032699Z  2020-03-04T10:39:00.794664Z  FILE_ATTRIBUTE_ARCHIVE  2020-03-04T10:38:59.032699Z  2020-03-04T10:38:59.032699Z  2020-03-04T10:38:59.032699Z       c:\Windows\WinSxS\Manifests\amd64_wpdfs.inf-languagepack_31bf3856ad364e35_10.0.18362.1_de-de_a09f098927b0c6b9.manifest
# 200064      FILE   202518         1            0                  0              2  ALLOCATED            664            1024       208         False      False                    False           (empty)  2020-03-04T10:38:59.042535Z  2020-03-04T10:38:59.042535Z  2020-03-04T10:39:00.794664Z  FILE_ATTRIBUTE_ARCHIVE  2020-03-04T10:38:59.042535Z  2020-03-04T10:38:59.042535Z  2020-03-04T10:38:59.032699Z      c:\Windows\WinSxS\Manifests\amd64_wpdmtp.inf-languagepack_31bf3856ad364e35_10.0.18362.1_de-de_13d74fb245acf719.manifest
# 200065      FILE   202519         1            0                  0              2  ALLOCATED            672            1024       211         False      False                    False           (empty)  2020-03-04T10:38:59.042535Z  2020-03-04T10:38:59.042535Z  2020-03-04T10:39:00.794664Z  FILE_ATTRIBUTE_ARCHIVE  2020-03-04T10:38:59.042535Z  2020-03-04T10:38:59.042535Z  2020-03-04T10:38:59.042535Z    c:\Windows\WinSxS\Manifests\amd64_wpdmtphw.inf-languagepack_31bf3856ad364e35_10.0.18362.1_de-de_52e461d8f91111b2.manifest

```

## Examples

### Finds all python files on your HDD that contain the string "ctypes" in less than 2 minutes

```python
import pandas as pd
from PrettyColorPrinter import add_printer # pip install PrettyColorPrinter
add_printer(1)
from mft2df import list_files_from_drive
from time import perf_counter

start = perf_counter()
df = list_files_from_drive(drive="c", convert_dates=False)
print(f"Time needed: {perf_counter() - start} " f"for {len(df)} files")


def get_content(file):
    try:
        with open(file, mode="r", encoding="utf-8") as f:
            data = f.read()

    except Exception:
        data = pd.NA
    return data


dffi = df.loc[
    (df.FullPath.str.endswith(".py")) & (~df.IsDeleted) & (~df.IsADirectory)
].copy()
dffi["FileContent"] = dffi.FullPath.apply(get_content)
dffi = dffi.loc[~dffi["FileContent"].isna()]
ctypesfiles = dffi.loc[dffi.FileContent.str.contains("ctypes")]
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/hansalemaos/mft2df",
    "name": "mft2df",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "mft,windows,pandas",
    "author": "Johannes Fischer",
    "author_email": "aulasparticularesdealemaosp@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/97/77/eca58cbb2f4ed12a10c3f621b6eab412e977a48c03f9c9d09fef47f358d1/mft2df-0.13.tar.gz",
    "platform": null,
    "description": "\r\n# Lists the files on a drive insanely fast (43 seconds for 1,800,000 files - 600 GB) by converting the $MFT to a pandas DataFrame \r\n\r\n## pip install mft2df\r\n\r\n### Tested against Windows 10 / Python 3.10 / Anaconda \r\n\r\n\r\nThe list_files_from_drive function can be used by individuals or developers who need to retrieve a \r\nlist of files from a specified drive. \r\nIt can be particularly useful for tasks such as file system analysis, data exploration, or building file management utilities.\r\n\r\n### Advantages of list_files_from_drive:\r\n\r\n- Retrieves file information from the specified drive and returns it as a structured pandas DataFrame, allowing for easy data manipulation and analysis.\r\n- Supports parsing of the Master File Table (MFT) dump using external utilities (mft.exe https://github.com/makitos666/MFT_Fast_Transcoder -to copy the mft-  and mft_dump.exe https://github.com/omerbenamram/mft -to parse the mft-) to extract file metadata.\r\n- Uses subprocess calls to execute external commands in a hidden window, providing a seamless user experience.\r\n- Parses the output of the MFT dump into a DataFrame using pandas, enabling efficient data handling and processing.\r\n- Performs data type conversions and date parsing for specific columns, ensuring data consistency and usability.\r\n- Filters out rows with missing FullPath values to ensure the integrity of the data.\r\n- Prepends the drive letter to the FullPath column to create a complete file path.\r\n- Cleans up the temporary MFT dump file after processing.\r\n- Utilizes efficient memory management by explicitly deleting variables, garbage collection, and low-memory options in the pandas read_csv function.\r\n\r\n\r\n```python\r\nArgs:\r\n    drive (str): The drive letter to retrieve the files from. Default is \"c\".\r\n    convert_dates (bool): Whether to use pd.to_datetime to convert \"FileNameLastModified\", \"FileNameLastAccess\",\r\n                           \"FileNameCreated\",\"StandardInfoLastModified\",\"StandardInfoLastAccess\",\"StandardInfoCreated\"\r\n                          (Parsing takes about 2x longer, and the resulting DataFrame is about 30% bigger)\r\nReturns:\r\n    pd.DataFrame: A DataFrame containing the list of files retrieved from the drive.\r\nRaises:\r\n    None\r\n    \r\n# Important: you need admin rights!!!!\r\nfrom mft2df import list_files_from_drive\r\nfrom time import perf_counter\r\nstart = perf_counter()\r\ndf=list_files_from_drive(drive= \"c\")\r\nprint(f'Time needed: {perf_counter() - start} for {len(df)} files')\r\nprint(df[200060:200066].to_string())\r\n\r\n# Time needed: 43.62916430000041 for 1842450 files\r\n\r\n#        Signature  EntryId  Sequence  BaseEntryId  BaseEntrySequence  HardLinkCount      Flags  UsedEntrySize  TotalEntrySize  FileSize  IsADirectory  IsDeleted  HasAlternateDataStreams StandardInfoFlags     StandardInfoLastModified       StandardInfoLastAccess          StandardInfoCreated           FileNameFlags         FileNameLastModified           FileNameLastAccess              FileNameCreated                                                                                                                     FullPath\r\n# 200060      FILE   202514         1            0                  0              2  ALLOCATED            672            1024       211         False      False                    False           (empty)  2020-03-04T10:38:59.012552Z  2020-03-04T10:38:59.012552Z  2020-03-04T10:39:00.779040Z  FILE_ATTRIBUTE_ARCHIVE  2020-03-04T10:38:59.012552Z  2020-03-04T10:38:59.012552Z  2020-03-04T10:38:59.012552Z  c:\\Windows\\WinSxS\\Manifests\\amd64_bthmtpenum.inf-languagepack_31bf3856ad364e35_10.0.18362.1_de-de_710d1caf8aa9bb19.manifest\r\n# 200061      FILE   202515         1            0                  0              2  ALLOCATED            664            1024       208         False      False                    False           (empty)  2020-03-04T10:38:59.022586Z  2020-03-04T10:38:59.022586Z  2020-03-04T10:39:00.779040Z  FILE_ATTRIBUTE_ARCHIVE  2020-03-04T10:38:59.022586Z  2020-03-04T10:38:59.022586Z  2020-03-04T10:38:59.022586Z       c:\\Windows\\WinSxS\\Manifests\\amd64_c_wpd.inf-languagepack_31bf3856ad364e35_10.0.18362.1_de-de_a4c4bcf7ec41f07e.manifest\r\n# 200062      FILE   202516         1            0                  0              2  ALLOCATED            672            1024       207         False      False                    False           (empty)  2020-03-04T10:38:59.032170Z  2020-03-04T10:38:59.032170Z  2020-03-04T10:39:00.779040Z  FILE_ATTRIBUTE_ARCHIVE  2020-03-04T10:38:59.032170Z  2020-03-04T10:38:59.032170Z  2020-03-04T10:38:59.022586Z     c:\\Windows\\WinSxS\\Manifests\\amd64_wpdcomp.inf-languagepack_31bf3856ad364e35_10.0.18362.1_de-de_78d37c0df7225559.manifest\r\n# 200063      FILE   202517         1            0                  0              2  ALLOCATED            664            1024       207         False      False                    False           (empty)  2020-03-04T10:38:59.032699Z  2020-03-04T10:38:59.032699Z  2020-03-04T10:39:00.794664Z  FILE_ATTRIBUTE_ARCHIVE  2020-03-04T10:38:59.032699Z  2020-03-04T10:38:59.032699Z  2020-03-04T10:38:59.032699Z       c:\\Windows\\WinSxS\\Manifests\\amd64_wpdfs.inf-languagepack_31bf3856ad364e35_10.0.18362.1_de-de_a09f098927b0c6b9.manifest\r\n# 200064      FILE   202518         1            0                  0              2  ALLOCATED            664            1024       208         False      False                    False           (empty)  2020-03-04T10:38:59.042535Z  2020-03-04T10:38:59.042535Z  2020-03-04T10:39:00.794664Z  FILE_ATTRIBUTE_ARCHIVE  2020-03-04T10:38:59.042535Z  2020-03-04T10:38:59.042535Z  2020-03-04T10:38:59.032699Z      c:\\Windows\\WinSxS\\Manifests\\amd64_wpdmtp.inf-languagepack_31bf3856ad364e35_10.0.18362.1_de-de_13d74fb245acf719.manifest\r\n# 200065      FILE   202519         1            0                  0              2  ALLOCATED            672            1024       211         False      False                    False           (empty)  2020-03-04T10:38:59.042535Z  2020-03-04T10:38:59.042535Z  2020-03-04T10:39:00.794664Z  FILE_ATTRIBUTE_ARCHIVE  2020-03-04T10:38:59.042535Z  2020-03-04T10:38:59.042535Z  2020-03-04T10:38:59.042535Z    c:\\Windows\\WinSxS\\Manifests\\amd64_wpdmtphw.inf-languagepack_31bf3856ad364e35_10.0.18362.1_de-de_52e461d8f91111b2.manifest\r\n\r\n```\r\n\r\n## Examples\r\n\r\n### Finds all python files on your HDD that contain the string \"ctypes\" in less than 2 minutes\r\n\r\n```python\r\nimport pandas as pd\r\nfrom PrettyColorPrinter import add_printer # pip install PrettyColorPrinter\r\nadd_printer(1)\r\nfrom mft2df import list_files_from_drive\r\nfrom time import perf_counter\r\n\r\nstart = perf_counter()\r\ndf = list_files_from_drive(drive=\"c\", convert_dates=False)\r\nprint(f\"Time needed: {perf_counter() - start} \" f\"for {len(df)} files\")\r\n\r\n\r\ndef get_content(file):\r\n    try:\r\n        with open(file, mode=\"r\", encoding=\"utf-8\") as f:\r\n            data = f.read()\r\n\r\n    except Exception:\r\n        data = pd.NA\r\n    return data\r\n\r\n\r\ndffi = df.loc[\r\n    (df.FullPath.str.endswith(\".py\")) & (~df.IsDeleted) & (~df.IsADirectory)\r\n].copy()\r\ndffi[\"FileContent\"] = dffi.FullPath.apply(get_content)\r\ndffi = dffi.loc[~dffi[\"FileContent\"].isna()]\r\nctypesfiles = dffi.loc[dffi.FileContent.str.contains(\"ctypes\")]\r\n```\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Lists the files on a drive insanely fast (43 seconds for 1,800,000 files - 600 GB) by converting the $MFT to a pandas DataFrame",
    "version": "0.13",
    "project_urls": {
        "Homepage": "https://github.com/hansalemaos/mft2df"
    },
    "split_keywords": [
        "mft",
        "windows",
        "pandas"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7f6302b0244712b114ab267d7a90500526f8c75d9818e57e300312e0809359ef",
                "md5": "05ca86175fa6851b99949537900e4637",
                "sha256": "b269347a0a48c20fd8acb1fdab9c1c28e8c241e7a85d4a8c2ea20d861b0cb4b4"
            },
            "downloads": -1,
            "filename": "mft2df-0.13-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "05ca86175fa6851b99949537900e4637",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 464519,
            "upload_time": "2023-06-28T15:33:34",
            "upload_time_iso_8601": "2023-06-28T15:33:34.048967Z",
            "url": "https://files.pythonhosted.org/packages/7f/63/02b0244712b114ab267d7a90500526f8c75d9818e57e300312e0809359ef/mft2df-0.13-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9777eca58cbb2f4ed12a10c3f621b6eab412e977a48c03f9c9d09fef47f358d1",
                "md5": "889157131ee77c7a96c14908896d949e",
                "sha256": "488e970dae2835f08150c164ed4d978db5e8e79a06367fb1582245f1a2cbd205"
            },
            "downloads": -1,
            "filename": "mft2df-0.13.tar.gz",
            "has_sig": false,
            "md5_digest": "889157131ee77c7a96c14908896d949e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 464277,
            "upload_time": "2023-06-28T15:33:37",
            "upload_time_iso_8601": "2023-06-28T15:33:37.872549Z",
            "url": "https://files.pythonhosted.org/packages/97/77/eca58cbb2f4ed12a10c3f621b6eab412e977a48c03f9c9d09fef47f358d1/mft2df-0.13.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-28 15:33:37",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "hansalemaos",
    "github_project": "mft2df",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "getfilenuitkapython",
            "specs": []
        },
        {
            "name": "pandas",
            "specs": []
        }
    ],
    "lcname": "mft2df"
}
        
Elapsed time: 0.10309s