demystify-digipres


Namedemystify-digipres JSON
Version 2.0.0 PyPI version JSON
download
home_pageNone
Summaryengine for the analysis of DROID and Siegfried file format reports
upload_time2024-05-05 15:44:01
maintainerNone
docs_urlNone
authorRoss Spencer
requires_python>=3.8
licenseCopyright (c) 2013 Ross Spencer This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software. Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions: The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software. This notice may not be removed or altered from any source distribution.
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Demystify

Static analysis and reporting for file-format reports generated by digital
preservation tools, DROID and Siegfried.

Working example __Siegfried__: [Siegfried Govdocs Select Results...][SF-1]
<br/>
Working example __DROID__: [DROID Govdocs Select Results...][DROID-1]

## Introduction

Utility for the analysis of [DROID CSV][DROID-CSV] and [Seigfried][SF-2]
file-format reports. The tool has three purposes:

1. break the export into its components and store them within a set of tables
in a SQLite database for performance and consistent access;
2. provide additional information about a collection's profile where useful;
3. and query the SQLite database, outputting results in a visually pleasant
report for further analysis by digital preservation specialists and archivists.

For departments on-boarding archivists or building digital capability, the
report contains descriptions, written by archivists for each of the statistics
output.

![archivist descriptions in demystify](https://github.com/exponential-decay/demystify/blob/main/documentation/archivist-descriptions.png?raw=true)

### Analysis of file format reports

This Code4Lib article published early in 2022 describes some of the important
information in file-format reports that appear, in-aggregate. It describes
the challenges of accessing that information consistently also.

* [Fractal in detail: What information is in a file format identification
report?][CODE4LIB-1]

### 2020/2021 refactor

This utility was first written in 2013. The code was pretty bad, but worked.
It wrapped a lot of technical debt into a small package.

The 2020/2021 refactor tries to do three things:

1. Fix minor issues.
2. Make compatible with Python 3 and temporarily, one last time with Python 2.
3. Add unit tests.

Adding unit tests is the key to your contributions and greater flexibility with
refactoring. One a release candidate is available of this work, there is more
freedom to think about next steps including exposing queries more generically
so that more folk can work with sqlitefid. And finding more generic API-like
abstractions in general so the utility is less like a monolith and more like
a configurable static analysis engine analogous to something you might work
with in Python or Golang.

## More information

See the following blogs for more information:

* [2014-06-03] [On the creation of this tool][OPF-1]
* [2015-08-25] [Creating a digital preservation rogues gallery][OPF-2]
* [2016-05-23] [Consistent and repeatable digital preservation reporting][OPF-3]
* [2016-05-24] [A multi-lingual lingua-franca and exploring ID methods][OPF-4]

COPTR Link: [DROID_Siegfried_Sqlite_Analysis_Engine][COPTR-1]

## Components

There are three components to the tool.

### sqlitefid

Adds identification data to an SQLite database that forms the basis of the
entire analysis. There are five tables.

* DBMD - Database Metadata
* FILEDATA - File manifest and filesystem metadata
* IDDATA - Identification metadata
* IDRESULTS - FILEDATA/IDRESULTS junction table
* NSDATA - Namespace metadata, also secondary key (NS_ID) in IDDATA table

Will also augment DROID or Siegfried export data with additional columns:

* URI_SCHEME: Separates the URI_SCHEME from the DROID URI column. This is to
enable the identification of container objects found in the export
specifically, and the distinction of files stored in container objects from
standalone files.
* DIR_NAME: Returns the base directory name from the file path to enable
analysis of directory names, e.g. the number of directories in the collection.

### demystify

Outputs an analysis from extensive querying of the SQLite database created by
sqlitefid,

HTML is the default report output, with plain-text, and file-listings also
available.

It is a good idea to run the analysis and `>` pipe the result to a file, e.g.
`python demystify.py --export my_export.csv > my_analysis.htm`.

### Rogues Gallery (v.0.2.0, v0.5.0+)

The following flags provide Rogue or Hero output:

* `--rogues`

Outputs a list of files returned by the identification tool that might require
more analysis e.g. non-IDs, multiple IDs, extension mismatches, zero-byte
objects and duplicate files.

* `--heroes`

Outputs a list of files considered to need less analysis.

The options can be configured by looking at `denylist.cfg`. More information
can be found [here][OPF-2].

![Rogues Gallery Animation](https://github.com/exponential-decay/demystify/blob/main/documentation/rogues-gallery.gif?raw=true)

### pathlesstaken

A string analysis engine created to highlight when string values, e.g. file
paths might need more care taken of them in a digital preservation environment,
e.g. so we don't lose diacritics during transfer - providing a checklist of
items to look at.

Includes:

* Class to handle analysis of non-recommended filenames from [Microsoft][MS-1].
* Copy of a library from Cooper Hewitt to enable writing of plain text
descriptions of [Unicode characters][UNICODE-1].

## Architecture

The tool is designed to be easily modified to create your own output by using
the Analysis Results class as a further abstraction layer (API).

![Analysis Engine Architecture](https://github.com/exponential-decay/demystify/blob/main/documentation/analysis-engine-architecture.png?raw=true)

The recent re-factor resulted in more generic python data structures being
returned from queries and less (if not zero) formatted output. This means a
little more work has to be put into presentation of results, but it is more
flexible to what you want to do.

Tests are being implemented to promote the reliability of data returned.

## Design Decisions

There should be no dependencies associated with this tool. That being said,
you may need `lxml` for HTML output. An alternative may be found as the tool is
refactored.

If we can maintain a state of few repositories then it should promote use across
a wide-number of institutions. This has been driven by my previous two working
environments where installing Python was the first challenge... PIP and the
ability to get hold of code dependencies another - especially on multiple
user's machines where we want this tool to be successful.

## Usage Notes

Summary/Aggregate Binary / Text / Filename identification statistics are output
with the following priority:

Namespace (e.g. ordered by PRONOM first [configurable])

1. Binary and Container Identifiers
2. XML Identifiers
3. Text Identifiers
4. Filename Identifiers
5. Extension Identifiers

We need to monitor how well this works. Namespace specific statistics are also
output further down the report.

## TODO, and how you can get involved

* Internationalizing archivist descriptions [here][TRANSL-1].
* Improved container listing/handling.
* Improved 'directory' listing and handling.
* Output formatting unit tests!

As you use the tool or find problems, please report them. If you find you are
missing summaries that might be useful to you please let me know. The more the
utility is used, the more we can all benefit.

I have started a discussion topic for improvements: [here][discuss-1].

## Installation

Installation should be easy. Until the utility is packaged, you need to do the
following:

1. Find a directory you want to install demystify to.
1. Run `git clone`.
1. Navigate into the demystify repository, `cd demystify`.
1. Checkout the sub-modules (pathlesstaken, and sqlitefid):
`git submodule update --init --recursive`.
1. Install `lxml`: `python -m pip install -r requirements/production.txt`.
1. Run tests to make sure everything works: `tox -e py39`.

**NB.** `tox` is cool. If you're working on this code and want to format it
idiomatically, run `tox -e linting`. If there are errors, they will point to
where you may need to improve your code.

### Virtual environment

A virtual environment is recommended in some instances, e.g. you don't want
to pollute your Python environment with other developer's code. To do this,
for Linux you can do the following:

 1. Create a virtual environment: `python3 -m virtualenv venv-py3`.
 2. Activate the virtual environment: `source venv-py3/bin/activate`.

Then follow the installation instructions above this.

## Releases

See the [Releases][REL-1] section on GitHub.

## License

Copyright (c) 2013 Ross Spencer

This software is provided 'as-is', without any express or implied warranty. In
no event will the authors be held liable for any damages arising from the use
of this software.

Permission is granted to anyone to use this software for any purpose, including
commercial applications, and to alter it and redistribute it freely, subject to
the following restrictions:

The origin of this software must not be misrepresented; you must not claim that
you wrote the original software. If you use this software in a product, an
acknowledgment in the product documentation would be appreciated but is not
required.

Altered source versions must be plainly marked as such, and must not be
misrepresented as being the original software.

This notice may not be removed or altered from any source distribution.

[SF-1]: https://htmlpreview.github.io/?https://github.com/exponential-decay/droid-siegfried-sqlite-analysis-engine/blob/master/govdocs-selected-corpus-output/govdocs-select-sqlite-sf.htm
[DROID-1]: https://htmlpreview.github.io/?https://github.com/exponential-decay/droid-siegfried-sqlite-analysis-engine/blob/master/govdocs-selected-corpus-output/govdocs-select-sqlite-droid.htm
[DROID-CSV]: https://github.com/digital-preservation/droid
[SF-2]: https://github.com/richardlehane/siegfried
[OPF-1]: https://openpreservation.org/blog/2014/06/03/analysis-engine-droid-csv-export/
[OPF-2]: http://openpreservation.org/blog/2015/08/25/hero-or-villain-a-tool-to-create-a-digital-preservation-rogues-gallery/
[OPF-3]: http://openpreservation.org/blog/2016/05/23/whats-in-a-namespace-the-marriage-of-droid-and-siegfried-analysis/
[OPF-4]: http://openpreservation.org/blog/2016/05/24/while-were-on-the-subject-a-few-more-points-of-interest-about-the-siegfrieddroid-analysis-tool/
[REL-1]: https://github.com/exponential-decay/droid-sqlite-analysis/releases
[MS-1]: http://msdn.microsoft.com/en-us/library/aa365247(VS.85).aspx
[UNICODE-1]: https://github.com/cooperhewitt/py-cooperhewitt-unicode
[TRANSL-1]: https://docs.google.com/spreadsheets/d/1dVsRsXgD9V2GarNHHpf6Tzhrfx99_MXt3LjSSDrNLOY/edit?usp=sharing
[DISCUSS-1]: https://github.com/exponential-decay/demystify/discussions/68
[CODE4LIB-1]: https://journal.code4lib.org/articles/16351
[COPTR-1]: https://coptr.digipres.org/index.php/DROID_Siegfried_Sqlite_Analysis_Engine

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "demystify-digipres",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "Ross Spencer",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/ea/b5/5ff195c23494f418a8b7e0d065630e0b852f26a14bded403b49fe929b030/demystify_digipres-2.0.0.tar.gz",
    "platform": null,
    "description": "# Demystify\n\nStatic analysis and reporting for file-format reports generated by digital\npreservation tools, DROID and Siegfried.\n\nWorking example __Siegfried__: [Siegfried Govdocs Select Results...][SF-1]\n<br/>\nWorking example __DROID__: [DROID Govdocs Select Results...][DROID-1]\n\n## Introduction\n\nUtility for the analysis of [DROID CSV][DROID-CSV] and [Seigfried][SF-2]\nfile-format reports. The tool has three purposes:\n\n1. break the export into its components and store them within a set of tables\nin a SQLite database for performance and consistent access;\n2. provide additional information about a collection's profile where useful;\n3. and query the SQLite database, outputting results in a visually pleasant\nreport for further analysis by digital preservation specialists and archivists.\n\nFor departments on-boarding archivists or building digital capability, the\nreport contains descriptions, written by archivists for each of the statistics\noutput.\n\n![archivist descriptions in demystify](https://github.com/exponential-decay/demystify/blob/main/documentation/archivist-descriptions.png?raw=true)\n\n### Analysis of file format reports\n\nThis Code4Lib article published early in 2022 describes some of the important\ninformation in file-format reports that appear, in-aggregate. It describes\nthe challenges of accessing that information consistently also.\n\n* [Fractal in detail: What information is in a file format identification\nreport?][CODE4LIB-1]\n\n### 2020/2021 refactor\n\nThis utility was first written in 2013. The code was pretty bad, but worked.\nIt wrapped a lot of technical debt into a small package.\n\nThe 2020/2021 refactor tries to do three things:\n\n1. Fix minor issues.\n2. Make compatible with Python 3 and temporarily, one last time with Python 2.\n3. Add unit tests.\n\nAdding unit tests is the key to your contributions and greater flexibility with\nrefactoring. One a release candidate is available of this work, there is more\nfreedom to think about next steps including exposing queries more generically\nso that more folk can work with sqlitefid. And finding more generic API-like\nabstractions in general so the utility is less like a monolith and more like\na configurable static analysis engine analogous to something you might work\nwith in Python or Golang.\n\n## More information\n\nSee the following blogs for more information:\n\n* [2014-06-03] [On the creation of this tool][OPF-1]\n* [2015-08-25] [Creating a digital preservation rogues gallery][OPF-2]\n* [2016-05-23] [Consistent and repeatable digital preservation reporting][OPF-3]\n* [2016-05-24] [A multi-lingual lingua-franca and exploring ID methods][OPF-4]\n\nCOPTR Link: [DROID_Siegfried_Sqlite_Analysis_Engine][COPTR-1]\n\n## Components\n\nThere are three components to the tool.\n\n### sqlitefid\n\nAdds identification data to an SQLite database that forms the basis of the\nentire analysis. There are five tables.\n\n* DBMD - Database Metadata\n* FILEDATA - File manifest and filesystem metadata\n* IDDATA - Identification metadata\n* IDRESULTS - FILEDATA/IDRESULTS junction table\n* NSDATA - Namespace metadata, also secondary key (NS_ID) in IDDATA table\n\nWill also augment DROID or Siegfried export data with additional columns:\n\n* URI_SCHEME: Separates the URI_SCHEME from the DROID URI column. This is to\nenable the identification of container objects found in the export\nspecifically, and the distinction of files stored in container objects from\nstandalone files.\n* DIR_NAME: Returns the base directory name from the file path to enable\nanalysis of directory names, e.g. the number of directories in the collection.\n\n### demystify\n\nOutputs an analysis from extensive querying of the SQLite database created by\nsqlitefid,\n\nHTML is the default report output, with plain-text, and file-listings also\navailable.\n\nIt is a good idea to run the analysis and `>` pipe the result to a file, e.g.\n`python demystify.py --export my_export.csv > my_analysis.htm`.\n\n### Rogues Gallery (v.0.2.0, v0.5.0+)\n\nThe following flags provide Rogue or Hero output:\n\n* `--rogues`\n\nOutputs a list of files returned by the identification tool that might require\nmore analysis e.g. non-IDs, multiple IDs, extension mismatches, zero-byte\nobjects and duplicate files.\n\n* `--heroes`\n\nOutputs a list of files considered to need less analysis.\n\nThe options can be configured by looking at `denylist.cfg`. More information\ncan be found [here][OPF-2].\n\n![Rogues Gallery Animation](https://github.com/exponential-decay/demystify/blob/main/documentation/rogues-gallery.gif?raw=true)\n\n### pathlesstaken\n\nA string analysis engine created to highlight when string values, e.g. file\npaths might need more care taken of them in a digital preservation environment,\ne.g. so we don't lose diacritics during transfer - providing a checklist of\nitems to look at.\n\nIncludes:\n\n* Class to handle analysis of non-recommended filenames from [Microsoft][MS-1].\n* Copy of a library from Cooper Hewitt to enable writing of plain text\ndescriptions of [Unicode characters][UNICODE-1].\n\n## Architecture\n\nThe tool is designed to be easily modified to create your own output by using\nthe Analysis Results class as a further abstraction layer (API).\n\n![Analysis Engine Architecture](https://github.com/exponential-decay/demystify/blob/main/documentation/analysis-engine-architecture.png?raw=true)\n\nThe recent re-factor resulted in more generic python data structures being\nreturned from queries and less (if not zero) formatted output. This means a\nlittle more work has to be put into presentation of results, but it is more\nflexible to what you want to do.\n\nTests are being implemented to promote the reliability of data returned.\n\n## Design Decisions\n\nThere should be no dependencies associated with this tool. That being said,\nyou may need `lxml` for HTML output. An alternative may be found as the tool is\nrefactored.\n\nIf we can maintain a state of few repositories then it should promote use across\na wide-number of institutions. This has been driven by my previous two working\nenvironments where installing Python was the first challenge... PIP and the\nability to get hold of code dependencies another - especially on multiple\nuser's machines where we want this tool to be successful.\n\n## Usage Notes\n\nSummary/Aggregate Binary / Text / Filename identification statistics are output\nwith the following priority:\n\nNamespace (e.g. ordered by PRONOM first [configurable])\n\n1. Binary and Container Identifiers\n2. XML Identifiers\n3. Text Identifiers\n4. Filename Identifiers\n5. Extension Identifiers\n\nWe need to monitor how well this works. Namespace specific statistics are also\noutput further down the report.\n\n## TODO, and how you can get involved\n\n* Internationalizing archivist descriptions [here][TRANSL-1].\n* Improved container listing/handling.\n* Improved 'directory' listing and handling.\n* Output formatting unit tests!\n\nAs you use the tool or find problems, please report them. If you find you are\nmissing summaries that might be useful to you please let me know. The more the\nutility is used, the more we can all benefit.\n\nI have started a discussion topic for improvements: [here][discuss-1].\n\n## Installation\n\nInstallation should be easy. Until the utility is packaged, you need to do the\nfollowing:\n\n1. Find a directory you want to install demystify to.\n1. Run `git clone`.\n1. Navigate into the demystify repository, `cd demystify`.\n1. Checkout the sub-modules (pathlesstaken, and sqlitefid):\n`git submodule update --init --recursive`.\n1. Install `lxml`: `python -m pip install -r requirements/production.txt`.\n1. Run tests to make sure everything works: `tox -e py39`.\n\n**NB.** `tox` is cool. If you're working on this code and want to format it\nidiomatically, run `tox -e linting`. If there are errors, they will point to\nwhere you may need to improve your code.\n\n### Virtual environment\n\nA virtual environment is recommended in some instances, e.g. you don't want\nto pollute your Python environment with other developer's code. To do this,\nfor Linux you can do the following:\n\n 1. Create a virtual environment: `python3 -m virtualenv venv-py3`.\n 2. Activate the virtual environment: `source venv-py3/bin/activate`.\n\nThen follow the installation instructions above this.\n\n## Releases\n\nSee the [Releases][REL-1] section on GitHub.\n\n## License\n\nCopyright (c) 2013 Ross Spencer\n\nThis software is provided 'as-is', without any express or implied warranty. In\nno event will the authors be held liable for any damages arising from the use\nof this software.\n\nPermission is granted to anyone to use this software for any purpose, including\ncommercial applications, and to alter it and redistribute it freely, subject to\nthe following restrictions:\n\nThe origin of this software must not be misrepresented; you must not claim that\nyou wrote the original software. If you use this software in a product, an\nacknowledgment in the product documentation would be appreciated but is not\nrequired.\n\nAltered source versions must be plainly marked as such, and must not be\nmisrepresented as being the original software.\n\nThis notice may not be removed or altered from any source distribution.\n\n[SF-1]: https://htmlpreview.github.io/?https://github.com/exponential-decay/droid-siegfried-sqlite-analysis-engine/blob/master/govdocs-selected-corpus-output/govdocs-select-sqlite-sf.htm\n[DROID-1]: https://htmlpreview.github.io/?https://github.com/exponential-decay/droid-siegfried-sqlite-analysis-engine/blob/master/govdocs-selected-corpus-output/govdocs-select-sqlite-droid.htm\n[DROID-CSV]: https://github.com/digital-preservation/droid\n[SF-2]: https://github.com/richardlehane/siegfried\n[OPF-1]: https://openpreservation.org/blog/2014/06/03/analysis-engine-droid-csv-export/\n[OPF-2]: http://openpreservation.org/blog/2015/08/25/hero-or-villain-a-tool-to-create-a-digital-preservation-rogues-gallery/\n[OPF-3]: http://openpreservation.org/blog/2016/05/23/whats-in-a-namespace-the-marriage-of-droid-and-siegfried-analysis/\n[OPF-4]: http://openpreservation.org/blog/2016/05/24/while-were-on-the-subject-a-few-more-points-of-interest-about-the-siegfrieddroid-analysis-tool/\n[REL-1]: https://github.com/exponential-decay/droid-sqlite-analysis/releases\n[MS-1]: http://msdn.microsoft.com/en-us/library/aa365247(VS.85).aspx\n[UNICODE-1]: https://github.com/cooperhewitt/py-cooperhewitt-unicode\n[TRANSL-1]: https://docs.google.com/spreadsheets/d/1dVsRsXgD9V2GarNHHpf6Tzhrfx99_MXt3LjSSDrNLOY/edit?usp=sharing\n[DISCUSS-1]: https://github.com/exponential-decay/demystify/discussions/68\n[CODE4LIB-1]: https://journal.code4lib.org/articles/16351\n[COPTR-1]: https://coptr.digipres.org/index.php/DROID_Siegfried_Sqlite_Analysis_Engine\n",
    "bugtrack_url": null,
    "license": "Copyright (c) 2013 Ross Spencer  This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software.  Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions:  The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required.  Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software.  This notice may not be removed or altered from any source distribution. ",
    "summary": "engine for the analysis of DROID and Siegfried file format reports",
    "version": "2.0.0",
    "project_urls": {
        "Bug Reports": "https://github.com/exponential-decay/demystify/issues/",
        "Source": "https://github.com/exponential-decay/demystify"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ea20b3ec85da695a2051e85d680577adc1e560bf500a703cc63bbc778ae0eb65",
                "md5": "3a8587c3ad7c0d5d734f54572757f3b6",
                "sha256": "ae6aad2755ce5148d8ff9d31a9d344e01ad23d6fd4df41e7edbc6154928615d8"
            },
            "downloads": -1,
            "filename": "demystify_digipres-2.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3a8587c3ad7c0d5d734f54572757f3b6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 345867,
            "upload_time": "2024-05-05T15:43:57",
            "upload_time_iso_8601": "2024-05-05T15:43:57.511699Z",
            "url": "https://files.pythonhosted.org/packages/ea/20/b3ec85da695a2051e85d680577adc1e560bf500a703cc63bbc778ae0eb65/demystify_digipres-2.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "eab55ff195c23494f418a8b7e0d065630e0b852f26a14bded403b49fe929b030",
                "md5": "ad4591ac303c38776572eed8ca13a26b",
                "sha256": "1b736a0e8f7998fd3c64cdc00f878decc1df6c398589840060ddb3a54c7a401f"
            },
            "downloads": -1,
            "filename": "demystify_digipres-2.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "ad4591ac303c38776572eed8ca13a26b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 685789,
            "upload_time": "2024-05-05T15:44:01",
            "upload_time_iso_8601": "2024-05-05T15:44:01.158289Z",
            "url": "https://files.pythonhosted.org/packages/ea/b5/5ff195c23494f418a8b7e0d065630e0b852f26a14bded403b49fe929b030/demystify_digipres-2.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-05 15:44:01",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "exponential-decay",
    "github_project": "demystify",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "tox": true,
    "lcname": "demystify-digipres"
}
        
Elapsed time: 0.20829s