metacrafter


Namemetacrafter JSON
Version 0.0.4 PyPI version JSON
download
home_pagehttps://github.com/apicrafter/metacrafter/
SummaryMetacrafter metadata classification tool
upload_time2024-06-14 08:41:28
maintainerNone
docs_urlNone
authorIvan Begtin
requires_pythonNone
licenseApache
keywords json jsonl csv bson cli dataset metadata
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Metacrafter



Python command line tool and python engine to label table fields and fields in data files.

It could help to find meaningful data in your tables and data files or to find Personal identifable information (PII).





## Installation



To install Python library use `pip install metacrafter` via pip or `python setup.py install` 



## Features



Metacrafter is a rule based tool that helps to label fields of the tables in databases. It scans table and finds person names, surnames, midnames, PII data, basic identifiers like UUID/GUID. 

These rules written as .yaml files and could be easily extended.



File formats supported:

* CSV

* JSON lines

* JSON (array of records)

* BSON

* Parquet

* XML



Databases support:

* Any SQL database supported by [SQLAlchemy](https://www.sqlalchemy.org/) 

* NoSQL databases: 

  * MongoDB



Metacrafter key features:

* 111 labeling rules

* all labels metadata collected into [Metacrafter registry](https://github.com/apicrafter/metacrafter-registry ) public repository

* 312 date detection rules/patterns, date detection using [qddate](https://github.com/ivbeg/qddate), "quick and dirty" date detection library

* extendable set of rules using PyParsing, exact text match and validation functions

* support any database supported by SQLAlchemy

* advanced context and language management. You could apply only rules relevant to certain data of choosen language

* built-in API server

* commercial support and additional rules available





## Command line examples



### File analysis examples



    # Scan CSV file

    $ metacrafter scan-file --format short somefile.csv



    # Scan CSV file with delimiter ';' and windows-1251 encoding

    $ metacrafter scan-file --format short --encoding windows-1251 --delimiter ';' somefile.csv



    # Scan JSON lines file, output results as stats table to file file

    $ metacrafter scan-file --format stats -o somefile_result.json somefile.jsonl





Result example of 'full' type of formatting

```    

key               ftype    tags    matches                                                                datatype_url

----------------  -------  ------  ---------------------------------------------------------------------  ----------------------------------------------------------

Domain            str              fqdn 99.90                                                             https://registry.apicrafter.io/datatype/fqdn

Primary domain    str              fqdn 100.00                                                            https://registry.apicrafter.io/datatype/fqdn

Name              str              name 100.00                                                            https://registry.apicrafter.io/datatype/name

Domain type       str      dict

Organization      str

Status            str      dict

Region            str      dict    rusregion 22.95                                                        https://registry.apicrafter.io/datatype/rusregion

GovSystem         str      dict

HTTP Support      str      dict    boolean 100.00                                                         https://registry.apicrafter.io/datatype/boolean

HTTPS Support     str      dict    boolean 100.00                                                         https://registry.apicrafter.io/datatype/boolean

Statuscode        str      dict

Is archived       str      empty

Archives          str      empty

Archive priority  str      dict

Archive Strategy  str      dict

ASN               str              asn 93.77                                                              https://registry.apicrafter.io/datatype/asn

ASN Country code  str      dict    countrycode_alpha2 100.00,countrycode_alpha2 100.00,languagetag 99.56  https://registry.apicrafter.io/datatype/countrycode_alpha2

IPs               str              ipv4 96.28                                                             https://registry.apicrafter.io/datatype/ipv4

GovType           str      dict







```





### Database analysis examples



    # Scan MongoDB database 'fns', save results as result.json and format output as 'stats'

    $ metacrafter scan-mongodb --dbname fns -o result.json -f full



    # Scan Postgres database 'dbname', with schema 'public'.

    $ metacrafter scan-db --schema public --connstr postgresql+psycopg2://username:password@127.0.0.1:15432/dbname







# Rules



All rules described as YAML files and by default rules loaded from directory 'rules' or from list of directories provided in .metacrafter file with YAML format



All rules could be applied to **fields** or **data** .



Compare engines defined in **match** parameter in rule description:

* text - scan text for exact match to one of text values. Text values delimited by comma (',')

* ppr - scan text for PyParsing. PyParsing rule defined as Python code with PyParsing objects like Word(nums, exact=4)

* func - scan text using Python function provided. Function shoud accept one string parameter and shoud return True or False



## How to write rules



### Function (func)



Example Russian administrative legal act/law matched by custom function

```

  runpabyfunc:

    key: runpa

    name: Russian legal act / law

    maxlen: 500

    minlen: 3

    priority: 1

    match: func

    type: data

    rule: metacrafter.rules.ru.gov.is_ru_law

```



### Exact text match (text)



Example midname matching by exact field name

```

  midname:

    key: person_midname

    name: Person midname by known

    rule: midname,secondname,middlename,mid_name,middle_name

    type: field

    match: text

```

### PyParsing rule (ppr)



Example Russian cadastral number

```

  rukadastr:

    key: rukadastr

    name: Russian land territory cadastral identifier

    rule: Word(nums, min=1, max=2) + Literal(':').suppress() + Word(nums, min=1, max=2) + Literal(':').suppress() + Word(nums, min=6, max=7) + Literal(':').suppress() + Word(nums, min=1, max=6)

    maxlen: 20

    minlen: 12

    priority: 1

    match: ppr

    type: data

```



## Detailed stats





Rule types:

- field based rules 146

- data based rules 102



Context:

- common 47

- companies 15

- crypto 3

- datetime 29

- finances 5

- geo 58

- government 19

- identifiers 3

- industry 2

- internet 18

- medical 6

- objectids 3

- persons 19

- pii 16

- science 2

- software 1

- values 1

- vehicles 1



Language:

- common 100

- de 4

- en 24

- es 1

- fr 11

- ru 108



Data/time patterns (qddate): 312





## Commercial support



Please write ibegtin@apicrafter.io or ivan@begtin.tech to request beta access to commercial API.

Commercial API support 195 fields and data rules and provided with dedicated support.


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/apicrafter/metacrafter/",
    "name": "metacrafter",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "json jsonl csv bson cli dataset metadata",
    "author": "Ivan Begtin",
    "author_email": "ivan@begtin.tech",
    "download_url": "https://files.pythonhosted.org/packages/51/aa/1eff5682e8e034501818a26e7d997cab628b0724e0198f8d826387706ad8/metacrafter-0.0.4.tar.gz",
    "platform": null,
    "description": "# Metacrafter\r\r\n\r\r\nPython command line tool and python engine to label table fields and fields in data files.\r\r\nIt could help to find meaningful data in your tables and data files or to find Personal identifable information (PII).\r\r\n\r\r\n\r\r\n## Installation\r\r\n\r\r\nTo install Python library use `pip install metacrafter` via pip or `python setup.py install` \r\r\n\r\r\n## Features\r\r\n\r\r\nMetacrafter is a rule based tool that helps to label fields of the tables in databases. It scans table and finds person names, surnames, midnames, PII data, basic identifiers like UUID/GUID. \r\r\nThese rules written as .yaml files and could be easily extended.\r\r\n\r\r\nFile formats supported:\r\r\n* CSV\r\r\n* JSON lines\r\r\n* JSON (array of records)\r\r\n* BSON\r\r\n* Parquet\r\r\n* XML\r\r\n\r\r\nDatabases support:\r\r\n* Any SQL database supported by [SQLAlchemy](https://www.sqlalchemy.org/) \r\r\n* NoSQL databases: \r\r\n  * MongoDB\r\r\n\r\r\nMetacrafter key features:\r\r\n* 111 labeling rules\r\r\n* all labels metadata collected into [Metacrafter registry](https://github.com/apicrafter/metacrafter-registry ) public repository\r\r\n* 312 date detection rules/patterns, date detection using [qddate](https://github.com/ivbeg/qddate), \"quick and dirty\" date detection library\r\r\n* extendable set of rules using PyParsing, exact text match and validation functions\r\r\n* support any database supported by SQLAlchemy\r\r\n* advanced context and language management. You could apply only rules relevant to certain data of choosen language\r\r\n* built-in API server\r\r\n* commercial support and additional rules available\r\r\n\r\r\n\r\r\n## Command line examples\r\r\n\r\r\n### File analysis examples\r\r\n\r\r\n    # Scan CSV file\r\r\n    $ metacrafter scan-file --format short somefile.csv\r\r\n\r\r\n    # Scan CSV file with delimiter ';' and windows-1251 encoding\r\r\n    $ metacrafter scan-file --format short --encoding windows-1251 --delimiter ';' somefile.csv\r\r\n\r\r\n    # Scan JSON lines file, output results as stats table to file file\r\r\n    $ metacrafter scan-file --format stats -o somefile_result.json somefile.jsonl\r\r\n\r\r\n\r\r\nResult example of 'full' type of formatting\r\r\n```    \r\r\nkey               ftype    tags    matches                                                                datatype_url\r\r\n----------------  -------  ------  ---------------------------------------------------------------------  ----------------------------------------------------------\r\r\nDomain            str              fqdn 99.90                                                             https://registry.apicrafter.io/datatype/fqdn\r\r\nPrimary domain    str              fqdn 100.00                                                            https://registry.apicrafter.io/datatype/fqdn\r\r\nName              str              name 100.00                                                            https://registry.apicrafter.io/datatype/name\r\r\nDomain type       str      dict\r\r\nOrganization      str\r\r\nStatus            str      dict\r\r\nRegion            str      dict    rusregion 22.95                                                        https://registry.apicrafter.io/datatype/rusregion\r\r\nGovSystem         str      dict\r\r\nHTTP Support      str      dict    boolean 100.00                                                         https://registry.apicrafter.io/datatype/boolean\r\r\nHTTPS Support     str      dict    boolean 100.00                                                         https://registry.apicrafter.io/datatype/boolean\r\r\nStatuscode        str      dict\r\r\nIs archived       str      empty\r\r\nArchives          str      empty\r\r\nArchive priority  str      dict\r\r\nArchive Strategy  str      dict\r\r\nASN               str              asn 93.77                                                              https://registry.apicrafter.io/datatype/asn\r\r\nASN Country code  str      dict    countrycode_alpha2 100.00,countrycode_alpha2 100.00,languagetag 99.56  https://registry.apicrafter.io/datatype/countrycode_alpha2\r\r\nIPs               str              ipv4 96.28                                                             https://registry.apicrafter.io/datatype/ipv4\r\r\nGovType           str      dict\r\r\n\r\r\n\r\r\n\r\r\n```\r\r\n\r\r\n\r\r\n### Database analysis examples\r\r\n\r\r\n    # Scan MongoDB database 'fns', save results as result.json and format output as 'stats'\r\r\n    $ metacrafter scan-mongodb --dbname fns -o result.json -f full\r\r\n\r\r\n    # Scan Postgres database 'dbname', with schema 'public'.\r\r\n    $ metacrafter scan-db --schema public --connstr postgresql+psycopg2://username:password@127.0.0.1:15432/dbname\r\r\n\r\r\n\r\r\n\r\r\n# Rules\r\r\n\r\r\nAll rules described as YAML files and by default rules loaded from directory 'rules' or from list of directories provided in .metacrafter file with YAML format\r\r\n\r\r\nAll rules could be applied to **fields** or **data** .\r\r\n\r\r\nCompare engines defined in **match** parameter in rule description:\r\r\n* text - scan text for exact match to one of text values. Text values delimited by comma (',')\r\r\n* ppr - scan text for PyParsing. PyParsing rule defined as Python code with PyParsing objects like Word(nums, exact=4)\r\r\n* func - scan text using Python function provided. Function shoud accept one string parameter and shoud return True or False\r\r\n\r\r\n## How to write rules\r\r\n\r\r\n### Function (func)\r\r\n\r\r\nExample Russian administrative legal act/law matched by custom function\r\r\n```\r\r\n  runpabyfunc:\r\r\n    key: runpa\r\r\n    name: Russian legal act / law\r\r\n    maxlen: 500\r\r\n    minlen: 3\r\r\n    priority: 1\r\r\n    match: func\r\r\n    type: data\r\r\n    rule: metacrafter.rules.ru.gov.is_ru_law\r\r\n```\r\r\n\r\r\n### Exact text match (text)\r\r\n\r\r\nExample midname matching by exact field name\r\r\n```\r\r\n  midname:\r\r\n    key: person_midname\r\r\n    name: Person midname by known\r\r\n    rule: midname,secondname,middlename,mid_name,middle_name\r\r\n    type: field\r\r\n    match: text\r\r\n```\r\r\n### PyParsing rule (ppr)\r\r\n\r\r\nExample Russian cadastral number\r\r\n```\r\r\n  rukadastr:\r\r\n    key: rukadastr\r\r\n    name: Russian land territory cadastral identifier\r\r\n    rule: Word(nums, min=1, max=2) + Literal(':').suppress() + Word(nums, min=1, max=2) + Literal(':').suppress() + Word(nums, min=6, max=7) + Literal(':').suppress() + Word(nums, min=1, max=6)\r\r\n    maxlen: 20\r\r\n    minlen: 12\r\r\n    priority: 1\r\r\n    match: ppr\r\r\n    type: data\r\r\n```\r\r\n\r\r\n## Detailed stats\r\r\n\r\r\n\r\r\nRule types:\r\r\n- field based rules 146\r\r\n- data based rules 102\r\r\n\r\r\nContext:\r\r\n- common 47\r\r\n- companies 15\r\r\n- crypto 3\r\r\n- datetime 29\r\r\n- finances 5\r\r\n- geo 58\r\r\n- government 19\r\r\n- identifiers 3\r\r\n- industry 2\r\r\n- internet 18\r\r\n- medical 6\r\r\n- objectids 3\r\r\n- persons 19\r\r\n- pii 16\r\r\n- science 2\r\r\n- software 1\r\r\n- values 1\r\r\n- vehicles 1\r\r\n\r\r\nLanguage:\r\r\n- common 100\r\r\n- de 4\r\r\n- en 24\r\r\n- es 1\r\r\n- fr 11\r\r\n- ru 108\r\r\n\r\r\nData/time patterns (qddate): 312\r\r\n\r\r\n\r\r\n## Commercial support\r\r\n\r\r\nPlease write ibegtin@apicrafter.io or ivan@begtin.tech to request beta access to commercial API.\r\r\nCommercial API support 195 fields and data rules and provided with dedicated support.\r\r\n",
    "bugtrack_url": null,
    "license": "Apache",
    "summary": "Metacrafter metadata classification tool",
    "version": "0.0.4",
    "project_urls": {
        "Download": "https://github.com/apicrafter/metacrafter/",
        "Homepage": "https://github.com/apicrafter/metacrafter/"
    },
    "split_keywords": [
        "json",
        "jsonl",
        "csv",
        "bson",
        "cli",
        "dataset",
        "metadata"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "51aa1eff5682e8e034501818a26e7d997cab628b0724e0198f8d826387706ad8",
                "md5": "21be36887520842510f86865f57b0462",
                "sha256": "55f4497dd3f5939528fe16147b94d52fda011745d8d66792727f66c0c864bcf0"
            },
            "downloads": -1,
            "filename": "metacrafter-0.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "21be36887520842510f86865f57b0462",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 27135,
            "upload_time": "2024-06-14T08:41:28",
            "upload_time_iso_8601": "2024-06-14T08:41:28.679505Z",
            "url": "https://files.pythonhosted.org/packages/51/aa/1eff5682e8e034501818a26e7d997cab628b0724e0198f8d826387706ad8/metacrafter-0.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-14 08:41:28",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "apicrafter",
    "github_project": "metacrafter",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "metacrafter"
}
        
Elapsed time: 0.70410s