pymongo-schema

Name	pymongo-schema JSON
Version	0.4.1 JSON
	download
home_page	https://github.com/pajachiet/pymongo-schema
Summary	A schema analyser for MongoDB written in Python
upload_time	2022-08-30 21:21:11
maintainer
docs_url	None
author	Pierre-Alain Jachiet
requires_python
license	GNU General Public License v3.0
keywords	mongo mongodb schema mongo-connector mongo-connector-postgresql
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI
coveralls test coverage

            # pymongo-schema
A schema analyser for MongoDB, written in Python. 

This tool allows you to **extract your application's schema, directly from your MongoDB data**. It comes with **powerful schema manipulation and export functionalities**.

It will be particularly useful when you inherit a data dump, and want to quickly learn how the data is structured. 

pymongo-schema allows to map your MongoDB data model to a relational (SQL) data model. This greatly helps to configure [mongo-connector-postgresql](https://github.com/Hopwork/mongo-connector-postgresql), a tool to synchronize data from MongoDB to a target PostgreSQL database.

It also helps you to **compare different versions of your data model**.

This tools is inspired by [variety](https://github.com/variety/variety), with the following enhancement

- extract the **hierarchical structure** of the schema 
- versatile output options : json, yaml, tsv, markdown or htlm
- **finer grained types**. ex: INTEGER, DOUBLE rather than NUMBER 
- **filtering** of the output schema, using a `namespace` as defined by [mongo-connector](https://github.com/mongodb-labs/mongo-connector/wiki/Configuration-Options#configure-namespaces)
- **mapping to a relational schema**
- **comparison** of successive schema

[![Build Status](https://travis-ci.org/pajachiet/pymongo-schema.svg?branch=master)](https://travis-ci.org/pajachiet/pymongo-schema)
[![Coverage Status](https://coveralls.io/repos/github/pajachiet/pymongo-schema/badge.svg?branch=master)](https://coveralls.io/github/pajachiet/pymongo-schema?branch=master)


# Install

You can install latest stable version PyPi :
```shell
pip install --upgrade pymongo-schema
```

Or directly from github : 
```shell
pip install --upgrade git+https://github.com/pajachiet/pymongo-schema
```
# Usage

## Command line

```shell
python -m pymongo_schema -h
usage: [-h] [--quiet] {extract,transform,tosql,compare} ...

commands:
  {extract,transform,tosql,compare}
    extract             Extract schema from a MongoDB instance
    transform           Transform a json schema to another format, potentially
                        filtering or changing columns outputs
    tosql               Create a mapping from mongo schema to relational
                        schema (json input and output)
    compare             Compare two schemas

optional arguments:
  -h, --help            show this help message and exit
  --quiet               Remove logging on standard output

Usage:
    python -m pymongo_schema extract -h
    usage:  [-h] [-f [FORMATS [FORMATS ...]]] [-o OUTPUT] [--port PORT] [--host HOST]
                 [-d [DATABASES [DATABASES ...]]] [-c [COLLECTIONS [COLLECTIONS ...]]]
                 [--columns COLUMNS [COLUMNS ...]] [--size SIZE] [--without-counts]
                 
    python -m pymongo_schema transform -h
    usage: [-h] [-f [FORMATS [FORMATS ...]]] [-o OUTPUT] [--category CATEGORY] [-n FILTER]
                [--columns COLUMNS [COLUMNS ...]] [--without-counts] [input]
                
    python -m pymongo_schema tosql -h
    usage: [-h] [-f [FORMATS [FORMATS ...]]] [--columns COLUMNS [COLUMNS ...]]
                [--without-counts] [-o OUTPUT] [input]

    python -m pymongo_schema compare -h
    usage: [-h] [-f [FORMATS [FORMATS ...]]] [-o OUTPUT] [input]
                [--columns COLUMNS [COLUMNS ...]] [--without-counts]
                [--detailed_diff] prev_schema [new_schema]

```

To display full usage, with options description, run:
```shell 
pymongo-schema <command> -h
```

## Python package

pymongo_schema modules can also be imported to be used directly inside python code :

```python
from pymongo_schema.compare import compare_schemas_bases
from pymongo_schema.export import transform_data_to_file
from pymongo_schema.extract import extract_pymongo_client_schema
from pymongo_schema.filter import filter_mongo_schema_namespaces
from pymongo_schema.tosql import mongo_schema_to_mapping
```

Fore more details, refer to modules and functions docstrings.

# Examples

First, lets populate a collection in test database from mongo shell


    db.users.insert({name: "Tom", bio: "A nice guy.", pets: ["monkey", "fish"], someWeirdLegacyKey: "I like Ike!"})
    db.users.insert({name: "Dick", bio: "I swordfight.", birthday: new Date("1974/03/14")})
    db.users.insert({name: "Harry", pets: "egret", birthday: new Date("1984/03/14"), location:{country:"France", city: "Lyon"}})
    db.users.insert({name: "Geneviève", bio: "Ça va?", location:{country:"France", city: "Nantes"}})
    db.users.insert({name: "MadJacques", location:{country:"France", city: "Paris"}})

## Bash api examples
### Easy examples

Extract the schema from this database, with a json format on standard output

    $ python -m pymongo_schema extract --database test
    === Start MongoDB schema analysis
    Extract schema of database test
    ...collection users
       scanned 5 documents out of 5 (100.00 %)
    --- MongoDB schema analysis took 0.00 s
    === Write output

    {"test": {
        "users": {
            "object": {"_id": {"prop_in_object": 1.0, "count": 5, "type": "oid", "types_count": {"oid": 5}},
                       "pets": {"array_types_count": {"string": 2}, "prop_in_object": 0.4, "count": 2, "array_type": "string", "type": "ARRAY", "types_count": {"string": 1, "ARRAY": 1}},
                       "birthday": {"prop_in_object": 0.4, "count": 2, "type": "date", "types_count": {"date": 2}},
                       "name": {"prop_in_object": 1.0, "count": 5, "type": "string", "types_count": {"string": 5}},
                       "bio": {"prop_in_object": 0.6, "count": 3, "type": "string", "types_count": {"string": 3}},
                       "someWeirdLegacyKey": {"prop_in_object": 0.2, "count": 1, "type": "string", "types_count": {"string": 1}},
                       "location": {"object": {"country": {"prop_in_object": 1.0, "count": 3, "type": "string", "types_count": {"string": 3}},
                                               "city": {"prop_in_object": 1.0, "count": 3, "type": "string", "types_count": {"string": 3}}},
                                    "types_count": {"OBJECT": 3}, "prop_in_object": 0.6, "type": "OBJECT", "count": 3}},
            "count": 5}}}

Extract the same schema in md format.

    $ python -m pymongo_schema extract --database test --format md
    === Start MongoDB schema analysis
    Extract schema of database test
    ...collection users
       scanned 5 documents out of 5 (100.00 %)
    --- MongoDB schema analysis took 0.00 s
    === Write output

    ### Database: test
    #### Collection: users 
    |Field_compact_name     |Field_name             |Count     |Percentage     |Types_count                           |
    |-----------------------|-----------------------|----------|---------------|--------------------------------------|
    |_id                    |_id                    |5         |100.0          |oid : 5                               |
    |name                   |name                   |5         |100.0          |string : 5                            |
    |bio                    |bio                    |3         |60.0           |string : 3                            |
    |location               |location               |3         |60.0           |OBJECT : 3                            |
    | . city                |city                   |3         |100.0          |string : 3                            |
    | . country             |country                |3         |100.0          |string : 3                            |
    |birthday               |birthday               |2         |40.0           |date : 2                              |
    |pets                   |pets                   |2         |40.0           |ARRAY(string : 2) : 1, string : 1     |
    |someWeirdLegacyKey     |someWeirdLegacyKey     |1         |20.0           |string : 1                            |

Map this schema to a relational mapping

    $ python -m pymongo_schema extract --database test | python -m pymongo_schema tosql
    === Start MongoDB schema analysis
    Extract schema of database test
    ...collection users
       scanned 5 documents out of 5 (100.00 %)
    --- MongoDB schema analysis took 0.00 s
    === Write output
    === Generate mapping from mongo to sql
    === Write output

    {"test":
     {"users":
          {"_id": {"type": "TEXT", "dest": "_id"},
           "pets": {"valueField": "pets", "fk": "id_users", "type": "_ARRAY_OF_SCALARS", "dest": "users__pets"},
           "location.city": {"type": "TEXT", "dest": "location__city"},
           "name": {"type": "TEXT", "dest": "name"},
           "someWeirdLegacyKey": {"type": "TEXT", "dest": "someWeirdLegacyKey"},
           "pk": "_id",
           "bio": {"type": "TEXT", "dest": "bio"},
           "birthday": {"type": "TIMESTAMP", "dest": "birthday"},
           "location.country": {"type": "TEXT", "dest": "location__country"}},
      "users__pets": {"id_users": {"type": "TEXT"},
                      "pets": {"type": "TEXT", "dest": "pets"},
                      "pk": "_id_postgres"}}}

### Other examples

**extract:** Extract the schema for collections `test_collection_1` and `test_collection_2` from `test_db` and write it into `mongo_schema.html` and `mongo_schema.json` files
```shell
    python -m pymongo_schema extract --databases test_db --collections test_collection_1 test_collection_2 --output mongo_schema --format html json
```
**extract:** Extract the schema for collection `test_collection_1` with only 1000 random rows scanned and write it into `mongo_schema.html` files
```shell
    python -m pymongo_schema extract --collections test_collection_1 --size 1000 --output mongo_schema --format html
```
**transform:** Filter extracted schema (`mongo_schema.json`) using `namespace.json` file and write output into `mongo_schema_filtered.html`, `mongo_schema_filtered.csv` and `mongo_schema_filtered.json` files
```shell
    python -m pymongo_schema transform mongo_schema.json --filter namespace.json --output mongo_schema_filtered --format html csv json
```
**tosql:** Create mapping file based on `mongo_schema_filtered.json`
```shell
    python -m pymongo_schema tosql mongo_schema_filtered.json --output mapping.json
```

## Python api examples

Extract the schemas of all collections and all databases in a MongoDB instance:

```python
import pymongo
from pymongo_schema.extract import extract_pymongo_client_schema

with pymongo.MongoClient() as client:
    schema = extract_pymongo_client_schema(client)
```
Arguments can be specified to extract only some databases and some collections. See code documentation for more details.

Filter extract schema with a `namespace`:
```python
import json
from pymongo_schema.filter import filter_mongo_schema_namespaces

# assuming a namespace is defined in a file named namespace.json
with open("namespace.json") as f:
    namespace = json.load(f)

schema_filtered = filter_mongo_schema_namespaces(schema, namespace)
```

Save filtered_schema (could be used for schema) to file in json and md formats in a `docs` directory:
```python
from pymongo_schema.export import transform_data_to_file

transform_data_to_file(schema_filtered, ['json', 'md'], output='docs/schema_filtered')
```

Compare filtered_schema (could be used for schema) to another (previous for example) schema:
```python
from pymongo_schema.compare import compare_schemas_bases

# assuming a namespace is defined in a file named namespace.json
with open("old_schema_filtered.json") as f:
    old_schema_filtered = json.load(f)

differences = compare_schemas_bases(old_schema_filtered, schema_filtered)
```

Save differences to file in json and md formats in a `docs` directory:
```python
transform_data_to_file(differences, ['json', 'md'], output='docs/diff', category='diff')
```

Transform filtered_schema to a relational mapping:
```python
from pymongo_schema.tosql import mongo_schema_to_mapping

mapping = mongo_schema_to_mapping(schema_filtered)
```

Save mapping to file in json and md formats in a `docs` directory:
```python
transform_data_to_file(mapping, ['json', 'md'], output='docs/mapping', category='mapping')
```

# Schema

We define 'schema' as a dictionary describing the structure of MongoDB component, being either a MongoDB instances, a database, a collection, an objects or a field. 
 
Schema are hierarchically nested, with the following structure :  



```python 
# mongo_schema : A MongoDB instance contains databases
{
    "database_name_1": {}, #database_schema,
    "database_name_2": # A database contains collections
    { 
        "collection_name_1": {}, # collection_schema,
        "collection_name_2": # A collection maintains a 'count' and contains 1 object
        { 
            "count" : int, 
            "object":  # object_schema : An object contains fields.            
             {
                "field_name_1" : {}, # field_schema, 
                "field_name_2": # A field maintains 'types_count_information
                                # An optional 'array_types_count' field maintains 'types_count' information for values encountered in arrays 
                                # An 'OBJECT' or 'ARRAY(OBJECT)' field recursively contains 1 'object'
                {
                    'count': int,
                    'prop_in_object': float,
                    'type': 'type_str', 
                    'types_count': {  # count for each encountered type  
                        'type_str' : 13,
                        'Null' : 3
                    }, 
                    'array_type': 'type_str',
                    'array_types_count': {  # (optional) count for each type encountered  in arrays
                        'type_str' : 7,
                        'Null' : 3
                    }, 
                    'object': {}, # (optional) object_schema 
                } 
            } 
        }
    }           
}
```
# Contributing - Limitations - TODO 
The code base should be easy to read and improve upon. Contributions are welcomed.

## Mixed types handling
pymongo-schema handles mixed types by looking for the lowest common parent type in the following tree.

<img src="https://raw.githubusercontent.com/pajachiet/pymongo-schema/master/type_tree.png" alt="type_tree" width=700/>

If a field contains both arrays and scalars, it is considered as an array. The 'array_type' is defined as the common parent type of scalars and array_types encountered in this field. 

TODO

- Improve mapping from Python type to name (TYPE_TO_STR dict)
    - see documentation: [bson-types](https://docs.mongodb.com/manual/reference/bson-types/), [spec](http://bsonspec.org/spec.html)

- Check a mongo scheme for compatibility to an sql mapping
- Handle incompatibilities

## Support Python 3 version

- fix encoding issues when exporting manually added non-ascii characters

## Diff between schemas

A way to compare the schema dictionaries and highlights the differences.


## Test if a mongo schema can be mapped tosql

- test for the presence of mongo types in the mapping 
- look for mixes of list and scalar, that are currently not supported by mongo-connector-postgresql
- look for the presence of an '_id'

=> It may be donne directly in mongo-connector-postgresql doc_manager


## Adding fields in json/yaml outputs

- for example to add comments


## Other option to sort text outputs

- It is currently based on counts and then alphabetically.



## Tackle bigger databases
This code has been only used on a relatively small sized Mongo database, on which it was faster than Variety. 

To tackle bigger databases, it certainly would be usefull to implement the following variety's features :

- Analyze subsets of documents, most recent documents, or documents to a maximum depth.

## Tests
The codebase is still under development. It should not be trusted blindly.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/pajachiet/pymongo-schema",
    "name": "pymongo-schema",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "mongo,mongodb,schema,mongo-connector,mongo-connector-postgresql",
    "author": "Pierre-Alain Jachiet",
    "author_email": "pajachiet@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/38/60/d3bbece51eb1ef9332e3abb66a50f5034cbea7a99bcfba18ac486ac4d6a8/pymongo-schema-0.4.1.tar.gz",
    "platform": null,
    "description": "# pymongo-schema\nA schema analyser for MongoDB, written in Python. \n\nThis tool allows you to **extract your application's schema, directly from your MongoDB data**. It comes with **powerful schema manipulation and export functionalities**.\n\nIt will be particularly useful when you inherit a data dump, and want to quickly learn how the data is structured. \n\npymongo-schema allows to map your MongoDB data model to a relational (SQL) data model. This greatly helps to configure [mongo-connector-postgresql](https://github.com/Hopwork/mongo-connector-postgresql), a tool to synchronize data from MongoDB to a target PostgreSQL database.\n\nIt also helps you to **compare different versions of your data model**.\n\nThis tools is inspired by [variety](https://github.com/variety/variety), with the following enhancement\n\n- extract the **hierarchical structure** of the schema \n- versatile output options : json, yaml, tsv, markdown or htlm\n- **finer grained types**. ex: INTEGER, DOUBLE rather than NUMBER \n- **filtering** of the output schema, using a `namespace` as defined by [mongo-connector](https://github.com/mongodb-labs/mongo-connector/wiki/Configuration-Options#configure-namespaces)\n- **mapping to a relational schema**\n- **comparison** of successive schema\n\n[![Build Status](https://travis-ci.org/pajachiet/pymongo-schema.svg?branch=master)](https://travis-ci.org/pajachiet/pymongo-schema)\n[![Coverage Status](https://coveralls.io/repos/github/pajachiet/pymongo-schema/badge.svg?branch=master)](https://coveralls.io/github/pajachiet/pymongo-schema?branch=master)\n\n\n# Install\n\nYou can install latest stable version PyPi :\n```shell\npip install --upgrade pymongo-schema\n```\n\nOr directly from github : \n```shell\npip install --upgrade git+https://github.com/pajachiet/pymongo-schema\n```\n# Usage\n\n## Command line\n\n```shell\npython -m pymongo_schema -h\nusage: [-h] [--quiet] {extract,transform,tosql,compare} ...\n\ncommands:\n  {extract,transform,tosql,compare}\n    extract             Extract schema from a MongoDB instance\n    transform           Transform a json schema to another format, potentially\n                        filtering or changing columns outputs\n    tosql               Create a mapping from mongo schema to relational\n                        schema (json input and output)\n    compare             Compare two schemas\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --quiet               Remove logging on standard output\n\nUsage:\n    python -m pymongo_schema extract -h\n    usage:  [-h] [-f [FORMATS [FORMATS ...]]] [-o OUTPUT] [--port PORT] [--host HOST]\n                 [-d [DATABASES [DATABASES ...]]] [-c [COLLECTIONS [COLLECTIONS ...]]]\n                 [--columns COLUMNS [COLUMNS ...]] [--size SIZE] [--without-counts]\n                 \n    python -m pymongo_schema transform -h\n    usage: [-h] [-f [FORMATS [FORMATS ...]]] [-o OUTPUT] [--category CATEGORY] [-n FILTER]\n                [--columns COLUMNS [COLUMNS ...]] [--without-counts] [input]\n                \n    python -m pymongo_schema tosql -h\n    usage: [-h] [-f [FORMATS [FORMATS ...]]] [--columns COLUMNS [COLUMNS ...]]\n                [--without-counts] [-o OUTPUT] [input]\n\n    python -m pymongo_schema compare -h\n    usage: [-h] [-f [FORMATS [FORMATS ...]]] [-o OUTPUT] [input]\n                [--columns COLUMNS [COLUMNS ...]] [--without-counts]\n                [--detailed_diff] prev_schema [new_schema]\n\n```\n\nTo display full usage, with options description, run:\n```shell \npymongo-schema <command> -h\n```\n\n## Python package\n\npymongo_schema modules can also be imported to be used directly inside python code :\n\n```python\nfrom pymongo_schema.compare import compare_schemas_bases\nfrom pymongo_schema.export import transform_data_to_file\nfrom pymongo_schema.extract import extract_pymongo_client_schema\nfrom pymongo_schema.filter import filter_mongo_schema_namespaces\nfrom pymongo_schema.tosql import mongo_schema_to_mapping\n```\n\nFore more details, refer to modules and functions docstrings.\n\n# Examples\n\nFirst, lets populate a collection in test database from mongo shell\n\n\n    db.users.insert({name: \"Tom\", bio: \"A nice guy.\", pets: [\"monkey\", \"fish\"], someWeirdLegacyKey: \"I like Ike!\"})\n    db.users.insert({name: \"Dick\", bio: \"I swordfight.\", birthday: new Date(\"1974/03/14\")})\n    db.users.insert({name: \"Harry\", pets: \"egret\", birthday: new Date(\"1984/03/14\"), location:{country:\"France\", city: \"Lyon\"}})\n    db.users.insert({name: \"Genevi\u00e8ve\", bio: \"\u00c7a va?\", location:{country:\"France\", city: \"Nantes\"}})\n    db.users.insert({name: \"MadJacques\", location:{country:\"France\", city: \"Paris\"}})\n\n## Bash api examples\n### Easy examples\n\nExtract the schema from this database, with a json format on standard output\n\n    $ python -m pymongo_schema extract --database test\n    === Start MongoDB schema analysis\n    Extract schema of database test\n    ...collection users\n       scanned 5 documents out of 5 (100.00 %)\n    --- MongoDB schema analysis took 0.00 s\n    === Write output\n\n    {\"test\": {\n        \"users\": {\n            \"object\": {\"_id\": {\"prop_in_object\": 1.0, \"count\": 5, \"type\": \"oid\", \"types_count\": {\"oid\": 5}},\n                       \"pets\": {\"array_types_count\": {\"string\": 2}, \"prop_in_object\": 0.4, \"count\": 2, \"array_type\": \"string\", \"type\": \"ARRAY\", \"types_count\": {\"string\": 1, \"ARRAY\": 1}},\n                       \"birthday\": {\"prop_in_object\": 0.4, \"count\": 2, \"type\": \"date\", \"types_count\": {\"date\": 2}},\n                       \"name\": {\"prop_in_object\": 1.0, \"count\": 5, \"type\": \"string\", \"types_count\": {\"string\": 5}},\n                       \"bio\": {\"prop_in_object\": 0.6, \"count\": 3, \"type\": \"string\", \"types_count\": {\"string\": 3}},\n                       \"someWeirdLegacyKey\": {\"prop_in_object\": 0.2, \"count\": 1, \"type\": \"string\", \"types_count\": {\"string\": 1}},\n                       \"location\": {\"object\": {\"country\": {\"prop_in_object\": 1.0, \"count\": 3, \"type\": \"string\", \"types_count\": {\"string\": 3}},\n                                               \"city\": {\"prop_in_object\": 1.0, \"count\": 3, \"type\": \"string\", \"types_count\": {\"string\": 3}}},\n                                    \"types_count\": {\"OBJECT\": 3}, \"prop_in_object\": 0.6, \"type\": \"OBJECT\", \"count\": 3}},\n            \"count\": 5}}}\n\nExtract the same schema in md format.\n\n    $ python -m pymongo_schema extract --database test --format md\n    === Start MongoDB schema analysis\n    Extract schema of database test\n    ...collection users\n       scanned 5 documents out of 5 (100.00 %)\n    --- MongoDB schema analysis took 0.00 s\n    === Write output\n\n    ### Database: test\n    #### Collection: users \n    |Field_compact_name     |Field_name             |Count     |Percentage     |Types_count                           |\n    |-----------------------|-----------------------|----------|---------------|--------------------------------------|\n    |_id                    |_id                    |5         |100.0          |oid : 5                               |\n    |name                   |name                   |5         |100.0          |string : 5                            |\n    |bio                    |bio                    |3         |60.0           |string : 3                            |\n    |location               |location               |3         |60.0           |OBJECT : 3                            |\n    | . city                |city                   |3         |100.0          |string : 3                            |\n    | . country             |country                |3         |100.0          |string : 3                            |\n    |birthday               |birthday               |2         |40.0           |date : 2                              |\n    |pets                   |pets                   |2         |40.0           |ARRAY(string : 2) : 1, string : 1     |\n    |someWeirdLegacyKey     |someWeirdLegacyKey     |1         |20.0           |string : 1                            |\n\nMap this schema to a relational mapping\n\n    $ python -m pymongo_schema extract --database test | python -m pymongo_schema tosql\n    === Start MongoDB schema analysis\n    Extract schema of database test\n    ...collection users\n       scanned 5 documents out of 5 (100.00 %)\n    --- MongoDB schema analysis took 0.00 s\n    === Write output\n    === Generate mapping from mongo to sql\n    === Write output\n\n    {\"test\":\n     {\"users\":\n          {\"_id\": {\"type\": \"TEXT\", \"dest\": \"_id\"},\n           \"pets\": {\"valueField\": \"pets\", \"fk\": \"id_users\", \"type\": \"_ARRAY_OF_SCALARS\", \"dest\": \"users__pets\"},\n           \"location.city\": {\"type\": \"TEXT\", \"dest\": \"location__city\"},\n           \"name\": {\"type\": \"TEXT\", \"dest\": \"name\"},\n           \"someWeirdLegacyKey\": {\"type\": \"TEXT\", \"dest\": \"someWeirdLegacyKey\"},\n           \"pk\": \"_id\",\n           \"bio\": {\"type\": \"TEXT\", \"dest\": \"bio\"},\n           \"birthday\": {\"type\": \"TIMESTAMP\", \"dest\": \"birthday\"},\n           \"location.country\": {\"type\": \"TEXT\", \"dest\": \"location__country\"}},\n      \"users__pets\": {\"id_users\": {\"type\": \"TEXT\"},\n                      \"pets\": {\"type\": \"TEXT\", \"dest\": \"pets\"},\n                      \"pk\": \"_id_postgres\"}}}\n\n### Other examples\n\n**extract:** Extract the schema for collections `test_collection_1` and `test_collection_2` from `test_db` and write it into `mongo_schema.html` and `mongo_schema.json` files\n```shell\n    python -m pymongo_schema extract --databases test_db --collections test_collection_1 test_collection_2 --output mongo_schema --format html json\n```\n**extract:** Extract the schema for collection `test_collection_1` with only 1000 random rows scanned and write it into `mongo_schema.html` files\n```shell\n    python -m pymongo_schema extract --collections test_collection_1 --size 1000 --output mongo_schema --format html\n```\n**transform:** Filter extracted schema (`mongo_schema.json`) using `namespace.json` file and write output into `mongo_schema_filtered.html`, `mongo_schema_filtered.csv` and `mongo_schema_filtered.json` files\n```shell\n    python -m pymongo_schema transform mongo_schema.json --filter namespace.json --output mongo_schema_filtered --format html csv json\n```\n**tosql:** Create mapping file based on `mongo_schema_filtered.json`\n```shell\n    python -m pymongo_schema tosql mongo_schema_filtered.json --output mapping.json\n```\n\n## Python api examples\n\nExtract the schemas of all collections and all databases in a MongoDB instance:\n\n```python\nimport pymongo\nfrom pymongo_schema.extract import extract_pymongo_client_schema\n\nwith pymongo.MongoClient() as client:\n    schema = extract_pymongo_client_schema(client)\n```\nArguments can be specified to extract only some databases and some collections. See code documentation for more details.\n\nFilter extract schema with a `namespace`:\n```python\nimport json\nfrom pymongo_schema.filter import filter_mongo_schema_namespaces\n\n# assuming a namespace is defined in a file named namespace.json\nwith open(\"namespace.json\") as f:\n    namespace = json.load(f)\n\nschema_filtered = filter_mongo_schema_namespaces(schema, namespace)\n```\n\nSave filtered_schema (could be used for schema) to file in json and md formats in a `docs` directory:\n```python\nfrom pymongo_schema.export import transform_data_to_file\n\ntransform_data_to_file(schema_filtered, ['json', 'md'], output='docs/schema_filtered')\n```\n\nCompare filtered_schema (could be used for schema) to another (previous for example) schema:\n```python\nfrom pymongo_schema.compare import compare_schemas_bases\n\n# assuming a namespace is defined in a file named namespace.json\nwith open(\"old_schema_filtered.json\") as f:\n    old_schema_filtered = json.load(f)\n\ndifferences = compare_schemas_bases(old_schema_filtered, schema_filtered)\n```\n\nSave differences to file in json and md formats in a `docs` directory:\n```python\ntransform_data_to_file(differences, ['json', 'md'], output='docs/diff', category='diff')\n```\n\nTransform filtered_schema to a relational mapping:\n```python\nfrom pymongo_schema.tosql import mongo_schema_to_mapping\n\nmapping = mongo_schema_to_mapping(schema_filtered)\n```\n\nSave mapping to file in json and md formats in a `docs` directory:\n```python\ntransform_data_to_file(mapping, ['json', 'md'], output='docs/mapping', category='mapping')\n```\n\n# Schema\n\nWe define 'schema' as a dictionary describing the structure of MongoDB component, being either a MongoDB instances, a database, a collection, an objects or a field. \n \nSchema are hierarchically nested, with the following structure :  \n\n\n\n```python \n# mongo_schema : A MongoDB instance contains databases\n{\n    \"database_name_1\": {}, #database_schema,\n    \"database_name_2\": # A database contains collections\n    { \n        \"collection_name_1\": {}, # collection_schema,\n        \"collection_name_2\": # A collection maintains a 'count' and contains 1 object\n        { \n            \"count\" : int, \n            \"object\":  # object_schema : An object contains fields.            \n             {\n                \"field_name_1\" : {}, # field_schema, \n                \"field_name_2\": # A field maintains 'types_count_information\n                                # An optional 'array_types_count' field maintains 'types_count' information for values encountered in arrays \n                                # An 'OBJECT' or 'ARRAY(OBJECT)' field recursively contains 1 'object'\n                {\n                    'count': int,\n                    'prop_in_object': float,\n                    'type': 'type_str', \n                    'types_count': {  # count for each encountered type  \n                        'type_str' : 13,\n                        'Null' : 3\n                    }, \n                    'array_type': 'type_str',\n                    'array_types_count': {  # (optional) count for each type encountered  in arrays\n                        'type_str' : 7,\n                        'Null' : 3\n                    }, \n                    'object': {}, # (optional) object_schema \n                } \n            } \n        }\n    }           \n}\n```\n# Contributing - Limitations - TODO \nThe code base should be easy to read and improve upon. Contributions are welcomed.\n\n## Mixed types handling\npymongo-schema handles mixed types by looking for the lowest common parent type in the following tree.\n\n<img src=\"https://raw.githubusercontent.com/pajachiet/pymongo-schema/master/type_tree.png\" alt=\"type_tree\" width=700/>\n\nIf a field contains both arrays and scalars, it is considered as an array. The 'array_type' is defined as the common parent type of scalars and array_types encountered in this field. \n\nTODO\n\n- Improve mapping from Python type to name (TYPE_TO_STR dict)\n    - see documentation: [bson-types](https://docs.mongodb.com/manual/reference/bson-types/), [spec](http://bsonspec.org/spec.html)\n\n- Check a mongo scheme for compatibility to an sql mapping\n- Handle incompatibilities\n\n## Support Python 3 version\n\n- fix encoding issues when exporting manually added non-ascii characters\n\n## Diff between schemas\n\nA way to compare the schema dictionaries and highlights the differences.\n\n\n## Test if a mongo schema can be mapped tosql\n\n- test for the presence of mongo types in the mapping \n- look for mixes of list and scalar, that are currently not supported by mongo-connector-postgresql\n- look for the presence of an '_id'\n\n=> It may be donne directly in mongo-connector-postgresql doc_manager\n\n\n## Adding fields in json/yaml outputs\n\n- for example to add comments\n\n\n## Other option to sort text outputs\n\n- It is currently based on counts and then alphabetically.\n\n\n\n## Tackle bigger databases\nThis code has been only used on a relatively small sized Mongo database, on which it was faster than Variety. \n\nTo tackle bigger databases, it certainly would be usefull to implement the following variety's features :\n\n- Analyze subsets of documents, most recent documents, or documents to a maximum depth.\n\n## Tests\nThe codebase is still under development. It should not be trusted blindly.\n",
    "bugtrack_url": null,
    "license": "GNU General Public License v3.0",
    "summary": "A schema analyser for MongoDB written in Python",
    "version": "0.4.1",
    "project_urls": {
        "Homepage": "https://github.com/pajachiet/pymongo-schema"
    },
    "split_keywords": [
        "mongo",
        "mongodb",
        "schema",
        "mongo-connector",
        "mongo-connector-postgresql"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bc006f164d9dfba20b9776eb0df1f84ee1d2d0c089648197d00e910c782ba117",
                "md5": "daa3fa1f6fb0d5b870d4a8743a96440f",
                "sha256": "fbce44d1aa6c665da4114fcf391aa8d5eea2407ec0253a88d874e1cb42fbe311"
            },
            "downloads": -1,
            "filename": "pymongo_schema-0.4.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "daa3fa1f6fb0d5b870d4a8743a96440f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 29622,
            "upload_time": "2022-08-30T21:21:08",
            "upload_time_iso_8601": "2022-08-30T21:21:08.708503Z",
            "url": "https://files.pythonhosted.org/packages/bc/00/6f164d9dfba20b9776eb0df1f84ee1d2d0c089648197d00e910c782ba117/pymongo_schema-0.4.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3860d3bbece51eb1ef9332e3abb66a50f5034cbea7a99bcfba18ac486ac4d6a8",
                "md5": "4aa2fe028ec6bd67c18f7c5170046003",
                "sha256": "25bb5b1fe5cbfb8d3e03b49c749e760fa741b1ccafdc90f841cfdfdfb75fa8c7"
            },
            "downloads": -1,
            "filename": "pymongo-schema-0.4.1.tar.gz",
            "has_sig": false,
            "md5_digest": "4aa2fe028ec6bd67c18f7c5170046003",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 30599,
            "upload_time": "2022-08-30T21:21:11",
            "upload_time_iso_8601": "2022-08-30T21:21:11.302183Z",
            "url": "https://files.pythonhosted.org/packages/38/60/d3bbece51eb1ef9332e3abb66a50f5034cbea7a99bcfba18ac486ac4d6a8/pymongo-schema-0.4.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-08-30 21:21:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "pajachiet",
    "github_project": "pymongo-schema",
    "travis_ci": true,
    "coveralls": true,
    "github_actions": false,
    "requirements": [],
    "lcname": "pymongo-schema"
}

Pierre-Alain Jachiet