# Data Flow Visualization Tool
[](https://github.com/jkorsvik/dataflow-generator/actions/workflows/ci.yml)
[](https://github.com/jkorsvik/dataflow-generator/actions/workflows/ci.yml)
[](https://opensource.org/licenses/MIT)
[](https://www.python.org/)
[](https://en.wikipedia.org/wiki/SQL)
[](https://developer.mozilla.org/en-US/docs/Web/JavaScript)
This tool generates visual representations of data flows based on Denodo metadata exported in a `.vql` file. It can produce both complete data flow diagrams and focused data flow diagrams for specified views and tables.
## Installation
### Dependencies
- Python 3.10 or higher
- UV package manager
### Setup
#### Linux/macOS
```sh
source setup.sh
```
#### Windows
For PowerShell:
```powershell
.\setup.ps1
```
For Command Prompt:
```cmd
setup.bat
```
To reset the environment:
Linux/macOS:
```sh
rm -rf .venv && source setup.sh
```
Windows (PowerShell):
```powershell
Remove-Item -Recurse -Force .venv; .\setup.ps1
```
Windows (Command Prompt):
```cmd
rmdir /s /q .venv && setup.bat
```
## Installation
- Install from PyPI:
```sh
uv pip install data-flow-generator
```
- Install for current user:
```sh
uv pip install --user data-flow-generator
```
## Usage
1. **Install the Tool:**
You can install the tool globally using UV:
```sh
# Install globally with UV
uv pip install .
```
Or install for your user only:
```sh
# Install for current user
uv pip install --user .
```
2. **Run the Tool:**
After installation, you can run the tool from anywhere using:
```sh
dataflow
```
The tool provides multiple ways to select your SQL file:
- Drop/Upload File: Simply drag and drop any SQL file into the terminal
- Browse SQL Files: Shows a filtered list of SQL files in the directory
- Specify file path: Enter or paste a file path directly
- Search in directory: Search for SQL files by name
Supported SQL file extensions:
- `.sql` - Standard SQL files
- `.vql` - Denodo VQL files
- `.ddl` - Data Definition Language
- `.dml` - Data Manipulation Language
- `.hql` - Hive Query Language
- `.pls`, `.plsql` - PL/SQL files
- `.proc` - Stored Procedures
- `.psql` - PostgreSQL files
- `.tsql` - T-SQL files
- `.view` - View definitions
File Validation:
- Automatically checks for valid SQL extensions
- Validates file content for SQL keywords
- Provides option to proceed with non-SQL files after confirmation
The tool generates diagrams in the "generated-image" folder:
- Complete flow diagram: Shows all dependencies
- Focused flow diagram: Shows selected nodes and their relationships
Outputs include:
- PNG files for static visualization
- Interactive HTML files for dynamic exploration
2. **Database Selection:**
The tool now analyzes database names in your metadata and prompts you to select a main database.
Databases are presented in descending order by frequency, with the most common database listed first.
- The main database will be displayed without a prefix in the visualization
- Data Market objects are prefixed with "datamarket." (e.g., "datamarket.bv_datalake_d1300_currency_full")
- All other database objects are prefixed with "other."
This database differentiation is represented in the visualization legend with different colors.
## Development
For contributors, it's recommended to install the tool in editable mode. This allows you to modify the code and see changes immediately without reinstalling.
### Editable Installation
Activate your virtual environment and run:
```sh
uv pip install -e ".[dev]"
```
This installs the package in editable mode, making it easier to test changes during development.
### Tips for Local Development
- Ensure your virtual environment is activated before running any commands.
- Use `pytest` to continuously run tests as you make changes.
- Consider setting up linting and format checking to maintain code quality.
## Testing
The project uses pytest for testing with separate unit and integration tests. The testing setup ensures high code quality and proper functionality:
### Running Tests
With your virtual environment activated:
```sh
# Run all tests (unit + integration)
pytest
# Run only unit tests
pytest tests/unit/
# Run only integration tests
pytest tests/integration/
# Run with coverage report
pytest --cov=src --cov-report=term
# Run specific test file
pytest tests/unit/test_graph_generation.py
```
### Test Structure
1. Unit Tests (`tests/unit/`):
- VQL parsing and graph generation
- Node type inference and database detection
- CLI core functionality
- Visualization output validation
- Error handling and edge cases
2. Integration Tests (`tests/integration/`):
- End-to-end workflow testing
- File generation and validation
- Complex graph scenarios
### Coverage Requirements
A minimum of 80% code coverage is maintained across all modules. The CI pipeline enforces this requirement and generates coverage badges automatically.
## Script Overview
### Parsing `.vql` File
The script reads and parses the `.vql` file to extract metadata about views and tables. It uses regular expressions to find:
- **Views**: Matched by the pattern `CREATE OR REPLACE (INTERFACE) VIEW`.
- **Tables**: Matched by the pattern `CREATE OR REPLACE TABLE`.
Dependencies between views and tables are identified using the `FROM` and `JOIN` keywords within the view's and table's definitions.
### Database Identification
The tool now identifies database prefixes in object names (e.g., "data_market.table_name") and processes them as follows:
- Objects from the main database (selected by the user) are displayed without a prefix
- Objects from "data_market" are displayed with the prefix "datamarket."
- Objects from all other databases are displayed with the prefix "other."
This helps you quickly identify objects that come from different databases in your visualization.
### Handling Files Without Database Prefixes
If the tool does not detect any database prefixes in your VQL file, it will:
1. Use the filename as the default database name (with special characters removed)
2. Allow you to select this as the main database
3. Process all objects as if they belong to this main database
### Functions
1. **find_script_dependencies(vql_script)**:
- Finds all the table names a `vql_script` is dependent on using the `FROM` and `JOIN` keywords.
- Returns a list of names (dependencies in the given `vql_script`).
2. **parse_vql(file_path)**:
- Parses the `.vql` file to extract views, tables, and their dependencies.
- Returns a list of edges (dependencies), a dictionary of node types, and database counts.
3. **standardize_database_names(edges, node_types, main_db)**:
- Standardizes database names based on the selected main database.
- Formats node names to include database prefixes where appropriate.
4. **create_pyvis_figure(graph, node_types, focus_nodes=[], shake_towards_root=False)**:
- Creates interactive Pyvis figures for the data flow diagrams.
- Returns an interactive figure.
5. **draw_complete_data_flow(edges, node_types, save_path=None, file_name=None)**:
- Generates and displays a complete data flow diagram.
- Adjusts the figure size based on the number of nodes.
- Saves the figure as `complete_data_flow_pyvis_metadata_file_name.html`.
6. **draw_focused_data_flow(edges, node_types, focus_nodes, save_path=None, file_name=None, see_ancestors=True, see_descendants=True)**:
- Generates and displays a focused data flow diagram for the specified nodes.
- Includes the specified nodes, their ancestors, and descendants in the subgraph if enabled.
- Adjusts the figure size based on the number of nodes.
- Saves the figure as `focused_data_flow_pyvis_metadata_file_name.html`.
### Main Script Execution
- Reads the `.vql` file in the metadata folder selected using the CLI-tool
- Parses the file to extract metadata (views, tables, and their dependencies).
- Prompts the user to select the main database from a list ordered by frequency.
- Standardizes node names based on database information.
- Generates and saves the complete data flow diagram if selected in the CLI-tool.
- Generates and saves the focused data flow diagram if selected in the CLI-tool.
## Project Structure
```
/data-flow-generator
|-- pyproject.toml # Project configuration and dependencies
|-- requirements.txt # Legacy requirements file
|-- uv.lock # UV lock file
|-- src/
| |-- __init__.py
| |-- dataflow.py # Command line interface
| |-- generate_data_flow.py
| |-- pyvis_mod.py
|-- tests/
| |-- generate_data_flow_test.py
| |-- test_database_functions.py
|-- metadata/ # VQL file directory
| |-- denodo_metadata1.vql
| |-- denodo_metadata2.vql
|-- generated-image/ # Output directory
|-- complete_data_flow_*.html
|-- focused_data_flow_*.html
```
## Troubleshooting
- **File Not Found Error**:
- Ensure the `.vql` file is in the metadata folder within the same directory as the script.
- **Overlapping Titles in Diagram**:
- Increase the `SCALING_CONSTANT` value in `generate_data_flow.py` file to widen the figure.
- **Legend overlaps nodes in diagram**:
- As the generation of the diagram is not deterministic, rerun the generation untill a desired output is achieved
If you encounter any issues or need further assistance, feel free to ask us at Insight Factory, Emerging Business Technology!
Raw data
{
"_id": null,
"home_page": null,
"name": "data-flow-generator",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "dataflow, visualization, vql, denodo, diagram",
"author": null,
"author_email": "Jon-Mikkel Korsvik <jkorsvik@outlook.com>",
"download_url": "https://files.pythonhosted.org/packages/2a/51/2d8c74b195b6a1dede012773e1db3101ee5a106a2d7afe41c75ad539ac9a/data_flow_generator-0.1.7.tar.gz",
"platform": null,
"description": "# Data Flow Visualization Tool\n\n[](https://github.com/jkorsvik/dataflow-generator/actions/workflows/ci.yml)\n[](https://github.com/jkorsvik/dataflow-generator/actions/workflows/ci.yml)\n\n[](https://opensource.org/licenses/MIT)\n[](https://www.python.org/)\n[](https://en.wikipedia.org/wiki/SQL)\n[](https://developer.mozilla.org/en-US/docs/Web/JavaScript)\n\n\nThis tool generates visual representations of data flows based on Denodo metadata exported in a `.vql` file. It can produce both complete data flow diagrams and focused data flow diagrams for specified views and tables.\n\n\n\n## Installation\n\n### Dependencies\n\n- Python 3.10 or higher\n- UV package manager\n\n### Setup\n\n#### Linux/macOS\n```sh\nsource setup.sh\n```\n\n#### Windows\nFor PowerShell:\n```powershell\n.\\setup.ps1\n```\n\nFor Command Prompt:\n```cmd\nsetup.bat\n```\n\nTo reset the environment:\n\nLinux/macOS:\n```sh\nrm -rf .venv && source setup.sh\n```\n\nWindows (PowerShell):\n```powershell\nRemove-Item -Recurse -Force .venv; .\\setup.ps1\n```\n\nWindows (Command Prompt):\n```cmd\nrmdir /s /q .venv && setup.bat\n```\n\n## Installation\n\n- Install from PyPI:\n```sh\nuv pip install data-flow-generator\n```\n\n- Install for current user:\n```sh\nuv pip install --user data-flow-generator\n```\n\n## Usage\n\n1. **Install the Tool:**\n\n You can install the tool globally using UV:\n\n ```sh\n # Install globally with UV\n uv pip install . \n ```\n\n Or install for your user only:\n\n ```sh\n # Install for current user\n uv pip install --user .\n ```\n\n2. **Run the Tool:**\n\n After installation, you can run the tool from anywhere using:\n\n ```sh\n dataflow\n ```\n\n The tool provides multiple ways to select your SQL file:\n - Drop/Upload File: Simply drag and drop any SQL file into the terminal\n - Browse SQL Files: Shows a filtered list of SQL files in the directory\n - Specify file path: Enter or paste a file path directly\n - Search in directory: Search for SQL files by name\n\n Supported SQL file extensions:\n - `.sql` - Standard SQL files\n - `.vql` - Denodo VQL files\n - `.ddl` - Data Definition Language\n - `.dml` - Data Manipulation Language\n - `.hql` - Hive Query Language\n - `.pls`, `.plsql` - PL/SQL files\n - `.proc` - Stored Procedures\n - `.psql` - PostgreSQL files\n - `.tsql` - T-SQL files\n - `.view` - View definitions\n\n File Validation:\n - Automatically checks for valid SQL extensions\n - Validates file content for SQL keywords\n - Provides option to proceed with non-SQL files after confirmation\n\n The tool generates diagrams in the \"generated-image\" folder:\n - Complete flow diagram: Shows all dependencies\n - Focused flow diagram: Shows selected nodes and their relationships\n\n Outputs include:\n - PNG files for static visualization\n - Interactive HTML files for dynamic exploration\n\n2. **Database Selection:**\n\n The tool now analyzes database names in your metadata and prompts you to select a main database. \n Databases are presented in descending order by frequency, with the most common database listed first.\n\n - The main database will be displayed without a prefix in the visualization\n - Data Market objects are prefixed with \"datamarket.\" (e.g., \"datamarket.bv_datalake_d1300_currency_full\")\n - All other database objects are prefixed with \"other.\"\n\n This database differentiation is represented in the visualization legend with different colors.\n\n ## Development\n\n For contributors, it's recommended to install the tool in editable mode. This allows you to modify the code and see changes immediately without reinstalling.\n\n ### Editable Installation\n\n Activate your virtual environment and run:\n\n ```sh\n uv pip install -e \".[dev]\"\n ```\n\n This installs the package in editable mode, making it easier to test changes during development.\n\n ### Tips for Local Development\n\n - Ensure your virtual environment is activated before running any commands.\n - Use `pytest` to continuously run tests as you make changes.\n - Consider setting up linting and format checking to maintain code quality.\n\n\n## Testing\n\nThe project uses pytest for testing with separate unit and integration tests. The testing setup ensures high code quality and proper functionality:\n\n### Running Tests\n\nWith your virtual environment activated:\n\n```sh\n# Run all tests (unit + integration)\npytest\n\n# Run only unit tests\npytest tests/unit/\n\n# Run only integration tests\npytest tests/integration/\n\n# Run with coverage report\npytest --cov=src --cov-report=term\n\n# Run specific test file\npytest tests/unit/test_graph_generation.py\n```\n\n### Test Structure\n\n1. Unit Tests (`tests/unit/`):\n - VQL parsing and graph generation\n - Node type inference and database detection\n - CLI core functionality\n - Visualization output validation\n - Error handling and edge cases\n\n2. Integration Tests (`tests/integration/`):\n - End-to-end workflow testing\n - File generation and validation\n - Complex graph scenarios\n\n### Coverage Requirements\n\nA minimum of 80% code coverage is maintained across all modules. The CI pipeline enforces this requirement and generates coverage badges automatically.\n\n## Script Overview\n\n### Parsing `.vql` File\n\nThe script reads and parses the `.vql` file to extract metadata about views and tables. It uses regular expressions to find:\n- **Views**: Matched by the pattern `CREATE OR REPLACE (INTERFACE) VIEW`.\n- **Tables**: Matched by the pattern `CREATE OR REPLACE TABLE`.\n\nDependencies between views and tables are identified using the `FROM` and `JOIN` keywords within the view's and table's definitions.\n\n### Database Identification\n\nThe tool now identifies database prefixes in object names (e.g., \"data_market.table_name\") and processes them as follows:\n- Objects from the main database (selected by the user) are displayed without a prefix\n- Objects from \"data_market\" are displayed with the prefix \"datamarket.\"\n- Objects from all other databases are displayed with the prefix \"other.\"\n\nThis helps you quickly identify objects that come from different databases in your visualization.\n\n### Handling Files Without Database Prefixes\n\nIf the tool does not detect any database prefixes in your VQL file, it will:\n1. Use the filename as the default database name (with special characters removed)\n2. Allow you to select this as the main database\n3. Process all objects as if they belong to this main database\n\n### Functions\n\n1. **find_script_dependencies(vql_script)**:\n - Finds all the table names a `vql_script` is dependent on using the `FROM` and `JOIN` keywords.\n - Returns a list of names (dependencies in the given `vql_script`).\n\n2. **parse_vql(file_path)**:\n - Parses the `.vql` file to extract views, tables, and their dependencies.\n - Returns a list of edges (dependencies), a dictionary of node types, and database counts.\n\n3. **standardize_database_names(edges, node_types, main_db)**:\n - Standardizes database names based on the selected main database.\n - Formats node names to include database prefixes where appropriate.\n\n4. **create_pyvis_figure(graph, node_types, focus_nodes=[], shake_towards_root=False)**:\n - Creates interactive Pyvis figures for the data flow diagrams.\n - Returns an interactive figure.\n\n5. **draw_complete_data_flow(edges, node_types, save_path=None, file_name=None)**:\n - Generates and displays a complete data flow diagram.\n - Adjusts the figure size based on the number of nodes.\n - Saves the figure as `complete_data_flow_pyvis_metadata_file_name.html`.\n\n6. **draw_focused_data_flow(edges, node_types, focus_nodes, save_path=None, file_name=None, see_ancestors=True, see_descendants=True)**:\n - Generates and displays a focused data flow diagram for the specified nodes.\n - Includes the specified nodes, their ancestors, and descendants in the subgraph if enabled.\n - Adjusts the figure size based on the number of nodes.\n - Saves the figure as `focused_data_flow_pyvis_metadata_file_name.html`.\n\n### Main Script Execution\n\n- Reads the `.vql` file in the metadata folder selected using the CLI-tool\n- Parses the file to extract metadata (views, tables, and their dependencies).\n- Prompts the user to select the main database from a list ordered by frequency.\n- Standardizes node names based on database information.\n- Generates and saves the complete data flow diagram if selected in the CLI-tool.\n- Generates and saves the focused data flow diagram if selected in the CLI-tool.\n\n## Project Structure\n\n```\n/data-flow-generator\n |-- pyproject.toml # Project configuration and dependencies\n \n |-- requirements.txt # Legacy requirements file\n |-- uv.lock # UV lock file\n |-- src/\n | |-- __init__.py\n | |-- dataflow.py # Command line interface\n | |-- generate_data_flow.py\n | |-- pyvis_mod.py\n |-- tests/\n | |-- generate_data_flow_test.py\n | |-- test_database_functions.py\n |-- metadata/ # VQL file directory\n | |-- denodo_metadata1.vql\n | |-- denodo_metadata2.vql\n |-- generated-image/ # Output directory\n |-- complete_data_flow_*.html\n |-- focused_data_flow_*.html\n```\n\n## Troubleshooting\n\n- **File Not Found Error**:\n - Ensure the `.vql` file is in the metadata folder within the same directory as the script.\n\n- **Overlapping Titles in Diagram**:\n - Increase the `SCALING_CONSTANT` value in `generate_data_flow.py` file to widen the figure.\n\n- **Legend overlaps nodes in diagram**:\n - As the generation of the diagram is not deterministic, rerun the generation untill a desired output is achieved\n\n\nIf you encounter any issues or need further assistance, feel free to ask us at Insight Factory, Emerging Business Technology!\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Interactive data flow visualization tool for database dumps",
"version": "0.1.7",
"project_urls": {
"Documentation": "https://github.com/jkorsvik/dataflow-generator/blob/main/README.md",
"Homepage": "https://github.com/jkorsvik/dataflow-generator"
},
"split_keywords": [
"dataflow",
" visualization",
" vql",
" denodo",
" diagram"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "5258681eba667286bd446b4b507dbeea37f7eeb2002887bdb792a3c013a12737",
"md5": "e0a865382f6c0d7ff96738880a82b225",
"sha256": "11e1a59e80bc76a8148dd2ed673df601e910c9ce2f4727fb21e722eb9a00e5df"
},
"downloads": -1,
"filename": "data_flow_generator-0.1.7-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e0a865382f6c0d7ff96738880a82b225",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 57902,
"upload_time": "2025-04-24T23:40:40",
"upload_time_iso_8601": "2025-04-24T23:40:40.294782Z",
"url": "https://files.pythonhosted.org/packages/52/58/681eba667286bd446b4b507dbeea37f7eeb2002887bdb792a3c013a12737/data_flow_generator-0.1.7-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "2a512d8c74b195b6a1dede012773e1db3101ee5a106a2d7afe41c75ad539ac9a",
"md5": "e2e2811add730550c30f2204e0726171",
"sha256": "10c965ab8ce5526f1f44d3205856b501cbb2a48d1e7fbe945105f7a17f5245a7"
},
"downloads": -1,
"filename": "data_flow_generator-0.1.7.tar.gz",
"has_sig": false,
"md5_digest": "e2e2811add730550c30f2204e0726171",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 52901,
"upload_time": "2025-04-24T23:40:41",
"upload_time_iso_8601": "2025-04-24T23:40:41.853913Z",
"url": "https://files.pythonhosted.org/packages/2a/51/2d8c74b195b6a1dede012773e1db3101ee5a106a2d7afe41c75ad539ac9a/data_flow_generator-0.1.7.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-04-24 23:40:41",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "jkorsvik",
"github_project": "dataflow-generator",
"travis_ci": false,
"coveralls": true,
"github_actions": true,
"lcname": "data-flow-generator"
}