segmentation-forests


Namesegmentation-forests JSON
Version 0.1.0 PyPI version JSON
download
home_pageNone
SummaryUnsupervised segment discovery using divergence-based decision trees inspired by Random Forests
upload_time2025-10-08 10:27:49
maintainerNone
docs_urlNone
authorNone
requires_python>=3.12
licenseNone
keywords machine-learning unsupervised-learning segment-discovery data-mining random-forests
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Segmentation Forests

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)

**Unsupervised segment discovery using divergence-based decision trees inspired by Random Forests.**

Automatically discover meaningful segments in your data where metric distributions significantly diverge from the background. Perfect for exploratory data analysis, anomaly detection, and customer segmentation without requiring labeled data.

---

## 🎯 The Problem

You have a dataset with many categorical features and a metric you care about. For example:

- **E-commerce**: Users across countries, devices, age groups → conversion rate
- **Digital advertising**: Impressions across demographics, platforms, times → CTR
- **Healthcare**: Patients across conditions, treatments, demographics → readmission rate
- **Finance**: Transactions across customer segments, times → fraud rate

**The question**: *Which specific combinations of features exhibit unusual behavior?*

With N features and many values per feature, exhaustively testing combinations is impossible. **Segmentation Forests solves this** by intelligently searching for segments where your metric's distribution differs significantly from the overall population.

---

## ✨ Key Features

- 🌳 **Tree-based discovery**: Greedy algorithm efficiently navigates combinatorial feature space
- 🌲 **Forest ensemble**: Bootstrap aggregating for robust, reproducible patterns
- 📊 **Statistical rigor**: KS distance (continuous) & Jensen-Shannon divergence (discrete)
- 📈 **Beautiful visualizations**: Distribution comparisons and quality assessments
- 🔬 **Fully typed**: Complete type hints for excellent IDE support
- ⚡ **Fast & scalable**: Handles datasets with millions of rows

---

## 🚀 Quick Start

### Installation

```bash
pip install segmentation-forests
```

### Basic Example

```python
import pandas as pd
from segmentation_forests import SegmentationTree, SegmentationForest

# Your data: features + metric
data = pd.DataFrame({
    'country': ['US', 'UK', 'US', 'UK', ...],
    'device': ['Mobile', 'Desktop', 'Mobile', ...],
    'gender': ['F', 'M', 'F', ...],
    'impressions': [245, 103, 312, 98, ...]  # Your metric
})

# Discover segments with a single tree
tree = SegmentationTree(max_depth=3, min_samples_split=100)
tree.fit(data, metric_column='impressions')
segments = tree.get_segments(min_divergence=0.1)

# View results
for i, seg in enumerate(segments[:3], 1):
    print(f"{i}. {seg.get_condition_string()}")
    print(f"   Divergence: {seg.divergence:.3f} | Size: {seg.size:,}")
```

**Output:**
```
1. gender == F AND device == Mobile AND country == UK
   Divergence: 0.948 | Size: 523

2. time_of_day == Evening AND country == US AND device == Desktop
   Divergence: 0.856 | Size: 412

3. country == DE AND time_of_day == Morning
   Divergence: 0.824 | Size: 289
```

---

## 🌲 Using the Forest (Recommended)

For more robust results, use the ensemble approach:

```python
from segmentation_forests import SegmentationForest

# Create forest with bootstrap sampling and random features
forest = SegmentationForest(
    n_trees=10,
    max_depth=3,
    max_features=2,  # Random feature selection
    min_samples_split=100,
    min_samples_leaf=50
)

# Fit and get segments found by multiple trees
forest.fit(data, metric_column='impressions')
robust_segments = forest.get_segments(min_support=3, min_divergence=0.1)

# View results
for seg in robust_segments:
    cond_str = " AND ".join([f"{c[0]} {c[1]} {c[2]}" for c in seg['conditions']])
    print(f"{cond_str}")
    print(f"  Support: {seg['support']}/10 trees ({seg['support_rate']*100:.0f}%)")
    print(f"  Avg Divergence: {seg['avg_divergence']:.3f}")
    print()
```

---

## 📊 Visualization

Beautiful distribution comparison plots:

```python
from segmentation_forests.visualization import plot_segment_comparison

# Compare segment distribution vs background
fig = plot_segment_comparison(
    data=data,
    segment_conditions=[('country', '==', 'UK'), ('device', '==', 'Mobile')],
    metric_column='impressions',
    title='UK Mobile Users vs Background'
)
fig.savefig('segment_comparison.png', dpi=150)
```

**Example output:**

The plot shows:
- **Left**: Overlapping histograms (background in blue, segment in coral)
- **Right**: Box plots comparing distributions
- **Clear separation**: Strong segments show minimal overlap

---

## 🧠 How It Works

### Algorithm Overview

1. **Compute Background Distribution**: Calculate the distribution of your metric across all data
2. **Greedy Tree Building**:
   - At each node, evaluate all feature-value splits
   - Choose the split that maximizes divergence from background
   - Recursively build left (matching condition) and right (not matching) subtrees
3. **Collect High-Divergence Leaves**: Return segments that diverge significantly
4. **Ensemble Aggregation** (Forest only): Vote across trees to find robust patterns

### Divergence Measures

The algorithm automatically selects the appropriate measure:

| Metric Type | Measure | Range | Description |
|------------|---------|-------|-------------|
| **Continuous** | Kolmogorov-Smirnov | [0, 1] | Max distance between CDFs |
| **Discrete** | Jensen-Shannon | [0, 1] | Symmetric KL divergence |

**Decision threshold**: ≤20 unique values → discrete, >20 → continuous

### Quality Guidelines

Interpret divergence scores:

- **≥ 0.5**: 🎯 **Excellent** - Strong, highly actionable pattern
- **0.3-0.5**: ✓ **Good** - Meaningful difference worth investigating
- **0.1-0.3**: ⚠️ **Weak** - Marginal effect, could be noise
- **< 0.1**: ❌ **Very weak** - Likely statistical noise

---

## 📖 API Reference

### `SegmentationTree`

```python
SegmentationTree(
    max_depth: int = 5,
    min_samples_split: int = 50,
    min_samples_leaf: int = 20,
    divergence_threshold: float = 0.01,
    random_features: Optional[int] = None
)
```

**Parameters:**
- `max_depth`: Maximum tree depth (controls segment complexity)
- `min_samples_split`: Minimum samples required to split a node
- `min_samples_leaf`: Minimum samples required in each child
- `divergence_threshold`: Minimum divergence to keep a segment
- `random_features`: Number of random features per split (None = use all)

**Methods:**
- `fit(data: pd.DataFrame, metric_column: str) -> Self`: Fit tree to data
- `get_segments(min_divergence: float = 0.0) -> List[SegmentationNode]`: Get segments

---

### `SegmentationForest`

```python
SegmentationForest(
    n_trees: int = 10,
    max_depth: int = 5,
    min_samples_split: int = 50,
    min_samples_leaf: int = 20,
    divergence_threshold: float = 0.01,
    max_features: Optional[int] = None
)
```

**Parameters:**
- `n_trees`: Number of trees in the forest
- Other parameters same as `SegmentationTree`

**Methods:**
- `fit(data: pd.DataFrame, metric_column: str) -> Self`: Fit forest
- `get_segments(min_support: int = 2, min_divergence: float = 0.0) -> List[Dict]`: Get robust segments

**Returns:** List of dicts with keys:
- `conditions`: List of (column, operator, value) tuples
- `support`: Number of trees that found this segment
- `avg_divergence`: Average divergence across trees
- `avg_size`: Average segment size
- `support_rate`: Fraction of trees (support / n_trees)

---

### `SegmentationNode`

Represents a discovered segment.

**Attributes:**
- `conditions`: List of (column, operator, value) tuples
- `divergence`: Divergence score
- `size`: Number of data points
- `depth`: Depth in tree
- `data_indices`: Indices of data points in this segment

**Methods:**
- `get_condition_string() -> str`: Human-readable condition string

---

## 🎨 Visualization Functions

### `plot_segment_comparison`

```python
plot_segment_comparison(
    data: pd.DataFrame,
    segment_conditions: List[Tuple],
    metric_column: str,
    title: Optional[str] = None,
    figsize: Tuple = (14, 5)
) -> plt.Figure
```

Creates side-by-side histogram and box plot comparison.

---

## 💡 Usage Tips

### Choosing Parameters

**For max_depth:**
- `depth=2`: Simple 2-condition segments (e.g., "Country=UK AND Device=Mobile")
- `depth=3-4`: **Recommended** - Balanced complexity
- `depth=5+`: Complex segments, risk of overfitting

**For min_divergence:**
- Start with `0.1` to see all interesting patterns
- Increase to `0.3+` to focus only on strong effects
- Use forest `min_support` to filter noise instead

**For forest:**
- `n_trees=10`: Good default
- `n_trees=20+`: More robust but slower
- `max_features=sqrt(n_features)`: Good for high-dimensional data

### Interpreting Results

1. **Always visualize top segments** to verify they make sense
2. **Check segment size** - very small segments may be spurious
3. **Use forest support** - patterns in 5+/10 trees are highly reliable
4. **Domain validation** - do discovered segments align with business intuition?

---

## 🔬 Example: Advertising Dataset

```python
from segmentation_forests import SegmentationForest
from segmentation_forests.visualization import plot_segment_comparison
import pandas as pd
import numpy as np

# Create synthetic advertising data
np.random.seed(42)
n = 10000

data = pd.DataFrame({
    'country': np.random.choice(['US', 'UK', 'CA', 'DE', 'FR'], n),
    'device': np.random.choice(['Mobile', 'Desktop', 'Tablet'], n),
    'gender': np.random.choice(['M', 'F'], n),
    'time_of_day': np.random.choice(['Morning', 'Afternoon', 'Evening', 'Night'], n),
    'impressions': np.random.poisson(100, n)  # Base: ~100 impressions
})

# Add hidden pattern: UK females on mobile get 3x impressions
mask = (data['gender'] == 'F') & (data['country'] == 'UK') & (data['device'] == 'Mobile')
data.loc[mask, 'impressions'] = np.random.poisson(300, mask.sum())

# Discover the pattern
forest = SegmentationForest(n_trees=10, max_depth=3, max_features=2)
forest.fit(data, 'impressions')
segments = forest.get_segments(min_support=3, min_divergence=0.3)

# Result: Discovers the hidden pattern!
# Output: "gender == F AND country == UK AND device == Mobile"
# Divergence: 0.948, Support: 7/10 trees
```

See `examples/advertising_example.py` for the complete example.

---

## 🛠️ Development

### Setup

```bash
# Clone repository
git clone https://github.com/davidgeorgewilliams/segmentation-forests.git
cd segmentation-forests

# Install with dev dependencies
pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install
```

### Running Tests

```bash
# Run all tests
pytest

# With coverage
pytest --cov=segmentation_forests --cov-report=html

# Run specific test
pytest tests/test_tree.py -v
```

### Code Quality

```bash
# Format code
black src/ tests/
isort src/ tests/

# Lint
ruff check src/ tests/

# Type check
mypy src/
```

---

## 🤝 Contributing

Contributions are welcome! Please:

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Make your changes and add tests
4. Ensure all tests pass and code is formatted
5. Submit a pull request

---

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

---

## 📚 Citation

If you use Segmentation Forests in your research or project, please cite:

```bibtex
@software{segmentation_forests,
  author = {Williams, David},
  title = {Segmentation Forests: Unsupervised Segment Discovery using Divergence-based Decision Trees},
  year = {2025},
  url = {https://github.com/davidgeorgewilliams/segmentation-forests}
}
```

---

## 🙏 Acknowledgments

- Algorithm inspired by Random Forests (Breiman, 2001)
- Divergence measures from information theory (Kullback-Leibler, Jensen-Shannon)
- Built with NumPy, pandas, SciPy, matplotlib, and seaborn

---

## 📞 Contact

**David Williams** - [david@davidgeorgewilliams.com](mailto:david@davidgeorgewilliams.com)

Project Link: [https://github.com/davidgeorgewilliams/segmentation-forests](https://github.com/davidgeorgewilliams/segmentation-forests)

---

**Happy Discovering! 🎯🌲**

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "segmentation-forests",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.12",
    "maintainer_email": null,
    "keywords": "machine-learning, unsupervised-learning, segment-discovery, data-mining, random-forests",
    "author": null,
    "author_email": "David Williams <david@davidgeorgewilliams.com>",
    "download_url": "https://files.pythonhosted.org/packages/fe/53/57b88e89061d70701b69321c40b3a6a0f77a6e4a1917d430af125f7decb0/segmentation_forests-0.1.0.tar.gz",
    "platform": null,
    "description": "# Segmentation Forests\n\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)\n\n**Unsupervised segment discovery using divergence-based decision trees inspired by Random Forests.**\n\nAutomatically discover meaningful segments in your data where metric distributions significantly diverge from the background. Perfect for exploratory data analysis, anomaly detection, and customer segmentation without requiring labeled data.\n\n---\n\n## \ud83c\udfaf The Problem\n\nYou have a dataset with many categorical features and a metric you care about. For example:\n\n- **E-commerce**: Users across countries, devices, age groups \u2192 conversion rate\n- **Digital advertising**: Impressions across demographics, platforms, times \u2192 CTR\n- **Healthcare**: Patients across conditions, treatments, demographics \u2192 readmission rate\n- **Finance**: Transactions across customer segments, times \u2192 fraud rate\n\n**The question**: *Which specific combinations of features exhibit unusual behavior?*\n\nWith N features and many values per feature, exhaustively testing combinations is impossible. **Segmentation Forests solves this** by intelligently searching for segments where your metric's distribution differs significantly from the overall population.\n\n---\n\n## \u2728 Key Features\n\n- \ud83c\udf33 **Tree-based discovery**: Greedy algorithm efficiently navigates combinatorial feature space\n- \ud83c\udf32 **Forest ensemble**: Bootstrap aggregating for robust, reproducible patterns\n- \ud83d\udcca **Statistical rigor**: KS distance (continuous) & Jensen-Shannon divergence (discrete)\n- \ud83d\udcc8 **Beautiful visualizations**: Distribution comparisons and quality assessments\n- \ud83d\udd2c **Fully typed**: Complete type hints for excellent IDE support\n- \u26a1 **Fast & scalable**: Handles datasets with millions of rows\n\n---\n\n## \ud83d\ude80 Quick Start\n\n### Installation\n\n```bash\npip install segmentation-forests\n```\n\n### Basic Example\n\n```python\nimport pandas as pd\nfrom segmentation_forests import SegmentationTree, SegmentationForest\n\n# Your data: features + metric\ndata = pd.DataFrame({\n    'country': ['US', 'UK', 'US', 'UK', ...],\n    'device': ['Mobile', 'Desktop', 'Mobile', ...],\n    'gender': ['F', 'M', 'F', ...],\n    'impressions': [245, 103, 312, 98, ...]  # Your metric\n})\n\n# Discover segments with a single tree\ntree = SegmentationTree(max_depth=3, min_samples_split=100)\ntree.fit(data, metric_column='impressions')\nsegments = tree.get_segments(min_divergence=0.1)\n\n# View results\nfor i, seg in enumerate(segments[:3], 1):\n    print(f\"{i}. {seg.get_condition_string()}\")\n    print(f\"   Divergence: {seg.divergence:.3f} | Size: {seg.size:,}\")\n```\n\n**Output:**\n```\n1. gender == F AND device == Mobile AND country == UK\n   Divergence: 0.948 | Size: 523\n\n2. time_of_day == Evening AND country == US AND device == Desktop\n   Divergence: 0.856 | Size: 412\n\n3. country == DE AND time_of_day == Morning\n   Divergence: 0.824 | Size: 289\n```\n\n---\n\n## \ud83c\udf32 Using the Forest (Recommended)\n\nFor more robust results, use the ensemble approach:\n\n```python\nfrom segmentation_forests import SegmentationForest\n\n# Create forest with bootstrap sampling and random features\nforest = SegmentationForest(\n    n_trees=10,\n    max_depth=3,\n    max_features=2,  # Random feature selection\n    min_samples_split=100,\n    min_samples_leaf=50\n)\n\n# Fit and get segments found by multiple trees\nforest.fit(data, metric_column='impressions')\nrobust_segments = forest.get_segments(min_support=3, min_divergence=0.1)\n\n# View results\nfor seg in robust_segments:\n    cond_str = \" AND \".join([f\"{c[0]} {c[1]} {c[2]}\" for c in seg['conditions']])\n    print(f\"{cond_str}\")\n    print(f\"  Support: {seg['support']}/10 trees ({seg['support_rate']*100:.0f}%)\")\n    print(f\"  Avg Divergence: {seg['avg_divergence']:.3f}\")\n    print()\n```\n\n---\n\n## \ud83d\udcca Visualization\n\nBeautiful distribution comparison plots:\n\n```python\nfrom segmentation_forests.visualization import plot_segment_comparison\n\n# Compare segment distribution vs background\nfig = plot_segment_comparison(\n    data=data,\n    segment_conditions=[('country', '==', 'UK'), ('device', '==', 'Mobile')],\n    metric_column='impressions',\n    title='UK Mobile Users vs Background'\n)\nfig.savefig('segment_comparison.png', dpi=150)\n```\n\n**Example output:**\n\nThe plot shows:\n- **Left**: Overlapping histograms (background in blue, segment in coral)\n- **Right**: Box plots comparing distributions\n- **Clear separation**: Strong segments show minimal overlap\n\n---\n\n## \ud83e\udde0 How It Works\n\n### Algorithm Overview\n\n1. **Compute Background Distribution**: Calculate the distribution of your metric across all data\n2. **Greedy Tree Building**:\n   - At each node, evaluate all feature-value splits\n   - Choose the split that maximizes divergence from background\n   - Recursively build left (matching condition) and right (not matching) subtrees\n3. **Collect High-Divergence Leaves**: Return segments that diverge significantly\n4. **Ensemble Aggregation** (Forest only): Vote across trees to find robust patterns\n\n### Divergence Measures\n\nThe algorithm automatically selects the appropriate measure:\n\n| Metric Type | Measure | Range | Description |\n|------------|---------|-------|-------------|\n| **Continuous** | Kolmogorov-Smirnov | [0, 1] | Max distance between CDFs |\n| **Discrete** | Jensen-Shannon | [0, 1] | Symmetric KL divergence |\n\n**Decision threshold**: \u226420 unique values \u2192 discrete, >20 \u2192 continuous\n\n### Quality Guidelines\n\nInterpret divergence scores:\n\n- **\u2265 0.5**: \ud83c\udfaf **Excellent** - Strong, highly actionable pattern\n- **0.3-0.5**: \u2713 **Good** - Meaningful difference worth investigating\n- **0.1-0.3**: \u26a0\ufe0f **Weak** - Marginal effect, could be noise\n- **< 0.1**: \u274c **Very weak** - Likely statistical noise\n\n---\n\n## \ud83d\udcd6 API Reference\n\n### `SegmentationTree`\n\n```python\nSegmentationTree(\n    max_depth: int = 5,\n    min_samples_split: int = 50,\n    min_samples_leaf: int = 20,\n    divergence_threshold: float = 0.01,\n    random_features: Optional[int] = None\n)\n```\n\n**Parameters:**\n- `max_depth`: Maximum tree depth (controls segment complexity)\n- `min_samples_split`: Minimum samples required to split a node\n- `min_samples_leaf`: Minimum samples required in each child\n- `divergence_threshold`: Minimum divergence to keep a segment\n- `random_features`: Number of random features per split (None = use all)\n\n**Methods:**\n- `fit(data: pd.DataFrame, metric_column: str) -> Self`: Fit tree to data\n- `get_segments(min_divergence: float = 0.0) -> List[SegmentationNode]`: Get segments\n\n---\n\n### `SegmentationForest`\n\n```python\nSegmentationForest(\n    n_trees: int = 10,\n    max_depth: int = 5,\n    min_samples_split: int = 50,\n    min_samples_leaf: int = 20,\n    divergence_threshold: float = 0.01,\n    max_features: Optional[int] = None\n)\n```\n\n**Parameters:**\n- `n_trees`: Number of trees in the forest\n- Other parameters same as `SegmentationTree`\n\n**Methods:**\n- `fit(data: pd.DataFrame, metric_column: str) -> Self`: Fit forest\n- `get_segments(min_support: int = 2, min_divergence: float = 0.0) -> List[Dict]`: Get robust segments\n\n**Returns:** List of dicts with keys:\n- `conditions`: List of (column, operator, value) tuples\n- `support`: Number of trees that found this segment\n- `avg_divergence`: Average divergence across trees\n- `avg_size`: Average segment size\n- `support_rate`: Fraction of trees (support / n_trees)\n\n---\n\n### `SegmentationNode`\n\nRepresents a discovered segment.\n\n**Attributes:**\n- `conditions`: List of (column, operator, value) tuples\n- `divergence`: Divergence score\n- `size`: Number of data points\n- `depth`: Depth in tree\n- `data_indices`: Indices of data points in this segment\n\n**Methods:**\n- `get_condition_string() -> str`: Human-readable condition string\n\n---\n\n## \ud83c\udfa8 Visualization Functions\n\n### `plot_segment_comparison`\n\n```python\nplot_segment_comparison(\n    data: pd.DataFrame,\n    segment_conditions: List[Tuple],\n    metric_column: str,\n    title: Optional[str] = None,\n    figsize: Tuple = (14, 5)\n) -> plt.Figure\n```\n\nCreates side-by-side histogram and box plot comparison.\n\n---\n\n## \ud83d\udca1 Usage Tips\n\n### Choosing Parameters\n\n**For max_depth:**\n- `depth=2`: Simple 2-condition segments (e.g., \"Country=UK AND Device=Mobile\")\n- `depth=3-4`: **Recommended** - Balanced complexity\n- `depth=5+`: Complex segments, risk of overfitting\n\n**For min_divergence:**\n- Start with `0.1` to see all interesting patterns\n- Increase to `0.3+` to focus only on strong effects\n- Use forest `min_support` to filter noise instead\n\n**For forest:**\n- `n_trees=10`: Good default\n- `n_trees=20+`: More robust but slower\n- `max_features=sqrt(n_features)`: Good for high-dimensional data\n\n### Interpreting Results\n\n1. **Always visualize top segments** to verify they make sense\n2. **Check segment size** - very small segments may be spurious\n3. **Use forest support** - patterns in 5+/10 trees are highly reliable\n4. **Domain validation** - do discovered segments align with business intuition?\n\n---\n\n## \ud83d\udd2c Example: Advertising Dataset\n\n```python\nfrom segmentation_forests import SegmentationForest\nfrom segmentation_forests.visualization import plot_segment_comparison\nimport pandas as pd\nimport numpy as np\n\n# Create synthetic advertising data\nnp.random.seed(42)\nn = 10000\n\ndata = pd.DataFrame({\n    'country': np.random.choice(['US', 'UK', 'CA', 'DE', 'FR'], n),\n    'device': np.random.choice(['Mobile', 'Desktop', 'Tablet'], n),\n    'gender': np.random.choice(['M', 'F'], n),\n    'time_of_day': np.random.choice(['Morning', 'Afternoon', 'Evening', 'Night'], n),\n    'impressions': np.random.poisson(100, n)  # Base: ~100 impressions\n})\n\n# Add hidden pattern: UK females on mobile get 3x impressions\nmask = (data['gender'] == 'F') & (data['country'] == 'UK') & (data['device'] == 'Mobile')\ndata.loc[mask, 'impressions'] = np.random.poisson(300, mask.sum())\n\n# Discover the pattern\nforest = SegmentationForest(n_trees=10, max_depth=3, max_features=2)\nforest.fit(data, 'impressions')\nsegments = forest.get_segments(min_support=3, min_divergence=0.3)\n\n# Result: Discovers the hidden pattern!\n# Output: \"gender == F AND country == UK AND device == Mobile\"\n# Divergence: 0.948, Support: 7/10 trees\n```\n\nSee `examples/advertising_example.py` for the complete example.\n\n---\n\n## \ud83d\udee0\ufe0f Development\n\n### Setup\n\n```bash\n# Clone repository\ngit clone https://github.com/davidgeorgewilliams/segmentation-forests.git\ncd segmentation-forests\n\n# Install with dev dependencies\npip install -e \".[dev]\"\n\n# Install pre-commit hooks\npre-commit install\n```\n\n### Running Tests\n\n```bash\n# Run all tests\npytest\n\n# With coverage\npytest --cov=segmentation_forests --cov-report=html\n\n# Run specific test\npytest tests/test_tree.py -v\n```\n\n### Code Quality\n\n```bash\n# Format code\nblack src/ tests/\nisort src/ tests/\n\n# Lint\nruff check src/ tests/\n\n# Type check\nmypy src/\n```\n\n---\n\n## \ud83e\udd1d Contributing\n\nContributions are welcome! Please:\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Make your changes and add tests\n4. Ensure all tests pass and code is formatted\n5. Submit a pull request\n\n---\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n---\n\n## \ud83d\udcda Citation\n\nIf you use Segmentation Forests in your research or project, please cite:\n\n```bibtex\n@software{segmentation_forests,\n  author = {Williams, David},\n  title = {Segmentation Forests: Unsupervised Segment Discovery using Divergence-based Decision Trees},\n  year = {2025},\n  url = {https://github.com/davidgeorgewilliams/segmentation-forests}\n}\n```\n\n---\n\n## \ud83d\ude4f Acknowledgments\n\n- Algorithm inspired by Random Forests (Breiman, 2001)\n- Divergence measures from information theory (Kullback-Leibler, Jensen-Shannon)\n- Built with NumPy, pandas, SciPy, matplotlib, and seaborn\n\n---\n\n## \ud83d\udcde Contact\n\n**David Williams** - [david@davidgeorgewilliams.com](mailto:david@davidgeorgewilliams.com)\n\nProject Link: [https://github.com/davidgeorgewilliams/segmentation-forests](https://github.com/davidgeorgewilliams/segmentation-forests)\n\n---\n\n**Happy Discovering! \ud83c\udfaf\ud83c\udf32**\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Unsupervised segment discovery using divergence-based decision trees inspired by Random Forests",
    "version": "0.1.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/davidgeorgewilliams/segmentation-forests/issues",
        "Documentation": "https://github.com/davidgeorgewilliams/segmentation-forests/blob/main/README.md",
        "Homepage": "https://github.com/davidgeorgewilliams/segmentation-forests",
        "Repository": "https://github.com/davidgeorgewilliams/segmentation-forests"
    },
    "split_keywords": [
        "machine-learning",
        " unsupervised-learning",
        " segment-discovery",
        " data-mining",
        " random-forests"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "77f4d9a51443e6fbeb45d21b9cd35397742caddff78640c268491bbe727b5d52",
                "md5": "c11186c1abdd157fff250cbcdf4237e4",
                "sha256": "a4c29983caab4a2153a362c7ed209b67a9cdc34536ba49982224205e2f27089a"
            },
            "downloads": -1,
            "filename": "segmentation_forests-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c11186c1abdd157fff250cbcdf4237e4",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.12",
            "size": 14902,
            "upload_time": "2025-10-08T10:27:48",
            "upload_time_iso_8601": "2025-10-08T10:27:48.360206Z",
            "url": "https://files.pythonhosted.org/packages/77/f4/d9a51443e6fbeb45d21b9cd35397742caddff78640c268491bbe727b5d52/segmentation_forests-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "fe5357b88e89061d70701b69321c40b3a6a0f77a6e4a1917d430af125f7decb0",
                "md5": "a8b39edc1f35b61cbf44450ed9c06cb1",
                "sha256": "b267564400c65a56f257a61f230806b7cf4ac2bbdca6f364b0130244e2f5a31b"
            },
            "downloads": -1,
            "filename": "segmentation_forests-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a8b39edc1f35b61cbf44450ed9c06cb1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.12",
            "size": 18838,
            "upload_time": "2025-10-08T10:27:49",
            "upload_time_iso_8601": "2025-10-08T10:27:49.963022Z",
            "url": "https://files.pythonhosted.org/packages/fe/53/57b88e89061d70701b69321c40b3a6a0f77a6e4a1917d430af125f7decb0/segmentation_forests-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-08 10:27:49",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "davidgeorgewilliams",
    "github_project": "segmentation-forests",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "segmentation-forests"
}
        
Elapsed time: 2.21914s