improved-fmmds-core


Nameimproved-fmmds-core JSON
Version 1.1.1 PyPI version JSON
download
home_pageNone
SummaryCore implementation of the Improved FMMD-S algorithm for fair max-min diversification
upload_time2025-10-29 09:39:10
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT License Copyright (c) 2024 Improved FMMD-S Core Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords fairness diversification optimization clustering sampling
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Improved FMMD-S 

This repository contains an efficient implementation of [the FMMD-S algorithm](https://doi.org/10.1137/1.9781611977653.ch11) for 'fair' diverse sampling with significant performance improvements over the original implementation. `Fair' here means that the sample is selected in a stratified fashion, ensuring that specified numbers of data points are selected from specified data groups.


## Problem
The algorithm returns an approximately optimal solution to the following problem. Given a set $V$ of data points, partitioned into disjoint groups $V_1, V_2, \ldots, V_C$, select $k$ data points $S\subseteq V$, $|S| = k$, so that the number of selected points from of each group adheres to given constraints, $$l_c \leq |S \cap  V_c| \leq h_c$$ and the _diversity_ $div(S)$ of the selected points is _maximized_.
$$div(S) = min_{u, v \in S: u\not \equiv  v}\ \mathrm{distance}(u, v) $$

## The FMMD-S Algorithm

The FMMD-S algorithm consists of the following main steps:

1. **Dataset-wide Selection**: Use the natural greedy algorithm for the problem to select $k$ points from the entire dataset and use them to establish a diversity threshold.
2. **Group-specific Selection**: Similarly, select greedily additional points from each group as long as the group-specific diversity remains above the established threshold.
3. **Coreset Graph**: Create a graph where nodes represent points selected in Steps (1-2) above, and edges represent pairwise distances below the diversity threshold.
4. **ILP Solution**: Solve the Maximum Independent Set problem using [the Gurobi Integer Linear Programming (ILP) optimizer](https://www.gurobi.com/solutions/gurobi-optimizer/) to find a problem solution.
5. **Threshold Relaxation**: If the solution returned fewer than $k$ items, decrease the diversity threshold and repeat from step 2.

For details, please refer to [the paper](https://doi.org/10.1137/1.9781611977653.ch11).

###  Key Features and Improvements

This implementation is more efficient than the one used in the paper, achieving **13-200x speedup** on our experiments, depending on problem size and constraints. The improvement is  due to the following implementation choices.

- **Incremental Graph Updates**: Avoids redundant distance computations by incrementally updating the coreset graph instead of rebuilding it in each iteration.
- **Working Data Persistence**: Reuses intermediate computations when relaxing diversity thresholds.
- **Parallel Distance Computations**: Uses Cython-compiled parallel distance updates using OpenMP.
- **Parallel Edge Creation**: Cython-compiled parallel coreset graph edge creation.
- **Memory-Efficient Operations**: Contiguous array operations and optimized data structures.


Moreover, the library implements the following functionality improvements.

- **Clean Interface**: Both functional and class-based APIs for easy integration.
- **Flexible Constraints**: Support for uniform balanced, proportional, and manual constraint definitions.
- **Comprehensive CLI**: Command-line interface with subcommand-based constraint definition.
- **Multiple Distance Metrics**: Support for L_2_ (Euclidean), L_1_ (Manhattan), and angular distance metrics.
- **Detailed Metadata**: Rich timing and performance metadata for analysis.

## Installation

### Prerequisites

- Python 3.8+
- Gurobi Optimizer (with academic license recommended for larger datasets)
- Cython (for parallel computations)

**Note for Mac users**: You need to have g++-13 installed via Homebrew and set the CXX environment variable before installing the library:

```bash
# Install g++-13 via Homebrew
brew install gcc@13

# Set CXX environment variable
export CXX=g++-13

# Then proceed with installation
pip install -e .
```

### Install from Source

```bash
git clone https://github.com/yourusername/improved-fmmds-core.git
cd improved-fmmds-core
pip install -e .
```

### Build Cython Extensions

The package includes Cython-compiled parallel utilities for significant performance improvements. If compilation fails, the package will fall back to pure Python implementations.

**Manual Building (if needed):**

If you encounter issues with automatic Cython compilation, you can build the extensions manually:

```bash
# Install build dependencies
pip install Cython numpy

# Build Cython extensions manually
python -c "
import numpy as np
from Cython.Build import cythonize
from setuptools import setup, Extension

ext_modules = [
    Extension(
        'improved_fmmds_core.parallel.cython_utils',
        sources=['improved_fmmds_core/parallel/cython_utils.pyx'],
        extra_compile_args=['-fopenmp', '-std=c++11', '-O3'],
        extra_link_args=['-fopenmp', '-std=c++11'],
        include_dirs=[np.get_include()]
    )
]

setup(
    ext_modules=cythonize(ext_modules, compiler_directives={'language_level': 3})
)
"
```
### Installation Troubleshooting

**Common Issues:**

1. **Gurobi License Issues**: Ensure you have a valid Gurobi license. Academic licenses are available for free.

2. **Cython Compilation Errors**: If Cython compilation fails:
   - Ensure you have a C++ compiler installed
   - On Mac: Install g++-13 via Homebrew and set `export CXX=g++-13`
   - On Linux: Install `build-essential` package
   - On Windows: Install Visual Studio Build Tools

3. **OpenMP Issues**: If you encounter OpenMP-related errors:
   - On Mac: `brew install libomp`
   - On Linux: `sudo apt-get install libomp-dev`
   - On Windows: OpenMP should be included with Visual Studio

### Development Installation

For development and testing:

```bash
git clone https://github.com/yourusername/improved-fmmds-core.git
cd improved-fmmds-core
pip install -e "."
```

## Quick Start

### Python API

#### Functional Interface

```python
import numpy as np
from improved_fmmds_core import solve_fmmd

# Generate sample data
np.random.seed(42)
features = np.random.random((1000, 50))  # 1000 items, 50 features
ids = np.arange(1000)
groups = np.random.randint(0, 5, 1000)  # 5 groups

# Define constraints (group_id: (lower_bound, upper_bound))
constraints = {
    0: (2, 5),  # Group 0: between 2 and 5 items
    1: (1, 3),  # Group 1: between 1 and 3 items
    2: (2, 4),  # Group 2: between 2 and 4 items
    3: (1, 2),  # Group 3: between 1 and 2 items
    4: (2, 6)   # Group 4: between 2 and 6 items
}

# Solve using functional interface
solution, diversity, metadata = solve_fmmd(
    features=features,
    ids=ids,
    groups=groups,
    k=10,
    constraints=constraints,
    parallel_dist_update=True,
    parallel_edge_creation=True,
    metric="l2",
    eps=0.1,
    time_limit=300
)

print(f"Solution: {solution}")
print(f"Diversity: {diversity:.6f}")
print(f"Runtime: {metadata['total_time']:.2f} seconds")
print(f"Iterations: {metadata['num_iterations']}")
```

#### Class-based Interface

```python
from improved_fmmds_core import FMMDSolver

# Create solver with comprehensive configuration
solver = FMMDSolver(
    k=10,
    constraints=constraints,
    eps=0.1,
    time_limit=300,
    parallel_dist_update=True,
    parallel_edge_creation=True,
    verbose=False,
    metric="l2"
)

# Solve and get detailed results
result = solver.solve(features, ids, groups)

print(f"Solution: {result.solution}")
print(f"Diversity: {result.diversity:.6f}")
print(f"Runtime: {result.runtime:.2f} seconds")
print(f"Metric: {result.metric}")
print(f"Metadata: {result.metadata}")
```

#### Using Constraint Generators

```python
from improved_fmmds_core import (
    ConstraintGenerator, 
    UniformBalancedConfig, 
    ProportionalConfig
)
import numpy as np

# Generate data
features = np.random.random((1000, 50))
ids = np.arange(1000)
groups = np.random.randint(0, 5, 1000)
unique_groups, group_counts = np.unique(groups, return_counts=True)

# Uniform balanced constraints
uniform_config = UniformBalancedConfig(minimum_total_samples=50)
constraints_obj, k = ConstraintGenerator.generate_constraints(
    uniform_config, unique_groups, group_counts
)
constraints = constraints_obj.constraints

# Proportional constraints
proportional_config = ProportionalConfig(k=100, alpha=0.2)
constraints_obj, k = ConstraintGenerator.generate_constraints(
    proportional_config, unique_groups, group_counts
)
constraints = constraints_obj.constraints
```

### Command Line Interface

The CLI uses subcommands for different constraint definitions:

```bash
# Uniform balanced sampling
fmmds-run --data dataset.npz uniform-sampling --minimum-samples 50

# Proportional sampling
fmmds-run --data dataset.npz proportional-sampling --k 50 --alpha 0.2

# Custom constraints from file
fmmds-run --data dataset.npz custom-constraints --constraints-file constraints.json --k 50

# With parallel processing and custom parameters
fmmds-run --data dataset.npz uniform-sampling --minimum-samples 100 \
    --parallel --eps 0.05 --time-limit 600 --metric l1 --output results.json
```

#### Data Format Requirements

The `--data` argument expects a `.npz` file with the following required keys:
- `features`: 2D numpy array (n × d) containing feature vectors
- `ids`: 1D numpy array (n) containing unique item identifiers  
- `groups`: 1D numpy array (n) containing group assignments

**Performance Recommendations:**
- Use contiguous arrays for optimal performance: `np.ascontiguousarray()`
- Ensure arrays are properly aligned and have consistent dtypes
- For large datasets, consider using float32 instead of float64 to reduce memory usage

**Example data preparation:**
```python
import numpy as np

# Prepare your data
features = np.ascontiguousarray(features, dtype=np.float32)
ids = np.ascontiguousarray(ids, dtype=np.int32)
groups = np.ascontiguousarray(groups, dtype=np.int32)

# Save to .npz file
np.savez('dataset.npz', features=features, ids=ids, groups=groups)
```

#### CLI Options

**Global Options:**
- `--data`: Path to .npz file containing features, ids, and groups arrays
- `--output`: Output file for results (default: results.json)
- `--eps`: Diversity relaxation factor (default: 0.1)
- `--time-limit`: Time limit for ILP solver in seconds (default: 300)
- `--metric`: Distance metric (l2, l1, angular) (default: l2)
- `--parallel`: Enable all parallel options
- `--parallel-dist-update`: Enable parallel distance updates
- `--parallel-edge-creation`: Enable parallel edge creation
- `--verbose`: Enable verbose ILP solver output
- `--quiet`: Suppress output except errors

**Subcommand-specific Options:**
- `uniform-sampling --minimum-samples`: Minimum total samples for uniform sampling
- `proportional-sampling --k`: Number of samples, `--alpha`: Tolerance parameter (default: 0.2)
- `custom-constraints --constraints-file`: Path to constraints JSON file, `--k`: Number of samples

## Constraint Definitions

The library provides three types of constraint definitions.

### Uniform Balanced Sampling
Ensures (roughly) equal number of points selected from each group, regardless of group size.

```python
from improved_fmmds_core import UniformBalancedConfig, ConstraintGenerator
import numpy as np

# Example with group counts
unique_groups = np.array([0, 1, 2, 3, 4])
group_counts = np.array([100, 50, 200, 25, 75])  # Group sizes

config = UniformBalancedConfig(minimum_total_samples=50)
constraints_obj, k = ConstraintGenerator.generate_constraints(config, unique_groups, group_counts)
constraints = constraints_obj.constraints

print(f"Generated k={k} with uniform constraints:")
for group_id, (lb, ub) in constraints.items():
    print(f"  Group {group_id}: {lb}-{ub} samples")
```

### Proportional Sampling
Maintains proportional representation based on group sizes with a tolerance parameter. These constraints preserves the original group proportions while allowing some flexibility.

```python
from improved_fmmds_core import ProportionalConfig

config = ProportionalConfig(k=100, alpha=0.2)  # 20% tolerance
constraints_obj, k = ConstraintGenerator.generate_constraints(config, unique_groups, group_counts)
constraints = constraints_obj.constraints

print(f"Generated k={k} with proportional constraints:")
for group_id, (lb, ub) in constraints.items():
    proportion = group_counts[group_id] / group_counts.sum()
    print(f"  Group {group_id}: {lb}-{ub} samples (proportion: {proportion:.2f})")
```

### Manual Constraints
Specify exact constraints for each group. This provides maximum control over the sampling approach.

```python
# Define constraints manually
constraints = {
    0: (5, 10),   # Group 0: 5-10 samples
    1: (2, 8),    # Group 1: 2-8 samples
    2: (3, 12),   # Group 2: 3-12 samples
    3: (1, 5),    # Group 3: 1-5 samples
    4: (4, 8)     # Group 4: 4-8 samples
}

# Validate feasibility
from improved_fmmds_core import Constraints, validate_k_feasibility
constraints_obj = Constraints(constraints)
k = 25  # Target total samples
validate_k_feasibility(k, constraints_obj)  # Raises error if not feasible
```

### Constraint Validation

The library includes comprehensive validation to ensure constraints are feasible:

```python
from improved_fmmds_core import Constraints

constraints_obj = Constraints(constraints)
print(f"Minimum possible samples: {constraints_obj.get_min_total()}")
print(f"Maximum possible samples: {constraints_obj.get_max_total()}")
print(f"Is k=25 feasible? {constraints_obj.is_k_feasible(25)}")
```


## Data Format

The algorithm expects data in the following format:

- **features**: 2D numpy array (n × d) where each row is a feature vector
- **ids**: 1D numpy array (n) containing unique identifiers for each item
- **groups**: 1D numpy array (n) specifying group assignments
- **constraints**: Dictionary mapping group IDs to (lower_bound, upper_bound) tuples

### Loading Data

```python
import numpy as np

# From separate files
features = np.load('features.npy')
ids = np.load('ids.npy')
groups = np.load('groups.npy')

# From single .npz file
data = np.load('dataset.npz')
features = data['features']
ids = data['ids']
groups = data['groups']
```

## Distance Metrics

The algorithm supports three distance metrics:

- **L2 (Euclidean)**: `metric="l2"` (default)
- **L1 (Manhattan)**: `metric="l1"`
- **Angular**: `metric="angular"`

## Parallel Processing

Enable parallel processing for better performance on large datasets:

```python
# Enable all parallel options
solver = FMMDSolver(
    k=10,
    constraints=constraints,
    parallel_dist_update=True,
    parallel_edge_creation=True
)

# Or via CLI
fmmds-run --data dataset.npz uniform-sampling --minimum-samples 50 --parallel
```

## Examples

The package includes comprehensive examples in the `examples/` directory:

- `basic_example.py`: Demonstrates all three constraint types with synthetic data
- Shows both functional and class-based interfaces
- Includes result saving and metadata analysis

Run the examples:

```bash
cd examples
python basic_example.py
```

## Requirements

### Core Dependencies

- **Python 3.8+**
- **NumPy >= 1.20.0**: Numerical computations
- **Gurobi Optimizer >= 9.0.0**: ILP solver (academic license recommended)
- **NetworkX >= 2.5**: Graph operations
- **tqdm >= 4.60.0**: Progress bars

### Build Dependencies

- **Cython >= 0.29.0**: For parallel computations
- **C++ Compiler**: For building Cython extensions
- **OpenMP**: For parallel processing (usually included with compiler)

### Optional Dependencies

- **Pandas**: For data manipulation (if needed for your workflow)
- **Matplotlib/Seaborn**: For visualization (if needed for analysis)

## License

MIT License

## Citation

If you use this implementation, please cite the original paper:

```bibtex
@inproceedings{wang-fmmd-2023,
  title={Max-Min Diversification with Fairness Constraints: Exact and Approximation Algorithms},
  author={Wang, Y. and Mathioudakis, M. and Li, J. and Fabbri, F.},
  booktitle={Proceedings of the 2023 SIAM International Conference on Data Mining (SDM)},
  pages={91--99},
  year={2023},
  organization={Society for Industrial and Applied Mathematics}
}
```

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Support

For questions and support, please open an issue on GitHub.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "improved-fmmds-core",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "fairness, diversification, optimization, clustering, sampling",
    "author": null,
    "author_email": "Ananth Mahadevan <ananth.mahadevan@helsinki.fi>, Michael Mathioudakis <michael.mathioudakis@helsinki.fi>",
    "download_url": "https://files.pythonhosted.org/packages/1a/ed/dcef320118ff74f2afdb95da23e77eceb19bc3f52b54ae5044fb0dd1fc53/improved_fmmds_core-1.1.1.tar.gz",
    "platform": null,
    "description": "# Improved FMMD-S \n\nThis repository contains an efficient implementation of [the FMMD-S algorithm](https://doi.org/10.1137/1.9781611977653.ch11) for 'fair' diverse sampling with significant performance improvements over the original implementation. `Fair' here means that the sample is selected in a stratified fashion, ensuring that specified numbers of data points are selected from specified data groups.\n\n\n## Problem\nThe algorithm returns an approximately optimal solution to the following problem. Given a set $V$ of data points, partitioned into disjoint groups $V_1, V_2, \\ldots, V_C$, select $k$ data points $S\\subseteq V$, $|S| = k$, so that the number of selected points from of each group adheres to given constraints, $$l_c \\leq |S \\cap  V_c| \\leq h_c$$ and the _diversity_ $div(S)$ of the selected points is _maximized_.\n$$div(S) = min_{u, v \\in S: u\\not \\equiv  v}\\ \\mathrm{distance}(u, v) $$\n\n## The FMMD-S Algorithm\n\nThe FMMD-S algorithm consists of the following main steps:\n\n1. **Dataset-wide Selection**: Use the natural greedy algorithm for the problem to select $k$ points from the entire dataset and use them to establish a diversity threshold.\n2. **Group-specific Selection**: Similarly, select greedily additional points from each group as long as the group-specific diversity remains above the established threshold.\n3. **Coreset Graph**: Create a graph where nodes represent points selected in Steps (1-2) above, and edges represent pairwise distances below the diversity threshold.\n4. **ILP Solution**: Solve the Maximum Independent Set problem using [the Gurobi Integer Linear Programming (ILP) optimizer](https://www.gurobi.com/solutions/gurobi-optimizer/) to find a problem solution.\n5. **Threshold Relaxation**: If the solution returned fewer than $k$ items, decrease the diversity threshold and repeat from step 2.\n\nFor details, please refer to [the paper](https://doi.org/10.1137/1.9781611977653.ch11).\n\n###  Key Features and Improvements\n\nThis implementation is more efficient than the one used in the paper, achieving **13-200x speedup** on our experiments, depending on problem size and constraints. The improvement is  due to the following implementation choices.\n\n- **Incremental Graph Updates**: Avoids redundant distance computations by incrementally updating the coreset graph instead of rebuilding it in each iteration.\n- **Working Data Persistence**: Reuses intermediate computations when relaxing diversity thresholds.\n- **Parallel Distance Computations**: Uses Cython-compiled parallel distance updates using OpenMP.\n- **Parallel Edge Creation**: Cython-compiled parallel coreset graph edge creation.\n- **Memory-Efficient Operations**: Contiguous array operations and optimized data structures.\n\n\nMoreover, the library implements the following functionality improvements.\n\n- **Clean Interface**: Both functional and class-based APIs for easy integration.\n- **Flexible Constraints**: Support for uniform balanced, proportional, and manual constraint definitions.\n- **Comprehensive CLI**: Command-line interface with subcommand-based constraint definition.\n- **Multiple Distance Metrics**: Support for L_2_ (Euclidean), L_1_ (Manhattan), and angular distance metrics.\n- **Detailed Metadata**: Rich timing and performance metadata for analysis.\n\n## Installation\n\n### Prerequisites\n\n- Python 3.8+\n- Gurobi Optimizer (with academic license recommended for larger datasets)\n- Cython (for parallel computations)\n\n**Note for Mac users**: You need to have g++-13 installed via Homebrew and set the CXX environment variable before installing the library:\n\n```bash\n# Install g++-13 via Homebrew\nbrew install gcc@13\n\n# Set CXX environment variable\nexport CXX=g++-13\n\n# Then proceed with installation\npip install -e .\n```\n\n### Install from Source\n\n```bash\ngit clone https://github.com/yourusername/improved-fmmds-core.git\ncd improved-fmmds-core\npip install -e .\n```\n\n### Build Cython Extensions\n\nThe package includes Cython-compiled parallel utilities for significant performance improvements. If compilation fails, the package will fall back to pure Python implementations.\n\n**Manual Building (if needed):**\n\nIf you encounter issues with automatic Cython compilation, you can build the extensions manually:\n\n```bash\n# Install build dependencies\npip install Cython numpy\n\n# Build Cython extensions manually\npython -c \"\nimport numpy as np\nfrom Cython.Build import cythonize\nfrom setuptools import setup, Extension\n\next_modules = [\n    Extension(\n        'improved_fmmds_core.parallel.cython_utils',\n        sources=['improved_fmmds_core/parallel/cython_utils.pyx'],\n        extra_compile_args=['-fopenmp', '-std=c++11', '-O3'],\n        extra_link_args=['-fopenmp', '-std=c++11'],\n        include_dirs=[np.get_include()]\n    )\n]\n\nsetup(\n    ext_modules=cythonize(ext_modules, compiler_directives={'language_level': 3})\n)\n\"\n```\n### Installation Troubleshooting\n\n**Common Issues:**\n\n1. **Gurobi License Issues**: Ensure you have a valid Gurobi license. Academic licenses are available for free.\n\n2. **Cython Compilation Errors**: If Cython compilation fails:\n   - Ensure you have a C++ compiler installed\n   - On Mac: Install g++-13 via Homebrew and set `export CXX=g++-13`\n   - On Linux: Install `build-essential` package\n   - On Windows: Install Visual Studio Build Tools\n\n3. **OpenMP Issues**: If you encounter OpenMP-related errors:\n   - On Mac: `brew install libomp`\n   - On Linux: `sudo apt-get install libomp-dev`\n   - On Windows: OpenMP should be included with Visual Studio\n\n### Development Installation\n\nFor development and testing:\n\n```bash\ngit clone https://github.com/yourusername/improved-fmmds-core.git\ncd improved-fmmds-core\npip install -e \".\"\n```\n\n## Quick Start\n\n### Python API\n\n#### Functional Interface\n\n```python\nimport numpy as np\nfrom improved_fmmds_core import solve_fmmd\n\n# Generate sample data\nnp.random.seed(42)\nfeatures = np.random.random((1000, 50))  # 1000 items, 50 features\nids = np.arange(1000)\ngroups = np.random.randint(0, 5, 1000)  # 5 groups\n\n# Define constraints (group_id: (lower_bound, upper_bound))\nconstraints = {\n    0: (2, 5),  # Group 0: between 2 and 5 items\n    1: (1, 3),  # Group 1: between 1 and 3 items\n    2: (2, 4),  # Group 2: between 2 and 4 items\n    3: (1, 2),  # Group 3: between 1 and 2 items\n    4: (2, 6)   # Group 4: between 2 and 6 items\n}\n\n# Solve using functional interface\nsolution, diversity, metadata = solve_fmmd(\n    features=features,\n    ids=ids,\n    groups=groups,\n    k=10,\n    constraints=constraints,\n    parallel_dist_update=True,\n    parallel_edge_creation=True,\n    metric=\"l2\",\n    eps=0.1,\n    time_limit=300\n)\n\nprint(f\"Solution: {solution}\")\nprint(f\"Diversity: {diversity:.6f}\")\nprint(f\"Runtime: {metadata['total_time']:.2f} seconds\")\nprint(f\"Iterations: {metadata['num_iterations']}\")\n```\n\n#### Class-based Interface\n\n```python\nfrom improved_fmmds_core import FMMDSolver\n\n# Create solver with comprehensive configuration\nsolver = FMMDSolver(\n    k=10,\n    constraints=constraints,\n    eps=0.1,\n    time_limit=300,\n    parallel_dist_update=True,\n    parallel_edge_creation=True,\n    verbose=False,\n    metric=\"l2\"\n)\n\n# Solve and get detailed results\nresult = solver.solve(features, ids, groups)\n\nprint(f\"Solution: {result.solution}\")\nprint(f\"Diversity: {result.diversity:.6f}\")\nprint(f\"Runtime: {result.runtime:.2f} seconds\")\nprint(f\"Metric: {result.metric}\")\nprint(f\"Metadata: {result.metadata}\")\n```\n\n#### Using Constraint Generators\n\n```python\nfrom improved_fmmds_core import (\n    ConstraintGenerator, \n    UniformBalancedConfig, \n    ProportionalConfig\n)\nimport numpy as np\n\n# Generate data\nfeatures = np.random.random((1000, 50))\nids = np.arange(1000)\ngroups = np.random.randint(0, 5, 1000)\nunique_groups, group_counts = np.unique(groups, return_counts=True)\n\n# Uniform balanced constraints\nuniform_config = UniformBalancedConfig(minimum_total_samples=50)\nconstraints_obj, k = ConstraintGenerator.generate_constraints(\n    uniform_config, unique_groups, group_counts\n)\nconstraints = constraints_obj.constraints\n\n# Proportional constraints\nproportional_config = ProportionalConfig(k=100, alpha=0.2)\nconstraints_obj, k = ConstraintGenerator.generate_constraints(\n    proportional_config, unique_groups, group_counts\n)\nconstraints = constraints_obj.constraints\n```\n\n### Command Line Interface\n\nThe CLI uses subcommands for different constraint definitions:\n\n```bash\n# Uniform balanced sampling\nfmmds-run --data dataset.npz uniform-sampling --minimum-samples 50\n\n# Proportional sampling\nfmmds-run --data dataset.npz proportional-sampling --k 50 --alpha 0.2\n\n# Custom constraints from file\nfmmds-run --data dataset.npz custom-constraints --constraints-file constraints.json --k 50\n\n# With parallel processing and custom parameters\nfmmds-run --data dataset.npz uniform-sampling --minimum-samples 100 \\\n    --parallel --eps 0.05 --time-limit 600 --metric l1 --output results.json\n```\n\n#### Data Format Requirements\n\nThe `--data` argument expects a `.npz` file with the following required keys:\n- `features`: 2D numpy array (n \u00d7 d) containing feature vectors\n- `ids`: 1D numpy array (n) containing unique item identifiers  \n- `groups`: 1D numpy array (n) containing group assignments\n\n**Performance Recommendations:**\n- Use contiguous arrays for optimal performance: `np.ascontiguousarray()`\n- Ensure arrays are properly aligned and have consistent dtypes\n- For large datasets, consider using float32 instead of float64 to reduce memory usage\n\n**Example data preparation:**\n```python\nimport numpy as np\n\n# Prepare your data\nfeatures = np.ascontiguousarray(features, dtype=np.float32)\nids = np.ascontiguousarray(ids, dtype=np.int32)\ngroups = np.ascontiguousarray(groups, dtype=np.int32)\n\n# Save to .npz file\nnp.savez('dataset.npz', features=features, ids=ids, groups=groups)\n```\n\n#### CLI Options\n\n**Global Options:**\n- `--data`: Path to .npz file containing features, ids, and groups arrays\n- `--output`: Output file for results (default: results.json)\n- `--eps`: Diversity relaxation factor (default: 0.1)\n- `--time-limit`: Time limit for ILP solver in seconds (default: 300)\n- `--metric`: Distance metric (l2, l1, angular) (default: l2)\n- `--parallel`: Enable all parallel options\n- `--parallel-dist-update`: Enable parallel distance updates\n- `--parallel-edge-creation`: Enable parallel edge creation\n- `--verbose`: Enable verbose ILP solver output\n- `--quiet`: Suppress output except errors\n\n**Subcommand-specific Options:**\n- `uniform-sampling --minimum-samples`: Minimum total samples for uniform sampling\n- `proportional-sampling --k`: Number of samples, `--alpha`: Tolerance parameter (default: 0.2)\n- `custom-constraints --constraints-file`: Path to constraints JSON file, `--k`: Number of samples\n\n## Constraint Definitions\n\nThe library provides three types of constraint definitions.\n\n### Uniform Balanced Sampling\nEnsures (roughly) equal number of points selected from each group, regardless of group size.\n\n```python\nfrom improved_fmmds_core import UniformBalancedConfig, ConstraintGenerator\nimport numpy as np\n\n# Example with group counts\nunique_groups = np.array([0, 1, 2, 3, 4])\ngroup_counts = np.array([100, 50, 200, 25, 75])  # Group sizes\n\nconfig = UniformBalancedConfig(minimum_total_samples=50)\nconstraints_obj, k = ConstraintGenerator.generate_constraints(config, unique_groups, group_counts)\nconstraints = constraints_obj.constraints\n\nprint(f\"Generated k={k} with uniform constraints:\")\nfor group_id, (lb, ub) in constraints.items():\n    print(f\"  Group {group_id}: {lb}-{ub} samples\")\n```\n\n### Proportional Sampling\nMaintains proportional representation based on group sizes with a tolerance parameter. These constraints preserves the original group proportions while allowing some flexibility.\n\n```python\nfrom improved_fmmds_core import ProportionalConfig\n\nconfig = ProportionalConfig(k=100, alpha=0.2)  # 20% tolerance\nconstraints_obj, k = ConstraintGenerator.generate_constraints(config, unique_groups, group_counts)\nconstraints = constraints_obj.constraints\n\nprint(f\"Generated k={k} with proportional constraints:\")\nfor group_id, (lb, ub) in constraints.items():\n    proportion = group_counts[group_id] / group_counts.sum()\n    print(f\"  Group {group_id}: {lb}-{ub} samples (proportion: {proportion:.2f})\")\n```\n\n### Manual Constraints\nSpecify exact constraints for each group. This provides maximum control over the sampling approach.\n\n```python\n# Define constraints manually\nconstraints = {\n    0: (5, 10),   # Group 0: 5-10 samples\n    1: (2, 8),    # Group 1: 2-8 samples\n    2: (3, 12),   # Group 2: 3-12 samples\n    3: (1, 5),    # Group 3: 1-5 samples\n    4: (4, 8)     # Group 4: 4-8 samples\n}\n\n# Validate feasibility\nfrom improved_fmmds_core import Constraints, validate_k_feasibility\nconstraints_obj = Constraints(constraints)\nk = 25  # Target total samples\nvalidate_k_feasibility(k, constraints_obj)  # Raises error if not feasible\n```\n\n### Constraint Validation\n\nThe library includes comprehensive validation to ensure constraints are feasible:\n\n```python\nfrom improved_fmmds_core import Constraints\n\nconstraints_obj = Constraints(constraints)\nprint(f\"Minimum possible samples: {constraints_obj.get_min_total()}\")\nprint(f\"Maximum possible samples: {constraints_obj.get_max_total()}\")\nprint(f\"Is k=25 feasible? {constraints_obj.is_k_feasible(25)}\")\n```\n\n\n## Data Format\n\nThe algorithm expects data in the following format:\n\n- **features**: 2D numpy array (n \u00d7 d) where each row is a feature vector\n- **ids**: 1D numpy array (n) containing unique identifiers for each item\n- **groups**: 1D numpy array (n) specifying group assignments\n- **constraints**: Dictionary mapping group IDs to (lower_bound, upper_bound) tuples\n\n### Loading Data\n\n```python\nimport numpy as np\n\n# From separate files\nfeatures = np.load('features.npy')\nids = np.load('ids.npy')\ngroups = np.load('groups.npy')\n\n# From single .npz file\ndata = np.load('dataset.npz')\nfeatures = data['features']\nids = data['ids']\ngroups = data['groups']\n```\n\n## Distance Metrics\n\nThe algorithm supports three distance metrics:\n\n- **L2 (Euclidean)**: `metric=\"l2\"` (default)\n- **L1 (Manhattan)**: `metric=\"l1\"`\n- **Angular**: `metric=\"angular\"`\n\n## Parallel Processing\n\nEnable parallel processing for better performance on large datasets:\n\n```python\n# Enable all parallel options\nsolver = FMMDSolver(\n    k=10,\n    constraints=constraints,\n    parallel_dist_update=True,\n    parallel_edge_creation=True\n)\n\n# Or via CLI\nfmmds-run --data dataset.npz uniform-sampling --minimum-samples 50 --parallel\n```\n\n## Examples\n\nThe package includes comprehensive examples in the `examples/` directory:\n\n- `basic_example.py`: Demonstrates all three constraint types with synthetic data\n- Shows both functional and class-based interfaces\n- Includes result saving and metadata analysis\n\nRun the examples:\n\n```bash\ncd examples\npython basic_example.py\n```\n\n## Requirements\n\n### Core Dependencies\n\n- **Python 3.8+**\n- **NumPy >= 1.20.0**: Numerical computations\n- **Gurobi Optimizer >= 9.0.0**: ILP solver (academic license recommended)\n- **NetworkX >= 2.5**: Graph operations\n- **tqdm >= 4.60.0**: Progress bars\n\n### Build Dependencies\n\n- **Cython >= 0.29.0**: For parallel computations\n- **C++ Compiler**: For building Cython extensions\n- **OpenMP**: For parallel processing (usually included with compiler)\n\n### Optional Dependencies\n\n- **Pandas**: For data manipulation (if needed for your workflow)\n- **Matplotlib/Seaborn**: For visualization (if needed for analysis)\n\n## License\n\nMIT License\n\n## Citation\n\nIf you use this implementation, please cite the original paper:\n\n```bibtex\n@inproceedings{wang-fmmd-2023,\n  title={Max-Min Diversification with Fairness Constraints: Exact and Approximation Algorithms},\n  author={Wang, Y. and Mathioudakis, M. and Li, J. and Fabbri, F.},\n  booktitle={Proceedings of the 2023 SIAM International Conference on Data Mining (SDM)},\n  pages={91--99},\n  year={2023},\n  organization={Society for Industrial and Applied Mathematics}\n}\n```\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## Support\n\nFor questions and support, please open an issue on GitHub.\n",
    "bugtrack_url": null,
    "license": "MIT License\n        \n        Copyright (c) 2024 Improved FMMD-S Core\n        \n        Permission is hereby granted, free of charge, to any person obtaining a copy\n        of this software and associated documentation files (the \"Software\"), to deal\n        in the Software without restriction, including without limitation the rights\n        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n        copies of the Software, and to permit persons to whom the Software is\n        furnished to do so, subject to the following conditions:\n        \n        The above copyright notice and this permission notice shall be included in all\n        copies or substantial portions of the Software.\n        \n        THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n        SOFTWARE.\n        ",
    "summary": "Core implementation of the Improved FMMD-S algorithm for fair max-min diversification",
    "version": "1.1.1",
    "project_urls": null,
    "split_keywords": [
        "fairness",
        " diversification",
        " optimization",
        " clustering",
        " sampling"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7486bab8dc929bb8b8ce57cc6c3210d8eda1126b8a1f43210935470fd5cc17d9",
                "md5": "69c2ddee4484a791e101c0d28296d34e",
                "sha256": "ccc0036b6a6a4b3db57a34db1ea76675d25c771cee47e1b39dda75888a7c8e83"
            },
            "downloads": -1,
            "filename": "improved_fmmds_core-1.1.1-cp310-cp310-macosx_11_0_arm64.whl",
            "has_sig": false,
            "md5_digest": "69c2ddee4484a791e101c0d28296d34e",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.8",
            "size": 304378,
            "upload_time": "2025-10-29T09:39:09",
            "upload_time_iso_8601": "2025-10-29T09:39:09.490877Z",
            "url": "https://files.pythonhosted.org/packages/74/86/bab8dc929bb8b8ce57cc6c3210d8eda1126b8a1f43210935470fd5cc17d9/improved_fmmds_core-1.1.1-cp310-cp310-macosx_11_0_arm64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "1aeddcef320118ff74f2afdb95da23e77eceb19bc3f52b54ae5044fb0dd1fc53",
                "md5": "1a6a38dbcd9b6869d64219e782fd83ce",
                "sha256": "19e92d9a92953f8b4336714cac2768116663cf10f5f916afcc7b631da539ef27"
            },
            "downloads": -1,
            "filename": "improved_fmmds_core-1.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "1a6a38dbcd9b6869d64219e782fd83ce",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 185869,
            "upload_time": "2025-10-29T09:39:10",
            "upload_time_iso_8601": "2025-10-29T09:39:10.716170Z",
            "url": "https://files.pythonhosted.org/packages/1a/ed/dcef320118ff74f2afdb95da23e77eceb19bc3f52b54ae5044fb0dd1fc53/improved_fmmds_core-1.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-29 09:39:10",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "improved-fmmds-core"
}
        
Elapsed time: 1.28904s