# Scaling Up Constraint-Based Diverse Sampling

A faster implementation of the FMMD (Fair Max-Min Diversification) algorithm

# Improved FMMD-S

An improved implementation of the FMMD-S algorithm from the paper:

Wang, Y., Mathioudakis, M., Li, J., & Fabbri, F. (2023). Max-Min Diversification with Fairness Constraints: Exact and Approximation Algorithms. In Proceedings of the 2023 SIAM International Conference on Data Mining (SDM) (pp. 91–99). Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9781611977653.ch11

## Table of Contents

## Algorithm Overview

The FMMD-S algorithm described in the paper has the following four steps:

- Obtain a greedy solution with
`k`

items using the Gonzales algorithm without regard for group constraints - For each group run the Gonzales algorithm to obtain a coreset and a diversity threshold
- Creates a coreset graph where edges signify the distance between items are lower than the diversity threshold
- Create the MIS problem using the coreset graph and solve using ILP solver
- If solution is infeasible decrease diversity threshold and go to step 2, else return solution

The FMMD-S algorithm is available in fmmd.algorithms is the `fmmd`

function

```
fmmd(
features: np.ndarray,
ids: np.ndarray,
groups: np.ndarray,
k: int,
constraints: Dict[int,Tuple[int, int]],
eps: float,
time_limit:int = 300,
verbose:bool = False,
parallel_dist_update: bool = False,
parallel_edge_creation: bool = False
) -> Tuple[set, float]:
"""The implementation of the Fair Max-Min Diversification algorithm.
First obtains a greedy solution ignoring fairness constrains. Then obtains a coreset by
getting greedy solutions in each group. Solves the MIS problem on the coreset.
Args:
features (np.ndarray): The feature vectors of the items
ids (np.ndarray): The ids of the items
groups (np.ndarray): The group (int) of the items
k (int): The minimum number of total samples required
constraints (List[Tuple[int,int]]): The list of lower and upper limits on number of samples for each group
eps (float): The fraction to relax the diversity to get a solution.
time_limit (int): The maximum number of seconds for Gurobi solver
verbose (bool, optional): Print many debug statements. Defaults to False.
parallel_dist_update (bool, optional): Whether to update distances in parallel. Defaults to False.
parallel_edge_creation (bool, optional): Whether to create coreset graph edges in parallel. Defaults to False.
Returns:
Tuple[set,float]: Returns the solution as set of item ids and the solution diversity
"""
```

## Setup

### Python

```
conda env create -n fmmd -f env.yaml
```

### Gurobi

The FMMD-S algorithm uses Gurobi to solve the MIS problem as explained in the paper.

The gurobi optimizer can be used without a license for small datasets, but for larger datasets it is recommended to have an academic licence.

### Cython

This library uses Cython to compile and parallelize several utility functions.

Run the following command to build the module:

```
python setup.py build_ext --inplace
```

After building, the module is used like this:

```
from fmmd.parallel_utils pdist
import numpy as np
X = np.random.random(1000,512)
d = pdist(X)
```

The `.pyx`

is found in ctython/parallel_utils.pyx and can be modified to add additional metrics and functionalities.

## Balanced Sampling

This library was developed specifically to perform balanced sampling.

Balanced sampling is when there is a need to uniformly sample from groups instead of the typical proportional
sampling. Specifically, we wanted to perform balanced sampling when the distribution of groups was very skewed. This means that there are some groups with very few items and some groups with a lot of items. Furthermore, if `min_group_size`

is the size of the smallest group and `k>min_group_size`

then we might be forced to take all items from smaller groups.

For example, assume we have the following group sizes:

group | count |
---|---|

1 | 2 |

2 | 5 |

3 | 10 |

4 | 200 |

Now, if we want to sample at least ten items uniformly `k>=10`

then, we would need to select all items from group 1 and 3 items from the remaining groups. This would result in `k=11`

items where we take a maximum of `3`

from each group. This is what the `find_num_samples_per_group`

method in the datasets/naive_balanced_sampling.py does. It returns `num_samples_per_group,total_number_of_samples`

for a given group distribution and minimum number of samples.

### Old FMMD-S implementation Bottlenecks

This sort of balanced sampling constraints causes several bottlenecks in the original FMMD-S implementation:

- When entire groups need to be selected, the diversity threshold is relaxed several times until the constraints is not
`under_capped`

- The coreset graph typically is much larger as several groups are added directly. This results in a long time to compute the edges which satisfy the diversity constraints
- When the ILP is infeasible steps 1 and 2 need to performed again

We fix these issues in the new implementations. Along with these fixes, a parallelization of several portions of the algorithm offer a general speedup compared the original implementation.

## Speedup over original Implementation

Running the benchmark on the Census dataset from the original paper.
For the full census dataset. `k`

is the number of samples and `C`

is the number of groups.

New and improved algorithm implementation

```
python census_single_solution.py -k=10 -C=2
```

Original algorithm implementation

```
python census_single_solution.py -k=10 -C=2 --old
```

k | C | Original (in seconds) | New (in seconds) | Relative Speedup |
---|---|---|---|---|

10 | 2 | 112.67 | 8.316 | 13.55x |

10 | 7 | 106.24 | 8.256 | 12.87x |

50 | 14 | 474.152 | 10.503 | 45.14x |

100 | 14 | 3670.781 | 18.092 | 202.90x |

Experiments on an Apple MacBook Pro with a M2 Pro chip and 16GB RAM.

## When to parallelize?

The `gonzales_algorithm`

method found in fmmd/algorithms.py has the argument `parallel_dist_update`

. This argument will parallelize the update to the distance array. The `gonzales_algoritm`

is used in both step 1 and step 2 (see Algorithm Overview) and can offer substantial speedups.

Therefore, to test the speed-up of one execution of the Gonzales algorithm, we executed the following for different number of items and feature dimensions:

```
rng = np.random.default_rng(seed=42)
features = rng.random(size=(n,d))
ids = np.arange(n)
gonzales_algorithm(set(),features,ids,k=10,parallel_dist_update=True) # with parallel
gonzales_algorithm(set(),features,ids,k=10,parallel_dist_update=False) # with parallel
```

The result for speed up is shown in the figure below:

All blue regions are where the parallel version is slower than the sequential version and all the red regions are where the parallel regions are faster than the sequential versions.

Similarly, the `get_coreset_graph`

method in in fmmd/algorithms.py has an argument `parallel_edges_creation`

. This option uses the `fmmd.parallel_utils.edges`

method to parallelize step 3 of the algorithm as described in Algorithm Overview. Again, the speed-up depends on the number of items in the coreset and the dimensionality of each item.

Therefore, to test the speed-up offered by the parallel version, we run the following two methods with random data of different coreset size and dimensions:

```
rng = np.random.default_rng(seed=42)
features = rng.random(size=(n,d))
parallel_utils.edges(features,0.01) # parallel edge creation
parallel_utils.edges_sequential(features,0.01) # sequential edge creation
```

The results are shown below:

All blue regions are where the parallel version is slower than the sequential version and all the red regions are where the parallel regions are faster than the sequential versions.The number of coreset items depends on the data features, the number of samples

`k`

and the constraints for the FMMD problem.

As discussed above, the parallel options are only useful when the number of dimensions increase.
Therefore, using the full `768`

dimensional embeddings for the ECCO dataset, we select `k=500`

balanced samples as follows:

```
python ecco_balanced_samples.py -k=500 --eps=0.5
```

The `fmmd`

method found in fmmd/algorithms.py has two options to utilize multiple cores:

`--parallel-dist-update`

: Update the solution distances in parallel for the Gonzales algorithm`--parallel-edge-creation`

: Find the coreset edges in parallel which are below the diversity threshold

For the ECCO dataset and balanced sampling we observe the following trends:

`--parallel-dist-update` | `--parallel-edge-creation` | Running Time | Relative Speedup |
---|---|---|---|

❌ | ❌ | 4444.224 | 4.55 |

✅ | ❌ | 1781.913 | 1.82 |

❌ | ✅ | 3625.860 | 3.71 |

✅ | ✅ | 976.961 | 1.00 |

The above ablation results indicate the majority of the speedup is gained from the parallel updates and that the speedup from parallel edge creation is relatively smaller potentially due to the number of coreset items.

## Full Experiment Results

For the ECCO dataset, there are 4.2 million items and 137 groups with a very skewed distribution. The requirement was at least 3000 balanced samples. With the group distribution resulted in the following:

- Maximum of
`26`

samples from each group `3063`

total samples

The items are `768`

dimensional embedding which take 3min to load the 22GB vectors.

The FFMD-S solution takes 9250.9 seconds (2.5hrs).

The FMMD-S algorithm has the following stages:

- Finding an initial greedy solution without thinking of buckets:
**1615.6 sec** - Finding a core-set of the data from buckets starting from the initial samples :
**1832.83 sec** - Computing pair-wise distances and selecting ones which are below a diversity threshold:
**5779.5 sec** - Creating a coreset graph with nodes from 2 and edges from 3:
**0.69 sec** - Solving an Integer Linear Program of a coreset from 4 to get final solution:
**17.27 sec**

All these times are on a system with 40 cores and 80GB RAM (of which only 25GB was actually used) The main bottleneck as seen is step 3. This is because we end up with 274,398 items in the coreset leading to the computation of nearly 37B pair-wise distances. After step 3 we only have 253 edges below the required threshold, so the rest of the algorithm is very quick.

## Limitations and Future Updates

There are some limitations compared to the original implementations. These will be addressed as the need arises.

- Only supports L2 distance metric
- No ability to generate multiple solutions in parallel to obtain the best overall diversity