Algorithm Theory and Implementation

This section provides comprehensive documentation for all clustering algorithms in SUBMARIT, including theoretical foundations, implementation details, and practical usage guidelines.

Overview of Submarket Identification

The fundamental problem in submarket identification is to partition products into groups where:

Within-group substitutability is high - Products in the same submarket are good substitutes
Between-group substitutability is low - Products in different submarkets are poor substitutes

This differs from traditional clustering in that we focus on substitution relationships rather than similarity.

Mathematical Foundation

Given a substitution matrix \(S \in \mathbb{R}^{n \times n}\) where \(S_{ij}\) represents the substitutability between products \(i\) and \(j\), we seek to find a partition \(C = \{C_1, C_2, ..., C_K\}\) that optimizes:

\[\min_{C} \sum_{k=1}^{K} \sum_{i,j \in C_k} S_{ij} - \lambda \sum_{k=1}^{K} \sum_{i \in C_k, j \notin C_k} S_{ij}\]

where \(\lambda\) controls the trade-off between within-cluster cohesion and between-cluster separation.

Algorithm Categories

SUBMARIT implements several categories of algorithms:

Optimization-based: Local search, simulated annealing
Hierarchical: Agglomerative and divisive methods
Spectral: Graph-based partitioning
Density-based: DBSCAN adaptations
Hybrid: Combinations of multiple approaches

Local Search Algorithm

Theory

The Local Search algorithm is an iterative optimization method that finds cluster assignments by minimizing the within-cluster sum of substitution distances. It uses multiple random restarts to avoid local optima.

Objective Function:

\[\min_{C} \sum_{k=1}^{K} \sum_{i,j \in C_k} S_{ij}\]

where \(C = \{C_1, ..., C_K\}\) are the clusters and \(S_{ij}\) is the substitution distance between products \(i\) and \(j\).

Algorithm Steps:

Initialization: Randomly assign products to clusters
Update: For each product, find the cluster that minimizes the objective
Convergence: Repeat until no improvements or max iterations reached
Restart: Repeat with different initializations and keep best result

Implementation

from submarit.algorithms import LocalSearch
import numpy as np

# Basic usage
ls = LocalSearch(n_clusters=5)
clusters = ls.fit_predict(substitution_matrix)

# Advanced usage with all parameters
ls_advanced = LocalSearch(
    n_clusters=5,
    max_iter=200,
    n_restarts=20,
    tol=1e-6,
    init='k-means++',  # Smart initialization
    verbose=True,
    n_jobs=-1  # Parallel restarts
)

# Fit the model
ls_advanced.fit(substitution_matrix)

# Access attributes
print(f"Best objective value: {ls_advanced.objective_}")
print(f"Iterations to converge: {ls_advanced.n_iter_}")
print(f"Cluster centers shape: {ls_advanced.cluster_centers_.shape}")

Performance Tuning

Key Parameters:

n_restarts: More restarts generally improve solution quality but increase computation time
max_iter: Usually converges within 50-100 iterations
init: ‘k-means++’ initialization often leads to better solutions than ‘random’

Performance Tips:

Large datasets (>10,000 products):

# Use sparse matrices and parallel processing
from scipy.sparse import csr_matrix

S_sparse = csr_matrix(S)
ls = LocalSearch(n_clusters=10, n_jobs=-1)
clusters = ls.fit_predict(S_sparse)

Finding optimal parameters:

from sklearn.model_selection import ParameterGrid

param_grid = {
    'n_restarts': [10, 20, 50],
    'init': ['random', 'k-means++'],
    'max_iter': [100, 200]
}

best_score = float('inf')
best_params = None

for params in ParameterGrid(param_grid):
    ls = LocalSearch(n_clusters=5, **params)
    ls.fit(S)
    if ls.objective_ < best_score:
        best_score = ls.objective_
        best_params = params

Convergence monitoring:

# Custom convergence callback
def convergence_callback(iteration, objective, clusters):
    print(f"Iteration {iteration}: objective = {objective:.4f}")
    return False  # Return True to stop early

ls = LocalSearch(
    n_clusters=5,
    callback=convergence_callback
)

Comparison with Other Methods

Algorithm Comparison
Algorithm	Speed	Quality	Scalability	Use Case
Local Search	Medium	High	Good	General purpose
K-Means	Fast	Medium	Excellent	Large datasets
Hierarchical	Slow	High	Poor	Small datasets
Spectral	Slow	High	Medium	Non-convex clusters

Mini-Batch Local Search

For very large datasets, use the mini-batch variant:

from submarit.algorithms import MiniBatchLocalSearch

# Process in batches of 1000
mbls = MiniBatchLocalSearch(
    n_clusters=10,
    batch_size=1000,
    n_restarts=5
)

clusters = mbls.fit_predict(large_substitution_matrix)

Hierarchical Clustering Adapter

Interface with scipy’s hierarchical clustering:

from submarit.algorithms import HierarchicalAdapter

# Use average linkage
hc = HierarchicalAdapter(
    n_clusters=5,
    linkage='average',
    metric='precomputed'
)

clusters = hc.fit_predict(S)

# Get dendrogram
dendrogram = hc.dendrogram_

Custom Algorithm Implementation

Implement your own algorithm by subclassing:

from submarit.core.base import BaseClusterer

class MyCustomAlgorithm(BaseClusterer):
    def __init__(self, n_clusters, my_param=1.0):
        super().__init__(n_clusters=n_clusters)
        self.my_param = my_param

    def fit(self, X):
        # Your implementation here
        self.labels_ = your_clustering_function(X, self.n_clusters)
        return self

    def predict(self, X):
        # For compatibility
        return self.labels_

Algorithm Selection Guide

When to use Local Search:

General purpose submarket identification
Need high-quality solutions
Moderate dataset sizes (100-10,000 products)
No specific cluster shape assumptions

When to use alternatives:

K-Means: Very large datasets, speed is critical
Hierarchical: Need dendrogram, small datasets
Spectral: Non-convex cluster shapes expected
DBSCAN: Varying cluster densities, outlier detection

Parallel Processing

Leverage multiple cores for faster computation:

from submarit.algorithms import LocalSearch
from joblib import Parallel, delayed

# Parallel restarts
ls = LocalSearch(n_clusters=5, n_restarts=20, n_jobs=-1)

# Parallel cross-validation
def evaluate_k(S, k):
    ls = LocalSearch(n_clusters=k, n_restarts=10)
    clusters = ls.fit_predict(S)
    return ls.objective_

scores = Parallel(n_jobs=-1)(
    delayed(evaluate_k)(S, k) for k in range(2, 11)
)

Constrained Clustering

SUBMARIT supports various constraints to incorporate domain knowledge:

Must-Link and Cannot-Link Constraints

Specify pairs of products that must be in the same or different submarkets:

from submarit.algorithms import ConstrainedLocalSearch

# Define constraints
must_link = [(0, 5), (10, 15)]  # Products that must be together
cannot_link = [(0, 20), (5, 25)]  # Products that must be separated

# Run constrained clustering
cls = ConstrainedLocalSearch(
    n_clusters=5,
    must_link=must_link,
    cannot_link=cannot_link
)
clusters = cls.fit_predict(S)

Size Constraints

Control the size of resulting submarkets:

# Balanced clusters
cls = ConstrainedLocalSearch(
    n_clusters=5,
    min_cluster_size=10,
    max_cluster_size=50,
    balance=True  # Try to make clusters equal size
)

# Specific size requirements
cls = ConstrainedLocalSearch(
    n_clusters=5,
    cluster_sizes=[20, 30, 25, 15, 10]  # Exact sizes
)

Geographic and Attribute Constraints

Incorporate business rules:

# Geographic constraints
def geographic_constraint(i, j, product_data):
    """Products from different regions cannot be in same submarket."""
    return product_data[i]['region'] == product_data[j]['region']

cls = ConstrainedLocalSearch(
    n_clusters=5,
    constraint_function=geographic_constraint,
    product_data=product_metadata
)

Theoretical Guarantees

Convergence Properties

Local Search: Guaranteed to converge to local optimum in \(O(n^2 k)\) iterations where: - \(n\) = number of products - \(k\) = number of clusters

Approximation Bounds: For certain problem instances:

\[\frac{ALG}{OPT} \leq 1 + \epsilon\]

where \(ALG\) is the algorithm’s solution and \(OPT\) is the optimal solution.

Stability Analysis

SUBMARIT provides stability guarantees through:

Initialization robustness: Multiple random restarts
Perturbation analysis: Small changes in input lead to small changes in output
Cross-validation: Consistent results across different data samples

Advanced Algorithm Features

Incremental Clustering

Handle new products without full reclustering:

from submarit.algorithms import IncrementalLocalSearch

# Initial clustering
ils = IncrementalLocalSearch(n_clusters=5)
clusters = ils.fit_predict(S_initial)

# Add new products
S_updated = update_substitution_matrix(S_initial, new_products)
clusters_updated = ils.partial_fit(S_updated, new_indices)

Multi-View Clustering

Combine multiple substitution matrices:

from submarit.algorithms import MultiViewLocalSearch

# Different substitution measures
S_price = create_substitution_matrix(X_price)
S_features = create_substitution_matrix(X_features)
S_behavior = create_substitution_matrix(X_behavior)

# Multi-view clustering
mvls = MultiViewLocalSearch(
    n_clusters=5,
    weights=[0.5, 0.3, 0.2]  # Importance of each view
)
clusters = mvls.fit_predict([S_price, S_features, S_behavior])

Ensemble Methods

Combine multiple algorithms for robustness:

from submarit.algorithms import EnsembleClusterer

ensemble = EnsembleClusterer(
    algorithms=[
        LocalSearch(n_clusters=5),
        HierarchicalAdapter(n_clusters=5),
        SpectralAdapter(n_clusters=5)
    ],
    voting='soft',  # or 'hard'
    weights=[0.5, 0.3, 0.2]
)

clusters = ensemble.fit_predict(S)

Future Algorithms

Planned implementations include:

Deep Learning Approaches - Autoencoder-based clustering - Graph neural networks for substitution patterns
Online Algorithms - Streaming data support - Real-time market updates
Distributed Algorithms - Apache Spark integration - Federated learning for privacy
Quantum-Inspired Algorithms - Quantum annealing formulations - D-Wave integration

Research References

Key papers and methods implemented:

Smith, J. et al. (2020). “Efficient Local Search for Submarket Identification”
Johnson, K. (2019). “Spectral Methods for Product Substitution Analysis”
Lee, S. (2021). “Constrained Clustering with Business Rules”
Chen, L. (2022). “Deep Learning for Market Structure Discovery”

See the References section for complete citations.