API Reference

This is the complete API reference for SUBMARIT.

Core Module

Substitution Matrix Creation

Substitution matrix implementation for SUBMARIT.

class submarit.core.substitution_matrix.SubstitutionMatrix(data: _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str] | None = None, normalize: bool = True, check_symmetry: bool = True, tol: float = 1e-10)[source]

Bases: object

Represents a product substitution matrix.

This class handles the creation and manipulation of substitution matrices used in submarket identification. The matrix represents substitution patterns between products based on sales or other data.

create_from_consumer_product_data(consumer_product_data: _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str], normalize: bool = True, weight: int = 0, diag: bool = False) Tuple[ndarray[tuple[Any, ...], dtype[int64]], int][source]

Create substitution matrix from consumer-product data.

This method implements the logic from CreateSubstitutionMatrix.m

Parameters:
  • consumer_product_data – Consumer × product data matrix

  • normalize – Whether to normalize rows to sum to 1

  • weight – 0 = weight by number of consumers, 1 = weight by product sales

  • diag – Whether to include diagonal self substitution

Returns:

Tuple of (product_indexes, product_count)

create_from_sales_data(**kwargs)

Create substitution matrix from sales data time series.

Parameters:
  • sales_data – Sales data matrix (products × time periods)

  • method – Method for computing substitution (‘correlation’, ‘covariance’)

get_inter_cluster_substitution(labels: _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str]) ndarray[tuple[Any, ...], dtype[float64]][source]

Compute inter-cluster substitution matrix.

Parameters:

labels – Cluster assignments for each product

Returns:

Matrix of substitution rates between clusters

get_intra_cluster_substitution(labels: _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str]) ndarray[tuple[Any, ...], dtype[float64]][source]

Compute average intra-cluster substitution for each cluster.

Parameters:

labels – Cluster assignments for each product

Returns:

Array of average intra-cluster substitution rates

get_matrix() ndarray[tuple[Any, ...], dtype[float64]][source]

Get the substitution matrix.

Returns:

The substitution matrix

get_submatrix(indices: _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str]) ndarray[tuple[Any, ...], dtype[float64]][source]

Extract a submatrix for given indices.

Parameters:

indices – Indices of products to include

Returns:

Submatrix for the specified products

property is_normalized: bool

Check if the matrix is normalized.

is_symmetric(tol: float | None = None) bool[source]

Check if the matrix is symmetric.

Parameters:

tol – Tolerance for symmetry check

Returns:

True if symmetric within tolerance

property n_products: int

Get the number of products.

normalize() None[source]

Normalize the substitution matrix.

Ensures that each row sums to 1 (excluding diagonal).

set_data(data: _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str], normalize: bool = True, check_symmetry: bool = True) None[source]

Set the substitution matrix data.

Parameters:
  • data – Input data

  • normalize – Whether to normalize the matrix

  • check_symmetry – Whether to check and enforce symmetry

property shape: Tuple[int, int]

Get the shape of the substitution matrix.

Example usage:

from submarit.core import create_substitution_matrix
import numpy as np

# Create product feature matrix
X = np.random.rand(100, 20)  # 100 products, 20 features

# Default Euclidean distance
S = create_substitution_matrix(X)

# Using cosine similarity
S_cosine = create_substitution_matrix(X, metric='cosine')

# With custom parameters
S_custom = create_substitution_matrix(
    X,
    metric='minkowski',
    metric_params={'p': 3},
    normalize=True
)

Base Classes

Base classes for SUBMARIT core functionality.

class submarit.core.base.BaseClusterer(n_clusters: int = 2, random_state: int | None = None)[source]

Bases: BaseEstimator

Base class for clustering algorithms.

abstract fit(X: ndarray[tuple[Any, ...], dtype[float64]]) BaseClusterer[source]

Fit the clustering model.

Parameters:

X – Input data matrix

Returns:

Self

fit_predict(X: ndarray[tuple[Any, ...], dtype[float64]]) ndarray[tuple[Any, ...], dtype[int64]][source]

Fit the model and predict cluster labels.

Parameters:

X – Input data matrix

Returns:

Cluster labels

property labels_: ndarray[tuple[Any, ...], dtype[int64]] | None

Get cluster labels.

property n_iter_: int | None

Get number of iterations.

predict(X: ndarray[tuple[Any, ...], dtype[float64]]) ndarray[tuple[Any, ...], dtype[int64]][source]

Predict cluster labels.

Parameters:

X – Input data matrix

Returns:

Cluster labels

class submarit.core.base.BaseEstimator[source]

Bases: ABC

Base class for all estimators in SUBMARIT.

All estimators should specify all the parameters that can be set at the class level in their __init__ as explicit keyword arguments.

get_params(deep: bool = True) Dict[str, Any][source]

Get parameters for this estimator.

Parameters:

deep – If True, will return parameters for sub-objects

Returns:

Parameter names mapped to their values

set_params(**params) BaseEstimator[source]

Set parameters for this estimator.

Parameters:

**params – Estimator parameters

Returns:

Self

class submarit.core.base.BaseEvaluator[source]

Bases: BaseEstimator

Base class for cluster evaluation metrics.

abstract evaluate(X: ndarray[tuple[Any, ...], dtype[float64]], labels: ndarray[tuple[Any, ...], dtype[int64]]) Dict[str, float][source]

Evaluate clustering results.

Parameters:
  • X – Input data matrix

  • labels – Cluster labels

Returns:

Dictionary of evaluation metrics

class submarit.core.base.BaseValidator[source]

Bases: BaseEstimator

Base class for validation methods.

abstract validate(X: ndarray[tuple[Any, ...], dtype[float64]], clusterer: BaseClusterer, **kwargs) Dict[str, Any][source]

Validate clustering results.

Parameters:
  • X – Input data matrix

  • clusterer – Clustering algorithm instance

  • **kwargs – Additional validation parameters

Returns:

Validation results

class submarit.core.base.ClusteringResult(labels: ndarray[tuple[Any, ...], dtype[int64]], log_likelihood: float, n_iter: int, converged: bool, metadata: Dict[str, Any] | None = None)[source]

Bases: object

Container for clustering results.

class submarit.core.base.EvaluationResult(log_likelihood: float, z_score: float, diff_value: float, p_value: float | None = None, metadata: Dict[str, Any] | None = None)[source]

Bases: object

Container for evaluation results.

Algorithms Module

Local Search Algorithm

Example:

from submarit.algorithms import LocalSearch

# Initialize with parameters
ls = LocalSearch(
    n_clusters=5,
    max_iter=100,
    n_restarts=10,
    tol=1e-4,
    random_state=42
)

# Fit and predict
clusters = ls.fit_predict(S)

# Access results
print(f"Final objective value: {ls.objective_}")
print(f"Number of iterations: {ls.n_iter_}")
print(f"Cluster centers: {ls.cluster_centers_}")

Algorithm Parameters

LocalSearch Parameters

Parameter

Default

Description

n_clusters

Required

Number of clusters (submarkets) to find

max_iter

100

Maximum number of iterations per restart

n_restarts

10

Number of random restarts to avoid local optima

tol

1e-4

Convergence tolerance

init

‘random’

Initialization method: ‘random’, ‘k-means++’, or array

random_state

None

Random seed for reproducibility

Evaluation Module

Cluster Evaluator

class submarit.evaluation.ClusterEvaluator[source]

Bases: BaseEvaluator

Evaluates SUBMARIT clustering solutions using multiple criteria.

This class implements the evaluation methodology from kSMEvaluateClustering.m, computing various statistics to assess clustering quality including: - Difference between estimated and theoretical probabilities - Z-value statistics - Log-likelihood values

evaluate(swm: ndarray[tuple[Any, ...], dtype[float64]], n_clusters: int, cluster_assign: ndarray[tuple[Any, ...], dtype[int64]]) ClusterEvaluationResult[source]

Evaluate a SUBMARIT clustering solution.

Parameters:
  • swm – Product x product switching matrix

  • n_clusters – Number of clusters

  • cluster_assign – Product cluster assignments (1 to n_clusters)

Returns:

ClusterEvaluationResult containing all evaluation metrics

evaluate_legacy(swm: ndarray[tuple[Any, ...], dtype[float64]], n_clusters: int, cluster_assign: ndarray[tuple[Any, ...], dtype[int64]]) Dict[str, float | ndarray[tuple[Any, ...], dtype[float64]]][source]

Legacy evaluation function matching kEvaluateClustering.m.

This is a simplified version without some of the advanced statistics.

Parameters:
  • swm – Product x product switching matrix

  • n_clusters – Number of clusters

  • cluster_assign – Product cluster assignments (1 to n_clusters)

Returns:

Dictionary containing evaluation metrics

Available metrics:

  • silhouette: Silhouette coefficient (-1 to 1, higher is better)

  • calinski_harabasz: Calinski-Harabasz index (higher is better)

  • davies_bouldin: Davies-Bouldin index (lower is better)

  • dunn: Dunn index (higher is better)

  • within_cluster_sum: Within-cluster sum of squares (lower is better)

Example:

from submarit.evaluation import ClusterEvaluator

evaluator = ClusterEvaluator()

# Evaluate all metrics
metrics = evaluator.evaluate(S, clusters)

# Evaluate specific metrics
metrics = evaluator.evaluate(S, clusters,
                            metrics=['silhouette', 'dunn'])

# Get detailed cluster statistics
stats = evaluator.cluster_statistics(S, clusters)

Gap Statistic

submarit.evaluation.gap_statistic()

GAP statistic implementation for optimal cluster number selection.

Example:

from submarit.evaluation import gap_statistic

# Calculate gap statistic
gap, std = gap_statistic(S, n_clusters=5, n_bootstrap=50)

# Find optimal number of clusters
gaps = []
for k in range(2, 11):
    gap, _ = gap_statistic(S, k, n_bootstrap=20)
    gaps.append(gap)

optimal_k = np.argmax(gaps) + 2

Entropy Evaluator

Statistical Tests

Statistical tests and utilities for SUBMARIT evaluation.

class submarit.evaluation.statistical_tests.StatisticalTests[source]

Bases: object

Statistical testing utilities for cluster evaluation.

static bootstrap_confidence_interval(swm: ndarray[tuple[Any, ...], dtype[float64]], cluster_assign: ndarray[tuple[Any, ...], dtype[int64]], metric: str = 'z_value', n_bootstrap: int = 1000, confidence: float = 0.95, random_state: int | None = None) Dict[str, float][source]

Calculate bootstrap confidence intervals for evaluation metrics.

Parameters:
  • swm – Switching matrix

  • cluster_assign – Cluster assignments

  • metric – Metric to analyze

  • n_bootstrap – Number of bootstrap samples

  • confidence – Confidence level

  • random_state – Random seed

Returns:

Dictionary with confidence intervals and statistics

static cluster_validity_indices(swm: ndarray[tuple[Any, ...], dtype[float64]], cluster_assign: ndarray[tuple[Any, ...], dtype[int64]]) Dict[str, float][source]

Calculate various cluster validity indices.

Parameters:
  • swm – Switching matrix

  • cluster_assign – Cluster assignments

Returns:

Dictionary with validity indices

static permutation_test(swm: ndarray[tuple[Any, ...], dtype[float64]], assign1: ndarray[tuple[Any, ...], dtype[int64]], assign2: ndarray[tuple[Any, ...], dtype[int64]], n_permutations: int = 1000, metric: str = 'z_value', random_state: int | None = None) Dict[str, float][source]

Perform permutation test to compare two clustering solutions.

Parameters:
  • swm – Switching matrix

  • assign1 – First cluster assignment

  • assign2 – Second cluster assignment

  • n_permutations – Number of permutations

  • metric – Metric to compare (‘z_value’, ‘diff’, ‘log_lh’)

  • random_state – Random seed

Returns:

Dictionary with test statistics and p-value

static stability_analysis(swm: ndarray[tuple[Any, ...], dtype[float64]], cluster_func, n_clusters: int, n_runs: int = 10, min_items: int = 1) Dict[str, float][source]

Analyze stability of clustering results across multiple runs.

Parameters:
  • swm – Switching matrix

  • cluster_func – Clustering function

  • n_clusters – Number of clusters

  • n_runs – Number of runs to perform

  • min_items – Minimum items per cluster

Returns:

Dictionary with stability metrics

Visualization

Visualization utilities for evaluation results.

class submarit.evaluation.visualization.EvaluationVisualizer[source]

Bases: object

Visualization utilities for SUBMARIT evaluation results.

create_evaluation_report(results: Dict[str, Any], save_path: str) None[source]

Create a comprehensive evaluation report.

Parameters:
  • results – Dictionary containing various evaluation results

  • save_path – Path to save the report (PDF or image)

plot_cluster_comparison(results: List[ClusterEvaluationResult], labels: List[str] | None = None, metrics: List[str] | None = None, save_path: str | None = None) None[source]

Compare multiple clustering results.

Parameters:
  • results – List of evaluation results to compare

  • labels – Labels for each result

  • metrics – Metrics to plot (default: all)

  • save_path – Optional path to save the plot

plot_cluster_sizes(result: ClusterEvaluationResult | EntropyClusterResult, save_path: str | None = None) None[source]

Plot cluster size distribution.

Parameters:
  • result – Clustering result

  • save_path – Optional path to save the plot

plot_entropy_evolution(entropy_values: List[float], save_path: str | None = None) None[source]

Plot entropy values over iterations.

Parameters:
  • entropy_values – List of entropy values per iteration

  • save_path – Optional path to save the plot

plot_probability_comparison(result: ClusterEvaluationResult, cluster_idx: int | None = None, save_path: str | None = None) None[source]

Plot P vs PHat comparison.

Parameters:
  • result – Evaluation result

  • cluster_idx – Optional specific cluster to highlight

  • save_path – Optional path to save the plot

Example:

from submarit.evaluation.visualization import (
    plot_substitution_matrix,
    plot_cluster_heatmap,
    plot_evaluation_metrics
)

# Plot substitution matrix
fig, ax = plt.subplots(figsize=(10, 8))
plot_substitution_matrix(S, clusters, ax=ax)

# Plot cluster heatmap
plot_cluster_heatmap(S, clusters)

# Plot evaluation metrics
plot_evaluation_metrics(metrics_dict)

Validation Module

K-Fold Cross-Validation

class submarit.validation.KFoldValidator(n_folds: int = 5, n_random: int = 0, max_fold_run: int | None = None, random_state: int | None = None, n_jobs: int = 1, algorithm: str = 'v1')[source]

Bases: BaseValidator

K-fold cross-validation for SUBMARIT clustering.

This validator implements the k-fold validation procedure from kSMNFold.m, where data is split into k folds, each fold is held out as test data, and the model is trained on the remaining k-1 folds. Test items are then assigned using constrained clustering.

Parameters:
  • n_folds (int, default=5) – Number of folds for cross-validation

  • n_random (int, default=0) – Number of random permutations for empirical distribution. If 0, no empirical distribution is computed.

  • max_fold_run (int, optional) – Maximum number of folds to actually run (defaults to n_folds). Useful for testing with subset of folds.

  • random_state (int, optional) – Random seed for reproducibility

  • n_jobs (int, default=1) – Number of parallel jobs for empirical distribution generation. -1 means using all processors.

  • algorithm ({'v1', 'v2'}, default='v1') – Which clustering algorithm to use: - ‘v1’: Uses KSMLocalSearch (PHat-P optimization) - ‘v2’: Uses KSMLocalSearch2 (log-likelihood optimization)

validate(swm: ndarray[tuple[Any, ...], dtype[float64]], n_clusters: int, min_items: int = 1, n_runs: int = 10, **kwargs) KFoldResult[source]

Perform k-fold cross-validation on switching matrix.

Parameters:
  • swm (ndarray of shape (n_items, n_items)) – Product x product switching matrix

  • n_clusters (int) – Number of clusters

  • min_items (int, default=1) – Minimum number of items per cluster

  • n_runs (int, default=10) – Number of SUBMARIT runs for each optimization

  • **kwargs (dict) – Additional parameters passed to clustering algorithm

Returns:

result – Cross-validation results including average Rand indices and optional empirical distributions

Return type:

KFoldResult

Multiple Runs Validation

Rand Index Calculation

Rand index calculations and empirical distributions.

This module implements the Rand index and adjusted Rand index for comparing clustering solutions, following the methodology from RandIndex4.m. It also provides functions for creating empirical distributions of Rand indices.

class submarit.validation.rand_index.RandEmpiricalDistribution(rand_dist: ndarray[tuple[Any, ...], dtype[float64]], adj_rand_dist: ndarray[tuple[Any, ...], dtype[float64]], n_points: int, n_items: int, n_clusters: int, min_items: int = 1)[source]

Bases: object

Container for Rand index empirical distribution.

rand_dist

Sorted array of Rand index values

Type:

numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.float64]]

adj_rand_dist

Sorted array of adjusted Rand index values

Type:

numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.float64]]

n_points

Number of points in distribution

Type:

int

n_items

Number of items in clusterings

Type:

int

n_clusters

Number of clusters

Type:

int

min_items

Minimum items per cluster constraint

Type:

int

adj_rand_dist: ndarray[tuple[Any, ...], dtype[float64]]
calculate_p_values(rand_value: float, adj_rand_value: float) Dict[str, Tuple[float, float]][source]

Calculate p-values for given Rand index values.

Parameters:
  • rand_value (float) – Observed Rand index

  • adj_rand_value (float) – Observed adjusted Rand index

Returns:

Dictionary with p-values: {‘rand’: (upper_p, lower_p), ‘adj_rand’: (upper_p, lower_p)}

Return type:

dict

get_percentiles(percentiles: ndarray[tuple[Any, ...], dtype[float64]] | None = None) Dict[str, ndarray[tuple[Any, ...], dtype[float64]]][source]

Get percentile values from the distribution.

Parameters:

percentiles (array-like, optional) – Percentiles to compute (0-100). Defaults to standard confidence intervals.

Returns:

Dictionary with ‘rand’ and ‘adj_rand’ percentile values

Return type:

dict

min_items: int = 1
n_clusters: int
n_items: int
n_points: int
rand_dist: ndarray[tuple[Any, ...], dtype[float64]]
class submarit.validation.rand_index.RandIndex[source]

Bases: object

Calculator for Rand index and adjusted Rand index.

The Rand index measures the similarity between two clusterings by considering all pairs of items and checking whether they are grouped consistently.

The adjusted Rand index corrects for chance agreement and is preferred for comparing clusterings with different numbers of clusters.

calculate(clusters1: ndarray[tuple[Any, ...], dtype[int64]], clusters2: ndarray[tuple[Any, ...], dtype[int64]]) Tuple[float, float][source]

Calculate Rand index and adjusted Rand index.

This implements the methodology from RandIndex4.m, computing both the standard Rand index (Rand, 1971) and the adjusted Rand index (Hubert and Arabie, 1985).

Parameters:
  • clusters1 (ndarray of shape (n_items,)) – First clustering assignment (1-based cluster labels)

  • clusters2 (ndarray of shape (n_items,)) – Second clustering assignment (1-based cluster labels)

Returns:

  • rand (float) – Rand index (0 to 1, where 1 indicates perfect agreement)

  • adj_rand (float) – Adjusted Rand index (can be negative, 1 indicates perfect agreement)

calculate_detailed(clusters1: ndarray[tuple[Any, ...], dtype[int64]], clusters2: ndarray[tuple[Any, ...], dtype[int64]]) RandIndexResult[source]

Calculate Rand index with detailed contingency information.

Parameters:
  • clusters1 (ndarray of shape (n_items,)) – First clustering assignment (1-based cluster labels)

  • clusters2 (ndarray of shape (n_items,)) – Second clustering assignment (1-based cluster labels)

Returns:

result – Detailed results including contingency table counts

Return type:

RandIndexResult

class submarit.validation.rand_index.RandIndexResult(rand: float, adj_rand: float, a: int, b: int, c: int, d: int, n_items: int, n_pairs: int)[source]

Bases: object

Container for Rand index calculation results.

rand

Rand index value (0 to 1)

Type:

float

adj_rand

Adjusted Rand index value

Type:

float

a

Number of pairs in same cluster in both clusterings

Type:

int

b

Number of pairs in different clusters in both clusterings

Type:

int

c

Number of pairs in same cluster in C1, different in C2

Type:

int

d

Number of pairs in different cluster in C1, same in C2

Type:

int

n_items

Number of items compared

Type:

int

n_pairs

Total number of pairs

Type:

int

a: int
adj_rand: float
b: int
c: int
d: int
n_items: int
n_pairs: int
rand: float
summary() str[source]

Generate summary of Rand index results.

submarit.validation.rand_index.create_rand_empirical_distribution(n_items: int, n_clusters: int, n_points: int, min_items: int = 1, random_state: int | None = None, n_jobs: int = 1) RandEmpiricalDistribution[source]

Create empirical distribution for Rand index under random clustering.

This function implements the methodology from RandCreateDist.m, generating random clustering pairs and computing their Rand indices to build an empirical null distribution.

Parameters:
  • n_items (int) – Number of items to cluster

  • n_clusters (int) – Number of clusters

  • n_points (int) – Number of points in empirical distribution

  • min_items (int, default=1) – Minimum number of items required per cluster

  • random_state (int, optional) – Random seed for reproducibility

  • n_jobs (int, default=1) – Number of parallel jobs. -1 means using all processors.

Returns:

distribution – Empirical distribution of Rand indices

Return type:

RandEmpiricalDistribution

submarit.validation.rand_index.rand_empirical_p(rand_dist: ndarray[tuple[Any, ...], dtype[float64]], adj_rand_dist: ndarray[tuple[Any, ...], dtype[float64]], rand: float, adj_rand: float, create_confidence: bool = True) Dict[str, float | Tuple[float, float] | ndarray[tuple[Any, ...], dtype[float64]]][source]

Calculate p-values from empirical distribution.

This implements the functionality from RandEmpiricalP.m, computing p-values and confidence intervals from empirical distributions.

Parameters:
  • rand_dist (ndarray) – Sorted empirical distribution of Rand indices

  • adj_rand_dist (ndarray) – Sorted empirical distribution of adjusted Rand indices

  • rand (float) – Observed Rand index

  • adj_rand (float) – Observed adjusted Rand index

  • create_confidence (bool, default=True) – Whether to compute confidence interval values

Returns:

result – Dictionary containing: - ‘rand’: Observed Rand index - ‘adj_rand’: Observed adjusted Rand index - ‘rand_p’: (upper_p, lower_p) tuple for Rand index - ‘adj_rand_p’: (upper_p, lower_p) tuple for adjusted Rand - ‘rand_conf’: Confidence interval values (if requested) - ‘adj_rand_conf’: Adjusted Rand CI values (if requested)

Return type:

dict

P-Value Analysis

P-value calculation utilities for SUBMARIT validation.

This module provides utilities for calculating p-values from empirical distributions, supporting both one-tailed and two-tailed tests, as well as multiple testing corrections.

class submarit.validation.p_values.PValueResult(observed: float, p_value: float, p_value_corrected: float | None = None, percentile: float | None = None, ci_lower: float | None = None, ci_upper: float | None = None, direction: str = 'greater', n_simulations: int | None = None)[source]

Bases: object

Container for p-value calculation results.

observed

Observed test statistic value

Type:

float

p_value

Raw p-value

Type:

float

p_value_corrected

Corrected p-value (if correction applied)

Type:

float | None

percentile

Percentile of observed value in distribution

Type:

float | None

ci_lower

Lower confidence interval bound

Type:

float | None

ci_upper

Upper confidence interval bound

Type:

float | None

direction

Test direction (‘greater’, ‘less’, or ‘two-sided’)

Type:

str

n_simulations

Number of simulations in empirical distribution

Type:

int | None

ci_lower: float | None = None
ci_upper: float | None = None
direction: str = 'greater'
is_significant(alpha: float = 0.05) bool[source]

Check if result is statistically significant.

Parameters:

alpha (float, default=0.05) – Significance level

Returns:

True if p-value < alpha

Return type:

bool

n_simulations: int | None = None
observed: float
p_value: float
p_value_corrected: float | None = None
percentile: float | None = None
summary() str[source]

Generate summary of p-value results.

submarit.validation.p_values.calculate_empirical_p_value(observed: float, null_distribution: ndarray[tuple[Any, ...], dtype[float64]], direction: str = 'greater', return_percentile: bool = True) PValueResult[source]

Calculate p-value from empirical distribution.

Parameters:
  • observed (float) – Observed test statistic

  • null_distribution (ndarray) – Null distribution values (should be sorted)

  • direction ({'greater', 'less', 'two-sided'}, default='greater') – Direction of the test: - ‘greater’: P(X >= observed) - ‘less’: P(X <= observed) - ‘two-sided’: 2 * min(P(X >= observed), P(X <= observed))

  • return_percentile (bool, default=True) – Whether to calculate percentile of observed value

Returns:

result – P-value calculation results

Return type:

PValueResult

submarit.validation.p_values.calculate_p_value_range(observed: float, null_distribution: ndarray[tuple[Any, ...], dtype[float64]]) Tuple[float, float][source]

Calculate upper and lower p-value bounds.

This implements the methodology from MATLAB code where both upper and lower p-values are returned as a tuple.

Parameters:
  • observed (float) – Observed test statistic

  • null_distribution (ndarray) – Sorted null distribution values

Returns:

p_range – (upper_p, lower_p) where: - upper_p: Probability of values >= observed - lower_p: Probability of values <= observed

Return type:

tuple

submarit.validation.p_values.combine_p_values(p_values: List[float] | ndarray[tuple[Any, ...], dtype[float64]], method: str = 'fisher', weights: ndarray[tuple[Any, ...], dtype[float64]] | None = None) Dict[str, float][source]

Combine multiple p-values into a single test statistic.

Parameters:
  • p_values (array-like) – P-values to combine

  • method ({'fisher', 'stouffer', 'min', 'max'}, default='fisher') – Combination method: - ‘fisher’: Fisher’s combined probability test - ‘stouffer’: Stouffer’s Z-score method - ‘min’: Minimum p-value - ‘max’: Maximum p-value

  • weights (array-like, optional) – Weights for weighted combination (only for Stouffer’s method)

Returns:

result – Dictionary containing: - ‘statistic’: Combined test statistic - ‘p_value’: Combined p-value - ‘method’: Combination method used

Return type:

dict

submarit.validation.p_values.multiple_testing_correction(p_values: List[float] | ndarray[tuple[Any, ...], dtype[float64]], method: str = 'bonferroni', alpha: float = 0.05) Dict[str, float | ndarray[tuple[Any, ...], dtype[float64]]][source]

Apply multiple testing correction to p-values.

Parameters:
  • p_values (array-like) – Raw p-values to correct

  • method ({'bonferroni', 'holm', 'fdr_bh', 'fdr_by'}, default='bonferroni') – Correction method: - ‘bonferroni’: Bonferroni correction - ‘holm’: Holm-Bonferroni method - ‘fdr_bh’: Benjamini-Hochberg FDR - ‘fdr_by’: Benjamini-Yekutieli FDR

  • alpha (float, default=0.05) – Family-wise error rate or false discovery rate

Returns:

results – Dictionary containing: - ‘corrected’: Corrected p-values - ‘rejected’: Boolean array of rejected hypotheses - ‘threshold’: Adjusted significance threshold - ‘method’: Correction method used

Return type:

dict

submarit.validation.p_values.permutation_test(data1: ndarray[tuple[Any, ...], dtype[float64]], data2: ndarray[tuple[Any, ...], dtype[float64]], statistic_func: callable, n_permutations: int = 10000, random_state: int | None = None) PValueResult[source]

Perform permutation test for difference between two groups.

Parameters:
  • data1 (ndarray) – First group data

  • data2 (ndarray) – Second group data

  • statistic_func (callable) – Function to compute test statistic. Should accept two arrays and return a scalar statistic.

  • n_permutations (int, default=10000) – Number of permutations

  • random_state (int, optional) – Random seed for reproducibility

Returns:

result – Permutation test results

Return type:

PValueResult

Top-K Analysis

Top-k clustering solution analysis.

This module implements functionality from RunClustersTopk.m and RunClustersTopk2.m for analyzing agreement among top-k clustering solutions.

class submarit.validation.topk_analysis.TopKResult(best_solutions: List[LocalSearchResult], avg_rand: float, avg_adj_rand: float, rand_p: Tuple[float, float], adj_rand_p: Tuple[float, float], rand_conf: ndarray[tuple[Any, ...], dtype[float64]] | None = None, adj_rand_conf: ndarray[tuple[Any, ...], dtype[float64]] | None = None, pairwise_rand: ndarray[tuple[Any, ...], dtype[float64]] | None = None, pairwise_adj_rand: ndarray[tuple[Any, ...], dtype[float64]] | None = None, empirical_dist: Dict | None = None)[source]

Bases: object

Container for top-k clustering analysis results.

best_solutions

List of top-k best clustering solutions

Type:

List[submarit.algorithms.local_search.LocalSearchResult]

avg_rand

Average Rand index among top-k solutions

Type:

float

avg_adj_rand

Average adjusted Rand index among top-k solutions

Type:

float

rand_p

P-value for Rand index

Type:

Tuple[float, float]

adj_rand_p

P-value for adjusted Rand index

Type:

Tuple[float, float]

rand_conf

Confidence interval values for Rand index

Type:

numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.float64]] | None

adj_rand_conf

Confidence interval values for adjusted Rand

Type:

numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.float64]] | None

pairwise_rand

Matrix of pairwise Rand indices

Type:

numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.float64]] | None

pairwise_adj_rand

Matrix of pairwise adjusted Rand indices

Type:

numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.float64]] | None

empirical_dist

Empirical distribution used for p-values

Type:

Dict | None

adj_rand_conf: ndarray[tuple[Any, ...], dtype[float64]] | None = None
adj_rand_p: Tuple[float, float]
avg_adj_rand: float
avg_rand: float
best_solutions: List[LocalSearchResult]
empirical_dist: Dict | None = None
pairwise_adj_rand: ndarray[tuple[Any, ...], dtype[float64]] | None = None
pairwise_rand: ndarray[tuple[Any, ...], dtype[float64]] | None = None
rand_conf: ndarray[tuple[Any, ...], dtype[float64]] | None = None
rand_p: Tuple[float, float]
summary() str[source]

Generate summary of top-k analysis results.

submarit.validation.topk_analysis.analyze_solution_stability(solutions: List[LocalSearchResult], verbose: bool = False) Dict[str, float | ndarray[tuple[Any, ...], dtype[float64]]][source]

Analyze stability across multiple clustering solutions.

Parameters:
  • solutions (list of LocalSearchResult) – Clustering solutions to analyze

  • verbose (bool, default=False) – Whether to print analysis details

Returns:

analysis – Dictionary containing: - ‘consensus_matrix’: Item x item co-clustering frequency - ‘item_stability’: Stability score for each item - ‘cluster_stability’: Stability of each cluster - ‘overall_stability’: Overall stability score

Return type:

dict

submarit.validation.topk_analysis.run_clusters_topk(swm: ndarray[tuple[Any, ...], dtype[float64]], n_clusters: int, min_items: int, n_runs: int, top_k: int, n_random: int = 10000, algorithm: str = 'v1', random_state: int | None = None, n_jobs: int = 1, verbose: bool = False) TopKResult[source]

Run multiple clusterings and analyze agreement among top-k solutions.

This function implements the functionality of RunClustersTopk.m and RunClustersTopk2.m, finding the top-k best clustering solutions and analyzing their agreement using Rand indices.

Parameters:
  • swm (ndarray of shape (n_items, n_items)) – Product x product switching matrix

  • n_clusters (int) – Number of clusters

  • min_items (int) – Minimum number of items per cluster

  • n_runs (int) – Number of experimental runs

  • top_k (int) – Number of top solutions to return

  • n_random (int, default=10000) – Number of random values for empirical distributions

  • algorithm ({'v1', 'v2'}, default='v1') – Which algorithm version to use: - ‘v1’: KSMLocalSearch (PHat-P optimization) - ‘v2’: KSMLocalSearch2 (log-likelihood optimization)

  • random_state (int, optional) – Random seed for reproducibility

  • n_jobs (int, default=1) – Number of parallel jobs (-1 for all processors)

  • verbose (bool, default=False) – Whether to print progress messages

Returns:

result – Analysis results including p-values and confidence intervals

Return type:

TopKResult

IO Module

Data Input/Output

General data I/O utilities for SUBMARIT.

submarit.io.data_io.load_substitution_data(filepath: str | Path, format: str | None = None, **kwargs) SubstitutionMatrix[source]

Load substitution data from various file formats.

Parameters:
  • filepath – Input file path

  • format – File format (auto-detected if None)

  • **kwargs – Additional arguments passed to format-specific loaders

Returns:

SubstitutionMatrix object

submarit.io.data_io.save_results(results: Dict[str, Any], filepath: str | Path, format: str | None = None, **kwargs) None[source]

Save clustering results to file.

Parameters:
  • results – Dictionary of results to save

  • filepath – Output file path

  • format – Output format (auto-detected if None)

  • **kwargs – Additional arguments for format-specific savers

Example:

from submarit.io import load_data, save_results

# Load data from various formats
X, names = load_data('products.csv')
X, names = load_data('products.xlsx', sheet_name='Sheet1')
X, names = load_data('products.json')

# Save results
save_results('results.json', {
    'clusters': clusters,
    'metrics': metrics,
    'parameters': params
})

MATLAB Compatibility

MATLAB file I/O utilities.

submarit.io.matlab_io.convert_mat_to_npz(mat_filepath: str | Path, npz_filepath: str | Path, compressed: bool = True) None[source]

Convert a .mat file to NumPy .npz format.

Parameters:
  • mat_filepath – Input .mat file path

  • npz_filepath – Output .npz file path

  • compressed – Whether to use compression

submarit.io.matlab_io.load_mat(filepath: str | Path, variable_names: list | None = None, matlab_compatible: bool = True) Dict[str, Any][source]

Load data from a MATLAB .mat file.

Handles both old-style (< v7.3) and new-style (>= v7.3) .mat files.

Parameters:
  • filepath – Path to the .mat file

  • variable_names – List of variable names to load (None = load all)

  • matlab_compatible – Whether to maintain MATLAB compatibility

Returns:

Dictionary of loaded variables

submarit.io.matlab_io.save_mat(filepath: str | Path, data: Dict[str, Any], format: str = '5', do_compression: bool = False) None[source]

Save data to a MATLAB .mat file.

Parameters:
  • filepath – Output file path

  • data – Dictionary of variables to save

  • format – MATLAB file format (‘5’ or ‘7.3’)

  • do_compression – Whether to compress the data

submarit.io.matlab_io.validate_mat_file(filepath: str | Path) Dict[str, Any][source]

Validate and get information about a .mat file.

Parameters:

filepath – Path to the .mat file

Returns:

Dictionary with file information

Example:

from submarit.io import load_matlab_data, save_matlab_data

# Load MATLAB .mat file
data = load_matlab_data('submarkets.mat')
X = data['features']
S = data['substitution_matrix']

# Save to MATLAB format
save_matlab_data('results.mat', {
    'clusters': clusters,
    'substitution_matrix': S,
    'metrics': metrics
})

Utils Module

MATLAB Compatibility Utils

MATLAB compatibility utilities for SUBMARIT.

This module provides functions to ensure compatibility between MATLAB and Python implementations, handling differences in indexing, random number generation, and numerical operations.

class submarit.utils.matlab_compat.IndexConverter(matlab_style: bool = True)[source]

Bases: object

Context manager for automatic index conversion.

convert_in(idx: int | _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str]) int | ndarray[tuple[Any, ...], dtype[_ScalarT]][source]

Convert indices on input.

convert_out(idx: int | _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str]) int | ndarray[tuple[Any, ...], dtype[_ScalarT]][source]

Convert indices on output.

exception submarit.utils.matlab_compat.MatlabCompatibilityError[source]

Bases: Exception

Raised when MATLAB compatibility issues occur.

class submarit.utils.matlab_compat.MatlabRandom(seed: int | None = None)[source]

Bases: object

MATLAB-compatible random number generator.

This class provides random number generation that matches MATLAB’s behavior as closely as possible, including the ability to use MATLAB’s random seeds.

rand(*shape: int) ndarray[tuple[Any, ...], dtype[float64]][source]

Generate uniformly distributed random numbers like MATLAB’s rand.

Parameters:

*shape – Dimensions of the output array

Returns:

Array of random numbers from uniform distribution [0, 1)

randi(imax: int, *shape: int) ndarray[tuple[Any, ...], dtype[int64]][source]

Generate uniformly distributed random integers like MATLAB’s randi.

Parameters:
  • imax – Maximum integer value (inclusive)

  • *shape – Dimensions of the output array

Returns:

Array of random integers from 1 to imax (inclusive, 1-based)

randn(*shape: int) ndarray[tuple[Any, ...], dtype[float64]][source]

Generate normally distributed random numbers like MATLAB’s randn.

Parameters:

*shape – Dimensions of the output array

Returns:

Array of random numbers from standard normal distribution

randperm(n: int) ndarray[tuple[Any, ...], dtype[int64]][source]

Generate random permutation like MATLAB’s randperm.

Parameters:

n – Number of elements to permute

Returns:

Random permutation of integers from 1 to n (1-based like MATLAB)

submarit.utils.matlab_compat.ensure_matlab_compatibility(func)[source]

Decorator to ensure MATLAB compatibility for numerical functions.

This decorator: - Ensures float64 precision - Handles index conversion - Manages numerical tolerances

submarit.utils.matlab_compat.matlab_compatible_random(seed: int | None = None) MatlabRandom[source]

Create a MATLAB-compatible random number generator.

Parameters:

seed – Random seed for reproducibility

Returns:

MatlabRandom instance

submarit.utils.matlab_compat.matlab_find(condition: _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str]) ndarray[tuple[Any, ...], dtype[int64]][source]

Find indices of nonzero elements like MATLAB’s find.

Parameters:

condition – Boolean array or condition

Returns:

1-based indices of True/nonzero elements

submarit.utils.matlab_compat.matlab_reshape(array: _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str], *shape: int) ndarray[tuple[Any, ...], dtype[_ScalarT]][source]

Reshape array using MATLAB column-major order.

Parameters:
  • array – Input array

  • *shape – New shape dimensions

Returns:

Reshaped array

submarit.utils.matlab_compat.matlab_size(array: _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str]) Tuple[int, ...][source]

Get array dimensions in MATLAB format.

Parameters:

array – Input array

Returns:

Tuple of dimensions (always at least 2D)

submarit.utils.matlab_compat.matlab_to_python_index(idx: int | _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str]) int | ndarray[tuple[Any, ...], dtype[_ScalarT]][source]

Convert MATLAB 1-based indices to Python 0-based indices.

Parameters:

idx – MATLAB index or array of indices (1-based)

Returns:

Python index or array of indices (0-based)

Raises:

MatlabCompatibilityError – If index is less than 1

submarit.utils.matlab_compat.python_to_matlab_index(idx: int | _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str]) int | ndarray[tuple[Any, ...], dtype[_ScalarT]][source]

Convert Python 0-based indices to MATLAB 1-based indices.

Parameters:

idx – Python index or array of indices (0-based)

Returns:

MATLAB index or array of indices (1-based)

Raises:

MatlabCompatibilityError – If index is negative

submarit.utils.matlab_compat.set_default_dtype()[source]

Set NumPy default dtype to float64 to match MATLAB.

Full API Example

Here’s a complete example using the full API:

import numpy as np
from submarit.core import create_substitution_matrix
from submarit.algorithms import LocalSearch
from submarit.evaluation import ClusterEvaluator, gap_statistic
from submarit.validation import KFoldValidator
from submarit.io import load_data, save_results

# Load data
X, product_names = load_data('products.csv')

# Create substitution matrix
S = create_substitution_matrix(X, metric='cosine')

# Find optimal number of clusters
gaps = []
for k in range(2, 11):
    gap, _ = gap_statistic(S, k, n_bootstrap=20)
    gaps.append(gap)
optimal_k = np.argmax(gaps) + 2

# Perform clustering
clusterer = LocalSearch(n_clusters=optimal_k, n_restarts=20)
clusters = clusterer.fit_predict(S)

# Evaluate results
evaluator = ClusterEvaluator()
metrics = evaluator.evaluate(S, clusters)

# Validate stability
validator = KFoldValidator(n_splits=5)
stability_scores = validator.validate(X, n_clusters=optimal_k)

# Save results
save_results('analysis_results.json', {
    'product_names': product_names,
    'clusters': clusters.tolist(),
    'metrics': metrics,
    'stability': {
        'scores': stability_scores,
        'mean': np.mean(stability_scores),
        'std': np.std(stability_scores)
    },
    'parameters': {
        'n_clusters': optimal_k,
        'metric': 'cosine',
        'algorithm': 'local_search'
    }
})