API Reference

This is the complete API reference for SUBMARIT.

Core Module 

Substitution Matrix Creation 

Substitution matrix implementation for SUBMARIT.

Bases: object

Represents a product substitution matrix.

This class handles the creation and manipulation of substitution matrices used in submarket identification. The matrix represents substitution patterns between products based on sales or other data.

Create substitution matrix from consumer-product data.

This method implements the logic from CreateSubstitutionMatrix.m

Parameters:

consumer_product_data – Consumer × product data matrix
normalize – Whether to normalize rows to sum to 1
weight – 0 = weight by number of consumers, 1 = weight by product sales
diag – Whether to include diagonal self substitution

Returns:

Tuple of (product_indexes, product_count)

create_from_sales_data(**kwargs)

Create substitution matrix from sales data time series.

Parameters:

sales_data – Sales data matrix (products × time periods)
method – Method for computing substitution (‘correlation’, ‘covariance’)

Compute inter-cluster substitution matrix.

Parameters:: labels – Cluster assignments for each product
Returns:: Matrix of substitution rates between clusters

Compute average intra-cluster substitution for each cluster.

Parameters:: labels – Cluster assignments for each product
Returns:: Array of average intra-cluster substitution rates

get_matrix() → ndarray[tuple[Any, ...], dtype[float64]][source]

Get the substitution matrix.

Returns:: The substitution matrix

Extract a submatrix for given indices.

Parameters:: indices – Indices of products to include
Returns:: Submatrix for the specified products

property is_normalized: bool: Check if the matrix is normalized.

is_symmetric(tol: float | None = None) → bool[source]

Check if the matrix is symmetric.

Parameters:: tol – Tolerance for symmetry check
Returns:: True if symmetric within tolerance

property n_products: int: Get the number of products.

normalize() → None[source]

Normalize the substitution matrix.

Ensures that each row sums to 1 (excluding diagonal).

Set the substitution matrix data.

Parameters:

data – Input data
normalize – Whether to normalize the matrix
check_symmetry – Whether to check and enforce symmetry

property shape: Tuple[int, int]: Get the shape of the substitution matrix.

Example usage:

from submarit.core import create_substitution_matrix
import numpy as np

# Create product feature matrix
X = np.random.rand(100, 20)  # 100 products, 20 features

# Default Euclidean distance
S = create_substitution_matrix(X)

# Using cosine similarity
S_cosine = create_substitution_matrix(X, metric='cosine')

# With custom parameters
S_custom = create_substitution_matrix(
    X,
    metric='minkowski',
    metric_params={'p': 3},
    normalize=True
)

Base Classes 

Base classes for SUBMARIT core functionality.

class submarit.core.base.BaseClusterer(n_clusters: int = 2, random_state: int | None = None)[source]

Bases: BaseEstimator

Base class for clustering algorithms.

abstract fit(X: ndarray[tuple[Any, ...], dtype[float64]]) → BaseClusterer[source]

Fit the clustering model.

Parameters:: X – Input data matrix
Returns:: Self

fit_predict(X: ndarray[tuple[Any, ...], dtype[float64]]) → ndarray[tuple[Any, ...], dtype[int64]][source]

Fit the model and predict cluster labels.

Parameters:: X – Input data matrix
Returns:: Cluster labels

property labels_: ndarray[tuple[Any, ...], dtype[int64]] | None: Get cluster labels.

property n_iter_: int | None: Get number of iterations.

predict(X: ndarray[tuple[Any, ...], dtype[float64]]) → ndarray[tuple[Any, ...], dtype[int64]][source]

Predict cluster labels.

Parameters:: X – Input data matrix
Returns:: Cluster labels

class submarit.core.base.BaseEstimator[source]

Bases: ABC

Base class for all estimators in SUBMARIT.

All estimators should specify all the parameters that can be set at the class level in their __init__ as explicit keyword arguments.

get_params(deep: bool = True) → Dict[str, Any][source]

Get parameters for this estimator.

Parameters:: deep – If True, will return parameters for sub-objects
Returns:: Parameter names mapped to their values

set_params(**params) → BaseEstimator[source]

Set parameters for this estimator.

Parameters:: **params – Estimator parameters
Returns:: Self

class submarit.core.base.BaseEvaluator[source]

Bases: BaseEstimator

Base class for cluster evaluation metrics.

abstract evaluate(X: ndarray[tuple[Any, ...], dtype[float64]], labels: ndarray[tuple[Any, ...], dtype[int64]]) → Dict[str, float][source]

Evaluate clustering results.

Parameters:

X – Input data matrix
labels – Cluster labels

Returns:

Dictionary of evaluation metrics

class submarit.core.base.BaseValidator[source]

Bases: BaseEstimator

Base class for validation methods.

abstract validate(X: ndarray[tuple[Any, ...], dtype[float64]], clusterer: BaseClusterer, **kwargs) → Dict[str, Any][source]

Validate clustering results.

Parameters:

X – Input data matrix
clusterer – Clustering algorithm instance
**kwargs – Additional validation parameters

Returns:

Validation results

class submarit.core.base.ClusteringResult(labels: ndarray[tuple[Any, ...], dtype[int64]], log_likelihood: float, n_iter: int, converged: bool, metadata: Dict[str, Any] | None = None)[source]

Bases: object

Container for clustering results.

class submarit.core.base.EvaluationResult(log_likelihood: float, z_score: float, diff_value: float, p_value: float | None = None, metadata: Dict[str, Any] | None = None)[source]

Bases: object

Container for evaluation results.

Algorithms Module 

Local Search Algorithm 

Example:

from submarit.algorithms import LocalSearch

# Initialize with parameters
ls = LocalSearch(
    n_clusters=5,
    max_iter=100,
    n_restarts=10,
    tol=1e-4,
    random_state=42
)

# Fit and predict
clusters = ls.fit_predict(S)

# Access results
print(f"Final objective value: {ls.objective_}")
print(f"Number of iterations: {ls.n_iter_}")
print(f"Cluster centers: {ls.cluster_centers_}")

Algorithm Parameters 

LocalSearch Parameters
Parameter	Default	Description
n_clusters	Required	Number of clusters (submarkets) to find
max_iter	100	Maximum number of iterations per restart
n_restarts	10	Number of random restarts to avoid local optima
tol	1e-4	Convergence tolerance
init	‘random’	Initialization method: ‘random’, ‘k-means++’, or array
random_state	None	Random seed for reproducibility

Evaluation Module 

Cluster Evaluator 

class submarit.evaluation.ClusterEvaluator[source]

Bases: BaseEvaluator

Evaluates SUBMARIT clustering solutions using multiple criteria.

This class implements the evaluation methodology from kSMEvaluateClustering.m, computing various statistics to assess clustering quality including: - Difference between estimated and theoretical probabilities - Z-value statistics - Log-likelihood values

evaluate(swm: ndarray[tuple[Any, ...], dtype[float64]], n_clusters: int, cluster_assign: ndarray[tuple[Any, ...], dtype[int64]]) → ClusterEvaluationResult[source]

Evaluate a SUBMARIT clustering solution.

Parameters:

swm – Product x product switching matrix
n_clusters – Number of clusters
cluster_assign – Product cluster assignments (1 to n_clusters)

Returns:

ClusterEvaluationResult containing all evaluation metrics

evaluate_legacy(swm: ndarray[tuple[Any, ...], dtype[float64]], n_clusters: int, cluster_assign: ndarray[tuple[Any, ...], dtype[int64]]) → Dict[str, float | ndarray[tuple[Any, ...], dtype[float64]]][source]

Legacy evaluation function matching kEvaluateClustering.m.

This is a simplified version without some of the advanced statistics.

Parameters:

swm – Product x product switching matrix
n_clusters – Number of clusters
cluster_assign – Product cluster assignments (1 to n_clusters)

Returns:

Dictionary containing evaluation metrics

Available metrics:

silhouette: Silhouette coefficient (-1 to 1, higher is better)
calinski_harabasz: Calinski-Harabasz index (higher is better)
davies_bouldin: Davies-Bouldin index (lower is better)
dunn: Dunn index (higher is better)
within_cluster_sum: Within-cluster sum of squares (lower is better)

Example:

from submarit.evaluation import ClusterEvaluator

evaluator = ClusterEvaluator()

# Evaluate all metrics
metrics = evaluator.evaluate(S, clusters)

# Evaluate specific metrics
metrics = evaluator.evaluate(S, clusters,
                            metrics=['silhouette', 'dunn'])

# Get detailed cluster statistics
stats = evaluator.cluster_statistics(S, clusters)

Gap Statistic 

submarit.evaluation.gap_statistic(): GAP statistic implementation for optimal cluster number selection.

Example:

from submarit.evaluation import gap_statistic

# Calculate gap statistic
gap, std = gap_statistic(S, n_clusters=5, n_bootstrap=50)

# Find optimal number of clusters
gaps = []
for k in range(2, 11):
    gap, _ = gap_statistic(S, k, n_bootstrap=20)
    gaps.append(gap)

optimal_k = np.argmax(gaps) + 2

Entropy Evaluator 

Statistical Tests 

Statistical tests and utilities for SUBMARIT evaluation.

class submarit.evaluation.statistical_tests.StatisticalTests[source]

Bases: object

Statistical testing utilities for cluster evaluation.

static bootstrap_confidence_interval(swm: ndarray[tuple[Any, ...], dtype[float64]], cluster_assign: ndarray[tuple[Any, ...], dtype[int64]], metric: str = 'z_value', n_bootstrap: int = 1000, confidence: float = 0.95, random_state: int | None = None) → Dict[str, float][source]

Calculate bootstrap confidence intervals for evaluation metrics.

Parameters:

swm – Switching matrix
cluster_assign – Cluster assignments
metric – Metric to analyze
n_bootstrap – Number of bootstrap samples
confidence – Confidence level
random_state – Random seed

Returns:

Dictionary with confidence intervals and statistics

static cluster_validity_indices(swm: ndarray[tuple[Any, ...], dtype[float64]], cluster_assign: ndarray[tuple[Any, ...], dtype[int64]]) → Dict[str, float][source]

Calculate various cluster validity indices.

Parameters:

swm – Switching matrix
cluster_assign – Cluster assignments

Returns:

Dictionary with validity indices

static permutation_test(swm: ndarray[tuple[Any, ...], dtype[float64]], assign1: ndarray[tuple[Any, ...], dtype[int64]], assign2: ndarray[tuple[Any, ...], dtype[int64]], n_permutations: int = 1000, metric: str = 'z_value', random_state: int | None = None) → Dict[str, float][source]

Perform permutation test to compare two clustering solutions.

Parameters:

swm – Switching matrix
assign1 – First cluster assignment
assign2 – Second cluster assignment
n_permutations – Number of permutations
metric – Metric to compare (‘z_value’, ‘diff’, ‘log_lh’)
random_state – Random seed

Returns:

Dictionary with test statistics and p-value

static stability_analysis(swm: ndarray[tuple[Any, ...], dtype[float64]], cluster_func, n_clusters: int, n_runs: int = 10, min_items: int = 1) → Dict[str, float][source]

Analyze stability of clustering results across multiple runs.

Parameters:

swm – Switching matrix
cluster_func – Clustering function
n_clusters – Number of clusters
n_runs – Number of runs to perform
min_items – Minimum items per cluster

Returns:

Dictionary with stability metrics

Visualization 

Visualization utilities for evaluation results.

class submarit.evaluation.visualization.EvaluationVisualizer[source]

Bases: object

Visualization utilities for SUBMARIT evaluation results.

create_evaluation_report(results: Dict[str, Any], save_path: str) → None[source]

Create a comprehensive evaluation report.

Parameters:

results – Dictionary containing various evaluation results
save_path – Path to save the report (PDF or image)

plot_cluster_comparison(results: List[ClusterEvaluationResult], labels: List[str] | None = None, metrics: List[str] | None = None, save_path: str | None = None) → None[source]

Compare multiple clustering results.

Parameters:

results – List of evaluation results to compare
labels – Labels for each result
metrics – Metrics to plot (default: all)
save_path – Optional path to save the plot

plot_cluster_sizes(result: ClusterEvaluationResult | EntropyClusterResult, save_path: str | None = None) → None[source]

Plot cluster size distribution.

Parameters:

result – Clustering result
save_path – Optional path to save the plot

plot_entropy_evolution(entropy_values: List[float], save_path: str | None = None) → None[source]

Plot entropy values over iterations.

Parameters:

entropy_values – List of entropy values per iteration
save_path – Optional path to save the plot

plot_probability_comparison(result: ClusterEvaluationResult, cluster_idx: int | None = None, save_path: str | None = None) → None[source]

Plot P vs PHat comparison.

Parameters:

result – Evaluation result
cluster_idx – Optional specific cluster to highlight
save_path – Optional path to save the plot

Example:

from submarit.evaluation.visualization import (
    plot_substitution_matrix,
    plot_cluster_heatmap,
    plot_evaluation_metrics
)

# Plot substitution matrix
fig, ax = plt.subplots(figsize=(10, 8))
plot_substitution_matrix(S, clusters, ax=ax)

# Plot cluster heatmap
plot_cluster_heatmap(S, clusters)

# Plot evaluation metrics
plot_evaluation_metrics(metrics_dict)

Validation Module 

K-Fold Cross-Validation 

class submarit.validation.KFoldValidator(n_folds: int = 5, n_random: int = 0, max_fold_run: int | None = None, random_state: int | None = None, n_jobs: int = 1, algorithm: str = 'v1')[source]

Bases: BaseValidator

K-fold cross-validation for SUBMARIT clustering.

This validator implements the k-fold validation procedure from kSMNFold.m, where data is split into k folds, each fold is held out as test data, and the model is trained on the remaining k-1 folds. Test items are then assigned using constrained clustering.

Parameters:

n_folds (int, default=5) – Number of folds for cross-validation
n_random (int, default=0) – Number of random permutations for empirical distribution. If 0, no empirical distribution is computed.
max_fold_run (int, optional) – Maximum number of folds to actually run (defaults to n_folds). Useful for testing with subset of folds.
random_state (int, optional) – Random seed for reproducibility
n_jobs (int, default=1) – Number of parallel jobs for empirical distribution generation. -1 means using all processors.
algorithm ({'v1', 'v2'}, default='v1') – Which clustering algorithm to use: - ‘v1’: Uses KSMLocalSearch (PHat-P optimization) - ‘v2’: Uses KSMLocalSearch2 (log-likelihood optimization)

validate(swm: ndarray[tuple[Any, ...], dtype[float64]], n_clusters: int, min_items: int = 1, n_runs: int = 10, **kwargs) → KFoldResult[source]

Perform k-fold cross-validation on switching matrix.

Parameters:

swm (ndarray of shape (n_items, n_items)) – Product x product switching matrix
n_clusters (int) – Number of clusters
min_items (int, default=1) – Minimum number of items per cluster
n_runs (int, default=10) – Number of SUBMARIT runs for each optimization
**kwargs (dict) – Additional parameters passed to clustering algorithm

Returns:

result – Cross-validation results including average Rand indices and optional empirical distributions

Return type:

KFoldResult

Multiple Runs Validation 

Rand Index Calculation 

Rand index calculations and empirical distributions.

This module implements the Rand index and adjusted Rand index for comparing clustering solutions, following the methodology from RandIndex4.m. It also provides functions for creating empirical distributions of Rand indices.

class submarit.validation.rand_index.RandEmpiricalDistribution(rand_dist: ndarray[tuple[Any, ...], dtype[float64]], adj_rand_dist: ndarray[tuple[Any, ...], dtype[float64]], n_points: int, n_items: int, n_clusters: int, min_items: int = 1)[source]

Bases: object

Container for Rand index empirical distribution.

rand_dist

Sorted array of Rand index values

Type:: numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.float64]]

adj_rand_dist

Sorted array of adjusted Rand index values

Type:: numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.float64]]

n_points

Number of points in distribution

Type:: int

n_items

Number of items in clusterings

Type:: int

n_clusters

Number of clusters

Type:: int

min_items

Minimum items per cluster constraint

Type:: int

adj_rand_dist: ndarray[tuple[Any, ...], dtype[float64]]

calculate_p_values(rand_value: float, adj_rand_value: float) → Dict[str, Tuple[float, float]][source]

Calculate p-values for given Rand index values.

Parameters:

rand_value (float) – Observed Rand index
adj_rand_value (float) – Observed adjusted Rand index

Returns:

Dictionary with p-values: {‘rand’: (upper_p, lower_p), ‘adj_rand’: (upper_p, lower_p)}

Return type:

dict

get_percentiles(percentiles: ndarray[tuple[Any, ...], dtype[float64]] | None = None) → Dict[str, ndarray[tuple[Any, ...], dtype[float64]]][source]

Get percentile values from the distribution.

Parameters:: percentiles (array-like, optional) – Percentiles to compute (0-100). Defaults to standard confidence intervals.
Returns:: Dictionary with ‘rand’ and ‘adj_rand’ percentile values
Return type:: dict

min_items: int = 1

n_clusters: int

n_items: int

n_points: int

rand_dist: ndarray[tuple[Any, ...], dtype[float64]]

class submarit.validation.rand_index.RandIndex[source]

Bases: object

Calculator for Rand index and adjusted Rand index.

The Rand index measures the similarity between two clusterings by considering all pairs of items and checking whether they are grouped consistently.

The adjusted Rand index corrects for chance agreement and is preferred for comparing clusterings with different numbers of clusters.

calculate(clusters1: ndarray[tuple[Any, ...], dtype[int64]], clusters2: ndarray[tuple[Any, ...], dtype[int64]]) → Tuple[float, float][source]

Calculate Rand index and adjusted Rand index.

This implements the methodology from RandIndex4.m, computing both the standard Rand index (Rand, 1971) and the adjusted Rand index (Hubert and Arabie, 1985).

Parameters:

clusters1 (ndarray of shape (n_items,)) – First clustering assignment (1-based cluster labels)
clusters2 (ndarray of shape (n_items,)) – Second clustering assignment (1-based cluster labels)

Returns:

rand (float) – Rand index (0 to 1, where 1 indicates perfect agreement)
adj_rand (float) – Adjusted Rand index (can be negative, 1 indicates perfect agreement)

calculate_detailed(clusters1: ndarray[tuple[Any, ...], dtype[int64]], clusters2: ndarray[tuple[Any, ...], dtype[int64]]) → RandIndexResult[source]

Calculate Rand index with detailed contingency information.

Parameters:

clusters1 (ndarray of shape (n_items,)) – First clustering assignment (1-based cluster labels)
clusters2 (ndarray of shape (n_items,)) – Second clustering assignment (1-based cluster labels)

Returns:

result – Detailed results including contingency table counts

Return type:

RandIndexResult

class submarit.validation.rand_index.RandIndexResult(rand: float, adj_rand: float, a: int, b: int, c: int, d: int, n_items: int, n_pairs: int)[source]

Bases: object

Container for Rand index calculation results.

rand

Rand index value (0 to 1)

Type:: float

adj_rand

Adjusted Rand index value

Type:: float

a

Number of pairs in same cluster in both clusterings

Type:: int

b

Number of pairs in different clusters in both clusterings

Type:: int

c

Number of pairs in same cluster in C1, different in C2

Type:: int

d

Number of pairs in different cluster in C1, same in C2

Type:: int

n_items

Number of items compared

Type:: int

n_pairs

Total number of pairs

Type:: int

a: int

adj_rand: float

b: int

c: int

d: int

n_items: int

n_pairs: int

rand: float

summary() → str[source]: Generate summary of Rand index results.

submarit.validation.rand_index.create_rand_empirical_distribution(n_items: int, n_clusters: int, n_points: int, min_items: int = 1, random_state: int | None = None, n_jobs: int = 1) → RandEmpiricalDistribution[source]

Create empirical distribution for Rand index under random clustering.

This function implements the methodology from RandCreateDist.m, generating random clustering pairs and computing their Rand indices to build an empirical null distribution.

Parameters:

n_items (int) – Number of items to cluster
n_clusters (int) – Number of clusters
n_points (int) – Number of points in empirical distribution
min_items (int, default=1) – Minimum number of items required per cluster
random_state (int, optional) – Random seed for reproducibility
n_jobs (int, default=1) – Number of parallel jobs. -1 means using all processors.

Returns:

distribution – Empirical distribution of Rand indices

Return type:

RandEmpiricalDistribution

submarit.validation.rand_index.rand_empirical_p(rand_dist: ndarray[tuple[Any, ...], dtype[float64]], adj_rand_dist: ndarray[tuple[Any, ...], dtype[float64]], rand: float, adj_rand: float, create_confidence: bool = True) → Dict[str, float | Tuple[float, float] | ndarray[tuple[Any, ...], dtype[float64]]][source]

Calculate p-values from empirical distribution.

This implements the functionality from RandEmpiricalP.m, computing p-values and confidence intervals from empirical distributions.

Parameters:

rand_dist (ndarray) – Sorted empirical distribution of Rand indices
adj_rand_dist (ndarray) – Sorted empirical distribution of adjusted Rand indices
rand (float) – Observed Rand index
adj_rand (float) – Observed adjusted Rand index
create_confidence (bool, default=True) – Whether to compute confidence interval values

Returns:

result – Dictionary containing: - ‘rand’: Observed Rand index - ‘adj_rand’: Observed adjusted Rand index - ‘rand_p’: (upper_p, lower_p) tuple for Rand index - ‘adj_rand_p’: (upper_p, lower_p) tuple for adjusted Rand - ‘rand_conf’: Confidence interval values (if requested) - ‘adj_rand_conf’: Adjusted Rand CI values (if requested)

Return type:

dict

P-Value Analysis 

P-value calculation utilities for SUBMARIT validation.

This module provides utilities for calculating p-values from empirical distributions, supporting both one-tailed and two-tailed tests, as well as multiple testing corrections.

class submarit.validation.p_values.PValueResult(observed: float, p_value: float, p_value_corrected: float | None = None, percentile: float | None = None, ci_lower: float | None = None, ci_upper: float | None = None, direction: str = 'greater', n_simulations: int | None = None)[source]

Bases: object

Container for p-value calculation results.

observed

Observed test statistic value

Type:: float

p_value

Raw p-value

Type:: float

p_value_corrected

Corrected p-value (if correction applied)

Type:: float | None

percentile

Percentile of observed value in distribution

Type:: float | None

ci_lower

Lower confidence interval bound

Type:: float | None

ci_upper

Upper confidence interval bound

Type:: float | None

direction

Test direction (‘greater’, ‘less’, or ‘two-sided’)

Type:: str

n_simulations

Number of simulations in empirical distribution

Type:: int | None

ci_lower: float | None = None

ci_upper: float | None = None

direction: str = 'greater'

is_significant(alpha: float = 0.05) → bool[source]

Check if result is statistically significant.

Parameters:: alpha (float, default=0.05) – Significance level
Returns:: True if p-value < alpha
Return type:: bool

n_simulations: int | None = None

observed: float

p_value: float

p_value_corrected: float | None = None

percentile: float | None = None

summary() → str[source]: Generate summary of p-value results.

submarit.validation.p_values.calculate_empirical_p_value(observed: float, null_distribution: ndarray[tuple[Any, ...], dtype[float64]], direction: str = 'greater', return_percentile: bool = True) → PValueResult[source]

Calculate p-value from empirical distribution.

Parameters:

observed (float) – Observed test statistic
null_distribution (ndarray) – Null distribution values (should be sorted)
direction ({'greater', 'less', 'two-sided'}, default='greater') – Direction of the test: - ‘greater’: P(X >= observed) - ‘less’: P(X <= observed) - ‘two-sided’: 2 * min(P(X >= observed), P(X <= observed))
return_percentile (bool, default=True) – Whether to calculate percentile of observed value

Returns:

result – P-value calculation results

Return type:

PValueResult

submarit.validation.p_values.calculate_p_value_range(observed: float, null_distribution: ndarray[tuple[Any, ...], dtype[float64]]) → Tuple[float, float][source]

Calculate upper and lower p-value bounds.

This implements the methodology from MATLAB code where both upper and lower p-values are returned as a tuple.

Parameters:

observed (float) – Observed test statistic
null_distribution (ndarray) – Sorted null distribution values

Returns:

p_range – (upper_p, lower_p) where: - upper_p: Probability of values >= observed - lower_p: Probability of values <= observed

Return type:

tuple

submarit.validation.p_values.combine_p_values(p_values: List[float] | ndarray[tuple[Any, ...], dtype[float64]], method: str = 'fisher', weights: ndarray[tuple[Any, ...], dtype[float64]] | None = None) → Dict[str, float][source]

Combine multiple p-values into a single test statistic.

Parameters:

p_values (array-like) – P-values to combine
method ({'fisher', 'stouffer', 'min', 'max'}, default='fisher') – Combination method: - ‘fisher’: Fisher’s combined probability test - ‘stouffer’: Stouffer’s Z-score method - ‘min’: Minimum p-value - ‘max’: Maximum p-value
weights (array-like, optional) – Weights for weighted combination (only for Stouffer’s method)

Returns:

result – Dictionary containing: - ‘statistic’: Combined test statistic - ‘p_value’: Combined p-value - ‘method’: Combination method used

Return type:

dict

submarit.validation.p_values.multiple_testing_correction(p_values: List[float] | ndarray[tuple[Any, ...], dtype[float64]], method: str = 'bonferroni', alpha: float = 0.05) → Dict[str, float | ndarray[tuple[Any, ...], dtype[float64]]][source]

Apply multiple testing correction to p-values.

Parameters:

p_values (array-like) – Raw p-values to correct
method ({'bonferroni', 'holm', 'fdr_bh', 'fdr_by'}, default='bonferroni') – Correction method: - ‘bonferroni’: Bonferroni correction - ‘holm’: Holm-Bonferroni method - ‘fdr_bh’: Benjamini-Hochberg FDR - ‘fdr_by’: Benjamini-Yekutieli FDR
alpha (float, default=0.05) – Family-wise error rate or false discovery rate

Returns:

results – Dictionary containing: - ‘corrected’: Corrected p-values - ‘rejected’: Boolean array of rejected hypotheses - ‘threshold’: Adjusted significance threshold - ‘method’: Correction method used

Return type:

dict

submarit.validation.p_values.permutation_test(data1: ndarray[tuple[Any, ...], dtype[float64]], data2: ndarray[tuple[Any, ...], dtype[float64]], statistic_func: callable, n_permutations: int = 10000, random_state: int | None = None) → PValueResult[source]

Perform permutation test for difference between two groups.

Parameters:

data1 (ndarray) – First group data
data2 (ndarray) – Second group data
statistic_func (callable) – Function to compute test statistic. Should accept two arrays and return a scalar statistic.
n_permutations (int, default=10000) – Number of permutations
random_state (int, optional) – Random seed for reproducibility

Returns:

result – Permutation test results

Return type:

PValueResult

Top-K Analysis 

Top-k clustering solution analysis.

This module implements functionality from RunClustersTopk.m and RunClustersTopk2.m for analyzing agreement among top-k clustering solutions.

class submarit.validation.topk_analysis.TopKResult(best_solutions: List[LocalSearchResult], avg_rand: float, avg_adj_rand: float, rand_p: Tuple[float, float], adj_rand_p: Tuple[float, float], rand_conf: ndarray[tuple[Any, ...], dtype[float64]] | None = None, adj_rand_conf: ndarray[tuple[Any, ...], dtype[float64]] | None = None, pairwise_rand: ndarray[tuple[Any, ...], dtype[float64]] | None = None, pairwise_adj_rand: ndarray[tuple[Any, ...], dtype[float64]] | None = None, empirical_dist: Dict | None = None)[source]

Bases: object

Container for top-k clustering analysis results.

best_solutions

List of top-k best clustering solutions

Type:: List[submarit.algorithms.local_search.LocalSearchResult]

avg_rand

Average Rand index among top-k solutions

Type:: float

avg_adj_rand

Average adjusted Rand index among top-k solutions

Type:: float

rand_p

P-value for Rand index

Type:: Tuple[float, float]

adj_rand_p

P-value for adjusted Rand index

Type:: Tuple[float, float]

rand_conf

Confidence interval values for Rand index

Type:: numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.float64]] | None

adj_rand_conf

Confidence interval values for adjusted Rand

Type:: numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.float64]] | None

pairwise_rand

Matrix of pairwise Rand indices

Type:: numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.float64]] | None

pairwise_adj_rand

Matrix of pairwise adjusted Rand indices

Type:: numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.float64]] | None

empirical_dist

Empirical distribution used for p-values

Type:: Dict | None

adj_rand_conf: ndarray[tuple[Any, ...], dtype[float64]] | None = None

adj_rand_p: Tuple[float, float]

avg_adj_rand: float

avg_rand: float

best_solutions: List[LocalSearchResult]

empirical_dist: Dict | None = None

pairwise_adj_rand: ndarray[tuple[Any, ...], dtype[float64]] | None = None

pairwise_rand: ndarray[tuple[Any, ...], dtype[float64]] | None = None

rand_conf: ndarray[tuple[Any, ...], dtype[float64]] | None = None

rand_p: Tuple[float, float]

summary() → str[source]: Generate summary of top-k analysis results.

submarit.validation.topk_analysis.analyze_solution_stability(solutions: List[LocalSearchResult], verbose: bool = False) → Dict[str, float | ndarray[tuple[Any, ...], dtype[float64]]][source]

Analyze stability across multiple clustering solutions.

Parameters:

solutions (list of LocalSearchResult) – Clustering solutions to analyze
verbose (bool, default=False) – Whether to print analysis details

Returns:

analysis – Dictionary containing: - ‘consensus_matrix’: Item x item co-clustering frequency - ‘item_stability’: Stability score for each item - ‘cluster_stability’: Stability of each cluster - ‘overall_stability’: Overall stability score

Return type:

dict

submarit.validation.topk_analysis.run_clusters_topk(swm: ndarray[tuple[Any, ...], dtype[float64]], n_clusters: int, min_items: int, n_runs: int, top_k: int, n_random: int = 10000, algorithm: str = 'v1', random_state: int | None = None, n_jobs: int = 1, verbose: bool = False) → TopKResult[source]

Run multiple clusterings and analyze agreement among top-k solutions.

This function implements the functionality of RunClustersTopk.m and RunClustersTopk2.m, finding the top-k best clustering solutions and analyzing their agreement using Rand indices.

Parameters:

swm (ndarray of shape (n_items, n_items)) – Product x product switching matrix
n_clusters (int) – Number of clusters
min_items (int) – Minimum number of items per cluster
n_runs (int) – Number of experimental runs
top_k (int) – Number of top solutions to return
n_random (int, default=10000) – Number of random values for empirical distributions
algorithm ({'v1', 'v2'}, default='v1') – Which algorithm version to use: - ‘v1’: KSMLocalSearch (PHat-P optimization) - ‘v2’: KSMLocalSearch2 (log-likelihood optimization)
random_state (int, optional) – Random seed for reproducibility
n_jobs (int, default=1) – Number of parallel jobs (-1 for all processors)
verbose (bool, default=False) – Whether to print progress messages

Returns:

result – Analysis results including p-values and confidence intervals

Return type:

TopKResult

IO Module 

Data Input/Output 

General data I/O utilities for SUBMARIT.

submarit.io.data_io.load_substitution_data(filepath: str | Path, format: str | None = None, **kwargs) → SubstitutionMatrix[source]

Load substitution data from various file formats.

Parameters:

filepath – Input file path
format – File format (auto-detected if None)
**kwargs – Additional arguments passed to format-specific loaders

Returns:

SubstitutionMatrix object

submarit.io.data_io.save_results(results: Dict[str, Any], filepath: str | Path, format: str | None = None, **kwargs) → None[source]

Save clustering results to file.

Parameters:

results – Dictionary of results to save
filepath – Output file path
format – Output format (auto-detected if None)
**kwargs – Additional arguments for format-specific savers

Example:

from submarit.io import load_data, save_results

# Load data from various formats
X, names = load_data('products.csv')
X, names = load_data('products.xlsx', sheet_name='Sheet1')
X, names = load_data('products.json')

# Save results
save_results('results.json', {
    'clusters': clusters,
    'metrics': metrics,
    'parameters': params
})

MATLAB Compatibility 

MATLAB file I/O utilities.

submarit.io.matlab_io.convert_mat_to_npz(mat_filepath: str | Path, npz_filepath: str | Path, compressed: bool = True) → None[source]

Convert a .mat file to NumPy .npz format.

Parameters:

mat_filepath – Input .mat file path
npz_filepath – Output .npz file path
compressed – Whether to use compression

submarit.io.matlab_io.load_mat(filepath: str | Path, variable_names: list | None = None, matlab_compatible: bool = True) → Dict[str, Any][source]

Load data from a MATLAB .mat file.

Handles both old-style (< v7.3) and new-style (>= v7.3) .mat files.

Parameters:

filepath – Path to the .mat file
variable_names – List of variable names to load (None = load all)
matlab_compatible – Whether to maintain MATLAB compatibility

Returns:

Dictionary of loaded variables

submarit.io.matlab_io.save_mat(filepath: str | Path, data: Dict[str, Any], format: str = '5', do_compression: bool = False) → None[source]

Save data to a MATLAB .mat file.

Parameters:

filepath – Output file path
data – Dictionary of variables to save
format – MATLAB file format (‘5’ or ‘7.3’)
do_compression – Whether to compress the data

submarit.io.matlab_io.validate_mat_file(filepath: str | Path) → Dict[str, Any][source]

Validate and get information about a .mat file.

Parameters:: filepath – Path to the .mat file
Returns:: Dictionary with file information

Example:

from submarit.io import load_matlab_data, save_matlab_data

# Load MATLAB .mat file
data = load_matlab_data('submarkets.mat')
X = data['features']
S = data['substitution_matrix']

# Save to MATLAB format
save_matlab_data('results.mat', {
    'clusters': clusters,
    'substitution_matrix': S,
    'metrics': metrics
})

Utils Module 

MATLAB Compatibility Utils 

MATLAB compatibility utilities for SUBMARIT.

This module provides functions to ensure compatibility between MATLAB and Python implementations, handling differences in indexing, random number generation, and numerical operations.

class submarit.utils.matlab_compat.IndexConverter(matlab_style: bool = True)[source]

Bases: object

Context manager for automatic index conversion.

convert_in(idx: int | _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str]) → int | ndarray[tuple[Any, ...], dtype[_ScalarT]][source]: Convert indices on input.

convert_out(idx: int | _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str]) → int | ndarray[tuple[Any, ...], dtype[_ScalarT]][source]: Convert indices on output.

exception submarit.utils.matlab_compat.MatlabCompatibilityError[source]

Bases: Exception

Raised when MATLAB compatibility issues occur.

class submarit.utils.matlab_compat.MatlabRandom(seed: int | None = None)[source]

Bases: object

MATLAB-compatible random number generator.

This class provides random number generation that matches MATLAB’s behavior as closely as possible, including the ability to use MATLAB’s random seeds.

rand(*shape: int) → ndarray[tuple[Any, ...], dtype[float64]][source]

Generate uniformly distributed random numbers like MATLAB’s rand.

Parameters:: *shape – Dimensions of the output array
Returns:: Array of random numbers from uniform distribution [0, 1)

randi(imax: int, *shape: int) → ndarray[tuple[Any, ...], dtype[int64]][source]

Generate uniformly distributed random integers like MATLAB’s randi.

Parameters:

imax – Maximum integer value (inclusive)
*shape – Dimensions of the output array

Returns:

Array of random integers from 1 to imax (inclusive, 1-based)

randn(*shape: int) → ndarray[tuple[Any, ...], dtype[float64]][source]

Generate normally distributed random numbers like MATLAB’s randn.

Parameters:: *shape – Dimensions of the output array
Returns:: Array of random numbers from standard normal distribution

randperm(n: int) → ndarray[tuple[Any, ...], dtype[int64]][source]

Generate random permutation like MATLAB’s randperm.

Parameters:: n – Number of elements to permute
Returns:: Random permutation of integers from 1 to n (1-based like MATLAB)

submarit.utils.matlab_compat.ensure_matlab_compatibility(func)[source]

Decorator to ensure MATLAB compatibility for numerical functions.

This decorator: - Ensures float64 precision - Handles index conversion - Manages numerical tolerances

submarit.utils.matlab_compat.matlab_compatible_random(seed: int | None = None) → MatlabRandom[source]

Create a MATLAB-compatible random number generator.

Parameters:: seed – Random seed for reproducibility
Returns:: MatlabRandom instance

Find indices of nonzero elements like MATLAB’s find.

Parameters:: condition – Boolean array or condition
Returns:: 1-based indices of True/nonzero elements

Reshape array using MATLAB column-major order.

Parameters:

array – Input array
*shape – New shape dimensions

Returns:

Reshaped array

Get array dimensions in MATLAB format.

Parameters:: array – Input array
Returns:: Tuple of dimensions (always at least 2D)

Convert MATLAB 1-based indices to Python 0-based indices.

Parameters:: idx – MATLAB index or array of indices (1-based)
Returns:: Python index or array of indices (0-based)
Raises:: MatlabCompatibilityError – If index is less than 1

Convert Python 0-based indices to MATLAB 1-based indices.

Parameters:: idx – Python index or array of indices (0-based)
Returns:: MATLAB index or array of indices (1-based)
Raises:: MatlabCompatibilityError – If index is negative

submarit.utils.matlab_compat.set_default_dtype()[source]: Set NumPy default dtype to float64 to match MATLAB.

Full API Example 

Here’s a complete example using the full API:

import numpy as np
from submarit.core import create_substitution_matrix
from submarit.algorithms import LocalSearch
from submarit.evaluation import ClusterEvaluator, gap_statistic
from submarit.validation import KFoldValidator
from submarit.io import load_data, save_results

# Load data
X, product_names = load_data('products.csv')

# Create substitution matrix
S = create_substitution_matrix(X, metric='cosine')

# Find optimal number of clusters
gaps = []
for k in range(2, 11):
    gap, _ = gap_statistic(S, k, n_bootstrap=20)
    gaps.append(gap)
optimal_k = np.argmax(gaps) + 2

# Perform clustering
clusterer = LocalSearch(n_clusters=optimal_k, n_restarts=20)
clusters = clusterer.fit_predict(S)

# Evaluate results
evaluator = ClusterEvaluator()
metrics = evaluator.evaluate(S, clusters)

# Validate stability
validator = KFoldValidator(n_splits=5)
stability_scores = validator.validate(X, n_clusters=optimal_k)

# Save results
save_results('analysis_results.json', {
    'product_names': product_names,
    'clusters': clusters.tolist(),
    'metrics': metrics,
    'stability': {
        'scores': stability_scores,
        'mean': np.mean(stability_scores),
        'std': np.std(stability_scores)
    },
    'parameters': {
        'n_clusters': optimal_k,
        'metric': 'cosine',
        'algorithm': 'local_search'
    }
})