API Reference
This is the complete API reference for SUBMARIT.
Core Module
Substitution Matrix Creation
Substitution matrix implementation for SUBMARIT.
- class submarit.core.substitution_matrix.SubstitutionMatrix(data: _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str] | None = None, normalize: bool = True, check_symmetry: bool = True, tol: float = 1e-10)[source]
Bases:
objectRepresents a product substitution matrix.
This class handles the creation and manipulation of substitution matrices used in submarket identification. The matrix represents substitution patterns between products based on sales or other data.
- create_from_consumer_product_data(consumer_product_data: _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str], normalize: bool = True, weight: int = 0, diag: bool = False) Tuple[ndarray[tuple[Any, ...], dtype[int64]], int][source]
Create substitution matrix from consumer-product data.
This method implements the logic from CreateSubstitutionMatrix.m
- Parameters:
consumer_product_data – Consumer × product data matrix
normalize – Whether to normalize rows to sum to 1
weight – 0 = weight by number of consumers, 1 = weight by product sales
diag – Whether to include diagonal self substitution
- Returns:
Tuple of (product_indexes, product_count)
- create_from_sales_data(**kwargs)
Create substitution matrix from sales data time series.
- Parameters:
sales_data – Sales data matrix (products × time periods)
method – Method for computing substitution (‘correlation’, ‘covariance’)
- get_inter_cluster_substitution(labels: _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str]) ndarray[tuple[Any, ...], dtype[float64]][source]
Compute inter-cluster substitution matrix.
- Parameters:
labels – Cluster assignments for each product
- Returns:
Matrix of substitution rates between clusters
- get_intra_cluster_substitution(labels: _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str]) ndarray[tuple[Any, ...], dtype[float64]][source]
Compute average intra-cluster substitution for each cluster.
- Parameters:
labels – Cluster assignments for each product
- Returns:
Array of average intra-cluster substitution rates
- get_matrix() ndarray[tuple[Any, ...], dtype[float64]][source]
Get the substitution matrix.
- Returns:
The substitution matrix
- get_submatrix(indices: _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str]) ndarray[tuple[Any, ...], dtype[float64]][source]
Extract a submatrix for given indices.
- Parameters:
indices – Indices of products to include
- Returns:
Submatrix for the specified products
- is_symmetric(tol: float | None = None) bool[source]
Check if the matrix is symmetric.
- Parameters:
tol – Tolerance for symmetry check
- Returns:
True if symmetric within tolerance
- normalize() None[source]
Normalize the substitution matrix.
Ensures that each row sums to 1 (excluding diagonal).
- set_data(data: _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str], normalize: bool = True, check_symmetry: bool = True) None[source]
Set the substitution matrix data.
- Parameters:
data – Input data
normalize – Whether to normalize the matrix
check_symmetry – Whether to check and enforce symmetry
Example usage:
from submarit.core import create_substitution_matrix
import numpy as np
# Create product feature matrix
X = np.random.rand(100, 20) # 100 products, 20 features
# Default Euclidean distance
S = create_substitution_matrix(X)
# Using cosine similarity
S_cosine = create_substitution_matrix(X, metric='cosine')
# With custom parameters
S_custom = create_substitution_matrix(
X,
metric='minkowski',
metric_params={'p': 3},
normalize=True
)
Base Classes
Base classes for SUBMARIT core functionality.
- class submarit.core.base.BaseClusterer(n_clusters: int = 2, random_state: int | None = None)[source]
Bases:
BaseEstimatorBase class for clustering algorithms.
- abstract fit(X: ndarray[tuple[Any, ...], dtype[float64]]) BaseClusterer[source]
Fit the clustering model.
- Parameters:
X – Input data matrix
- Returns:
Self
- class submarit.core.base.BaseEstimator[source]
Bases:
ABCBase class for all estimators in SUBMARIT.
All estimators should specify all the parameters that can be set at the class level in their __init__ as explicit keyword arguments.
- get_params(deep: bool = True) Dict[str, Any][source]
Get parameters for this estimator.
- Parameters:
deep – If True, will return parameters for sub-objects
- Returns:
Parameter names mapped to their values
- set_params(**params) BaseEstimator[source]
Set parameters for this estimator.
- Parameters:
**params – Estimator parameters
- Returns:
Self
- class submarit.core.base.BaseEvaluator[source]
Bases:
BaseEstimatorBase class for cluster evaluation metrics.
- class submarit.core.base.BaseValidator[source]
Bases:
BaseEstimatorBase class for validation methods.
Algorithms Module
Local Search Algorithm
Example:
from submarit.algorithms import LocalSearch
# Initialize with parameters
ls = LocalSearch(
n_clusters=5,
max_iter=100,
n_restarts=10,
tol=1e-4,
random_state=42
)
# Fit and predict
clusters = ls.fit_predict(S)
# Access results
print(f"Final objective value: {ls.objective_}")
print(f"Number of iterations: {ls.n_iter_}")
print(f"Cluster centers: {ls.cluster_centers_}")
Algorithm Parameters
Parameter |
Default |
Description |
|---|---|---|
n_clusters |
Required |
Number of clusters (submarkets) to find |
max_iter |
100 |
Maximum number of iterations per restart |
n_restarts |
10 |
Number of random restarts to avoid local optima |
tol |
1e-4 |
Convergence tolerance |
init |
‘random’ |
Initialization method: ‘random’, ‘k-means++’, or array |
random_state |
None |
Random seed for reproducibility |
Evaluation Module
Cluster Evaluator
- class submarit.evaluation.ClusterEvaluator[source]
Bases:
BaseEvaluatorEvaluates SUBMARIT clustering solutions using multiple criteria.
This class implements the evaluation methodology from kSMEvaluateClustering.m, computing various statistics to assess clustering quality including: - Difference between estimated and theoretical probabilities - Z-value statistics - Log-likelihood values
- evaluate(swm: ndarray[tuple[Any, ...], dtype[float64]], n_clusters: int, cluster_assign: ndarray[tuple[Any, ...], dtype[int64]]) ClusterEvaluationResult[source]
Evaluate a SUBMARIT clustering solution.
- Parameters:
swm – Product x product switching matrix
n_clusters – Number of clusters
cluster_assign – Product cluster assignments (1 to n_clusters)
- Returns:
ClusterEvaluationResult containing all evaluation metrics
- evaluate_legacy(swm: ndarray[tuple[Any, ...], dtype[float64]], n_clusters: int, cluster_assign: ndarray[tuple[Any, ...], dtype[int64]]) Dict[str, float | ndarray[tuple[Any, ...], dtype[float64]]][source]
Legacy evaluation function matching kEvaluateClustering.m.
This is a simplified version without some of the advanced statistics.
- Parameters:
swm – Product x product switching matrix
n_clusters – Number of clusters
cluster_assign – Product cluster assignments (1 to n_clusters)
- Returns:
Dictionary containing evaluation metrics
Available metrics:
silhouette: Silhouette coefficient (-1 to 1, higher is better)
calinski_harabasz: Calinski-Harabasz index (higher is better)
davies_bouldin: Davies-Bouldin index (lower is better)
dunn: Dunn index (higher is better)
within_cluster_sum: Within-cluster sum of squares (lower is better)
Example:
from submarit.evaluation import ClusterEvaluator
evaluator = ClusterEvaluator()
# Evaluate all metrics
metrics = evaluator.evaluate(S, clusters)
# Evaluate specific metrics
metrics = evaluator.evaluate(S, clusters,
metrics=['silhouette', 'dunn'])
# Get detailed cluster statistics
stats = evaluator.cluster_statistics(S, clusters)
Gap Statistic
- submarit.evaluation.gap_statistic()
GAP statistic implementation for optimal cluster number selection.
Example:
from submarit.evaluation import gap_statistic
# Calculate gap statistic
gap, std = gap_statistic(S, n_clusters=5, n_bootstrap=50)
# Find optimal number of clusters
gaps = []
for k in range(2, 11):
gap, _ = gap_statistic(S, k, n_bootstrap=20)
gaps.append(gap)
optimal_k = np.argmax(gaps) + 2
Entropy Evaluator
Statistical Tests
Statistical tests and utilities for SUBMARIT evaluation.
- class submarit.evaluation.statistical_tests.StatisticalTests[source]
Bases:
objectStatistical testing utilities for cluster evaluation.
- static bootstrap_confidence_interval(swm: ndarray[tuple[Any, ...], dtype[float64]], cluster_assign: ndarray[tuple[Any, ...], dtype[int64]], metric: str = 'z_value', n_bootstrap: int = 1000, confidence: float = 0.95, random_state: int | None = None) Dict[str, float][source]
Calculate bootstrap confidence intervals for evaluation metrics.
- Parameters:
swm – Switching matrix
cluster_assign – Cluster assignments
metric – Metric to analyze
n_bootstrap – Number of bootstrap samples
confidence – Confidence level
random_state – Random seed
- Returns:
Dictionary with confidence intervals and statistics
- static cluster_validity_indices(swm: ndarray[tuple[Any, ...], dtype[float64]], cluster_assign: ndarray[tuple[Any, ...], dtype[int64]]) Dict[str, float][source]
Calculate various cluster validity indices.
- Parameters:
swm – Switching matrix
cluster_assign – Cluster assignments
- Returns:
Dictionary with validity indices
- static permutation_test(swm: ndarray[tuple[Any, ...], dtype[float64]], assign1: ndarray[tuple[Any, ...], dtype[int64]], assign2: ndarray[tuple[Any, ...], dtype[int64]], n_permutations: int = 1000, metric: str = 'z_value', random_state: int | None = None) Dict[str, float][source]
Perform permutation test to compare two clustering solutions.
- Parameters:
swm – Switching matrix
assign1 – First cluster assignment
assign2 – Second cluster assignment
n_permutations – Number of permutations
metric – Metric to compare (‘z_value’, ‘diff’, ‘log_lh’)
random_state – Random seed
- Returns:
Dictionary with test statistics and p-value
- static stability_analysis(swm: ndarray[tuple[Any, ...], dtype[float64]], cluster_func, n_clusters: int, n_runs: int = 10, min_items: int = 1) Dict[str, float][source]
Analyze stability of clustering results across multiple runs.
- Parameters:
swm – Switching matrix
cluster_func – Clustering function
n_clusters – Number of clusters
n_runs – Number of runs to perform
min_items – Minimum items per cluster
- Returns:
Dictionary with stability metrics
Visualization
Visualization utilities for evaluation results.
- class submarit.evaluation.visualization.EvaluationVisualizer[source]
Bases:
objectVisualization utilities for SUBMARIT evaluation results.
- create_evaluation_report(results: Dict[str, Any], save_path: str) None[source]
Create a comprehensive evaluation report.
- Parameters:
results – Dictionary containing various evaluation results
save_path – Path to save the report (PDF or image)
- plot_cluster_comparison(results: List[ClusterEvaluationResult], labels: List[str] | None = None, metrics: List[str] | None = None, save_path: str | None = None) None[source]
Compare multiple clustering results.
- Parameters:
results – List of evaluation results to compare
labels – Labels for each result
metrics – Metrics to plot (default: all)
save_path – Optional path to save the plot
- plot_cluster_sizes(result: ClusterEvaluationResult | EntropyClusterResult, save_path: str | None = None) None[source]
Plot cluster size distribution.
- Parameters:
result – Clustering result
save_path – Optional path to save the plot
Example:
from submarit.evaluation.visualization import (
plot_substitution_matrix,
plot_cluster_heatmap,
plot_evaluation_metrics
)
# Plot substitution matrix
fig, ax = plt.subplots(figsize=(10, 8))
plot_substitution_matrix(S, clusters, ax=ax)
# Plot cluster heatmap
plot_cluster_heatmap(S, clusters)
# Plot evaluation metrics
plot_evaluation_metrics(metrics_dict)
Validation Module
K-Fold Cross-Validation
- class submarit.validation.KFoldValidator(n_folds: int = 5, n_random: int = 0, max_fold_run: int | None = None, random_state: int | None = None, n_jobs: int = 1, algorithm: str = 'v1')[source]
Bases:
BaseValidatorK-fold cross-validation for SUBMARIT clustering.
This validator implements the k-fold validation procedure from kSMNFold.m, where data is split into k folds, each fold is held out as test data, and the model is trained on the remaining k-1 folds. Test items are then assigned using constrained clustering.
- Parameters:
n_folds (int, default=5) – Number of folds for cross-validation
n_random (int, default=0) – Number of random permutations for empirical distribution. If 0, no empirical distribution is computed.
max_fold_run (int, optional) – Maximum number of folds to actually run (defaults to n_folds). Useful for testing with subset of folds.
random_state (int, optional) – Random seed for reproducibility
n_jobs (int, default=1) – Number of parallel jobs for empirical distribution generation. -1 means using all processors.
algorithm ({'v1', 'v2'}, default='v1') – Which clustering algorithm to use: - ‘v1’: Uses KSMLocalSearch (PHat-P optimization) - ‘v2’: Uses KSMLocalSearch2 (log-likelihood optimization)
- validate(swm: ndarray[tuple[Any, ...], dtype[float64]], n_clusters: int, min_items: int = 1, n_runs: int = 10, **kwargs) KFoldResult[source]
Perform k-fold cross-validation on switching matrix.
- Parameters:
swm (ndarray of shape (n_items, n_items)) – Product x product switching matrix
n_clusters (int) – Number of clusters
min_items (int, default=1) – Minimum number of items per cluster
n_runs (int, default=10) – Number of SUBMARIT runs for each optimization
**kwargs (dict) – Additional parameters passed to clustering algorithm
- Returns:
result – Cross-validation results including average Rand indices and optional empirical distributions
- Return type:
KFoldResult
Multiple Runs Validation
Rand Index Calculation
Rand index calculations and empirical distributions.
This module implements the Rand index and adjusted Rand index for comparing clustering solutions, following the methodology from RandIndex4.m. It also provides functions for creating empirical distributions of Rand indices.
- class submarit.validation.rand_index.RandEmpiricalDistribution(rand_dist: ndarray[tuple[Any, ...], dtype[float64]], adj_rand_dist: ndarray[tuple[Any, ...], dtype[float64]], n_points: int, n_items: int, n_clusters: int, min_items: int = 1)[source]
Bases:
objectContainer for Rand index empirical distribution.
- rand_dist
Sorted array of Rand index values
- Type:
numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.float64]]
- adj_rand_dist
Sorted array of adjusted Rand index values
- Type:
numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.float64]]
- calculate_p_values(rand_value: float, adj_rand_value: float) Dict[str, Tuple[float, float]][source]
Calculate p-values for given Rand index values.
- get_percentiles(percentiles: ndarray[tuple[Any, ...], dtype[float64]] | None = None) Dict[str, ndarray[tuple[Any, ...], dtype[float64]]][source]
Get percentile values from the distribution.
- Parameters:
percentiles (array-like, optional) – Percentiles to compute (0-100). Defaults to standard confidence intervals.
- Returns:
Dictionary with ‘rand’ and ‘adj_rand’ percentile values
- Return type:
- class submarit.validation.rand_index.RandIndex[source]
Bases:
objectCalculator for Rand index and adjusted Rand index.
The Rand index measures the similarity between two clusterings by considering all pairs of items and checking whether they are grouped consistently.
The adjusted Rand index corrects for chance agreement and is preferred for comparing clusterings with different numbers of clusters.
- calculate(clusters1: ndarray[tuple[Any, ...], dtype[int64]], clusters2: ndarray[tuple[Any, ...], dtype[int64]]) Tuple[float, float][source]
Calculate Rand index and adjusted Rand index.
This implements the methodology from RandIndex4.m, computing both the standard Rand index (Rand, 1971) and the adjusted Rand index (Hubert and Arabie, 1985).
- Parameters:
clusters1 (ndarray of shape (n_items,)) – First clustering assignment (1-based cluster labels)
clusters2 (ndarray of shape (n_items,)) – Second clustering assignment (1-based cluster labels)
- Returns:
rand (float) – Rand index (0 to 1, where 1 indicates perfect agreement)
adj_rand (float) – Adjusted Rand index (can be negative, 1 indicates perfect agreement)
- calculate_detailed(clusters1: ndarray[tuple[Any, ...], dtype[int64]], clusters2: ndarray[tuple[Any, ...], dtype[int64]]) RandIndexResult[source]
Calculate Rand index with detailed contingency information.
- Parameters:
clusters1 (ndarray of shape (n_items,)) – First clustering assignment (1-based cluster labels)
clusters2 (ndarray of shape (n_items,)) – Second clustering assignment (1-based cluster labels)
- Returns:
result – Detailed results including contingency table counts
- Return type:
- class submarit.validation.rand_index.RandIndexResult(rand: float, adj_rand: float, a: int, b: int, c: int, d: int, n_items: int, n_pairs: int)[source]
Bases:
objectContainer for Rand index calculation results.
- submarit.validation.rand_index.create_rand_empirical_distribution(n_items: int, n_clusters: int, n_points: int, min_items: int = 1, random_state: int | None = None, n_jobs: int = 1) RandEmpiricalDistribution[source]
Create empirical distribution for Rand index under random clustering.
This function implements the methodology from RandCreateDist.m, generating random clustering pairs and computing their Rand indices to build an empirical null distribution.
- Parameters:
n_items (int) – Number of items to cluster
n_clusters (int) – Number of clusters
n_points (int) – Number of points in empirical distribution
min_items (int, default=1) – Minimum number of items required per cluster
random_state (int, optional) – Random seed for reproducibility
n_jobs (int, default=1) – Number of parallel jobs. -1 means using all processors.
- Returns:
distribution – Empirical distribution of Rand indices
- Return type:
- submarit.validation.rand_index.rand_empirical_p(rand_dist: ndarray[tuple[Any, ...], dtype[float64]], adj_rand_dist: ndarray[tuple[Any, ...], dtype[float64]], rand: float, adj_rand: float, create_confidence: bool = True) Dict[str, float | Tuple[float, float] | ndarray[tuple[Any, ...], dtype[float64]]][source]
Calculate p-values from empirical distribution.
This implements the functionality from RandEmpiricalP.m, computing p-values and confidence intervals from empirical distributions.
- Parameters:
rand_dist (ndarray) – Sorted empirical distribution of Rand indices
adj_rand_dist (ndarray) – Sorted empirical distribution of adjusted Rand indices
rand (float) – Observed Rand index
adj_rand (float) – Observed adjusted Rand index
create_confidence (bool, default=True) – Whether to compute confidence interval values
- Returns:
result – Dictionary containing: - ‘rand’: Observed Rand index - ‘adj_rand’: Observed adjusted Rand index - ‘rand_p’: (upper_p, lower_p) tuple for Rand index - ‘adj_rand_p’: (upper_p, lower_p) tuple for adjusted Rand - ‘rand_conf’: Confidence interval values (if requested) - ‘adj_rand_conf’: Adjusted Rand CI values (if requested)
- Return type:
P-Value Analysis
P-value calculation utilities for SUBMARIT validation.
This module provides utilities for calculating p-values from empirical distributions, supporting both one-tailed and two-tailed tests, as well as multiple testing corrections.
- class submarit.validation.p_values.PValueResult(observed: float, p_value: float, p_value_corrected: float | None = None, percentile: float | None = None, ci_lower: float | None = None, ci_upper: float | None = None, direction: str = 'greater', n_simulations: int | None = None)[source]
Bases:
objectContainer for p-value calculation results.
- submarit.validation.p_values.calculate_empirical_p_value(observed: float, null_distribution: ndarray[tuple[Any, ...], dtype[float64]], direction: str = 'greater', return_percentile: bool = True) PValueResult[source]
Calculate p-value from empirical distribution.
- Parameters:
observed (float) – Observed test statistic
null_distribution (ndarray) – Null distribution values (should be sorted)
direction ({'greater', 'less', 'two-sided'}, default='greater') – Direction of the test: - ‘greater’: P(X >= observed) - ‘less’: P(X <= observed) - ‘two-sided’: 2 * min(P(X >= observed), P(X <= observed))
return_percentile (bool, default=True) – Whether to calculate percentile of observed value
- Returns:
result – P-value calculation results
- Return type:
- submarit.validation.p_values.calculate_p_value_range(observed: float, null_distribution: ndarray[tuple[Any, ...], dtype[float64]]) Tuple[float, float][source]
Calculate upper and lower p-value bounds.
This implements the methodology from MATLAB code where both upper and lower p-values are returned as a tuple.
- submarit.validation.p_values.combine_p_values(p_values: List[float] | ndarray[tuple[Any, ...], dtype[float64]], method: str = 'fisher', weights: ndarray[tuple[Any, ...], dtype[float64]] | None = None) Dict[str, float][source]
Combine multiple p-values into a single test statistic.
- Parameters:
p_values (array-like) – P-values to combine
method ({'fisher', 'stouffer', 'min', 'max'}, default='fisher') – Combination method: - ‘fisher’: Fisher’s combined probability test - ‘stouffer’: Stouffer’s Z-score method - ‘min’: Minimum p-value - ‘max’: Maximum p-value
weights (array-like, optional) – Weights for weighted combination (only for Stouffer’s method)
- Returns:
result – Dictionary containing: - ‘statistic’: Combined test statistic - ‘p_value’: Combined p-value - ‘method’: Combination method used
- Return type:
- submarit.validation.p_values.multiple_testing_correction(p_values: List[float] | ndarray[tuple[Any, ...], dtype[float64]], method: str = 'bonferroni', alpha: float = 0.05) Dict[str, float | ndarray[tuple[Any, ...], dtype[float64]]][source]
Apply multiple testing correction to p-values.
- Parameters:
p_values (array-like) – Raw p-values to correct
method ({'bonferroni', 'holm', 'fdr_bh', 'fdr_by'}, default='bonferroni') – Correction method: - ‘bonferroni’: Bonferroni correction - ‘holm’: Holm-Bonferroni method - ‘fdr_bh’: Benjamini-Hochberg FDR - ‘fdr_by’: Benjamini-Yekutieli FDR
alpha (float, default=0.05) – Family-wise error rate or false discovery rate
- Returns:
results – Dictionary containing: - ‘corrected’: Corrected p-values - ‘rejected’: Boolean array of rejected hypotheses - ‘threshold’: Adjusted significance threshold - ‘method’: Correction method used
- Return type:
- submarit.validation.p_values.permutation_test(data1: ndarray[tuple[Any, ...], dtype[float64]], data2: ndarray[tuple[Any, ...], dtype[float64]], statistic_func: callable, n_permutations: int = 10000, random_state: int | None = None) PValueResult[source]
Perform permutation test for difference between two groups.
- Parameters:
data1 (ndarray) – First group data
data2 (ndarray) – Second group data
statistic_func (callable) – Function to compute test statistic. Should accept two arrays and return a scalar statistic.
n_permutations (int, default=10000) – Number of permutations
random_state (int, optional) – Random seed for reproducibility
- Returns:
result – Permutation test results
- Return type:
Top-K Analysis
Top-k clustering solution analysis.
This module implements functionality from RunClustersTopk.m and RunClustersTopk2.m for analyzing agreement among top-k clustering solutions.
- class submarit.validation.topk_analysis.TopKResult(best_solutions: List[LocalSearchResult], avg_rand: float, avg_adj_rand: float, rand_p: Tuple[float, float], adj_rand_p: Tuple[float, float], rand_conf: ndarray[tuple[Any, ...], dtype[float64]] | None = None, adj_rand_conf: ndarray[tuple[Any, ...], dtype[float64]] | None = None, pairwise_rand: ndarray[tuple[Any, ...], dtype[float64]] | None = None, pairwise_adj_rand: ndarray[tuple[Any, ...], dtype[float64]] | None = None, empirical_dist: Dict | None = None)[source]
Bases:
objectContainer for top-k clustering analysis results.
- best_solutions
List of top-k best clustering solutions
- Type:
List[submarit.algorithms.local_search.LocalSearchResult]
- rand_conf
Confidence interval values for Rand index
- Type:
numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.float64]] | None
- adj_rand_conf
Confidence interval values for adjusted Rand
- Type:
numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.float64]] | None
- pairwise_rand
Matrix of pairwise Rand indices
- Type:
numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.float64]] | None
- pairwise_adj_rand
Matrix of pairwise adjusted Rand indices
- Type:
numpy.ndarray[tuple[Any, …], numpy.dtype[numpy.float64]] | None
- empirical_dist
Empirical distribution used for p-values
- Type:
Dict | None
- submarit.validation.topk_analysis.analyze_solution_stability(solutions: List[LocalSearchResult], verbose: bool = False) Dict[str, float | ndarray[tuple[Any, ...], dtype[float64]]][source]
Analyze stability across multiple clustering solutions.
- Parameters:
- Returns:
analysis – Dictionary containing: - ‘consensus_matrix’: Item x item co-clustering frequency - ‘item_stability’: Stability score for each item - ‘cluster_stability’: Stability of each cluster - ‘overall_stability’: Overall stability score
- Return type:
- submarit.validation.topk_analysis.run_clusters_topk(swm: ndarray[tuple[Any, ...], dtype[float64]], n_clusters: int, min_items: int, n_runs: int, top_k: int, n_random: int = 10000, algorithm: str = 'v1', random_state: int | None = None, n_jobs: int = 1, verbose: bool = False) TopKResult[source]
Run multiple clusterings and analyze agreement among top-k solutions.
This function implements the functionality of RunClustersTopk.m and RunClustersTopk2.m, finding the top-k best clustering solutions and analyzing their agreement using Rand indices.
- Parameters:
swm (ndarray of shape (n_items, n_items)) – Product x product switching matrix
n_clusters (int) – Number of clusters
min_items (int) – Minimum number of items per cluster
n_runs (int) – Number of experimental runs
top_k (int) – Number of top solutions to return
n_random (int, default=10000) – Number of random values for empirical distributions
algorithm ({'v1', 'v2'}, default='v1') – Which algorithm version to use: - ‘v1’: KSMLocalSearch (PHat-P optimization) - ‘v2’: KSMLocalSearch2 (log-likelihood optimization)
random_state (int, optional) – Random seed for reproducibility
n_jobs (int, default=1) – Number of parallel jobs (-1 for all processors)
verbose (bool, default=False) – Whether to print progress messages
- Returns:
result – Analysis results including p-values and confidence intervals
- Return type:
IO Module
Data Input/Output
General data I/O utilities for SUBMARIT.
- submarit.io.data_io.load_substitution_data(filepath: str | Path, format: str | None = None, **kwargs) SubstitutionMatrix[source]
Load substitution data from various file formats.
- Parameters:
filepath – Input file path
format – File format (auto-detected if None)
**kwargs – Additional arguments passed to format-specific loaders
- Returns:
SubstitutionMatrix object
- submarit.io.data_io.save_results(results: Dict[str, Any], filepath: str | Path, format: str | None = None, **kwargs) None[source]
Save clustering results to file.
- Parameters:
results – Dictionary of results to save
filepath – Output file path
format – Output format (auto-detected if None)
**kwargs – Additional arguments for format-specific savers
Example:
from submarit.io import load_data, save_results
# Load data from various formats
X, names = load_data('products.csv')
X, names = load_data('products.xlsx', sheet_name='Sheet1')
X, names = load_data('products.json')
# Save results
save_results('results.json', {
'clusters': clusters,
'metrics': metrics,
'parameters': params
})
MATLAB Compatibility
MATLAB file I/O utilities.
- submarit.io.matlab_io.convert_mat_to_npz(mat_filepath: str | Path, npz_filepath: str | Path, compressed: bool = True) None[source]
Convert a .mat file to NumPy .npz format.
- Parameters:
mat_filepath – Input .mat file path
npz_filepath – Output .npz file path
compressed – Whether to use compression
- submarit.io.matlab_io.load_mat(filepath: str | Path, variable_names: list | None = None, matlab_compatible: bool = True) Dict[str, Any][source]
Load data from a MATLAB .mat file.
Handles both old-style (< v7.3) and new-style (>= v7.3) .mat files.
- Parameters:
filepath – Path to the .mat file
variable_names – List of variable names to load (None = load all)
matlab_compatible – Whether to maintain MATLAB compatibility
- Returns:
Dictionary of loaded variables
- submarit.io.matlab_io.save_mat(filepath: str | Path, data: Dict[str, Any], format: str = '5', do_compression: bool = False) None[source]
Save data to a MATLAB .mat file.
- Parameters:
filepath – Output file path
data – Dictionary of variables to save
format – MATLAB file format (‘5’ or ‘7.3’)
do_compression – Whether to compress the data
- submarit.io.matlab_io.validate_mat_file(filepath: str | Path) Dict[str, Any][source]
Validate and get information about a .mat file.
- Parameters:
filepath – Path to the .mat file
- Returns:
Dictionary with file information
Example:
from submarit.io import load_matlab_data, save_matlab_data
# Load MATLAB .mat file
data = load_matlab_data('submarkets.mat')
X = data['features']
S = data['substitution_matrix']
# Save to MATLAB format
save_matlab_data('results.mat', {
'clusters': clusters,
'substitution_matrix': S,
'metrics': metrics
})
Utils Module
MATLAB Compatibility Utils
MATLAB compatibility utilities for SUBMARIT.
This module provides functions to ensure compatibility between MATLAB and Python implementations, handling differences in indexing, random number generation, and numerical operations.
- class submarit.utils.matlab_compat.IndexConverter(matlab_style: bool = True)[source]
Bases:
objectContext manager for automatic index conversion.
- exception submarit.utils.matlab_compat.MatlabCompatibilityError[source]
Bases:
ExceptionRaised when MATLAB compatibility issues occur.
- class submarit.utils.matlab_compat.MatlabRandom(seed: int | None = None)[source]
Bases:
objectMATLAB-compatible random number generator.
This class provides random number generation that matches MATLAB’s behavior as closely as possible, including the ability to use MATLAB’s random seeds.
- rand(*shape: int) ndarray[tuple[Any, ...], dtype[float64]][source]
Generate uniformly distributed random numbers like MATLAB’s rand.
- Parameters:
*shape – Dimensions of the output array
- Returns:
Array of random numbers from uniform distribution [0, 1)
- randi(imax: int, *shape: int) ndarray[tuple[Any, ...], dtype[int64]][source]
Generate uniformly distributed random integers like MATLAB’s randi.
- Parameters:
imax – Maximum integer value (inclusive)
*shape – Dimensions of the output array
- Returns:
Array of random integers from 1 to imax (inclusive, 1-based)
- submarit.utils.matlab_compat.ensure_matlab_compatibility(func)[source]
Decorator to ensure MATLAB compatibility for numerical functions.
This decorator: - Ensures float64 precision - Handles index conversion - Manages numerical tolerances
- submarit.utils.matlab_compat.matlab_compatible_random(seed: int | None = None) MatlabRandom[source]
Create a MATLAB-compatible random number generator.
- Parameters:
seed – Random seed for reproducibility
- Returns:
MatlabRandom instance
- submarit.utils.matlab_compat.matlab_find(condition: _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str]) ndarray[tuple[Any, ...], dtype[int64]][source]
Find indices of nonzero elements like MATLAB’s find.
- Parameters:
condition – Boolean array or condition
- Returns:
1-based indices of True/nonzero elements
- submarit.utils.matlab_compat.matlab_reshape(array: _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str], *shape: int) ndarray[tuple[Any, ...], dtype[_ScalarT]][source]
Reshape array using MATLAB column-major order.
- Parameters:
array – Input array
*shape – New shape dimensions
- Returns:
Reshaped array
- submarit.utils.matlab_compat.matlab_size(array: _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str]) Tuple[int, ...][source]
Get array dimensions in MATLAB format.
- Parameters:
array – Input array
- Returns:
Tuple of dimensions (always at least 2D)
- submarit.utils.matlab_compat.matlab_to_python_index(idx: int | _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str]) int | ndarray[tuple[Any, ...], dtype[_ScalarT]][source]
Convert MATLAB 1-based indices to Python 0-based indices.
- Parameters:
idx – MATLAB index or array of indices (1-based)
- Returns:
Python index or array of indices (0-based)
- Raises:
MatlabCompatibilityError – If index is less than 1
- submarit.utils.matlab_compat.python_to_matlab_index(idx: int | _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str]) int | ndarray[tuple[Any, ...], dtype[_ScalarT]][source]
Convert Python 0-based indices to MATLAB 1-based indices.
- Parameters:
idx – Python index or array of indices (0-based)
- Returns:
MATLAB index or array of indices (1-based)
- Raises:
MatlabCompatibilityError – If index is negative
Full API Example
Here’s a complete example using the full API:
import numpy as np
from submarit.core import create_substitution_matrix
from submarit.algorithms import LocalSearch
from submarit.evaluation import ClusterEvaluator, gap_statistic
from submarit.validation import KFoldValidator
from submarit.io import load_data, save_results
# Load data
X, product_names = load_data('products.csv')
# Create substitution matrix
S = create_substitution_matrix(X, metric='cosine')
# Find optimal number of clusters
gaps = []
for k in range(2, 11):
gap, _ = gap_statistic(S, k, n_bootstrap=20)
gaps.append(gap)
optimal_k = np.argmax(gaps) + 2
# Perform clustering
clusterer = LocalSearch(n_clusters=optimal_k, n_restarts=20)
clusters = clusterer.fit_predict(S)
# Evaluate results
evaluator = ClusterEvaluator()
metrics = evaluator.evaluate(S, clusters)
# Validate stability
validator = KFoldValidator(n_splits=5)
stability_scores = validator.validate(X, n_clusters=optimal_k)
# Save results
save_results('analysis_results.json', {
'product_names': product_names,
'clusters': clusters.tolist(),
'metrics': metrics,
'stability': {
'scores': stability_scores,
'mean': np.mean(stability_scores),
'std': np.std(stability_scores)
},
'parameters': {
'n_clusters': optimal_k,
'metric': 'cosine',
'algorithm': 'local_search'
}
})