Frequently Asked Questions (FAQ)

General Questions

What is SUBMARIT?

SUBMARIT (SUBMARket Identification and Testing) is a Python package for identifying and analyzing submarkets based on product substitution patterns. It helps businesses and researchers understand market structure by clustering products based on how substitutable they are for each other.

How does SUBMARIT differ from standard clustering?

While standard clustering groups similar items, SUBMARIT specifically focuses on substitution relationships. Two products might be different in features but highly substitutable (e.g., butter and margarine), which SUBMARIT captures through specialized algorithms and evaluation metrics.

What data do I need?

You need a matrix where: - Each row represents a product - Each column represents a feature or attribute - Values can be numeric (prices, quantities) or encoded categorical data (brand, category)

Example data structure:

# Products x Features matrix
# Columns: [price, size, brand_encoded, category_encoded, ...]
X = np.array([
    [2.99, 16, 0, 1, ...],  # Product 1
    [3.49, 12, 1, 1, ...],  # Product 2
    ...
])

Installation Issues

ImportError: No module named ‘submarit’

Solution 1: Ensure you’re in the correct environment:

# Check if installed
pip list | grep submarit

# If using conda
conda list submarit

Solution 2: Reinstall:

pip uninstall submarit
pip install submarit

Cannot build wheel for numpy/scipy

Solution: Install pre-built wheels:

# Windows
pip install --only-binary :all: numpy scipy

# Or use conda
conda install numpy scipy

MATLAB Engine not found

Solution: Install MATLAB Engine API for Python:

cd "C:\Program Files\MATLAB\R2023b\extern\engines\python"  # Windows
cd "/Applications/MATLAB_R2023b.app/extern/engines/python"  # macOS
python setup.py install

Algorithm Questions

How do I choose the number of clusters?

Use multiple methods to find optimal k:

from submarit.evaluation import gap_statistic, elbow_method
from submarit.algorithms import LocalSearch

# Method 1: Gap statistic
gaps = []
for k in range(2, 11):
    gap, std = gap_statistic(S, k, n_bootstrap=50)
    gaps.append(gap)
optimal_k = np.argmax(gaps) + 2

# Method 2: Elbow method
scores = []
for k in range(2, 11):
    ls = LocalSearch(n_clusters=k)
    ls.fit(S)
    scores.append(ls.objective_)

# Plot and look for "elbow"
plt.plot(range(2, 11), scores)
plt.xlabel('Number of clusters')
plt.ylabel('Within-cluster sum')

Local Search doesn’t converge

Solution 1: Increase iterations and tolerance:

ls = LocalSearch(
    n_clusters=5,
    max_iter=500,      # Increase from default 100
    tol=1e-6,          # Decrease from default 1e-4
    n_restarts=20      # More random restarts
)

Solution 2: Check data scale:

# Normalize features
from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X)
S = create_substitution_matrix(X_scaled)

Results are unstable between runs

Solution: Set random seed for reproducibility:

# Set global seed
np.random.seed(42)

# Or use random_state parameter
ls = LocalSearch(n_clusters=5, random_state=42)

# For complete reproducibility
import random
random.seed(42)
np.random.seed(42)

# If using parallel processing
import os
os.environ['PYTHONHASHSEED'] = '42'

Performance Issues

Out of memory with large datasets

Solution 1: Use sparse matrices:

from submarit.core import create_sparse_substitution_matrix

S_sparse = create_sparse_substitution_matrix(
    X,
    threshold=0.1,  # Keep only top 10% of values
    format='csr'
)

Solution 2: Process in chunks:

# Mini-batch processing
from submarit.algorithms import MiniBatchLocalSearch

mbls = MiniBatchLocalSearch(
    n_clusters=10,
    batch_size=1000
)

Solution 3: Use float32 instead of float64:

X = X.astype(np.float32)
S = create_substitution_matrix(X, dtype=np.float32)

Slow computation

Solution 1: Enable parallel processing:

ls = LocalSearch(n_clusters=5, n_jobs=-1)  # Use all cores

Solution 2: Use approximate methods:

from submarit.algorithms import ApproximateLocalSearch

als = ApproximateLocalSearch(
    n_clusters=5,
    approximation='sample',
    sample_size=5000
)

Solution 3: Profile to find bottlenecks:

import cProfile
import pstats

profiler = cProfile.Profile()
profiler.enable()

# Your code here
clusters = ls.fit_predict(S)

profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(10)

Numerical Differences

Results differ from MATLAB version

Common causes and solutions:

  1. Different random seeds:

    # MATLAB: rng(42)
    # Python:
    np.random.seed(42)
    
  2. Indexing differences (0 vs 1-based):

    # MATLAB: clusters (1-indexed)
    # Python: clusters - 1 (0-indexed)
    matlab_clusters = python_clusters + 1
    
  3. Numerical precision:

    # Use same tolerance
    ls = LocalSearch(n_clusters=5, tol=1e-6)
    
    # Compare with tolerance
    np.testing.assert_allclose(
        python_result,
        matlab_result,
        rtol=1e-5,
        atol=1e-8
    )
    
  4. Algorithm initialization:

    # Ensure same initialization
    init_clusters = load_matlab_initialization()
    ls = LocalSearch(n_clusters=5, init=init_clusters)
    

Small numerical differences in results

This is normal due to: - Floating-point arithmetic differences - Different BLAS/LAPACK implementations - Compiler optimizations

To minimize differences:

# Use higher precision
X = X.astype(np.float64)

# Disable fast math optimizations
os.environ['MKL_CBWR'] = 'COMPATIBLE'

# Use same linear algebra backend
import scipy.linalg
scipy.linalg.use_solver = 'gesv'

Visualization Questions

How to visualize high-dimensional clusters?

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Method 1: PCA
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X)

plt.scatter(X_2d[:, 0], X_2d[:, 1], c=clusters, cmap='viridis')
plt.title('Clusters in PCA space')

# Method 2: t-SNE (better for visualization)
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)

plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=clusters, cmap='viridis')
plt.title('Clusters in t-SNE space')

How to create publication-quality plots?

import matplotlib.pyplot as plt
import seaborn as sns

# Set style
plt.style.use('seaborn-v0_8-paper')
sns.set_palette("husl")

# High DPI for publications
plt.rcParams['figure.dpi'] = 300
plt.rcParams['savefig.dpi'] = 300
plt.rcParams['font.size'] = 12

# Create figure
fig, ax = plt.subplots(figsize=(8, 6))

# Plot with proper labels
plot_substitution_matrix(S, clusters, ax=ax)
ax.set_xlabel('Product Index', fontsize=14)
ax.set_ylabel('Product Index', fontsize=14)
ax.set_title('Product Substitution Patterns', fontsize=16)

# Save
plt.tight_layout()
plt.savefig('submarkets.pdf', bbox_inches='tight')

Best Practices

Data Preprocessing

Always preprocess your data:

from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Standardize (zero mean, unit variance)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Or normalize to [0, 1]
scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)

# Handle missing values
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
X_complete = imputer.fit_transform(X)

Feature Engineering

Create meaningful features:

# Interaction features
X_interactions = np.column_stack([
    X,
    X[:, 0] * X[:, 1],  # Price × Size
    X[:, 2] / X[:, 0],  # Brand premium
])

# Polynomial features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

Validation Strategy

Always validate your results:

from submarit.validation import cross_validate_clustering

# Multiple validation methods
validation_results = {
    'stability': stability_test(X, n_clusters=5),
    'bootstrap': bootstrap_validation(X, n_clusters=5),
    'noise': noise_injection_test(X, n_clusters=5),
    'holdout': holdout_validation(X, n_clusters=5)
}

# Report
for method, score in validation_results.items():
    print(f"{method}: {score:.3f}")

Getting Help

Where to get help?

  1. Documentation: Read the full documentation at https://submarit.readthedocs.io

  2. GitHub Issues: Report bugs or request features at https://github.com/m-marinucci/SUBMARIT/issues

  3. Stack Overflow: Ask questions with tag ‘submarit’

  4. Email: Contact maintainers at submarit@example.com

How to report a bug?

Include: 1. Minimal reproducible example 2. Full error traceback 3. Environment information:

import submarit
import sys
import numpy as np
import scipy

print(f"Python: {sys.version}")
print(f"SUBMARIT: {submarit.__version__}")
print(f"NumPy: {np.__version__}")
print(f"SciPy: {scipy.__version__}")

How to contribute?

  1. Fork the repository

  2. Create a feature branch

  3. Add tests for new functionality

  4. Submit a pull request

See CONTRIBUTING.md for detailed guidelines.