Performance Tuning Guide

This guide helps you optimize SUBMARIT for speed and memory efficiency with large datasets.

Performance Considerations

Dataset Size Guidelines

Recommended Approaches by Dataset Size

Products

Memory Required

Recommended Approach

Expected Time

< 1,000

< 1 GB

Standard dense matrix

< 1 minute

1,000 - 10,000

1-8 GB

Dense matrix with optimization

1-10 minutes

10,000 - 100,000

8-80 GB

Sparse matrix, mini-batch

10-60 minutes

> 100,000

> 80 GB

Distributed, approximate methods

> 1 hour

Memory Optimization

Sparse Matrices

For datasets with many zero or near-zero substitution values:

from scipy.sparse import csr_matrix
from submarit.core import create_sparse_substitution_matrix

# Create sparse substitution matrix
S_sparse = create_sparse_substitution_matrix(
    X,
    threshold=0.1,  # Keep only top 10% of connections
    metric='cosine',
    format='csr'  # Compressed sparse row format
)

print(f"Memory usage: {S_sparse.data.nbytes / 1e6:.2f} MB")
print(f"Sparsity: {1 - S_sparse.nnz / (S_sparse.shape[0]**2):.2%}")

Memory-Mapped Arrays

For datasets too large for memory:

import numpy as np
from submarit.core import create_mmap_substitution_matrix

# Create memory-mapped substitution matrix
S_mmap = create_mmap_substitution_matrix(
    X,
    output_file='substitution_matrix.dat',
    dtype=np.float32,  # Use float32 to save space
    chunks=1000  # Process in chunks
)

Chunked Processing

Process large matrices in chunks:

from submarit.utils import chunked_substitution_matrix

def process_in_chunks(X, chunk_size=5000):
    n = len(X)
    S = np.zeros((n, n), dtype=np.float32)

    for i in range(0, n, chunk_size):
        for j in range(0, n, chunk_size):
            chunk_i = X[i:i+chunk_size]
            chunk_j = X[j:j+chunk_size]
            S[i:i+chunk_size, j:j+chunk_size] = compute_chunk(chunk_i, chunk_j)

    return S

Speed Optimization

Parallel Processing

Leverage multiple CPU cores:

from submarit.algorithms import LocalSearch
from joblib import Parallel, delayed
import multiprocessing

# Use all available cores
n_cores = multiprocessing.cpu_count()

# Parallel clustering with different random seeds
ls = LocalSearch(
    n_clusters=10,
    n_restarts=20,
    n_jobs=n_cores  # Parallel restarts
)

# Parallel substitution matrix computation
from submarit.core import parallel_substitution_matrix

S = parallel_substitution_matrix(X, n_jobs=n_cores, batch_size=100)

Vectorization

Use NumPy’s vectorized operations:

# Slow: Python loops
def slow_distance(X):
    n = len(X)
    D = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            D[i, j] = np.linalg.norm(X[i] - X[j])
    return D

# Fast: Vectorized
def fast_distance(X):
    # Use broadcasting
    diff = X[:, np.newaxis, :] - X[np.newaxis, :, :]
    return np.linalg.norm(diff, axis=2)

# Faster: Use scipy
from scipy.spatial.distance import cdist
D = cdist(X, X, metric='euclidean')

Numba JIT Compilation

Speed up custom functions:

from numba import jit, prange

@jit(nopython=True, parallel=True)
def fast_local_search_update(S, clusters, n_clusters):
    n = len(clusters)
    changed = True

    while changed:
        changed = False
        for i in prange(n):  # Parallel loop
            best_cluster = clusters[i]
            best_cost = compute_cost(S, i, clusters, best_cluster)

            for k in range(n_clusters):
                if k != clusters[i]:
                    cost = compute_cost(S, i, clusters, k)
                    if cost < best_cost:
                        best_cost = cost
                        best_cluster = k

            if best_cluster != clusters[i]:
                clusters[i] = best_cluster
                changed = True

    return clusters

Algorithm-Specific Optimizations

Local Search Optimizations

from submarit.algorithms import OptimizedLocalSearch

# Use optimized implementation
ols = OptimizedLocalSearch(
    n_clusters=10,
    max_iter=100,
    tol=1e-4,
    early_stopping=True,  # Stop when improvement is minimal
    cache_distances=True,  # Cache frequently accessed distances
    use_triangle_inequality=True  # Skip unnecessary distance calculations
)

# Mini-batch version for large datasets
from submarit.algorithms import MiniBatchLocalSearch

mbls = MiniBatchLocalSearch(
    n_clusters=10,
    batch_size=1000,
    n_init=3,
    max_no_improvement=10
)

Approximate Methods

For very large datasets, use approximations:

from submarit.algorithms import ApproximateLocalSearch

# Use locality-sensitive hashing
als = ApproximateLocalSearch(
    n_clusters=10,
    approximation='lsh',
    n_hash_functions=10,
    accuracy=0.9  # 90% accuracy vs exact method
)

# Use random sampling
als_sample = ApproximateLocalSearch(
    n_clusters=10,
    approximation='sample',
    sample_size=10000,  # Work with subset
    n_iterations=5  # Refine with full data
)

Profiling and Benchmarking

Profile Your Code

import cProfile
import pstats
from submarit.utils import Timer

# Simple timing
with Timer() as t:
    S = create_substitution_matrix(X)
print(f"Matrix creation took {t.elapsed:.2f} seconds")

# Detailed profiling
profiler = cProfile.Profile()
profiler.enable()

clusters = LocalSearch(n_clusters=5).fit_predict(S)

profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(10)  # Top 10 time-consuming functions

Memory Profiling

from memory_profiler import profile

@profile
def memory_intensive_function(X):
    S = create_substitution_matrix(X)
    ls = LocalSearch(n_clusters=10)
    clusters = ls.fit_predict(S)
    return clusters

# Run with: python -m memory_profiler your_script.py

Benchmarking Suite

from submarit.benchmarks import run_benchmark

# Benchmark different configurations
results = run_benchmark(
    dataset_sizes=[100, 1000, 10000],
    n_clusters_list=[5, 10, 20],
    algorithms=['local_search', 'kmeans', 'hierarchical'],
    metrics=['time', 'memory', 'quality']
)

# Plot results
from submarit.benchmarks import plot_benchmark_results
plot_benchmark_results(results)

Best Practices

  1. Data Preprocessing

    # Normalize features for faster convergence
    from sklearn.preprocessing import StandardScaler
    X_scaled = StandardScaler().fit_transform(X)
    
    # Remove redundant features
    from sklearn.feature_selection import VarianceThreshold
    selector = VarianceThreshold(threshold=0.01)
    X_reduced = selector.fit_transform(X_scaled)
    
  2. Caching Results

    import joblib
    
    # Cache substitution matrix
    try:
        S = joblib.load('substitution_matrix.pkl')
    except FileNotFoundError:
        S = create_substitution_matrix(X)
        joblib.dump(S, 'substitution_matrix.pkl')
    
  3. Progressive Refinement

    # Start with coarse clustering, then refine
    def progressive_clustering(X, final_k=50):
        # Stage 1: Coarse clustering
        coarse_k = 10
        coarse_clusters = LocalSearch(n_clusters=coarse_k).fit_predict(X)
    
        # Stage 2: Refine each coarse cluster
        final_clusters = np.zeros(len(X), dtype=int)
        offset = 0
    
        for i in range(coarse_k):
            mask = coarse_clusters == i
            X_subset = X[mask]
    
            if len(X_subset) > final_k // coarse_k:
                sub_k = final_k // coarse_k
                sub_clusters = LocalSearch(n_clusters=sub_k).fit_predict(X_subset)
                final_clusters[mask] = sub_clusters + offset
                offset += sub_k
            else:
                final_clusters[mask] = offset
                offset += 1
    
        return final_clusters
    

Hardware Considerations

CPU Optimization

  • Use Intel MKL for optimized linear algebra: conda install mkl

  • Set thread affinity: export OMP_NUM_THREADS=8

  • Disable hyperthreading for compute-intensive tasks

GPU Acceleration

For extremely large datasets:

# Using CuPy for GPU arrays
import cupy as cp

def gpu_distance_matrix(X):
    X_gpu = cp.asarray(X)
    # Compute pairwise distances on GPU
    diff = X_gpu[:, cp.newaxis, :] - X_gpu[cp.newaxis, :, :]
    distances = cp.linalg.norm(diff, axis=2)
    return cp.asnumpy(distances)  # Transfer back to CPU

Distributed Computing

For cluster computing:

from dask.distributed import Client
import dask.array as da

# Setup Dask client
client = Client('scheduler-address:8786')

# Convert to Dask array
X_dask = da.from_array(X, chunks=(1000, X.shape[1]))

# Compute in parallel across cluster
S_dask = compute_distributed_substitution_matrix(X_dask)
S = S_dask.compute()  # Trigger computation

Cloud Deployment

AWS Configuration

Deploy SUBMARIT on AWS for large-scale processing:

# Using AWS Batch
import boto3
from submarit.cloud import AWSBatchRunner

runner = AWSBatchRunner(
    job_definition='submarit-clustering',
    job_queue='high-memory-queue',
    vcpus=16,
    memory=64000  # 64GB
)

# Submit job
job_id = runner.submit(
    data_s3_path='s3://bucket/data.csv',
    n_clusters=20,
    algorithm='local_search'
)

# Monitor progress
status = runner.get_status(job_id)

Google Cloud Platform

from submarit.cloud import GCPDataprocRunner

runner = GCPDataprocRunner(
    cluster_name='submarit-cluster',
    num_workers=10,
    worker_machine_type='n1-highmem-8'
)

# Run distributed clustering
results = runner.run_clustering(
    gcs_path='gs://bucket/data.csv',
    n_clusters=50,
    max_iter=1000
)

Azure ML Pipeline

from azureml.core import Workspace, Experiment
from submarit.cloud import AzureMLRunner

# Configure compute
runner = AzureMLRunner(
    workspace=ws,
    compute_target='gpu-cluster',
    environment='submarit-env'
)

# Run experiment
run = runner.submit_experiment(
    data_path='datastore://products/data.csv',
    config={
        'n_clusters': 30,
        'algorithm': 'gpu_local_search',
        'batch_size': 10000
    }
)

Edge Computing

For real-time submarket analysis at retail locations:

Lightweight Models

from submarit.edge import EdgeClusterer

# Create lightweight model
edge_model = EdgeClusterer(
    n_clusters=5,
    max_products=1000,
    memory_limit='512MB',
    cpu_limit=2
)

# Export for edge deployment
edge_model.export_onnx('edge_model.onnx')
edge_model.export_tflite('edge_model.tflite')

Incremental Updates

# Edge device code
from submarit.edge import IncrementalEdgeClusterer

clusterer = IncrementalEdgeClusterer.load('edge_model.pkl')

# Process new products in real-time
while True:
    new_products = get_new_products()
    if new_products:
        clusterer.partial_update(new_products)

    # Periodic sync with cloud
    if time_to_sync():
        clusterer.sync_with_cloud()

Performance Monitoring

Real-time Metrics

from submarit.monitoring import PerformanceMonitor

monitor = PerformanceMonitor(
    metrics=['cpu', 'memory', 'disk', 'network'],
    interval=1.0  # seconds
)

with monitor:
    clusters = LocalSearch(n_clusters=10).fit_predict(S)

# Get performance report
report = monitor.get_report()
print(f"Peak memory: {report['memory_peak_mb']:.2f} MB")
print(f"CPU time: {report['cpu_time']:.2f} seconds")
print(f"Wall time: {report['wall_time']:.2f} seconds")

Bottleneck Analysis

from submarit.profiling import bottleneck_analysis

# Automatic bottleneck detection
analysis = bottleneck_analysis(
    function=lambda: LocalSearch(5).fit_predict(S),
    data_size=len(S),
    iterations=10
)

print("Bottlenecks found:")
for bottleneck in analysis['bottlenecks']:
    print(f"- {bottleneck['function']}: {bottleneck['percent']:.1f}% of time")
    print(f"  Suggestion: {bottleneck['optimization_hint']}")

Advanced Optimization Techniques

JIT Compilation Strategies

from numba import jit, cuda
from submarit.optimization import auto_optimize

# Automatic optimization selection
@auto_optimize(target=['cpu', 'gpu'])
def optimized_distance_computation(X):
    # Framework automatically selects best implementation
    pass

# GPU-specific optimization
@cuda.jit
def gpu_local_search_kernel(S, clusters, n_clusters):
    # CUDA kernel for GPU execution
    idx = cuda.grid(1)
    if idx < len(clusters):
        # Parallel cluster assignment update
        pass

Memory Mapping Strategies

from submarit.optimization import SmartMemoryManager

# Intelligent memory management
manager = SmartMemoryManager(
    available_memory='16GB',
    swap_path='/fast_ssd/swap',
    compression='lz4'
)

# Automatically handles large matrices
with manager:
    S = create_substitution_matrix(very_large_X)
    clusters = LocalSearch(20).fit_predict(S)

Optimization Decision Tree

Use this decision tree to choose optimization strategies:

Dataset Size?
├── < 1,000 products
│   └── Use default settings
├── 1,000 - 10,000 products
│   ├── Memory < 8GB?
│   │   ├── Yes → Use sparse matrices
│   │   └── No → Use dense matrices with parallel processing
│   └── Time critical?
│       ├── Yes → Use approximate methods
│       └── No → Use exact methods with multiple restarts
└── > 10,000 products
    ├── Memory < 32GB?
    │   ├── Yes → Use mini-batch or distributed computing
    │   └── No → Use GPU acceleration if available
    └── Real-time requirements?
        ├── Yes → Use edge computing with incremental updates
        └── No → Use cloud computing with batch processing

Performance Benchmarks

Latest benchmark results (v2.0):

Performance Benchmarks

Dataset Size

Algorithm

Time (seconds)

Memory (GB)

Hardware

1,000

Local Search

0.5

0.1

4-core CPU

10,000

Local Search

45

8

8-core CPU

10,000

GPU Local Search

5

4

NVIDIA V100

100,000

Mini-batch LS

600

16

16-core CPU

100,000

Distributed LS

120

8/node

10-node cluster

1,000,000

Approximate LS

1800

32

32-core CPU

1,000,000

Cloud GPU

300

16

4x NVIDIA A100