Performance Tuning Guide
This guide helps you optimize SUBMARIT for speed and memory efficiency with large datasets.
Performance Considerations
Dataset Size Guidelines
Products |
Memory Required |
Recommended Approach |
Expected Time |
|---|---|---|---|
< 1,000 |
< 1 GB |
Standard dense matrix |
< 1 minute |
1,000 - 10,000 |
1-8 GB |
Dense matrix with optimization |
1-10 minutes |
10,000 - 100,000 |
8-80 GB |
Sparse matrix, mini-batch |
10-60 minutes |
> 100,000 |
> 80 GB |
Distributed, approximate methods |
> 1 hour |
Memory Optimization
Sparse Matrices
For datasets with many zero or near-zero substitution values:
from scipy.sparse import csr_matrix
from submarit.core import create_sparse_substitution_matrix
# Create sparse substitution matrix
S_sparse = create_sparse_substitution_matrix(
X,
threshold=0.1, # Keep only top 10% of connections
metric='cosine',
format='csr' # Compressed sparse row format
)
print(f"Memory usage: {S_sparse.data.nbytes / 1e6:.2f} MB")
print(f"Sparsity: {1 - S_sparse.nnz / (S_sparse.shape[0]**2):.2%}")
Memory-Mapped Arrays
For datasets too large for memory:
import numpy as np
from submarit.core import create_mmap_substitution_matrix
# Create memory-mapped substitution matrix
S_mmap = create_mmap_substitution_matrix(
X,
output_file='substitution_matrix.dat',
dtype=np.float32, # Use float32 to save space
chunks=1000 # Process in chunks
)
Chunked Processing
Process large matrices in chunks:
from submarit.utils import chunked_substitution_matrix
def process_in_chunks(X, chunk_size=5000):
n = len(X)
S = np.zeros((n, n), dtype=np.float32)
for i in range(0, n, chunk_size):
for j in range(0, n, chunk_size):
chunk_i = X[i:i+chunk_size]
chunk_j = X[j:j+chunk_size]
S[i:i+chunk_size, j:j+chunk_size] = compute_chunk(chunk_i, chunk_j)
return S
Speed Optimization
Parallel Processing
Leverage multiple CPU cores:
from submarit.algorithms import LocalSearch
from joblib import Parallel, delayed
import multiprocessing
# Use all available cores
n_cores = multiprocessing.cpu_count()
# Parallel clustering with different random seeds
ls = LocalSearch(
n_clusters=10,
n_restarts=20,
n_jobs=n_cores # Parallel restarts
)
# Parallel substitution matrix computation
from submarit.core import parallel_substitution_matrix
S = parallel_substitution_matrix(X, n_jobs=n_cores, batch_size=100)
Vectorization
Use NumPy’s vectorized operations:
# Slow: Python loops
def slow_distance(X):
n = len(X)
D = np.zeros((n, n))
for i in range(n):
for j in range(n):
D[i, j] = np.linalg.norm(X[i] - X[j])
return D
# Fast: Vectorized
def fast_distance(X):
# Use broadcasting
diff = X[:, np.newaxis, :] - X[np.newaxis, :, :]
return np.linalg.norm(diff, axis=2)
# Faster: Use scipy
from scipy.spatial.distance import cdist
D = cdist(X, X, metric='euclidean')
Numba JIT Compilation
Speed up custom functions:
from numba import jit, prange
@jit(nopython=True, parallel=True)
def fast_local_search_update(S, clusters, n_clusters):
n = len(clusters)
changed = True
while changed:
changed = False
for i in prange(n): # Parallel loop
best_cluster = clusters[i]
best_cost = compute_cost(S, i, clusters, best_cluster)
for k in range(n_clusters):
if k != clusters[i]:
cost = compute_cost(S, i, clusters, k)
if cost < best_cost:
best_cost = cost
best_cluster = k
if best_cluster != clusters[i]:
clusters[i] = best_cluster
changed = True
return clusters
Algorithm-Specific Optimizations
Local Search Optimizations
from submarit.algorithms import OptimizedLocalSearch
# Use optimized implementation
ols = OptimizedLocalSearch(
n_clusters=10,
max_iter=100,
tol=1e-4,
early_stopping=True, # Stop when improvement is minimal
cache_distances=True, # Cache frequently accessed distances
use_triangle_inequality=True # Skip unnecessary distance calculations
)
# Mini-batch version for large datasets
from submarit.algorithms import MiniBatchLocalSearch
mbls = MiniBatchLocalSearch(
n_clusters=10,
batch_size=1000,
n_init=3,
max_no_improvement=10
)
Approximate Methods
For very large datasets, use approximations:
from submarit.algorithms import ApproximateLocalSearch
# Use locality-sensitive hashing
als = ApproximateLocalSearch(
n_clusters=10,
approximation='lsh',
n_hash_functions=10,
accuracy=0.9 # 90% accuracy vs exact method
)
# Use random sampling
als_sample = ApproximateLocalSearch(
n_clusters=10,
approximation='sample',
sample_size=10000, # Work with subset
n_iterations=5 # Refine with full data
)
Profiling and Benchmarking
Profile Your Code
import cProfile
import pstats
from submarit.utils import Timer
# Simple timing
with Timer() as t:
S = create_substitution_matrix(X)
print(f"Matrix creation took {t.elapsed:.2f} seconds")
# Detailed profiling
profiler = cProfile.Profile()
profiler.enable()
clusters = LocalSearch(n_clusters=5).fit_predict(S)
profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(10) # Top 10 time-consuming functions
Memory Profiling
from memory_profiler import profile
@profile
def memory_intensive_function(X):
S = create_substitution_matrix(X)
ls = LocalSearch(n_clusters=10)
clusters = ls.fit_predict(S)
return clusters
# Run with: python -m memory_profiler your_script.py
Benchmarking Suite
from submarit.benchmarks import run_benchmark
# Benchmark different configurations
results = run_benchmark(
dataset_sizes=[100, 1000, 10000],
n_clusters_list=[5, 10, 20],
algorithms=['local_search', 'kmeans', 'hierarchical'],
metrics=['time', 'memory', 'quality']
)
# Plot results
from submarit.benchmarks import plot_benchmark_results
plot_benchmark_results(results)
Best Practices
Data Preprocessing
# Normalize features for faster convergence from sklearn.preprocessing import StandardScaler X_scaled = StandardScaler().fit_transform(X) # Remove redundant features from sklearn.feature_selection import VarianceThreshold selector = VarianceThreshold(threshold=0.01) X_reduced = selector.fit_transform(X_scaled)
Caching Results
import joblib # Cache substitution matrix try: S = joblib.load('substitution_matrix.pkl') except FileNotFoundError: S = create_substitution_matrix(X) joblib.dump(S, 'substitution_matrix.pkl')
Progressive Refinement
# Start with coarse clustering, then refine def progressive_clustering(X, final_k=50): # Stage 1: Coarse clustering coarse_k = 10 coarse_clusters = LocalSearch(n_clusters=coarse_k).fit_predict(X) # Stage 2: Refine each coarse cluster final_clusters = np.zeros(len(X), dtype=int) offset = 0 for i in range(coarse_k): mask = coarse_clusters == i X_subset = X[mask] if len(X_subset) > final_k // coarse_k: sub_k = final_k // coarse_k sub_clusters = LocalSearch(n_clusters=sub_k).fit_predict(X_subset) final_clusters[mask] = sub_clusters + offset offset += sub_k else: final_clusters[mask] = offset offset += 1 return final_clusters
Hardware Considerations
CPU Optimization
Use Intel MKL for optimized linear algebra:
conda install mklSet thread affinity:
export OMP_NUM_THREADS=8Disable hyperthreading for compute-intensive tasks
GPU Acceleration
For extremely large datasets:
# Using CuPy for GPU arrays
import cupy as cp
def gpu_distance_matrix(X):
X_gpu = cp.asarray(X)
# Compute pairwise distances on GPU
diff = X_gpu[:, cp.newaxis, :] - X_gpu[cp.newaxis, :, :]
distances = cp.linalg.norm(diff, axis=2)
return cp.asnumpy(distances) # Transfer back to CPU
Distributed Computing
For cluster computing:
from dask.distributed import Client
import dask.array as da
# Setup Dask client
client = Client('scheduler-address:8786')
# Convert to Dask array
X_dask = da.from_array(X, chunks=(1000, X.shape[1]))
# Compute in parallel across cluster
S_dask = compute_distributed_substitution_matrix(X_dask)
S = S_dask.compute() # Trigger computation
Cloud Deployment
AWS Configuration
Deploy SUBMARIT on AWS for large-scale processing:
# Using AWS Batch
import boto3
from submarit.cloud import AWSBatchRunner
runner = AWSBatchRunner(
job_definition='submarit-clustering',
job_queue='high-memory-queue',
vcpus=16,
memory=64000 # 64GB
)
# Submit job
job_id = runner.submit(
data_s3_path='s3://bucket/data.csv',
n_clusters=20,
algorithm='local_search'
)
# Monitor progress
status = runner.get_status(job_id)
Google Cloud Platform
from submarit.cloud import GCPDataprocRunner
runner = GCPDataprocRunner(
cluster_name='submarit-cluster',
num_workers=10,
worker_machine_type='n1-highmem-8'
)
# Run distributed clustering
results = runner.run_clustering(
gcs_path='gs://bucket/data.csv',
n_clusters=50,
max_iter=1000
)
Azure ML Pipeline
from azureml.core import Workspace, Experiment
from submarit.cloud import AzureMLRunner
# Configure compute
runner = AzureMLRunner(
workspace=ws,
compute_target='gpu-cluster',
environment='submarit-env'
)
# Run experiment
run = runner.submit_experiment(
data_path='datastore://products/data.csv',
config={
'n_clusters': 30,
'algorithm': 'gpu_local_search',
'batch_size': 10000
}
)
Edge Computing
For real-time submarket analysis at retail locations:
Lightweight Models
from submarit.edge import EdgeClusterer
# Create lightweight model
edge_model = EdgeClusterer(
n_clusters=5,
max_products=1000,
memory_limit='512MB',
cpu_limit=2
)
# Export for edge deployment
edge_model.export_onnx('edge_model.onnx')
edge_model.export_tflite('edge_model.tflite')
Incremental Updates
# Edge device code
from submarit.edge import IncrementalEdgeClusterer
clusterer = IncrementalEdgeClusterer.load('edge_model.pkl')
# Process new products in real-time
while True:
new_products = get_new_products()
if new_products:
clusterer.partial_update(new_products)
# Periodic sync with cloud
if time_to_sync():
clusterer.sync_with_cloud()
Performance Monitoring
Real-time Metrics
from submarit.monitoring import PerformanceMonitor
monitor = PerformanceMonitor(
metrics=['cpu', 'memory', 'disk', 'network'],
interval=1.0 # seconds
)
with monitor:
clusters = LocalSearch(n_clusters=10).fit_predict(S)
# Get performance report
report = monitor.get_report()
print(f"Peak memory: {report['memory_peak_mb']:.2f} MB")
print(f"CPU time: {report['cpu_time']:.2f} seconds")
print(f"Wall time: {report['wall_time']:.2f} seconds")
Bottleneck Analysis
from submarit.profiling import bottleneck_analysis
# Automatic bottleneck detection
analysis = bottleneck_analysis(
function=lambda: LocalSearch(5).fit_predict(S),
data_size=len(S),
iterations=10
)
print("Bottlenecks found:")
for bottleneck in analysis['bottlenecks']:
print(f"- {bottleneck['function']}: {bottleneck['percent']:.1f}% of time")
print(f" Suggestion: {bottleneck['optimization_hint']}")
Advanced Optimization Techniques
JIT Compilation Strategies
from numba import jit, cuda
from submarit.optimization import auto_optimize
# Automatic optimization selection
@auto_optimize(target=['cpu', 'gpu'])
def optimized_distance_computation(X):
# Framework automatically selects best implementation
pass
# GPU-specific optimization
@cuda.jit
def gpu_local_search_kernel(S, clusters, n_clusters):
# CUDA kernel for GPU execution
idx = cuda.grid(1)
if idx < len(clusters):
# Parallel cluster assignment update
pass
Memory Mapping Strategies
from submarit.optimization import SmartMemoryManager
# Intelligent memory management
manager = SmartMemoryManager(
available_memory='16GB',
swap_path='/fast_ssd/swap',
compression='lz4'
)
# Automatically handles large matrices
with manager:
S = create_substitution_matrix(very_large_X)
clusters = LocalSearch(20).fit_predict(S)
Optimization Decision Tree
Use this decision tree to choose optimization strategies:
Dataset Size?
├── < 1,000 products
│ └── Use default settings
├── 1,000 - 10,000 products
│ ├── Memory < 8GB?
│ │ ├── Yes → Use sparse matrices
│ │ └── No → Use dense matrices with parallel processing
│ └── Time critical?
│ ├── Yes → Use approximate methods
│ └── No → Use exact methods with multiple restarts
└── > 10,000 products
├── Memory < 32GB?
│ ├── Yes → Use mini-batch or distributed computing
│ └── No → Use GPU acceleration if available
└── Real-time requirements?
├── Yes → Use edge computing with incremental updates
└── No → Use cloud computing with batch processing
Performance Benchmarks
Latest benchmark results (v2.0):
Dataset Size |
Algorithm |
Time (seconds) |
Memory (GB) |
Hardware |
|---|---|---|---|---|
1,000 |
Local Search |
0.5 |
0.1 |
4-core CPU |
10,000 |
Local Search |
45 |
8 |
8-core CPU |
10,000 |
GPU Local Search |
5 |
4 |
NVIDIA V100 |
100,000 |
Mini-batch LS |
600 |
16 |
16-core CPU |
100,000 |
Distributed LS |
120 |
8/node |
10-node cluster |
1,000,000 |
Approximate LS |
1800 |
32 |
32-core CPU |
1,000,000 |
Cloud GPU |
300 |
16 |
4x NVIDIA A100 |