Spectral Processing Workflow¶
This guide walks through a complete spectral preprocessing workflow, from raw data to analysis-ready spectra. The order of operations matters — this guide presents the standard sequence used in chemometrics.
Typical Processing Order¶
Raw Spectra
│
├─ 1. Smoothing (noise reduction)
├─ 2. Scatter Correction (MSC/EMSC)
├─ 3. Baseline Correction (ALS, SNIP, etc.)
├─ 4. Normalization (SNV, min-max, area)
├─ 5. Derivatives (optional, SG or gap-segment)
└─ 6. Transform (optional, Kubelka-Munk, ATR)
│
└─ Analysis-ready spectra
Tip
Not every dataset requires all steps. Start simple and add steps as needed.
Step 1: Smoothing¶
Smoothing reduces high-frequency noise without distorting spectral features.
from spectrakit import smooth_savgol, smooth_whittaker
# Savitzky-Golay: good general-purpose smoother
smoothed = smooth_savgol(spectra, window_length=11, polyorder=3)
# Whittaker: penalized least-squares, tunable via lambda
smoothed = smooth_whittaker(spectra, lam=1e4)
When to use:
smooth_savgol— Well-understood, preserves peak shapes. Start here.smooth_whittaker— Better for heavy noise. Increaselamfor more smoothing.
Parameters to tune:
window_length: Larger = smoother but may broaden peaks. Must be odd.polyorder: Higher = preserves more features. Usually 2 or 3.lam: Whittaker smoothing penalty. Range: 1e2 (light) to 1e6 (heavy).
Step 2: Scatter Correction¶
Multiplicative scatter effects are common in diffuse reflectance (NIR) spectra.
from spectrakit import scatter_msc, scatter_emsc
# MSC: standard correction
corrected = scatter_msc(spectra)
# EMSC: includes polynomial baseline terms
corrected = scatter_emsc(spectra, poly_order=2)
When to use:
scatter_msc— Standard for NIR diffuse reflectance. Requires 2D batch.scatter_emsc— Better when scatter varies with wavelength (adds polynomial terms).
Note
MSC and EMSC require a reference spectrum. For batch data (N, W), the mean
spectrum is used by default. For single spectra, pass reference explicitly.
Step 3: Baseline Correction¶
Remove broad baseline contributions from the spectrum.
from spectrakit import baseline_als, baseline_snip, baseline_polynomial
# ALS: asymmetric least squares (most popular)
corrected = baseline_als(spectra, lam=1e6, p=0.01)
# SNIP: peak clipping (good for spectra with many sharp peaks)
corrected = baseline_snip(spectra, num_iterations=40)
# Polynomial: iterative polynomial fit
corrected = baseline_polynomial(spectra, poly_order=3)
When to use:
baseline_als— Most versatile. Highlam= smoother baseline. Lowp= asymmetric.baseline_snip— Fast, good for Raman and XRF with sharp peaks.baseline_polynomial— Simple, works well for gentle baselines.baseline_rubberband— Convex hull approach, no parameters to tune.
Step 4: Normalization¶
Scale spectra to a common range or standard for meaningful comparison.
from spectrakit import normalize_snv, normalize_minmax, normalize_area
# SNV: zero mean, unit variance per spectrum
normalized = normalize_snv(spectra)
# Min-max: scale to [0, 1]
normalized = normalize_minmax(spectra)
# Area: unit area under the curve
normalized = normalize_area(spectra)
When to use:
normalize_snv— Removes multiplicative and additive scatter. Standard for NIR.normalize_minmax— Good for visualization and when absolute scale matters.normalize_area— Preserves relative peak intensities.normalize_vector— L2 normalization, useful before cosine similarity.
Step 5: Derivatives (Optional)¶
Derivatives resolve overlapping peaks and remove constant/linear baselines.
from spectrakit import derivative_savgol, derivative_gap_segment
# SG first derivative
d1 = derivative_savgol(spectra, window_length=11, polyorder=3, deriv=1)
# SG second derivative (enhances peak resolution)
d2 = derivative_savgol(spectra, window_length=11, polyorder=3, deriv=2)
# Gap-segment derivative (Norris-Williams)
d1_gap = derivative_gap_segment(spectra, gap=5, segment=5, deriv=1)
Warning
Derivatives amplify noise. Always smooth first, or use a large enough
window_length in derivative_savgol.
Step 6: Transforms (Optional)¶
Apply physics-based spectral transforms.
import numpy as np
from spectrakit import transform_kubelka_munk, transform_atr_correction
# Kubelka-Munk: convert reflectance to absorption-like units
km = transform_kubelka_munk(reflectance_spectra)
# ATR correction: compensate for depth of penetration
wavenumbers = np.linspace(400, 4000, 1000)
corrected = transform_atr_correction(spectra, wavenumbers)
Putting It All Together¶
from spectrakit import (
baseline_als,
normalize_snv,
smooth_savgol,
)
from spectrakit.pipeline import Pipeline
# Define a reusable processing pipeline
pipe = Pipeline()
pipe.add(smooth_savgol, window_length=11, polyorder=3)
pipe.add(baseline_als, lam=1e6, p=0.01)
pipe.add(normalize_snv)
# Apply to new data
processed = pipe.transform(raw_spectra)
Visualizing Each Step¶
from spectrakit.plot import plot_comparison
import numpy as np
wavenumbers = np.linspace(400, 4000, 1000)
# Compare raw vs. smoothed
plot_comparison(raw, smoothed, wavenumbers, labels=("Raw", "Smoothed"))
# Compare original vs. baseline-corrected
plot_comparison(smoothed, corrected, wavenumbers, labels=("Smoothed", "Corrected"))
Next Steps¶
- Pipeline Guide — advanced pipeline features
- scikit-learn Integration — use in ML workflows
- API Reference — full documentation