Implementing Spatial Lag Models in Python
Implementing spatial lag models in Python requires estimating a spatial autoregressive parameter () that explicitly captures how neighboring observations directly influence the dependent variable. Unlike ordinary least squares (OLS), this specification accounts for spatial spillover effects, making it the standard choice for Spatial Regression Models where geographic proximity drives correlation. The production-ready approach relies on the spreg module within the PySAL ecosystem, which provides both maximum likelihood (ML) and generalized method of moments (GMM) estimators. You will need a GeoDataFrame, a properly constructed spatial weights matrix, and either spreg.ML_Lag (for small-to-medium datasets) or spreg.GM_Lag (for larger datasets or when endogeneity is a concern).
Environment Setup & Compatibility
Spatial lag estimation depends heavily on sparse linear algebra and eigenvalue decomposition. Ensure your environment meets these baseline requirements:
- Python: 3.9+ (3.10+ recommended for
libpysalperformance optimizations) - PySAL Ecosystem:
libpysal>=4.8.0,spreg>=1.5.0 - Data Handling:
geopandas>=0.14.0,pandas>=2.0.0,numpy>=1.24.0 - Backend:
scipy>=1.10.0(required for sparse matrix inversion and Jacobian computation)
Install dependencies via:
pip install geopandas libpysal spreg numpy scipy pandas
Compatibility Warning: spreg dropped legacy pysal v1.x syntax in v2.0. Always import directly from spreg, not pysal.spreg. Weight matrices must be row-standardized (transform='r') before passing to ML estimators to ensure remains bounded between and . GMM estimators can accept raw contiguity but converge faster with standardized inputs. Consult the official spreg documentation for estimator-specific parameter mappings.
Step-by-Step Implementation
The pipeline below demonstrates a complete spatial lag specification using modern PySAL patterns. It assumes a GeoDataFrame (gdf) containing a dependent variable y, independent variables X1 and X2, and a valid geometry column.
import geopandas as gpd
import numpy as np
import pandas as pd
import libpysal
from spreg import ML_Lag, GM_Lag
# 1. Load and prepare data
# gdf = gpd.read_file("your_spatial_data.gpkg")
model_vars = ["y", "X1", "X2"]
df = gdf[model_vars].dropna().copy()
# 2. Build spatial weights matrix
# Queen contiguity is standard for polygon data; k-NN or distance bands work for points
w = libpysal.weights.Queen.from_dataframe(df, use_index=True)
w.transform = "r" # Row-standardize (critical for ML_Lag interpretation)
# 3. Extract dependent and independent arrays
y = df["y"].values.reshape(-1, 1)
X = df[["X1", "X2"]].values
# 4. Fit Spatial Lag Model
# ML_Lag is efficient for N < 2000. GM_Lag scales better and handles endogenous weights.
try:
model = ML_Lag(
y, X, w,
name_y="y",
name_x=["X1", "X2"],
name_w="Queen_Weights"
)
estimator_type = "ML"
except Exception as e:
print(f"ML estimation failed ({e}). Falling back to GMM.")
model = GM_Lag(
y, X, w,
name_y="y",
name_x=["X1", "X2"],
name_w="Queen_Weights"
)
estimator_type = "GMM"
# 5. Extract key results
print(f"\n--- {estimator_type} Spatial Lag Results ---")
print(f"Spatial Autoregressive Coefficient (ρ): {model.rho[0,0]:.4f}")
print(f"R²: {model.r2:.4f}")
print(f"AIC: {model.aic:.2f}")
print(f"Log-Likelihood: {model.ll:{estimator_type=='ML' and '.4f' or '.2f'}}")
Diagnostics & Interpretation
A spatial lag model is only valid if spatial dependence is statistically confirmed. Before trusting , run Lagrange Multiplier (LM) diagnostics to verify that the lag specification outperforms a spatial error alternative.
# Access diagnostic statistics (available in both ML and GM outputs)
lm_lag = model.lm_lag
lm_lag_robust = model.lm_lag_robust
p_value = lm_lag[1]
print(f"LM-Lag Statistic: {lm_lag[0]:.4f} (p={p_value:.4f})")
print(f"Robust LM-Lag: {lm_lag_robust[0]:.4f} (p={lm_lag_robust[1]:.4f})")
Interpretation Guidelines:
- Significant LM-Lag (): Confirms spatial dependence in the dependent variable. Proceed with the spatial lag specification.
- Significant : Indicates direct spatial spillover. A positive means high values cluster with high neighbors; negative indicates dispersion.
- Marginal Effects: Unlike OLS, coefficients in spatial lag models represent direct effects only. Total impact requires computing the spatial multiplier: . Use
spreg’s built-inspatial_lag_effectsutilities or manually compute the Leontief inverse for policy simulations.
For deeper statistical validation, review the libpysal weights documentation to ensure your neighbor definition aligns with the underlying spatial process.
Production & Performance Guidelines
When deploying spatial regression pipelines in research or production environments, follow these optimization and validation rules:
- Estimator Selection:
ML_Laguses dense matrix operations and exact Jacobian computation. It becomes computationally prohibitive beyond . Switch toGM_Lagfor larger datasets; it uses instrumental variables and sparse algebra to scale linearly. - Row-Standardization Enforcement: Always apply
w.transform = 'r'before ML estimation. Unstandardized weights distort the eigenvalue spectrum, causing to exceed theoretical bounds and producing biased standard errors. - Island Handling: Isolated polygons (zero neighbors) break row-standardization. Resolve them by:
- Assigning nearest neighbors:
w = libpysal.weights.KNN.from_dataframe(df, k=1) - Using
w.remap_ids()to drop islands before modeling - Applying
w.transform = 'b'(binary) only if explicitly required by your theoretical framework
- Endogeneity Checks: If your spatial weights matrix is derived from the dependent variable (e.g., distance-decay based on outcome values), spatial lag becomes endogenous. Use
GM_Lagwith external instruments or switch to a spatial Durbin model. - Workflow Integration: Embed spatial diagnostics directly into your CI/CD or notebook pipelines. Automated checks for
model.lm_lag[1] < 0.05prevent silent OLS mis-specification. This pattern aligns with standardized Python Workflows for Spatial Modeling & Regression where reproducibility and diagnostic gating are enforced before model deployment.
By adhering to these specifications, you ensure that your spatial lag implementation remains statistically rigorous, computationally efficient, and production-ready across varying dataset scales.