Uncovering the factors that drive house and apartment prices in Indonesia — using XGBoost and SHAP analysis on ~48,000 active property listings.
Property prices are shaped by dozens of interacting factors — from geographic location and building size to neighborhood amenities. This project was born from a simple curiosity: which factors most strongly drive house and apartment prices across Indonesia?
Indonesia's property market is enormously diverse — prices can vary dramatically between cities, even between districts within the same city. Prospective buyers and investors often struggle to judge whether a listed price is actually fair.
By gathering real-world data from active property listing platforms, training an XGBoost regression model, and applying SHAP (SHapley Additive exPlanations) to interpret it — we can transparently surface which features matter most and by how much.
Data was gathered from multiple property listing platforms while fully adhering to each platform's terms and conditions. The collection process followed a structured four-step pipeline.
Collected listings for houses and apartments actively offered for sale across multiple platforms, respecting each site's terms of service throughout.
Removed duplicate entries and standardized categorical formats and aligned numerical scales to ensure data consistency.
Extracted structured features from raw listing text, created binary indicator variables for amenities (CCTV, garden, toll access, etc.), applied categorical encoding, and applied a log-transformation to the target price variable.
The cleaned dataset contains ~48,000 rows and 34 features spanning physical dimensions, location, amenities, and legal specifications of each property.
Each property listing is represented by a combination of numerical, categorical, and binary amenity features.
XGBoost (eXtreme Gradient Boosting) was selected for its ability to handle mixed tabular data with both numerical and categorical features, its robustness to outliers, and its native compatibility with SHAP for model interpretability.
Evaluation was performed using K-Fold Cross Validation to obtain robust performance estimates and prevent data leakage between training and testing splits.
A MAPE of 8.98% means the model predicts property prices with an average deviation of under 9% — a highly competitive result given the extreme price variation across Indonesian cities and property types. An R² Score of 0.7206 indicates that 72.06% of price variability is explained by the 34 available features. The remaining 28% is likely attributable to factors not captured in the data — micro-location nuances, negotiation dynamics, or seller-specific conditions that are difficult to quantify from a listing alone.
SHAP (SHapley Additive exPlanations) was applied to the trained XGBoost model to produce global feature impact insights. Each dot on the beeswarm plot represents one property — red indicates a high feature value, blue a low one, and horizontal position shows the direction and magnitude of its impact on the predicted price.
Location is the strongest driver, since it shows a wide SHAP distribution. This suggests the model has learned significant differences between locations, with some areas strongly increasing or decreasing the prediction.
Subsidy status consistently pushes prices down — subsidized properties carry a strong negative SHAP signal, clearly delineating the affordable housing segment.
Larger building areas (shown in red) produce consistently positive SHAP values. The relationship is strong and monotonic across the entire dataset.
Bathroom count serves as a strong proxy for property quality and luxury tier — more bathrooms reliably translate to higher predicted prices.
Larger land plots increase property value, particularly relevant in the landed house segment where plot size is a core pricing component.
Higher electrical capacity is associated with higher predicted outcomes.
Global Feature Impact on Real Estate Price Prediction · SHAP Beeswarm Plot