Abstract

Bottom Line Up Front: Spokane Washington has stagnate home growth for the last 10 years.

This study presents an exploratory data analytics project focused on forecasting residential new-build permit trends in Spokane over the next five years. Leveraging fifteen years of residential new build permit data obtained through a public records request from the City of Spokane, the analysis employs three statistical and machine learning models: Random Forest Regressor, Generalized Linear Model (GLM), and Seasonal Autoregressive Integrated Moving Average with Exogenous Variables (SARIMAX). Data preprocessing included time series differencing, stationarity testing via the Augmented Dickey-Fuller test, and parameter tuning using auto-ARIMA. The project aims to determine whether predictive models can reliably estimate future permit counts to support urban planning and development investment decisions. Despite rigorous modeling and statistical testing, results failed to demonstrate predictive power across all three approaches. The Random Forest model showed no improvement over baseline predictions. SARIMAX failed to detect meaningful temporal patterns, and GLM yielded no statistically significant predictors. All null hypotheses were retained, indicating no forecast model was suitable for reliable forward-looking projections. Nevertheless, exploratory data analysis revealed an important trend: since 2015, the average number of permits has increased and stabilized around 350 annually, suggesting a shift in baseline activity. The year 2024 marked the highest on record, with approximately 450 permits submitted. The average residential structure size was found to be 2,400 square feet. This project offers a cleaned dataset, visualizations, a Jupyter Notebook with reproducible Python code, and a presentation outlining the methodology and findings. While predictive modeling did not yield significant results, the study contributes valuable insights into Spokane’s residential development patterns and offers a robust foundation for future research.

Research Questions:

1. Can a forecast model be developed from the data?

2. How many new residential building permits have been submitted?

3. What year was the highest on record?

4. What is the average size of a new building?

Null Hypothesis (H0):

The Random Forest model does not explain any more variance in the yearly residential permit counts than a naive or random model (e.g., constant mean prediction); Model performance is no better than chance.

Alternative Hypothesis (H1):

The Random Forest model explains a significant portion of the variance in yearly residential permit counts and performs better than a naive baseline; The model has meaningful predictive power.

Null Hypothesis (H2):

The SARIMAX model’s predictions are not significantly different from white noise or a simple time series baseline; No meaningful temporal structure or exogenous influence is captured.

Alternative Hypothesis (H3):

The SARIMAX model captures significant temporal dependencies and/or exogenous effects, providing meaningful forecasts of yearly residential permits; The time series model improves forecasting accuracy.

Null Hypothesis (H4):

The GLM coefficients are all statistically zero (no effect), and the model does not significantly predict yearly permit counts; The response variable is not significantly explained by the predictors.

Alternative Hypothesis (H5):

At least one of the GLM coefficients is significantly different from zero, indicating a meaningful relationship between the predictors and residential permit counts; The model provides a statistically valid explanation.

Context:

The contribution of this study to the field of Data Analytics and Spokane urban development is to conduct a big data exploration of residential new-build permits issued by the City of Spokane. Exploring the use of three machine learning models: Random Forrest Regressor, Generalized Linear Regression, and Seasonal Autoregressive Incremental Moving Average & eXogenous-factors (SARIMAX) to predict the annual number of newbuild permits for the next five years.

With this information a housing development firm or city planning committee can maximize the investment put into housing projects and have a baseline understanding of how the city works. A California State University article titled, Big Data for Comprehensive Analysis of Real Estate Markets, showcases a study using data to explore summary statistics using the identical variables of ‘Year’, ‘Months’, and ‘Square Footage’ (YongLin, 2022). Another article titled: Big Data in Construction: Current Applications and Future Opportunities, showcases how “Data mining is used to extract meaningful patterns in the data. It has been an integral part of all big data management systems. It employs the techniques used in pattern recognition, ML, and statistics for research in construction engineering” (Munawar, 2022).

Secondly, in an article titled, Integration of big data in the decision-making process in the real estate sector, examines how data science for real estate planning is paramount to the modern age and a competitive edge (Yu Yu, 2021). They found that these variables are key factors in overall content engagement and brand awareness. A similar article titled: The role of random forest and Markov chain models in understanding metropolitan urban growth trajectory, investigates suburban sprawl with similar engineered variables (frontiersin.org, 2024). Understanding these variables can help describe the relationship between the Independent Variables and Dependent Variables. In a research paper titled, Using Statsmodels’ SARIMAX to Model Housing Data Pulled from Zillow, uses the same values of ‘frequency’ and ‘date’ to forecast housing (Martin, 2019).

Data:

The dataset of residential building permits containing the necessary variables, was sourced from the city of Spokane by a Public Records Request. The dataset is the immutable declaration of the city. The dataset contains almost 4,968 rows and 7 initial columns. The dataset is limited to 15 years of submitted permits; Between 1/1/2010 to 3/2025. While the dataset has 7 columns for possible exploration, a delimitations for this analysis: “Record ID” will not be factored. Additionally, the ‘RECORD_OPEN_DATE’ columns of the dataset will be parsed into three separate columns for individual exploration: ‘Day’, ‘Month’, and ‘Year’. The ‘Year’ variable is the target variable to attempt to predict for the next five years. The dataset is easy to work with because the columns with whole number values. The ‘RECORD_OPEN_DATE’ is in 24hour timestamp indexed on Pacific Standard Time.

Data Gathering:

The data gathering began with submitting a Public Records Requests to the City of Spokane requesting all residential building permits submitted to the city within the last 15 years; Specifically requesting the variables, Square footage of the building, the address of the residence, the description of the residence type, and the date the permit was submitted. The city correctly responded to the request and provided the necessary dataset used in this analysis.

Data Analytics Tools and Techniques: A KDE plot was used to visualize the distribution and visually inspect for normality. A series of predictive models was used and the data was processed according to the model assumptions. The first model was Random Forrest Regression, Season Auto Regressive Incremental Moving Average with Exogenous Factors (SARIMAX) and Generalized Linear Regression. Prior to the SARIMAX model, the data was manipulated via the order of first differencing using the .diff() function, then confirmed for stationarity via the AD Fuller test. Then the stationary data is then passed through a function called PD auto-ARIMA which gets the three variables P,D, and Q variables, necessary for the SARIMAX model.

Overall, this is an exploratory quantitative data analytic technique and a descriptive statistic. The tools used will be Jupyter Notebook operating in Python code, running statsmodel api as a reliable open-source statistical library. Due to the data size, a Pandas data frame will be called, same with Numpy and Seaborn will be used for visualizations.

Justification of Tools/Techniques:

Python will be used for this analysis because of Numpy and Pandas packages that can manipulate large datasets (IBM, 2021). The tools and techniques are common industry practice and have consensus of trust.

The technique is justified through the integer variables necessary to plot against a timeline. In so doing, may just reveal different modes of frequency distribution. Another reason why forecast testing is ideal is because the data is based off of human behavior, which is notoriously skewed. Because of the size of the dataset, pandas and Numpy will be called. Python is being selected over SAS because the Python has better visualizations (Panday, 2022).

Project Outcomes: In order to find statistically significant differences, the proposed end state is a predicative statistical model that can predict with a reasonable accuracy (over 70%). A visualization of the frequency distribution of the permit frequency against a 24hr scaled timeline, indexed at Pacific Standard time. A cleaned dataset of all the correctly labeled columns and rows, for replication. A better understanding of previously stated groups with exploratory graphs, giving support as to what can be reasonably forecasted. Lastly, a copy of the Jupyter NoteBook with the Python code will be available, along with a video presentation added by PowerPoint. According to the same study predictive modeling was instrumental in support for alternative hypothesis, against other categorical variables (Yu Yu, 2021).

The Analysis

The python code can be found here: code

Below is an output of the datasets head the head function showcases all the column names and the 1st 5 rows.

Below is a map of Spokane's new builds plotted out with Google Maps per the addresses in the CSV file. The majority of the housing clusters are located in the outskirts of the county in city proper. While there are a scattering of patches within the city center the majority of new builds have been focused on the outskirts. Keep in mind, these residuals are plotted over the last 15 years.

Below is a frequency distribution for permits issued by days of the month going from 1 to 31 notice that the 29th day of the month is the day with the most frequency of permits while the 31st day of the month is the lowest frequency of permits.

The below graph is the number of permits by month the distribution appears to be rather random if not continuous with the highest month for permits filed being June and possibly tied with March while the lowest month for permits filed being December.

The below graph is for residential permits per year we noticed the graph starts at the lower left hand corner with the year on the Y axis in frequency count on the X axis. Notice that After year 2014 the baseline bull whips up consistently and stays higher between 2015 onward. Last year 2024 was the highest year on record for residential new build permits. The distribution appears to be continuous as opposed to growing or decreasing since 2015. Since 2015 the average the number of permits submitted has been around 350 per year. Each voting precinct in Spokane County has on average 7000 voters. Therefore, Spokane has grown approximately two precincts per the last 15 years, assuming 2 voters per home.

Below is a density plot of the building square footage, the average square footage is around 2400 square feet. The distribution appears to be right skewed and nonparametric as in not a perfectly gaussian or bell shaped.

Being nonparametric means that only nonparametric tests such as the three stated for this analysis are appropriate. A parametric task like a basic linear regression is not appropriate given the distribution shapes of every variable.

Below is the frequency distribution of property prescriptions for new builds; deleting frequency is single family detached houses. In a distant second, is townhomes counting for approximately 400 in 15 years.

Below is the output of the random forest forecast its mean square error MSE is 12,864. Which is way too big and unacceptable margin of error; resulting in the next 5 years forecasted to be flat, or unable to forecast.

The same holds true with the SARIMAX model after a first differencing and evoking stationarity, reconfirming with the AD Fuller test and getting the PDQs from auto_arima(). The resulting predictions are all tracing along the 0 mark; Meaning that the data is unable to be accurately forecasted.

Below is an output of generalized linear regression model. With an R2 score of 0.03 meaning the model is only 3% accurate with a mean standard error of 9885.10; meaning this model is not a good reflection of future events nor does any of these models presented so far have the ability to generalize well on unseen data.

In final Analysis

We accept the Null Hypothesis (H0): The Random Forest model does not explain any more variance in the yearly residential permit counts than a naive or random model (e.g., constant mean prediction); Model performance is no better than chance.

We accept the Null Hypothesis (H2): The SARIMAX model’s predictions are not significantly different from white noise or a simple time series baseline; No meaningful temporal structure or exogenous influence is captured.

We accept Null Hypothesis (H4): The GLM coefficients are all statistically zero (no effect), and the model does not significantly predict yearly permit counts; The response variable is not significantly explained by the predictors.

The data processed with the above statistical tests suggest that we reject any claim of predictability. What is obvious is the average of new residential permits has increased by 150 per year since 2014; Resulting in a consistent or continuous distribution. Importantly there is not an upward or downward trend since 2014 just a flat continuous distribution.

We answered the following Research Questions:

1. Can a forecast model be developed from the data?

No.

2. How many new residential building permits have been submitted?

There have been 4968 permits in the last 15 years.

3. What year was the highest on record?

2024 is the highest year on record with approximately 450 permits filed.

4. What is the average size of a new building?

2,400 square feet is the average size of residence per permit.

Work Cited

1. (PDF) integration of Big Data in the decision-making process in the real estate sector. (n.d.-b). https://www.researchgate.net/publication/351385494_Integration_of_big_data_in_the_decision-making_process_in_the_real_estate_sector

2. Big Data for comprehensive analysis of real ... (n.d.-a). https://scholarworks.lib.csusb.edu/cgi/viewcontent.cgi?article=2750&context=etd

3. Campo, A. M. del. (2019, November 11). Using statsmodels’ SARIMAX to model housing data pulled from zillow. Medium. https://medium.com/analytics-vidhya/using-statsmodels-sarimax-to-model-housing-data-pulled-from-zillow-c0cce905aaed

4. IBM Industry Solutions. (n.d.). https://www.ibm.com/cloud/blog/python-vs-r

5. Munawar, H. S., Ullah, F., Qayyum, S., & Shahzad, D. (2022, February 6). Big Data in construction: Current applications and future opportunities. MDPI. https://www.mdpi.com/2504-2289/6/1/18

6. Pandey, Y. (2022, May 25). SAS vs python. LinkedIn. https://www.linkedin.com/pulse/sas-vs-python-yuvaraj-pandey/