Commercial and residential buildings in San Francisco are subject to strict code enforcement and building regulations. Purpose of the regulations is to maximize public health and safety. In San Francisco, the Department of Building Inspection (DBI) is the regulatory building safety agency responsible for overseeing building inspections for more than 200,000 buildings. DBI, as probably most similar agencies out there, do not nearly have the capacity to conduct inspections in all these buildings. Due to limited capacity, agencies need to prioritize inspections based on various criteria. Depending on the resources and analitical proficiency, quality of the prioritization can vary a lot and there is usually room for significant improvements.
Having in mind this and the fact that San Francisco nurtures open data policy, we at Flow&Form wanted to explore the possibility of using Machine Learning for predicting building violations. In other words, we wanted to explore the potential of using Machine Learning for augmenting city decision making processes by detecting which buildings are likely to fail and which are likely to pass the following inspection. On top of that, there are industries such as the insurance industry that could find the information about the existing and probable future non compliant buildings interesting.
Dataset used in our ML approach is available for download here. It is updated regularly, and at the time of writing this post (May 3rd 2020) it consisted of 387 096 cases of violation notices. Oldest record in the dataset was from 10/8/1993, and the newest records were from 4/1/2020.
Notices are documents created by the building inspectors which contain the information about the type of the violation and a comment from the inspector. Violations belong to one of the 10 broad categories: building, interior surfaces, fire, plumbing and electrical, security requirements, smoke detection, sanitation section, lead section, hco and other section. Comments from the inspectors contain valuable information too, not only because they give insight about the location of the violation but also because they provide a greater level of detail about the violation. For example, comments for the building category may give clues whether it was an old paint, damaged walls, doors or a mold issue, while comments for the sanitation category may indicate an insect or rodent infestation, trash issues, and etc.
Entries in the dataset are a result of either a regular inspection or a building complaint inspection. These inspections may or may not have resulted in a notice of violation. The dataset, however, consists only of cases in which the violations did happen. This means that our modelling approach would need to be adjusted to the fact that cases in which there were no violations were not recorded.
Violations are recorded on a block-lot (parcel) level, which is the established system of managing the land parcels/buildings in the USA. Dataset is organized in a way that each row represents an occurrence of a specific type of a violation, in a specific block-lot and on a specific day. Moreover, since there is a large number of parcels which have multiple or numerous condos, there is also a decent number of cases where there are multiple occurrences of the same type of violation on the same parcel and on the exact same day. As a result of this long format of the data, before doing any kind of modelling we needed to standardize it (convert it to a wide format) in a way that each row in the dataset would contain all the necessary information about all the violations recorded in a specific block-lot on a specific date. Barchart below shows the relation between the long and the wide format in terms of dataset length, and also the number of unique block-lot units that appear in the San Francisco notices of violation dataset. This number is much smaller than the actual number of the parcels in SF, since many buildings did not have recorded violations.
Defining the problem
Predicting the likelihood of a violation happening is not a problem unique for real estate. There is a fair amount of research and blogs concerning violations prediction out there, such as restaurant violations and red-light running violations. These and similar problems have an intrinsic time series component in the data which requires a specific and more careful approach when doing predictive modelling. One must be very cautious when splitting the data for training and testing in order to keep the chronological order of the records intact.
Another important thing we needed to consider before modelling is to decide how we want to go about the high degree of time interval irregularity present in the data. Unlike traditional time series data (e.g. daily stock prices or daily sales), time between recorded violations for a given parcel is completely irregular. After one violation was detected, the next one can be recorded in the next couple of days, months, years or never. In fact, this high variability is yet another characteristic of the dataset which we needed to consider carefully before choosing our modelling approach.
In our experiment, we decided to explore the viability of two different approaches, classification and regression. We formulated the target variable for the binary classification problem as whether the violation will happen or not in the next 120 days after it has already happened, while the regression model would try to best estimate the number of violations during the next violation-occuring inspection. Reason why target variables were very specifically designed is because, again, the dataset does not contain the information about inspections which did not result in a notice of violation.
Irregular time series component
imilarly to many other Data Science problems out there, there were numerous approaches we could have taken for trying to get the best results. Reality is, in many cases best results are never achieved. For most problems, the solution space is endless and there is nowhere near the time to test all possible approaches.
Having that in mind, besides the usual dilemmas when building a model (choosing the algorithm, modelling scope, dataset balancing, handling missing data, encoding, dataset splitting, etc.), one major decision we had to make before modelling is whether to try to standardize the dataset into regular time intervals or to proceed with modelling on data with irregular time intervals. Ideal would be to test both of the approaches, however we opted for the irregular time interval approach while the regular one will be a topic of another post.
Our algorithm of choice was xgboost, an implementation of gradient boosted decision trees designed for speed and performance. Xgboost has numerous parameters to configure, and the optimal configuration (model) is very specific to the dataset and problem at hand. Because of this, we needed to perform a gridsearch in order to test out a defined set of possible combinations. This, however, meant that we also needed to split our dataset into subsets for training, validation and testing, where validation subset is used for testing which model is the optimal one and testing subset is used to estimate how good will the optimal model be on unseen data. All of this is done in order to prevent bias when using the same (test) subset for both model selection and model testing. Below is a depiction of how we performed the train, test and validation split.
In our approach, we have decided to use the most recent cases of violations as data for prediction. Cases prior to the most recent ones were held out as test sets to be used for model evaluation. Cases prior to these ones were used as validation data, and finally, all other cases prior to validation cases were used for training purposes. There were a decent number of parcels which had only one, two or three cases of violations historically, and these were removed from the appropriate subsets in a cascade fashion. Maximum number of violations which occurred in parcels during the recorded history was 137.
Dataset lengths of validation and test subset were fixed -, 7 540 and 13 548 rows long respectively. Train data varied in length because we incrementally increased the dataset length after being finished with validating and testing the model. For instance, after the grid search was finished and optimal parameters were determined, we built the model again with inclusion of the validation data. Furthermore, after we evaluated the dataset on the test data, the model was trained again with the addition of the test data (and validation data). By doing this we used the maximum data available for building our final model.
It is fair to mention that this approach is not without shortcomings. While we have ensured the relative chronological order on a parcel level and between the train, validation and test splits, the absolute chronological order has not remained intact. In other words, one parcel’s record was used for training at a time point which could be a point in the future for another parcel’s validation, test or score data. Even though this is not ideal and would be completely unstable for a proper time series dataset, we concluded that this would only represent a small setback – it would inhibit the model from potentially connecting the dots for cases which were occurring in approximately similar periods. Our justification was that we were not dealing with a traditional time series dataset and were not using algorithms designed primarily for time series modelling. Our goal was to use Machine Learning to create a model that would learn a pattern of violations based on the timedelta, frequency, quantity and violation types from previous occurrences of violations.
In the default dataset we found only a small number of features useful for modelling. Features which immediately popped out as the most interesting ones were date of violation and the violation category. At first glance this may not seem like a ton of information for Machine Learning, but this was far from the truth. Clever feature engineering can make the difference between a good model and an excellent one, and exactly this is part of the research on which most of our coding time was spent.
Out of these two variables mentioned, and with the help of the groupby and count methods, we managed to create 230 features for our model. New features can be broadly categorized in groups shown in the table below. Reasoning was the following: after standardizing the dataset on a parcel-date level, we could easily have created three types of features: days since recent, N violations recent and cum sum recent (Tier I features). By using groupby and shift, we calculated the number of days passed, total number of violations and cumulative sum of violations for five most recent time points in the past for each parcel-date case. This served as a basis for calculating the Tier II features. Another important aspect of feature engineering was that most of the feature engineering described above was done both on a general level (all types of violations combined) and individual violation category level. For example, for a particular parcel-date case we engineered features that show both the total number of violations and the number of violations for each of the 10 different categories of violations for most recent points in the past.
Publicly available dataset was used to explore the possibility of applying Machine Learning for predicting building violations. The dataset has a significant time series component which required a careful design of the Machine Learning pipeline. Out of two main approaches, with irregular time intervals and with standardized time intervals, we opted for the former one. We wanted to build a classification and a regression model, which would try to predict whether the violation will happen or not in the next 120 days after it has already happened and the number of violations during the next violation-occuring inspection, respectively. Our algorithm of choice was xgboost, we applied grid search on our validation set and estimated the performance on the test set. As a classification metric we used accuracy and AUC (area under the curve), and as a regression metric we used MAE (mean absolute error). Test accuracy and AUC were 0,81 and 0,78 respectively, while test MAE was 2,55.
Unlike AUC, accuracy is a very straightforward metric – it represents the percentage of correctly predicted cases in the test dataset. It can, however, be very unreliable in cases of imbalanced datasets. AUC, on the other hand, serves as an excellent all-around metric for estimating the performance of the model. There are many blogs out there, such as this one here, that offer good explanations for deeper understanding of the metric. AUC of 0,5 is something that a randomly guessing model would provide, while AUC of 1 is a result of a perfect (imaginary) model. Deciding where a good AUC ends, and excellent AUC starts is not very straightforward and is specific to the industry and problem difficulty. For instance, in psychological studies AUC higher than 0,7 can in some cases be considered very good, AUC of 0,8 of a model identifying profitable investments could be excellent, while AUC of 0,95 when classifying handwritten digits could be nothing special at all. Therefore, it really depends on the context and it is often best to compare results with similar research or find a baseline to compare with.
Since our dataset included information only regarding inspections which resulted in violation, we had no way of knowing the baseline. In other words, we had no way of knowing the ratio of inspections which resulted in violations vs. inspections which didn’t. Even though San Francisco’s open data platform offers additional datasets, such as the one from the Department of Building Inspection Complaints, there was no way to obtain all the required information about the results of routine and non routine inspections (at best of our knowledge). However, we did find a research very similar to ours (link), in the goal was to try to predict the outcome of inspections using Chicago building inspections dataset. In their case, they reported that inspections resulted in notices of violations 57% of the time (baseline), and accuracy of their test data was 74%, which represents a nearly 20% increase over the baseline. However, we cannot directly compare our results with theirs because our way of formulating the target variable was different. Their goal was to predict whether the next inspection will result in a notice of violation irrelevant of any time period, while we were predicting whether a notice of violation can occur in the next 120 days. Nevertheless, we think it is safe to say that classification results are very satisfying and are roughly in the same boat or perhaps even slightly better than other research results.
MAE calculates the average over the test sample of the absolute differences between predicted and actual observations. MAE of 2,55 doesn’t tell much without the comparison with the average value of the actually observed violations of all parcel-date cases – which is 5,63. Another valid comparison is with the distribution of the values of the observed number of violations (histogram below). For example, if the model predicted the number of violations in the next violation occurring inspection will be 6, the average error indicates the actual number could be somewhere between 6 – 2,55 and 6 + 2,55). Both comparisons indicate that the regressor model results are not as satisfying as the classification results.
Out of two models we tested, one showed very promising results. However, we feel that there is a lot of room for improvement for both of them. Dataset we had at hand contained information only regarding failed inspections. This represents an incomplete view of the patterns of inspections and thus creates a constraint for our modelling approach. If there is a way to acquire complete information it would probably allow for a model improvement. Moreover, our approach was based on modelling data with an irregular time series component. These were only our initial attempts and there is no guarantee this approach was the most optimal one. It would certainly be interesting to compare the current method with the method of standardized time intervals to see if there is any improvement in the model results. On top of that, San Francisco’s open data platform offers additional interesting datasets such as street use permits, land use, evictions, building footprints, energy performance, etc. These are mostly coupled with specific time points and have some problems with missing data, but there could be a clever way of incorporating them in some form in the model. One other very important thing was that in our models we have used only the time series component, while the inclusion of the space component presented a challenge. It is problematic, and probably not wise at all, to do feature engineering with the combination of very specific time and space points (x y coordinates). However, in the standardized time interval approach, if we aggregate parcel violations in a quarterly manner, it would be much more feasible to calculate the number of violations in the vicinity in previous quarters than to try to search specific date-location combinations.
In light of the above, we think that Machine Learning can prove to be a very useful tool for real estate and city policy planning. Despite the fact that data-driven decision making is nothing new in these areas, we feel that advanced analytics and predictive modelling is not being fully utilized. We believe that better predictive intelligence can improve processes, quality of life for citizens, provide better city departmental services and address city resource allocation issues.