Deep Air: the smart data approach to designing healthier cities

Article written byt the MSc in Business Analytics students Davide Callegaro and Peter Bruins  

We live in a society that is increasingly aware of the adverse effects of living in polluted air. As a result, pollution is becoming a critical issue when cities are designed or redesigned. Unfortunately, it is difficult to assess what effect individual choices have since causality is difficult to confirm. As human behaviour is fundamental to results, the problem is even more complicated. Luckily, urban planners are looking into ways to nudge people into making better individual choices. Unfortunately, planners do not possess the necessary tools to evaluate what must be done to reduce pollution, especially those planners who cannot afford to use large supercomputers.

Humans have a considerable effect have on pollution levels. The Covid-19 pandemic shows the extent to which our behaviour is correlated with air pollution in our cities. In Barcelona, NO2 levels dropped 64% in March 2020 to levels previously deemed unreachable. Combine this result with the knowledge that multiple studies have shown that NO2 pollution is associated with health problems such as diabetes, hypertension, strokes, chronic obstructive pulmonary disease, and asthma.

The pollution challenge we face in our cities is important and the consequences of success or failure will be felt by everyone. To discover what we could do to help we combined our forces with 300.000 km/s, a Barcelona urban planning think tank that works with smart data for cities. We aimed to use data intelligently to enable city architects to make more informed decisions when considering air pollution. With an abundance of data, this raises essential questions from the get-go: what data is relevant? How do we make data smart?


Our approach

Our journey started on Esteve Almirall’s kitchen table as we discussed the various options for an impactful and exciting project that would conclude our master's degree in business analytics. We quickly decided on the subject of smart cities, an area in which Esteve had a done some recent research. Esteve contacted Mar and Pablo, the co-founders of 300.000 Km/s. Together with Esteve, they helped and supported us throughout this eight-month journey and guided us with their expertise when we took a wrong turn. We would not have got this far without them.

We started our project with a dataset provided by 300.000km/s. This dataset contained summarised travel data about the movement of individuals in Spain collected from the movement of cellular devices. Spain has been divided into roughly 2500 regions, and all travel between these regions was collected. Scholars have long shown that NO2 is strongly correlated with travel (most notably, from diesel cars). To reinforce our initial data, we added numerous environmental statistics. These ranged from the number of people per age group living in these areas to average incomes.

The Covid-19 pandemic shows the extent to which our behaviour is correlated with air pollution in our cities

To accurately predict the NO2 levels in many areas of Spain, we needed to think about modelling techniques. Our model used a combination of standard and uncommon machine learning techniques. We used correlation matrices, random forest regression trees, graph-based representations, and spatial lagged features from start to finish. As we were struggling to use the data optimally, Andre, the data scientist from 300.000 km/s, introduced us to the concept of spatial lag. This feature uses the strength of the data we possess, namely the geographical information, in the best possible way. By doing so, we could introduce 'spatiality' into our machine-learning vocabulary.

As a result, we could extract vital information usually lost in traditional machine learning techniques, such as random forests or XGBoost. We looked at Moran's I coefficient to ensure we would only use spatial lagged features that possessed complete information. This coefficient is a measure for spatial autocorrelation, which, in simple terms, represents how good it is to predict an element with the knowledge of the value of the same quality in geographically neighbouring areas.

Our final deliverable was a model that used the best combination of 'normal' and 'spatially-lagged' features to predict NO2 levels in Spain. We started our initial search for the best possible model for the 30+ features and ended with a model that uses eight features to predict NO2 throughout Spain. The Moran's I score and multiple try outs between different features are spatially lagged. We arrived at a model that is 88.8% accurate for predicting NO2 levels throughout Spain. We found that the percentage of space used for residential buildings and the number of homes with surfaces between 61 and 90 m2 were the most potent predictors of NO2 levels. Other notable predictors were houses with surfaces between 45 and 60 m2 and the number of people aged between 0 and 25 per square kilometre. Thus, we could predict NO2 levels with precision using primarily residential information. This insight shows how city planning affects liveability.

Random Forest - Accuracy 88.876% std 1.376834%
Map accuracy – darker is less accurate

Final considerations

Multiple sectors can leverage the results of our model. The public sector can be a major beneficiary as urban planning affects pollution. By taking innovative strategies to reduce traffic between places, cities may have a greater impact at a lower cost compared to routes currently used. This model will give us information on what happens if we adjust specific traffic flows within the whole structure. An example can be building offices in Sant Cugat to reduce traffic flow to Barcelona and so improve Barcelona's air quality. This action contrasts with those taken nowadays when politicians try to establish measures where the pollution is too high.

Nations can use these models to check whether their pollution planning works according to plan. Our predictions can benchmark areas where specific pollution minimalizing measures have been taken and review their success. This takeaway will reduce the time to market for successful ideas, as it will take less time to confirm the results. Furthermore, it will enable a more rapid rollout of new ideas as poor ideas will be identified sooner. The result will be cost savings and better protection for the environment.

All written content is licensed under a Creative Commons Attribution 4.0 International license.