2017

Data Visualisation, Data Science

Como Smart City for Satisfied Citizens

The City of Como project is a collaboration with Fluxedo, an Italian startup working in partnership with the municipality of Como to model human dynamic flow in the city. The overall aim of the project is to integrate multiple and diverse data sources to build a picture of the way people live and move around the city. Using historical telecom and social media data along with other geolocated data, we have created an intuitive visualisation of several models of daily movements of different demographic groups throughout Como dependent on the day and time. We have also explored the motivations for movement throughout the city based on social media activity and content.

As a large proportion of the revenue of the city is due to to tourism,the city is interested in better understanding how to cater for them and use resources more effectively. While the municipality have conducted analyses of telecom data in the past, we have been able to provide a more intuitive visualisation of the data, which was too fine grained to be effectively understood in both spatial and temporal dimensions.

Team

  • Mattia Gasparini
  • Beatrice Gobbo
  • Sophie Hilgard
  • Nikhila Ravi

The data is available to us at several granularities: at both daily and hourly levels and at region and full city levels. The data is also subsetted based on gender and age range, among other factors we did not consider (largely because the privacy protection resulted in a large amount of missing data). While inspecting the data, we realized that there was a significant amount of overcounting occurring. That is, the data aggregated from the hourly and region level data resulted in significantly more people counted than the total number of people in the city for a given day. This is due to the fact that within any given hour if an individual connects to a cell phone tower in two different regions, he or she will be counted in both. This presents an interesting modeling case in which we have added data about possible movement patterns embedded in the total data, but it is difficult to extract the number of people in any given region at any given time. We chose two models to deal with this problem: a linear-program based model which assigns people to specific physically possible routes through the city (regions passed through within any given hour), and a spatial Gaussian process smoothing model, which should give us the general feel of popularity of a given area abstracted from the absolute numerical counts.

As well as understanding how people move through the city, we explored some of the reasons why. Initially we explored using geolocated social media posts including Twitter, Instagram and TripAdvisor as indicators of activity. Although the posts are too sparse to be a useful signal of activity level, they can be used to infer the types of activities associated with each region of the city. We started by developing a representation of two time periods (Tuesday afternoon and Saturday evening) in the TripAdvisor and Instagram data. We analyzed all of the checkins on the TripAdvisor app bucketed by region. The hashtags and comments of Instagram posts bucketed by region were analysed using tf-idf, a method of representing importance of words based on their occurrence in each post relative to an entire corpus (more info here). This has the effect of weigh less frequently occuring words. Stop words were also removed. From this analysis we extracted a representative set of hashtags for each region and separately chose an Instagram picture which represented the activity in each region. These analyses resulted in a general view of the city from the social media point of view: TripAdvisor data pointed out the general interests of tourists, showing how they move from the main monuments and attractions during the day to the restaurants in the evening. In a similar way, the Instagram data gave an insight into the most significant places in each of the telecom regions, highlighting the main attractions of the city for young people during the day and during the night. It is also clear that the pictures and activities dramatically change between seasons.

The data is available to us at several granularities: at both daily and hourly levels and at region and full city levels. The data is also subsetted based on gender and age range, among other factors we did not consider (largely because the privacy protection resulted in a large amount of missing data). While inspecting the data, we realized that there was a significant amount of overcounting occurring. That is, the data aggregated from the hourly and region level data resulted in significantly more people counted than the total number of people in the city for a given day. This is due to the fact that within any given hour if an individual connects to a cell phone tower in two different regions, he or she will be counted in both. This presents an interesting modeling case in which we have added data about possible movement patterns embedded in the total data, but it is difficult to extract the number of people in any given region at any given time. We chose two models to deal with this problem: a linear-program based model which assigns people to specific physically possible routes through the city (regions passed through within any given hour), and a spatial Gaussian process smoothing model, which should give us the general feel of popularity of a given area abstracted from the absolute numerical counts.