Applied Data Science | Finding Cultural and Entertainment Opportunities in North Carolina

Starting in December 2019, I began a 9-course IBM Data Science Professional Certificate program through Coursera.org.

The classes included Python programming, SQL databases, machine learning, statistics, and more.

All learning culminated in an Applied Data Science Capstone course, which I worked on for much of April and early May, in whatever spare time I could find.

Here’s the assignment summary for this final project:

Now that you have been equipped with the skills and the tools to use location data to explore a geographical location, over the course of two weeks, you will have the opportunity to be as creative as you want and come up with an idea to leverage the Foursquare location data to explore or compare neighborhoods or cities of your choice or to come up with a problem that you can use the Foursquare location data to solve.

The following post outlines my approach for using Python analytics, visualization, and Foursquare data to help find the ‘great cities’ of North Carolina.

Last but not least, I hope you enjoy it, even if you’re not from NC.


INTRODUCTION

1.1 BACKGROUND

North Carolina (NC) is one of the most heavily populated states in the United States (#9 in U.S.A.) with 10.49 million people in 2019. The population is also growing quickly. Demographers expect this growth to continue, with an estimated 12.3 million population by 2035.

The growth in North Carolina is also meaningful. NC cities such as Raleigh, Durham, and Charlotte often appear in lists of top places to live in the USA. NC cities are on these lists for a combination of culture, outdoors activities, job prospects, and cost of living.

1.2 PROBLEM

For people and families interested in NC as a place to live, this project will drill down into cultural/entertainment opportunities and establishments (restaurants, museums, parks, etc. – more details on this in section 3.1) that make every NC city its own unique place.

This analysis will attempt to find opportunities for families in NC that have a relatively high number of cultural/entertainment establishments, but lower cost of living and real estate prices. These cities should be considered great places to live. The analysis may end up finding undiscovered “gems” – cities with a high number of cultural/entertainment establishments but lower cost of living and real estate / home prices.

1.3 INTEREST

The key audience for this research is people who are considering a move to NC. Personally, I'm also interested in this analysis. Having lived in NC for almost ten years, I am looking forward to applying data science principles and location data to gain a better understanding of the key cities in the state.


2. DATA ACQUISITION AND CLEANING

2.1 DATA SOURCES

The main NC cities I analyzed include:

  • Raleigh, NC

  • Charlotte, NC

  • Durham, NC

  • Hickory, NC

  • Chapel Hill, NC

  • Wilmington, NC

  • Asheville, NC

  • Boone, NC

  • Greensboro, NC

  • Winston-Salem, NC

  • New Bern, NC

  • Fayetteville, NC

For each city, I used the following types of data (source in parenthesis):

  • Latitude (Foursquare API)

  • Longitude (Foursquare API)

  • Venues (Foursquare API)

  • Population (Census)

  • Cost of living (salary.com)

  • Median home value (Zillow.com)

2.2 DATA CLEANING

Data for the 12 key cities above was pulled from the listed sources, and compiled into one table. A data pulling and cleaning process using Pandas/Python was used to access city and venue data from the Foursquare API. This table was then merged with a .csv file I created with Population, Cost of Living, and Median Home Value data. All data are very recent, from 2020.

 
Data Table created using .csv / pandas.

Data Table created using .csv / pandas.

 

Slight adjustments were made on the Cost of Living data from salary.com. The website salary.com calculates a percentage comparison between specific cities and a national average. This would not be useful in my analysis. Instead, I created an index vs. the National Average cost of living from this information. For example, New York has a cost of living that is 83% higher than the National Average. The cost of living index is 183. Likewise, the cost of living in Odessa, Texas is 17.3% lower than the National Average. The cost of living index is 82.7.

2.3 LIMITS OF DATA

One limitation of this analysis is a built-in limit from the Foursquare API, where only 100 of the top venue results for any specific location are returned. This is a hard limit of using Foursquare data. For example, in cities such as Charlotte with hundreds of venues, a limit of the top 100 venues will only return the 100 most popular venues.

These 100 most popular venues may not be a true representation of the cultural/entertainment venues for cities with hundreds of venues, but rather a glimpse into how many cultural/entertainment venues are in the top 100. While this is not perfect, this analysis will still serve as a proxy for understanding relatively higher levels of cultural/entertainment interests in these cities.


3. METHODOLOGY: EXPLORATORY DATA ANALYSIS

3.1 CALCULATION OF TARGET VARIABLE

In order to understand the relationship of venues with a high degree of cultural and entertainment opportunities to them, I needed to first pull all venues for each of the cities from the Foursquare API, within a 3,200-meter (approximately 2-mile) radius.

Once the total number of venues was pulled for each city, the venue types were then grouped and counted for each city. Once this table of venue types by city was created, specific types of venues were selected as cultural/entertainment, based on personal experience: 

  • Brewery, Bar, Wine Bar, Cocktail Bar, Beer Garden, Beer Bar, Pub

  • Tea Room, Lounge

  • Yoga Studio

  • Baseball Stadium

  • Coffee Shop, Café, Chocolate Shop, Dessert Shop

  • Record Shop, Music Venue

  • Performing Arts Venue, Art Gallery, Theatre, Theater, Antique Shop

  • Park, Trail, Historic Site

  • Science Museum, Art Museum

  • All types of Restaurants (except Fast Food)

    • Burger Joint, BBQ Joint, Irish Pub

    • Breakfast Spot

    • Taco Place

3.2 RELATIONSHIP BETWEEN CULTURAL/ENTERTAINMENT VENUES AND MEDIAN HOME VALUES

Following the creation of the target variable, I then started to explore three independent variables to understand their impact on the number of cultural/entertainment venues. The following scatterplot shows the variance in median home value and cultural/entertainment venues. 

 
Screenshot 2020-05-05 11.31.04.png
 

To better understand this plot, I trained a regression model using 70% of the data. The following regression line plot shows opportunities where the total number of cultural/entertainment venues is higher than expected, based on median home values.

 
Screenshot 2020-05-05 11.31.13.png
 

While not a perfect model, with a Mean Squared Error (MSE) of 186.78, this starts to tell a story of where we might find a higher number of cultural/entertainment opportunities than expected, in NC. Cities such as: Hickory, Durham, Raleigh, Charlotte, Winston-Salem, Greensboro, and Asheville have a higher number of cultural/entertainment opportunities than we would expect, based on their median home values.


3.3 RELATIONSHIP BETWEEN CULTURAL/ENTERTAINMENT VENUES AND COST OF LIVING

The following scatterplot shows the relationship between cost of living and number of cultural/entertainment venues.

 
Screenshot 2020-05-05 11.31.34.png
 

A regression model was created using 70% of this data as a training set, which results in the following regression plot:

 
Screenshot 2020-05-05 11.31.43.png
 

Cost of living has a better correlation with cultural/entertainment venues, as shown in the resulting MSE of 93.86. Similar cities again appear as opportunities for higher-than-expected cultural/entertainment opportunities: Hickory, Greensboro, Winston-Salem, Raleigh, Durham, Charlotte, and Asheville.


3.4 RELATIONSHIP BETWEEN CULTURAL/ENTERTAINMENT VENUES AND POPULATION SIZE

Lastly, population size was explored to understand the relative ‘density’ of cultural/entertainment venues. This is an effort to understand if higher numbers of people living in a city equate to more cultural/entertainment opportunities.

 
Screenshot 2020-05-05 11.32.00.png
 

I then created a regression model using 70% of the data to train, which resulted in the following regression line:

 
Screenshot 2020-05-05 11.32.12.png
 

With an MSE of 226.27, population size is the least relevant metric to help in understanding cultural/entertainment opportunities in a city. This is promising – as one doesn’t have to live in a large city to have opportunities for culture and entertainment.


4. CONCLUSIONS

The best fit between the three independent variables (Population, Median Home Value, Cost of Living) and the dependent variable (Cultural/Entertainment Venues) is found in Cost of Living. While not perfect, the Cost of Living model clearly shows opportunity in Hickory, Greensboro, Winston-Salem, Raleigh, Durham, Charlotte and Asheville.

 
Screenshot 2020-05-05 11.31.43.png
 


When this analysis is coupled with the results of the Median Home Value and Population models, the same cities appear near the ‘top’ of the list:

  • Raleigh

  • Durham

  • Winston-Salem

  • Greensboro

  • Hickory

  • Charlotte

  • Asheville

If a person or family has flexibility on where to live, desires to live in the state of NC and appreciates cultural/entertainment opportunities in their city, this short list of cities would be a great place to start exploring. Of course, the ultimate choice of where to live is much broader with many more inputs – dependent on jobs, extended family and a multitude of other factors.


5. FUTURE DIRECTIONS

The current analysis shows opportunities for finding cultural/entertainment establishments in certain NC cities. Future analysis should pull in more data sources (more cities) to make the models even more accurate. 

It would be interesting to also pull in NC suburbs and rural areas to see if that changes how these large cities relate to each other.

Comparing all major US cities (the top 100 or 200 cities across America) would also be a fascinating follow-up to this project. This kind of data analysis could be used as an input into the next ‘best places to live’ list.


6. REFERENCES

Carolina Demography. Link: https://www.ncdemography.org/. Accessed May 1, 2020.

Dragna, Madison. December 30, 2019. The 9 Coolest Cities in North Carolina. Link: https://www.tripstodiscover.com/coolest-cities-in-north-carolina. Accessed April 21, 2020.

Salary.com. Cost of Living in North Carolina. Link: https://www.salary.com/research/cost-of-living/nc. Accessed April 22, 2020.

Strohm, Mitch. Feb. 13, 2020. America’s best places to live in 2020. Link: https://www.bankrate.com/real-estate/best-places-to-live/us/. Accessed April 21, 2020.

Thorsby, Devon. July 5, 2019. The Best Places to Live in North Carolina. Link: https://realestate.usnews.com/real-estate/articles/best-places-to-live-in-north-carolina. Accessed April 21, 2020.

World Population Review. Top 500 Cities in North Carolina by Population. Link: https://worldpopulationreview.com/states/north-carolina-population/cities/. Accessed April 22, 2020.

Zillow. North Carolina Home Prices & Values. Link: https://www.zillow.com/nc/home-values/. Accessed April 22, 2020.


7. CLOSING THOUGHTS

Beginning this data science journey has been a ‘swim in the deep end’ kinda moment. It’s been technically challenging to learn Python, and theoretically challenging to refresh on advanced statistics. Overall, I’m looking forward to seeing what is next in this space.

And if you have a project that needs data science and analytics, I’d love to have the chance to work on it!