Housing Prices in Recessions

Hypothesis Testing

Hypothesis: University towns have their mean housing prices less affected by recessions.

Background

The following data files are available for this assignment:

From the Zillow research data site there is housing data for the United States. In particular the datafile for all homes at a city level, City_Zhvi_AllHomes.csv, has median home sale prices at a fine grained level. From the Wikipedia page on college towns is a list of university towns in the United States which has been copy and pasted into the file university_towns.txt. From Bureau of Economic Analysis, US Department of Commerce, the GDP over time of the United States in current dollars (use the chained value in 2009 dollars), in quarterly intervals, in the file gdplev.xls. For this assignment, only look at GDP data from the first quarter of 2000 onward.

Previewing the Data

To get started, I uploaded the required libraries. The independent samples t-test is used in this case.

import pandas as pd
import numpy as np
from scipy.stats import ttest_ind

Let’s view the list of university towns.

df = pd.read_fwf('university_towns.txt')
df.head()

Now, let’s check out the US Department of Commerce GDP data.

GDP = pd.read_excel('gdplev.xls', usecols=[4,6], skiprows=219, names = ['Quarter', 'GDP'])
GDP['GDP'] = pd.to_numeric(GDP['GDP'])

GDP.head()

Finally, let’s look at our Zillow housing data.

zillow = pd.read_csv("City_Zhvi_AllHomes.csv")
zillow.head()

Cleaning the Data

First, I had to map states to acronyms to merge the university towns with the zillow dataset. So, I used this dictionary.

states = {'OH': 'Ohio', 'KY': 'Kentucky', 'AS': 'American Samoa', 'NV': 'Nevada', 'WY': 'Wyoming', 'NA': 'National', 'AL': 'Alabama', 'MD': 'Maryland', 'AK': 'Alaska', 'UT': 'Utah', 'OR': 'Oregon', 'MT': 'Montana', 'IL': 'Illinois', 'TN': 'Tennessee', 'DC': 'District of Columbia', 'VT': 'Vermont', 'ID': 'Idaho', 'AR': 'Arkansas', 'ME': 'Maine', 'WA': 'Washington', 'HI': 'Hawaii', 'WI': 'Wisconsin', 'MI': 'Michigan', 'IN': 'Indiana', 'NJ': 'New Jersey', 'AZ': 'Arizona', 'GU': 'Guam', 'MS': 'Mississippi', 'PR': 'Puerto Rico', 'NC': 'North Carolina', 'TX': 'Texas', 'SD': 'South Dakota', 'MP': 'Northern Mariana Islands', 'IA': 'Iowa', 'MO': 'Missouri', 'CT': 'Connecticut', 'WV': 'West Virginia', 'SC': 'South Carolina', 'LA': 'Louisiana', 'KS': 'Kansas', 'NY': 'New York', 'NE': 'Nebraska', 'OK': 'Oklahoma', 'FL': 'Florida', 'CA': 'California', 'CO': 'Colorado', 'PA': 'Pennsylvania', 'DE': 'Delaware', 'NM': 'New Mexico', 'RI': 'Rhode Island', 'MN': 'Minnesota', 'VI': 'Virgin Islands', 'NH': 'New Hampshire', 'MA': 'Massachusetts', 'GA': 'Georgia', 'ND': 'North Dakota', 'VA': 'Virginia'}

I cleaned the university_towns data to remove any weird characters.

def get_list_of_university_towns(): 

    with open('university_towns.txt', "r") as reader:
        lines = reader.readlines()
    reader.close()
    
    df = pd.DataFrame(columns=('State', 'RegionName'))
    count = 0
    state = ""
    region = ""
    
    for line in lines:
        line = line.rstrip('n')
        line = line.strip()
        if 'edit' in line:
            state = line.split('[')[0]
        else:
            region = line.split(' (',1)[0]
            df.loc[count] = [state, region]
            count = count + 1
    
    df.reset_index(inplace=True,drop=True)
    return df
    
get_list_of_university_towns()

Defining Quarters

Using the GDP data and the definition of a quarter as described in the background, I returned the quarter values with year in an easily usable string format when I found the start of the recession.

def get_recession_start():
#    '''Returns the year and quarter of the recession start time as a 
#    string value in a format such as 2005q3'''
    
    GDP = pd.read_excel('gdplev.xls', usecols=[4,6], skiprows=219, names = ['Quarter', 'GDP'])
    GDP['GDP'] = pd.to_numeric(GDP['GDP'])


    Quarter = []

    for i in range(len(GDP)-2):
        if ((GDP.iloc[i][1] > GDP.iloc[i+1][1]) & (GDP.iloc[i+1][1] > GDP.iloc[i+2][1])):
            Quarter.append(GDP.iloc[i+1][0])
            break
    return Quarter[0]
get_recession_start()

‘2008q3’

Next, I found the end of the recession.

def get_recession_end():
    start = get_recession_start()
    GDP = pd.read_excel('gdplev.xls', usecols=[4,6], skiprows=219, names = ['Quarter', 'GDP'])
    GDP['GDP'] = pd.to_numeric(GDP['GDP'])
    start_index = GDP[GDP['Quarter'] == start].index.tolist()[0]
    for i in range(start_index, len(GDP)-2):
        if ((GDP.iloc[i][1] < GDP.iloc[i+1][1]) & (GDP.iloc[i+1][1] < GDP.iloc[i+2][1])):   
            return GDP.iloc[i+2][0]
get_recession_end()

‘2009q4’

Finally, I found the bottom of the recession.

def get_recession_bottom():
    
    start= get_recession_start()
    end = get_recession_end()
    GDP = pd.read_excel('gdplev.xls', usecols=[4,6], skiprows=219, names = ['Quarter', 'GDP'])
    GDP['GDP'] = pd.to_numeric(GDP['GDP'])
    
    recession = GDP[(GDP['Quarter'] <= end) & (GDP['Quarter'] >= start)]
    recession.reset_index(inplace=True)
    recession.drop(['index'], axis=1, inplace=True)
    min_index = recession['GDP'].idxmin()
    return recession.iloc[min_index][0]
    
get_recession_bottom()

‘2009q2’

To use the quarterly values I created when finding the recession start/bottom/end, I had to convert the housing data to quarters as defined at the beginning and take the mean of the 3 months in the quarter.

def convert_housing_data_to_quarters():
    
    # Retrieved the housing data.
    df= pd.read_csv("City_Zhvi_AllHomes.csv")
    df['State'].replace(states, inplace= True)
    df= df.set_index(["State","RegionName"])
    df = df.iloc[:,49:250]
   
    # Defined quarters and calculated the mean of each quarters' values.
    def quarters(col):
        if col.endswith(("01", "02", "03")):
            s = col[:4] + "q1"
        elif col.endswith(("04", "05", "06")):
            s = col[:4] + "q2"
        elif col.endswith(("07", "08", "09")):
            s = col[:4] + "q3"
        else:
            s = col[:4] + "q4"
        return s  
    housing = df.groupby(quarters, axis = 1).mean()
    housing = housing.sort_index()
    return housing
 
convert_housing_data_to_quarters()

Running the T-Test

I created a column “University Town” in the table for identifying college towns since the towns in this table are college towns. Then I left-merged the housing data with the aforementioned towns table to only account for towns with housing data. I then identified the non-university towns by filling the blanks in newly merged towns in the “University Town” column with a value of false.

def run_ttest():

    # Retrieved the university towns table and created a column indicating if the town is a university town.
    univ_towns_table = get_list_of_university_towns()
    univ_towns_table['University Town'] = True
    
    # Retrieved the housing data and the start and bottom of the recession.
    housing = convert_housing_data_to_quarters()
    start = '2008q2'
    bottom = get_recession_bottom()
    housing = housing.reset_index()
    
    # Merged the housing and university town data, only keeping rows that appeared in the housing data. All NaN values signified non-university towns, so I replaced with a value of false for the t-test.
    all_towns = housing.merge(univ_towns_table, how='left', on=['State', 'RegionName'])
    all_towns['University Town'] = all_towns['University Town'].replace({np.NaN: False})
    
    # Calculated the price difference to be used in our t-test to determine mean price ratio (aka reduced market loss).
    all_towns['Price Difference'] = all_towns[start]/all_towns[bottom]
    
    # Created the two datasets with university and non-university towns to test the hypothesis.
    univ_towns = all_towns[all_towns['University Town'] == True]
    non_univ_towns = all_towns[all_towns['University Town'] == False]
    
    # Ran the t-test.
    t,p = ttest_ind(univ_towns['Price Difference'], non_univ_towns['Price Difference'],nan_policy='omit')
    different = True if p < 0.01 else False
    better = 'university town' if univ_towns['Price Difference'].mean() < non_univ_towns['Price Difference'].mean() else "Non-University Town"
    
    
    return different, p, better
    
run_ttest()

‘(True, 0.0027240637047531249, ‘university town’)’

Results

With a p-value of 0.0027240637047531249 < .05, we fail to reject the hypothesis. Thus, university towns mean housing prices seem to be less affected by recessions than non-university towms.