Analysis NYC Property Sales

19 minute read

NYC Property Sales Introduction

The aim of this projects is to introduce you to practical statistic with Python as concrete and as consistent as possible. Using what you’ve learned; download the NYC Property Sales Dataset from Kaggle. This dataset is a record of every building or building unit (apartment, etc.) sold in the New York City property market over a 12-month period.

This dataset contains the location, address, type, sale price, and sale date of building units sold. A reference on the trickier fields:

  • BOROUGH: A digit code for the borough the property is located in; in order these are Manhattan (1), Bronx (2), Brooklyn (3), Queens (4), and Staten Island (5).
  • BLOCK; LOT: The combination of borough, block, and lot forms a unique key for property in New York City. Commonly called a BBL.
  • BUILDING CLASS AT PRESENT and BUILDING CLASS AT TIME OF SALE: The type of building at various points in time.

Note that because this is a financial transaction dataset, there are some points that need to be kept in mind:

  • Many sales occur with a nonsensically small dollar amount: $0 most commonly. These sales are actually transfers of deeds between parties: for example, parents transferring ownership to their home to a child after moving out for retirement.
  • This dataset uses the financial definition of a building/building unit, for tax purposes. In case a single entity owns the building in question, a sale covers the value of the entire building. In case a building is owned piecemeal by its residents (a condominium), a sale refers to a single apartment (or group of apartments) owned by some individual.

Formulate a question and derive a statistical hypothesis test to answer the question. You have to demonstrate that you’re able to make decisions using data in a scientific manner. Examples of questions can be:

  • Is there a difference in unit sold between property built in 1900-2000 and 2001 so on?
  • Is there a difference in unit sold based on building category?
  • What can you discover about New York City real estate by looking at a year’s worth of raw transaction records? Can you spot trends in the market?

Please make sure that you have completed the lesson for this course, namely Python and Practical Statistics which is part of this Program.

Note: You can take a look at Project Rubric below:

Code Review  
CRITERIA SPECIFICATIONS
Mean Student implement mean to specifics column/data using pandas, numpy, or scipy
Median Student implement median to specifics column/data using pandas, numpy, or scipy
Modus Student implement modus to specifics column/data using pandas, numpy, or scipy
Central Tendencies Implementing Central Tendencies through dataset
Box Plot Implementing Box Plot to visualize spesific data
Z-Score Implementing Z-score concept to specific data
Probability Distribution Student analyzing distribution of data and gain insight from the distribution
Intervals Implementing Confidence or Prediction Intervals
Hypotesis Testing Made 1 Hypotesis and get conclusion from data
Preprocessing Student preprocess dataset before applying the statistical treatment.
Does the code run without errors? The code runs without errors. All code is functional and formatted properly.
Readability  
CRITERIA SPECIFICATIONS
Well Documented All cell in notebook are well documented with markdown above each cell explaining the code
Analysis  
CRITERIA SPECIFICATIONS
Overall Analysis Gain an insight/conclusion of overall plots that answer the hypotesis

Focus on “Graded-Function” sections.


Data Preparation

Load the library you need.

Get your NYC property data from here and load the dataframe to your notebook.

# Get your import statement here
import pandas as pd
import numpy as np

# Load your dataset here
df = pd.read_csv('nyc-rolling-sales.csv')

print ('Data read into a pandas dataframe!')
Data read into a pandas dataframe!

Let’s view the top 5 rows of the dataset using the head() function.

# Write your syntax here
df.head()
Unnamed: 0 BOROUGH NEIGHBORHOOD BUILDING CLASS CATEGORY TAX CLASS AT PRESENT BLOCK LOT EASE-MENT BUILDING CLASS AT PRESENT ADDRESS ... RESIDENTIAL UNITS COMMERCIAL UNITS TOTAL UNITS LAND SQUARE FEET GROSS SQUARE FEET YEAR BUILT TAX CLASS AT TIME OF SALE BUILDING CLASS AT TIME OF SALE SALE PRICE SALE DATE
0 4 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS 2A 392 6 C2 153 AVENUE B ... 5 0 5 1633 6440 1900 2 C2 6625000 2017-07-19 00:00:00
1 5 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS 2 399 26 C7 234 EAST 4TH STREET ... 28 3 31 4616 18690 1900 2 C7 - 2016-12-14 00:00:00
2 6 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS 2 399 39 C7 197 EAST 3RD STREET ... 16 1 17 2212 7803 1900 2 C7 - 2016-12-09 00:00:00
3 7 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS 2B 402 21 C4 154 EAST 7TH STREET ... 10 0 10 2272 6794 1913 2 C4 3936272 2016-09-23 00:00:00
4 8 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS 2A 404 55 C2 301 EAST 10TH STREET ... 6 0 6 2369 4615 1900 2 C2 8000000 2016-11-17 00:00:00

5 rows × 22 columns

We can also veiw the bottom 5 rows of the dataset using the tail() function.

# Write your syntax here
df.tail()
Unnamed: 0 BOROUGH NEIGHBORHOOD BUILDING CLASS CATEGORY TAX CLASS AT PRESENT BLOCK LOT EASE-MENT BUILDING CLASS AT PRESENT ADDRESS ... RESIDENTIAL UNITS COMMERCIAL UNITS TOTAL UNITS LAND SQUARE FEET GROSS SQUARE FEET YEAR BUILT TAX CLASS AT TIME OF SALE BUILDING CLASS AT TIME OF SALE SALE PRICE SALE DATE
84543 8409 5 WOODROW 02 TWO FAMILY DWELLINGS 1 7349 34 B9 37 QUAIL LANE ... 2 0 2 2400 2575 1998 1 B9 450000 2016-11-28 00:00:00
84544 8410 5 WOODROW 02 TWO FAMILY DWELLINGS 1 7349 78 B9 32 PHEASANT LANE ... 2 0 2 2498 2377 1998 1 B9 550000 2017-04-21 00:00:00
84545 8411 5 WOODROW 02 TWO FAMILY DWELLINGS 1 7351 60 B2 49 PITNEY AVENUE ... 2 0 2 4000 1496 1925 1 B2 460000 2017-07-05 00:00:00
84546 8412 5 WOODROW 22 STORE BUILDINGS 4 7100 28 K6 2730 ARTHUR KILL ROAD ... 0 7 7 208033 64117 2001 4 K6 11693337 2016-12-21 00:00:00
84547 8413 5 WOODROW 35 INDOOR PUBLIC AND CULTURAL FACILITIES 4 7105 679 P9 155 CLAY PIT ROAD ... 0 1 1 10796 2400 2006 4 P9 69300 2016-10-27 00:00:00

5 rows × 22 columns

BOROUGH: A digit code for the borough the property is located in; in order these are Manhattan (1), Bronx (2), Brooklyn (3), Queens (4), and Staten Island (5).

To view the dimensions of the dataframe, we use the .shape parameter. Expected result: (84548, 22)

# Write your syntax here
df.shape
(84548, 22)

According to this official page, Ease-ment is “is a right, such as a right of way, which allows an entity to make limited use of another’s real property. For example: MTA railroad tracks that run across a portion of another property”. Also, the Unnamed column is not mentioned and was likely used for iterating through records. So, those two columns are removed for now.

# Drop 'Unnamed: 0' and 'EASE-MENT' features using .drop function
df.drop(['Unnamed: 0','EASE-MENT'], axis=1, inplace=True)
df.head()
BOROUGH NEIGHBORHOOD BUILDING CLASS CATEGORY TAX CLASS AT PRESENT BLOCK LOT BUILDING CLASS AT PRESENT ADDRESS APARTMENT NUMBER ZIP CODE RESIDENTIAL UNITS COMMERCIAL UNITS TOTAL UNITS LAND SQUARE FEET GROSS SQUARE FEET YEAR BUILT TAX CLASS AT TIME OF SALE BUILDING CLASS AT TIME OF SALE SALE PRICE SALE DATE
0 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS 2A 392 6 C2 153 AVENUE B 10009 5 0 5 1633 6440 1900 2 C2 6625000 2017-07-19 00:00:00
1 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS 2 399 26 C7 234 EAST 4TH STREET 10009 28 3 31 4616 18690 1900 2 C7 - 2016-12-14 00:00:00
2 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS 2 399 39 C7 197 EAST 3RD STREET 10009 16 1 17 2212 7803 1900 2 C7 - 2016-12-09 00:00:00
3 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS 2B 402 21 C4 154 EAST 7TH STREET 10009 10 0 10 2272 6794 1913 2 C4 3936272 2016-09-23 00:00:00
4 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS 2A 404 55 C2 301 EAST 10TH STREET 10009 6 0 6 2369 4615 1900 2 C2 8000000 2016-11-17 00:00:00

Let’s view Dtype of each features in dataframe using .info() function.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84548 entries, 0 to 84547
Data columns (total 20 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   BOROUGH                         84548 non-null  int64 
 1   NEIGHBORHOOD                    84548 non-null  object
 2   BUILDING CLASS CATEGORY         84548 non-null  object
 3   TAX CLASS AT PRESENT            84548 non-null  object
 4   BLOCK                           84548 non-null  int64 
 5   LOT                             84548 non-null  int64 
 6   BUILDING CLASS AT PRESENT       84548 non-null  object
 7   ADDRESS                         84548 non-null  object
 8   APARTMENT NUMBER                84548 non-null  object
 9   ZIP CODE                        84548 non-null  int64 
 10  RESIDENTIAL UNITS               84548 non-null  int64 
 11  COMMERCIAL UNITS                84548 non-null  int64 
 12  TOTAL UNITS                     84548 non-null  int64 
 13  LAND SQUARE FEET                84548 non-null  object
 14  GROSS SQUARE FEET               84548 non-null  object
 15  YEAR BUILT                      84548 non-null  int64 
 16  TAX CLASS AT TIME OF SALE       84548 non-null  int64 
 17  BUILDING CLASS AT TIME OF SALE  84548 non-null  object
 18  SALE PRICE                      84548 non-null  object
 19  SALE DATE                       84548 non-null  object
dtypes: int64(9), object(11)
memory usage: 12.9+ MB

It looks like empty records are not being treated as NA. We convert columns to their appropriate data types to obtain NAs.

#First, let's check which columns should be categorical
print('Column name')
for col in df.columns:
    if df[col].dtype=='object':
        print(col, df[col].nunique())
Column name
NEIGHBORHOOD 254
BUILDING CLASS CATEGORY 47
TAX CLASS AT PRESENT 11
BUILDING CLASS AT PRESENT 167
ADDRESS 67563
APARTMENT NUMBER 3989
LAND SQUARE FEET 6062
GROSS SQUARE FEET 5691
BUILDING CLASS AT TIME OF SALE 166
SALE PRICE 10008
SALE DATE 364
# LAND SQUARE FEET,GROSS SQUARE FEET, SALE PRICE, BOROUGH should be numeric. 
# SALE DATE datetime format.
# categorical: NEIGHBORHOOD, BUILDING CLASS CATEGORY, TAX CLASS AT PRESENT, BUILDING CLASS AT PRESENT,
# BUILDING CLASS AT TIME OF SALE, TAX CLASS AT TIME OF SALE,BOROUGH 

numer = ['LAND SQUARE FEET','GROSS SQUARE FEET', 'SALE PRICE', 'BOROUGH']
for col in numer: # coerce for missing values
    df[col] = pd.to_numeric(df[col], errors='coerce')

categ = ['NEIGHBORHOOD', 'BUILDING CLASS CATEGORY', 'TAX CLASS AT PRESENT', 'BUILDING CLASS AT PRESENT', 'BUILDING CLASS AT TIME OF SALE', 'TAX CLASS AT TIME OF SALE']
for col in categ:
    df[col] = df[col].astype('category')

df['SALE DATE'] = pd.to_datetime(df['SALE DATE'], errors='coerce')

Our dataset is ready for checking missing values.

missing = df.isnull().sum()/len(df)*100

print(pd.DataFrame([missing[missing>0],pd.Series(df.isnull().sum()[df.isnull().sum()>1000])], index=['percent missing','how many missing']))
                  LAND SQUARE FEET  GROSS SQUARE FEET   SALE PRICE
percent missing          31.049818          32.658372     17.22217
how many missing      26252.000000       27612.000000  14561.00000

Around 30% of GROSS SF and LAND SF are missing. Furthermore, around 17% of SALE PRICE is also missing.

We can fill in the missing value from one column to another, which will help us reduce missing values. Expected values:

(6, 20)

(1366, 20)

print(df[(df['LAND SQUARE FEET'].isnull()) & (df['GROSS SQUARE FEET'].notnull())].shape)
print(df[(df['LAND SQUARE FEET'].notnull()) & (df['GROSS SQUARE FEET'].isnull())].shape)
(6, 20)
(1366, 20)

There are 1372 rows that can be filled in with their approximate values.

df['LAND SQUARE FEET'] = df['LAND SQUARE FEET'].mask((df['LAND SQUARE FEET'].isnull()) & (df['GROSS SQUARE FEET'].notnull()), df['GROSS SQUARE FEET'])
df['GROSS SQUARE FEET'] = df['GROSS SQUARE FEET'].mask((df['LAND SQUARE FEET'].notnull()) & (df['GROSS SQUARE FEET'].isnull()), df['LAND SQUARE FEET'])
#  Check for duplicates before

print(sum(df.duplicated()))

df[df.duplicated(keep=False)].sort_values(['NEIGHBORHOOD', 'ADDRESS']).head(10)

# df.duplicated() automatically excludes duplicates, to keep duplicates in df we use keep=False

# in df.duplicated(df.columns) we can specify column names to look for duplicates only in those mentioned columns.
765
BOROUGH NEIGHBORHOOD BUILDING CLASS CATEGORY TAX CLASS AT PRESENT BLOCK LOT BUILDING CLASS AT PRESENT ADDRESS APARTMENT NUMBER ZIP CODE RESIDENTIAL UNITS COMMERCIAL UNITS TOTAL UNITS LAND SQUARE FEET GROSS SQUARE FEET YEAR BUILT TAX CLASS AT TIME OF SALE BUILDING CLASS AT TIME OF SALE SALE PRICE SALE DATE
76286 5 ANNADALE 02 TWO FAMILY DWELLINGS 1 6350 7 B2 106 BENNETT PLACE 10312 2 0 2 8000.0 4208.0 1985 1 B2 NaN 2017-06-27
76287 5 ANNADALE 02 TWO FAMILY DWELLINGS 1 6350 7 B2 106 BENNETT PLACE 10312 2 0 2 8000.0 4208.0 1985 1 B2 NaN 2017-06-27
76322 5 ANNADALE 05 TAX CLASS 1 VACANT LAND 1B 6459 28 V0 N/A HYLAN BOULEVARD 0 0 0 0 6667.0 6667.0 0 1 V0 NaN 2017-05-11
76323 5 ANNADALE 05 TAX CLASS 1 VACANT LAND 1B 6459 28 V0 N/A HYLAN BOULEVARD 0 0 0 0 6667.0 6667.0 0 1 V0 NaN 2017-05-11
76383 5 ARDEN HEIGHTS 01 ONE FAMILY DWELLINGS 1 5741 93 A5 266 ILYSSA WAY 10312 1 0 1 500.0 1354.0 1996 1 A5 320000.0 2017-06-06
76384 5 ARDEN HEIGHTS 01 ONE FAMILY DWELLINGS 1 5741 93 A5 266 ILYSSA WAY 10312 1 0 1 500.0 1354.0 1996 1 A5 320000.0 2017-06-06
76643 5 ARROCHAR 02 TWO FAMILY DWELLINGS 1 3103 57 B2 129 MC CLEAN AVENUE 10305 2 0 2 5000.0 2733.0 1925 1 B2 NaN 2017-03-21
76644 5 ARROCHAR 02 TWO FAMILY DWELLINGS 1 3103 57 B2 129 MC CLEAN AVENUE 10305 2 0 2 5000.0 2733.0 1925 1 B2 NaN 2017-03-21
50126 4 ASTORIA 03 THREE FAMILY DWELLINGS 1 856 139 C0 22-18 27TH STREET 11105 3 0 3 2000.0 1400.0 1930 1 C0 NaN 2017-01-12
50127 4 ASTORIA 03 THREE FAMILY DWELLINGS 1 856 139 C0 22-18 27TH STREET 11105 3 0 3 2000.0 1400.0 1930 1 C0 NaN 2017-01-12

The dataframe has 765 duplicated rows (exluding the original rows).

df.drop_duplicates(inplace=True)

print(sum(df.duplicated()))
0

Exploratory data analysis

Now, let’s get a simple descriptive statistics with .describe() function for COMMERCIAL UNITS features.

df[df['COMMERCIAL UNITS']==0].describe().transpose()
count mean std min 25% 50% 75% max
BOROUGH 78777.0 3.004329 1.298594e+00 1.0 2.0 3.0 4.0 5.0
BLOCK 78777.0 4273.781015 3.589242e+03 1.0 1330.0 3340.0 6361.0 16322.0
LOT 78777.0 395.422420 6.716047e+02 1.0 23.0 52.0 1003.0 9106.0
ZIP CODE 78777.0 10722.737068 1.318494e+03 0.0 10304.0 11209.0 11357.0 11694.0
RESIDENTIAL UNITS 78777.0 1.691737 9.838994e+00 0.0 0.0 1.0 2.0 889.0
COMMERCIAL UNITS 78777.0 0.000000 0.000000e+00 0.0 0.0 0.0 0.0 0.0
TOTAL UNITS 78777.0 1.724133 9.835016e+00 0.0 1.0 1.0 2.0 889.0
LAND SQUARE FEET 52780.0 3140.139731 2.929999e+04 0.0 1600.0 2295.0 3300.0 4252327.0
GROSS SQUARE FEET 52780.0 2714.612069 2.791294e+04 0.0 975.0 1600.0 2388.0 4252327.0
YEAR BUILT 78777.0 1781.065451 5.510246e+02 0.0 1920.0 1940.0 1967.0 2017.0
SALE PRICE 65629.0 995296.912904 3.329268e+06 0.0 240000.0 529490.0 921956.0 345000000.0

Let us try to understand the columns. Above table shows descriptive statistics for the numeric columns.

  • There are zipcodes with 0 value
  • Can block/lot numbers go up to 16322?
  • Most of the properties have 2 unit and maximum of 1844 units? The latter might mean some company purchased a building. This should be treated as an outlier.
  • Other columns also have outliers which needs further investigation.
  • Year column has a year with 0
  • Most sales prices less than 10000 can be treated as gift or transfer fees.

Now, let’s get a simple descriptive statistics with .describe() function for RESIDENTIAL UNITS features.

# Write your function below
df[df['RESIDENTIAL UNITS']==0].describe().transpose()
# Graded-Funtion Begin (~1 Lines)

# Graded-Funtion End
count mean std min 25% 50% 75% max
BOROUGH 24546.0 2.542084e+00 1.334486e+00 1.0 1.0 3.0 4.00 5.000000e+00
BLOCK 24546.0 3.355267e+03 3.091222e+03 1.0 1158.0 1947.0 5390.75 1.631700e+04
LOT 24546.0 2.839434e+02 5.700453e+02 1.0 12.0 38.0 135.00 9.056000e+03
ZIP CODE 24546.0 1.032151e+04 2.135406e+03 0.0 10023.0 11004.0 11354.00 1.169400e+04
RESIDENTIAL UNITS 24546.0 0.000000e+00 0.000000e+00 0.0 0.0 0.0 0.00 0.000000e+00
COMMERCIAL UNITS 24546.0 4.593824e-01 1.582602e+01 0.0 0.0 0.0 0.00 2.261000e+03
TOTAL UNITS 24546.0 5.633504e-01 1.582595e+01 0.0 0.0 0.0 0.00 2.261000e+03
LAND SQUARE FEET 9503.0 7.416797e+03 8.032892e+04 0.0 0.0 0.0 3250.00 4.252327e+06
GROSS SQUARE FEET 9503.0 8.870466e+03 7.890877e+04 0.0 0.0 0.0 2500.00 4.252327e+06
YEAR BUILT 24546.0 1.675526e+03 6.790950e+02 0.0 1921.0 1950.0 1962.00 2.017000e+03
SALE PRICE 20855.0 1.632257e+06 1.969307e+07 0.0 182500.0 395000.0 850000.00 2.210000e+09

Write your findings below:

Use .value_counts function to count total value of BOROUGH features. Expected value:

4 26548
3 23843
1 18102
5 8296
2 6994
Name: BOROUGH, dtype: int64

print('uniqe value ',df['BOROUGH'].unique())
df1 = df['BOROUGH'].groupby(df['BOROUGH']).value_counts()
df1
uniqe value  [1 2 3 4 5]





BOROUGH  BOROUGH
1        1          18102
2        2           6994
3        3          23843
4        4          26548
5        5           8296
Name: BOROUGH, dtype: int64

From here, we can calculate the mean for each Borough. Use .mean() function to calculate mean.

# Write your function below
df3 = df['SALE PRICE'].groupby(df['BOROUGH']).value_counts()
# Graded-Funtion Begin (~1 Lines)
print(df3)
# Graded-Funtion End
print('mean ',df3.mean())
BOROUGH  SALE PRICE 
1        10.0           90
         1100000.0      89
         750000.0       82
         1300000.0      79
         1250000.0      77
                        ..
5        11700000.0      1
         11900000.0      1
         31500000.0      1
         67200000.0      1
         122000000.0     1
Name: SALE PRICE, Length: 14997, dtype: int64
mean  4.641394945655798

From here, we can calculate the median for each Borough. Use .median() function to calculate median.

# Write your function below
print(df3.describe())
print('median ',df3.median())
# Graded-Funtion Begin (~1 Lines)

# Graded-Funtion End
count    14997.000000
mean         4.641395
std         69.275739
min          1.000000
25%          1.000000
50%          1.000000
75%          2.000000
max       8186.000000
Name: SALE PRICE, dtype: float64
median  1.0

From here, we can calculate the mode for each Borough.

# Write your function below
# print('mode ',df.mode())
# df4 = df.set_index('BOROUGH')
df3.mode()
# Graded-Funtion Begin (~1 Lines)

# Graded-Funtion End
0    1
dtype: int64

there is no mode for given data

From here, we can calculate the Range for each Borough.

# Write your function below
df4= df3.describe()
range = df4['max']-df4['min']
range
# Graded-Funtion Begin (~1 Lines)

# Graded-Funtion End
8185.0

From here, we can calculate the Variance for each Borough.

# Write your function below
df3.var()
# Graded-Funtion Begin (~1 Lines)

# Graded-Funtion End
4799.127995598708

From here, we can calculate the SD for each Borough.

# Write your function below
df3.std()
# Graded-Funtion Begin (~1 Lines)

# Graded-Funtion End
69.27573886721605

Now we can analyze Probability Distibution below.

# for inline plots in jupyter
%matplotlib inline
# import matplotlib
import matplotlib.pyplot as plt
# for latex equations
from IPython.display import Math, Latex
# for displaying images
from IPython.core.display import Image
# import seaborn
import seaborn as sns
# settings for seaborn plotting style
sns.set(color_codes=True)
# settings for seaborn plot sizes
sns.set(rc={'figure.figsize':(5,5)})
# Write your function below

# Graded-Funtion Begin
ax = sns.distplot(df1,
                  bins=100,
                  kde=True,
                  color='skyblue',
                  hist_kws={"linewidth": 15,'alpha':1})
ax.set(xlabel='Uniform Distribution ', ylabel='Frequency')




# Graded-Funtion End
[Text(0, 0.5, 'Frequency'), Text(0.5, 0, 'Uniform Distribution ')]

image-center

Now we can analyze Confidence Intervals below.

df.head()
BOROUGH NEIGHBORHOOD BUILDING CLASS CATEGORY TAX CLASS AT PRESENT BLOCK LOT BUILDING CLASS AT PRESENT ADDRESS APARTMENT NUMBER ZIP CODE RESIDENTIAL UNITS COMMERCIAL UNITS TOTAL UNITS LAND SQUARE FEET GROSS SQUARE FEET YEAR BUILT TAX CLASS AT TIME OF SALE BUILDING CLASS AT TIME OF SALE SALE PRICE SALE DATE
0 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS 2A 392 6 C2 153 AVENUE B 10009 5 0 5 1633.0 6440.0 1900 2 C2 6625000.0 2017-07-19
1 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS 2 399 26 C7 234 EAST 4TH STREET 10009 28 3 31 4616.0 18690.0 1900 2 C7 NaN 2016-12-14
2 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS 2 399 39 C7 197 EAST 3RD STREET 10009 16 1 17 2212.0 7803.0 1900 2 C7 NaN 2016-12-09
3 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS 2B 402 21 C4 154 EAST 7TH STREET 10009 10 0 10 2272.0 6794.0 1913 2 C4 3936272.0 2016-09-23
4 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS 2A 404 55 C2 301 EAST 10TH STREET 10009 6 0 6 2369.0 4615.0 1900 2 C2 8000000.0 2016-11-17
# Write your function below
dx = df[["SALE PRICE", "BOROUGH"]].dropna()
# dx
pd.crosstab(dx['SALE PRICE'], dx['BOROUGH'])
# Graded-Funtion Begin

BOROUGH 1 2 3 4 5
SALE PRICE
0.000000e+00 0 1826 8186 0 0
1.000000e+00 30 22 37 28 9
2.000000e+00 2 0 0 1 0
3.000000e+00 0 1 0 1 0
5.000000e+00 0 0 1 0 0
... ... ... ... ... ...
5.650000e+08 1 0 0 0 0
6.200000e+08 1 0 0 0 0
6.520000e+08 1 0 0 0 0
1.040000e+09 1 0 0 0 0
2.210000e+09 1 0 0 0 0

10007 rows × 5 columns

p_fm = 1826/(1826+22)
p_fm
0.9880952380952381
n = 1826+22
n
1848
br_2 = np.sqrt(p_fm * (1 - p_fm) / n)
br_2
0.002522950772404102
z_score = 1.96
lcb = p_fm - z_score* br_2 #lower limit of the CI
ucb = p_fm + z_score* br_2 #upper limit of the CI
print('Confidence interval ',lcb, ucb)
Confidence interval  0.9831502545813261 0.9930402216091502

Make your Hypothesis Testing below

import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
df
BOROUGH NEIGHBORHOOD BUILDING CLASS CATEGORY TAX CLASS AT PRESENT BLOCK LOT BUILDING CLASS AT PRESENT ADDRESS APARTMENT NUMBER ZIP CODE RESIDENTIAL UNITS COMMERCIAL UNITS TOTAL UNITS LAND SQUARE FEET GROSS SQUARE FEET YEAR BUILT TAX CLASS AT TIME OF SALE BUILDING CLASS AT TIME OF SALE SALE PRICE SALE DATE
0 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS 2A 392 6 C2 153 AVENUE B 10009 5 0 5 1633.0 6440.0 1900 2 C2 6625000.0 2017-07-19
1 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS 2 399 26 C7 234 EAST 4TH STREET 10009 28 3 31 4616.0 18690.0 1900 2 C7 NaN 2016-12-14
2 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS 2 399 39 C7 197 EAST 3RD STREET 10009 16 1 17 2212.0 7803.0 1900 2 C7 NaN 2016-12-09
3 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS 2B 402 21 C4 154 EAST 7TH STREET 10009 10 0 10 2272.0 6794.0 1913 2 C4 3936272.0 2016-09-23
4 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS 2A 404 55 C2 301 EAST 10TH STREET 10009 6 0 6 2369.0 4615.0 1900 2 C2 8000000.0 2016-11-17
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
84543 5 WOODROW 02 TWO FAMILY DWELLINGS 1 7349 34 B9 37 QUAIL LANE 10309 2 0 2 2400.0 2575.0 1998 1 B9 450000.0 2016-11-28
84544 5 WOODROW 02 TWO FAMILY DWELLINGS 1 7349 78 B9 32 PHEASANT LANE 10309 2 0 2 2498.0 2377.0 1998 1 B9 550000.0 2017-04-21
84545 5 WOODROW 02 TWO FAMILY DWELLINGS 1 7351 60 B2 49 PITNEY AVENUE 10309 2 0 2 4000.0 1496.0 1925 1 B2 460000.0 2017-07-05
84546 5 WOODROW 22 STORE BUILDINGS 4 7100 28 K6 2730 ARTHUR KILL ROAD 10309 0 7 7 208033.0 64117.0 2001 4 K6 11693337.0 2016-12-21
84547 5 WOODROW 35 INDOOR PUBLIC AND CULTURAL FACILITIES 4 7105 679 P9 155 CLAY PIT ROAD 10309 0 1 1 10796.0 2400.0 2006 4 P9 69300.0 2016-10-27

83783 rows × 20 columns

Null Hypothesis: sale price BOROUGH1>BOROUGH2
Alternative Hypthosis: sale price BOROUGH1<BOROUGH2

# Write your function below
b1 = df[df['BOROUGH'] == 1]
b2 = df[df['BOROUGH'] == 2]

# Graded-Funtion Begin
n1 = len(b1)
mu1 = b1["SALE PRICE"].mean()
sd1 = b1["SALE PRICE"].std()

n2 = len(b2)
mu2 = b2["SALE PRICE"].mean()
sd2 = b2["SALE PRICE"].std()

print(n1, mu1, sd1)
print(n2, mu2, sd2)


# Graded-Funtion End
18102 3344641.9823292056 24140481.087110177
6994 594677.118387189 2793509.0485762
sm.stats.ztest(b1["SALE PRICE"].dropna(), b2["SALE PRICE"].dropna(),alternative='two-sided')
(9.495736658776908, 2.1865980196322893e-21)

Write your final conclusion below.
because of p-value b2 smaller than b1, then null hypothesis is correct \