Fitting Normal Distribution on House Price Data
What we have to prove?
House Price of each house can be modeled as Normal distribution with 95% confidence.
Importing Library
import pandas as pd import matplotlib.pyplot as plt import numpy as np import scipy.stats as st
Data
d=pd.read_excel("/content/drive/MyDrive/Site Data/House Data.xlsx") d
Output:-
plt.hist(d["House Price"],bins=10)
Output:-
Simulation
* Fitting a Normal distribution
From the histogram, the distribution could be modelled as Normal\((\mu,\sigma^2)\). The next step is to estimate \(\mu\) and \(\sigma^2\) from the given samples.
* Method of Moments
Suppose \(m_1\) and \(m_2\) are the first and second moments of the samples. The method of moments estimates are obtained by solving $$m_1=\mu ,$$ $$m_2=\sigma^2+\mu^2.$$ The solution results in $$\hat{\mu}_{MM}=m_1,\hat{\sigma}_{MM}=\sqrt{m_2-m_1^2}.$$ We now compute the values of \(m_1\) (sample mean) and \(m_2-m_1^2\) (sample variance) from the data. After that, we can compute the estimates.
x=np.array(d["House Price"]) m1=np.average(x) ss=np.var(x) # Computing sample variance. muMM = m1 sigmaMM = ss**0.5 print(muMM) print(round(sigmaMM,3))
Output:- μ : 205846.275 σ : 113100.833
# blue curve plt.hist(d["House Price"],bins=10,density=True) # np.linespace(min,max,smooth) domain= np.linspace(d["House Price"].min(),d["House Price"].max(),50) # Orange Line. plt.plot(domain,st.norm.pdf(domain,loc=muMM,scale=sigmaMM),label='Normal fit MM') plt.legend(loc='best') plt.show()
Output:-
* Approximate confidence intervals with Bootstrap
> Bootstrap
How do we find the bias and variance of the estimator? Theoretical derivations of the sampling distributions may be too cumbersome and difficult in most cases. Bootstrap is a Monte Carlo simulation method for computing metrics such as bias, variance and confidence intervals for estimators.
In the above simulation, we have found \(\hat{\mu}_{MM}=205846.275...\) and \(\hat{\sigma}_{MM}=113100.833...\). Using these values, we simulate \(n=1321\) iid samples from Normal\((205846.275...,113100.833...)\) and using the simulated samples, we compute new estimates of \(\mu\) and \(\sigma\) and call them \(\hat{\mu}_{MM}(1)\) and \(\hat{\sigma}_{MM}(1)\). Now, repeat the simulation \(N\) times to get estimates \(\hat{\mu}_{MM}(i)\) and \(\hat{\sigma}_{MM}(i)\), \(i=1,2,\ldots,N\).
> Confidence Intervals
Suppose a parameter \(\theta\) is estimated as \(\hat{\theta}\), and suppose the distribution of \(\hat{\theta}-\theta\) is known. Then, to obtain \((100(1-\alpha))\)% confidence intervals (typical values are \(\alpha=0.1\) for 90% confidence intervals and \(\alpha=0.05\) for 95% confidence intervals), we use the CDF of \(\hat{\theta}-\theta\) to obtain \(\delta_1\) and \(\delta_2\) such that $$P(\hat{\theta}-\theta\le\delta_1)=1-\frac{\alpha}{2},$$ $$P(\hat{\theta}-\theta\le\delta_2)=\frac{\alpha}{2}.$$ Actually, the inverse of the CDF of \(\hat{\theta}-\theta\) is used to find the above \(\delta_1\) and \(\delta_2\). From the above, we see that $$P(\hat{\theta}-\theta \le \delta_1)-P(\hat{\theta}-\theta \le \delta_2)= P(\delta_2< \hat{\theta}-\theta \le \delta_1)=1-\frac{\alpha}{2}-\frac{\alpha}{2}=1-\alpha.$$ The above is rewritten as $$P(\hat{\theta}-\delta_1\le\theta<\hat{\theta}-\delta_2)=1-\alpha,$$ and \([\hat{\theta}-\delta_1,\hat{\theta}-\delta_2]\) is interpreted as the \(100(1-\alpha)\)% confidence interval.
> Bootstrap confidence intervals
The CDF of \(\hat{\theta}-\theta\) might be difficult to determine in many cases, and the bootstrap method is used often to estimate \(\delta_1\) and \(\delta_2\) for \(\mu\).
We consider the list of numbers \(\{\hat{\mu}_{MM}(1)-205846.275...,\ldots,\hat{\mu}_{MM}(N)-205846.275...\}\) and pick the \(100(\alpha/2)\)-th percentile and \(100(1-\alpha/2)\)-th percentile. Similarly we can estimate \(\delta_1\) and \(\delta_2\) for \(\sigma\).
N = 1000 n = 1321 mu_hat = np.zeros(N) sigma_hat = np.zeros(N) for i in np.arange(N): xi = st.norm.rvs(muMM,scale=sigmaMM,size=n) m1i = np.average(xi); ssi = np.var(xi) # Calculating mu_hat & sigma_hat with bootstrap. mu_hat[i] = m1i; sigma_hat[i] = ssi**0.5
# del1 & del2 for μ. del1 = np.percentile(mu_hat - muMM, 95) del2 = np.percentile(mu_hat - muMM, 5) print([round(del1,3),round(del2,3)])
Output:- [5140.216, -5027.517]
The 95% confidence interval for \(\mu\) using the method of moments estimator works out to \([205846.275-5140.216, 205846.275-(-5027.517)] = [200706.059, 210873.792]\).
# del1 & del2 for σ. dels1 = np.percentile(sigma_hat - sigmaMM, 95) dels2 = np.percentile(sigma_hat - sigmaMM, 5) print([round(dels1,3),round(dels2,3)])
Output:- [3541.747, -3648.79]
The 95% confidence interval for \(\sigma\) using the method of moments estimator works out to \([113100.833-3541.747, 113100.833-(-3648.79)] = [109559.086, 116749.623]\).
Conclusion
Hence, We can claim that House Price of each house can be modeled as Normal distribution with 95% confidence that
\(\mu\) lies in \([200706.059, 210873.792]\) range and \(\sigma\) lies in \([109559.086, 116749.623]\) range.
\(Note:-\) I Use Python Programming Language for giving simulation.