Importing all libraries

Data exploration

Loading our dataset and getting familiar with it

Checking the operation types in op_type column :

As you see, there are also other values like "Buying", "Renting", "Change" and "Other". Before continuing, let's do the following:

  1. Drop entries with operations "Change", "Other" as irrelevant to our goal - price prediction
  2. Drop entries with operations "Buying" and "Renting" as they are presented with very few samples

Let's explore unique districts in district column:

let's look at unique values of other columns as well

Floor values look fine.

One not coming from the eastern europe might be confused by the house_seria values, but believe us - they are fine. Despite Riga being the city with the highest concentration of Art Nouveau architecture anywhere in the world, there are also many standardized apartment blocks constructed in the Soviet period, so 602, 119, 103., 467., 104. are just weird names of construction projects. We will treat them as ordinary categorical values.

Now let's check lat and lon columns:

The latitude of Rīga, Latvia is 56.946285, and the longitude is 24.105078. While some of the values seem to be within a correct range, there are broken values, that make plot look terribly zoomed out. Let's check how many samples have wrong coordinates. Previous plot allows us to assume all broken values deviate too much from real Riga coordinates, so we can use rough comparison to filter them out.

Not so many to worry about, let's just drop them and see how plot looks without broken values:

Much better! All items are now concentrated within a single location matching Riga coordinates. Let's see them overlaying actual Riga map:

Handling missing values

Let's define a helper function to get missing values for a dataframe

Missing geo coordinates

It can be seen that most missing values come from geo coordinate columns - lon and lat. However, these missing values are not resolved, due to the uselessness of these two columns, which are eventually deleted. For the record, to reach this conclusion they had been properly handled at the first trails of this project, but not anymore.

Missing districts

Let's take a look at the entries with missing district value:

One can find out missing district names by looking at rows with the same street:

Great! There are multiple properties listed at the same address - Ogļu 32. Let's impute missing value:

Let's try doing the same for Pupuku iela 9:

No luck this time - this is the only property on the Pupuku street in our dataset. We might use alternative approach to seach nearest points within some range using lat lon column values, but it would be overkill for a single row. Let's impute district manually by finding Pupuku iela 9 on Google Maps:

Once again, let's review what else is missing:

Invalid or missing Rooms

let's check unique room values:

It turns out this column is categorical due to the presence of value "Citi". This is bad, as room count by nature is numerical and might be important input for correct price prediction in our model. So what does this "Citi" really mean for rooms? "Citi" translates from Latvian as "Other". In our context this word might describe some special architectural solutions, where room count can't be clearly defined.

For the sake of data integrity let's treat "Citi" the same way as missing value:

So we have 15 rows to fix instead of 1. In order to do this correctly, we could take advantage of other samples with the similar area. Let's build a helper functions to approximate room count.

The Idea of the next few cells is :</br>

We are ready!

Great! Room column now is numeric and contains no missing values.

Final check:

Features Engineering

When we started working on label encoding we noticed that the column district caused us to have a lot of features so we wanted to reduce them to increase the performance of the Linear Regression.</br> So, we decided to some feature engineering on district column. What we did was to fit all the values of the column into 3 categories :</br>

Our 3 categories idea is based on a research about housing in Riga city called "Residential satisfaction and mobility behaviour among the young: insights from the post-Soviet city of Riga".</br> </br> Link to the paper </br> Link to the figure that shows the categories

</br>--------------</br>

Please, note that the previous engineering method of the district column is not used anymore in our project, as a more valuable method has been used, which is the ordinal encoding, as it can be seen in the follwing few code blocks.

Data Visulization

Let's see the relationship between the price and the area in a graph </br>

OBSERVATION:</br>

It seems like I should split the data into two datasets one for sale and the other for rent becuase understanding the data would be easier and more beneficial

OBSERVATION:</br>

OBSERVATION:</br>

OBSERVATION:</br>

OBSERVATION:</br>

OBSERVATION:</br>

OBSERVATION:</br>

OBSERVATION:</br>

OBSERVATION:</br>

OBSERVATION:</br>

OBSERVATION:</br>

Data Preprocessing

Task 1: coding categorical data

Checking where categorical data are found

Splitting encoded dataset into :

Task 2 : Removing features

We have removed these columns: ['op_type','street','lat','lon','district','total floors'] of the data frame, because they are of little or no value to our model.

Normalising Price Values

As we have anticipated by visualization that your date won't perform well in linear regression due to being skewed. Normalization of the dependent variable is needed to achieve better prediction outcomes, thus log is used.

Tarin Test Split

Model building

source : https://www.kaggle.com/sudhirnl7/linear-regression-tutorial?scriptVersionId=31415973&cellId=37

Step 1: add x0 =1 to dataset

https://www.unite.ai/what-is-linear-regression/ </br> https://towardsdatascience.com/introduction-to-machine-learning-algorithms-linear-regression-14c4e325882a

Step 2 : build model

Step 3 : The parameters for linear regression model

The parameter obtained from both the model are same.So we succefull build our model using normal equation and verified using sklearn linear regression module. Let's move ahead, next step is prediction and model evaluvation.

Model Evaluation

We will predict value for target variable by using our model parameter for test data set. Then compare the predicted value with actual valu in test set. We compute Mean Square Error using formula $$\mathbf{ J(\theta) = \frac{1}{m} \sum_{i=1}^{m}(\hat{y}_i - y_i)^2}$$

$\mathbf{R^2}$ is statistical measure of how close data are to the fitted regression line. $\mathbf{R^2}$ is always between 0 to 100%. 0% indicated that model explains none of the variability of the response data around it's mean. 100% indicated that model explains all the variablity of the response data around the mean.

$$\mathbf{R^2 = 1 - \frac{SSE}{SST}}$$

SSE = Sum of Square Error
SST = Sum of Square Total
$$\mathbf{SSE = \sum_{i=1}^{m}(\hat{y}_i - y_i)^2}$$ $$\mathbf{SST = \sum_{i=1}^{m}(y_i - \bar{y}_i)^2}$$ Here $\mathbf{\hat{y}}$ is predicted value and $\mathbf{\bar{y}}$ is mean value of $\mathbf{y}$.

Model Validation

In order to validated model we need to check few assumption of linear regression model. The common assumption for Linear Regression model are following

  1. Linear Relationship: In linear regression the relationship between the dependent and independent variable to be linear. This can be checked by scatter ploting Actual value Vs Predicted value
  2. The residual error plot should be normally distributed.
  3. The mean of residual error should be 0 or close to 0 as much as possible
  4. The linear regression require all variables to be multivariate normal. This assumption can best checked with Q-Q plot.
  5. Linear regession assumes that there is little or no Multicollinearity in the data. Multicollinearity occurs when the independent variables are too highly correlated with each other. The variance inflation factor VIF* identifies correlation between independent variables and strength of that correlation. $\mathbf{VIF = \frac {1}{1-R^2}}$, If VIF >1 & VIF <5 moderate correlation, VIF < 5 critical level of multicollinearity.
  6. Homoscedasticity: The data are homoscedastic meaning the residuals are equal across the regression line. We can look at residual Vs fitted value scatter plot. If heteroscedastic plot would exhibit a funnel shape pattern.

The model assumption linear regression as follows

  1. In our model, the actual vs predicted plot of (for sale) is linear so linear assumption works, however, it fails in (for rent), as the plot is not aligned in liner rather scattered almost in a two non-parallel lines.
  2. The residual mean is zero and residual error plot normally disterbuted for (for sale) dataset, whilest the residual error plot left skewed in (for rent, and the mean greater then 0.
  3. Q-Q plot shows for (for sale), shows that data are slightly skewed but mostly normally distributed; the Q-Q plot for (for rent) shows that date are heavily left skewed.
  4. The plot is exhibit homoscedasticity in (for sale) dataset, and heteroscedastic in (for rent) dataset, the error will insease after certian point for the latter dataset.
  5. Variance inflation factor value in (for sale) is a little bit above 5, which mean critical level of multicollinearity, while in (for rent) is less than 1, so no multicollearity.