Wednesday, September 13, 2017

Machine Learning - Linear regression using python

Hello friends ,

This is my first post on machine learning .

Lets talk about the problem set first . The problem set is to predict the price of the automobile car .
The dataset in mentioned in the below link.
https://archive.ics.uci.edu/ml/datasets/Automobile

The Dataset description is mentioned in below url .
https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names

In order to solve any machine learning Algorithm it needs lot of steps , below are some steps in any model building process.


Data Collection :
So as part of data collection , we will get the data from the below url.
https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data

Data Preparation and Cleaning:
This step is one of the most important step in building any ML model .
There are some common steps listed below and there can be many more .
  1. Adding Header to data :As header is not mentioned for the dataset , we will be adding the corresponding header to each and every columns . If it is mentioned  it is well and good and we can skip this step.                                                                                                                       
  2. Removing Spaces :Some of the categorical data will have whitespaces , which will some time cause issue in comparison of data down the line , so we will be eliminating those whitespaces.
  3. Impute values : Definition of "Imputed value" , the value of an item for which actual values are not available. In our dataset , you will find lot of "?" , which are missing values .This step has two sub steps as follows.
  •    Replace Imputed Values : For all "?" in dataset, we will be removing with python NaN. As we can observe 41 rows of normalized-losses, 2 rows of num-of-doors, 4 rows of bore,  4 rows of stroke, 2 row of horsepower, 2 rows peak-rpm and finally 4 rows of price are missing . 

  •    Replace Nan with values  : So now will be trying to assign the corresponding values these rows of data for equivalent columns. There are lot of ways to do this, one common which i find is to replace the numeric data with mean and categorical data with mode.   Note:    Sometimes people opt to remove these rows and then built the model. Again, it up to the accuracy of the model which will speak. 
           

     4. Encoding: Encoding is the technique of replacing the categorical data with quantitive(numeric) value . This is needed because most of the algorithm does not handle the categorical data , it only work with numeric data. Again this step has various sub steps.

  •    Find and Replace  :  Sometimes data in dataset are categorical but it means numeric. For example in our case "num-of-doors" ,"num-of-cylinders", though it is categorical but we can replace with equivalent numbers.

  •    Label Encoding  : Label encoding is simply converting each value in a column to a number.    For   example "make" has 22 manufacturers, we will assign numbers from 1-22 to these manufacturers. i.e.
    alfa-romero -->1                                                                                   audi -->2 .

  •    Hot Label Encoding  :Label encoding has the advantage that it is straightforward but it the disadvantage that the numeric values can be “ misinterpreted” by the algorithms. A common alternative approach is called one hot encoding. The basic strategy is to convert each category value into a new column and assigns a 1 or 0 (True/False) value to the column.    This has the benefit of not   weighting a value improperly but does have the downside of adding more columns to the data set. Pandas supports this feature using get dummies. This function is named this way because it creates dummy/indicator variables (aka 1 or 0).



Data Visualization :
This steps will help us to understand the relationship between various attributes . Some of the sample plots are shown below.




Now as we can see after encoding fuel-type is having only two values. Also another realization
is there is very few automobile having engine located at "rear". For the last figure , we can see
that majority of vehicles are having wheel base between 95-110.  Similarly we can plot against other
attributes and visualize the same.

Feature Engineering:
This is one of the key aspect of making your model accurate .Again it has various steps of analysis as 
follows.  

  •        Feature -Target Correlation:  As part of this we will trying to analysis the relationship between price (Target) and other attributes column and look at the correlation between them. If we will able to find any column which is least correlated, we can drop that column and built our model.
  •         Feature -Feature Correlation:  Now there may be features(columns) which may be redundant. So if we look at the correlation plot for these columns it would be highly correlated and we can eliminate these kinds of columns so that our model will not over or under fit.
  •         Quadratic Features(Binding of two features):   In this technique we will be clubbing two features and come up as one column and use that in our model .                                                                                                                        

Model Building:
In this step we will be building the model but before we go ahead and built it , we will split the
dataset into training and test data . Now this is because our model will learn from the training dataset
and run the prediction against the test data , for which we already know the expected output.
So in this way we will conclude on our accuracy of the model.
On general the split ratio is 70-30 , but in my case as dataset is very small i have taken 80-20 (80% 
training data and 20 % test data).

Finally we will be building the model and running the prediction.

Whole source code for the above exercise is upload in below link.

https://github.com/Hariomsingh2007/Harrycodehub/blob/master/Linear_Regression.py



















6 comments:

  1. It was a very good post indeed. I thoroughly enjoyed reading it in my lunch time. Will surely come and visit this blog more often. There are so many fun and exciting things to do and experience around the world that I thought I'd put together a list of my favourite Things to do for all travelers ...

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Positive site, where did u come up with the information on this posting? I'm pleased I discovered it though, ill be checking back soon to find out what additional posts you include. https://catcherrors.com/repos/facebook/react-native

    ReplyDelete
  4. I am very much pleased with the contents you have mentioned. I wanted to thank you for this great article.
    NHL streams

    ReplyDelete