Hello friends ,
This is my first post on machine learning .
Lets talk about the problem set first . The problem set is to predict the price of the automobile car .
The dataset in mentioned in the below link.
https://archive.ics.uci.edu/ml/datasets/Automobile
The Dataset description is mentioned in below url .
https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names
In order to solve any machine learning Algorithm it needs lot of steps , below are some steps in any model building process.
Data Collection :
So as part of data collection , we will get the data from the below url.
https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data
Data Preparation and Cleaning:
This step is one of the most important step in building any ML model .
There are some common steps listed below and there can be many more .
This is my first post on machine learning .
Lets talk about the problem set first . The problem set is to predict the price of the automobile car .
The dataset in mentioned in the below link.
https://archive.ics.uci.edu/ml/datasets/Automobile
The Dataset description is mentioned in below url .
https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names
In order to solve any machine learning Algorithm it needs lot of steps , below are some steps in any model building process.
Data Collection :
So as part of data collection , we will get the data from the below url.
https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data
Data Preparation and Cleaning:
This step is one of the most important step in building any ML model .
There are some common steps listed below and there can be many more .
- Adding Header to data :As header is not mentioned for the dataset , we will be adding the corresponding header to each and every columns . If it is mentioned it is well and good and we can skip this step.
- Removing Spaces :Some of the categorical data will have whitespaces , which will some time cause issue in comparison of data down the line , so we will be eliminating those whitespaces.
- Impute values : Definition of "Imputed value" , the value of an item for which actual values are not available. In our dataset , you will find lot of "?" , which are missing values .This step has two sub steps as follows.
- Replace Imputed Values : For all "?" in dataset, we will be removing with python NaN. As we can observe 41 rows of normalized-losses, 2 rows of num-of-doors, 4 rows of bore, 4 rows of stroke, 2 row of horsepower, 2 rows peak-rpm and finally 4 rows of price are missing .
- Replace Nan with values : So now will be trying to
assign the corresponding values these rows of data for equivalent
columns. There are lot of ways to do this, one common which i find is to replace
the numeric data with mean and categorical data with mode. Note: Sometimes people
opt to remove these rows and then built the model. Again, it up to the accuracy of the model which will speak.
4. Encoding: Encoding is the technique of replacing the categorical data with quantitive(numeric) value . This is needed because most of the algorithm does not handle the categorical data , it only work with numeric data. Again this step has various sub steps.
- Find and Replace : Sometimes data in
dataset are categorical but it means numeric. For example in our case
"num-of-doors" ,"num-of-cylinders", though it is
categorical but we can replace with equivalent numbers.
- Label Encoding : Label encoding is simply
converting each value in a column to a number. For example "make" has 22 manufacturers,
we will assign numbers from 1-22 to these manufacturers. i.e.
alfa-romero -->1 audi -->2 .
- Hot Label Encoding :Label encoding has the advantage that it is straightforward but it the disadvantage that the numeric values can be “ misinterpreted” by the algorithms. A common alternative approach is called one hot encoding. The basic strategy is to convert each category value into a new column and assigns a 1 or 0 (True/False) value to the column. This has the benefit of not weighting a value improperly but does have the downside of adding more columns to the data set. Pandas supports this feature using get dummies. This function is named this way because it creates dummy/indicator variables (aka 1 or 0).
Data Visualization :
This steps will help us to understand the relationship between various attributes . Some of the sample plots are shown below.
This steps will help us to understand the relationship between various attributes . Some of the sample plots are shown below.
is there is very few automobile having engine located at "rear". For the last figure , we can see
that majority of vehicles are having wheel base between 95-110. Similarly we can plot against other
attributes and visualize the same.
Feature Engineering:
This is one of the key aspect of making your model accurate .Again it has various steps of analysis as
follows.
- Feature -Target Correlation: As part of this we will trying to analysis the relationship between price (Target) and other attributes column and look at the correlation between them. If we will able to find any column which is least correlated, we can drop that column and built our model.
- Feature -Feature Correlation: Now there may be features(columns) which may be redundant. So if we look at the correlation plot for these columns it would be highly correlated and we can eliminate these kinds of columns so that our model will not over or under fit.
- Quadratic Features(Binding of two features): In this technique we will be clubbing two features and come up as one column and use that in our model .
Model Building:
In this step we will be building the model but before we go ahead and built it , we will split the
dataset into training and test data . Now this is because our model will learn from the training dataset
and run the prediction against the test data , for which we already know the expected output.
So in this way we will conclude on our accuracy of the model.
On general the split ratio is 70-30 , but in my case as dataset is very small i have taken 80-20 (80%
training data and 20 % test data).
Finally we will be building the model and running the prediction.
Whole source code for the above exercise is upload in below link.
https://github.com/Hariomsingh2007/Harrycodehub/blob/master/Linear_Regression.py
Nice one Hari
ReplyDeleteGeorgia Football
ReplyDeleteGeorgia Football Game
Georgia College Football
Watch Georgia Football
Georgia Football Live
Georgia Football Live Stream
Michigan Football
Michigan Football Game
Michigan College Football
Watch Michigan Football
Michigan Football Live
Michigan Football Live Stream
Ohio State Football
Ohio State Football Game
Ohio State College Football
Watch Ohio State Football
Ohio State Football Live
Ohio State Football Live Stream
Nebraska Football
Nebraska Football Game
Nebraska College Football
Watch Nebraska Football
Nebraska Football Live
Nebraska Football Live Stream
LSU Football
LSU Football Game
LSU College Football
Watch LSU Football
LSU Football Live
LSU Football Live Stream
Notre Dame Football
Notre Dame Football Game
Notre Dame College Football
Watch Notre Dame Football
Notre Dame Football Live
Notre Dame Football Live Stream
It was a very good post indeed. I thoroughly enjoyed reading it in my lunch time. Will surely come and visit this blog more often. There are so many fun and exciting things to do and experience around the world that I thought I'd put together a list of my favourite Things to do for all travelers ...
ReplyDeleteThis comment has been removed by the author.
ReplyDeletePositive site, where did u come up with the information on this posting? I'm pleased I discovered it though, ill be checking back soon to find out what additional posts you include. https://catcherrors.com/repos/facebook/react-native
ReplyDeleteI am very much pleased with the contents you have mentioned. I wanted to thank you for this great article.
ReplyDeleteNHL streams