{"id":22744,"date":"2021-02-13T11:53:06","date_gmt":"2021-02-13T02:53:06","guid":{"rendered":"https:\/\/www.waca.associates\/en\/?p=22744"},"modified":"2021-03-22T08:55:18","modified_gmt":"2021-03-21T23:55:18","slug":"kit-how-to-build-a-simple-prediction-model-in-machine-learning","status":"publish","type":"post","link":"https:\/\/www.waca.or.jp\/en\/uncategorized\/kit-how-to-build-a-simple-prediction-model-in-machine-learning\/","title":{"rendered":"[KIT] How to Build a simple prediction model in machine learning"},"content":{"rendered":"<p>I am an absolute beginner to machine learning and i think this article will be very simple that people who just start in machine learning can understand what i am doing<\/p><p>i just start machine learning so i will apply all the concept that i have learn that relate to regression model,data cleaning,feature engineering in this project<\/p><h4 class=\"wp-block-heading\">+Importance library that will be use<\/h4><pre class=\"wp-block-preformatted\">import pandas as <a>pd<\/a>\nimport numpy as <a>np<\/a> \nimport matplotlib.pyplot as <a>plt<\/a>\n%matplotlib inline\nimport seaborn as <a>sns<\/a><\/pre><h4 class=\"wp-block-heading\" id=\"Load-Data-from-csv-file-to-dataFrame\">+Load Data from csv file to dataFrame<\/h4><pre class=\"wp-block-preformatted\"><a>df<\/a>=<a>pd<\/a>.<a>read_csv<\/a>(\"..\/input\/house-prices-advanced-regression-techniques\/train.csv\")<\/pre><p>I have import pandas as pd so pd.read_csv(filepath) that mean i want to convert from a csv file to dataframe<\/p><h4 class=\"wp-block-heading\" id=\"Exploring-data\">+Exploring data<\/h4><p>What i mean by exploring the data is i want to analyze all factor that affect the result also I want to know which column is more importance and which column i should remove from dataframe. Data cleaning and outlier removing also include in this process.<\/p><p>Now let list all the column of a dataframe<\/p><pre class=\"wp-block-preformatted\"><a>k<\/a>=<a>sorted<\/a>(<a>list<\/a>(<a>df<\/a>.<a>columns<\/a>))\n<a>k<\/a><\/pre><p>as we see there are 80 columns are indepandance variable and 1 depandance variable(SalePrice) in our DataFrame if we put all this columns into our model our model will not optimize because some of the columns may not determine the price so we need to remove it that will increase our model&#8217;s performance<\/p><p>To identified we which are the columns that we should use to train our model i recommend you to look this Notebook:&nbsp;<a href=\"https:\/\/www.kaggle.com\/pmarcelino\/comprehensive-data-exploration-with-python\">https:\/\/www.kaggle.com\/pmarcelino\/comprehensive-data-exploration-with-python<\/a><\/p><h4 class=\"wp-block-heading\" id=\"after-doing-this-i've-seen-some-solumns-that-interesting\">after doing this i&#8217;ve seen some solumns that interesting<\/h4><h5 class=\"wp-block-heading\" id=\"1.BsmtQual:it-is-the-quality-of-the-basement\">1.BsmtQual:it is the quality of the basement<\/h5><h5 class=\"wp-block-heading\" id=\"2.TotalBsmt:-Total-square-feet-of-basement-area\">2.TotalBsmt: Total square feet of basement area<\/h5><h5 class=\"wp-block-heading\" id=\"3.1stFlrSF:-First-Floor-square-feet\">3.1stFlrSF: First Floor square feet<\/h5><h5 class=\"wp-block-heading\" id=\"4.2ndFlrSF:-Second-floor-square-feet\">4.2ndFlrSF: Second floor square feet<\/h5><h5 class=\"wp-block-heading\" id=\"5.GrLivArea:-Above-grade-(ground)-living-area-square-feet\">5.GrLivArea: Above grade (ground) living area square feet<\/h5><h5 class=\"wp-block-heading\" id=\"6.GarageCars:Size-of-garage-in-car-capacity\">6.GarageCars:Size of garage in car capacity<\/h5><h5 class=\"wp-block-heading\" id=\"7-.KitchenAbvGr:-Kitchens-above-grade\">7 .KitchenAbvGr: Kitchens above grade<\/h5><h5 class=\"wp-block-heading\" id=\"8.TotRmsAbvGrd:-Total-rooms-above-grade-(does-not-include-bathrooms)\">8.TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)<\/h5><h5 class=\"wp-block-heading\" id=\"9.MiscFeature-:-Miscellaneous-feature-not-covered-in-other-categories\">9.MiscFeature : Miscellaneous feature not covered in other categories<\/h5><pre class=\"wp-block-preformatted\">df1=<a>df<\/a>[[\"BsmtQual\",\"TotalBsmtSF\",\"1stFlrSF\",\"2ndFlrSF\",\"GrLivArea\",\"BedroomAbvGr\",\"KitchenAbvGr\",\"YearBuilt\",\"TotRmsAbvGrd\",\"GarageType\",\"GarageCars\",\"MiscFeature\",\"MiscVal\",\"SalePrice\"]]\n<a>df1<\/a><\/pre><p>Now I removed all the column that i think it is not importance and only keep the column that i think they are impact the price. This our DataFrame<\/p><figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"784\" height=\"316\" src=\"https:\/\/www.waca.associates\/en\/wp-content\/uploads\/2021\/02\/Screenshot-from-2021-02-13-05-30-00.png\" alt=\"\" class=\"wp-image-22747\" \/><\/figure><h4 class=\"wp-block-heading\" id=\"Data-Cleaning-and-Feature-Engineering\">Data Cleaning and Feature Engineering<\/h4><h4 class=\"wp-block-heading\" id=\"Data-Cleaning\">Data Cleaning<\/h4><ul class=\"wp-block-list\"><li>Data cleaning refers to preparing the data for our model by removing or modify the data that not complete or irrelevant to our model<\/li><li>It help in increasing the accuracy of our model<\/li><li>Most of the time, almost 80% in modeling process is dedicate to data cleaning<\/li><\/ul><h4 class=\"wp-block-heading\" id=\"Feature-Engineering\">Feature Engineering<\/h4><ul class=\"wp-block-list\"><li>it is the way we apply our domain knowledge or business knowledge about the data to remove outlier or erorr data. ex: if we know that a room in a house is maximum 300 square feet but in the data is more than that so we will remove from the dataframe<\/li><li>it is also include in data cleaning<\/li><\/ul><p>Now let see How much data we have:<\/p><pre class=\"wp-block-preformatted\"><a>df1<\/a>.<a>shape<\/a>[0]<\/pre><p>Now let see How many row that has null value<\/p><pre class=\"wp-block-preformatted\"><a>df1<\/a>[<a>df1<\/a>[\"MiscFeature\"].<a>isna<\/a>()].<a>shape<\/a>[0<\/pre><p>As we see there are a small amount(1406) of data that has Miscellaneous feature so we can consider it as outlier then remove them from our dataframe the increase our model performance<\/p><pre class=\"wp-block-preformatted\"><a>df3<\/a>=<a>df2<\/a>.<a>drop<\/a>([\"MiscFeature\",\"MiscVal\"],axis=\"columns\"\n<\/pre><h4 class=\"wp-block-heading\" id=\"Remove-outlier\">+Remove outlier<\/h4><pre class=\"wp-block-preformatted\"><a>df3<\/a>.SalePrice.<a>describe<\/a>()<\/pre><p>Output:<\/p><pre class=\"wp-block-preformatted\">count      1406.000000\nmean     182046.410384\nstd       80084.136570\nmin       34900.000000\n25%      130000.000000\n50%      164250.000000\n75%      215000.000000\nmax      755000.000000\nName: SalePrice, dtype: float64<\/pre><pre class=\"wp-block-preformatted\"><a>df3<\/a>[\"price_per_sqt\"]=<a>df3<\/a>.SalePrice\/<a>df3<\/a>.GrLivArea\n\n<a>df3<\/a>.price_per_sqt.<a>describe<\/a>()<\/pre><p>Output<\/p><pre class=\"wp-block-preformatted\">count    1406.000000\nmean      120.947425\nstd        31.538984\nmin        28.358738\n25%       100.332272\n50%       120.344258\n75%       139.045487\nmax       276.250881\nName: price_per_sqt, dtype: float64<\/pre><h4 class=\"wp-block-heading\" id=\"Visualization\">+Visualization<\/h4><ul class=\"wp-block-list\"><li>it is very importance because as we are human picture get us to understand thing very qucikly<\/li><li>For me i think data visualization is like memes<\/li><li>&#8220;if we need to explain our visualization that mean our visualization is not good enough&#8221;<\/li><\/ul><h5 class=\"wp-block-heading\" id=\"To-make-clear-that-those-columns-have-a-strong-affect-on-the-saleprice-i-am-going-to-visualize-those-columns-with-sale-price\">To make clear that those columns have a strong affect on the saleprice i am going to visualize those columns with sale price<\/h5><p>In&nbsp;[15]:<\/p><pre class=\"wp-block-preformatted\">def <a>year_price<\/a>(<a>df<\/a>):\n    <a>plt<\/a>.<a>scatter<\/a>(<a>df<\/a>.<a>YearBuilt<\/a>,<a>df<\/a>.<a>price_per_sqt<\/a>)\n    <a>plt<\/a>.<a>xlabel<\/a>(\"YearBuilt\")\n    <a>plt<\/a>.<a>ylabel<\/a>(\"Price_per_sqft\")\n    <a>plt<\/a>.<a>legend<\/a>()\n<\/pre><p>it seem that year built and price has a positive relationship<\/p><p>now let see the graph<\/p><pre class=\"wp-block-preformatted\">    plt.scatter(df3.TotalBsmtSF,df3.price_per_sqt)\n    plt.xlabel(\"Total basement area\")\n    plt.ylabel(\"Price_per_sqft\")\n    plt.rcParams[\"figure.figsize\"] = (15, 10)\n    plt.legend()<\/pre><figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"390\" height=\"262\" src=\"https:\/\/www.waca.associates\/en\/wp-content\/uploads\/2021\/02\/results___32_1.png\" alt=\"\" class=\"wp-image-22781\" \/><\/figure><p>As we see in the graph there are some outlier so we need to remove it to increase our model performance<\/p><pre class=\"wp-block-preformatted\">import numpy as <a>np<\/a>\ndef <a>remove_out<\/a>(<a>df<\/a>):\n    <a>remove_data<\/a>=<a>pd<\/a>.<a>DataFrame<\/a>()\n    for <a>year<\/a>,<a>year_df<\/a> <strong>in<\/strong> <a>df<\/a>.groupby(\"YearBuilt\"):\n        <a>m<\/a>=<a>np<\/a>.<a>mean<\/a>(<a>year_df<\/a>.<a>price_per_sqt<\/a>)\n        <a>s<\/a>=<a>np<\/a>.<a>std<\/a>(<a>year_df<\/a>.<a>price_per_sqt<\/a>)\n        <a>out<\/a>=<a>year_df<\/a>[(<a>year_df<\/a>.<a>price_per_sqt<\/a>&gt;(<a>m<\/a>-<a>s<\/a>)) &amp; (<a>year_df<\/a>.<a>price_per_sqt<\/a>&lt;(<a>m<\/a>+<a>s<\/a>))]\n        <a>remove_data<\/a>=<a>pd<\/a>.<a>concat<\/a>([<a>remove_data<\/a>,<a>out<\/a>],ignore_index=<a>True<\/a>)\n    return <a>remove_data<\/a>\n<a>df4<\/a>=<a>remove_out<\/a>(<a>df3<\/a>)\n<a>df4<\/a>\n<\/pre><p><\/p>","protected":false},"excerpt":{"rendered":"I am an absolute beginner to machine learning and i think this article will be very simple that people who just start in machine learning can understand what i am doing i just start machine learning so i will apply all the concept that i have learn that relate to regression model,data cleaning,feature engineering in this project +Importance library that will be use import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns +Load Data from csv file to dataFrame df=pd.read_csv(\"..\/input\/house-prices-advanced-regression-techniques\/train.csv\") I have import pandas as pd so pd.read_csv(filepath) that mean i want to convert from a csv file to dataframe +Exploring [&hellip;]","protected":false},"author":718,"featured_media":27897,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-22744","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"jetpack_featured_media_url":"https:\/\/www.waca.or.jp\/en\/wp-content\/uploads\/2021\/02\/Screenshot-from-2021-03-22-06-54-47.png","_links":{"self":[{"href":"https:\/\/www.waca.or.jp\/en\/wp-json\/wp\/v2\/posts\/22744"}],"collection":[{"href":"https:\/\/www.waca.or.jp\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.waca.or.jp\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.waca.or.jp\/en\/wp-json\/wp\/v2\/users\/718"}],"replies":[{"embeddable":true,"href":"https:\/\/www.waca.or.jp\/en\/wp-json\/wp\/v2\/comments?post=22744"}],"version-history":[{"count":1,"href":"https:\/\/www.waca.or.jp\/en\/wp-json\/wp\/v2\/posts\/22744\/revisions"}],"predecessor-version":[{"id":22782,"href":"https:\/\/www.waca.or.jp\/en\/wp-json\/wp\/v2\/posts\/22744\/revisions\/22782"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.waca.or.jp\/en\/wp-json\/wp\/v2\/media\/27897"}],"wp:attachment":[{"href":"https:\/\/www.waca.or.jp\/en\/wp-json\/wp\/v2\/media?parent=22744"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.waca.or.jp\/en\/wp-json\/wp\/v2\/categories?post=22744"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.waca.or.jp\/en\/wp-json\/wp\/v2\/tags?post=22744"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}