{"cells":[{"metadata":{},"cell_type":"markdown","source":["# Predicting Loan Risk using SparkML on IBM Cloud Pak for Data (ICP4D)"]},{"metadata":{},"cell_type":"markdown","source":["We'll use this notebook to create a machine learning model to predict credit risk. In this notebook we will build the prediction model using the SparkML library.\n","\n","This notebook walks you through these steps:\n","\n","- Load and Visualize data set.\n","- Build a predictive model with SparkML API\n","- Save the model in the ML repository"]},{"metadata":{},"cell_type":"markdown","source":["## 1.0 Install required packages\n","\n","There are a couple of Python packages we will use in this notebook. First we make sure the Watson Machine Learning client v3 is removed (its not installed by default) and then install/upgrade the v4 version of the client (this package is installed by default on CP4D).\n","\n","WML Client: https://wml-api-pyclient-dev-v4.mybluemix.net/#repository"]},{"metadata":{"collapsed":true},"cell_type":"markdown","source":["### 1.1 Package Installation"]},{"metadata":{},"cell_type":"code","source":["import warnings\n","warnings.filterwarnings('ignore')"],"execution_count":1,"outputs":[]},{"metadata":{"scrolled":false},"cell_type":"code","source":["!pip uninstall watson-machine-learning-client -y | tail -n 1\n","!pip install --user watson-machine-learning-client-v4==1.0.95 --upgrade | tail -n 1\n","!pip install --user pyspark==2.3.3 --upgrade | tail -n 1"],"execution_count":2,"outputs":[{"output_type":"stream","text":"\u001b[33mWARNING: Skipping watson-machine-learning-client as it is not installed.\u001b[0m\nRequirement already satisfied, skipping upgrade: docutils<0.16,>=0.10 in /opt/conda/envs/Python-3.6-WMLCE/lib/python3.6/site-packages (from ibm-cos-sdk-core==2.6.0->ibm-cos-sdk==2.6.0->watson-machine-learning-client-v4==1.0.95) (0.14)\nRequirement already satisfied, skipping upgrade: py4j==0.10.7 in /home/wsuser/.local/lib/python3.6/site-packages (from pyspark==2.3.3) (0.10.7)\n","name":"stdout"}]},{"metadata":{},"cell_type":"markdown","source":["#### Action: restart the kernel!"]},{"metadata":{},"cell_type":"markdown","source":["### 1.2 Package Imports"]},{"metadata":{},"cell_type":"code","source":["import pandas as pd\n","import numpy as np\n","import json\n","import os\n","\n","# Import the Project Library to read/write project assets\n","from project_lib import Project\n","project = Project.access()"],"execution_count":3,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":["## 2.0 Load and Clean data\n","\n","We'll load our data as a pandas data frame.\n","\n","**<< FOLLOW THE INSTRUCTIONS BELOW TO LOAD THE DATASET >>**\n","\n","* Highlight the cell below by clicking it.\n","* Click the `10/01` \"Find data\" icon in the upper right of the notebook.\n","* Add the locally uploaded file `german_credit_data.csv` by choosing the `Files` tab. Then choose the `german_credit_data.csv`. Click `Insert to code` and choose `Insert Pandas DataFrame`.\n","* The code to bring the data into the notebook environment and create a Pandas DataFrame will be added to the cell below.\n","* Run the cell\n"]},{"metadata":{},"cell_type":"code","source":["# Place cursor below and insert the Pandas DataFrame for the Credit Risk data\n","\n"],"execution_count":4,"outputs":[{"output_type":"execute_result","execution_count":4,"data":{"text/plain":" CUSTOMERID CHECKINGSTATUS LOANDURATION \\\n0 713a336c-a255-4e2d-9d57-90b3e99e2f06 0_to_200 31 \n1 140b363f-a3fe-4828-a33f-7284dfdb3969 less_0 18 \n2 43b7b51d-5eda-4860-b461-ebef3d3436f4 less_0 15 \n3 f40eaf08-e6d1-4765-ab20-c5f7faca1635 0_to_200 28 \n4 1728910a-d3ff-4799-ac50-203a3a58a3fb no_checking 28 \n\n CREDITHISTORY LOANPURPOSE LOANAMOUNT EXISTINGSAVINGS \\\n0 credits_paid_to_date other 1889 100_to_500 \n1 credits_paid_to_date car_new 462 less_100 \n2 prior_payments_delayed furniture 250 less_100 \n3 credits_paid_to_date retraining 3693 less_100 \n4 prior_payments_delayed education 6235 500_to_1000 \n\n EMPLOYMENTDURATION INSTALLMENTPERCENT SEX ... OWNSPROPERTY AGE \\\n0 less_1 3 female ... savings_insurance 32 \n1 1_to_4 2 female ... savings_insurance 37 \n2 1_to_4 2 male ... real_estate 28 \n3 greater_7 3 male ... savings_insurance 32 \n4 greater_7 3 male ... unknown 57 \n\n INSTALLMENTPLANS HOUSING EXISTINGCREDITSCOUNT JOB DEPENDENTS \\\n0 none own 1 skilled 1 \n1 stores own 2 skilled 1 \n2 none own 2 skilled 1 \n3 none own 1 skilled 1 \n4 none own 2 skilled 1 \n\n TELEPHONE FOREIGNWORKER RISK \n0 none yes No Risk \n1 none yes No Risk \n2 yes no No Risk \n3 none yes No Risk \n4 none yes Risk \n\n[5 rows x 22 columns]","text/html":"
\n | CUSTOMERID | \nCHECKINGSTATUS | \nLOANDURATION | \nCREDITHISTORY | \nLOANPURPOSE | \nLOANAMOUNT | \nEXISTINGSAVINGS | \nEMPLOYMENTDURATION | \nINSTALLMENTPERCENT | \nSEX | \n... | \nOWNSPROPERTY | \nAGE | \nINSTALLMENTPLANS | \nHOUSING | \nEXISTINGCREDITSCOUNT | \nJOB | \nDEPENDENTS | \nTELEPHONE | \nFOREIGNWORKER | \nRISK | \n
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n713a336c-a255-4e2d-9d57-90b3e99e2f06 | \n0_to_200 | \n31 | \ncredits_paid_to_date | \nother | \n1889 | \n100_to_500 | \nless_1 | \n3 | \nfemale | \n... | \nsavings_insurance | \n32 | \nnone | \nown | \n1 | \nskilled | \n1 | \nnone | \nyes | \nNo Risk | \n
1 | \n140b363f-a3fe-4828-a33f-7284dfdb3969 | \nless_0 | \n18 | \ncredits_paid_to_date | \ncar_new | \n462 | \nless_100 | \n1_to_4 | \n2 | \nfemale | \n... | \nsavings_insurance | \n37 | \nstores | \nown | \n2 | \nskilled | \n1 | \nnone | \nyes | \nNo Risk | \n
2 | \n43b7b51d-5eda-4860-b461-ebef3d3436f4 | \nless_0 | \n15 | \nprior_payments_delayed | \nfurniture | \n250 | \nless_100 | \n1_to_4 | \n2 | \nmale | \n... | \nreal_estate | \n28 | \nnone | \nown | \n2 | \nskilled | \n1 | \nyes | \nno | \nNo Risk | \n
3 | \nf40eaf08-e6d1-4765-ab20-c5f7faca1635 | \n0_to_200 | \n28 | \ncredits_paid_to_date | \nretraining | \n3693 | \nless_100 | \ngreater_7 | \n3 | \nmale | \n... | \nsavings_insurance | \n32 | \nnone | \nown | \n1 | \nskilled | \n1 | \nnone | \nyes | \nNo Risk | \n
4 | \n1728910a-d3ff-4799-ac50-203a3a58a3fb | \nno_checking | \n28 | \nprior_payments_delayed | \neducation | \n6235 | \n500_to_1000 | \ngreater_7 | \n3 | \nmale | \n... | \nunknown | \n57 | \nnone | \nown | \n2 | \nskilled | \n1 | \nnone | \nyes | \nRisk | \n
5 rows × 22 columns
\n\n | CHECKINGSTATUS | \nLOANDURATION | \nCREDITHISTORY | \nLOANPURPOSE | \nLOANAMOUNT | \nEXISTINGSAVINGS | \nEMPLOYMENTDURATION | \nINSTALLMENTPERCENT | \nSEX | \nOTHERSONLOAN | \n... | \nOWNSPROPERTY | \nAGE | \nINSTALLMENTPLANS | \nHOUSING | \nEXISTINGCREDITSCOUNT | \nJOB | \nDEPENDENTS | \nTELEPHONE | \nFOREIGNWORKER | \nRISK | \n
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n0_to_200 | \n31 | \ncredits_paid_to_date | \nother | \n1889 | \n100_to_500 | \nless_1 | \n3 | \nfemale | \nnone | \n... | \nsavings_insurance | \n32 | \nnone | \nown | \n1 | \nskilled | \n1 | \nnone | \nyes | \nNo Risk | \n
1 | \nless_0 | \n18 | \ncredits_paid_to_date | \ncar_new | \n462 | \nless_100 | \n1_to_4 | \n2 | \nfemale | \nnone | \n... | \nsavings_insurance | \n37 | \nstores | \nown | \n2 | \nskilled | \n1 | \nnone | \nyes | \nNo Risk | \n
2 | \nless_0 | \n15 | \nprior_payments_delayed | \nfurniture | \n250 | \nless_100 | \n1_to_4 | \n2 | \nmale | \nnone | \n... | \nreal_estate | \n28 | \nnone | \nown | \n2 | \nskilled | \n1 | \nyes | \nno | \nNo Risk | \n
3 | \n0_to_200 | \n28 | \ncredits_paid_to_date | \nretraining | \n3693 | \nless_100 | \ngreater_7 | \n3 | \nmale | \nnone | \n... | \nsavings_insurance | \n32 | \nnone | \nown | \n1 | \nskilled | \n1 | \nnone | \nyes | \nNo Risk | \n
4 | \nno_checking | \n28 | \nprior_payments_delayed | \neducation | \n6235 | \n500_to_1000 | \ngreater_7 | \n3 | \nmale | \nnone | \n... | \nunknown | \n57 | \nnone | \nown | \n2 | \nskilled | \n1 | \nnone | \nyes | \nRisk | \n
5 rows × 21 columns
\n\n | CHECKINGSTATUS | \nLOANDURATION | \nCREDITHISTORY | \nLOANPURPOSE | \nLOANAMOUNT | \nEXISTINGSAVINGS | \nEMPLOYMENTDURATION | \nINSTALLMENTPERCENT | \nSEX | \nOTHERSONLOAN | \n... | \nOWNSPROPERTY | \nAGE | \nINSTALLMENTPLANS | \nHOUSING | \nEXISTINGCREDITSCOUNT | \nJOB | \nDEPENDENTS | \nTELEPHONE | \nFOREIGNWORKER | \nRISK | \n
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | \n5000 | \n5000.000000 | \n5000 | \n5000 | \n5000.000000 | \n5000 | \n5000 | \n5000.000000 | \n5000 | \n5000 | \n... | \n5000 | \n5000.000000 | \n5000 | \n5000 | \n5000.000000 | \n5000 | \n5000.000000 | \n5000 | \n5000 | \n5000 | \n
unique | \n4 | \nNaN | \n5 | \n11 | \nNaN | \n5 | \n5 | \nNaN | \n2 | \n3 | \n... | \n4 | \nNaN | \n3 | \n3 | \nNaN | \n4 | \nNaN | \n2 | \n2 | \n2 | \n
top | \nno_checking | \nNaN | \nprior_payments_delayed | \ncar_new | \nNaN | \nless_100 | \n1_to_4 | \nNaN | \nmale | \nnone | \n... | \nsavings_insurance | \nNaN | \nnone | \nown | \nNaN | \nskilled | \nNaN | \nnone | \nyes | \nNo Risk | \n
freq | \n1993 | \nNaN | \n1686 | \n945 | \nNaN | \n1856 | \n1470 | \nNaN | \n3104 | \n4173 | \n... | \n1660 | \nNaN | \n3517 | \n3195 | \nNaN | \n3400 | \nNaN | \n2941 | \n4877 | \n3330 | \n
mean | \nNaN | \n21.393000 | \nNaN | \nNaN | \n3480.145000 | \nNaN | \nNaN | \n2.982400 | \nNaN | \nNaN | \n... | \nNaN | \n35.932400 | \nNaN | \nNaN | \n1.465800 | \nNaN | \n1.164600 | \nNaN | \nNaN | \nNaN | \n
std | \nNaN | \n11.162843 | \nNaN | \nNaN | \n2488.232783 | \nNaN | \nNaN | \n1.127096 | \nNaN | \nNaN | \n... | \nNaN | \n10.648536 | \nNaN | \nNaN | \n0.565415 | \nNaN | \n0.370856 | \nNaN | \nNaN | \nNaN | \n
min | \nNaN | \n4.000000 | \nNaN | \nNaN | \n250.000000 | \nNaN | \nNaN | \n1.000000 | \nNaN | \nNaN | \n... | \nNaN | \n19.000000 | \nNaN | \nNaN | \n1.000000 | \nNaN | \n1.000000 | \nNaN | \nNaN | \nNaN | \n
25% | \nNaN | \n13.000000 | \nNaN | \nNaN | \n1326.750000 | \nNaN | \nNaN | \n2.000000 | \nNaN | \nNaN | \n... | \nNaN | \n28.000000 | \nNaN | \nNaN | \n1.000000 | \nNaN | \n1.000000 | \nNaN | \nNaN | \nNaN | \n
50% | \nNaN | \n21.000000 | \nNaN | \nNaN | \n3238.500000 | \nNaN | \nNaN | \n3.000000 | \nNaN | \nNaN | \n... | \nNaN | \n36.000000 | \nNaN | \nNaN | \n1.000000 | \nNaN | \n1.000000 | \nNaN | \nNaN | \nNaN | \n
75% | \nNaN | \n29.000000 | \nNaN | \nNaN | \n5355.000000 | \nNaN | \nNaN | \n4.000000 | \nNaN | \nNaN | \n... | \nNaN | \n44.000000 | \nNaN | \nNaN | \n2.000000 | \nNaN | \n1.000000 | \nNaN | \nNaN | \nNaN | \n
max | \nNaN | \n64.000000 | \nNaN | \nNaN | \n11676.000000 | \nNaN | \nNaN | \n6.000000 | \nNaN | \nNaN | \n... | \nNaN | \n74.000000 | \nNaN | \nNaN | \n4.000000 | \nNaN | \n2.000000 | \nNaN | \nNaN | \nNaN | \n
11 rows × 21 columns
\n