In [110]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Original Data Source : https://data.usaid.gov/HIV-AIDS/Supply-Chain-Shipment-Pricing-Data/a3rc-nmf6

In [111]:
# Reading the Data

pd.set_option('display.max_columns', 40)
data=pd.read_csv(r"C:\Users\Shashank Sundi\Desktop\I neuron\assignments\SCM SHIPMENT PRICING\SCMS_Delivery_History_Dataset.csv")
data.head()
Out[111]:
ID Project Code PQ # PO / SO # ASN/DN # Country Managed By Fulfill Via Vendor INCO Term Shipment Mode PQ First Sent to Client Date PO Sent to Vendor Date Scheduled Delivery Date Delivered to Client Date Delivery Recorded Date Product Group Sub Classification Vendor Item Description Molecule/Test Type Brand Dosage Dosage Form Unit of Measure (Per Pack) Line Item Quantity Line Item Value Pack Price Unit Price Manufacturing Site First Line Designation Weight (Kilograms) Freight Cost (USD) Line Item Insurance (USD)
0 1 100-CI-T01 Pre-PQ Process SCMS-4 ASN-8 Côte d'Ivoire PMO - US Direct Drop EXW Air Pre-PQ Process Date Not Captured 2-Jun-06 2-Jun-06 2-Jun-06 HRDT HIV test RANBAXY Fine Chemicals LTD. HIV, Reveal G3 Rapid HIV-1 Antibody Test, 30 T... HIV, Reveal G3 Rapid HIV-1 Antibody Test Reveal NaN Test kit 30 19 551.0 29.00 0.97 Ranbaxy Fine Chemicals LTD Yes 13 780.34 NaN
1 3 108-VN-T01 Pre-PQ Process SCMS-13 ASN-85 Vietnam PMO - US Direct Drop EXW Air Pre-PQ Process Date Not Captured 14-Nov-06 14-Nov-06 14-Nov-06 ARV Pediatric Aurobindo Pharma Limited Nevirapine 10mg/ml, oral suspension, Bottle, 2... Nevirapine Generic 10mg/ml Oral suspension 240 1000 6200.0 6.20 0.03 Aurobindo Unit III, India Yes 358 4521.5 NaN
2 4 100-CI-T01 Pre-PQ Process SCMS-20 ASN-14 Côte d'Ivoire PMO - US Direct Drop FCA Air Pre-PQ Process Date Not Captured 27-Aug-06 27-Aug-06 27-Aug-06 HRDT HIV test Abbott GmbH & Co. KG HIV 1/2, Determine Complete HIV Kit, 100 Tests HIV 1/2, Determine Complete HIV Kit Determine NaN Test kit 100 500 40000.0 80.00 0.80 ABBVIE GmbH & Co.KG Wiesbaden Yes 171 1653.78 NaN
3 15 108-VN-T01 Pre-PQ Process SCMS-78 ASN-50 Vietnam PMO - US Direct Drop EXW Air Pre-PQ Process Date Not Captured 1-Sep-06 1-Sep-06 1-Sep-06 ARV Adult SUN PHARMACEUTICAL INDUSTRIES LTD (RANBAXY LAB... Lamivudine 150mg, tablets, 60 Tabs Lamivudine Generic 150mg Tablet 60 31920 127360.8 3.99 0.07 Ranbaxy, Paonta Shahib, India Yes 1855 16007.06 NaN
4 16 108-VN-T01 Pre-PQ Process SCMS-81 ASN-55 Vietnam PMO - US Direct Drop EXW Air Pre-PQ Process Date Not Captured 11-Aug-06 11-Aug-06 11-Aug-06 ARV Adult Aurobindo Pharma Limited Stavudine 30mg, capsules, 60 Caps Stavudine Generic 30mg Capsule 60 38000 121600.0 3.20 0.05 Aurobindo Unit III, India Yes 7590 45450.08 NaN

Data Description

1. ID - identity number
2. Project Code - Project code for the shipment
3. PQ # - Price quote (PQ) number
4. PO / SO # - Price Order / Sales Order --- PO is a buyer's request to a seller (supplier) to order the goods/services
5. ASN/DN # - Shipment number: Advanced Shipment Note (ASN) for Direct Drop deliveries, or Delivery Note (DN) for from RDC deliveries
6. Country - Destination country , where shipment is to be delivered
7. Managed By - SCMS managing office: either the Program Management Office (PMO) in the U.S. or the relevant SCMS field office
8. Fulfill Via - Method through which the shipment was fulfilled: via Direct Drop from vendor or from stock available in the RDCs( retail distribution centers )
9. Vendor INCO Term - The vendor INCO term (also known as International Commercial Terms) for Direct Drop deliveries---a set of 11 internationally recognized rules which define the responsibilities of sellers and buyers. Incoterms specifies who is responsible for paying for and managing the shipment, insurance, documentation, customs clearance, and other logistical activities.
10. Shipment Mode - Method by which commodities are shipped
11. PQ First Sent to Client Date - Date the PQ is first sent to the client
12. PO Sent to Vendor Date - Date the PO is first sent to the vendor
13. Scheduled Delivery Date - Date on which delivery to client was recorded in SCMS information systems
14. Product Group - Product group for item, i.e. ARV, HRDT
15. Sub Classification - Identifies relevant product sub classifications, such as whether ARVs are pediatric or adult, whether a malaria product is an artemisinin-based combination therapy (ACT), etc.
16. Vendor - Vendor name
17. Item Description - Product name and formulation from Partnership for Supply Chain Management (PFSCM) Item Master
18. Molecule/Test Type - Active drug(s) or test kit type
19. Brand - Generic or branded name for the item
20. Dosage - Item dosage and unit
21. Dosage Form - Dosage form for the item (tablet, oral solution, injection, etc.).
22. Unit of Measure (Per Pack) - Pack quantity (pills or test kits) used to compute unit price
23. Line Item Quantity - Total quantity (packs) of commodity per line item
24. Line Item Value - Total value of commodity per line item
25. Pack Price - Cost per pack (i.e. month s supply of ARVs, pack of 60 test kits)
26. Unit Price - Cost per pill (for drugs) or per test (for test kits)
27. Manufacturing Site - Identifies manufacturing site for the line item for direct drop and from RDC deliveries
28. First Line Designation - Designates if the line in question shows the aggregated freight costs and weight associated with all items on the ASN DN
29. Weight (Kilograms) - Weight for all lines on an ASN DN
30. Freight Cost (USD) - Freight charges associated with all lines on the respective ASN DN
31. Line Item Insurance (USD) - Line item cost of insurance, created by applying an annual flat rate ( ) to commodity cost
In [112]:
# pack price = unit measure per pack * unit price
# Line item value =  no. of items in line * price of a pack

# Target column -----> Shipment Price( total value of transporting the shipment to destination)=freight cost + line item value + Line item insurance
# So we'll create a new feature for that later on 
In [113]:
# Checking and Validating Data Types

pd.DataFrame(data=data.dtypes.values,index=data.dtypes.index).T

# we can see that columns - Scheduled Delivery Date , 	Delivered to Client Date,	Delivery Recorded Date ,PQ First Sent to Client Date,PO Sent to Vendor Date are dates , but are identified as 'object' type
# We can also see that columns - Weight (Kilograms)	, Freight Cost (USD) are actually numeric values , but are classified as 'object' type
# We need to change the data types of above identified columns
Out[113]:
ID Project Code PQ # PO / SO # ASN/DN # Country Managed By Fulfill Via Vendor INCO Term Shipment Mode PQ First Sent to Client Date PO Sent to Vendor Date Scheduled Delivery Date Delivered to Client Date Delivery Recorded Date Product Group Sub Classification Vendor Item Description Molecule/Test Type Brand Dosage Dosage Form Unit of Measure (Per Pack) Line Item Quantity Line Item Value Pack Price Unit Price Manufacturing Site First Line Designation Weight (Kilograms) Freight Cost (USD) Line Item Insurance (USD)
0 int64 object object object object object object object object object object object object object object object object object object object object object object int64 int64 float64 float64 float64 object object object object float64
In [114]:
for i in ['Weight (Kilograms)', 'Freight Cost (USD)','PQ First Sent to Client Date','PO Sent to Vendor Date']:
    print(data[i].value_counts().head(),"\n\n")
Weight Captured Separately    1507
2                               29
6                               26
1                               23
5                               20
Name: Weight (Kilograms), dtype: int64 


Freight Included in Commodity Cost    1442
Invoiced Separately                    239
9736.1                                  36
6147.18                                 27
See DN-304 (ID#:10589)                  16
Name: Freight Cost (USD), dtype: int64 


Pre-PQ Process       2476
Date Not Captured     205
9/11/14               205
7/11/13               173
4/30/14               123
Name: PQ First Sent to Client Date, dtype: int64 


N/A - From RDC       5404
Date Not Captured     328
8/27/14                80
3/19/10                78
8/29/14                76
Name: PO Sent to Vendor Date, dtype: int64 


In [115]:
# in above cell we observe that some of the weights were weighed separately and hence we don't have info about those weights , so we replace them with nan
# moreover , in freight cost column , the cost was either not read from the ASN/DN , or Invoiced Separately etc.  ; hence it says to read from the tag itself manually ,
#  which is not feasible for every product , so we replace it with nan

# To capture the importance of the fact that it was measured separately ,Weight Captured Separately , freight included in commo..etc. , we'll create a new feature , else we might lose 
# some important info

# Creating new feature features to capture importance of special cases where measurements could either not be read , or were recorded separately

# For Freight Cost
data['Freight_cost_special']=0
for i in range(data.shape[0]):

    if data['Freight Cost (USD)'].loc[i]  in ["Freight Included in Commodity Cost",'Invoiced Separately']:
        data['Freight_cost_special'].loc[i]=data['Freight Cost (USD)'].loc[i]

    elif data['Freight Cost (USD)'].loc[i].split(" ")[0]=="See":
        data['Freight_cost_special'].loc[i]="See ASN/DN Tag"

    else :
        data['Freight_cost_special'].loc[i]="Normal Measurement"


# For Shipment weight
data['Weight_special']=0
for i in range(data.shape[0]):

    if data['Weight (Kilograms)'].loc[i]  in ["Weight Captured Separately"]:
        data['Weight_special'].loc[i]=data['Weight (Kilograms)'].loc[i]

    elif data['Weight (Kilograms)'].loc[i].split(" ")[0]=="See":
        data['Weight_special'].loc[i]="See ASN/DN Tag"

    else :
        data['Weight_special'].loc[i]="Normal Measurement"

# For PQ First Sent to Client Date	

data['PQ_date_sent']=0
for i in range(data.shape[0]):

    if data['PQ First Sent to Client Date'].loc[i]  in ['Pre-PQ Process','Date Not Captured']:
        data['PQ_date_sent'].loc[i]=data['PQ First Sent to Client Date'].loc[i]
    else :
        data['PQ_date_sent'].loc[i]="Date Captured"

# For PO Sent to Vendor Date

data['PO_date_sent']=0
for i in range(data.shape[0]):

    if data['PO Sent to Vendor Date'].loc[i]  in ['N/A - From RDC' ,'Date Not Captured']:
        data['PO_date_sent'].loc[i]=data['PO Sent to Vendor Date'].loc[i]

    else :
        data['PO_date_sent'].loc[i]="Date Captured"
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py:671: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
In [116]:
# Converting data types to Datetime

# For PQ First Sent to Client Date	& For PO Sent to Vendor Date --- approximately 30-50% data is missing and the dates are not sequential , so we can't 
# impute it in any way . So , after we have captured the importance of missing dates in previous cell . We will drop the two columns----if we had the all dates , then we could have found another feature

for i in ['Scheduled Delivery Date', 'Delivered to Client Date','Delivery Recorded Date']:
    data[i]=pd.to_datetime(data[i], infer_datetime_format=True)

# Converting data types to Numeric

for i in ['Weight (Kilograms)', 'Freight Cost (USD)']:
    data[i]=pd.to_numeric(data[i],errors='coerce')
In [117]:
# Rearranging data
# and excluding the columns - 'ID', 'Project Code' ,'PQ' , 'PO/SO' , 'ASD/DN' , 'PQ First Sent to Client Date', 'For PO Sent to Vendor Date'

data=data[['Country','Vendor','Manufacturing Site', 'Brand', 'Item Description','Product Group', 'Sub Classification',
       'Molecule/Test Type','Dosage Form','Dosage','Managed By', 'Vendor INCO Term','Fulfill Via', 'Shipment Mode',
       'Scheduled Delivery Date', 'Delivered to Client Date','Delivery Recorded Date',  'Unit of Measure (Per Pack)',
        'Line Item Quantity','Line Item Value', 'Pack Price', 'Unit Price','First Line Designation', 'Weight (Kilograms)',
        'Weight_special', 'Freight Cost (USD)', 'Freight_cost_special','Line Item Insurance (USD)']]
data.head()
Out[117]:
Country Vendor Manufacturing Site Brand Item Description Product Group Sub Classification Molecule/Test Type Dosage Form Dosage Managed By Vendor INCO Term Fulfill Via Shipment Mode Scheduled Delivery Date Delivered to Client Date Delivery Recorded Date Unit of Measure (Per Pack) Line Item Quantity Line Item Value Pack Price Unit Price First Line Designation Weight (Kilograms) Weight_special Freight Cost (USD) Freight_cost_special Line Item Insurance (USD)
0 Côte d'Ivoire RANBAXY Fine Chemicals LTD. Ranbaxy Fine Chemicals LTD Reveal HIV, Reveal G3 Rapid HIV-1 Antibody Test, 30 T... HRDT HIV test HIV, Reveal G3 Rapid HIV-1 Antibody Test Test kit NaN PMO - US EXW Direct Drop Air 2006-06-02 2006-06-02 2006-06-02 30 19 551.0 29.00 0.97 Yes 13.0 Normal Measurement 780.34 Normal Measurement NaN
1 Vietnam Aurobindo Pharma Limited Aurobindo Unit III, India Generic Nevirapine 10mg/ml, oral suspension, Bottle, 2... ARV Pediatric Nevirapine Oral suspension 10mg/ml PMO - US EXW Direct Drop Air 2006-11-14 2006-11-14 2006-11-14 240 1000 6200.0 6.20 0.03 Yes 358.0 Normal Measurement 4521.50 Normal Measurement NaN
2 Côte d'Ivoire Abbott GmbH & Co. KG ABBVIE GmbH & Co.KG Wiesbaden Determine HIV 1/2, Determine Complete HIV Kit, 100 Tests HRDT HIV test HIV 1/2, Determine Complete HIV Kit Test kit NaN PMO - US FCA Direct Drop Air 2006-08-27 2006-08-27 2006-08-27 100 500 40000.0 80.00 0.80 Yes 171.0 Normal Measurement 1653.78 Normal Measurement NaN
3 Vietnam SUN PHARMACEUTICAL INDUSTRIES LTD (RANBAXY LAB... Ranbaxy, Paonta Shahib, India Generic Lamivudine 150mg, tablets, 60 Tabs ARV Adult Lamivudine Tablet 150mg PMO - US EXW Direct Drop Air 2006-09-01 2006-09-01 2006-09-01 60 31920 127360.8 3.99 0.07 Yes 1855.0 Normal Measurement 16007.06 Normal Measurement NaN
4 Vietnam Aurobindo Pharma Limited Aurobindo Unit III, India Generic Stavudine 30mg, capsules, 60 Caps ARV Adult Stavudine Capsule 30mg PMO - US EXW Direct Drop Air 2006-08-11 2006-08-11 2006-08-11 60 38000 121600.0 3.20 0.05 Yes 7590.0 Normal Measurement 45450.08 Normal Measurement NaN
In [118]:
# Now , all dtypes are correct
pd.DataFrame(data=data.dtypes.values,index=data.dtypes.index).T
Out[118]:
Country Vendor Manufacturing Site Brand Item Description Product Group Sub Classification Molecule/Test Type Dosage Form Dosage Managed By Vendor INCO Term Fulfill Via Shipment Mode Scheduled Delivery Date Delivered to Client Date Delivery Recorded Date Unit of Measure (Per Pack) Line Item Quantity Line Item Value Pack Price Unit Price First Line Designation Weight (Kilograms) Weight_special Freight Cost (USD) Freight_cost_special Line Item Insurance (USD)
0 object object object object object object object object object object object object object object datetime64[ns] datetime64[ns] datetime64[ns] int64 int64 float64 float64 float64 object float64 object float64 object float64
In [119]:
# getting index of object type column names
categ_index=data.dtypes[data.dtypes=='object'].index

# getting index of numeric type columns
floats=data.dtypes[data.dtypes=='float64'].index
ints=data.dtypes[data.dtypes=='int64'].index

num_index=floats.append(ints)

# getting index of datetime columns
date_index=data.dtypes[data.dtypes=='datetime64[ns]'].index
In [120]:
# Statistical properties of numerical columns

data[num_index].describe()
Out[120]:
Line Item Value Pack Price Unit Price Weight (Kilograms) Freight Cost (USD) Line Item Insurance (USD) Unit of Measure (Per Pack) Line Item Quantity
count 1.032400e+04 10324.000000 10324.000000 6372.000000 6198.000000 10037.000000 10324.000000 10324.000000
mean 1.576506e+05 21.910241 0.611701 3424.441306 11103.234819 240.117626 77.990895 18332.534870
std 3.452921e+05 45.609223 3.275808 13526.968270 15813.026692 500.190568 76.579764 40035.302961
min 0.000000e+00 0.000000 0.000000 0.000000 0.750000 0.000000 1.000000 1.000000
25% 4.314593e+03 4.120000 0.080000 206.750000 2131.120000 6.510000 30.000000 408.000000
50% 3.047147e+04 9.300000 0.160000 1047.000000 5869.655000 47.040000 60.000000 3000.000000
75% 1.664471e+05 23.592500 0.470000 3334.000000 14406.570000 252.400000 90.000000 17039.750000
max 5.951990e+06 1345.640000 238.650000 857354.000000 289653.200000 7708.440000 1000.000000 619999.000000
In [121]:
# Statistical properties of categorical columns

data[categ_index].describe()
Out[121]:
Country Vendor Manufacturing Site Brand Item Description Product Group Sub Classification Molecule/Test Type Dosage Form Dosage Managed By Vendor INCO Term Fulfill Via Shipment Mode First Line Designation Weight_special Freight_cost_special
count 10324 10324 10324 10324 10324 10324 10324 10324 10324 8588 10324 10324 10324 9964 10324 10324 10324
unique 43 73 88 48 184 5 6 86 17 54 4 8 2 4 2 3 4
top South Africa SCMS from RDC Aurobindo Unit III, India Generic Efavirenz 600mg, tablets, 30 Tabs ARV Adult Efavirenz Tablet 300mg PMO - US N/A - From RDC From RDC Air Yes Normal Measurement Normal Measurement
freq 1406 5404 3172 7285 755 8550 6595 1125 3532 990 10265 5404 5404 6113 7030 6372 6198

EDA

In [123]:
# Checking Distribution of data in columns

fig,ax=plt.subplots(4,2,figsize=(30,15))

i,j=0,0
for col in num_index:
    sns.kdeplot(data[col],ax=ax[i,j],shade=True,linewidth=5,)
    j+=1
    if j==2:
        j=0
        i+=1
    
In [124]:
# Checking Normality of distribution using Normal Probability Plot

from scipy import stats

fig,ax=plt.subplots(4,2,figsize=(30,15))
plt.subplots_adjust( 
                    wspace=0.7, 
                    hspace=1.1)

a,b=0,0
for col in num_index:
    stats.probplot(data[col],plot=ax[a,b])
    ax[a,b].set_title(col)
    b+=1
    if b==2:
        b=0
        a+=1

# It is observed that most of the features are Right skewed (might have outliers in the right tail)
In [53]:
# Visualizing Correlation b/w features ,using Scatter Plots

sns.pairplot(data[num_index])
Out[53]:
<seaborn.axisgrid.PairGrid at 0x1f96271ca00>
In [125]:
fig,ax=plt.subplots(1,3,figsize=(30,8))

# Top 10 Manufacturing sites by Total Price/Value of Packs produced 
data.groupby(['Manufacturing Site'])['Line Item Value'].sum().nlargest(10).plot(kind='bar',ax=ax[0])
ax[0].set_ylabel('Shipment Value')

# Top 10 Countries by Total Price/Value of Packs produced 
data.groupby(['Country'])['Line Item Value'].sum().nlargest(10).plot(kind='bar',ax=ax[1])
ax[1].set_ylabel('Shipment Value')

# Top 10 Vendors sites by Total Price/Value of Packs produced 
data.groupby(['Vendor'])['Line Item Value'].sum().nlargest(10).plot(kind='bar',ax=ax[2])
ax[2].set_ylabel('Shipment Value')
Out[125]:
Text(0, 0.5, 'Shipment Value')
In [126]:
fig,ax=plt.subplots(1,3,figsize=(25,5))
data.groupby(['Shipment Mode'])['Line Item Value'].sum().nlargest(5).plot(kind='bar',ax=ax[0])
ax[0].set_ylabel('Shipment Value')
data.groupby(['Dosage Form'])['Line Item Value'].sum().nlargest(5).plot(kind='bar',ax=ax[1])
ax[1].set_ylabel('Shipment Value')
data.groupby(['Item Description'])['Line Item Value'].sum().nlargest(5).plot(kind='bar',ax=ax[2])
plt.xticks(rotation=15)
ax[2].set_ylabel('Shipment Value')
Out[126]:
Text(0, 0.5, 'Shipment Value')
In [127]:
fig,ax=plt.subplots(1,3,figsize=(25,5))
plt.subplots_adjust( wspace=0.5)

sns.barplot('Shipment Mode','Freight Cost (USD)',data=data,ax=ax[0])
sns.barplot('Product Group','Freight Cost (USD)',data=data,ax=ax[1])
sns.barplot('Weight_special','Freight Cost (USD)',data=data,ax=ax[2])
plt.xticks(rotation=15)
Out[127]:
(array([0, 1, 2]), <a list of 3 Text major ticklabel objects>)
In [128]:
# Checking for Null values
pd.DataFrame(data=data.isnull().sum().values,index=data.isnull().sum().index).T
Out[128]:
Country Vendor Manufacturing Site Brand Item Description Product Group Sub Classification Molecule/Test Type Dosage Form Dosage Managed By Vendor INCO Term Fulfill Via Shipment Mode Scheduled Delivery Date Delivered to Client Date Delivery Recorded Date Unit of Measure (Per Pack) Line Item Quantity Line Item Value Pack Price Unit Price First Line Designation Weight (Kilograms) Weight_special Freight Cost (USD) Freight_cost_special Line Item Insurance (USD)
0 0 0 0 0 0 0 0 0 0 1736 0 0 0 360 0 0 0 0 0 0 0 0 0 3952 0 4126 0 287
In [129]:
# Getting Columns with null values

null_val_cols=(data.isnull().sum()[data.isnull().sum()>0]).index
null_val_cols
Out[129]:
Index(['Dosage', 'Shipment Mode', 'Weight (Kilograms)', 'Freight Cost (USD)',
       'Line Item Insurance (USD)'],
      dtype='object')
In [130]:
# Random Sample Imputation for Categorical Columns

for col in ['Dosage', 'Shipment Mode']:
    rand_samples=data[col].dropna().sample(data[col].isnull().sum())
    rand_samples.index=data[data[col].isnull()].index
    data.loc[data[col].isnull(),col]=rand_samples

# KNN Imputation for numerical columns

from sklearn.impute import KNNImputer

for col in ['Weight (Kilograms)', 'Freight Cost (USD)','Line Item Insurance (USD)'] :
        imputer=KNNImputer(n_neighbors=10)
        data[col]=imputer.fit_transform(data[[col]])
In [131]:
pd.DataFrame(data=data.isnull().sum().values,index=data.isnull().sum().index).T
Out[131]:
Country Vendor Manufacturing Site Brand Item Description Product Group Sub Classification Molecule/Test Type Dosage Form Dosage Managed By Vendor INCO Term Fulfill Via Shipment Mode Scheduled Delivery Date Delivered to Client Date Delivery Recorded Date Unit of Measure (Per Pack) Line Item Quantity Line Item Value Pack Price Unit Price First Line Designation Weight (Kilograms) Weight_special Freight Cost (USD) Freight_cost_special Line Item Insurance (USD)
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
In [132]:
data.shape
Out[132]:
(10324, 28)
In [133]:
# Encoding Categorical Columns

# frequency encode - 'Country', 'Vendor', 'Manufacturing Site', 'Brand', 'Item Description','Product Group', 'Sub Classification', 'Molecule/Test Type','Dosage Form',
#                       'Dosage', 'Managed By', 'Vendor INCO Term', 'Shipment Mode', 'Weight_special', 'Freight_cost_special'

# one hot encoding--'Fulfill Via','First Line Designation'
# ordinal encoding --- can be done in inco terms , shipment mode , -- domain knowledge is needed , moreover wedon't know personal preferences of the distributor , which might benefit him
In [134]:
"""for cols in categ_index:
    print(f"\n {data[cols].value_counts().head(10)}")"""

# Observed that in all the categorical columns , there are a lare number of unique values
# We would not like to include all the unique values ; rather , we encode the top 10 categories and the remaining are classified as Other
Out[134]:
'for cols in categ_index:\n    print(f"\n {data[cols].value_counts().head(10)}")'
In [135]:
for col in categ_index:
    for i in range( data.shape[0]):
        if data[col].loc[i] in list(data[col].value_counts().head(10).to_dict().keys()):
            continue
        else :
            data[col].loc[i]="Other"
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py:671: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
In [136]:
for cols in categ_index:
    print(f"\n {data[cols].value_counts().head(10)}")
 Other            2512
South Africa     1406
Nigeria          1194
Côte d'Ivoire    1083
Uganda            779
Vietnam           688
Zambia            683
Haiti             655
Mozambique        631
Zimbabwe          538
Name: Country, dtype: int64

 SCMS from RDC                                            5404
Other                                                    1141
Orgenics, Ltd                                             754
S. BUYS WHOLESALER                                        715
Aurobindo Pharma Limited                                  668
Trinity Biotech, Plc                                      356
ABBVIE LOGISTICS (FORMERLY ABBOTT LOGISTICS BV)           347
PHARMACY DIRECT                                           326
MYLAN LABORATORIES LTD (FORMERLY MATRIX LABORATORIES)     317
HETERO LABS LIMITED                                       277
Name: Vendor, dtype: int64

 Aurobindo Unit III, India         3172
Other                             2052
Mylan (formerly Matrix) Nashik    1415
Hetero Unit III Hyderabad IN       869
Cipla, Goa, India                  665
Strides, Bangalore, India.         540
Alere Medical Co., Ltd.            481
Trinity Biotech, Plc               405
ABBVIE Ludwigshafen Germany        366
Inverness Japan                    345
Name: Manufacturing Site, dtype: int64

 Generic      7285
Other         978
Determine     799
Uni-Gold      373
Aluvia        250
Kaletra       165
Norvir        136
Stat-Pak      115
Bioline       113
Truvada        94
Name: Brand, dtype: int64

 Other                                                                   5824
Efavirenz 600mg, tablets, 30 Tabs                                        755
Nevirapine 200mg, tablets, 60 Tabs                                       623
Lamivudine/Zidovudine 150/300mg, tablets, 60 Tabs                        597
Lamivudine/Nevirapine/Zidovudine 150/200/300mg, tablets, 60 Tabs         580
HIV 1/2, Determine Complete HIV Kit, 100 Tests                           577
Lamivudine 150mg, tablets, 60 Tabs                                       378
HIV 1/2, Uni-Gold HIV Kit, 20 Tests                                      369
Zidovudine 300mg, tablets, 60 Tabs                                       317
Lamivudine/Tenofovir Disoproxil Fumarate 300/300mg, tablets, 30 Tabs     301
Name: Item Description, dtype: int64

 ARV     8550
HRDT    1728
ANTM      22
ACT       16
MRDT       8
Name: Product Group, dtype: int64

 Adult                   6595
Pediatric               1955
HIV test                1567
HIV test - Ancillary     161
Malaria                   30
ACT                       16
Name: Sub Classification, dtype: int64

 Other                                  4105
Efavirenz                              1125
Nevirapine                              877
Lamivudine/Nevirapine/Zidovudine        707
Lamivudine/Zidovudine                   689
Lopinavir/Ritonavir                     633
Lamivudine                              592
HIV 1/2, Determine Complete HIV Kit     577
Zidovudine                              529
Abacavir                                453
Name: Molecule/Test Type, dtype: int64

 Tablet                               3532
Tablet - FDC                         2749
Test kit                             1575
Capsule                               729
Oral solution                         727
Chewable/dispersible tablet - FDC     239
Oral suspension                       214
Test kit - Ancillary                  161
Chewable/dispersible tablet           146
Delayed-release capsules              131
Name: Dosage Form, dtype: int64

 Other            3625
300mg            1197
200mg            1098
600mg             927
150/300mg         732
150/300/200mg     708
10mg/ml           664
150mg             523
200/50mg          480
300/300mg         361
Name: Dosage, dtype: int64

 PMO - US                     10265
South Africa Field Office       57
Haiti Field Office               1
Ethiopia Field Office            1
Name: Managed By, dtype: int64

 N/A - From RDC    5404
EXW               2778
DDP               1443
FCA                397
CIP                275
DDU                 15
DAP                  9
CIF                  3
Name: Vendor INCO Term, dtype: int64

 From RDC       5404
Direct Drop    4920
Name: Fulfill Via, dtype: int64

 Air            6324
Truck          2941
Air Charter     674
Ocean           385
Name: Shipment Mode, dtype: int64

 Yes    7030
No     3294
Name: First Line Designation, dtype: int64

 Normal Measurement            6372
See ASN/DN Tag                2445
Weight Captured Separately    1507
Name: Weight_special, dtype: int64

 Normal Measurement                    6198
See ASN/DN Tag                        2445
Freight Included in Commodity Cost    1442
Invoiced Separately                    239
Name: Freight_cost_special, dtype: int64
In [137]:
# While Frequency encoding , manually give less frequency for "Other" category as , there are a large no. of insignificant categories which aggregated to a large no.
# We do this so that the insignificant categories don't get unnecessarily higher weightage , while training

# columns to do manual encoding changes- Country , Vendor , Manufacturing site ,Brand , Item description , Molecule/Test type,Dosage

# frequency encode - ,'Product Group', 'Sub Classification','Dosage Form','Managed By', 'Vendor INCO Term', 'Shipment Mode', 'Weight_special', 'Freight_cost_special'

country_map=data.Country.value_counts().to_dict()
country_map["Other"]=min(country_map.values())-75
data.Country=data.Country.map((country_map))

vendor_map=data.Vendor.value_counts().to_dict()
vendor_map["Other"]=2
data.Vendor=data.Vendor.map(vendor_map)

manuf_site_map=data['Manufacturing Site'].value_counts().to_dict()
manuf_site_map["Other"]=2
data['Manufacturing Site']=data['Manufacturing Site'].map(manuf_site_map)

brand_map=data.Brand.value_counts().to_dict()
brand_map["Other"]=5
data.Brand=data.Brand.map(brand_map)

item_map=data['Item Description'].value_counts().to_dict()
item_map["Other"]=1
data['Item Description']=data['Item Description'].map(item_map)

test_type_map=data['Molecule/Test Type'].value_counts().to_dict()
test_type_map["Other"]=5
data['Molecule/Test Type']=data['Molecule/Test Type'].map(test_type_map)

dosage_map=data.Dosage.value_counts().to_dict()
dosage_map["Other"]=2
data.Dosage=data.Dosage.map(dosage_map)

prod_group_map=data['Product Group'].value_counts().to_dict()
data['Product Group']=data['Product Group'].map(prod_group_map)

sub_class_map=data['Sub Classification'].value_counts().to_dict()
data['Sub Classification']=data['Sub Classification'].map(sub_class_map)

dosage_form_map=data['Dosage Form'].value_counts().to_dict()
data['Dosage Form']=data['Dosage Form'].map(dosage_form_map)

managed_by_map=data['Managed By'].value_counts().to_dict()
data['Managed By']=data['Managed By'].map(managed_by_map)

inco_term_map=data['Vendor INCO Term'].value_counts().to_dict()
data['Vendor INCO Term']=data['Vendor INCO Term'].map(inco_term_map)

shipment_mode_map=data['Shipment Mode'].value_counts().to_dict()
data['Shipment Mode']=data['Shipment Mode'].map(shipment_mode_map)

wt_spcl_map=data['Weight_special'].value_counts().to_dict()
data['Weight_special']=data['Weight_special'].map(wt_spcl_map)

freight_spcl_map=data['Freight_cost_special'].value_counts().to_dict()
data['Freight_cost_special']=data['Freight_cost_special'].map(freight_spcl_map)

# one hot encoding--'Fulfill Via','First Line Designation'

data['Fulfill Via']=pd.get_dummies(data['Fulfill Via'],drop_first=True,dtype='float64') #---From RDC-1 ; Direct Drop-0

data['First Line Designation']=pd.get_dummies(data['First Line Designation'],drop_first=True,dtype='float64') #---Yes-1 ; No-0
In [138]:
data
Out[138]:
Country Vendor Manufacturing Site Brand Item Description Product Group Sub Classification Molecule/Test Type Dosage Form Dosage Managed By Vendor INCO Term Fulfill Via Shipment Mode Scheduled Delivery Date Delivered to Client Date Delivery Recorded Date Unit of Measure (Per Pack) Line Item Quantity Line Item Value Pack Price Unit Price First Line Designation Weight (Kilograms) Weight_special Freight Cost (USD) Freight_cost_special Line Item Insurance (USD)
0 1083 2 2 5 1 1728 1567 5 1575 1098 10265 2778 0.0 6324 2006-06-02 2006-06-02 2006-06-02 30 19 551.00 29.00 0.97 1.0 13.000000 6372 780.340000 6198 240.117626
1 688 668 3172 7285 1 8550 1955 877 214 664 10265 2778 0.0 6324 2006-11-14 2006-11-14 2006-11-14 240 1000 6200.00 6.20 0.03 1.0 358.000000 6372 4521.500000 6198 240.117626
2 1083 2 2 799 577 1728 1567 577 1575 708 10265 397 0.0 6324 2006-08-27 2006-08-27 2006-08-27 100 500 40000.00 80.00 0.80 1.0 171.000000 6372 1653.780000 6198 240.117626
3 688 2 2 7285 378 8550 6595 592 3532 523 10265 2778 0.0 6324 2006-09-01 2006-09-01 2006-09-01 60 31920 127360.80 3.99 0.07 1.0 1855.000000 6372 16007.060000 6198 240.117626
4 688 668 3172 7285 1 8550 6595 5 729 2 10265 2778 0.0 6324 2006-08-11 2006-08-11 2006-08-11 60 38000 121600.00 3.20 0.05 1.0 7590.000000 6372 45450.080000 6198 240.117626
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10319 538 5404 2 7285 1 8550 1955 707 239 2 10265 5404 1.0 2941 2015-07-31 2015-07-15 2015-07-20 60 166571 599655.60 3.60 0.06 0.0 3424.441306 2445 11103.234819 2445 705.790000
10320 1083 5404 869 7285 597 8550 6595 689 2749 732 10265 5404 1.0 2941 2015-07-31 2015-08-06 2015-08-07 60 21072 137389.44 6.52 0.11 0.0 3424.441306 2445 11103.234819 2445 161.710000
10321 683 5404 2 7285 1 8550 6595 5 2749 2 10265 5404 1.0 2941 2015-08-31 2015-08-25 2015-09-03 30 514526 5140114.74 9.99 0.33 0.0 3424.441306 1507 11103.234819 1442 5284.040000
10322 538 5404 1415 7285 597 8550 6595 689 2749 732 10265 5404 1.0 2941 2015-09-09 2015-08-04 2015-08-11 60 17465 113871.80 6.52 0.11 1.0 1392.000000 6372 11103.234819 1442 134.030000
10323 538 5404 665 7285 1 8550 1955 689 239 2 10265 5404 1.0 2941 2015-08-31 2015-08-04 2015-08-11 60 36639 72911.61 1.99 0.03 0.0 3424.441306 1507 11103.234819 1442 85.820000

10324 rows × 28 columns

In [139]:
# Outlier detection , feature selection/creation , model testing

fig,ax=plt.subplots(4,2,figsize=(25,10),squeeze=False)
plt.subplots_adjust( 
                    wspace=0.7, 
                    hspace=1.1)
i,j=0,0
for col in num_index:
    sns.boxplot(x=data[col],ax=ax[i,j],whis=3,linewidth=3,)

    j+=1
    if j==2:
        j=0
        i+=1

# we should think carefully about what is an outlier, why is it an outlier, and what should you do with the data if you choose to keep the outliers in the data.
# Note that "outliers" are rare. If they are not rare, they are either no outliers or your method/data collection is seriousely flawed 

# It is important to figure out if a certain data value is a "true outlier" or a "influence or leverage point". 
# Is it due to some measurement error or is it maybe due to excessive variability in the data? 
    
In [140]:
# 1e6 - 10^6

# Manually set outlier thresholds above which data is considered as outlier , based on sparsity and rarity

# Line Item Value - 2800000
# Pack Price - 200
# Unit Price - 30
# Weight - 100000
# Freight Cost - 120000
# Line Item Insurance - 3700
# Line Item Quantity - 270000
In [141]:
num_index
Out[141]:
Index(['Line Item Value', 'Pack Price', 'Unit Price', 'Weight (Kilograms)',
       'Freight Cost (USD)', 'Line Item Insurance (USD)',
       'Unit of Measure (Per Pack)', 'Line Item Quantity'],
      dtype='object')
In [142]:
# Unit of Measure(per pack)--talthough we see outliers here , they might not be actually outliers. Since , Packs can be filled with pills , test kits or a combo of both ;
#                               since both occupy spaces varyingly , this leads too excessive variability in data.So ,we dont remove outliers here
                             

# Freight Cost - This can excessively vary depending on the distance and route of shipment for the particular project.However ,we can't call them as outliers as they are
# not rare and contain important information .Hence , we don't completely impute the outliers 
# We impute those points as outliers , after which the data points become very sparse ; so that our model doesn't overfit 

# Similarly for all other columns , they definitely have some important information . Hence , we don't completely impute the outliers 
# We manually impute those points as outliers , after which the data points become very sparse ; so that our model doesn't overfit 



# Outlier Imputation for Line Item Value

b=0
for idx,j in enumerate(data['Line Item Value']):
    if j>2800000:
        b+=1
        data.loc[idx,'Line Item Value']=data['Line Item Value'].mean()
print("\n'Line Item Value'")
print(f"No. of outliers :{b}") 

# Outlier Imputation for Pack Price

b=0
for idx,j in enumerate(data['Pack Price']):
    if j>200:
        b+=1
        data.loc[idx,'Pack Price']=data['Pack Price'].mean()
print("\n'Pack Price'")
print(f"No. of outliers :{b}") 

# Outlier Imputation for  Unit Price

b=0
for idx,j in enumerate(data['Unit Price']):
    if j>30:
        b+=1
        data.loc[idx,'Unit Price']=data['Unit Price'].mean()
print("\n'Unit Price'")
print(f"No. of outliers :{b}") 

# Outlier Imputation for Weight

b=0
for idx,j in enumerate(data['Weight (Kilograms)']):
    if j>100000:
        b+=1
        data.loc[idx,'Weight (Kilograms)']=data['Weight (Kilograms)'].mean()
print("\n'Weight (Kilograms)'")
print(f"No. of outliers :{b}") 

# Outlier Imputation for Freight Cost

b=0
for idx,j in enumerate(data['Freight Cost (USD)']):
    if j>120000:
        b+=1
        data.loc[idx,'Freight Cost (USD)']=data['Freight Cost (USD)'].mean()
print("\n'Freight Cost (USD)'")
print(f"No. of outliers :{b}") 

# Outlier Imputation for Line Item Insurance

b=0
for idx,j in enumerate(data['Line Item Insurance (USD)']):
    if j>3700:
        b+=1
        data.loc[idx,'Line Item Insurance (USD)']=data['Line Item Insurance (USD)'].mean()
print("\n'Line Item Insurance (USD)'")
print(f"No. of outliers :{b}") 

# Outlier Imputation for Line Item Quantity
b=0
for idx,j in enumerate(data['Line Item Quantity']):
    if j>270000:
        b+=1
        data.loc[idx,'Line Item Quantity']=data['Line Item Quantity'].mean()
print("\n'Line Item Quantity'")
print(f"No. of outliers :{b}") 
'Line Item Value'
No. of outliers :27

'Pack Price'
No. of outliers :71

'Unit Price'
No. of outliers :4

'Weight (Kilograms)'
No. of outliers :5

'Freight Cost (USD)'
No. of outliers :17

'Line Item Insurance (USD)'
No. of outliers :33

'Line Item Quantity'
No. of outliers :37
In [143]:
fig,ax=plt.subplots(4,2,figsize=(25,10),squeeze=False)
plt.subplots_adjust( 
                    wspace=0.7, 
                    hspace=1.1)
i,j=0,0
for col in num_index:
    sns.boxplot(x=data[col],ax=ax[i,j],whis=3,linewidth=3,)

    j+=1
    if j==2:
        j=0
        i+=1
In [144]:
# Feature Creation & Target Column Creation

# New Features - important KPI's in logistics )

# --------- Delay in Delivery
data["dlvry_delay"]=data['Delivered to Client Date']-data['Scheduled Delivery Date'] 
# +ve days denotes Late delivery

# --------- Delay in Delivery_verification_time
data["dlvry_verif_time_delay"]=data['Delivery Recorded Date']-data['Delivered to Client Date']  

# time taken to place order , after Price Quote ws received by Client
#data['PQ First Sent to Client Date']-data['PO Sent to Vendor Date']

# value of goods = ( Line Item Value + Line Item insurance )/Line Item Quantity    -------- assigns higher value to sales driving shipments,rather than volume driving shipments
data["item_value"]=(data['Line Item Value']+data['Line Item Insurance (USD)'])/data['Line Item Quantity']

# Target column
data["Shipment_Price"]=data['Freight Cost (USD)']+data['Line Item Value']+data['Line Item Insurance (USD)']
In [158]:
# Converting nanoseconds to days
data["dlvry_delay"]=pd.to_numeric(data.dlvry_delay)/(60*60*24*(10**9))
data["dlvry_verif_time_delay"]=pd.to_numeric(data.dlvry_verif_time_delay)/(60*60*24*(10**9))
In [160]:
data.columns
Out[160]:
Index(['Country', 'Vendor', 'Manufacturing Site', 'Brand', 'Item Description',
       'Product Group', 'Sub Classification', 'Molecule/Test Type',
       'Dosage Form', 'Dosage', 'Managed By', 'Vendor INCO Term',
       'Fulfill Via', 'Shipment Mode', 'Scheduled Delivery Date',
       'Delivered to Client Date', 'Delivery Recorded Date',
       'Unit of Measure (Per Pack)', 'Line Item Quantity', 'Line Item Value',
       'Pack Price', 'Unit Price', 'First Line Designation',
       'Weight (Kilograms)', 'Weight_special', 'Freight Cost (USD)',
       'Freight_cost_special', 'Line Item Insurance (USD)', 'dlvry_delay',
       'dlvry_verif_time_delay', 'item_value', 'Shipment_Price'],
      dtype='object')
In [162]:
# Rearranging data and deleting date columns

data=data[['Country', 'Vendor', 'Manufacturing Site', 'Brand', 'Item Description',
       'Product Group', 'Sub Classification', 'Molecule/Test Type',
       'Dosage Form', 'Dosage', 'Managed By', 'Vendor INCO Term',
       'Fulfill Via', 'Shipment Mode', 'dlvry_delay','dlvry_verif_time_delay',
       'Unit of Measure (Per Pack)', 'Line Item Quantity', 'Line Item Value',
       'Pack Price', 'Unit Price', 'First Line Designation',
       'Weight (Kilograms)', 'Weight_special', 'Freight Cost (USD)',
       'Freight_cost_special', 'Line Item Insurance (USD)', 'item_value', 'Shipment_Price']]

data.head()
Out[162]:
Country Vendor Manufacturing Site Brand Item Description Product Group Sub Classification Molecule/Test Type Dosage Form Dosage Managed By Vendor INCO Term Fulfill Via Shipment Mode dlvry_delay dlvry_verif_time_delay Unit of Measure (Per Pack) Line Item Quantity Line Item Value Pack Price Unit Price First Line Designation Weight (Kilograms) Weight_special Freight Cost (USD) Freight_cost_special Line Item Insurance (USD) item_value Shipment_Price
0 1083 2 2 5 1 1728 1567 5 1575 1098 10265 2778 0.0 6324 0.0 0.0 30 19.0 551.0 29.00 0.97 1.0 13.0 6372 780.34 6198 240.117626 41.637770 1571.457626
1 688 668 3172 7285 1 8550 1955 877 214 664 10265 2778 0.0 6324 0.0 0.0 240 1000.0 6200.0 6.20 0.03 1.0 358.0 6372 4521.50 6198 240.117626 6.440118 10961.617626
2 1083 2 2 799 577 1728 1567 577 1575 708 10265 397 0.0 6324 0.0 0.0 100 500.0 40000.0 80.00 0.80 1.0 171.0 6372 1653.78 6198 240.117626 80.480235 41893.897626
3 688 2 2 7285 378 8550 6595 592 3532 523 10265 2778 0.0 6324 0.0 0.0 60 31920.0 127360.8 3.99 0.07 1.0 1855.0 6372 16007.06 6198 240.117626 3.997522 143607.977626
4 688 668 3172 7285 1 8550 6595 5 729 2 10265 2778 0.0 6324 0.0 0.0 60 38000.0 121600.0 3.20 0.05 1.0 7590.0 6372 45450.08 6198 240.117626 3.206319 167290.197626
In [165]:
# Feature Removal

plt.figure(figsize=(10,10))
sns.heatmap(data.drop(['Shipment_Price'],axis=1).corr())
Out[165]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f960ca0c10>
In [174]:
# Finding Higly Correlated Columns

def correlation(data,threshold):
    col_corr=set()
    cor=data.corr()
    for i in range(len(cor.columns)):
        for j in range(len(cor.columns)):
            if (abs(cor.iloc[i,j]) > threshold) and i!=j:
                if (cor.columns[j] in col_corr) or (cor.columns[i] in col_corr):
                    continue
                print("\n",cor.columns[i],"-----",cor.columns[j])
                print(abs(cor.iloc[i,j]))
                colname=cor.columns[i]
                col_corr.add(colname)
    return col_corr

a=correlation(data.drop(['Shipment_Price'],axis=1),0.9)
print('\n',a)
 Vendor ----- Vendor INCO Term
0.9367767103388067

 Vendor INCO Term ----- Fulfill Via
0.9367997828512018

 Line Item Value ----- Line Item Insurance (USD)
0.9120221547463065

 Weight_special ----- Freight_cost_special
0.9415039738045063

 {'Vendor INCO Term', 'Vendor', 'Line Item Value', 'Weight_special'}
In [176]:
# Removing the highly correlated columns - {'Vendor INCO Term', 'Vendor', 'Line Item Value', 'Weight_special'}

data=data.drop(['Vendor INCO Term', 'Vendor', 'Line Item Value', 'Weight_special'],axis=1)
data.head()
Out[176]:
Country Manufacturing Site Brand Item Description Product Group Sub Classification Molecule/Test Type Dosage Form Dosage Managed By Fulfill Via Shipment Mode dlvry_delay dlvry_verif_time_delay Unit of Measure (Per Pack) Line Item Quantity Pack Price Unit Price First Line Designation Weight (Kilograms) Freight Cost (USD) Freight_cost_special Line Item Insurance (USD) item_value Shipment_Price
0 1083 2 5 1 1728 1567 5 1575 1098 10265 0.0 6324 0.0 0.0 30 19.0 29.00 0.97 1.0 13.0 780.34 6198 240.117626 41.637770 1571.457626
1 688 3172 7285 1 8550 1955 877 214 664 10265 0.0 6324 0.0 0.0 240 1000.0 6.20 0.03 1.0 358.0 4521.50 6198 240.117626 6.440118 10961.617626
2 1083 2 799 577 1728 1567 577 1575 708 10265 0.0 6324 0.0 0.0 100 500.0 80.00 0.80 1.0 171.0 1653.78 6198 240.117626 80.480235 41893.897626
3 688 2 7285 378 8550 6595 592 3532 523 10265 0.0 6324 0.0 0.0 60 31920.0 3.99 0.07 1.0 1855.0 16007.06 6198 240.117626 3.997522 143607.977626
4 688 3172 7285 1 8550 6595 5 729 2 10265 0.0 6324 0.0 0.0 60 38000.0 3.20 0.05 1.0 7590.0 45450.08 6198 240.117626 3.206319 167290.197626
In [180]:
# Clustering data to fit separate models on the different clusters , to check if  model generalises better after clustering , or as a whole
# Finding Optimal No. of Clusters

from sklearn.cluster import KMeans

inertia=[]
for i in range(1,10):
    kmeans=KMeans(n_clusters=i)
    kmeans.fit(data)
    inertia.append(kmeans.inertia_)

elbow=pd.DataFrame({'Cluster':range(1,10),'Inertia':inertia})
plt.plot(elbow.Cluster,elbow.Inertia)
Out[180]:
[<matplotlib.lines.Line2D at 0x1f968aeeac0>]
In [181]:
# From the elbow curve above , it seems reasonable to choose 3 clusters

kmeans=KMeans(n_clusters=3)
kmeans.fit(data)
data['Cluster']=kmeans.predict(data)
data.head()
Out[181]:
Country Manufacturing Site Brand Item Description Product Group Sub Classification Molecule/Test Type Dosage Form Dosage Managed By Fulfill Via Shipment Mode dlvry_delay dlvry_verif_time_delay Unit of Measure (Per Pack) Line Item Quantity Pack Price Unit Price First Line Designation Weight (Kilograms) Freight Cost (USD) Freight_cost_special Line Item Insurance (USD) item_value Shipment_Price Cluster
0 1083 2 5 1 1728 1567 5 1575 1098 10265 0.0 6324 0.0 0.0 30 19.0 29.00 0.97 1.0 13.0 780.34 6198 240.117626 41.637770 1571.457626 1
1 688 3172 7285 1 8550 1955 877 214 664 10265 0.0 6324 0.0 0.0 240 1000.0 6.20 0.03 1.0 358.0 4521.50 6198 240.117626 6.440118 10961.617626 1
2 1083 2 799 577 1728 1567 577 1575 708 10265 0.0 6324 0.0 0.0 100 500.0 80.00 0.80 1.0 171.0 1653.78 6198 240.117626 80.480235 41893.897626 1
3 688 2 7285 378 8550 6595 592 3532 523 10265 0.0 6324 0.0 0.0 60 31920.0 3.99 0.07 1.0 1855.0 16007.06 6198 240.117626 3.997522 143607.977626 1
4 688 3172 7285 1 8550 6595 5 729 2 10265 0.0 6324 0.0 0.0 60 38000.0 3.20 0.05 1.0 7590.0 45450.08 6198 240.117626 3.206319 167290.197626 1
In [182]:
data.Cluster.value_counts()
Out[182]:
1    8791
0    1288
2     245
Name: Cluster, dtype: int64
In [186]:
# Separating the 3 data clusters

data_c1=data[data.Cluster==0].drop(['Cluster'],axis=1).copy(deep=True)
data_c2=data[data.Cluster==1].drop(['Cluster'],axis=1).copy(deep=True)
data_c3=data[data.Cluster==2].drop(['Cluster'],axis=1).copy(deep=True)
In [209]:
# Function to calculate ajusted r2 score

def adjusted_r2(r2,data):
    score=1-(1-r2)*(data.shape[0]-1)/(data.shape[0]-data.shape[1]-1)
    return score
In [210]:
# Regression models to try
# Linear regression , lasso / regularised ,SVM regressor, KNN , rand forest regressor , XGB regressor
In [233]:
#Trying LInear Regression

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score , RepeatedStratifiedKFold           
from sklearn.linear_model import LinearRegression

scores_linreg=[]

for i in [data,data_c1,data_c2,data_c3]:
            X=i.drop(['Shipment_Price'],axis=1)
            y=i['Shipment_Price']
            X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=1)
            
            linreg=LinearRegression()
            linreg.fit(X_train,y_train)
            r2=linreg.score(X_test,y_test)
            adj_r2=adjusted_r2(r2=r2,data=i)
            scores_linreg.append((adj_r2))

print(f"Adj r2 score for Entire Data : {scores_linreg[0]}")
print(f"Adj r2 score for Data Cluster 1 : {scores_linreg[1]}")
print(f"Adj r2 score for Data Cluster 2 : {scores_linreg[2]}")
print(f"Adj r2 score for Data Cluster 3 : {scores_linreg[3]}")
Adj r2 score for Entire Data : 0.8014625033426527
Adj r2 score for Data Cluster 1 : 0.6727115882484896
Adj r2 score for Data Cluster 2 : 0.8152797920963445
Adj r2 score for Data Cluster 3 : 0.6310064482748541
In [237]:
#Trying Out Elastic Net Regression

from sklearn.linear_model import ElasticNetCV ,ElasticNet
scores_elasticreg=[]
for i in [data,data_c1,data_c2,data_c3]:

            X=i.drop(['Shipment_Price'],axis=1)
            y=i['Shipment_Price']
            X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=2)

            elasticCV=ElasticNetCV(alphas=None,cv=10)
            elasticCV.fit(X_train,y_train)
            elastic_net_reg=ElasticNet(alpha=elasticCV.alpha_,l1_ratio=0.5)
            elastic_net_reg.fit(X_train,y_train)
            r2=elastic_net_reg.score(X_test,y_test)
            adj_r2=adjusted_r2(r2=r2,data=i)
            scores_elasticreg.append((adj_r2))
            
print(f"r2 score for Entire Data : {scores_elasticreg[0]}")
print(f"r2 score for Data Cluster 1 : {scores_elasticreg[1]}")
print(f"r2 score for Data Cluster 2 : {scores_elasticreg[2]}")
print(f"r2 score for Data Cluster 3 : {scores_elasticreg[3]}")
r2 score for Entire Data : 0.6620559979120453
r2 score for Data Cluster 1 : 0.23333147581448643
r2 score for Data Cluster 2 : 0.48235291266186575
r2 score for Data Cluster 3 : 0.27089918804632207
In [238]:
#Trying Out Elastic Net Regression

from sklearn.linear_model import RidgeCV ,Ridge
scores_ridgereg=[]
for i in [data,data_c1,data_c2,data_c3]:

            X=i.drop(['Shipment_Price'],axis=1)
            y=i['Shipment_Price']
            X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=2)

            alphas=np.random.uniform(low=0,high=10,size=(50,))
            ridgeCV=RidgeCV(alphas=alphas,cv=10)
            ridgeCV.fit(X_train,y_train)
            ridge_reg=Ridge(alpha=ridgeCV.alpha_)
            ridge_reg.fit(X_train,y_train)
            r2=ridge_reg.score(X_test,y_test)
            adj_r2=adjusted_r2(r2=r2,data=i)
            scores_ridgereg.append((adj_r2))

print(f"r2 score for Entire Data : {scores_ridgereg[0]}")
print(f"r2 score for Data Cluster 1 : {scores_ridgereg[1]}")
print(f"r2 score for Data Cluster 2 : {scores_ridgereg[2]}")
print(f"r2 score for Data Cluster 3 : {scores_ridgereg[3]}")
r2 score for Entire Data : 0.906502473170007
r2 score for Data Cluster 1 : 0.6625554703968244
r2 score for Data Cluster 2 : 0.7983882215352294
r2 score for Data Cluster 3 : 0.647439833490822
In [240]:
#Trying Out Support Vector Regression

from sklearn.svm import LinearSVR,SVR
from sklearn.preprocessing import StandardScaler    
scaler_svm=StandardScaler()
scores_SVM_reg=[]
for i in [data,data_c1,data_c2,data_c3]:
            #scaler_svm.fit(i.drop(['Shipment_Price'],axis=1))
            #X=scaler_svm.transform(i.drop(['Shipment_Price'],axis=1))
            X=i.drop(['Shipment_Price'],axis=1)
            y=i['Shipment_Price']
            X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=2)

            svm_reg=SVR(kernel="rbf")
            svm_reg.fit(X_train,y_train)
            r2=svm_reg.score(X_test,y_test)
            adj_r2=adjusted_r2(r2=r2,data=i)
            scores_SVM_reg.append((adj_r2))

print(f"r2 score for Entire Data : {scores_SVM_reg[0]}")
print(f"r2 score for Data Cluster 1 : {scores_SVM_reg[1]}")
print(f"r2 score for Data Cluster 2 : {scores_SVM_reg[2]}")
print(f"r2 score for Data Cluster 3 : {scores_SVM_reg[3]}")
r2 score for Entire Data : -0.18255493408086565
r2 score for Data Cluster 1 : -0.04523420128227129
r2 score for Data Cluster 2 : -0.23749163121148342
r2 score for Data Cluster 3 : -0.17498867583171096
In [241]:
#Trying Decision Tree Regression
from sklearn.tree import DecisionTreeRegressor , plot_tree
from sklearn.metrics import r2_score

scores_dtree_reg=[]
for i in [data,data_c1,data_c2,data_c3]:
            X=i.drop(['Shipment_Price'],axis=1)
            y=i['Shipment_Price']
            X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=2)
            dtree_reg=DecisionTreeRegressor()
            dtree_reg.fit(X_train,y_train)
            pred=dtree_reg.predict(X_test)
            r2=r2_score(y_test,pred)
            adj_r2=adjusted_r2(r2=r2,data=i)
            scores_dtree_reg.append((adj_r2))

print(f"r2 score for Entire Data  : {scores_dtree_reg[0]}")
print(f"r2 score for Data Cluster 1 : {scores_dtree_reg[1]}")
print(f"r2 score for Data Cluster 2 : {scores_dtree_reg[2]}")
print(f"r2 score for Data Cluster 3 : {scores_dtree_reg[3]}")
r2 score for Entire Data  : 0.9756605019085524
r2 score for Data Cluster 1 : 0.7726668858094312
r2 score for Data Cluster 2 : 0.9499326972716455
r2 score for Data Cluster 3 : 0.6209724312628135
In [243]:
#Trying Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor

scores_rand_forest=[]
for i in [data,data_c1,data_c2,data_c3]:
            X=i.drop(['Shipment_Price'],axis=1)
            y=i['Shipment_Price']
            X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=2)
            rand_forest=RandomForestRegressor(min_samples_leaf=10,n_estimators=30)
            rand_forest.fit(X_train,y_train)
            pred=rand_forest.predict(X_test)
            r2=r2_score(y_test,pred)
            adj_r2=adjusted_r2(r2=r2,data=i)
            scores_rand_forest.append((adj_r2))
            
print(f"r2 score for Entire Data : {scores_rand_forest[0]}")
print(f"r2 score for Data Cluster 1 : {scores_rand_forest[1]}")
print(f"r2 score for Data Cluster 2 : {scores_rand_forest[2]}")
print(f"r2 score for Data Cluster 3 : {scores_rand_forest[3]}")
r2 score for Entire Data : 0.9674783918934399
r2 score for Data Cluster 1 : 0.846370105292314
r2 score for Data Cluster 2 : 0.9695294968973991
r2 score for Data Cluster 3 : 0.5897237971515654
In [245]:
#Trying XGBoost Regressor

from xgboost import XGBRegressor 

scores_xgb=[]
for i in [data,data_c1,data_c2,data_c3]:
            X=i.drop(['Shipment_Price'],axis=1)
            y=i['Shipment_Price']
            X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=2)
            xgb_reg=XGBRegressor(subsample=0.7,max_depth=9,colsample_bytree=0.8,eta=0.1)
            xgb_reg.fit(X_train,y_train)
            pred=xgb_reg.predict(X_test)
            r2=r2_score(y_test,pred)
            adj_r2=adjusted_r2(r2=r2,data=i)
            scores_xgb.append((adj_r2))
            
print(f"r2 score for Entire Data : {scores_xgb[0]}")
print(f"r2 score for Data Cluster 1 : {scores_xgb[1]}")
print(f"r2 score for Data Cluster 2 : {scores_xgb[2]}")
print(f"r2 score for Data Cluster 3 : {scores_xgb[3]}")
r2 score for Entire Data : 0.9902594505950042
r2 score for Data Cluster 1 : 0.9499256667838698
r2 score for Data Cluster 2 : 0.988497404278208
r2 score for Data Cluster 3 : 0.8598913223396523
In [249]:
#Trying KNN Regression

from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler    

scaler_knn=StandardScaler()
scores_KNN_reg=[]

for i in [data,data_c1,data_c2,data_c3]:
            scaler_knn.fit(i.drop(['Shipment_Price'],axis=1))
            X=scaler_knn.transform(i.drop(['Shipment_Price'],axis=1))
            y=i['Shipment_Price']
            X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=2)
            knn=KNeighborsRegressor(n_neighbors=10)
            knn.fit(X_train,y_train)
            r2=knn.score(X_test,y_test)
            adj_r2=adjusted_r2(r2=r2,data=i)
            scores_KNN_reg.append((adj_r2))
            
print(f"r2 score for Entire Data : {scores_KNN_reg[0]}")
print(f"r2 score for Data Cluster 1 : {scores_KNN_reg[1]}")
print(f"r2 score for Data Cluster 2 : {scores_KNN_reg[2]}")
print(f"r2 score for Data Cluster 3 : {scores_KNN_reg[3]}")
r2 score for Entire Data : 0.904319729470703
r2 score for Data Cluster 1 : 0.5923392868709882
r2 score for Data Cluster 2 : 0.8542445272682004
r2 score for Data Cluster 3 : 0.24052581272137985
In [251]:
scores=pd.DataFrame([scores_linreg,scores_elasticreg,scores_ridgereg,scores_SVM_reg,scores_dtree_reg,scores_rand_forest,scores_KNN_reg,scores_xgb]
             ,index=['Linreg-r2','Elastic-net r2','Ridge-r2','SVR-r2','Dtree-r2','Rand-Forest-r2','KNN-r2',"XGB-r2"],
             columns=['Entire Data','Cluster 1','Cluster 2','Cluster 3'])
scores
Out[251]:
Entire Data Cluster 1 Cluster 2 Cluster 3
Linreg-r2 0.801463 0.672712 0.815280 0.631006
Elastic-net r2 0.662056 0.233331 0.482353 0.270899
Ridge-r2 0.906502 0.662555 0.798388 0.647440
SVR-r2 -0.182555 -0.045234 -0.237492 -0.174989
Dtree-r2 0.975661 0.772667 0.949933 0.620972
Rand-Forest-r2 0.967478 0.846370 0.969529 0.589724
KNN-r2 0.904320 0.592339 0.854245 0.240526
XGB-r2 0.990259 0.949926 0.988497 0.859891
In [252]:
for i in scores.columns:
    print(f"Best Model for {i} : {scores[i].idxmax()} ")

# From the above results, we can conclude that models generalise worse after clustering ,as compared to the whole dataset.
# Hence we don't need to to cluster the data
Best Model for Entire Data : XGB-r2 
Best Model for Cluster 1 : XGB-r2 
Best Model for Cluster 2 : XGB-r2 
Best Model for Cluster 3 : XGB-r2 
In [253]:
# So, Best model is XGB Regression

X=data.drop(['Shipment_Price'],axis=1)
y=data['Shipment_Price']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=2)

xgb_reg=XGBRegressor(subsample=0.7,max_depth=9,colsample_bytree=0.8,eta=0.1)
xgb_reg.fit(X_train,y_train)
Out[253]:
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.8, eta=0.1, gamma=0,
             gpu_id=-1, importance_type='gain', interaction_constraints='',
             learning_rate=0.100000001, max_delta_step=0, max_depth=9,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=8, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=0.7,
             tree_method='exact', validate_parameters=1, verbosity=None)
In [254]:
import pickle
filename="xgboost.pickle"
pickle.dump(xgb_reg,open(filename,'wb'))

model=pickle.load(open(filename,'rb'))
pred=model.predict(X_test)
r2_score(y_test,pred)
Out[254]:
0.9902839836071644
In [ ]: