import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Original Data Source : https://data.usaid.gov/HIV-AIDS/Supply-Chain-Shipment-Pricing-Data/a3rc-nmf6 ¶

This data set provides supply chain health commodity shipment and pricing data. Specifically, the data set identifies Antiretroviral (ARV) and HIV lab shipments to supported countries. In addition, the data set provides the commodity pricing and associated supply chain expenses necessary to move the commodities to countries for use. The dataset has similar fields to the Global Fund's Price, Quality and Reporting (PQR) data. PEPFAR and the Global Fund represent the two largest procurers of HIV health commodities. This dataset, when analyzed in conjunction with the PQR data, provides a more complete picture of global spending on specific health commodities. The data are particularly valuable for understanding ranges and trends in pricing as well as volumes delivered by country. The US Government believes this data will help stakeholders make better, data-driven decisions. Care should be taken to consider contextual factors when using the database. Conclusions related to costs associated with moving specific line items or products to specific countries and lead times by product/country will not be accurate.¶

# Reading the Data

pd.set_option('display.max_columns', 40)
data=pd.read_csv(r"C:\Users\Shashank Sundi\Desktop\I neuron\assignments\SCM SHIPMENT PRICING\SCMS_Delivery_History_Dataset.csv")
data.head()

Data Description¶

1. ID - identity number¶

2. Project Code - Project code for the shipment¶

3. PQ # - Price quote (PQ) number¶

4. PO / SO # - Price Order / Sales Order --- PO is a buyer's request to a seller (supplier) to order the goods/services¶

5. ASN/DN # - Shipment number: Advanced Shipment Note (ASN) for Direct Drop deliveries, or Delivery Note (DN) for from RDC deliveries¶

6. Country - Destination country , where shipment is to be delivered¶

7. Managed By - SCMS managing office: either the Program Management Office (PMO) in the U.S. or the relevant SCMS field office¶

8. Fulfill Via - Method through which the shipment was fulfilled: via Direct Drop from vendor or from stock available in the RDCs( retail distribution centers )¶

9. Vendor INCO Term - The vendor INCO term (also known as International Commercial Terms) for Direct Drop deliveries---a set of 11 internationally recognized rules which define the responsibilities of sellers and buyers. Incoterms specifies who is responsible for paying for and managing the shipment, insurance, documentation, customs clearance, and other logistical activities.¶

10. Shipment Mode - Method by which commodities are shipped¶

11. PQ First Sent to Client Date - Date the PQ is first sent to the client¶

12. PO Sent to Vendor Date - Date the PO is first sent to the vendor¶

13. Scheduled Delivery Date - Date on which delivery to client was recorded in SCMS information systems¶

14. Product Group - Product group for item, i.e. ARV, HRDT¶

15. Sub Classification - Identifies relevant product sub classifications, such as whether ARVs are pediatric or adult, whether a malaria product is an artemisinin-based combination therapy (ACT), etc.¶

16. Vendor - Vendor name¶

17. Item Description - Product name and formulation from Partnership for Supply Chain Management (PFSCM) Item Master¶

18. Molecule/Test Type - Active drug(s) or test kit type¶

19. Brand - Generic or branded name for the item¶

20. Dosage - Item dosage and unit¶

21. Dosage Form - Dosage form for the item (tablet, oral solution, injection, etc.).¶

22. Unit of Measure (Per Pack) - Pack quantity (pills or test kits) used to compute unit price¶

23. Line Item Quantity - Total quantity (packs) of commodity per line item¶

24. Line Item Value - Total value of commodity per line item¶

25. Pack Price - Cost per pack (i.e. month s supply of ARVs, pack of 60 test kits)¶

26. Unit Price - Cost per pill (for drugs) or per test (for test kits)¶

27. Manufacturing Site - Identifies manufacturing site for the line item for direct drop and from RDC deliveries¶

28. First Line Designation - Designates if the line in question shows the aggregated freight costs and weight associated with all items on the ASN DN¶

29. Weight (Kilograms) - Weight for all lines on an ASN DN¶

30. Freight Cost (USD) - Freight charges associated with all lines on the respective ASN DN¶

31. Line Item Insurance (USD) - Line item cost of insurance, created by applying an annual flat rate ( ) to commodity cost¶

# pack price = unit measure per pack * unit price
# Line item value =  no. of items in line * price of a pack

# Target column -----> Shipment Price( total value of transporting the shipment to destination)=freight cost + line item value + Line item insurance
# So we'll create a new feature for that later on

# Checking and Validating Data Types

pd.DataFrame(data=data.dtypes.values,index=data.dtypes.index).T

# we can see that columns - Scheduled Delivery Date , 	Delivered to Client Date,	Delivery Recorded Date ,PQ First Sent to Client Date,PO Sent to Vendor Date are dates , but are identified as 'object' type
# We can also see that columns - Weight (Kilograms)	, Freight Cost (USD) are actually numeric values , but are classified as 'object' type
# We need to change the data types of above identified columns

for i in ['Weight (Kilograms)', 'Freight Cost (USD)','PQ First Sent to Client Date','PO Sent to Vendor Date']:
    print(data[i].value_counts().head(),"\n\n")

Weight Captured Separately    1507
2                               29
6                               26
1                               23
5                               20
Name: Weight (Kilograms), dtype: int64 


Freight Included in Commodity Cost    1442
Invoiced Separately                    239
9736.1                                  36
6147.18                                 27
See DN-304 (ID#:10589)                  16
Name: Freight Cost (USD), dtype: int64 


Pre-PQ Process       2476
Date Not Captured     205
9/11/14               205
7/11/13               173
4/30/14               123
Name: PQ First Sent to Client Date, dtype: int64 


N/A - From RDC       5404
Date Not Captured     328
8/27/14                80
3/19/10                78
8/29/14                76
Name: PO Sent to Vendor Date, dtype: int64

# in above cell we observe that some of the weights were weighed separately and hence we don't have info about those weights , so we replace them with nan
# moreover , in freight cost column , the cost was either not read from the ASN/DN , or Invoiced Separately etc.  ; hence it says to read from the tag itself manually ,
#  which is not feasible for every product , so we replace it with nan

# To capture the importance of the fact that it was measured separately ,Weight Captured Separately , freight included in commo..etc. , we'll create a new feature , else we might lose 
# some important info

# Creating new feature features to capture importance of special cases where measurements could either not be read , or were recorded separately

# For Freight Cost
data['Freight_cost_special']=0
for i in range(data.shape[0]):

    if data['Freight Cost (USD)'].loc[i]  in ["Freight Included in Commodity Cost",'Invoiced Separately']:
        data['Freight_cost_special'].loc[i]=data['Freight Cost (USD)'].loc[i]

    elif data['Freight Cost (USD)'].loc[i].split(" ")[0]=="See":
        data['Freight_cost_special'].loc[i]="See ASN/DN Tag"

    else :
        data['Freight_cost_special'].loc[i]="Normal Measurement"


# For Shipment weight
data['Weight_special']=0
for i in range(data.shape[0]):

    if data['Weight (Kilograms)'].loc[i]  in ["Weight Captured Separately"]:
        data['Weight_special'].loc[i]=data['Weight (Kilograms)'].loc[i]

    elif data['Weight (Kilograms)'].loc[i].split(" ")[0]=="See":
        data['Weight_special'].loc[i]="See ASN/DN Tag"

    else :
        data['Weight_special'].loc[i]="Normal Measurement"

# For PQ First Sent to Client Date	

data['PQ_date_sent']=0
for i in range(data.shape[0]):

    if data['PQ First Sent to Client Date'].loc[i]  in ['Pre-PQ Process','Date Not Captured']:
        data['PQ_date_sent'].loc[i]=data['PQ First Sent to Client Date'].loc[i]
    else :
        data['PQ_date_sent'].loc[i]="Date Captured"

# For PO Sent to Vendor Date

data['PO_date_sent']=0
for i in range(data.shape[0]):

    if data['PO Sent to Vendor Date'].loc[i]  in ['N/A - From RDC' ,'Date Not Captured']:
        data['PO_date_sent'].loc[i]=data['PO Sent to Vendor Date'].loc[i]

    else :
        data['PO_date_sent'].loc[i]="Date Captured"

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py:671: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)

# Converting data types to Datetime

# For PQ First Sent to Client Date	& For PO Sent to Vendor Date --- approximately 30-50% data is missing and the dates are not sequential , so we can't 
# impute it in any way . So , after we have captured the importance of missing dates in previous cell . We will drop the two columns----if we had the all dates , then we could have found another feature

for i in ['Scheduled Delivery Date', 'Delivered to Client Date','Delivery Recorded Date']:
    data[i]=pd.to_datetime(data[i], infer_datetime_format=True)

# Converting data types to Numeric

for i in ['Weight (Kilograms)', 'Freight Cost (USD)']:
    data[i]=pd.to_numeric(data[i],errors='coerce')

# Rearranging data
# and excluding the columns - 'ID', 'Project Code' ,'PQ' , 'PO/SO' , 'ASD/DN' , 'PQ First Sent to Client Date', 'For PO Sent to Vendor Date'

data=data[['Country','Vendor','Manufacturing Site', 'Brand', 'Item Description','Product Group', 'Sub Classification',
       'Molecule/Test Type','Dosage Form','Dosage','Managed By', 'Vendor INCO Term','Fulfill Via', 'Shipment Mode',
       'Scheduled Delivery Date', 'Delivered to Client Date','Delivery Recorded Date',  'Unit of Measure (Per Pack)',
        'Line Item Quantity','Line Item Value', 'Pack Price', 'Unit Price','First Line Designation', 'Weight (Kilograms)',
        'Weight_special', 'Freight Cost (USD)', 'Freight_cost_special','Line Item Insurance (USD)']]
data.head()

# Now , all dtypes are correct
pd.DataFrame(data=data.dtypes.values,index=data.dtypes.index).T

# getting index of object type column names
categ_index=data.dtypes[data.dtypes=='object'].index

# getting index of numeric type columns
floats=data.dtypes[data.dtypes=='float64'].index
ints=data.dtypes[data.dtypes=='int64'].index

num_index=floats.append(ints)

# getting index of datetime columns
date_index=data.dtypes[data.dtypes=='datetime64[ns]'].index

# Statistical properties of numerical columns

data[num_index].describe()

# Statistical properties of categorical columns

data[categ_index].describe()

EDA¶

# Checking Distribution of data in columns

fig,ax=plt.subplots(4,2,figsize=(30,15))

i,j=0,0
for col in num_index:
    sns.kdeplot(data[col],ax=ax[i,j],shade=True,linewidth=5,)
    j+=1
    if j==2:
        j=0
        i+=1

# Checking Normality of distribution using Normal Probability Plot

from scipy import stats

fig,ax=plt.subplots(4,2,figsize=(30,15))
plt.subplots_adjust( 
                    wspace=0.7, 
                    hspace=1.1)

a,b=0,0
for col in num_index:
    stats.probplot(data[col],plot=ax[a,b])
    ax[a,b].set_title(col)
    b+=1
    if b==2:
        b=0
        a+=1

# It is observed that most of the features are Right skewed (might have outliers in the right tail)

# Visualizing Correlation b/w features ,using Scatter Plots

sns.pairplot(data[num_index])

<seaborn.axisgrid.PairGrid at 0x1f96271ca00>

fig,ax=plt.subplots(1,3,figsize=(30,8))

# Top 10 Manufacturing sites by Total Price/Value of Packs produced 
data.groupby(['Manufacturing Site'])['Line Item Value'].sum().nlargest(10).plot(kind='bar',ax=ax[0])
ax[0].set_ylabel('Shipment Value')

# Top 10 Countries by Total Price/Value of Packs produced 
data.groupby(['Country'])['Line Item Value'].sum().nlargest(10).plot(kind='bar',ax=ax[1])
ax[1].set_ylabel('Shipment Value')

# Top 10 Vendors sites by Total Price/Value of Packs produced 
data.groupby(['Vendor'])['Line Item Value'].sum().nlargest(10).plot(kind='bar',ax=ax[2])
ax[2].set_ylabel('Shipment Value')

Text(0, 0.5, 'Shipment Value')

fig,ax=plt.subplots(1,3,figsize=(25,5))
data.groupby(['Shipment Mode'])['Line Item Value'].sum().nlargest(5).plot(kind='bar',ax=ax[0])
ax[0].set_ylabel('Shipment Value')
data.groupby(['Dosage Form'])['Line Item Value'].sum().nlargest(5).plot(kind='bar',ax=ax[1])
ax[1].set_ylabel('Shipment Value')
data.groupby(['Item Description'])['Line Item Value'].sum().nlargest(5).plot(kind='bar',ax=ax[2])
plt.xticks(rotation=15)
ax[2].set_ylabel('Shipment Value')

Text(0, 0.5, 'Shipment Value')

fig,ax=plt.subplots(1,3,figsize=(25,5))
plt.subplots_adjust( wspace=0.5)

sns.barplot('Shipment Mode','Freight Cost (USD)',data=data,ax=ax[0])
sns.barplot('Product Group','Freight Cost (USD)',data=data,ax=ax[1])
sns.barplot('Weight_special','Freight Cost (USD)',data=data,ax=ax[2])
plt.xticks(rotation=15)

(array([0, 1, 2]), <a list of 3 Text major ticklabel objects>)

# Checking for Null values
pd.DataFrame(data=data.isnull().sum().values,index=data.isnull().sum().index).T

# Getting Columns with null values

null_val_cols=(data.isnull().sum()[data.isnull().sum()>0]).index
null_val_cols

Index(['Dosage', 'Shipment Mode', 'Weight (Kilograms)', 'Freight Cost (USD)',
       'Line Item Insurance (USD)'],
      dtype='object')

# Random Sample Imputation for Categorical Columns

for col in ['Dosage', 'Shipment Mode']:
    rand_samples=data[col].dropna().sample(data[col].isnull().sum())
    rand_samples.index=data[data[col].isnull()].index
    data.loc[data[col].isnull(),col]=rand_samples

# KNN Imputation for numerical columns

from sklearn.impute import KNNImputer

for col in ['Weight (Kilograms)', 'Freight Cost (USD)','Line Item Insurance (USD)'] :
        imputer=KNNImputer(n_neighbors=10)
        data[col]=imputer.fit_transform(data[[col]])

pd.DataFrame(data=data.isnull().sum().values,index=data.isnull().sum().index).T

data.shape

(10324, 28)

# Encoding Categorical Columns

# frequency encode - 'Country', 'Vendor', 'Manufacturing Site', 'Brand', 'Item Description','Product Group', 'Sub Classification', 'Molecule/Test Type','Dosage Form',
#                       'Dosage', 'Managed By', 'Vendor INCO Term', 'Shipment Mode', 'Weight_special', 'Freight_cost_special'

# one hot encoding--'Fulfill Via','First Line Designation'
# ordinal encoding --- can be done in inco terms , shipment mode , -- domain knowledge is needed , moreover wedon't know personal preferences of the distributor , which might benefit him

"""for cols in categ_index:
    print(f"\n {data[cols].value_counts().head(10)}")"""

# Observed that in all the categorical columns , there are a lare number of unique values
# We would not like to include all the unique values ; rather , we encode the top 10 categories and the remaining are classified as Other

'for cols in categ_index:\n    print(f"\n {data[cols].value_counts().head(10)}")'

for col in categ_index:
    for i in range( data.shape[0]):
        if data[col].loc[i] in list(data[col].value_counts().head(10).to_dict().keys()):
            continue
        else :
            data[col].loc[i]="Other"

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py:671: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)

for cols in categ_index:
    print(f"\n {data[cols].value_counts().head(10)}")

 Other            2512
South Africa     1406
Nigeria          1194
Côte d'Ivoire    1083
Uganda            779
Vietnam           688
Zambia            683
Haiti             655
Mozambique        631
Zimbabwe          538
Name: Country, dtype: int64

 SCMS from RDC                                            5404
Other                                                    1141
Orgenics, Ltd                                             754
S. BUYS WHOLESALER                                        715
Aurobindo Pharma Limited                                  668
Trinity Biotech, Plc                                      356
ABBVIE LOGISTICS (FORMERLY ABBOTT LOGISTICS BV)           347
PHARMACY DIRECT                                           326
MYLAN LABORATORIES LTD (FORMERLY MATRIX LABORATORIES)     317
HETERO LABS LIMITED                                       277
Name: Vendor, dtype: int64

 Aurobindo Unit III, India         3172
Other                             2052
Mylan (formerly Matrix) Nashik    1415
Hetero Unit III Hyderabad IN       869
Cipla, Goa, India                  665
Strides, Bangalore, India.         540
Alere Medical Co., Ltd.            481
Trinity Biotech, Plc               405
ABBVIE Ludwigshafen Germany        366
Inverness Japan                    345
Name: Manufacturing Site, dtype: int64

 Generic      7285
Other         978
Determine     799
Uni-Gold      373
Aluvia        250
Kaletra       165
Norvir        136
Stat-Pak      115
Bioline       113
Truvada        94
Name: Brand, dtype: int64

 Other                                                                   5824
Efavirenz 600mg, tablets, 30 Tabs                                        755
Nevirapine 200mg, tablets, 60 Tabs                                       623
Lamivudine/Zidovudine 150/300mg, tablets, 60 Tabs                        597
Lamivudine/Nevirapine/Zidovudine 150/200/300mg, tablets, 60 Tabs         580
HIV 1/2, Determine Complete HIV Kit, 100 Tests                           577
Lamivudine 150mg, tablets, 60 Tabs                                       378
HIV 1/2, Uni-Gold HIV Kit, 20 Tests                                      369
Zidovudine 300mg, tablets, 60 Tabs                                       317
Lamivudine/Tenofovir Disoproxil Fumarate 300/300mg, tablets, 30 Tabs     301
Name: Item Description, dtype: int64

 ARV     8550
HRDT    1728
ANTM      22
ACT       16
MRDT       8
Name: Product Group, dtype: int64

 Adult                   6595
Pediatric               1955
HIV test                1567
HIV test - Ancillary     161
Malaria                   30
ACT                       16
Name: Sub Classification, dtype: int64

 Other                                  4105
Efavirenz                              1125
Nevirapine                              877
Lamivudine/Nevirapine/Zidovudine        707
Lamivudine/Zidovudine                   689
Lopinavir/Ritonavir                     633
Lamivudine                              592
HIV 1/2, Determine Complete HIV Kit     577
Zidovudine                              529
Abacavir                                453
Name: Molecule/Test Type, dtype: int64

 Tablet                               3532
Tablet - FDC                         2749
Test kit                             1575
Capsule                               729
Oral solution                         727
Chewable/dispersible tablet - FDC     239
Oral suspension                       214
Test kit - Ancillary                  161
Chewable/dispersible tablet           146
Delayed-release capsules              131
Name: Dosage Form, dtype: int64

 Other            3625
300mg            1197
200mg            1098
600mg             927
150/300mg         732
150/300/200mg     708
10mg/ml           664
150mg             523
200/50mg          480
300/300mg         361
Name: Dosage, dtype: int64

 PMO - US                     10265
South Africa Field Office       57
Haiti Field Office               1
Ethiopia Field Office            1
Name: Managed By, dtype: int64

 N/A - From RDC    5404
EXW               2778
DDP               1443
FCA                397
CIP                275
DDU                 15
DAP                  9
CIF                  3
Name: Vendor INCO Term, dtype: int64

 From RDC       5404
Direct Drop    4920
Name: Fulfill Via, dtype: int64

 Air            6324
Truck          2941
Air Charter     674
Ocean           385
Name: Shipment Mode, dtype: int64

 Yes    7030
No     3294
Name: First Line Designation, dtype: int64

 Normal Measurement            6372
See ASN/DN Tag                2445
Weight Captured Separately    1507
Name: Weight_special, dtype: int64

 Normal Measurement                    6198
See ASN/DN Tag                        2445
Freight Included in Commodity Cost    1442
Invoiced Separately                    239
Name: Freight_cost_special, dtype: int64

# While Frequency encoding , manually give less frequency for "Other" category as , there are a large no. of insignificant categories which aggregated to a large no.
# We do this so that the insignificant categories don't get unnecessarily higher weightage , while training

# columns to do manual encoding changes- Country , Vendor , Manufacturing site ,Brand , Item description , Molecule/Test type,Dosage

# frequency encode - ,'Product Group', 'Sub Classification','Dosage Form','Managed By', 'Vendor INCO Term', 'Shipment Mode', 'Weight_special', 'Freight_cost_special'

country_map=data.Country.value_counts().to_dict()
country_map["Other"]=min(country_map.values())-75
data.Country=data.Country.map((country_map))

vendor_map=data.Vendor.value_counts().to_dict()
vendor_map["Other"]=2
data.Vendor=data.Vendor.map(vendor_map)

manuf_site_map=data['Manufacturing Site'].value_counts().to_dict()
manuf_site_map["Other"]=2
data['Manufacturing Site']=data['Manufacturing Site'].map(manuf_site_map)

brand_map=data.Brand.value_counts().to_dict()
brand_map["Other"]=5
data.Brand=data.Brand.map(brand_map)

item_map=data['Item Description'].value_counts().to_dict()
item_map["Other"]=1
data['Item Description']=data['Item Description'].map(item_map)

test_type_map=data['Molecule/Test Type'].value_counts().to_dict()
test_type_map["Other"]=5
data['Molecule/Test Type']=data['Molecule/Test Type'].map(test_type_map)

dosage_map=data.Dosage.value_counts().to_dict()
dosage_map["Other"]=2
data.Dosage=data.Dosage.map(dosage_map)

prod_group_map=data['Product Group'].value_counts().to_dict()
data['Product Group']=data['Product Group'].map(prod_group_map)

sub_class_map=data['Sub Classification'].value_counts().to_dict()
data['Sub Classification']=data['Sub Classification'].map(sub_class_map)

dosage_form_map=data['Dosage Form'].value_counts().to_dict()
data['Dosage Form']=data['Dosage Form'].map(dosage_form_map)

managed_by_map=data['Managed By'].value_counts().to_dict()
data['Managed By']=data['Managed By'].map(managed_by_map)

inco_term_map=data['Vendor INCO Term'].value_counts().to_dict()
data['Vendor INCO Term']=data['Vendor INCO Term'].map(inco_term_map)

shipment_mode_map=data['Shipment Mode'].value_counts().to_dict()
data['Shipment Mode']=data['Shipment Mode'].map(shipment_mode_map)

wt_spcl_map=data['Weight_special'].value_counts().to_dict()
data['Weight_special']=data['Weight_special'].map(wt_spcl_map)

freight_spcl_map=data['Freight_cost_special'].value_counts().to_dict()
data['Freight_cost_special']=data['Freight_cost_special'].map(freight_spcl_map)

# one hot encoding--'Fulfill Via','First Line Designation'

data['Fulfill Via']=pd.get_dummies(data['Fulfill Via'],drop_first=True,dtype='float64') #---From RDC-1 ; Direct Drop-0

data['First Line Designation']=pd.get_dummies(data['First Line Designation'],drop_first=True,dtype='float64') #---Yes-1 ; No-0

data

# Outlier detection , feature selection/creation , model testing

fig,ax=plt.subplots(4,2,figsize=(25,10),squeeze=False)
plt.subplots_adjust( 
                    wspace=0.7, 
                    hspace=1.1)
i,j=0,0
for col in num_index:
    sns.boxplot(x=data[col],ax=ax[i,j],whis=3,linewidth=3,)

    j+=1
    if j==2:
        j=0
        i+=1

# we should think carefully about what is an outlier, why is it an outlier, and what should you do with the data if you choose to keep the outliers in the data.
# Note that "outliers" are rare. If they are not rare, they are either no outliers or your method/data collection is seriousely flawed 

# It is important to figure out if a certain data value is a "true outlier" or a "influence or leverage point". 
# Is it due to some measurement error or is it maybe due to excessive variability in the data?

# 1e6 - 10^6

# Manually set outlier thresholds above which data is considered as outlier , based on sparsity and rarity

# Line Item Value - 2800000
# Pack Price - 200
# Unit Price - 30
# Weight - 100000
# Freight Cost - 120000
# Line Item Insurance - 3700
# Line Item Quantity - 270000

num_index

Index(['Line Item Value', 'Pack Price', 'Unit Price', 'Weight (Kilograms)',
       'Freight Cost (USD)', 'Line Item Insurance (USD)',
       'Unit of Measure (Per Pack)', 'Line Item Quantity'],
      dtype='object')

# Unit of Measure(per pack)--talthough we see outliers here , they might not be actually outliers. Since , Packs can be filled with pills , test kits or a combo of both ;
#                               since both occupy spaces varyingly , this leads too excessive variability in data.So ,we dont remove outliers here
                             

# Freight Cost - This can excessively vary depending on the distance and route of shipment for the particular project.However ,we can't call them as outliers as they are
# not rare and contain important information .Hence , we don't completely impute the outliers 
# We impute those points as outliers , after which the data points become very sparse ; so that our model doesn't overfit 

# Similarly for all other columns , they definitely have some important information . Hence , we don't completely impute the outliers 
# We manually impute those points as outliers , after which the data points become very sparse ; so that our model doesn't overfit 



# Outlier Imputation for Line Item Value

b=0
for idx,j in enumerate(data['Line Item Value']):
    if j>2800000:
        b+=1
        data.loc[idx,'Line Item Value']=data['Line Item Value'].mean()
print("\n'Line Item Value'")
print(f"No. of outliers :{b}") 

# Outlier Imputation for Pack Price

b=0
for idx,j in enumerate(data['Pack Price']):
    if j>200:
        b+=1
        data.loc[idx,'Pack Price']=data['Pack Price'].mean()
print("\n'Pack Price'")
print(f"No. of outliers :{b}") 

# Outlier Imputation for  Unit Price

b=0
for idx,j in enumerate(data['Unit Price']):
    if j>30:
        b+=1
        data.loc[idx,'Unit Price']=data['Unit Price'].mean()
print("\n'Unit Price'")
print(f"No. of outliers :{b}") 

# Outlier Imputation for Weight

b=0
for idx,j in enumerate(data['Weight (Kilograms)']):
    if j>100000:
        b+=1
        data.loc[idx,'Weight (Kilograms)']=data['Weight (Kilograms)'].mean()
print("\n'Weight (Kilograms)'")
print(f"No. of outliers :{b}") 

# Outlier Imputation for Freight Cost

b=0
for idx,j in enumerate(data['Freight Cost (USD)']):
    if j>120000:
        b+=1
        data.loc[idx,'Freight Cost (USD)']=data['Freight Cost (USD)'].mean()
print("\n'Freight Cost (USD)'")
print(f"No. of outliers :{b}") 

# Outlier Imputation for Line Item Insurance

b=0
for idx,j in enumerate(data['Line Item Insurance (USD)']):
    if j>3700:
        b+=1
        data.loc[idx,'Line Item Insurance (USD)']=data['Line Item Insurance (USD)'].mean()
print("\n'Line Item Insurance (USD)'")
print(f"No. of outliers :{b}") 

# Outlier Imputation for Line Item Quantity
b=0
for idx,j in enumerate(data['Line Item Quantity']):
    if j>270000:
        b+=1
        data.loc[idx,'Line Item Quantity']=data['Line Item Quantity'].mean()
print("\n'Line Item Quantity'")
print(f"No. of outliers :{b}")

'Line Item Value'
No. of outliers :27

'Pack Price'
No. of outliers :71

'Unit Price'
No. of outliers :4

'Weight (Kilograms)'
No. of outliers :5

'Freight Cost (USD)'
No. of outliers :17

'Line Item Insurance (USD)'
No. of outliers :33

'Line Item Quantity'
No. of outliers :37

fig,ax=plt.subplots(4,2,figsize=(25,10),squeeze=False)
plt.subplots_adjust( 
                    wspace=0.7, 
                    hspace=1.1)
i,j=0,0
for col in num_index:
    sns.boxplot(x=data[col],ax=ax[i,j],whis=3,linewidth=3,)

    j+=1
    if j==2:
        j=0
        i+=1

# Feature Creation & Target Column Creation

# New Features - important KPI's in logistics )

# --------- Delay in Delivery
data["dlvry_delay"]=data['Delivered to Client Date']-data['Scheduled Delivery Date'] 
# +ve days denotes Late delivery

# --------- Delay in Delivery_verification_time
data["dlvry_verif_time_delay"]=data['Delivery Recorded Date']-data['Delivered to Client Date']  

# time taken to place order , after Price Quote ws received by Client
#data['PQ First Sent to Client Date']-data['PO Sent to Vendor Date']

# value of goods = ( Line Item Value + Line Item insurance )/Line Item Quantity    -------- assigns higher value to sales driving shipments,rather than volume driving shipments
data["item_value"]=(data['Line Item Value']+data['Line Item Insurance (USD)'])/data['Line Item Quantity']

# Target column
data["Shipment_Price"]=data['Freight Cost (USD)']+data['Line Item Value']+data['Line Item Insurance (USD)']

# Converting nanoseconds to days
data["dlvry_delay"]=pd.to_numeric(data.dlvry_delay)/(60*60*24*(10**9))
data["dlvry_verif_time_delay"]=pd.to_numeric(data.dlvry_verif_time_delay)/(60*60*24*(10**9))

data.columns

Index(['Country', 'Vendor', 'Manufacturing Site', 'Brand', 'Item Description',
       'Product Group', 'Sub Classification', 'Molecule/Test Type',
       'Dosage Form', 'Dosage', 'Managed By', 'Vendor INCO Term',
       'Fulfill Via', 'Shipment Mode', 'Scheduled Delivery Date',
       'Delivered to Client Date', 'Delivery Recorded Date',
       'Unit of Measure (Per Pack)', 'Line Item Quantity', 'Line Item Value',
       'Pack Price', 'Unit Price', 'First Line Designation',
       'Weight (Kilograms)', 'Weight_special', 'Freight Cost (USD)',
       'Freight_cost_special', 'Line Item Insurance (USD)', 'dlvry_delay',
       'dlvry_verif_time_delay', 'item_value', 'Shipment_Price'],
      dtype='object')

# Rearranging data and deleting date columns

data=data[['Country', 'Vendor', 'Manufacturing Site', 'Brand', 'Item Description',
       'Product Group', 'Sub Classification', 'Molecule/Test Type',
       'Dosage Form', 'Dosage', 'Managed By', 'Vendor INCO Term',
       'Fulfill Via', 'Shipment Mode', 'dlvry_delay','dlvry_verif_time_delay',
       'Unit of Measure (Per Pack)', 'Line Item Quantity', 'Line Item Value',
       'Pack Price', 'Unit Price', 'First Line Designation',
       'Weight (Kilograms)', 'Weight_special', 'Freight Cost (USD)',
       'Freight_cost_special', 'Line Item Insurance (USD)', 'item_value', 'Shipment_Price']]

data.head()

# Feature Removal

plt.figure(figsize=(10,10))
sns.heatmap(data.drop(['Shipment_Price'],axis=1).corr())

<matplotlib.axes._subplots.AxesSubplot at 0x1f960ca0c10>

# Finding Higly Correlated Columns

def correlation(data,threshold):
    col_corr=set()
    cor=data.corr()
    for i in range(len(cor.columns)):
        for j in range(len(cor.columns)):
            if (abs(cor.iloc[i,j]) > threshold) and i!=j:
                if (cor.columns[j] in col_corr) or (cor.columns[i] in col_corr):
                    continue
                print("\n",cor.columns[i],"-----",cor.columns[j])
                print(abs(cor.iloc[i,j]))
                colname=cor.columns[i]
                col_corr.add(colname)
    return col_corr

a=correlation(data.drop(['Shipment_Price'],axis=1),0.9)
print('\n',a)

 Vendor ----- Vendor INCO Term
0.9367767103388067

 Vendor INCO Term ----- Fulfill Via
0.9367997828512018

 Line Item Value ----- Line Item Insurance (USD)
0.9120221547463065

 Weight_special ----- Freight_cost_special
0.9415039738045063

 {'Vendor INCO Term', 'Vendor', 'Line Item Value', 'Weight_special'}

# Removing the highly correlated columns - {'Vendor INCO Term', 'Vendor', 'Line Item Value', 'Weight_special'}

data=data.drop(['Vendor INCO Term', 'Vendor', 'Line Item Value', 'Weight_special'],axis=1)
data.head()

# Clustering data to fit separate models on the different clusters , to check if  model generalises better after clustering , or as a whole
# Finding Optimal No. of Clusters

from sklearn.cluster import KMeans

inertia=[]
for i in range(1,10):
    kmeans=KMeans(n_clusters=i)
    kmeans.fit(data)
    inertia.append(kmeans.inertia_)

elbow=pd.DataFrame({'Cluster':range(1,10),'Inertia':inertia})
plt.plot(elbow.Cluster,elbow.Inertia)

[<matplotlib.lines.Line2D at 0x1f968aeeac0>]

# From the elbow curve above , it seems reasonable to choose 3 clusters

kmeans=KMeans(n_clusters=3)
kmeans.fit(data)
data['Cluster']=kmeans.predict(data)
data.head()

data.Cluster.value_counts()

1    8791
0    1288
2     245
Name: Cluster, dtype: int64

# Separating the 3 data clusters

data_c1=data[data.Cluster==0].drop(['Cluster'],axis=1).copy(deep=True)
data_c2=data[data.Cluster==1].drop(['Cluster'],axis=1).copy(deep=True)
data_c3=data[data.Cluster==2].drop(['Cluster'],axis=1).copy(deep=True)

# Function to calculate ajusted r2 score

def adjusted_r2(r2,data):
    score=1-(1-r2)*(data.shape[0]-1)/(data.shape[0]-data.shape[1]-1)
    return score

# Regression models to try
# Linear regression , lasso / regularised ,SVM regressor, KNN , rand forest regressor , XGB regressor

#Trying LInear Regression

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score , RepeatedStratifiedKFold           
from sklearn.linear_model import LinearRegression

scores_linreg=[]

for i in [data,data_c1,data_c2,data_c3]:
            X=i.drop(['Shipment_Price'],axis=1)
            y=i['Shipment_Price']
            X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=1)
            
            linreg=LinearRegression()
            linreg.fit(X_train,y_train)
            r2=linreg.score(X_test,y_test)
            adj_r2=adjusted_r2(r2=r2,data=i)
            scores_linreg.append((adj_r2))

print(f"Adj r2 score for Entire Data : {scores_linreg[0]}")
print(f"Adj r2 score for Data Cluster 1 : {scores_linreg[1]}")
print(f"Adj r2 score for Data Cluster 2 : {scores_linreg[2]}")
print(f"Adj r2 score for Data Cluster 3 : {scores_linreg[3]}")

Adj r2 score for Entire Data : 0.8014625033426527
Adj r2 score for Data Cluster 1 : 0.6727115882484896
Adj r2 score for Data Cluster 2 : 0.8152797920963445
Adj r2 score for Data Cluster 3 : 0.6310064482748541

#Trying Out Elastic Net Regression

from sklearn.linear_model import ElasticNetCV ,ElasticNet
scores_elasticreg=[]
for i in [data,data_c1,data_c2,data_c3]:

            X=i.drop(['Shipment_Price'],axis=1)
            y=i['Shipment_Price']
            X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=2)

            elasticCV=ElasticNetCV(alphas=None,cv=10)
            elasticCV.fit(X_train,y_train)
            elastic_net_reg=ElasticNet(alpha=elasticCV.alpha_,l1_ratio=0.5)
            elastic_net_reg.fit(X_train,y_train)
            r2=elastic_net_reg.score(X_test,y_test)
            adj_r2=adjusted_r2(r2=r2,data=i)
            scores_elasticreg.append((adj_r2))
            
print(f"r2 score for Entire Data : {scores_elasticreg[0]}")
print(f"r2 score for Data Cluster 1 : {scores_elasticreg[1]}")
print(f"r2 score for Data Cluster 2 : {scores_elasticreg[2]}")
print(f"r2 score for Data Cluster 3 : {scores_elasticreg[3]}")

r2 score for Entire Data : 0.6620559979120453
r2 score for Data Cluster 1 : 0.23333147581448643
r2 score for Data Cluster 2 : 0.48235291266186575
r2 score for Data Cluster 3 : 0.27089918804632207

#Trying Out Elastic Net Regression

from sklearn.linear_model import RidgeCV ,Ridge
scores_ridgereg=[]
for i in [data,data_c1,data_c2,data_c3]:

            X=i.drop(['Shipment_Price'],axis=1)
            y=i['Shipment_Price']
            X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=2)

            alphas=np.random.uniform(low=0,high=10,size=(50,))
            ridgeCV=RidgeCV(alphas=alphas,cv=10)
            ridgeCV.fit(X_train,y_train)
            ridge_reg=Ridge(alpha=ridgeCV.alpha_)
            ridge_reg.fit(X_train,y_train)
            r2=ridge_reg.score(X_test,y_test)
            adj_r2=adjusted_r2(r2=r2,data=i)
            scores_ridgereg.append((adj_r2))

print(f"r2 score for Entire Data : {scores_ridgereg[0]}")
print(f"r2 score for Data Cluster 1 : {scores_ridgereg[1]}")
print(f"r2 score for Data Cluster 2 : {scores_ridgereg[2]}")
print(f"r2 score for Data Cluster 3 : {scores_ridgereg[3]}")

r2 score for Entire Data : 0.906502473170007
r2 score for Data Cluster 1 : 0.6625554703968244
r2 score for Data Cluster 2 : 0.7983882215352294
r2 score for Data Cluster 3 : 0.647439833490822

#Trying Out Support Vector Regression

from sklearn.svm import LinearSVR,SVR
from sklearn.preprocessing import StandardScaler    
scaler_svm=StandardScaler()
scores_SVM_reg=[]
for i in [data,data_c1,data_c2,data_c3]:
            #scaler_svm.fit(i.drop(['Shipment_Price'],axis=1))
            #X=scaler_svm.transform(i.drop(['Shipment_Price'],axis=1))
            X=i.drop(['Shipment_Price'],axis=1)
            y=i['Shipment_Price']
            X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=2)

            svm_reg=SVR(kernel="rbf")
            svm_reg.fit(X_train,y_train)
            r2=svm_reg.score(X_test,y_test)
            adj_r2=adjusted_r2(r2=r2,data=i)
            scores_SVM_reg.append((adj_r2))

print(f"r2 score for Entire Data : {scores_SVM_reg[0]}")
print(f"r2 score for Data Cluster 1 : {scores_SVM_reg[1]}")
print(f"r2 score for Data Cluster 2 : {scores_SVM_reg[2]}")
print(f"r2 score for Data Cluster 3 : {scores_SVM_reg[3]}")

r2 score for Entire Data : -0.18255493408086565
r2 score for Data Cluster 1 : -0.04523420128227129
r2 score for Data Cluster 2 : -0.23749163121148342
r2 score for Data Cluster 3 : -0.17498867583171096

#Trying Decision Tree Regression
from sklearn.tree import DecisionTreeRegressor , plot_tree
from sklearn.metrics import r2_score

scores_dtree_reg=[]
for i in [data,data_c1,data_c2,data_c3]:
            X=i.drop(['Shipment_Price'],axis=1)
            y=i['Shipment_Price']
            X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=2)
            dtree_reg=DecisionTreeRegressor()
            dtree_reg.fit(X_train,y_train)
            pred=dtree_reg.predict(X_test)
            r2=r2_score(y_test,pred)
            adj_r2=adjusted_r2(r2=r2,data=i)
            scores_dtree_reg.append((adj_r2))

print(f"r2 score for Entire Data  : {scores_dtree_reg[0]}")
print(f"r2 score for Data Cluster 1 : {scores_dtree_reg[1]}")
print(f"r2 score for Data Cluster 2 : {scores_dtree_reg[2]}")
print(f"r2 score for Data Cluster 3 : {scores_dtree_reg[3]}")

r2 score for Entire Data  : 0.9756605019085524
r2 score for Data Cluster 1 : 0.7726668858094312
r2 score for Data Cluster 2 : 0.9499326972716455
r2 score for Data Cluster 3 : 0.6209724312628135

#Trying Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor

scores_rand_forest=[]
for i in [data,data_c1,data_c2,data_c3]:
            X=i.drop(['Shipment_Price'],axis=1)
            y=i['Shipment_Price']
            X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=2)
            rand_forest=RandomForestRegressor(min_samples_leaf=10,n_estimators=30)
            rand_forest.fit(X_train,y_train)
            pred=rand_forest.predict(X_test)
            r2=r2_score(y_test,pred)
            adj_r2=adjusted_r2(r2=r2,data=i)
            scores_rand_forest.append((adj_r2))
            
print(f"r2 score for Entire Data : {scores_rand_forest[0]}")
print(f"r2 score for Data Cluster 1 : {scores_rand_forest[1]}")
print(f"r2 score for Data Cluster 2 : {scores_rand_forest[2]}")
print(f"r2 score for Data Cluster 3 : {scores_rand_forest[3]}")

r2 score for Entire Data : 0.9674783918934399
r2 score for Data Cluster 1 : 0.846370105292314
r2 score for Data Cluster 2 : 0.9695294968973991
r2 score for Data Cluster 3 : 0.5897237971515654

#Trying XGBoost Regressor

from xgboost import XGBRegressor 

scores_xgb=[]
for i in [data,data_c1,data_c2,data_c3]:
            X=i.drop(['Shipment_Price'],axis=1)
            y=i['Shipment_Price']
            X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=2)
            xgb_reg=XGBRegressor(subsample=0.7,max_depth=9,colsample_bytree=0.8,eta=0.1)
            xgb_reg.fit(X_train,y_train)
            pred=xgb_reg.predict(X_test)
            r2=r2_score(y_test,pred)
            adj_r2=adjusted_r2(r2=r2,data=i)
            scores_xgb.append((adj_r2))
            
print(f"r2 score for Entire Data : {scores_xgb[0]}")
print(f"r2 score for Data Cluster 1 : {scores_xgb[1]}")
print(f"r2 score for Data Cluster 2 : {scores_xgb[2]}")
print(f"r2 score for Data Cluster 3 : {scores_xgb[3]}")

r2 score for Entire Data : 0.9902594505950042
r2 score for Data Cluster 1 : 0.9499256667838698
r2 score for Data Cluster 2 : 0.988497404278208
r2 score for Data Cluster 3 : 0.8598913223396523

#Trying KNN Regression

from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler    

scaler_knn=StandardScaler()
scores_KNN_reg=[]

for i in [data,data_c1,data_c2,data_c3]:
            scaler_knn.fit(i.drop(['Shipment_Price'],axis=1))
            X=scaler_knn.transform(i.drop(['Shipment_Price'],axis=1))
            y=i['Shipment_Price']
            X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=2)
            knn=KNeighborsRegressor(n_neighbors=10)
            knn.fit(X_train,y_train)
            r2=knn.score(X_test,y_test)
            adj_r2=adjusted_r2(r2=r2,data=i)
            scores_KNN_reg.append((adj_r2))
            
print(f"r2 score for Entire Data : {scores_KNN_reg[0]}")
print(f"r2 score for Data Cluster 1 : {scores_KNN_reg[1]}")
print(f"r2 score for Data Cluster 2 : {scores_KNN_reg[2]}")
print(f"r2 score for Data Cluster 3 : {scores_KNN_reg[3]}")

r2 score for Entire Data : 0.904319729470703
r2 score for Data Cluster 1 : 0.5923392868709882
r2 score for Data Cluster 2 : 0.8542445272682004
r2 score for Data Cluster 3 : 0.24052581272137985

scores=pd.DataFrame([scores_linreg,scores_elasticreg,scores_ridgereg,scores_SVM_reg,scores_dtree_reg,scores_rand_forest,scores_KNN_reg,scores_xgb]
             ,index=['Linreg-r2','Elastic-net r2','Ridge-r2','SVR-r2','Dtree-r2','Rand-Forest-r2','KNN-r2',"XGB-r2"],
             columns=['Entire Data','Cluster 1','Cluster 2','Cluster 3'])
scores

for i in scores.columns:
    print(f"Best Model for {i} : {scores[i].idxmax()} ")

# From the above results, we can conclude that models generalise worse after clustering ,as compared to the whole dataset.
# Hence we don't need to to cluster the data

Best Model for Entire Data : XGB-r2 
Best Model for Cluster 1 : XGB-r2 
Best Model for Cluster 2 : XGB-r2 
Best Model for Cluster 3 : XGB-r2

# So, Best model is XGB Regression

X=data.drop(['Shipment_Price'],axis=1)
y=data['Shipment_Price']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=2)

xgb_reg=XGBRegressor(subsample=0.7,max_depth=9,colsample_bytree=0.8,eta=0.1)
xgb_reg.fit(X_train,y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.8, eta=0.1, gamma=0,
             gpu_id=-1, importance_type='gain', interaction_constraints='',
             learning_rate=0.100000001, max_delta_step=0, max_depth=9,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=8, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=0.7,
             tree_method='exact', validate_parameters=1, verbosity=None)

import pickle
filename="xgboost.pickle"
pickle.dump(xgb_reg,open(filename,'wb'))

model=pickle.load(open(filename,'rb'))
pred=model.predict(X_test)
r2_score(y_test,pred)

0.9902839836071644

	Line Item Value	Pack Price	Unit Price	Weight (Kilograms)	Freight Cost (USD)	Line Item Insurance (USD)	Unit of Measure (Per Pack)	Line Item Quantity
count	1.032400e+04	10324.000000	10324.000000	6372.000000	6198.000000	10037.000000	10324.000000	10324.000000
mean	1.576506e+05	21.910241	0.611701	3424.441306	11103.234819	240.117626	77.990895	18332.534870
std	3.452921e+05	45.609223	3.275808	13526.968270	15813.026692	500.190568	76.579764	40035.302961
min	0.000000e+00	0.000000	0.000000	0.000000	0.750000	0.000000	1.000000	1.000000
25%	4.314593e+03	4.120000	0.080000	206.750000	2131.120000	6.510000	30.000000	408.000000
50%	3.047147e+04	9.300000	0.160000	1047.000000	5869.655000	47.040000	60.000000	3000.000000
75%	1.664471e+05	23.592500	0.470000	3334.000000	14406.570000	252.400000	90.000000	17039.750000
max	5.951990e+06	1345.640000	238.650000	857354.000000	289653.200000	7708.440000	1000.000000	619999.000000

	ID	Project Code	PQ #	PO / SO #	ASN/DN #	Country	Managed By	Fulfill Via	Vendor INCO Term	Shipment Mode	PQ First Sent to Client Date	PO Sent to Vendor Date	Scheduled Delivery Date	Delivered to Client Date	Delivery Recorded Date	Product Group	Sub Classification	Vendor	Item Description	Molecule/Test Type	Brand	Dosage	Dosage Form	Unit of Measure (Per Pack)	Line Item Quantity	Line Item Value	Pack Price	Unit Price	Manufacturing Site	First Line Designation	Weight (Kilograms)	Freight Cost (USD)	Line Item Insurance (USD)
0	1	100-CI-T01	Pre-PQ Process	SCMS-4	ASN-8	Côte d'Ivoire	PMO - US	Direct Drop	EXW	Air	Pre-PQ Process	Date Not Captured	2-Jun-06	2-Jun-06	2-Jun-06	HRDT	HIV test	RANBAXY Fine Chemicals LTD.	HIV, Reveal G3 Rapid HIV-1 Antibody Test, 30 T...	HIV, Reveal G3 Rapid HIV-1 Antibody Test	Reveal	NaN	Test kit	30	19	551.0	29.00	0.97	Ranbaxy Fine Chemicals LTD	Yes	13	780.34	NaN
1	3	108-VN-T01	Pre-PQ Process	SCMS-13	ASN-85	Vietnam	PMO - US	Direct Drop	EXW	Air	Pre-PQ Process	Date Not Captured	14-Nov-06	14-Nov-06	14-Nov-06	ARV	Pediatric	Aurobindo Pharma Limited	Nevirapine 10mg/ml, oral suspension, Bottle, 2...	Nevirapine	Generic	10mg/ml	Oral suspension	240	1000	6200.0	6.20	0.03	Aurobindo Unit III, India	Yes	358	4521.5	NaN
2	4	100-CI-T01	Pre-PQ Process	SCMS-20	ASN-14	Côte d'Ivoire	PMO - US	Direct Drop	FCA	Air	Pre-PQ Process	Date Not Captured	27-Aug-06	27-Aug-06	27-Aug-06	HRDT	HIV test	Abbott GmbH & Co. KG	HIV 1/2, Determine Complete HIV Kit, 100 Tests	HIV 1/2, Determine Complete HIV Kit	Determine	NaN	Test kit	100	500	40000.0	80.00	0.80	ABBVIE GmbH & Co.KG Wiesbaden	Yes	171	1653.78	NaN
3	15	108-VN-T01	Pre-PQ Process	SCMS-78	ASN-50	Vietnam	PMO - US	Direct Drop	EXW	Air	Pre-PQ Process	Date Not Captured	1-Sep-06	1-Sep-06	1-Sep-06	ARV	Adult	SUN PHARMACEUTICAL INDUSTRIES LTD (RANBAXY LAB...	Lamivudine 150mg, tablets, 60 Tabs	Lamivudine	Generic	150mg	Tablet	60	31920	127360.8	3.99	0.07	Ranbaxy, Paonta Shahib, India	Yes	1855	16007.06	NaN
4	16	108-VN-T01	Pre-PQ Process	SCMS-81	ASN-55	Vietnam	PMO - US	Direct Drop	EXW	Air	Pre-PQ Process	Date Not Captured	11-Aug-06	11-Aug-06	11-Aug-06	ARV	Adult	Aurobindo Pharma Limited	Stavudine 30mg, capsules, 60 Caps	Stavudine	Generic	30mg	Capsule	60	38000	121600.0	3.20	0.05	Aurobindo Unit III, India	Yes	7590	45450.08	NaN

	Country	Vendor	Manufacturing Site	Brand	Item Description	Product Group	Sub Classification	Molecule/Test Type	Dosage Form	Dosage	Managed By	Vendor INCO Term	Fulfill Via	Shipment Mode	First Line Designation	Weight_special	Freight_cost_special
count	10324	10324	10324	10324	10324	10324	10324	10324	10324	8588	10324	10324	10324	9964	10324	10324	10324
unique	43	73	88	48	184	5	6	86	17	54	4	8	2	4	2	3	4
top	South Africa	SCMS from RDC	Aurobindo Unit III, India	Generic	Efavirenz 600mg, tablets, 30 Tabs	ARV	Adult	Efavirenz	Tablet	300mg	PMO - US	N/A - From RDC	From RDC	Air	Yes	Normal Measurement	Normal Measurement
freq	1406	5404	3172	7285	755	8550	6595	1125	3532	990	10265	5404	5404	6113	7030	6372	6198

	Country	Vendor	Manufacturing Site	Brand	Item Description	Product Group	Sub Classification	Molecule/Test Type	Dosage Form	Dosage	Managed By	Vendor INCO Term	Fulfill Via	Shipment Mode	Scheduled Delivery Date	Delivered to Client Date	Delivery Recorded Date	Unit of Measure (Per Pack)	Line Item Quantity	Line Item Value	Pack Price	Unit Price	First Line Designation	Weight (Kilograms)	Weight_special	Freight Cost (USD)	Freight_cost_special	Line Item Insurance (USD)
0	1083	2	2	5	1	1728	1567	5	1575	1098	10265	2778	0.0	6324	2006-06-02	2006-06-02	2006-06-02	30	19	551.00	29.00	0.97	1.0	13.000000	6372	780.340000	6198	240.117626
1	688	668	3172	7285	1	8550	1955	877	214	664	10265	2778	0.0	6324	2006-11-14	2006-11-14	2006-11-14	240	1000	6200.00	6.20	0.03	1.0	358.000000	6372	4521.500000	6198	240.117626
2	1083	2	2	799	577	1728	1567	577	1575	708	10265	397	0.0	6324	2006-08-27	2006-08-27	2006-08-27	100	500	40000.00	80.00	0.80	1.0	171.000000	6372	1653.780000	6198	240.117626
3	688	2	2	7285	378	8550	6595	592	3532	523	10265	2778	0.0	6324	2006-09-01	2006-09-01	2006-09-01	60	31920	127360.80	3.99	0.07	1.0	1855.000000	6372	16007.060000	6198	240.117626
4	688	668	3172	7285	1	8550	6595	5	729	2	10265	2778	0.0	6324	2006-08-11	2006-08-11	2006-08-11	60	38000	121600.00	3.20	0.05	1.0	7590.000000	6372	45450.080000	6198	240.117626
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
10319	538	5404	2	7285	1	8550	1955	707	239	2	10265	5404	1.0	2941	2015-07-31	2015-07-15	2015-07-20	60	166571	599655.60	3.60	0.06	0.0	3424.441306	2445	11103.234819	2445	705.790000
10320	1083	5404	869	7285	597	8550	6595	689	2749	732	10265	5404	1.0	2941	2015-07-31	2015-08-06	2015-08-07	60	21072	137389.44	6.52	0.11	0.0	3424.441306	2445	11103.234819	2445	161.710000
10321	683	5404	2	7285	1	8550	6595	5	2749	2	10265	5404	1.0	2941	2015-08-31	2015-08-25	2015-09-03	30	514526	5140114.74	9.99	0.33	0.0	3424.441306	1507	11103.234819	1442	5284.040000
10322	538	5404	1415	7285	597	8550	6595	689	2749	732	10265	5404	1.0	2941	2015-09-09	2015-08-04	2015-08-11	60	17465	113871.80	6.52	0.11	1.0	1392.000000	6372	11103.234819	1442	134.030000
10323	538	5404	665	7285	1	8550	1955	689	239	2	10265	5404	1.0	2941	2015-08-31	2015-08-04	2015-08-11	60	36639	72911.61	1.99	0.03	0.0	3424.441306	1507	11103.234819	1442	85.820000

	Entire Data	Cluster 1	Cluster 2	Cluster 3
Linreg-r2	0.801463	0.672712	0.815280	0.631006
Elastic-net r2	0.662056	0.233331	0.482353	0.270899
Ridge-r2	0.906502	0.662555	0.798388	0.647440
SVR-r2	-0.182555	-0.045234	-0.237492	-0.174989
Dtree-r2	0.975661	0.772667	0.949933	0.620972
Rand-Forest-r2	0.967478	0.846370	0.969529	0.589724
KNN-r2	0.904320	0.592339	0.854245	0.240526
XGB-r2	0.990259	0.949926	0.988497	0.859891

Original Data Source : https://data.usaid.gov/HIV-AIDS/Supply-Chain-Shipment-Pricing-Data/a3rc-nmf6¶

Data Description¶

1. ID - identity number¶

2. Project Code - Project code for the shipment¶

3. PQ # - Price quote (PQ) number¶

4. PO / SO # - Price Order / Sales Order --- PO is a buyer's request to a seller (supplier) to order the goods/services¶

5. ASN/DN # - Shipment number: Advanced Shipment Note (ASN) for Direct Drop deliveries, or Delivery Note (DN) for from RDC deliveries¶

6. Country - Destination country , where shipment is to be delivered¶

7. Managed By - SCMS managing office: either the Program Management Office (PMO) in the U.S. or the relevant SCMS field office¶

8. Fulfill Via - Method through which the shipment was fulfilled: via Direct Drop from vendor or from stock available in the RDCs( retail distribution centers )¶

10. Shipment Mode - Method by which commodities are shipped¶

11. PQ First Sent to Client Date - Date the PQ is first sent to the client¶

12. PO Sent to Vendor Date - Date the PO is first sent to the vendor¶

13. Scheduled Delivery Date - Date on which delivery to client was recorded in SCMS information systems¶

14. Product Group - Product group for item, i.e. ARV, HRDT¶

15. Sub Classification - Identifies relevant product sub classifications, such as whether ARVs are pediatric or adult, whether a malaria product is an artemisinin-based combination therapy (ACT), etc.¶

16. Vendor - Vendor name¶

17. Item Description - Product name and formulation from Partnership for Supply Chain Management (PFSCM) Item Master¶

18. Molecule/Test Type - Active drug(s) or test kit type¶

19. Brand - Generic or branded name for the item¶

20. Dosage - Item dosage and unit¶

21. Dosage Form - Dosage form for the item (tablet, oral solution, injection, etc.).¶

22. Unit of Measure (Per Pack) - Pack quantity (pills or test kits) used to compute unit price¶

23. Line Item Quantity - Total quantity (packs) of commodity per line item¶

24. Line Item Value - Total value of commodity per line item¶

25. Pack Price - Cost per pack (i.e. month s supply of ARVs, pack of 60 test kits)¶

26. Unit Price - Cost per pill (for drugs) or per test (for test kits)¶

27. Manufacturing Site - Identifies manufacturing site for the line item for direct drop and from RDC deliveries¶

28. First Line Designation - Designates if the line in question shows the aggregated freight costs and weight associated with all items on the ASN DN¶

29. Weight (Kilograms) - Weight for all lines on an ASN DN¶

30. Freight Cost (USD) - Freight charges associated with all lines on the respective ASN DN¶

31. Line Item Insurance (USD) - Line item cost of insurance, created by applying an annual flat rate ( ) to commodity cost¶

EDA¶

Original Data Source : https://data.usaid.gov/HIV-AIDS/Supply-Chain-Shipment-Pricing-Data/a3rc-nmf6 ¶