How to use deep learning to make predictions on Tabular data with Tensorflow Keras

It has been light years since I last wrote a post. My role at my job has changed and I no longer do engineering work, but I take every opportunity to learn new technical skills all the time. This time, it’s Deep learning!

Deep learning has lots of use cases, with the most popular use case out there to be image classification (probably). However, my approach to learning has always been to make sure it is applicable to my “daily life”. Hence, with this thought in mind, I have chosen to do work on tabular / structured data.

Story

Organizations often seek to discover the secrets of keeping their best talents at the firm. However, this is not as straightforward as it seems.

The goal of this project is to allow Deep learning to discover the chance of employee attrition based on a dataset of employees taken from Kaggle. Employee Attrition.

Code-first learning (TLDR;)

For those of you who prefer to dive right into code, do refer to my Github repository on Employee attrition Predictor. This notebook was also built on colab by Google.

The First Model

The First Model is based on Red Dragon AI‘s class example

Steps

Install and import packages

pip -q install tf-nightly
%matplotlib inline
import matplotlib.pyplot as plt

import math
import tensorflow as tf
import numpy as np
from numpy import unique
import pandas as pd
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.callbacks import ModelCheckpoint, Callback, TensorBoard

Read Files

This example reads files from my Google drive, but if you would like, here’s a copy on my Github

#Training and validation dataset
file_url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vRTzmPbXWcC6mfBDE1MBg5HoHsYlvYtkZp8oJFHfIMNzqiG6P4cdGaceWsxW9JS6ip9vdJYCNrDEbOx/pub?gid=581336355&single=true&output=csv"
dataframe = pd.read_csv(file_url)
dataframe.shape

#Prediction dataset
pred_file_url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vRsxR0nTrEaCqfro4FvGNDn6ZYYdQS0e2Tev1SMtJ5jYIjU0WGp77hp6btdJYkMl3XAk4lA01hxE30o/pub?gid=2057317048&single=true&output=csv"
pred_dataframe = pd.read_csv(pred_file_url)
pred_dataframe.shape

This is what the data looks like

idAgeAttritionBusinessTravelDailyRateDepartmentDistanceFromHomeEducationEducationFieldEmployeeCountEmployeeNumberEnvironmentSatisfactionGenderHourlyRateJobInvolvementJobLevelJobRoleJobSatisfactionMaritalStatusMonthlyIncomeMonthlyRateNumCompaniesWorkedOver18OverTimePercentSalaryHikePerformanceRatingRelationshipSatisfactionStandardHoursStockOptionLevelTotalWorkingYearsTrainingTimesLastYearWorkLifeBalanceYearsAtCompanyYearsInCurrentRoleYearsSinceLastPromotionYearsWithCurrManagerAttritionBinary
150NoTravel_Rarely1126Research & Development12Medical19974Male6634Research Director4Divorced1739966159YNo2243801321254130
236NoTravel_Rarely216Research & Development62Medical11782Male8432Manufacturing Director2Divorced494128196YNo204480270332010
321YesTravel_Rarely337Sales71Marketing117802Male3131Sales Representative2Single267945671YNo133280013310101
450NoTravel_Frequently1246Human Resources283Medical16441Male9935Manager2Married1820079991YNo113380132233251070
552NoTravel_Rarely994Research & Development74Life Sciences111182Male8733Healthcare Representative2Single10445153227YNo1934800184386400

Drop columns

Drop dataframe columns that are not useful

dataframe = dataframe.drop(['id','Attrition','EmployeeCount','EmployeeNumber'], axis=1)

Preprocess data



Split Dataframes for training and validation then convert dataframes to datasets.

val_dataframe = dataframe.sample(frac=0.2, random_state=1337)
train_dataframe = dataframe.drop(val_dataframe.index)


print(
    "Using %d samples for training, %d for validation and %d for predicting"
    % (len(train_dataframe), len(val_dataframe), len(pred_dataframe))
)

def dataframe_to_dataset(dataframe):
    dataframe = dataframe.copy()
    labels = dataframe.pop("AttritionBinary")
    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
    ds = ds.shuffle(buffer_size=len(dataframe))
    return ds


def pred_dataframe_to_dataset(dataframe):
    dataframe = dataframe.copy()
    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe)))
    ds = ds.shuffle(buffer_size=len(dataframe))
    return ds



train_ds = dataframe_to_dataset(train_dataframe)
val_ds = dataframe_to_dataset(val_dataframe)
pred_ds = pred_dataframe_to_dataset(pred_dataframe)

for x, y in train_ds.take(1):
    print("Input:", x)
    print("Target:", y)

train_ds = train_ds.batch(32)
val_ds = val_ds.batch(32)

Normalize integers, convert strings, do one hot encoding

Feature preprocessing with Keras layers The following features are categorical features encoded as integers:

  • Education
  • EnvironmentSatisfaction
  • JobInvolvement
  • JobLevel
  • JobSatisfaction
  • PerformanceRating
  • RelationshipSatisfaction
  • StandardHours
  • StockOptionLevel
  • WorkLifeBalance

We will encode these features using one-hot encoding using the CategoryEncoding() layer.

We also have some categorical features encoded as strings:

  • BusinessTravel
  • Department
  • EducationField
  • Gender
  • JobRole
  • MaritalStatus
  • Over18
  • OverTime

We will first create an index of all possible features using the StringLookup() layer, then we will one-hot encode the output indices using a CategoryEncoding() layer.

Finally, the following feature are continuous numerical features:

  • Age
  • DailyRate
  • DistanceFromHome
  • HourlyRate
  • MonthlyIncome
  • MonthlyRate
  • NumCompaniesWorked
  • PercentSalaryHike
  • TotalWorkingYears
  • TrainingTimesLastYear
  • YearsAtCompany
  • YearsInCurrentRole
  • YearsSinceLastPromotion
  • YearsWithCurrManager

For each of these features, we will use a Normalization() layer to make sure the mean of each feature is 0 and its standard deviation is 1.

Below, we use 3 utility functions to do the operations (all 3 functions were written by fchollet):

  • encode_numerical_feature to apply featurewise normalization to numerical features.
  • encode_string_categorical_feature to first turn string inputs into integer indices, then one-hot encode these integer indices.
  • encode_integer_categorical_feature to one-hot encode integer categorical features.
from tensorflow.keras.layers.experimental.preprocessing import Normalization
from tensorflow.keras.layers.experimental.preprocessing import CategoryEncoding
from tensorflow.keras.layers.experimental.preprocessing import StringLookup


def encode_numerical_feature(feature, name, dataset):
    # Create a Normalization layer for our feature
    normalizer = Normalization()

    # Prepare a Dataset that only yields our feature
    feature_ds = dataset.map(lambda x, y: x[name])
    feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1))

    # Learn the statistics of the data
    normalizer.adapt(feature_ds)

    # Normalize the input feature
    encoded_feature = normalizer(feature)
    return encoded_feature


def encode_string_categorical_feature(feature, name, dataset):
    # Create a StringLookup layer which will turn strings into integer indices
    index = StringLookup()

    # Prepare a Dataset that only yields our feature
    feature_ds = dataset.map(lambda x, y: x[name])
    feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1))

    # Learn the set of possible string values and assign them a fixed integer index
    index.adapt(feature_ds)

    # Turn the string input into integer indices
    encoded_feature = index(feature)

    # Create a CategoryEncoding for our integer indices
    encoder = CategoryEncoding(output_mode="binary")

    # Prepare a dataset of indices
    feature_ds = feature_ds.map(index)

    # Learn the space of possible indices
    encoder.adapt(feature_ds)

    # Apply one-hot encoding to our indices
    encoded_feature = encoder(encoded_feature)
    return encoded_feature


def encode_integer_categorical_feature(feature, name, dataset):
    # Create a CategoryEncoding for our integer indices
    encoder = CategoryEncoding(output_mode="binary")

    # Prepare a Dataset that only yields our feature
    feature_ds = dataset.map(lambda x, y: x[name])
    feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1))

    # Learn the space of possible indices
    encoder.adapt(feature_ds)

    # Apply one-hot encoding to our indices
    encoded_feature = encoder(feature)
    return encoded_feature

Build model

Do note that the encoded inputs are concatenated before going into the deeper layers

# Categorical features encoded as integers

Education = keras.Input(shape=(1,), name="Education", dtype="int64")
EnvironmentSatisfaction = keras.Input(shape=(1,), name="EnvironmentSatisfaction", dtype="int64")
JobInvolvement = keras.Input(shape=(1,), name="JobInvolvement", dtype="int64")
JobLevel = keras.Input(shape=(1,), name="JobLevel", dtype="int64")
JobSatisfaction = keras.Input(shape=(1,), name="JobSatisfaction", dtype="int64")
PerformanceRating = keras.Input(shape=(1,), name="PerformanceRating", dtype="int64")
RelationshipSatisfaction = keras.Input(shape=(1,), name="RelationshipSatisfaction", dtype="int64")
StandardHours = keras.Input(shape=(1,), name="StandardHours", dtype="int64")
StockOptionLevel = keras.Input(shape=(1,), name="StockOptionLevel", dtype="int64")
WorkLifeBalance = keras.Input(shape=(1,), name="WorkLifeBalance", dtype="int64")

# Categorical feature encoded as string

BusinessTravel = keras.Input(shape=(1,), name="BusinessTravel", dtype="string")
Department = keras.Input(shape=(1,), name="Department", dtype="string")
EducationField = keras.Input(shape=(1,), name="EducationField", dtype="string")
Gender = keras.Input(shape=(1,), name="Gender", dtype="string")
JobRole = keras.Input(shape=(1,), name="JobRole", dtype="string")
MaritalStatus = keras.Input(shape=(1,), name="MaritalStatus", dtype="string")
Over18 = keras.Input(shape=(1,), name="Over18", dtype="string")
OverTime = keras.Input(shape=(1,), name="OverTime", dtype="string")

# Numerical features

Age = keras.Input(shape=(1,), name="Age")
DailyRate = keras.Input(shape=(1,), name="DailyRate")
DistanceFromHome = keras.Input(shape=(1,), name="DistanceFromHome")
HourlyRate = keras.Input(shape=(1,), name="HourlyRate")
MonthlyIncome = keras.Input(shape=(1,), name="MonthlyIncome")
MonthlyRate = keras.Input(shape=(1,), name="MonthlyRate")
NumCompaniesWorked = keras.Input(shape=(1,), name="NumCompaniesWorked")
PercentSalaryHike = keras.Input(shape=(1,), name="PercentSalaryHike")
TotalWorkingYears = keras.Input(shape=(1,), name="TotalWorkingYears")
TrainingTimesLastYear = keras.Input(shape=(1,), name="TrainingTimesLastYear")
YearsAtCompany = keras.Input(shape=(1,), name="YearsAtCompany")
YearsInCurrentRole = keras.Input(shape=(1,), name="YearsInCurrentRole")
YearsSinceLastPromotion = keras.Input(shape=(1,), name="YearsSinceLastPromotion")
YearsWithCurrManager = keras.Input(shape=(1,), name="YearsWithCurrManager")

all_inputs = [
    Education,
    EnvironmentSatisfaction,
    JobInvolvement,
    JobLevel,
    JobSatisfaction,
    PerformanceRating,
    RelationshipSatisfaction,
    StandardHours,
    StockOptionLevel,
    WorkLifeBalance,

    BusinessTravel,
    Department,
    EducationField,
    Gender,
    JobRole,
    MaritalStatus,
    Over18,
    OverTime,

    Age,
    DailyRate,
    DistanceFromHome,
    HourlyRate,
    MonthlyIncome,
    MonthlyRate,
    NumCompaniesWorked,
    PercentSalaryHike,
    TotalWorkingYears,
    TrainingTimesLastYear,
    YearsAtCompany,
    YearsInCurrentRole,
    YearsSinceLastPromotion,
    YearsWithCurrManager
]

# Integer categorical features
Education_encoded = encode_integer_categorical_feature(Education, "Education", train_ds)
EnvironmentSatisfaction_encoded = encode_integer_categorical_feature(EnvironmentSatisfaction, "EnvironmentSatisfaction", train_ds)
JobInvolvement_encoded = encode_integer_categorical_feature(JobInvolvement, "JobInvolvement", train_ds)
JobLevel_encoded = encode_integer_categorical_feature(JobLevel, "JobLevel", train_ds)
JobSatisfaction_encoded = encode_integer_categorical_feature(JobSatisfaction, "JobSatisfaction", train_ds)
PerformanceRating_encoded = encode_integer_categorical_feature(PerformanceRating, "PerformanceRating", train_ds)
RelationshipSatisfaction_encoded = encode_integer_categorical_feature(RelationshipSatisfaction, "RelationshipSatisfaction", train_ds)
StandardHours_encoded = encode_integer_categorical_feature(StandardHours, "StandardHours", train_ds)
StockOptionLevel_encoded = encode_integer_categorical_feature(StockOptionLevel, "StockOptionLevel", train_ds)
WorkLifeBalance_encoded = encode_integer_categorical_feature(WorkLifeBalance, "WorkLifeBalance", train_ds)

# String categorical features
BusinessTravel_encoded = encode_string_categorical_feature(BusinessTravel, "BusinessTravel", train_ds)
Department_encoded = encode_string_categorical_feature(Department, "Department", train_ds)
EducationField_encoded = encode_string_categorical_feature(EducationField, "EducationField", train_ds)
Gender_encoded = encode_string_categorical_feature(Gender, "Gender", train_ds)
JobRole_encoded = encode_string_categorical_feature(JobRole, "JobRole", train_ds)
MaritalStatus_encoded = encode_string_categorical_feature(MaritalStatus, "MaritalStatus", train_ds)
Over18_encoded = encode_string_categorical_feature(Over18, "Over18", train_ds)
OverTime_encoded = encode_string_categorical_feature(OverTime, "OverTime", train_ds)

# Numerical features
Age_encoded = encode_numerical_feature(Age, "Age", train_ds)
DailyRate_encoded = encode_numerical_feature(DailyRate, "DailyRate", train_ds)
DistanceFromHome_encoded = encode_numerical_feature(DistanceFromHome, "DistanceFromHome", train_ds)
HourlyRate_encoded = encode_numerical_feature(HourlyRate, "HourlyRate", train_ds)
MonthlyIncome_encoded = encode_numerical_feature(MonthlyIncome, "MonthlyIncome", train_ds)
MonthlyRate_encoded = encode_numerical_feature(MonthlyRate, "MonthlyRate", train_ds)
NumCompaniesWorked_encoded = encode_numerical_feature(NumCompaniesWorked, "NumCompaniesWorked", train_ds)
PercentSalaryHike_encoded = encode_numerical_feature(PercentSalaryHike, "PercentSalaryHike", train_ds)
TotalWorkingYears_encoded = encode_numerical_feature(TotalWorkingYears, "TotalWorkingYears", train_ds)
TrainingTimesLastYear_encoded = encode_numerical_feature(TrainingTimesLastYear, "TrainingTimesLastYear", train_ds)
YearsAtCompany_encoded = encode_numerical_feature(YearsAtCompany, "YearsAtCompany", train_ds)
YearsInCurrentRole_encoded = encode_numerical_feature(YearsInCurrentRole, "YearsInCurrentRole", train_ds)
YearsSinceLastPromotion_encoded = encode_numerical_feature(YearsSinceLastPromotion, "YearsSinceLastPromotion", train_ds)
YearsWithCurrManager_encoded = encode_numerical_feature(YearsWithCurrManager, "YearsWithCurrManager", train_ds)

all_features = layers.concatenate(
    [
      Education_encoded,
      EnvironmentSatisfaction_encoded,
      JobInvolvement_encoded,
      JobLevel_encoded,
      JobSatisfaction_encoded,
      PerformanceRating_encoded,
      RelationshipSatisfaction_encoded,
      StandardHours_encoded,
      StockOptionLevel_encoded,
      WorkLifeBalance_encoded,
      BusinessTravel_encoded,
      Department_encoded,
      EducationField_encoded,
      Gender_encoded,
      JobRole_encoded,
      MaritalStatus_encoded,
      Over18_encoded,
      OverTime_encoded,
      Age_encoded,
      DailyRate_encoded,
      DistanceFromHome_encoded,
      HourlyRate_encoded,
      MonthlyIncome_encoded,
      MonthlyRate_encoded,
      NumCompaniesWorked_encoded,
      PercentSalaryHike_encoded,
      TotalWorkingYears_encoded,
      TrainingTimesLastYear_encoded,
      YearsAtCompany_encoded,
      YearsInCurrentRole_encoded,
      YearsSinceLastPromotion_encoded,
      YearsWithCurrManager_encoded
    ]
)

# Multiple layers with dropout (for higher accuracy)
x = layers.Dense(187, activation="tanh", name = "Dense_1")(all_features)
x = layers.Dropout(0.2)(x)
x = layers.Dense(64, activation="relu", name = "Dense_2")(x)
x = layers.Dropout(0.2)(x)
x = layers.Dense(32, activation="relu", name = "Dense_3")(x)
output = layers.Dense(1, activation="sigmoid",name = "Outputlayer")(x)

model = keras.Model(all_inputs, output)

#Use Adam optimizer with custom learning rate

opt = keras.optimizers.Adam(learning_rate=0.001)

model.compile(opt, "binary_crossentropy", metrics=["accuracy"])

model.summary()

View model

keras.utils.plot_model(model, show_shapes=True, rankdir="LR")

Add checkpoints

Checkpoints allow us to save the model at the highest validation accuracy

!mkdir checkpoints
!ls
checkpoint = ModelCheckpoint('./checkpoints/best_weights.tf', monitor='val_accuracy', verbose=1, save_best_only=True, mode='auto')

We saved the model.fit function into history so we can plot it later.

history = model.fit(train_ds, epochs=70, validation_data=val_ds, callbacks=[checkpoint])

Predict on sample data

sample = {
    "Education": 2,
    "EnvironmentSatisfaction": 4,
    "JobInvolvement": 4,
    "JobLevel": 2,
    "JobSatisfaction": 1,
    "PerformanceRating": 3,
    "RelationshipSatisfaction": 3,
    "StandardHours": 80,
    "StockOptionLevel": 2,
    "WorkLifeBalance": 3,
    "BusinessTravel": "Travel_Rarely",
    "Department":"Research & Development",
    "EducationField":"Medical",
    "Gender":"Female",
    "JobRole":"Manufacturing Director",
    "MaritalStatus":"Divorced",
    "Over18":"Y",
    "OverTime":"No",
    "Age": 53,
    "DailyRate": 1084,
    "DistanceFromHome": 13,
    "HourlyRate": 57,
    "MonthlyIncome": 4450,
    "MonthlyRate": 26250,
    "NumCompaniesWorked": 1,
    "PercentSalaryHike": 11,
    "TotalWorkingYears": 5,
    "TrainingTimesLastYear": 3,
    "YearsAtCompany": 4,
    "YearsInCurrentRole": 2,
    "YearsSinceLastPromotion": 1,
    "YearsWithCurrManager": 3,
}

input_dict = {name: tf.convert_to_tensor([value]) for name, value in sample.items()}
model.predict(input_dict)

This employee had a low chance of attrition!

Predict on 400+ employees and chart prediction outcome

predictions = []

for employee in pred_ds:
    input_dict = {name: tf.convert_to_tensor([value]) for name, value in employee.items()}
    probability = model.predict(input_dict)
    employee_number = tf.get_static_value(employee["EmployeeNumber"])
    employee_age = tf.get_static_value(employee["Age"])
    employee_monthly_income = tf.get_static_value(employee["MonthlyIncome"])
    employee_satisfaction = tf.get_static_value(employee["JobSatisfaction"]) * tf.get_static_value(employee["RelationshipSatisfaction"])
    if(math.isnan(probability)):
      #Test data has missing features that cause the model to be unable to determine employee attrition %
      print(
        "We do not have a good prediction as to whether Employee Number %d will leave"
        % (employee_number)
      )
      predictions.append({
                          "Age": employee_age,
                          "MonthlyIncome":employee_monthly_income,
                          "Satisfaction":employee_satisfaction,
                          "prediction":0
                        })
    elif(probability <0.5):
      #Employee has low chance of leaving (Less than 50%)
      print(
          "Employee Number %d has a low chance of leaving (%f)"
          % (employee_number, probability)
      )
      predictions.append({
                          "Age": employee_age,
                          "MonthlyIncome":employee_monthly_income,
                          "Satisfaction":employee_satisfaction,
                          "prediction":1
                        })
    else:
      #Employee has high chance of leaving (Less than 50%)
      print(
          "Employee Number %d has a high chance of leaving (%f)"
          % (employee_number, probability)
      )
      predictions.append({
                    "Age": employee_age,
                    "MonthlyIncome":employee_monthly_income,
                    "Satisfaction":employee_satisfaction,
                    "prediction":2
                  })

Chart attrition prediction against employee age (image gallery below shows other ways of plotting also shown in jupyter notebook)

figure = plt.figure(figsize=(13,8))
predictions_frame= pd.DataFrame(predictions)
plt.hist([predictions_frame[predictions_frame['prediction']==2]['Age'],predictions_frame[predictions_frame['prediction']==1]['Age'],predictions_frame[predictions_frame['prediction']==0]['Age']], stacked=True, color = ['r','g','b'],
         label = ['High chance of attrition','Likely to stay','Unknown'])
plt.xlabel('Age')
plt.ylabel('Number of employees')
plt.legend()
plt.title("Employee attrition probability by age")

Improving accuracy

In the First Model, we normalized the continuous integers and used one hot encoding for the categorical integers / strings.

This causes us to lose some form of relationship information between / within the categories.

Therefore, we will try to use feature columns to allow for more flexibility in how we treat our columns and dictate how the model learns.

The Second Model

The Second Model seeks to use Tensorflow’s feature columns to manage the columns based on

  1. Numerical columns
  2. Categorical columns
  3. Embedding columns
  4. Crossed feature columns
%pip install -q sklearn
from tensorflow import feature_column
from sklearn.model_selection import train_test_split
train, test = train_test_split(dataframe, test_size=0.2)
train, val = train_test_split(train, test_size=0.2)
print(len(train), 'train examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')

# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
  dataframe = dataframe.copy()
  labels = dataframe.pop('AttritionBinary')
  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.batch(batch_size)
  return ds

batch_size = 10
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)

for feature_batch, label_batch in train_ds.take(1):
  print('Every feature:', list(feature_batch.keys()))
  print('A batch of ages:', feature_batch['Age'])
  print('A batch of targets:', label_batch )

Choosing the feature columns

  • Numeric feature columns for continuous integers
  • Bucketized for age groups
  • Indicator columns for categorical data
  • Embedding columns for large categories of data that should not use one-hot encoding
  • Crossed column for data you want to be understood together

feature_columns = []

# numeric cols
for header in [    
      'DailyRate',
      'DistanceFromHome',
      'HourlyRate',
      'MonthlyIncome',
      'MonthlyRate',
      'NumCompaniesWorked',
      'PercentSalaryHike',
      'TotalWorkingYears',
      'TrainingTimesLastYear',
      'YearsAtCompany',
      'YearsInCurrentRole',
      'YearsSinceLastPromotion',
      'YearsWithCurrManager',
    ]:
  feature_columns.append(feature_column.numeric_column(header))


# bucketized cols
#We can bucketize the Age and Salary
age = feature_column.numeric_column('Age')
age_buckets = feature_column.bucketized_column(age, boundaries=[1, 20, 35, 45, 70])
feature_columns.append(age_buckets)


# indicator_columns
indicator_column_names = [
    'EducationField',
    'Gender',
    'MaritalStatus',
    'Over18',
    'OverTime', 
    'Education',
    'EnvironmentSatisfaction',
    'JobInvolvement',
    'JobLevel',
    'JobSatisfaction',
    'PerformanceRating',
    'RelationshipSatisfaction',
    'StandardHours',
    'StockOptionLevel',
    'WorkLifeBalance',
    
]
for col_name in indicator_column_names:
  categorical_column = feature_column.categorical_column_with_vocabulary_list(
      col_name, dataframe[col_name].unique())
  indicator_column = feature_column.indicator_column(categorical_column)
  feature_columns.append(indicator_column)

# embedding columns: BusinessTravel
businessTravel = feature_column.categorical_column_with_vocabulary_list(
      'BusinessTravel', dataframe.BusinessTravel.unique())
BusinessTravel_embedding = feature_column.embedding_column(businessTravel, dimension=8)
feature_columns.append(BusinessTravel_embedding)

# embedding columns: Department
department = feature_column.categorical_column_with_vocabulary_list(
      'Department', dataframe.BusinessTravel.unique())
department_embedding = feature_column.embedding_column(department, dimension=8)
feature_columns.append(department_embedding)

# embedding columns: JobRole
jobRole = feature_column.categorical_column_with_vocabulary_list(
      'JobRole', dataframe.BusinessTravel.unique())
JobRole_embedding = feature_column.embedding_column(jobRole, dimension=8)
feature_columns.append(JobRole_embedding)

# crossed columns
Department_Job_feature = feature_column.crossed_column([department, jobRole], hash_bucket_size=100)
feature_columns.append(feature_column.indicator_column(Department_Job_feature))
feature_layer = tf.keras.layers.DenseFeatures(feature_columns)
model_2 = tf.keras.Sequential([
  feature_layer,
  layers.Dropout(.2),
  layers.Dense(73, activation='relu'),
  layers.Dropout(.2),
  layers.Dense(73, activation='relu'),
  layers.Dense(1)
])
opt = keras.optimizers.Adam(learning_rate=0.001)

model_2.compile(optimizer=opt,
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

history_feature = model_2.fit(train_ds,
          validation_data=val_ds,
          epochs=70)

Then, we will compare and arrive at a conclusion by comparing the model accuracy.

Comparing the history of both trainings

Green line: First Model Training accuracy
Red line: First Model Validation accuracy
Blue line: Second Model Training accuracy
Orange line: Second Model Validation accuracy

Conclusion

The First Model seemed to have a slightly better accuracy than the Second model, even though their validation accuracy was rather similar.

Could using feature columns differently improve the accuracy of the Second model? Maybe..

Thanks for reading and hope the content was useful to you!

This notebook can also be found here royleekiat/Employee_attrition_predictor

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s