Advice:

  1. Start clean the dataset and build the model through easy steps (do not use over tech) or advanced features. Just use simple ways, if your model will have bad performance, then try add one by one.
  2. If you code smth and change your variable X_val/X_train to X_train_resampled, so try to use it, do not just leave it
  3. Use encoding and scaling after train test split

Requirements of project:

You need to build a classifier model to determine diabetes. You have no restrictions on tools, new fields, and data encoding method.

About dataset:

The dataset is a collection of medical and demographic data of patients, as well as their diagnosis of diabetes (positive or negative).

The data includes characteristics such as age, gender, body mass index (BMI), hypertension, heart disease, smoking history, HbA1c level, and blood glucose level. This dataset can be used to build machine learning models to predict diabetes in patients based on their medical history and demographic information.

Submission rules:

You will be provided with a second dataset, without the target variable (target - diabetes) This dataset will need to be accelerated and submitted to Google Classroom in .csv format, with 2 columns: ID and prediction

The prediction field must be a class prediction (predict), i.e. 1 or 0, not a probability (predict_proba)


Importing the datasets:

df_train = pd.read_csv('training_data.csv', index_col=0)
df_test = pd.read_csv('test_data.csv', index_col=0)