Contents

About data division

   Jul 22, 2024     1 min read

This is an article about data division.

hello!

Today we will learn about data division.

Data partitioning is an important step in machine learning and statistical modeling. It refers to the process of dividing given data for training, validation, and testing.

We will explain data partitioning below.

Purpose of data division

Model training

Provides training data to learn the model.

Model validation

Provides validation data to evaluate model performance.

Model testing

Provides test data to evaluate the generalization ability of the model.

Types of data division

Training Data

This is the data used to train the model and is used to adjust the model’s parameters.

Validation Data

This is data for evaluating model performance and tuning hyperparameters, and is used to verify performance after training the model.

Test Data

It is used to evaluate the generalization ability of a model and determines how well the model performs on data it is not seeing for the first time.

How to split data

Holdout method

This is the most basic method of dividing data into learning, verification, and testing at a certain rate.

Cross-Validation

This is a method of evaluating a model by dividing the data into multiple folds and performing cross-validation. It is used to prevent overfitting and evaluate the stability of the model.

caution

Data consistency

When partitioning data, care must be taken to ensure that the same data samples are not duplicated in different subsets.

Data distribution

The data for training, validation, and testing must be split so that it well represents the characteristics of the entire data.

uses

Model evaluation

It is used to evaluate and compare the performance of learned models.

Hyperparameter tuning

It is utilized to tune hyperparameters to optimize model performance.

Conclusion

Data partitioning is an important step in appropriately dividing data for model training, validation, and testing. It plays an important role in evaluating the model’s generalization ability and increasing model reliability.

thank you!