Building your own ML problem: Datasets

Let's talk about the datasets that you need to build an ML problem on HackerEarth

What kinds of datasets do I need while building a Machine Learning problem on HackerEarth?

Your dataset must be divided into two parts:

  • Training data set
  • Test data set

What is a training data set?

A training data set is the data that candidates will use to train their models.

What is a test data set?

A test data set is the unseen data that candidates will use to predict an outcome. The test data must not specify the outcome.

How do I divide my dataset?

The following data set of 10 rows can be divided into the following:

  • Training data set (50% of the rows)
  • Test data set (remaining 50% of the rows)

Entire data set

Outlook

Temperature

Humidity

Wind

Play

Sunny

Hot

High

False

No

Rainy

Mild

High

False

Yes

Sunny

Cool

Normal

False

Yes

Overcast

Hot

High

False

Yes

Rainy

Mild

High

False

Yes

Overcast

Hot

Normal

False

Yes

Sunny

Mild

Normal

True

Yes

Sunny

Mild

High

False

No

Overcast

Cool

Normal

True

Yes

Rainy

Mild

High

True

Yes

 

Training data set

ID

Outlook

Temperature

Humidity

Wind

Play

1

Sunny

Hot

High

False

No

2

Rainy

Mild

High

False

Yes

3

Sunny

Cool

Normal

False

Yes

4

Overcast

Hot

High

False

Yes

5

Rainy

Mild

High

False

Yes

 

Test data set (test.csv)

ID

Outlook

Temperature

Humidity

Wind

1

Sunny

Hot

High

False

2

Rainy

Mild

High

False

3

Sunny

Cool

Normal

False

4

Overcast

Hot

High

False

5

Rainy

Mild

High

False

Note: We have not provided the target variable in the test data set.

What does the candidate do after the models are trained?

After the models have been trained, candidates are expected to do the following:

  1. To predict an outcome on the test data set
  2. Upload the prediction file on HackerEarth