Predicting Singapore Pools 4d Lottery winning numbers with Machine Learning

Who wants to be a millionaire?

TL;DR

Introduction

My good buddy, Ah Seng, approach me with a strange request the other day.

It started from this forum thread on Hardwarezone.

Apparently, the subject, Ms Foyce Le Xuan, could predict winning 4d lottery combinations.

She posted proof of the winnings on her Instagram account week after week.

There was much debate in the forum on the strategies she used to maintain her consistent winning streak.

Although there were contrasting views with regards to her methods, the strong consensus was that she had a piece of "magical software".

A programming tool that could generate winning lottery combinations using specialised algorithms.

The Bet

"Oei Terence, if we could get historical data of past winning 4d numbers, I am pretty sure we could build something similar leh!", said Ah Seng.

I was adamant that winning lottery numbers are random events.

If people could devise algorithms to predict future winning numbers using past patterns, it would be all over the news.

Nevertheless, I was keen to disprove his hypothesis.

Machine Learning or Learning Machine?

Using historical data to predict future events sounded like some CSI "zoom and enhance" AI Technology.

I had no prior knowledge about the obscure field of data science.

Thankfully, there are tons of machine learning resources available online.

The one offered by Google was pretty good - Machine Learning Crash Course.

The Tensorflow website provides extensive resource and documentation as well.

That being said, this exploration into the world of machine learning was definitely not a walk in the park for me.

This exploration into the world of machine learning was definitely not a walk in the park for me.

Requirements

Python 3
Selenium
PhantomJS
Chrome Webdriver
Beautiful Soup
Tensorflow
Machine Learning Crash Course
Google Cloud Machine Learning Engine (optional)
Sanity

The Setup

Obtain past winning 4d lottery results from Singapore Pools website
Data prep and feature engineering
Training the model and hyper-parameter tuning
Prediction results

Obtain past winning 4d lottery results from Singapore Pools website

This was when I hit my first roadblock.

If you head over to Singapore Pools 4d Results page, you'll find that they only provide 4d results for the past three years.

With a bit of Google-Fu, I managed to locate a page within their website to check winning numbers for the past 20 years.

The steps taken to extract the data were as follow:

Download PhantomJS and Chrome Webdriver.
Set up a virtual environment.
Install Selenium and BeautifulSoup.
Create the scraper in Python.
Save the results to Excel for data prep.
A whole load of patience.

Day two of scraping

Data prep and feature engineering

My initial thoughts were that I had very little to work with.

The two features(Number, Date) were not sufficient to train the model accurately.

...
1001 . Sat, 12 Jan 2019 . Consolation Prize
1001 . Sun, 06 Sep 2015 . Starter Prize
1001 . Sat, 06 Sep 2014 . Starter Prize
...

It turns out that we could actually derive meaningful metrics with a little feature engineering elbow grease.

...
0001 . 3 . 15 . 320 . 3 . 47 . 11 . 2000 . 668 . 21 . 1 . 6697 . 957 . 19 . 0 . 0 . 0 . 0 . 1 . 5
0010 . 6 . 14 . 257 . 3 . 37 . 9 . 2002 . 942 . 30 . 2 . 6029 . 861 . 17 . 0 . 0 . 0 . 0 . 1 . 5
9800 . 3 . 13 . 103 . 3 . 16 . 4 . 2005 . 130 . 4 . 0 . 5087 . 727 . 14 . 0 . 0 . 0 . 0 . 1 . 5
...

Why not create binary vectors for numbers 0000 to 9999?

That would be an exteremely bad idea. The computation resources required for such intensive task are expensive. This form of data representation is extremely inefficient.

Instead, we should approach the problem using sparse representation.

The data is split between 70% (Training Data), 25% (Test Data) and 5% (Evaluation Data).

Training the model and hyper-parameter tuning

The model trained for 100 times using Gradient Descent and Adam optimisers.

Each round took between 15 to 25 minutes on my MacBook Pro.

I could leverage on Google's Cloud Machine Learning Engine to cut my training time by half but I'm a cheapskate.

These were the ideal hyper-parameter settings after 100 tests:

ID	Optimizer	Learning Rate	Epochs	Batch Size	Dense Layers	L2 Regularization	Dropout
1	Adam	0.001	301	16	16	0.001	0
2	Adam	0.0001	301	16	16	0.001	0
3	Adam	0.0005	301	16	16	0.001	0
4	Gradient Descent	0.001	301	16	16	0.001	0
5	Gradient Descent	0.0001	301	16	16	0.001	0
6	Gradient Descent	0.0005	301	16	16	0.001	0
7	Adam	0.001	301	16	16	0	0.5
8	Adam	0.0001	301	16	16	0	0.5
9	Adam	0.0005	301	16	16	0	0.5
10	Gradient Descent	0.001	301	16	16	0	0.5
11	Gradient Descent	0.0001	301	16	16	0	0.5
12	Gradient Descent	0.0005	301	16	16	0	0.5

The results based on ideal hyper-parameter settings.

Training Metrics	Results
	Epoch 000: Loss: 729.569, Accuracy: 33.531% Epoch 050: Loss: 0.788, Accuracy: 84.624% Epoch 100: Loss: 0.328, Accuracy: 92.486% Epoch 150: Loss: 0.187, Accuracy: 95.526% Epoch 200: Loss: 0.148, Accuracy: 97.336% Epoch 250: Loss: 0.073, Accuracy: 99.030% Epoch 300: Loss: 0.055, Accuracy: 99.320% Test set accuracy: 94.861%
Overfitting	Epoch 000: Loss: 81.530, Accuracy: 39.018% Epoch 050: Loss: 0.195, Accuracy: 96.482% Epoch 100: Loss: 0.007, Accuracy: 100.000% Epoch 150: Loss: 0.001, Accuracy: 100.000% Epoch 200: Loss: 0.001, Accuracy: 100.000% Epoch 250: Loss: 0.000, Accuracy: 100.000% Epoch 300: Loss: 0.000, Accuracy: 100.000% Test set accuracy: 99.971%
Overfitting	Epoch 000: Loss: 45.983, Accuracy: 40.437% Epoch 050: Loss: 1.137, Accuracy: 89.518% Epoch 100: Loss: 0.000, Accuracy: 100.000% Epoch 150: Loss: 0.874, Accuracy: 95.961% Epoch 200: Loss: 0.000, Accuracy: 100.000% Epoch 250: Loss: 0.000, Accuracy: 100.000% Epoch 300: Loss: 0.000, Accuracy: 100.000% Test set accuracy: 99.971%
	Epoch 000: Loss: 6.345, Accuracy: 48.559% Epoch 050: Loss: 1.155, Accuracy: 50.253% Epoch 100: Loss: 1.147, Accuracy: 48.979% Epoch 150: Loss: 1.145, Accuracy: 48.530% Epoch 200: Loss: 1.146, Accuracy: 49.240% Epoch 250: Loss: 1.143, Accuracy: 47.995% Epoch 300: Loss: 1.141, Accuracy: 47.749% Test set accuracy: 44.106%
	Epoch 000: Loss: 3.950, Accuracy: 44.375% Epoch 050: Loss: 1.153, Accuracy: 43.840% Epoch 100: Loss: 1.146, Accuracy: 43.709% Epoch 150: Loss: 1.142, Accuracy: 43.753% Epoch 200: Loss: 1.141, Accuracy: 43.724% Epoch 250: Loss: 1.140, Accuracy: 43.738% Epoch 300: Loss: 1.139, Accuracy: 43.767% Test set accuracy: 42.973%
	Epoch 000: Loss: 19.132, Accuracy: 43.984% Epoch 050: Loss: 1.162, Accuracy: 43.637% Epoch 100: Loss: 1.152, Accuracy: 43.637% Epoch 150: Loss: 1.147, Accuracy: 43.637% Epoch 200: Loss: 1.145, Accuracy: 43.637% Epoch 250: Loss: 1.143, Accuracy: 43.637% Epoch 300: Loss: 1.143, Accuracy: 43.637% Test set accuracy: 43.002%
	Epoch 000: Loss: 143.438, Accuracy: 36.137% Epoch 050: Loss: 1.083, Accuracy: 58.766% Epoch 100: Loss: 0.655, Accuracy: 86.159% Epoch 150: Loss: 0.363, Accuracy: 87.303% Epoch 200: Loss: 0.196, Accuracy: 95.150% Epoch 250: Loss: 0.072, Accuracy: 99.826% Epoch 300: Loss: 0.028, Accuracy: 99.957% Test set accuracy: 99.884%
Overfitting	Epoch 000: Loss: 301.922, Accuracy: 29.390% Epoch 050: Loss: 0.337, Accuracy: 94.093% Epoch 100: Loss: 0.048, Accuracy: 98.972% Epoch 150: Loss: 0.376, Accuracy: 95.975% Epoch 200: Loss: 0.003, Accuracy: 100.000% Epoch 250: Loss: 0.365, Accuracy: 97.206% Epoch 300: Loss: 0.000, Accuracy: 100.000% Test set accuracy: 99.913%
	Epoch 000: Loss: 60.548, Accuracy: 35.370% Epoch 050: Loss: 1.137, Accuracy: 43.347% Epoch 100: Loss: 1.137, Accuracy: 43.145% Epoch 150: Loss: 1.137, Accuracy: 43.101% Epoch 200: Loss: 1.137, Accuracy: 43.579% Epoch 250: Loss: 1.137, Accuracy: 43.478% Epoch 300: Loss: 1.137, Accuracy: 43.376% Test set accuracy: 43.002%
	Epoch 000: Loss: 8.623, Accuracy: 44.781% Epoch 050: Loss: 1.162, Accuracy: 51.209% Epoch 100: Loss: 1.156, Accuracy: 50.427% Epoch 150: Loss: 1.155, Accuracy: 50.934% Epoch 200: Loss: 1.152, Accuracy: 50.442% Epoch 250: Loss: 1.153, Accuracy: 50.572% Epoch 300: Loss: 1.153, Accuracy: 50.109% Test set accuracy: 42.973%
	Epoch 000: Loss: 10.269, Accuracy: 44.650% Epoch 050: Loss: 1.154, Accuracy: 43.840% Epoch 100: Loss: 1.146, Accuracy: 43.897% Epoch 150: Loss: 1.143, Accuracy: 43.753% Epoch 200: Loss: 1.141, Accuracy: 43.622% Epoch 250: Loss: 1.140, Accuracy: 43.724% Epoch 300: Loss: 1.140, Accuracy: 43.738% Test set accuracy: 43.118%
	Epoch 000: Loss: 4.455, Accuracy: 44.737% Epoch 050: Loss: 1.151, Accuracy: 43.709% Epoch 100: Loss: 1.144, Accuracy: 43.666% Epoch 150: Loss: 1.141, Accuracy: 43.709% Epoch 200: Loss: 1.140, Accuracy: 43.724% Epoch 250: Loss: 1.139, Accuracy: 44.042% Epoch 300: Loss: 1.138, Accuracy: 43.709% Test set accuracy: 42.973%

Training Metrics
Epoch 000: Loss: 729.569, Accuracy: 33.531% Epoch 050: Loss: 0.788, Accuracy: 84.624% Epoch 100: Loss: 0.328, Accuracy: 92.486% Epoch 150: Loss: 0.187, Accuracy: 95.526% Epoch 200: Loss: 0.148, Accuracy: 97.336% Epoch 250: Loss: 0.073, Accuracy: 99.030% Epoch 300: Loss: 0.055, Accuracy: 99.320% Test set accuracy: 94.861%
Overfitting Epoch 000: Loss: 81.530, Accuracy: 39.018% Epoch 050: Loss: 0.195, Accuracy: 96.482% Epoch 100: Loss: 0.007, Accuracy: 100.000% Epoch 150: Loss: 0.001, Accuracy: 100.000% Epoch 200: Loss: 0.001, Accuracy: 100.000% Epoch 250: Loss: 0.000, Accuracy: 100.000% Epoch 300: Loss: 0.000, Accuracy: 100.000% Test set accuracy: 99.971%
Overfitting Epoch 000: Loss: 45.983, Accuracy: 40.437% Epoch 050: Loss: 1.137, Accuracy: 89.518% Epoch 100: Loss: 0.000, Accuracy: 100.000% Epoch 150: Loss: 0.874, Accuracy: 95.961% Epoch 200: Loss: 0.000, Accuracy: 100.000% Epoch 250: Loss: 0.000, Accuracy: 100.000% Epoch 300: Loss: 0.000, Accuracy: 100.000% Test set accuracy: 99.971%
Epoch 000: Loss: 6.345, Accuracy: 48.559% Epoch 050: Loss: 1.155, Accuracy: 50.253% Epoch 100: Loss: 1.147, Accuracy: 48.979% Epoch 150: Loss: 1.145, Accuracy: 48.530% Epoch 200: Loss: 1.146, Accuracy: 49.240% Epoch 250: Loss: 1.143, Accuracy: 47.995% Epoch 300: Loss: 1.141, Accuracy: 47.749% Test set accuracy: 44.106%
Epoch 000: Loss: 3.950, Accuracy: 44.375% Epoch 050: Loss: 1.153, Accuracy: 43.840% Epoch 100: Loss: 1.146, Accuracy: 43.709% Epoch 150: Loss: 1.142, Accuracy: 43.753% Epoch 200: Loss: 1.141, Accuracy: 43.724% Epoch 250: Loss: 1.140, Accuracy: 43.738% Epoch 300: Loss: 1.139, Accuracy: 43.767% Test set accuracy: 42.973%
Epoch 000: Loss: 19.132, Accuracy: 43.984% Epoch 050: Loss: 1.162, Accuracy: 43.637% Epoch 100: Loss: 1.152, Accuracy: 43.637% Epoch 150: Loss: 1.147, Accuracy: 43.637% Epoch 200: Loss: 1.145, Accuracy: 43.637% Epoch 250: Loss: 1.143, Accuracy: 43.637% Epoch 300: Loss: 1.143, Accuracy: 43.637% Test set accuracy: 43.002%
Epoch 000: Loss: 143.438, Accuracy: 36.137% Epoch 050: Loss: 1.083, Accuracy: 58.766% Epoch 100: Loss: 0.655, Accuracy: 86.159% Epoch 150: Loss: 0.363, Accuracy: 87.303% Epoch 200: Loss: 0.196, Accuracy: 95.150% Epoch 250: Loss: 0.072, Accuracy: 99.826% Epoch 300: Loss: 0.028, Accuracy: 99.957% Test set accuracy: 99.884%
Overfitting Epoch 000: Loss: 301.922, Accuracy: 29.390% Epoch 050: Loss: 0.337, Accuracy: 94.093% Epoch 100: Loss: 0.048, Accuracy: 98.972% Epoch 150: Loss: 0.376, Accuracy: 95.975% Epoch 200: Loss: 0.003, Accuracy: 100.000% Epoch 250: Loss: 0.365, Accuracy: 97.206% Epoch 300: Loss: 0.000, Accuracy: 100.000% Test set accuracy: 99.913%
Epoch 000: Loss: 60.548, Accuracy: 35.370% Epoch 050: Loss: 1.137, Accuracy: 43.347% Epoch 100: Loss: 1.137, Accuracy: 43.145% Epoch 150: Loss: 1.137, Accuracy: 43.101% Epoch 200: Loss: 1.137, Accuracy: 43.579% Epoch 250: Loss: 1.137, Accuracy: 43.478% Epoch 300: Loss: 1.137, Accuracy: 43.376% Test set accuracy: 43.002%
Epoch 000: Loss: 8.623, Accuracy: 44.781% Epoch 050: Loss: 1.162, Accuracy: 51.209% Epoch 100: Loss: 1.156, Accuracy: 50.427% Epoch 150: Loss: 1.155, Accuracy: 50.934% Epoch 200: Loss: 1.152, Accuracy: 50.442% Epoch 250: Loss: 1.153, Accuracy: 50.572% Epoch 300: Loss: 1.153, Accuracy: 50.109% Test set accuracy: 42.973%
Epoch 000: Loss: 10.269, Accuracy: 44.650% Epoch 050: Loss: 1.154, Accuracy: 43.840% Epoch 100: Loss: 1.146, Accuracy: 43.897% Epoch 150: Loss: 1.143, Accuracy: 43.753% Epoch 200: Loss: 1.141, Accuracy: 43.622% Epoch 250: Loss: 1.140, Accuracy: 43.724% Epoch 300: Loss: 1.140, Accuracy: 43.738% Test set accuracy: 43.118%
Epoch 000: Loss: 4.455, Accuracy: 44.737% Epoch 050: Loss: 1.151, Accuracy: 43.709% Epoch 100: Loss: 1.144, Accuracy: 43.666% Epoch 150: Loss: 1.141, Accuracy: 43.709% Epoch 200: Loss: 1.140, Accuracy: 43.724% Epoch 250: Loss: 1.139, Accuracy: 44.042% Epoch 300: Loss: 1.138, Accuracy: 43.709% Test set accuracy: 42.973%

A few take-aways based on the results.

In general, the Adam optimisation algorithm performed much better than Gradient Descent.
Accuracy was most affected by the learning rate as compared to the rest of the parameters.
Adding L2 regularization and Dropout yielded minimal improvements.

Prediction results

The moment we've all been waiting for!

Based on the trained model, we predict the probability for a set of numbers from the evaluation data.

0067 . Wed, 14 Nov 2018 . Third Prize
0831 . Sun, 06 Oct 2013 . Second Prize
0965 . Wed, 31 Jul 2013 . Starter Prize

Version	Predictions
1	Example 0 prediction: first_prize (29.9%)
	Example 1 prediction: second_prize ( 0.2%)
	Example 2 prediction: starter_prize ( 0.5%)
2	Example 0 prediction: third_prize ( 0.0%)
	Example 1 prediction: second_prize ( 0.1%)
	Example 2 prediction: starter_prize ( 0.0%)
3	Example 0 prediction: third_prize ( 0.0%)
	Example 1 prediction: second_prize ( 0.0%)
	Example 2 prediction: starter_prize ( 0.0%)
4	Example 0 prediction: consolation_prize (41.5%)
	Example 1 prediction: consolation_prize (41.0%)
	Example 2 prediction: consolation_prize (41.4%)
5	Example 0 prediction: starter_prize ( 4.6%)
	Example 1 prediction: starter_prize ( 4.6%)
	Example 2 prediction: starter_prize ( 4.6%)
6	Example 0 prediction: starter_prize ( 4.5%)
	Example 1 prediction: starter_prize ( 4.5%)
	Example 2 prediction: starter_prize ( 4.5%)
7	Example 0 prediction: third_prize ( 4.3%)
	Example 1 prediction: second_prize (11.7%)
	Example 2 prediction: starter_prize ( 0.1%)
8	Example 0 prediction: third_prize ( 0.0%)
	Example 1 prediction: second_prize ( 0.3%)
	Example 2 prediction: starter_prize ( 0.0%)
9	Example 0 prediction: starter_prize ( 4.6%)
	Example 1 prediction: starter_prize ( 4.6%)
	Example 2 prediction: starter_prize ( 4.6%)
10	Example 0 prediction: starter_prize ( 6.0%)
	Example 1 prediction: starter_prize ( 6.1%)
	Example 2 prediction: starter_prize ( 6.0%)
11	Example 0 prediction: starter_prize ( 4.6%)
	Example 1 prediction: starter_prize ( 4.6%)
	Example 2 prediction: starter_prize ( 4.6%)
12	Example 0 prediction: starter_prize ( 4.6%)
	Example 1 prediction: starter_prize ( 4.6%)
	Example 2 prediction: starter_prize ( 4.6%)

As you can see, the results were unsurprising.

None of the predictions was satisfactory.
The probability was low even though the corresponding test set accuracy was high.
Models that overfit produced 0% or close to 0% probability as expected.

Conclusion

If you do a quick google search for the terms "4d prediction singapore", you’ll find no shortage of 4d prediction websites.

These predictions are often served with a healthy dose of snake oil.

The draw process, engineered by Singapore Pools includes numerous variables to deter fraud. Trying to predict winning combinations would be a fool's errand.

To make an almost accurate prediction, you have to factor in draw machine configurations and draw ball weights, as well as unseen elements like the force of the jet air, atmospheric noise and gravitational pull.

Punters believe that there are patterns to lottery numbers which can help increase the probability of winning.

As with all things in life, you will start to see patterns by over-analysing any situation.

I have no answers as to how Ms Foyce predicted her winning numbers but it is most likely not through a magical black box algorithm machine.

As many have said, she could have bought the tickets in bulk and posted the winning combinations after the results were announced.

There are many ways to achieve internet notoriety you know?

Fun fact: Your money is better spent elsewhere.
Wealth gained hastily will dwindle, but whoever gathers little by little will increase it. - (Proverbs 13:11-13)

Disclaimer: I am not a data scientist, neither am I trained in the field of machine learning. Several assumptions were made while training the models. If you are a data scientist, please feel free to chime in on the comments section.

Latest Posts

How Chat-GPT Replaced My Job

The Rise and Fall of AI Empires

GPT-3: The Latest Craze in NLP

A Slacker’s Guide to Programming

Predicting Singapore Pools 4d Lottery winning numbers with Machine Learning

Who wants to be a millionaire?

TL;DR

Introduction

The Bet

Machine Learning or Learning Machine?

Requirements

The Setup

Obtain past winning 4d lottery results from Singapore Pools website

Data prep and feature engineering

Why not create binary vectors for numbers 0000 to 9999?

Training the model and hyper-parameter tuning

Prediction results

Conclusion

Fun fact: Your money is better spent elsewhere.
Wealth gained hastily will dwindle, but whoever gathers little by little will increase it. - (Proverbs 13:11-13)

Latest Posts

Predicting Singapore Pools 4d Lottery winning numbers with Machine Learning

Who wants to be a millionaire?

TL;DR

Introduction

The Bet

Machine Learning or Learning Machine?

Requirements

The Setup

Obtain past winning 4d lottery results from Singapore Pools website

Data prep and feature engineering

Why not create binary vectors for numbers 0000 to 9999?

Training the model and hyper-parameter tuning

Prediction results

Conclusion

Fun fact: Your money is better spent elsewhere. Wealth gained hastily will dwindle, but whoever gathers little by little will increase it. - (Proverbs 13:11-13)

Latest Posts

Fun fact: Your money is better spent elsewhere.
Wealth gained hastily will dwindle, but whoever gathers little by little will increase it. - (Proverbs 13:11-13)