Preprocessing
Data Loader
With Preprocessing module, you can load the data. Also, some famous data sets can be loaded by specifying a name. First, let's load the MNIST data.
using LearningHorse.Preprocessing
train_data, test_dara = dataloader("MNIST")
If it is your first time to load a data set. you wil be asked if you want to download the data like this:
Can I download the mnist_train.csv(110MB)? (y/n)
The data downloaded here is in /home_directory/learningdatasets/
by default. for more details, see dataloader
.
Let's also read the local data.
#Please create train.csv directly under the /home_directory/learningdatasets/ as follows:
#0.202027,0.246752,0.608301,0.406351,0.0260402,0.747845
#0.735292,0.332892,0.438458,0.0787028,0.797796,0.294831
#0.710725,0.213594,0.527118,0.579191,0.298599,0.23684
#0.288168,0.787194,0.809412,0.464031,0.960465,0.655897
df = dataloader("train.csv", header = false)
Data Preprocessing
Normalization
Even if you use the data as it is, you won't be able to make a model with high accuracy, let's normalize the data. The following three scalers are available in LearningHorse:
- Standard Scaler : $\tilde{\boldsymbol{x}} = \frac{x_{i}-\mu}{\sigma}$
- MinMax Scaler : $\tilde{\boldsymbol{x}} = \frac{\boldsymbol{x}-min(\boldsymbol{x})}{max(\boldsymbol{x})-min(\boldsymbol{x})}$
- Robust Scaler : $\tilde{\boldsymbol{x}} = \frac{\boldsymbol{x}-Q2}{Q3 - Q1}$
For example, if you normalize iris data set with Standard Scaler,
using LearningHorse.Preprocessing: fit!
data = Matrix(dataloader("iris"))
x, t = data[:, 1:4], data[:, 5]
scaler = Standard()
x = fit_transform!(scaler, x)
Once the scaler is fittted, sclaes using the same value unless it fits again.
Data Spliting
Data division is done by DataSplitter
.
this function is broken in LearningHorse v0.3.2
DS = DataSplitter(150, test_size = 0.3)
train_x, test_x = DS(x)
train_t, test_t = DS(t)
You can specify test_size
and train_size
, whether it's the number of data or the percentage.