Preprocessing

Data Preprocessing

HorseML.Preprocessing.dataloaderFunction
dataloader(name; header=true, dir="HorseMLdatasets")

Load a data for Machine Learning. name is either the name of the datasets or the full path of the data file to be loaded. The following three can be specified as the name of the dataset in name.

  • MNIST : The MNIST Datasets
  • iris : The iris Datasets
  • BostonHousing : Boston Housing DataSets

And these datasets are downloaded and saved by creating a dir folder under the home directly(i.e. it is saved in the /home_directory/HorseMLdatasets by default). When importing a data file, you can specify whether to read the header with header.

Example

julia> dataloader("MNIST");

julia> dataloader("/home/ubuntu/data/data.csv", header = false)
source
HorseML.Preprocessing.databuilderFunction
databuilder(x, y; batches=1)

x is the feature value, y is the teacher data, and bacthes is the batch size. This function formats and returns the data used for the neural network(however, x andy must be Arrays or DataFrames, so please use this function after encoding, normalization, etc.).

Example

julia> data = dataloader("iris");

julia> LE, OHE = LabelEncoder(), OneHotEncoder()
(LabelEncoder(Dict{Any, Any}()), OneHotEncoder())

julia> x, y = data[:, Not(:variety)], OHE(LE(data[:, :variety]));

julia> databuilder(x, y)
150-element Vector{Tuple{Matrix{Float64}, Matrix{Float64}}}:
 ([5.1; 3.5; 1.4; 0.2;;], [1.0; 0.0; 0.0;;])
 ([4.9; 3.0; 1.4; 0.2;;], [1.0; 0.0; 0.0;;])
 ([4.7; 3.2; 1.3; 0.2;;], [1.0; 0.0; 0.0;;])
 ([4.6; 3.1; 1.5; 0.2;;], [1.0; 0.0; 0.0;;])
 ([5.0; 3.6; 1.4; 0.2;;], [1.0; 0.0; 0.0;;])
 ([5.4; 3.9; 1.7; 0.4;;], [1.0; 0.0; 0.0;;])
 ([4.6; 3.4; 1.4; 0.3;;], [1.0; 0.0; 0.0;;])
 ([5.0; 3.4; 1.5; 0.2;;], [1.0; 0.0; 0.0;;])
 ([4.4; 2.9; 1.4; 0.2;;], [1.0; 0.0; 0.0;;])
 ([4.9; 3.1; 1.5; 0.1;;], [1.0; 0.0; 0.0;;])
 ([5.4; 3.7; 1.5; 0.2;;], [1.0; 0.0; 0.0;;])
 ([4.8; 3.4; 1.6; 0.2;;], [1.0; 0.0; 0.0;;])
 ([4.8; 3.0; 1.4; 0.1;;], [1.0; 0.0; 0.0;;])
 ⋮
 ([6.0; 3.0; 4.8; 1.8;;], [0.0; 0.0; 1.0;;])
 ([6.9; 3.1; 5.4; 2.1;;], [0.0; 0.0; 1.0;;])
 ([6.7; 3.1; 5.6; 2.4;;], [0.0; 0.0; 1.0;;])
 ([6.9; 3.1; 5.1; 2.3;;], [0.0; 0.0; 1.0;;])
 ([5.8; 2.7; 5.1; 1.9;;], [0.0; 0.0; 1.0;;])
 ([6.8; 3.2; 5.9; 2.3;;], [0.0; 0.0; 1.0;;])
 ([6.7; 3.3; 5.7; 2.5;;], [0.0; 0.0; 1.0;;])
 ([6.7; 3.0; 5.2; 2.3;;], [0.0; 0.0; 1.0;;])
 ([6.3; 2.5; 5.0; 1.9;;], [0.0; 0.0; 1.0;;])
 ([6.5; 3.0; 5.2; 2.0;;], [0.0; 0.0; 1.0;;])
 ([6.2; 3.4; 5.4; 2.3;;], [0.0; 0.0; 1.0;;])
 ([5.9; 3.0; 5.1; 1.8;;], [0.0; 0.0; 1.0;;])
source
HorseML.Preprocessing.DataSplitterType
DataSplitter(ndata; test_size=nothing, train_size=nothing)

Split the data into test data and training data. ndata is the number of the data, and you must specify either test_size or train_size. thease parameter can be proprtional or number of data.

Note

If both test_size and train_size are specified, test_size takes precedence.

#Example

julia> x = rand(20, 2);

julia> DS = DataSplitter(50, train_size = 0.3);

julia> train, test = DS(x, dims = 2);

julia> train |> size
(6, 2)

julia> test |> size
(14, 2)
source

Encoders

HorseML.Preprocessing.LabelEncoderType
LabelEncoder()

LabelEncoder structure. LE(label; count=false, decode=false) Convert labels(like string) to class numbers(encode), and convert class numbers to labels(decode).

Example

julia> label = ["Apple", "Apple", "Pear", "Pear", "Lemon", "Apple", "Pear", "Lemon"]
8-element Vector{String}
 "Apple"
 "Apple"
 "Pear"
 "Pear"
 "Lemon"
 "Apple"
 "Pear"
 "Lemon"

julia> LE = LabelEncoder()
LabelEncoder(Dict{Any, Any}())

julia> classes, count = LE(label, count=true) #Encode
([3.0 3.0 … 1.0 2.0], [3.0 2.0 3.0])

julia> LE(classes, decode=true) #Decode
1×18 Matrix{String}:
 "Apple" "Apple" "Pear" "Pear" "Lemon" "Apple" "Pear" "Lemon"
source
HorseML.Preprocessing.OneHotEncoderType
OneHotEncoder()

convert data to Ont-Hot format. If you specified decode, data will be decoded.

Example

julia> x = [4.9 3.0; 4.6 3.1; 4.4 2.9; 4.8 3.4; 5.1 3.8; 5.4 3.4; 4.8 3.4; 5.2 4.1; 5.5 4.2; 5.5 3.5; 4.8 3.0; 5.1 3.8; 5.0 3.3; 6.4 3.2; 5.7 2.8; 6.1 2.9; 6.7 3.1; 5.6 2.5; 6.3 2.5; 5.6 3.0; 5.6 2.7; 7.6 3.0; 6.4 2.7; 6.4 3.2; 6.5 3.0; 7.7 3.8; 7.2 3.2; 7.2 3.0; 6.3 2.8; 6.1 2.6; 6.3 3.4; 6.0 3.0; 6.9 3.1; 6.7 3.1; 5.8 2.7; 6.8 3.2; 6.3 2.5];#These data are also used to explanations of other functions.

julia> t = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2];

julia> OHE = OneHotEncoder()
OneHotEncoder()

julia> ct = OHE(t)
37×3 Matrix{Float64}:
 1.0 0.0 0.0
 1.0 0.0 0.0
 1.0 0.0 0.0
 1.0 0.0 0.0
 1.0 0.0 0.0
 1.0 0.0 0.0
 1.0 0.0 0.0
 1.0 0.0 0.0
 1.0 0.0 0.0
 1.0 0.0 0.0
 1.0 0.0 0.0
 1.0 0.0 0.0
 0.0 1.0 0.0
 0.0 1.0 0.0
 0.0 1.0 0.0
 0.0 1.0 0.0
 0.0 1.0 0.0
 ⋮
 0.0 1.0 0.0
 0.0 0.0 1.0
 0.0 0.0 1.0
 0.0 0.0 1.0
 0.0 0.0 1.0
 0.0 0.0 1.0
 0.0 0.0 1.0
 0.0 0.0 1.0
 0.0 0.0 1.0
 0.0 0.0 1.0
 0.0 0.0 1.0
 0.0 0.0 1.0
 0.0 0.0 1.0
 0.0 0.0 1.0
 0.0 0.0 1.0
 0.0 0.0 1.0
 0.0 0.0 1.0

julia> OHE(ct, decode = true)
37-element Vector{Int64}:
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 2
 2
 2
 2
 ⋮
 2
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3
source

Scaler

HorseML.Preprocessing.StandardType
Standard()

Standard Scaler. This scaler scale data as: $\tilde{\boldsymbol{x}} = \frac{x_{i}-\mu}{\sigma}$

Example

julia> x = [
    16.862463771320925 68.10823385851712
    15.382965696961577 65.4313485700859
    8.916228406218375 53.92034559524475
    10.560285659132695 59.17305391117168
    12.142253214135884 62.28708207525656
    5.362107221163482 43.604947901567414
    13.893239446341777 62.44348617377496
    11.871357065173395 60.28433066289655
    29.83792267802442 69.22281924803998
    21.327107214235483 70.15810991597944
    23.852372696012498 69.81780163668844
    26.269031430914108 67.61037566099782
    22.78907104644012 67.78105545358633
    26.73342178134947 68.59263965946904
    9.107259141706415 56.565383817343495
    29.38551885863976 68.1005579469209
    7.935966787763017 53.76264777936664
    29.01677894379809 68.69484161138638
    6.839609488194577 49.69794758177567
    13.95215840314148 62.058116579899085]; #These data are also used to explanations of other functions.

julia> t = [169.80980778351542, 167.9081124078835, 152.30845618985222, 160.3110300206261, 161.96826472170756, 136.02842285615077, 163.98131131382686, 160.117817321485, 172.22758529098235, 172.21342437006865, 171.8939175591617, 169.83018083884602, 171.3878062674257, 170.52487535026015, 156.40282783981309, 170.6488327896672, 151.69267899906185, 172.32478221316322, 145.14365314788827, 163.79383292080666];

julia> scaler = Standard()
Standard(Float64[])

julia> fit!(scaler, x)
2×2 Matrix{Float64}:
 19.0591   64.467
  6.95818   6.68467

julia> transform(scaler, x)
20×2 Matrix{Float64}:
  0.0310374   0.44579
  0.0104337   0.218714
  1.01139     0.710027
  1.37285     0.954091
 -0.895893   -0.270589
  0.983361    0.47397
  1.39855     0.645624
 -1.31861    -0.901482
  0.0147702   0.241708
  0.29675     0.483076
 -0.338824   -0.104812
  0.432442    0.387922
  0.860418    0.567095
 -0.495306    0.140871
  0.963084    0.552767
 -1.38901    -1.05926
  0.469323    0.719196
  0.0669475   0.512023
 -1.47583    -1.44017
 -1.99789    -3.27656
source
HorseML.Preprocessing.MinMaxType
MinMax()

MinMax Scaler. This scaler scale data as: $\tilde{\boldsymbol{x}} = \frac{\boldsymbol{x}-min(\boldsymbol{x})}{max(\boldsymbol{x})-min(\boldsymbol{x})}$

Example

julia> scaler = MinMax()
MinMax(Float64[])

julia> fit!(scaler, x)
2×2 Matrix{Float64}:
  1.39855   0.954091
 -1.99789  -3.27656

julia> transform(scaler, x)
20×2 Matrix{Float64}:
 0.597368  0.879853
 0.591301  0.826179
 0.88601   0.942311
 0.992432  1.0
 0.324455  0.710522
 0.877757  0.886514
 1.0       0.927088
 0.199996  0.561398
 0.592578  0.831614
 0.675601  0.888666
 0.488471  0.749707
 0.715552  0.866175
 0.841559  0.908526
 0.442398  0.807779
 0.871787  0.905139
 0.179268  0.524104
 0.72641   0.944478
 0.607941  0.895508
 0.153707  0.43407
 0.0       0.0
source
HorseML.Preprocessing.RobustType
Robust()

Robust Scaler. This scaler scale data as: $\tilde{\boldsymbol{x}} = \frac{\boldsymbol{x}-Q2}{Q3 - Q1}$

Example

julia> scaler = Robust()
Robust(Float64[])

julia> fit!(scaler, x)
3×2 Matrix{Float64}:
 0.412913  0.739911
 0.602654  0.873014
 0.849116  0.905986

julia> transform(scaler, x)
20×2 Matrix{Float64}:
 -0.0121192   0.041181
 -0.0260262  -0.28201
  0.649595    0.417263
  0.89357     0.764633
 -0.637774   -0.978423
  0.630675    0.0812893
  0.910919    0.3256
 -0.923097   -1.87636
 -0.0230992  -0.249283
  0.16723     0.0942497
 -0.261766   -0.742477
  0.258819   -0.041181
  0.547691    0.213832
 -0.367388   -0.392802
  0.616988    0.193438
 -0.970618   -2.10092
  0.283712    0.430314
  0.0121192   0.135449
 -1.02922    -2.64305
 -1.38159    -5.25675
source