Preprocessing
Data Preprocessing
HorseML.Preprocessing.dataloader
— Functiondataloader(name; header=true, dir="HorseMLdatasets")
Load a data for Machine Learning. name
is either the name of the datasets or the full path of the data file to be loaded. The following three can be specified as the name of the dataset in name
.
MNIST
: The MNIST Datasetsiris
: The iris DatasetsBostonHousing
: Boston Housing DataSets
And these datasets are downloaded and saved by creating a dir
folder under the home directly(i.e. it is saved in the /home_directory/HorseMLdatasets
by default). When importing a data file, you can specify whether to read the header with header
.
Example
julia> dataloader("MNIST");
julia> dataloader("/home/ubuntu/data/data.csv", header = false)
HorseML.Preprocessing.databuilder
— Functiondatabuilder(x, y; batches=1)
x
is the feature value, y
is the teacher data, and bacthes
is the batch size. This function formats and returns the data used for the neural network(however, x
andy
must be Arrays or DataFrames, so please use this function after encoding, normalization, etc.).
Example
julia> data = dataloader("iris");
julia> LE, OHE = LabelEncoder(), OneHotEncoder()
(LabelEncoder(Dict{Any, Any}()), OneHotEncoder())
julia> x, y = data[:, Not(:variety)], OHE(LE(data[:, :variety]));
julia> databuilder(x, y)
150-element Vector{Tuple{Matrix{Float64}, Matrix{Float64}}}:
([5.1; 3.5; 1.4; 0.2;;], [1.0; 0.0; 0.0;;])
([4.9; 3.0; 1.4; 0.2;;], [1.0; 0.0; 0.0;;])
([4.7; 3.2; 1.3; 0.2;;], [1.0; 0.0; 0.0;;])
([4.6; 3.1; 1.5; 0.2;;], [1.0; 0.0; 0.0;;])
([5.0; 3.6; 1.4; 0.2;;], [1.0; 0.0; 0.0;;])
([5.4; 3.9; 1.7; 0.4;;], [1.0; 0.0; 0.0;;])
([4.6; 3.4; 1.4; 0.3;;], [1.0; 0.0; 0.0;;])
([5.0; 3.4; 1.5; 0.2;;], [1.0; 0.0; 0.0;;])
([4.4; 2.9; 1.4; 0.2;;], [1.0; 0.0; 0.0;;])
([4.9; 3.1; 1.5; 0.1;;], [1.0; 0.0; 0.0;;])
([5.4; 3.7; 1.5; 0.2;;], [1.0; 0.0; 0.0;;])
([4.8; 3.4; 1.6; 0.2;;], [1.0; 0.0; 0.0;;])
([4.8; 3.0; 1.4; 0.1;;], [1.0; 0.0; 0.0;;])
⋮
([6.0; 3.0; 4.8; 1.8;;], [0.0; 0.0; 1.0;;])
([6.9; 3.1; 5.4; 2.1;;], [0.0; 0.0; 1.0;;])
([6.7; 3.1; 5.6; 2.4;;], [0.0; 0.0; 1.0;;])
([6.9; 3.1; 5.1; 2.3;;], [0.0; 0.0; 1.0;;])
([5.8; 2.7; 5.1; 1.9;;], [0.0; 0.0; 1.0;;])
([6.8; 3.2; 5.9; 2.3;;], [0.0; 0.0; 1.0;;])
([6.7; 3.3; 5.7; 2.5;;], [0.0; 0.0; 1.0;;])
([6.7; 3.0; 5.2; 2.3;;], [0.0; 0.0; 1.0;;])
([6.3; 2.5; 5.0; 1.9;;], [0.0; 0.0; 1.0;;])
([6.5; 3.0; 5.2; 2.0;;], [0.0; 0.0; 1.0;;])
([6.2; 3.4; 5.4; 2.3;;], [0.0; 0.0; 1.0;;])
([5.9; 3.0; 5.1; 1.8;;], [0.0; 0.0; 1.0;;])
HorseML.Preprocessing.DataSplitter
— TypeDataSplitter(ndata; test_size=nothing, train_size=nothing)
Split the data into test data and training data. ndata
is the number of the data, and you must specify either test_size
or train_size
. thease parameter can be proprtional or number of data.
If both test_size
and train_size
are specified, test_size
takes precedence.
#Example
julia> x = rand(20, 2);
julia> DS = DataSplitter(50, train_size = 0.3);
julia> train, test = DS(x, dims = 2);
julia> train |> size
(6, 2)
julia> test |> size
(14, 2)
Encoders
HorseML.Preprocessing.LabelEncoder
— TypeLabelEncoder()
LabelEncoder structure. LE(label; count=false, decode=false)
Convert labels(like string) to class numbers(encode), and convert class numbers to labels(decode).
Example
julia> label = ["Apple", "Apple", "Pear", "Pear", "Lemon", "Apple", "Pear", "Lemon"]
8-element Vector{String}
"Apple"
"Apple"
"Pear"
"Pear"
"Lemon"
"Apple"
"Pear"
"Lemon"
julia> LE = LabelEncoder()
LabelEncoder(Dict{Any, Any}())
julia> classes, count = LE(label, count=true) #Encode
([3.0 3.0 … 1.0 2.0], [3.0 2.0 3.0])
julia> LE(classes, decode=true) #Decode
1×18 Matrix{String}:
"Apple" "Apple" "Pear" "Pear" "Lemon" "Apple" "Pear" "Lemon"
HorseML.Preprocessing.OneHotEncoder
— TypeOneHotEncoder()
convert data to Ont-Hot format. If you specified decode
, data
will be decoded.
Example
julia> x = [4.9 3.0; 4.6 3.1; 4.4 2.9; 4.8 3.4; 5.1 3.8; 5.4 3.4; 4.8 3.4; 5.2 4.1; 5.5 4.2; 5.5 3.5; 4.8 3.0; 5.1 3.8; 5.0 3.3; 6.4 3.2; 5.7 2.8; 6.1 2.9; 6.7 3.1; 5.6 2.5; 6.3 2.5; 5.6 3.0; 5.6 2.7; 7.6 3.0; 6.4 2.7; 6.4 3.2; 6.5 3.0; 7.7 3.8; 7.2 3.2; 7.2 3.0; 6.3 2.8; 6.1 2.6; 6.3 3.4; 6.0 3.0; 6.9 3.1; 6.7 3.1; 5.8 2.7; 6.8 3.2; 6.3 2.5];#These data are also used to explanations of other functions.
julia> t = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2];
julia> OHE = OneHotEncoder()
OneHotEncoder()
julia> ct = OHE(t)
37×3 Matrix{Float64}:
1.0 0.0 0.0
1.0 0.0 0.0
1.0 0.0 0.0
1.0 0.0 0.0
1.0 0.0 0.0
1.0 0.0 0.0
1.0 0.0 0.0
1.0 0.0 0.0
1.0 0.0 0.0
1.0 0.0 0.0
1.0 0.0 0.0
1.0 0.0 0.0
0.0 1.0 0.0
0.0 1.0 0.0
0.0 1.0 0.0
0.0 1.0 0.0
0.0 1.0 0.0
⋮
0.0 1.0 0.0
0.0 0.0 1.0
0.0 0.0 1.0
0.0 0.0 1.0
0.0 0.0 1.0
0.0 0.0 1.0
0.0 0.0 1.0
0.0 0.0 1.0
0.0 0.0 1.0
0.0 0.0 1.0
0.0 0.0 1.0
0.0 0.0 1.0
0.0 0.0 1.0
0.0 0.0 1.0
0.0 0.0 1.0
0.0 0.0 1.0
0.0 0.0 1.0
julia> OHE(ct, decode = true)
37-element Vector{Int64}:
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
⋮
2
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
Scaler
HorseML.Preprocessing.Standard
— TypeStandard()
Standard Scaler. This scaler scale data as: $\tilde{\boldsymbol{x}} = \frac{x_{i}-\mu}{\sigma}$
Example
julia> x = [
16.862463771320925 68.10823385851712
15.382965696961577 65.4313485700859
8.916228406218375 53.92034559524475
10.560285659132695 59.17305391117168
12.142253214135884 62.28708207525656
5.362107221163482 43.604947901567414
13.893239446341777 62.44348617377496
11.871357065173395 60.28433066289655
29.83792267802442 69.22281924803998
21.327107214235483 70.15810991597944
23.852372696012498 69.81780163668844
26.269031430914108 67.61037566099782
22.78907104644012 67.78105545358633
26.73342178134947 68.59263965946904
9.107259141706415 56.565383817343495
29.38551885863976 68.1005579469209
7.935966787763017 53.76264777936664
29.01677894379809 68.69484161138638
6.839609488194577 49.69794758177567
13.95215840314148 62.058116579899085]; #These data are also used to explanations of other functions.
julia> t = [169.80980778351542, 167.9081124078835, 152.30845618985222, 160.3110300206261, 161.96826472170756, 136.02842285615077, 163.98131131382686, 160.117817321485, 172.22758529098235, 172.21342437006865, 171.8939175591617, 169.83018083884602, 171.3878062674257, 170.52487535026015, 156.40282783981309, 170.6488327896672, 151.69267899906185, 172.32478221316322, 145.14365314788827, 163.79383292080666];
julia> scaler = Standard()
Standard(Float64[])
julia> fit!(scaler, x)
2×2 Matrix{Float64}:
19.0591 64.467
6.95818 6.68467
julia> transform(scaler, x)
20×2 Matrix{Float64}:
0.0310374 0.44579
0.0104337 0.218714
1.01139 0.710027
1.37285 0.954091
-0.895893 -0.270589
0.983361 0.47397
1.39855 0.645624
-1.31861 -0.901482
0.0147702 0.241708
0.29675 0.483076
-0.338824 -0.104812
0.432442 0.387922
0.860418 0.567095
-0.495306 0.140871
0.963084 0.552767
-1.38901 -1.05926
0.469323 0.719196
0.0669475 0.512023
-1.47583 -1.44017
-1.99789 -3.27656
HorseML.Preprocessing.MinMax
— TypeMinMax()
MinMax Scaler. This scaler scale data as: $\tilde{\boldsymbol{x}} = \frac{\boldsymbol{x}-min(\boldsymbol{x})}{max(\boldsymbol{x})-min(\boldsymbol{x})}$
Example
julia> scaler = MinMax()
MinMax(Float64[])
julia> fit!(scaler, x)
2×2 Matrix{Float64}:
1.39855 0.954091
-1.99789 -3.27656
julia> transform(scaler, x)
20×2 Matrix{Float64}:
0.597368 0.879853
0.591301 0.826179
0.88601 0.942311
0.992432 1.0
0.324455 0.710522
0.877757 0.886514
1.0 0.927088
0.199996 0.561398
0.592578 0.831614
0.675601 0.888666
0.488471 0.749707
0.715552 0.866175
0.841559 0.908526
0.442398 0.807779
0.871787 0.905139
0.179268 0.524104
0.72641 0.944478
0.607941 0.895508
0.153707 0.43407
0.0 0.0
HorseML.Preprocessing.Robust
— TypeRobust()
Robust Scaler. This scaler scale data as: $\tilde{\boldsymbol{x}} = \frac{\boldsymbol{x}-Q2}{Q3 - Q1}$
Example
julia> scaler = Robust()
Robust(Float64[])
julia> fit!(scaler, x)
3×2 Matrix{Float64}:
0.412913 0.739911
0.602654 0.873014
0.849116 0.905986
julia> transform(scaler, x)
20×2 Matrix{Float64}:
-0.0121192 0.041181
-0.0260262 -0.28201
0.649595 0.417263
0.89357 0.764633
-0.637774 -0.978423
0.630675 0.0812893
0.910919 0.3256
-0.923097 -1.87636
-0.0230992 -0.249283
0.16723 0.0942497
-0.261766 -0.742477
0.258819 -0.041181
0.547691 0.213832
-0.367388 -0.392802
0.616988 0.193438
-0.970618 -2.10092
0.283712 0.430314
0.0121192 0.135449
-1.02922 -2.64305
-1.38159 -5.25675
HorseML.Preprocessing.fit!
— Functionfit!(scaler, x; dims=1)
fit the scaler with x
. dims
is the dimension of the number of data.
HorseML.Preprocessing.transform
— Functiontransform(scaler, x; dims=1)
transform data with scaler. dims
is the dimension of the number of data.
HorseML.Preprocessing.fit_transform!
— Functionfit_transform!(scaler, x; dims=1)
fit scaler with x
, and transform x
. dims
is the dimension of the number of data.
HorseML.Preprocessing.inv_transform
— Functioninv_transform(scaler, x; dims=1)
Convert x
in reverse. dims
is the dimension of the number of data.