analitics

Pages

Saturday, November 2, 2019

Python 3.7.5 : Intro about scikit-learn python module.

This python module named scikit-learn used like sklearn is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy and comes with various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN.
The official webpage can be found here
Let't install this on my Fedora 30 distro:
[mythcat@desk proiecte_github]$ mkdir sklearn_examples
[mythcat@desk proiecte_github]$ cd sklearn_examples/
[mythcat@desk sklearn_examples]$ pip3 install scikit-learn --user
Python 3.7.5 (default, Oct 17 2019, 12:09:47) 
[GCC 9.2.1 20190827 (Red Hat 9.2.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sklearn
>>> print('sklearn: %s' % sklearn.__version__)
sklearn: 0.21.3 

First, this is a complex python module with many examples on web.
You can learn much about how can use simple and efficient all data mining and data analysis.
You can learn a lot about how all data exploitation and data analysis can be used simply and efficiently.
Mathematical functions are simple and complex. How to use python programming and existing examples can be used in several learning points.
I would start with discovering the input and output data sets and then continue with clear examples used daily by us.
I tested today with SVC and sklearn python module.
The SVMs were introduced initially in the 1960s and were later refined in the 1990s.
The base of this algorithm is the decision boundary that maximizes the distance from the nearest data points of all the classes.
The wikipedia article show all informations about support-vector machines (named SVM).
As applications we can use this function in: medical field for cell counting or similar cell quantification, astronomy, etc.
This simple example use multiple kernels and gammas parameters to group the input data.
import numpy as np
from sklearn.datasets import make_blobs
from sklearn import svm
from sklearn.svm import SVC
# importing scikit learn with make_blobs 
from sklearn.datasets.samples_generator import make_blobs 
# 
import matplotlib.pyplot as plt

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets

# import some data to play with
iris = datasets.load_iris()
x = iris.data[:, :2]
y = iris.target

def plotSVC(title):
  # create a mesh to plot with dataset x and y
  x_min, x_max = x[:, 0].min() - 1, x[:, 0].max() + 1
  y_min, y_max = x[:, 1].min() - 1, x[:, 1].max() + 1
  # set the resolution by 100 
  h = (x_max / x_min)/100
  # create the meshgrid 
  xx, yy = np.meshgrid(np.arange(x_min, x_max, h),np.arange(y_min, y_max, h))
  # divides the current figure into an m-by-n grid and creates axes in the position specified by p
  plt.subplot(1, 1, 1)
  # the model can then be used to predict new values
  Z = svc.predict(np.c_[xx.ravel(), yy.ravel()])
  # reshape your test data because prediction needs an array that looks like your training data
  Z = Z.reshape(xx.shape)
  # use plt to show result 
  plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8) 
  plt.scatter(x[:, 0], x[:, 1], c=y, cmap=plt.cm.Paired)
  plt.xlabel('x length')
  plt.ylabel('y width')
  plt.xlim(xx.min(), xx.max())
  plt.title("Plot SVC")
  plt.show()

# create kernels for svg 
kernels = ['linear', 'rbf', 'poly']
# for each kernel show graphs
for kernel in kernels:
  svc = svm.SVC(kernel=kernel).fit(x, y)
  plotSVC('kernel=' + str(kernel))
# create gammas values
# the gamma parameter defines how far the influence of a single training example reaches
gammas = [0.1, 1, 10, 100, 1000]
# for each gammas and kernel rbf - fast processing,  show graphs
for gamma in gammas:
   svc = svm.SVC(kernel='rbf', gamma=gamma).fit(x, y)
   plotSVC('gamma=' + str(gamma))

See the last result for kernel rbf and gamma 1000.