Data Mining With Python
Data mining is the extraction of implicit, formerly unknown, and potentially useful information from data. It is applied in a wide range of domains and its practices have become fundamental for several applications.
This article is about the tools used in real Data Mining for finding and describing structural patterns in data using Python. In recent years, Python has been used for the development of data-centric.
DATA IMPORTING AND VISUALIZATION
The very first step of a data analysis consists of obtaining the data and loading the data into the user’s work environment. User can easily download data using the following Python capability:
u = urllib2.urlopen(url)
localFile = open(‘iris.csv”, ‘w’)
In the above snippet user has used the library urllib2 to access a file on the website and saved it to the disk using the methods of the File object provided by the standard library. The file contains the iris dataset, which is a multivariate dataset that consists of 50 samples from each of three species of Iris flowers. Each sample has four features that is the length and the width of sepal and petal, in centimetres.
The dataset is stored in the CSV format. It is appropriate to parse the CSV file and to store the informa tion that it contains using a more suitable data structure. The dataset has 5 rows, the first 4 rows contain the values of the features while the last row signifies the class of the samples. The CSV can be easily parsed using the function genfromtxt of the numpy library:
from numpy import genfromtxt, zeros
# read the first 4 columns
data = genfromtxt(‘iris.csv’,delimiter=’,’,usecols=(0,1,2,3))
# read the fifth column
target = genfromtxt(‘iris.csv’,delimiter=’,’,usecols=(4),dtype=str)
In the above example user has created a matrix with the features and a vector that contains the classes. The user can also confirm the size of the dataset looking at the shape of the data structures loaded:
print set(target) # build a collection of unique elements
set([‘setosa’, ‘versicolor’, ‘virginica’])
An important task when working with a new data is to understand what information the data contains and how it is structured. Visualization helps the user to explore the information graphically in such a way to gain understanding and insight into the data.
Classification is a data mining function that allocates samples in a dataset to target classes. The models that implement this function are called classifiers. There are two basic steps for using a classifiers: training and classification. The library sklearn contains the implementation of many models for classification.
t = zeros(len(target))
t[target == ‘setosa’] = 1
t[target == ‘versicolor’] = 2
t[target == ‘virginica’] = 3
The classification can be done with the predict method and it is easy to test it with one of the sample:
In this case the predicted class is equal to the correct one (setosa), but it is important to assess the classifier on a wider range of samples and to test it with data not used in the training process.
We do not have labels attached to the data that tell us the class of the samples. The user has to analyse the data in order to group them on the basis of a similar criteria where groups are sets of similar samples. This kind of analysis is called unsupervised data analysis. One of the most famous clustering tools is the k-means algorithm, which can be run as follows:
from sklearn.cluster import KMeans
kmeans = KMeans(k=3, init=’random’) # initialization
kmeans.fit(data) # actual execution
The snippet above runs the algorithm and groups the data in 3 clusters (as specified by the parameter k). Now the user can use the model to assign each sample to one of the clusters:
c = kmeans.predict(data)
And the user can evaluate the results of clustering, comparing it with the labels that they already have using the completeness and the homogeneity of the score:
from sklearn.metrics import completeness_score, homogeneity_score
The wholeness of the score approaches 1 when most of the data points that are members of a given class are elements of the same cluster while the homogeneity score approaches 1 when all the clusters contain almost only data points that are member of a single class.
The user can also visualize the result of the clustering and compare the assignments with the real labels visually:
subplot(211) # top figure with the real classes
subplot(212) # bottom figure with classes assigned automatically