<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Pro-Tek Blog &#187; Data Mining With Python</title>
	<atom:link href="http://www.pro-tekconsulting.com/blog/tag/data-mining-with-python/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.pro-tekconsulting.com/blog</link>
	<description>For UI developers / UI designers and UI trends</description>
	<lastBuildDate>Thu, 05 Sep 2019 03:59:47 +0000</lastBuildDate>
	<language>en-US</language>
		<sy:updatePeriod>hourly</sy:updatePeriod>
		<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=4.0.34</generator>
	<item>
		<title>DATA MINING WITH PYTHON</title>
		<link>http://www.pro-tekconsulting.com/blog/data-mining-with-python/</link>
		<comments>http://www.pro-tekconsulting.com/blog/data-mining-with-python/#comments</comments>
		<pubDate>Tue, 10 Oct 2017 04:13:51 +0000</pubDate>
		<dc:creator><![CDATA[admin]]></dc:creator>
				<category><![CDATA[PYTHON]]></category>
		<category><![CDATA[Data Mining With Python]]></category>

		<guid isPermaLink="false">http://www.pro-tekconsulting.com/blog/?p=2225</guid>
		<description><![CDATA[<p>Data Mining With Python Data mining is the extraction of implicit, formerly unknown, and potentially useful information from data. It is applied in a wide range of domains and its practices have become fundamental for several applications. This article is about the tools used in real Data Mining for finding and describing structural patterns in [&#8230;]</p>
<p>The post <a rel="nofollow" href="http://www.pro-tekconsulting.com/blog/data-mining-with-python/">DATA MINING WITH PYTHON</a> appeared first on <a rel="nofollow" href="http://www.pro-tekconsulting.com/blog">Pro-Tek Blog</a>.</p>
]]></description>
				<content:encoded><![CDATA[<h4>Data Mining With Python</h4>
<p>Data mining is the extraction of implicit, formerly unknown, and potentially useful information from data. It is applied in a wide range of domains and its practices have become fundamental for several applications.</p>
<p>This article is about the tools used in real Data Mining for finding and describing structural patterns in data using Python. In recent years, Python has been used for the development of data-centric.</p>
<p><strong>DATA IMPORTING AND VISUALIZATION</strong></p>
<p>The very first step of a data analysis consists of obtaining the data and loading the data into the user’s work environment. User can easily download data using the following Python capability:</p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">import urllib2</span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">url = &#8216;</span></span></span><a class="western" href="http://aima.cs.devopspython.edu/data/iris.csv"><span style="color: #0000ff;"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en"><span style="text-decoration: underline;">http://aima.cs.devopspython.edu/data/iris.csv</span></span></span></span></span></a><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">&#8216;</span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">u = urllib2.urlopen(url)</span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">localFile = open(&#8216;iris.csv&#8221;, &#8216;w&#8217;)</span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">localFile.write(u.read())</span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">localFile.close()</span></span></span></p>
<p>In the above snippet user has used the library urllib2 to access a file on the website and saved it to the disk using the methods of the File object provided by the standard library. The file contains the iris dataset, which is a multivariate dataset that consists of 50 samples from each of three species of Iris flowers. Each sample has four features that is the length and the width of sepal and petal, in centimetres.</p>
<p>The dataset is stored in the CSV format. It is appropriate to parse the CSV file and to store the informa tion that it contains using a more suitable data structure. The dataset has 5 rows, the first 4 rows contain the values of the features while the last row signifies the class of the samples. The CSV can be easily parsed using the function genfromtxt of the numpy library:</p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">from numpy import genfromtxt, zeros</span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en"># read the first 4 columns</span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">data = genfromtxt(&#8216;iris.csv&#8217;,delimiter=&#8217;,&#8217;,usecols=(0,1,2,3)) </span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en"># read the fifth column</span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">target = genfromtxt(&#8216;iris.csv&#8217;,delimiter=&#8217;,&#8217;,usecols=(4),dtype=str)</span></span></span></p>
<p>In the above example user has created a matrix with the features and a vector that contains the classes. The user can also confirm the size of the dataset looking at the shape of the data structures loaded:</p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">print data.shape</span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">(150, 4)</span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">print target.shape</span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">(150,)</span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">print set(target) # build a collection of unique elements</span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">set([&#8216;setosa&#8217;, &#8216;versicolor&#8217;, &#8216;virginica&#8217;])</span></span></span></p>
<p>An important task when working with a new data is to understand what information the data contains and how it is structured. Visualization helps the user to explore the information graphically in such a way to gain understanding and insight into the data.</p>
<p><strong>CLASSIFICATION</strong></p>
<p>Classification is a data mining function that allocates samples in a dataset to target classes. The models that implement this function are called classifiers. There are two basic steps for using a classifiers: training and classification. The library sklearn contains the implementation of many models for classification.</p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">t = zeros(len(target))</span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">t[target == &#8216;setosa&#8217;] = 1</span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">t[target == &#8216;versicolor&#8217;] = 2</span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">t[target == &#8216;virginica&#8217;] = 3</span></span></span></p>
<p>The classification can be done with the predict method and it is easy to test it with one of the sample:</p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">print classifier.predict(data[0])</span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">[ 1.]</span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">print t[0]</span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">1</span></span></span></p>
<p>In this case the predicted class is equal to the correct one (setosa), but it is important to assess the classifier on a wider range of samples and to test it with data not used in the training process.</p>
<p><strong>CLUSTERING </strong></p>
<p>We do not have labels attached to the data that tell us the class of the samples. The user has to analyse the data in order to group them on the basis of a similar criteria where groups are sets of similar samples. This kind of analysis is called unsupervised data analysis. One of the most famous clustering tools is the k-means algorithm, which can be run as follows:</p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">from sklearn.cluster import KMeans </span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">kmeans = KMeans(k=3, init=&#8217;random&#8217;) # initialization</span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">kmeans.fit(data) # actual execution</span></span></span></p>
<p>The snippet above runs the algorithm and groups the data in 3 clusters (as specified by the parameter k). Now the user can use the model to assign each sample to one of the clusters:</p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">c = kmeans.predict(data)</span></span></span></p>
<p>And the user can evaluate the results of clustering, comparing it with the labels that they already have using the completeness and the homogeneity of the score:</p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">from sklearn.metrics import completeness_score, homogeneity_score</span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">print completeness_score(t,c)</span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">0.7649861514489815</span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">print homogeneity_score(t,c)</span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">0.7514854021988338</span></span></span></p>
<p>The wholeness of the score approaches 1 when most of the data points that are members of a given class are elements of the same cluster while the homogeneity score approaches 1 when all the clusters contain almost only data points that are member of a single class.</p>
<p>The user can also visualize the result of the clustering and compare the assignments with the real labels visually:</p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">figure()</span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">subplot(211) # top figure with the real classes</span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">plot(data[t==1,0],data[t==1,2],&#8217;bo&#8217;)</span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">plot(data[t==2,0],data[t==2,2],&#8217;ro&#8217;)</span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">plot(data[t==3,0],data[t==3,2],&#8217;go&#8217;)</span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">subplot(212) # bottom figure with classes assigned automatically</span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">plot(data[c==1,0],data[tt==1,2],&#8217;bo&#8217;,alpha=.7)</span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">plot(data[c==2,0],data[tt==2,2],&#8217;go&#8217;,alpha=.7)</span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">plot(data[c==0,0],data[tt==0,2],&#8217;mo&#8217;,alpha=.7)</span></span></span></p>
<p class="western"><span style="font-family: Calibri, serif;"><span style="font-size: small;"><span lang="en">show()</span></span></span></p>
<p>The post <a rel="nofollow" href="http://www.pro-tekconsulting.com/blog/data-mining-with-python/">DATA MINING WITH PYTHON</a> appeared first on <a rel="nofollow" href="http://www.pro-tekconsulting.com/blog">Pro-Tek Blog</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.pro-tekconsulting.com/blog/data-mining-with-python/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
