analitics

Pages

Thursday, July 18, 2019

Python 3.7.3 : The pandas python module.

Since I started learning python programming language I have not found a more complex and complete module for viewing complex data.
The official documentation of this python module tells us:
pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open-source data analysis/manipulation tool available in any language. It is already well on its way toward this goal.
The official webpage can be found here.
This python module is one of the most popular Python libraries for Data Science and Analytics.
You can install this python module with pip tool:
C:\Python373\Scripts>pip install pandas
Requirement already satisfied: pandas in c:\python373\lib\site-packages (0.24.2)
You can find many tutorials on web with this python module.
Today I will show you a short tutorial about this python module.
Most users use this both python modules:
import numpy as np
import pandas as pd
Most area of the pandas python module has a target into this list:
Window Functions, Aggregations, Missing Data, GroupBy, Merging/Joining, Concatenation, Date Functionality, Timedelta, Categorical Data,
Visualization, IO Tools, Sparse Data, Caveats & Gotchas, Comparison with SQL
There are two types of data structures in pandas: Series and DataFrames.
The pandas Series is a one-dimensional data structure.
The pandas DataFrame is a two (or more) dimensional data structure, like a table
Pandas provide few variants rolling, expanding and exponentially moving weights for window statistics.
Also have the sum, mean, median, variance, covariance, correlation, etc.
The several methods are available to perform aggregations on data.
Pandas provide functions for missing data like the isnull() and notnull().
Let's test the DataFrames with pandas and the Wikipedia example from my tutorial
C:\Python373>python.exe
Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 21:26:53) [MSC v.1916 32 bit (Inte
l)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> # get table from wikipedia
... import requests
>>> from bs4 import BeautifulSoup
>>> website_url = requests.get('https://en.wikipedia.org/w/index.php?title=Table
_of_food_nutrients').text
>>> soup = BeautifulSoup(website_url,'lxml')
>>>
>>> my_table = soup.find('table',{'class':'wikitable collapsible collapsed'})
>>> links = my_table.findAll('a')
>>> Food = []
>>> for link in links:
...     Food.append(link.get('title'))
...
>>> print(Food)
["Cows' milk (page does not exist)", 'Buttermilk', 'Fortified milk (page does no
t exist)', 'Powdered milk', "Goats' milk", 'Malted milk', 'Hot chocolate', 'Yogu
rt', 'Milk pudding (page does not exist)', 'Custard', 'Ice cream', 'Ice milk', '
Cream', 'Cheese', 'Cheddar cheese', 'American cheese', 'Processed cheese', 'Egg
(food)', 'Scrambled', 'Omelet', 'Yolk']
>>> import pandas
>>> import pandas as pd
>>> df = pd.DataFrame()
>>> df['Foods'] = Food
>>> print(df)
                                   Foods
0       Cows' milk (page does not exist)
1                             Buttermilk
2   Fortified milk (page does not exist)
3                          Powdered milk
4                            Goats' milk
5                            Malted milk
6                          Hot chocolate
7                                 Yogurt
8     Milk pudding (page does not exist)
9                                Custard
10                             Ice cream
11                              Ice milk
12                                 Cream
13                                Cheese
14                        Cheddar cheese
15                       American cheese
16                      Processed cheese
17                            Egg (food)
18                             Scrambled
19                                Omelet
20                                  Yolk
>>> df.describe()
              Foods
count            21
unique           21
top     Malted milk
freq              1
>>> df.apply(pd.Series.value_counts)
                                      Foods
Malted milk                               1
Ice milk                                  1
Omelet                                    1
Goats' milk                               1
Custard                                   1
Cheddar cheese                            1
American cheese                           1
Ice cream                                 1
Yolk                                      1
Cream                                     1
Cows' milk (page does not exist)          1
Yogurt                                    1
Fortified milk (page does not exist)      1
Egg (food)                                1
Powdered milk                             1
Milk pudding (page does not exist)        1
Cheese                                    1
Hot chocolate                             1
Buttermilk                                1
Processed cheese                          1
Scrambled                                 1
The last example is to show data with
>>> import numpy as np
>>> import pandas as pd
>>> import matplotlib.pyplot as plt
>>> ts = pd.Series(np.random.randn(76), index=pd.date_range('1/1/76', periods=76
))
>>> ts.plot()

>>> plt.show()