python-catalin: pandas

Showing posts with label pandas. Show all posts

Wednesday, September 4, 2024

Python 3.13.0rc1 : Use faker with build pandas and PyQt6.

In this tutorial I will show you how I build pandas and use faker and PyQt6.

The faker python module is a powerful library designed to generate fake data, which is particularly useful for testing, filling databases, and creating realistic-looking sample data.

I used python version 3.13.0rc1 and I install with pip tool faker and the pandas and PyQt6 is build with pip tool.

pip install faker
Collecting faker
...
Installing collected packages: six, python-dateutil, faker
Successfully installed faker-28.1.0 python-dateutil-2.9.0.post0 six-1.16.0

The pandas installation fail first time then today works, maybe comes with fixes ...

pip install pandas
...
Successfully built pandas
Installing collected packages: pytz, tzdata, pandas
Successfully installed pandas-2.2.2 pytz-2024.1 tzdata-2024.1

Let's try one example to see how this works, I used copilot from microsoft to generate this first source code:

import sys
from PyQt6.QtWidgets import QApplication, QMainWindow, QTableView
from PyQt6.QtCore import Qt, QAbstractTableModel
import pandas as pd
from faker import Faker

# Generăm date false folosind Faker
fake = Faker()
data = {
    'Name': [fake.name() for _ in range(100)],
    'Address': [fake.address() for _ in range(100)],
    'Email': [fake.email() for _ in range(100)],
    'IP Address': [fake.ipv4() for _ in range(100)]  # Adăugăm adresa IP
}

# Creăm un DataFrame Pandas
df = pd.DataFrame(data)

# Definim un model pentru QTableView
class PandasModel(QAbstractTableModel):
    def __init__(self, df):
        super().__init__()
        self._df = df

    def rowCount(self, parent=None):
        return len(self._df)

    def columnCount(self, parent=None):
        return self._df.shape[1]

    def data(self, index, role=Qt.ItemDataRole.DisplayRole):
        if index.isValid():
            if role == Qt.ItemDataRole.DisplayRole:
                return str(self._df.iloc[index.row(), index.column()])
        return None

    def headerData(self, section, orientation, role=Qt.ItemDataRole.DisplayRole):
        if role == Qt.ItemDataRole.DisplayRole:
            if orientation == Qt.Orientation.Horizontal:
                return self._df.columns[section]
            else:
                return str(section)
        return None

# Aplicatia PyQt6
app = QApplication(sys.argv)
window = QMainWindow()
view = QTableView()

# Setăm modelul pentru QTableView
model = PandasModel(df)
view.setModel(model)

# Configurăm fereastra principală
window.setCentralWidget(view)
window.resize(800, 600)
window.show()

# Rulăm aplicația
sys.exit(app.exec())

Sunday, March 31, 2024

Python 3.12.1 : About multiprocessing performance.

Today I test a simple python script for Pool and ThreadPool python classes from multiprocessing python module.

The main goal was to test Python’s multiprocessing performance with my computer.

NumPy releases the GIL for many of its operations, which means you can use multiple CPU cores even with threads.

Processing large amounts of data with Pandas can be difficult, and with Polars dataframe library is a potential solution.

Sciagraph gives you both performance profiling and peak memory profiling information.

Let's teste only these class:

The multiprocessing.pool.Pool class provides a process pool in Python.

The multiprocessing.pool.ThreadPool class in Python provides a pool of reusable threads for executing spontaneous tasks.

This is the python script:

from time import time
import multiprocessing as mp
from multiprocessing.pool import ThreadPool
import numpy as np
import pickle

def main():
    arr = np.ones((1024, 1024, 1024), dtype=np.uint8)
    expected_sum = np.sum(arr)

    with ThreadPool(1) as threadpool:
        start = time()
        assert (
            threadpool.apply(np.sum, (arr,)) == expected_sum
        )
        print("Thread pool:", time() - start)

    with mp.get_context("spawn").Pool(1) as process_pool:
        start = time()
        assert (
            process_pool.apply(np.sum, (arr,))
            == expected_sum
        )
        print("Process pool:", time() - start)

if __name__ == "__main__":
    main()

This is the result:

python thread_process_pool_001.py
Thread pool: 1.6689703464508057
Process pool: 11.644825458526611

Thursday, July 18, 2019

Python 3.7.3 : The pandas python module.

Since I started learning python programming language I have not found a more complex and complete module for viewing complex data.
The official documentation of this python module tells us:
pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open-source data analysis/manipulation tool available in any language. It is already well on its way toward this goal.
The official webpage can be found here.
This python module is one of the most popular Python libraries for Data Science and Analytics.
You can install this python module with pip tool:

C:\Python373\Scripts>pip install pandas
Requirement already satisfied: pandas in c:\python373\lib\site-packages (0.24.2)

You can find many tutorials on web with this python module.
Today I will show you a short tutorial about this python module.
Most users use this both python modules:

import numpy as np
import pandas as pd

Most area of the pandas python module has a target into this list:
Window Functions, Aggregations, Missing Data, GroupBy, Merging/Joining, Concatenation, Date Functionality, Timedelta, Categorical Data,
Visualization, IO Tools, Sparse Data, Caveats & Gotchas, Comparison with SQL
There are two types of data structures in pandas: Series and DataFrames.
The pandas Series is a one-dimensional data structure.
The pandas DataFrame is a two (or more) dimensional data structure, like a table
Pandas provide few variants rolling, expanding and exponentially moving weights for window statistics.
Also have the sum, mean, median, variance, covariance, correlation, etc.
The several methods are available to perform aggregations on data.
Pandas provide functions for missing data like the isnull() and notnull().
Let's test the DataFrames with pandas and the Wikipedia example from my tutorial

C:\Python373>python.exe
Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 21:26:53) [MSC v.1916 32 bit (Inte
l)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> # get table from wikipedia
... import requests
>>> from bs4 import BeautifulSoup
>>> website_url = requests.get('https://en.wikipedia.org/w/index.php?title=Table
_of_food_nutrients').text
>>> soup = BeautifulSoup(website_url,'lxml')
>>>
>>> my_table = soup.find('table',{'class':'wikitable collapsible collapsed'})
>>> links = my_table.findAll('a')
>>> Food = []
>>> for link in links:
...     Food.append(link.get('title'))
...
>>> print(Food)
["Cows' milk (page does not exist)", 'Buttermilk', 'Fortified milk (page does no
t exist)', 'Powdered milk', "Goats' milk", 'Malted milk', 'Hot chocolate', 'Yogu
rt', 'Milk pudding (page does not exist)', 'Custard', 'Ice cream', 'Ice milk', '
Cream', 'Cheese', 'Cheddar cheese', 'American cheese', 'Processed cheese', 'Egg
(food)', 'Scrambled', 'Omelet', 'Yolk']
>>> import pandas
>>> import pandas as pd
>>> df = pd.DataFrame()
>>> df['Foods'] = Food
>>> print(df)
                                   Foods
0       Cows' milk (page does not exist)
1                             Buttermilk
2   Fortified milk (page does not exist)
3                          Powdered milk
4                            Goats' milk
5                            Malted milk
6                          Hot chocolate
7                                 Yogurt
8     Milk pudding (page does not exist)
9                                Custard
10                             Ice cream
11                              Ice milk
12                                 Cream
13                                Cheese
14                        Cheddar cheese
15                       American cheese
16                      Processed cheese
17                            Egg (food)
18                             Scrambled
19                                Omelet
20                                  Yolk
>>> df.describe()
              Foods
count            21
unique           21
top     Malted milk
freq              1
>>> df.apply(pd.Series.value_counts)
                                      Foods
Malted milk                               1
Ice milk                                  1
Omelet                                    1
Goats' milk                               1
Custard                                   1
Cheddar cheese                            1
American cheese                           1
Ice cream                                 1
Yolk                                      1
Cream                                     1
Cows' milk (page does not exist)          1
Yogurt                                    1
Fortified milk (page does not exist)      1
Egg (food)                                1
Powdered milk                             1
Milk pudding (page does not exist)        1
Cheese                                    1
Hot chocolate                             1
Buttermilk                                1
Processed cheese                          1
Scrambled                                 1

The last example is to show data with

>>> import numpy as np
>>> import pandas as pd
>>> import matplotlib.pyplot as plt
>>> ts = pd.Series(np.random.randn(76), index=pd.date_range('1/1/76', periods=76
))
>>> ts.plot()

>>> plt.show()

python-catalin

analitics

Pages