python-catalin: openpyxl

Showing posts with label openpyxl. Show all posts

Sunday, August 30, 2020

Python 3.8.5 : Testing with openpyxl - part 002 .

Today I will show you how can use Levenshtein ratio and distance between two strings, see wikipedia.
I used three files created with LibreOffice and save it like xlsx file type.
All of these files come with the column A fill with strings of characters, in this case, numbers.
The script will read all of these files from the folder named xlsx_files and will calculate Levenshtein ratio and distance between the strings of name of these files and column A.
Finally, the result is shown into a graph with matplotlib python package.
Let's see the python script:

import os
from glob import glob

from openpyxl import load_workbook
import numpy as np 
import matplotlib.pyplot as plt 

def levenshtein_ratio_and_distance(s, t, ratio_calc = False):
    """ levenshtein_ratio_and_distance - distance between two strings.
        If ratio_calc = True, the function computes the
        levenshtein distance ratio of similarity between two strings
        For all i and j, distance[i,j] will contain the Levenshtein
        distance between the first i characters of s and the
        first j characters of t
    """
    # Initialize matrix of zeros
    rows = len(s)+1
    cols = len(t)+1
    distance = np.zeros((rows,cols),dtype = int)

    # Populate matrix of zeros with the indeces of each character of both strings
    for i in range(1, rows):
        for k in range(1,cols):
            distance[i][0] = i
            distance[0][k] = k
    for col in range(1, cols):
        for row in range(1, rows):
            # check the characters are the same in the two strings in a given position [i,j] 
            # then the cost is 0
            if s[row-1] == t[col-1]:
                cost = 0 
            else:             
                # calculate distance, then the cost of a substitution is 1.
                if ratio_calc == True:
                    cost = 2
                else:
                    cost = 1
            distance[row][col] = min(distance[row-1][col] + 1,      # Cost of deletions
                                 distance[row][col-1] + 1,          # Cost of insertions
                                 distance[row-1][col-1] + cost)     # Cost of substitutions
    if ratio_calc == True:
        # Ration computation of the Levenshtein Distance Ratio
        Ratio = ((len(s)+len(t)) - distance[row][col]) / (len(s)+len(t))
        return Ratio
    else:
        return distance[row][col]


PATH = "/home/mythcat/xlsx_files/"
result = [y for x in os.walk(PATH) for y in glob(os.path.join(x[0], '*.xlsx'))]
result_files = [os.path.join(path, name) for path, subdirs, files in os.walk(PATH) for name in files]
#print(result)
row_0 = []

for r in result:
    n = 0
    wb = load_workbook(r)
    sheets = wb.sheetnames
    ws = wb[sheets[n]]
    for row in ws.rows:
            if (row[0].value) != None :
                rows = row[0].value
                row_0.append(rows)

print("All rows of column A ")
print(row_0)
files = []
for f in result_files:
    ff = str(f).split('/')[-1:][0]
    fff = str(ff).split('.xlsx')[0]
    files.append(fff)

print(files)
# define tree lists for levenshtein
list1 = []
list2 = []

for l in row_0:
    str(l).lower()
    for d in files:
        Distance = levenshtein_ratio_and_distance(str(l).lower(),str(d).lower())   
        Ratio = levenshtein_ratio_and_distance(str(l).lower(),str(d).lower(),ratio_calc = True)
        list1.append(Distance)
        list2.append(Ratio)
        
print(list1, list2)
# plotting the points  
plt.plot(list1,'g*', list2, 'ro' )
plt.show()

The result is this:

[mythcat@desk ~]$ python test_xlsx.py
All rows of column A 
[11, 2, 113, 4, 1111, 4, 4, 111, 2, 1111, 5, 4, 4, 3, 1111, 1, 2, 1113, 4, 115, 1, 2, 221, 1, 1,
 43536, 2, 34242, 3, 1]
['001', '002', '003']
[2, 3, 3, 3, 2, 3, 3, 3, 2, 3, 3, 3, 3, 4, 4, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 2, 3, 3, 4, 4, 3, 3, 
3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 4, 4, 2, 3, 3, 3, 2, 3, 3, 4, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 
2, 3, 2, 3, 3, 2, 3, 3, 2, 3, 3, 5, 5, 4, 3, 2, 3, 5, 4, 5, 3, 3, 2, 2, 3, 3] [0.4, 0.0, 0.0, 0.0, 
0.5, 0.0, 0.3333333333333333, 0.0, 0.3333333333333333, 0.0, 0.0, 0.0, 0.2857142857142857, 0.0, 0.0,
 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.3333333333333333, 0.0, 0.0, 0.0, 0.5, 0.0, 0.2857142857142857, 0.0,
 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5, 0.2857142857142857, 0.0, 0.0, 0.5,
 0.0, 0.0, 0.0, 0.5, 0.0, 0.2857142857142857, 0.0, 0.2857142857142857, 0.0, 0.0, 0.0, 0.3333333333333333,
 0.0, 0.0, 0.5, 0.0, 0.0, 0.0, 0.5, 0.0, 0.3333333333333333, 0.3333333333333333, 0.0, 0.5, 0.0, 0.0,
 0.5, 0.0, 0.0, 0.0, 0.0, 0.25, 0.0, 0.5, 0.0, 0.0, 0.25, 0.25, 0.0, 0.0, 0.5, 0.5, 0.0, 0.0]

Sunday, August 23, 2020

Python 3.8.5 : Testing with openpyxl - part 001 .

The Python executes the code line by line because is an interpreter language.
This allows users to solve issues in the programming area, fast and easy.
I use python versiono 3.8.5 build on Aug 12 2020 at 00:00:00, see the result of interactive mode:

[mythcat@desk ~]$ python
Python 3.8.5 (default, Aug 12 2020, 00:00:00) 
[GCC 10.2.1 20200723 (Red Hat 10.2.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.

Today I will show you how to start using openpyxl python package.
Another tutorial about python and documents can be found here.
The openpyxl was created by Eric Gazoni, Charlie Clark, and is a Python library to read/write Excel 2010 xlsx/xlsm/xltx/xltm files.
Let's install the openpyxl python package:

[mythcat@desk ~]$ pip3 install openpyxl --user
Collecting openpyxl
...
Installing collected packages: openpyxl
Successfully installed openpyxl-3.0.5

I tested with the default example source code and works well.

from openpyxl import Workbook
wb = Workbook()

# grab the active worksheet
ws = wb.active

# Data can be assigned directly to cells
ws['A1'] = 42

# Rows can also be appended
ws.append([1, 2, 3])

# Python types will automatically be converted
import datetime
ws['A2'] = datetime.datetime.now()

# Save the file
wb.save("sample.xlsx")

The next example gets all data about asteroids close to planet Earth and put into xlsx file type.
The rows with dangerous asteroids are fill with the red color:

# check asteroids close to planet Earth and add it to file
# import json python package
import json, urllib.request, time

# import openpyxl python package
from openpyxl import Workbook
from openpyxl.styles import PatternFill
# use active worksheet
wb = Workbook()
ws = wb.active

today = time.strftime('%Y-%m-%d', time.gmtime())
print("Time is: " + today)
now = today
# retrieve data about asteroids approaching planet Earth into json format
url = "https://api.nasa.gov/neo/rest/v1/feed?start_date=" + today + "&end_date=" + today + "&api_key=DEMO_KEY"
response = urllib.request.urlopen(url)
result = json.loads(response.read())

print("Now, " + str(result["element_count"]) + " asteroids is close to planet Earth.")
asteroids = result["near_earth_objects"]

no_data = ""
dangerous = ""

ws.append(['today', 'name', 'dangerous?', 'no_data'])
# parsing all the JSON data and add to file
for asteroid in asteroids:
    for field in asteroids[asteroid]:

      try:
        name = "Asteroid Name: " + field["name"]

        if field["is_potentially_hazardous_asteroid"]:   
          dangerous = "... dangerous to planet Earth!"

        else:
          dangerous = "... not threat to planet Earth!"

      except:
        no_data = "no data"
      ws.append([today, name, dangerous, no_data]) 

# create a red patern to fill
redFill = PatternFill(start_color='FFFF0000',
                   end_color='FFFF0000',
                   fill_type='solid')

# check the row with the dangerous asteroid and fill it
for row in ws.rows:
 if row[2].value == "... dangerous to planet Earth!":
  for cell in row:
      cell.fill = redFill

# write all data to file 
wb.save(str(now)+"_asteroids.xlsx")

I run it and result working well:

[mythcat@desk ~]$ python asteroid_data.py 
Time is: 2020-08-23
Now, 9 asteroids is close to planet Earth.

... see the next screenshot:

Thursday, February 14, 2019

Using python with documents files.

Today I tested with python version 3.6.4 two python modules: python-docx and openpyxl.
This python modules let us to deal with document files like: docx, xlsx, xlsm, xltx, xltm.
First python module named python-docx is a Python library for creating and updating Microsoft Word (.docx) files.
The documentation of this python module can be found here.
Let's start with a simple example.

C:\Python364>cd Scripts
C:\Python364\Scripts>pip3.6.exe install python-docx
Collecting python-docx
...
Successfully installed python-docx-0.8.10

Let's start with the import step:

C:\Python364>python.exe
Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:54:40) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import docx
>>> dir(docx)
['Document', 'ImagePart', 'RT', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '
__name__', '__package__', '__path__', '__spec__', '__version__', 'api', 'blkcntnr', 'compat', 'dml',
 'document', 'enum', 'exceptions', 'image', 'opc', 'oxml', 'package', 'parts', 'section', 'settings'
, 'shape', 'shared', 'styles', 'text']

Let's create a document with this python module and add some text and an image:

C:\Python364>python.exe
Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:54:40) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import docx
>>> mydoc = docx.Document()
>>> mydoc.add_paragraph('This is a text')
>>> mydoc.add_picture('icon.png',width=docx.shared.Inches(1),height=docx.shared.Inches(1))
>>> mydoc.save('test.docx')

Another python module is openpyxl.
This python module let you to read/write Excel 2010 xlsx/xlsm/xltx/xltm files.

C:\Python364\Scripts>pip3.6.exe install openpyxl
Collecting openpyxl
...
Successfully installed et-xmlfile-1.0.1 jdcal-1.4 openpyxl-2.6.0

Let's test it:

C:\Python364>python.exe
Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:54:40) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from openpyxl import load_workbook
>>> w = load_workbook(filename='test.xlsx',read_only=True)
>>> print(w.sheetnames)
['Sheet1', 'TestSheet2']
>>> s=w['Sheet1']
>>> for row in s.rows:
...     for c in row:
...             print(c.value)
...
A1
None
ABC
None
None
NOP

python-catalin

analitics

Pages