analitics

Pages

Tuesday, June 21, 2016

Scrapy python module - part 001.

To install pip under python 2.7.8, securely download get-pip.py into Python27 folder.
Use this command:

C:\Python27\python.exe get-pip.py
...
C:\Python27\Scripts>pip2.7.exe install urllib3
C:\Python27\Scripts>pip2.7 install requests
C:\Python27\Scripts>pip install Scrapy

Some of python modules are installed:

Successfully built PyDispatcher pycparser
Installing collected packages: cssselect, queuelib, six, enum34, ipaddress, idna, pycparser, cffi, pyasn1, cryptography, pyOpenSSL, w3lib, lxml, parsel, PyDispatcher, zope.interface, Twisted, attrs, pyasn1-modules, service-identity, Scrapy
Successfully installed PyDispatcher-2.0.5 Scrapy-1.1.0 Twisted-16.2.0 attrs-16.0.0 cffi-1.7.0 cryptography-1.4 cssselect-0.9.2 enum34-1.1.6 idna-2.1 ipaddress-1.0.16 lxml-3.6.0 parsel-1.0.2 pyOpenSSL-16.0.0 pyasn1-0.1.9 pyasn1-modules-0.0.8 pycparser-2.14 queuelib-1.4.2 service-identity-16.0.0 six-1.10.0 w3lib-1.14.2 zope.interface-4.2.0



>>> print scrapy.version_info
(1, 1, 0)


>>> help(scrapy)
PACKAGE CONTENTS
_monkeypatches
cmdline
command
commands (package)
conf
contracts (package)
contrib (package)
contrib_exp (package)
core (package)
crawler
downloadermiddlewares (package)
dupefilter
dupefilters
exceptions
exporters
extension
extensions (package)
http (package)
interfaces
item
link
linkextractor
linkextractors (package)
loader (package)
log
logformatter
mail
middleware
pipelines (package)
project
resolver
responsetypes
selector (package)
settings (package)
shell
signalmanager
signals
spider
spiderloader
spidermanager
spidermiddlewares (package)
spiders (package)
squeue
squeues
stats
statscol
statscollectors
telnet
utils (package)
xlib (package)
...


C:\Python27\c:\Python27\Scripts\scrapy.exe startproject test_scrapy
New Scrapy project 'test_scrapy', using template directory 'c:\\python27\\lib\\site-packages\\scrapy\\templates\\project', created in:
C:\Python27\test_scrapy

You can start your first spider with:
cd test_scrapy
scrapy genspider example example.com

C:\Python27\cd test_scrapy

C:\Python27\test_scrapy>tree
Folder PATH listing
Volume serial number is 9A67-3A80
C:.
└───test_scrapy
└───spiders

Now you need to install win32api with this python module:
pip install pypiwin32
...
Downloading pypiwin32-219-cp27-none-win_amd64.whl (7.3MB)
100% |################################| 7.3MB 61kB/s
Installing collected packages: pypiwin32
Successfully installed pypiwin32-219

... and test scrapy bench:
C:\Python27\Scripts\scrapy.exe bench
2016-06-21 22:45:20 [scrapy] INFO: Scrapy 1.1.0 started (bot: scrapybot)
2016-06-21 22:45:20 [scrapy] INFO: Overridden settings: {'CLOSESPIDER_TIMEOUT': 10, 'LOG_LEVEL': 'INFO', 'LOGSTATS_INTERVAL': 1}
2016-06-21 22:45:39 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.closespider.CloseSpider',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-06-21 22:45:46 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-06-21 22:45:46 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-06-21 22:45:46 [scrapy] INFO: Enabled item pipelines:
[]
2016-06-21 22:45:46 [scrapy] INFO: Spider opened
2016-06-21 22:45:46 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-06-21 22:45:48 [scrapy] INFO: Crawled 27 pages (at 1620 pages/min), scraped 0 items (at 0 items/min)
2016-06-21 22:45:49 [scrapy] INFO: Crawled 59 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
2016-06-21 22:45:50 [scrapy] INFO: Crawled 85 pages (at 1560 pages/min), scraped 0 items (at 0 items/min)
2016-06-21 22:45:51 [scrapy] INFO: Crawled 123 pages (at 2280 pages/min), scraped 0 items (at 0 items/min)
2016-06-21 22:45:52 [scrapy] INFO: Crawled 149 pages (at 1560 pages/min), scraped 0 items (at 0 items/min)
2016-06-21 22:45:53 [scrapy] INFO: Crawled 181 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
2016-06-21 22:45:54 [scrapy] INFO: Crawled 211 pages (at 1800 pages/min), scraped 0 items (at 0 items/min)
2016-06-21 22:45:55 [scrapy] INFO: Crawled 237 pages (at 1560 pages/min), scraped 0 items (at 0 items/min)
2016-06-21 22:45:56 [scrapy] INFO: Crawled 269 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
2016-06-21 22:45:57 [scrapy] INFO: Closing spider (closespider_timeout)
2016-06-21 22:45:57 [scrapy] INFO: Crawled 307 pages (at 2280 pages/min), scraped 0 items (at 0 items/min)
2016-06-21 22:45:57 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 97844,
'downloader/request_count': 317,
'downloader/request_method_count/GET': 317,
'downloader/response_bytes': 469955,
'downloader/response_count': 317,
'downloader/response_status_count/200': 317,
'dupefilter/filtered': 204,
'finish_reason': 'closespider_timeout',
'finish_time': datetime.datetime(2016, 6, 21, 19, 45, 57, 835000),
'log_count/INFO': 17,
'request_depth_max': 14,
'response_received_count': 317,
'scheduler/dequeued': 317,
'scheduler/dequeued/memory': 317,
'scheduler/enqueued': 6136,
'scheduler/enqueued/memory': 6136,
'start_time': datetime.datetime(2016, 6, 21, 19, 45, 46, 986000)}
2016-06-21 22:45:57 [scrapy] INFO: Spider closed (closespider_timeout)

Into the next tutorial I will try to use scrapy.
If you have some ideas about how to do the next step just send me one comment.


Thursday, May 19, 2016

News: The new python version 3.6.0.a1

I used the Windows x86-64 executable installer to install this version of python.
I set some settings and I start the install aplication.

I read all new changes and PEP 0498.
I take a look to see all python modules:
Please wait a moment while I gather a list of all available modules...

__future__          aifc                http                setuptools
_ast                antigravity         idlelib             shelve
_bisect             argparse            imaplib             shlex
_bootlocale         array               imghdr              shutil
_bz2                ast                 imp                 signal
_codecs             asynchat            importlib           site
_codecs_cn          asyncio             inspect             smtpd
_codecs_hk          asyncore            io                  smtplib
_codecs_iso2022     atexit              ipaddress           sndhdr
_codecs_jp          audioop             itertools           socket
_codecs_kr          base64              json                socketserver
_codecs_tw          bdb                 keyword             sqlite3
_collections        binascii            lib2to3             sre_compile
_collections_abc    binhex              linecache           sre_constants
_compat_pickle      bisect              locale              sre_parse
_compression        builtins            logging             ssl
_csv                bz2                 lzma                stat
_ctypes             cProfile            macpath             statistics
_ctypes_test        calendar            macurl2path         string
_datetime           cgi                 mailbox             stringprep
_decimal            cgitb               mailcap             struct
_dummy_thread       chunk               marshal             subprocess
_elementtree        cmath               math                sunau
_functools          cmd                 mimetypes           symbol
_hashlib            code                mmap                symtable
_heapq              codecs              modulefinder        sys
_imp                codeop              msilib              sysconfig
_io                 collections         msvcrt              tabnanny
_json               colorsys            multiprocessing     tarfile
_locale             compileall          netrc               telnetlib
_lsprof             concurrent          nntplib             tempfile
_lzma               configparser        nt                  test
_markupbase         contextlib          ntpath              textwrap
_md5                copy                nturl2path          this
_msi                copyreg             numbers             threading
_multibytecodec     crypt               opcode              time
_multiprocessing    csv                 operator            timeit
_opcode             ctypes              optparse            tkinter
_operator           curses              os                  token
_osx_support        datetime            parser              tokenize
_overlapped         dbm                 pathlib             trace
_pickle             decimal             pdb                 traceback
_pydecimal          difflib             pickle              tracemalloc
_pyio               dis                 pickletools         tty
_random             distutils           pip                 turtle
_sha1               doctest             pipes               turtledemo
_sha256             dummy_threading     pkg_resources       types
_sha512             easy_install        pkgutil             typing
_signal             email               platform            unicodedata
_sitebuiltins       encodings           plistlib            unittest
_socket             ensurepip           poplib              urllib
_sqlite3            enum                posixpath           uu
_sre                errno               pprint              uuid
_ssl                faulthandler        profile             venv
_stat               filecmp             pstats              warnings
_string             fileinput           pty                 wave
_strptime           fnmatch             py_compile          weakref
_struct             formatter           pyclbr              webbrowser
_symtable           fractions           pydoc               winreg
_testbuffer         ftplib              pydoc_data          winsound
_testcapi           functools           pyexpat             wsgiref
_testimportmultiple gc                  queue               xdrlib
_testmultiphase     genericpath         quopri              xml
_thread             getopt              random              xmlrpc
_threading_local    getpass             re                  xxsubtype
_tkinter            gettext             reprlib             zipapp
_tracemalloc        glob                rlcompleter         zipfile
_warnings           gzip                runpy               zipimport
_weakref            hashlib             sched               zlib
_weakrefset         heapq               secrets
_winapi             hmac                select
abc                 html                selectors
The new formatted string literals are a new kind of string literal, prefixed with 'f' this allow you to add contain replacement fields surrounded by curly braces.
I don't think the add Python start well, maybe need restart:
 >>> import crypt
Traceback (most recent call last):
  File "", line 1, in 
  File "C:\Python36\lib\crypt.py", line 3, in 
    import _crypt
ImportError: No module named '_crypt' 
Some of changes can be see at whatsnew.
You can read more and also download the new python released version 360a1 from here.
Very good work from development team, they make a great job.

Wednesday, April 20, 2016

News: New PyPy 5.1 released.

New features and bug fixes come with new PyPy 5.1 version.
The PyPy is a very compliant Python interpreter and supports x86 machines on most common operating systems.
Now we have full support for the IBM s390x and good optimizations.
Take a look here.

Saturday, March 19, 2016

Free ebook from O'Reilly - Functional Programming in Python.

You can download your free ebook from O'Reilly.
The Functional Programming in Python by David Mertz - publisher: O'Reilly - released: June 2015.
David Mertz is a director of the Python Software Foundation, and chair of its Trademarks and Outreach & Education Committees. He wrote the columns Charming Python and XML Matters for IBM developerWorks and the Addison-Wesley book Text Processing in Python. David has spoken at multiple OSCON and PyCon events.
This is the download link.

 

Thursday, February 4, 2016

Testing PyQt4 under Python 3.5.1.

Today I working well with python 3.5.1.
Most of my source code was wrote with python 2.7 and the next step was to using pip3.5 to update and upgrade some python modules.
I try to install PyQt4 with pip3.5 - not working.
So I used the old way: whl file from here.
That was good, most of the scripts was running.
The main problem was OpenGL and that will make errors when you try to use QtOpenGL.
The result of this seam to me the python 3.5.1 it's now without OpenGL features.

The most short source code into Python 3.5.1 .

Just type this :

import antigravity

That will open your browser with one comics from xkcd website.
The antigravity module was added to Python 3.5.1. I'm not sure but seam working also with python 2.7 version.

Saturday, January 23, 2016

wmi python module - part 002.

According to MSDN Microsoft the Win32_Process WMI class represents a process on an operating system.
We can see all of the inherited properties of processes:

class Win32_Process : CIM_Process
{
  string   Caption;
  string   CommandLine;
  string   CreationClassName;
  datetime CreationDate;
  string   CSCreationClassName;
  string   CSName;
  string   Description;
  string   ExecutablePath;
  uint16   ExecutionState;
  string   Handle;
  uint32   HandleCount;
  datetime InstallDate;
  uint64   KernelModeTime;
  uint32   MaximumWorkingSetSize;
  uint32   MinimumWorkingSetSize;
  string   Name;
  string   OSCreationClassName;
  string   OSName;
  uint64   OtherOperationCount;
  uint64   OtherTransferCount;
  uint32   PageFaults;
  uint32   PageFileUsage;
  uint32   ParentProcessId;
  uint32   PeakPageFileUsage;
  uint64   PeakVirtualSize;
  uint32   PeakWorkingSetSize;
  uint32   Priority = NULL;
  uint64   PrivatePageCount;
  uint32   ProcessId;
  uint32   QuotaNonPagedPoolUsage;
  uint32   QuotaPagedPoolUsage;
  uint32   QuotaPeakNonPagedPoolUsage;
  uint32   QuotaPeakPagedPoolUsage;
  uint64   ReadOperationCount;
  uint64   ReadTransferCount;
  uint32   SessionId;
  string   Status;
  datetime TerminationDate;
  uint32   ThreadCount;
  uint64   UserModeTime;
  uint64   VirtualSize;
  string   WindowsVersion;
  uint64   WorkingSetSize;
  uint64   WriteOperationCount;
  uint64   WriteTransferCount;
};
Let's make one simple example with wmi python module.
import wmi
c = wmi.WMI()
for process in c.Win32_Process ():
 name = process.Properties_("Name").Value
 pid = process.Properties_('ProcessID').Value
 parent = process.Properties_('ParentProcessId')
 termination = process.Properties_('TerminationDate')
 print (name,' = pid -',pid,'+', parent,'|termination_date-',termination)
And the output of this script it's :
firefox.exe  = pid - 13788 + 2564 |termination_date- None
explorer.exe  = pid - 1048 + 772 |termination_date- None
sublime_text.exe  = pid - 11404 + 2564 |termination_date- None
plugin_host.exe  = pid - 7432 + 11404 |termination_date- None
cmd.exe  = pid - 9568 + 2564 |termination_date- None
conhost.exe  = pid - 14124 + 9568 |termination_date- None
conhost.exe  = pid - 9700 + 11208 |termination_date- None
Taskmgr.exe  = pid - 9424 + 13404 |termination_date- None
WmiPrvSE.exe  = pid - 9764 + 772 |termination_date- None
SpfService64.exe  = pid - 11908 + 684 |termination_date- None
python.exe  = pid - 1308 + 9568 |termination_date- None

Friday, January 22, 2016

wmi python module - part 001.

Named WMI from Windows Management Instrumentation, this python module allow you to use Microsoft’s implementation of Web-Based Enterprise Management ( WBEM ).
Is a set of extensions to the Windows Driver Model AND that provides an operating system interface.
allows you to scripting languages like VBScript to manage Microsoft Windows personal computers and servers, both locally and remotely.
You cand read about this python module here.

C:\Python34\Scripts>pip install  wmi
...
Installing collected packages: wmi
Running setup.py install for wmi
warning: install_data: setup script did not provide a directory for 'readme.
txt' -- installing right in 'C:\Python34'
...
Successfully installed wmi
Cleaning up...

Let try first example :

C:\Python34>python
Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 10:45:13) [MSC v.1600 64 bit (AM
D64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import wmi
>>> remote_process = wmi.WMI (computer="home").new ("Win32_Process")
>>> for i in wmi.WMI ().Win32_OperatingSystem ():
...     print (i.Caption)
...
Microsoft Windows 10 Home

Now let's see another example can used by you with wmi python module.
This example let you see your processes.

import wmi
import datetime
c = wmi.WMI()
process_watcher = c.Win32_Process.watch_for("modification")
while True:
  new_process = process_watcher()
  print (new_process.Caption)

I used the python version 3.3.5 and Spyder ( Scientific PYthon Development EnviRonment ) to test the script.
You can change .watch_for method args with: creation, deletion, modification or operation.