analitics

Pages

Tuesday, June 21, 2016

Scrapy python module - part 001.

To install pip under python 2.7.8, securely download get-pip.py into Python27 folder.
Use this command:

C:\Python27\python.exe get-pip.py
...
C:\Python27\Scripts>pip2.7.exe install urllib3
C:\Python27\Scripts>pip2.7 install requests
C:\Python27\Scripts>pip install Scrapy

Some of python modules are installed:

Successfully built PyDispatcher pycparser
Installing collected packages: cssselect, queuelib, six, enum34, ipaddress, idna, pycparser, cffi, pyasn1, cryptography, pyOpenSSL, w3lib, lxml, parsel, PyDispatcher, zope.interface, Twisted, attrs, pyasn1-modules, service-identity, Scrapy
Successfully installed PyDispatcher-2.0.5 Scrapy-1.1.0 Twisted-16.2.0 attrs-16.0.0 cffi-1.7.0 cryptography-1.4 cssselect-0.9.2 enum34-1.1.6 idna-2.1 ipaddress-1.0.16 lxml-3.6.0 parsel-1.0.2 pyOpenSSL-16.0.0 pyasn1-0.1.9 pyasn1-modules-0.0.8 pycparser-2.14 queuelib-1.4.2 service-identity-16.0.0 six-1.10.0 w3lib-1.14.2 zope.interface-4.2.0



>>> print scrapy.version_info
(1, 1, 0)


>>> help(scrapy)
PACKAGE CONTENTS
_monkeypatches
cmdline
command
commands (package)
conf
contracts (package)
contrib (package)
contrib_exp (package)
core (package)
crawler
downloadermiddlewares (package)
dupefilter
dupefilters
exceptions
exporters
extension
extensions (package)
http (package)
interfaces
item
link
linkextractor
linkextractors (package)
loader (package)
log
logformatter
mail
middleware
pipelines (package)
project
resolver
responsetypes
selector (package)
settings (package)
shell
signalmanager
signals
spider
spiderloader
spidermanager
spidermiddlewares (package)
spiders (package)
squeue
squeues
stats
statscol
statscollectors
telnet
utils (package)
xlib (package)
...


C:\Python27\c:\Python27\Scripts\scrapy.exe startproject test_scrapy
New Scrapy project 'test_scrapy', using template directory 'c:\\python27\\lib\\site-packages\\scrapy\\templates\\project', created in:
C:\Python27\test_scrapy

You can start your first spider with:
cd test_scrapy
scrapy genspider example example.com

C:\Python27\cd test_scrapy

C:\Python27\test_scrapy>tree
Folder PATH listing
Volume serial number is 9A67-3A80
C:.
└───test_scrapy
└───spiders

Now you need to install win32api with this python module:
pip install pypiwin32
...
Downloading pypiwin32-219-cp27-none-win_amd64.whl (7.3MB)
100% |################################| 7.3MB 61kB/s
Installing collected packages: pypiwin32
Successfully installed pypiwin32-219

... and test scrapy bench:
C:\Python27\Scripts\scrapy.exe bench
2016-06-21 22:45:20 [scrapy] INFO: Scrapy 1.1.0 started (bot: scrapybot)
2016-06-21 22:45:20 [scrapy] INFO: Overridden settings: {'CLOSESPIDER_TIMEOUT': 10, 'LOG_LEVEL': 'INFO', 'LOGSTATS_INTERVAL': 1}
2016-06-21 22:45:39 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.closespider.CloseSpider',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-06-21 22:45:46 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-06-21 22:45:46 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-06-21 22:45:46 [scrapy] INFO: Enabled item pipelines:
[]
2016-06-21 22:45:46 [scrapy] INFO: Spider opened
2016-06-21 22:45:46 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-06-21 22:45:48 [scrapy] INFO: Crawled 27 pages (at 1620 pages/min), scraped 0 items (at 0 items/min)
2016-06-21 22:45:49 [scrapy] INFO: Crawled 59 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
2016-06-21 22:45:50 [scrapy] INFO: Crawled 85 pages (at 1560 pages/min), scraped 0 items (at 0 items/min)
2016-06-21 22:45:51 [scrapy] INFO: Crawled 123 pages (at 2280 pages/min), scraped 0 items (at 0 items/min)
2016-06-21 22:45:52 [scrapy] INFO: Crawled 149 pages (at 1560 pages/min), scraped 0 items (at 0 items/min)
2016-06-21 22:45:53 [scrapy] INFO: Crawled 181 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
2016-06-21 22:45:54 [scrapy] INFO: Crawled 211 pages (at 1800 pages/min), scraped 0 items (at 0 items/min)
2016-06-21 22:45:55 [scrapy] INFO: Crawled 237 pages (at 1560 pages/min), scraped 0 items (at 0 items/min)
2016-06-21 22:45:56 [scrapy] INFO: Crawled 269 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
2016-06-21 22:45:57 [scrapy] INFO: Closing spider (closespider_timeout)
2016-06-21 22:45:57 [scrapy] INFO: Crawled 307 pages (at 2280 pages/min), scraped 0 items (at 0 items/min)
2016-06-21 22:45:57 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 97844,
'downloader/request_count': 317,
'downloader/request_method_count/GET': 317,
'downloader/response_bytes': 469955,
'downloader/response_count': 317,
'downloader/response_status_count/200': 317,
'dupefilter/filtered': 204,
'finish_reason': 'closespider_timeout',
'finish_time': datetime.datetime(2016, 6, 21, 19, 45, 57, 835000),
'log_count/INFO': 17,
'request_depth_max': 14,
'response_received_count': 317,
'scheduler/dequeued': 317,
'scheduler/dequeued/memory': 317,
'scheduler/enqueued': 6136,
'scheduler/enqueued/memory': 6136,
'start_time': datetime.datetime(2016, 6, 21, 19, 45, 46, 986000)}
2016-06-21 22:45:57 [scrapy] INFO: Spider closed (closespider_timeout)

Into the next tutorial I will try to use scrapy.
If you have some ideas about how to do the next step just send me one comment.


Thursday, May 19, 2016

News: The new python version 3.6.0.a1

I used the Windows x86-64 executable installer to install this version of python.
I set some settings and I start the install aplication.

I read all new changes and PEP 0498.
I take a look to see all python modules:
Please wait a moment while I gather a list of all available modules...

__future__          aifc                http                setuptools
_ast                antigravity         idlelib             shelve
_bisect             argparse            imaplib             shlex
_bootlocale         array               imghdr              shutil
_bz2                ast                 imp                 signal
_codecs             asynchat            importlib           site
_codecs_cn          asyncio             inspect             smtpd
_codecs_hk          asyncore            io                  smtplib
_codecs_iso2022     atexit              ipaddress           sndhdr
_codecs_jp          audioop             itertools           socket
_codecs_kr          base64              json                socketserver
_codecs_tw          bdb                 keyword             sqlite3
_collections        binascii            lib2to3             sre_compile
_collections_abc    binhex              linecache           sre_constants
_compat_pickle      bisect              locale              sre_parse
_compression        builtins            logging             ssl
_csv                bz2                 lzma                stat
_ctypes             cProfile            macpath             statistics
_ctypes_test        calendar            macurl2path         string
_datetime           cgi                 mailbox             stringprep
_decimal            cgitb               mailcap             struct
_dummy_thread       chunk               marshal             subprocess
_elementtree        cmath               math                sunau
_functools          cmd                 mimetypes           symbol
_hashlib            code                mmap                symtable
_heapq              codecs              modulefinder        sys
_imp                codeop              msilib              sysconfig
_io                 collections         msvcrt              tabnanny
_json               colorsys            multiprocessing     tarfile
_locale             compileall          netrc               telnetlib
_lsprof             concurrent          nntplib             tempfile
_lzma               configparser        nt                  test
_markupbase         contextlib          ntpath              textwrap
_md5                copy                nturl2path          this
_msi                copyreg             numbers             threading
_multibytecodec     crypt               opcode              time
_multiprocessing    csv                 operator            timeit
_opcode             ctypes              optparse            tkinter
_operator           curses              os                  token
_osx_support        datetime            parser              tokenize
_overlapped         dbm                 pathlib             trace
_pickle             decimal             pdb                 traceback
_pydecimal          difflib             pickle              tracemalloc
_pyio               dis                 pickletools         tty
_random             distutils           pip                 turtle
_sha1               doctest             pipes               turtledemo
_sha256             dummy_threading     pkg_resources       types
_sha512             easy_install        pkgutil             typing
_signal             email               platform            unicodedata
_sitebuiltins       encodings           plistlib            unittest
_socket             ensurepip           poplib              urllib
_sqlite3            enum                posixpath           uu
_sre                errno               pprint              uuid
_ssl                faulthandler        profile             venv
_stat               filecmp             pstats              warnings
_string             fileinput           pty                 wave
_strptime           fnmatch             py_compile          weakref
_struct             formatter           pyclbr              webbrowser
_symtable           fractions           pydoc               winreg
_testbuffer         ftplib              pydoc_data          winsound
_testcapi           functools           pyexpat             wsgiref
_testimportmultiple gc                  queue               xdrlib
_testmultiphase     genericpath         quopri              xml
_thread             getopt              random              xmlrpc
_threading_local    getpass             re                  xxsubtype
_tkinter            gettext             reprlib             zipapp
_tracemalloc        glob                rlcompleter         zipfile
_warnings           gzip                runpy               zipimport
_weakref            hashlib             sched               zlib
_weakrefset         heapq               secrets
_winapi             hmac                select
abc                 html                selectors
The new formatted string literals are a new kind of string literal, prefixed with 'f' this allow you to add contain replacement fields surrounded by curly braces.
I don't think the add Python start well, maybe need restart:
 >>> import crypt
Traceback (most recent call last):
  File "", line 1, in 
  File "C:\Python36\lib\crypt.py", line 3, in 
    import _crypt
ImportError: No module named '_crypt' 
Some of changes can be see at whatsnew.
You can read more and also download the new python released version 360a1 from here.
Very good work from development team, they make a great job.