analitics

Pages

Showing posts with label scrapy. Show all posts
Showing posts with label scrapy. Show all posts

Tuesday, June 21, 2016

Scrapy python module - part 001.

To install pip under python 2.7.8, securely download get-pip.py into Python27 folder.
Use this command:

C:\Python27\python.exe get-pip.py
...
C:\Python27\Scripts>pip2.7.exe install urllib3
C:\Python27\Scripts>pip2.7 install requests
C:\Python27\Scripts>pip install Scrapy

Some of python modules are installed:

Successfully built PyDispatcher pycparser
Installing collected packages: cssselect, queuelib, six, enum34, ipaddress, idna, pycparser, cffi, pyasn1, cryptography, pyOpenSSL, w3lib, lxml, parsel, PyDispatcher, zope.interface, Twisted, attrs, pyasn1-modules, service-identity, Scrapy
Successfully installed PyDispatcher-2.0.5 Scrapy-1.1.0 Twisted-16.2.0 attrs-16.0.0 cffi-1.7.0 cryptography-1.4 cssselect-0.9.2 enum34-1.1.6 idna-2.1 ipaddress-1.0.16 lxml-3.6.0 parsel-1.0.2 pyOpenSSL-16.0.0 pyasn1-0.1.9 pyasn1-modules-0.0.8 pycparser-2.14 queuelib-1.4.2 service-identity-16.0.0 six-1.10.0 w3lib-1.14.2 zope.interface-4.2.0



>>> print scrapy.version_info
(1, 1, 0)


>>> help(scrapy)
PACKAGE CONTENTS
_monkeypatches
cmdline
command
commands (package)
conf
contracts (package)
contrib (package)
contrib_exp (package)
core (package)
crawler
downloadermiddlewares (package)
dupefilter
dupefilters
exceptions
exporters
extension
extensions (package)
http (package)
interfaces
item
link
linkextractor
linkextractors (package)
loader (package)
log
logformatter
mail
middleware
pipelines (package)
project
resolver
responsetypes
selector (package)
settings (package)
shell
signalmanager
signals
spider
spiderloader
spidermanager
spidermiddlewares (package)
spiders (package)
squeue
squeues
stats
statscol
statscollectors
telnet
utils (package)
xlib (package)
...


C:\Python27\c:\Python27\Scripts\scrapy.exe startproject test_scrapy
New Scrapy project 'test_scrapy', using template directory 'c:\\python27\\lib\\site-packages\\scrapy\\templates\\project', created in:
C:\Python27\test_scrapy

You can start your first spider with:
cd test_scrapy
scrapy genspider example example.com

C:\Python27\cd test_scrapy

C:\Python27\test_scrapy>tree
Folder PATH listing
Volume serial number is 9A67-3A80
C:.
└───test_scrapy
└───spiders

Now you need to install win32api with this python module:
pip install pypiwin32
...
Downloading pypiwin32-219-cp27-none-win_amd64.whl (7.3MB)
100% |################################| 7.3MB 61kB/s
Installing collected packages: pypiwin32
Successfully installed pypiwin32-219

... and test scrapy bench:
C:\Python27\Scripts\scrapy.exe bench
2016-06-21 22:45:20 [scrapy] INFO: Scrapy 1.1.0 started (bot: scrapybot)
2016-06-21 22:45:20 [scrapy] INFO: Overridden settings: {'CLOSESPIDER_TIMEOUT': 10, 'LOG_LEVEL': 'INFO', 'LOGSTATS_INTERVAL': 1}
2016-06-21 22:45:39 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.closespider.CloseSpider',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-06-21 22:45:46 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-06-21 22:45:46 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-06-21 22:45:46 [scrapy] INFO: Enabled item pipelines:
[]
2016-06-21 22:45:46 [scrapy] INFO: Spider opened
2016-06-21 22:45:46 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-06-21 22:45:48 [scrapy] INFO: Crawled 27 pages (at 1620 pages/min), scraped 0 items (at 0 items/min)
2016-06-21 22:45:49 [scrapy] INFO: Crawled 59 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
2016-06-21 22:45:50 [scrapy] INFO: Crawled 85 pages (at 1560 pages/min), scraped 0 items (at 0 items/min)
2016-06-21 22:45:51 [scrapy] INFO: Crawled 123 pages (at 2280 pages/min), scraped 0 items (at 0 items/min)
2016-06-21 22:45:52 [scrapy] INFO: Crawled 149 pages (at 1560 pages/min), scraped 0 items (at 0 items/min)
2016-06-21 22:45:53 [scrapy] INFO: Crawled 181 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
2016-06-21 22:45:54 [scrapy] INFO: Crawled 211 pages (at 1800 pages/min), scraped 0 items (at 0 items/min)
2016-06-21 22:45:55 [scrapy] INFO: Crawled 237 pages (at 1560 pages/min), scraped 0 items (at 0 items/min)
2016-06-21 22:45:56 [scrapy] INFO: Crawled 269 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
2016-06-21 22:45:57 [scrapy] INFO: Closing spider (closespider_timeout)
2016-06-21 22:45:57 [scrapy] INFO: Crawled 307 pages (at 2280 pages/min), scraped 0 items (at 0 items/min)
2016-06-21 22:45:57 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 97844,
'downloader/request_count': 317,
'downloader/request_method_count/GET': 317,
'downloader/response_bytes': 469955,
'downloader/response_count': 317,
'downloader/response_status_count/200': 317,
'dupefilter/filtered': 204,
'finish_reason': 'closespider_timeout',
'finish_time': datetime.datetime(2016, 6, 21, 19, 45, 57, 835000),
'log_count/INFO': 17,
'request_depth_max': 14,
'response_received_count': 317,
'scheduler/dequeued': 317,
'scheduler/dequeued/memory': 317,
'scheduler/enqueued': 6136,
'scheduler/enqueued/memory': 6136,
'start_time': datetime.datetime(2016, 6, 21, 19, 45, 46, 986000)}
2016-06-21 22:45:57 [scrapy] INFO: Spider closed (closespider_timeout)

Into the next tutorial I will try to use scrapy.
If you have some ideas about how to do the next step just send me one comment.