nitro-python-crawler

nitro-python-crawler is web crawler, used to crawl websites and extract structured data from their pages.

Overview

Multi-threading
Free-proxy
Rotate IP and User-agents

Requirements

Python 2.7 or Python 3.4+
Works on Linux, Mac OSX

Required Python3 Modules

requests
python3-lxml
beautifulsoup4

Install

Install modules

Python 2.7.9+ and 3.4+ ship with pip

On Ubuntu(and similar Linux systems):

$ sudo pip3 install requests
$ sudo pip3 install lxml
$ sudo pip3 install bs4

Git clone

$ git clone https://github.com/heehomoon/nitro-python-crawler.git

How to use

Put urls to crawl in the url_list.txt

$ vi url_list.txt

    https://www.amazon.com/dp/B0054LHI5A
    https://www.amazon.com/dp/B01LZ3RLPC
    https://www.amazon.com/dp/B00Y2CQRZY

Create a extractor method

$ vi extractor.py

    def getProdcutTitle(self, soup):

        title = ""

        if(soup.find('span', {'id': 'productTitle'})):
            title = soup.find('span', {'id': 'productTitle'}).text
        elif(soup.find('span', {'id': 'ebooksProductTitle'})):
            title = soup.find('span', {'id': 'ebooksProductTitle'}).text
        elif(soup.find('span', {'id': 'fineArtTitle'})):
            title = soup.find('span', {'id': 'fineArtTitle'}).text

        title = title.strip()

    return title

Execute

$ python3 crawler.py

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.gitignore		.gitignore
README.md		README.md
crawler.py		crawler.py
extractor.py		extractor.py
material.py		material.py
setting.py		setting.py
soup.py		soup.py
url_list.txt		url_list.txt
user_agent_list.txt		user_agent_list.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nitro-python-crawler

Overview

Requirements

Required Python3 Modules

Install

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

nitro-python-crawler

Overview

Requirements

Required Python3 Modules

Install

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages