欢迎访问www.showerlee.com, 您的支持就是我前进的动力.

[Python] 利用Beautiful Soup+Pandas+Pyqt5+Selenum进行python爬虫

showerlee 2017-12-07 11:34 PYTHON, 其他 阅读 (14,045) 抢沙发

Beautiful Soup, pandas, pyqt5是一组非常方便的进行网络爬虫的python模块.

Beautiful Soup主要从解析好的HTML源码中抓取我们所需要的关键内容

Pandas与Beautiful Soup类似, 不过它侧重去抓取源码中的表格信息

pyqt5这里的作用是模拟浏览器去解析源码中的Javasript, 并最终抓取JS实际的返回值.

这里我在我的Flask env下创建了一个测试页面, 用这些模块进行一些简单的页面爬虫测试.

http://flask.showerlee.com/scrapingtest/

安装环境

OS:       Windows 7 x64   
Python:   Python3.6.2
Git Bash: Git-2.15.1.2-64-bit

一. 环境配置:

1. 安装并运行Git bash

2. 安装并测试python版本

# python -V

Python 3.6.2

3. 安装相关爬虫模块

# python -m pip install beautifulsoup4 lxml pandas html5lib pyqt5 selenum

二. Beautiful Soup演示

# vi ~/scrap1.py

Tip: 这里首先去调用io和sys模块是为了改变默认的标准输出为utf-8, 这么做是为了保证无论我们抓取的源页面是什么格式, 都不会在用BS解析时报UnicodeEncodeError.

import io
import sys
import bs4 as bs
import urllib.request

# 改变标准输出的默认编码为utf-8
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf8')

# 获取该页面编码并解码成utf-8
sauce = urllib.request.urlopen(
    'http://flask.showerlee.com/scrapingtest/').read().decode('utf-8')

# 利用BS抓取页面源代码,并利用lxml规范格式
soup = bs.BeautifulSoup(sauce, 'lxml')

# 获取页面源代码
print(soup)

# 获取titile标签源代码
print(soup.title)

# 获取titile标签name
print(soup.title.name)

# 获取titile标签字符
print(soup.title.string)
print(soup.title.text)

# 获取第一个p标签源代码
print(soup.p)

# 获取p标签源代码
print(soup.find_all('p'))

# 获取p标签所有内容
for paragraph in soup.find_all('p'):
    print(paragraph.text)

# 获取页面所有内容
print(soup.get_text())

# 获取a标签所有内容
for url in soup.find_all('a'):
    print(url.text)

# 获取a标签所有href链接
for url in soup.find_all('a'):
    print(url.get('href'))

# 获取nav标签源代码
nav = soup.nav
print(nav)

# 获取nav标签URL
for url in nav.find_all('a'):
    print(url.get('href'))

# 获取body标签内容
body = soup.body
for paragraph in body.find_all('p'):
    print(paragraph.text)

# 获取div标签下body下的内容
for div in soup.find_all('div', class_='body'):
    print(div.text)

# 获取table标签源代码
table = soup.table
# table = soup.find('table')
print(table)

# 获取table每行内容
table_rows = table.find_all('tr')

for tr in table_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    print(row)

# python scrap1.py

......

三. Pandas演示

# vi ~/scrap2.py

import pandas as pd

dfs = pd.read_html(
    'http://flask.showerlee.com/scrapingtest/', header=0)

for df in dfs:
    print(df)

# python scrap2.py

  Program Name  Internet Points    Kittens?
0       Python        932914021  Definitely
1       Pascal              532    Unlikely
2         Lisp             1522   Uncertain
3           D#               12    Possibly
4        Cobol                3         No.
5      Fortran            52124        Yes.
6      Haskell               24        lol.

四. pyqt5演示

这里我们首先不解析JS, 直接利用Beautiful Soup去抓取p标签下class=jstest的内容

# vi ~/scrap3.py

import io
import sys
import bs4 as bs
import urllib.request

# 改变标准输出的默认编码为utf-8
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf8')

# 获取该页面编码并解码成utf-8
sauce = urllib.request.urlopen(
    'http://flask.showerlee.com/scrapingtest/').read().decode('utf-8')

# 利用BS抓取页面源代码,并利用lxml规范格式
soup = bs.BeautifulSoup(sauce, 'lxml')

js_test = soup.find('p', class_='jstest')

print(js_test.text)

# python scrap3.py

No js loaded

可以看到实际抓取的为未被JS处理的标签内容

这里利用pyqt5去抓取p标签下class=jstest的内容

# vi ~/scrap4.py

import bs4 as bs
import sys
from PyQt5.QtWebEngineWidgets import QWebEnginePage
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl


class Page(QWebEnginePage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebEnginePage.__init__(self)
        self.html = ''
        self.loadFinished.connect(self._on_load_finished)
        self.load(QUrl(url))
        self.app.exec_()

    def _on_load_finished(self):
        self.html = self.toHtml(self.Callable)
        # print('Load finished')

    def Callable(self, html_str):
        self.html = html_str
        self.app.quit()


def main():
    page = Page('http://flask.showerlee.com/scrapingtest/')
    soup = bs.BeautifulSoup(page.html, 'html.parser')
    js_test = soup.find('p', class_='jstest')
    print(js_test.text)


if __name__ == '__main__':
    main()

# python scrap4.py

js loaded successfully

JS解析成功.

四. selenum演示

首先我们需要从官网下载chrome driver, 并放到脚本同路径的driver目录里.

这里需要查找匹配你当前chrome浏览器版本的driver版本. 这边我的chrome版本为62.0, 所以选择driver版本为2.35.

# vi selenum.py

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
import time, sys, io

sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf8')

def scrape():

	chromedriver = r".\driver\chromedriver.exe"
	URL = "http://flask.showerlee.com/scrapingtest/"

	try:
		driver = webdriver.Chrome(chromedriver)
		driver.set_window_position(-10000, 0)
		driver.get(URL)
		time.sleep(10)
		result = driver.execute_script("return document.body.innerHTML").encode('utf-8')
	except TimeoutException as e:
		print(e)

	soup = BeautifulSoup(result, "lxml")

	print(soup)

	driver.close()

scrape()

# python selenum.py

...

更多文档:

https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

http://pandas.pydata.org/pandas-docs/stable/

http://pyqt.sourceforge.net/Docs/PyQt5/

https://sites.google.com/a/chromium.org/chromedriver/downloads

https://chromedriver.storage.googleapis.com/index.html

正文部分到此结束
版权声明:除非注明,本文由(showerlee)原创,转载请保留文章出处!
本文链接:http://www.showerlee.com/archives/2109

继续浏览:python3

还没有评论,快来抢沙发!

发表评论

icon_wink.gif icon_neutral.gif icon_mad.gif icon_twisted.gif icon_smile.gif icon_eek.gif icon_sad.gif icon_rolleyes.gif icon_razz.gif icon_redface.gif icon_surprised.gif icon_mrgreen.gif icon_lol.gif icon_idea.gif icon_biggrin.gif icon_evil.gif icon_cry.gif icon_cool.gif icon_arrow.gif icon_confused.gif icon_question.gif icon_exclaim.gif