[Python] 利用Beautiful Soup+Pandas+Pyqt5+Selenum进行python爬虫
Beautiful Soup, pandas, pyqt5是一组非常方便的进行网络爬虫的python模块.
Beautiful Soup主要从解析好的HTML源码中抓取我们所需要的关键内容
Pandas与Beautiful Soup类似, 不过它侧重去抓取源码中的表格信息
pyqt5这里的作用是模拟浏览器去解析源码中的Javasript, 并最终抓取JS实际的返回值.
这里我在我的Flask env下创建了一个测试页面, 用这些模块进行一些简单的页面爬虫测试.
http://flask.showerlee.com/scrapingtest/
安装环境
OS: Windows 7 x64
Python: Python3.6.2
Git Bash: Git-2.15.1.2-64-bit
一. 环境配置:
1. 安装并运行Git bash
2. 安装并测试python版本
# python -V
Python 3.6.2
3. 安装相关爬虫模块
# python -m pip install beautifulsoup4 lxml pandas html5lib pyqt5 selenum
二. Beautiful Soup演示
# vi ~/scrap1.py
Tip: 这里首先去调用io和sys模块是为了改变默认的标准输出为utf-8, 这么做是为了保证无论我们抓取的源页面是什么格式, 都不会在用BS解析时报UnicodeEncodeError.
import io import sys import bs4 as bs import urllib.request # 改变标准输出的默认编码为utf-8 sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf8') # 获取该页面编码并解码成utf-8 sauce = urllib.request.urlopen( 'http://flask.showerlee.com/scrapingtest/').read().decode('utf-8') # 利用BS抓取页面源代码,并利用lxml规范格式 soup = bs.BeautifulSoup(sauce, 'lxml') # 获取页面源代码 print(soup) # 获取titile标签源代码 print(soup.title) # 获取titile标签name print(soup.title.name) # 获取titile标签字符 print(soup.title.string) print(soup.title.text) # 获取第一个p标签源代码 print(soup.p) # 获取p标签源代码 print(soup.find_all('p')) # 获取p标签所有内容 for paragraph in soup.find_all('p'): print(paragraph.text) # 获取页面所有内容 print(soup.get_text()) # 获取a标签所有内容 for url in soup.find_all('a'): print(url.text) # 获取a标签所有href链接 for url in soup.find_all('a'): print(url.get('href')) # 获取nav标签源代码 nav = soup.nav print(nav) # 获取nav标签URL for url in nav.find_all('a'): print(url.get('href')) # 获取body标签内容 body = soup.body for paragraph in body.find_all('p'): print(paragraph.text) # 获取div标签下body下的内容 for div in soup.find_all('div', class_='body'): print(div.text) # 获取table标签源代码 table = soup.table # table = soup.find('table') print(table) # 获取table每行内容 table_rows = table.find_all('tr') for tr in table_rows: td = tr.find_all('td') row = [i.text for i in td] print(row)
# python scrap1.py
......
三. Pandas演示
# vi ~/scrap2.py
import pandas as pd dfs = pd.read_html( 'http://flask.showerlee.com/scrapingtest/', header=0) for df in dfs: print(df)
# python scrap2.py
Program Name Internet Points Kittens? 0 Python 932914021 Definitely 1 Pascal 532 Unlikely 2 Lisp 1522 Uncertain 3 D# 12 Possibly 4 Cobol 3 No. 5 Fortran 52124 Yes. 6 Haskell 24 lol.
四. pyqt5演示
这里我们首先不解析JS, 直接利用Beautiful Soup去抓取p标签下class=jstest的内容
# vi ~/scrap3.py
import io import sys import bs4 as bs import urllib.request # 改变标准输出的默认编码为utf-8 sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf8') # 获取该页面编码并解码成utf-8 sauce = urllib.request.urlopen( 'http://flask.showerlee.com/scrapingtest/').read().decode('utf-8') # 利用BS抓取页面源代码,并利用lxml规范格式 soup = bs.BeautifulSoup(sauce, 'lxml') js_test = soup.find('p', class_='jstest') print(js_test.text)
# python scrap3.py
No js loaded
可以看到实际抓取的为未被JS处理的标签内容
这里利用pyqt5去抓取p标签下class=jstest的内容
# vi ~/scrap4.py
import bs4 as bs import sys from PyQt5.QtWebEngineWidgets import QWebEnginePage from PyQt5.QtWidgets import QApplication from PyQt5.QtCore import QUrl class Page(QWebEnginePage): def __init__(self, url): self.app = QApplication(sys.argv) QWebEnginePage.__init__(self) self.html = '' self.loadFinished.connect(self._on_load_finished) self.load(QUrl(url)) self.app.exec_() def _on_load_finished(self): self.html = self.toHtml(self.Callable) # print('Load finished') def Callable(self, html_str): self.html = html_str self.app.quit() def main(): page = Page('http://flask.showerlee.com/scrapingtest/') soup = bs.BeautifulSoup(page.html, 'html.parser') js_test = soup.find('p', class_='jstest') print(js_test.text) if __name__ == '__main__': main()
# python scrap4.py
js loaded successfully
JS解析成功.
四. selenum演示
首先我们需要从官网下载chrome driver, 并放到脚本同路径的driver目录里.
这里需要查找匹配你当前chrome浏览器版本的driver版本. 这边我的chrome版本为62.0, 所以选择driver版本为2.35.
# vi selenum.py
from selenium import webdriver from selenium.common.exceptions import TimeoutException from bs4 import BeautifulSoup import time, sys, io sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf8') def scrape(): chromedriver = r".\driver\chromedriver.exe" URL = "http://flask.showerlee.com/scrapingtest/" try: driver = webdriver.Chrome(chromedriver) driver.set_window_position(-10000, 0) driver.get(URL) time.sleep(10) result = driver.execute_script("return document.body.innerHTML").encode('utf-8') except TimeoutException as e: print(e) soup = BeautifulSoup(result, "lxml") print(soup) driver.close() scrape()
# python selenum.py
...
更多文档:
https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
http://pandas.pydata.org/pandas-docs/stable/
http://pyqt.sourceforge.net/Docs/PyQt5/
https://sites.google.com/a/chromium.org/chromedriver/downloads
https://chromedriver.storage.googleapis.com/index.html
本文链接:http://www.showerlee.com/archives/2109
继续浏览:python3
还没有评论,快来抢沙发!