[Python] Multiprocessing多线程任务
在编写Python时, 如果我们需要同一时间内执行多个任务, 我们可以利用python内建模块multiprocessing来让其并行执行某个方法.
1. 随机执行方法
# vi ~/multipro1.py
import multiprocessing def spawn(num): print('Sprawned! {}.'.format(num)) if __name__ == "__main__": for i in range(50): p = multiprocessing.Process(target=spawn, args=(i,)) p.start()
2. 按顺序执行方法
# vi ~/multipro2.py
import multiprocessing def spawn(num): print('Sprawned! {}.'.format(num)) if __name__ == "__main__": for i in range(50): p = multiprocessing.Process(target=spawn, args=(i,)) p.start() # Waiting to finish One by one p.join()
Tip: 这里if __name__ == "__main__":的作用:
1. 当直接执行该脚本时, 内建变量__name__的值被赋予"__main__", 所以按照if逻辑, python解释器可以继续执行接下来的代码.
2. 当其他脚本去调用(import)该脚本时, __name__的值被赋予当前脚本名"multipro1", 而不是"__main__", 所以按照if逻辑, 保证if下面的代码不会被其他脚本调用.
所以在我们日常编写脚本的时候, 可以推荐按照如下结构, 在自己使用正常的同时, 保证在其他python脚本在调用test.py中的test1(), test2(), test3()的同时, 不会去执行if条件下的方法.
# vi test.py
def test1(): .... def test2(): .... def test3(): .... if __name__ == "__main__": ....
3. 多线程处理返回值.
# vi multipro3.py
from multiprocessing import pool def job(num): return num * 2 if __name__ == '__main__': p = Pool(processes=20) data = p.map(job, range(5)) data2 = p.map(job, [5, 2]) p.close()
4.多线程爬取随机网站URL
# vi ~/multipro4.py
from multiprocessing import Pool import bs4 as bs import random import requests import string # Return Four digits domain URL def random_starting_url(): starting = ''.join(random.SystemRandom().choice( string.ascii_lowercase) for _ in range(4)) url = ''.join(['http://', starting, '.com']) return url # Correct URL if it is relative def handle_local_links(url, link): if link.startswith('/'): return ''.join([url, link]) else: return link # Get URL from "a" tag of 'body' tag def get_links(url): try: # Request the URL resp = requests.get(url) # return the HTML soup = bs.BeautifulSoup(resp.text, 'lxml') # Get 'body' tag of the HTML body = soup.body # Get URL from "a" tag of 'body' tag links = [link.get('href') for link in body.find_all('a')] # Correct URL if it is relative links = [handle_local_links(url, link) for link in links] # Encoding the URL to ascii links = [str(link.encode('ascii')) for link in links] return links except TypeError as e: print(e) print('Got a TypeError, probably got a None that we tried to iterate over') return([]) except IndexError as e: print(e) print('No valid link found, return a empty list') return([]) except AttributeError as e: print(e) print('Likely got None for links, so we are throwing this') return([]) except Exception as e: print(str(e)) return([]) def main(): # CPU process process = 5 # The site intends to scrap site = 3 p = Pool(processes=process) # Get random URL list parse_us = [random_starting_url() for _ in range(site)] # Multiprocessing the URL and parse the result data = p.map(get_links, parse_us) # Get each URL in a list data = [url for url_list in data for url in url_list] p.close() # Write to txt file with open('urls.txt', 'w') as f: f.write(str(data)) if __name__ == '__main__': main()
本文链接:http://www.showerlee.com/archives/2157
继续浏览:python3
还没有评论,快来抢沙发!