Web爬虫开发

2023-09-07

字数统计: 1.8k字 | 阅读时长≈ 8分

Web爬虫开发。为了实验的数据集，准备写个爬虫来搞。为了防止反爬机制，因而使用Selenium来处理。

Web爬虫开发

Selenium

环境

环境完全切换到Linux！Windows上总是一堆bug，打不开。

安装Selenium

1	pip install selenium

不过据说Edge的话可以装这个：（Selenium3版本，好久没更新了）

1	pip install msedge-selenium-tools

Selenium版本会低一点，但是确实能跑起来了。

安装浏览器与驱动

Chrome和ChromeDriver的版本要严格对应，所以最好一起下载，专门下一个用于自动化测试的Chrome。

Chrome for Testing availability (googlechromelabs.github.io)

ChromeDriver - WebDriver for Chrome - Version Selection (google.com)

驱动配置

Firefox浏览器驱动：geckodriver

Chrome浏览器驱动：chromedriver ,CNPM Binaries Mirror (npmmirror.com)， taobao备用地址

IE浏览器驱动：IEDriverServer

Edge浏览器驱动：MicrosoftWebDriver

Opera浏览器驱动：operadriver

PhantomJS浏览器驱动：phantomjs

这里以Edge驱动为例子。

下载后压缩包，解压，得到msedgedriver.exe。将这个文件放在D:\Tools\edgedriver\msedgedriver.exe。然后再在环境变量配置这个文件夹路径，使得selenium能够识别到。

Linux上配置驱动

Linux上用Chrome做例子。

How to Install Selenium Tools on Linux? - GeeksforGeeks

重点在于，拿到了chromedriver后，

1
2
3

sudo mv chromedriver /usr/bin/chromedriver
sudo chown root:root /usr/bin/chromedriver
sudo chmod +x /usr/bin/chromedriver

这样将chromedriver放到bin文件夹里面，使得进入PATH。

当然不想放也行，不过要额外配置一下PATH。

测试（Edge）

from selenium import webdriver
from msedge.selenium_tools import Edge, EdgeOptions
options = EdgeOptions()
options.use_chromium = True
options.binary_location = r"C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe" # 浏览器的位置
driver = Edge(options=options, executable_path=r"D:\Tools\edgedriver\msedgedriver.exe") # 相应的浏览器的驱动位置

driver.get("http://www.baidu.com")

成功打开。

测试（Chrome）

from selenium import webdriver
option = webdriver.ChromeOptions()

option.binary_location = r"C:\Users\任昊\Desktop\SSTudy\Selenium\chrome-win64\chrome.exe" 

driver=webdriver.Chrome(options=option)

最后也是打开了。一开始莫名其妙报错，现在莫名其妙又可以了。

Chrome配置参数

from selenium import webdriver
option = webdriver.ChromeOptions()

# 添加启动参数
option.add_argument()

# 添加扩展应用 
option.add_extension()
option.add_encoded_extension()

# 添加实验性质的设置参数 
option.add_experimental_option()

# 设置调试器地址
option.debugger_address()

# 添加UA
options.add_argument('user-agent="MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1"')

# 指定浏览器分辨率
options.add_argument('window-size=1920x3000') 

# 谷歌文档提到需要加上这个属性来规避bug
chrome_options.add_argument('--disable-gpu') 

 # 隐藏滚动条, 应对一些特殊页面
options.add_argument('--hide-scrollbars')

# 不加载图片, 提升速度
options.add_argument('blink-settings=imagesEnabled=false') 

# 浏览器不提供可视化页面. linux下如果系统不支持可视化不加这条会启动失败
options.add_argument('--headless') 

# 以最高权限运行
options.add_argument('--no-sandbox')

# 手动指定使用的浏览器位置
options.binary_location = r"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" 

#添加crx插件
option.add_extension('d:\crx\AdBlock_v2.17.crx') 

# 禁用JavaScript
option.add_argument("--disable-javascript") 

# 设置开发者模式启动，该模式下webdriver属性为正常值
options.add_experimental_option('excludeSwitches', ['enable-automation']) 

# 禁用浏览器弹窗
prefs = {  
    'profile.default_content_setting_values' :  {  
        'notifications' : 2  
     }  
}  
options.add_experimental_option('prefs',prefs)


driver=webdriver.Chrome(chrome_options=chrome_options)

浏览器地址栏参数

在浏览器地址栏输入下列命令得到相应的信息

about:version - 显示当前版本

　　about:memory - 显示本机浏览器内存使用状况

　　about:plugins - 显示已安装插件

　　about:histograms - 显示历史记录

　　about:dns - 显示DNS状态

　　about:cache - 显示缓存页面

　　about:gpu -是否有硬件加速

　　chrome://extensions/ - 查看已经安装的扩展

有的指令没法用，不过先放在这里吧。

其他

–user-data-dir=”[PATH]” 
# 指定用户文件夹User Data路径，可以把书签这样的用户数据保存在系统分区以外的分区

　　–disk-cache-dir=”[PATH]“ 
# 指定缓存Cache路径

　　–disk-cache-size= 
# 指定Cache大小，单位Byte

　　–first run 
# 重置到初始状态，第一次运行

　　–incognito 
# 隐身模式启动

　　–disable-javascript 
# 禁用Javascript

　　--omnibox-popup-count="num" 
# 将地址栏弹出的提示菜单数量改为num个

　　--user-agent="xxxxxxxx" 
# 修改HTTP请求头部的Agent字符串，可以通过about:version页面查看修改效果

　　--disable-plugins 
# 禁止加载所有插件，可以增加速度。可以通过about:plugins页面查看效果

　　--disable-javascript 
# 禁用JavaScript，如果觉得速度慢在加上这个

　　--disable-java 
# 禁用java

　　--start-maximized 
# 启动就最大化

　　--no-sandbox 
# 取消沙盒模式

　　--single-process 
# 单进程运行

　　--process-per-tab 
# 每个标签使用单独进程

　　--process-per-site 
# 每个站点使用单独进程

　　--in-process-plugins 
# 插件不启用单独进程

　　--disable-popup-blocking 
# 禁用弹出拦截

　　--disable-plugins 
# 禁用插件

　　--disable-images 
# 禁用图像

　　--incognito 
# 启动进入隐身模式

　　--enable-udd-profiles 
# 启用账户切换菜单

　　--proxy-pac-url 
# 使用pac代理 [via 1/2]

　　--lang=zh-CN 
# 设置语言为简体中文

　　--disk-cache-dir 
# 自定义缓存目录

　　--disk-cache-size 
# 自定义缓存最大值（单位byte）

　　--media-cache-size 
# 自定义多媒体缓存最大值（单位byte）

　　--bookmark-menu 
# 在工具 栏增加一个书签按钮

　　--enable-sync 
# 启用书签同步

Selenium-CheatSheet

Selenium Python 教程 - 知乎 (zhihu.com)

在Text等表单中填入数据

WebElement in Selenium: TextBox, Button, sendkeys(), click() (guru99.com)

寻找到WebElement后，使用：

1	send_keys("AAA")

Wait until clickable

How To Resolve Element Click Intercepted Exception in Selenium? (qasource.com)

1 2	WebDriverWait wait = new WebDriverWait(driver, Duration.ofSeconds(10)); wait.until(ExpectedConditions.elementToBeClickable(element))); element.click();

Demo1——爬取APP市场（Selenium3）

from selenium import webdriver
import time
option = webdriver.ChromeOptions()
option.binary_location = r"C:\Users\任昊\Desktop\SSTudy\Selenium\chrome-win64\chrome.exe" 

option.add_argument('blink-settings=imagesEnabled=false') 
# option.add_argument('--headless') 

driver=webdriver.Chrome(options=option)

for page in range(1,27):
    driver.get("https://www.shafa.com/car_apps?page={}".format(str(page)))

    download_btn = "/html/body/div[1]/div[3]/div[?]/div[2]"

    for i in range(1,11):
        btn = driver.find_element_by_xpath(download_btn.replace("?", str(i)))
        # print(btn)
        btn.click()
        time.sleep(20)

undetected-chromedriver

ultrafunkamsterdam/undetected-chromedriver: Custom Selenium Chromedriver | Zero-Config | Passes ALL bot mitigation systems (like Distil / Imperva/ Datadadome / CloudFlare IUAM) (github.com)

这应该是魔改后的自动化Chrome。

环境

Ubuntu22 x64
Chrome+Chromedriver linux64

安装

1	pip install undetected-chromedriver

使用

import undetected_chromedriver as uc
driver = uc.Chrome(browser_executable_path="", 
	driver_executable_path="")

driver = uc.Chrome(headless=True,use_subprocess=False)
driver.get('https://nowsecure.nl')
driver.save_screenshot('nowsecure.png')