折腾手记

运维!!运维!!

0%

用PlayWright抓取动态页面

动态网页

网页用XHR局部获取数据,要抓这样的数据,需要找到XHR的URL,这样的URL一般包括token,一段时间后就失效,因此需要动态抓取网页发起的请求。

1
2
3
XHR含有Token的URL

https://****/****?Id=5&token=99aeacc27dc64c1124f1e25dc0666c10

PlayWright

PlayWright是微软开发的浏览器模拟神器,其中Network Event能监控网页发出的请求和响应。

将其中代码稍作修改,即可满足需求

  1. 只监控response
  2. 如果response.url中包含关键字,则输出相应的url
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from playwright.sync_api import sync_playwright

def run(playwright) -> None:
# browser = playwright.chromium.launch(headless=False)
browser = playwright.chromium.launch()
context = browser.new_context()

# Open new page
page = context.new_page()

# page.on("request", lambda request: print(request.url))
page.on("response", res)

# Go to
page.goto("https://****/****")

# ---------------------
context.close()
browser.close()

def res(res) -> None:
if "id" in res.url:
print(res.url)

with sync_playwright() as playwright:
run(playwright)

后续

  1. PowerBi支持调用Python返回数据集,response.json()能返回json格式的数据,导入pandas再返回PowerBi,应该可以直接得到数据集。
  2. PlayWright支持C#,理论上应该能用VSTO将数据返回给EXCEL。