在我们利用xpath匹配页面标签时,经常会遇到标签下面还包含标签,但是我们只想取下面的所有文字
例如相匹配图中 div[@class=’display-content’]下面所有P的文字,此时我们可以利用这个方法
直接上代码
# 取正文
def get_details(url):
payload = ""
headers = {
'Accept': "*/*",
'Accept-Encoding': "gzip, deflate",
'Accept-Language': "zh-CN,zh;q=0.9,en;q=0.8",
'Cache-Control': "no-cache",
'Connection': "keep-alive",
'Cookie': "SUV=1811281936496730; gidinf=x099980109ee0edb269b528280008252b495807e917b; _muid_=1548315571095387; IPLOC=CN4403; reqtype=pc; t=1557797597640; MTV_SRC=10010001",
'Host': "v2.sohu.com",
'Origin': "http://m.sohu.com",
'Pragma': "no-cache",
'Referer': "http://m.sohu.com/ch/8/?_f=m-index_important_hsdh&spm=smwp.home.nav-ch.1.1557825265945dy1ukUW",
'User-Agent': "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Mobile Safari/537.36",
'Postman-Token': "46314343-d211-4b4e-8d84-2b20462a5f54"
}
response = requests.request("GET", url, data=payload, headers=headers)
text = etree.HTML(response.text)
tt = text.xpath("//div[@class='display-content']")
# print(tt)
info = tt[0].xpath("string(.)")
return info
返回结果如图
发表回复