from lxparse import LxParse
lx = LxParse()

list_html = ""
lx.parse_list(list_html)
# 指定解析规则
lx.parse_list(list_html,xpath_list='/div[@id="lx"]/a')

detail_html = ""
lx.parse_detail(detail_html)
# 指定解析规则,不声明则使用默认规则
xpath_item = {
    'xpath_title':'',
    'xpath_source':'',
    'xpath_date':'',
    'xpath_author':'',
    'xpath_content':'',
}
lx.parse_detail(detail_html,item=xpath_item)

parse_detail 返回： Alt

测试代码

demo文件中分别有列表页和详情页的解析示例
将html保存本地后，经测试今日头条、新浪新闻、百度资讯、网易新闻、腾讯新闻等可正常解析。

备注

使用lxparse解析库解析时，如有解析异常的可手动指定解析规则。
测试用例不多，如有问题麻烦提issues一起优化。
或者关注公众号《Pythonlx》，获取群聊二维码，一起交流学习

Alt

Keywords

python web crawl HtmlParse

FAQs

What is lxparse?

Is lxparse well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

lxparse

lxparse

项目背景

实现逻辑

列表页

详情页

使用方法

测试代码

备注

Keywords

Related posts

npm Phishing Email Targets Developers with Typosquatted Domain

Knip Hits 500 Releases with v5.62.0, Improving TypeScript Config Detection and Plugin Integrations