爬虫BeautifulSoup库基本使用,案例解析(附源代码)

网友投稿 658 2022-05-30

1. 爬虫解析库汇总

2. BeautifulSoup基本使用

3. 标签选择器

3.1 选择元素

3.2 获取名称

3.3 获取属性

3.4 获取内容

3.5 嵌套选择

4. 子节点和子孙节点

5. 父节点和祖先节点

6. 兄弟节点

7. 标准选择器

7.1 text属性

**7.2 find( name , attrs , recursive , text , kwargs )

8. CSS选择器

8.1 获取属性

8.2 获取内容

9. 总结

1. 爬虫解析库汇总

2. BeautifulSoup基本使用

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

3. 标签选择器

3.1 选择元素

html = """ The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.

...

""" from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.title) print(type(soup.title)) print(soup.head) print(soup.p)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

3.2 获取名称

html = """ The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.

...

""" from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.title.name)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

3.3 获取属性

html = """ The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.

...

""" from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.p.attrs['name']) print(soup.p['name'])

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

3.4 获取内容

html = """ The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.

...

""" from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.p.string)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

3.5 嵌套选择

html = """ The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.

...

""" from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.head.title.string)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

5. 子节点和子孙节点

html = """ The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie Lacie and Tillie and they lived at the bottom of a well.

...

""" from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.p.contents)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

html = """ The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie Lacie and Tillie and they lived at the bottom of a well.

...

""" from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.p.descendants) for i, child in enumerate(soup.p.descendants): print(i, child)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

爬虫BeautifulSoup库基本使用,案例解析(附源代码)

21

22

23

6. 父节点和祖先节点

html = """ The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie Lacie and Tillie and they lived at the bottom of a well.

...

""" from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.a.parent)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

html = """ The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie Lacie and Tillie and they lived at the bottom of a well.

...

""" from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(list(enumerate(soup.a.parents)))

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

7. 兄弟节点

html = """ The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie Lacie and Tillie and they lived at the bottom of a well.

...

""" from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(list(enumerate(soup.a.next_siblings))) print(list(enumerate(soup.a.previous_siblings)))

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

8. 标准选择器

7.1 text属性

html='''

Hello

  • Foo
  • Bar
  • Jay
  • Foo
  • Bar
''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(text='Foo'))

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

**7.2 find( name , attrs , recursive , text , kwargs )

find返回单个元素,find_all返回所有元素

html='''

Hello

  • Foo
  • Bar
  • Jay
  • Foo
  • Bar
''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find('ul')) print(type(soup.find('ul'))) print(soup.find('page'))

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

9. CSS选择器

elect()直接传入CSS选择器即可完成选择

html='''

Hello

  • Foo
  • Bar
  • Jay
  • Foo
  • Bar
''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.select('.panel .panel-heading')) print(soup.select('ul li')) print(soup.select('#list-2 .element')) print(type(soup.select('ul')[0]))

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

html='''

Hello

  • Foo
  • Bar
  • Jay
  • Foo
  • Bar
''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for ul in soup.select('ul'): print(ul.select('li'))

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

8.1 获取属性

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

8.2 获取内容

html='''

Hello

  • Foo
  • Bar
  • Jay
  • Foo
  • Bar
''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for li in soup.select('li'): print(li.get_text())

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

10. 总结

推荐使用lxml解析库,必要时使用html.parser

标签选择筛选功能弱但是速度

建议使用find()、find_all() 查询匹配单个结果或者多个结果

如果对CSS选择器熟悉建议使用select()

记住常用的获取属性和文本值的方法

1

2

HTML

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:基于华为云设计的人脸考勤机丨【我的华为云体验之旅】
下一篇:如何迁移win7 dokuwiki到linux下
相关文章