Python爬虫入门BeautifulSoup模块

网友投稿 816 2022-05-30

BeautifulSoup

BeautifulSoup是一个模块,该模块用于接收一个HTML或XML字符串,

然后将其进行格式化,之后遍可以使用他提供的方法进行快速查找指定元素,

从而使得在HTML或XML中查找指定元素变得简单。

安装:

pip install BeautifulSoup4

1

导入:

from bs4 import BeautifulSoup

1

beautifulsoup简单示例:

soup = BeautifulSoup(text, features="html.parser") # 返回第一个对象 v1 = soup.find("div") v1 = soup.find(id="i1") v1 = soup.find("div", id="i1") # 组合使用 # 返回对象列表 v2 = soup.find_all("div") v2 = soup.find_all(id="i1") v2 = soup.find_all("div", id="i1") # 组合使用 tag.text # 获取文本 tag.attrs("href") # 获取属性

1

2

3

4

5

6

7

8

9

10

11

12

13

14

代码示例

from bs4 import BeautifulSoup html_doc = """ The Dormouse's story asdf

The Dormouse's story总共

f

Once upon a time there were three little sisters; and their names were Elsfie, Lacie and Tillie; and they lived at the bottom of a well.
ad
sf

...

""" soup = BeautifulSoup(html_doc, features="lxml") # 找到第一个a标签 tag1 = soup.find(name='a') # 找到所有的a标签 tag2 = soup.find_all(name='a') # 找到id=link2的标签 tag3 = soup.select('#link2')

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

1、 name,标签名称

tag = soup.find('a') name = tag.name # 获取 print(name) tag.name = 'span' # 设置 print(soup)

1

2

3

4

5

6

2、 attrs,标签属性

tag = soup.find('a') attrs = tag.attrs # 获取 print(attrs) tag.attrs = {'ik':123} # 设置 tag.attrs['id'] = 'value' # 设置 print(soup)

1

2

3

4

5

6

7

3、 children,所有子标签

body = soup.find('body') v = body.children

1

2

4、 descendants,所有子子孙孙标签

body = soup.find('body') v = body.descendants

1

2

5、 clear,将标签的所有子标签全部清空(保留标签名)

tag = soup.find('body') tag.clear() print(soup)

1

2

3

6、decompose,递归的删除所有的标签

body = soup.find('body') body.decompose() print(soup)

1

2

3

7、extract,递归的删除所有的标签,并获取删除的标签

body = soup.find('body') v = body.extract() print(soup)

1

2

3

8、 decode,转换为字符串(含当前标签);decode_contents(不含当前标签)

body = soup.find('body') v = body.decode() v = body.decode_contents() print(v)

1

2

3

4

9、encode,转换为字节(含当前标签);encode_contents(不含当前标签)

body = soup.find('body') v = body.encode() v = body.encode_contents() print(v)

1

2

3

4

10、find,获取匹配的第一个标签

tag = soup.find('a') print(tag) tag = soup.find(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie') tag = soup.find(name='a', class_='sister', recursive=True, text='Lacie') print(tag)

1

2

3

4

5

6

7

8

9

11、find_all,获取匹配的所有标签

tags = soup.find_all('a') print(tags) tags = soup.find_all('a',limit=1) print(tags) tags = soup.find_all(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie') tags = soup.find_all(name='a', class_='sister', recursive=True, text='Lacie') print(tags) ####### 列表 ####### v = soup.find_all(name=['a','div']) print(v) v = soup.find_all(class_=['sister0', 'sister']) print(v) v = soup.find_all(text=['Tillie']) print(v, type(v[0])) v = soup.find_all(id=['link1','link2']) print(v) v = soup.find_all(href=['link1','link2']) print(v) ####### 正则 ####### import re rep = re.compile('p') rep = re.compile('^p') # 所有以p开头 v = soup.find_all(name=rep) print(v) rep = re.compile('sister.*') v = soup.find_all(class_=rep) print(v) rep = re.compile('http://www.oldboy.com/static/.*') v = soup.find_all(href=rep) print(v) ####### 方法筛选 ####### def func(tag): return tag.has_attr('class') and tag.has_attr('id') v = soup.find_all(name=func) print(v) ## get,获取标签属性 tag = soup.find('a') v = tag.get('id') print(v)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

Python爬虫入门BeautifulSoup模块

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

12、has_attr,检查标签是否具有该属性

tag = soup.find('a') v = tag.has_attr('id') print(v)

1

2

3

13、get_text,获取标签内部文本内容

tag = soup.find('a') v = tag.get_text() print(v)

1

2

3

14、index,检查标签在某标签中的索引位置

tag = soup.find('body') v = tag.index(tag.find('div')) print(v) tag = soup.find('body') for i, v in enumerate(tag): print(i,v)

1

2

3

4

5

6

7

15、 is_empty_element,是否是空标签(是否可以是空)或者自闭合标签,

# 判断是否是如下标签: # 'br' , 'hr', 'input', 'img', 'meta','spacer', 'link', 'frame', 'base' tag = soup.find('br') v = tag.is_empty_element print(v)

1

2

3

4

5

6

16、 当前的关联标签

tag.next tag.next_element tag.next_elements tag.next_sibling tag.next_siblings tag.previous tag.previous_element tag.previous_elements tag.previous_sibling tag.previous_siblings tag.parent tag.parents tag.children tag.descendants

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

17、查找某标签的关联标签

tag.find_next(...) tag.find_all_next(...) tag.find_next_sibling(...) tag.find_next_siblings(...) tag.find_previous(...) tag.find_all_previous(...) tag.find_previous_sibling(...) tag.find_previous_siblings(...) tag.find_parent(...) tag.find_parents(...) # 参数同find_all

1

2

3

4

5

6

7

8

9

10

11

12

13

14

18、 select,select_one, CSS选择器

soup.select("title") soup.select("p nth-of-type(3)") soup.select("body a") soup.select("html head title") tag = soup.select("span,a") soup.select("head > title") soup.select("p > a") soup.select("p > a:nth-of-type(2)") soup.select("p > #link1") soup.select("body > a") soup.select("#link1 ~ .sister") soup.select("#link1 + .sister") soup.select(".sister") soup.select("[class~=sister]") soup.select("#link1") soup.select("a#link2") soup.select('a[href]') soup.select('a[href="http://example.com/elsie"]') soup.select('a[href^="http://example.com/"]') soup.select('a[href$="tillie"]') soup.select('a[href*=".com/el"]') from bs4.element import Tag def default_candidate_generator(tag): for child in tag.descendants: if not isinstance(child, Tag): continue if not child.has_attr('href'): continue yield child tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator) print(type(tags), tags) from bs4.element import Tag def default_candidate_generator(tag): for child in tag.descendants: if not isinstance(child, Tag): continue if not child.has_attr('href'): continue yield child tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator, limit=1) print(type(tags), tags)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

19、 标签的内容

tag = soup.find('span') print(tag.string) # 获取 tag.string = 'new content' # 设置 print(soup) tag = soup.find('body') print(tag.string) tag.string = 'xxx' print(soup) tag = soup.find('body') v = tag.stripped_strings # 递归内部获取所有标签的文本 print(v)

1

2

3

4

5

6

7

8

9

10

11

12

13

20、append在当前标签内部追加一个标签

tag = soup.find('body') tag.append(soup.find('a')) print(soup) from bs4.element import Tag obj = Tag(name='i',attrs={'id': 'it'}) obj.string = '我是一个新来的' tag = soup.find('body') tag.append(obj) print(soup)

1

2

3

4

5

6

7

8

9

10

11

21、insert在当前标签内部指定位置插入一个标签

from bs4.element import Tag obj = Tag(name='i', attrs={'id': 'it'}) obj.string = '我是一个新来的' tag = soup.find('body') tag.insert(2, obj) print(soup)

1

2

3

4

5

6

7

22、 insert_after,insert_before 在当前标签后面或前面插入

from bs4.element import Tag obj = Tag(name='i', attrs={'id': 'it'}) obj.string = '我是一个新来的' tag = soup.find('body') tag.insert_before(obj) tag.insert_after(obj) print(soup)

1

2

3

4

5

6

7

8

23、 replace_with 在当前标签替换为指定标签

from bs4.element import Tag obj = Tag(name='i', attrs={'id': 'it'}) obj.string = '我是一个新来的' tag = soup.find('div') tag.replace_with(obj) print(soup)

1

2

3

4

5

6

7

24、 创建标签之间的关系

tag = soup.find('div') a = soup.find('a') tag.setup(previous_sibling=a) print(tag.previous_sibling)

1

2

3

4

25、wrap,将指定标签把当前标签包裹起来

from bs4.element import Tag obj1 = Tag(name='div', attrs={'id': 'it'}) obj1.string = '我是一个新来的' tag = soup.find('a') v = tag.wrap(obj1) print(soup) tag = soup.find('a') v = tag.wrap(soup.find('p')) print(soup)

1

2

3

4

5

6

7

8

9

10

11

12

26、 unwrap,去掉当前标签,将保留其包裹的标签

tag = soup.find('a') v = tag.unwrap() print(soup)

1

2

3

参考:

武沛齐:http://www.cnblogs.com/wupeiqi/articles/6283017.html

官方文档:http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

Python

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:python爬虫入门requests模块
下一篇:JS逆向|使用express框架开启服务并替换加密字符串
相关文章