我的wps2019没有自定义动画窗格,怎么点都没有反应,右侧出不来。重装没用,请问怎么解决?
545
2022-05-30
明镜本清净,明镜亦非台。
本来无一物,何处惹尘埃!
截图PDF指定区域并提取文件
需求:PDF文件结构都一致,对于下图红框区域截图并提取文本
测试pdfplumber库
先试用一下pdfplumber看看能否提取出文本
import pdfplumber with pdfplumber.open("测试文档.pdf") as p: page = p.pages[0] print(page.extract_text())
1
2
3
4
运行结果:
Date of Test : 2020-11-05 R Test Engineer : ? e s KAYSER-THREDE Contact Name : WX u l 00 EVAluation Version: 2.1.7 sample.def ta 1 n t 0 8 Z0 Y, 6 X, g] 40 1 n [ . P o ati20 ag r e e cel o ac0 f J 071H 7 -20 .0; Vo = 15 / 2020-11HEAD00ead Acce 822-75 0-40 3.889 m1-0500E2ACleration -HFC 1080 /s; M = 11 RA / CFC SP 1 Res A_202 g]60 60 kg 1000ultant 0_11_ t [ 0 n 5 ulta40 13 s e _ r0 2 2 5 00 0 F -200 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 rid a time [ms] y , 6 .1 A1 Analysis Interval: 0 - 1000 [ms] naly.202 Max(61 ms) = 72 g; Min(4.3 ms) = 0.04043 g s0 cHoICn t=. A330m7 (s5(55.64. 6-1 6 -6 .539 m.61s )m; Hs)IC =3 665 =.7 340 g7; ( c5u5.m4 .- A 636m.3s m =s 7);0 H.1I8C g15 = 307 (55.4 - 66.3 ms) is: IA 11:2 T3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
试用后发现,pdfplumber提取对这种存在旋转文字的pdf文字提取效果非常糟糕,即使是正常顺序的位置,也出现了交错现象。
通过PyMuPDF实现区域截图和区域文字提取
官方文档:https://pymupdf.readthedocs.io/en/latest/index.html
Github:https://github.com/pymupdf/PyMuPDF
安装:
pip install pymupdf
1
截图
先测试截取左下角的部分:
from IPython.display import Image import fitz pdfDoc = fitz.open("测试文档.pdf") page = pdfDoc[0] mat = fitz.Matrix(1, 1) # 1.5表示放大1.5倍 rect = page.rect clip = fitz.Rect(0, 0.87*rect.height, rect.width*0.8, rect.height) pix = page.getPixmap(matrix=mat, alpha=False, clip=clip) display(Image(pix.getImageData()))
1
2
3
4
5
6
7
8
9
10
11
fitz.Matrix(1, 1)的两个参数表示宽度和高度的放大系数,上面的截的图较小可以通过该参数放大
fitz.Rect有好几种坐标模式,我选择了(x0, y0, x1, y1)这种坐标模式来定位要截取的区域
page.getPixmap传入放大系数和区域即可获取图片对象,可直接获取图片的数据也可以写入到文件保存起来
再测试截取右上角部分:
clip = fitz.Rect(0.8*rect.width, 0.27*rect.height, rect.width*0.9, rect.height) pix = page.getPixmap(matrix=mat.preRotate(-90), alpha=False, clip=clip) display(Image(pix.getImageData()))
1
2
3
4
mat.preRotate(-90)实现了截取区域逆时针旋转90度。
保存图片很简单,只需调用writeImage即可:
pix.writeImage("tmp.png")
1
文字提取
通过fitz.Rect要提取文字的区域即可:
a_text = page.getText(clip=clip) print(a_text)
1
2
1. Page of J7822-75-HFCA_2020_11_05 13_25 Head Acceleration SP 1 Resultant 11HEAD0000E2ACRA / CFC1000 75 / 2020-11-05 0.0; Vo = 13.889 m/s; M = 1160 kg Friday, 6.11.2020 11:23 Analysis: IAT
1
2
3
4
5
6
7
这段文本提取的效果还不错!
再测试一下左下角部分:
clip = fitz.Rect(0, 0.87*rect.height, rect.width*0.8, rect.height) b_text = page.getText(clip=clip) print(b_text)
1
2
3
4
Max(61 ms) = 72 g; Min(4.3 ms) = 0.04043 g cont. A3ms(56.61 - 59.61 ms) = 65.74 g; cum. A3ms = 70.18 g HIC = 307 (55.4 - 66.3 ms); HIC36 = 307 (55.4 - 66.3 ms); HIC15 = 307 (55.4 - 66.3 ms) Analysis Interval: 0 - 1000 [ms]
1
2
3
4
文本行顺序处理
文字的行顺序似乎与原始图片的文本顺序不一致。不过我们可以借助pandas自定义排序,还原到一致的顺序。
import pandas as pd tmp = pd.DataFrame(b_text.splitlines(), columns=["a"]) tmp["b"] = (tmp.a.str[:2]).astype("category") tmp.b.cat.set_categories( ['An', 're', 'vi', 'Ma', 'co', 'VC', 'ES'], inplace=True) tmp.sort_values('b', inplace=True) b_text = '\n'.join(tmp.a.to_list()) print(b_text)
1
2
3
4
5
6
7
8
9
Analysis Interval: 0 - 1000 [ms] Max(61 ms) = 72 g; Min(4.3 ms) = 0.04043 g cont. A3ms(56.61 - 59.61 ms) = 65.74 g; cum. A3ms = 70.18 g HIC = 307 (55.4 - 66.3 ms); HIC36 = 307 (55.4 - 66.3 ms); HIC15 = 307 (55.4 - 66.3 ms)
1
2
3
4
完整代码
import fitz # pip install PyMuPDF import os from IPython.display import Image import pandas as pd pdf_path = "测试文档.pdf" if not os.path.exists("imgs"): os.mkdir("imgs") result = [] with fitz.open(pdf_path) as pdfDoc: for i in range(pdfDoc.pageCount): page_num = i+1 print("--------------", page_num, "--------------") page = pdfDoc[i] mat = fitz.Matrix(1.3, 1.3) # 1.5表示放大1.5倍 rect = page.rect clip = fitz.Rect(0.8*rect.width, 0.27*rect.height, rect.width*0.9, rect.height) # 想要截取的区域 pix = page.getPixmap(matrix=mat.preRotate(-90), alpha=False, clip=clip) # 将页面转换为图像 pix.writeImage(f"imgs/{page_num}_a.png") img1 = pix.getImageData() display(Image(img1)) a_text = page.getText(clip=clip) print(a_text) clip = fitz.Rect(0, 0.87*rect.height, rect.width*0.8, rect.height) pix = page.getPixmap(matrix=mat.preRotate(90), alpha=False, clip=clip) pix.writeImage(f"imgs/{page_num}_b.png") img2 = pix.getImageData() display(Image(img2)) b_text = page.getText(clip=clip) tmp = pd.DataFrame(b_text.splitlines(), columns=["a"]) tmp["b"] = (tmp.a.str[:2]).astype("category") tmp.b.cat.set_categories( ['An', 're', 'vi', 'Ma', 'co', 'VC', 'ES'], inplace=True) tmp.sort_values('b', inplace=True) b_text = '\n'.join(tmp.a.to_list()) print(b_text) result.append((a_text, b_text)) df = pd.DataFrame(result, columns=["A", "B"]) df.to_excel("result.xlsx", index=False)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
前5页运行结果展示:
-------------- 1 --------------
1. Page of J7822-75-HFCA_2020_11_05 13_25 Head Acceleration SP 1 Resultant 11HEAD0000E2ACRA / CFC1000 75 / 2020-11-05 0.0; Vo = 13.889 m/s; M = 1160 kg Friday, 6.11.2020 11:23 Analysis: IAT
1
2
3
4
5
6
7
Analysis Interval: 0 - 1000 [ms] Max(61 ms) = 72 g; Min(4.3 ms) = 0.04043 g cont. A3ms(56.61 - 59.61 ms) = 65.74 g; cum. A3ms = 70.18 g HIC = 307 (55.4 - 66.3 ms); HIC36 = 307 (55.4 - 66.3 ms); HIC15 = 307 (55.4 - 66.3 ms)
1
2
3
4
-------------- 2 --------------
2. Page of J7822-75-HFCA_2020_11_05 13_25 Head Acceleration X SP 1 11HEAD0000E2ACXA / CFC1000 75 / 2020-11-05 0.0; Vo = 13.889 m/s; M = 1160 kg Friday, 6.11.2020 11:23 Analysis: IAT
1
2
3
4
5
6
7
Analysis Interval: 0 - 1000 [ms] Max(65.5 ms) = 8.15 g; Min(52.2 ms) = -7.426 g
1
2
-------------- 3 --------------
3. Page of J7822-75-HFCA_2020_11_05 13_25 Head Acceleration Y SP 1 11HEAD0000E2ACYA / CFC1000 75 / 2020-11-05 0.0; Vo = 13.889 m/s; M = 1160 kg Friday, 6.11.2020 11:23 Analysis: IAT
1
2
3
4
5
6
7
Analysis Interval: 0 - 1000 [ms] Max(59.4 ms) = 71.87 g; Min(52 ms) = -9.89 g
1
2
-------------- 4 --------------
4. Page of J7822-75-HFCA_2020_11_05 13_25 Head Acceleration Z SP 1 11HEAD0000E2ACZA / CFC1000 75 / 2020-11-05 0.0; Vo = 13.889 m/s; M = 1160 kg Friday, 6.11.2020 11:23 Analysis: IAT
1
2
3
4
5
6
7
Analysis Interval: 0 - 1000 [ms] Max(56.5 ms) = 20.39 g; Min(63.6 ms) = -23.43 g
1
2
-------------- 5 --------------
5. Page of J7822-75-HFCA_2020_11_05 13_25 Rib Left Upper Displacement Y SP 1 11RIBSLEUPE2DSYC / CFC180 75 / 2020-11-05 0.0; Vo = 13.889 m/s; M = 1160 kg Friday, 6.11.2020 11:23 Analysis: IAT
1
2
3
4
5
6
7
Analysis Interval: 0 - 1000 [ms] Max(314.8 ms) = 0.2821 mm; Min(52.9 ms) = -33.24 mm
1
2
…
版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。