Marker：Windows环境折腾PDF转Markdown

L_ran · 2024 年1 月 10 日 23:24

前言

Marker 是 VikParuchuri 开发的一款将 PDF、EPUB 和 MOBI 转换为 Markdown的工具。据称比nougat快 10 倍，在大多数文档上更准确，并且产生错误的风险较低。

可能大多数人都不需要这玩意儿，毕竟这年头除了程序员谁会用 Markdown 格式啊？
当然还有 Obsidian 折腾型选手！
说实话这年头各种 ocr 准确率已经很高了，只要把 pdf 转成 word，然后复制粘贴进 markdown 文件也一样。
然而痛点在于，专业书中的各种公式，识别率那是惨不忍睹，就算准确率很高，在md文件中也只是一坨数字，还要手动一个个改成 LaTeX 公式。
可能有人会说，你看 pdf 或者纸质书不也一样吗？
我就不，我就要 All-in-One！

Marker 作者提供了在Linux和Mac系统下的安装方法，然而他并没有Windows系统，因此全文操作参照这篇如何在win环境下安装的教程，当然我们中国人有自己的坑。

github.com/VikParuchuri/marker

The way to install VikParuchuri/marker on Windows 10.

opened 03:39PM - 01 Dec 23 UTC

closed 03:38AM - 13 Dec 23 UTC

avan06

The most challenging aspect of installing Marker on Windows lies in the detectro…n2 package developed by Facebook Research. Facebook Research is not very Windows-friendly, and they basically do not support or provide installation guidance for Windows. The following records the process of installing VikParuchuri/marker on Windows 10. ___ To install the detectron2 package on Windows, you need to clone detectron2 and make some modifications before installation: 1. Compilation of detectron2 requires a C/C++ compiler. I have MSVC (Visual Studio 2022) cl.exe in my environment, and you must have a similar C/C++ compiler in your environment. Visual Studio Download: https://visualstudio.microsoft.com/vs/community/ 1. Compilation of detectron2 requires NVIDIA CUDA's nvcc. You must install the CUDA Toolkit first. I installed version 12.3. CUDA Toolkit Download: https://developer.nvidia.com/cuda-downloads 1. The torch package may also need to be installed. I installed the latest version provided by PyTorch: pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 1. Install wheel: pip install wheel 1. Clone detectron2: git clone https://github.com/facebookresearch/detectron2.git 1. Fix the "identifier 'single_box_iou_rotated' is undefined" issue by Viliami. (Refer to: https://github.com/facebookresearch/detectron2/issues/1601#issuecomment-651560907) 1. Install the local detectron2. Install detectron2: pip install -e detectron2 If everything goes smoothly, detectron2 should be installed. If there are any issues, you'll need to check the error logs for further investigation. ___ Installing the Windows version of Tesseract and Ghostscript. 1. To install Tesseract OCR on Windows setup tesseract-ocr-w64-setup-5.3.3.20231005.exe or a newer version https://digi.bib.uni-mannheim.de/tesseract/ 1. To install Ghostscript on Windows setup gs10021w64.exe or a newer version https://ghostscript.readthedocs.io/en/gs10.02.0/Install.html ___ Installing the VikParuchuri/marker 1. git clone https://github.com/VikParuchuri/marker.git 1. Remove detectron2 from VikParuchuri/marker/requirements.txt and install it manually using the aforementioned steps 1. nougat in VikParuchuri/marker/requirements.txt is installing to the wrong repository. It needs to be removed from requirements.txt and installed from the repository developed by facebookresearch(https://github.com/facebookresearch/nougat). pip install nougat-ocr 1. Install the missing dependencies. pip install -r requirements.txt pip install ftfy pip install spellchecker pip install pyspellchecker pip install ocrmypdf pip install nltk pip install thefuzz pip uninstall python-magic pip install python-magic-bin

1. 安装3.9版本以上的Python

我就选择3.9了，因为texify要求 Python 版本为 3.9 或更高版本，版本太高的话也可能跟其他的库有不兼容问题。

2. pip换清华源

由于pip的下载速度很慢，所以我们要先把下载方式改成清华镜像。
在“C:\Users[你的用户名]”中建一个名为“pip”的文件夹，再在里面新建一个txt文本，填入如下代码：

[global]
index-url = https://pypi.tuna.tsinghua.edu.cn/simple
[install]
trusted-host=mirrors.aliyun.com

之后将txt文件名改为“pip.ini”即可。

3.下载代码

不知道大家的习惯是怎样的，毕竟我不是专业的程序员，可能就随意了一些。因此我首先用git下载了一下这个项目的代码。
git clone https://github.com/VikParuchuri/marker.git
然后切换到目录里面：
cd marker
开始安装detectron2：
git clone https://github.com/facebookresearch/detectron2.git
pip install -e detectron2
原文说需要修复什么错误，但我没遇到也就不管它了。

4. 安装软件

下载安装Visual Studio: https://visualstudio.microsoft.com/vs/community/
（意味不明，说是需要C/C++环境）
安装CUDA : https://developer.nvidia.com/cuda-downloads
有很多相关的教程，这里就不多赘述了, 注意 pytorch 版本要支持对应版本的 CUDA。
安装 tesseract-ocr-w64-setup-5.3.3.20231005.exe https://digi.bib.uni-mannheim.de/tesseract/
安装时有两个选项要下载语言的应该可以不勾，虽然本来就没勾上，但是我觉得没啥就让它下载了，结果下了接近两个小时。
安装 Ghostscript gs10021w64.exe https://ghostscript.readthedocs.io/en/gs10.02.0/Install.html

5. 安装各种依赖项

直接复制到控制台粘贴就好了：

pip install nougat-ocr
pip install wheel
pip install torch 
pip install python-magic
pip install python-magic-bin
pip install ftfy  
pip install spellchecker  
pip install pyspellchecker  
pip install ocrmypdf  
pip install nltk  
pip install thefuzz
pip install ray==2.7.1
pip install PyMuPDF
pip install pydantic
pip install --upgrade fastapi
pip install python-dotenv
pip install pydantic-settings
pip install texify
pip install pyspellchecker
pip install ocrmypdf
pip install scikit-learn
pip install ray

可能还有其它依赖项没装上，这时候直接把报错信息扔给GPT就可以了，它会告诉你需要安装什么。

6. 下载预训练模型

由于特殊的网络原因，大部分人没法连接到huggingface，经过摸索我终于知道怎么用离线模型了。
这个项目主要用了4个预训练模型：

在项目里新建一个叫“vikp”的文件夹，然后把4个预训练模型都下载分别放进各自名字的文件夹里。

正如上文所说，我不是那么专业，并不知道哪些文件可以不用下载，因此我一股脑儿都下了。

7.运行代码

作者给出的转换单本书的示例代码是：
python convert_single.py /path/to/file.pdf /path/to/output.md --parallel_factor 2 --max_pages 10

--max_pages是要转化多少页，我一般都是整本书所以不需要。
--parallel_factor 数字越高占用的显存和CPU越多，默认是1。

因此我想转化一本路径是E:\Windows lock\Downloads\Documents\Econometric Analysis Global Edition (William H. Greene) (Z-Library).pdf的书到同目录下，代码就是：

python convert_single.py "E:\Windows lock\Downloads\Documents\Econometric Analysis Global Edition (William H. Greene) (Z-Library).pdf" "E:\Windows lock\Downloads\Documents\Econometric Analysis.md" --parallel_factor 2

但很快我就发现很多不对劲了，首先这本书很大，1000多页，而且控制台也没显示进度条。等了很久终于来了一句潜在的报错：您的输入文本经过分词后的长度超过了模型设定的最大序列长度（384），可能导致模型无法处理。

然后我仔细看了看，如果要用CUDA还是要自己调的。

首先TORCH_DEVICE: Optional[str] = NONE 要改成 cuda，然后是 INFERENCE_RAM 改成自己显存大小，项目作者原来填的40，可真是有钱呀。
VRAM_PER_TASK 是每个任务分配到的显存，并行任务数乘以这个数最好远远低于你的显存，爆显存了好像进度条就不动了，DEFAULT_LANG 是书所用的语言。

随后我使用 Acrobat 将 pdf 分割成了38份扔到了名为“Econometric Analysis”的文件夹中，或许在此之前应该优化一下文件大小。

再将“Econometric Analysis”文件夹放到项目文件夹下，这样运行代码就可以只使用相对路径。
因此我认为将大 pdf 拆分进行多线程任务是比较方便的，作者给出的示例代码是：
python convert.py /path/to/input/folder /path/to/output/folder --workers 10 --max 10 --metadata_file /path/to/metadata.json --min_length 10000

--workers是一次要转换的 pdf 数量。默认情况下设置为 1，上面也说了，数量不要太多避免超过显存，因此我设置的2
--max是要转换的 pdf 的最大数量。不填的话就会转换文件夹中的所有 pdf 文件，只能说非常好这就是我想要的。
--metadata_file是 json 文件的可选路径，其中包含有关 pdf 的元数据（主要是语言）。如果没有，就会使用默认语言 DEFAULT_LANG。
--min_length是在考虑处理之前需要从 pdf 中提取的最少字符数。这项可以避免对主要是图像的 pdf 进行 OCR 处理。（减慢一切速度）

因此我这边最后的指令就是：
python convert.py "./Econometric Analysis" "./Econometric Analysis" --workers 2 --min_length 10000

最后出来的结果虽然有些乱码，但已经很满意了，毕竟能节省一分时间也是好的。

老实说这真是我离AI最近的时候了。

小哥哥，做ai吗？
你是cv还是nlp？

PlayerMiller · 2024 年1 月 11 日 02:51

作为只有古早 N 卡的非专业没钱仔，根据楼主的描述，说一说可能的注意事项：

CUDA 需要 NVIDIA 卡，先检查自己的卡，如果有 N 卡 CPU，再确认是否有 N 卡 GPU，两种对应的 PyTorch 有区别。
PyTorch 1.3.0 版本后取消了对算力 3.7 以下的支持，见 Pytorch GitHub Issue #31285。在 NVIDIA GPU Compute Capability 查询自己卡的算力。

看了各种帖子，对 PyTorch 如何支持老卡主要有两种方法：

降版本：大 CUDA 和 cuDNN、PyTorch 和 PyTorch 的小 cuda、虚拟环境里的 Python 等，参“全流程各软硬件对照图”。驱动的版本应保持最新，不要降驱动版本。
从源码安装（build from source），耗时较长，步骤较复杂。

全流程各软硬件对照图，用于明确如何选择各软件版本：

相关解释可参 b 站视频“Windows 下 PyTorch 入门深度学习环境安装与配置 CPU GPU 版 | 土堆教程”P9（已复制精准空降），该教程从 P11 开始安装，之前为概念解释，对故障排查很有帮助。

如果不存在老卡问题，除了上面提到的土堆教程，也可参 b 站视频 Python深度学习：安装Anaconda、PyTorch（GPU版）安装。Anaconda 带 Python，不用重复安装 Python。

自己下载 whl，找版本时，一般 cuXXX 是 GPU 版，大小较大，如果只有几百 M，多半下载到了 CPU 版。cpXX 对应虚拟环境中的 Python 版本，如 cp39 指 Python 3.9 可以安装的版本。

以上基于自己经验，如果说的有不对，还望各位大佬包涵指正。

层主本人用整合包尝试过跑 AI（非此话题里的），老卡跑 AI 可能报错 no kernel image is available，即算力、CUDA 版本、PyTorch 版本不匹配，还在尝试老卡跑 AI 软件版本

虽然整合包没跑通，但 no kernel image is available 解决了，大 CUDA 和 cuDNN 10.0，PyTorch 1.3.0，小 cuda 9.2，在 Anaconda Python 3.7 的虚拟环境里用 torch.zeros(1).cuda() 检验不再报错，供参考。

如果要求 Python 3.8 或 3.9 以上，算力 3.7 以下的老卡大概无缘了吧…

RoyZhang · 2024 年1 月 15 日 00:48

在Linux上搞了2天代码，最后还是一直报错说连接不到网站

LeonX86 · 2024 年3 月 25 日 14:12

项目是跑通了，挺不错的项目
但是在后面突然有了灵感如果你的PDF文档是可以修改的（不是图片）可以采取一个非常邪门的方式光速整理好！！
用 Kimi 直接读取pdf，让大模型替你整理好格式和公式就行
写一下我的prompt

整个文档整理成markdown格式给我（包括公式） 所有的公式用latex表示 请使用  $行中公式$  和  $$单行公式$$的格式来写，公式不用添加```的代码块符号

最后在推荐一个免费的公式识别simpletex
调整格式可以用这个latexlive

来自一个人工智能专业的菜鸡

obanki · 2024 年4 月 15 日 10:16

~~要是谁闲着能把它包一下docker就好了…win配置太烦了.~~
看了一下，底下有人已经都可打包了啊, 你们在折腾啥
----他的包好大

github.com/VikParuchuri/marker

Add build-docker-containers.sh script and Dockerfiles for CPU and GPU…

VikParuchuri:master ← robin-collins:docker

opened 03:52AM - 09 Apr 24 UTC

robin-collins

+419 -0

Added cpu.Dockerfile and gpu.Dockerfile to enable containerization of the marker… app. Includes entrypoint.sh to enable pdf to markdown of both a single file, or multiple files. Add .github/workflows to build and push both CPU and GPU containers to ghcr.io Updated the README.md to include quickstart info using the container.

xiahan · 2025 年2 月 23 日 14:34

盖楼
simpletex
现在可以识别pdf了,虽然精度有点小瑕疵
目前免费,但是后来一直卡住,不好,估计要充值才能不卡住

下面这个新的,是免费的,也很快,只要下载一个软件就可以用了
MinerU