Update README.md
Browse files
README.md
CHANGED
|
@@ -46,222 +46,7 @@ Overview of Infinity-Parser training framework. Our model is optimized via reinf
|
|
| 46 |
|
| 47 |
# Quick Start
|
| 48 |
|
| 49 |
-
|
| 50 |
-
```shell
|
| 51 |
-
conda create -n Infinity_Parser python=3.11
|
| 52 |
-
conda activate Infinity_Parser
|
| 53 |
-
|
| 54 |
-
git clone https://github.com/infly-ai/INF-MLLM.git
|
| 55 |
-
cd INF-MLLM/Infinity-Parser
|
| 56 |
-
# Install pytorch, see https://pytorch.org/get-started/previous-versions/ for your cuda version
|
| 57 |
-
conda install pytorch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 pytorch-cuda=12.1 -c pytorch -c nvidia
|
| 58 |
-
pip install .
|
| 59 |
-
```
|
| 60 |
-
Before starting, make sure that **PyTorch** is correctly installed according to the official installation guide at [https://pytorch.org/](https://pytorch.org/).
|
| 61 |
-
|
| 62 |
-
## Download Model Weights
|
| 63 |
-
|
| 64 |
-
```shell
|
| 65 |
-
pip install -r requirements.txt
|
| 66 |
-
|
| 67 |
-
python3 tools/download_model.py
|
| 68 |
-
```
|
| 69 |
-
|
| 70 |
-
## Vllm Inference
|
| 71 |
-
We recommend using the vLLM backend for accelerated inference.
|
| 72 |
-
It supports image and PDF inputs, automatically parses the document content, and exports the results in Markdown format to a specified directory.
|
| 73 |
-
|
| 74 |
-
```shell
|
| 75 |
-
parser --model /path/model --input dir/PDF/Image --output output_folders --batch_size 128 --tp 1
|
| 76 |
-
```
|
| 77 |
-
|
| 78 |
-
Adjust the tensor parallelism (tp) value — 1, 2, or 4 — and the batch size according to the number of GPUs and the available memory.
|
| 79 |
-
|
| 80 |
-
<details>
|
| 81 |
-
<summary> [The information of result folder] </summary>
|
| 82 |
-
The result folder contains the following contents:
|
| 83 |
-
|
| 84 |
-
```
|
| 85 |
-
output_folders/
|
| 86 |
-
├── <file_name>/output.md
|
| 87 |
-
├── ...
|
| 88 |
-
├── ...
|
| 89 |
-
```
|
| 90 |
-
|
| 91 |
-
</details>
|
| 92 |
-
|
| 93 |
-
### Online Serving
|
| 94 |
-
|
| 95 |
-
<details>
|
| 96 |
-
<summary> Example </summary>
|
| 97 |
-
|
| 98 |
-
- Launch the vLLM Server
|
| 99 |
-
|
| 100 |
-
```shell
|
| 101 |
-
vllm serve /path/to/model --tensor-parallel-size=4 --served-model-name=Infinity_Parser
|
| 102 |
-
```
|
| 103 |
-
|
| 104 |
-
- Python Client Example
|
| 105 |
-
|
| 106 |
-
```python
|
| 107 |
-
import os
|
| 108 |
-
import re
|
| 109 |
-
import sys
|
| 110 |
-
import json
|
| 111 |
-
from PIL import Image
|
| 112 |
-
from openai import OpenAI, AsyncOpenAI
|
| 113 |
-
import base64, pathlib
|
| 114 |
-
|
| 115 |
-
prompt = r'''You are an AI assistant specialized in converting PDF images to Markdown format. Please follow these instructions for the conversion:
|
| 116 |
-
|
| 117 |
-
1. Text Processing:
|
| 118 |
-
- Accurately recognize all text content in the PDF image without guessing or inferring.
|
| 119 |
-
- Convert the recognized text into Markdown format.
|
| 120 |
-
- Maintain the original document structure, including headings, paragraphs, lists, etc.
|
| 121 |
-
|
| 122 |
-
2. Mathematical Formula Processing:
|
| 123 |
-
- Convert all mathematical formulas to LaTeX format.
|
| 124 |
-
- Enclose inline formulas with \( \). For example: This is an inline formula \( E = mc^2 \)
|
| 125 |
-
- Enclose block formulas with \\[ \\]. For example: \[ \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} \]
|
| 126 |
-
|
| 127 |
-
3. Table Processing:
|
| 128 |
-
- Convert tables to HTML format.
|
| 129 |
-
- Wrap the entire table with <table> and </table>.
|
| 130 |
-
|
| 131 |
-
4. Figure Handling:
|
| 132 |
-
- Ignore figures content in the PDF image. Do not attempt to describe or convert images.
|
| 133 |
-
|
| 134 |
-
5. Output Format:
|
| 135 |
-
- Ensure the output Markdown document has a clear structure with appropriate line breaks between elements.
|
| 136 |
-
- For complex layouts, try to maintain the original document's structure and format as closely as possible.
|
| 137 |
-
|
| 138 |
-
Please strictly follow these guidelines to ensure accuracy and consistency in the conversion. Your task is to accurately convert the content of the PDF image into Markdown format without adding any extra explanations or comments.
|
| 139 |
-
'''
|
| 140 |
-
|
| 141 |
-
def encode_image(image_path):
|
| 142 |
-
with open(image_path, "rb") as image_file:
|
| 143 |
-
return base64.b64encode(image_file.read()).decode("utf-8")
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
def build_message(image_path, prompt):
|
| 147 |
-
|
| 148 |
-
content = [
|
| 149 |
-
{
|
| 150 |
-
"type": "image_url",
|
| 151 |
-
"image_url": {
|
| 152 |
-
"url": f"data:image/jpeg;base64,{encode_image(image_path)}"
|
| 153 |
-
}
|
| 154 |
-
},
|
| 155 |
-
{"type": "text", 'text': prompt}
|
| 156 |
-
]
|
| 157 |
-
messages = [
|
| 158 |
-
{"role": "system", "content": "You are a helpful assistant."},
|
| 159 |
-
{'role': 'user', 'content': content}
|
| 160 |
-
]
|
| 161 |
-
|
| 162 |
-
return messages
|
| 163 |
-
|
| 164 |
-
client = OpenAI(
|
| 165 |
-
api_key="EMPTY",
|
| 166 |
-
base_url="http://localhost:8000/v1",
|
| 167 |
-
)
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
def request(messages):
|
| 171 |
-
completion = client.chat.completions.create(
|
| 172 |
-
messages=messages,
|
| 173 |
-
extra_headers={
|
| 174 |
-
"Authorization": f"Bearer {Authorization}"
|
| 175 |
-
},
|
| 176 |
-
model="Infinity_Parser",
|
| 177 |
-
max_completion_tokens=8192,
|
| 178 |
-
temperature=0.0,
|
| 179 |
-
top_p=0.95
|
| 180 |
-
)
|
| 181 |
-
|
| 182 |
-
return completion.choices[0].message.content
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
if __name__ == "__main__":
|
| 186 |
-
img_path = "path/to/image.png"
|
| 187 |
-
res = build_message(img_path, prompt)
|
| 188 |
-
print(res)
|
| 189 |
-
```
|
| 190 |
-
</details>
|
| 191 |
-
|
| 192 |
-
## Using Transformers to Inference
|
| 193 |
-
|
| 194 |
-
<details>
|
| 195 |
-
<summary> Transformers Inference Example </summary>
|
| 196 |
-
|
| 197 |
-
```python
|
| 198 |
-
import torch
|
| 199 |
-
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
|
| 200 |
-
from qwen_vl_utils import process_vision_info
|
| 201 |
-
|
| 202 |
-
model_path = "infly/Infinity-Parser-7B"
|
| 203 |
-
prompt = "Please transform the document’s contents into Markdown format."
|
| 204 |
-
|
| 205 |
-
print("Loading model and processor...")
|
| 206 |
-
# Default: Load the model on the available device(s)
|
| 207 |
-
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
|
| 208 |
-
# model_path, torch_dtype="auto", device_map="auto"
|
| 209 |
-
# )
|
| 210 |
-
|
| 211 |
-
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
|
| 212 |
-
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
|
| 213 |
-
model_path,
|
| 214 |
-
torch_dtype=torch.bfloat16,
|
| 215 |
-
attn_implementation="flash_attention_2",
|
| 216 |
-
device_map="auto",
|
| 217 |
-
)
|
| 218 |
-
|
| 219 |
-
# Default processor
|
| 220 |
-
# processor = AutoProcessor.from_pretrained(model_path)
|
| 221 |
-
|
| 222 |
-
# Recommended processor
|
| 223 |
-
min_pixels = 256 * 28 * 28 # 448 * 448
|
| 224 |
-
max_pixels = 2304 * 28 * 28 # 1344 * 1344
|
| 225 |
-
processor = AutoProcessor.from_pretrained(model_path, min_pixels=min_pixels, max_pixels=max_pixels)
|
| 226 |
-
|
| 227 |
-
print("Preparing messages for inference...")
|
| 228 |
-
messages = [
|
| 229 |
-
{
|
| 230 |
-
"role": "user",
|
| 231 |
-
"content": [
|
| 232 |
-
{
|
| 233 |
-
"type": "image",
|
| 234 |
-
"image": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png",
|
| 235 |
-
},
|
| 236 |
-
{"type": "text", "text": prompt},
|
| 237 |
-
],
|
| 238 |
-
}
|
| 239 |
-
]
|
| 240 |
-
|
| 241 |
-
text = processor.apply_chat_template(
|
| 242 |
-
messages, tokenize=False, add_generation_prompt=True
|
| 243 |
-
)
|
| 244 |
-
image_inputs, video_inputs = process_vision_info(messages)
|
| 245 |
-
inputs = processor(
|
| 246 |
-
text=[text],
|
| 247 |
-
images=image_inputs,
|
| 248 |
-
videos=video_inputs,
|
| 249 |
-
padding=True,
|
| 250 |
-
return_tensors="pt",
|
| 251 |
-
)
|
| 252 |
-
inputs = inputs.to("cuda")
|
| 253 |
-
|
| 254 |
-
print("Generating results...")
|
| 255 |
-
generated_ids = model.generate(**inputs, max_new_tokens=4096)
|
| 256 |
-
generated_ids_trimmed = [
|
| 257 |
-
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
|
| 258 |
-
]
|
| 259 |
-
output_text = processor.batch_decode(
|
| 260 |
-
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
|
| 261 |
-
)
|
| 262 |
-
print(output_text)
|
| 263 |
-
```
|
| 264 |
-
</details>
|
| 265 |
|
| 266 |
# Visualization
|
| 267 |
|
|
|
|
| 46 |
|
| 47 |
# Quick Start
|
| 48 |
|
| 49 |
+
Please refer to <a href="https://github.com/infly-ai/INF-MLLM/tree/main/Infinity-Parser#quick-start">Quick_Start.</a>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
# Visualization
|
| 52 |
|