README.md 수정

2025-10-23 16:22:12 +09:00
parent 0a5c99b0ec
commit 7d5a46b11d
8 changed files with 134 additions and 174 deletions
--- a/README.md
+++ b/README.md
@@ -1,180 +1,163 @@
-<!-- markdownlint-disable first-line-h1 -->
-<!-- markdownlint-disable html -->
-<!-- markdownlint-disable no-duplicate-header -->
+# DeepSeek-OCR

+## 소개

-<div align="center">
-  <img src="assets/logo.svg" width="60%" alt="DeepSeek AI" />
-</div>
+DeepSeek-OCR은 DeepSeek AI에서 개발한 OCR(광학 문자 인식)을 활용하여 컨텍스트 압축을 수행하는 혁신적인 시스템입니다. 이 시스템은 방대한 텍스트 정보를 처리하는 방식을 혁신적으로 바꾸는 것을 목표로 합니다. 텍스트가 많은 문서를 이미지로 취급하여 시각적 표현을 통해 정보를 압축함으로써, 기존 텍스트 인코딩 방식보다 훨씬 적은 토큰으로 텍스트를 표현할 수 있습니다.

+## 주요 기능

-<hr>
-<div align="center">
-  <a href="https://www.deepseek.com/" target="_blank">
-    <img alt="Homepage" src="assets/badge.svg" />
-  </a>
-  <a href="https://huggingface.co/deepseek-ai/DeepSeek-OCR" target="_blank">
-    <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-DeepSeek%20AI-ffc107?color=ffc107&logoColor=white" />
-  </a>
+- **높은 압축률**: 10개의 텍스트 토큰을 단일 비전 토큰으로 압축할 수 있습니다.
+- **높은 정확도**: 10배 압축률에서 97%의 OCR 정밀도를 달성합니다.
+- **빠른 처리 속도**: 단일 A100-40G GPU에서 하루에 200,000페이지 이상을 처리할 수 있습니다.
+- **구조화된 콘텐츠 처리**: 표, 수식, 다이어그램과 같은 구조화된 콘텐츠를 효과적으로 분석합니다.

-</div>
+## 프로젝트 구조

-<div align="center">
+```
+.
+├── docker-compose.yml
+├── Dockerfile.hf
+├── Dockerfile.vllm
+├── LICENSE
+├── README.md
+├── requirements.txt
+└── DeepSeek-OCR-master
+    ├── DeepSeek-OCR-hf
+    │   └── run_dpsk_ocr.py
+    └── DeepSeek-OCR-vllm
+        ├── config.py
+        ├── deepseek_ocr.py
+        ├── run_dpsk_ocr_eval_batch.py
+        ├── run_dpsk_ocr_image.py
+        ├── run_dpsk_ocr_pdf.py
+        ├── deepencoder
+        │   ├── build_linear.py
+        │   ├── clip_sdpa.py
+        │   └── sam_vary_sdpa.py
+        └── process
+            ├── image_process.py
+            └── ngram_norepeat.py
+```

-  <a href="https://discord.gg/Tc7c45Zzu5" target="_blank">
-    <img alt="Discord" src="https://img.shields.io/badge/Discord-DeepSeek%20AI-7289da?logo=discord&logoColor=white&color=7289da" />
-  </a>
-  <a href="https://twitter.com/deepseek_ai" target="_blank">
-    <img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-deepseek_ai-white?logo=x&logoColor=white" />
-  </a>
+## 파일 설명

-</div>
+### 최상위 디렉토리

+- `docker-compose.yml`: Docker 서비스를 정의하고 관리합니다 (`hf` 및 `vllm` 버전).
+- `Dockerfile.hf`: Hugging Face 버전의 Docker 이미지를 빌드하기 위한 설정 파일입니다.
+- `Dockerfile.vllm`: vLLM 버전의 Docker 이미지를 빌드하기 위한 설정 파일입니다.
+- `requirements.txt`: 프로젝트에 필요한 Python 패키지 목록입니다.

+### `DeepSeek-OCR-master/DeepSeek-OCR-hf`

-<p align="center">
-  <a href="https://huggingface.co/deepseek-ai/DeepSeek-OCR"><b>📥 Model Download</b></a> |
-  <a href="https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf"><b>📄 Paper Link</b></a> |
-  <a href="https://arxiv.org/abs/2510.18234"><b>📄 Arxiv Paper Link</b></a> |
-</p>
+- `run_dpsk_ocr.py`: Hugging Face `transformers` 라이브러리를 사용하여 단일 이미지 또는 PDF 파일에 대해 OCR을 실행하는 스크립트입니다. 스크립트 내에서 직접 경로와 같은 변수를 수정해야 합니다.

-<h2>
-<p align="center">
-  <a href="">DeepSeek-OCR: Contexts Optical Compression</a>
-</p>
-</h2>
+### `DeepSeek-OCR-master/DeepSeek-OCR-vllm`

-<p align="center">
-<img src="assets/fig1.png" style="width: 1000px" align=center>
-</p>
-<p align="center">
-<a href="">Explore the boundaries of visual-text compression.</a>       
-</p>
+- `config.py`: vLLM 버전 스크립트의 모든 설정을 관리하는 파일입니다. 모델, 입/출력 경로, 프롬프트, 처리 옵션 등을 여기서 지정합니다.
+- `deepseek_ocr.py`: vLLM 프레임워크와 호환되도록 DeepSeek-OCR 모델의 아키텍처를 정의하는 핵심 파일입니다.
+- `run_dpsk_ocr_image.py`: `config.py` 설정에 따라 단일 이미지 파일에 대해 OCR을 수행합니다.
+- `run_dpsk_ocr_pdf.py`: `config.py` 설정에 따라 단일 PDF 파일에 대해 OCR을 수행합니다.
+- `run_dpsk_ocr_eval_batch.py`: `config.py`에 지정된 디렉토리 내의 모든 이미지에 대해 일괄적으로 OCR을 수행하고 결과를 저장하여 평가에 사용됩니다.
+- `deepencoder/`: SAM, CLIP 등과 같은 비전 인코더 모델을 빌드하는 스크립트가 포함된 디렉토리입니다.
+- `process/`: 이미지 전처리 및 n-gram 반복 방지 로직과 같은 유틸리티 스크립트가 포함된 디렉토리입니다.

-## Release
- [2025/10/20]🚀🚀🚀 We release DeepSeek-OCR, a model to investigate the role of vision encoders from an LLM-centric viewpoint.
+## 설치

-## Contents
- [Install](#install)
- [vLLM Inference](#vllm-inference)
- [Transformers Inference](#transformers-inference)
-  
+1.  저장소를 복제합니다.

+    ```bash
+    git clone https://gitea.hmac.kr/kyy/deepseek_ocr.git
+    cd deepseek_ocr
+    ```

+2.  필요한 패키지를 설치합니다.
+    ```bash
+    pip install -r requirements.txt
+    ```

+## 사용법
+
+### Hugging Face 버전
+
+1.  `DeepSeek-OCR-master/DeepSeek-OCR-hf/run_dpsk_ocr.py` 파일을 열고 `image_file` 및 `output_path` 변수를 수정합니다.
+2.  스크립트를 실행합니다: `python DeepSeek-OCR-master/DeepSeek-OCR-hf/run_dpsk_ocr.py`
+
+### vLLM 버전
+
+1.  `DeepSeek-OCR-master/DeepSeek-OCR-vllm/config.py` 파일을 열고 `INPUT_PATH`, `OUTPUT_PATH` 및 기타 설정을 수정합니다.
+2.  실행할 스크립트를 선택하여 실행합니다.
+    - 이미지: `python DeepSeek-OCR-master/DeepSeek-OCR-vllm/run_dpsk_ocr_image.py`
+    - PDF: `python DeepSeek-OCR-master/DeepSeek-OCR-vllm/run_dpsk_ocr_pdf.py`
+
+## Docker 사용법
+
+Docker를 사용하면 모든 종속성이 포함된 격리된 환경에서 스크립트를 쉽게 실행할 수 있습니다.
+
+### 1. (선택 사항) 입/출력 디렉토리 준비
+
+호스트 머신에 입력 파일을 저장하고 출력 결과를 받을 디렉토리를 만듭니다.

-## Install
->Our environment is cuda11.8+torch2.6.0.
-1. Clone this repository and navigate to the DeepSeek-OCR folder
 ```bash
-git clone https://github.com/deepseek-ai/DeepSeek-OCR.git
-```
-2. Conda
-```Shell
-conda create -n deepseek-ocr python=3.12.9 -y
-conda activate deepseek-ocr
-```
-3. Packages
-
- download the vllm-0.8.5 [whl](https://github.com/vllm-project/vllm/releases/tag/v0.8.5) 
-```Shell
-pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
-pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl
-pip install -r requirements.txt
-pip install flash-attn==2.7.3 --no-build-isolation
-```
-**Note:** if you want vLLM and transformers codes to run in the same environment, you don't need to worry about this installation error like: vllm 0.8.5+cu118 requires transformers>=4.51.1
-
-## vLLM-Inference
- VLLM:
->**Note:** change the INPUT_PATH/OUTPUT_PATH and other settings in the DeepSeek-OCR-master/DeepSeek-OCR-vllm/config.py
-```Shell
-cd DeepSeek-OCR-master/DeepSeek-OCR-vllm
-```
-1. image: streaming output
-```Shell
-python run_dpsk_ocr_image.py
-```
-2. pdf: concurrency ~2500tokens/s(an A100-40G)
-```Shell
-python run_dpsk_ocr_pdf.py
-```
-3. batch eval for benchmarks
-```Shell
-python run_dpsk_ocr_eval_batch.py
-```
-## Transformers-Inference
- Transformers
-```python
-from transformers import AutoModel, AutoTokenizer
-import torch
-import os
-os.environ["CUDA_VISIBLE_DEVICES"] = '0'
-model_name = 'deepseek-ai/DeepSeek-OCR'
-
-tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
-model = AutoModel.from_pretrained(model_name, _attn_implementation='flash_attention_2', trust_remote_code=True, use_safetensors=True)
-model = model.eval().cuda().to(torch.bfloat16)
-
-# prompt = "<image>\nFree OCR. "
-prompt = "<image>\n<|grounding|>Convert the document to markdown. "
-image_file = 'your_image.jpg'
-output_path = 'your/output/dir'
-
-res = model.infer(tokenizer, prompt=prompt, image_file=image_file, output_path = output_path, base_size = 1024, image_size = 640, crop_mode=True, save_results = True, test_compress = True)
-```
-or you can
-```Shell
-cd DeepSeek-OCR-master/DeepSeek-OCR-hf
-python run_dpsk_ocr.py
-```
-## Support-Modes
-The current open-source model supports the following modes:
- Native resolution:
-  - Tiny: 512×512 （64 vision tokens）✅
-  - Small: 640×640 （100 vision tokens）✅
-  - Base: 1024×1024 （256 vision tokens）✅
-  - Large: 1280×1280 （400 vision tokens）✅
- Dynamic resolution
-  - Gundam: n×640×640 + 1×1024×1024 ✅
-
-## Prompts examples
-```python
-# document: <image>\n<|grounding|>Convert the document to markdown.
-# other image: <image>\n<|grounding|>OCR this image.
-# without layouts: <image>\nFree OCR.
-# figures in document: <image>\nParse the figure.
-# general: <image>\nDescribe this image in detail.
-# rec: <image>\nLocate <|ref|>xxxx<|/ref|> in the image.
-# '先天下之忧而忧'
+mkdir -p ./data/input
+mkdir -p ./data/output
+# 입력 파일을 ./data/input 에 복사합니다.
+cp /path/to/your/file.pdf ./data/input/
 ```

+### 2. Docker Compose 설정 수정

-## Visualizations
-<table>
-<tr>
-<td><img src="assets/show1.jpg" style="width: 500px"></td>
-<td><img src="assets/show2.jpg" style="width: 500px"></td>
-</tr>
-<tr>
-<td><img src="assets/show3.jpg" style="width: 500px"></td>
-<td><img src="assets/show4.jpg" style="width: 500px"></td>
-</tr>
-</table>
+`docker-compose.yml` 파일을 열고, 방금 만든 디렉토리를 컨테이너에 마운트하도록 `volumes` 섹션을 수정합니다. 이렇게 하면 컨테이너가 호스트의 파일에 접근할 수 있습니다.

+```yaml
+services:
+  deepseek_ocr_vllm:
+    build:
+      context: .
+      dockerfile: Dockerfile.vllm
+    # ... (기타 설정)
+    volumes:
+      - ./DeepSeek-OCR-master/DeepSeek-OCR-vllm:/workspace
+      - ./data/input:/workspace/input # 입력 디렉토리 마운트
+      - ./data/output:/workspace/output # 출력 디렉토리 마운트
+    # ... (기타 설정)
+```

-## Acknowledgement
+### 3. Docker 컨테이너 빌드 및 시작

-We would like to thank [Vary](https://github.com/Ucas-HaoranWei/Vary/), [GOT-OCR2.0](https://github.com/Ucas-HaoranWei/GOT-OCR2.0/), [MinerU](https://github.com/opendatalab/MinerU), [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR), [OneChart](https://github.com/LingyvKong/OneChart), [Slow Perception](https://github.com/Ucas-HaoranWei/Slow-Perception) for their valuable models and ideas.
+`docker-compose.yml`이 있는 프로젝트의 최상위 디렉토리에서 다음 명령을 실행합니다.

-We also appreciate the benchmarks: [Fox](https://github.com/ucaslcl/Fox), [OminiDocBench](https://github.com/opendatalab/OmniDocBench).
+```bash
+# vLLM 서비스 빌드 및 시작
+docker-compose build
+docker-compose up -d
+```

-## Citation
+Hugging Face 버전을 사용하려면 `docker-compose.yml`에서 `deepseek_ocr_hf` 서비스의 주석을 해제하면 됩니다.

-```bibtex
-@article{wei2024deepseek-ocr,
-  title={DeepSeek-OCR: Contexts Optical Compression},
-  author={Wei, Haoran and Sun, Yaofeng and Li, Yukun},
-  journal={arXiv preprint arXiv:2510.18234},
-  year={2025}
-}
+### 4. 컨테이너에서 스크립트 실행
+
+1.  실행 중인 컨테이너에 접속합니다.
+
+    ```bash
+    docker exec -it deepseek_ocr_vllm /bin/bash
+    ```
+
+2.  컨테이너 내부에서 `config.py`를 수정하여 마운트된 디렉토리의 파일을 가리키도록 합니다.
+
+    ```python
+    # /workspace/config.py
+    INPUT_PATH = '/workspace/input/file.pdf'  # 마운트된 입력 파일
+    OUTPUT_PATH = '/workspace/output/'        # 마운트된 출력 디렉토리
+    ```
+
+3.  원하는 스크립트를 실행합니다.
+    ```bash
+    python run_dpsk_ocr_pdf.py
+    ```
+
+결과는 호스트 머신의 `./data/output` 디렉토리에 저장됩니다.
+
+## 라이선스
+
+이 프로젝트는 [LICENSE](LICENSE) 파일에 명시된 라이선스를 따릅니다.