Initial Commit

This commit is contained in:
kyy
2025-03-18 04:34:57 +00:00
commit a4981faeef
60 changed files with 2160 additions and 0 deletions

5
.dockerignore Normal file
View File

@@ -0,0 +1,5 @@
__pycache__/
*.pyc
*.pyo
*.pyd
projects/

32
Dockerfile Normal file
View File

@@ -0,0 +1,32 @@
# Base stage: Install common dependencies
FROM python:3.10-slim AS base
# Set working directory and environment variables
WORKDIR /usr/src/app
ENV PYTHONUNBUFFERED=1 \
PYTHONDONTWRITEBYTECODE=1 \
PIP_NO_CACHE_DIR=1
# Copy only requirements files first to leverage Docker cache
COPY pyproject.toml ./
# Install system and Python dependencies in a single layer
RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
gcc \
libssl-dev && \
pip install --no-cache-dir --upgrade pip setuptools setuptools-scm && \
rm -rf /var/lib/apt/lists/*
# Copy project files
COPY autorag /usr/src/app/autorag
COPY dashboard.sh /usr/src/app/dashboard.sh
# Install base project
RUN pip install -r requirements.txt
# Set permissions for dashboard script
RUN chmod +x /usr/src/app/dashboard.sh
CMD ["bash", "dashboard.sh"]

70
README.md Normal file
View File

@@ -0,0 +1,70 @@
# 📊 AutoRAG Dashboard
AutoRAG Dashboard는 실험 결과를 시각화하여 분석할 수 있는 웹 기반 대시보드입니다.
---
## **🚀 주요 기능**
### ✅ **1. 실험 결과 요약**
- **실험 폴더 (`trial_dir`)를 자동으로 분석**하여 실험 결과를 요약합니다.
- `summary.csv`를 기반으로 가장 성능이 좋은 모듈을 자동으로 선택합니다.
- 실험에 사용된 설정 파일(`config.yaml`)을 Markdown 형식으로 출력합니다.
### ✅ **2. 실험별 성능 비교**
- Stripplot 및 Boxplot을 제공하여 **모듈별 성능 분포를 시각적으로 비교**할 수 있습니다.
- 성능 비교를 통해 실험 결과를 더욱 직관적으로 이해할 수 있습니다.
### ✅ **3. 개별 쿼리 조회**
- `detail` 버튼을 클릭하여 쿼리 별로 자세한 정보를를 제공합니다.
---
## **📥 설치 및 실행 방법**
### **1⃣ Docker 컨테이너 실행 (권장)**
AutoRAG Dashboard는 Docker 환경에서 쉽게 실행할 수 있습니다.
```bash
# 1. Docker 컨테이너 빌드
docker compose build
# 2. AutoRAG Dashboard 실행
docker compose up
```
브라우저에서 [http://localhost:7690](http://localhost:7690)로 접속하여 대시보드를 확인하세요.
---
## **📂 프로젝트 구조**
```bash
.
├── autorag
│ ├── cli.py # 대시보드 실행을 위한 CLI
│ ├── dashboard.py # 대시보드 메인 로직
│ ├── schema/
│ ├── utils/
│ ├── VERSION
├── dashboard.sh # Docker 환경에서 실행되는 스크립트
├── docker-compose.yml # Docker 설정 파일
├── Dockerfile # Docker 빌드 파일
├── pyproject.toml # Python 패키지 설정
├── requirements.txt # 필수 패키지 목록
└── projects/ # 실험 결과 저장 폴더 (볼륨 마운트)
```
---
## **📌 추가 설정**
### **📍 포트 변경**
기본적으로 **7690번 포트**에서 실행됩니다.
포트를 변경하려면 `docker-compose.yml`에서 아래 내용을 수정하세요.
```yaml
ports:
- "7690:7690" # 변경하고 싶은 포트로 수정 가능
```
### **📍 실험 데이터 경로 변경**
실험 결과 폴더(`trial_dir`)의 기본 위치는 `./projects/benchmark_sample`입니다.
다른 폴더를 사용하려면 실행 시 `--trial_dir` 옵션 경로를 수정하세요.
```bash
python3 -m autorag.cli dashboard --trial_dir ./projects/custom_experiment
```

1
autorag/VERSION Normal file
View File

@@ -0,0 +1 @@
0.3.14

113
autorag/__init__.py Normal file
View File

@@ -0,0 +1,113 @@
import logging
import logging.config
import os
import sys
from random import random
from typing import List, Any
from llama_index.core.embeddings.mock_embed_model import MockEmbedding
from llama_index.core.base.llms.types import CompletionResponse
from llama_index.core.llms.mock import MockLLM
from llama_index.llms.bedrock import Bedrock
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.embeddings.openai import OpenAIEmbeddingModelType
from llama_index.llms.openai import OpenAI
from llama_index.llms.openai_like import OpenAILike
from langchain_openai.embeddings import OpenAIEmbeddings
from rich.logging import RichHandler
from llama_index.llms.ollama import Ollama
version_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "VERSION")
with open(version_path, "r") as f:
__version__ = f.read().strip()
class LazyInit:
def __init__(self, factory, *args, **kwargs):
self._factory = factory
self._args = args
self._kwargs = kwargs
self._instance = None
def __call__(self):
if self._instance is None:
self._instance = self._factory(*self._args, **self._kwargs)
return self._instance
def __getattr__(self, name):
if self._instance is None:
self._instance = self._factory(*self._args, **self._kwargs)
return getattr(self._instance, name)
rich_format = "[%(filename)s:%(lineno)s] >> %(message)s"
logging.basicConfig(
level="INFO", format=rich_format, handlers=[RichHandler(rich_tracebacks=True)]
)
logger = logging.getLogger("AutoRAG")
def handle_exception(exc_type, exc_value, exc_traceback):
logger = logging.getLogger("AutoRAG")
logger.error("Unexpected exception", exc_info=(exc_type, exc_value, exc_traceback))
sys.excepthook = handle_exception
class AutoRAGBedrock(Bedrock):
async def acomplete(
self, prompt: str, formatted: bool = False, **kwargs: Any
) -> CompletionResponse:
return self.complete(prompt, formatted=formatted, **kwargs)
generator_models = {
"openai": OpenAI,
"openailike": OpenAILike,
"mock": MockLLM,
"bedrock": AutoRAGBedrock,
"ollama": Ollama,
}
# embedding_models = {
# }
try:
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.llms.ollama import Ollama
generator_models["huggingfacellm"] = HuggingFaceLLM
generator_models["ollama"] = Ollama
except ImportError:
logger.info(
"You are using API version of AutoRAG. "
"To use local version, run pip install 'AutoRAG[gpu]'"
)
# try:
# from llama_index.embeddings.huggingface import HuggingFaceEmbedding
# embedding_models["hf_all_mpnet_base_v2"] = HuggingFaceEmbedding # 250312 변경 - 김용연
# embedding_models["hf_KURE-v1"] = HuggingFaceEmbedding # 250312 변경 - 김용연
# embedding_models["hf_snowflake-arctic-embed-l-v2.0-ko"] = HuggingFaceEmbedding # 250313 변경 - 김용연
# except ImportError:
# logger.info(
# "You are using API version of AutoRAG."
# "To use local version, run pip install 'AutoRAG[gpu]'"
# )
try:
import transformers
transformers.logging.set_verbosity_error()
except ImportError:
logger.info(
"You are using API version of AutoRAG."
"To use local version, run pip install 'AutoRAG[gpu]'"
)

40
autorag/cli.py Normal file
View File

@@ -0,0 +1,40 @@
import logging
import os
import click
from autorag import dashboard
logger = logging.getLogger("AutoRAG")
autorag_dir = os.path.dirname(os.path.realpath(__file__))
version_file = os.path.join(autorag_dir, "VERSION")
with open(version_file, "r") as f:
__version__ = f.read().strip()
@click.group()
@click.version_option(__version__)
def cli():
pass
@click.command()
@click.option(
"--trial_dir",
type=click.Path(dir_okay=True, file_okay=False, exists=True),
required=True,
)
@click.option(
"--port", type=int, default=7690, help="Port number. The default is 7690."
)
def run_dashboard(trial_dir: str, port: int):
"""Runs the AutoRAG Dashboard."""
logger.info(f"Starting AutoRAG Dashboard on port {port}...")
dashboard.run(trial_dir, port=port)
cli.add_command(run_dashboard, "dashboard")
if __name__ == "__main__":
cli()

215
autorag/dashboard.py Normal file
View File

@@ -0,0 +1,215 @@
import ast
import logging
import os
from typing import Dict, List
import matplotlib.pyplot as plt
import pandas as pd
import panel as pn
import seaborn as sns
import yaml
from bokeh.models import NumberFormatter, BooleanFormatter
from autorag.utils.util import dict_to_markdown, dict_to_markdown_table
pn.extension(
"terminal",
"tabulator",
"mathjax",
"ipywidgets",
console_output="disable",
sizing_mode="stretch_width",
css_files=[
"https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.2/css/all.min.css"
],
)
logger = logging.getLogger("AutoRAG")
def find_node_dir(trial_dir: str) -> List[str]:
trial_summary_df = pd.read_csv(os.path.join(trial_dir, "summary.csv"))
result_paths = []
for idx, row in trial_summary_df.iterrows():
node_line_name = row["node_line_name"]
node_type = row["node_type"]
result_paths.append(os.path.join(trial_dir, node_line_name, node_type))
return result_paths
def get_metric_values(node_summary_df: pd.DataFrame) -> Dict:
non_metric_column_names = [
"filename",
"module_name",
"module_params",
"execution_time",
"average_output_token",
"is_best",
]
best_row = node_summary_df.loc[node_summary_df["is_best"]].drop(
columns=non_metric_column_names, errors="ignore"
)
assert len(best_row) == 1, "The best module must be only one."
return best_row.iloc[0].to_dict()
def make_trial_summary_md(trial_dir):
markdown_text = f"""# Trial Result Summary
- Trial Directory : {trial_dir}
"""
node_dirs = find_node_dir(trial_dir)
for node_dir in node_dirs:
node_summary_filepath = os.path.join(node_dir, "summary.csv")
node_type = os.path.basename(node_dir)
node_summary_df = pd.read_csv(node_summary_filepath)
best_row = node_summary_df.loc[node_summary_df["is_best"]].iloc[0]
metric_dict = get_metric_values(node_summary_df)
markdown_text += f"""---
## {node_type} best module
### Module Name
{best_row['module_name']}
### Module Params
{dict_to_markdown(ast.literal_eval(best_row['module_params']), level=3)}
### Metric Values
{dict_to_markdown_table(metric_dict, key_column_name='metric_name', value_column_name='metric_value')}
"""
return markdown_text
def node_view(node_dir: str):
non_metric_column_names = [
"filename",
"module_name",
"module_params",
"execution_time",
"average_output_token",
"is_best",
]
summary_df = pd.read_csv(os.path.join(node_dir, "summary.csv"))
bokeh_formatters = {
"float": NumberFormatter(format="0.000"),
"bool": BooleanFormatter(),
}
first_df = pd.read_parquet(os.path.join(node_dir, "0.parquet"), engine="pyarrow")
each_module_df_widget = pn.widgets.Tabulator(
pd.DataFrame(columns=first_df.columns),
name="Module DataFrame",
formatters=bokeh_formatters,
pagination="local",
page_size=20,
widths=150,
)
def change_module_widget(event):
if event.column == "detail":
filename = summary_df["filename"].iloc[event.row]
filepath = os.path.join(node_dir, filename)
each_module_df = pd.read_parquet(filepath, engine="pyarrow")
each_module_df_widget.value = each_module_df
df_widget = pn.widgets.Tabulator(
summary_df,
name="Summary DataFrame",
formatters=bokeh_formatters,
buttons={"detail": '<i class="fa fa-eye"></i>'},
widths=150,
)
df_widget.on_click(change_module_widget)
try:
fig, ax = plt.subplots(figsize=(10, 5))
metric_df = summary_df.drop(columns=non_metric_column_names, errors="ignore")
sns.stripplot(data=metric_df, ax=ax)
strip_plot_pane = pn.pane.Matplotlib(fig, tight=True)
fig2, ax2 = plt.subplots(figsize=(10, 5))
sns.boxplot(data=metric_df, ax=ax2)
box_plot_pane = pn.pane.Matplotlib(fig2, tight=True)
plot_pane = pn.Row(strip_plot_pane, box_plot_pane)
layout = pn.Column(
"## Summary distribution plot",
plot_pane,
"## Summary DataFrame",
df_widget,
"## Module Result DataFrame",
each_module_df_widget,
)
except Exception as e:
logger.error(f"Skipping make boxplot and stripplot with error {e}")
layout = pn.Column("## Summary DataFrame", df_widget)
layout.servable()
return layout
CSS = """
div.card-margin:nth-child(1) {
max-height: 300px;
}
div.card-margin:nth-child(2) {
max-height: 400px;
}
"""
def yaml_to_markdown(yaml_filepath):
markdown_content = ""
with open(yaml_filepath, "r", encoding="utf-8") as file:
try:
content = yaml.safe_load(file)
markdown_content += f"## {os.path.basename(yaml_filepath)}\n```yaml\n{yaml.safe_dump(content, allow_unicode=True)}\n```\n\n"
except yaml.YAMLError as exc:
print(f"Error in {yaml_filepath}: {exc}")
return markdown_content
def run(trial_dir: str, port: int = 7690):
trial_summary_md = make_trial_summary_md(trial_dir=trial_dir)
trial_summary_tab = pn.pane.Markdown(trial_summary_md, sizing_mode="stretch_width")
node_views = [
(str(os.path.basename(node_dir)), pn.bind(node_view, node_dir))
for node_dir in find_node_dir(trial_dir)
]
"""
수정 전
node_views = [
(str(os.path.basename(node_dir)), node_view(node_dir))
for node_dir in find_node_dir(trial_dir)
]
"""
yaml_file_markdown = yaml_to_markdown(os.path.join(trial_dir, "config.yaml"))
yaml_file = pn.pane.Markdown(yaml_file_markdown, sizing_mode="stretch_width")
tabs = pn.Tabs(
("Summary", trial_summary_tab),
*node_views,
("Used YAML file", yaml_file),
dynamic=True,
)
'''
수정 전
template = pn.template.FastListTemplate(
site="AutoRAG", title="Dashboard", main=[tabs], raw_css=[CSS]
).servable()
template.show(port=port)
'''
if CSS not in pn.config.raw_css:
pn.config.raw_css.append(CSS)
template = pn.template.FastListTemplate(
site="AutoRAG", title="Dashboard", main=[tabs]
).servable()
pn.serve(template, port=port, show=False)

View File

@@ -0,0 +1,3 @@
from .module import Module
from .node import Node
from .base import BaseModule

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

35
autorag/schema/base.py Normal file
View File

@@ -0,0 +1,35 @@
from abc import ABCMeta, abstractmethod
from pathlib import Path
from typing import Union
import pandas as pd
class BaseModule(metaclass=ABCMeta):
@abstractmethod
def pure(self, previous_result: pd.DataFrame, *args, **kwargs):
pass
@abstractmethod
def _pure(self, *args, **kwargs):
pass
@classmethod
def run_evaluator(
cls,
project_dir: Union[str, Path],
previous_result: pd.DataFrame,
*args,
**kwargs,
):
instance = cls(project_dir, *args, **kwargs)
result = instance.pure(previous_result, *args, **kwargs)
del instance
return result
@abstractmethod
def cast_to_run(self, previous_result: pd.DataFrame, *args, **kwargs):
"""
This function is for cast function (a.k.a decorator) only for pure function in the whole node.
"""
pass

View File

@@ -0,0 +1,99 @@
from dataclasses import dataclass
from typing import Optional, List, Dict, Callable, Any, Union
import numpy as np
import pandas as pd
@dataclass
class MetricInput:
query: Optional[str] = None
queries: Optional[List[str]] = None
retrieval_gt_contents: Optional[List[List[str]]] = None
retrieved_contents: Optional[List[str]] = None
retrieval_gt: Optional[List[List[str]]] = None
retrieved_ids: Optional[List[str]] = None
prompt: Optional[str] = None
generated_texts: Optional[str] = None
generation_gt: Optional[List[str]] = None
generated_log_probs: Optional[List[float]] = None
def is_fields_notnone(self, fields_to_check: List[str]) -> bool:
for field in fields_to_check:
actual_value = getattr(self, field)
if actual_value is None:
return False
try:
if not type_checks.get(type(actual_value), lambda _: False)(
actual_value
):
return False
except Exception:
return False
return True
@classmethod
def from_dataframe(cls, qa_data: pd.DataFrame) -> List["MetricInput"]:
"""
Convert a pandas DataFrame into a list of MetricInput instances.
qa_data: pd.DataFrame: qa_data DataFrame containing metric data.
:returns: List[MetricInput]: List of MetricInput objects created from DataFrame rows.
"""
instances = []
for _, row in qa_data.iterrows():
instance = cls()
for attr_name in cls.__annotations__:
if attr_name in row:
value = row[attr_name]
if isinstance(value, str):
setattr(
instance,
attr_name,
value.strip() if value.strip() != "" else None,
)
elif isinstance(value, list):
setattr(instance, attr_name, value if len(value) > 0 else None)
else:
setattr(instance, attr_name, value)
instances.append(instance)
return instances
@staticmethod
def _check_list(lst_or_arr: Union[List[Any], np.ndarray]) -> bool:
if isinstance(lst_or_arr, np.ndarray):
lst_or_arr = lst_or_arr.flatten().tolist()
if len(lst_or_arr) == 0:
return False
for item in lst_or_arr:
if item is None:
return False
item_type = type(item)
if item_type in type_checks:
if not type_checks[item_type](item):
return False
else:
return False
return True
type_checks: Dict[type, Callable[[Any], bool]] = {
str: lambda x: len(x.strip()) > 0,
list: MetricInput._check_list,
np.ndarray: MetricInput._check_list,
int: lambda _: True,
float: lambda _: True,
}

24
autorag/schema/module.py Normal file
View File

@@ -0,0 +1,24 @@
from copy import deepcopy
from dataclasses import dataclass, field
from typing import Callable, Dict
from autorag.support import get_support_modules
@dataclass
class Module:
module_type: str
module_param: Dict
module: Callable = field(init=False)
def __post_init__(self):
self.module = get_support_modules(self.module_type)
if self.module is None:
raise ValueError(f"Module type {self.module_type} is not supported.")
@classmethod
def from_dict(cls, module_dict: Dict) -> "Module":
_module_dict = deepcopy(module_dict)
module_type = _module_dict.pop("module_type")
module_params = _module_dict
return cls(module_type, module_params)

143
autorag/schema/node.py Normal file
View File

@@ -0,0 +1,143 @@
import itertools
import logging
from copy import deepcopy
from dataclasses import dataclass, field
from typing import Dict, List, Callable, Tuple, Any
import pandas as pd
from autorag.schema.module import Module
from autorag.support import get_support_nodes
from autorag.utils.util import make_combinations, explode, find_key_values
logger = logging.getLogger("AutoRAG")
@dataclass
class Node:
node_type: str
strategy: Dict
node_params: Dict
modules: List[Module]
run_node: Callable = field(init=False)
def __post_init__(self):
self.run_node = get_support_nodes(self.node_type)
if self.run_node is None:
raise ValueError(f"Node type {self.node_type} is not supported.")
def get_param_combinations(self) -> Tuple[List[Callable], List[Dict]]:
"""
This method returns a combination of module and node parameters, also corresponding modules.
:return: Each module and its module parameters.
:rtype: Tuple[List[Callable], List[Dict]]
"""
def make_single_combination(module: Module) -> List[Dict]:
input_dict = {**self.node_params, **module.module_param}
return make_combinations(input_dict)
combinations = list(map(make_single_combination, self.modules))
module_list, combination_list = explode(self.modules, combinations)
return list(map(lambda x: x.module, module_list)), combination_list
@classmethod
def from_dict(cls, node_dict: Dict) -> "Node":
_node_dict = deepcopy(node_dict)
node_type = _node_dict.pop("node_type")
strategy = _node_dict.pop("strategy")
modules = list(map(lambda x: Module.from_dict(x), _node_dict.pop("modules")))
node_params = _node_dict
return cls(node_type, strategy, node_params, modules)
def run(self, previous_result: pd.DataFrame, node_line_dir: str) -> pd.DataFrame:
logger.info(f"Running node {self.node_type}...")
input_modules, input_params = self.get_param_combinations()
return self.run_node(
modules=input_modules,
module_params=input_params,
previous_result=previous_result,
node_line_dir=node_line_dir,
strategies=self.strategy,
)
def extract_values(node: Node, key: str) -> List[str]:
"""
This function extract values from node's modules' module_param.
:param node: The node you want to extract values from.
:param key: The key of module_param that you want to extract.
:return: The list of extracted values.
It removes duplicated elements automatically.
"""
def extract_module_values(module: Module):
if key not in module.module_param:
return []
value = module.module_param[key]
if isinstance(value, str) or isinstance(value, int):
return [value]
elif isinstance(value, list):
return value
else:
raise ValueError(f"{key} must be str,list or int, but got {type(value)}")
values = list(map(extract_module_values, node.modules))
return list(set(list(itertools.chain.from_iterable(values))))
def extract_values_from_nodes(nodes: List[Node], key: str) -> List[str]:
"""
This function extract values from nodes' modules' module_param.
:param nodes: The nodes you want to extract values from.
:param key: The key of module_param that you want to extract.
:return: The list of extracted values.
It removes duplicated elements automatically.
"""
values = list(map(lambda node: extract_values(node, key), nodes))
return list(set(list(itertools.chain.from_iterable(values))))
def extract_values_from_nodes_strategy(nodes: List[Node], key: str) -> List[Any]:
"""
This function extract values from nodes' strategy.
:param nodes: The nodes you want to extract values from.
:param key: The key string that you want to extract.
:return: The list of extracted values.
It removes duplicated elements automatically.
"""
values = []
for node in nodes:
value_list = find_key_values(node.strategy, key)
if value_list:
values.extend(value_list)
return values
def module_type_exists(nodes: List[Node], module_type: str) -> bool:
"""
This function check if the module type exists in the nodes.
:param nodes: The nodes you want to check.
:param module_type: The module type you want to check.
:return: True if the module type exists in the nodes.
"""
return any(
list(
map(
lambda node: any(
list(
map(
lambda module: module.module_type == module_type,
node.modules,
)
)
),
nodes,
)
)
)

View File

@@ -0,0 +1,8 @@
from .preprocess import (
validate_qa_dataset,
validate_corpus_dataset,
cast_qa_dataset,
cast_corpus_dataset,
validate_qa_from_corpus_dataset,
)
from .util import fetch_contents, result_to_dataframe, sort_by_scores

Binary file not shown.

Binary file not shown.

Binary file not shown.

149
autorag/utils/preprocess.py Normal file
View File

@@ -0,0 +1,149 @@
from datetime import datetime
import numpy as np
import pandas as pd
from autorag.utils.util import preprocess_text
def validate_qa_dataset(df: pd.DataFrame):
columns = ["qid", "query", "retrieval_gt", "generation_gt"]
assert set(columns).issubset(
df.columns
), f"df must have columns {columns}, but got {df.columns}"
def validate_corpus_dataset(df: pd.DataFrame):
columns = ["doc_id", "contents", "metadata"]
assert set(columns).issubset(
df.columns
), f"df must have columns {columns}, but got {df.columns}"
def cast_qa_dataset(df: pd.DataFrame):
def cast_retrieval_gt(gt):
if isinstance(gt, str):
return [[gt]]
elif isinstance(gt, list):
if isinstance(gt[0], str):
return [gt]
elif isinstance(gt[0], list):
return gt
elif isinstance(gt[0], np.ndarray):
return cast_retrieval_gt(list(map(lambda x: x.tolist(), gt)))
else:
raise ValueError(
f"retrieval_gt must be str or list, but got {type(gt[0])}"
)
elif isinstance(gt, np.ndarray):
return cast_retrieval_gt(gt.tolist())
else:
raise ValueError(f"retrieval_gt must be str or list, but got {type(gt)}")
def cast_generation_gt(gt):
if isinstance(gt, str):
return [gt]
elif isinstance(gt, list):
return gt
elif isinstance(gt, np.ndarray):
return cast_generation_gt(gt.tolist())
else:
raise ValueError(f"generation_gt must be str or list, but got {type(gt)}")
df = df.reset_index(drop=True)
validate_qa_dataset(df)
assert df["qid"].apply(lambda x: isinstance(x, str)).sum() == len(
df
), "qid must be string type."
assert df["query"].apply(lambda x: isinstance(x, str)).sum() == len(
df
), "query must be string type."
df["retrieval_gt"] = df["retrieval_gt"].apply(cast_retrieval_gt)
df["generation_gt"] = df["generation_gt"].apply(cast_generation_gt)
df["query"] = df["query"].apply(preprocess_text)
df["generation_gt"] = df["generation_gt"].apply(
lambda x: list(map(preprocess_text, x))
)
return df
def cast_corpus_dataset(df: pd.DataFrame):
df = df.reset_index(drop=True)
validate_corpus_dataset(df)
# drop rows that have empty contents
df = df[~df["contents"].apply(lambda x: x is None or x.isspace())]
def make_datetime_metadata(x):
if x is None or x == {}:
return {"last_modified_datetime": datetime.now()}
elif x.get("last_modified_datetime") is None:
return {**x, "last_modified_datetime": datetime.now()}
else:
return x
df["metadata"] = df["metadata"].apply(make_datetime_metadata)
# check every metadata have a datetime key
assert sum(
df["metadata"].apply(lambda x: x.get("last_modified_datetime") is not None)
) == len(df), "Every metadata must have a datetime key."
def make_prev_next_id_metadata(x, id_type: str):
if x is None or x == {}:
return {id_type: None}
elif x.get(id_type) is None:
return {**x, id_type: None}
else:
return x
df["metadata"] = df["metadata"].apply(
lambda x: make_prev_next_id_metadata(x, "prev_id")
)
df["metadata"] = df["metadata"].apply(
lambda x: make_prev_next_id_metadata(x, "next_id")
)
df["contents"] = df["contents"].apply(preprocess_text)
def normalize_unicode_metadata(metadata: dict):
result = {}
for key, value in metadata.items():
if isinstance(value, str):
result[key] = preprocess_text(value)
else:
result[key] = value
return result
df["metadata"] = df["metadata"].apply(normalize_unicode_metadata)
# check every metadata have a prev_id, next_id key
assert all(
"prev_id" in metadata for metadata in df["metadata"]
), "Every metadata must have a prev_id key."
assert all(
"next_id" in metadata for metadata in df["metadata"]
), "Every metadata must have a next_id key."
return df
def validate_qa_from_corpus_dataset(qa_df: pd.DataFrame, corpus_df: pd.DataFrame):
qa_ids = []
for retrieval_gt in qa_df["retrieval_gt"].tolist():
if isinstance(retrieval_gt, list) and (
retrieval_gt[0] != [] or any(bool(g) is True for g in retrieval_gt)
):
for gt in retrieval_gt:
qa_ids.extend(gt)
elif isinstance(retrieval_gt, np.ndarray) and retrieval_gt[0].size > 0:
for gt in retrieval_gt:
qa_ids.extend(gt)
no_exist_ids = list(
filter(lambda qa_id: corpus_df[corpus_df["doc_id"] == qa_id].empty, qa_ids)
)
assert (
len(no_exist_ids) == 0
), f"{len(no_exist_ids)} doc_ids in retrieval_gt do not exist in corpus_df."

751
autorag/utils/util.py Normal file
View File

@@ -0,0 +1,751 @@
import ast
import asyncio
import datetime
import functools
import glob
import inspect
import itertools
import json
import logging
import os
import re
import string
from copy import deepcopy
from json import JSONDecoder
from typing import List, Callable, Dict, Optional, Any, Collection, Iterable
from asyncio import AbstractEventLoop
import emoji
import numpy as np
import pandas as pd
import tiktoken
import unicodedata
import yaml
from llama_index.embeddings.openai import OpenAIEmbedding
from pydantic import BaseModel as BM
from pydantic.v1 import BaseModel
logger = logging.getLogger("AutoRAG")
def fetch_contents(
corpus_data: pd.DataFrame, ids: List[List[str]], column_name: str = "contents"
) -> List[List[Any]]:
def fetch_contents_pure(
ids: List[str], corpus_data: pd.DataFrame, column_name: str
):
return list(map(lambda x: fetch_one_content(corpus_data, x, column_name), ids))
result = flatten_apply(
fetch_contents_pure, ids, corpus_data=corpus_data, column_name=column_name
)
return result
def fetch_one_content(
corpus_data: pd.DataFrame,
id_: str,
column_name: str = "contents",
id_column_name: str = "doc_id",
) -> Any:
if isinstance(id_, str):
if id_ in ["", ""]:
return None
fetch_result = corpus_data[corpus_data[id_column_name] == id_]
if fetch_result.empty:
raise ValueError(f"doc_id: {id_} not found in corpus_data.")
else:
return fetch_result[column_name].iloc[0]
else:
return None
def result_to_dataframe(column_names: List[str]):
"""
Decorator for converting results to pd.DataFrame.
"""
def decorator_result_to_dataframe(func: Callable):
@functools.wraps(func)
def wrapper(*args, **kwargs) -> pd.DataFrame:
results = func(*args, **kwargs)
if len(column_names) == 1:
df_input = {column_names[0]: results}
else:
df_input = {
column_name: result
for result, column_name in zip(results, column_names)
}
result_df = pd.DataFrame(df_input)
return result_df
return wrapper
return decorator_result_to_dataframe
def load_summary_file(
summary_path: str, dict_columns: Optional[List[str]] = None
) -> pd.DataFrame:
"""
Load a summary file from summary_path.
:param summary_path: The path of the summary file.
:param dict_columns: The columns that are dictionary type.
You must fill this parameter if you want to load summary file properly.
Default is ['module_params'].
:return: The summary dataframe.
"""
if not os.path.exists(summary_path):
raise ValueError(f"summary.csv does not exist in {summary_path}.")
summary_df = pd.read_csv(summary_path)
if dict_columns is None:
dict_columns = ["module_params"]
if any([col not in summary_df.columns for col in dict_columns]):
raise ValueError(f"{dict_columns} must be in summary_df.columns.")
def convert_dict(elem):
try:
return ast.literal_eval(elem)
except:
# convert datetime or date to its object (recency filter)
date_object = convert_datetime_string(elem)
if date_object is None:
raise ValueError(
f"Malformed dict received : {elem}\nCan't convert to dict properly"
)
return {"threshold": date_object}
summary_df[dict_columns] = summary_df[dict_columns].map(convert_dict)
return summary_df
def convert_datetime_string(s):
# Regex to extract datetime arguments from the string
m = re.search(r"(datetime|date)(\((\d+)(,\s*\d+)*\))", s)
if m:
args = ast.literal_eval(m.group(2))
if m.group(1) == "datetime":
return datetime.datetime(*args)
elif m.group(1) == "date":
return datetime.date(*args)
return None
def make_combinations(target_dict: Dict[str, Any]) -> List[Dict[str, Any]]:
"""
Make combinations from target_dict.
The target_dict key value must be a string,
and the value can be a list of values or single value.
If generates all combinations of values from target_dict,
which means generating dictionaries that contain only one value for each key,
and all dictionaries will be different from each other.
:param target_dict: The target dictionary.
:return: The list of generated dictionaries.
"""
dict_with_lists = dict(
map(
lambda x: (x[0], x[1] if isinstance(x[1], list) else [x[1]]),
target_dict.items(),
)
)
def delete_duplicate(x):
def is_hashable(obj):
try:
hash(obj)
return True
except TypeError:
return False
if any([not is_hashable(elem) for elem in x]):
# TODO: add duplication check for unhashable objects
return x
else:
return list(set(x))
dict_with_lists = dict(
map(lambda x: (x[0], delete_duplicate(x[1])), dict_with_lists.items())
)
combination = list(itertools.product(*dict_with_lists.values()))
combination_dicts = [
dict(zip(dict_with_lists.keys(), combo)) for combo in combination
]
return combination_dicts
def explode(index_values: Collection[Any], explode_values: Collection[Collection[Any]]):
"""
Explode index_values and explode_values.
The index_values and explode_values must have the same length.
It will flatten explode_values and keep index_values as a pair.
:param index_values: The index values.
:param explode_values: The exploded values.
:return: Tuple of exploded index_values and exploded explode_values.
"""
assert len(index_values) == len(
explode_values
), "Index values and explode values must have same length"
df = pd.DataFrame({"index_values": index_values, "explode_values": explode_values})
df = df.explode("explode_values")
return df["index_values"].tolist(), df["explode_values"].tolist()
def replace_value_in_dict(target_dict: Dict, key: str, replace_value: Any) -> Dict:
"""
Replace the value of a certain key in target_dict.
If there is no targeted key in target_dict, it will return target_dict.
:param target_dict: The target dictionary.
:param key: The key is to replace.
:param replace_value: The value to replace.
:return: The replaced dictionary.
"""
replaced_dict = deepcopy(target_dict)
if key not in replaced_dict:
return replaced_dict
replaced_dict[key] = replace_value
return replaced_dict
def normalize_string(s: str) -> str:
"""
Taken from the official evaluation script for v1.1 of the SQuAD dataset.
Lower text and remove punctuation, articles, and extra whitespace.
"""
def remove_articles(text):
return re.sub(r"\b(a|an|the)\b", " ", text)
def white_space_fix(text):
return " ".join(text.split())
def remove_punc(text):
exclude = set(string.punctuation)
return "".join(ch for ch in text if ch not in exclude)
def lower(text):
return text.lower()
return white_space_fix(remove_articles(remove_punc(lower(s))))
def convert_string_to_tuple_in_dict(d):
"""Recursively converts strings that start with '(' and end with ')' to tuples in a dictionary."""
for key, value in d.items():
# If the value is a dictionary, recurse
if isinstance(value, dict):
convert_string_to_tuple_in_dict(value)
# If the value is a list, iterate through its elements
elif isinstance(value, list):
for i, item in enumerate(value):
# If an item in the list is a dictionary, recurse
if isinstance(item, dict):
convert_string_to_tuple_in_dict(item)
# If an item in the list is a string matching the criteria, convert it to a tuple
elif (
isinstance(item, str)
and item.startswith("(")
and item.endswith(")")
):
value[i] = ast.literal_eval(item)
# If the value is a string matching the criteria, convert it to a tuple
elif isinstance(value, str) and value.startswith("(") and value.endswith(")"):
d[key] = ast.literal_eval(value)
return d
def convert_env_in_dict(d: Dict):
"""
Recursively converts environment variable string in a dictionary to actual environment variable.
:param d: The dictionary to convert.
:return: The converted dictionary.
"""
env_pattern = re.compile(r".*?\${(.*?)}.*?")
def convert_env(val: str):
matches = env_pattern.findall(val)
for match in matches:
val = val.replace(f"${{{match}}}", os.environ.get(match, ""))
return val
for key, value in d.items():
if isinstance(value, dict):
convert_env_in_dict(value)
elif isinstance(value, list):
for i, item in enumerate(value):
if isinstance(item, dict):
convert_env_in_dict(item)
elif isinstance(item, str):
value[i] = convert_env(item)
elif isinstance(value, str):
d[key] = convert_env(value)
return d
async def process_batch(tasks, batch_size: int = 64) -> List[Any]:
"""
Processes tasks in batches asynchronously.
:param tasks: A list of no-argument functions or coroutines to be executed.
:param batch_size: The number of tasks to process in a single batch.
Default is 64.
:return: A list of results from the processed tasks.
"""
results = []
for i in range(0, len(tasks), batch_size):
batch = tasks[i : i + batch_size]
batch_results = await asyncio.gather(*batch)
results.extend(batch_results)
return results
def make_batch(elems: List[Any], batch_size: int) -> List[List[Any]]:
"""
Make a batch of elems with batch_size.
"""
return [elems[i : i + batch_size] for i in range(0, len(elems), batch_size)]
def save_parquet_safe(df: pd.DataFrame, filepath: str, upsert: bool = False):
output_file_dir = os.path.dirname(filepath)
if not os.path.isdir(output_file_dir):
raise NotADirectoryError(f"directory {output_file_dir} not found.")
if not filepath.endswith("parquet"):
raise NameError(
f'file path: {filepath} filename extension need to be ".parquet"'
)
if os.path.exists(filepath) and not upsert:
raise FileExistsError(
f"file {filepath} already exists."
"Set upsert True if you want to overwrite the file."
)
df.to_parquet(filepath, index=False)
def openai_truncate_by_token(
texts: List[str], token_limit: int, model_name: str
) -> List[str]:
try:
tokenizer = tiktoken.encoding_for_model(model_name)
except KeyError:
# This is not a real OpenAI model
return texts
def truncate_text(text: str, limit: int, tokenizer):
tokens = tokenizer.encode(text)
if len(tokens) <= limit:
return text
truncated_text = tokenizer.decode(tokens[:limit])
return truncated_text
return list(map(lambda x: truncate_text(x, token_limit, tokenizer), texts))
def reconstruct_list(flat_list: List[Any], lengths: List[int]) -> List[List[Any]]:
result = []
start = 0
for length in lengths:
result.append(flat_list[start : start + length])
start += length
return result
def flatten_apply(
func: Callable, nested_list: List[List[Any]], **kwargs
) -> List[List[Any]]:
"""
This function flattens the input list and applies the function to the elements.
After that, it reconstructs the list to the original shape.
Its speciality is that the first dimension length of the list can be different from each other.
:param func: The function that applies to the flattened list.
:param nested_list: The nested list to be flattened.
:return: The list that is reconstructed after applying the function.
"""
df = pd.DataFrame({"col1": nested_list})
df = df.explode("col1")
df["result"] = func(df["col1"].tolist(), **kwargs)
return df.groupby(level=0, sort=False)["result"].apply(list).tolist()
async def aflatten_apply(
func: Callable, nested_list: List[List[Any]], **kwargs
) -> List[List[Any]]:
"""
This function flattens the input list and applies the function to the elements.
After that, it reconstructs the list to the original shape.
Its speciality is that the first dimension length of the list can be different from each other.
:param func: The function that applies to the flattened list.
:param nested_list: The nested list to be flattened.
:return: The list that is reconstructed after applying the function.
"""
df = pd.DataFrame({"col1": nested_list})
df = df.explode("col1")
df["result"] = await func(df["col1"].tolist(), **kwargs)
return df.groupby(level=0, sort=False)["result"].apply(list).tolist()
def sort_by_scores(row, reverse=True):
"""
Sorts each row by 'scores' column.
The input column names must be 'contents', 'ids', and 'scores'.
And its elements must be list type.
"""
results = sorted(
zip(row["contents"], row["ids"], row["scores"]),
key=lambda x: x[2],
reverse=reverse,
)
reranked_contents, reranked_ids, reranked_scores = zip(*results)
return list(reranked_contents), list(reranked_ids), list(reranked_scores)
def select_top_k(df, column_names: List[str], top_k: int):
for column_name in column_names:
df[column_name] = df[column_name].apply(lambda x: x[:top_k])
return df
def filter_dict_keys(dict_, keys: List[str]):
result = {}
for key in keys:
if key in dict_:
result[key] = dict_[key]
else:
raise KeyError(f"Key '{key}' not found in dictionary.")
return result
def split_dataframe(df, chunk_size):
num_chunks = (
len(df) // chunk_size + 1
if len(df) % chunk_size != 0
else len(df) // chunk_size
)
result = list(
map(lambda x: df[x * chunk_size : (x + 1) * chunk_size], range(num_chunks))
)
result = list(map(lambda x: x.reset_index(drop=True), result))
return result
def find_trial_dir(project_dir: str) -> List[str]:
# Pattern to match directories named with numbers
pattern = os.path.join(project_dir, "[0-9]*")
all_entries = glob.glob(pattern)
# Filter out only directories
trial_dirs = [
entry
for entry in all_entries
if os.path.isdir(entry) and entry.split(os.sep)[-1].isdigit()
]
return trial_dirs
def find_node_summary_files(trial_dir: str) -> List[str]:
# Find all summary.csv files recursively
all_summary_files = glob.glob(
os.path.join(trial_dir, "**", "summary.csv"), recursive=True
)
# Filter out files that are at a lower directory level
filtered_files = [
f for f in all_summary_files if f.count(os.sep) > trial_dir.count(os.sep) + 2
]
return filtered_files
def preprocess_text(text: str) -> str:
return normalize_unicode(demojize(text))
def demojize(text: str) -> str:
return emoji.demojize(text)
def normalize_unicode(text: str) -> str:
return unicodedata.normalize("NFC", text)
def dict_to_markdown(d, level=1):
"""
Convert a dictionary to a Markdown formatted string.
:param d: Dictionary to convert
:param level: Current level of heading (used for nested dictionaries)
:return: Markdown formatted string
"""
markdown = ""
for key, value in d.items():
if isinstance(value, dict):
markdown += f"{'#' * level} {key}\n"
markdown += dict_to_markdown(value, level + 1)
elif isinstance(value, list):
markdown += f"{'#' * level} {key}\n"
for item in value:
if isinstance(item, dict):
markdown += dict_to_markdown(item, level + 1)
else:
markdown += f"- {item}\n"
else:
markdown += f"{'#' * level} {key}\n{value}\n"
return markdown
def dict_to_markdown_table(data, key_column_name: str, value_column_name: str):
# Check if the input is a dictionary
if not isinstance(data, dict):
raise ValueError("Input must be a dictionary")
# Create the header of the table
header = f"| {key_column_name} | {value_column_name} |\n| :---: | :-----: |\n"
# Create the rows of the table
rows = ""
for key, value in data.items():
rows += f"| {key} | {value} |\n"
# Combine header and rows
markdown_table = header + rows
return markdown_table
def embedding_query_content(
queries: List[str],
contents_list: List[List[str]],
embedding_model: Optional[str] = None,
batch: int = 128,
):
flatten_contents = list(itertools.chain.from_iterable(contents_list))
openai_embedding_limit = 8000 # all openai embedding model has 8000 max token input
if isinstance(embedding_model, OpenAIEmbedding):
queries = openai_truncate_by_token(
queries, openai_embedding_limit, embedding_model.model_name
)
flatten_contents = openai_truncate_by_token(
flatten_contents, openai_embedding_limit, embedding_model.model_name
)
# Embedding using batch
embedding_model.embed_batch_size = batch
query_embeddings = embedding_model.get_text_embedding_batch(queries)
content_lengths = list(map(len, contents_list))
content_embeddings_flatten = embedding_model.get_text_embedding_batch(
flatten_contents
)
content_embeddings = reconstruct_list(content_embeddings_flatten, content_lengths)
return query_embeddings, content_embeddings
def to_list(item):
"""Recursively convert collections to Python lists."""
if isinstance(item, np.ndarray):
# Convert numpy array to list and recursively process each element
return [to_list(sub_item) for sub_item in item.tolist()]
elif isinstance(item, pd.Series):
# Convert pandas Series to list and recursively process each element
return [to_list(sub_item) for sub_item in item.tolist()]
elif isinstance(item, Iterable) and not isinstance(
item, (str, bytes, BaseModel, BM)
):
# Recursively process each element in other iterables
return [to_list(sub_item) for sub_item in item]
else:
return item
def convert_inputs_to_list(func):
"""Decorator to convert all function inputs to Python lists."""
@functools.wraps(func)
def wrapper(*args, **kwargs):
new_args = [to_list(arg) for arg in args]
new_kwargs = {k: to_list(v) for k, v in kwargs.items()}
return func(*new_args, **new_kwargs)
return wrapper
def get_best_row(
summary_df: pd.DataFrame, best_column_name: str = "is_best"
) -> pd.Series:
"""
From the summary dataframe, find the best result row by 'is_best' column and return it.
:param summary_df: Summary dataframe created by AutoRAG.
:param best_column_name: The column name that indicates the best result.
Default is 'is_best'.
You don't have to change this unless the column name is different.
:return: Best row pandas Series instance.
"""
bests = summary_df.loc[summary_df[best_column_name]]
assert len(bests) == 1, "There must be only one best result."
return bests.iloc[0]
def get_event_loop() -> AbstractEventLoop:
"""
Get asyncio event loop safely.
"""
try:
loop = asyncio.get_running_loop()
except RuntimeError:
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
return loop
def find_key_values(data, target_key: str) -> List[Any]:
"""
Recursively find all values for a specific key in a nested dictionary or list.
:param data: The dictionary or list to search.
:param target_key: The key to search for.
:return: A list of values associated with the target key.
"""
values = []
if isinstance(data, dict):
for key, value in data.items():
if key == target_key:
values.append(value)
if isinstance(value, (dict, list)):
values.extend(find_key_values(value, target_key))
elif isinstance(data, list):
for item in data:
if isinstance(item, (dict, list)):
values.extend(find_key_values(item, target_key))
return values
def pop_params(func: Callable, kwargs: Dict) -> Dict:
"""
Pop parameters from the given func and return them.
It automatically deletes the parameters like "self" or "cls".
:param func: The function to pop parameters.
:param kwargs: kwargs to pop parameters.
:return: The popped parameters.
"""
ignore_params = ["self", "cls"]
target_params = list(inspect.signature(func).parameters.keys())
target_params = list(filter(lambda x: x not in ignore_params, target_params))
init_params = {}
kwargs_keys = list(kwargs.keys())
for key in kwargs_keys:
if key in target_params:
init_params[key] = kwargs.pop(key)
return init_params
def apply_recursive(func, data):
"""
Recursively apply a function to all elements in a list, tuple, set, np.ndarray, or pd.Series and return as List.
:param func: Function to apply to each element.
:param data: List or nested list.
:return: List with the function applied to each element.
"""
if (
isinstance(data, list)
or isinstance(data, tuple)
or isinstance(data, set)
or isinstance(data, np.ndarray)
or isinstance(data, pd.Series)
):
return [apply_recursive(func, item) for item in data]
else:
return func(data)
def empty_cuda_cache():
try:
import torch
if torch.cuda.is_available():
torch.cuda.empty_cache()
except ImportError:
pass
def load_yaml_config(yaml_path: str) -> Dict:
"""
Load a YAML configuration file for AutoRAG.
It contains safe loading, converting string to tuple, and insert environment variables.
:param yaml_path: The path of the YAML configuration file.
:return: The loaded configuration dictionary.
"""
if not os.path.exists(yaml_path):
raise ValueError(f"YAML file {yaml_path} does not exist.")
with open(yaml_path, "r", encoding="utf-8") as stream:
try:
yaml_dict = yaml.safe_load(stream)
except yaml.YAMLError as exc:
raise ValueError(f"YAML file {yaml_path} could not be loaded.") from exc
yaml_dict = convert_string_to_tuple_in_dict(yaml_dict)
yaml_dict = convert_env_in_dict(yaml_dict)
return yaml_dict
def decode_multiple_json_from_bytes(byte_data: bytes) -> list:
"""
Decode multiple JSON objects from bytes received from SSE server.
Args:
byte_data: Bytes containing one or more JSON objects
Returns:
List of decoded JSON objects
"""
# Decode bytes to string
try:
text_data = byte_data.decode("utf-8").strip()
except UnicodeDecodeError:
raise ValueError("Invalid byte data: Unable to decode as UTF-8")
# Initialize decoder and result list
decoder = JSONDecoder()
result = []
# Keep track of position in string
pos = 0
text_data = text_data.strip()
while pos < len(text_data):
try:
# Try to decode next JSON object
json_obj, json_end = decoder.raw_decode(text_data[pos:])
result.append(json_obj)
# Move position to end of current JSON object
pos += json_end
# Skip any whitespace
while pos < len(text_data) and text_data[pos].isspace():
pos += 1
except json.JSONDecodeError:
# If we can't decode at current position, move forward one character
pos += 1
return result

7
dashboard.sh Normal file
View File

@@ -0,0 +1,7 @@
#!/bin/bash
python3 -m autorag.cli dashboard \
--trial_dir /usr/src/app/projects/benchmark_sample/1 \
--port 7690
echo "📊 AutoRAG 대시보드 실행 중..."

22
docker-compose.yml Normal file
View File

@@ -0,0 +1,22 @@
services:
autorag-dashboard:
build:
context: .
container_name: autorag-dashboard
environment:
- CUDA_VISIBLE_DEVICES=0
- OPENAI_API_KEY=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
- BOKEH_ALLOW_WS_ORIGIN=172.16.10.175:7690
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
- ./:/usr/src/app/
tty: true
working_dir: /usr/src/app
networks:
- autorag-dashboard_network
ports:
- "7690:7690"
networks:
autorag-dashboard_network:
driver: bridge

View File

@@ -0,0 +1,92 @@
vectordb:
- name: chroma_dragonkue2
db_type: chroma
client_type: persistent
embedding_model: huggingface_drangonku-v2-ko
collection_name: huggingface_drangonku-v2-ko
path: ${PROJECT_DIR}/resources/chroma
node_lines:
- node_line_name: retrieve_node_line # Arbitrary node line name
nodes:
- node_type: retrieval
strategy:
metrics: [ retrieval_f1, retrieval_recall, retrieval_precision,
retrieval_ndcg, retrieval_map, retrieval_mrr ]
speed_threshold: 10
top_k: 10
modules:
- module_type: bm25
bm25_tokenizer: [ ko_kiwi, ko_okt ]
- module_type: vectordb
vectordb: chroma_dragonkue2 # chromadb
- module_type: hybrid_cc
normalize_method: [ mm, tmm, z, dbsf ]
target_modules: ('bm25', 'vectordb')
weight_range: (0.6, 0.4)
test_weight_size: 101
- node_type: passage_reranker # re-ranker
strategy:
metrics:
- retrieval_recall
- retrieval_precision
- retrieval_map
modules:
- module_type: dragonkue2
top_k: 5
- node_line_name: post_retrieve_node_line # 생성노드
nodes:
- node_type: prompt_maker
strategy:
metrics:
- metric_name: bleu
- metric_name: meteor
- metric_name: rouge
- metric_name: sem_score
embedding_model: huggingface_drangonku-v2-ko # raise ValueError("Only one embedding model is supported")
lang: ko
generator_modules:
- module_type: llama_index_llm
llm: ollama
model: [ gemma3:12b, phi4, deepseek-r1:14b, aya-expanse:8b ]
request_timeout: 3000.0
modules:
- module_type: fstring
prompt:
- |
### Task:
Respond to the user query using the provided context.
### Guidelines:
- If you don't know the answer, clearly state that.
- If uncertain, ask the user for clarification.
- Respond in the same language as the user's query.
- If the context is unreadable or of poor quality, inform the user and provide the best possible answer.
- If the answer isn't present in the context but you possess the knowledge, explain this to the user and provide the answer using your own understanding.
- Do not use XML tags in your response.
### Output:
Provide a clear and direct response to the user's query.
<context>
{retrieved_contents}
</context>
<user_query>
{query}
</user_query>
- node_type: generator # Gen-LLM
strategy:
metrics:
- metric_name: bleu
- metric_name: meteor
- metric_name: rouge
- metric_name: sem_score
modules:
- module_type: llama_index_llm
llm: ollama
model: gemma3:12b # phi4, deepseek-r1:14b, aya-expanse:8b
temperature: 0.0
request_timeout: 30000.0
batch: 4

View File

@@ -0,0 +1,92 @@
vectordb:
- name: chroma_dragonkue2
db_type: chroma
client_type: persistent
embedding_model: huggingface_drangonku-v2-ko
collection_name: huggingface_drangonku-v2-ko
path: ${PROJECT_DIR}/resources/chroma
node_lines:
- node_line_name: retrieve_node_line # Arbitrary node line name
nodes:
- node_type: retrieval
strategy:
metrics: [ retrieval_f1, retrieval_recall, retrieval_precision,
retrieval_ndcg, retrieval_map, retrieval_mrr ]
speed_threshold: 10
top_k: 10
modules:
- module_type: bm25
bm25_tokenizer: [ ko_kiwi ] # ko_kiwi, ko_okt
- module_type: vectordb
vectordb: chroma_dragonkue2 # chromadb
- module_type: hybrid_cc
normalize_method: [ mm, tmm, z, dbsf ]
target_modules: ('bm25', 'vectordb')
weight_range: (0.6, 0.4)
test_weight_size: 101
- node_type: passage_reranker # re-ranker
strategy:
metrics:
- retrieval_recall
- retrieval_precision
- retrieval_map
modules:
- module_type: dragonkue2
top_k: 5
- node_line_name: post_retrieve_node_line # 생성노드
nodes:
- node_type: prompt_maker
strategy:
metrics:
- metric_name: bleu
- metric_name: meteor
- metric_name: rouge
- metric_name: sem_score
embedding_model: huggingface_drangonku-v2-ko # raise ValueError("Only one embedding model is supported")
lang: ko
generator_modules:
- module_type: llama_index_llm
llm: ollama
model: gemma3:12b
request_timeout: 3000.0
modules:
- module_type: fstring
prompt:
- |
### 작업:
지침에 따라 제공된 컨텍스트를 활용하여 사용자 질문에 답변하세요.
### 지침:
- 답을 모를 경우, 모른다고 명확히 말하세요.
- 확신이 없다면, 사용자에게 추가 설명을 요청하세요.
- 사용자의 질문과 동일한 언어로 답변하세요.
- 컨텍스트가 읽기 어렵거나 품질이 낮을 경우, 이를 사용자에게 알리고 최선의 답변을 제공하세요.
- 컨텍스트에 답이 없지만 알고 있는 내용이라면, 이를 사용자에게 설명하고 자신의 지식을 바탕으로 답변하세요.
- XML 태그를 사용하지 마세요.
### 출력:
사용자의 질문에 대해 명확하고 직접적인 답변을 제공하세요.
<context>
{retrieved_contents}
</context>
<user_query>
{query}
</user_query>
- node_type: generator # Gen-LLM
strategy:
metrics:
- metric_name: bleu
- metric_name: meteor
- metric_name: rouge
- metric_name: sem_score
modules:
- module_type: llama_index_llm
llm: ollama
model: gemma3:12b # phi4, deepseek-r1:14b, aya-expanse:8b
temperature: 0.0
request_timeout: 300.0
batch: 8

View File

@@ -0,0 +1,2 @@
filename,module_name,module_params,execution_time,average_output_token,bleu,meteor,rouge,sem_score,is_best
0.parquet,LlamaIndexLLM,"{'llm': 'ollama', 'model': 'gemma3:12b', 'temperature': 0.0, 'request_timeout': 300.0, 'batch': 8}",0.8519447922706604,259.05,14.57290077698799,0.47984407229799053,0.4400396825396825,0.8177114641079747,True
1 filename module_name module_params execution_time average_output_token bleu meteor rouge sem_score is_best
2 0.parquet LlamaIndexLLM {'llm': 'ollama', 'model': 'gemma3:12b', 'temperature': 0.0, 'request_timeout': 300.0, 'batch': 8} 0.8519447922706604 259.05 14.57290077698799 0.47984407229799053 0.4400396825396825 0.8177114641079747 True

View File

@@ -0,0 +1,2 @@
filename,module_name,module_params,execution_time,average_prompt_token,is_best
0.parquet,Fstring,"{'prompt': '### 작업: \n지침에 따라 제공된 컨텍스트를 활용하여 사용자 질문에 답변하세요. \n\n### 지침: \n- 답을 모를 경우, 모른다고 명확히 말하세요. \n- 확신이 없다면, 사용자에게 추가 설명을 요청하세요. \n- 사용자의 질문과 동일한 언어로 답변하세요. \n- 컨텍스트가 읽기 어렵거나 품질이 낮을 경우, 이를 사용자에게 알리고 최선의 답변을 제공하세요. \n- 컨텍스트에 답이 없지만 알고 있는 내용이라면, 이를 사용자에게 설명하고 자신의 지식을 바탕으로 답변하세요. \n- XML 태그를 사용하지 마세요. \n\n### 출력: \n사용자의 질문에 대해 명확하고 직접적인 답변을 제공하세요.\n\n<context>\n{retrieved_contents}\n</context>\n\n<user_query>\n{query}\n</user_query>\n'}",0.0003142237663269043,2751.85,True
1 filename module_name module_params execution_time average_prompt_token is_best
2 0.parquet Fstring {'prompt': '### 작업: \n지침에 따라 제공된 컨텍스트를 활용하여 사용자 질문에 답변하세요. \n\n### 지침: \n- 답을 모를 경우, 모른다고 명확히 말하세요. \n- 확신이 없다면, 사용자에게 추가 설명을 요청하세요. \n- 사용자의 질문과 동일한 언어로 답변하세요. \n- 컨텍스트가 읽기 어렵거나 품질이 낮을 경우, 이를 사용자에게 알리고 최선의 답변을 제공하세요. \n- 컨텍스트에 답이 없지만 알고 있는 내용이라면, 이를 사용자에게 설명하고 자신의 지식을 바탕으로 답변하세요. \n- XML 태그를 사용하지 마세요. \n\n### 출력: \n사용자의 질문에 대해 명확하고 직접적인 답변을 제공하세요.\n\n<context>\n{retrieved_contents}\n</context>\n\n<user_query>\n{query}\n</user_query>\n'} 0.0003142237663269043 2751.85 True

View File

@@ -0,0 +1,3 @@
node_type,best_module_filename,best_module_name,best_module_params,best_execution_time
prompt_maker,0.parquet,Fstring,"{'prompt': '### 작업: \n지침에 따라 제공된 컨텍스트를 활용하여 사용자 질문에 답변하세요. \n\n### 지침: \n- 답을 모를 경우, 모른다고 명확히 말하세요. \n- 확신이 없다면, 사용자에게 추가 설명을 요청하세요. \n- 사용자의 질문과 동일한 언어로 답변하세요. \n- 컨텍스트가 읽기 어렵거나 품질이 낮을 경우, 이를 사용자에게 알리고 최선의 답변을 제공하세요. \n- 컨텍스트에 답이 없지만 알고 있는 내용이라면, 이를 사용자에게 설명하고 자신의 지식을 바탕으로 답변하세요. \n- XML 태그를 사용하지 마세요. \n\n### 출력: \n사용자의 질문에 대해 명확하고 직접적인 답변을 제공하세요.\n\n<context>\n{retrieved_contents}\n</context>\n\n<user_query>\n{query}\n</user_query>\n'}",0.0003142237663269
generator,0.parquet,LlamaIndexLLM,"{'llm': 'ollama', 'model': 'gemma3:12b', 'temperature': 0.0, 'request_timeout': 300.0, 'batch': 8}",0.8519447922706604
1 node_type best_module_filename best_module_name best_module_params best_execution_time
2 prompt_maker 0.parquet Fstring {'prompt': '### 작업: \n지침에 따라 제공된 컨텍스트를 활용하여 사용자 질문에 답변하세요. \n\n### 지침: \n- 답을 모를 경우, 모른다고 명확히 말하세요. \n- 확신이 없다면, 사용자에게 추가 설명을 요청하세요. \n- 사용자의 질문과 동일한 언어로 답변하세요. \n- 컨텍스트가 읽기 어렵거나 품질이 낮을 경우, 이를 사용자에게 알리고 최선의 답변을 제공하세요. \n- 컨텍스트에 답이 없지만 알고 있는 내용이라면, 이를 사용자에게 설명하고 자신의 지식을 바탕으로 답변하세요. \n- XML 태그를 사용하지 마세요. \n\n### 출력: \n사용자의 질문에 대해 명확하고 직접적인 답변을 제공하세요.\n\n<context>\n{retrieved_contents}\n</context>\n\n<user_query>\n{query}\n</user_query>\n'} 0.0003142237663269
3 generator 0.parquet LlamaIndexLLM {'llm': 'ollama', 'model': 'gemma3:12b', 'temperature': 0.0, 'request_timeout': 300.0, 'batch': 8} 0.8519447922706604

View File

@@ -0,0 +1,2 @@
filename,module_name,module_params,execution_time,passage_reranker_retrieval_recall,passage_reranker_retrieval_precision,passage_reranker_retrieval_map,is_best
0.parquet,DragonKue2,{'top_k': 5},0.12188564538955689,0.3,0.06,0.18916666666666665,True
1 filename module_name module_params execution_time passage_reranker_retrieval_recall passage_reranker_retrieval_precision passage_reranker_retrieval_map is_best
2 0.parquet DragonKue2 {'top_k': 5} 0.12188564538955689 0.3 0.06 0.18916666666666665 True

View File

@@ -0,0 +1,7 @@
filename,module_name,module_params,execution_time,retrieval_f1,retrieval_recall,retrieval_precision,retrieval_ndcg,retrieval_map,retrieval_mrr,is_best
0.parquet,VectorDB,"{'top_k': 10, 'vectordb': 'chroma_dragonkue2'}",0.10161013603210449,0.045454545454545456,0.25,0.025,0.14013009087326042,0.10625,0.10625,False
1.parquet,BM25,"{'top_k': 10, 'bm25_tokenizer': 'ko_kiwi'}",1.9859044432640076,0.03636363636363636,0.2,0.02,0.07248116240107563,0.034999999999999996,0.034999999999999996,False
2.parquet,HybridCC,"{'top_k': 10, 'normalize_method': 'dbsf', 'target_modules': ('VectorDB', 'BM25'), 'weight': 0.516, 'target_module_params': ({'top_k': 10, 'vectordb': 'chroma_dragonkue2'}, {'top_k': 10, 'bm25_tokenizer': 'ko_kiwi'})}",2.087514579296112,0.06363636363636363,0.35,0.035,0.20447427813233116,0.16041666666666665,0.16041666666666665,True
3.parquet,HybridCC,"{'top_k': 10, 'normalize_method': 'mm', 'target_modules': ('VectorDB', 'BM25'), 'weight': 0.51, 'target_module_params': ({'top_k': 10, 'vectordb': 'chroma_dragonkue2'}, {'top_k': 10, 'bm25_tokenizer': 'ko_kiwi'})}",2.087514579296112,0.06363636363636363,0.35,0.035,0.20447427813233116,0.16041666666666665,0.16041666666666665,False
4.parquet,HybridCC,"{'top_k': 10, 'normalize_method': 'tmm', 'target_modules': ('VectorDB', 'BM25'), 'weight': 0.454, 'target_module_params': ({'top_k': 10, 'vectordb': 'chroma_dragonkue2'}, {'top_k': 10, 'bm25_tokenizer': 'ko_kiwi'})}",2.087514579296112,0.05454545454545454,0.3,0.03,0.15007396002669662,0.10499999999999998,0.10499999999999998,False
5.parquet,HybridCC,"{'top_k': 10, 'normalize_method': 'z', 'target_modules': ('VectorDB', 'BM25'), 'weight': 0.516, 'target_module_params': ({'top_k': 10, 'vectordb': 'chroma_dragonkue2'}, {'top_k': 10, 'bm25_tokenizer': 'ko_kiwi'})}",2.087514579296112,0.06363636363636363,0.35,0.035,0.20447427813233116,0.16041666666666665,0.16041666666666665,False
1 filename module_name module_params execution_time retrieval_f1 retrieval_recall retrieval_precision retrieval_ndcg retrieval_map retrieval_mrr is_best
2 0.parquet VectorDB {'top_k': 10, 'vectordb': 'chroma_dragonkue2'} 0.10161013603210449 0.045454545454545456 0.25 0.025 0.14013009087326042 0.10625 0.10625 False
3 1.parquet BM25 {'top_k': 10, 'bm25_tokenizer': 'ko_kiwi'} 1.9859044432640076 0.03636363636363636 0.2 0.02 0.07248116240107563 0.034999999999999996 0.034999999999999996 False
4 2.parquet HybridCC {'top_k': 10, 'normalize_method': 'dbsf', 'target_modules': ('VectorDB', 'BM25'), 'weight': 0.516, 'target_module_params': ({'top_k': 10, 'vectordb': 'chroma_dragonkue2'}, {'top_k': 10, 'bm25_tokenizer': 'ko_kiwi'})} 2.087514579296112 0.06363636363636363 0.35 0.035 0.20447427813233116 0.16041666666666665 0.16041666666666665 True
5 3.parquet HybridCC {'top_k': 10, 'normalize_method': 'mm', 'target_modules': ('VectorDB', 'BM25'), 'weight': 0.51, 'target_module_params': ({'top_k': 10, 'vectordb': 'chroma_dragonkue2'}, {'top_k': 10, 'bm25_tokenizer': 'ko_kiwi'})} 2.087514579296112 0.06363636363636363 0.35 0.035 0.20447427813233116 0.16041666666666665 0.16041666666666665 False
6 4.parquet HybridCC {'top_k': 10, 'normalize_method': 'tmm', 'target_modules': ('VectorDB', 'BM25'), 'weight': 0.454, 'target_module_params': ({'top_k': 10, 'vectordb': 'chroma_dragonkue2'}, {'top_k': 10, 'bm25_tokenizer': 'ko_kiwi'})} 2.087514579296112 0.05454545454545454 0.3 0.03 0.15007396002669662 0.10499999999999998 0.10499999999999998 False
7 5.parquet HybridCC {'top_k': 10, 'normalize_method': 'z', 'target_modules': ('VectorDB', 'BM25'), 'weight': 0.516, 'target_module_params': ({'top_k': 10, 'vectordb': 'chroma_dragonkue2'}, {'top_k': 10, 'bm25_tokenizer': 'ko_kiwi'})} 2.087514579296112 0.06363636363636363 0.35 0.035 0.20447427813233116 0.16041666666666665 0.16041666666666665 False

View File

@@ -0,0 +1,3 @@
node_type,best_module_filename,best_module_name,best_module_params,best_execution_time
retrieval,2.parquet,HybridCC,"{'top_k': 10, 'normalize_method': 'dbsf', 'target_modules': ('VectorDB', 'BM25'), 'weight': 0.516, 'target_module_params': ({'top_k': 10, 'vectordb': 'chroma_dragonkue2'}, {'top_k': 10, 'bm25_tokenizer': 'ko_kiwi'})}",2.087514579296112
passage_reranker,0.parquet,DragonKue2,{'top_k': 5},0.1218856453895568
1 node_type best_module_filename best_module_name best_module_params best_execution_time
2 retrieval 2.parquet HybridCC {'top_k': 10, 'normalize_method': 'dbsf', 'target_modules': ('VectorDB', 'BM25'), 'weight': 0.516, 'target_module_params': ({'top_k': 10, 'vectordb': 'chroma_dragonkue2'}, {'top_k': 10, 'bm25_tokenizer': 'ko_kiwi'})} 2.087514579296112
3 passage_reranker 0.parquet DragonKue2 {'top_k': 5} 0.1218856453895568

View File

@@ -0,0 +1,5 @@
node_line_name,node_type,best_module_filename,best_module_name,best_module_params,best_execution_time
retrieve_node_line,retrieval,2.parquet,HybridCC,"{'top_k': 10, 'normalize_method': 'dbsf', 'target_modules': ('VectorDB', 'BM25'), 'weight': 0.516, 'target_module_params': ({'top_k': 10, 'vectordb': 'chroma_dragonkue2'}, {'top_k': 10, 'bm25_tokenizer': 'ko_kiwi'})}",2.087514579296112
retrieve_node_line,passage_reranker,0.parquet,DragonKue2,{'top_k': 5},0.1218856453895568
post_retrieve_node_line,prompt_maker,0.parquet,Fstring,"{'prompt': '### 작업: \n지침에 따라 제공된 컨텍스트를 활용하여 사용자 질문에 답변하세요. \n\n### 지침: \n- 답을 모를 경우, 모른다고 명확히 말하세요. \n- 확신이 없다면, 사용자에게 추가 설명을 요청하세요. \n- 사용자의 질문과 동일한 언어로 답변하세요. \n- 컨텍스트가 읽기 어렵거나 품질이 낮을 경우, 이를 사용자에게 알리고 최선의 답변을 제공하세요. \n- 컨텍스트에 답이 없지만 알고 있는 내용이라면, 이를 사용자에게 설명하고 자신의 지식을 바탕으로 답변하세요. \n- XML 태그를 사용하지 마세요. \n\n### 출력: \n사용자의 질문에 대해 명확하고 직접적인 답변을 제공하세요.\n\n<context>\n{retrieved_contents}\n</context>\n\n<user_query>\n{query}\n</user_query>\n'}",0.0003142237663269
post_retrieve_node_line,generator,0.parquet,LlamaIndexLLM,"{'llm': 'ollama', 'model': 'gemma3:12b', 'temperature': 0.0, 'request_timeout': 300.0, 'batch': 8}",0.8519447922706604
1 node_line_name node_type best_module_filename best_module_name best_module_params best_execution_time
2 retrieve_node_line retrieval 2.parquet HybridCC {'top_k': 10, 'normalize_method': 'dbsf', 'target_modules': ('VectorDB', 'BM25'), 'weight': 0.516, 'target_module_params': ({'top_k': 10, 'vectordb': 'chroma_dragonkue2'}, {'top_k': 10, 'bm25_tokenizer': 'ko_kiwi'})} 2.087514579296112
3 retrieve_node_line passage_reranker 0.parquet DragonKue2 {'top_k': 5} 0.1218856453895568
4 post_retrieve_node_line prompt_maker 0.parquet Fstring {'prompt': '### 작업: \n지침에 따라 제공된 컨텍스트를 활용하여 사용자 질문에 답변하세요. \n\n### 지침: \n- 답을 모를 경우, 모른다고 명확히 말하세요. \n- 확신이 없다면, 사용자에게 추가 설명을 요청하세요. \n- 사용자의 질문과 동일한 언어로 답변하세요. \n- 컨텍스트가 읽기 어렵거나 품질이 낮을 경우, 이를 사용자에게 알리고 최선의 답변을 제공하세요. \n- 컨텍스트에 답이 없지만 알고 있는 내용이라면, 이를 사용자에게 설명하고 자신의 지식을 바탕으로 답변하세요. \n- XML 태그를 사용하지 마세요. \n\n### 출력: \n사용자의 질문에 대해 명확하고 직접적인 답변을 제공하세요.\n\n<context>\n{retrieved_contents}\n</context>\n\n<user_query>\n{query}\n</user_query>\n'} 0.0003142237663269
5 post_retrieve_node_line generator 0.parquet LlamaIndexLLM {'llm': 'ollama', 'model': 'gemma3:12b', 'temperature': 0.0, 'request_timeout': 300.0, 'batch': 8} 0.8519447922706604

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@@ -0,0 +1,7 @@
vectordb:
- client_type: persistent
collection_name: huggingface_drangonku-v2-ko
db_type: chroma
embedding_model: huggingface_drangonku-v2-ko
name: chroma_dragonkue2
path: ../projects/daesan-dangjin_01/benchmark/resources/chroma

View File

@@ -0,0 +1,10 @@
[
{
"trial_name": "0",
"start_time": "2025-03-13 07:47:00"
},
{
"trial_name": "1",
"start_time": "2025-03-13 08:03:47"
}
]

149
pyproject.toml Normal file
View File

@@ -0,0 +1,149 @@
[build-system]
requires = ["setuptools", "setuptools-scm"]
build-backend = "setuptools.build_meta"
[project]
name = "AutoRAG"
authors = [
{ name = "Marker-Inc", email = "vkehfdl1@gmail.com" }
]
description = 'Automatically Evaluate RAG pipelines with your own data. Find optimal structure for new RAG product.'
readme = "README.md"
requires-python = ">=3.10"
keywords = ['RAG', 'AutoRAG', 'autorag', 'rag-evaluation', 'evaluation', 'rag-auto', 'AutoML', 'AutoML-RAG']
license = { file = "LICENSE" }
classifiers = [
"Intended Audience :: Developers",
"Intended Audience :: Information Technology",
"Intended Audience :: Science/Research",
'Programming Language :: Python :: 3.10',
'Programming Language :: Python :: 3.11',
'Programming Language :: Python :: 3.12',
"Topic :: Scientific/Engineering :: Artificial Intelligence",
"Topic :: Software Development :: Libraries",
"Topic :: Software Development :: Libraries :: Python Modules",
]
urls = { Homepage = "https://github.com/Marker-Inc-Korea/AutoRAG" }
dynamic = ["version", "dependencies"]
[tool.poetry]
name = "AutoRAG"
version = "0.0.2" #initial version
description = "Automatically Evaluate RAG pipelines with your own data. Find optimal structure for new RAG product."
authors = ["Marker-Inc <vkehfdl1@gmail.com>"]
[tool.setuptools.dynamic]
version = { file = ["autorag/VERSION"] }
dependencies = { file = ["requirements.txt"] }
[tool.setuptools]
include-package-data = true
[tool.setuptools.packages.find]
where = ["."]
include = ["autorag*"]
exclude = ["tests"]
[tool.pytest.ini_options]
pythonpath = ["."]
testpaths = ["tests"]
addopts = ["--import-mode=importlib"] # default is prepend
[project.optional-dependencies]
ko = ["kiwipiepy >= 0.18.0", "konlpy"]
dev = ["ruff", "pre-commit"]
parse = ["PyMuPDF", "pdfminer.six", "pdfplumber", "unstructured", "jq", "unstructured[pdf]", "PyPDF2<3.0", "pdf2image"]
ja = ["sudachipy>=0.6.8", "sudachidict_core"]
gpu = ["torch", "sentencepiece", "bert_score", "optimum[openvino,nncf]", "peft", "llmlingua", "FlagEmbedding",
"sentence-transformers", "transformers", "llama-index-llms-ollama", "llama-index-embeddings-huggingface",
"llama-index-llms-huggingface", "onnxruntime"]
all = ["AutoRAG[gpu]", "AutoRAG[ko]", "AutoRAG[dev]", "AutoRAG[parse]", "AutoRAG[ja]"]
[project.entry-points.console_scripts]
autorag = "autorag.cli:cli"
[tool.ruff]
# Exclude a variety of commonly ignored directories.
exclude = [
".bzr",
".direnv",
".eggs",
".git",
".git-rewrite",
".hg",
".ipynb_checkpoints",
".mypy_cache",
".nox",
".pants.d",
".pyenv",
".pytest_cache",
".pytype",
".ruff_cache",
".svn",
".tox",
".venv",
".vscode",
"__pypackages__",
"_build",
"buck-out",
"build",
"dist",
"node_modules",
"site-packages",
"venv",
]
# Same as Black.
line-length = 88
indent-width = 4
# Assume Python 3.9
target-version = "py39"
[tool.ruff.lint]
# Enable Pyflakes (`F`) and a subset of the pycodestyle (`E`) codes by default.
# Unlike Flake8, Ruff doesn't enable pycodestyle warnings (`W`) or
# McCabe complexity (`C901`) by default.
select = ["E4", "E7", "E9", "F"]
ignore = ["E722", "F821"]
# Allow fix for all enabled rules (when `--fix`) is provided.
fixable = ["ALL"]
unfixable = ["B"]
# Allow unused variables when underscore-prefixed.
dummy-variable-rgx = "^(_+|(_+[a-zA-Z0-9_]*[a-zA-Z0-9]+?))$"
[tool.ruff.lint.per-file-ignores]
"__init__.py" = ["E402", "F401"]
"**/{docs}/*" = ["E402", "F401"]
"test_*.py" = ["F401", "F811"]
"*_test.py" = ["F401", "F811"]
"resources/parse_data/*" = ["W292"]
[tool.ruff.format]
# Like Black, use double quotes for strings.
quote-style = "double"
# Like Black, indent with spaces, rather than tabs.
indent-style = "tab"
# Like Black, respect magic trailing commas.
skip-magic-trailing-comma = false
# Like Black, automatically detect the appropriate line ending.
line-ending = "auto"
# Enable auto-formatting of code examples in docstrings. Markdown,
# reStructuredText code/literal blocks and doctests are all supported.
#
# This is currently disabled by default, but it is planned for this
# to be opt-out in the future.
docstring-code-format = true
# Set the line length limit used when formatting code snippets in
# docstrings.
#
# This only has an effect when the `docstring-code-format` setting is
# enabled.
docstring-code-line-length = "dynamic"

69
requirements.txt Normal file
View File

@@ -0,0 +1,69 @@
pydantic<2.10.0 # incompatible with llama index
numpy<2.0.0 # temporal not using numpy 2.0.0
pandas>=2.1.0
tqdm
tiktoken>=0.7.0 # for counting token
openai>=1.0.0
rank_bm25 # for bm25 retrieval
pyyaml # for yaml file
pyarrow # for pandas with parquet
fastparquet # for pandas with parquet
sacrebleu # for bleu score
evaluate # for meteor and other scores
rouge_score # for rouge score
rich # for pretty logging
click # for cli
cohere>=5.8.0 # for cohere services
tokenlog>=0.0.2 # for token logging
aiohttp # for async http requests
voyageai # for voyageai reranker
mixedbread-ai # for mixedbread-ai reranker
llama-index-llms-bedrock
scikit-learn
emoji
### Vector DB ###
pymilvus>=2.3.0 # for using milvus vectordb
chromadb>=0.5.0 # for chroma vectordb
weaviate-client # for weaviate vectordb
pinecone[grpc] # for pinecone vectordb
couchbase # for couchbase vectordb
qdrant-client # for qdrant vectordb
### API server ###
quart
pyngrok
### LlamaIndex ###
llama-index>=0.11.0
llama-index-core=0.12.24
# readers
llama-index-readers-file
# Embeddings
llama-index-embeddings-openai
llama-index-embeddings-ollama
# LLMs
llama-index-llms-openai>=0.2.7
llama-index-llms-openai-like
# Retriever
llama-index-retrievers-bm25
# WebUI
streamlit
gradio
### Langchain ###
langchain-core>=0.3.0
langchain-unstructured>=0.1.5
langchain-upstage
langchain-community>=0.3.0
# autorag dashboard
panel
seaborn
ipykernel
ipywidgets
ipywidgets_bokeh
# added library - 김용연
llama_index.llms.ollama
llama_index.embeddings.huggingface