๐Ÿš€ Big News: Socket Acquires Coana to Bring Reachability Analysis to Every Appsec Team.Learn more โ†’
Socket
Book a DemoInstallSign in
Socket

data-filtering

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

data-filtering

A library to filter and deduplicate Q&A text datasets from CSV files.

0.1.21
PyPI
Maintainers
1

QA ๋ฐ์ดํ„ฐ ํ•„ํ„ฐ๋ง ๋ฐ ์ค‘๋ณต ์ œ๊ฑฐ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ (data_filtering)

์ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” ์งˆ๋ฌธ(Q)๊ณผ ๋‹ต๋ณ€(A) ์Œ์œผ๋กœ ๊ตฌ์„ฑ๋œ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹(CSV ํŒŒ์ผ)์—์„œ ์ค‘๋ณต์„ ํ™•์ธํ•˜๊ณ , ํ’ˆ์งˆ ๊ธฐ์ค€์— ๋”ฐ๋ผ ๋ฐ์ดํ„ฐ๋ฅผ ์„ ๋ณ„ํ•˜๋Š” ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

์ฃผ์š” ๊ธฐ๋Šฅ

  • ๋‹ค์–‘ํ•œ CSV ์ž…๋ ฅ ํ˜•์‹ ์ง€์›:
    • ์งˆ๋ฌธ ์ปฌ๋Ÿผ๊ณผ ๋‹ต๋ณ€ ์ปฌ๋Ÿผ์ด ๋ถ„๋ฆฌ๋œ ๊ฒฝ์šฐ
    • ์งˆ๋ฌธ๊ณผ ๋‹ต๋ณ€์ด ํ•ฉ์ณ์ง„ ๋‹จ์ผ ์ปฌ๋Ÿผ์ธ ๊ฒฝ์šฐ
    • ์ผ๋ฐ˜ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ (๋‹จ์ผ ์ปฌ๋Ÿผ)
  • ํ’ˆ์งˆ ํ•„ํ„ฐ๋ง:
    • ํ…์ŠคํŠธ ๊ธธ์ด (์ตœ์†Œ/์ตœ๋Œ€ ๊ธธ์ด ์ง€์ • ๊ฐ€๋Šฅ, ํ™œ์„ฑํ™”/๋น„ํ™œ์„ฑํ™” ๊ฐ€๋Šฅ)
    • ์–ธ์–ด ๊ฐ์ง€ (ํŠน์ • ์–ธ์–ด ๋ฐ ์‹ ๋ขฐ๋„ ์ž„๊ณ„๊ฐ’ ์„ค์ • ๊ฐ€๋Šฅ, ํ™œ์„ฑํ™”/๋น„ํ™œ์„ฑํ™” ๊ฐ€๋Šฅ)
  • ์ค‘๋ณต ์ œ๊ฑฐ:
    • ์ •ํ™•ํ•œ ์ค‘๋ณต: ์™„์ „ํžˆ ๋™์ผํ•œ ํ…์ŠคํŠธ ์ œ๊ฑฐ
    • ์˜๋ฏธ๋ก ์  ์ค‘๋ณต: ๋‹ค์–‘ํ•œ ์ž„๋ฒ ๋”ฉ ๋ฐฑ์—”๋“œ๋ฅผ ์ง€์›ํ•˜์—ฌ ์˜๋ฏธ์ ์œผ๋กœ ์œ ์‚ฌํ•œ ํ…์ŠคํŠธ๋ฅผ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค.
      • Sentence-Transformers: ๋กœ์ปฌ ๋˜๋Š” ํ—ˆ๊น…ํŽ˜์ด์Šค์— ํ˜ธ์ŠคํŒ…๋œ ๋ชจ๋ธ ์‚ฌ์šฉ (์˜ˆ: Qwen/Qwen3-Embedding-0.6B)
      • OpenAI API: text-embedding-3-small ๋“ฑ OpenAI์˜ ๋ชจ๋ธ ์‚ฌ์šฉ
      • Google Gemini API: gemini-embedding-exp-03-07 ๋“ฑ Google์˜ ๋ชจ๋ธ ์‚ฌ์šฉ
      • ์œ ์‚ฌ๋„ ์ž„๊ณ„๊ฐ’(๊ธฐ๋ณธ๊ฐ’: 0.80) ๋ฐ ์ค‘๋ณต ์‹œ ๋ณด์กด ๊ธฐ์ค€('first', 'longest') ์„ค์ • ๊ฐ€๋Šฅ
      • ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ ๋ฐ ์ž๋™ ์žฌ์‹œ๋„ ๋ฉ”์ปค๋‹ˆ์ฆ˜ ๋‚ด์žฅ
  • ๊ฒฐ๊ณผ ์ถœ๋ ฅ:
    • ์„ ๋ณ„๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒˆ๋กœ์šด CSV ํŒŒ์ผ๋กœ ์ €์žฅ
    • ์ฒ˜๋ฆฌ ๊ณผ์ • ๋ฐ ํ†ต๊ณ„๋ฅผ ๋‹ด์€ ๋ฆฌํฌํŠธ ์ƒ์„ฑ (HTML ๋˜๋Š” TXT ํ˜•์‹)
  • ์„ค์ •: config/default_settings.yaml ํŒŒ์ผ์„ ํ†ตํ•ด ๋Œ€๋ถ€๋ถ„์˜ ๋™์ž‘์„ ์ƒ์„ธํ•˜๊ฒŒ ์„ค์ • ๊ฐ€๋Šฅํ•˜๋ฉฐ, CLI ์ธ์ž ๋˜๋Š” ํ•จ์ˆ˜ ํ˜ธ์ถœ ์‹œ ์˜ค๋ฒ„๋ผ์ด๋“œ ๊ฐ€๋Šฅ.

์„ค์น˜ ๋ฐ ํ™˜๊ฒฝ ์„ค์ •

1. ํ™˜๊ฒฝ ์ค€๋น„ (Conda ๊ถŒ์žฅ)

์ƒˆ๋กœ์šด ๊ฐ€์ƒ ํ™˜๊ฒฝ์„ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค. Conda๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ:

conda create --name data-filtering-env python=3.11 # ์˜ˆ์‹œ ํ™˜๊ฒฝ ์ด๋ฆ„ ๋ฐ Python ๋ฒ„์ „
conda activate data-filtering-env

Python 3.11 ์ด์ƒ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค.

2. ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์น˜

๋ฐฉ๋ฒ• A: PyPI์—์„œ ์„ค์น˜ (๋ฐฐํฌ๋œ ๊ฒฝ์šฐ)

pip install data-filtering

(์ด ๋ช…๋ น์–ด๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ PyPI์— ์ •์‹ ๋ฐฐํฌ๋œ ํ›„์— ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.)

๋ฐฉ๋ฒ• B: ์†Œ์Šค์—์„œ ์ง์ ‘ ๋นŒ๋“œ ๋ฐ ์„ค์น˜ (ํ˜„์žฌ ๊ฐœ๋ฐœ/ํ…Œ์ŠคํŠธ ๋‹จ๊ณ„)

  • ์†Œ์Šค ์ฝ”๋“œ ๋‹ค์šด๋กœ๋“œ ๋˜๋Š” ํด๋ก :

    git clone https://github.com/yourusername/data-filtering.git # ์‹ค์ œ ์ €์žฅ์†Œ URL๋กœ ๋ณ€๊ฒฝ
    cd data-filtering
    
  • ํ•„์ˆ˜ ๋นŒ๋“œ ๋„๊ตฌ ์„ค์น˜:

    pip install build
    
  • ํŒจํ‚ค์ง€ ๋นŒ๋“œ: ํ”„๋กœ์ ํŠธ ๋ฃจํŠธ ๋””๋ ‰ํ† ๋ฆฌ์—์„œ ๋‹ค์Œ ๋ช…๋ น์„ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค.

    python -m build
    

    ์ด ๋ช…๋ น์€ dist ๋””๋ ‰ํ† ๋ฆฌ์— .whl ํŒŒ์ผ๊ณผ .tar.gz ํŒŒ์ผ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

  • ๋นŒ๋“œ๋œ ํŒจํ‚ค์ง€ ์„ค์น˜: ์ƒ์„ฑ๋œ .whl ํŒŒ์ผ์„ ์‚ฌ์šฉํ•˜์—ฌ ์„ค์น˜ํ•ฉ๋‹ˆ๋‹ค.

    pip install dist/data_filtering-0.1.0-py3-none-any.whl # ์‹ค์ œ ์ƒ์„ฑ๋œ ํŒŒ์ผ๋ช…์œผ๋กœ ๋ณ€๊ฒฝ
    

    ๋˜๋Š”, ๊ฐœ๋ฐœ ์ค‘์—๋Š” editable ๋ชจ๋“œ๋กœ ์„ค์น˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

    pip install -e .
    

3. ์˜์กด์„ฑ ํŒจํ‚ค์ง€ ํ™•์ธ ๋ฐ ์„ค์น˜

data-filtering ํŒจํ‚ค์ง€๋Š” ํ•„์š”ํ•œ ์˜์กด์„ฑ์„ ์ž๋™์œผ๋กœ ํ•จ๊ป˜ ์„ค์น˜ํ•˜๋ ค๊ณ  ์‹œ๋„ํ•ฉ๋‹ˆ๋‹ค. ์ฃผ์š” ์˜์กด์„ฑ์€ pyproject.toml ํŒŒ์ผ์˜ dependencies ์„น์…˜์— ๋ช…์‹œ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

์˜๋ฏธ๋ก ์  ์ค‘๋ณต ์ œ๊ฑฐ ๊ธฐ๋Šฅ์„ ์‚ฌ์šฉํ•˜๋ ค๋ฉด, ์„ ํƒํ•œ ๋ฐฑ์—”๋“œ์— ๋งž๋Š” ์ถ”๊ฐ€ ํŒจํ‚ค์ง€๋ฅผ ์„ค์น˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

  • Sentence-Transformers ๋ฐฑ์—”๋“œ ์‚ฌ์šฉ ์‹œ:

    pip install sentence-transformers scipy
    
    • CPU ์ „์šฉ ํ™˜๊ฒฝ์—์„œ๋Š” ์ถ”๊ฐ€๋กœ PyTorch CPU ๋ฒ„์ „์„ ์„ค์น˜ํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค:
    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
    
  • OpenAI ๋˜๋Š” Gemini API ๋ฐฑ์—”๋“œ ์‚ฌ์šฉ ์‹œ:

    pip install openai instructor scipy
    
  • API ํ‚ค ์„ค์ • (OpenAI/Gemini ์‚ฌ์šฉ ์‹œ): API ๊ธฐ๋ฐ˜ ๋ฐฑ์—”๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด API ํ‚ค๋ฅผ ์„ค์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

    # OpenAI API ํ‚ค ์„ค์ •
    export OPENAI_API_KEY="your_openai_api_key"
    
    # ๋˜๋Š” Google API ํ‚ค ์„ค์ • (Gemini ์‚ฌ์šฉ ์‹œ)
    export GOOGLE_API_KEY="your_google_api_key"
    

    ๋˜๋Š” config ํŒŒ์ผ์— ์ง์ ‘ ํ‚ค๋ฅผ ๋ช…์‹œํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค (๋ณด์•ˆ์ƒ ๊ถŒ์žฅ๋˜์ง€ ์•Š์Œ).

  • ์„ ํƒ์  ์˜์กด์„ฑ:

    • scipy: ์˜๋ฏธ๋ก ์  ์ค‘๋ณต ํƒ์ง€ ์‹œ ๊ณ„์ธต์  ํด๋Ÿฌ์Šคํ„ฐ๋ง์— ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. sentence-transformers ๋˜๋Š” API ๋ฐฑ์—”๋“œ ์‚ฌ์šฉ ์‹œ ํ•จ๊ป˜ ์„ค์น˜๋ฉ๋‹ˆ๋‹ค.
    • instructor: OpenAI/Gemini API์™€์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ๋‹จ์ˆœํ™”ํ•ฉ๋‹ˆ๋‹ค.

์‚ฌ์šฉ ๋ฐฉ๋ฒ•

๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” ๋‘ ๊ฐ€์ง€ ์ฃผ์š” ๋ฐฉ์‹์œผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค: ๋ช…๋ น์ค„ ์ธํ„ฐํŽ˜์ด์Šค(CLI) ๋˜๋Š” Python ์ฝ”๋“œ ๋‚ด์—์„œ ์ง์ ‘ ํ˜ธ์ถœ.

1. ๋ช…๋ น์ค„ ์ธํ„ฐํŽ˜์ด์Šค (CLI) ์‚ฌ์šฉ

ํŒจํ‚ค์ง€๊ฐ€ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์„ค์น˜๋˜์—ˆ๋‹ค๋ฉด, ํ„ฐ๋ฏธ๋„์—์„œ data-filtering-cli ๋ช…๋ น์–ด๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ธฐ๋ณธ ์‚ฌ์šฉ๋ฒ•:

data-filtering-cli <์ž…๋ ฅ_CSV_ํŒŒ์ผ_๊ฒฝ๋กœ>

์˜ˆ์‹œ:

data-filtering-cli examples/sample_data.csv

์ฃผ์š” ์˜ต์…˜:

  • --config <์„ค์ •_ํŒŒ์ผ_๊ฒฝ๋กœ>: ์‚ฌ์šฉ์ž ์ •์˜ YAML ์„ค์ • ํŒŒ์ผ์„ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค. (๊ธฐ๋ณธ๊ฐ’: config/default_settings.yaml)
  • --q_col <์ปฌ๋Ÿผ๋ช…>: CSV ํŒŒ์ผ ๋‚ด ์งˆ๋ฌธ ์ปฌ๋Ÿผ๋ช…์„ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค. (์„ค์ • ํŒŒ์ผ ๊ฐ’ ์˜ค๋ฒ„๋ผ์ด๋“œ)
  • --a_col <์ปฌ๋Ÿผ๋ช…>: CSV ํŒŒ์ผ ๋‚ด ๋‹ต๋ณ€ ์ปฌ๋Ÿผ๋ช…์„ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค. (์„ค์ • ํŒŒ์ผ ๊ฐ’ ์˜ค๋ฒ„๋ผ์ด๋“œ)
  • --qa_col <์ปฌ๋Ÿผ๋ช…>: CSV ํŒŒ์ผ ๋‚ด ์งˆ๋ฌธ+๋‹ต๋ณ€ ํ†ตํ•ฉ ์ปฌ๋Ÿผ๋ช…์„ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค. (q_col, a_col ๋Œ€์‹  ์‚ฌ์šฉ)
  • --encoding <์ธ์ฝ”๋”ฉ>: ์ž…๋ ฅ CSV ํŒŒ์ผ์˜ ์ธ์ฝ”๋”ฉ์„ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค. (์˜ˆ: utf-8, cp949)
  • --output_dir <๊ฒฝ๋กœ>: ๊ฒฐ๊ณผ ํŒŒ์ผ(์„ ๋ณ„๋œ CSV, ๋ฆฌํฌํŠธ)์ด ์ €์žฅ๋  ๋””๋ ‰ํ† ๋ฆฌ๋ฅผ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค.

CLI ์˜ˆ์‹œ:

# ์‚ฌ์šฉ์ž ์„ค์ • ํŒŒ์ผ๊ณผ ํ•จ๊ป˜ ์‹คํ–‰
data-filtering-cli data/my_qna_data.csv --config config/my_custom_settings.yaml

# ์งˆ๋ฌธ/๋‹ต๋ณ€ ์ปฌ๋Ÿผ๋ช… ์ง์ ‘ ์ง€์ • ๋ฐ ์ถœ๋ ฅ ๋””๋ ‰ํ† ๋ฆฌ ๋ณ€๊ฒฝ
data-filtering-cli data/another_data.csv --q_col "Question" --a_col "Answer" --output_dir processed_results

(๋งŒ์•ฝ data-filtering-cli ๋ช…๋ น์–ด๊ฐ€ ์ธ์‹๋˜์ง€ ์•Š๋Š”๋‹ค๋ฉด, Python ํ™˜๊ฒฝ์˜ bin ๋˜๋Š” Scripts ๋””๋ ‰ํ† ๋ฆฌ๊ฐ€ ์‹œ์Šคํ…œ PATH์— ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์ถ”๊ฐ€๋˜์—ˆ๋Š”์ง€, ๋˜๋Š” python -m data_filtering.main_processor <์ž…๋ ฅ_CSV_ํŒŒ์ผ_๊ฒฝ๋กœ> ํ˜•ํƒœ๋กœ ์ง์ ‘ ๋ชจ๋“ˆ์„ ์‹คํ–‰ํ•ด์•ผ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.)

2. Python ์ฝ”๋“œ์—์„œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ ์‚ฌ์šฉ

data_filtering ๋ชจ๋“ˆ์˜ run ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Python ์Šคํฌ๋ฆฝํŠธ ๋‚ด์—์„œ ํ•„ํ„ฐ๋ง ํ”„๋กœ์„ธ์Šค๋ฅผ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

from data_filtering import run

# ๊ธฐ๋ณธ ์„ค์ • ์‚ฌ์šฉ (ํŒจํ‚ค์ง€ ๋‚ด๋ถ€์˜ default_settings.yaml ์‚ฌ์šฉ)
run(input_csv_path="examples/sample_data.csv")

# ์‚ฌ์šฉ์ž ์ •์˜ ์„ค์ • ํŒŒ์ผ ๋ฐ ์ผ๋ถ€ ์˜ต์…˜ kwargs๋กœ ์˜ค๋ฒ„๋ผ์ด๋“œ
run(
    input_csv_path="data/my_qna_data.csv",
    config_path="path/to/my_custom_settings.yaml", # ์‚ฌ์šฉ์ž YAML ํŒŒ์ผ ๊ฒฝ๋กœ
    output_dir="custom_output", # kwargs๋กœ ์ตœ์ƒ์œ„ ์„ค์ • ์˜ค๋ฒ„๋ผ์ด๋“œ
    deduplication={"semantic_threshold": 0.88} # kwargs๋กœ ์ค‘์ฒฉ๋œ ์„ค์ • ์˜ค๋ฒ„๋ผ์ด๋“œ
)

# ์ปฌ๋Ÿผ๋ช… ์ง์ ‘ ์ง€์ • (config ํŒŒ์ผ ์„ค์ •๋ณด๋‹ค ์šฐ์„ )
run(
    input_csv_path="data/other_format.csv",
    q_col="Inquiry",
    a_col="Response"
)

์„ค์ • (data_filtering/config/default_settings.yaml)

๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ ์„ธ๋ถ€ ๋™์ž‘์€ data_filtering/config/default_settings.yaml ํŒŒ์ผ์„ ํ†ตํ•ด ์ œ์–ด๋ฉ๋‹ˆ๋‹ค. ์ด ํŒŒ์ผ์€ ํŒจํ‚ค์ง€ ๋‚ด๋ถ€์— ํฌํ•จ๋˜์–ด ์žˆ์œผ๋ฉฐ, ์‚ฌ์šฉ์ž ์ •์˜ ์„ค์ •์„ ์œ„ํ•ด ์ด ํŒŒ์ผ์„ ๋ณต์‚ฌํ•˜์—ฌ ์ˆ˜์ •ํ•œ ํ›„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ฃผ์š” ์„ค์ • ํ•ญ๋ชฉ

  • ์ž…๋ ฅ/์ถœ๋ ฅ ์„ค์ •:

    • input_csv: ์ž…๋ ฅ CSV ํŒŒ์ผ ๊ฒฝ๋กœ
    • output_dir: ๊ฒฐ๊ณผ ํŒŒ์ผ์ด ์ €์žฅ๋  ๋””๋ ‰ํ† ๋ฆฌ (๊ธฐ๋ณธ๊ฐ’: ./output)
    • q_col, a_col: ์งˆ๋ฌธ/๋‹ต๋ณ€ ์ปฌ๋Ÿผ๋ช… (QA ๋ฐ์ดํ„ฐ์…‹์ธ ๊ฒฝ์šฐ)
    • qa_col: ์งˆ๋ฌธ๊ณผ ๋‹ต๋ณ€์ด ๊ฒฐํ•ฉ๋œ ์ปฌ๋Ÿผ๋ช… (๋‹จ์ผ ์ปฌ๋Ÿผ ๋ฐ์ดํ„ฐ์…‹์ธ ๊ฒฝ์šฐ)
    • encoding: ์ž…๋ ฅ ํŒŒ์ผ ์ธ์ฝ”๋”ฉ (๊ธฐ๋ณธ๊ฐ’: utf-8)
  • ์ค‘๋ณต ์ œ๊ฑฐ ์„ค์ • (deduplication):

    • enable_exact: ์ •ํ™•ํ•œ ์ค‘๋ณต ์ œ๊ฑฐ ํ™œ์„ฑํ™” (๊ธฐ๋ณธ๊ฐ’: true)
    • enable_semantic: ์˜๋ฏธ๋ก ์  ์ค‘๋ณต ์ œ๊ฑฐ ํ™œ์„ฑํ™” (๊ธฐ๋ณธ๊ฐ’: false)
    • backend: ์ž„๋ฒ ๋”ฉ ๋ฐฑ์—”๋“œ (sentence-transformers, openai, gemini)
    • model: ์‚ฌ์šฉํ•  ๋ชจ๋ธ ์ด๋ฆ„ (์˜ˆ: Qwen/Qwen3-Embedding-0.6B, text-embedding-3-small)
    • semantic_threshold: ์˜๋ฏธ์  ์œ ์‚ฌ๋„ ์ž„๊ณ„๊ฐ’ (0.0 ~ 1.0, ๊ธฐ๋ณธ๊ฐ’: 0.80)
    • keep_criterion: ์ค‘๋ณต ์‹œ ๋ณด์กด ๊ธฐ์ค€ (first: ์ฒซ ๋ฒˆ์งธ ํ•ญ๋ชฉ, longest: ๊ฐ€์žฅ ๊ธด ํ…์ŠคํŠธ)
    • batch_size: ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ ํฌ๊ธฐ (API ๋ฐฑ์—”๋“œ ์‚ฌ์šฉ ์‹œ, ๊ธฐ๋ณธ๊ฐ’: 32)
    • max_retries: API ํ˜ธ์ถœ ์‹คํŒจ ์‹œ ์ตœ๋Œ€ ์žฌ์‹œ๋„ ํšŸ์ˆ˜ (๊ธฐ๋ณธ๊ฐ’: 3)
  • ํ’ˆ์งˆ ํ•„ํ„ฐ ์„ค์ • (quality_filters):

    • length: ํ…์ŠคํŠธ ๊ธธ์ด ํ•„ํ„ฐ
      • enable: ํ™œ์„ฑํ™” ์—ฌ๋ถ€
      • min: ์ตœ์†Œ ๊ธธ์ด (๊ธฐ๋ณธ๊ฐ’: 10)
      • max: ์ตœ๋Œ€ ๊ธธ์ด (๊ธฐ๋ณธ๊ฐ’: 1000)
    • language: ์–ธ์–ด ํ•„ํ„ฐ
      • enable: ํ™œ์„ฑํ™” ์—ฌ๋ถ€
      • target: ๋ชฉํ‘œ ์–ธ์–ด ์ฝ”๋“œ (์˜ˆ: ko, en)
      • confidence_threshold: ์–ธ์–ด ๊ฐ์ง€ ์‹ ๋ขฐ๋„ ์ž„๊ณ„๊ฐ’ (0.0 ~ 1.0)
  • ๋ฆฌํฌํŠธ ์„ค์ • (report):

    • format: ์ถœ๋ ฅ ํ˜•์‹ (html ๋˜๋Š” txt)
    • filename: ๋ฆฌํฌํŠธ ํŒŒ์ผ๋ช… (ํ™•์žฅ์ž ์ œ์™ธ)
    • include_rejected_samples: ๊ฑฐ๋ถ€๋œ ์ƒ˜ํ”Œ ํฌํ•จ ์—ฌ๋ถ€
    • num_rejected_samples: ํฌํ•จํ•  ๊ฑฐ๋ถ€ ์ƒ˜ํ”Œ ์ˆ˜ (์ƒ์œ„ N๊ฐœ)
  • API ์„ค์ • (api_settings):

    • openai: OpenAI API ๊ด€๋ จ ์„ค์ •
      • api_key: API ํ‚ค (๋ณด์•ˆ์ƒ ํ™˜๊ฒฝ ๋ณ€์ˆ˜ ์‚ฌ์šฉ ๊ถŒ์žฅ)
      • base_url: ์ปค์Šคํ…€ ์—”๋“œํฌ์ธํŠธ URL (์„ ํƒ์‚ฌํ•ญ)
    • gemini: Google Gemini API ๊ด€๋ จ ์„ค์ •
      • api_key: API ํ‚ค (๋ณด์•ˆ์ƒ ํ™˜๊ฒฝ ๋ณ€์ˆ˜ ์‚ฌ์šฉ ๊ถŒ์žฅ)
      • base_url: ๊ธฐ๋ณธ๊ฐ’์€ Google API ์—”๋“œํฌ์ธํŠธ ์‚ฌ์šฉ
  • ํ’ˆ์งˆ ํ•„ํ„ฐ ์„ค์ • (quality_filters):

    • ๊ธธ์ด ํ•„ํ„ฐ (length): ํ™œ์„ฑํ™” ์—ฌ๋ถ€ (enable), ์ตœ์†Œ/์ตœ๋Œ€ ๊ธธ์ด (min, max)
    • ์–ธ์–ด ํ•„ํ„ฐ (language): ํ™œ์„ฑํ™” ์—ฌ๋ถ€ (enable), ๋ชฉํ‘œ ์–ธ์–ด (target), ์‹ ๋ขฐ๋„ ์ž„๊ณ„๊ฐ’ (confidence_threshold)
  • ๋ฆฌํฌํŠธ ์„ค์ • (report):

    • ๋ฆฌํฌํŠธ ํ˜•์‹ (format: html ๋˜๋Š” txt)
    • ๋ฆฌํฌํŠธ ํŒŒ์ผ๋ช… (filename)
    • ๋ฆฌํฌํŠธ์— ํฌํ•จํ•  ๊ฑฐ๋ถ€๋œ ์ƒ˜ํ”Œ ์ˆ˜ (include_rejected_samples)
  • ์ถœ๋ ฅ CSV ์„ค์ • (output_csv):

    • ์„ ๋ณ„๋œ ๋ฐ์ดํ„ฐ CSV ํŒŒ์ผ๋ช… (filename)
    • ์ตœ์ข… CSV์— ํฌํ•จ๋  ์ปฌ๋Ÿผ ๋ชฉ๋ก (columns)

์‚ฌ์šฉ์ž ์ •์˜ ์„ค์ •์„ ์›ํ•  ๊ฒฝ์šฐ, default_settings.yaml ํŒŒ์ผ์„ ๋ณต์‚ฌํ•˜์—ฌ ์ˆ˜์ • ํ›„ --config ์˜ต์…˜์ด๋‚˜ run ํ•จ์ˆ˜์˜ config_path ์ธ์ž๋กœ ์ง€์ •ํ•˜์—ฌ ์‚ฌ์šฉํ•˜์‹ญ์‹œ์˜ค.

๋””๋ ‰ํ† ๋ฆฌ ๊ตฌ์กฐ

.
โ”œโ”€โ”€ data_filtering/          # ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์†Œ์Šค ์ฝ”๋“œ ํŒจํ‚ค์ง€
โ”‚   โ”œโ”€โ”€ config/              # ๊ธฐ๋ณธ ์„ค์ • ํŒŒ์ผ ๋””๋ ‰ํ† ๋ฆฌ
โ”‚   โ”‚   โ””โ”€โ”€ default_settings.yaml
โ”‚   โ”œโ”€โ”€ templates/           # HTML ๋ฆฌํฌํŠธ ํ…œํ”Œ๋ฆฟ
โ”‚   โ”‚   โ””โ”€โ”€ report_template.html
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ data_handler.py
โ”‚   โ”œโ”€โ”€ duplication_handler.py
โ”‚   โ”œโ”€โ”€ main_processor.py    # CLI ์ง„์ž…์  ๋ฐ ํ•ต์‹ฌ ๋กœ์ง
โ”‚   โ”œโ”€โ”€ quality_checker.py
โ”‚   โ””โ”€โ”€ report_generator.py
โ”œโ”€โ”€ examples/                # ์˜ˆ์ œ ์Šคํฌ๋ฆฝํŠธ ๋ฐ ๋ฐ์ดํ„ฐ
โ”‚   โ”œโ”€โ”€ run_example.py       # ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์‚ฌ์šฉ ์˜ˆ์‹œ Python ์Šคํฌ๋ฆฝํŠธ
โ”‚   โ””โ”€โ”€ sample_data.csv
โ”œโ”€โ”€ tests/                   # ํ…Œ์ŠคํŠธ ์ฝ”๋“œ
โ”‚   โ”œโ”€โ”€ pytest.ini           # Pytest ์„ค์ • (PYTHONPATH ๋“ฑ)
โ”‚   โ””โ”€โ”€ ... (๊ฐ ๋ชจ๋“ˆ๋ณ„ ํ…Œ์ŠคํŠธ ํŒŒ์ผ) ...
โ”œโ”€โ”€ MANIFEST.in              # ํŒจํ‚ค์ง€์— ํฌํ•จํ•  ๋น„-Python ํŒŒ์ผ ๋ชฉ๋ก
โ”œโ”€โ”€ pyproject.toml           # ๋นŒ๋“œ ์‹œ์Šคํ…œ, ํ”„๋กœ์ ํŠธ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ, ์˜์กด์„ฑ ๋ช…์‹œ
โ”œโ”€โ”€ README.md                # ํ˜„์žฌ ํŒŒ์ผ
โ”œโ”€โ”€ requirements.txt         # ๊ฐœ๋ฐœ ํ™˜๊ฒฝ์šฉ ์˜์กด์„ฑ ๋ชฉ๋ก (์„ ํƒ์ )
โ””โ”€โ”€ setup.py                 # Setuptools ์„ค์ • ํŒŒ์ผ

ํ…Œ์ŠคํŠธ ์‹คํ–‰

ํ”„๋กœ์ ํŠธ์˜ ๊ธฐ๋Šฅ์„ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•ด pytest๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

  • Conda ํ™˜๊ฒฝ(data_filtering-env)์„ ํ™œ์„ฑํ™”ํ•ฉ๋‹ˆ๋‹ค.

  • ํ”„๋กœ์ ํŠธ ๋ฃจํŠธ ๋””๋ ‰ํ† ๋ฆฌ์—์„œ ๋‹ค์Œ ๋ช…๋ น์„ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค:

    pytest
    

์ฃผ์š” ์—…๋ฐ์ดํŠธ ์‚ฌํ•ญ

  • v0.1.2 (์ตœ์‹ ):
    • duplication_handler.py ๋ฆฌํŒฉํ† ๋ง
    • EmbeddingProvider ์ถ”์ƒ ํด๋ž˜์Šค ๋„์ž…์œผ๋กœ ์ž„๋ฒ ๋”ฉ ๋ฐฑ์—”๋“œ ๊ตฌ์กฐ ๊ฐœ์„ 
    • SentenceTransformerProvider, OpenAIEmbeddingProvider, GeminiEmbeddingProvider ๊ตฌํ˜„์ฒด ์ถ”๊ฐ€
    • ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ ๋ฐ ์žฌ์‹œ๋„ ๋ฉ”์ปค๋‹ˆ์ฆ˜ ๋‚ด์žฅ
    • API ํ‚ค ๊ด€๋ฆฌ๋ฅผ ์œ„ํ•œ ํ™˜๊ฒฝ ๋ณ€์ˆ˜ ์ง€์›
    • ์„ค์ • ํŒŒ์ผ ๊ตฌ์กฐ ๊ฐœ์„  ๋ฐ ๋ฌธ์„œํ™”

Keywords

data filtering

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts