kor-mark-search


Namekor-mark-search JSON
Version 0.1.0 PyPI version JSON
download
home_pagehttps://github.com/bill0077/kor-mark-search
SummaryA package suitable for searching queries in Korean-based Markdown, including features such as automatic typo correction.
upload_time2024-05-12 03:56:39
maintainerNone
docs_urlNone
authorbill0077
requires_python<4.0,>=3.10
licenseMIT
keywords korean markdown search
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # kor_mark_search
`kor_mark_search`는 local 폴더 내부의 한국어 마크다운 문서에서 쿼리를 검색하기 위한 검색 엔진입니다.
자잘한 오타, 한영키 미변환 등도 문제없이 검색이 가능합니다.

root 폴더 내부의 마크다운 문서를 바탕으로 인덱스를 생성하고, 생성된 인덱스로 주어진 쿼리를 검색하는 구조입니다. 인덱스 생성시 비슷한 단어를 하나의 token으로 분류하기 때문에 쿼리나 마크다운 문서에 오탈자가 있어도 검색에 큰 영향이 없습니다 (특수문자는 인덱싱되지 않음).

ex) '컨테이너', '컨테ㅇ너', '커테ㅣㅇ너', 'zjsxpdlsj', 'zjsxpdjs' 모두 '컨테이너` token 하나로 해석됩니다.

# 시작해보기
## 실행 환경 설정
`kor-mark-search`는 외부 패키지가 필요하지 않습니다. `git clone` 이후 main.py를 실행해 빠르게 시작해볼 수 있습니다.

또는 `pip3 install kor-mark-search`로 설치하는 방법도 가능합니다. `import kor-mark-search`로 패키지를 import하고 `kor_mark_search.index_search.search`함수를 사용할 수 있습니다.

minimal example (main.py와 동일):
```python
from kor_mark_search.index_search import search

while True:
  query = input('query:')
  result = search(query, 'YOUR_ROOT_PATH')
  print(result)
```

# kor_mark_search.index_search
## search
실행하면 input으로 검색할 쿼리와 root를 받습니다.
결과값으로 전체 마크다운 파일을 검색 결과에 따라 정렬하여 반환하고, 검색에 사용된 token들과 그 score를 추가로 반환합니다.

초기에는 인덱스를 생성하느라 시간이 걸리지만(전체 문서 길이 n에 대해 매우 완만한 O(n^2) 시간 복잡도) 한번 인덱스를 생성하면 이후에는 기존의 인덱스를 로드해 검색을 진행합니다.
마크다운 파일들을 root(인자로 따로 설정 가능) 폴더에 넣으면 해당 폴더 내부 문서를 기준으로 인덱스를 생성합니다.
쿼리를 인덱스를 기반으로 검색하는 함수입니다. 아래와 같은 매개변수가 있습니다
- `root`: index를 생성할 마크다운들이 있는 폴더입니다. 기본은 'root'입니다.
- `skip_indexing`: index 생성을 하지 않을 폴더의 목록입니다. 기본은 지정되어 있지 않습니다.
- `index_file`: index 파일이 생성되는 경로입니다. 기본은 'index/path_to_root.json'입니다.
- `alpha`: 서로 다른 token들이 같은 group인지 판정하는 기준치입니다. 값이 높을수록 더욱 많은 token이 하나의 group으로 예상합니다.
- `beta`: 한영키가 뒤바뀐채로 입력되었는지 판정하는 기준치입니다. 값이 높을수록 한영키 오타를 높게 예상합니다.
- `min_results`: 최소 min_results 만큼의 마크다운들을 반환합니다. 그 이상의 마크다운은 적합도가 `beta`를 넘어야만 반환됩니다.

# kor_mark_search.index_builder
## load_index
인덱스를 파일로부터 로드해오는 함수입니다.
`path`: 로드해올 파일의 경로

## build_index
root 폴더의 인덱스를 생성하는 함수입니다.
- `root`: 인덱스를 생성할 최상위 폴더
- `index_file`: 생성한 인덱스를 저장할 파일의 경로
- `skip_indexing`: 인덱스를 생성하지 않을 폴더. 어느 위치에 있든 해당 폴더의 파일은 인덱싱되지 않습니다.
- `alpha`: 위의 `kor_mark_search.index_search.search` 에서 사용된 alpha와 동일

## add_index
기존의 인덱스에 새로운 마크다운을 추가하는 함수입니다. 이미 인덱싱되어있는 마크다운을 다시 `add_index`로 추가하면 해당 마크다운의 인덱스를 덮어씁니다.
- `markdown_path`: 인덱스에 추가할 마크다운의 경로
- `index_file`: 기존 인덱스 파일의 경로
- `alpha`: `build_index`의 alpha와 동일
            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/bill0077/kor-mark-search",
    "name": "kor-mark-search",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.10",
    "maintainer_email": null,
    "keywords": "korean, markdown, search",
    "author": "bill0077",
    "author_email": "<bill007tjr@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/5d/45/80c8f7857ff21cc5f6b5d78165f9fd985e2321c7bfa8865fc3f6c0e5014b/kor_mark_search-0.1.0.tar.gz",
    "platform": null,
    "description": "# kor_mark_search\n`kor_mark_search`\ub294 local \ud3f4\ub354 \ub0b4\ubd80\uc758 \ud55c\uad6d\uc5b4 \ub9c8\ud06c\ub2e4\uc6b4 \ubb38\uc11c\uc5d0\uc11c \ucffc\ub9ac\ub97c \uac80\uc0c9\ud558\uae30 \uc704\ud55c \uac80\uc0c9 \uc5d4\uc9c4\uc785\ub2c8\ub2e4.\n\uc790\uc798\ud55c \uc624\ud0c0, \ud55c\uc601\ud0a4 \ubbf8\ubcc0\ud658 \ub4f1\ub3c4 \ubb38\uc81c\uc5c6\uc774 \uac80\uc0c9\uc774 \uac00\ub2a5\ud569\ub2c8\ub2e4.\n\nroot \ud3f4\ub354 \ub0b4\ubd80\uc758 \ub9c8\ud06c\ub2e4\uc6b4 \ubb38\uc11c\ub97c \ubc14\ud0d5\uc73c\ub85c \uc778\ub371\uc2a4\ub97c \uc0dd\uc131\ud558\uace0, \uc0dd\uc131\ub41c \uc778\ub371\uc2a4\ub85c \uc8fc\uc5b4\uc9c4 \ucffc\ub9ac\ub97c \uac80\uc0c9\ud558\ub294 \uad6c\uc870\uc785\ub2c8\ub2e4. \uc778\ub371\uc2a4 \uc0dd\uc131\uc2dc \ube44\uc2b7\ud55c \ub2e8\uc5b4\ub97c \ud558\ub098\uc758 token\uc73c\ub85c \ubd84\ub958\ud558\uae30 \ub54c\ubb38\uc5d0 \ucffc\ub9ac\ub098 \ub9c8\ud06c\ub2e4\uc6b4 \ubb38\uc11c\uc5d0 \uc624\ud0c8\uc790\uac00 \uc788\uc5b4\ub3c4 \uac80\uc0c9\uc5d0 \ud070 \uc601\ud5a5\uc774 \uc5c6\uc2b5\ub2c8\ub2e4 (\ud2b9\uc218\ubb38\uc790\ub294 \uc778\ub371\uc2f1\ub418\uc9c0 \uc54a\uc74c).\n\nex) '\ucee8\ud14c\uc774\ub108', '\ucee8\ud14c\u3147\ub108', '\ucee4\ud14c\u3163\u3147\ub108', 'zjsxpdlsj', 'zjsxpdjs' \ubaa8\ub450 '\ucee8\ud14c\uc774\ub108` token \ud558\ub098\ub85c \ud574\uc11d\ub429\ub2c8\ub2e4.\n\n# \uc2dc\uc791\ud574\ubcf4\uae30\n## \uc2e4\ud589 \ud658\uacbd \uc124\uc815\n`kor-mark-search`\ub294 \uc678\ubd80 \ud328\ud0a4\uc9c0\uac00 \ud544\uc694\ud558\uc9c0 \uc54a\uc2b5\ub2c8\ub2e4. `git clone` \uc774\ud6c4 main.py\ub97c \uc2e4\ud589\ud574 \ube60\ub974\uac8c \uc2dc\uc791\ud574\ubcfc \uc218 \uc788\uc2b5\ub2c8\ub2e4.\n\n\ub610\ub294 `pip3 install kor-mark-search`\ub85c \uc124\uce58\ud558\ub294 \ubc29\ubc95\ub3c4 \uac00\ub2a5\ud569\ub2c8\ub2e4. `import kor-mark-search`\ub85c \ud328\ud0a4\uc9c0\ub97c import\ud558\uace0 `kor_mark_search.index_search.search`\ud568\uc218\ub97c \uc0ac\uc6a9\ud560 \uc218 \uc788\uc2b5\ub2c8\ub2e4.\n\nminimal example (main.py\uc640 \ub3d9\uc77c):\n```python\nfrom kor_mark_search.index_search import search\n\nwhile True:\n  query = input('query:')\n  result = search(query, 'YOUR_ROOT_PATH')\n  print(result)\n```\n\n# kor_mark_search.index_search\n## search\n\uc2e4\ud589\ud558\uba74 input\uc73c\ub85c \uac80\uc0c9\ud560 \ucffc\ub9ac\uc640 root\ub97c \ubc1b\uc2b5\ub2c8\ub2e4.\n\uacb0\uacfc\uac12\uc73c\ub85c \uc804\uccb4 \ub9c8\ud06c\ub2e4\uc6b4 \ud30c\uc77c\uc744 \uac80\uc0c9 \uacb0\uacfc\uc5d0 \ub530\ub77c \uc815\ub82c\ud558\uc5ec \ubc18\ud658\ud558\uace0, \uac80\uc0c9\uc5d0 \uc0ac\uc6a9\ub41c token\ub4e4\uacfc \uadf8 score\ub97c \ucd94\uac00\ub85c \ubc18\ud658\ud569\ub2c8\ub2e4.\n\n\ucd08\uae30\uc5d0\ub294 \uc778\ub371\uc2a4\ub97c \uc0dd\uc131\ud558\ub290\ub77c \uc2dc\uac04\uc774 \uac78\ub9ac\uc9c0\ub9cc(\uc804\uccb4 \ubb38\uc11c \uae38\uc774 n\uc5d0 \ub300\ud574 \ub9e4\uc6b0 \uc644\ub9cc\ud55c O(n^2) \uc2dc\uac04 \ubcf5\uc7a1\ub3c4) \ud55c\ubc88 \uc778\ub371\uc2a4\ub97c \uc0dd\uc131\ud558\uba74 \uc774\ud6c4\uc5d0\ub294 \uae30\uc874\uc758 \uc778\ub371\uc2a4\ub97c \ub85c\ub4dc\ud574 \uac80\uc0c9\uc744 \uc9c4\ud589\ud569\ub2c8\ub2e4.\n\ub9c8\ud06c\ub2e4\uc6b4 \ud30c\uc77c\ub4e4\uc744 root(\uc778\uc790\ub85c \ub530\ub85c \uc124\uc815 \uac00\ub2a5) \ud3f4\ub354\uc5d0 \ub123\uc73c\uba74 \ud574\ub2f9 \ud3f4\ub354 \ub0b4\ubd80 \ubb38\uc11c\ub97c \uae30\uc900\uc73c\ub85c \uc778\ub371\uc2a4\ub97c \uc0dd\uc131\ud569\ub2c8\ub2e4.\n\ucffc\ub9ac\ub97c \uc778\ub371\uc2a4\ub97c \uae30\ubc18\uc73c\ub85c \uac80\uc0c9\ud558\ub294 \ud568\uc218\uc785\ub2c8\ub2e4. \uc544\ub798\uc640 \uac19\uc740 \ub9e4\uac1c\ubcc0\uc218\uac00 \uc788\uc2b5\ub2c8\ub2e4\n- `root`: index\ub97c \uc0dd\uc131\ud560 \ub9c8\ud06c\ub2e4\uc6b4\ub4e4\uc774 \uc788\ub294 \ud3f4\ub354\uc785\ub2c8\ub2e4. \uae30\ubcf8\uc740 'root'\uc785\ub2c8\ub2e4.\n- `skip_indexing`: index \uc0dd\uc131\uc744 \ud558\uc9c0 \uc54a\uc744 \ud3f4\ub354\uc758 \ubaa9\ub85d\uc785\ub2c8\ub2e4. \uae30\ubcf8\uc740 \uc9c0\uc815\ub418\uc5b4 \uc788\uc9c0 \uc54a\uc2b5\ub2c8\ub2e4.\n- `index_file`: index \ud30c\uc77c\uc774 \uc0dd\uc131\ub418\ub294 \uacbd\ub85c\uc785\ub2c8\ub2e4. \uae30\ubcf8\uc740 'index/path_to_root.json'\uc785\ub2c8\ub2e4.\n- `alpha`: \uc11c\ub85c \ub2e4\ub978 token\ub4e4\uc774 \uac19\uc740 group\uc778\uc9c0 \ud310\uc815\ud558\ub294 \uae30\uc900\uce58\uc785\ub2c8\ub2e4. \uac12\uc774 \ub192\uc744\uc218\ub85d \ub354\uc6b1 \ub9ce\uc740 token\uc774 \ud558\ub098\uc758 group\uc73c\ub85c \uc608\uc0c1\ud569\ub2c8\ub2e4.\n- `beta`: \ud55c\uc601\ud0a4\uac00 \ub4a4\ubc14\ub010\ucc44\ub85c \uc785\ub825\ub418\uc5c8\ub294\uc9c0 \ud310\uc815\ud558\ub294 \uae30\uc900\uce58\uc785\ub2c8\ub2e4. \uac12\uc774 \ub192\uc744\uc218\ub85d \ud55c\uc601\ud0a4 \uc624\ud0c0\ub97c \ub192\uac8c \uc608\uc0c1\ud569\ub2c8\ub2e4.\n- `min_results`: \ucd5c\uc18c min_results \ub9cc\ud07c\uc758 \ub9c8\ud06c\ub2e4\uc6b4\ub4e4\uc744 \ubc18\ud658\ud569\ub2c8\ub2e4. \uadf8 \uc774\uc0c1\uc758 \ub9c8\ud06c\ub2e4\uc6b4\uc740 \uc801\ud569\ub3c4\uac00 `beta`\ub97c \ub118\uc5b4\uc57c\ub9cc \ubc18\ud658\ub429\ub2c8\ub2e4.\n\n# kor_mark_search.index_builder\n## load_index\n\uc778\ub371\uc2a4\ub97c \ud30c\uc77c\ub85c\ubd80\ud130 \ub85c\ub4dc\ud574\uc624\ub294 \ud568\uc218\uc785\ub2c8\ub2e4.\n`path`: \ub85c\ub4dc\ud574\uc62c \ud30c\uc77c\uc758 \uacbd\ub85c\n\n## build_index\nroot \ud3f4\ub354\uc758 \uc778\ub371\uc2a4\ub97c \uc0dd\uc131\ud558\ub294 \ud568\uc218\uc785\ub2c8\ub2e4.\n- `root`: \uc778\ub371\uc2a4\ub97c \uc0dd\uc131\ud560 \ucd5c\uc0c1\uc704 \ud3f4\ub354\n- `index_file`: \uc0dd\uc131\ud55c \uc778\ub371\uc2a4\ub97c \uc800\uc7a5\ud560 \ud30c\uc77c\uc758 \uacbd\ub85c\n- `skip_indexing`: \uc778\ub371\uc2a4\ub97c \uc0dd\uc131\ud558\uc9c0 \uc54a\uc744 \ud3f4\ub354. \uc5b4\ub290 \uc704\uce58\uc5d0 \uc788\ub4e0 \ud574\ub2f9 \ud3f4\ub354\uc758 \ud30c\uc77c\uc740 \uc778\ub371\uc2f1\ub418\uc9c0 \uc54a\uc2b5\ub2c8\ub2e4.\n- `alpha`: \uc704\uc758 `kor_mark_search.index_search.search` \uc5d0\uc11c \uc0ac\uc6a9\ub41c alpha\uc640 \ub3d9\uc77c\n\n## add_index\n\uae30\uc874\uc758 \uc778\ub371\uc2a4\uc5d0 \uc0c8\ub85c\uc6b4 \ub9c8\ud06c\ub2e4\uc6b4\uc744 \ucd94\uac00\ud558\ub294 \ud568\uc218\uc785\ub2c8\ub2e4. \uc774\ubbf8 \uc778\ub371\uc2f1\ub418\uc5b4\uc788\ub294 \ub9c8\ud06c\ub2e4\uc6b4\uc744 \ub2e4\uc2dc `add_index`\ub85c \ucd94\uac00\ud558\uba74 \ud574\ub2f9 \ub9c8\ud06c\ub2e4\uc6b4\uc758 \uc778\ub371\uc2a4\ub97c \ub36e\uc5b4\uc501\ub2c8\ub2e4.\n- `markdown_path`: \uc778\ub371\uc2a4\uc5d0 \ucd94\uac00\ud560 \ub9c8\ud06c\ub2e4\uc6b4\uc758 \uacbd\ub85c\n- `index_file`: \uae30\uc874 \uc778\ub371\uc2a4 \ud30c\uc77c\uc758 \uacbd\ub85c\n- `alpha`: `build_index`\uc758 alpha\uc640 \ub3d9\uc77c",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A package suitable for searching queries in Korean-based Markdown, including features such as automatic typo correction.",
    "version": "0.1.0",
    "project_urls": {
        "Homepage": "https://github.com/bill0077/kor-mark-search",
        "Repository": "https://github.com/bill0077/kor-mark-search"
    },
    "split_keywords": [
        "korean",
        " markdown",
        " search"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "42a46692b21b0ec42ef7fdf6bc632426940e7d4596c9136517c14c408ccfe614",
                "md5": "d64194d542f73e831a027bcb66c980c9",
                "sha256": "8c780fdae398c279ab4fde45d63b115d4330b55eb03f41cdf3d41e89dece32dd"
            },
            "downloads": -1,
            "filename": "kor_mark_search-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d64194d542f73e831a027bcb66c980c9",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.10",
            "size": 9342,
            "upload_time": "2024-05-12T03:56:36",
            "upload_time_iso_8601": "2024-05-12T03:56:36.999642Z",
            "url": "https://files.pythonhosted.org/packages/42/a4/6692b21b0ec42ef7fdf6bc632426940e7d4596c9136517c14c408ccfe614/kor_mark_search-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5d4580c8f7857ff21cc5f6b5d78165f9fd985e2321c7bfa8865fc3f6c0e5014b",
                "md5": "471f8dce7106dd1f81acfbca0a28c7ce",
                "sha256": "29a8f6ceb25d307a92b70b75aebed7f156368d15425434402a463da7c485bbc4"
            },
            "downloads": -1,
            "filename": "kor_mark_search-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "471f8dce7106dd1f81acfbca0a28c7ce",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.10",
            "size": 9554,
            "upload_time": "2024-05-12T03:56:39",
            "upload_time_iso_8601": "2024-05-12T03:56:39.052117Z",
            "url": "https://files.pythonhosted.org/packages/5d/45/80c8f7857ff21cc5f6b5d78165f9fd985e2321c7bfa8865fc3f6c0e5014b/kor_mark_search-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-12 03:56:39",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "bill0077",
    "github_project": "kor-mark-search",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "kor-mark-search"
}
        
Elapsed time: 3.93122s