semantic-code-search

Name	semantic-code-search JSON
Version	0.4.0 JSON
	download
home_page	https://github.com/sturdy-dev/semantic-code-search
Summary	Search your codebase with natural language.
upload_time	2022-12-23 10:38:08
maintainer
docs_url	None
author	Kiril Videlov
requires_python	>=3.8
license
keywords	semantic code search
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Semantic Code Search

<p align="center">
  <img src="https://raw.githubusercontent.com/sturdy-dev/semantic-code-search/main/docs/readme-banner.png">
</p>
<p align='center'>
  Search your codebase with natural language. No data leaves your computer.
</p>
<p align='center'>
    <a href="https://github.com/sturdy-dev/semantic-code-search/blob/main/LICENSE.txt">
        <img alt="GitHub"
        src="https://img.shields.io/github/license/sturdy-dev/semantic-code-search">
    </a>
    <a href="https://pypi.org/project/semantic-code-search">
     <img alt="PyPi"
 src="https://img.shields.io/pypi/v/semantic-code-search">
    </a>
</p>
<p align="center">
  <a href="#overview">🔍 Overview</a> •
  <a href="#installation">🔧 Installation</a> •
  <a href="#usage">💻 Usage</a> •
  <a href="#command-line-flags">📖 Docs</a> •
  <a href="#how-it-works">🧠 How it works</a>
</p>

--------------------------------------------------------------------

## Overview

`sem` is a command line application which allows you to search your git repository using natural language. For example you can query for:

- 'Where are API requests authenticated?'
- 'Saving user objects to the database'
- 'Handling of webhook events'
- 'Where are jobs read from the queue?'

You will get a (visualized) list of code snippets and their `file:line` locations. You can use `sem` for exploring large codebases or, if you are as forgetfull as I am, even small ones.

Basic usage:

```bash
sem 'my query'
```

This will present you with a list of code snippets that most closely match your search. You can select one and press  `Return` to open it in your editor of choice.

How does this work? In a nutshell, it uses a neural network to generate code embeddings. More info [below](#how-it-works).

> NB: All processing is done on your hardware and no data is transmitted to the Internet.

## Installation

You can install `semantic-code-search` via `pip`.

### Pip (MacOS, Linux, Windows)

```bash
pip3 install semantic-code-search
```

## Usage

TL;DR:

```bash
cd /my/repo
sem 'my query'
```

Run `sem --help` to see [all available options](#command-line-flags).

### Searching for code

Inside your repo simply run

```bash
sem 'my query'
```

*(quotes can be omitted)*

> Note that you *need to* be  inside a git repository or provide a path to a repo with the `-p` argument.

Before you get your *first* search results, two things need to happen:

- The app downloads its [model](#model) (~500 MB). This is done only once for the installation.
- The app generates 'embeddings' of your code. This will be cached in an `.embeddings` file at the root of the repo and is reused in subsequent searches.

Depending on the project size, the above can take from a couple of seconds to minutes. Once this is complete, querying is very fast.

Example output:

```bash session
foo@bar:~$ cd /my/repo
foo@bar:~$ sem 'parsing command line args'
Embeddings not found in /Users/kiril/src/semantic-code-search. Generating embeddings now.
Embedding 15 functions in 1 batches. This is done once and cached in .embeddings
Batches: 100%|█████████████████████████████████████████████████████████| 1/1 [00:07<00:00,  7.05s/it]
```

### Navigating search results

By default, a list of the top 5 matches are shown, containing :

- Similarity score
- File path
- Line number
- Code snippet

You can navigate the list using the `↑` `↓` arrow keys or `vim` bindings. Pressing `return` will open the relevant file at the line of the code snippet in your editor.

> NB: The editor used for opening can be set with the `--editor` argument.

Example results:

![example results](./docs/example-results.png)

### Command line flags

``` bash
usage: sem [-h] [-p PATH] [-m MODEL] [-d] [-b BS] [-x EXT] [-n N]
           [-e {vscode,vim}] [-c] [--cluster-max-distance THRESHOLD]
           [--cluster-min-lines SIZE] [--cluster-min-cluster-size SIZE]
           [--cluster-ignore-identincal]
           ...

Search your codebase using natural language

positional arguments:
  query_text

optional arguments:
  -h, --help            show this help message and exit
  -p PATH, --path-to-repo PATH
                        Path to the root of the git repo to search or embed
  -m MODEL, --model-name-or-path MODEL
                        Name or path of the model to use
  -d, --embed           (Re)create the embeddings index for codebase
  -b BS, --batch-size BS
                        Batch size for embeddings generation
  -x EXT, --file-extension EXT
                        File extension filter (e.g. "py" will only return
                        results from Python files)
  -n N, --n-results N   Number of results to return
  -e {vscode,vim}, --editor {vscode,vim}
                        Editor to open selected result in
  -c, --cluster         Generate clusters of code that is semantically
                        similar. You can use this to spot near duplicates,
                        results are simply printed to stdout
  --cluster-max-distance THRESHOLD
                        How close functions need to be to one another to be
                        clustered. Distance 0 means that the code is
                        identical, smaller values (e.g. 0.2, 0.3) are stricter
                        and result in fewer matches
  --cluster-min-lines SIZE
                        Ignore clusters with code snippets smaller than this
                        size (lines of code). Use this if you are not
                        interested in smaller duplications (eg. one liners)
  --cluster-min-cluster-size SIZE
                        Ignore clusters smaller than this size. Use this if
                        you want to find code that is similar and repeated
                        many times (e.g. >5)
  --cluster-ignore-identincal
                        Ignore identical code / exact duplicates (where
                        distance is 0)
```

## How it works

In a nutshell, this application uses a [transformer](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)) machine learning model to generate embeddings of methods and functions in your codebase. Embeddings are information dense numerical representations of the semantics of the text/code they represent.

Here is a great blog post by Jay Alammar which explains the concept really nicely:
> <https://jalammar.github.io/illustrated-word2vec/>

When the app is ran with the `--embed` argument, function and method definitions are first extracted from the source files and then used for sentence embedding. To avoid doing this for every query, the results are compressed and saved in an `.embeddings` file.

When a query is being processed, embeddings are generated from the query text. This is then used in a 'nearest neighbor' search to discover function or methods with similar embeddings. We are basically comparing the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) between vectors.

### Model

The application uses [sentence transformer](https://www.sbert.net/) model architecture to produce 'sentence' embeddings for functions and queries. The particular model is [krlvi/sentence-t5-base-nlpl-code_search_net](https://huggingface.co/krlvi/sentence-t5-base-nlpl-code_search_net) which is based of a [SentenceT5-Base](https://github.com/google-research/t5x_retrieval#released-model-checkpoints) checkpoint with 110M parameters and a pooling layer.

It has been further trained on the [code_search_net](https://huggingface.co/datasets/code_search_net) dataset of 'natural language' — 'programming language' pairs with a [MultipleNegativesRanking](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MultipleNegativesRankingLoss.py) loss function.

You can experiment with your own sentence transformer models with the `--model` parameter.

## Bugs and limitations

- Currently, the `.embeddings` index is not updated when repository files change. As a temporary workaround, `sem embed` can be re-ran occasionally.
- Supported languages: `{ 'python', 'javascript', 'typescript', 'ruby', 'go', 'rust', 'java', 'c', 'c++', 'kotlin' }`
- Supported text editors for opening results in: `{ 'vscode', 'vim' }`

## License

Semantic Code Search is distributed under [AGPL-3.0-only](LICENSE.txt). For Apache-2.0 exceptions — <kiril@codeball.ai>

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/sturdy-dev/semantic-code-search",
    "name": "semantic-code-search",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "semantic code search",
    "author": "Kiril Videlov",
    "author_email": "kiril@codeball.ai",
    "download_url": "https://files.pythonhosted.org/packages/66/99/f52163c7fc017663e1f865d4b3b2b19ecc3ae90fbd3573d680711abd6e02/semantic-code-search-0.4.0.tar.gz",
    "platform": null,
    "description": "# Semantic Code Search\n\n<p align=\"center\">\n  <img src=\"https://raw.githubusercontent.com/sturdy-dev/semantic-code-search/main/docs/readme-banner.png\">\n</p>\n<p align='center'>\n  Search your codebase with natural language. No data leaves your computer.\n</p>\n<p align='center'>\n    <a href=\"https://github.com/sturdy-dev/semantic-code-search/blob/main/LICENSE.txt\">\n        <img alt=\"GitHub\"\n        src=\"https://img.shields.io/github/license/sturdy-dev/semantic-code-search\">\n    </a>\n    <a href=\"https://pypi.org/project/semantic-code-search\">\n     <img alt=\"PyPi\"\n src=\"https://img.shields.io/pypi/v/semantic-code-search\">\n    </a>\n</p>\n<p align=\"center\">\n  <a href=\"#overview\">\ud83d\udd0d Overview</a> \u2022\n  <a href=\"#installation\">\ud83d\udd27 Installation</a> \u2022\n  <a href=\"#usage\">\ud83d\udcbb Usage</a> \u2022\n  <a href=\"#command-line-flags\">\ud83d\udcd6 Docs</a> \u2022\n  <a href=\"#how-it-works\">\ud83e\udde0 How it works</a>\n</p>\n\n--------------------------------------------------------------------\n\n## Overview\n\n`sem` is a command line application which allows you to search your git repository using natural language. For example you can query for:\n\n- 'Where are API requests authenticated?'\n- 'Saving user objects to the database'\n- 'Handling of webhook events'\n- 'Where are jobs read from the queue?'\n\nYou will get a (visualized) list of code snippets and their `file:line` locations. You can use `sem` for exploring large codebases or, if you are as forgetfull as I am, even small ones.\n\nBasic usage:\n\n```bash\nsem 'my query'\n```\n\nThis will present you with a list of code snippets that most closely match your search. You can select one and press  `Return` to open it in your editor of choice.\n\nHow does this work? In a nutshell, it uses a neural network to generate code embeddings. More info [below](#how-it-works).\n\n> NB: All processing is done on your hardware and no data is transmitted to the Internet.\n\n## Installation\n\nYou can install `semantic-code-search` via `pip`.\n\n### Pip (MacOS, Linux, Windows)\n\n```bash\npip3 install semantic-code-search\n```\n\n## Usage\n\nTL;DR:\n\n```bash\ncd /my/repo\nsem 'my query'\n```\n\nRun `sem --help` to see [all available options](#command-line-flags).\n\n### Searching for code\n\nInside your repo simply run\n\n```bash\nsem 'my query'\n```\n\n*(quotes can be omitted)*\n\n> Note that you *need to* be  inside a git repository or provide a path to a repo with the `-p` argument.\n\nBefore you get your *first* search results, two things need to happen:\n\n- The app downloads its [model](#model) (~500 MB). This is done only once for the installation.\n- The app generates 'embeddings' of your code. This will be cached in an `.embeddings` file at the root of the repo and is reused in subsequent searches.\n\nDepending on the project size, the above can take from a couple of seconds to minutes. Once this is complete, querying is very fast.\n\nExample output:\n\n```bash session\nfoo@bar:~$ cd /my/repo\nfoo@bar:~$ sem 'parsing command line args'\nEmbeddings not found in /Users/kiril/src/semantic-code-search. Generating embeddings now.\nEmbedding 15 functions in 1 batches. This is done once and cached in .embeddings\nBatches: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 1/1 [00:07<00:00,  7.05s/it]\n```\n\n### Navigating search results\n\nBy default, a list of the top 5 matches are shown, containing :\n\n- Similarity score\n- File path\n- Line number\n- Code snippet\n\nYou can navigate the list using the `\u2191` `\u2193` arrow keys or `vim` bindings. Pressing `return` will open the relevant file at the line of the code snippet in your editor.\n\n> NB: The editor used for opening can be set with the `--editor` argument.\n\nExample results:\n\n![example results](./docs/example-results.png)\n\n### Command line flags\n\n``` bash\nusage: sem [-h] [-p PATH] [-m MODEL] [-d] [-b BS] [-x EXT] [-n N]\n           [-e {vscode,vim}] [-c] [--cluster-max-distance THRESHOLD]\n           [--cluster-min-lines SIZE] [--cluster-min-cluster-size SIZE]\n           [--cluster-ignore-identincal]\n           ...\n\nSearch your codebase using natural language\n\npositional arguments:\n  query_text\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -p PATH, --path-to-repo PATH\n                        Path to the root of the git repo to search or embed\n  -m MODEL, --model-name-or-path MODEL\n                        Name or path of the model to use\n  -d, --embed           (Re)create the embeddings index for codebase\n  -b BS, --batch-size BS\n                        Batch size for embeddings generation\n  -x EXT, --file-extension EXT\n                        File extension filter (e.g. \"py\" will only return\n                        results from Python files)\n  -n N, --n-results N   Number of results to return\n  -e {vscode,vim}, --editor {vscode,vim}\n                        Editor to open selected result in\n  -c, --cluster         Generate clusters of code that is semantically\n                        similar. You can use this to spot near duplicates,\n                        results are simply printed to stdout\n  --cluster-max-distance THRESHOLD\n                        How close functions need to be to one another to be\n                        clustered. Distance 0 means that the code is\n                        identical, smaller values (e.g. 0.2, 0.3) are stricter\n                        and result in fewer matches\n  --cluster-min-lines SIZE\n                        Ignore clusters with code snippets smaller than this\n                        size (lines of code). Use this if you are not\n                        interested in smaller duplications (eg. one liners)\n  --cluster-min-cluster-size SIZE\n                        Ignore clusters smaller than this size. Use this if\n                        you want to find code that is similar and repeated\n                        many times (e.g. >5)\n  --cluster-ignore-identincal\n                        Ignore identical code / exact duplicates (where\n                        distance is 0)\n```\n\n## How it works\n\nIn a nutshell, this application uses a [transformer](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)) machine learning model to generate embeddings of methods and functions in your codebase. Embeddings are information dense numerical representations of the semantics of the text/code they represent.\n\nHere is a great blog post by Jay Alammar which explains the concept really nicely:\n> <https://jalammar.github.io/illustrated-word2vec/>\n\nWhen the app is ran with the `--embed` argument, function and method definitions are first extracted from the source files and then used for sentence embedding. To avoid doing this for every query, the results are compressed and saved in an `.embeddings` file.\n\nWhen a query is being processed, embeddings are generated from the query text. This is then used in a 'nearest neighbor' search to discover function or methods with similar embeddings. We are basically comparing the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) between vectors.\n\n### Model\n\nThe application uses [sentence transformer](https://www.sbert.net/) model architecture to produce 'sentence' embeddings for functions and queries. The particular model is [krlvi/sentence-t5-base-nlpl-code_search_net](https://huggingface.co/krlvi/sentence-t5-base-nlpl-code_search_net) which is based of a [SentenceT5-Base](https://github.com/google-research/t5x_retrieval#released-model-checkpoints) checkpoint with 110M parameters and a pooling layer.\n\nIt has been further trained on the [code_search_net](https://huggingface.co/datasets/code_search_net) dataset of 'natural language' \u2014 'programming language' pairs with a [MultipleNegativesRanking](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MultipleNegativesRankingLoss.py) loss function.\n\nYou can experiment with your own sentence transformer models with the `--model` parameter.\n\n## Bugs and limitations\n\n- Currently, the `.embeddings` index is not updated when repository files change. As a temporary workaround, `sem embed` can be re-ran occasionally.\n- Supported languages: `{ 'python', 'javascript', 'typescript', 'ruby', 'go', 'rust', 'java', 'c', 'c++', 'kotlin' }`\n- Supported text editors for opening results in: `{ 'vscode', 'vim' }`\n\n## License\n\nSemantic Code Search is distributed under [AGPL-3.0-only](LICENSE.txt). For Apache-2.0 exceptions \u2014 <kiril@codeball.ai>\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Search your codebase with natural language.",
    "version": "0.4.0",
    "split_keywords": [
        "semantic",
        "code",
        "search"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "b5680f62dcc7bcae25e31aef5013db2d",
                "sha256": "59a08a33b072dea120e768c9a56b2798fa8a9d3fb1388397a37aa3f7e1dbb440"
            },
            "downloads": -1,
            "filename": "semantic_code_search-0.4.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b5680f62dcc7bcae25e31aef5013db2d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 24224,
            "upload_time": "2022-12-23T10:38:06",
            "upload_time_iso_8601": "2022-12-23T10:38:06.956821Z",
            "url": "https://files.pythonhosted.org/packages/03/c5/02949d797c8ce50fdd505736cc2cfe606c422e91938e5584d3c44fde64e2/semantic_code_search-0.4.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "ca8fecde4aad5b4aae44a279900415fd",
                "sha256": "edfcbe631a09f0f981cab0fd9863e8d0a2ec78d5f6b9f4c168b087cd9836c218"
            },
            "downloads": -1,
            "filename": "semantic-code-search-0.4.0.tar.gz",
            "has_sig": false,
            "md5_digest": "ca8fecde4aad5b4aae44a279900415fd",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 22044,
            "upload_time": "2022-12-23T10:38:08",
            "upload_time_iso_8601": "2022-12-23T10:38:08.790313Z",
            "url": "https://files.pythonhosted.org/packages/66/99/f52163c7fc017663e1f865d4b3b2b19ecc3ae90fbd3573d680711abd6e02/semantic-code-search-0.4.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-23 10:38:08",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "sturdy-dev",
    "github_project": "semantic-code-search",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "semantic-code-search"
}

Kiril Videlov