samecode


Namesamecode JSON
Version 0.5.1 PyPI version JSON
download
home_pagehttps://github.com/aboutcode-org/ai-gen-code-search
SummaryA library to help detect approximately code such as AI-generated code.
upload_time2024-11-28 14:51:44
maintainerNone
docs_urlNone
authorAboutCode.org and others
requires_python>=3.8
licenseApache-2.0
keywords utilities open source
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            =========================================
  SameCode library
=========================================


``Search, detect, and identify AI-generated code and other copied code.``

The AI-Generated Code Search project provides open source tools to find code that may have been
generated using LLMs and GPT tools.

In this project, SameCode is a low level Python library that exposes features to:

1. Break code content in code fragments
2. Compute fingerprints for approximate matching these fragments
3. Provide related utilities for hamming distance computation

These features are fundamental building blocks for code fragments and snippets matching
approximately.

WARNING: this is under heavy development and not yet a finished project!

Note that using this library alone is not straightforward. Consider looking at the design and
reference documentation at https://ai-gen-code-search.readthedocs.io for more details.
It is designed to be used in the context of a larger code matching feature with MatchCode and the
PurlDB: https://github.com/aboutcode-org/purldb


- PyPI: https://pypi.org/project/samecode/
- Homepage: https://github.com/aboutcode-org/ai-gen-code-search
- Documentation: https://ai-gen-code-search.readthedocs.io

Installation
------------

SameCode is standalone library that does not provide a UI and command line. To install

From PyPI::

  pip install samecode


The preferred development setup is with these commands to create a development environment::

    git clone https://github.com/aboutcode-org/ai-gen-code-search
    cd ai-gen-code-search
    make dev # to configure the environemnt
    make test # to run tests
    make check # to run code checks 


Alternatively, a checkout of the https://github.com/aboutcode-org/ai-gen-code-search repo
can also be installed into an environment using pip's ``--editable`` option ::

    git clone https://github.com/aboutcode-org/ai-gen-code-search
    cd ai-gen-code-search
    python -m venv venv
    venv/bin/pip install --editable .

or built into a wheel and dists and then installed::

    pip install build
    venv/bin/pyproject-build --wheel --sdist
    pip install dist/samecode*.whl


Usage
-------

SameCode provides these functions classes:

In the module  ``samecode.chunking``, the main functions are:

- ``ngrams(iterable, ngram_length)``
  Return an iterable of ngrams of length `ngram_length` given an `iterable` of strings.
  Each ngram is a tuple of `ngram_length` items.
  The returned iterable is empty if the input iterable contains less than
  `ngram_length` items.

- ``select_ngrams(ngrams, with_pos=False)``
  Return an iterable as a subset of a sequence of ngrams using the hailstorm
  algorithm. If `with_pos` is True also include the starting position for the
  ngram in the original sequence.

In the module: ``samecode.halohash``, the main functions and classes are:

- ``BitAverageHaloHash(msg=None, size_in_bits=128)``
     A bit matrix averaging hash object, with these methods and properties:

     ``digest_size``
         Digest size in bytes.

     ``b64digest(self)``
         Return a base64 "url safe"-encoded string representing this hash.

     ``hexdigest(self)``
         Return the hex-encoded hash value.

     ``digest(self)``
         Return a binary string representing this hash.

     ``distance(self, other)``
         Return the bit Hamming distance between this hash and another hash.

     ``hash(self)``
         Return this hash as a bitarray.

     ``update(self, msg)``
         Append a bytestring or sequence of bytestrings to the hash.

     ``BitAverageHaloHash.combine(hashes)`` (class method)
         Return a BitAverageHaloHash by summing and averaging the columns of the
         BitAverageHaloHashes in `hashes` together, putting the resulting
         columns into a new BitAverageHaloHash and returning it

- ``bit_to_num(bits)``
     Return an int (or long) for a bitarray.
 
- ``bitarray_from_bytes(b)``
     Return a bitarray built from a byte string b.
 
- ``byte_hamming_distance(b1, b2)``
     Return the Hamming distance between ``b1`` and ``b2`` byte strings
 
- ``common_chunks(h1, h2, chunk_bytes_length=4)``
     Compute the number of common chunks of byte length ``chunk_bytes_length`` between to
     hashes ``h1`` and ``h2`` using their digest.
 
- ``common_chunks_from_hexdigest(h1, h2, chunk_bytes_length=4)``
     Compute the number of common chunks of byte length ``chunk_bytes_length`` between two
     strings ``h1`` and ``h2``, each representing a BitAverageHaloHash hexdigest value.
 
- ``decode_vector(b64_str)``
     Return a bit array from an encoded string representation.
 
- ``hamming_distance(bv1, bv2)``
     Return the Hamming distance between ``bv1`` and ``bv2``  bitvectors as the number of equal bits
     for all positions. (e.g. the count of bits set to one in an XOR between two bit strings.)
     
     ``bv1`` and ``bv2`` must both be  either hash-like Halohash instances (with a hash() function)
     or bitarray instances (that can be manipulated as-is).
 
- ``slices(s, size)``
     Given a sequence s, return a sequence of non-overlapping slices of ``size``.
     Raise an AssertionError if the sequence length is not a multiple of ``size``.


See also code examples in the test suite under /tests.


Tests
--------

Run the tests with::

    pytest -vvs

or with::

    make test


License
-------

SPDX-License-Identifier: Apache-2.0



Acknowledgements, Funding, Support and Sponsoring
--------------------------------------------------------

|europa|
    
|ngisearch|   

Funded by the European Union. Views and opinions expressed are however those of the author(s) only
and do not necessarily reflect those of the European Union or European Commission. Neither the
European Union nor the granting authority can be held responsible for them. Funded within the
framework of the NGI Search project under grant agreement No 101069364


This project is also supported and sponsored by:

- Generous support and contributions from users like you!
- Microsoft and Microsoft Azure
- AboutCode ASBL


|aboutcode| 


.. |ngisearch| image:: https://www.ngisearch.eu/download/FlamingoThemes/NGISearch2/NGISearch_logo_tag_icon.svg?rev=1.1
    :target: https://www.ngisearch.eu/
    :height: 50
    :alt: NGI logo


.. |ngi| image:: https://ngi.eu/wp-content/uploads/thegem-logos/logo_8269bc6efcf731d34b6385775d76511d_1x.png
    :target: https://www.ngi.eu/ngi-projects/ngi-search/
    :height: 37
    :alt: NGI logo

.. |europa| image:: etc/eu.funded.png
    :target: http://ec.europa.eu/index_en.htm
    :height: 120
    :alt: Europa logo

.. |aboutcode| image:: https://aboutcode.org/wp-content/uploads/2023/10/AboutCode.svg
    :target: https://aboutcode.org/
    :height: 30
    :alt: AboutCode logo

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/aboutcode-org/ai-gen-code-search",
    "name": "samecode",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "utilities, open source",
    "author": "AboutCode.org and others",
    "author_email": "info@aboutcode.org",
    "download_url": "https://files.pythonhosted.org/packages/e5/72/3ad3c58a9577fac0d76692e0f6ca093f76ec207a2c3e18450d7230e4be40/samecode-0.5.1.tar.gz",
    "platform": null,
    "description": "=========================================\n  SameCode library\n=========================================\n\n\n``Search, detect, and identify AI-generated code and other copied code.``\n\nThe AI-Generated Code Search project provides open source tools to find code that may have been\ngenerated using LLMs and GPT tools.\n\nIn this project, SameCode is a low level Python library that exposes features to:\n\n1. Break code content in code fragments\n2. Compute fingerprints for approximate matching these fragments\n3. Provide related utilities for hamming distance computation\n\nThese features are fundamental building blocks for code fragments and snippets matching\napproximately.\n\nWARNING: this is under heavy development and not yet a finished project!\n\nNote that using this library alone is not straightforward. Consider looking at the design and\nreference documentation at https://ai-gen-code-search.readthedocs.io for more details.\nIt is designed to be used in the context of a larger code matching feature with MatchCode and the\nPurlDB: https://github.com/aboutcode-org/purldb\n\n\n- PyPI: https://pypi.org/project/samecode/\n- Homepage: https://github.com/aboutcode-org/ai-gen-code-search\n- Documentation: https://ai-gen-code-search.readthedocs.io\n\nInstallation\n------------\n\nSameCode is standalone library that does not provide a UI and command line. To install\n\nFrom PyPI::\n\n  pip install samecode\n\n\nThe preferred development setup is with these commands to create a development environment::\n\n    git clone https://github.com/aboutcode-org/ai-gen-code-search\n    cd ai-gen-code-search\n    make dev # to configure the environemnt\n    make test # to run tests\n    make check # to run code checks \n\n\nAlternatively, a checkout of the https://github.com/aboutcode-org/ai-gen-code-search repo\ncan also be installed into an environment using pip's ``--editable`` option ::\n\n    git clone https://github.com/aboutcode-org/ai-gen-code-search\n    cd ai-gen-code-search\n    python -m venv venv\n    venv/bin/pip install --editable .\n\nor built into a wheel and dists and then installed::\n\n    pip install build\n    venv/bin/pyproject-build --wheel --sdist\n    pip install dist/samecode*.whl\n\n\nUsage\n-------\n\nSameCode provides these functions classes:\n\nIn the module  ``samecode.chunking``, the main functions are:\n\n- ``ngrams(iterable, ngram_length)``\n  Return an iterable of ngrams of length `ngram_length` given an `iterable` of strings.\n  Each ngram is a tuple of `ngram_length` items.\n  The returned iterable is empty if the input iterable contains less than\n  `ngram_length` items.\n\n- ``select_ngrams(ngrams, with_pos=False)``\n  Return an iterable as a subset of a sequence of ngrams using the hailstorm\n  algorithm. If `with_pos` is True also include the starting position for the\n  ngram in the original sequence.\n\nIn the module: ``samecode.halohash``, the main functions and classes are:\n\n- ``BitAverageHaloHash(msg=None, size_in_bits=128)``\n     A bit matrix averaging hash object, with these methods and properties:\n\n     ``digest_size``\n         Digest size in bytes.\n\n     ``b64digest(self)``\n         Return a base64 \"url safe\"-encoded string representing this hash.\n\n     ``hexdigest(self)``\n         Return the hex-encoded hash value.\n\n     ``digest(self)``\n         Return a binary string representing this hash.\n\n     ``distance(self, other)``\n         Return the bit Hamming distance between this hash and another hash.\n\n     ``hash(self)``\n         Return this hash as a bitarray.\n\n     ``update(self, msg)``\n         Append a bytestring or sequence of bytestrings to the hash.\n\n     ``BitAverageHaloHash.combine(hashes)`` (class method)\n         Return a BitAverageHaloHash by summing and averaging the columns of the\n         BitAverageHaloHashes in `hashes` together, putting the resulting\n         columns into a new BitAverageHaloHash and returning it\n\n- ``bit_to_num(bits)``\n     Return an int (or long) for a bitarray.\n \n- ``bitarray_from_bytes(b)``\n     Return a bitarray built from a byte string b.\n \n- ``byte_hamming_distance(b1, b2)``\n     Return the Hamming distance between ``b1`` and ``b2`` byte strings\n \n- ``common_chunks(h1, h2, chunk_bytes_length=4)``\n     Compute the number of common chunks of byte length ``chunk_bytes_length`` between to\n     hashes ``h1`` and ``h2`` using their digest.\n \n- ``common_chunks_from_hexdigest(h1, h2, chunk_bytes_length=4)``\n     Compute the number of common chunks of byte length ``chunk_bytes_length`` between two\n     strings ``h1`` and ``h2``, each representing a BitAverageHaloHash hexdigest value.\n \n- ``decode_vector(b64_str)``\n     Return a bit array from an encoded string representation.\n \n- ``hamming_distance(bv1, bv2)``\n     Return the Hamming distance between ``bv1`` and ``bv2``  bitvectors as the number of equal bits\n     for all positions. (e.g. the count of bits set to one in an XOR between two bit strings.)\n     \n     ``bv1`` and ``bv2`` must both be  either hash-like Halohash instances (with a hash() function)\n     or bitarray instances (that can be manipulated as-is).\n \n- ``slices(s, size)``\n     Given a sequence s, return a sequence of non-overlapping slices of ``size``.\n     Raise an AssertionError if the sequence length is not a multiple of ``size``.\n\n\nSee also code examples in the test suite under /tests.\n\n\nTests\n--------\n\nRun the tests with::\n\n    pytest -vvs\n\nor with::\n\n    make test\n\n\nLicense\n-------\n\nSPDX-License-Identifier: Apache-2.0\n\n\n\nAcknowledgements, Funding, Support and Sponsoring\n--------------------------------------------------------\n\n|europa|\n    \n|ngisearch|   \n\nFunded by the European Union. Views and opinions expressed are however those of the author(s) only\nand do not necessarily reflect those of the European Union or European Commission. Neither the\nEuropean Union nor the granting authority can be held responsible for them. Funded within the\nframework of the NGI Search project under grant agreement No 101069364\n\n\nThis project is also supported and sponsored by:\n\n- Generous support and contributions from users like you!\n- Microsoft and Microsoft Azure\n- AboutCode ASBL\n\n\n|aboutcode| \n\n\n.. |ngisearch| image:: https://www.ngisearch.eu/download/FlamingoThemes/NGISearch2/NGISearch_logo_tag_icon.svg?rev=1.1\n    :target: https://www.ngisearch.eu/\n    :height: 50\n    :alt: NGI logo\n\n\n.. |ngi| image:: https://ngi.eu/wp-content/uploads/thegem-logos/logo_8269bc6efcf731d34b6385775d76511d_1x.png\n    :target: https://www.ngi.eu/ngi-projects/ngi-search/\n    :height: 37\n    :alt: NGI logo\n\n.. |europa| image:: etc/eu.funded.png\n    :target: http://ec.europa.eu/index_en.htm\n    :height: 120\n    :alt: Europa logo\n\n.. |aboutcode| image:: https://aboutcode.org/wp-content/uploads/2023/10/AboutCode.svg\n    :target: https://aboutcode.org/\n    :height: 30\n    :alt: AboutCode logo\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "A library to help detect approximately code such as AI-generated code.",
    "version": "0.5.1",
    "project_urls": {
        "Homepage": "https://github.com/aboutcode-org/ai-gen-code-search"
    },
    "split_keywords": [
        "utilities",
        " open source"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "02436fc48ce14503d661d64ae029cd8f73a11b67f007e780bdff986cbca40502",
                "md5": "7e308b66623023b922b735038f4f5280",
                "sha256": "e5d5c8f3b671644170b1dace8e8e61d704b7f98c6723780937e4d93d8ee128b4"
            },
            "downloads": -1,
            "filename": "samecode-0.5.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7e308b66623023b922b735038f4f5280",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 19541,
            "upload_time": "2024-11-28T14:51:42",
            "upload_time_iso_8601": "2024-11-28T14:51:42.561651Z",
            "url": "https://files.pythonhosted.org/packages/02/43/6fc48ce14503d661d64ae029cd8f73a11b67f007e780bdff986cbca40502/samecode-0.5.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e5723ad3c58a9577fac0d76692e0f6ca093f76ec207a2c3e18450d7230e4be40",
                "md5": "3c8113e30a79ed4623a5ebd6d9cc29b6",
                "sha256": "6a0b76cf510b95abb3da0759a181e9b06c9576cdbe76f61f9059401bb8afb8b3"
            },
            "downloads": -1,
            "filename": "samecode-0.5.1.tar.gz",
            "has_sig": false,
            "md5_digest": "3c8113e30a79ed4623a5ebd6d9cc29b6",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 424482,
            "upload_time": "2024-11-28T14:51:44",
            "upload_time_iso_8601": "2024-11-28T14:51:44.729011Z",
            "url": "https://files.pythonhosted.org/packages/e5/72/3ad3c58a9577fac0d76692e0f6ca093f76ec207a2c3e18450d7230e4be40/samecode-0.5.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-28 14:51:44",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "aboutcode-org",
    "github_project": "ai-gen-code-search",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "samecode"
}
        
Elapsed time: 0.39155s