timescale-vector


Nametimescale-vector JSON
Version 0.0.7 PyPI version JSON
download
home_pagehttps://github.com/timescale/python-vector
SummaryPython library for storing vector data in Postgres
upload_time2024-08-26 18:24:02
maintainerNone
docs_urlNone
authorMatvey Arye
requires_python>=3.7
licenseApache Software License 2.0
keywords nbdev jupyter notebook python
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Timescale Vector

<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

PostgreSQL++ for AI Applications.

- [Signup for Timescale
  Vector](https://console.cloud.timescale.com/signup?utm_campaign=vectorlaunch&utm_source=github&utm_medium=direct):
  Get 90 days free to try Timescale Vector on the Timescale cloud data
  platform. There is no self-managed version at this time.
- [Documentation](https://timescale.github.io/python-vector/): Learn the
  key features of Timescale Vector and how to use them.
- [Getting Started
  Tutorial](https://timescale.github.io/python-vector/tsv_python_getting_started_tutorial.html):
  Learn how to use Timescale Vector for semantic search on a real-world
  dataset.
- [Learn
  more](https://www.timescale.com/blog/how-we-made-postgresql-the-best-vector-database/?utm_campaign=vectorlaunch&utm_source=github&utm_medium=direct):
  Learn more about Timescale Vector, how it works and why we built it.

If you prefer to use an LLM development or data framework, see Timescale
Vector’s integrations with
[LangChain](https://python.langchain.com/docs/integrations/vectorstores/timescalevector)
and
[LlamaIndex](https://gpt-index.readthedocs.io/en/stable/examples/vector_stores/Timescalevector.html)

## Install

To install the main library use:

``` sh
pip install timescale_vector
```

We also use `dotenv` in our examples for passing around secrets and
keys. You can install that with:

``` sh
pip install python-dotenv
```

If you run into installation errors related to the psycopg2 package, you
will need to install some prerequisites. The timescale-vector package
explicitly depends on psycopg2 (the non-binary version). This adheres to
[the advice provided by
psycopg2](https://www.psycopg.org/docs/install.html#psycopg-vs-psycopg-binary).
Building psycopg from source [requires a few prerequisites to be
installed](https://www.psycopg.org/docs/install.html#build-prerequisites).
Make sure these are installed before trying to
`pip install timescale_vector`.

## Basic usage

First, import all the necessary libraries:

``` python
from dotenv import load_dotenv, find_dotenv
import os
from timescale_vector import client
import uuid
from datetime import datetime, timedelta
```

Load up your PostgreSQL credentials. Safest way is with a .env file:

``` python
_ = load_dotenv(find_dotenv(), override=True) 
service_url  = os.environ['TIMESCALE_SERVICE_URL']
```

Next, create the client. In this tutorial, we will use the sync client.
But we have an async client as well (with an identical interface that
uses async functions).

The client constructor takes three required arguments:

| name           | description                                                                               |
|----------------|-------------------------------------------------------------------------------------------|
| service_url    | Timescale service URL / connection string                                                 |
| table_name     | Name of the table to use for storing the embeddings. Think of this as the collection name |
| num_dimensions | Number of dimensions in the vector                                                        |

You can also specify the schema name, distance type, primary key type,
etc. as optional parameters. Please see the documentation for details.

``` python
vec  = client.Sync(service_url, "my_data", 2)
```

Next, create the tables for the collection:

``` python
vec.create_tables()
```

Next, insert some data. The data record contains:

- A UUID to uniquely identify the embedding
- A JSON blob of metadata about the embedding
- The text the embedding represents
- The embedding itself

Because this data includes UUIDs which become primary keys, we ingest
with upserts.

``` python
vec.upsert([\
    (uuid.uuid1(), {"animal": "fox"}, "the brown fox", [1.0,1.3]),\
    (uuid.uuid1(), {"animal": "fox", "action":"jump"}, "jumped over the", [1.0,10.8]),\
])
```

You can now create a vector index to speed up similarity search:

``` python
vec.create_embedding_index(client.DiskAnnIndex())
```

Now, you can query for similar items:

``` python
vec.search([1.0, 9.0])
```

    [[UUID('4494c186-4a0d-11ef-94a3-6ee10b77fd09'),
      {'action': 'jump', 'animal': 'fox'},
      'jumped over the',
      array([ 1. , 10.8], dtype=float32),
      0.00016793422934946456],
     [UUID('4494c12c-4a0d-11ef-94a3-6ee10b77fd09'),
      {'animal': 'fox'},
      'the brown fox',
      array([1. , 1.3], dtype=float32),
      0.14489260377438218]]

There are many search options which we will cover below in the
`Advanced search` section.

As one example, we will return one item using a similarity search
constrained by a metadata filter.

``` python
vec.search([1.0, 9.0], limit=1, filter={"action": "jump"})
```

    [[UUID('4494c186-4a0d-11ef-94a3-6ee10b77fd09'),
      {'action': 'jump', 'animal': 'fox'},
      'jumped over the',
      array([ 1. , 10.8], dtype=float32),
      0.00016793422934946456]]

The returned records contain 5 fields:

| name      | description                                             |
|-----------|---------------------------------------------------------|
| id        | The UUID of the record                                  |
| metadata  | The JSON metadata associated with the record            |
| contents  | the text content that was embedded                      |
| embedding | The vector embedding                                    |
| distance  | The distance between the query embedding and the vector |

You can access the fields by simply using the record as a dictionary
keyed on the field name:

``` python
records = vec.search([1.0, 9.0], limit=1, filter={"action": "jump"})
(records[0]["id"],records[0]["metadata"], records[0]["contents"], records[0]["embedding"], records[0]["distance"])
```

    (UUID('4494c186-4a0d-11ef-94a3-6ee10b77fd09'),
     {'action': 'jump', 'animal': 'fox'},
     'jumped over the',
     array([ 1. , 10.8], dtype=float32),
     0.00016793422934946456)

You can delete by ID:

``` python
vec.delete_by_ids([records[0]["id"]])
```

Or you can delete by metadata filters:

``` python
vec.delete_by_metadata({"action": "jump"})
```

To delete all records use:

``` python
vec.delete_all()
```

## Advanced usage

In this section, we will go into more detail about our feature. We will
cover:

1.  Search filter options - how to narrow your search by additional
    constraints
2.  Indexing - how to speed up your similarity queries
3.  Time-based partitioning - how to optimize similarity queries that
    filter on time
4.  Setting different distance types to use in distance calculations

### Search options

The `search` function is very versatile and allows you to search for the
right vector in a wide variety of ways. We’ll describe the search option
in 3 parts:

1.  We’ll cover basic similarity search.
2.  Then, we’ll describe how to filter your search based on the
    associated metadata.
3.  Finally, we’ll talk about filtering on time when time-partitioning
    is enabled.

Let’s use the following data for our example:

``` python
vec.upsert([\
    (uuid.uuid1(), {"animal":"fox", "action": "sit", "times":1}, "the brown fox", [1.0,1.3]),\
    (uuid.uuid1(),  {"animal":"fox", "action": "jump", "times":100}, "jumped over the", [1.0,10.8]),\
])
```

The basic query looks like:

``` python
vec.search([1.0, 9.0])
```

    [[UUID('456dbbbc-4a0d-11ef-94a3-6ee10b77fd09'),
      {'times': 100, 'action': 'jump', 'animal': 'fox'},
      'jumped over the',
      array([ 1. , 10.8], dtype=float32),
      0.00016793422934946456],
     [UUID('456dbb6c-4a0d-11ef-94a3-6ee10b77fd09'),
      {'times': 1, 'action': 'sit', 'animal': 'fox'},
      'the brown fox',
      array([1. , 1.3], dtype=float32),
      0.14489260377438218]]

You could provide a limit for the number of items returned:

``` python
vec.search([1.0, 9.0], limit=1)
```

    [[UUID('456dbbbc-4a0d-11ef-94a3-6ee10b77fd09'),
      {'times': 100, 'action': 'jump', 'animal': 'fox'},
      'jumped over the',
      array([ 1. , 10.8], dtype=float32),
      0.00016793422934946456]]

#### Narrowing your search by metadata

We have two main ways to filter results by metadata: - `filters` for
equality matches on metadata. - `predicates` for complex conditions on
metadata.

Filters are more likely to be performant but are more limited in what
they can express, so we suggest using those if your use case allows it.

##### Filters

You could specify a match on the metadata as a dictionary where all keys
have to match the provided values (keys not in the filter are
unconstrained):

``` python
vec.search([1.0, 9.0], limit=1, filter={"action": "sit"})
```

    [[UUID('456dbb6c-4a0d-11ef-94a3-6ee10b77fd09'),
      {'times': 1, 'action': 'sit', 'animal': 'fox'},
      'the brown fox',
      array([1. , 1.3], dtype=float32),
      0.14489260377438218]]

You can also specify a list of filter dictionaries, where an item is
returned if it matches any dict:

``` python
vec.search([1.0, 9.0], limit=2, filter=[{"action": "jump"}, {"animal": "fox"}])
```

    [[UUID('456dbbbc-4a0d-11ef-94a3-6ee10b77fd09'),
      {'times': 100, 'action': 'jump', 'animal': 'fox'},
      'jumped over the',
      array([ 1. , 10.8], dtype=float32),
      0.00016793422934946456],
     [UUID('456dbb6c-4a0d-11ef-94a3-6ee10b77fd09'),
      {'times': 1, 'action': 'sit', 'animal': 'fox'},
      'the brown fox',
      array([1. , 1.3], dtype=float32),
      0.14489260377438218]]

##### Predicates

Predicates allow for more complex search conditions. For example, you
could use greater than and less than conditions on numeric values.

``` python
vec.search([1.0, 9.0], limit=2, predicates=client.Predicates("times", ">", 1))
```

    [[UUID('456dbbbc-4a0d-11ef-94a3-6ee10b77fd09'),
      {'times': 100, 'action': 'jump', 'animal': 'fox'},
      'jumped over the',
      array([ 1. , 10.8], dtype=float32),
      0.00016793422934946456]]

[`Predicates`](https://timescale.github.io/python-vector/vector.html#predicates)
objects are defined by the name of the metadata key, an operator, and a
value.

The supported operators are: `==`, `!=`, `<`, `<=`, `>`, `>=`

The type of the values determines the type of comparison to perform. For
example, passing in `"Sam"` (a string) will do a string comparison while
a `10` (an int) will perform an integer comparison while a `10.0`
(float) will do a float comparison. It is important to note that using a
value of `"10"` will do a string comparison as well so it’s important to
use the right type. Supported Python types are: `str`, `int`, and
`float`. One more example with a string comparison:

``` python
vec.search([1.0, 9.0], limit=2, predicates=client.Predicates("action", "==", "jump"))
```

    [[UUID('456dbbbc-4a0d-11ef-94a3-6ee10b77fd09'),
      {'times': 100, 'action': 'jump', 'animal': 'fox'},
      'jumped over the',
      array([ 1. , 10.8], dtype=float32),
      0.00016793422934946456]]

The real power of predicates is that they can also be combined using the
`&` operator (for combining predicates with AND semantics) and `|`(for
combining using OR semantic). So you can do:

``` python
vec.search([1.0, 9.0], limit=2, predicates=client.Predicates("action", "==", "jump") & client.Predicates("times", ">", 1))
```

    [[UUID('456dbbbc-4a0d-11ef-94a3-6ee10b77fd09'),
      {'times': 100, 'action': 'jump', 'animal': 'fox'},
      'jumped over the',
      array([ 1. , 10.8], dtype=float32),
      0.00016793422934946456]]

Just for sanity, let’s show a case where no results are returned because
or predicates:

``` python
vec.search([1.0, 9.0], limit=2, predicates=client.Predicates("action", "==", "jump") & client.Predicates("times", "==", 1))
```

    []

And one more example where we define the predicates as a variable and
use grouping with parenthesis:

``` python
my_predicates = client.Predicates("action", "==", "jump") & (client.Predicates("times", "==", 1) | client.Predicates("times", ">", 1))
vec.search([1.0, 9.0], limit=2, predicates=my_predicates)
```

    [[UUID('456dbbbc-4a0d-11ef-94a3-6ee10b77fd09'),
      {'times': 100, 'action': 'jump', 'animal': 'fox'},
      'jumped over the',
      array([ 1. , 10.8], dtype=float32),
      0.00016793422934946456]]

We also have some semantic sugar for combining many predicates with AND
semantics. You can pass in multiple 3-tuples to
[`Predicates`](https://timescale.github.io/python-vector/vector.html#predicates):

``` python
vec.search([1.0, 9.0], limit=2, predicates=client.Predicates(("action", "==", "jump"), ("times", ">", 10)))
```

    [[UUID('456dbbbc-4a0d-11ef-94a3-6ee10b77fd09'),
      {'times': 100, 'action': 'jump', 'animal': 'fox'},
      'jumped over the',
      array([ 1. , 10.8], dtype=float32),
      0.00016793422934946456]]

#### Filter your search by time

When using `time-partitioning`(see below). You can very efficiently
filter your search by time. Time-partitioning makes a timestamp embedded
as part of the UUID-based ID associated with an embedding. Let us first
create a collection with time partitioning and insert some data (one
item from January 2018 and another in January 2019):

``` python
tpvec = client.Sync(service_url, "time_partitioned_table", 2, time_partition_interval=timedelta(hours=6))
tpvec.create_tables()

specific_datetime = datetime(2018, 1, 1, 12, 0, 0)
tpvec.upsert([\
    (client.uuid_from_time(specific_datetime), {"animal":"fox", "action": "sit", "times":1}, "the brown fox", [1.0,1.3]),\
    (client.uuid_from_time(specific_datetime+timedelta(days=365)),  {"animal":"fox", "action": "jump", "times":100}, "jumped over the", [1.0,10.8]),\
])
```

Then, you can filter using the timestamps by specifing a
`uuid_time_filter`:

``` python
tpvec.search([1.0, 9.0], limit=4, uuid_time_filter=client.UUIDTimeRange(specific_datetime, specific_datetime+timedelta(days=1)))
```

    [[UUID('33c52800-ef15-11e7-8a12-ea51d07b6447'),
      {'times': 1, 'action': 'sit', 'animal': 'fox'},
      'the brown fox',
      array([1. , 1.3], dtype=float32),
      0.14489260377438218]]

A
[`UUIDTimeRange`](https://timescale.github.io/python-vector/vector.html#uuidtimerange)
can specify a start_date or end_date or both(as in the example above).
Specifying only the start_date or end_date leaves the other end
unconstrained.

``` python
tpvec.search([1.0, 9.0], limit=4, uuid_time_filter=client.UUIDTimeRange(start_date=specific_datetime))
```

    [[UUID('ac8be800-0de6-11e9-a5fd-5a100e653c25'),
      {'times': 100, 'action': 'jump', 'animal': 'fox'},
      'jumped over the',
      array([ 1. , 10.8], dtype=float32),
      0.00016793422934946456],
     [UUID('33c52800-ef15-11e7-8a12-ea51d07b6447'),
      {'times': 1, 'action': 'sit', 'animal': 'fox'},
      'the brown fox',
      array([1. , 1.3], dtype=float32),
      0.14489260377438218]]

You have the option to define the inclusivity of the start and end dates
with the `start_inclusive` and `end_inclusive` parameters. Setting
`start_inclusive` to true results in comparisons using the `>=`
operator, whereas setting it to false applies the `>` operator. By
default, the start date is inclusive, while the end date is exclusive.
One example:

``` python
tpvec.search([1.0, 9.0], limit=4, uuid_time_filter=client.UUIDTimeRange(start_date=specific_datetime, start_inclusive=False))
```

    [[UUID('ac8be800-0de6-11e9-a5fd-5a100e653c25'),
      {'times': 100, 'action': 'jump', 'animal': 'fox'},
      'jumped over the',
      array([ 1. , 10.8], dtype=float32),
      0.00016793422934946456]]

Notice how the results are different when we use the
`start_inclusive=False` option because the first row has the exact
timestamp specified by `start_date`.

We’ve also made it easy to integrate time filters using the `filter` and
`predicates` parameters described above using special reserved key names
to make it appear that the timestamps are part of your metadata. We
found this useful when integrating with other systems that just want to
specify a set of filters (often these are “auto retriever” type
systems). The reserved key names are `__start_date` and `__end_date` for
filters and `__uuid_timestamp` for predicates. Some examples below:

``` python
tpvec.search([1.0, 9.0], limit=4, filter={ "__start_date": specific_datetime, "__end_date": specific_datetime+timedelta(days=1)})
```

    [[UUID('33c52800-ef15-11e7-8a12-ea51d07b6447'),
      {'times': 1, 'action': 'sit', 'animal': 'fox'},
      'the brown fox',
      array([1. , 1.3], dtype=float32),
      0.14489260377438218]]

``` python
tpvec.search([1.0, 9.0], limit=4, 
             predicates=client.Predicates("__uuid_timestamp", ">=", specific_datetime) & client.Predicates("__uuid_timestamp", "<", specific_datetime+timedelta(days=1)))
```

    [[UUID('33c52800-ef15-11e7-8a12-ea51d07b6447'),
      {'times': 1, 'action': 'sit', 'animal': 'fox'},
      'the brown fox',
      array([1. , 1.3], dtype=float32),
      0.14489260377438218]]

### Indexing

Indexing speeds up queries over your data. By default, we set up indexes
to query your data by the UUID and the metadata.

But to speed up similarity search based on the embeddings, you have to
create additional indexes.

Note that if performing a query without an index, you will always get an
exact result, but the query will be slow (it has to read all of the data
you store for every query). With an index, your queries will be
order-of-magnitude faster, but the results are approximate (because
there are no known indexing techniques that are exact).

Nevertheless, there are excellent approximate algorithms. There are 3
different indexing algorithms available on the Timescale platform:
Timescale Vector index, pgvector HNSW, and pgvector ivfflat. Below are
the trade-offs between these algorithms:

| Algorithm        | Build speed | Query speed | Need to rebuild after updates |
|------------------|-------------|-------------|-------------------------------|
| StreamingDiskANN | Fast        | Fastest     | No                            |
| pgvector hnsw    | Slowest     | Faster      | No                            |
| pgvector ivfflat | Fastest     | Slowest     | Yes                           |

You can see
[benchmarks](https://www.timescale.com/blog/how-we-made-postgresql-the-best-vector-database/)
on our blog.

We recommend using the Timescale Vector index for most use cases. This
can be created with:

``` python
vec.create_embedding_index(client.DiskAnnIndex())
```

Indexes are created for a particular distance metric type. So it is
important that the same distance metric is set on the client during
index creation as it is during queries. See the `distance type` section
below.

Each of these indexes has a set of build-time options for controlling
the speed/accuracy trade-off when creating the index and an additional
query-time option for controlling accuracy during a particular query. We
have smart defaults for all of these options but will also describe the
details below so that you can adjust these options manually.

#### StreamingDiskANN index

The StreamingDiskANN index from pgvectorscale is a graph-based algorithm
that uses the [DiskANN](https://github.com/microsoft/DiskANN) algorithm.
You can read more about it on our
[blog](https://www.timescale.com/blog/how-we-made-postgresql-as-fast-as-pinecone-for-vector-data/)
announcing its release.

To create this index, run:

``` python
vec.create_embedding_index(client.DiskAnnIndex())
```

The above command will create the index using smart defaults. There are
a number of parameters you could tune to adjust the accuracy/speed
trade-off.

The parameters you can set at index build time are:

| Parameter name           | Description                                                                                                                                                                                      | Default value                               |
|--------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------|
| `storage_layout`         | `memory_optimized` which uses SBQ to compress vector data or `plain` which stores data uncompressed                                                                                              | memory_optimized                            |
| `num_neighbors`          | Sets the maximum number of neighbors per node. Higher values increase accuracy but make the graph traversal slower.                                                                              | 50                                          |
| `search_list_size`       | This is the S parameter used in the greedy search algorithm used during construction. Higher values improve graph quality at the cost of slower index builds.                                    | 100                                         |
| `max_alpha`              | Is the alpha parameter in the algorithm. Higher values improve graph quality at the cost of slower index builds.                                                                                 | 1.2                                         |
| `num_dimensions`         | The number of dimensions to index. By default, all dimensions are indexed. But you can also index less dimensions to make use of [Matryoshka embeddings](https://huggingface.co/blog/matryoshka) | 0 (all dimensions)                          |
| `num_bits_per_dimension` | Number of bits used to encode each dimension when using SBQ                                                                                                                                      | 2 for less than 900 dimensions, 1 otherwise |

To set these parameters, you could run:

``` python
vec.create_embedding_index(client.DiskAnnIndex(num_neighbors=50, search_list_size=100, max_alpha=1.0, storage_layout="memory_optimized", num_dimensions=0, num_bits_per_dimension=1))
```

You can also set a parameter to control the accuracy vs. query speed
trade-off at query time. The parameter is set in the `search()` function
using the `query_params` argment.

| Parameter name     | Description                                                             | Default value |
|--------------------|-------------------------------------------------------------------------|---------------|
| `search_list_size` | The number of additional candidates considered during the graph search. | 100           |
| `rescore`          | The number of elements rescored (0 to disable rescoring)                | 50            |

We suggest using the `rescore` parameter to fine-tune accuracy.

``` python
vec.search([1.0, 9.0], limit=4, query_params=client.DiskAnnIndexParams(rescore=400, search_list_size=10))
```

    [[UUID('456dbbbc-4a0d-11ef-94a3-6ee10b77fd09'),
      {'times': 100, 'action': 'jump', 'animal': 'fox'},
      'jumped over the',
      array([ 1. , 10.8], dtype=float32),
      0.00016793422934946456],
     [UUID('456dbb6c-4a0d-11ef-94a3-6ee10b77fd09'),
      {'times': 1, 'action': 'sit', 'animal': 'fox'},
      'the brown fox',
      array([1. , 1.3], dtype=float32),
      0.14489260377438218]]

To drop the index, run:

``` python
vec.drop_embedding_index()
```

#### pgvector HNSW index

Pgvector provides a graph-based indexing algorithm based on the popular
[HNSW algorithm](https://arxiv.org/abs/1603.09320).

To create this index, run:

``` python
vec.create_embedding_index(client.HNSWIndex())
```

The above command will create the index using smart defaults. There are
a number of parameters you could tune to adjust the accuracy/speed
trade-off.

The parameters you can set at index build time are:

| Parameter name  | Description                                                                                                                                                                                                                                                            | Default value |
|-----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
| m               | Represents the maximum number of connections per layer. Think of these connections as edges created for each node during graph construction. Increasing m increases accuracy but also increases index build time and size.                                             | 16            |
| ef_construction | Represents the size of the dynamic candidate list for constructing the graph. It influences the trade-off between index quality and construction speed. Increasing ef_construction enables more accurate search results at the expense of lengthier index build times. | 64            |

To set these parameters, you could run:

``` python
vec.create_embedding_index(client.HNSWIndex(m=16, ef_construction=64))
```

You can also set a parameter to control the accuracy vs. query speed
trade-off at query time. The parameter is set in the `search()` function
using the `query_params` argument. You can set the `ef_search`(default:
40). This parameter specifies the size of the dynamic candidate list
used during search. Higher values improve query accuracy while making
the query slower.

You can specify this value during search as follows:

``` python
vec.search([1.0, 9.0], limit=4, query_params=client.HNSWIndexParams(ef_search=10))
```

    [[UUID('456dbbbc-4a0d-11ef-94a3-6ee10b77fd09'),
      {'times': 100, 'action': 'jump', 'animal': 'fox'},
      'jumped over the',
      array([ 1. , 10.8], dtype=float32),
      0.00016793422934946456],
     [UUID('456dbb6c-4a0d-11ef-94a3-6ee10b77fd09'),
      {'times': 1, 'action': 'sit', 'animal': 'fox'},
      'the brown fox',
      array([1. , 1.3], dtype=float32),
      0.14489260377438218]]

To drop the index run:

``` python
vec.drop_embedding_index()
```

#### pgvector ivfflat index

Pgvector provides a clustering-based indexing algorithm. Our [blog
post](https://www.timescale.com/blog/nearest-neighbor-indexes-what-are-ivfflat-indexes-in-pgvector-and-how-do-they-work/)
describes how it works in detail. It provides the fastest index-build
speed but the slowest query speeds of any indexing algorithm.

To create this index, run:

``` python
vec.create_embedding_index(client.IvfflatIndex())
```

Note: *ivfflat should never be created on empty tables* because it needs
to cluster data, and that only happens when an index is first created,
not when new rows are inserted or modified. Also, if your table
undergoes a lot of modifications, you will need to rebuild this index
occasionally to maintain good accuracy. See our [blog
post](https://www.timescale.com/blog/nearest-neighbor-indexes-what-are-ivfflat-indexes-in-pgvector-and-how-do-they-work/)
for details.

Pgvector ivfflat has a `lists` index parameter that is automatically set
with a smart default based on the number of rows in your table. If you
know that you’ll have a different table size, you can specify the number
of records to use for calculating the `lists` parameter as follows:

``` python
vec.create_embedding_index(client.IvfflatIndex(num_records=1000000))
```

You can also set the `lists` parameter directly:

``` python
vec.create_embedding_index(client.IvfflatIndex(num_lists=100))
```

You can also set a parameter to control the accuracy vs. query speed
trade-off at query time. The parameter is set in the `search()` function
using the `query_params` argument. You can set the `probes`. This
parameter specifies the number of clusters searched during a query. It
is recommended to set this parameter to `sqrt(lists)` where lists is the
`num_list` parameter used above during index creation. Higher values
improve query accuracy while making the query slower.

You can specify this value during search as follows:

``` python
vec.search([1.0, 9.0], limit=4, query_params=client.IvfflatIndexParams(probes=10))
```

    [[UUID('456dbbbc-4a0d-11ef-94a3-6ee10b77fd09'),
      {'times': 100, 'action': 'jump', 'animal': 'fox'},
      'jumped over the',
      array([ 1. , 10.8], dtype=float32),
      0.00016793422934946456],
     [UUID('456dbb6c-4a0d-11ef-94a3-6ee10b77fd09'),
      {'times': 1, 'action': 'sit', 'animal': 'fox'},
      'the brown fox',
      array([1. , 1.3], dtype=float32),
      0.14489260377438218]]

To drop the index, run:

``` python
vec.drop_embedding_index()
```

### Time partitioning

In many use cases where you have many embeddings, time is an important
component associated with the embeddings. For example, when embedding
news stories, you often search by time as well as similarity (e.g.,
stories related to Bitcoin in the past week or stories about Clinton in
November 2016).

Yet, traditionally, searching by two components “similarity” and “time”
is challenging for Approximate Nearest Neighbor (ANN) indexes and makes
the similarity-search index less effective.

One approach to solving this is partitioning the data by time and
creating ANN indexes on each partition individually. Then, during
search, you can:

- Step 1: filter our partitions that don’t match the time predicate.
- Step 2: perform the similarity search on all matching partitions.
- Step 3: combine all the results from each partition in step 2, rerank,
  and filter out results by time.

Step 1 makes the search a lot more efficient by filtering out whole
swaths of data in one go.

Timescale-vector supports time partitioning using TimescaleDB’s
hypertables. To use this feature, simply indicate the length of time for
each partition when creating the client:

``` python
from datetime import timedelta
from datetime import datetime
```

``` python
vec = client.Async(service_url, "my_data_with_time_partition", 2, time_partition_interval=timedelta(hours=6))
await vec.create_tables()
```

Then, insert data where the IDs use UUIDs v1 and the time component of
the UUID specifies the time of the embedding. For example, to create an
embedding for the current time, simply do:

``` python
id = uuid.uuid1()
await vec.upsert([(id, {"key": "val"}, "the brown fox", [1.0, 1.2])])
```

To insert data for a specific time in the past, create the UUID using
our
[`uuid_from_time`](https://timescale.github.io/python-vector/vector.html#uuid_from_time)
function

``` python
specific_datetime = datetime(2018, 8, 10, 15, 30, 0)
await vec.upsert([(client.uuid_from_time(specific_datetime), {"key": "val"}, "the brown fox", [1.0, 1.2])])
```

You can then query the data by specifying a `uuid_time_filter` in the
search call:

``` python
rec = await vec.search([1.0, 2.0], limit=4, uuid_time_filter=client.UUIDTimeRange(specific_datetime-timedelta(days=7), specific_datetime+timedelta(days=7)))
```

### Distance metrics

By default, we use cosine distance to measure how similarly an embedding
is to a given query. In addition to cosine distance, we also support
Euclidean/L2 distance. The distance type is set when creating the client
using the `distance_type` parameter. For example, to use the Euclidean
distance metric, you can create the client with:

``` python
vec  = client.Sync(service_url, "my_data", 2, distance_type="euclidean")
```

Valid values for `distance_type` are `cosine` and `euclidean`.

It is important to note that you should use consistent distance types on
clients that create indexes and perform queries. That is because an
index is only valid for one particular type of distance measure.

Please note the Timescale Vector index only supports cosine distance at
this time.

# LangChain integration

[LangChain](https://www.langchain.com/) is a popular framework for
development applications powered by LLMs. Timescale Vector has a native
LangChain integration, enabling you to use Timescale Vector as a
vectorstore and leverage all its capabilities in your applications built
with LangChain.

Here are resources about using Timescale Vector with LangChain:

- [Getting started with LangChain and Timescale
  Vector](https://python.langchain.com/docs/integrations/vectorstores/timescalevector):
  You’ll learn how to use Timescale Vector for (1) semantic search, (2)
  time-based vector search, (3) self-querying, and (4) how to create
  indexes to speed up queries.
- [PostgreSQL Self
  Querying](https://python.langchain.com/docs/integrations/retrievers/self_query/timescalevector_self_query):
  Learn how to use Timescale Vector with self-querying in LangChain.
- [LangChain template: RAG with conversational
  retrieval](https://github.com/langchain-ai/langchain/tree/master/templates/rag-timescale-conversation):
  This template is used for conversational retrieval, which is one of
  the most popular LLM use-cases. It passes both a conversation history
  and retrieved documents into an LLM for synthesis.
- [LangChain template: RAG with time-based search and self-query
  retrieval](https://github.com/langchain-ai/langchain/tree/master/templates/rag-timescale-hybrid-search-time):This
  template shows how to use timescale-vector with the self-query
  retriver to perform hybrid search on similarity and time. This is
  useful any time your data has a strong time-based component.
- [Learn more about Timescale Vector and
  LangChain](https://blog.langchain.dev/timescale-vector-x-langchain-making-postgresql-a-better-vector-database-for-ai-applications/)

# LlamaIndex integration

\[LlamaIndex\] is a popular data framework for connecting custom data
sources to large language models (LLMs). Timescale Vector has a native
LlamaIndex integration, enabling you to use Timescale Vector as a
vectorstore and leverage all its capabilities in your applications built
with LlamaIndex.

Here are resources about using Timescale Vector with LlamaIndex:

- [Getting started with LlamaIndex and Timescale
  Vector](https://docs.llamaindex.ai/en/stable/examples/vector_stores/Timescalevector.html):
  You’ll learn how to use Timescale Vector for (1) similarity
  search, (2) time-based vector search, (3) faster search with indexes,
  and (4) retrieval and query engine.
- [Time-based
  retrieval](https://youtu.be/EYMZVfKcRzM?si=I0H3uUPgzKbQw__W): Learn
  how to power RAG applications with time-based retrieval.
- [Llama Pack: Auto Retrieval with time-based
  search](https://github.com/run-llama/llama-hub/tree/main/llama_hub/llama_packs/timescale_vector_autoretrieval):
  This pack demonstrates performing auto-retrieval for hybrid search
  based on both similarity and time, using the timescale-vector
  (PostgreSQL) vectorstore.  
- [Learn more about Timescale Vector and
  LlamaIndex](https://www.timescale.com/blog/timescale-vector-x-llamaindex-making-postgresql-a-better-vector-database-for-ai-applications/)

# PgVectorize

PgVectorize enables you to create vector embeddings from any data that
you already have stored in PostgreSQL. You can get more background
information in our [blog
post](https://www.timescale.com/blog/a-complete-guide-to-creating-and-storing-embeddings-for-postgresql-data/)
announcing this feature, as well as a [“how we built
in”](https://www.timescale.com/blog/how-we-designed-a-resilient-vector-embedding-creation-system-for-postgresql-data/)
post going into the details of the design.

To create vector embeddings, simply attach PgVectorize to any PostgreSQL
table, and it will automatically sync that table’s data with a set of
embeddings stored in Timescale Vector. For example, let’s say you have a
blog table defined in the following way:

``` python
import psycopg2
from langchain.docstore.document import Document
from langchain.text_splitter import CharacterTextSplitter
from timescale_vector import client, pgvectorizer
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores.timescalevector import TimescaleVector
from datetime import timedelta
```

``` python
with psycopg2.connect(service_url) as conn:
    with conn.cursor() as cursor:
        cursor.execute('''
        CREATE TABLE IF NOT EXISTS blog (
            id              SERIAL PRIMARY KEY NOT NULL,
            title           TEXT NOT NULL,
            author          TEXT NOT NULL,
            contents        TEXT NOT NULL,
            category        TEXT NOT NULL,
            published_time  TIMESTAMPTZ NULL --NULL if not yet published
        );
        ''')
```

You can insert some data as follows:

``` python
with psycopg2.connect(service_url) as conn:
    with conn.cursor() as cursor:
        cursor.execute('''
            INSERT INTO blog (title, author, contents, category, published_time) VALUES ('First Post', 'Matvey Arye', 'some super interesting content about cats.', 'AI', '2021-01-01');
        ''')
```

Now, say you want to embed these blogs in Timescale Vector. First, you
need to define an `embed_and_write` function that takes a set of blog
posts, creates the embeddings, and writes them into TimescaleVector. For
example, if using LangChain, it could look something like the following.

``` python
def get_document(blog):
    text_splitter = CharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
    )
    docs = []
    for chunk in text_splitter.split_text(blog['contents']):
        content = f"Author {blog['author']}, title: {blog['title']}, contents:{chunk}"
        metadata = {
            "id": str(client.uuid_from_time(blog['published_time'])),
            "blog_id": blog['id'], 
            "author": blog['author'], 
            "category": blog['category'],
            "published_time": blog['published_time'].isoformat(),
        }
        docs.append(Document(page_content=content, metadata=metadata))
    return docs

def embed_and_write(blog_instances, vectorizer):
    embedding = OpenAIEmbeddings()
    vector_store = TimescaleVector(
        collection_name="blog_embedding",
        service_url=service_url,
        embedding=embedding,
        time_partition_interval=timedelta(days=30),
    )

    # delete old embeddings for all ids in the work queue. locked_id is a special column that is set to the primary key of the table being
    # embedded. For items that are deleted, it is the only key that is set.
    metadata_for_delete = [{"blog_id": blog['locked_id']} for blog in blog_instances]
    vector_store.delete_by_metadata(metadata_for_delete)

    documents = []
    for blog in blog_instances:
        # skip blogs that are not published yet, or are deleted (in which case it will be NULL)
        if blog['published_time'] != None:
            documents.extend(get_document(blog))

    if len(documents) == 0:
        return
    
    texts = [d.page_content for d in documents]
    metadatas = [d.metadata for d in documents]
    ids = [d.metadata["id"] for d in documents]
    vector_store.add_texts(texts, metadatas, ids)
```

Then, all you have to do is run the following code in a scheduled job
(cron job, Lambda job, etc):

``` python
# this job should be run on a schedule
vectorizer = pgvectorizer.Vectorize(service_url, 'blog')
while vectorizer.process(embed_and_write) > 0:
    pass
```

Every time that job runs, it will sync the table with your embeddings.
It will sync all inserts, updates, and deletes to an embeddings table
called `blog_embedding`.

Now, you can simply search the embeddings as follows (again, using
LangChain in the example):

``` python
embedding = OpenAIEmbeddings()
vector_store = TimescaleVector(
    collection_name="blog_embedding",
    service_url=service_url,
    embedding=embedding,
    time_partition_interval=timedelta(days=30),
)

res = vector_store.similarity_search_with_score("Blogs about cats")
res
```

    [(Document(metadata={'id': '334e4800-4bee-11eb-a52a-57b3c4a96ccb', 'author': 'Matvey Arye', 'blog_id': 1, 'category': 'AI', 'published_time': '2021-01-01T00:00:00-05:00'}, page_content='Author Matvey Arye, title: First Post, contents:some super interesting content about cats.'),
      0.12680577303752072)]

## Development

This project is developed with [nbdev](https://nbdev.fast.ai/). Please
see that website for the development process.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/timescale/python-vector",
    "name": "timescale-vector",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "nbdev jupyter notebook python",
    "author": "Matvey Arye",
    "author_email": "mat@timescale.com",
    "download_url": "https://files.pythonhosted.org/packages/a0/52/050137732a2953253d324613c79e15dba30a7fb305ed8f5d95cc076ddd8a/timescale-vector-0.0.7.tar.gz",
    "platform": null,
    "description": "# Timescale Vector\n\n<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->\n\nPostgreSQL++ for AI Applications.\n\n- [Signup for Timescale\n  Vector](https://console.cloud.timescale.com/signup?utm_campaign=vectorlaunch&utm_source=github&utm_medium=direct):\n  Get 90 days free to try Timescale Vector on the Timescale cloud data\n  platform. There is no self-managed version at this time.\n- [Documentation](https://timescale.github.io/python-vector/): Learn the\n  key features of Timescale Vector and how to use them.\n- [Getting Started\n  Tutorial](https://timescale.github.io/python-vector/tsv_python_getting_started_tutorial.html):\n  Learn how to use Timescale Vector for semantic search on a real-world\n  dataset.\n- [Learn\n  more](https://www.timescale.com/blog/how-we-made-postgresql-the-best-vector-database/?utm_campaign=vectorlaunch&utm_source=github&utm_medium=direct):\n  Learn more about Timescale Vector, how it works and why we built it.\n\nIf you prefer to use an LLM development or data framework, see Timescale\nVector\u2019s integrations with\n[LangChain](https://python.langchain.com/docs/integrations/vectorstores/timescalevector)\nand\n[LlamaIndex](https://gpt-index.readthedocs.io/en/stable/examples/vector_stores/Timescalevector.html)\n\n## Install\n\nTo install the main library use:\n\n``` sh\npip install timescale_vector\n```\n\nWe also use `dotenv` in our examples for passing around secrets and\nkeys. You can install that with:\n\n``` sh\npip install python-dotenv\n```\n\nIf you run into installation errors related to the psycopg2 package, you\nwill need to install some prerequisites. The timescale-vector package\nexplicitly depends on psycopg2 (the non-binary version). This adheres to\n[the advice provided by\npsycopg2](https://www.psycopg.org/docs/install.html#psycopg-vs-psycopg-binary).\nBuilding psycopg from source [requires a few prerequisites to be\ninstalled](https://www.psycopg.org/docs/install.html#build-prerequisites).\nMake sure these are installed before trying to\n`pip install timescale_vector`.\n\n## Basic usage\n\nFirst, import all the necessary libraries:\n\n``` python\nfrom dotenv import load_dotenv, find_dotenv\nimport os\nfrom timescale_vector import client\nimport uuid\nfrom datetime import datetime, timedelta\n```\n\nLoad up your PostgreSQL credentials. Safest way is with a .env file:\n\n``` python\n_ = load_dotenv(find_dotenv(), override=True) \nservice_url  = os.environ['TIMESCALE_SERVICE_URL']\n```\n\nNext, create the client. In this tutorial, we will use the sync client.\nBut we have an async client as well (with an identical interface that\nuses async functions).\n\nThe client constructor takes three required arguments:\n\n| name           | description                                                                               |\n|----------------|-------------------------------------------------------------------------------------------|\n| service_url    | Timescale service URL / connection string                                                 |\n| table_name     | Name of the table to use for storing the embeddings. Think of this as the collection name |\n| num_dimensions | Number of dimensions in the vector                                                        |\n\nYou can also specify the schema name, distance type, primary key type,\netc. as optional parameters. Please see the documentation for details.\n\n``` python\nvec  = client.Sync(service_url, \"my_data\", 2)\n```\n\nNext, create the tables for the collection:\n\n``` python\nvec.create_tables()\n```\n\nNext, insert some data. The data record contains:\n\n- A UUID to uniquely identify the embedding\n- A JSON blob of metadata about the embedding\n- The text the embedding represents\n- The embedding itself\n\nBecause this data includes UUIDs which become primary keys, we ingest\nwith upserts.\n\n``` python\nvec.upsert([\\\n    (uuid.uuid1(), {\"animal\": \"fox\"}, \"the brown fox\", [1.0,1.3]),\\\n    (uuid.uuid1(), {\"animal\": \"fox\", \"action\":\"jump\"}, \"jumped over the\", [1.0,10.8]),\\\n])\n```\n\nYou can now create a vector index to speed up similarity search:\n\n``` python\nvec.create_embedding_index(client.DiskAnnIndex())\n```\n\nNow, you can query for similar items:\n\n``` python\nvec.search([1.0, 9.0])\n```\n\n    [[UUID('4494c186-4a0d-11ef-94a3-6ee10b77fd09'),\n      {'action': 'jump', 'animal': 'fox'},\n      'jumped over the',\n      array([ 1. , 10.8], dtype=float32),\n      0.00016793422934946456],\n     [UUID('4494c12c-4a0d-11ef-94a3-6ee10b77fd09'),\n      {'animal': 'fox'},\n      'the brown fox',\n      array([1. , 1.3], dtype=float32),\n      0.14489260377438218]]\n\nThere are many search options which we will cover below in the\n`Advanced search` section.\n\nAs one example, we will return one item using a similarity search\nconstrained by a metadata filter.\n\n``` python\nvec.search([1.0, 9.0], limit=1, filter={\"action\": \"jump\"})\n```\n\n    [[UUID('4494c186-4a0d-11ef-94a3-6ee10b77fd09'),\n      {'action': 'jump', 'animal': 'fox'},\n      'jumped over the',\n      array([ 1. , 10.8], dtype=float32),\n      0.00016793422934946456]]\n\nThe returned records contain 5 fields:\n\n| name      | description                                             |\n|-----------|---------------------------------------------------------|\n| id        | The UUID of the record                                  |\n| metadata  | The JSON metadata associated with the record            |\n| contents  | the text content that was embedded                      |\n| embedding | The vector embedding                                    |\n| distance  | The distance between the query embedding and the vector |\n\nYou can access the fields by simply using the record as a dictionary\nkeyed on the field name:\n\n``` python\nrecords = vec.search([1.0, 9.0], limit=1, filter={\"action\": \"jump\"})\n(records[0][\"id\"],records[0][\"metadata\"], records[0][\"contents\"], records[0][\"embedding\"], records[0][\"distance\"])\n```\n\n    (UUID('4494c186-4a0d-11ef-94a3-6ee10b77fd09'),\n     {'action': 'jump', 'animal': 'fox'},\n     'jumped over the',\n     array([ 1. , 10.8], dtype=float32),\n     0.00016793422934946456)\n\nYou can delete by ID:\n\n``` python\nvec.delete_by_ids([records[0][\"id\"]])\n```\n\nOr you can delete by metadata filters:\n\n``` python\nvec.delete_by_metadata({\"action\": \"jump\"})\n```\n\nTo delete all records use:\n\n``` python\nvec.delete_all()\n```\n\n## Advanced usage\n\nIn this section, we will go into more detail about our feature. We will\ncover:\n\n1.  Search filter options - how to narrow your search by additional\n    constraints\n2.  Indexing - how to speed up your similarity queries\n3.  Time-based partitioning - how to optimize similarity queries that\n    filter on time\n4.  Setting different distance types to use in distance calculations\n\n### Search options\n\nThe `search` function is very versatile and allows you to search for the\nright vector in a wide variety of ways. We\u2019ll describe the search option\nin 3 parts:\n\n1.  We\u2019ll cover basic similarity search.\n2.  Then, we\u2019ll describe how to filter your search based on the\n    associated metadata.\n3.  Finally, we\u2019ll talk about filtering on time when time-partitioning\n    is enabled.\n\nLet\u2019s use the following data for our example:\n\n``` python\nvec.upsert([\\\n    (uuid.uuid1(), {\"animal\":\"fox\", \"action\": \"sit\", \"times\":1}, \"the brown fox\", [1.0,1.3]),\\\n    (uuid.uuid1(),  {\"animal\":\"fox\", \"action\": \"jump\", \"times\":100}, \"jumped over the\", [1.0,10.8]),\\\n])\n```\n\nThe basic query looks like:\n\n``` python\nvec.search([1.0, 9.0])\n```\n\n    [[UUID('456dbbbc-4a0d-11ef-94a3-6ee10b77fd09'),\n      {'times': 100, 'action': 'jump', 'animal': 'fox'},\n      'jumped over the',\n      array([ 1. , 10.8], dtype=float32),\n      0.00016793422934946456],\n     [UUID('456dbb6c-4a0d-11ef-94a3-6ee10b77fd09'),\n      {'times': 1, 'action': 'sit', 'animal': 'fox'},\n      'the brown fox',\n      array([1. , 1.3], dtype=float32),\n      0.14489260377438218]]\n\nYou could provide a limit for the number of items returned:\n\n``` python\nvec.search([1.0, 9.0], limit=1)\n```\n\n    [[UUID('456dbbbc-4a0d-11ef-94a3-6ee10b77fd09'),\n      {'times': 100, 'action': 'jump', 'animal': 'fox'},\n      'jumped over the',\n      array([ 1. , 10.8], dtype=float32),\n      0.00016793422934946456]]\n\n#### Narrowing your search by metadata\n\nWe have two main ways to filter results by metadata: - `filters` for\nequality matches on metadata. - `predicates` for complex conditions on\nmetadata.\n\nFilters are more likely to be performant but are more limited in what\nthey can express, so we suggest using those if your use case allows it.\n\n##### Filters\n\nYou could specify a match on the metadata as a dictionary where all keys\nhave to match the provided values (keys not in the filter are\nunconstrained):\n\n``` python\nvec.search([1.0, 9.0], limit=1, filter={\"action\": \"sit\"})\n```\n\n    [[UUID('456dbb6c-4a0d-11ef-94a3-6ee10b77fd09'),\n      {'times': 1, 'action': 'sit', 'animal': 'fox'},\n      'the brown fox',\n      array([1. , 1.3], dtype=float32),\n      0.14489260377438218]]\n\nYou can also specify a list of filter dictionaries, where an item is\nreturned if it matches any dict:\n\n``` python\nvec.search([1.0, 9.0], limit=2, filter=[{\"action\": \"jump\"}, {\"animal\": \"fox\"}])\n```\n\n    [[UUID('456dbbbc-4a0d-11ef-94a3-6ee10b77fd09'),\n      {'times': 100, 'action': 'jump', 'animal': 'fox'},\n      'jumped over the',\n      array([ 1. , 10.8], dtype=float32),\n      0.00016793422934946456],\n     [UUID('456dbb6c-4a0d-11ef-94a3-6ee10b77fd09'),\n      {'times': 1, 'action': 'sit', 'animal': 'fox'},\n      'the brown fox',\n      array([1. , 1.3], dtype=float32),\n      0.14489260377438218]]\n\n##### Predicates\n\nPredicates allow for more complex search conditions. For example, you\ncould use greater than and less than conditions on numeric values.\n\n``` python\nvec.search([1.0, 9.0], limit=2, predicates=client.Predicates(\"times\", \">\", 1))\n```\n\n    [[UUID('456dbbbc-4a0d-11ef-94a3-6ee10b77fd09'),\n      {'times': 100, 'action': 'jump', 'animal': 'fox'},\n      'jumped over the',\n      array([ 1. , 10.8], dtype=float32),\n      0.00016793422934946456]]\n\n[`Predicates`](https://timescale.github.io/python-vector/vector.html#predicates)\nobjects are defined by the name of the metadata key, an operator, and a\nvalue.\n\nThe supported operators are: `==`, `!=`, `<`, `<=`, `>`, `>=`\n\nThe type of the values determines the type of comparison to perform. For\nexample, passing in `\"Sam\"` (a string) will do a string comparison while\na `10` (an int) will perform an integer comparison while a `10.0`\n(float) will do a float comparison. It is important to note that using a\nvalue of `\"10\"` will do a string comparison as well so it\u2019s important to\nuse the right type. Supported Python types are: `str`, `int`, and\n`float`. One more example with a string comparison:\n\n``` python\nvec.search([1.0, 9.0], limit=2, predicates=client.Predicates(\"action\", \"==\", \"jump\"))\n```\n\n    [[UUID('456dbbbc-4a0d-11ef-94a3-6ee10b77fd09'),\n      {'times': 100, 'action': 'jump', 'animal': 'fox'},\n      'jumped over the',\n      array([ 1. , 10.8], dtype=float32),\n      0.00016793422934946456]]\n\nThe real power of predicates is that they can also be combined using the\n`&` operator (for combining predicates with AND semantics) and `|`(for\ncombining using OR semantic). So you can do:\n\n``` python\nvec.search([1.0, 9.0], limit=2, predicates=client.Predicates(\"action\", \"==\", \"jump\") & client.Predicates(\"times\", \">\", 1))\n```\n\n    [[UUID('456dbbbc-4a0d-11ef-94a3-6ee10b77fd09'),\n      {'times': 100, 'action': 'jump', 'animal': 'fox'},\n      'jumped over the',\n      array([ 1. , 10.8], dtype=float32),\n      0.00016793422934946456]]\n\nJust for sanity, let\u2019s show a case where no results are returned because\nor predicates:\n\n``` python\nvec.search([1.0, 9.0], limit=2, predicates=client.Predicates(\"action\", \"==\", \"jump\") & client.Predicates(\"times\", \"==\", 1))\n```\n\n    []\n\nAnd one more example where we define the predicates as a variable and\nuse grouping with parenthesis:\n\n``` python\nmy_predicates = client.Predicates(\"action\", \"==\", \"jump\") & (client.Predicates(\"times\", \"==\", 1) | client.Predicates(\"times\", \">\", 1))\nvec.search([1.0, 9.0], limit=2, predicates=my_predicates)\n```\n\n    [[UUID('456dbbbc-4a0d-11ef-94a3-6ee10b77fd09'),\n      {'times': 100, 'action': 'jump', 'animal': 'fox'},\n      'jumped over the',\n      array([ 1. , 10.8], dtype=float32),\n      0.00016793422934946456]]\n\nWe also have some semantic sugar for combining many predicates with AND\nsemantics. You can pass in multiple 3-tuples to\n[`Predicates`](https://timescale.github.io/python-vector/vector.html#predicates):\n\n``` python\nvec.search([1.0, 9.0], limit=2, predicates=client.Predicates((\"action\", \"==\", \"jump\"), (\"times\", \">\", 10)))\n```\n\n    [[UUID('456dbbbc-4a0d-11ef-94a3-6ee10b77fd09'),\n      {'times': 100, 'action': 'jump', 'animal': 'fox'},\n      'jumped over the',\n      array([ 1. , 10.8], dtype=float32),\n      0.00016793422934946456]]\n\n#### Filter your search by time\n\nWhen using `time-partitioning`(see below). You can very efficiently\nfilter your search by time. Time-partitioning makes a timestamp embedded\nas part of the UUID-based ID associated with an embedding. Let us first\ncreate a collection with time partitioning and insert some data (one\nitem from January 2018 and another in January 2019):\n\n``` python\ntpvec = client.Sync(service_url, \"time_partitioned_table\", 2, time_partition_interval=timedelta(hours=6))\ntpvec.create_tables()\n\nspecific_datetime = datetime(2018, 1, 1, 12, 0, 0)\ntpvec.upsert([\\\n    (client.uuid_from_time(specific_datetime), {\"animal\":\"fox\", \"action\": \"sit\", \"times\":1}, \"the brown fox\", [1.0,1.3]),\\\n    (client.uuid_from_time(specific_datetime+timedelta(days=365)),  {\"animal\":\"fox\", \"action\": \"jump\", \"times\":100}, \"jumped over the\", [1.0,10.8]),\\\n])\n```\n\nThen, you can filter using the timestamps by specifing a\n`uuid_time_filter`:\n\n``` python\ntpvec.search([1.0, 9.0], limit=4, uuid_time_filter=client.UUIDTimeRange(specific_datetime, specific_datetime+timedelta(days=1)))\n```\n\n    [[UUID('33c52800-ef15-11e7-8a12-ea51d07b6447'),\n      {'times': 1, 'action': 'sit', 'animal': 'fox'},\n      'the brown fox',\n      array([1. , 1.3], dtype=float32),\n      0.14489260377438218]]\n\nA\n[`UUIDTimeRange`](https://timescale.github.io/python-vector/vector.html#uuidtimerange)\ncan specify a start_date or end_date or both(as in the example above).\nSpecifying only the start_date or end_date leaves the other end\nunconstrained.\n\n``` python\ntpvec.search([1.0, 9.0], limit=4, uuid_time_filter=client.UUIDTimeRange(start_date=specific_datetime))\n```\n\n    [[UUID('ac8be800-0de6-11e9-a5fd-5a100e653c25'),\n      {'times': 100, 'action': 'jump', 'animal': 'fox'},\n      'jumped over the',\n      array([ 1. , 10.8], dtype=float32),\n      0.00016793422934946456],\n     [UUID('33c52800-ef15-11e7-8a12-ea51d07b6447'),\n      {'times': 1, 'action': 'sit', 'animal': 'fox'},\n      'the brown fox',\n      array([1. , 1.3], dtype=float32),\n      0.14489260377438218]]\n\nYou have the option to define the inclusivity of the start and end dates\nwith the `start_inclusive` and `end_inclusive` parameters. Setting\n`start_inclusive` to true results in comparisons using the `>=`\noperator, whereas setting it to false applies the `>` operator. By\ndefault, the start date is inclusive, while the end date is exclusive.\nOne example:\n\n``` python\ntpvec.search([1.0, 9.0], limit=4, uuid_time_filter=client.UUIDTimeRange(start_date=specific_datetime, start_inclusive=False))\n```\n\n    [[UUID('ac8be800-0de6-11e9-a5fd-5a100e653c25'),\n      {'times': 100, 'action': 'jump', 'animal': 'fox'},\n      'jumped over the',\n      array([ 1. , 10.8], dtype=float32),\n      0.00016793422934946456]]\n\nNotice how the results are different when we use the\n`start_inclusive=False` option because the first row has the exact\ntimestamp specified by `start_date`.\n\nWe\u2019ve also made it easy to integrate time filters using the `filter` and\n`predicates` parameters described above using special reserved key names\nto make it appear that the timestamps are part of your metadata. We\nfound this useful when integrating with other systems that just want to\nspecify a set of filters (often these are \u201cauto retriever\u201d type\nsystems). The reserved key names are `__start_date` and `__end_date` for\nfilters and `__uuid_timestamp` for predicates. Some examples below:\n\n``` python\ntpvec.search([1.0, 9.0], limit=4, filter={ \"__start_date\": specific_datetime, \"__end_date\": specific_datetime+timedelta(days=1)})\n```\n\n    [[UUID('33c52800-ef15-11e7-8a12-ea51d07b6447'),\n      {'times': 1, 'action': 'sit', 'animal': 'fox'},\n      'the brown fox',\n      array([1. , 1.3], dtype=float32),\n      0.14489260377438218]]\n\n``` python\ntpvec.search([1.0, 9.0], limit=4, \n             predicates=client.Predicates(\"__uuid_timestamp\", \">=\", specific_datetime) & client.Predicates(\"__uuid_timestamp\", \"<\", specific_datetime+timedelta(days=1)))\n```\n\n    [[UUID('33c52800-ef15-11e7-8a12-ea51d07b6447'),\n      {'times': 1, 'action': 'sit', 'animal': 'fox'},\n      'the brown fox',\n      array([1. , 1.3], dtype=float32),\n      0.14489260377438218]]\n\n### Indexing\n\nIndexing speeds up queries over your data. By default, we set up indexes\nto query your data by the UUID and the metadata.\n\nBut to speed up similarity search based on the embeddings, you have to\ncreate additional indexes.\n\nNote that if performing a query without an index, you will always get an\nexact result, but the query will be slow (it has to read all of the data\nyou store for every query). With an index, your queries will be\norder-of-magnitude faster, but the results are approximate (because\nthere are no known indexing techniques that are exact).\n\nNevertheless, there are excellent approximate algorithms. There are 3\ndifferent indexing algorithms available on the Timescale platform:\nTimescale Vector index, pgvector HNSW, and pgvector ivfflat. Below are\nthe trade-offs between these algorithms:\n\n| Algorithm        | Build speed | Query speed | Need to rebuild after updates |\n|------------------|-------------|-------------|-------------------------------|\n| StreamingDiskANN | Fast        | Fastest     | No                            |\n| pgvector hnsw    | Slowest     | Faster      | No                            |\n| pgvector ivfflat | Fastest     | Slowest     | Yes                           |\n\nYou can see\n[benchmarks](https://www.timescale.com/blog/how-we-made-postgresql-the-best-vector-database/)\non our blog.\n\nWe recommend using the Timescale Vector index for most use cases. This\ncan be created with:\n\n``` python\nvec.create_embedding_index(client.DiskAnnIndex())\n```\n\nIndexes are created for a particular distance metric type. So it is\nimportant that the same distance metric is set on the client during\nindex creation as it is during queries. See the `distance type` section\nbelow.\n\nEach of these indexes has a set of build-time options for controlling\nthe speed/accuracy trade-off when creating the index and an additional\nquery-time option for controlling accuracy during a particular query. We\nhave smart defaults for all of these options but will also describe the\ndetails below so that you can adjust these options manually.\n\n#### StreamingDiskANN index\n\nThe StreamingDiskANN index from pgvectorscale is a graph-based algorithm\nthat uses the [DiskANN](https://github.com/microsoft/DiskANN) algorithm.\nYou can read more about it on our\n[blog](https://www.timescale.com/blog/how-we-made-postgresql-as-fast-as-pinecone-for-vector-data/)\nannouncing its release.\n\nTo create this index, run:\n\n``` python\nvec.create_embedding_index(client.DiskAnnIndex())\n```\n\nThe above command will create the index using smart defaults. There are\na number of parameters you could tune to adjust the accuracy/speed\ntrade-off.\n\nThe parameters you can set at index build time are:\n\n| Parameter name           | Description                                                                                                                                                                                      | Default value                               |\n|--------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------|\n| `storage_layout`         | `memory_optimized` which uses SBQ to compress vector data or `plain` which stores data uncompressed                                                                                              | memory_optimized                            |\n| `num_neighbors`          | Sets the maximum number of neighbors per node. Higher values increase accuracy but make the graph traversal slower.                                                                              | 50                                          |\n| `search_list_size`       | This is the S parameter used in the greedy search algorithm used during construction. Higher values improve graph quality at the cost of slower index builds.                                    | 100                                         |\n| `max_alpha`              | Is the alpha parameter in the algorithm. Higher values improve graph quality at the cost of slower index builds.                                                                                 | 1.2                                         |\n| `num_dimensions`         | The number of dimensions to index. By default, all dimensions are indexed. But you can also index less dimensions to make use of [Matryoshka embeddings](https://huggingface.co/blog/matryoshka) | 0 (all dimensions)                          |\n| `num_bits_per_dimension` | Number of bits used to encode each dimension when using SBQ                                                                                                                                      | 2 for less than 900 dimensions, 1 otherwise |\n\nTo set these parameters, you could run:\n\n``` python\nvec.create_embedding_index(client.DiskAnnIndex(num_neighbors=50, search_list_size=100, max_alpha=1.0, storage_layout=\"memory_optimized\", num_dimensions=0, num_bits_per_dimension=1))\n```\n\nYou can also set a parameter to control the accuracy vs.\u00a0query speed\ntrade-off at query time. The parameter is set in the `search()` function\nusing the `query_params` argment.\n\n| Parameter name     | Description                                                             | Default value |\n|--------------------|-------------------------------------------------------------------------|---------------|\n| `search_list_size` | The number of additional candidates considered during the graph search. | 100           |\n| `rescore`          | The number of elements rescored (0 to disable rescoring)                | 50            |\n\nWe suggest using the `rescore` parameter to fine-tune accuracy.\n\n``` python\nvec.search([1.0, 9.0], limit=4, query_params=client.DiskAnnIndexParams(rescore=400, search_list_size=10))\n```\n\n    [[UUID('456dbbbc-4a0d-11ef-94a3-6ee10b77fd09'),\n      {'times': 100, 'action': 'jump', 'animal': 'fox'},\n      'jumped over the',\n      array([ 1. , 10.8], dtype=float32),\n      0.00016793422934946456],\n     [UUID('456dbb6c-4a0d-11ef-94a3-6ee10b77fd09'),\n      {'times': 1, 'action': 'sit', 'animal': 'fox'},\n      'the brown fox',\n      array([1. , 1.3], dtype=float32),\n      0.14489260377438218]]\n\nTo drop the index, run:\n\n``` python\nvec.drop_embedding_index()\n```\n\n#### pgvector HNSW index\n\nPgvector provides a graph-based indexing algorithm based on the popular\n[HNSW algorithm](https://arxiv.org/abs/1603.09320).\n\nTo create this index, run:\n\n``` python\nvec.create_embedding_index(client.HNSWIndex())\n```\n\nThe above command will create the index using smart defaults. There are\na number of parameters you could tune to adjust the accuracy/speed\ntrade-off.\n\nThe parameters you can set at index build time are:\n\n| Parameter name  | Description                                                                                                                                                                                                                                                            | Default value |\n|-----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|\n| m               | Represents the maximum number of connections per layer. Think of these connections as edges created for each node during graph construction. Increasing m increases accuracy but also increases index build time and size.                                             | 16            |\n| ef_construction | Represents the size of the dynamic candidate list for constructing the graph. It influences the trade-off between index quality and construction speed. Increasing ef_construction enables more accurate search results at the expense of lengthier index build times. | 64            |\n\nTo set these parameters, you could run:\n\n``` python\nvec.create_embedding_index(client.HNSWIndex(m=16, ef_construction=64))\n```\n\nYou can also set a parameter to control the accuracy vs.\u00a0query speed\ntrade-off at query time. The parameter is set in the `search()` function\nusing the `query_params` argument. You can set the `ef_search`(default:\n40). This parameter specifies the size of the dynamic candidate list\nused during search. Higher values improve query accuracy while making\nthe query slower.\n\nYou can specify this value during search as follows:\n\n``` python\nvec.search([1.0, 9.0], limit=4, query_params=client.HNSWIndexParams(ef_search=10))\n```\n\n    [[UUID('456dbbbc-4a0d-11ef-94a3-6ee10b77fd09'),\n      {'times': 100, 'action': 'jump', 'animal': 'fox'},\n      'jumped over the',\n      array([ 1. , 10.8], dtype=float32),\n      0.00016793422934946456],\n     [UUID('456dbb6c-4a0d-11ef-94a3-6ee10b77fd09'),\n      {'times': 1, 'action': 'sit', 'animal': 'fox'},\n      'the brown fox',\n      array([1. , 1.3], dtype=float32),\n      0.14489260377438218]]\n\nTo drop the index run:\n\n``` python\nvec.drop_embedding_index()\n```\n\n#### pgvector ivfflat index\n\nPgvector provides a clustering-based indexing algorithm. Our [blog\npost](https://www.timescale.com/blog/nearest-neighbor-indexes-what-are-ivfflat-indexes-in-pgvector-and-how-do-they-work/)\ndescribes how it works in detail. It provides the fastest index-build\nspeed but the slowest query speeds of any indexing algorithm.\n\nTo create this index, run:\n\n``` python\nvec.create_embedding_index(client.IvfflatIndex())\n```\n\nNote: *ivfflat should never be created on empty tables* because it needs\nto cluster data, and that only happens when an index is first created,\nnot when new rows are inserted or modified. Also, if your table\nundergoes a lot of modifications, you will need to rebuild this index\noccasionally to maintain good accuracy. See our [blog\npost](https://www.timescale.com/blog/nearest-neighbor-indexes-what-are-ivfflat-indexes-in-pgvector-and-how-do-they-work/)\nfor details.\n\nPgvector ivfflat has a `lists` index parameter that is automatically set\nwith a smart default based on the number of rows in your table. If you\nknow that you\u2019ll have a different table size, you can specify the number\nof records to use for calculating the `lists` parameter as follows:\n\n``` python\nvec.create_embedding_index(client.IvfflatIndex(num_records=1000000))\n```\n\nYou can also set the `lists` parameter directly:\n\n``` python\nvec.create_embedding_index(client.IvfflatIndex(num_lists=100))\n```\n\nYou can also set a parameter to control the accuracy vs.\u00a0query speed\ntrade-off at query time. The parameter is set in the `search()` function\nusing the `query_params` argument. You can set the `probes`. This\nparameter specifies the number of clusters searched during a query. It\nis recommended to set this parameter to `sqrt(lists)` where lists is the\n`num_list` parameter used above during index creation. Higher values\nimprove query accuracy while making the query slower.\n\nYou can specify this value during search as follows:\n\n``` python\nvec.search([1.0, 9.0], limit=4, query_params=client.IvfflatIndexParams(probes=10))\n```\n\n    [[UUID('456dbbbc-4a0d-11ef-94a3-6ee10b77fd09'),\n      {'times': 100, 'action': 'jump', 'animal': 'fox'},\n      'jumped over the',\n      array([ 1. , 10.8], dtype=float32),\n      0.00016793422934946456],\n     [UUID('456dbb6c-4a0d-11ef-94a3-6ee10b77fd09'),\n      {'times': 1, 'action': 'sit', 'animal': 'fox'},\n      'the brown fox',\n      array([1. , 1.3], dtype=float32),\n      0.14489260377438218]]\n\nTo drop the index, run:\n\n``` python\nvec.drop_embedding_index()\n```\n\n### Time partitioning\n\nIn many use cases where you have many embeddings, time is an important\ncomponent associated with the embeddings. For example, when embedding\nnews stories, you often search by time as well as similarity (e.g.,\nstories related to Bitcoin in the past week or stories about Clinton in\nNovember 2016).\n\nYet, traditionally, searching by two components \u201csimilarity\u201d and \u201ctime\u201d\nis challenging for Approximate Nearest Neighbor (ANN) indexes and makes\nthe similarity-search index less effective.\n\nOne approach to solving this is partitioning the data by time and\ncreating ANN indexes on each partition individually. Then, during\nsearch, you can:\n\n- Step 1: filter our partitions that don\u2019t match the time predicate.\n- Step 2: perform the similarity search on all matching partitions.\n- Step 3: combine all the results from each partition in step 2, rerank,\n  and filter out results by time.\n\nStep 1 makes the search a lot more efficient by filtering out whole\nswaths of data in one go.\n\nTimescale-vector supports time partitioning using TimescaleDB\u2019s\nhypertables. To use this feature, simply indicate the length of time for\neach partition when creating the client:\n\n``` python\nfrom datetime import timedelta\nfrom datetime import datetime\n```\n\n``` python\nvec = client.Async(service_url, \"my_data_with_time_partition\", 2, time_partition_interval=timedelta(hours=6))\nawait vec.create_tables()\n```\n\nThen, insert data where the IDs use UUIDs v1 and the time component of\nthe UUID specifies the time of the embedding. For example, to create an\nembedding for the current time, simply do:\n\n``` python\nid = uuid.uuid1()\nawait vec.upsert([(id, {\"key\": \"val\"}, \"the brown fox\", [1.0, 1.2])])\n```\n\nTo insert data for a specific time in the past, create the UUID using\nour\n[`uuid_from_time`](https://timescale.github.io/python-vector/vector.html#uuid_from_time)\nfunction\n\n``` python\nspecific_datetime = datetime(2018, 8, 10, 15, 30, 0)\nawait vec.upsert([(client.uuid_from_time(specific_datetime), {\"key\": \"val\"}, \"the brown fox\", [1.0, 1.2])])\n```\n\nYou can then query the data by specifying a `uuid_time_filter` in the\nsearch call:\n\n``` python\nrec = await vec.search([1.0, 2.0], limit=4, uuid_time_filter=client.UUIDTimeRange(specific_datetime-timedelta(days=7), specific_datetime+timedelta(days=7)))\n```\n\n### Distance metrics\n\nBy default, we use cosine distance to measure how similarly an embedding\nis to a given query. In addition to cosine distance, we also support\nEuclidean/L2 distance. The distance type is set when creating the client\nusing the `distance_type` parameter. For example, to use the Euclidean\ndistance metric, you can create the client with:\n\n``` python\nvec  = client.Sync(service_url, \"my_data\", 2, distance_type=\"euclidean\")\n```\n\nValid values for `distance_type` are `cosine` and `euclidean`.\n\nIt is important to note that you should use consistent distance types on\nclients that create indexes and perform queries. That is because an\nindex is only valid for one particular type of distance measure.\n\nPlease note the Timescale Vector index only supports cosine distance at\nthis time.\n\n# LangChain integration\n\n[LangChain](https://www.langchain.com/) is a popular framework for\ndevelopment applications powered by LLMs. Timescale Vector has a native\nLangChain integration, enabling you to use Timescale Vector as a\nvectorstore and leverage all its capabilities in your applications built\nwith LangChain.\n\nHere are resources about using Timescale Vector with LangChain:\n\n- [Getting started with LangChain and Timescale\n  Vector](https://python.langchain.com/docs/integrations/vectorstores/timescalevector):\n  You\u2019ll learn how to use Timescale Vector for (1) semantic search, (2)\n  time-based vector search, (3) self-querying, and (4) how to create\n  indexes to speed up queries.\n- [PostgreSQL Self\n  Querying](https://python.langchain.com/docs/integrations/retrievers/self_query/timescalevector_self_query):\n  Learn how to use Timescale Vector with self-querying in LangChain.\n- [LangChain template: RAG with conversational\n  retrieval](https://github.com/langchain-ai/langchain/tree/master/templates/rag-timescale-conversation):\n  This template is used for conversational retrieval, which is one of\n  the most popular LLM use-cases. It passes both a conversation history\n  and retrieved documents into an LLM for synthesis.\n- [LangChain template: RAG with time-based search and self-query\n  retrieval](https://github.com/langchain-ai/langchain/tree/master/templates/rag-timescale-hybrid-search-time):This\n  template shows how to use timescale-vector with the self-query\n  retriver to perform hybrid search on similarity and time. This is\n  useful any time your data has a strong time-based component.\n- [Learn more about Timescale Vector and\n  LangChain](https://blog.langchain.dev/timescale-vector-x-langchain-making-postgresql-a-better-vector-database-for-ai-applications/)\n\n# LlamaIndex integration\n\n\\[LlamaIndex\\] is a popular data framework for connecting custom data\nsources to large language models (LLMs). Timescale Vector has a native\nLlamaIndex integration, enabling you to use Timescale Vector as a\nvectorstore and leverage all its capabilities in your applications built\nwith LlamaIndex.\n\nHere are resources about using Timescale Vector with LlamaIndex:\n\n- [Getting started with LlamaIndex and Timescale\n  Vector](https://docs.llamaindex.ai/en/stable/examples/vector_stores/Timescalevector.html):\n  You\u2019ll learn how to use Timescale Vector for (1) similarity\n  search, (2) time-based vector search, (3) faster search with indexes,\n  and (4) retrieval and query engine.\n- [Time-based\n  retrieval](https://youtu.be/EYMZVfKcRzM?si=I0H3uUPgzKbQw__W): Learn\n  how to power RAG applications with time-based retrieval.\n- [Llama Pack: Auto Retrieval with time-based\n  search](https://github.com/run-llama/llama-hub/tree/main/llama_hub/llama_packs/timescale_vector_autoretrieval):\n  This pack demonstrates performing auto-retrieval for hybrid search\n  based on both similarity and time, using the timescale-vector\n  (PostgreSQL) vectorstore.  \n- [Learn more about Timescale Vector and\n  LlamaIndex](https://www.timescale.com/blog/timescale-vector-x-llamaindex-making-postgresql-a-better-vector-database-for-ai-applications/)\n\n# PgVectorize\n\nPgVectorize enables you to create vector embeddings from any data that\nyou already have stored in PostgreSQL. You can get more background\ninformation in our [blog\npost](https://www.timescale.com/blog/a-complete-guide-to-creating-and-storing-embeddings-for-postgresql-data/)\nannouncing this feature, as well as a [\u201chow we built\nin\u201d](https://www.timescale.com/blog/how-we-designed-a-resilient-vector-embedding-creation-system-for-postgresql-data/)\npost going into the details of the design.\n\nTo create vector embeddings, simply attach PgVectorize to any PostgreSQL\ntable, and it will automatically sync that table\u2019s data with a set of\nembeddings stored in Timescale Vector. For example, let\u2019s say you have a\nblog table defined in the following way:\n\n``` python\nimport psycopg2\nfrom langchain.docstore.document import Document\nfrom langchain.text_splitter import CharacterTextSplitter\nfrom timescale_vector import client, pgvectorizer\nfrom langchain_openai import OpenAIEmbeddings\nfrom langchain_community.vectorstores.timescalevector import TimescaleVector\nfrom datetime import timedelta\n```\n\n``` python\nwith psycopg2.connect(service_url) as conn:\n    with conn.cursor() as cursor:\n        cursor.execute('''\n        CREATE TABLE IF NOT EXISTS blog (\n            id              SERIAL PRIMARY KEY NOT NULL,\n            title           TEXT NOT NULL,\n            author          TEXT NOT NULL,\n            contents        TEXT NOT NULL,\n            category        TEXT NOT NULL,\n            published_time  TIMESTAMPTZ NULL --NULL if not yet published\n        );\n        ''')\n```\n\nYou can insert some data as follows:\n\n``` python\nwith psycopg2.connect(service_url) as conn:\n    with conn.cursor() as cursor:\n        cursor.execute('''\n            INSERT INTO blog (title, author, contents, category, published_time) VALUES ('First Post', 'Matvey Arye', 'some super interesting content about cats.', 'AI', '2021-01-01');\n        ''')\n```\n\nNow, say you want to embed these blogs in Timescale Vector. First, you\nneed to define an `embed_and_write` function that takes a set of blog\nposts, creates the embeddings, and writes them into TimescaleVector. For\nexample, if using LangChain, it could look something like the following.\n\n``` python\ndef get_document(blog):\n    text_splitter = CharacterTextSplitter(\n        chunk_size=1000,\n        chunk_overlap=200,\n    )\n    docs = []\n    for chunk in text_splitter.split_text(blog['contents']):\n        content = f\"Author {blog['author']}, title: {blog['title']}, contents:{chunk}\"\n        metadata = {\n            \"id\": str(client.uuid_from_time(blog['published_time'])),\n            \"blog_id\": blog['id'], \n            \"author\": blog['author'], \n            \"category\": blog['category'],\n            \"published_time\": blog['published_time'].isoformat(),\n        }\n        docs.append(Document(page_content=content, metadata=metadata))\n    return docs\n\ndef embed_and_write(blog_instances, vectorizer):\n    embedding = OpenAIEmbeddings()\n    vector_store = TimescaleVector(\n        collection_name=\"blog_embedding\",\n        service_url=service_url,\n        embedding=embedding,\n        time_partition_interval=timedelta(days=30),\n    )\n\n    # delete old embeddings for all ids in the work queue. locked_id is a special column that is set to the primary key of the table being\n    # embedded. For items that are deleted, it is the only key that is set.\n    metadata_for_delete = [{\"blog_id\": blog['locked_id']} for blog in blog_instances]\n    vector_store.delete_by_metadata(metadata_for_delete)\n\n    documents = []\n    for blog in blog_instances:\n        # skip blogs that are not published yet, or are deleted (in which case it will be NULL)\n        if blog['published_time'] != None:\n            documents.extend(get_document(blog))\n\n    if len(documents) == 0:\n        return\n    \n    texts = [d.page_content for d in documents]\n    metadatas = [d.metadata for d in documents]\n    ids = [d.metadata[\"id\"] for d in documents]\n    vector_store.add_texts(texts, metadatas, ids)\n```\n\nThen, all you have to do is run the following code in a scheduled job\n(cron job, Lambda job, etc):\n\n``` python\n# this job should be run on a schedule\nvectorizer = pgvectorizer.Vectorize(service_url, 'blog')\nwhile vectorizer.process(embed_and_write) > 0:\n    pass\n```\n\nEvery time that job runs, it will sync the table with your embeddings.\nIt will sync all inserts, updates, and deletes to an embeddings table\ncalled `blog_embedding`.\n\nNow, you can simply search the embeddings as follows (again, using\nLangChain in the example):\n\n``` python\nembedding = OpenAIEmbeddings()\nvector_store = TimescaleVector(\n    collection_name=\"blog_embedding\",\n    service_url=service_url,\n    embedding=embedding,\n    time_partition_interval=timedelta(days=30),\n)\n\nres = vector_store.similarity_search_with_score(\"Blogs about cats\")\nres\n```\n\n    [(Document(metadata={'id': '334e4800-4bee-11eb-a52a-57b3c4a96ccb', 'author': 'Matvey Arye', 'blog_id': 1, 'category': 'AI', 'published_time': '2021-01-01T00:00:00-05:00'}, page_content='Author Matvey Arye, title: First Post, contents:some super interesting content about cats.'),\n      0.12680577303752072)]\n\n## Development\n\nThis project is developed with [nbdev](https://nbdev.fast.ai/). Please\nsee that website for the development process.\n",
    "bugtrack_url": null,
    "license": "Apache Software License 2.0",
    "summary": "Python library for storing vector data in Postgres",
    "version": "0.0.7",
    "project_urls": {
        "Homepage": "https://github.com/timescale/python-vector"
    },
    "split_keywords": [
        "nbdev",
        "jupyter",
        "notebook",
        "python"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b753224d52f77a60edc9be03a39e1ed15c37ac680dcd393ec526eeb668ee87ab",
                "md5": "ce743fb40dd2f824e515715acacdd478",
                "sha256": "b21e832ae90add3a07bb7adda9f747a325e902a1d6b85186f0738dacb2861ee2"
            },
            "downloads": -1,
            "filename": "timescale_vector-0.0.7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ce743fb40dd2f824e515715acacdd478",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 31502,
            "upload_time": "2024-08-26T18:24:01",
            "upload_time_iso_8601": "2024-08-26T18:24:01.296356Z",
            "url": "https://files.pythonhosted.org/packages/b7/53/224d52f77a60edc9be03a39e1ed15c37ac680dcd393ec526eeb668ee87ab/timescale_vector-0.0.7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a052050137732a2953253d324613c79e15dba30a7fb305ed8f5d95cc076ddd8a",
                "md5": "f6f04ced21895bd66bfba4e766db33ab",
                "sha256": "cc6e9517d1ac3caf37aa240ee3401d337952fc6030ea1962bbc7103ae555fb51"
            },
            "downloads": -1,
            "filename": "timescale-vector-0.0.7.tar.gz",
            "has_sig": false,
            "md5_digest": "f6f04ced21895bd66bfba4e766db33ab",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 51792,
            "upload_time": "2024-08-26T18:24:02",
            "upload_time_iso_8601": "2024-08-26T18:24:02.756714Z",
            "url": "https://files.pythonhosted.org/packages/a0/52/050137732a2953253d324613c79e15dba30a7fb305ed8f5d95cc076ddd8a/timescale-vector-0.0.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-26 18:24:02",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "timescale",
    "github_project": "python-vector",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "timescale-vector"
}
        
Elapsed time: 1.94436s