pyvgx Similarity Methods Description

pyvgx.Similarity.NewVector()

Create a new vector instance

pyvgx.Similarity.NewCentroid()

Create a new centroid instance

pyvgx.Similarity.rvec()

Create a new, random vector instance

pyvgx.Similarity.Fingerprint()

Compute vector fingerprint

pyvgx.Similarity.CreateProjectionSets()

pyvgx.Similarity.DeleteProjectionSets()

pyvgx.Similarity.Similarity()

Compute the similarity of two vectors

pyvgx.Similarity.EuclideanDistance()

Compute the Euclidean distance between two vectors

pyvgx.Similarity.Cosine()

Compute the cosine similarity of two vectors

pyvgx.Similarity.Jaccard()

Compute the Jaccard similarity of two vectors

pyvgx.Similarity.HammingDistance()

Compute the number of differing bits for two vector fingerprints

pyvgx.Similarity.AsDict()

Return a dictionary representation of a similarity instance

1. pyvgx.Similarity.NewVector()

Create a new pyvgx.Vector instance.

1.1. Syntax

pyvgx.Similarity.NewVector( elements ) -> pyvgx.Vector

1.2. Parameters

Parameter Type Description

elements

list

Feature Vectors: List of tuples [(dim, weight), …​] where dim is a dimension string and weight is a numeric weight for the dimension

Euclidean Vectors: List of numeric values [c1, c2, …​, cn] where n must be a multiple of 32. The maximum number of vector components is 65472. Vectors are normalized by scaling all components by a factor \( a = \displaystyle \frac{1}{| \text{c}_{max} |} \) where cmax is the component with the largest absolute value.

1.3. Return Value

This method returns a new pyvgx.Vector object associated with the Similarity object.

1.4. Remarks

Vector instances are bound to the Similarity object that spawned them. It is not possible to use a Vector created by one Similarity object with another Similarity object.

Similarity objects are themselves bound to Graph instances (as the pyvgx.Graph.sim member.) In practice this means Vector objects are bound to Graph instances, and Vector objects can therefore not be shared among separate Graph instances.

Use pyvgx.Vector.External() to export a Vector to Python list, which may then be used freely with any Similarity object or Graph object.

1.5. Example: Feature Vector

from pyvgx import *
system.Initialize()
g = Graph( "graph" )
A = Similarity.NewVector( g.sim, [("x",1),("y",0.7),("z",0.5)] )
B = g.sim.NewVector( [("x",1.1),("y",0.6),("q",0.5)] )
g.sim.Cosine( A, B )  # -> 0.8515625

1.6. Example: Euclidean Vector

from pyvgx import *
from random import random
system.Initialize( euclidean=True )
g = Graph( "graph" )
base = [2*random()-1 for x in range(768)]
A = Similarity.NewVector( g.sim, [x+0.1*random() for x in base] )
B = g.sim.NewVector( [x+0.1*random() for x in base] )
g.sim.Cosine( A, B )  # -> 0.9976

2. pyvgx.Similarity.NewCentroid()

Create a new pyvgx.Vector instance representing the centroid of multiple vectors.

2.1. Syntax

# -> pyvgx.Vector
pyvgx.Similarity.NewCentroid( [ v1, v2, ..., vn ] )

2.2. Parameters

Parameter Type Description

vk

vector or list

Vector instance or list of elements

2.3. Return Value

This method returns a new pyvgx.Vector object representing the centroid vector for all vectors v1 through vn.

2.4. Remarks

If the input vectors are considered as points in n-dimensional space, their centroid can be thought of as the geometric center (or center of mass.)

In Feature Vector mode, since vector objects have an upper limit for how many dimensions they can contain, it is likely for some information to be lost when generating the centroid. Less common features (or features with low weights) in the input vectors may not be included in the centroid.

2.5. Example: Feature Vector

from pyvgx import *
g = Graph( "graph" )

# Some things people may think of when they hear "pizza"
p1 = g.sim.NewVector( [("dough",2), ("cheese",1), ("tomato",1)] )
p2 = g.sim.NewVector( [("crust",2), ("tomato",1), ("sausage",1)] )
p3 = g.sim.NewVector( [("dough",2), ("crust",1), ("sauce",1)] )
p4 = g.sim.NewVector( [("pepperoni",2), ("cheese",1), ("tomato",1)] )
p5 = g.sim.NewVector( [("dough",2), ("tomato",1), ("cheese",1)] )
p6 = g.sim.NewVector( [("cheese",1), ("mushroom",1), ("broccoli",1)] )

# Compute the average impression of "pizza"
P = g.sim.NewCentroid( [p1, p2, p3, p4, p5, p6] )
P.external # -> [ ('dough', 0.875),
#      ('cheese', 0.625),
#      ('tomato', 0.625),
#      ('crust', 0.46875),
#      ('pepperoni', 0.28125),
#      ('sausage', 0.15625),
#      ('broccoli', 0.15625),
#      ('mushroom', 0.15625),
#      ('sauce', 0.15625) ]

2.6. Example: Euclidean Vector

from pyvgx import *
from random import random
system.Initialize( euclidean=True )
g = Graph( "graph" )

# Random Euclidean vectors
A = g.sim.NewVector( [2*random()-1 for x in range(768)] )
B = g.sim.NewVector( [2*random()-1 for x in range(768)] )
C = g.sim.NewVector( [2*random()-1 for x in range(768)] )
D = g.sim.NewVector( [2*random()-1 for x in range(768)] )

# Vectors have nothing in common
g.sim.Cosine( A, B ) -> near_zero
g.sim.Cosine( A, C ) -> near_zero
g.sim.Cosine( A, D ) -> near_zero
g.sim.Cosine( B, C ) -> near_zero
g.sim.Cosine( B, D ) -> near_zero
g.sim.Cosine( C, D ) -> near_zero

# Compute the centroid of random vectors
X = g.sim.NewCentroid( [A, B, C, D] )

# All vectors have something in common with centroid
g.sim.Cosine( A, X ) -> some_similarity
g.sim.Cosine( B, X ) -> some_similarity
g.sim.Cosine( C, X ) -> some_similarity
g.sim.Cosine( D, X ) -> some_similarity

3. pyvgx.Similarity.rvec()

Create a new, random pyvgx.Vector instance. (Euclidean mode only.)

3.1. Syntax

pyvgx.Similarity.rvec( n ) -> pyvgx.Vector

3.2. Parameters

Parameter Type Description

n

int

Number of dimensions, rounded up to nearest multiple of 32.

3.3. Return Value

This method returns a new randomized pyvgx.Vector object associated with the Similarity object.

3.4. Example

from pyvgx import *
system.Initialize()
g = Graph( "graph" )
A = g.sim.rvec(768)
B = g.sim.rvec(768)
g.sim.Cosine( A, B )  # -> -0.071999

4. pyvgx.Similarity.Fingerprint()

Compute 64-bit vector fingerprint.

4.1. Syntax

pyvgx.Similarity.Fingerprint( vector[, seed] ) -> int

4.2. Parameters

Parameter Type Description

vector

vector or list

Vector instance or list of elements

seed

int

Projection seed (default 0)

4.3. Return Value

This method returns a 64-bit integer representing a projection of the n-dimensional vector onto 64-dimensional binary space. The optional seed allows alternative projections.

4.4. Remarks

A vector’s fingerprint is a locality sensitive hash and represents a mapping of a high-dimensional object onto a low-dimensional "shadow" of that object. By using different values for seed it is possible to create multiple fingerprints forming a set of projections which more accurately represent the original vector.

4.5. Example

from pyvgx import *
from random import random
system.Initialize( euclidean=True )
g = Graph( "graph" )
base = [2*random()-1 for x in range(768)]
A = Similarity.NewVector( g.sim, [x+0.1*random() for x in base] )
fp0 = Similarity.Fingerprint( A )
fp1 = Similarity.Fingerprint( A, 1234 )
fp2 = Similarity.Fingerprint( A, 0xd5c098876a90fe2c )

5. pyvgx.Similarity.CreateProjectionSets()

Initialize a graph structure used for approximate nearest neighbor (ANN) vector search.

5.1. Syntax

pyvgx.Similarity.CreateProjectionSets( nseeds, ksize )

5.2. Parameters

Parameter Type Description

nseeds

int

Number of seeds (1 - 8) to use for ANN index projections

ksize

int

Size of projection keys, must be 8, 10, or 12

5.3. Return Value

This function returns a list of seed values.

5.4. Remarks

Use this function to initialize a graph structure used for approximate nearest neighbor search.

The structure depends on nseeds and ksize.

Projection keys are extracted from a vector’s fingerprint. The size of each key is equal to ksize. Keys are extracted from different bit regions of the fingerprint.

ksize=8

A total of 16 8-bit keys are extracted from the 64-bit LSH. Keys 1 - 8 are the eight 8-bit segments of the LSH, and keys 8 - 16 are the eight 8-bit segments of the LSH shifted four bits to the right (with rotation).

ksize=10

A total of 12 10-bit keys are extracted from the 64-bit LSH. Keys 1 - 6 come from the rightmost 60 bits of the LSH, and keys 7 - 12 come from the rightmost 60 bits of the LSH shifted five bits to the right (with rotation).

ksize=12

A total of 15 12-bit keys are extracted from the 64-bit LSH. Keys 1 - 5 come from the rightmost 60 bits of the LSH, keys 6 - 10 come from the rightmost 60 bits of the LSH shifted four bits to the right (with rotation), and keys 11 - 15 come from the rightmost 60 bits of the LSH shifted eight bits to the right (with rotation).

Special index nodes with connections are created as shown below:

_|  ->  _|000   ->  _0000|000
                    _0001|000
                    _0002|000
                    ...
                    _2FFF|000
        _|43F   ->  _0000|43F
                    _0001|43F
                    _0002|43F
                    ...
        ...

The root node is _| connecting to nseeds projection sets.

A projection set node _|set connects to n projection nodes, where n depends on ksize.

ksize=8 → n=4096

Number of keys is 16, each key is eight bits → n = 16 * 28 = 4096

ksize=10 → n=12288

Number of keys is 12, each key is 10 bits → n = 12 * 210 = 12288

ksize=12 → n=61440

Number of keys is 15, each key is 12 bits → n = 15 * 212 = 61440

A projection node _proj|set is an index anchor node that can be connected to a node representing an indexed item, when the item has a vector whose fingerprint contains a bit pattern segment mapped to by proj.

Note that set will have one of eight fixed values pre-determined by the system’s seed values.

Both set and proj in the vertex names are uppercase hex values.

There is no built-in indexing algorithm. This function only creates a skeleton indexing structure for service implementations to take advantage of. Extraction of LSH keys must be performed explicitly by calling

System must be initialized in euclidean mode

5.5. Example

from pyvgx import *
system.Initialize( "ann", euclidean=1 )

g = Graph("test")
g.sim.CreateProjectionSets( 2, 10 )     # -> [0, 1087]

6. pyvgx.Similarity.DeleteProjectionSets()

Destroy previously created ANN index structure.

6.1. Syntax

pyvgx.Similarity.DeleteProjectionSets()

6.2. Remarks

This removes all vertices created by a previous call to pyvgx.Similarity.CreateProjectionSets().

6.3. Example

from pyvgx import *
system.Initialize( "ann", euclidean=1 )

g = Graph("test")
g.CreateVertex( "my vertex" )
g.order                                 # -> 1

g.sim.CreateProjectionSets( 2, 10 )     # -> [0, 1087]
g.order                                 # -> 24580
g.size                                  # -> 24578

g.sim.DeleteProjectionSets()
g.order                                 # -> 1

"my vertex" in g                        # -> True

7. pyvgx.Similarity.Similarity()

Compute the similarity of two vectors.

7.1. Syntax

pyvgx.Similarity.Similarity( A, B ) # -> float

7.2. Parameters

Parameter Type Description

A

vector

First vector instance

B

vector

Second vector instance

7.3. Return Value

This method returns a similarity score for vectors A and B.

7.3.1. Feature Vectors

The score is computed as a composite of Cosine() and Jaccard() using the following formula:

score = Cosine( A, B )α · Jaccard( A, B )β, where α is the cosine exponent and β is the jaccard exponent.

The returned value is a number between 0.0 and 1.0.

7.3.2. Euclidean Vectors

This method is an alias for Cosine().

The returned value is a number between -1.0 and 1.0.

7.4. Example

from pyvgx import *
g = Graph( "graph" )

A = g.sim.NewVector( [("x",1),("y",1),("z",1)] )
B = g.sim.NewVector( [("x",0.5),("y",0.5),("z",0.5)] )
C = g.sim.NewVector( [("x",1),("y",0.5),("z",0.25)] )

g.sim.cosine_exp = 1.0    # jaccard_exp = 0.0 implied
g.sim.Similarity( A, B )  # -> 1.0
g.sim.Similarity( A, C )  # -> 0.8828125

g.sim.cosine_exp = 0.5    # cosine_exp = 0.5 implied
g.sim.Similarity( A, B )  # -> 0.70703125
g.sim.Similarity( A, C )  # -> 0.71484375

8. pyvgx.Similarity.EuclideanDistance()

Compute the Euclidean distance between two vectors.

8.1. Syntax

pyvgx.Similarity.EuclideanDistance( A, B ) # -> float

8.2. Parameters

Parameter Type Description

A

vector

First vector instance

B

vector

Second vector instance

8.3. Return Value

This method returns the Euclidean distance between vectors A and B in n-dimensional space.

\( dist(A,B) = \displaystyle \sqrt{\displaystyle \sum_{i=1}^n { (A_i - B_i)^2 }} \)

The returned value is a non-negative float.

8.4. Example

from pyvgx import *
system.Initialize( "vectors", euclidean=True )
g = Graph( "graph" )

A = g.sim.NewVector( [0.1, 0.2, 0.3, 0.4] + [0]*28 )
B = g.sim.NewVector( [0.9, 0.8, 0.7, 0.6] + [0]*28 )
C = g.sim.NewVector( [-0.5, 0.5, -0.5, 0.5] + [0]*28 )

g.sim.EuclideanDistance( A, B )  # -> 1.0957
g.sim.EuclideanDistance( A, C )  # -> 1.0482
g.sim.EuclideanDistance( B, C )  # -> 1.8721

9. pyvgx.Similarity.Cosine()

Compute the cosine of the angle between two vectors.

9.1. Syntax

pyvgx.Similarity.Cosine( A, B ) # -> float

9.2. Parameters

Parameter Type Description

A

vector

First vector instance

B

vector

Second vector instance

9.3. Return Value

This method returns the Cosine similarity of vectors A and B, which is the cosine of the angle Θ between the two vectors in n-dimensional space. This is computed as the dot product of the two vectors divided by the product of their magnitudes:

\( cos(\theta) = \displaystyle \frac{\displaystyle \sum_{i=1}^n {A_iB_i}}{ \sqrt{\displaystyle \sum_{i=1}^n {A_i^2}} \cdot \sqrt{\displaystyle \sum_{i=1}^n {B_i^2} }} \)

9.3.1. Feature Vectors

The returned value is a number between 0.0 and 1.0.

9.3.2. Euclidean Vectors

The returned value is a number between -1.0 and 1.0.

9.4. Example: Feature Vectors

from pyvgx import *
g = Graph( "graph" )

A = g.sim.NewVector( [("x",1),("y",1),("z",1)] )
B = g.sim.NewVector( [("x",0.5),("y",0.5),("z",0.5)] )
C = g.sim.NewVector( [("x",1),("y",0.5),("z",0.25)] )

g.sim.Cosine( A, B )  # -> 1.0
g.sim.Cosine( A, C )  # -> 0.8828125

9.5. Example: Euclidean Vectors

from pyvgx import *
system.Initialize( euclidean=True )
g = Graph( "graph" )

A = g.sim.NewVector( [0.1, 0.2, 0.3, 0.4] + [0]*28 )
B = g.sim.NewVector( [0.9, 0.8, 0.7, 0.6] + [0]*28 )
C = g.sim.NewVector( [-0.5, 0.5, -0.5, 0.5] + [0]*28 )

g.sim.Cosine( A, B )  # -> 0.8386
g.sim.Cosine( A, C )  # -> 0.1845
g.sim.Cosine( B, C )  # -> -0.0681

10. pyvgx.Similarity.Jaccard()

Compute the Jaccard similarity coefficient of two feature vectors.

10.1. Syntax

pyvgx.Similarity.Jaccard( A, B ) # -> float

10.2. Parameters

Parameter Type Description

A

vector

First vector instance

B

vector

Second vector instance

10.3. Return Value

This method returns the Jaccard similarity coefficient of feature vectors A and B, which can be thought of as the intersection divided by the union of the vectors (A ∩ B / A ∪ B) when the vectors are treated as sets of (weighted) dimensions:

weighted jaccard similarity = ∑min(Ai, Bi) / ∑max( Ai, Bi )

The returned value is a number between 0.0 and 1.0.

Jaccard is supported for feature vectors only, not for Euclidean vectors.

10.4. Example

from pyvgx import *
g = Graph( "graph" )

A = g.sim.NewVector( [("x",1),("y",1),("z",1)] )
B = g.sim.NewVector( [("x",0.5),("y",0.5),("z",0.5)] )
C = g.sim.NewVector( [("x",1),("y",0.5),("z",0.25)] )

g.sim.Jaccard( A, B )  # -> 0.5
g.sim.Jaccard( A, C )  # -> 0.58203125

11. pyvgx.Similarity.HammingDistance()

Compute the number of differing bits in vector fingerprints.

11.1. Syntax

pyvgx.Similarity.HammingDistance( A, B ) # -> int

11.2. Parameters

Parameter Type Description

A

vector

First vector instance

B

vector

Second vector instance

11.3. Return Value

This method compares the fingerprints of two vectors A and B and returns the number of bits that are different between the two, i.e. the hamming distance between the fingerprints.

Vector fingerprints are 64-bit integers derived from the weighted vector dimensions. Similar vectors have similar fingerprints, and therefore a small number of differing bit positions, i.e. a small hamming distance.

The returned value is an integer between 0 and 64.

11.4. Remarks

It is more efficient to compare fingerprints than to compute the Cosine/Jaccard for two vectors, especially for vectors with many dimensions. At the same time, the ability of a fingerprint to accurately represent its source vector is directly affected by the number of vector dimensions. That is, the vectors that produce the highest quality fingerprints are also the ones that benefit the most (speed-wise) from fingerprint comparison.

Using hamming distance as a direct measure of similarity should generally be restricted to very similar vectors (sim>0.9) whose fingerprints are nearly identical (hamdist<6), in applications such as duplicate detection.

Vectors with few dimensions (<10) do not reliably map to fingerprints. It is better to use Cosine/Jaccard to compare short vectors.
Using fingerprint similarity as a pre-filter can significantly speed up processing of queries that include vector similarity filters. The fingerprint pre-filter should set the hamming distance threshold such that a majority of result candidates are removed before the cosine/jaccard is evaluated, without adversely impacting recall of the overall query. This strategy works best if the cosine/jaccard threshold is high.

11.5. Example

from pyvgx import *
g = Graph( "graph" )

A = g.sim.NewVector( [("a",1),("b",1),("c",1),("d",1),("e",1)] )
B = g.sim.NewVector( [("a",1),("b",1),("c",1),("d",1),("f",1)] )

g.sim.HammingDistance( A, B )  # -> 10

12. pyvgx.Similarity.AsDict()

Return a dictionary representing the current configuration of the Similarity instance.

12.1. Syntax

pyvgx.Similarity.AsDict() # -> dict

12.2. Return Value

This methods returns the Similarity object’s current configuration parameters as a Python dictionary.

12.3. Example

from pyvgx import *
g = Graph( "graph" )

g.sim.AsDict()

PYVGX