| pyvgx Similarity Methods | Description |
|---|---|
Create a new vector instance |
|
Create a new centroid instance |
|
Create a new, random vector instance |
|
Compute vector fingerprint |
|
Compute the similarity of two vectors |
|
Compute the Euclidean distance between two vectors |
|
Compute the cosine similarity of two vectors |
|
Compute the Jaccard similarity of two vectors |
|
Compute the number of differing bits for two vector fingerprints |
|
Return a dictionary representation of a similarity instance |
1. pyvgx.Similarity.NewVector()
Create a new pyvgx.Vector instance.
1.1. Syntax
pyvgx.Similarity.NewVector( elements ) -> pyvgx.Vector
1.2. Parameters
| Parameter | Type | Description |
|---|---|---|
elements |
list |
Feature Vectors: List of tuples |
1.3. Return Value
This method returns a new pyvgx.Vector object associated with the Similarity object.
1.4. Remarks
Vector instances are bound to the Similarity object that spawned them. It is not possible to use a Vector created by one Similarity object with another Similarity object.
Similarity objects are themselves bound to Graph instances (as the pyvgx.Graph.sim member.) In practice this means Vector objects are bound to Graph instances, and Vector objects can therefore not be shared among separate Graph instances.
Use pyvgx.Vector.External() to export a Vector to Python list, which may then be used freely with any Similarity object or Graph object.
|
1.5. Example: Feature Vector
from pyvgx import *
system.Initialize()
g = Graph( "graph" )
A = Similarity.NewVector( g.sim, [("x",1),("y",0.7),("z",0.5)] )
B = g.sim.NewVector( [("x",1.1),("y",0.6),("q",0.5)] )
g.sim.Cosine( A, B ) # -> 0.8515625
1.6. Example: Euclidean Vector
from pyvgx import *
from random import random
system.Initialize( euclidean=True )
g = Graph( "graph" )
base = [2*random()-1 for x in range(768)]
A = Similarity.NewVector( g.sim, [x+0.1*random() for x in base] )
B = g.sim.NewVector( [x+0.1*random() for x in base] )
g.sim.Cosine( A, B ) # -> 0.9976
2. pyvgx.Similarity.NewCentroid()
Create a new pyvgx.Vector instance representing the centroid of multiple vectors.
2.1. Syntax
# -> pyvgx.Vector
pyvgx.Similarity.NewCentroid( [ v1, v2, ..., vn ] )
2.2. Parameters
| Parameter | Type | Description |
|---|---|---|
vk |
vector or list |
Vector instance or list of elements |
2.3. Return Value
This method returns a new pyvgx.Vector object representing the centroid vector for all vectors v1 through vn.
2.4. Remarks
If the input vectors are considered as points in n-dimensional space, their centroid can be thought of as the geometric center (or center of mass.)
| In Feature Vector mode, since vector objects have an upper limit for how many dimensions they can contain, it is likely for some information to be lost when generating the centroid. Less common features (or features with low weights) in the input vectors may not be included in the centroid. |
2.5. Example: Feature Vector
from pyvgx import *
g = Graph( "graph" )
# Some things people may think of when they hear "pizza"
p1 = g.sim.NewVector( [("dough",2), ("cheese",1), ("tomato",1)] )
p2 = g.sim.NewVector( [("crust",2), ("tomato",1), ("sausage",1)] )
p3 = g.sim.NewVector( [("dough",2), ("crust",1), ("sauce",1)] )
p4 = g.sim.NewVector( [("pepperoni",2), ("cheese",1), ("tomato",1)] )
p5 = g.sim.NewVector( [("dough",2), ("tomato",1), ("cheese",1)] )
p6 = g.sim.NewVector( [("cheese",1), ("mushroom",1), ("broccoli",1)] )
# Compute the average impression of "pizza"
P = g.sim.NewCentroid( [p1, p2, p3, p4, p5, p6] )
P.external # -> [ ('dough', 0.875),
# ('cheese', 0.625),
# ('tomato', 0.625),
# ('crust', 0.46875),
# ('pepperoni', 0.28125),
# ('sausage', 0.15625),
# ('broccoli', 0.15625),
# ('mushroom', 0.15625),
# ('sauce', 0.15625) ]
2.6. Example: Euclidean Vector
from pyvgx import *
from random import random
system.Initialize( euclidean=True )
g = Graph( "graph" )
# Random Euclidean vectors
A = g.sim.NewVector( [2*random()-1 for x in range(768)] )
B = g.sim.NewVector( [2*random()-1 for x in range(768)] )
C = g.sim.NewVector( [2*random()-1 for x in range(768)] )
D = g.sim.NewVector( [2*random()-1 for x in range(768)] )
# Vectors have nothing in common
g.sim.Cosine( A, B ) -> near_zero
g.sim.Cosine( A, C ) -> near_zero
g.sim.Cosine( A, D ) -> near_zero
g.sim.Cosine( B, C ) -> near_zero
g.sim.Cosine( B, D ) -> near_zero
g.sim.Cosine( C, D ) -> near_zero
# Compute the centroid of random vectors
X = g.sim.NewCentroid( [A, B, C, D] )
# All vectors have something in common with centroid
g.sim.Cosine( A, X ) -> some_similarity
g.sim.Cosine( B, X ) -> some_similarity
g.sim.Cosine( C, X ) -> some_similarity
g.sim.Cosine( D, X ) -> some_similarity
3. pyvgx.Similarity.rvec()
Create a new, random pyvgx.Vector instance. (Euclidean mode only.)
3.1. Syntax
pyvgx.Similarity.rvec( n ) -> pyvgx.Vector
3.2. Parameters
| Parameter | Type | Description |
|---|---|---|
n |
int |
Number of dimensions, rounded up to nearest multiple of 32. |
3.3. Return Value
This method returns a new randomized pyvgx.Vector object associated with the Similarity object.
3.4. Example
from pyvgx import *
system.Initialize()
g = Graph( "graph" )
A = g.sim.rvec(768)
B = g.sim.rvec(768)
g.sim.Cosine( A, B ) # -> -0.071999
4. pyvgx.Similarity.Fingerprint()
Compute 64-bit vector fingerprint.
4.1. Syntax
pyvgx.Similarity.Fingerprint( vector[, seed] ) -> int
4.2. Parameters
| Parameter | Type | Description |
|---|---|---|
vector |
vector or list |
Vector instance or list of elements |
seed |
int |
Projection seed (default 0) |
4.3. Return Value
This method returns a 64-bit integer representing a projection of the n-dimensional vector onto 64-dimensional binary space. The optional seed allows alternative projections.
4.4. Remarks
A vector’s fingerprint is a locality sensitive hash and represents a mapping of a high-dimensional object onto a low-dimensional "shadow" of that object. By using different values for seed it is possible to create multiple fingerprints forming a set of projections which more accurately represent the original vector.
4.5. Example
from pyvgx import *
from random import random
system.Initialize( euclidean=True )
g = Graph( "graph" )
base = [2*random()-1 for x in range(768)]
A = Similarity.NewVector( g.sim, [x+0.1*random() for x in base] )
fp0 = Similarity.Fingerprint( A )
fp1 = Similarity.Fingerprint( A, 1234 )
fp2 = Similarity.Fingerprint( A, 0xd5c098876a90fe2c )
5. pyvgx.Similarity.CreateProjectionSets()
Initialize a graph structure used for approximate nearest neighbor (ANN) vector search.
5.1. Syntax
pyvgx.Similarity.CreateProjectionSets( nseeds, ksize )
5.2. Parameters
| Parameter | Type | Description |
|---|---|---|
nseeds |
int |
Number of seeds (1 - 8) to use for ANN index projections |
ksize |
int |
Size of projection keys, must be 8, 10, or 12 |
5.3. Return Value
This function returns a list of seed values.
5.4. Remarks
Use this function to initialize a graph structure used for approximate nearest neighbor search.
The structure depends on nseeds and ksize.
Projection keys are extracted from a vector’s fingerprint. The size of each key is equal to ksize. Keys are extracted from different bit regions of the fingerprint.
- ksize=8
-
A total of 16 8-bit keys are extracted from the 64-bit LSH. Keys 1 - 8 are the eight 8-bit segments of the LSH, and keys 8 - 16 are the eight 8-bit segments of the LSH shifted four bits to the right (with rotation).
- ksize=10
-
A total of 12 10-bit keys are extracted from the 64-bit LSH. Keys 1 - 6 come from the rightmost 60 bits of the LSH, and keys 7 - 12 come from the rightmost 60 bits of the LSH shifted five bits to the right (with rotation).
- ksize=12
-
A total of 15 12-bit keys are extracted from the 64-bit LSH. Keys 1 - 5 come from the rightmost 60 bits of the LSH, keys 6 - 10 come from the rightmost 60 bits of the LSH shifted four bits to the right (with rotation), and keys 11 - 15 come from the rightmost 60 bits of the LSH shifted eight bits to the right (with rotation).
Special index nodes with connections are created as shown below:
_| -> _|000 -> _0000|000
_0001|000
_0002|000
...
_2FFF|000
_|43F -> _0000|43F
_0001|43F
_0002|43F
...
...
The root node is _| connecting to nseeds projection sets.
A projection set node _|set connects to n projection nodes, where n depends on ksize.
- ksize=8 → n=4096
-
Number of keys is 16, each key is eight bits → n = 16 * 28 = 4096
- ksize=10 → n=12288
-
Number of keys is 12, each key is 10 bits → n = 12 * 210 = 12288
- ksize=12 → n=61440
-
Number of keys is 15, each key is 12 bits → n = 15 * 212 = 61440
A projection node _proj|set is an index anchor node that can be connected to a node representing an indexed item, when the item has a vector whose fingerprint contains a bit pattern segment mapped to by proj.
Note that set will have one of eight fixed values pre-determined by the system’s seed values.
Both set and proj in the vertex names are uppercase hex values.
|
There is no built-in indexing algorithm. This function only creates a skeleton indexing structure for service implementations to take advantage of. Extraction of LSH keys must be performed explicitly by calling |
|
System must be initialized in euclidean mode |
5.5. Example
from pyvgx import *
system.Initialize( "ann", euclidean=1 )
g = Graph("test")
g.sim.CreateProjectionSets( 2, 10 ) # -> [0, 1087]
6. pyvgx.Similarity.DeleteProjectionSets()
Destroy previously created ANN index structure.
6.1. Syntax
pyvgx.Similarity.DeleteProjectionSets()
6.2. Remarks
This removes all vertices created by a previous call to pyvgx.Similarity.CreateProjectionSets().
6.3. Example
from pyvgx import *
system.Initialize( "ann", euclidean=1 )
g = Graph("test")
g.CreateVertex( "my vertex" )
g.order # -> 1
g.sim.CreateProjectionSets( 2, 10 ) # -> [0, 1087]
g.order # -> 24580
g.size # -> 24578
g.sim.DeleteProjectionSets()
g.order # -> 1
"my vertex" in g # -> True
7. pyvgx.Similarity.Similarity()
Compute the similarity of two vectors.
7.1. Syntax
pyvgx.Similarity.Similarity( A, B ) # -> float
7.2. Parameters
| Parameter | Type | Description |
|---|---|---|
A |
vector |
First vector instance |
B |
vector |
Second vector instance |
7.3. Return Value
This method returns a similarity score for vectors A and B.
7.3.1. Feature Vectors
score = Cosine( A, B )α · Jaccard( A, B )β, where α is the cosine exponent and β is the jaccard exponent.
The returned value is a number between 0.0 and 1.0.
7.3.2. Euclidean Vectors
This method is an alias for Cosine().
The returned value is a number between -1.0 and 1.0.
7.4. Example
from pyvgx import *
g = Graph( "graph" )
A = g.sim.NewVector( [("x",1),("y",1),("z",1)] )
B = g.sim.NewVector( [("x",0.5),("y",0.5),("z",0.5)] )
C = g.sim.NewVector( [("x",1),("y",0.5),("z",0.25)] )
g.sim.cosine_exp = 1.0 # jaccard_exp = 0.0 implied
g.sim.Similarity( A, B ) # -> 1.0
g.sim.Similarity( A, C ) # -> 0.8828125
g.sim.cosine_exp = 0.5 # cosine_exp = 0.5 implied
g.sim.Similarity( A, B ) # -> 0.70703125
g.sim.Similarity( A, C ) # -> 0.71484375
8. pyvgx.Similarity.EuclideanDistance()
Compute the Euclidean distance between two vectors.
8.1. Syntax
pyvgx.Similarity.EuclideanDistance( A, B ) # -> float
8.2. Parameters
| Parameter | Type | Description |
|---|---|---|
A |
vector |
First vector instance |
B |
vector |
Second vector instance |
8.3. Return Value
This method returns the Euclidean distance between vectors A and B in n-dimensional space.
\( dist(A,B) = \displaystyle \sqrt{\displaystyle \sum_{i=1}^n { (A_i - B_i)^2 }} \)
The returned value is a non-negative float.
8.4. Example
from pyvgx import *
system.Initialize( "vectors", euclidean=True )
g = Graph( "graph" )
A = g.sim.NewVector( [0.1, 0.2, 0.3, 0.4] + [0]*28 )
B = g.sim.NewVector( [0.9, 0.8, 0.7, 0.6] + [0]*28 )
C = g.sim.NewVector( [-0.5, 0.5, -0.5, 0.5] + [0]*28 )
g.sim.EuclideanDistance( A, B ) # -> 1.0957
g.sim.EuclideanDistance( A, C ) # -> 1.0482
g.sim.EuclideanDistance( B, C ) # -> 1.8721
9. pyvgx.Similarity.Cosine()
Compute the cosine of the angle between two vectors.
9.1. Syntax
pyvgx.Similarity.Cosine( A, B ) # -> float
9.2. Parameters
| Parameter | Type | Description |
|---|---|---|
A |
vector |
First vector instance |
B |
vector |
Second vector instance |
9.3. Return Value
This method returns the Cosine similarity of vectors A and B, which is the cosine of the angle Θ between the two vectors in n-dimensional space. This is computed as the dot product of the two vectors divided by the product of their magnitudes:
\( cos(\theta) = \displaystyle \frac{\displaystyle \sum_{i=1}^n {A_iB_i}}{ \sqrt{\displaystyle \sum_{i=1}^n {A_i^2}} \cdot \sqrt{\displaystyle \sum_{i=1}^n {B_i^2} }} \)
9.3.1. Feature Vectors
The returned value is a number between 0.0 and 1.0.
9.3.2. Euclidean Vectors
The returned value is a number between -1.0 and 1.0.
9.4. Example: Feature Vectors
from pyvgx import *
g = Graph( "graph" )
A = g.sim.NewVector( [("x",1),("y",1),("z",1)] )
B = g.sim.NewVector( [("x",0.5),("y",0.5),("z",0.5)] )
C = g.sim.NewVector( [("x",1),("y",0.5),("z",0.25)] )
g.sim.Cosine( A, B ) # -> 1.0
g.sim.Cosine( A, C ) # -> 0.8828125
9.5. Example: Euclidean Vectors
from pyvgx import *
system.Initialize( euclidean=True )
g = Graph( "graph" )
A = g.sim.NewVector( [0.1, 0.2, 0.3, 0.4] + [0]*28 )
B = g.sim.NewVector( [0.9, 0.8, 0.7, 0.6] + [0]*28 )
C = g.sim.NewVector( [-0.5, 0.5, -0.5, 0.5] + [0]*28 )
g.sim.Cosine( A, B ) # -> 0.8386
g.sim.Cosine( A, C ) # -> 0.1845
g.sim.Cosine( B, C ) # -> -0.0681
10. pyvgx.Similarity.Jaccard()
Compute the Jaccard similarity coefficient of two feature vectors.
10.1. Syntax
pyvgx.Similarity.Jaccard( A, B ) # -> float
10.2. Parameters
| Parameter | Type | Description |
|---|---|---|
A |
vector |
First vector instance |
B |
vector |
Second vector instance |
10.3. Return Value
This method returns the Jaccard similarity coefficient of feature vectors A and B, which can be thought of as the intersection divided by the union of the vectors (A ∩ B / A ∪ B) when the vectors are treated as sets of (weighted) dimensions:
weighted jaccard similarity = ∑min(Ai, Bi) / ∑max( Ai, Bi )
The returned value is a number between 0.0 and 1.0.
| Jaccard is supported for feature vectors only, not for Euclidean vectors. |
10.4. Example
from pyvgx import *
g = Graph( "graph" )
A = g.sim.NewVector( [("x",1),("y",1),("z",1)] )
B = g.sim.NewVector( [("x",0.5),("y",0.5),("z",0.5)] )
C = g.sim.NewVector( [("x",1),("y",0.5),("z",0.25)] )
g.sim.Jaccard( A, B ) # -> 0.5
g.sim.Jaccard( A, C ) # -> 0.58203125
11. pyvgx.Similarity.HammingDistance()
Compute the number of differing bits in vector fingerprints.
11.1. Syntax
pyvgx.Similarity.HammingDistance( A, B ) # -> int
11.2. Parameters
| Parameter | Type | Description |
|---|---|---|
A |
vector |
First vector instance |
B |
vector |
Second vector instance |
11.3. Return Value
This method compares the fingerprints of two vectors A and B and returns the number of bits that are different between the two, i.e. the hamming distance between the fingerprints.
Vector fingerprints are 64-bit integers derived from the weighted vector dimensions. Similar vectors have similar fingerprints, and therefore a small number of differing bit positions, i.e. a small hamming distance.
The returned value is an integer between 0 and 64.
11.4. Remarks
It is more efficient to compare fingerprints than to compute the Cosine/Jaccard for two vectors, especially for vectors with many dimensions. At the same time, the ability of a fingerprint to accurately represent its source vector is directly affected by the number of vector dimensions. That is, the vectors that produce the highest quality fingerprints are also the ones that benefit the most (speed-wise) from fingerprint comparison.
Using hamming distance as a direct measure of similarity should generally be restricted to very similar vectors (sim>0.9) whose fingerprints are nearly identical (hamdist<6), in applications such as duplicate detection.
| Vectors with few dimensions (<10) do not reliably map to fingerprints. It is better to use Cosine/Jaccard to compare short vectors. |
| Using fingerprint similarity as a pre-filter can significantly speed up processing of queries that include vector similarity filters. The fingerprint pre-filter should set the hamming distance threshold such that a majority of result candidates are removed before the cosine/jaccard is evaluated, without adversely impacting recall of the overall query. This strategy works best if the cosine/jaccard threshold is high. |
11.5. Example
from pyvgx import *
g = Graph( "graph" )
A = g.sim.NewVector( [("a",1),("b",1),("c",1),("d",1),("e",1)] )
B = g.sim.NewVector( [("a",1),("b",1),("c",1),("d",1),("f",1)] )
g.sim.HammingDistance( A, B ) # -> 10
12. pyvgx.Similarity.AsDict()
Return a dictionary representing the current configuration of the Similarity instance.
12.1. Syntax
pyvgx.Similarity.AsDict() # -> dict
12.2. Return Value
This methods returns the Similarity object’s current configuration parameters as a Python dictionary.
