1. Vector Objects

Vectors have two distinct representations in pyvgx: external and internal. The external vector is the original vector (with applicable normalization, truncation and quantization applied.) The internal vector is an encoded version of the external vector.

Vectors can be assigned to vertices to enable similarity matching for those vertices.

Vectors can also be created and used as separate objects. In this case the built-in default similarity configuration controls how vectors are compared.

Two vector modes are supported:

  1. Feature Vectors (default)

  2. Euclidean Vectors

Vector mode is chosen at system initialization and cannot be changed after initialization. The chosen vector mode applies to all graphs in the VGX instance.

1.1. Feature Vector

Feature vectors are lists of (<dimension>, <weight>) key-value pairs, where <dimension> is a string and <weight> is a number. The maximum length of <dimension> is 27 bytes. The range of <weight> is [0.0078125, 1.875], internally quantized into 64 discrete buckets. The maximum number of vector features is 48.

1.2. Euclidean Vector

Euclidean vectors are lists of numeric vector components \(A = \lbrace c_1, c_2, ..., c_n \rbrace\) where n is a multiple of 32 or 64. The required multiple for your system can be found by inspecting the value returned by pyvgx.avxbuild(). If 512 is returned the multiple is 64, otherwise 32.

The maximum number of vector components is 65472.

Euclidean vectors are internally normalized by a scaling factor. This is done because the internal representation uses 8-bit integers to store each component and must ensure maximum use of the 8-bit resolution. The scaling factor \(a\) is determined from |c|max, the component with the largest absolute value:

\( a = \displaystyle \frac{1}{| \text{c}|_{max} }\)

An internal (normalized) Euclidean vector \(A'\) is thus computed:

\(A' = a \cdot A = \lbrace c'_1, c'_2, ..., c'_n \rbrace\), which implies \(c'_k \in \lbrack -1, 1 \rbrack\) and \(c'_\text{max} \in \lbrace -1, 1 \rbrace\).

Consequently \(A'\) is distorted due to loss of resolution.

Internal Euclidean vector representation uses 8-bit integers to encode normalized values in the range \( \langle \frac{-1}{127}, 1 \rbrack \) as integers \( \lbrack 0, 127 \rbrack \), and normalized values in the range \( \lbrack -1, \frac{-1}{127} \rbrack \) as integers \( \lbrack 129, 255 \rbrack \). (Internal byte value \(128\) is reserved and not used.)

1.3. Example: Feature Vectors

from pyvgx import *

# Initialize in the default feature vectors mode
system.Initialize()

# Make two feature vectors
Coffee = Vector( [("beverage",1.5), ("black",0.5), ("hot",0.5)] )
Soda = Vector( [("beverage",1.5), ("bubbly",0.8), ("sweet",0.8)] )

# Compare using default similarity parameters
DefaultSimilarity.Similarity( Coffee, Soda ) # -> 0.7421875

# Change similarity parameters and compare again
DefaultSimilarity.jaccard_exp = 1.0
DefaultSimilarity.Similarity( Coffee, Soda ) # -> 0.375

1.4. Example: Euclidean Vectors

from pyvgx import *
from random import *

# Initialize in euclidean mode
system.Initialize( euclidean=True )

# Make two random Euclidean vectors with components in range [-1,1]
A = Vector( [2*random()-1 for i in range(768)] )
B = Vector( [2*random()-1 for i in range(768)] )

# Cosine of angle between random vectors is close to zero
DefaultSimilarity.Cosine( A, B ) # -> near zero

2. Vector Members

pyvgx Vector Member Description

pyvgx.Vector.length

Number of elements in vector

pyvgx.Vector.magnitude

Euclidean length of vector

pyvgx.Vector.fingerprint

64-bit integer representing the vector

pyvgx.Vector.external

Vector with original elements where terms are strings and weights are floats

pyvgx.Vector.internal

Enumerated version of vector as used internally for storage and similarity computation

2.1. pyvgx.Vector.length

Return the number of elements in the vector. Equivalent: len( pyvgx.Vector )

2.2. pyvgx.Vector.magnitude

Return the Euclidean length of the vector.

For Euclidean vectors the magnitude of the original vector is not preserved, since the vector is normalized.

2.3. pyvgx.Vector.fingerprint

Return a 64-bit integer representing the vector. This is a dimensionality-reduced (compressed) vector that can be used to speed up certain types of similarity computations. Two fingerprints are compared by counting the number of bit positions where the two differ, i.e. the hamming distance between the two numbers. Small hamming distance translates to high similarity.

2.4. pyvgx.Vector.external

Return the vector as originally defined.

2.4.1. Feature Vectors

Elements are pairs of string terms and floating point weights:

<external_vector> ::= [
    (<s_term>, <f_weight>),
    (<s_term>, <f_weight>),
    (<s_term>, <f_weight>),
    ...
]

where
    <s_term>    ::= <string_up_to_27_bytes>
    <f_weight>  ::= <float>

Vector terms longer than 27 bytes are truncated.

Vector weights are quantized into 64 distinct buckets, and have a range [0.0078125, 1.875].

2.4.2. Euclidean Vectors

Elements are floating point values:

<external_vector> ::= [ c1, c2, ..., cn ]

where n is a multiple of 32 and |ci| <= 1.

2.5. pyvgx.Vector.internal

Return the internal representation of the vector.

2.5.1. Feature Vectors

Elements are pairs of enumerated terms and integer weights:

<internal_vector> ::= [
    (<e_term>, <i_weight>),
    (<e_term>, <i_weight>),
    (<e_term>, <i_weight>),
    ...
]

where
    <e_term>    ::= <integer_mapping_of_s_term>
    <i_weight>  ::= <encoded_f_term>

Vector terms are mapped internally to integer values for more efficient storage and matching.

Internal vector weights are encoded as a small integer in the range [0, 63]

2.5.2. Euclidean Vectors

Elements are bytes, and the returned object is a bytearray:

<internal_vector> ::= bytearray(b'\xnn\xnn...\xnn')

where the number of bytes is a multiple of 32.

Vector components are mapped internally to 8-bit integer values for more efficient storage and matching.

3. Vector Methods

pyvgx Vector Method Description

pyvgx.Vector.Fingerprint()

Compute a 64-bit integer representing the vector

pyvgx.Vector.AsDict()

Dictionary representation of vector

pyvgx.Vector.Projections()

Return vector index projection information

3.1. pyvgx.Vector.Fingerprint()

Compute 64-bit vector fingerprint.

3.1.1. Syntax

pyvgx.Vector.Fingerprint( [seed] ) -> int

3.1.2. Parameters

Parameter Type Description

seed

int

Projection seed (default 0)

3.1.3. Return Value

This method returns a 64-bit integer representing a projection of the n-dimensional vector onto 64-dimensional binary space. The optional seed allows alternative projections.

3.1.4. Remarks

A vector’s fingerprint is a locality sensitive hash and represents a mapping of a high-dimensional object onto a low-dimensional "shadow" of that object. By using different values for seed it is possible to create multiple fingerprints forming a set of projections which more accurately represent the original vector.

3.1.5. Example

from pyvgx import *
from random import random
system.Initialize( euclidean=True )
g = Graph( "graph" )
base = [2*random()-1 for x in range(768)]
V = Vector( [x+0.1*random() for x in base] )
fp0 = V.Fingerprint()
fp1 = V.Fingerprint( 1234 )
fp2 = V.Fingerprint( 0xd5c098876a90fe2c )

3.2. pyvgx.Vector.AsDict()

Return the vector data as a dictionary.

3.2.1. Syntax

pyvgx.Vector.AsDict() -> dict

3.2.2. Return Value

This method returns all vector attributes in a dict.

3.2.2.1. Feature Vectors
<feature_vector_dict> ::= {
    'type'        : 'feature',
    'length'      : <number_of_elements>,
    'magnitude'   : <euclidean_length>,
    'fingerprint' : <int64>
    'external'    : [ (<str>, <float>), (<str>, <float>), ... ],
    'internal'    : [ (<int>, <int>), (<int>, <int>), ... ],
    'centroid'    : <bool>
}
3.2.2.2. Euclidean Vectors
<euclidean_vector_dict> ::= {
    'type'        : 'array',
    'length'      : <number_of_elements>,
    'magnitude'   : <normalized_euclidean_length>,
    'fingerprint' : <int64>
    'external'    : [ <float>, <float>, ... ],
    'internal'    : bytearray(b'\xnn\xnn...'),
    'centroid'    : <bool>
}

3.2.3. Example

from pyvgx import *
from random import random
system.Initialize( euclidean=True )
g = Graph( "graph" )
base = [2*random()-1 for x in range(768)]
V = Vector( [x+0.1*random() for x in base] )
V.AsDict()

3.3. pyvgx.Vector.Projections()

Return the vector’s LSH, "low confidence mask", and index projections for a given seed.

3.3.1. Syntax

pyvgx.Vector.Projections( seed[, lsh[, lcm[, reduce[, expand]]]] ) -> (lsh, lcm, [p1, p2, ...])

3.3.2. Parameters

Parameter Type Description

seed

int

Projection seed

lsh

int

Use this vector fingerprint instead of computing it from vector

lcm

int

Use this "low confidence mask" for the provided lsh

reduce

bool

Omit index projections for LSH segments where the corresponding lcm segment is non-zero

expand

bool

Create two index projections for LSH segments where the corresponding lcm segment has exactly one bit set to 1

3.3.3. Return Value

For a given projection seed, return a 3-tuple containing the vector’s fingerprint (LSH), the "low confidence mask" associated with the LSH, and a list of projection node names that map to the vector: (lsh, lcm, [p1, p2, …​])

lsh

The vector’s fingerprint for the given seed

lcm

Bitmask where the occurrence of a 1 indicates a low confidence bit value in the corresponding lsh position. When lcm is all zeros it means the entire lsh has high confidence bit values. Bit regions in lsh corresponding to non-zero regions in lcm are less likely to produce good projection keys.

[p1, p2, …​]

Projection node name pi corresponding to a node in the graph structure created by pyvgx.Similarity.CreateProjectionSets().

3.3.4. Remarks

Use pyvgx.Similarity.CreateProjectionSets() to initialize a graph structure suitable for building a vector index. Given a vector V, use V.Projections() to obtain the set of anchor nodes that should be connected to the item node representing V.

To search for probe vector Q, use Q.Projections() to obtain the set of anchor nodes for neighborhood queries that will find candidates.

Implementations may want to encode the arcs from projection nodes to item nodes with a portion of the vector LSH that was not used to generate the projection key. Use pyvgx.Vertex.ArcLSH() to obtain a suitable arc value for this purpose.

If lsh and lcm parameters are provided they will be used instead of computing them from the vector.

If parameter reduce is True the returned projection nodes will only be those generated from LSH segments where the corresponding lcm segment is zero. The purpose of this is to prevent the use of a projection that is likely to provide poor recall, thereby saving memory that may be better spent elsewhere to improve recall.

If parameter expand is True two projection nodes will be generated for LSH segments where the corresponding lcm segment has exactly one bit set to 1. The purpose of this is to increase recall by allowing more neighborhood queries to be performed.

Parameters reduce and expand are mutually exclusive.

3.3.5. Example

from pyvgx import *
system.Initialize( "vectors", euclidean=1 )
g = Graph("ann")

# Create a random vector
V = g.sim.NewVector( [random.random() for i in range(768)] )

# Compute LSH, mask, and projection node names
lsh, lcm, proj = V.Projections( 0 )

# Reduced number of projections
len( V.Projections( 0, reduce=True )[2] )   # <= len(proj)

# Expanded number of projections
len( V.Projections( 0, expand=True )[2] )   # >= len(proj)

print(hex(lsh)) # -> 0xa4b8d0ad7ae6a767
print(hex(lcm)) # -> 0x0108000400810000
print(proj)     # -> [b'_0367|000', b'_053B|000', b'_09A9|000',
                #     b'_0DCD|000', b'_13AE|000', b'_16BD|000',
                #     b'_1AB5|000', b'_1E15|000', b'_20D0|000',
                #     b'_25C6|000', b'_292E|000', b'_2F49|000']

4. Feature Vector Weight Encoding

Internal Encoding External Weight (quantized)

0

0.0078125

1

0.0087890625

2

0.009765625

3

0.0107421875

4

0.01171875

5

0.0126953125

6

0.013671875

7

0.0146484375

8

0.015625

9

0.017578125

10

0.01953125

11

0.021484375

12

0.0234375

13

0.025390625

14

0.02734375

15

0.029296875

16

0.03125

17

0.03515625

18

0.0390625

19

0.04296875

20

0.046875

21

0.05078125

22

0.0546875

23

0.05859375

24

0.0625

25

0.0703125

26

0.078125

27

0.0859375

28

0.09375

29

0.1015625

30

0.109375

31

0.1171875

32

0.125

33

0.140625

34

0.15625

35

0.171875

36

0.1875

37

0.203125

38

0.21875

39

0.234375

40

0.25

41

0.28125

42

0.3125

43

0.34375

44

0.375

45

0.40625

46

0.4375

47

0.46875

48

0.5

49

0.5625

50

0.625

51

0.6875

52

0.75

53

0.8125

54

0.875

55

0.9375

56

1.0

57

1.125

58

1.25

59

1.375

60

1.5

61

1.625

62

1.75

63

1.875


PYVGX