pyvgx.Vector

1. Vector Objects

Vectors have two distinct representations in pyvgx: external and internal. The external vector is the original vector (with applicable normalization, truncation and quantization applied.) The internal vector is an encoded version of the external vector.

Vectors can be assigned to vertices to enable similarity matching for those vertices.

Vectors can also be created and used as separate objects. In this case the built-in default similarity configuration controls how vectors are compared.

Two vector modes are supported:

Feature Vectors (default)
Euclidean Vectors

Vector mode is chosen at system initialization and cannot be changed after initialization. The chosen vector mode applies to all graphs in the VGX instance.

1.1. Feature Vector

Feature vectors are lists of (<dimension>, <weight>) key-value pairs, where <dimension> is a string and <weight> is a number. The maximum length of <dimension> is 27 bytes. The range of <weight> is [0.0078125, 1.875], internally quantized into 64 discrete buckets. The maximum number of vector features is 48.

1.2. Euclidean Vector

Euclidean vectors are lists of numeric vector components \(A = \lbrace c_1, c_2, ..., c_n \rbrace\) where n is a multiple of 32 or 64. The required multiple for your system can be found by inspecting the value returned by pyvgx.avxbuild(). If 512 is returned the multiple is 64, otherwise 32.

The maximum number of vector components is 65472.

Euclidean vectors are internally normalized by a scaling factor. This is done because the internal representation uses 8-bit integers to store each component and must ensure maximum use of the 8-bit resolution. The scaling factor \(a\) is determined from |c|_max, the component with the largest absolute value:

\( a = \displaystyle \frac{1}{| \text{c}|_{max} }\)

An internal (normalized) Euclidean vector \(A'\) is thus computed:

\(A' = a \cdot A = \lbrace c'_1, c'_2, ..., c'_n \rbrace\), which implies \(c'_k \in \lbrack -1, 1 \rbrack\) and \(c'_\text{max} \in \lbrace -1, 1 \rbrace\).

Consequently \(A'\) is distorted due to loss of resolution.

Internal Euclidean vector representation uses 8-bit integers to encode normalized values in the range \( \langle \frac{-1}{127}, 1 \rbrack \) as integers \( \lbrack 0, 127 \rbrack \), and normalized values in the range \( \lbrack -1, \frac{-1}{127} \rbrack \) as integers \( \lbrack 129, 255 \rbrack \). (Internal byte value \(128\) is reserved and not used.)

1.3. Example: Feature Vectors

from pyvgx import *

# Initialize in the default feature vectors mode
system.Initialize()

# Make two feature vectors
Coffee = Vector( [("beverage",1.5), ("black",0.5), ("hot",0.5)] )
Soda = Vector( [("beverage",1.5), ("bubbly",0.8), ("sweet",0.8)] )

# Compare using default similarity parameters
DefaultSimilarity.Similarity( Coffee, Soda ) # -> 0.7421875

# Change similarity parameters and compare again
DefaultSimilarity.jaccard_exp = 1.0
DefaultSimilarity.Similarity( Coffee, Soda ) # -> 0.375

1.4. Example: Euclidean Vectors

from pyvgx import *
from random import *

# Initialize in euclidean mode
system.Initialize( euclidean=True )

# Make two random Euclidean vectors with components in range [-1,1]
A = Vector( [2*random()-1 for i in range(768)] )
B = Vector( [2*random()-1 for i in range(768)] )

# Cosine of angle between random vectors is close to zero
DefaultSimilarity.Cosine( A, B ) # -> near zero

2. Vector Members

pyvgx Vector Member	Description
pyvgx.Vector.length	Number of elements in vector
pyvgx.Vector.magnitude	Euclidean length of vector
pyvgx.Vector.fingerprint	64-bit integer representing the vector
pyvgx.Vector.external	Vector with original elements where terms are strings and weights are floats
pyvgx.Vector.internal	Enumerated version of vector as used internally for storage and similarity computation

pyvgx Vector Member

Description

pyvgx.Vector.length

Number of elements in vector

pyvgx.Vector.magnitude

Euclidean length of vector

pyvgx.Vector.fingerprint

64-bit integer representing the vector

pyvgx.Vector.external

Vector with original elements where terms are strings and weights are floats

pyvgx.Vector.internal

Enumerated version of vector as used internally for storage and similarity computation

2.1. pyvgx.Vector.length

Return the number of elements in the vector. Equivalent: len( pyvgx.Vector )

2.2. pyvgx.Vector.magnitude

Return the Euclidean length of the vector.

For Euclidean vectors the magnitude of the original vector is not preserved, since the vector is normalized.

2.3. pyvgx.Vector.fingerprint

Return a 64-bit integer representing the vector. This is a dimensionality-reduced (compressed) vector that can be used to speed up certain types of similarity computations. Two fingerprints are compared by counting the number of bit positions where the two differ, i.e. the hamming distance between the two numbers. Small hamming distance translates to high similarity.

2.4. pyvgx.Vector.external

Return the vector as originally defined.

2.4.1. Feature Vectors

Elements are pairs of string terms and floating point weights:

<external_vector> ::= [
    (<s_term>, <f_weight>),
    (<s_term>, <f_weight>),
    (<s_term>, <f_weight>),
    ...
]

where
    <s_term>    ::= <string_up_to_27_bytes>
    <f_weight>  ::= <float>

Vector terms longer than 27 bytes are truncated.

Vector weights are quantized into 64 distinct buckets, and have a range [0.0078125, 1.875].

2.4.2. Euclidean Vectors

Elements are floating point values:

<external_vector> ::= [ c1, c2, ..., cn ]

where n is a multiple of 32 and |ci| <= 1.

2.5. pyvgx.Vector.internal

Return the internal representation of the vector.

2.5.1. Feature Vectors

Elements are pairs of enumerated terms and integer weights:

<internal_vector> ::= [
    (<e_term>, <i_weight>),
    (<e_term>, <i_weight>),
    (<e_term>, <i_weight>),
    ...
]

where
    <e_term>    ::= <integer_mapping_of_s_term>
    <i_weight>  ::= <encoded_f_term>

Vector terms are mapped internally to integer values for more efficient storage and matching.

Internal vector weights are encoded as a small integer in the range [0, 63]

2.5.2. Euclidean Vectors

Elements are bytes, and the returned object is a bytearray:

<internal_vector> ::= bytearray(b'\xnn\xnn...\xnn')

where the number of bytes is a multiple of 32.

Vector components are mapped internally to 8-bit integer values for more efficient storage and matching.

3. Vector Methods

pyvgx Vector Method	Description
pyvgx.Vector.Fingerprint()	Compute a 64-bit integer representing the vector
pyvgx.Vector.AsDict()	Dictionary representation of vector
pyvgx.Vector.Projections()	Return vector index projection information

pyvgx Vector Method

Description

pyvgx.Vector.Fingerprint()

Compute a 64-bit integer representing the vector

pyvgx.Vector.AsDict()

Dictionary representation of vector

pyvgx.Vector.Projections()

Return vector index projection information

3.1. pyvgx.Vector.Fingerprint()

Compute 64-bit vector fingerprint.

3.1.1. Syntax

pyvgx.Vector.Fingerprint( [seed] ) -> int

3.1.2. Parameters

Parameter	Type	Description
seed	int	Projection seed (default 0)

Parameter

Type

Description

seed

int

Projection seed (default 0)

3.1.3. Return Value

This method returns a 64-bit integer representing a projection of the n-dimensional vector onto 64-dimensional binary space. The optional seed allows alternative projections.

3.1.4. Remarks

A vector’s fingerprint is a locality sensitive hash and represents a mapping of a high-dimensional object onto a low-dimensional "shadow" of that object. By using different values for seed it is possible to create multiple fingerprints forming a set of projections which more accurately represent the original vector.

3.1.5. Example

from pyvgx import *
from random import random
system.Initialize( euclidean=True )
g = Graph( "graph" )
base = [2*random()-1 for x in range(768)]
V = Vector( [x+0.1*random() for x in base] )
fp0 = V.Fingerprint()
fp1 = V.Fingerprint( 1234 )
fp2 = V.Fingerprint( 0xd5c098876a90fe2c )

3.2. pyvgx.Vector.AsDict()

Return the vector data as a dictionary.

3.2.1. Syntax

pyvgx.Vector.AsDict() -> dict

3.2.2. Return Value

This method returns all vector attributes in a dict.

3.2.2.1. Feature Vectors

<feature_vector_dict> ::= {
    'type'        : 'feature',
    'length'      : <number_of_elements>,
    'magnitude'   : <euclidean_length>,
    'fingerprint' : <int64>
    'external'    : [ (<str>, <float>), (<str>, <float>), ... ],
    'internal'    : [ (<int>, <int>), (<int>, <int>), ... ],
    'centroid'    : <bool>
}

3.2.2.2. Euclidean Vectors

<euclidean_vector_dict> ::= {
    'type'        : 'array',
    'length'      : <number_of_elements>,
    'magnitude'   : <normalized_euclidean_length>,
    'fingerprint' : <int64>
    'external'    : [ <float>, <float>, ... ],
    'internal'    : bytearray(b'\xnn\xnn...'),
    'centroid'    : <bool>
}

3.2.3. Example

from pyvgx import *
from random import random
system.Initialize( euclidean=True )
g = Graph( "graph" )
base = [2*random()-1 for x in range(768)]
V = Vector( [x+0.1*random() for x in base] )
V.AsDict()

3.3. pyvgx.Vector.Projections()

Return the vector’s LSH, "low confidence mask", and index projections for a given seed.

3.3.1. Syntax

pyvgx.Vector.Projections( seed[, lsh[, lcm[, reduce[, expand]]]] ) -> (lsh, lcm, [p1, p2, ...])

3.3.2. Parameters

Parameter	Type	Description
seed	int	Projection seed
lsh	int	Use this vector fingerprint instead of computing it from vector
lcm	int	Use this "low confidence mask" for the provided lsh
reduce	bool	Omit index projections for LSH segments where the corresponding lcm segment is non-zero
expand	bool	Create two index projections for LSH segments where the corresponding lcm segment has exactly one bit set to 1

Parameter

Type

Description

seed

int

Projection seed

lsh

int

Use this vector fingerprint instead of computing it from vector

lcm

int

Use this "low confidence mask" for the provided lsh

reduce

bool

Omit index projections for LSH segments where the corresponding lcm segment is non-zero

expand

bool

Create two index projections for LSH segments where the corresponding lcm segment has exactly one bit set to 1

3.3.3. Return Value

For a given projection seed, return a 3-tuple containing the vector’s fingerprint (LSH), the "low confidence mask" associated with the LSH, and a list of projection node names that map to the vector: (lsh, lcm, [p1, p2, …])

lsh: The vector’s fingerprint for the given seed
lcm: Bitmask where the occurrence of a 1 indicates a low confidence bit value in the corresponding lsh position. When lcm is all zeros it means the entire lsh has high confidence bit values. Bit regions in lsh corresponding to non-zero regions in lcm are less likely to produce good projection keys.
[p1, p2, …]: Projection node name pi corresponding to a node in the graph structure created by pyvgx.Similarity.CreateProjectionSets().

3.3.4. Remarks

Use pyvgx.Similarity.CreateProjectionSets() to initialize a graph structure suitable for building a vector index. Given a vector V, use V.Projections() to obtain the set of anchor nodes that should be connected to the item node representing V.

To search for probe vector Q, use Q.Projections() to obtain the set of anchor nodes for neighborhood queries that will find candidates.

Implementations may want to encode the arcs from projection nodes to item nodes with a portion of the vector LSH that was not used to generate the projection key. Use pyvgx.Vertex.ArcLSH() to obtain a suitable arc value for this purpose.

If lsh and lcm parameters are provided they will be used instead of computing them from the vector.

If parameter reduce is True the returned projection nodes will only be those generated from LSH segments where the corresponding lcm segment is zero. The purpose of this is to prevent the use of a projection that is likely to provide poor recall, thereby saving memory that may be better spent elsewhere to improve recall.

If parameter expand is True two projection nodes will be generated for LSH segments where the corresponding lcm segment has exactly one bit set to 1. The purpose of this is to increase recall by allowing more neighborhood queries to be performed.

Parameters reduce and expand are mutually exclusive.

3.3.5. Example

from pyvgx import *
system.Initialize( "vectors", euclidean=1 )
g = Graph("ann")

# Create a random vector
V = g.sim.NewVector( [random.random() for i in range(768)] )

# Compute LSH, mask, and projection node names
lsh, lcm, proj = V.Projections( 0 )

# Reduced number of projections
len( V.Projections( 0, reduce=True )[2] )   # <= len(proj)

# Expanded number of projections
len( V.Projections( 0, expand=True )[2] )   # >= len(proj)

print(hex(lsh)) # -> 0xa4b8d0ad7ae6a767
print(hex(lcm)) # -> 0x0108000400810000
print(proj)     # -> [b'_0367|000', b'_053B|000', b'_09A9|000',
                #     b'_0DCD|000', b'_13AE|000', b'_16BD|000',
                #     b'_1AB5|000', b'_1E15|000', b'_20D0|000',
                #     b'_25C6|000', b'_292E|000', b'_2F49|000']

4. Feature Vector Weight Encoding

Internal Encoding	External Weight (quantized)
0	0.0078125
1	0.0087890625
2	0.009765625
3	0.0107421875
4	0.01171875
5	0.0126953125
6	0.013671875
7	0.0146484375
8	0.015625
9	0.017578125
10	0.01953125
11	0.021484375
12	0.0234375
13	0.025390625
14	0.02734375
15	0.029296875
16	0.03125
17	0.03515625
18	0.0390625
19	0.04296875
20	0.046875
21	0.05078125
22	0.0546875
23	0.05859375
24	0.0625
25	0.0703125
26	0.078125
27	0.0859375
28	0.09375
29	0.1015625
30	0.109375
31	0.1171875
32	0.125
33	0.140625
34	0.15625
35	0.171875
36	0.1875
37	0.203125
38	0.21875
39	0.234375
40	0.25
41	0.28125
42	0.3125
43	0.34375
44	0.375
45	0.40625
46	0.4375
47	0.46875
48	0.5
49	0.5625
50	0.625
51	0.6875
52	0.75
53	0.8125
54	0.875
55	0.9375
56	1.0
57	1.125
58	1.25
59	1.375
60	1.5
61	1.625
62	1.75
63	1.875