Splat Collision Web Industries - Fast vector similarity queries using CouchDB views

Fast vector similarity queries using CouchDB views

Problem: I have a corpus of text documents, and I want to compare a new document and quickly find the most similar documents.

Prepare your data

This approach assumes your documents have a set of features that can be represented by a vector. Probably keep it lower than 20 dimensions for speed.
In my prototype, I used topic analysis, keyword score ranking and clustering in combination to generate a vector for natural language text.
Store that vector on your CouchDB documents as an array of values. Associate anything else you need in the same document, such as original text, metadata, etc.
Your view function would look something like this:

function (doc) {
  // emit the entire vector as the key...
  // Here, I'm converting floats into integers
  var intVectors = doc.vectors.map(function(val){
    return Math.floor(val)
  })
  emit(intVectors, 1);
}

Query based on an input example document.

Calculate the vector for an input exactly the same as you did for the documents in your database.
Pick a distance (let's say 5) for each dimension in the vector, and calculate a 'high' and 'low' version of each.
Query the CouchDB view the database using startkey and endkey as vector arrays to find all documents with "close enough" scores to the draft.
Optionally, do a custom distance calcuation to sort the result documents by how closely their vectors to the draft.
Here's how this would look in Python:

def get_matching_docs(input_text):
    """
    Given an input text, 
    create a vector representation 
    and query the database for similar documents.
    """
    # adding or subtracting from the source vector 
    # to create startkey / endkey queries
    query_range = 5 
    # calculate our input vector - 
    # same algorithm as was used for the database
    doc_vector = vectorize_document(input_text)
    # make sure all are integers
    source_vector = [math.floor(x) for x in doc_vector]
    # offset each value by the query range, 
    # creating high end and low end vectors
    low_query_vector = [math.floor(x) - query_range for x in doc_vector]
    high_query_vector = [math.floor(x) + query_range for x in doc_vector]
    # call the couch view with startkey, endkey 
    # to get all documents that are in the range
    results = analyze_db.get_view_result('vectors', 'matching', 
        startkey=low_query_vector, 
        endkey=high_query_vector, 
        raw_result=True, reduce=False)
    # customizing the result order by a custom 'distance' check
    ordered_results = []
    for row in results["rows"]:
        # distance between the input vector and this document's vector
        # this is using numpy's linear algebra normalized distance
        dist = np.linalg.norm(np.array(source_vector)-np.array(row['key']))
        # collect document ids and vectors
        ordered_results.append({
            "_id": row["id"],
            "vec": row['key'],
            "dist": dist
        })
    # finally sorting by distance
    ordered_results = sorted(ordered_results, key=lambda x: x['dist'], reverse=False)
    # returning the top 5 results or however many you like
    return ordered_results[0:5]

CouchDB is really an excellent solution to this kind of data analysis and querying, you can really get very clever with the ability to create arbitrary indexes of your data using views with complex keys like the above solution.

Posted: Wed, 20 May 2020 17:25:00 GMT