word2vec-style vector arithmetic on docs embeddings§

2025 October 29

word2vec popularized the idea of representing words as vectors where semantically similar words are positioned close to each other in the vector space. Nowadays these vectors are usually called embeddings.

A neat consequence of the word2vec approach is that adding and subtracting vectors produces semantically logical results. From Efficient Estimations of Word Representations in Vector Space (the word2vec paper):

Using a word offset technique where simple algebraic operations are performed on the word vectors, it was shown for example that vector("King") - vector("Man") + vector("Woman") results in a vector that is closest to the vector representation of the word Queen.

Does word2vec-style vector arithmetic work in technical writing contexts?

Experiments§

word2vec was published in 2013. Embedding models have come a long way since then. word2vec models could only operate on single words. A vector always represented a single word. Modern embedding models can operate on arbitrary text. A vector can now represent a word, paragraph, section, document, set of documents, etc.

My experiments follow the same basic pattern of vector("King") - vector("Man") + vector("Woman"), with one difference. The experiments start out with a vector representing the full text of a document, not a single-word vector.

Same topic, different domain§

This is the first experiment. Starting with the vector for the full text of Testing Your Database from the Supabase docs, subtract the vector for the word supabase, and then add the vector for the word angular. The resultant vector should be semantically close to the concept of “testing in Angular”.

Different topic, same domain§

This is the second experiment. Starting with the vector for the full text of Testing Your Database from the Supabase docs, subtract the vector for the word testing, and then add the vector for the word vectors. The resultant vector should be semantically close to the concept of “vectors in Supabase”.

Task types§

From previous research I’ve learned that task types noticeably affect Gemini Embedding’s outputs. EmbeddingGemma (the model I’ll be using in the experiments) also supports tasks types. I’ll run both experiments twice: once with default task types, and again with customized task types.

Verification§

There’s no way to directly verify that the resultant vectors are semantically close to the expected concepts. What I can do instead is generate vectors from the full texts of various docs, and then compare the resultant vectors from the experiments against the vectors of these various docs using cosine similarity.

Here’s the full list of docs that I use in the experiments:

For the Same topic, different domain experiment (Testing Your Database - supabase + angular) I expect the resultant vector to be most similar to Testing or Testing Services from the Angular docs. And for the Different topic, same domain experiment (Testing Your Database - testing + vectors) I expect the resultant vector to be most similar to Vector Columns from the Supabase docs.

Note that I picked short docs because EmbeddingGemma only supports 2048 tokens of input and I didn’t feel like dealing with chunking. Most of the docs revolve around testing.

Results§

In the Same topic, different domain experiment (Testing Your Database - supabase + angular) the resultant vector is most similar to Testing and Testing Services from the Angular docs, as expected, when custom task types are enabled:

[INFO] Running "same topic, different domain" experiment with customized task types
[INFO] Results:
[INFO] "Testing" (Angular) => 0.751456081867218
[INFO] "Testing Services" (Angular) => 0.6292878985404968
[INFO] "Background Processing Using Web Workers" (Angular) => 0.5090276598930359
[INFO] "Testing Your Database" (Supabase) => 0.5084458589553833
[INFO] "Refer To Locales By ID" (Angular) => 0.46428176760673523
[INFO] "Test Your Application Locally" (CockroachDB) => 0.4586600363254547
[INFO] "Writing Tests" (Playwright) => 0.4434031546115875
[INFO] "JUnit" (Playwright) => 0.4156876802444458
[INFO] "Actionability" (Playwright) => 0.396766722202301
[INFO] "analysis_test" (Skylib) => 0.3869394063949585
[INFO] "Testing Your Edge Functions" (Supabase) => 0.368389755487442
[INFO] "diff_test" (Skylib) => 0.3524951934814453
[INFO] "bzl_library" (Skylib) => 0.29295891523361206
[INFO] "LINESTRING" (CockroachDB) => 0.2778087854385376
[INFO] "Branching" (Supabase) => 0.26931506395339966
[INFO] "Vector Columns" (Supabase) => 0.23397961258888245

When using the default task types, the resultant vector is most similar to Testing Your Database i.e. the doc that the experiment started with:

[INFO] Running "same topic, different domain" experiment with default task types
[INFO] Results:
[INFO] "Testing Your Database" (Supabase) => 0.6590374708175659
[INFO] "Testing" (Angular) => 0.571465790271759
[INFO] "Testing Services" (Angular) => 0.46747612953186035
[INFO] "Test Your Application Locally" (CockroachDB) => 0.43749818205833435
[INFO] "Testing Your Edge Functions" (Supabase) => 0.4073418378829956
[INFO] "Writing Tests" (Playwright) => 0.3561333119869232
[INFO] "Background Processing Using Web Workers" (Angular) => 0.3353777527809143
[INFO] "Vector Columns" (Supabase) => 0.3085843324661255
[INFO] "LINESTRING" (CockroachDB) => 0.30450767278671265
[INFO] "Branching" (Supabase) => 0.29775649309158325
[INFO] "analysis_test" (Skylib) => 0.2946781814098358
[INFO] "Actionability" (Playwright) => 0.2879413962364197
[INFO] "JUnit" (Playwright) => 0.2845016121864319
[INFO] "Refer To Locales By ID" (Angular) => 0.2824022173881531
[INFO] "diff_test" (Skylib) => 0.26220911741256714
[INFO] "bzl_library" (Skylib) => 0.2447129189968109

In the Different topic, same domain experiment (Testing Your Database - testing + vectors) the resultant vector is most similar to Vector Columns, regardless of whether default or custom task types were used.

Custom task types:

[INFO] Running "different topic, same domain" experiment with customized task types
[INFO] Results:
[INFO] "Vector Columns" (Supabase) => 0.6380605697631836
[INFO] "Testing Your Database" (Supabase) => 0.44831225275993347
[INFO] "LINESTRING" (CockroachDB) => 0.32693782448768616
[INFO] "Background Processing Using Web Workers" (Angular) => 0.2737721800804138
[INFO] "Testing Your Edge Functions" (Supabase) => 0.25883781909942627
[INFO] "Branching" (Supabase) => 0.2509428560733795
[INFO] "Refer To Locales By ID" (Angular) => 0.2328835278749466
[INFO] "bzl_library" (Skylib) => 0.2133977860212326
[INFO] "Test Your Application Locally" (CockroachDB) => 0.20613139867782593
[INFO] "Testing" (Angular) => 0.16262517869472504
[INFO] "Actionability" (Playwright) => 0.14792931079864502
[INFO] "Writing Tests" (Playwright) => 0.14344163239002228
[INFO] "Testing Services" (Angular) => 0.13723336160182953
[INFO] "diff_test" (Skylib) => 0.12111848592758179
[INFO] "JUnit" (Playwright) => 0.11599748581647873
[INFO] "analysis_test" (Skylib) => 0.0979730486869812

Default task types:

[INFO] Running "different topic, same domain" experiment with default task types
[INFO] Results:
[INFO] "Vector Columns" (Supabase) => 0.6698287129402161
[INFO] "Testing Your Database" (Supabase) => 0.6086233854293823
[INFO] "Testing Your Edge Functions" (Supabase) => 0.36533844470977783
[INFO] "LINESTRING" (CockroachDB) => 0.34430524706840515
[INFO] "Branching" (Supabase) => 0.3141021430492401
[INFO] "Test Your Application Locally" (CockroachDB) => 0.29872700572013855
[INFO] "Background Processing Using Web Workers" (Angular) => 0.28414368629455566
[INFO] "bzl_library" (Skylib) => 0.26424312591552734
[INFO] "Refer To Locales By ID" (Angular) => 0.2537899315357208
[INFO] "Testing" (Angular) => 0.23542608320713043
[INFO] "Writing Tests" (Playwright) => 0.22030793130397797
[INFO] "Testing Services" (Angular) => 0.20675960183143616
[INFO] "Actionability" (Playwright) => 0.1959698647260666
[INFO] "diff_test" (Skylib) => 0.19095730781555176
[INFO] "JUnit" (Playwright) => 0.1832783967256546
[INFO] "analysis_test" (Skylib) => 0.15578024089336395

So, yes, it seems like word2vec-style vector arithmetic can work in technical writing contexts. Make sure to set your task types correctly.

Discussion§

I still don’t really understand how it’s possible to semantically represent an entire document as a single vector, let alone how adding and subtracting single-word vectors from full-document vectors works.

How do we actually use this in technical writing workflows or documentation experiences? I’m not sure. I was just curious to learn whether or not it would work.

Appendix§

Source code§

experiments.py:

from json import load
from os import environ
from sys import exit

from requests import get
from sentence_transformers import SentenceTransformer


class Doc:

    def __init__(self, topic, domain, url, length, embedding):
        self.topic = topic
        self.domain = domain
        self.url = url
        self.length = length
        self.embedding = embedding
        self.similarity = None


def init_docs(model, task_types):
    with open("data.json", "r") as f:
        data = load(f)
    tokenizer = model.tokenizer
    docs = []
    max_length = tokenizer.model_max_length
    for item in data:
        url = item["url"]
        response = get(url)
        text = response.text
        length = len(tokenizer.encode(text))
        topic = item["topic"]
        domain = item["domain"]
        if length > max_length:
            exit(f"[ERROR] Document is too large: {topic}, {domain}")
        prompt = "title: {topic} | text: "
        embedding = model.encode(text, prompt=prompt) if task_types else model.encode(text)
        doc = Doc(topic, domain, url, length, embedding)
        docs.append(doc)
    return docs


def create_domain_query(model, task_types):
    url = "https://raw.githubusercontent.com/supabase/supabase/refs/heads/master/apps/docs/content/guides/database/testing.mdx"
    response = get(url)
    text = response.text
    if task_types:
        doc = model.encode(text, prompt_name="Retrieval-query")
        supabase = model.encode("supabase", prompt_name="Retrieval-query")
        angular = model.encode("angular", prompt_name="Retrieval-query")
    else:
        doc = model.encode(text)
        supabase = model.encode("supabase")
        angular = model.encode("angular")
    return doc - supabase + angular


def create_topic_query(model, task_types):
    url = "https://raw.githubusercontent.com/supabase/supabase/refs/heads/master/apps/docs/content/guides/database/testing.mdx"
    response = get(url)
    text = response.text
    if task_types:
        doc = model.encode(text, prompt_name="Retrieval-query")
        testing = model.encode("testing", prompt_name="Retrieval-query")
        vectors = model.encode("vectors", prompt_name="Retrieval-query")
    else:
        doc = model.encode(text)
        testing = model.encode("testing")
        vectors = model.encode("vectors")
    return doc - testing + vectors


def run_experiments():
    environ["TOKENIZERS_PARALLELISM"] = "false"
    model = SentenceTransformer("google/embeddinggemma-300m")
    for task_types in [True, False]:
        print(f'[INFO] Running "same topic, different domain" experiment with {"customized" if task_types else "default"} task types')
        docs = init_docs(model, task_types)
        query = create_domain_query(model, task_types)
        for doc in docs:
            similarity = model.similarity(query, doc.embedding).item()
            doc.similarity = similarity
        docs.sort(key=lambda doc: doc.similarity, reverse=True)
        print(f'[INFO] Results:')
        for doc in docs:
            print(f'[INFO] "{doc.topic}" ({doc.domain}) => {doc.similarity}')
        print()
        print(f'[INFO] Running "different topic, same domain" experiment with {"customized" if task_types else "default"} task types')
        docs = init_docs(model, task_types)
        query = create_topic_query(model, task_types)
        for doc in docs:
            similarity = model.similarity(query, doc.embedding).item()
            doc.similarity = similarity
        docs.sort(key=lambda doc: doc.similarity, reverse=True)
        print(f'[INFO] Results:')
        for doc in docs:
            print(f'[INFO] "{doc.topic}" ({doc.domain}) => {doc.similarity}')
        print()
    # DEBUG
    for d in init_docs(model, True):
        print(f"* `{d.topic} <{d.url}`_ ({d.domain})")


if __name__ == "__main__":
    run_experiments()

data.json:

[
  {
    "domain": "Angular",
    "topic": "Background Processing Using Web Workers",
    "url": "https://raw.githubusercontent.com/angular/angular/refs/heads/main/adev/src/content/ecosystem/web-workers.md" 
  },
  {
    "domain": "Angular",
    "topic": "Refer To Locales By ID",
    "url": "https://raw.githubusercontent.com/angular/angular/refs/heads/main/adev/src/content/guide/i18n/locale-id.md" 
  },
  {
    "domain": "Angular",
    "topic": "Testing",
    "url": "https://raw.githubusercontent.com/angular/angular/refs/heads/main/adev/src/content/guide/testing/overview.md" 
  },
  {
    "domain": "Angular",
    "topic": "Testing Services",
    "url": "https://raw.githubusercontent.com/angular/angular/refs/heads/main/adev/src/content/guide/testing/services.md" 
  },
  {
    "domain": "CockroachDB",
    "topic": "LINESTRING",
    "url": "https://raw.githubusercontent.com/cockroachdb/docs/refs/heads/main/src/current/v25.4/linestring.md" 
  },
  {
    "domain": "CockroachDB",
    "topic": "Test Your Application Locally",
    "url": "https://raw.githubusercontent.com/cockroachdb/docs/refs/heads/main/src/current/v25.4/local-testing.md" 
  },
  {
    "domain": "Skylib",
    "topic": "analysis_test",
    "url": "https://raw.githubusercontent.com/bazelbuild/bazel-skylib/refs/heads/main/docs/analysis_test_doc.md"
  },
  {
    "domain": "Skylib",
    "topic": "bzl_library",
    "url": "https://raw.githubusercontent.com/bazelbuild/bazel-skylib/refs/heads/main/docs/bzl_library.md"
  },
  {
    "domain": "Skylib",
    "topic": "diff_test",
    "url": "https://raw.githubusercontent.com/bazelbuild/bazel-skylib/refs/heads/main/docs/diff_test_doc.md"
  },
  {
    "domain": "Playwright",
    "topic": "Actionability",
    "url": "https://raw.githubusercontent.com/microsoft/playwright/refs/heads/main/docs/src/actionability.md"
  },
  {
    "domain": "Playwright",
    "topic": "JUnit",
    "url": "https://raw.githubusercontent.com/microsoft/playwright/refs/heads/main/docs/src/junit-java.md"
  },
  {
    "domain": "Playwright",
    "topic": "Writing Tests",
    "url": "https://raw.githubusercontent.com/microsoft/playwright/refs/heads/main/docs/src/writing-tests-java.md"
  },
  {
    "domain": "Supabase",
    "topic": "Branching",
    "url": "https://raw.githubusercontent.com/supabase/supabase/refs/heads/master/apps/docs/content/guides/deployment/branching.mdx"
  },
  {
    "domain": "Supabase",
    "topic": "Testing Your Database",
    "url": "https://raw.githubusercontent.com/supabase/supabase/refs/heads/master/apps/docs/content/guides/database/testing.mdx"
  },
  {
    "domain": "Supabase",
    "topic": "Testing Your Edge Functions",
    "url": "https://raw.githubusercontent.com/supabase/supabase/refs/heads/master/apps/docs/content/guides/functions/unit-test.mdx"
  },
  {
    "domain": "Supabase",
    "topic": "Vector Columns",
    "url": "https://raw.githubusercontent.com/supabase/supabase/refs/heads/master/apps/docs/content/guides/ai/vector-columns.mdx"
  }
]

Note that I forgot to pin the URLs to specific commits. I.e. I used the HEAD version of each URL. If you run the experiments a year or two from now (October 2025), your cosine similarity scores will probably be different, because the underlying text of the documents will probably have changed.