Embeddings are underrated#

Machine learning (ML) has the potential to advance the state of the art in technical writing. No, I’m not talking about text generation models like Claude Opus, Gemini Pro, Meta LLaMa 3, OpenAI GPT-4, etc. The ML technology that might end up having the biggest impact on technical writing is embeddings.

Embeddings aren't exactly new, but they have become much more widely accessible in the last couple years. What embeddings offer to technical writers is the ability to discover connections between texts at previously impossible scales.

I know that a lot of my fellow technical writers are worried about text generation models automating away our jobs. I think you’ll find embeddings much more palatable and interesting because there’s a lot less risk in this regard. Read on to see what I mean!

Building intuition about embeddings#

Here’s an overview, geared towards technical writers, of how you use embeddings and how they work.

Input and output#

Someone asks you to “make some embeddings”. What do you input? You input text. It could be a single word, or sentence, or paragraph, or section, or document, or set of documents, etc. You don’t need to provide the same amount of text every time.

What do you get back? If you provide a single word as the input, the output will be an array of numbers like this:

[-0.02387, -0.0353, 0.0456]

Now suppose your input is an entire set of documents. The output turns into this:

[0.0451, -0.0154, 0.0020]

A little strange, right? One input was drastically smaller than the other, yet they both produced an array of 3 numbers. (When you work with real embeddings, the arrays will have hundreds or thousands of numbers, not 3. More on that later.)

Here’s the first key insight. Because we always get back the same amount of numbers no matter how big or small the input text, we now have a way to mathematically compare any two pieces of arbitrary text to each other.

Huh? How is this? Why would I want to use math to compare docs? And what do those numbers MEAN??

But first, how to literally make the embeddings#

The big service providers have made it very easy. Here’s how it’s done with Gemini:

import google.generativeai as gemini


gemini.configure(api_key='…')

text = 'Hello, world!'
response = gemini.embed_content(
    model='models/text-embedding-004',
    content=text,
    task_type='SEMANTIC_SIMILARITY'
)
embedding = response['embedding']

The size of the array depends on what model you’re using. Gemini’s text-embedding-004 returns an array of 768 numbers whereas Voyage AI’s voyage-3 returns an array of 1024 numbers. This is one of the reasons why you can’t use embeddings from different providers interchangeably. (The other and main reason is that the numbers from one model mean something completely different than the numbers from another model.)

Does it cost a lot of money?#

No.

Is it terrible for the environment?#

I don’t know. Once the model is created (trained), I’m pretty sure that generating embeddings is much less computationally intensive than generating text. But it also seems to be the case that embedding models are created (trained) in similar ways as text generation models, with all the energy usage that implies. I’ll update this section when I find out more.

What model is best?#

Ideally, your embedding model can accept a huge amount of input text, so that you never need to worry about it erroring out because you fed it too much text. As of October 2024 voyage-3 is the clear winner.

Organization

Model Name

Input Limit

Voyage AI

voyage-3

32000

Nomic

Embed

8192

Mistral

Embed

8000

OpenAI

text-embedding-3-large

3072

Google

text-embedding-004

2048

Cohere

embed-english-v3.0

512

Very weird multi-dimensional space#

Back to the big mystery. What the hell do these numbers MEAN?!?!?!

I’m no expert here, but for our purposes of building very basic intution, I’m fairly confident that it’s safe to begin our journey by thinking about coordinates on a map.

Suppose I give you three points and their coordinates:

Point

X-Coordinate

Y-Coordinate

A

3

2

B

1

1

C

-2

-2

There are 2 dimensions to this map: the X-Coordinate and the Y-Coordinate. Each point lives at the intersection of an X-Coordinate and a Y-Coordinate.

Is A closer to B or C?

../_images/embeddings-1.png

A is much closer to B.

Here’s the mental leap. This is basically how embeddings work. Each number in the embedding array is a dimension, similar to our X-Coordinates and Y-Coordinates, similar to how we physically live in 3-dimensional space on Earth. When an embedding model sends you back an array of 1000 numbers, it’s telling you the point where that text semantically lives in its 1000-dimension space, relative to all other texts.

../_images/mindblown.gif

The concept of positioning items in a multi-dimensional space like this, where related items are clustered near each other, goes by the wonderful name of latent space.

The most famous example of the weird utility of this technology comes from the Word2vec paper, the foundational research that kickstarted interest in embeddings 11 years ago. In the paper they shared this anecdote:

embedding("king") - embedding("man") + embedding("woman") ≈ embedding("queen")

Starting with the embedding for king, subtract the embedding for man, then add the embedding for woman. When you look around this vicinity of the latent space, you find the embedding for queen nearby.

There appears to be an unspoken rule in ML culture that this anecdote must always be followed by this quote from John Rupert Firth:

You shall know a word by the company it keeps!

We started the section by thinking about distance between points on a 2D map. It was a nice stepping stone for building intuition but now we need to cast it aside, because embeddings operate in hundreds or thousands of dimensions. It’s (probably) impossible to visualize what “distance” looks like in 1000 dimensions. Also, we don’t know what each dimension represents, hence the section heading “Very weird multi-dimensional space”.1 One dimension might represent something close to color. The king - man + woman queen anecdote suggests that these models contain some notion of gender. Explainable AI is the subfield of ML research dedicated to figuring out what these dimensions mean (among other things).

The mechanics of converting text into very weird multi-dimensional space are complex, as you might imagine. They are teaching machines to learn, after all. The Illustrated Word2vec is a good way to start your journey down that rabbithole.

1 I borrowed this phrase from Embeddings: What they are why they matter.

Comparing embeddings#

After you’ve generated your embeddings, you’ll need some kind of “database” to keep track of what text each embedding is associated to. In the experiment discussed later I got by with just a local JSON file:

{
    "authors": {
        "embedding": […]
    },
    "changes/0.1": {
        "embedding": […]
    },
    …
}

authors is the name of a page. embedding is the embedding for that page.

The mechanics of comparing embeddings involves a lot of linear algebra. I learned the basics from Linear Algebra for Machine Learning and Data Science. The big math and ML libraries like NumPy and scikit-learn can do the heavy lifting for you (i.e. very little math code on your end).

Applications#

I could tell you exactly how I think we might advance the state of the art in technical writing with embeddings, but where’s the fun in that? Let’s cover a basic example to put the intuition-building ideas into practice and then wrap up this post.

Let a thousands embeddings bloom?#

As docs site owners, I wonder if we should start providing embeddings for our content freely to anyone who wants them, via a REST API or well-known URIs. Who knows what kinds of cool stuff our communities can build with this extra type of data about our docs? (I have no idea if there are copyright or terms-of-usage problems with sharing embeddings.)

Parting words#

Three years ago, if you had asked me what 768-dimensional space is, I would have told you that it’s just some abstract concept that physicists and mathematicians need for unfathomable reasons. Embeddings gave me a reason to think about this idea more deeply, and actually apply it to my own work. I think that’s pretty cool.

Order-of-magnitude improvements in our ability to maintain our docs may very well still be possible after all… perhaps we just need an order-of-magnitude-more dimensions!!

Appendix#

Implementation#

I created a Sphinx extension to generate an embedding for each doc. Sphinx automatically invokes this extension as it builds the docs.

import json
import os


import voyageai


VOYAGE_API_KEY = os.getenv('VOYAGE_API_KEY')
voyage = voyageai.Client(api_key=VOYAGE_API_KEY)


def on_build_finished(app, exception):
    with open(srcpath, 'w') as f:
        json.dump(data, f, indent=4)


def embed_with_voyage(text):
    try:
        embedding = voyage.embed([text], model='voyage-3', input_type='document').embeddings[0]
        return embedding
    except Exception as e:
        return None


def on_doctree_resolved(app, doctree, docname):
    text = doctree.astext()
    embedding = embed_with_voyage(text)  # Generate an embedding for each document!
    data[docname] = {
        'embedding': embedding
    }


# Use some globals because this is just an experiment and you can't stop me
def init_globals(srcdir):
    global filename
    global srcpath
    global data
    filename = 'embeddings.json'
    srcpath = f'{srcdir}/{filename}'
    data = {}


def setup(app):
    init_globals(app.srcdir)
    # https://www.sphinx-doc.org/en/master/extdev/appapi.html#sphinx-core-events
    app.connect('doctree-resolved', on_doctree_resolved)  # This event fires on every doc that's processed
    app.connect('build-finished', on_build_finished)
    return {
        'version': '0.0.1',
        'parallel_read_safe': True,
        'parallel_write_safe': True,
    }

When the build finishes, the embeddings data is stored in embeddings.json like this:

{
    "authors": {
        "embedding": […]
    },
    "changes/0.1": {
        "embedding": […]
    },
    …
}

authors and changes/0.1 are docs. embedding contains the embedding for that doc.

The last step is to find the closest neighbor for each doc. I.e. to find the other page that is considered relevant to the page you’re currently on. Linear Algebra for Machine Learning and Data Science gave me a basic idea of what this math does.

import json


import numpy as np
from sklearn.metrics.pairwise import cosine_similarity


def find_docname(data, target):
    for docname in data:
        if data[docname]['embedding'] == target:
            return docname
    return None


# Adapted from the Voyage AI docs
# https://web.archive.org/web/20240923001107/https://docs.voyageai.com/docs/quickstart-tutorial
def k_nearest_neighbors(target, embeddings, k=5):
    # Convert to numpy array
    target = np.array(target)
    embeddings = np.array(embeddings)
    # Reshape the query vector embedding to a matrix of shape (1, n) to make it
    # compatible with cosine_similarity
    target = target.reshape(1, -1)
    # Calculate the similarity for each item in data
    cosine_sim = cosine_similarity(target, embeddings)
    # Sort the data by similarity in descending order and take the top k items
    sorted_indices = np.argsort(cosine_sim[0])[::-1]
    # Take the top k related embeddings
    top_k_related_embeddings = embeddings[sorted_indices[:k]]
    top_k_related_embeddings = [
        list(row[:]) for row in top_k_related_embeddings
    ]  # convert to list
    return top_k_related_embeddings


with open('doc/embeddings.json', 'r') as f:
    data = json.load(f)
embeddings = [data[docname]['embedding'] for docname in data]
print('.. csv-table::')
print('   :header: "Target", "Neighbor"')
print()
for target in embeddings:
    dot_products = np.dot(embeddings, target)
    neighbors = k_nearest_neighbors(target, embeddings, k=3)
    # ignore neighbors[0] because that is always the target itself
    nearest_neighbor = neighbors[1]
    target_docname = find_docname(data, target)
    target_cell = f'`{target_docname} <https://www.sphinx-doc.org/en/master/{target_docname}.html>`_'
    neighbor_docname = find_docname(data, nearest_neighbor)
    neighbor_cell = f'`{neighbor_docname} <https://www.sphinx-doc.org/en/master/{neighbor_docname}.html>`_'
    print(f'   "{target_cell}", "{neighbor_cell}"')

As you may have noticed, I did not actually implement the recommendation UI in this experiment. My main goal was to get basic data on whether the embeddings approach generates decent recommendations or not.

Results#

How to interpret the data: Target would be the page that you’re currently on. Neighbor would be the recommended page.

Target

Neighbor

authors

changes/0.6

changes/0.1

changes/0.5

changes/0.2

changes/1.2

changes/0.3

changes/0.4

changes/0.4

changes/1.2

changes/0.5

changes/0.6

changes/0.6

changes/1.6

changes/1.0

changes/1.3

changes/1.1

changes/1.2

changes/1.2

changes/1.1

changes/1.3

changes/1.4

changes/1.4

changes/1.3

changes/1.5

changes/1.6

changes/1.6

changes/1.5

changes/1.7

changes/1.8

changes/1.8

changes/1.6

changes/2.0

changes/1.8

changes/2.1

changes/1.2

changes/2.2

changes/1.2

changes/2.3

changes/2.1

changes/2.4

changes/3.5

changes/3.0

changes/4.3

changes/3.1

changes/3.3

changes/3.2

changes/3.0

changes/3.3

changes/3.1

changes/3.4

changes/4.3

changes/3.5

changes/1.3

changes/4.0

changes/3.0

changes/4.1

changes/4.4

changes/4.2

changes/4.4

changes/4.3

changes/3.0

changes/4.4

changes/7.4

changes/4.5

changes/4.4

changes/5.0

changes/3.5

changes/5.1

changes/5.0

changes/5.2

changes/3.5

changes/5.3

changes/5.2

changes/6.0

changes/6.2

changes/6.1

changes/6.2

changes/6.2

changes/6.1

changes/7.0

extdev/deprecated

changes/7.1

changes/7.2

changes/7.2

changes/7.4

changes/7.3

changes/7.4

changes/7.4

changes/7.3

changes/8.0

changes/8.1

changes/8.1

changes/1.8

changes/index

changes/8.0

development/howtos/builders

usage/extensions/index

development/howtos/index

development/tutorials/index

development/howtos/setup_extension

usage/extensions/index

development/html_themes/index

usage/theming

development/html_themes/templating

development/html_themes/index

development/index

usage/index

development/tutorials/adding_domain

extdev/domainapi

development/tutorials/autodoc_ext

usage/extensions/autodoc

development/tutorials/examples/README

tutorial/end

development/tutorials/extending_build

usage/extensions/todo

development/tutorials/extending_syntax

extdev/markupapi

development/tutorials/index

development/howtos/index

examples

index

extdev/appapi

extdev/index

extdev/builderapi

usage/builders/index

extdev/collectorapi

extdev/envapi

extdev/deprecated

changes/1.8

extdev/domainapi

usage/domains/index

extdev/envapi

extdev/collectorapi

extdev/event_callbacks

extdev/appapi

extdev/i18n

usage/advanced/intl

extdev/index

extdev/appapi

extdev/logging

extdev/appapi

extdev/markupapi

development/tutorials/extending_syntax

extdev/nodes

extdev/domainapi

extdev/parserapi

extdev/appapi

extdev/projectapi

extdev/envapi

extdev/testing

internals/contributing

extdev/utils

extdev/appapi

faq

usage/configuration

glossary

usage/quickstart

index

usage/quickstart

internals/code-of-conduct

internals/index

internals/contributing

usage/advanced/intl

internals/index

usage/index

internals/organization

internals/contributing

internals/release-process

extdev/deprecated

latex

usage/configuration

man/index

usage/index

man/sphinx-apidoc

man/sphinx-autogen

man/sphinx-autogen

usage/extensions/autosummary

man/sphinx-build

usage/configuration

man/sphinx-quickstart

tutorial/getting-started

support

tutorial/end

tutorial/automatic-doc-generation

usage/extensions/autosummary

tutorial/deploying

tutorial/first-steps

tutorial/describing-code

usage/domains/index

tutorial/end

usage/index

tutorial/first-steps

tutorial/getting-started

tutorial/getting-started

tutorial/index

tutorial/index

tutorial/getting-started

tutorial/more-sphinx-customization

usage/theming

tutorial/narrative-documentation

usage/quickstart

usage/advanced/intl

internals/contributing

usage/advanced/websupport/api

usage/advanced/websupport/quickstart

usage/advanced/websupport/index

usage/advanced/websupport/quickstart

usage/advanced/websupport/quickstart

usage/advanced/websupport/api

usage/advanced/websupport/searchadapters

usage/advanced/websupport/api

usage/advanced/websupport/storagebackends

usage/advanced/websupport/api

usage/builders/index

usage/configuration

usage/configuration

changes/1.2

usage/domains/c

usage/domains/cpp

usage/domains/cpp

usage/domains/c

usage/domains/index

extdev/domainapi

usage/domains/javascript

usage/domains/python

usage/domains/mathematics

usage/referencing

usage/domains/python

extdev/domainapi

usage/domains/restructuredtext

extdev/markupapi

usage/domains/standard

usage/domains/index

usage/extensions/autodoc

tutorial/automatic-doc-generation

usage/extensions/autosectionlabel

usage/quickstart

usage/extensions/autosummary

tutorial/automatic-doc-generation

usage/extensions/coverage

usage/extensions/autodoc

usage/extensions/doctest

tutorial/describing-code

usage/extensions/duration

tutorial/more-sphinx-customization

usage/extensions/example_google

usage/extensions/example_numpy

usage/extensions/example_numpy

usage/extensions/example_google

usage/extensions/extlinks

usage/extensions/intersphinx

usage/extensions/githubpages

tutorial/deploying

usage/extensions/graphviz

usage/extensions/math

usage/extensions/ifconfig

usage/extensions/doctest

usage/extensions/imgconverter

usage/extensions/math

usage/extensions/index

development/index

usage/extensions/inheritance

usage/extensions/graphviz

usage/extensions/intersphinx

usage/quickstart

usage/extensions/linkcode

usage/extensions/viewcode

usage/extensions/math

usage/configuration

usage/extensions/napoleon

usage/extensions/example_google

usage/extensions/todo

development/tutorials/extending_build

usage/extensions/viewcode

usage/extensions/linkcode

usage/index

tutorial/end

usage/installation

tutorial/getting-started

usage/markdown

extdev/parserapi

usage/quickstart

index

usage/referencing

usage/restructuredtext/roles

usage/restructuredtext/basics

usage/restructuredtext/directives

usage/restructuredtext/directives

usage/restructuredtext/basics

usage/restructuredtext/domains

usage/domains/index

usage/restructuredtext/field-lists

usage/restructuredtext/directives

usage/restructuredtext/index

usage/restructuredtext/basics

usage/restructuredtext/roles

usage/referencing

usage/theming

development/html_themes/index