-
Notifications
You must be signed in to change notification settings - Fork 81
/
Copy pathembeddings.tex
2433 lines (1810 loc) · 194 KB
/
embeddings.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Diaz Essay
% LaTeX Template
% Version 2.0 (13/1/19)
%
% This template originates from:
% http://www.LaTeXTemplates.com
%
% Authors:
% Vel ([email protected])
% Nicolas Diaz ([email protected])
%
% License:
% CC BY-NC-SA 3.0 (http://creativecommons.org/licenses/by-nc-sa/3.0/)
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%----------------------------------------------------------------------------------------
% PACKAGES
%----------------------------------------------------------------------------------------
\documentclass[11pt, table]{diazessay} % Font size (can be 10pt, 11pt or 12pt)
\usepackage{graphicx}
\usepackage{anyfontsize}
\usepackage{tikz}
\usetikzlibrary{calc}
\usepackage[hyphens]{url}
\usepackage[
type={CC},
modifier={by-nc-sa},
version={3.0},
]{doclicense}
%----------------------------------------------------------------------------------------
% LINK COLORS
%----------------------------------------------------------------------------------------
\usepackage[]{xcolor}
%----------------------------------------------------------------------------------------
% FORMAL QUOTES
%----------------------------------------------------------------------------------------
% for adjustwidth environment
\usepackage[strict]{changepage}
% for formal definitions
\usepackage{framed}
% environment derived from framed.sty: see leftbar environment definition
\definecolor{formalshade}{rgb}{0.95,0.95,1}
\newenvironment{formal}{%
\def\FrameCommand{%
\hspace{1pt}%
{\color{w_lightblue}\vrule width 2pt}%
{\color{formalshade}\vrule width 4pt}%
\colorbox{formalshade}%
}%
\MakeFramed{\advance\hsize-\width\FrameRestore}%
\noindent\hspace{-4.55pt}% disable indenting first paragraph
\begin{adjustwidth}{}{7pt}%
\vspace{2pt}\vspace{2pt}%
}
{%
\vspace{2pt}\end{adjustwidth}\endMakeFramed%
}
%----------------------------------------------------------------------------------------
% BIBLIOGRAPHY STYLE
%----------------------------------------------------------------------------------------
\usepackage[numbers]{natbib} % bibliography style
\usepackage{amsmath, amssymb, latexsym}
\usepackage{minted} % code formatting
\usemintedstyle{borland}
%----------------------------------------------------------------------------------------
% TIKZ
%----------------------------------------------------------------------------------------
% 3D plots and tikz images
\usepackage{tikz}
\usepackage{tikz-3dplot}
\usetikzlibrary{shapes,arrows,chains,positioning,shapes.geometric,arrows.meta,backgrounds,fit, shapes, calc, matrix}
\usepackage{forest} % trees
\usepackage{float}
\newcommand{\empt}[2]{$#1^{\langle #2 \rangle}$}
%----------------------------------------------------------------------------------------
% IMAGE SHAPES
%----------------------------------------------------------------------------------------
% Define block styles
\tikzstyle{decision} = [diamond, draw, fill=w_lightblue,
text width=4.5em, text badly centered, node distance=3cm, inner sep=0pt]
\tikzstyle{block} = [rectangle, draw, fill=w_lightblue,
text width=5em, text centered, rounded corners, minimum height=4em]
\tikzstyle{line} = [draw, -latex']
\tikzstyle{cloud} = [draw, ellipse,fill=red!20, node distance=3cm,
minimum height=2em]
\tikzstyle{mybox} = [draw=red, fill=w_lightblue, very thick,
rectangle, rounded corners, inner sep=10pt, inner ysep=20pt]
\tikzstyle{fancytitle} =[fill=red, text=white]
\usepackage{tikzscale}
\usepackage{adjustbox}
%----------------------------------------------------------------------------------------
% TOC
%----------------------------------------------------------------------------------------
\setcounter{secnumdepth}{4} % how many sectioning levels to assign numbers to
\setcounter{tocdepth}{3} % how many sectioning levels to show in ToC
%----------------------------------------------------------------------------------------
% CAPTIONS
%----------------------------------------------------------------------------------------
\usepackage{subcaption}
\DeclareCaptionFormat{custom}
{%
\textbf{#1#2}\textit{\small #3}
}
\captionsetup{format=custom}
%----------------------------------------------------------------------------------------
% MATRIX
%----------------------------------------------------------------------------------------
\definecolor{dgreen}{RGB}{72,127,30}
\makeatletter
\tikzset{matrix dividers/.style={execute at end matrix={
\foreach \XX in {2,...,\the\pgf@matrix@numberofcolumns}
{\draw[#1] ($(\tikz@fig@name-1-\XX)!0.5!(\tikz@fig@name-1-\the\numexpr\XX-1)$)
coordinate (aux) (aux|-\tikz@[email protected])
-- (aux|-\tikz@[email protected]);
}
\foreach \XX in {2,...,\the\pgfmatrixcurrentrow}
{\draw[#1] ($(\tikz@fig@name-\XX-1)!0.5!(\tikz@fig@name-\the\numexpr\XX-1\relax-1)$)
coordinate (aux) (aux-|\tikz@[email protected])
-- (aux-|\tikz@[email protected]);
}
}},matrix frame/.style={execute at end matrix={
\draw[#1] (\tikz@[email protected] west) rectangle (\tikz@[email protected] east);
}}}
% from https://tex.stackexchange.com/a/386805/121799
\def\tikz@lib@matrix@empty@cell{\iftikz@lib@matrix@empty\node[name=\tikzmatrixname-\the\pgfmatrixcurrentrow-\the\pgfmatrixcurrentcolumn,empty node]{};\fi}
\makeatother
\tikzset{matrix of mathsf nodes/.style={%
matrix of nodes,
nodes={%
execute at begin node=$\mathsf\bgroup,%
execute at end node=\egroup$%
}%
}}
\usepackage{tikz-qtree,tikz-qtree-compat}
%----------------------------------------------------------------------------------------
% COLORS
%----------------------------------------------------------------------------------------
\definecolor{w_lightblue}{HTML}{D5E7F7}
\definecolor{lightblue}{HTML}{ADD8E6}
\definecolor{celestialblue}{rgb}{0.29, 0.59, 0.82}
\definecolor{asparagus}{rgb}{0.53, 0.66, 0.42}
\definecolor{azure}{rgb}{0.0, 0.5, 1.0}
%----------------------------------------------------------------------------------------
% TITLE SECTION
%----------------------------------------------------------------------------------------
\begin{document}
\begin{sloppypar}
\thispagestyle{empty}
\thispagestyle{empty}
\begin{tikzpicture}[overlay,remember picture]
% Background color
\fill[black!2] (current page.south west) rectangle (current page.north east);
% Image
\node at ($(current page.north)+(0,-1/2.5*\paperheight)$) {\includegraphics[width=0.5\textwidth]{figures/kandinsky.png}};
% Title
\node[align=center] at ($(current page.center)+(0,-5)$)
{
{\fontsize{40}{44} \selectfont {What are embeddings}} \\[1cm]
\rule[0.5ex]{4cm}{0.4pt} \\[1cm]
{\fontsize{30}{19.2} \selectfont \textcolor{blue}{\bfseries Vicki Boykis}}\\[3pt]
};
\end{tikzpicture}
\newpage
%----------------------------------------------------------------------------------------
% ABSTRACT AND META
%----------------------------------------------------------------------------------------
\begin{abstract}
Over the past decade, embeddings --- numerical representations of machine learning features used as input to deep learning models --- have become a foundational data structure in industrial machine learning systems. TF-IDF, PCA, and one-hot encoding have always been key tools in machine learning systems as ways to compress and make sense of large amounts of textual data. However, traditional approaches were limited in the amount of context they could reason about with increasing amounts of data. As the volume, velocity, and variety of data captured by modern applications has exploded, creating approaches specifically tailored to scale has become increasingly important.
Google's \href{https://arxiv.org/abs/1301.3781}{Word2Vec paper} made an important step in moving from simple statistical representations to semantic meaning of words. The subsequent rise of the \href{https://arxiv.org/abs/1706.03762}{Transformer architecture} and transfer learning, as well as the latest surge in generative methods has enabled the growth of embeddings as a foundational machine learning data structure. This survey paper aims to provide a deep dive into what embeddings are, their history, and usage patterns in industry.
\end{abstract}
\section*{Colophon }
This paper is typeset with \LaTeX. The cover art is Kandinsky's "Circles in a Circle" , 1923. \href{https://vickiboykis.com/2023/02/26/what-should-you-use-chatgpt-for/}{ChatGPT was used} to generate some of the figures.
\section*{Code, \LaTeX, and Website}
The latest version of the paper and code examples \href{https://github.com/veekaybee/what_are_embeddings}{are available here.} The \href{http://vickiboykis.com/what_are_embeddings/}{website for this project is here.}
\section*{About the Author}
Vicki Boykis is a machine learning engineer. Her website is \href{https://www.vickiboykis.com}{vickiboykis.com} and her semantic search side project is \href{https://viberary.pizza}{viberary.pizza}.
%----------------------------------------------------------------------------------------
% CREATIVE COMMONS LICENSE
%----------------------------------------------------------------------------------------
\section*{Acknowledgements}
I'm grateful to everyone who has graciously offered technical feedback but especially to Nicola Barbieri, Peter Baumgartner, Luca Belli, James Kirk, and Ravi Mody. All remaining errors, typos, and bad jokes are mine. Thank you to Dan for your patience, encouragement, for parenting while I was in the latent space, and for once casually asking, "How do you generate these 'embeddings', anyway?"
\section*{License}
\doclicenseThis
\newpage
%----------------------------------------------------------------------------------------
% TOC
%----------------------------------------------------------------------------------------
\tableofcontents
\newpage
%----------------------------------------------------------------------------------------
% ESSAY BODY
%----------------------------------------------------------------------------------------
\section{Introduction}
Implementing deep learning models has become an increasingly important machine learning strategy\footnote{Check out the machine learning industrial view Matt Turck \href{https://mattturck.com/mad2023/}{puts together every year, which has exploded in size.}} for companies looking to build data-driven products. In order to build and power deep learning models, companies collect and feed hundreds of millions of terabytes of multimodal\footnote{Multimodal means a variety of data usually including text, video, audio, and more recently as shown in \href{https://ai.facebook.com/blog/imagebind-six-modalities-binding-ai/}{Meta's ImageBind}, depth, thermal, and IMU.} data into deep learning models. As a result, \textbf{embeddings} --- deep learning models' internal representations of their input data --- are quickly becoming a critical component of building machine learning systems.
For example, they make up a significant part of Spotify's item recommender systems \citep{hansen2020contextual}, YouTube video recommendations of what to watch \citep{covington2016deep}, and Pinterest's visual search \citep{jing2015visual}. Even if they are not explicitly presented to the user through recommendation system UIs, embeddings are also used internally at places like Netflix to make content decisions around which shows to develop based on user preference popularity.
\begin{figure}[H]
\centering{\includegraphics[width=\textwidth]{figures/embeddings.png}}
\caption{Left to right: Products that use embeddings used to generate recommended items: Spotify Radio, YouTube Video recommendations, visual recommendations at Pinterest, BERT Embeddings in suggested Google search results}
\end{figure}
The usage of embeddings to generate compressed, context-specific representations of content exploded in popularity after the publication of Google’s Word2Vec paper \citep{mikolov2013efficient}.
\begin{figure}[H]
\centering
\includegraphics[width=.7\linewidth]{figures/embeddings_1.png}
\caption{Embeddings papers in arXiv by month. It's interesting to note the decline in frequency of embeddings-specific papers, possibly in tandem with the rise of deep learning architectures like GPT \href{https://github.com/veekaybee/what_are_embeddings/blob/main/notebooks/fig_2_embeddings_papers.ipynb}{source}}
\end{figure}
Building and expanding on the concepts in Word2Vec, the Transformer \citep{vaswani2017attention} architecture, with its self-attention mechanism, a much more specialized case of calculating context around a given word, has become the de-facto way to learn representations of growing multimodal vocabularies, and its rise in popularity both in academia and in industry has caused embeddings to become a staple of deep learning workflows.
However, the concept of embeddings can be elusive because they're neither data flow inputs or output results - they are intermediate elements that live within machine learning services to refine models. So it's helpful to define them explicitly from the beginning.
As a general definition, embeddings are data that has been transformed into n-dimensional matrices for use in deep learning computations. The process of embedding (as a verb):
\begin{formal}
\begin{itemize}
\item \emph{Transforms} multimodal input into representations that are easier to perform intensive computation on, in the form of \textbf{vectors}, tensors, or graphs \citep{rao2019natural}. For the purpose of machine learning, we can think of vectors as a list (or array) of numbers.
\item \emph{Compresses} input information for use in a machine learning \textbf{task} --- the type of methods available to us in machine learning to solve specific problems --- such as summarizing a document or identifying tags or labels for social media posts or performing \textbf{semantic search} on a large text corpus. The process of compression changes variable feature dimensions into fixed inputs, allowing them to be passed efficiently into downstream components of machine learning systems.
\item \emph{Creates an embedding space} that is specific to the data the embeddings were trained on but that, in the case of deep learning representations, can also generalize to other tasks and domains through \textbf{transfer learning} --- the ability to switch contexts --- which is one of the reasons embeddings have exploded in popularity across machine learning applications
\end{itemize}
\end{formal}
What do embeddings actually look like? Here is one single embedding, also called a \textbf{vector}, in three \textbf{dimensions}. We can think of this as a representation of a single element in our dataset. For example, this hypothetical embedding represents a single word "fly", in three dimensions. Generally, we represent individual embeddings as row vectors.
\begin{equation}
\begin{bmatrix}
1 & 4 & 9
\end{bmatrix}
\end{equation}
And here is a \textbf{tensor}, also known as a \textbf{matrix}\footnote{The difference between a matrix and a tensor is that it's a matrix if you're doing linear algebra and a tensor if you're an AI researcher.}, which is a multidimensional combination of vector representations of multiple elements. For example, this could be the representation of "fly", and "bird."
\begin{equation}
\begin{bmatrix}
1 & 4 & 9\\
4 & 5 & 6
\end{bmatrix}
\end{equation}
These embeddings are the output of the process of \textbf{learning} embeddings, which we do by passing raw input data into a machine learning model. We transform that multidimensional input data by compressing it, through the algorithms we discuss in this paper, into a lower-dimensional space. The result is a set of vectors in an \textbf{embedding space.}
\begin{figure}[H]
\centering
\begin{tikzpicture}[node distance=4cm]
\node[draw, rectangle] (multimodal) at (0,0) [block] {
\begin{tabular}{c}
Word \\
Sentence \\
Image
\end{tabular}
};
\node[above] at (multimodal.north) {Multimodal data} [block];
\node[draw, rectangle, right of=multimodal, xshift=3cm] (embedding) [block] {
\begin{tabular}{c}
$[1,4,9]$ \\
$ [1,4,7]$ \\
$ [12,0,3]$
\end{tabular}
};
\node[above] at (embedding.north) {Embedding Space} [block];
\draw[->] (multimodal) -- node[above] {Algorithm } (embedding);
\end{tikzpicture}
\caption{The process of embedding.}
\label{fig:embedding}
\end{figure}
We often talk about item embeddings being in $X$ dimensions, ranging anywhere from 100 to 1000, with diminishing returns in usefulness somewhere beyond 200-300 in the context of using them for machine learning problems\footnote{Embedding size is tunable as a hyperparameter but so far there have only been a \href{https://aclanthology.org/I17-2006.pdf}{few papers on optimal embedding size, with most of the size of embeddings set through magic and guesswork}}. This means that each item (image, song, word, etc) is represented by a vector of length $X$, where each value is a coordinate in an $X$-dimensional space.
We just made up an embedding for "bird", but let's take a look at what a real one for the word "hold" would look like in the quote, as generated by the BERT deep learning model,
\begin{formal}"Hold fast to dreams, for if dreams die, life is a broken-winged bird that cannot fly." --- Langston Hughes
\end{formal}
We've highlighted this quote because we'll be working with this sentence as our input example throughout this text.
\begin{figure}[H]
\begin{minted}
[
frame=lines,
framesep=2mm,
baselinestretch=1.2,
fontsize=\footnotesize,
linenos,
breaklines
]{python3}
import torch
from transformers import BertTokenizer, BertModel
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = """Hold fast to dreams, for if dreams die, life is a broken-winged bird that cannot fly."""
# Tokenize the sentence with the BERT tokenizer.
tokenized_text = tokenizer.tokenize(text)
# Print out the tokens.
print (tokenized_text)
['[CLS]', 'hold', 'fast', 'to', 'dreams', ',', 'for', 'if', 'dreams', 'die', ',', 'life', 'is', 'a', 'broken', '-', 'winged', 'bird', 'that', 'cannot', 'fly', '.', '[SEP]']
# BERT code truncated to show the final output, an embedding
[tensor([-3.0241e-01, -1.5066e+00, -9.6222e-01, 1.7986e-01, -2.7384e+00,
-1.6749e-01, 7.4106e-01, 1.9655e+00, 4.9202e-01, -2.0871e+00,
-5.8469e-01, 1.5016e+00, 8.2666e-01, 8.7033e-01, 8.5101e-01,
5.5919e-01, -1.4336e+00, 2.4679e+00, 1.3920e+00, -3.9291e-01,
-1.2054e+00, 1.4637e+00, 1.9681e+00, 3.6572e-01, 3.1503e+00,
-4.4693e-01, -1.1637e+00, 2.8804e-01, -8.3749e-01, 1.5026e+00,
-2.1318e+00, 1.9633e+00, -4.5096e-01, -1.8215e+00, 3.2744e+00,
5.2591e-01, 1.0686e+00, 3.7893e-01, -1.0792e-01, 5.1342e-01,
-1.0443e+00, 1.7513e+00, 1.3895e-01, -6.6757e-01, -4.8434e-01,
-2.1621e+00, -1.5593e+01, 1.5249e+00, 1.6911e+00, -1.2916e+00,
1.2339e+00, -3.6064e-01, -9.6036e-01, 1.3226e+00, 1.6427e+00,
1.4588e+00, -1.8806e+00, 6.3620e-01, 1.1713e+00, 1.1050e+00, ...
2.1277e+00])
\end{minted}
\caption{Analyzing Embeddings with BERT. See full notebook \href{https://github.com/veekaybee/what_are_embeddings/blob/main/notebooks/fig_4_bert.ipynb}{source}}
\end{figure}
We can see that this embedding is a PyTorch tensor object, a multidimensional matrix containing multiple levels of embeddings, and that's because in BERT's embedding representation, we have 13 different layers. One embedding layer is computed for each layer of the neural network. Each level represents a different view of our given \textbf{token} --- or simply a sequence of characters. We can get the final embedding by pooling several layers, details we'll get into as we work our way up to understanding embeddings generated using BERT.
When we create an embedding for a word, sentence, or image that represents the artifact in the multidimensional space, we can do any number of things with this embedding. For example, for tasks that focus on content understanding in machine learning, we are often interested in comparing two given items to see how similar they are. Projecting text as a vector allows us to do so with mathematical rigor and compare words in a shared embedding space.
\begin{figure}[H]
\centering
\begin{tikzpicture}
\draw[->] (-1.5,0) -- (1.5,0) node[right] {x};
\draw[->] (0,-1.5) -- (0,1.5) node[above] {y};
\draw[lightblue,thick,->] (0,0) -- (0.5,1) node[anchor=south west] {bird};
\draw[green,thick,->] (0,0) -- (-1,-0.5)
node[anchor=south east] {dog};
\draw[blue,thick,->] (0,0) -- (2,1.5)
node[anchor=south east] {fly};
\end{tikzpicture}
\caption{Projecting words into a shared embedding space}
\end{figure}
\begin{figure}[H]
\centering
\includegraphics[width=1\textwidth]{figures/web_service_ml.png}
\caption{Embeddings in the context of an application.}
\end{figure}
Engineering systems based on embeddings can be computationally expensive to build and maintain \citep{sharir2020cost}. The need to create, store, and manage embeddings has also recently resulted in the explosion of an entire ecosystem of related products. For example, the recent rise in the development of vector databases to facilitate production-ready use of nearest neighbors semantic queries in machine learning systems\footnote{For a survey of the vector database space today, refer to \href{https://dmitry-kan.medium.com/landscape-of-vector-databases-d241b279f486}{this article}}, and the rise of embeddings as a service\footnote{Embeddings now are a key differentiator in pricing between \href{https://openai.com/pricing}{on-demand ML services}}.
As such, it's important to understand their context both as end-consumers, product management teams, and as developers who work with them. But in my deep-dive into the embeddings reference material, I found that there are two types of resources: very deeply technical academic papers, for people who are already NLP experts, and surface-level marketing spam blurbs for people looking to buy embeddings-based tech, and that neither of these overlap in what they cover.
In Systems Thinking, Donella Meadows writes, “You think that because you understand 'one' that you must therefore understand 'two' because one and one make two. But you forget that you must also understand 'and.'" \citep{meadows2008thinking} In order to understand the current state of embedding architectures and be able to decide how to build them, we must understand how they came to be. In building my own understanding, I wanted a resource that was technical enough to be useful enough to ML practitioners, but one that also put embeddings in their correct business and engineering contexts as they become more often used in ML architecture stacks. This is, hopefully, that text.
In this text, we'll examine embeddings from three perspectives, working our way from the highest level view to the most technical. We'll start with the business context, followed by the engineering implementation, and finally look at the machine learning theory, focusing on the nuts and bolts of how they work. On a parallel axis, we'll also travel through time, surveying the earliest approaches and moving towards modern embedding approaches.
In writing this text, I strove to balance the need to have precise technical and mathematical definitions for concepts and my desire to stay away from explanations that make people's eyes glaze over. I've defined all technical jargon when it appears for the first time to build context. I include code as a frame of reference for practitioners, but don't go as deep as a code tutorial would\footnote{In other words, I wanted to straddle the "explanation" and "reference" quadrants of \href{https://diataxis.fr/}{the Diátaxis framework }}. So, it would be helpful for the reader to have some familiarity with programming and machine learning basics, particularly after the sections that discuss business context. But, ultimately the goal is to educate anyone who is willing to sit through this, regardless of level of technical understanding.
It's worth also mentioning what this text does not try to be: it does not try to explain the latest advancements in GPT and generative models, it does not try to explain transformers in their entirety, and it does not try to cover all of the exploding field of vector databases and semantic search. I've tried my best to keep it simple and focus on really understanding the core concept of embeddings.
\section{Recommendation as a business problem}
Let's step back and look at the larger context with a concrete example before diving into implementation details. Let's build a social media network, \textbf{Flutter}, the premier social network for all things with wings. Flutter is a web and mobile app where birds can post short snippets of text, videos, images, and sounds, to let other birds, insects and bats in the area know what’s up. Its business model is based on targeted advertising, and its app architecture includes a "home" feed based on birds that you follow, made up of small pieces of multimedia content called \textbf{“flits”}, which can be either text, videos, or photos. The home feed itself is by default in reverse chronological order that is curated by the user. But we also would like to offer personalized, recommended flits so that the user finds interesting content on our platform that they might have not known about before.
\begin{figure}[H]
\centering
\includegraphics[width=.5\textwidth]{figures/timeline.png}
\caption{Flutter's content timeline in a social feed with a blend of organic followed content, advertising, and recommendations.}
\end{figure}
How do we solve the problem of what to show in the timeline here so that our users find the content relevant and interesting, and balance the needs of our advertisers and business partners?
In many cases, we can approach engineering solutions without involving machine learning. In fact, we should definitely start without it \citep{zinkevich2017rules} because machine learning adds a tremendous amount of complexity to our working application \citep{sculley2014machine}. In the case of the Flutter home feed, though, machine learning forms a business-critical function part of the product offering. From the business product perspective, the objective is to offer Flutter’s users content that is relevant\footnote{The specific definition of a relevant item in the recommendations space varies and is under intense academic and industry debate, but generally it means an item that is of interest to the user}, interesting, and novel so they continue to use the platform. If we do not build discovery and personalization into our content-centric product, Flutter users will not be able to discover more content to consume and will disengage from the platform.
This is the case for many content-based businesses, all of which have feed-like surface areas for recommendations, including Netflix, Pinterest, Spotify, and Reddit. It also covers e-commerce platforms, which must surface relevant items to the user, and information retrieval platforms like search engines, which must provide relevant answers to users upon keyword queries. There is a new category of hybrid applications involving question-and-answering in semantic search contexts that is arising as a result of work around the GPT series of models, but for the sake of simplicity, and because that landscape changes every week, we'll stick to understanding the fundamental underlying concepts.
In subscription-based platforms\footnote{In ad-based services, the line between retention and revenue is a bit murkier, and we have often what's known as a multi-stakeholder problem, where the actual optimized function is a balance between meeting the needs of the user and meeting the needs of the advertiser \citep{zheng2017multi}. In real life, this can often result in a process of enshittification \citep{doctorow_2023} of the platform that leads to extremely suboptimal end-user experiences. So, when we create Flutter, we have to be very careful to balance these concerns, and we'll also assume for the sake of simplification that Flutter is a Good service that loves us as users and wants us to be happy.}, there is clear business objective that's tied directly to the bottom line, as outlined in this 2015 paper \citep{steck2021deep} about Netflix's recsys:
\begin{quote}
The main task of our recommender system at Netflix is to help our members discover content that they will watch and enjoy to maximize their long-term satisfaction. This is a challenging problem for many reasons, including that every person is unique, has a multitude of interests that can vary in different contexts, and needs a recommender system most when they are not sure what they want to watch. Doing this well means that each member gets a unique experience that allows them to get the most out of Netflix. As a monthly subscription service, member satisfaction is tightly coupled to a person’s likelihood to retain with our service, which directly impacts our revenue.
\end{quote}
Knowing this business context, and given that personalized content is more relevant and generally gets higher rates of engagement \citep{jannach2010recommender} than non-personalized forms of recommendation on online platforms,\footnote{For more, see \href{http://www.recommenderbook.net/media/Recommender_Systems_An_Introduction_Chapter08_Case_study.pdf}{this case study} on personalized recommendations as well as \href{https://www.arxiv-vanity.com/papers/1906.03109/}{the intro section of this paper} which covers many personalization use-cases.} how and why might we use embeddings in machine learning workflows in Flutter to show users flits that are interesting to them personally? We need to first understand how web apps work and where embeddings fit into them.
\subsection{Building a web app}
Most of the apps we use today --- Spotify, Gmail, Reddit, Slack, and Flutter --- are all designed based on the same foundational software engineering patterns. They are all apps available on web and mobile clients. They all have a front-end where the user interacts with the various \textbf{product features} of the applications, an API that connects the front-end to back-end elements, and a database that processes data and remembers state.
\begin{formal}
As an important note, \textbf{features} have many different definitions in machine learning and engineering. In this specific case, we mean collections of code that make up some front-end element, such as a button or a panel of recommendations. We'll refer to these as \textbf{product features}, in contrast with \textbf{machine learning features}, which are input data into machine learning models.
\end{formal}
This application architecture is commonly known as \textbf{model-view-controller} pattern \citep{fowler2012patterns}, or in common industry lingo, a \textbf{CRUD} app, named for the basic operations that its API allows to manage application state: create, read, update, and delete.
\begin{figure}[H]
\centering
\includegraphics[width=.9\textwidth]{figures/web_service.png}
\caption{Typical CRUD web app architecture}
\end{figure}
When we think of structural components in the architectures of these applications, we might think first in terms of product features. In an application like Slack, for example, we have the ability to post and read messages, manage notifications, and add custom emojis. Each of these can be seen as an application feature. In order to create features, we have to combine common elements like databases, caches, and web services. All of this happens as the client talks to the API, which talks to the database to process data. At a more granular, program-specific level, we might think of foundational data structures like arrays or hash maps, and lower still, we might think about memory management and network topologies. These are all foundational elements of modern programming.
At the feature level, though, we see that it not only includes the typical CRUD operations, such as the ability to post and read Slack messages, but also elements that are more than operations that alter database state. Some features such as \href{https://slack.engineering/personalized-channel-recommendations-in-slack/}{personalized channel suggestions}, \href{https://slack.engineering/search-at-slack/}{returning relevant results through search queries}, and \href{https://slack.engineering/email-classification/}{predicting Slack connection invites} necessitates the use of machine learning.
\begin{figure}[H]
\centering
\includegraphics[width=\textwidth]{figures/web_service_ml.png}
\caption{CRUD App with Machine learning service}
\end{figure}
\subsection{Rules-based systems versus machine learning}
To understand where embeddings fit into these systems, it first makes sense to understand where machine learning fits in at Flutter, or any given company, as a whole. In a typical consumer company, the user-facing app is made up of product features written in code, typically written as services or parts of services. To add a new web app feature, we write code based on a set of business logic requirements. This code acts on data in the app to develop our new feature.
In a typical data-centric software development lifecycle, we start with the business logic. For example, let's take the ability to post messages. We'd like users to be able to input text and emojis in their language of choice, have the messages sorted chronologically, and render correctly on web and mobile. These are the business requirements. We use the input data, in this case, user messages, and format them correctly and sort chronologically, at low latency, in the UI.
\begin{figure}[H]
\centering
\includegraphics[width=.75\textwidth]{figures/app_flow.png}
\caption{A typical application development lifecycle}
\end{figure}
Machine learning-based systems are typically also services in the backend of web applications. They are integrated into production workflows. But, they process data much differently. In these systems, we don't start with business logic. We start with input data that we use to build a model that will suggest the business logic for us. For more on the specifics of how to think about these data-centric engineering systems, see Kleppmann\cite{kleppmann2017designing}.
This requires thinking about application development slightly differently, and when we write an application that includes machine learning models as input, however, we’re inverting the traditional app lifecycle. What we have instead, is data plus our desired outcome. The data is combined into a model, and it is this model which instead generates our business logic that builds features.
\begin{figure}[H]
\centering
\includegraphics[width=.95\textwidth]{figures/ml-flow.png}
\caption{ML Development lifecycle}
\end{figure}
In short, the difference between programming and machine learning development is that we are not generating answers through business rules, but business rules through data. These rules are then re-incorporated into the application.
\begin{figure}[H]
\centering
\includegraphics[width=.75\textwidth]{figures/rules_ml.png}
\caption{Generating answers via machine learning. The top chart shows a classical programming approach with rules and data as inputs, while the bottom chart shows a machine learning approach with data and answers as inputs. \citep{chollet2021deep}}
\end{figure}
As an example, with Slack, for the channel recommendations product feature, we are not hard-coding a list of channels that need to be called from the organization's API. We are feeding in data about the organization's users (what other channels they've joined, how long they've been users, what channels the people they've interacted the most with Slack in), and building a model on that data that recommends a non-deterministic, personalized list of channels for each user that we then surface through the UI.
\begin{figure}[H]
\centering
\includegraphics[width=.95\textwidth]{figures/mlnotml.png}
\caption{Traditional versus ML architecture and infra}
\end{figure}
\subsection{Building a web app with machine learning}
All machine learning systems can be examined through how they accomplish these four steps. When we build models, our key questions should be, "what kind of input do we have and how is it formatted", and "what do we get as a result." We'll be asking this for each of the approaches we look at. When we build a machine learning system, we start by processing data and finish by serving a learned model artifact.
The four components of a machine learning system are\footnote{There are infinitely many layers of horror in ML systems \citep{kreuzberger2022machine}. These are still the foundational components.}:
\begin{formal}
\begin{itemize}
\item \textbf{Input data} - processing data from a database or streaming from a production application for use in modeling
\item \textbf{Feature Engineering and Selection} - The process of examining the data and cleaning it to pick features. In this case, we mean features as attributes of any given element that we use as inputs into machine learning. Examples of features are: user name, geographic location, how many times they've clicked on a button for the past 5 days, and revenue. This piece always takes the longest in any given machine learning system, and is also known as finding \textbf{representations} \citep{bengio2013representation} of the data that best fit the machine learning algorithm. This is where, in the new model architectures, we use embeddings as input.
\item \textbf{Model Building} - We select the features that are important and train our model, iterating on different performance metrics over and over again until we have an acceptable model we can use. Embeddings are also the output of this step that we can use in other, downstream steps.
\item \textbf{Model Serving} - Now that we have a model we like, we serve it to production, where it hits a web service, potentially cache, and our API where it then propagates to the front-end for the user to consume as part of our web app
\end{itemize}
\end{formal}
\begin{figure}[H]
\centering
\includegraphics[width=1\textwidth]{figures/ml_system.png}
\caption{CRUD app with ML}
\end{figure}
Within machine learning, there are many approaches we can use to fit different tasks. Machine learning workflows that are most effective are formulated as solutions to both a specific business need and a machine learning \textbf{task}. Tasks can best be thought of as approaches to modeling within the categorized solution space. For example, learning a regression model is a specific case of a task. Others include clustering, machine translation, anomaly detection, similarity matching, or semantic search. The three highest-level types of ML tasks are \textbf{supervised}, where we have training data that can tell us whether the results the model predicted are correct according to some model of the world. The second is \textbf{unsupervised}, where there is not a single ground-truth answer. An example here is clustering of our customer base. A clustering model can detect patterns in your data but won't explicitly label what those patterns are. The third is \textbf{reinforcement learning} which is separate from these two categories and formulated as a game theory problem: we have an agent moving through an environment and we'd like to understand how to optimally move them through a given environment using explore-exploit techniques. We'll focus on supervised learning, with a look at unsupervised learning with PCA and Word2Vec.
\begin{figure}[H]
\centering
\resizebox{.8\textwidth}{!}{%
\begin{forest}
for tree={
align=center,
parent anchor=south,
child anchor=north,
font=\sffamily,
edge={->,thick},
l sep+=20pt,
}
[Machine Learning
[Supervised Machine Learning
[SVM]
[Regression]
[{Neural\\Networks}]
]
[{Unsupervised\\Machine\\Learning}, align=center
[Clustering]
[Dimensionality\\Reduction
[PCA]
]
]
[{Reinforcement\\Learning}]
]
\end{forest}}
\caption{Machine learning task solution space and model families}
\end{figure}
\subsection{Formulating a machine learning problem}
As we saw in the last section, machine learning is a process that takes data as input to produce rules for how we should classify something or filter it or recommend it, depending on the task at hand. In any of these cases, for example, to generate a set of potential candidates, we need to construct a \textbf{model}.
A machine learning model is a set of instructions for generating a given output from data. The instructions are learned from the features of the input data itself. For Flutter, an example of a model we'd like to build is a \textbf{candidate generator} that picks flits similar to flits our birds have already liked, because we think users will like those, too. For the sake of building up the intuition for a machine learning workflow, let's pick a super-simple example that is not related to our business problem, linear regression, which gives us a continuous variable as output in response.
For example, let's say, given the number of posts a user has made and how many posts they've liked, we'd like to predict how many days they're likely to continue to stay on Flutter. For traditional \textbf{supervised} modeling approaches using tabular data, we start with our input data, or a \textbf{corpus} as it's generally known in machine learning problems that deal with text in the field known as \textbf{NLP} (natural language processing).
We're not doing NLP yet, though, so our input data may look something like this, where we have a UID (userid) and some attributes of that user, such as the number of times they've posted and number of posts they've liked. These are our machine learning \textbf{features}.
\begin{table}[H]
\centering
\caption{Tabular Input Data for Flutter Users}
\begin{tabular}{|l|l|l|l|}
\hline
\rowcolor[HTML]{D5E7F7}
bird\_id & bird\_posts & bird\_likes \\ \hline
012 & 2 & 5 \\ \hline
013 & 0 & 4 \\ \hline
056 & 57 & 70 \\ \hline
612 & 0 & 120 \\ \hline
\end{tabular}
\end{table}
We'll need part of this data to train our model, part of it to test the accuracy of the model we've trained, and part to tune meta-aspects of our model. These are known as \textbf{hyperparameters}.
We take two parts of this data as holdout data that we don't feed into the model. The first part, the \textbf{test set}, we use to validate the final model on data it's never seen before. We use the second split, called the \textbf{validation set}, to check our hyperparameters during the model training phase. In the case of linear regression, there are no true hyperparameters, but we'll need to keep in mind that we will need to tune the model's metadata for more complicated models.
Let's assume we have 100 of these values. A usual accepted split is to use 80\% of data for training and 20\% for testing. The reasoning is we want our model to have access to as much data as possible so it learns a more accurate representation.
In general, our goal is to feed our input into the model, through a function that we pick, and get some predicted output, $f(X) \rightarrow y$.
\begin{figure}[H]
\includegraphics[width=.8\linewidth]{figures/function.png}
\caption{How inputs map to outputs in ML functions \citep{klein2013coding}}
\end{figure}
For our simple dataset, we can use the linear regression equation:
\begin{equation}
y = x_1\beta_1 + x_2\beta_2 + \varepsilon
\end{equation}
This tells us that the output, $y$, can be predicted by two input variables, $x_1$ (bird posts) and $x_2$ (bird likes) with their given weights, $\beta_1$ and $\beta_2$, plus an error term $\varepsilon$, or the distance between each data point and the regression line generated by the equation. Our task is to find the smallest sum of squared differences between each point and the line, in other words to minimize the error, because it will mean that, at each point, our predicted y is as close to our actual y as we can get it, given the other points.
\begin{equation}
y = x_1\beta_1 + x_2\beta_2 + \varepsilon
\end{equation}
The heart of machine learning is this training phase, which is the process of finding a combination of model instructions and data that accurately represent our real data, which, in supervised learning, we can validate by checking the correct "answers" from the test set.
\begin{figure}[H]
\includegraphics[width=\linewidth]{figures/model_cycle.png}
\caption{The cycle of machine learning model development}
\end{figure}
As the first round of training starts, we have our data. We \textbf{train} --- or build --- our model by initializing it with a set of inputs, $X$. These are from the training data. $\beta_1$ and $\beta_2$ are either initialized by setting to zero or initialized randomly (depending on the model, different approaches work best), and we calculate $\hat{y}$, our predicted value for the model. $\epsilon$ is derived from the data and the estimated coefficients once we get an output.
\begin{equation}
y = 2\beta_1 + 5\beta_2 + \varepsilon
\end{equation}
How do we know our model is good? We initialize it with some set of values, weights, and we iterate on those weights, usually by minimizing a \textbf{cost function}. The cost function is a function that models the difference between our model's predicted value and the actual output for the training data. The first output may not be the most optimal, so we iterate over the model space many times, optimizing for the specific metric that will make the model as representative of reality as possible and minimize the difference between the actual and predicted values. So in our case, we compare $\hat{y}$ to $y$. The average squared difference between an observation’s actual and predicted values is the cost, otherwise known as \textbf{MSE} - mean squared error.
\begin{equation}
MSE = \frac{1}{N} \sum_{i=1}^{n} (y_i - (m x_i + b))^2
\end{equation}
We'd like to minimize this cost, and we do so with \textbf{gradient descent}. When we say that the model \textbf{learns}, we mean that we can learn what the correct inputs into a model are through an of iterative process where we feed the model data, evaluate the output, and to see if the predictions it generates improve through the process of gradient descent. We'll know because our loss should incrementally decrease in every training iteration.
We have finally trained our model. Now, we test the model's predictions on the 20 values that we've used as a hold-out set; i.e. the model has not seen these before and we can confidently assume that they won't influence the training data. We compare how many elements of the hold-out set the model was able to predict correctly to see what the model's accuracy was.
\subsubsection{The Task of Recommendations}
We just saw a simple example of machine learning as it relates to predicting continuous response variables. When our business question is, "What would be good content to show our users," we are facing the machine learning task for recommendation. Recommender systems are systems set up for \textbf{information retrieval}, a field closely related to NLP that's focused on finding relevant information in large collections of documents. The goal of information retrieval is to synthesize large collections of unstructured text documents. Within information retrieval, there are two complementary solutions in how we can offer users the correct content in our app: search, and recommendations.
\begin{formal}
\textbf{Search} is the problem of directed \citep{ekstrand2019recommender} information seeking, i.e. the user offers the system a specific query and would like a set of refined results. Search engines at this point are a well-established traditional solution in the space.
\textbf{Recommendation} is a problem where "man is the query." \citep{seaver2022computing} Here, we don't know what the person is looking for exactly, but we would like to infer what they like, and recommend items based on their learned tastes and preferences.
\end{formal}
The first industrial recommender systems were created to filter messages in email and newsgroups \citep{goldberg1992using} at the Xerox Palo Alto Research Center based on a growing need to filter incoming information from the web. The most common recommender systems today are those at Netflix, YouTube, and other large-scale platforms that need a way to surface relevant content to users.
The goal of recommender systems is surface items that are relevant to the user. Within the framework of machine learning approaches for recommendation, the main machine learning task is to determine which items to show to a user in a given situation. \citep{castells2023recommender}. There are several common ways to approach the recommendation problem.
\begin{formal}
\begin{itemize}
\item \textbf{Collaborative filtering} - The most common approach for creating recommendations is to formulate our data as a problem of finding missing user-item interactions in a given set of user-item interaction history. We start by collecting either explicit (ratings) data or \textbf{implicit} user interaction data like clicks, pageviews, or time spent on items, and compute. The simplest form of interactions are \textbf{neighborhood models}, where ratings are predicted initially by finding users similar to our given target user. We use similarity functions to compute the closeness of users.
Another common approach is using methods such \textbf{matrix factorization}, the process of representing users and items in a feature matrix made up of low-dimensional factor vectors, which in our case, are also known as embeddings, and learning those feature vectors through the process of minimizing a cost function. This process can be thought of as similar to Word2Vec \citep{levy2014neural}, a deep learning model which we'll discuss in depth in this document. There are many different approaches to collaborative filtering, including matrix factorization and \textbf{factorization machines}.
\item \textbf{Content filtering} - This approach uses metadata available about our items (for example in movies or music, the title, year released, genre, and so on) as initial or additional features input into models and work well when we don't have much information about user activity, although they are often used in combination with collaborative filtering approaches. Many embeddings architectures fall into this category since they help us model the textual features for our items.
\item \textbf{Learn to Rank} - Learn to rank methods focus on ranking items in relation to each other based on a known set of preferred rankings and the error is the number of cases when pairs or lists of items are ranked incorrectly. Here, the problem is not presenting a single item, but a set of items and how they interplay. This step normally takes place after candidate generation, in a filtering step, because it's computationally expensive to rank extremely large lists.
\item \textbf{Neural Recommendations} - The process of using neural networks to capture the same relationships that matrix factorization does without explicitly having to create a user/item matrix and based on the shape of the input data. This is where deep learning networks, and recently, large language models, come into play. Examples of deep learning architectures used for recommendation include Word2Vec and BERT, which we'll cover in this document, and convolutional and recurrent neural networks for sequential recommendation (such as is found in music playlists, for example). Deep learning allows us to better model content-based recommendations and give us representations of our items in an embedding space. \citep{zhang2019deep}
\end{itemize}
\end{formal}
Recommender systems have evolved their own unique architectures\footnote{For a good survey on the similarities and difference between search and recommendations, read \href{https://eugeneyan.com/writing/system-design-for-discovery/}{this great post on system design}}, and they usually include constructing a four-stage recommender system that's made up of several machine learning models, each of which perform a different machine learning task.
\begin{figure}[H]
\centering
\includegraphics[width=\textwidth]{figures/recsys.png}
\caption{Recommender systems as a machine learning problem}
\end{figure}
\begin{formal}
\begin{itemize}
\item \textbf{Candidate Generation} - First, we ingest data from the web app. This data goes into the initial piece, which hosts our first-pass model generating \textbf{candidate recommendations}. This is where collaborative filtering takes place, and we whittle our list of potential candidates down from millions to thousands or hundreds.
\item \textbf{Ranking} - Finally, we need a way to order the filtered list of recommendations based on what we think the user will prefer the most, so the next stage is \textbf{ranking}, and then we serve them out in the timeline or the ML product interface we're working with.
\item \textbf{Filtering} - Once we have a generated list of candidates, we want to continue to filter them, using business logic (i.e. we don't want to see NSFW content, or items that are not on sale, for example.). This is generally a heavily heuristic-based step.
\item \textbf{Retrieval} - This is the piece where the web application usually hits a model endpoint to get the final list of items served to the user through the product UI.
\end{itemize}
\end{formal}
Databases have become the fundamental tool in building backend infrastructure that performs data lookups. Embeddings have become similar building blocks in the creation of many modern search and recommendation product architectures. Embeddings are a type of \textbf{machine learning feature} --- or model input data --- that we use first as input into the feature engineering stage, and the first set of results that come from our candidate generation stage, that are then incorporated into downstream processing steps of ranking and retrieval to produce the final items the user sees.
\subsubsection{Machine learning features}
Now that we have a high-level conceptual view of how machine learning and recommender systems work, let's build towards a candidate generation model that will offer relevant flits.
Let's start by modeling a traditional machine learning problem and contrast it with our NLP problem. For example, let's say that one of our business problems is predicting whether a bird is likely to continue to stay on Flutter or to churn\footnote{An extremely common business problem to solve in almost every industry where either customer population or subscription based on revenues is important} --- disengage and leave the platform.
When we predict churn, we have a given set of machine learning feature inputs for each user and a final binary output of 1 or 0 from the model, 1 if the bird is likely to churn, or 0 if the user is likely to stay on the platform.
We might have the following inputs:
\begin{itemize}
\item How many posts the bird has clicked through in the past month (we'll call this \mintinline{python}{bird_posts} in our input data)
\item The geographical location of the bird from the browser headers (\mintinline{python}{bird_geo})
\item How many posts the bird has liked over the past month (\mintinline{python}{bird_likes})
\end{itemize}
\begin{table}[H]
\centering
\caption{Tabular Input Data for Flutter Users}
\begin{tabular}{|l|l|l|l|}
\hline
\rowcolor[HTML]{D5E7F7}
bird\_id & bird\_posts & bird\_geo & bird\_likes \\ \hline
012 & 2 & US & 5 \\ \hline
013 & 0 & UK & 4 \\ \hline
056 & 57 & NZ & 70 \\ \hline
612 & 0 & UK & 120 \\ \hline
\end{tabular}
\end{table}
We start by selecting our model features and arranging them in tabular format. We can formulate this data as a table (which, if we look closely, is also a matrix) based on rows of the bird id and our bird features.
Tabular data is any structured data. For example, for a given Flutter user we have their user id, how many posts they've liked, how old the account is, and so on. This approach works well for what we consider traditional machine learning approaches which deal with tabular data. As a general rule, the creation of the correct formulation of input data is perhaps the heart of machine learning. I.e. if we have bad input, we will get bad output. So in all cases, we want to spend our time putting together our input dataset and engineering features very carefully.
These are all discrete features that we can feed into our model and learn weights from, and is fairly easy as long as we have numerical features. But, something important to note here is that, in our bird interaction data, we have both numerical and textual features (bird geography). So what do we do with these textual features? How do we compare "US" to "UK"?
The process of formatting data correctly to feed into a model is called \textbf{feature engineering}. When we have a single continuous, numerical feature, like “the age of the flit in days”, it’s easy to feed these features into a model. But, when we have textual data, we need to turn it into numerical representations so that we can compare these representations.
\subsection{Numerical Feature Vectors}
Within the context of working with text in machine learning, we represent features as numerical vectors. We can think of each row in our tabular feature data as a vector. And a collection of features, or our tabular representation, is a matrix. For example, in the vector for our first user, \mintinline{python}{[012, 2, 'US', 5]}, we can see that this particular value is represented by four features. When we create vectors, we can run mathematical computations over them and use them as inputs into ML models in the numerical form we require.
Mathematically, vectors are collections of coordinates that tell us where a given point is in space among many dimensions. For example, in two dimensions, we have a point $[2,5]$, representing \mintinline{python}{bird_posts} and \mintinline{python}{bird_likes}.
In three dimensions, with three features including the bird id, we would have a vector
\begin{equation}
\begin{bmatrix}
12 & 2 & 5\\
\end{bmatrix}
\end{equation}
which tells us where that user falls on all three axes.
\begin{figure}[htbp]
\begin{adjustbox}{center}
\tdplotsetmaincoords{60}{120}
\begin{tikzpicture}
[scale=3, tdplot_main_coords, axis/.style={->,blue,thick},
vector/.style={-stealth,azure,very thick},
vector guide/.style={dashed,azure,thick}]
%standard tikz coordinate definition using x, y, z coords
\coordinate (O) at (0,0,0);
%tikz-3dplot coordinate definition using x, y, z coords
\pgfmathsetmacro{\ax}{.12}
\pgfmathsetmacro{\ay}{.2}
\pgfmathsetmacro{\az}{.5}
\coordinate (P) at (\ax,\ay,\az);
%draw axes
\draw[axis] (0,0,0) -- (1,0,0) node[anchor=north east]{$x$};
\draw[axis] (0,0,0) -- (0,1,0) node[anchor=north west]{$y$};
\draw[axis] (0,0,0) -- (0,0,1) node[anchor=south]{$z$};
%draw a vector from O to P
\draw[vector] (O) -- (P);
%draw guide lines to components
\draw[vector guide] (O) -- (\ax,\ay,0);
\draw[vector guide] (\ax,\ay,0) -- (P);
\draw[vector guide] (P) -- (0,0,\az);
\draw[vector guide] (\ax,\ay,0) -- (0,\ay,0);
\draw[vector guide] (\ax,\ay,0) -- (0,\ay,0);
\draw[vector guide] (\ax,\ay,0) -- (\ax,0,0);
\node[tdplot_main_coords,anchor=east]
at (\ax,0,0){(\ax, 0, 0)};
\node[tdplot_main_coords,anchor=west]
at (0,\ay,0){(0, \ay, 0)};
\node[tdplot_main_coords,anchor=south]
at (0,0,\az){(0, 0, \az)};
\end{tikzpicture}
\end{adjustbox}
\caption{Projecting a vector into the 3d space}
\label{fig:my_figure}
\end{figure}
But how do we represent "US" or "UK" in this space? Because modern models converge by performing operations on matrices \citep{lakshmanan2020machine}, we need to encode geography as some sort of numerical value so that the model can calculate them as inputs\footnote{There are some models, specifically decision trees, where you don't need to do text encoding because the tree learns the categorical variables out of the box, however implementations differ, for example the two most popular implementations, scikit-learn and XGBoost \citep{altay_2020}, can't.}. So, once we have a combination of vectors, we can compare it to other points. So in our case, each row of data tells us where to position each bird in relation to any other given bird based on the combination of features. And that's really what our numerical features allow us to do.
\subsection{From Words to Vectors in Three Easy Pieces}
In "Operating Systems: Three Easy Pieces", the authors write, "Like any system built by humans, good ideas accumulated in operating systems over time, as engineers learned what was important in their design." \citep{arpaci2018operating} Today's large language models were likewise built on hundreds of foundational ideas over the course of decades. There are, similarly, several fundamental concepts that make up the work of transforming words to numerical representations.
These show up over and over again, in every deep learning architecture and every NLP-related task\footnote{When we talk about tasks in NLP-based machine learning, we mean very specifically, what the machine learning problem is formulated to do. For example, we have the task of ranking, recommendation, translation, text summarization, and so on.}:
\begin{formal}
\begin{itemize}
\item \textbf{Encoding} - We need to represent our non-numerical, multimodal data as numbers so we can create models out of them. There are many different ways of doing this.
\item \textbf{Vectors} - we need a way to store the data we have encoded and have the ability to perform mathematical functions in an optimized way on them. We store encodings as vectors, usually floating-point representations.
\item \textbf{Lookup matrices} - Often times, the end-result we are looking for from encoding and embedding approaches is to give some approximation about the shape and format of our text, and we need to be able to quickly go from numerical to word representations across large chunks of text. So we use lookup tables, also known as hash tables, also known as attention, to help us map between the words and the numbers.
\end{itemize}
\end{formal}
As we go through the historical context of embeddings, we'll build our intuition from encoding to BERT and beyond\footnote{Original diagram from \href{http://mccormickml.com/2019/11/11/bert-research-ep-1-key-concepts-and-sources/}{this excellent guide on BERT}}. What we'll find as we go further into the document is that the explanations for each concept get successively shorter, because we've already done the hard work of understanding the building blocks at the beginning.
\begin{figure}[H]
\centering
\includegraphics[width=.6\textwidth]{figures/pyramid.png}
\caption{Pyramid of fundamental concepts building to BERT}
\end{figure}
\section{Historical Encoding Approaches}
Compressing content into lower dimensions for compact numerical representations and calculations is not a new idea. For as long as humans have been overwhelmed by information, we've been trying to synthesize it so that we can make decisions based on it. Early approaches have included one-hot encoding, TF-IDF, bag-of-words, LSA, and LDA.
The earlier approaches were \textbf{count-based} methods. They focused on counting how many times a word appeared relative to other words and generating encodings based on that. LDA and LSA can be considered statistical approaches, but they are still concerned with inferring the properties of a dataset through heuristics rather than modeling. \textbf{Prediction-based} approaches came later and instead learned the properties of a given text through models such as support vector machines, Word2Vec, BERT, and the GPT series of models, all of which use learned embeddings instead.
\begin{figure}[H]
\centering
\resizebox{1\textwidth}{!}{%
\begin{forest}
for tree={
align=center,
parent anchor=south,
child anchor=north,
font=\sffamily,
edge={->,thick},
l sep+=20pt,
}
[\textbf{Embedding Methods} [\textbf{Count-based Methods} [\textbf{TF-IDF}]
[\textbf{One-Hot Encoding}]
[\textbf{Bag-of-words}]
[\textbf{LSA}]
[\textbf{LDA}]
]
[\textbf{Prediction-based Methods} [\textbf{SVM}]
[\textbf{Word2Vec}]
[\textbf{BERT family}]
[\textbf{GPT family}]
]
]
\end{forest}}
\caption{Embedding Method Solution Space}
\end{figure}
\begin{formal}
\textbf{A Note on the Code}
In looking at these approaches programmatically, we'll start by using \mintinline{python}{scikit-learn}, the de-facto standard machine learning library for smaller datasets, with some implementations in native Python for clarity in understanding functionality that scikit-learn wraps. As we move into deep learning, we'll move to PyTorch, a deep learning library that's quickly becoming industry-standard for deep learning implementation. There are many different ways of implementing the concepts we discuss here, these are just the easiest to illustrate using Python's ML lingua franca libraries.
\end{formal}
\subsection{Early Approaches}
The first approaches to generating textual features were count-based, relying on simple counts or high-level understanding of statistical properties: they were \textbf{descriptive} instead of models, which are \textbf{predictive} and attempt to guess a value based on a set of input values. The first methods were \textbf{encoding methods}, a precursor to embedding. Encoding is often a process that still happens as the first stage of data preparation for input into more complex modeling approaches. There are several methods to create text features using a process known as encoding so that we can map the geography feature into the vector space:
\begin{itemize}
\item Ordinal encoding
\item Indicator encoding
\item One-Hot encoding
\end{itemize}
In all these cases, what we are doing is creating a new feature that maps to the text feature column but is a numerical representation of the variable so that we can project it into that space for modeling purposes. We'll motivate these examples with simple code snippets from scikit-learn, the most common library for demonstrating basic ML concepts. We'll start with \textbf{count-based} approaches.
\subsection{Encoding}
\textbf{Ordinal encoding}
Let's again come back to our dataset of flits. We encode our data using sequential numbers. For example, "1" is "finch", "2" is "bluejay" and so on. We can use this method only if the variables have a natural ordered relationship to each other. For example, in this case "bluejay" is not "more" than "finch" and so would be incorrectly represented in our model. The case is the same, if, in our flit data, we encode "US" as 1 and "UK" as 2.
\begin{table}[H]
\centering
\caption{Bird Geographical Location Encoding}
\begin{tabular}{|>{\centering\arraybackslash}m{2cm}|>{\centering\arraybackslash}m{2cm}|>{\centering\arraybackslash}m{2cm}|>{\centering\arraybackslash}m{2cm}|>{\centering\arraybackslash}m{2.5cm}|}
\hline
\rowcolor[HTML]{D5E7F7}
bird\_id & bird\_posts & bird\_geo & bird\_likes & enc\_bird\_geo \\ \hline
012 & 2 & US & 5 & 2 \\ \hline
013 & 0 & UK & 4 & 1 \\ \hline
056 & 57 & NZ & 70 & 0 \\ \hline
612 & 0 & UK & 120 & 1 \\ \hline
\end{tabular}
\end{table}
\begin{figure}[H]
\begin{minted}
[
frame=lines,
framesep=2mm,
baselinestretch=1.2,
fontsize=\footnotesize,
linenos
]{python3}
from sklearn.preprocessing import OrdinalEncoder
data = [['US'], ['UK'], ['NZ']]
>>> print(data)
[['US']
['UK']
['NZ']]
# our label features
encoder = OrdinalEncoder()
result = encoder.fit_transform(data)
>>> print(result)
[[2.]
[1.]
[0.]]
\end{minted}
\caption{Ordinal Encoding in Scikit-Learn \href{https://github.com/veekaybee/what_are_embeddings/blob/main/notebooks/fig_22_ordinal.ipynb}{source}}
\end{figure}
\subsubsection{Indicator and one-hot encoding}
Indicator encoding, given $n$ categories (i.e. "US", "UK", and "NZ"), encodes the variables into $n-1$ categories, creating a new feature for each category. So, if we have three variables, indicator encoding encodes into two indicator variables. Why would we do this? If the categories are mutually exclusive, as they usually are in point-in-time geolocation estimates, if someone is in the US, we know for sure they're not in the UK and not in NZ, so it reduces computational overhead.
If we instead use all the variables and they are very closely correlated, there is a chance we'll fall into something known as the \textbf{indicator variable trap}. We can predict one variable from the others, which means we no longer have feature independence. This generally isn't a risk for geolocation since there are more than 2 or 3 and if you're not in the US, it's not guaranteed that you're in the UK. So, if we have US = 1, UK = 2, and NZ = 3, and prefer more compact representations, we can use indicator encoding. However, many modern ML approaches don't require linear feature independence and use L1 regularization\footnote{Regularization is a way to prevent our model from \textbf{overfitting}. Overfitting means our model can exactly predict outcomes based on the training data, but it can't learn new inputs that we show it, which means it can't generalize} to prune feature inputs that don't minimize the error, and as such only use one-hot encoding.
\textbf{One-hot encoding} is the most commonly-used of the count-based methods. This process creates a new variable for each feature that we have. Everywhere the element is present in the sentence, we place a “1” in the vector. We are creating a mapping of all the elements in the feature space, where $0$ indicates a non-match and $1$ indicates a match, and comparing how similar those vectors are.
\begin{figure}[H]
\begin{minted}
[
frame=lines,
framesep=2mm,
baselinestretch=1.2,
fontsize=\footnotesize,
linenos
]{python}
from sklearn.preprocessing import OneHotEncoder
import numpy as np
enc = OneHotEncoder(handle_unknown='ignore')
data = np.asarray([['US'], ['UK'], ['NZ']])
enc.fit(data)
enc.categories_
>>> [array(['NZ', 'UK', 'US'], dtype='<U2')]
onehotlabels = enc.transform(data).toarray()
onehotlabels
>>>
array([[0., 0., 1.],
[0., 1., 0.],
[1., 0., 0.]])
\end{minted}
\caption{One-Hot Encoding in scikit-learn\href{https://github.com/veekaybee/what_are_embeddings/blob/main/notebooks/fig_22_ordinal.ipynb}{source}}
\end{figure}
\begin{table}[H]
\centering
\caption{Our one-hot encoded data with labels}
\begin{tabular}{llll}
\hline
\rowcolor[HTML]{D5E7F7}
bird\_id & US & UK & NZ \\
\hline
012 & 1 & 0 & 0 \\
\hline
013 & 0 & 1 & 0 \\
\hline
056 & 0 & 0 & 1
\end{tabular}
\end{table}
Now that we've encoded our textual features as vectors, we can feed them into the model we're developing to predict churn. The function we've been learning will minimize the loss of the model, or the distance between the model's prediction and the actual value, by predicting correct parameters for each of these features. The learned model will then return a value from 1 to 0 that is a probability that the event, either churn or no-churn, has taken place, given the input features of our particular bird. Since this is a supervised model, we then evaluate this model for accuracy by feeding our test data into the model and comparing the model's prediction against the actual data, which tells us whether the bird has churned or not.
What we've built is a standard \textbf{logistic regression model}. Generally these days the machine learning community has converged on using gradient-boosted decision tree methods for dealing with tabular data, but we'll see that neural networks build on simple linear and logistic regression models to generate their output, so it's a good starting point.
\subsubsection*{Embeddings as larger feature inputs}
Once we have encoded our feature data, we can use this input for any type of model that accepts tabular features. In our machine learning task, we were looking for output that indicated whether a bird was likely to leave the platform based on their location and some usage data. Now, we'd like to focus specifically on surfacing flits that are similar to other flits the user has already interacted with so we'll need feature representations of either/or our users or our content.
Let's go back to the original business question we posed at the beginning of this document: how do we recommend interesting new content for Flutter users given that we know that past content they consumed (i.e. liked and shared)?
In the traditional \textbf{collaborative filtering} approach to recommendations, we start by constructing a user-item matrix based on our input data that, when factored, gives us the latent properties of each flit and allows us to recommend similar ones.
In our case, we have Flutter users who might have liked a given flit. What other flits would we recommend given the textual properties of that one?
Here’s an example. We have a flit that our bird users liked.
\begin{formal}
"Hold fast to dreams, for if dreams die, life is a broken-winged bird that cannot fly."
\end{formal}
We also have other flits we may or may not want to surface in our bird's feed.
\begin{formal}
"No bird soars too high if he soars with his own wings."
\end{formal}
\begin{formal}
“A bird does not sing because it has an answer, it sings because it has a song.”
\end{formal}
How would we turn this into a machine learning problem that takes features as input and a prediction as an output, knowing what we know about how to do this already? First, in order to build this matrix, we need to turn each word into a feature that's a column value and each user remains a row value.
\begin{flushleft}
The best way to think of the difference between tabular and free-form representations as model inputs is that a row of tabular data looks like this,\mintinline{python}{ [012,2,"US", 5]}, and a "row" or document of text data looks like this, \mintinline{python}{["No bird soars too high if he soars with his own wings."]} In both cases, each of these are vectors, or a list of values that represents a single bird.
\end{flushleft}
In traditional machine learning, rows are our user data about a single bird and columns are features about the bird. In recommendation systems, our rows are the individual data about each user, and our column data represents the given data about each flit. If we can factor this matrix, that is decompose it into two matrices ($Q$ and $P^T$) that, when multiplied, the product is our original matrix ($R$), we can learn the "latent factors" or features that allow us to group similar users and items together to recommend them.
Another way to think about this is that in traditional ML, we have to actively engineer features, but they are then available to us as matrices. In text and deep-learning approaches, we don't need to do feature engineering, but need to perform the extra step of generating valuable numeric features anyway.
\begin{tikzpicture}[nmat/.style={matrix of mathsf nodes,inner sep=0pt,nodes in empty cells,column sep=-\pgflinewidth,
row sep=-\pgflinewidth,nodes={text height=1.7ex,text depth=0.2ex,inner
sep=2pt,minimum width=1.8ex},matrix dividers={thin},matrix
frame={thick}},font=\sffamily,
empty node/.style={fill=none}]
\matrix[nmat,nodes={fill=w_lightblue}] (mat1) {
1& & 3 & & & 5 & & & 5 & & 4 & \\
& & 5 & 4 & & & 4& & & 2 & 1 & 3 \\
2 & 4 & & 1 & 2 & & 3 & & 4 & 3 & 5 & \\
& 2& 4 & & 5 & & & 4 & & & 2& \\