John Hallman

Reflections on figure skating

John Hallman — Wed, 11 Mar 2026 23:10:03 GMT

Figure skating is having a moment right now, in large part thanks to Alyssa Liu and the Winter Olympics. It’s been a pleasure to follow along, and I’m amazed and delighted that someone like Alyssa was able to win, after years of dominance by skaters that prioritized jumps and slimness over health and happiness.

Like Alyssa, I too spent most of my early life figure skating due to a mixture of passion and parental pressure. Of course, unlike Alyssa, I never got close to the Olympics, though I was decent enough that my Instagram videos to this day still impress my friends and garner lots of likes.

A rusty John completing a 2A at the Yerba Buena ice skating rink.

Before AI research and math olympiads, figure skating was my main activity and the center of my identity. I was a member of the Swedish national team, was ranked second best in my age group in Sweden for much of my career, and competed internationally at Junior Grand Prix’s and other competitions. There even was a time in my life when I was really young and really really naive where I dreamt of winning the Olympics :’)

I was never good enough for that, though I was good enough to observe and meet people who were. I practiced in similar ways, in similar locations, competed to similar music in similar outfits. I got to see what made the difference between good and great.

This year, it’ll be 10 years since I retired from figure skating. I’m now entering an age where my relationship with figure skating is starting to come back, but in a very different form – friends and coworkers are starting to think about whether to have their kids try out figure skating. Presumably one day I will consider this too.

10 years is enough time for me to have gained some perspective on what figure skating taught me that was good and things that were bad. For anyone curious about figure skating, either for themselves or for their kids, here’s my reflections on the good and the bad of this lovely sport.

1) A complex mixture of arts and athletics

Figure skating is a lovely sport in large part because it combines arts and athletics in a way that feels unique and pretty well balanced. During competitions, you are scored separately on the technical tricks performed (TES) and the artistry of your performance (PCS). There are certainly skaters with less artistic ability who win through sheer technical dominance, but in general skaters try quite hard to improve on both axes.

The breadth of skill needed for skating results in a very interesting sport to learn and to perform – one day you’re in ballet or hip hop class practicing your musicality and facial expressions, the other day you’re at the gym doing squats to strengthen your jumps, and then the next day you’re running a 5k to push your stamina. Two days later you’re at a competition mixing dance with jumps and spins.

There are lots of layers of complexity that many audience members are not aware of – the spin positions and twists in step sequences are not picked at random. Rules update yearly so as to require the skaters to showcase creativity and skill in order to maximize the points earned from the different elements in their program. Some skaters find this aspect of the sport fun, while others find it annoying. Either way, understanding how to improve as a skater and maximize your score is a key part of the sport and honestly very intellectually demanding.

2) Athletics, bodies, and injuries

Like all sports, figure skating takes a heavy toll on the body. Ankle, knee, hip, and back injuries from jumps are extremely common, though unlike contact sports you’re mostly safe from extreme injuries like concussions. Some people disagree with this, but my experience after talking to many skaters is that it is possible to practice jumps with appropriate falling technique that lets you avoid hitting your head on the ice.

I myself had a bad hip injury and a bad knee injury, both of which slightly impair me to this day. It’ll probably get worse with the years, but I find this to be a worthy sacrifice for the experiences I had as a figure skater.

The experience varies a lot however, and some people suffer much worse injuries and bodily damages. In the 2010s, an infamous Russian coach named Eteri Tutberidze received a lot of criticism for her intense training regime which produced girls that won senior international competitions around the ages of 15-17 before burning out and being forced into retirement before reaching adulthood.

Today, with the increase in minimum age for participating in senior international competitions, this phenomenon has waned a bit. The U.S. also treats their skaters better, but under the pressure to perform it remains common for skaters to suffer injuries, eating disorders, body dysmorphia, and so on.

3) Individual vs team sports

Although team skating as a category exists, and pair skating / ice dancing remains quite popular, the vast majority of public mindshare is taken up by individual figure skating. Needless to say, individual sports are very different from team sports, and this mainly shows in the way sports psychology and mentality affects how successful you are as an athlete.

Team sports certainly have stars with big egos that can be hard to work with, but this tends to be penalized in ways that you don’t see in figure skating. Your wins are your own work, and your losses your own fault. There is no teammate to blame, no bad passes. Similarly, there is no opponent on the field. You are not fighting against anyone, the enemy is yourself – no one else can prevent you from landing that jump or completing that spin.

The result is that skaters sometimes develop a different sports psychology than team sports athletes. I don’t think figure skating is likely to produce dramatic differences in people, such as turning an otherwise lovely person into a diva, but you do get less exposure to team dynamics which some people may think is an unfortunate life experience to miss out on.

4) Mental fortitude

The most intense moment of my life to this date was the beginning of my short program at the Youth Olympics in Innsbruck in 2012. It was my first serious international competition and the audience was larger than at any other competition I had participated in. I was nervous as all hell.

Leading into the first jump, I found myself unable to think at all. My fight or flight response shut down my brain, and so my body operated on autopilot. Fortunately, I landed my first triple jump based on sheer muscle memory, and the rest of the program went decently well.

There were lots of other intense moments. Arguments with parents and coaches, tough beats at important competitions, bad practice sessions where you fall on every jump, the first time you attempt a new jump, the list goes on. Figure skating is not unique to any other sports in that it requires mental fortitude in a way that academics simply doesn’t – I never felt nearly the same intensity in my academic career, even at IMO, as I did in figure skating.

Does this mental fortitude transfer to other domains in your life? I think so, but it’s not obvious. My impression is that mental fortitude like “grit” is useful in academics, but you end up having to re-develop it a bit, because grit in sports and in academics feels rather different. Knowing what it feels and looks like is very valuable though, and I do think it helped me in my math career.

5) How much do you want it

As with other sports, you’re somewhat bounded in how far you can go by how talented you are. However, if you are lucky and are born with good genetics and a good “sense” for how skating works, you’ll at some point have to ask how much you want it. This is a tough decision to make, but a great one to face early in your life! Competing internationally, not to mention gunning for the Olympics, takes tremendous sacrifice, and there’s no rule that says that participating in the Olympics is a better way to live life than any other path you could take. I chose academics at the age of 18 and never regretted that decision.

My best advice here is that most people don’t spend their time particularly effectively, and so you should ask yourself what life would look like if you went for it versus if you didn’t, and see what your gut reaction says about which path you want more. Trust your gut, it knows what you want better than your brain does.

Hope this was helpful, if you enjoy skating or want to explore a new hobby, come visit the Yerba Buena Ice Skating & Bowling Center! They have both freestyle sessions for experienced skaters and public ice for beginners on Saturdays. I’m usually at the freestyle session at 9:00am.

Information externalities

John Hallman — Wed, 14 Aug 2024 06:04:23 GMT

Can more information make us collectively worse off?

For most of my life, I would have rejected a claim that more information is bad for society as ridiculous and authoritarian. However, modern society is weird. In particular, the internet is weird, and somehow, it appears that the forms of information dissemination that have emerged from the internet (news, twitter, social media) have not been universally good.

I recently took a break from social media and found that my life quality went up quite noticeably, and that I didn’t feel like I was missing out as much as I thought I would. This surprised me, because I was effectively restricting my information intake, and had concluded that this had a positive impact on my life.

There are the standard arguments for why information consumption is bad: it’s the sugar/fast food of our generation – we’ve evolved in information sparse settings and as such crave it at a primal level and are unable to healthily regulate our intake, modern sources are full of low-quality or misleading information that mislead you, it distracts you from the key things you should be focusing on, results in FOMO and commitment issues, etc etc.

Today I want to write a bit about information dissemination and why it can be bad. Of course, on the whole, I still believe that more information is good, and that the internet has and will continue to be net good for humanity. But it is also worth keeping track of its negative effects. Hopefully this inspires some intervention on your part, or at least is a fun read.

Networks and Braess’s paradox

In the 1900s, mathematician Dietrich Braess formalized an observation by economist Arthur Pigou that it is possible to construct a hypothetical city grid such that adding a road increases commute time for all residents (not just on average!). This became known as Braess’s paradox.

The setting is as follows: some number of citizens live in a city and commute from point A to B. They are self-interested actors and choose between several paths based on commute time. Commute time at any part of a road is a function of the number of other commuters on that road at that point in time.

In the example below, there are four roads with commute time t specified as either constant or a linear function of the number of travelers T. Suppose there are 4000 commuters. If a and b commuters pick the top and bottom path respectively, the top path has commute time a/100 + 45 and the bottom has b/100 + 45. This is a stable system – people will switch from the top to bottom path or vice versa if there are more people taking the same route that they do in order to reduce congestion and minimize their commute time. This stabilizes at an equal commute time of 2000/100 + 45 = 65 minutes.

Source: Wikipedia

Suppose a road is added from point A to B with *0* commute time. Any person traveling on the top road, once they get to A, can either take the 45 minute top path or take the new road to the bottom path, where the remaining commute is T/100 + 2000/100 = T/100 + 20 minutes. People will keep switching until 2500 people take the detour, at which point the commute reaches an equilibrium of 65 minutes, but remember only 2000 people were taking the path through A to begin with.

However, during this time, the number of people taking the bottom right path will have increased to 2000+2000=4000, and the commute time has reached 85 minutes. The commuters through B will have caught on to what is happening, and will switch to the much faster top path through the detour. Through self-interested decision making, everyone ends up switching to this path, which converges to a commute of 4000/100*2 = 80 minutes, worse off than the original 65 minutes before the “instant shortcut” from A to B appeared. This is also stable – no one has any incentive to switch to another path, since the alternative routes take 85 minutes.

People who have studied game theory or economics will see what is happening: two of the four road sections have commute times that scale with the number of travelers, which means that commuting through them produces an externality. The detour from A to B encourages paths with higher externality, leading to a worse Nash equilibrium in the game of commuters picking paths.

Braess discovers Instagram

Jonathan Haidt has written extensively about the harms of social media, in particular on young teenagers. He makes a strong case for the internet being harmful through two mechanisms: addiction and externalities.

The addiction argument is well-understood – apps compete for attention, and do a pretty good job of it. However, the externality argument is more interesting in my opinion. Addiction is easy to argue against. It’s just plain bad and we should regulate apps to reduce addictive patterns such as infinite scrolling. The externalities on the other hand arise due to properties of social media that are intrinsically positive. We seek out information on what our peers are doing because it is useful information for us. The anxiety and FOMO produced by these apps are a consequence of a behavior we seek out for our own good.

As with Braess’s paradox, the introduction of social media has made possible new behaviors which on an individual basis are beneficial but which produce externalities that make everyone worse off:

The “road” in Braess’s paradox is, in the context of social media, the ability for people to selectively share information about their lives and to easily consume the information shared by their peers at a previously impractical scale.
The equivalent to the old road system was to focus on a smaller friend group, build reputation through face-to-face interactions, and figure out how you fit in within smaller communities.
The advantage of social media was clear – curating your social media presence became the most efficient way to scale your reputation, a form of personal marketing. Tracking your peers was important to stay up to date on events, trends, and social dynamics.
Reduced time spent offline in smaller groups was one of the externalities produced by social media, forcing those who were less reluctant to move to social media as well.
Over time, social status dynamics eventually stabilized. The new hierarchy might look different than it did before, but status is mostly zero-sum, so people traded their old lifestyles for a new one which, based on the evidence collected by Haidt and others, appears to have left everyone on average worse off.
The other externality which harms even those who were originally supportive of the move to social media is the highly manipulative nature of the information shared on these platforms – the number of people on the platform is larger, so the most successful and attractive people you see are going to outperform those you meet in real life. The information shared by people is skewed to look better than it is. The result is a reduction in self-esteem due to social comparison.

Of course, there are massive benefits to social media as well. Not everyone goes on these apps to signal and to compare themselves to others. However, if you find yourself mostly using these apps to share how awesome your life is and to compare your posts to your peers, it may be worth reflecting on what Braess would say and maybe taking a break.

Braess goes job hunting

The internet has also changed the market dynamics for employers and job hunters. Most companies today hire through digital platforms. As with social media, this affects how candidates present themselves to companies, and the way that companies look for people to hire.

In the world before the internet, cold outreach meant more. Finding a phone number and making a call, or writing a letter, or showing up at someone’s office was harder and rarer. It meant something, and was usually quite a positive signal for the candidate. The world was smaller, so companies were more likely to know the candidate, or at least had more time to assess them.

Today, there are are AI tools which lets candidates automatically apply to thousands of companies at once, while recruiters can send thousands of customized emails to candidates with minimal effort. Cold outreach works worse than ever. The exceptions are exceptions precisely because they don’t apply to the average candidate or company – you are uniquely talented, have particular expertise relevant to the company you’re applying to, you know someone already working there, or reach out to someone or some company which does not have a lot of inbound (sometimes for good reasons).

A few years ago, I found myself managing the recruiting pipeline for the startup I was working at. This was in the midst of the Covid-stimulus induced tech boom where every company desperately wanted to grow and talent was hard to come by. Nevertheless, we put up a single job posting and in a day had hundreds of applicants. After a week it was over a thousand. We did no advertising, did not post it on any college hiring platforms, nor nothing else to justify this level of attention. We were also not a particularly well-known startup at the time.

My original plan was to personally look through every resume and interview anyone who looked promising. That idea quickly went out the window. I ended up trying all the usual strategies you’d think of – delegating resume screening, filtering based on GPA or school, even having candidates complete an online coding assessment. I never ended up getting a chance to look through most applications.

My point being – in this new world where any candidate and company can blast out applications, it is impossible to consider every option. Pre-screening is unavoidable. The challenge becomes figuring out how to screen effectively. Many companies exclusively hire people they know to avoid this headache, for better or worse. Many use the signals we are familiar with – GPA, school, and previous experience.

These signals are effective because they are hard to fake. Candidates know that companies will ask for evidence if they get to the stage where they extend an offer, so exaggeration or lying doesn’t work. On the other side, candidates are far more likely to pick companies with established brands, even sacrificing salary, work-life balance, and more. They know that when they interview for their next job, future employers will have little time to assess how good of a job they did at the projects they were assigned, and as such are more likely to focus on their previous company and title.

If you already have the signals companies and candidates are looking for, this is great – you have more options than ever. If you don’t, you’re likely to find the job market more frustrating than it would have been in the past. The increased focus on signaling is a negative externality on society. It costs time and resources for people to acquire signals such as getting into a good college, doing extracurriculars, collecting publications, and preparing for job interviews. Some of this work is authentic and valuable, but a vast amount of effort and resources goes to waste every year on junk research, extracurriculars, and studying for tests and interviews that will never be useful.

Unlike social media, there is not an easy solution here. You can’t convince a few friends to delete their apps and touch grass. Companies need to hire, and it is good for social mobility and the world that they don’t exclusively consider people they already know. However, search and matching is a tough problem, and we currently don’t have good scaleable solutions. GPA, the school you went to, and previous job experience does correlate positively with job performance, and are hard to fake signals, and so will stick around for the foreseeable future.

The good news is that acquiring these signals usually requires you to learn or do something genuinely useful. Unfortunately, this is eroding. GPA inflation is commonplace, wealthy parents buy spots at top schools, and a single early internship at a good company through family connections can jumpstart a college student’s career. It is also a waste of talent – plenty of incredible people don’t bother or realize to pursue signaling credentials.

Overall, the U.S. is still better than other places in the world. For example, it is a common sentiment in east asian countries that your life trajectory is determined by the college you get into. Anecdotally, I’m also seeing a growing trend towards hiring through your network, which makes signaling less important and reputation (a much more robust and truthful signal) more valuable.

Nevertheless, across the board, we are seeing dissatisfaction with current digital platforms at a surprising scale. They provide enough value that they aren’t going anywhere anytime soon, though it’ll be interesting to see whether social media, dating apps, and networking platforms can continue to grow in the next couple of years. My read on this is that the current implementation of the internet solved some problems while creating others, and that a new and improved version will be necessary for people to transition over at a greater scale and permanence.

Are you paying Attention?

John Hallman — Tue, 23 Apr 2024 06:42:13 GMT

Source: Google Brain

“Attention is All You Need”
– Google researchers in 2017

Amongst the many mysteries of modern deep learning, the remarkable success of attention is one of my favorites. There’s something about it that just seems like it shouldn’t quite work as well as it does. Nevertheless, for all the hate, and attempts over many years to replace it, it is still around!

Let’s dig into it a bit today.

We start with a simple but enduring problem – sequence modeling. Simple and infinitely expressive.

Sure, there are some details around BPE tokenization and hard-coded semantic rules that we humans dearly depend on and that deep learning researchers so despise, but its a hack thats worked so far, so we shrug and leave that for another day, as we have for the past many years.

We are working with sequences of integers, each integer mapping in some arbitrary manner to its own class embedded with rich information. Each integer value associated with its own concept and all of its deep context.

We find ourselves in a setting where the relationships between the elements in sequences are, though not entirely trivial, still relatively simple. Preposition elements pointing to elements before them, pronoun elements map back to noun elements in sometimes complex ways, but still mostly simple enough that these can be deduced from context. Adjective elements adjust the nuance of subsequent elements. Simple rules, but nevertheless, a large slew of them, too many to write down, and sufficiently diverse that general learning methods must be used.

In comes Attention. Or, to be specific, Self-Attention.

We want to predict a word that (1) follows “the”, and (2) cats usually sit on

Attention models both content and position. How content is modeled – QKV vectors – is better understood than positional information. It is also more intuitive. Different parts of the embedding vector model different concepts such as punctuation, time, language, etc. The value matrices pull out subsets of these embedding spaces to model specific concepts, the key matrices pull out related concepts necessary to index this content, and query matrices pull out the subset of the embedding space that determine what indexed content to retrieve.

Side note: Concept capacity in d-dimensional spaces

Some justification for why QKV vectors are capable of modeling all of human language – at the bare minimum, they can just subsample the embedding concept space that is the d-dimensional unit sphere, and d-dimensional spheres can fit a lot of information! We can assume we are working on unit spheres due to the layer norms so commonly used today enforcing norm-agnosticism in modern day Transformers.

In d-dimensions, as d increases, one can fit many concepts. Let’s quickly contemplate how many. Suppose each concept is it’s own vector in this d-dimensional sphere. To differentiate concept vectors, each vector must have some minimum cosine distance, let’s call this x for now, from all other vectors. An upper bound on the number of vectors can be calculated by looking at the total d-dimensional volume occupied by each vector, i.e. at distance x or less.

The d-volume taken up by a single unit vector u is the space of the d-sphere with distance x or less from this vector is

since we must allocate x distance of any competing vector in the unit direction u, and the remaining 1-x is allocated over the remaining d-1 dimensions (since vectors are unit norm). If we fix x=0.5 for now (x is clearly not the dominating factor as the base of an exponential), we see we can fit roughly O(2^d) concepts in a d-sphere.

Running a simulation confirms this – modulo constants, mean cosine distance between vectors and their nearest neighbor appears to plateau as you increase d and set number of concept vectors to 2^d.

This bodes well for deep learning. Usually parameter count grows as O(d^3) as you need to scale feedforward matrices as O(d^2) and the number of layers proportional to d. Concepts growing exponentially in d means that we win as we scale up. Yes, costs go up superlinearly, but the capacity of our space grows even faster. Worth noting of course is that, with increasing complexity, total number of concepts to model probably grows exponentially too.

Position embeddings

Enough of concepts! The most fascinating part of attention is not this, but rather the often overlooked nature of how it handles positional information. As is commonly known, the model has no built-in concept of sequence order, unlike its predecessors RNNs and LSTMs, and the newer SSMs. To work around this, position information is embedded into the model by adding hard coded position vectors, either onto the initial embedding, or the QK vectors. Amazingly, this is sufficient for models to model positional information on contexts of up to millions of tokens!

Imagine if you had to read text like this

These position embeddings are rich in information, and yet they are simple, man-made objects. The original position embedding vector from the Attention is All You Need (AIAYN) paper is constructed element-wise for a token at position pos by setting the value of each odd and even dimension i as follows:

Where the base frequency of each dimension is defined as:

My reaction upon first seeing how position embeddings are constructed

People wiser than me may disagree, but when I first read AIAYN, I did not find it intuitive at all why these position embeddings should work at scale. Let’s first just grok what these embeddings are even doing.

The overall structure of the position embedding is to group embedding dimensions into pairs, and for each one, to calculate the sin and cosine of the position index times some base frequency for that group. The base frequencies start of extremely large and shrink exponentially as you iterate through the groups. Let’s look at what these position embedding values look like as you scan across position index, for varying group indices.

As you scan across the position of a sequence (i.e. go from first token to the last token), the position embedding values oscillate rapidly for the smaller indices of the position embedding, and very slowly for the larger indices. The smaller indices are very sensitive to changes in position, while differences in value for the larger indices only appear at greater position distances. The smaller indices are useful for differentiating tokens at short distances from each other, while the larger indices are useful at long distances.

Digging into AIAYN position embeddings

How much information is embedded within these position embeddings? Let’s take a look, starting with how well they can differentiate position indices. Each group i consists of a value sin(pos f_i) and cos(pos f_i) for base frequency f_i, so at distance k the squared Euclidean difference is

Wow that simplifies nicely. It’s entirely a function of distance! And the shape of the difference values exactly matches those of the values in the position embedding themselves (sinusoidal with frequency dependent on group index i). Adding this up across all groups i and you get something not analytical but still quite nice:

Let’s do a quick sanity check. Is it possible with these position embeddings to produce a QK mapping which “selects” (i.e. attends maximally to) the token at distance k away from the current one? Turns out that the answer is yes, as is expected and fortunate for us. First, let’s start with QK vectors of equal dimension to the embedding dimension. Then one can map each group linearly as follows:

And voila, you’ve reproduced the position embedding of the token at distance k away, which will thus have the highest dot product (attention) score. Compressing this into a smaller subspace can then be done more or less arbitrarily and you are likely to produce pretty good results.

Nice. These position embeddings seem to work. One can differentiate tokens and select positions in the attention mechanism. Unfortunately, these position embeddings are not perfect. We can see that the structure of these position embeddings are such that identifying the distance between two tokens requires comparing the values group-wise, since non-equal frequencies desynchronize rapidly, so comparing values across groups produces no clearly meaningful signal. This is bad for the attention mechanism – it’s clear how this can force the QK vectors to select quite simple mappings that just select subsets of the groups and map them into the same indices in the Q and K vectors. More complex interactions between dimensions in the QK mappings are discouraged. It would be ideal if this could be avoided. Thankfully, people thought of this and came up with improvements.

Modern position embeddings

A lot of position embedding approaches have been tried. GPT-1 through 3 used learned embeddings. GPT-4’s architecture is unknown. Nowadays, all open-source models use RoPE embeddings. These position embeddings are quite nice – they are added directly onto the QK vectors! This means that the QK mappings are free to extract and mix subsets of the embedding vector in an arbitrary manner, since position information is added post-hoc.

Equation (13) from RoPE – rotation of the output values of QK mappings when d=2

RoPE goes back to hardcoded sinusoidal values but with some modifications justified through elegant math. In short, every consecutive pair of values in the output of each QK mapping is once again treated as a group, and are rotated by some angle based on their position and the base frequency of that group.

The elegance of this approach is that (1) the embedding vectors themselves are undisturbed by positional information, (2) translation invariance – rotation of one vector at position m and another at n is equivalent to rotation at 0 and n - m.

It is clear that the same argument as before with additive position embeddings apply here, and as such selection of a token at distance k away can be produced just by setting the appropriate Q and K values before the position rotation, although this case is non-trivial since there are no biases in attention mechanisms and the input vectors to the QK mapping can be arbitrary, so one is not guaranteed to be able to fix the values one wants. My guess for why this doesn’t matter is that the embedding vector presumably learns some hardcoded constant values to help the network out.

Expanding context

One significant benefit of hardcoded position embeddings is that there is some underlying structure to the values added post QK mapping that one would hope that the attention mechanism would pick up. If this is the case, then one ought to be able to simply extend the context window beyond what the model was trained for and see natural interpolation and improved performance.

The Position Interpolation paper

Unfortunately, this is not what happens by default. When one naively adds more context, one unfortunately tends to see Perplexity explode. There are a few hypothetical explanations, but the most common one is that the model never learns to ignore context with unseen position embeddings, so attention scores get messed up.

The YaRN paper – models degrade catastrophically with naively increased attention

The easiest solution is Position Interpolation – simply drag out the position embeddings by reducing the base frequencies by a constant proportional to the increase in context length. This works ok, but requires a decent amount of re-training.

YaRN improves on this by recalling that some group indices focus on short-term position information (in particular the earlier indices) while others (the latter indices) focus on long-distance information. One only needs to worry about stretching the latter to improve long context performance. In addition, one must readjust the normalization constant in the softmax scoring of the attention mechanism to account for the increased number of tokens. I would go over the math here but the intuition and empirical results are robust enough that I find it worth skipping.

In either case, I find it interesting and a little ironic that the original motivation for hard-coded position embeddings was that the model would generalize beyond the context space it was trained on, and yet the only way people have been able to get them to generalize has been to “stretch out” the space it was trained on to cover more context.

Overall thoughts

People don’t spend enough time thinking about the position embeddings of Transformers. From all I can see, this is literally the main difference between Transformers and every other contending NLP model family – LSTMs, SSMs, convolutions, etc. Rather than hard-code sequential information into the very architecture of the model, the Transformer does something radical – it demotes the importance of position, stating “you are not important enough to be an intrinsic part of the architecture. We will only pay attention to you through the position embeddings you pass as input!”

Or perhaps, it says something else… that position is so important and complex that it must be promoted, that hard-coded architectural information is so limited that the model instead must learn the nature of positional relations by itself. I like this interpretation a lot. Language is incredibly complex. The fact that books have words that refer back to pages upon pages suggests that information flow is not very well behaved. It is not ordered sequentially. Perhaps we should not force our models to learn in this manner then.

This is one reason I am concerned for SSMs, despite all the hype around them right now. I would love to be proven wrong. It’s great to have competition, and many contending architectures. Nevertheless, sequential structure based around position feels intrinsically limiting to these models. You’re forcing it to understand the world with a restricted view. The O(n^2) attention penalty must be beaten yes, but perhaps it can be done without going back to the dark age of architecturally hard-coded positional information.

Perhaps this is a lesson we should extend to other parts of the Transformer… like… idk… the universally hated BPE tokenizer? 🙃

What’s next?

Infinite context. Jk. Increasing context length is definitely coming, but infinite context requires some dramatic changes, including getting rid of this business of naively extending the attention mechanism. Thankfully we can wait quite a bit – d-spheres store exponentially growing amounts of information, thus also exponential amounts of positional information. We can probably scale to context lengths in the millions with just existing attention techniques. Already at this scale we are struggling to find use cases that can make use of this context (while affording the computational cost).

That said, something feels wrong about this approach. The math here points to scaling being enough, that we can just expand the size of the QK vectors and exponentially grow our context capacity. Scaling always wins. But it feels wrong. It feels inelegant. As if some form of compression and retrieval, the way humans handle memories, ought to be necessary. Of course, this is a classic intuitive fallacy. Machines do not think the way humans do. Just because humans do things in a certain way does not mean machines must do so too in order to surpass us.

It’ll be interesting to see which is right in the long term – mathematics, or intuition.

(Code for all experiments can be found on my GitHub)

The State of AI – March 2024

John Hallman — Sat, 09 Mar 2024 21:16:44 GMT

Let the battle between intelligences commence!

About a year ago, OpenAI announced GPT-4. This foundation model stunned the world and, most would argue, held the spot as the best commercially available LLM until the announcement of Claude-3 last week.

During this time, we’ve seen massive movement in the AI industry across the entire stack. Nvidia is slowly seeing growing competition in AMD’s MI300X chip, Google’s home-grown TPUs, and startups like Groq and Cerebras. There are now foundation model alternatives to GPT-4, including Gemini Ultra, Claude-3, Mistral Large, and more expected to come soon. The code generation space is no longer entirely monopolized by GitHub Copilot, with Codeium, Cursor, and other startups making significant progress.

Now that we are a bit further into the AI revolution, bits and pieces appear to fall into place, and so I figured it would be a good time to talk about it and make some predictions for where I think things might be headed.

1. The consumer AI space will be winner-take-all

I define the consumer AI space as ChatGPT Plus, Gemini Advanced, Claude Pro, and all the other monthly subscriptions, and I believe that this market will be very big. For comparison, Netflix, Spotify, and Amazon Prime all have roughly one quarter billion subscribers. Suppose the AI assistants market will grow to roughly that size over time, and that cost stays in the $10-20 per month range. This amounts to $30-60b in revenue per year. For comparison, the net income of Apple in 2023 was roughly $100b, for Microsoft and Google roughly $75b, and for Meta roughly $40b, and these are all trillion dollar companies. If the consumer AI space can maintain healthy margins, it could literally be a trillion dollar opportunity.

Of course, it is not yet clear whether the future of consumer AI will be a paid subscription or a free good-enough version. Most people I’ve talked to still do not pay for ChatGPT Plus, not to mention any of the other more recent alternatives such as Gemini Advanced or Claude Pro. However, everyone I know who pays for these subscriptions literally cannot imagine life without it. People who work in the AI space are trying all of the alternatives but feel overwhelmed. It seems obvious to me that people will continue to want an AI assistant for the common use cases we’ve seen such as question answering and research, and it feels almost predetermined that this space, just like search and social media, will converge towards a single winner.

My sense is that either OpenAI, Google, or Apple should win this space if they execute properly. The OpenAI case is straightforward – ChatGPT is a word that already has found its place in the common vocabulary, and most people I know who pay for an AI assistant pay for ChatGPT Plus and nothing else. The case for Google and Apple are different but also quite obvious – they can integrate directly into your devices, calendars, emails, search, and have a good chance of winning by nature of existing distribution channels.

2. Open-source is falling behind

A year ago when the LLM revolution was taking off, there was significant uncertainty about how open-source vs closed-source models would play out. Some felt it was unlikely that open-source would be able to compete with closed-source labs, effectively giving away for free a technology that costs significant amounts of money to produce, while others felt that the crowd-sourcing of wisdom and work of the masses would allow open-source models to catch up to and surpass those of closed labs, and pointed to other developer tools where open-source won out.

Roughly one year has passed, and my impression is that open-source is falling behind. In particular, it is falling behind in precisely the ways that people were worried about a year ago – they’ve produced tremendous advances in fine-tuning and other post-training techniques, but have only been able to apply them to small models on the scale of 7B parameters and less. There are some larger open-source models like Llama 2 70B and Mixtral 8x7B, but these are effectively generous donations by large research labs and have not been reproduced by smaller players. Furthermore, it does not appear that the open-source community has the compute capacity yet to significantly improve on these larger models the way they’ve improved on the smaller models. Even if they did, these models are a significant step behind the frontiers of GPT-4, Claude 3, and Gemini Ultra.

Nothing surprising has played out here so far – GPU, data, and talent costs are prohibitively high right now, and there was never a clear economic argument for how a company could survive while giving away their product for free. Meta AI doesn’t need to make a profit and so they will probably continue to hand out Llamas for free, but Mistral and Musk’s xAI have grokked the economics behind LLMs, and have since late last year stopped open-sourcing their best models.

The only counterargument to the above, which should bring hope to the open-source community, is that open-sourcing models has proven to be an incredibly powerful marketing tool for tier-2 research labs. One of the best models on HuggingFace today is Qwen 72B, a model produced by Alibaba Cloud. Prior to this release, I as well as many of my friends in the AI space had never heard of Alibaba’s AI research lab, but with it they’ve proven to be quite a formidable research team. The same dynamic played out with Mistral AI a few months earlier. The marketing value of open-sourcing models means that we are likely to continue to see improvements in open-source models even though the economics of giving away your product for free at a first glance doesn’t make sense.

Unfortunately, I don’t think that the dynamic above is going to be sufficient, and I expect the gap between open and closed source models to continue to grow. The reason for this ties into the next section – the cost of these models is getting serious.

3. On the scale and cost of foundation models

AI research labs no longer publish research, so it is much harder to get a sense of what it takes to produce the models we’ve seen come out over the past few months, but my takeaway from the bits and pieces we’ve seen is that the scale and cost of these models is starting to get really serious.

A few example points:

Google DeepMind mention in their Gemini paper that their most advanced model was trained across data centers. Gemini Ultra was not trained on a single massive Google-sized data center, it was trained on multiple of those data centers.
It was revealed that Reddit had come to an agreement with Google and other anonymous AI labs to sell access to their data for $60 million per year. These labs are scrubbing the internet for all the data they can find, and this does not come cheap.
Over the course of 2023 Anthropic raised $750 million and came to an agreement with Google to raise another $2 billion over time. It is likely that a significant amount of this money went towards training Claude 3. Similarly, Musk’s xAI is supposedly looking to raise $6 billion on a $20 billion valuation, whereas it was previously rumored to only be looking to raise $1 billion. xAI currently only has around 20 employees, so most of the money is not going towards labor cost. Presumably most of it is going towards compute.
Meta AI is building out a 350,000 H100 GPU data center this year, with each H100 going for around $30k each, which adds up to $10 billion or so in just H100 GPU costs.

All together, it looks like the industry will spend tens if not hundreds of billions of dollars over the next few years, and most of it looks to be going towards first compute and second data. Amusingly, if we suppose that compute is the bottleneck to scaling AI further, and that the total expenditure for the next few years is around $100 billion, and that the AI labs want this number to grow 100x, then we suddenly find ourselves surpassing Sam Altman’s $7 trillion number. Perhaps he knew all along.

4. Frontier model capabilities are hard to evaluate

When ChatGPT first came out, people were stunned that it could solve LeetCode style coding problems, answer riddles, and call functions given a proper interface. Back in those good days, evaluation was easy, because the capabilities of those models were simply not that extensive. We had not yet saturated MMLU (undergraduate level academic knowledge questions), GSM8K (grade-school math problems), HellaSwag (common knowledge), and many other beloved benchmarks.

With the release of Claude 3 and its stunning display of capabilities, we are entering a very interesting era where the average human being in many ways is insufficiently intelligent to properly evaluate frontier foundation models. Now, this does not mean that GPT-4 and Claude 3 are AGI-level models (there are still significant capabilities gaps) but merely that these models now have capabilities in some domains that significantly surpass those of average humans.

One particularly relevant example is the new GPQA benchmark (Google-Proof Question Answering), which asks PhD-level questions that are sufficiently difficult that non-PhDs with access to Google and significant time to think are unable to answer these questions. On the diamond set containing their highest quality questions, highly-skilled non-expert humans with access to Google get around 22% correctness, GPT-4 gets 36%, Claude 3 gets 50%, and PhD experts get 81%. In other words, current frontier model capabilities are somewhere between skilled and expert humans, which means that we now need PhDs to evaluate them on scientific question answering style tasks.

I have two takeaways from this observation. First of all, model intelligence is no longer the bottleneck for most LLM applications, since a vast portion of cognitive work in the modern economy does not depend on PhD level expertise. Rather, context, reasoning, consistency, style, cost, latency, and other factors become more important. The risk here is that AI research labs overly focus on the aspects of models that are easy to evaluate, such as these PhD-level science questions, rather than on other aspects that are harder to evaluate but more important for producing downstream economic value (Goodhart’s law). Or, alternatively, that we in due time discover that LLMs strengths lie in information compression, and that other capabilities like reasoning and planning do not emerge or grow as rapidly when we scale up models further. This will be very interesting and important to look out for over the next few years.

The second takeaway is that, in my opinion, running out of data no longer appears to be as big of a problem as it’s been made out to be. The point of collecting more data is for the model to learn more kinds of facts, reasoning, etc., but more data doesn’t help if all the data is low quality or sampled from the same domain. In a year or two, once frontier models solve GPQA and reach PhD-level capabilities across scientific domains, it becomes far less clear what value more internet-quality data contributes to these models.

More likely, the big AI labs will pivot to generating their own custom datasets by collaborating with experts, universities, and other leading institutions, as well as training in settings unbounded by the quality of existing data (e.g. it’s been rumored that Claude 3 was extensively trained using reinforcement learning). Simulations, sandboxed code environments, and other data generation approaches will likely make up an increasingly growing portion of the training data going into these models over time.

One big question here is how this affects scaling laws – the work that has been published by OpenAI and Google DeepMind only cover standard hyper-parameters like model size, token count, expert count, and some RLHF related ones. Clearly these existing laws don’t hold in the speculated new domains where the quality of data is significantly higher and quantity significantly lower. Though the labs presumably are aware of this and are planning accordingly, they have not published anything on this topic, it means that we the public will have to come to terms with having incomplete foresight into how scaling laws will play out in the coming few years.

5. Google’s Gemini fiasco is concerning for alignment

As many of you know, Google DeepMind’s Gemini model was recently roasted for being diverse in situations that made no sense.

Source: The Verge

I’m not going to comment on the cultural forces or the leadership dynamics at Google that led to this outcome. Instead, I want to highlight that this demonstrated a terrible failure on Google DeepMind’s part to properly align their AI models, and although no harm was caused by Gemini generating diverse Nazis, this is rather discomforting thinking ahead to the impact that future models will have and how fast the technology is progressing.

My concern is that Google DeepMind is an absolutely outstanding team, amongst the very best in the world, with near endless resources, and despite all of this failed to align the model to their world view. Some may object and claim that the model indeed was aligned – that Google wanted the model to be excessively pro-diversity, but I would object back. No human being who is supportive of diversity (as far as I know of) would ever consider representing diverse Nazis, because it makes no sense once you think about the underlying motivation for why diversity is important.

Humans who are supportive of diversity do so because diverse viewpoints lead to better outcomes (or, as they might say more bluntly, a room full of old white men probably don’t fully understand the preferences of minorities, women, LGBTQ+, etc), and because it helps break down stereotypes around how only certain kinds of people are expected to serve certain roles in society. The motivation behind the “diversity intervention” of foundation models is that, relative to a diversity-egalitarian society, our historical data is skewed towards a distribution that is suboptimal due to historical factors, and thus the models need to be recalibrated post-training to correct for this deficit.

Diverse Nazis do not contribute to either of these motivating reasons. The fact that Gemini proactively generated such images means that it failed to properly grasp the values and principles it had been trained to obey. It did not understand the sociopolitical dynamics that led to diversity being such a critical priority for the Google team, and as such failed to responsibly promote diversity once deployed in the real world. This, despite the fact that Google DeepMind is one of the best AI research labs in the world, and that diversity clearly is one of the most important values for the Google leadership team.

Why was Gemini improperly aligned, why did it fail to capture the underlying dynamics of how to be pro-diversity? My guess is that it has to do with spurious correlations and an insufficiently intelligent base model. There’s a good amount of research showing how smaller models are susceptible to bias due to spurious correlation (search “transformers text spurious correlation” on Google scholar for examples), where for example the presence of words like “Spielberg” can trick a model into thinking that a movie review is positive even when it is not. As we’ve scaled up models and their capabilities have improved, this has become less of an issue, but it is still something that we see from time to time.

In this case, I think that Gemini Pro likely was a smaller (relative to GPT-4 and Gemini Ultra) model which lacked a deep understanding of the world, and so if the post-training process lacked explanations for why promoting diversity is good and examples of when to abstain from promoting it, it makes sense that the model would blindly produce diverse images in all situations. It’s also not clear how much of this behavior was due to the system prompt versus a behavior that was actually ingrained into the model due to training, but in either case this remains a concern – for all we know, the system prompt did not explicitly state that the model should present people as diverse “in all circumstances with no exceptions”, and so you would expect a reasonably intelligent system with common sense to behave better than Gemini did.

A question this brings up for me is how this will change as model capabilities continue to grow. A more intelligent base model will probably have a better understanding of the underlying motivations behind diverse representation, and as such should avoid making the mistakes Gemini made, but the underlying problem persists. As we strive to align models to our values, there will be other deeply complex issues that the models will be insufficiently capable of understanding. All we can hope for is that by then we will have figured out how to make the models more aware of their limitations, such that they don’t make harmful decisions that they think are aligned with our goals and values.

6. Applications – performance, cost, and integration

Based off of what I can see on the internet and from conversations with friends, it appears that certain LLM use cases have taken off much faster than others. I’m sure this will change as capabilities improve and new applications come out, but for now, some of the most common use cases I’ve seen are:

Research (recommendations, explain concepts)
Role-play (AI friends/relationships, therapy)
Writing (homework, articles)
Coding (either through Copilot or ChatGPT)

There are some applications and use cases I’ve heard a ton of people ask for that do not yet exist or do not yet work very well, such as scheduling/AI executive assistants, automating call center workers (every friend working at a B2C startup), better coding (e.g. AI writing complete pull-requests), and lots of niche B2B paperwork tasks.

Amusingly, the current limitations also explain the most common current use cases:

LLMs greatest strength right now is being able to memorize internet-scale number of facts. This makes recommendations and research a great use case.
LLMs hallucinate and make mistakes. When talking to someone/something, mistakes are common and not a significant issue.
LLMs have seen millions of essays and articles. If you have an idea but need help writing it, it can help you get the structure right. It does not yet have the creativity or critical thinking to write original essays by itself.
A large portion of programming (not software engineering) is just boilerplate code figuring out the right functions to call and what data to pass to them. LLMs handles this wonderfully.

What’s holding LLMs back from doing more? Three things – performance, cost, and integration.

Performance is obvious. Any builder who has spent time with OpenAI’s GPT-4 API will have stories about how it will forget a rule repeated multiple times in the prompt, make basic reasoning mistakes, and generally lack the common sense to perform human tasks without having every detail specified in the prompt, and even then find ways to make dumb mistakes. The good news here is that models have continued to get better over time and there are no signs of this trend slowing down.

Cost is also obvious. The Claude 3 release was amazing, but the cost is still prohibitive at $15 for a million prompt tokens and $75 for a million generated ones. Most real tasks in the enterprise world, based on my experiences and conversations with a range of people, require tens of thousands of tokens in the prompt and hundreds if not thousands of tokens to be generated. $1 to automate a single task, not counting development, cloud, and so on, is for many tasks even more expensive than human labor. The good news is, once again, that costs have been coming down and does not look to slow down.

Lastly, we have integration, which in my opinion is the most interesting one. When I was working at AKASA (AI healthcare startup), I saw several tasks that were worked manually at hospitals that were very automatable, but were extraordinarily painful to automate due to difficulties both reading and writing data from legacy healthcare systems. Based on what I hear from people working in other enterprises, the story is more or less the same across industries.

This is one reason I don’t believe in the narrative that “GPT wrappers have no moat”. Sure, calling an AI API is easy, but you know what else is (relatively) easy? Spinning up a CRUD app and cold calling small businesses, but that describes half of all B2B SaaS startups in YC and they have had no problem building real businesses with healthy margins. Integration is a valid moat because it is necessary, slow, and often really hard! The main reason I think Google or Apple could steal the consumer AI market from OpenAI and others is integration into existing products that people already use (mail, calendar, devices).

I find it interesting that people seem rather unimaginative and short-sighted when talking about how to integrate LLMs. They talk about RAG and connecting to their database and maintaining memory. Is integrating LLMs into a codebase really necessary if, at some point in the future, we can give an LLM a sandboxed terminal, have it pull an entire codebase, and have it implement and push entire pull-requests containing entirely new features? If an LLM was given a computer and phone, just like the other call center workers, do we need custom integration? All of the above, of course, presuming we know that the model is aligned and safe to deploy.

We’re nowhere near these kinds of capabilities yet, but if the current trends continue it seems reasonable to expect this to happen eventually. As performance goes up and cost comes down, integration will remain the final bottleneck to unlocking the value of AI in applications. Let’s be more creative and serious about what integrating AI into our enterprises and society might look like.

7. Lastly, a personal announcement

I’m excited to announce that I’ll be joining OpenAI’s applied research team next week! As such, I’ll unfortunately not be writing as much about LLMs and AI research for some unknown period of time to stay clear of potential leaks. That aside, I’ll continue to write about other topics that interest me. Stay tuned!

Survival is getting more expensive

John Hallman — Thu, 26 Oct 2023 06:03:36 GMT

Sometimes, I can’t help but think that life seems to have gotten more expensive over the past few decades. Here’s a casual look at some data to see if this hypothesis holds.

Median income

The Federal Reserve Bank of St. Louis publishes a ton of useful historical price data. Their records show that median income in the U.S. has grown from ~$22.5k in 1985 to ~$75k today.

(Source)

Median house prices

FRED once more comes in clutch, informing us that the median house back in 1985 cost around $80k, which has risen to around $430k today.

(Source)

Personal healthcare expenditure:

FRED also tracks personal healthcare expenditure, which I find more useful than spending per capita given the confusing spending patterns of healthcare in the U.S.

In this case, spending has gone from around $300/y to $3000/y over the same timespan from 1985 until today.

(Source)

College

The Education Data Initiative tracks historical prices and inform us that public college tuition + other fees have grown from around $1.3/y back in 1985 to $9.5k/y today.

(Source)

Childcare

FRED only reports an index, which is disappointing but perhaps makes sense given the broad range of childcare offerings across the country. With 1982-1984 as the baseline, 1985 saw prices rise by 15% to 115, while the index today is at 840.

(Source)

Food

The index for food in urban centers has gone from $105 in 1985 to $320 today.

(Source)

Transportation

Car prices have remained the same or even decreased a bit going back since the mid 1900s!

(Source)

Breakdown of consumer expenditure

The Bureau of Labor Statistics reports breakdowns of average household expenditure patterns, and finds the following breakdown for the median household in 2022:

Housing: 33.3%

Transportation: 16.8%

Food: 12.8%

Personal insurance and pensions: 12.0%

Healthcare: 8.0%

Entertainment: 4.7%

Other: 4.1%

Cash contributions: 3.8%

Apparel and services: 2.7%

Education: 1.8%

(Source)

Given this breakdown, we can actually calculate an alternative inflation metric – let’s take the average of the inflation rates for each of the categories above for which we have data, weighted by how much the average person spends on each category today.

Feel free to do the math yourself, but I arrive at the following numbers:

Housing: +437.5% (4.5%/y)

Transportation: let’s leave this at 0

Food: +204.8% (3.0%/y)

Personal insurance and pensions: let’s keep this at 0 too

Healthcare: +900% (6.2%/y)

Entertainment: let’s leave this and everything else at 0

Note that wages have grown +233.3% in this time period (3.2%/y).

Only counting the rising cost of housing, food, and healthcare, (leaving everything else at 0) we get an inflation rate of 0.333 x 4.5 + 0.128 x 3.0 + 0.08 x 6.2 = 2.38%.

However, noting that this accounts for 54% of the total expenditure breakdown, we could also adjust this upward by dividing by 0.54 and get a different number: 4.4%, the weighted inflation of “survival” (housing + food + healthcare).

This is only 1.2% higher per annum relative to the median wage growth over that same period, but small numbers compound – 1.2% growth over 10 years results in a 12.7% total price change, and a 57% increase over the 38 years from 1985 until today.

Of course it doesn’t account for improvements in quality of goods, distributions (e.g. inequality, city vs rural real estate, etc.), deflation in many other categories, and other improvements in life quality over the same time period.

That said, the feeling that “survival” is getting more expensive seems reasonable from this initial glance.

It is still early for open-source AI

John Hallman — Sat, 29 Jul 2023 00:38:17 GMT

AI applications fall on a spectrum.

Some companies build applications on top of APIs, and some companies train their own models from scratch. Some fall in between, leveraging open-source models, perhaps even fine-tuning them further on their own data, or use the fine-tuning APIs provided by closed-source LLM providers.

No one knows for sure which approach will perform best in the long-term. Some predict a world where a few companies provide APIs that all other companies will end up using, while others predict that training will become so easy and ubiquitous that everyone will have their own model.

At this point in time, most successful AI companies either train models from scratch (Midjourney, ChatGPT, Character AI) or build on top of APIs (Jasper, Copy AI, Harvey). Fewer companies have launched successful products built on top of open-source models.

That said, there is a lot of optimism around open-source. With the leaked Google memo, the release of LLaMA 2 and other increasingly powerful models, and other amazing developments, there’s a growing sense that open-source in time will catch up to closed-source and then surpass it.

There are some strong arguments to be made. Model training is incredibly expensive and requires expertise that is hard to find and hire. It makes economic sense for companies to either build on top of APIs or skip the pre-training stage by grabbing an off-the-shelf open-source model. Between the two, open-source gives you significantly more control over your model and data.

In the long-term, I too am excited about open-source. Unfortunately, there are reasons why I think we won’t see significant adoption of open-source LLMs in applications for another several years. Past that point, it is still an open question whether open-source can catch up to closed-source.

Open-source AI faces challenges

A company that wants to build a product on top of an open-source model faces several challenges today. The first one is size.

Source: Huggingface Open LLM Leaderboard

The open-source community focuses primarily on small models – on the order of 13 billion parameters or less. A few bigger labs have released models in the 30-70 billion parameter space, and BLOOMZ tops out at ~176B, but almost all of the fine-tuned variants of LLaMA are 13B parameters or less, as are most of the newly released models that have been trained from scratch.

Unfortunately, there is a reason that all of the top AI labs in the world have been focusing on growing model size – it is a prerequisite for performance. Training smaller models on more data helps up to a point, after which performance saturates, just as scaling laws predict.

The scaling challenge would be less of an issue if the performance of open-source LLMs at the smaller scales of 7 to 13 billion parameters was sufficient for the use-cases companies are considering, but unfortunately they are not. Small models don’t make for good chatbots, don’t automate workflows, and don’t generate working code beyond a function or two. This aligns with research on emergent abilities which finds that many skills only appear at larger scales of tens of billions of parameters.

There are certainly still use cases for models at this scale around basic prediction and text understanding, but they will not produce the revolutionary companies that we have been promised.

Of course, not all LLMs are small. LLaMA 2 (now Llama?) just came out, and although it certainly is a game changer, it still lags behind GPT-4 in performance, in particular on coding problems. It is impressive that Meta has been able to squeeze more performance out of Llama without increasing size much, but the gap between 1 and 2 is not much like that between GPT-3 and 4, which suggests they will need to scale up further to be competitive.

This leads us to the second challenge that open-source LLMs face: infrastructure. Training and serving LLMs is hard and expensive. Bigger models are disproportionally harder and more expensive. Models consume compute proportional to size but performance scales proportional to the logarithm of size. Bigger models also bring infrastructure challenges: training instability, node failures, communication bandwidth, etc.

Of course, the fact that there is demand and that technical solutions exist today means that the market will catch up and provide solutions eventually. MosaicML is only the beginning – there is still much to do around managed fine-tuning and serving. It will take time.

The fact that there is so much to do ties into the third challenge: customization.

Open-source vs closed-source

The key question is not whether open-source LLMs will advance, but rather whether they will catch up to closed-source APIs. Failing to do so probably means that they fall out of favor, leading to the aforementioned bifurcation of companies that either train from scratch or depend on APIs. After all, you can’t win a market with a categorically worse product.

Within the context of raw performance, open and closed-source LLMs have significant and contrasting advantages. Open-source moves faster and allows for more customization. Closed-source has an easier time dealing with large scale coordination, such as executing expensive training runs, and provides a better user experience. The resemblance here relative to classical open vs closed-source software is rather conspicuous.

At this point in time, closed-source LLMs have the advantage for precisely the two reasons mentioned above – they have produced better models and have made it easier for both users and developers to access their powers. In order to catch up, open-source either needs to prove that openness leads to better quality models, or that it can produce a better developer experience.

For the most part, OpenAI is crushing the developer experience. They provide fine-tuned models for chat, embeddings, function calling, image generation, voice-to-text, packaged with great documentation and pretty good latency and uptime. It seems unlikely that open-source will surpass them anytime soon.

No, the way that open-source competes with closed-source is by allowing for greater customization, which will enable medium-sized tech companies to customize their LLMs specifically for their needs. GPT-4 API calls everywhere will be replaced by custom, fine-tuned models tailored to each industry and customer.

This leads us to the third challenge – almost none of the tools you need to customize LLMs (to a degree that they outperform GPT-4) exist today. There are hundreds of open-source tools that have yet to consolidate, and although they are all promising, they are all also very fresh, filled with bugs, and often lack critical features. Furthermore, the set of people in the world that can utilize the benefits of open-source LLMs is small and hard to access, and they are probably more likely to want to work for closed-source LLM API providers anyway.

What the future holds

Now, that’s enough bashing for today. We are still early, and I am confident that open-source will surpass the challenges I mentioned. What worries me is not whether open-source will improve, but rather whether it can catch up to OpenAI.

I am in the category of people that believe that LLM quality still has a long way to go (if you disagree, try to build even a remotely complex multi-LLM-call workflow, even using GPT-4), and so for a long foreseeable future, having the best LLMs will matter a ton for adoption. OpenAI is ahead of the pack, and although there is a lot of diversity in open-source, the peak remains solidly behind.

My sense is that a lot will depend on the biggest research labs, in particular Meta. As far as I know, no other lab has the resources, infrastructure, and talent to match GPT-4, while simultaneously having demonstrated the motivation to be willing to give such a model away for free.

The reason open-source needs these mega labs is that open-source by itself likely won’t be able to attract the funding necessary to pre-train models at the scale of GPT-4 and beyond, and without pre-training, none of the incredible advances in fine-tuning and quantization that we have seen from the open-source community matter.

In the long run, I think both open and closed source models will find their niche. Eventually, both will progress far enough that they find tons of use cases, and different companies will build on different platforms depending on their needs and resources. Closed-source will probably remain ahead in terms of raw performance and developer experience, but once open-source + industry figures out how to make fine-tuning 100B+ parameter LLMs feasible for mere mortals, open-source will finally unlock their key differentiating advantage: customization.

At that point, I think the balance will shift, and we will see more different kinds of LLMs. Or at least I am hoping for it, since a world with tons of LLMs is more exciting than a world with only a few ones. It will take time to get there though, and that time is proportional to how much big labs are willing to spend on models that they then give away. These decisions are up to a handful of individuals, so it is hard to say how things will play out. For now, we can just sit back and wait.

Don't count Google out just yet

John Hallman — Tue, 25 Apr 2023 03:05:30 GMT

The narrative spreading through the tech industry right now is that Google is on the way out – yet another example of an incumbent that got too comfortable in their massive success and failed to innovate in the face of setbacks at the hands of smaller and more innovative competitors. Google Search is losing users to ChatGPT, and will have to kill their own golden goose to compete. Despite incredible technology and research capabilities, they “just can’t ship a product”.

I would certainly be worried if I was an executive at Google, but not as worried as the media would make you believe I should be. For all the talk, my sense is that Google is really well-positioned to take advantage of the current AI wave. Here’s why I think people are writing Google off way too early, and why they even may want to revisit their position.

Current pro-Google arguments

Let me first quickly go over the more obvious reasons Google is not doing that bad right now – exceptional talent, massive datasets, AI compute chip capabilities, practically infinite resources, and the fact that competitive pressures are likely to shake things up and force Google to start moving faster than it previously has.

Exceptional talent. Google AI and DeepMind (now just Google DeepMind) are two of the three best AI research labs in the world (guess the last one). The two labs have between each other contributed the Transformer, Visual Transformer, AlphaGo, AlphaFold, TensorFlow, JAX, the medical Q&A model Med-PaLM, the original discoveries of emergent behaviors, chain-of-thought prompting, as well as the current SotA scaling laws for LLMs.

Massive datasets. Google has decades of search data magnitudes greater than what their competitors have, not to mention YouTube’s video data which companies barely have started tapping into.

AI compute. There’s a decent case to be made that compute will be the biggest bottleneck for AI in the next decade. If this is the case, then Google is sitting comfortable: PaLM was trained on Google’s custom Pathways AI infrastructure with a compute budget of around 2560 zettaflops (2.56x10^24 FLOPs) over 64 days, for a throughput of 4x10^22 FLOPs per day. The GPT-3 paper doesn’t disclose training duration, but given estimates that GPT-3 took 1-2 months to train combined with the reported 314 zettaflops (10^21) used during training, we can deduce that their throughput was probably no higher than 1x10^22 FLOPs per day. PaLM came out later, but the gap here is still 4x in the best case.

Infinite resources. Google is sitting on over a hundred billion and is profitable, unlike some of their competitors.

Competitive pressure. Yes, Google was asleep at the wheel, and they lost the first battle to OpenAI. However, for a company that has been winning so much for so long while putting in so little effort, this was probably inevitable. It’s not that surprising that Google productized so little of its research – it had no reason to! Now that it has every reason to, it makes sense that its behavior will change moving forward.

These are all great reasons not to count Google out yet, but they are subsumed by a more important point – when it comes to AI, they are ridiculously well-diversified.

Google is an AI portfolio

Except for actually building products, Google + DeepMind is the leader in literally every subfield of AI. They have the best hardware, top-tier frameworks (JAX more so than TensorFlow nowadays), tied for best LLMs (yes GPT-4 beats PaLM but PaLM is 1 year old by now, let’s see what Google’s response looks like), hands-down best reinforcement learning, best AI for biotech, best self-driving cars, and best fundamental AI research (i.e. understanding how the models work, see previous point about Google DeepMind discovering emergence and chain-of-thought).

Many of these endeavors are likely to be fruitless. However, even a single win can produce astounding financial gain and reaffirm Google’s position in the market. It could be Waymo, Isomorphic Labs, Bard, or who knows what will come out next month. Google has multiple ongoing projects that are redefining the limits of humanity’s current technological capabilities, and although progress has been slow until now, it would be foolish to assume that the executives at Google won’t look to these products with a more determined eye now that competition is heating up.

If Google takes the current moment seriously enough, it could also pursue the productization of groundbreaking research that it has been sitting on but never done anything with – its Med-PaLM 2 model is the first LLM to pass medical exams and match human performance on medical Q&A, AlphaCode performs better than the medium participant in international coding competitions, and DreamerV3 is the closest humanity has today to an autonomous AI agent that can interact with an arbitrary environment and get any (sufficiently simple) task done.

Google might be fine even if none of these endeavors work out. After all, Microsoft and OpenAI have revealed that new opportunities have emerged in business productivity software and AI APIs. Google missed out on all of these so far, but first mover advantages can be fickle. Google was not the first search engine, just as Facebook wasn’t the first social media platform. In general, big tech companies have done well for themselves in the past decade by waiting for startups to invent new products and then copying them.

In fact, even if Google never ships another product for the rest of its existence, it can still benefit massively from the coming AI revolution. All it has to do is build a platform of AI tooling for other companies to build on top of because, believe it or not, AI is still an incredibly difficult space to build in. There are certainly some products that require little more than an LLM API, but for the truly ambitious companies in healthcare, law, finance, software development, etc., existing APIs are not yet sufficient. This means we will need to train more and better models with more data and more compute, which means more demand for infrastructure and AI expertise.

In short, Google has been the leader of AI for so long that it is sitting on the best talent, infrastructure, and technology of any company in the world by a massively wide margin. Google is currently losing the battle in one specific area of AI, but even then, it is not clear if this is due to an insurmountable disadvantage, whether it be OpenAI’s first-mover advantage and resulting data moat, their complete inability to execute, or something else, or if Bard will be able to catch up to ChatGPT given enough time. Overall, it seems almost impossible to imagine an AI-centric future where Google is not a significant player.

What should Google do?

Google executives and engineers should be ecstatic. Over the past several years, no company has spent as much as Google building up talent, infrastructure, and institutional knowledge around AI. Now, the final puzzle piece has arrived – a fast-moving challenger has arrived and revealed to the world that AI is even bigger than we previously anticipated. Every person and company wants to disrupt themselves, their competitors, and their industry with shiny new AI tooling.

And no company is in a better position to offer those tools than Google is today.

Though there certainly will be big winners, it seems rather improbable that AI will be winner-take-all. After all, there are so many subfields to disrupt – chatbots and search, biotech, art and entertainment, self-driving cars, business productivity software, etc. Google should acknowledge and rejoice over the massive opportunity it has ahead of itself. Don’t try to tackle it all. Rather, focus on your core strengths. Take on big bets that few other companies are capable of, wait for the dust to settle to see what products are worth investing in, and in the meantime, build the infrastructure which will allow the next generation of startups to succeed.

From the POV of someone building in the space, I think Google should start by building an AI hub similar to Hugging Face (or acquire them) and fight back against OpenAI by doing what their competitor once promised to do – go open-source. Google already has some of the best open-source models available today (Flan-T5 and UL2). These are currently significantly worse than GPT-3/4, but this doesn’t have to be the case. If Google built a platform around their slightly more capable models, say their 80B parameter Flamingo or Chinchilla, along with a hub of LLM tooling such as fine-tuning and prompt-tuning services powered by their TPU chips, my bet would be that the industry would be happy to invest in their platform over OpenAI’s given the existential discomfort that developers feel when dealing with mission-critical closed-source APIs, not to mention other significant advantages around data privacy and security.

Conclusion

Google is in a better position than people think. There are tons of opportunities in AI and it is well-positioned to play a role in every single one of them. It doesn’t need to know how to build products – it can either copy, acquire, or simply be an infrastructure provider for the next generation of startups.

More importantly, it would be a tragedy for the industry and for humanity if Google squandered this opportunity and died a slow death due to the launch of a chatbot. Google has incredible technology that could make the world a better place – we could have better and cheaper healthcare, legal services, software, and more.

AI isn’t magic, and change won’t happen for free. The technology is nascent and painfully difficult to build with. Many AI startups being born today are likely to fail not due to competition but due to the difficulty of the challenge they take on. My bet and my hope is that Google sees this and steps up to the opportunity, and helps guide the industry and the world towards safe, effective, and abundant AI.

The unreasonable effectiveness of LLMs

John Hallman — Wed, 08 Mar 2023 03:10:27 GMT

LLMs are all the hype right now. While they are still deeply flawed and should be used with a healthy dose of caution, I have to admit that the performance of these models have surpassed my expectations by a wide margin.

I have been fascinated by the ML/AI space since AlphaGo in 2016, started exploring (very basic) ML research in 2018, got my first publication in 2020, and shortly afterwards pivoted into industry as a ML engineer, and I have been there since then.

Though I am no expert, I am also sufficiently up to speed with the research community to know that the performance of the most recent generation of LLMs came as a surprise. I, for one, predicted that LLMs would never display reasoning-type capabilities at any scale, and I was very wrong.

Here are my thoughts on why I was wrong – why LLMs have been better than expected, and how they might develop further.

A survey of current art

What can LLMs do right now?

LLMs are starting to display reasoning capabilities. The canonical example is math problems – the training and architecture of LLMs would lead one to expect that they would never learn to solve math problems, but it turns out that LLMs can make up for this by reasoning using words (aka chain of thought prompting).
LLMs can be trained to utilize tools to address weaknesses intrinsic to LLMs. Examples include querying search engines, using calculators, translators, and even accessing code documentation and tests to improve code generation.
The current state of the art applications of LLMs depend heavily on prompt engineering, which is brittle and inefficient. Eventually, we will move on to techniques that can figure out optimal prompts, or even skip prompts entirely. Recent applications of such techniques reveal that LLMs are close to human performance on several impressive tasks including medical QA and web navigation.

Beyond all of the tasks LLMs are capable of right now, the most exciting recent finding has been that scaling up LLMs leads to emergent abilities. In short, LLMs performance on certain tasks are non-linear – they unlock new skills once they reach a certain size. Examples include solving math word problems, arithmetic with multiple digits, and chain of thought prompting.

Overall, my impression of LLMs given the results and research that is publicly available today is that the technology is at the same time vastly overhyped (we are still not close to AGI) and overly pessimistic (LLMs can do more than just “autocomplete”).

Why do LLMs work?

As I do not have access to the SotA models, some of my understanding is derived from research papers which share inherently lossy information, so take all of the below with a grain of salt. With that disclaimer, here’s my current theory:

Language is more general than we think. My impression of why people are so impressed by ChatGPT is that it appears to be able to do “more” than just generate cool text, but this shouldn’t surprise us. Language can be used to represent a ton of tasks and behaviors. In our pursuit of predicting the next token with greater accuracy, it made sense that a bunch of cognitive capabilities (common sense, reasoning, knowledge) would be useful for improving the performance of our LLMs, but we didn’t know whether these models would be able to pick up such abilities from the low information density, unstructured data that we were feeding them. It now appears that they in fact do (at least sometimes), but why is this the case?
Complex tasks are just many simple tasks. My impression is that humans overestimate how complex our thinking really is – most thinking appears advanced but can be broken down into consecutively smaller and smaller cognitive steps, eventually reaching a scale where each step can be solved using bounded knowledge and compute. The scaling up of the number of attention heads and layers in LLMs may simply have led us to a scale where the parameters have enough capacity to memorize the “rule-book” for various simple cognitive tasks and how to combine the results. For example, it should not surprise people that LLMs can be taught to solve long arithmetic, given that there is a finite algorithm for doing so that can be described in simple words. That said, even if this is in fact what LLMs do, what is it about the current models and the way we train them that lets them do this?
Neural networks + big data can learn anything. Although neural networks are still poorly understood, we now have a few insightful puzzle pieces that help shed some light on the situation. In particular: (1) neural networks are universal approximators, which means they can approximate any function (with a few limitations that don’t really apply to humans), (2) models trained with gradient descent actually do converge to approximately global minima, (3) as we collect more text data (of sufficiently high quality and diversity), our training objective “converges” towards our true objective – understanding human language. Put these together and we get an amazing result: a sufficiently large model with sufficient data can learn human language. But what does “sufficiently large” mean, and are LLMs and internet-scale data anywhere close to it?
Transformers are really efficient. Theoretical convergence and approximation results suggest that neural network architecture matters a ton – merely adding residual layers can produce an exponential reduction in required parameter counts. Recent research shows that Transformers (the architecture for LLMs) are very efficient because they parallelize and compose computation (at least in certain toy settings). No one really knows how to quantify how complex human language is, and how much of it is captured within our current internet datasets, but it appears that current SotA LLMs may not be too far off.

Closing thoughts

My takeaways from the past few months are as follows:

Be humble, you never know what to expect. In the past I felt deeply confident that LLM research was a waste of time as I thought they would saturate in performance quickly, and thus I could not see any practical use cases for them, not to mention that they were hideously expensive to train. I was very wrong, mainly because I was projecting forward a highly non-linear function (performance for LLMs) with few and non-representative data points (almost all models up until 2020 had fewer than 1B parameters).
Sometimes, more is different. Emergent abilities in LLMs reveal something profound about the universe – in some circumstances, a large quantity of objects interacting with each other in simple ways produces complex and amazing emergent phenomena. Small changes to the interactions or large changes to the quantity of objects can lead to unexpected results.

Welcome to Contemplations!

John Hallman — Tue, 06 Dec 2022 12:08:34 GMT

Subscribe now

Who are you and what is this?

My name is John. My background is in mathematics and machine learning, but I spend perhaps just as much time pondering the circumstances we humans find ourselves in. For over a decade, I have maintained a personal journal named Contemplations, and my hope is for this substack to be the public continuation of these reflections.

What should I expect?

I mainly write about technology, society, and introspection, and I expect these subjects to be the primary focus of this substack as well.

Contemplations tend to be unstructured. I make no promises about the clarity of my thoughts, or how well researched they will be. The primary purpose of sharing my thoughts publicly is rather to serve as a record of proof of what I thought in the past, and to allow feedback to catch errors in my thinking process.

I make no promises about how frequently I will be posting. If anything, expect it to be highly infrequent and irregular.

Why choose to write publicly?

The reason I chose to not write publicly until now is that I found the thought of sharing my contemplations publicly frightening. People process feedback differently, and I tend to take feedback perhaps not personally, but at least seriously. Additionally, people on the internet are not always known to be kind in their words.

However, I have come to believe that the reason I find public writing frightening is the reason that it is healthy.

When you keep a thought locked in your mind, it is likely to face less critique and opposition than if it were announced publicly to the world. Most ideas have valid critiques, and not all of them are damning, but knowing which ideas and critiques hold water is a hard task. Institutions and societies spend years and decades debating some ideas. Don’t try to carry this burden yourself.

If you find yourself with ideas or thoughts that you want to build your life around, then it only makes sense that you want to test their validity first, and thankfully most individuals have the foresight to take at least this precaution. However, this is an outcome-centric view of idea validation. There is value in challenging even ideas that have no meaningful impact on your life outcomes. It is through the challenging of ideas that we are forced to become better thinkers.

By announcing a thought to the world, you gain information and wisdom at the expense of mental and emotional energy. There is a mental cost to researching and expressing a thought in writing, and an emotional toll in receiving or not receiving feedback and digesting its implications. In return, we gain a better understanding of thoughts others relate to or disagree with, their reasons, and gaps in our understanding or thinking process.

At this current point in time, I believe that I have overlooked the potential advantages to public writing – I kept my contemplations private for years and only discussed a handful of them with my closest of friends. This is tantamount to being an overprotective parent of your ideas, shielding them from harm but harming their potential long-term success.

Correspondingly, I believe that going public will be difficult, but will be positive in the long term. The only way to see if I am right or not is to try.

Hmm, I don’t know if I buy what you are saying…

If you find yourself thinking this, then I just ask that you comment and share why. In doing this, I hope we can come to a better shared understanding, all while reaping the rewards of having honed our minds together.

Coming soon

John Hallman — Tue, 06 Dec 2022 10:39:47 GMT

This is John’s Contemplations, a newsletter about technology, humanity, and living the good life..

Subscribe now