armand0e left over a slop dataset. he was right. here is what happened.

armand0e left glint research today. the reason is a dataset that should never have gone up under our name, a thing called Complete-FABLE.5-traces-2M, two million rows marketed as fable 5 traces. it is not two million fable 5 traces. it is barely any. it got pushed onto our org by one person with no review, and because i trusted the teammate who pushed it, i never opened the file myself, so it sat there for weeks. then armand0e checked it, flagged it, traced where it actually came from, and told me to take it down or he was gone. i did not move fast enough. he left. he was right to leave. the dataset is down now and the person who uploaded it no longer has write access to the org. this is the honest version, because honest is supposed to be the whole point of this place.

if you have read this blog before you know the house rule. we publish what a thing actually is, not the dressed up version. i have used that line to dunk on other people more than once. so when we are the ones who broke it, i have to write that down too. that is the deal. it does not only count when it makes us look good.

what the dataset actually was

Complete-FABLE.5-traces-2M went up on our org as two million fable 5 reasoning traces. here is the problem. glint research does not have two million fable 5 traces. we do not have one hundred thousand. the real collection is a few thousand traces i pulled locally off the actual fable 5 chats in claude code. so where did the other 1.9 million rows come from. they came from cloning other datasets. some of them were duplicates of our own original set. some of them were obvious reposts from random sources with zero fable content in them, just generic vibe-coding instruction junk relabeled as fable. and one of the sources was a private dataset of ours that should never have been public at all. so it was three failures stacked into one upload: padded with clones, padded with fake sources, and a privacy leak on top.

people did like it. it got reposts, it got follows, a couple hundred likes. i want to be clear about one thing, because the heat-of-the-moment version of this floated the idea that we left it up to farm the attention. that is not what happened. we did not know the data was bad. likes are not verification, a heart on a dataset card means somebody clicked, it does not mean a single row is real, and nobody on our side had actually read the rows. that is the real failure. it is quieter than chasing clout and it is honestly not much better.

how it got up

a teammate built the 2M set and pushed it straight into the org overnight. no heads up, no review, no "hey can someone look at this before it goes out." it was just live on our hub by morning, with a ping in the chat after the fact. the rest of us saw the ping, saw it was already public, and moved on with the day instead of opening it. that last part is ours and we take it fully. i had the access and the time to read the thing and i did not. armand0e read it. i did not. that is the whole difference between us in this story and it is not a flattering one for me.

and then the other half, which i am not going to soften: nobody should be able to push a dataset to this org overnight, silently, with zero review, and have it be public to the entire world before one other person has looked at a single row. that is not a small process gap. that is the gate just not existing. one person uploaded two million rows under all of our names while everybody else was asleep, and the first anyone heard of it was a ping saying it was already up. the bad data is one problem. the fact that bad data could go straight to public on our org with nobody checking is the bigger one, and it is the one we actually had to fix.

armand0e

armand0e has a simple rule. when bad data gets uploaded in our org, you fix it. that is it. that is the standard. he opened the dataset, read the source list, found the clones and the fake sources and the private set, and laid it all out row by row. then he told me plainly: if this stays up, i leave, remove me. and i did not take it down fast enough to keep him. so he left.

he was right and i want that on the record in my own words, not buried in a discord log. he held the line on the thing this org is supposed to care about, and i was the one who could have opened the file the morning it appeared and did not. you do not get many people who will walk over data integrity. most people will let it slide for the follow count. he would not. that is the kind of person you want and i lost him by being slow to act once he flagged it. respect to armand0e. the door is open if you ever want it, and i would not blame you if you do not.

what we did about it

the 2M dataset is removed from huggingface. the teammate who uploaded it no longer has write access to the org. the org account that shrugged it off in chat got demoted to moderator and lost write too. that is not me being dramatic, that is just matching the access to the judgment shown. if you can push two million unreviewed rows public overnight, or wave it off when someone flags it, you do not get the publish button. write access is trust, and trust is the thing that got spent here.

the real dataset is still up and is actually what it says it is. Fable-5-traces is the verified one, real fable 5 traces from claude code, plus native claude traces in the claude folder, collected by hand. if you trained anything on the 2M set, drop it and use the real one, or at minimum filter out everything that is not actually fable. i am sorry you have to redo that. it should never have been up for you to grab in the first place.

the new rule

no dataset goes public from this org without a second person checking the sources first, and no more overnight solo pushes to public, period. because datasets are the thing people quietly build on and the thing we left an open door on. we had an unwritten version of this and we assumed everyone would just behave. one overnight upload blew straight through it. so now it is written down, here, in public, where i cannot pretend i did not say it.

that is the whole post. a bad dataset got pushed onto our org overnight with nobody checking it, it was padded and fake and leaked a private set, we should have reviewed it the moment it appeared and we did not, and a good teammate left because i was too slow to pull it once he flagged it. we take full blame for not reviewing it. the data is down. the access is fixed. the rule is real now. and armand0e was right.

/lane
glint research, 2026, down a teammate i should have kept, one slop dataset deleted, one house rule learned the expensive way

the qkvae works now. 96.6% of the picture, 80% of the tokens, 1 million params.

ok so the qkvae got good. it is the image tokenizer i mumbled about last week. you hand it a picture, it turns the picture into tokens, you hand the tokens back, it gives you the picture. the whole game is doing that with as few tokens as possible while keeping the picture looking like the picture. and it does. on a big detailed cityscape it landed 96.6% structural similarity while spending 80% of its token budget. when i let it use the whole budget it hit 97.6%. it is 1.06 million parameters. it is not released yet. let me explain why i have been staring at it for two days.

original cityscape on the left, qkvae reconstruction on the right left is the original. right is what the qkvae gave back. 1.06M params. find the seam.

i keep saying "the picture looks like the picture" because that is the entire job and it is harder than it sounds. you are throwing away most of the data and asking the model to put it back from a handful of codes. most tokenizers at this size smear the detail into mush. this one keeps the windows on the buildings.

what a qkvae even is

it is an fsq autoencoder. you give it an image, the encoder squishes it down to a grid of discrete codes, and any model that reads tokens can read those codes. the decoder turns the codes back into pixels. small conv net on each end, a finite scalar quantizer in the middle, three sizes (1M, 5M, 20M) that all share the same codebook so you can swap them without retraining whatever is downstream. that part is normal. the part that got good is what it does with the token budget.

it spends tokens where the picture is hard

here is the trick. it does not spend the same number of tokens on every part of the image. a flat patch of sky collapses down to one token. a building covered in lit windows keeps all of its tokens. it walks the grid as a quadtree and merges any block that is basically uniform, and the merge only happens when it actually saves tokens, so a fully detailed image just costs the full grid and never more. the token count moves with how complex the picture actually is.

qkvae result reading 96.6 percent match at 80 percent of the token budget 96.6% match. 36.17 dB. 27829 of 35000 tokens. one little model.

on that cityscape the full grid was 35000 tokens. it used 27829, which is 80%, and still came back at 96.6%. so it found 7000 tokens of sky and haze and dark water that it could merge away, kept everything that mattered, and you have to look hard to find what it dropped. give it a screenshot or a logo or anything with big flat regions and it merges way harder than that.

any size in, same size out

it is fully convolutional, so you can hand it a 182 by 28 strip or a wide cityscape and it gives you back the exact same dimensions. no resizing to a square, no shrinking to a thumbnail, no letterboxing. the picture goes in at its native resolution and comes out at its native resolution. this sounds boring until you have used a tokenizer that forces everything to 256 by 256 and mangles your aspect ratio on the way in.

the part i did not expect: the 1M keeps winning

there are three sizes and you would assume more parameters means a better picture every time. on raw quality the 20M does win, it has the highest fidelity, that is real. but i built a little lab that grades the models on efficiency, quality kept per token, and sometimes the 1M just beats both of the bigger ones on the same image. the small one spends its tokens better. it is 1.06 million parameters beating a model twenty times its size on the metric that actually matters when tokens are the budget. i did not design it to do that. it just does, on the right image, and i find it very funny.

be honest about the numbers

house rule, same as always. 96.6% is structural similarity to the original, the match score, measured at 80% of the tokens on one detailed cityscape, 36.17 dB, tau 0.594. that is a hard image. flat images score higher and use a fraction of the tokens. dense images cost more. at the full token budget the same picture goes to 97.6%. it is a lossy autoencoder, so it is never pixel perfect, and i am not going to pretend it is. the demo shows you the real reconstruction and the real token count every time, so you can catch me if i am lying.

when you can try it

not yet. it is not released. i want to clean up the demo and the model cards first, and the 5090 is busy with other things, and i keep finding one more thing to tweak. soon. on hf, like everything else, no early access, no waitlist, no token sale. a 1 million parameter image tokenizer that keeps the windows on the buildings and occasionally embarrasses the 20M model. go look at the side by side again. find the seam. i will wait.

/lane
glint research, 2026, 4 people, one tiny tokenizer, 96.6% of a city in 80% of the tokens, still not released because i keep touching it

free time is dangerous. trained more things. bought an fpga.

so glint and shard are training in the background. which means the 5090 has been doing everything. but i am bored. so i use the 1% remaining to train models on even less budget. 99% less budget if you will. how fun! i uploaded a qkvae, a 1k param model, and bought a "Sipeed Tang Nano 20K GW2AR-18 QN88 FPGA Development Board with 64Mbits SDRAM 828K Block SRAM Linux RISCV Single Board Computer for Retro Game Console Support microSD RGB LCD JTAG Port" from amazon. fun times.

this is what happens when you give me a few days with nothing scheduled. i should be writing the minimythos paper. i should be sleeping. i should be doing literally anything other than training more models and buying hardware.

the qkvae

a qkvae is up on the hub. that is the announcement. more details when i have had more than four hours of sleep. it is small, it is weird, and it did something i did not expect. the model card explains it. the model card is also short because i wrote it at 2am and then passed out.

the 1k param model

a 1k param model. yes. another one. it is on huggingface. blink is one. this is another. they are different models. go look at it. tell me what it does. it does something.

the fpga

ok the big one. i bought a Sipeed Tang Nano 20K GW2AR-18 QN88 FPGA Development Board with 64Mbits SDRAM 828K Block SRAM Linux RISCV Single Board Computer for Retro Game Console Support microSD RGB LCD JTAG Port from amazon. yes that is the full name. no i am not abbreviating it. it is a small fpga with 64 megabits of sdram and 828k of block sram. it runs a soft riscv core that boots linux. it has a microsd slot. it has rgb output. it has jtag. it is the size of a stick of gum. it cost less than a nice dinner. i have wanted one for like 2 years and i finally caved.

why? because running a 1k param model on a $30 fpga is funny. that is the only reason. it is the same reason blink exists. it is funny. i am going to try to get blink running on it. if i succeed i will post a video. if i fail i will also post a video because failure is the best content.

anyway

the glint run is training. shard is training. the 5090 is busy. the fpga is in the mail. the qkvae is on the hub. the 1k model is on the hub. this blog post is on the blog. the minimythos paper is still unwritten. that is the state of things.

more soon. probably about the fpga. probably about the qkvae. probably about the 1k model. probably about how the minimythos paper is going to be late by like 2 weeks. normal service resumes shortly.

/lane
glint research, 2026, 4 people, 1 fpga, no chill, infinite free time, still no minimythos paper

blink is out. 1,087 parameters. yes, that's the whole model.

blink is live on huggingface. it is the smallest model glint research has ever shipped and probably ever will: 1,087 parameters. not 1.087 billion. not 1.087 million. one thousand and eighty-seven actual numbers. it trained on 100 billion tokens of fineweb-edu, which works out to about 92 million tokens for every single parameter, a ratio so absurd we had to do it just to see what would happen. what happened is blink. go say hi. it will say something back. it will not be a word, but it will be something.

if you have read this blog before you already know blink. it is the "1k debug model" i kept mentioning, the thing we break first so we do not break the 50M model. it was never supposed to be a release. it was a test rig. but it sat there at the bottom of every config sweep being quietly fascinating, so we trained it for real, ran it through the full pipeline, and now it is a model with a card and a leaderboard row like a grown-up. the runt of the litter got a name tag.

what it actually is

blink is a byte model. dim=3. the entire residual stream is a three-number vector. it is one transformer block looped eight times with a tiny LoRA per loop, one attention head, and tied embeddings. there is no tokenizer to download because it reads raw bytes. the released weights are the slerp-tournament champion, the actual thing the merge produced, scored on the same tiny-lm leaderboard harness as everything else. it ships in two flavors, a base champion and an instruct champion that formats like a chatbot and reasons like a 1k model, which is to say barely.

each weight file is about 13 kilobytes. the whole model fits in a single network packet. you could email it. i keep saying this because i cannot get over it.

be honest about the numbers

house rule, you know it by now: we publish what the champion actually scores, not the best checkpoint we could find and dress up. on the leaderboard harness blink lands around 71 byte perplexity on wikitext, 52.8% on blimp, and 26.6% on arc-easy. blimp chance is 50% and arc chance is 25%, so it learned a little grammar and a little reasoning, which for 1,087 parameters is more than we had any right to expect. the wikitext number gets worse during training, on purpose, because a three-dimensional residual stream cannot hold both wikipedia and filtered web text at once and it picks the one we trained it on. that is the model being 1k parameters. we are not hiding it.

what does it sound like? you give it "the " and it gives you back "ar n n c tiseos t at or areeeat ton al teat". you can see it reaching for words and mostly missing. every so often it lands one. "the". "at". "let". a tiny model briefly remembering that english has structure, then losing the thread. it is the most honest thing we make. it does not pretend to know more than it does, which is a bar a lot of bigger models still cannot clear.

why ship a model that can't talk

because the question that matters is not how good a small model can get. it is where the floor is. blink is the floor probe. if 92 million tokens per parameter cannot push a 1k model past chance on grammar, that tells you something real about the architecture. it pushed past chance. that tells you something too. every weird idea we want to try on the 50M model gets tried on blink first, and blink is small enough that you can watch it learn, fail, and occasionally surprise you in an afternoon instead of a week.

it is on hf right now with a single-file inference script, so you can run a state-of-the-art-sized-for-1999 language model on a potato. no early access, no waitlist, no token sale. four people, one 5090, a 13 kb model. go break it. tell me what it says.

/lane
glint research, 2026, 1,087 parameters, 100 billion tokens, one 5090, and a model you can fit in a text message

minimythos: 50M params, 50B tokens, real training pipeline

minimythos is not a codename anymore. it is a real model, and the run is live right now. 50 million parameters. 50 billion tokens of fineweb-edu. built on top of the fable-inspired architecture we have been iterating on for months. this is the first model out of the minimythos pipeline, and it exists because of thousands of hours of config tuning that almost broke us a few times.

we have been quiet about this for months. every time someone asked "when is minimythos dropping" in the discord, we said "soon" or "working on it" or "lane broke the tokenizer again." and all of that was true. but the real holdup was never the model. it was getting the config right.

getting the config right took thousands of hours. not because we were being thorough. because we kept breaking things. the shard model at 50M is deceptively simple on paper: one shared transformer block looped 16 times with per-iteration LoRA adapters, factorized embeddings like ALBERT, a custom tokenizer. in practice every change to one variable broke three other things. we ran config sweeps. we ran more config sweeps. we ran the sweep sweeps. the results are in the journal. there are pages of them.

the pipeline itself is straightforward. train a model, save checkpoints every couple thousand steps, benchmark each one on BLiMP, WikiText-2, and ARC-Easy, SLERP-merge the best checkpoints together, benchmark the merged result, and keep doing that until the merged model is way better than any single checkpoint. the SLERP merge is done per-parameter with sign-flip handling. it sounds simple. getting it to actually work at 50M scale was not simple.

the architecture

shard is our internal name for this family of models. the key idea is parameter sharing plus recursive depth, with SLERP merging at inference time. instead of one monolithic set of weights, you get one shared block that gets called multiple times with different LoRA adapters each time. the adapters specialize. then you merge the checkpoints with SLERP. the merged model is better than any individual checkpoint because it gets the specialization from each one without paying the parameter cost of having them all at once.

the concrete numbers for the 50M model: factorized embedding (vocab to 128-dim then to model width, like ALBERT), 4 unique prelude layers, the shared middle block looped up to 16 times with rank-8 LoRA per loop, 4 unique coda layers. the effective depth is something like 20+ layers but the actual parameter count stays at 50M because almost everything is shared. the learning rate schedule took a week to tune. arman did it. i am the one not allowed to touch it. the last time i touched the LR schedule the model started outputting the entire bee movie script.

there is also a 1k debug model called blink and a 1M model called glint. same architecture, fewer params, used for testing config changes without burning a day on a full run. they were invaluable during the config tuning period. every change went through blink first. most of them broke blink. the ones that broke blink also broke the 50M model, which is why they exist.

the training loop

the training pipeline is tuned for a single node with however many GPUs we could fit without tripping the breaker. the data is fineweb-edu, sample-100BT configuration. it gets filtered, deduplicated, and re-ranked by difficulty before it goes anywhere near the model. the difficulty scoring uses a learned scorer from the fable-5 tracing work. it is not hand-tuned. we tried hand-tuning. it went badly.

the core loop: train until the loss plateaus or the checkpoint schedule fires. save a checkpoint. run BLiMP, WikiText-2, and ARC-Easy against it. record the scores. then SLERP the best 3-5 checkpoints together. run the same benchmarks on the merged result. if the merged model beats the best individual checkpoint, keep the merge. if it does not, try a different combination. repeat.

the SLERP merge itself is per-parameter with sign-flip handling based on Git Re-Basin. when two checkpoints are close in weight space, we fall back to linear interpolation. when they are far apart, we use actual spherical interpolation. the interpolation parameter is swept at 0.25, 0.5, and 0.75 for every pair in the top-K. we benchmark every merge. we do not guess.

this is thousands of hours of work. most of it is waiting for benchmarks to finish. the benchmarks run in a subprocess with a timeout. if a benchmark hangs, it gets killed. the checkpoint saver writes to a temp file and atomically renames it into place so we never lose a checkpoint if training gets interrupted. we lost checkpoints in early versions. we do not lose checkpoints anymore.

what the team actually did

minimythos is not a solo project. there are a lot of people involved. the short version: i did most of the coding. shane edited a few lines. everyone else is here to run reality checks, catch the dumb ideas before i waste a week on them, and hand me papers that i then spend three weeks implementing badly before getting it right.

  • lane. lead developer, architecture, training, the push to beat gemma 4 e4b at under 100M parameters. also writing model cards and panicking.
  • armand0e. moral support.
  • shane. edited a few lines of code. kept the training node alive while lane fought the tokenizer. (also magebreaker on discord)
  • enderchefcoder. moral support, coding stuff.
  • dragonoid. reality checks. lots of papers. more papers than i know what to do with.
  • moon_senpai. reality checks. sometimes papers.
  • amytimed. reality checks. sometimes papers.
  • costikoooo. reality checks. sometimes papers.
  • datdanboi25. reality checks. sometimes papers.
  • finnyboy. reality checks. sometimes papers.
  • pedrodev2026. reality checks. sometimes papers.

the continuous thinking thing

something we added recently is COCONUT-style continuous latent thinking. instead of reasoning in language space like chain-of-thought, the model feeds its last hidden state back as the next input embedding. it creates "continuous thoughts" that let the model run multiple reasoning steps inside the forward pass without generating tokens. there is no explicit thinking loss. the cross-entropy loss supervises the thinking through backprop. it is a neat trick. the 1M model gets 46% better loss with 4 thinking steps and 8 loops than with no thinking at all. the blink model cannot fit thinking because even one thinking step pushes it over the 1k parameter limit.

the data

the entire training corpus is fineweb-edu. sample-100BT configuration. 50 billion tokens. nothing else in pretraining. the agent traces from fable-5 are in the codebase and in the journal as reference but they are not in the training mix for this run. this is a language model. it learns from text.

the quality filter is the part that took the most time. not the SLERP. not the architecture. the quality filter. getting the threshold right so we keep enough data to hit 50B tokens without letting in the garbage. the fineweb-edu sample is already filtered. we filter it again. then we rank it by the learned difficulty scorer and keep the top quality percent. the exact number is in the training config. we are not going to post a fake number here because we do not want people to argue with us about it in the comments.

what this means

minimythos is the first model out of the minimythos pipeline. shard is the product. smaller is where we are going. we do not have a number yet because the scaling curves from this run will tell us what makes sense. if the architecture holds at 50M parameters we will make it smaller. if it does not, we publish the failure analysis and go back to the drawing board. that is the deal.

the 1T token glint run is still happening in parallel. minimythos is not replacing that. minimythos is the production line. glint is the research line. they are running at the same time on different parts of the same cluster. ender manages the tokenizer for both. lane manages the panic for both.

when you can try it

the run is in progress right now. when it finishes, the weights, the model card, the full evaluation numbers, the training config, and the evaluation harness all go up on huggingface. the model will be there first. the blog post and the evaluation numbers will follow. all on huggingface. no early access lists. no discord exclusives. no token sale. we are a 4-person research org that builds tiny models and occasionally wins arguments on the internet.

if you want to follow the run as it happens: hf for the model when it ships, ko-fi if you want to help fund the next GPU so we stop asking shane to run the training loop from his apartment at 3am. the journal is in the repo if you want the raw training logs with no filter applied.

/lane
glint research, 2026, 4 people, 1 model, thousands of hours of config tuning, still wondering why anyone cares about a 50M model

minimythos is real. our first shard model trains on 50B tokens.

minimythos is happening. 50 million params, 50 billion tokens. we train lots of models, save checkpoints, slerp them, benchmark everything, and repeat until we have something way better than a baseline trained model. fineweb-edu data. thats it.

we have been quiet about this for months. every time someone asked "when is minimythos dropping" in the discord, we said "soon" or "working on it" or "lane broke the tokenizer again." and all of that was true.

minimythos is a 50M model trained on 50B tokens of fineweb-edu. we are not calling it glint-2 because it is not the same thing. glint is the research line. minimythos is the first model we are shipping as a real usable thing.

how it works

we train the model. we save checkpoints every x steps. we benchmark every checkpoint. we slerp them together. we benchmark all the slerp combinations. we see what does what. we repeat until we have a model that is much better than a baseline trained model without the slerp step. thats the whole idea.

the data is fineweb-edu. nothing fancy. no custom mixes. no synthetic data. no agent traces. just the good educational web data that everyone uses. the trick is not the data. the trick is what we do with the checkpoints after training.

we are not going to share too much about the details. not because we are secretive. because the details are still moving. we are running a lot of experiments in parallel and the recipe changes weekly. when it settles we will write it up properly.

when can you try it

the run is in progress. weights and a demo on hf soon enough. no early access, no waitlist, no token sale. we are a 4-person research org that builds tiny models. we will ship it when it is done.

if you want to follow along: hf for the model, discord for the training logs, ko-fi if you want to help fund the next run.

/lane
glint research, 2026, first shard model inbound, still running on 5090s, still 4 people in a trench coat pretending to be a lab

the blog is back. here's everything that happened.

ok so the blog is back. it has been dead for a really long time. the last post on here is from when glint-0.3 was our big release. so. yeah. a lot has happened. buckle up i guess, or don't, whatever, you can close the tab. im not your dad.

also hi. im lane. i run this place. you might know me as compactai on hf. if you do, no you don't, get out of here. ok lets go.

we beat supralabs in followers. YAYAYAYAYAYAYA

this is the headline. this is the lede. this is why you opened this post and if you scrolled past it i will personally come to your house and reset your wifi.

we beat supralabs. supralabs. the supralabs. the ones with nova, the ones with the bigger model cards, the ones who have had more followers than us for the entire lifespan of this org. 142 to 141. that's it. that's the gap. one follower. one single human being decided that glint research was the hill they wanted to die on and i love them. i dont know who they are. but i love them.

ok so. follower count means nothing. it really doesn't. it has no correlation with model quality, no correlation with research output, no correlation with anything except maybe how many twitter bros you know. im aware of this. im saying it out loud so when i lose my mind about it later you know i knew.

and yet.

i have been checking this number every day for like 4 months. i have a graph. i have a spreadsheet. i have, on at least one occasion, asked a friend with a spare account to follow supralabs and then immediately unfollow them, just to see if i could detect it. (i couldn't. i have no idea what im doing.) i have lost sleep over this. i have won sleep over this. the trajectory was clear and the gap was closing and the only question was when.

it was last week. and i refreshed the page and it was 142 and 141 and i did not breathe for about 30 seconds. and then i closed my laptop and went for a walk. and then i came back and refreshed again. and it was still true. so.

WE ARE BIGGER THAN SUPRALABS NOW. i am writing it in all caps because it deserves all caps. if you're from supralabs and you read this, no hard feelings. (some hard feelings.) ok a lot of hard feelings. come beat us back. please. i need something to obsess over.

minimythos is real and its coming

ok business. sort of. minimythos. the sub-100m model weve been quietly building. yes, it has a name now, yes, it's a real thing, no, you can't try it yet, yes, i know that's annoying.

the short version: glint-1.3 is a 982k parameter model and if you ask it a question there is a roughly 90% chance it will just say "chuamliamce" and walk off into the sunset. it's funny. it is not useful. we made it to prove a point about what 1m params can do at high training token count (100B on fineweb-edu, on a single 5090, 138k tok/s, the whole thing is a flex). and it proved the point. and now we want a model that you can actually use.

so minimythos. under 100 million params. instruction-tuned (glint-1.3 is base only, which is why it says "chuamliamce" instead of answering you). trained on a real dataset. the kind of small that you can run on a macbook without the fan sounding like a jet engine. the kind of good that you can actually paste into a chat and have it not embarrass you in front of your coworkers.

what we know going in: the model souping worked (per-group slerp, you can see it in the glint-1.3 card, the merged model gets +4.5% BLiMP over the best individual checkpoint, which for tiny models is kind of insane). the curriculum stuff worked. the tokenizer experiment we're not allowed to talk about worked, sorry, you'll see it in the paper. what we don't know: whether all of that scales linearly when you go from 1m to 80m params or whether the tricks stop working. we have guesses. they might be wrong. we'll find out.

honestly the worst part of minimythos is the name. i keep typo-ing it. there is no mythological creature called "minimythos." the name is "mythos" (small myth) and the "mini" is because it's the small one, and our naming is a disaster. shane wanted to call it "glint-2." arman wanted to call it "glint-mobile." ender suggested something im not allowed to print. minimythos won because i am a coward and nobody fought me on it. if you hate the name, blame me, im used to it.

1T tokens. yeah. really.

this is the part where i get to say the thing ive been biting my tongue about for like 6 months.

the next glint model (the actual one, not the 1m param research thing, not minimythos, the next one after that) is going to be trained on up to 1 trillion tokens of pretraining data. possibly more. possibly less. depends on the run.

and i think it might beat everything in its class on whatever leaderboard we throw it at. i think. i have a paper's worth of evidence that suggests it. the team has a paper's worth of evidence that suggests it. the ablations suggest it. the scaling behavior suggests it. none of this means it will actually happen. but it has never been this plausible before.

ok we are NOT going to beat gpt-4. we are NOT going to beat claude. we are NOT going to beat llama-3.1-405b. those models are 50-500x our size, they have teams of 200, they have a datacenter. we have 4 people and whatever 5090s we can fit in a room without tripping over them.

what we might beat: every other model under ~500m params on most reasonable benchmarks. maybe a few above that. maybe a few that are way above that if the day is right. the trick is that nobody else is doing 1T tokens on a small model with the kind of data curation pipeline we have. everyone is either small-model-small-data or big-model-lots-of-data. the diagonal is open. we are running down the diagonal.

i have been writing the pretraining config in my head for months. i have rewritten the data mix maybe 30 times. arman has opinions about the learning rate schedule. shane keeps asking me if im sure about the init scheme. ender is in charge of making sure we don't accidentally train on a duplicate of the test set, which is a real problem we have actually had to deal with, which is a whole other post.

no name. no release date. no promises. pretraining is hard and a lot of stuff can go sideways and we have 14 different failure modes bookmarked in a doc that i will share if you ask nicely. but the run is happening. it's not "if" anymore. it's "when."

ok one more thing. (a note on slerp, and on the small-model thing)

edit note (next day): TL;DR: i had a long argument in another discord about small models. the original version of this section had direct quotes and usernames from specific people, plus a few jabs. some of those people asked me to take the quotes, the names, and the mockery out, and threatened to report the org to huggingface for ToS. i removed all of it. every quote, every name. for the record: i do not agree with the framing that any of it was abusive. it was harsh constructive feedback. i was being mean about a technical disagreement, not about the people. but i respect the ask, so here is the same technical content without the receipts. the substance did not change. only the receipts did.

i had a long argument recently with people who think small models are pointless. some of them are not wrong to be skeptical. some of them are wrong in specific ways that i want to walk through, because the specifics are interesting. this is not a callout. this is feedback. if you have said any of these things to me in any discord, this is the response.

what slerp actually is, in case anyone is confused

slerp = spherical linear interpolation. it is a math operation. you have two vectors, you want a third vector that is "between" them. regular linear interpolation (lerp) draws a straight line. slerp draws an arc on a sphere. that is the entire idea.

lerp(a, b, t) = a + t * (b - a)

slerp(a, b, t) = ( sin((1 - t) * θ) / sin(θ) ) * a
              + ( sin( t     * θ) / sin(θ) ) * b

  where θ = arccos( dot(a, b) / (|a| * |b|) )

that's it. that's the math. it has been a thing in computer graphics since the 80s. it is how you rotate an object smoothly between two orientations without the object going through the floor.

for neural network weights the intuition is: trained weights kind of live on a high-dimensional sphere (loosely, after normalization, but the intuition holds even when it's not exactly true). straight-line interpolation cuts through the middle of the sphere. spherical interpolation walks along the surface. and for reasons that are not fully understood by anyone, walking along the surface gives you a model that performs better than either of the two you started with.

the cheating-math claim

on glint-1.3 we soup'd three checkpoints together using per-group slerp. the result is in the model card. the merged model gets 68.7% on blimp, the best individual checkpoint got 64.2%. that is a +4.5% superadditive gain. superadditive is the technical term for: the merged model is better than the best thing you put in.

and the claim that the scores are fake. ok. the scores are on the model card. the eval harness is in the HF repo. not on github. IN THE HF REPO. right there. in the files. next to the weights. in the same place you downloaded the model from. you literally cannot get a more direct line to whether this is real than clicking files and reading them. how simple is that.

the test sets are public. the checkpoints are public. the merge config is public. the eval script is public. all in the same place. one click. i cannot make this more public than it already is. if you think the scores are fake, run the eval yourself. you have a 5090? great. you have a colab? great. you can do it in an afternoon. i will wait.

nobody has done it. because they are not fake.

the doing-the-published-formula framing

ok. we are not cheating math. we are not beating math. we are doing math. specifically, we are taking three tensors of weights, computing the angle between them, and interpolating along the surface of a sphere. that is the operation. it has a name. it is called slerp. it is in textbooks. it is in computer graphics since the 80s. it is in mergekit. it is in the model soup paper. we are not doing anything new. we are doing the thing that already exists and we are showing that it works for tiny models.

if cheating math means running the published equation and publishing the result, then yes. we are cheating. and you should be mad at every paper that has ever used the formula. which is all of them.

the just-invent-better-architecture counter

this is the just-build-a-rocket-bro of ml twitter. it is technically advice. it is also useless. we are doing the work. it is just slow and looks like a small model that mostly outputs "chuamliamce."

the 1m param research line is explicitly the we-are-trying-weird-stuff line. we are running tokenizer experiments. we are running training-recipe experiments. we are running architecture experiments (glimmer-1 is the start of one). if something better than transformers comes out of this it will come out of the small-model research first because that's where you can afford to break things. the big labs cannot afford to break things. they have shareholders. we have a discord and a 5090.

the vega 7 / 5090 analogy

this is the analogy: you are trying to get big-model performance out of a 1M model by throwing 1T tokens at it. that is the overclocking. you are not making the chip bigger. you are just running it hotter. and a vega 7 does not become a 5090 by being run hotter. it crashes, or it just stays a vega 7 with worse thermals.

fine. we are not doing that. we are not running the 1M model hotter by feeding it 1T tokens and calling it a day. we are doing three distinct things that the analogy collapses into one:

  • we are training longer than the standard recipe says we should. this part is the overclock and we know it. the data has to come from somewhere and the bottleneck is the data not the params. ok.
  • we are merging checkpoints with slerp, which the standard recipe does not do. this is not overclocking. it is a different operation. we are not running the same model hotter. we are taking three models and combining them. the 5090 is not a vega 7 run hot. it is a different chip. this is the part of the work that actually does the heavy lifting.
  • we are doing curriculum + data curation work that the standard recipe does not do. also not overclocking. we are picking which 1T tokens to train on. a vega 7 with a better cooling solution is still a vega 7. a vega 7 running a workload it was actually tuned for is a different thing. that is what curriculum does.

so the analogy is wrong because it implies we are doing one thing (running a small model hot) when we are doing three things (running it longer, merging it, curating the data). the 5090 is not a vega 7 with more tokens. the 5090 is a chip that was architected differently. we are not claiming to be a 5090. we are claiming that the vega 7 with 1T tokens comparison is leaving out the slerp and the curriculum. with all three, you get something that is not a 5090 but is also not the vega 7 the analogy is picturing. that is the whole pitch.

the kaplan 2020 scaling laws paper

this gets dropped a lot, like a trump card. ok. i have read it. more than once. we have all read it. arman has it printed out. here is what it actually says: for a given compute budget, there is an OPTIMAL model size. the curve goes up with model size and down with data size depending on where you are on the compute frontier.

what it does NOT say is: you cannot train a small model on a lot of data. in fact, the 2020 paper and every follow-up explicitly note that the small-model-big-data corner of the chart is UNDER-EXPLORED. everyone runs the same diagonal. we are not running the same diagonal. we are running the other one. the laws do not forbid it. the laws just say nobody has done it well, which is a different statement.

the-law-says-i-cant is a great way to never discover anything. the law said flight was impossible. the law said heavier-than-air was a fantasy. the law said a 1M model could not get 52% on BLiMP. we did it. the law is now updated. we are trying to do that again.

the diminishing returns data point

this is a real datapoint and i want to engage with it honestly. ok. what was the model, what was the eval, what was the seed. because:

  • if you ran a 1M model on 24B and then 34B and saw 0% on wikitext perplexity, that is consistent with what we see too. wikitext saturates fast for tiny models. the curve flattens.
  • if you saw 0% on blimp, that is also kind of expected at that scale, blimp has a ceiling around 65-70% for small models.
  • if you saw 0% on arc-easy, that is the one i would push on. arc-easy should still be moving at that range.
  • if you ran 0% on the downstream task you actually care about, the eval you wrote yourself... that is the only one that matters and we have no way to reproduce it.

the thing nobody in that thread was doing: running slerp. we are not seeing 0% improvement because we are not just adding tokens. we are adding tokens AND merging checkpoints. the merged model is what gets the gain. the individual checkpoints plateau. the soup keeps improving. that is the whole point. the claim that more data alone doesnt help is true if you only train one model. it is less true if you train many and merge.

the chat-capability ceiling

ok. show me the proof. what does really good at chat even mean. because:

  • smollm2-135m can answer basic questions without falling over. it is 135M, not 1M. but it is proof that the floor is not zero at this scale.
  • phi-1.5 at 1.3B is genuinely chatty. we are not at 1.3B. but we are closer to 1.3B than to 1B in terms of research effort.
  • tinyllama 1.1B is a real chat model on a real phone.
  • none of these are 1M. and we are not claiming 1M is really good at chat today. we are claiming we are working on it.

never is a strong word. the people saying never are usually the people who have not tried. we have tried. we are still trying. the model is not really good at chat yet. it is closer to weirdly consistent at saying chuamliamce. but never is a claim about the future, and the future is not in your dataset.

the data-capacity limit theory

this is at least a technical question so i will answer it like a technical question.

the model weights do not need to fit the training data. the training data is the training data. the weights are a compressed representation of the patterns the model found in the data. that is what training does. you do not need 1TB of weights to learn from 1TB of text. a 1M param model with 4-bit weights is 0.5MB. the model has 0.5MB of space for learned patterns. that is fine. that is how compression works. a 100KB jpeg is not limited to 100KB of photo content. it is a representation of a much larger photo.

what matters is: does the model learn useful things from a lot of data. and the answer is: yes, it does, even at this scale, especially when you do the curriculum and the slerp and the rest of it. not as much as a bigger model. but enough to be interesting. and that is what we are testing.

the claim that 1M can only learn 8-10MB of unique uncompressable data is true if you are storing the data verbatim. it is not what we are doing. we are not storing the data. we are storing a model of the data. those are different.

and the meta point: vibes are not evidence

the claim that this is impossible, made without a training log, without an eval trace, without a failed run, is just a vibe. and vibes are not evidence. if you think we are wrong: fork the recipe. train a 1M model. soup it. publish the result. we will read it. we will cite it. we will probably dm you about it. that is the deal. that is how research works.

the proof is in the training runs. come run some with us. or run some against us. we do not care. we just want the data.

the other stuff (less cool but still cool)

a few things happened that don't deserve their own section but im going to mention anyway because this is my blog and i can:

  • fable-5-traces is now trending on huggingface. 2.5k+ likes. trending for 5 days straight. it is a dataset of distillation traces from the (now gone) fable 5 model. 28 models have been trained or fine-tuned on it in the last 2 weeks. it is the most popular thing ive ever made and i did not see it coming.
  • glimmer-1 shipped. 11,900 parameters. yes, eleven thousand nine hundred. an entire llm in 12k params. it scores 25% on arc-easy and 52% on blimp and it is the dumbest and also the most beautiful thing we have ever made. it is also a stunt. we know.
  • anthos-1 (text-to-image, our only non-text model) crossed 1000 downloads. it generates flowers. only flowers. it is the most single-purpose model on huggingface and i love it.
  • glint-trace shipped. distillation traces from larger models. 73 downloads as of writing this. somebody out there is using it. i have no idea who. if that person is reading this, hi, please tell me what you are doing with it.
  • the discord is alive. for a long time it was just me, the bot, and the sound of my own typing. now there are actual people in there. people who are smarter than me about things. this is a mixed blessing but mostly good.

the team is 4 now. me, shane, arman, ender. shane does infra and pretraining ablations. arman does data and post-training. ender does the things nobody else wants to do (tokenizer work, evaluation harness maintenance, making sure my code doesn't catch fire). i do whatever's left, which is mostly writing model cards and panicking.

whats next for the blog

more posts. real ones. not model cards (model cards dont count, they're spec sheets, dont @ me). pretraining updates, ablations, the stuff that broke, the stuff that worked and then broke, the failed runs, the "we spent 3 weeks on this and got nothing" posts. maybe a post about the time glint-1 convinced itself it was named greg. that one is coming.

if you want to follow along: hf for the models, discord for the chaos, ko-fi if you want us to keep doing this instead of getting real jobs.

ok that's it. that's the post. blog is alive again. supralabs who. minimythos is coming. 1T tokens is real. see you in like 3 weeks when i forget to write one and then panic about it on a sunday night and post something unhinged at 2am.

/lane
glint research, 2026, still tiny, still trying, now with 142 friends and 2.5k likes on a dataset we did not expect to take off

older posts

nothing here yet. the blog was dead, remember? more coming soon (probably).