ok so the blog is back. it has been dead for a really long time. the last post on here is from when glint-0.3 was our big release. so. yeah. a lot has happened. buckle up i guess, or don't, whatever, you can close the tab. im not your dad.
also hi. im lane. i run this place. you might know me as compactai on hf. if you do, no you don't, get out of here. ok lets go.
we beat supralabs in followers. YAYAYAYAYAYAYA
this is the headline. this is the lede. this is why you opened this post and if you scrolled past it i will personally come to your house and reset your wifi.
we beat supralabs. supralabs. the supralabs. the ones with nova, the ones with the bigger model cards, the ones who have had more followers than us for the entire lifespan of this org. 142 to 141. that's it. that's the gap. one follower. one single human being decided that glint research was the hill they wanted to die on and i love them. i dont know who they are. but i love them.
ok so. follower count means nothing. it really doesn't. it has no correlation with model quality, no correlation with research output, no correlation with anything except maybe how many twitter bros you know. im aware of this. im saying it out loud so when i lose my mind about it later you know i knew.
and yet.
i have been checking this number every day for like 4 months. i have a graph. i have a spreadsheet. i have, on at least one occasion, asked a friend with a spare account to follow supralabs and then immediately unfollow them, just to see if i could detect it. (i couldn't. i have no idea what im doing.) i have lost sleep over this. i have won sleep over this. the trajectory was clear and the gap was closing and the only question was when.
it was last week. and i refreshed the page and it was 142 and 141 and i did not breathe for about 30 seconds. and then i closed my laptop and went for a walk. and then i came back and refreshed again. and it was still true. so.
WE ARE BIGGER THAN SUPRALABS NOW. i am writing it in all caps because it deserves all caps. if you're from supralabs and you read this, no hard feelings. (some hard feelings.) ok a lot of hard feelings. come beat us back. please. i need something to obsess over.
minimythos is real and its coming
ok business. sort of. minimythos. the sub-100m model weve been quietly building. yes, it has a name now, yes, it's a real thing, no, you can't try it yet, yes, i know that's annoying.
the short version: glint-1.3 is a 982k parameter model and if you ask it a question there is a roughly 90% chance it will just say "chuamliamce" and walk off into the sunset. it's funny. it is not useful. we made it to prove a point about what 1m params can do at high training token count (100B on fineweb-edu, on a single 5090, 138k tok/s, the whole thing is a flex). and it proved the point. and now we want a model that you can actually use.
so minimythos. under 100 million params. instruction-tuned (glint-1.3 is base only, which is why it says "chuamliamce" instead of answering you). trained on a real dataset. the kind of small that you can run on a macbook without the fan sounding like a jet engine. the kind of good that you can actually paste into a chat and have it not embarrass you in front of your coworkers.
what we know going in: the model souping worked (per-group slerp, you can see it in the glint-1.3 card, the merged model gets +4.5% BLiMP over the best individual checkpoint, which for tiny models is kind of insane). the curriculum stuff worked. the tokenizer experiment we're not allowed to talk about worked, sorry, you'll see it in the paper. what we don't know: whether all of that scales linearly when you go from 1m to 80m params or whether the tricks stop working. we have guesses. they might be wrong. we'll find out.
honestly the worst part of minimythos is the name. i keep typo-ing it. there is no mythological creature called "minimythos." the name is "mythos" (small myth) and the "mini" is because it's the small one, and our naming is a disaster. shane wanted to call it "glint-2." arman wanted to call it "glint-mobile." ender suggested something im not allowed to print. minimythos won because i am a coward and nobody fought me on it. if you hate the name, blame me, im used to it.
1T tokens. yeah. really.
this is the part where i get to say the thing ive been biting my tongue about for like 6 months.
the next glint model (the actual one, not the 1m param research thing, not minimythos, the next one after that) is going to be trained on up to 1 trillion tokens of pretraining data. possibly more. possibly less. depends on the run.
and i think it might beat everything in its class on whatever leaderboard we throw it at. i think. i have a paper's worth of evidence that suggests it. the team has a paper's worth of evidence that suggests it. the ablations suggest it. the scaling behavior suggests it. none of this means it will actually happen. but it has never been this plausible before.
ok we are NOT going to beat gpt-4. we are NOT going to beat claude. we are NOT going to beat llama-3.1-405b. those models are 50-500x our size, they have teams of 200, they have a datacenter. we have 4 people and whatever 5090s we can fit in a room without tripping over them.
what we might beat: every other model under ~500m params on most reasonable benchmarks. maybe a few above that. maybe a few that are way above that if the day is right. the trick is that nobody else is doing 1T tokens on a small model with the kind of data curation pipeline we have. everyone is either small-model-small-data or big-model-lots-of-data. the diagonal is open. we are running down the diagonal.
i have been writing the pretraining config in my head for months. i have rewritten the data mix maybe 30 times. arman has opinions about the learning rate schedule. shane keeps asking me if im sure about the init scheme. ender is in charge of making sure we don't accidentally train on a duplicate of the test set, which is a real problem we have actually had to deal with, which is a whole other post.
no name. no release date. no promises. pretraining is hard and a lot of stuff can go sideways and we have 14 different failure modes bookmarked in a doc that i will share if you ask nicely. but the run is happening. it's not "if" anymore. it's "when."
ok one more thing. (a note on slerp, and on the small-model thing)
edit note (next day): TL;DR: i had a long argument in another discord about small models. the original version of this section had direct quotes and usernames from specific people, plus a few jabs. some of those people asked me to take the quotes, the names, and the mockery out, and threatened to report the org to huggingface for ToS. i removed all of it. every quote, every name. for the record: i do not agree with the framing that any of it was abusive. it was harsh constructive feedback. i was being mean about a technical disagreement, not about the people. but i respect the ask, so here is the same technical content without the receipts. the substance did not change. only the receipts did.
i had a long argument recently with people who think small models are pointless. some of them are not wrong to be skeptical. some of them are wrong in specific ways that i want to walk through, because the specifics are interesting. this is not a callout. this is feedback. if you have said any of these things to me in any discord, this is the response.
what slerp actually is, in case anyone is confused
slerp = spherical linear interpolation. it is a math operation. you have two vectors, you want a third vector that is "between" them. regular linear interpolation (lerp) draws a straight line. slerp draws an arc on a sphere. that is the entire idea.
lerp(a, b, t) = a + t * (b - a)
slerp(a, b, t) = ( sin((1 - t) * θ) / sin(θ) ) * a
+ ( sin( t * θ) / sin(θ) ) * b
where θ = arccos( dot(a, b) / (|a| * |b|) )
that's it. that's the math. it has been a thing in computer graphics since the 80s. it is how you rotate an object smoothly between two orientations without the object going through the floor.
for neural network weights the intuition is: trained weights kind of live on a high-dimensional sphere (loosely, after normalization, but the intuition holds even when it's not exactly true). straight-line interpolation cuts through the middle of the sphere. spherical interpolation walks along the surface. and for reasons that are not fully understood by anyone, walking along the surface gives you a model that performs better than either of the two you started with.
the cheating-math claim
on glint-1.3 we soup'd three checkpoints together using per-group slerp. the result is in the model card. the merged model gets 68.7% on blimp, the best individual checkpoint got 64.2%. that is a +4.5% superadditive gain. superadditive is the technical term for: the merged model is better than the best thing you put in.
and the claim that the scores are fake. ok. the scores are on the model card. the eval harness is in the HF repo. not on github. IN THE HF REPO. right there. in the files. next to the weights. in the same place you downloaded the model from. you literally cannot get a more direct line to whether this is real than clicking files and reading them. how simple is that.
the test sets are public. the checkpoints are public. the merge config is public. the eval script is public. all in the same place. one click. i cannot make this more public than it already is. if you think the scores are fake, run the eval yourself. you have a 5090? great. you have a colab? great. you can do it in an afternoon. i will wait.
nobody has done it. because they are not fake.
the doing-the-published-formula framing
ok. we are not cheating math. we are not beating math. we are doing math. specifically, we are taking three tensors of weights, computing the angle between them, and interpolating along the surface of a sphere. that is the operation. it has a name. it is called slerp. it is in textbooks. it is in computer graphics since the 80s. it is in mergekit. it is in the model soup paper. we are not doing anything new. we are doing the thing that already exists and we are showing that it works for tiny models.
if cheating math means running the published equation and publishing the result, then yes. we are cheating. and you should be mad at every paper that has ever used the formula. which is all of them.
the just-invent-better-architecture counter
this is the just-build-a-rocket-bro of ml twitter. it is technically advice. it is also useless. we are doing the work. it is just slow and looks like a small model that mostly outputs "chuamliamce."
the 1m param research line is explicitly the we-are-trying-weird-stuff line. we are running tokenizer experiments. we are running training-recipe experiments. we are running architecture experiments (glimmer-1 is the start of one). if something better than transformers comes out of this it will come out of the small-model research first because that's where you can afford to break things. the big labs cannot afford to break things. they have shareholders. we have a discord and a 5090.
the vega 7 / 5090 analogy
this is the analogy: you are trying to get big-model performance out of a 1M model by throwing 1T tokens at it. that is the overclocking. you are not making the chip bigger. you are just running it hotter. and a vega 7 does not become a 5090 by being run hotter. it crashes, or it just stays a vega 7 with worse thermals.
fine. we are not doing that. we are not running the 1M model hotter by feeding it 1T tokens and calling it a day. we are doing three distinct things that the analogy collapses into one:
- we are training longer than the standard recipe says we should. this part is the overclock and we know it. the data has to come from somewhere and the bottleneck is the data not the params. ok.
- we are merging checkpoints with slerp, which the standard recipe does not do. this is not overclocking. it is a different operation. we are not running the same model hotter. we are taking three models and combining them. the 5090 is not a vega 7 run hot. it is a different chip. this is the part of the work that actually does the heavy lifting.
- we are doing curriculum + data curation work that the standard recipe does not do. also not overclocking. we are picking which 1T tokens to train on. a vega 7 with a better cooling solution is still a vega 7. a vega 7 running a workload it was actually tuned for is a different thing. that is what curriculum does.
so the analogy is wrong because it implies we are doing one thing (running a small model hot) when we are doing three things (running it longer, merging it, curating the data). the 5090 is not a vega 7 with more tokens. the 5090 is a chip that was architected differently. we are not claiming to be a 5090. we are claiming that the vega 7 with 1T tokens comparison is leaving out the slerp and the curriculum. with all three, you get something that is not a 5090 but is also not the vega 7 the analogy is picturing. that is the whole pitch.
the kaplan 2020 scaling laws paper
this gets dropped a lot, like a trump card. ok. i have read it. more than once. we have all read it. arman has it printed out. here is what it actually says: for a given compute budget, there is an OPTIMAL model size. the curve goes up with model size and down with data size depending on where you are on the compute frontier.
what it does NOT say is: you cannot train a small model on a lot of data. in fact, the 2020 paper and every follow-up explicitly note that the small-model-big-data corner of the chart is UNDER-EXPLORED. everyone runs the same diagonal. we are not running the same diagonal. we are running the other one. the laws do not forbid it. the laws just say nobody has done it well, which is a different statement.
the-law-says-i-cant is a great way to never discover anything. the law said flight was impossible. the law said heavier-than-air was a fantasy. the law said a 1M model could not get 52% on BLiMP. we did it. the law is now updated. we are trying to do that again.
the diminishing returns data point
this is a real datapoint and i want to engage with it honestly. ok. what was the model, what was the eval, what was the seed. because:
- if you ran a 1M model on 24B and then 34B and saw 0% on wikitext perplexity, that is consistent with what we see too. wikitext saturates fast for tiny models. the curve flattens.
- if you saw 0% on blimp, that is also kind of expected at that scale, blimp has a ceiling around 65-70% for small models.
- if you saw 0% on arc-easy, that is the one i would push on. arc-easy should still be moving at that range.
- if you ran 0% on the downstream task you actually care about, the eval you wrote yourself... that is the only one that matters and we have no way to reproduce it.
the thing nobody in that thread was doing: running slerp. we are not seeing 0% improvement because we are not just adding tokens. we are adding tokens AND merging checkpoints. the merged model is what gets the gain. the individual checkpoints plateau. the soup keeps improving. that is the whole point. the claim that more data alone doesnt help is true if you only train one model. it is less true if you train many and merge.
the chat-capability ceiling
ok. show me the proof. what does really good at chat even mean. because:
- smollm2-135m can answer basic questions without falling over. it is 135M, not 1M. but it is proof that the floor is not zero at this scale.
- phi-1.5 at 1.3B is genuinely chatty. we are not at 1.3B. but we are closer to 1.3B than to 1B in terms of research effort.
- tinyllama 1.1B is a real chat model on a real phone.
- none of these are 1M. and we are not claiming 1M is really good at chat today. we are claiming we are working on it.
never is a strong word. the people saying never are usually the people who have not tried. we have tried. we are still trying. the model is not really good at chat yet. it is closer to weirdly consistent at saying chuamliamce. but never is a claim about the future, and the future is not in your dataset.
the data-capacity limit theory
this is at least a technical question so i will answer it like a technical question.
the model weights do not need to fit the training data. the training data is the training data. the weights are a compressed representation of the patterns the model found in the data. that is what training does. you do not need 1TB of weights to learn from 1TB of text. a 1M param model with 4-bit weights is 0.5MB. the model has 0.5MB of space for learned patterns. that is fine. that is how compression works. a 100KB jpeg is not limited to 100KB of photo content. it is a representation of a much larger photo.
what matters is: does the model learn useful things from a lot of data. and the answer is: yes, it does, even at this scale, especially when you do the curriculum and the slerp and the rest of it. not as much as a bigger model. but enough to be interesting. and that is what we are testing.
the claim that 1M can only learn 8-10MB of unique uncompressable data is true if you are storing the data verbatim. it is not what we are doing. we are not storing the data. we are storing a model of the data. those are different.
and the meta point: vibes are not evidence
the claim that this is impossible, made without a training log, without an eval trace, without a failed run, is just a vibe. and vibes are not evidence. if you think we are wrong: fork the recipe. train a 1M model. soup it. publish the result. we will read it. we will cite it. we will probably dm you about it. that is the deal. that is how research works.
the proof is in the training runs. come run some with us. or run some against us. we do not care. we just want the data.
the other stuff (less cool but still cool)
a few things happened that don't deserve their own section but im going to mention anyway because this is my blog and i can:
- fable-5-traces is now trending on huggingface. 2.5k+ likes. trending for 5 days straight. it is a dataset of distillation traces from the (now gone) fable 5 model. 28 models have been trained or fine-tuned on it in the last 2 weeks. it is the most popular thing ive ever made and i did not see it coming.
- glimmer-1 shipped. 11,900 parameters. yes, eleven thousand nine hundred. an entire llm in 12k params. it scores 25% on arc-easy and 52% on blimp and it is the dumbest and also the most beautiful thing we have ever made. it is also a stunt. we know.
- anthos-1 (text-to-image, our only non-text model) crossed 1000 downloads. it generates flowers. only flowers. it is the most single-purpose model on huggingface and i love it.
- glint-trace shipped. distillation traces from larger models. 73 downloads as of writing this. somebody out there is using it. i have no idea who. if that person is reading this, hi, please tell me what you are doing with it.
- the discord is alive. for a long time it was just me, the bot, and the sound of my own typing. now there are actual people in there. people who are smarter than me about things. this is a mixed blessing but mostly good.
the team is 4 now. me, shane, arman, ender. shane does infra and pretraining ablations. arman does data and post-training. ender does the things nobody else wants to do (tokenizer work, evaluation harness maintenance, making sure my code doesn't catch fire). i do whatever's left, which is mostly writing model cards and panicking.
whats next for the blog
more posts. real ones. not model cards (model cards dont count, they're spec sheets, dont @ me). pretraining updates, ablations, the stuff that broke, the stuff that worked and then broke, the failed runs, the "we spent 3 weeks on this and got nothing" posts. maybe a post about the time glint-1 convinced itself it was named greg. that one is coming.
if you want to follow along: hf for the models, discord for the chaos, ko-fi if you want us to keep doing this instead of getting real jobs.
ok that's it. that's the post. blog is alive again. supralabs who. minimythos is coming. 1T tokens is real. see you in like 3 weeks when i forget to write one and then panic about it on a sunday night and post something unhinged at 2am.
/lane
glint research, 2026, still tiny, still trying, now with 142 friends and 2.5k likes on a dataset we did not expect to take off