Many of the comments here are judging PyTorch in hindsight, which is unfair.
When Soumith Chintala co-launched PyTorch, and for many years after, the alternatives for fast, interactive, convenient development were much worse. There was no Jax.
Every single AI researcher I know, including me, who tried PyTorch back then immediately wanted to switch to it, because it was so much better. Andrej Karpathy described what PyTorch felt like back then when he tweeted, on May 2017, "I've been using PyTorch a few months now and I've never felt better. I have more energy. My skin is clearer. My eyesight has improved."[a]
THANK YOU SOUMITH for your hard work over all these years! Your hard work has made a difference for a huge number of people, including many of us here on HN.
We wish you success in your future endeavors, whatever they turn out to be!
There was Chainer, which originated the define-by-run model that characterized PyTorch’s effectiveness. It was developed by a much smaller, much less influential company in Japan. Early PyTorch is transparent about the debt owed to Chainer.
Thanks. Yes, I remember Chainer, but only vaguely. I kinda remember looking at it, but not actually using it.
My recollection is that when I looked at Chainer back then, it didn't offer a comprehensive library of preexisting components for deep learning. When I tried PyTorch, on the other hand, I vividly remember it as already having lots of prebuilt components (common layers, activation functions, etc.) in `torch.nn`, so it was easier and faster to get going.
Yes, exactly—not many people know about Chainer nowadays. Back in 2016, PyTorch's interface was actually inferior to Chainer's, and I think Chainer's design was really ahead of its time.
PyTorch was partly inspired by the python Autograd library (circa 2015 [1]) to the point where they called their autodiff [2] system "autograd" [3]. Jax is the direct successor of the Autograd library and several of the Autograd developers work on Jax to this day. Of course, for that matter, PyTorch author Adam Paszke is currently on the JAX team and seems to work on JAX and Dex these days.
I'm one of the many people who Soumith hired to Meta and PyTorch. I had the privilege of working on PyTorch with him and lots of the folks on this post.
As his longtime colleague, the one thing I would want people to know about him and this decision is that Soumith has always viewed PyTorch as a community project. He consistently celebrated the contributions of his co-creators Adam and Sam, and he extended the same view towards the Yangqing and the Caffe2 crew that we merged into PyTorch. At the very beginning, by Soumith's highly intentional design, PyTorch was aimed at being truly developed by and for the AI research community and for many years that was the key way in which we grew the framework, FB PT team, and the wider community. At every single stage of PT's lifecycle, he always ensured that our conception of PT and its community grew to include and celebrate the new people and organizations growing what was possible with PT. He's an incredible talent magnet, and thus more and more smart people kept dedicating their blood, sweat, and tears to making PT bigger and better for more people.
I've worked with some very well known and highly compensated leaders in tech, but *no one* has done the job he has done with ameliorating a bus factor problem with his baby. PT has a unique level of broad support that few other open source technology can reach. In a world of unbounded AI salaries, people who want to move AI research methods forward still freely give their time and attention to PyTorch and its ecosystem. It's the great lever of this era of AI that is moving the world, *due in large part* to the strength of the community he fostered and can now let continue without his direct involvement.
His departure is the end of an era, but it's also operationally a true non-event. PyTorch is going strong and can afford to let one of its creators retire from stewardship. This is precisely what success looks like in open source software.
He deserves our congratulations and our thanks. Enjoy your PT retirement, man.
Also worked with Soumith. The man is a legend, moves mountains and completely changed the course of my career because he liked something I wrote. No arrogance, no politics, just an extremely down to earth and chill guy who elevates everyone around him.
What I find most interesting with this is that it shows they believe there is nothing unique at Meta related to AI. There is no resource, people and computing power, that they can't get elsewhere for whatever they believe would be more interesting for them.
I mention this because it feels analogous to military research, where people "dream" of how advanced the military is, how forward they are compared to public research... and yet, it seems to be a recurring myth they love to sustain.
So the signal I get here is AI "labs" in BigTech have nothing worth waiting for around the corner, it's just more of the same and boring for people who stick there.
I think you might be reading a bit too much into this.
He’s been with Meta for 11 years and is likely in a very comfortable financial position, given the substantial stock options he’s received over that time.
He also mentioned the arrival of a new child, and it’s well known that Meta's work-life balance isn’t always ideal.
On top of that, Meta, like many major tech companies, has been shifting its focus toward LLM-based AI, moving away from more traditional PyTorch use cases.
Considering all of this, it seems like a natural time for him to move on and pursue new, more exciting opportunities.
> On top of that, Meta, like many major tech companies, has been shifting its focus toward LLM-based AI, moving away from more traditional PyTorch use cases.
This is very wrong. Meta is on the forefront of recommendation algorithms and that's all done with traditional ML models made using PyTorch.
Some recommendations are uncanny, except that I don't want any of them in my Facebook news feed and no matter how often I select "never show me this feed again," it keeps trying.
Everyone is fine-tuning constantly though. Training an entire model in excess of a few billion parameters. It’s pretty much on nobody’s personal radar, you have a handful of well fundedgroups using pytorch to do that. The masses are still using pytorch, just on small training jobs.
Fine-tuning is great for known, concrete use cases where you have the data in hand already, but how much of the industry does that actually cover? Managers have hated those use cases since the beginning of the deep learning era — huge upfront cost for data collection, high latency cycles for training and validation, slow reaction speed to new requirements and conditions.
That's wrong. Llama.cpp / Candle doesn't offer anything on the table that PyTorch cannot do (design wise). What they offer is smaller deployment footprint.
What's modern about LLM is the training infrastructure and single coordinator pattern, which PyTorch just started and inferior to many internal implementations: https://pytorch.org/blog/integration-idea-monarch/
Pytorch is still pretty dominant in cloud hosting. I’m not aware of anyone not using it (usually by way of vLLM or similar). It’s also completely dominant for training. I’m not aware of anyone using anything else.
It’s not dominant in terms of self-hosted where llama.cpp wins but there’s also not really that much self-hosting going on (at least compared with the amount of requests that hosted models are serving)
About the military, from my limited experience, they are significantly behind the civilian state of the art, except for technology that has has few applications outside of the military, like stealth.
In fact everything secret tends to be behind. Secrecy is a huge burden, and seriously limits all forms of collaboration.
In addition, because military projects are often big and highly politicized you get all the inefficiencies that goes with that. Classification is also convenient for hiding screwups and corruption.
I spent a dozen years as a US defense contractor across a broad spectrum of places (from R&D for the future to working with E3's today), and worked at internet scale and start-up B2B stuff in the other dozen years of my working career.
I think that the major difference about deployed military technologies- in contrast to both military R&D and the entire commercial side- is that they are, by and large, incredibly rock solid and reliable. If they aren't, they don't actually get used. It takes a lot of effort to get them that way. I remember once at a testing ground for our robot tanks of the far future, right next door was an outdoor test-track. And they were testing a kitchen trailer (a kitchen for ~200 men that can be towed by a Humvee). And they drove it around the track continuously for three weeks, stopping only long enough to change drivers/vehicles, and the four times a day they would halt and make 200 people meals, and then pack up and get back to driving. This was one of several reliability tests that the kitchen trailer had to get through before it was accepted for service.
Our R&D stuff couldn't handle that (it needed 3-4 engineers to carefully monitor it at all times), but the stuff that needed to be in the hands of some random 18 year old with a two week training course had to be rock solid to use, do regular maintenance on, and fix, even when they were only getting four hours of sleep a night. If it wasn't up to that level, then the troops ended up ignoring it, leaving it behind when they went out to do their job. And by and large, from what I could tell, most of the stuff they had was that reliable. There were some cool things that we were doing in the R&D space, but we were a long way from that level.
One thing I meant to add: this extensive testing- and the enormous amount of documentation/training materials necessary to take an 18 year old with average ASVAB scores and produce someone who can cook meals for 200 other soldiers on four hours of sleep a night- is both why military things cost so much, relative to commercial grade stuff, and why they don't get updated particularly often. Frequent software updates that change menus around play havoc with the detailed training powerpoints that the military relies on to produce those 18 year old tool operators.
Secret Squirrel projects (which I was near but never read into) can get away with lower reliability because they can count on the users to be much better trained and prepared, though again, from my brief encounters with these sorts, they will ignore anything they don't trust to be completely reliable. Reliability matters far more than cutting edge for like 99.9% of military gear.
> a 318-page report [...] said the SAIC software was incomplete, inadequate and so poorly designed that it would be essentially unusable under real-world conditions. Even in rudimentary tests, the system did not comply with basic requirements
I figured the reason Palantir was so successful was because it was a SV software company instead of a defense contractor dabbling in IT or specialized government consultancy.
Post Cold War, most militaries shifted to COTS and less boutique development. Turns out, you only need to put resources in a few places to stay ahead (stealth, sensing and measuring, space, hypersonics, drones, etc).
The military doesn't have the luxury of things being unreliable. It puts a pressure on them that corporations don't necessarily have: they'd rather have a less-effective but proven system than a potentially-more-effective but riskier system (especially since each system they have comes with massive logistics support).
Ironically, corporations can afford to take more risks of failure (financially and project-wise) than militaries because failure for them doesn't mean actual human death (and when it can, you see processes come in that look a lot more like military processes).
It's actually the commercial/consumer side that gets more reliability than the military side.
The military should have very reliable systems, and they often know the point at which their systems will fail (MTBF calculations are easier to develop with their record keeping). However, the military also has an almost unlimited budget and body count to keep just reliable enough things working much better than they should. It's also really bad about actually competing companies against each other.
The commercial sector, targeting consumers, is where you actually get reliable systems. Why? Because consumers will go towards either the cheapest option (reliability is replaced with ubiquity in the market, it's replaceable) or the more reliable but more expensive options. They (individuals) don't have an unlimited budget or unlimited time to maintain everything in their life. There's competition in the commercial world that's completely absent in the military world
The two major exceptions are where COTS products have taken over (definitionally, DOD is using commercial, often consumer-targeted, products instead of military specific products) and special forces. Special forces often bypasses normal acquisitions processes and so ends up having a better chance to compete vendors against each other than other parts of the military.
This doesn't mean everything the DOD procures through normal acquisitions is inherently unreliable, but reliability is only one of many factors and often only really discovered after selection and full-rate production has started. By that point, the DOD is committed to it for years to come. Each DOD procurement is separate enough from others that you don't even get huge opportunities for reuse. The F-35, to pick something from this century, didn't get components that were shared with other aircraft in the DOD fleet. It's almost all new, which means a lot of things were learned about its reliability after it started flying. It has new comms, new radar, new almost everything. Even the engine (though that probably used many subcomponents shared with other engines) was a new engine just used by the F-35.
> What I find most interesting with this is that it shows they believe there is nothing unique at Meta related to AI
Whether or not this is the case, I don't get this as being the reason for Sousmith leaving - it sounds as if he is just ready for a change.
Still, it is noticeable that with many of the AI companies claiming that their version of "AGI" is just around the corner, developers and staff don't appear to be particularly excited about this (I assume they realize it is just hype, not some momentous advance around the corner), and leave to pursue different things, such as Mira Murati starting a fine-tuning company, Karpathy going back to education, others switching ship (typically from OpenAI to Anthropic), etc.
"Ready for change" is just the polite way to say, "I can't stand it here anymore. I'd rather roll the dice on a new place because reversion-to-mean means it's probably going to be better than whatever this has become."
There are a lot of things I don't like about my current job, but not enough for it to make sense to gamble on a new place. It's easier to push for change from my current position than to count on any new place being any better.
But if it gets worse and I do leave, I'll definitely be telling the interviewer, "I was just ready for a change."
On the other hand, while I know nothing about Soumith, he clearly has enough financial runway (see my calc below) to not have to work again.
As far as I know, we all get one life. If one can help it (modulo other constraints), one should not get trapped by prestige, achievement, short-term admiration by others, impact and external facing factors. To see an alternate reality, it helps to escape the bubble, for example, by spending time in a completely different culture or environment where no one knows or cares about what one did.
I admire people taking such decisions. It's easy to be on autopilot in life. But, people who wear their success lightly are rare but more philosophically aware, in my opinion at least. I wish him good luck!
> see an alternate reality, it helps to escape the bubble, for example, by spending time in a completely different culture
I'm at similar position now, need to make decision. The problem is after leaving IT world for a while it will be hard to get back. I'll have to change my life completely and discard all knowledge and expertise I have. That will be fun, interesting, eyes opening, etc, but no way back.
I don't know you, don't know your situation, but this does not seem to match the experiences of many of my friends who left for a while and then came back. "Spent two years starting a restaurant" and "had to take care of my parents" were not blockers for getting another computer related job in due time. There are few truly irrevocable decisions in our life.
Now, the current job market makes this significantly harder than it was in the 2010's, but that's floating over all of us- if your company does an Amazon tomorrow, would you get a job as nice as you currently have? Maybe, maybe not.
In executive roles, your expertise really is in management acumen a lot of the time. But as an individual contributor--or adjacent--once you're out of a technical space for a few years, it's increasingly hard to get back in even if you've casually kept a finger in.
Exactly, the only way to stay current is to keep doing something at least half time. The good thing it doesn't have to be the same as prev job. Just keep brain working and learning.
I think age plays an important part in the decision to move away from a place. I think in your 20s or very early 30s you have far more leeway to kind of go away and start again, but a lot of the hope to actually be able to find that unicorn workplace fades away as you approach your late 30s. Once into your 40s, depending on your trade, you're dead on arrival unless you successfully manage to rebrand yourself as a consultant, whatever the fuck that means.
Can be*, that's not necessarily always true. I've quit jobs plenty of times without having any plan for the future or particular drama-reason for leaving, just "It's not as fun here anymore, despite this being a great place to work", I'm sure I'm not the only one who does so.
What I've never done though, is leaving a place without being 100% honest exactly why I'm leaving. I won't say "I was just ready for change" if that wasn't the reason, I have no reason not to be honest about why I'm leaving.
I've generally had 10+ year tenures other than a return to school that was basically always in my plan and dot-bomb (leaving a company I wasn't really a fit with anyway). But, yeah, I've always been ready to move on at about that ten year point which is actually fairly long by a lot of people's standards in the tech industry.
I do disagree though that, unless there's some actionable change that would specifically benefit you like more money, my answer outside of private conversations with people I know well is going to be some variant of "time for a change." Anything else just invites arguments and conversations I don't want to have.
I do want to push back on this a little. People leave all the time for this "I wanna see what else is out there" especially at such senior levels and with so much financial security as he inevitably has from working at Meta for 11 years. It is not always a gamble for many of them, and many of them are not so skeptical and cynical of other places they could go and bring their expertise
I don't think that's the read? Guy says he wants to work on something small. If you want to work on something big you probably want to be in a big corp to have the resources to do the big thing.
Also absolutely unknown if the "new thing" is AI-related at all!
> If you want to work on something big you probably want to be in a big corp to have the resources to do the big thing.
If anything, the reverse seems to be true, if you want to work on something big, you want to be in a small company, sufficiently funded, filled with great people, yet not "big", that's when "something big" seems to be more likely to happen.
In contrast, as far as I can think, the bigger a company gets, the less likely they are to actually come up with "something big", it seems like most of the times you need (creative) constraints in order for the results to end up being actually innovative, otherwise you end up like IBM and Meta, throwing money on stuff and getting some results, but nothing really out of the ordinary considering what's happening elsewhere in their ecosystems.
Well he left so whatever is coming next, AI related or not, "small" or not (small for them might be reaching just a million people, he wrote that he "lead the software layer that powers the entire AI industry." so his notion of scale is probably unlike mine, maybe yours too) is more exciting to him that whatever he could do next with all of Meta's resources.
Edit: to be clear, I didn't mean to imply their next thing is AI related, solely that they obviously know more about AI at Meta than e.g. XR at Meta, just because that's their expertise.
Pretty crazy/bizarre that a VP/Fellow engineer would have such little say in what they do at Meta. In my mind, companies would do everything possible to retain them. They are a special and rare breed.
> where people "dream" of how advanced the military is
If you've ever worked on "advanced military grade" equipment, you'd know better.
It tends to be what you'd euphemistically call "well-proven technology", built down to a price by the lowest bidder, by comparatively unskilled labour.
The most shocking thing about the "captured" Russian drones is they use name-brand Raspberry Pis inside. I'm prepared to bet the American versions use whatever AliExpress crap is on special this week. The UK stuff definitely does.
I mean, these things do exist. There are always tons of big and small tech projects floating around in the special operations community. Cutting-edge sets of hybrid night/thermal vision. Classified helicopters. Hand-built rifled with custom cartridges. Classified medical tech. Advanced fixed wing aircraft with unique capabilities. Advanced dive gear. So on.
"Big Army" doesn't see that stuff for decades, if ever, and mostly never due to cost. And I'm not even getting into classified submarine and nuclear tech, fixed wing drones and aircraft flying at night out of known test facilities, etc.
There's tons of actually advance tech out there in military circles.
Yeah. The DOD is enormous and definitely has your boring every day stuff, but tons of skunk works as r&d. just not very public. An organization that big has all kinds of nooks and crannies, so isn’t really that monolithic.
If you can afford to support yourself, which I’m sure he can, there’s a serenity to working on small projects that are nothin the public eye. It may simply be that he craves some quiet time that enables him to focus on his family and himself.
Negative, what you shave taken away is it’s the people. He mentions standing up clusters. Small shops can’t afford clusters. Ignore the technical aspect of this article and read it for what it’s for. A thank you note to the people he has worked with on amazing projects. Research in a bubble of 1 isn’t very useful. Research in a small team with Meta Budget is extremely useful. With the right people.
Having friends who are at or near both FAIR and other AI parts of meta, reosurces are not the issue, anymore at least. (there had been a massive squeeze for the last two years though) But pytorch and FAIR use(d) a AWS based cluster. (however pytorch is used everywhere else inside facebook though. well not everywhere...)
There is/are plenty of interesting things happening at big tech, and Meta specifically. If you like computer vision, then Meta is pretty much still the world leader. Much as it pains me to say it.
> I mention this because it feels analogous to military research, where people "dream" of how advanced the military is, how forward they are compared to public research... and yet, it seems to be a recurring myth they love to sustain.
I don't think that you can read this from the blog post at all, but it gives me a chuckle to think how the quest for AGI at Meta may be "The Men Who Stare at Goats" all over again.
I'm totally speculating. I have no extra information there.
It just makes me think of all the staff, technical staff, that left OpenAI recently. Altman was making grand claims about what was coming next.
Well, we know what followed, namely I don't think any researcher who left knowing what was in the pipeline feel like they missed much in terms of access.
The non-fiction book behind it is probably better comparison than the film adaptation, if you think Meta are doing goat-staring (I don't think they're especially bad on this issue compared to their rivals).
Unlimited compute resources aren’t literally unique but there are only a small handful of places in the world that have that.
Vast quantities of private data, especially text communications and images. Very few places have that. Coupled with a culture that puts zero privacy protections on that data. Even Google likes to think they’re doing the right thing, so I think that makes Meta unique.
That man has an infective enthusiasm. I remember the DCGAN paper inspired me to try getting the (Lua) Torch code to work, and I tried it on the Oxford flowers dataset early on. It worked surprisingly well, and Soumith Chintala even shared it around in social media, surprised at how well it worked on such a small dataset. Of course back then we didn't really appreciate the problem of mode collapse.
Pytorch and old Lua Torch were a pleasure to work with compared to the contemporary Tensorflow. Lots of S.C's code was copied around liberally, it had its quirks (I remember the DCGAN code had a pretty odd way of doing parameter passing) but it was also really easy to understand and made random people like me feel like we had suddenly stumbled onto something crazy powerful (which we had!). It was wonderfully hackable.
“ What's next for me? Something small. Something new. Something I don't fully understand yet. Something uncomfortable. I could have moved to something else inside Meta. But I needed to know what's out there. I needed to do something small again. I couldn't live with the counterfactual regret of never trying something outside Meta.”
Ah yes, shades of Siddhartha. I almost forgot about the part where he worked for a megacorp that was ripping society’s social fabric apart and wanted to do something else for a while.
He was. he didn't just parachute in Meta to start working on PyTorch. he worked in many areas of the product and a member of the senior technical staff, was knowledgable about many aspects of the company.
As a loyal JAX user, I hope they can play catchup. PyTorch has dominated the AI scene since TF1 fumbled the ball at 10th yard line. What Matt Johnson has done turning Autograd into JAX is hopefully going to be worthy of as much praise as what Soumith has received.
I see good answers already, but here's a concrete example:
In my University we had to decide between both libraries so, as a test, we decided to write a language model from scratch. The first minor problem with TF was that (if memory serves me right) you were supposed to declare your network "backwards" - instead of saying "A -> B -> C" you had to declare "C(B(A))". The major problem, however, was that there was no way to add debug messages - either your network worked or it didn't. To make matters worse, the "official" TF tutorial on how to write a Seq2Seq model didn't compile because the library had changed but the bug reports for that were met for years with "we are changing the API so we'll fix the example once we're done".
PyTorch, by comparison, had the advantage of a Python-based interface - you simply defined classes like you always did (including debug statements!), connected them as variables, and that was that. So when I and my beginner colleagues had to decide which library to pick, "the one that's not a nightmare to debug" sounded much better than "the one that's more efficient if you have several billions training datapoints and a cluster". Me and my colleagues then went on to become professionals, and we all brought PyTorch with us.
This was also my experience. TensorFlow's model of constructing then evaluating a computation graph felt at odds with Python's principles. It made it extremely difficult to debug because you couldn't print tensors easily! It didn't feel like Python at all.
Also the API changed constantly so examples from docs or open source repos wouldn't work.
They also had that weird thing about all tensors having a unique global name. I remember I tried to evaluate a DQN network twice in the same script and it errored because of that.
It's somewhat vindicating to see many people in this thread shared my frustrations. Considering the impact of these technologies I think a documentary about why TensorFlow failed and PyTorch took off would be a great watch.
In 2018, I co-wrote a blog post with the inflammatory title “Don’t use TensorFlow, try PyTorch instead” (https://news.ycombinator.com/item?id=17415321).
As it gained traction here, it was changed to “Keras vs PyTorch” (some edgy things that work for a private blog are not good for a corporate one). Yet the initial title stuck, and you can see it resonated well with the crowd.
TensorFlow (while a huge step on top of Theano) had issues with a strange API, mixing needlessly complex parts (even for the simplest layers) with magic-box-like optimization.
There was Keras, which I liked and used before it was cool (when it still supported the Theano backend), and it was the right decision for TF to incorporate it as the default API. But it was 1–2 years too late.
At the same time, I initially looked at PyTorch as some intern’s summer project porting from Lua to Python. I expected an imitation of the original Torch.
Yet the more it developed, the better it was, with (at least to my mind) the perfect level of abstraction. On the one hand, you can easily add two tensors, as if it were NumPy (and print its values in Python, which was impossible with TF at that time). On the other hand, you can wrap anything (from just a simple operation to a huge network) in an nn.Module. So it offered this natural hierarchical approach to deep learning. It offered building blocks that can be easily created, composed, debugged, and reused. It offered a natural way of picking the abstraction level you want to work with, so it worked well for industry and experimentation with novel architectures.
Actually, I opened “Teaching deep learning” and smiled as I saw how it evolved:
> There is a handful of popular deep learning libraries, including TensorFlow, Theano, Torch and Caffe. Each of them has Python interface (now also for Torch: PyTorch)
> [...]
> EDIT (July 2017): If you want a low-level framework, PyTorch may be the best way to start. It combines relatively brief and readable code (almost like Keras) but at the same time gives low-level access to all features (actually, more than TensorFlow).
> EDIT (June 2018): In Keras or PyTorch as your first deep learning framework I discuss pros and cons of starting learning deep learning with each of them.
The original TensorFlow had an API similar to the original Lua-based Torch (the predecessor to PyTorch) that required you to first build the network, node by node, then run it. PyTorch used a completely different, and much more convenient approach, where the network is built automatically for you just by running the forward pass code (and will then be used for the backward pass), using both provided node types and arbitrary NumPy compatible code. You're basically just writing differentiable code.
This new PyTorch approach was eventually supported by TensorFlow as well ("immediate mode"), but the PyTorch approach was such a huge improvement that there had been an immediate shift by many developers from TF to PyTorch, and TF never seemed able to regain the momentum.
TF also suffered from having a confusing array of alternate user libraries built on top of the core framework, none of which had great documentation, while PyTorch had a more focused approach and fantastic online support from the developer team.
LuaTorch is eager-execution. The problem with LuaTorch is the GC. You cannot rely on traditional GC for good work, since each tensor is megabytes (at the time), now gigabytes large, you need to collect them aggressively rather than at intervals (Python's reference-counting system solves this issue, and of course, by "collecting", I don't mean free the memory (PyTorch has a simple slab allocator to manage CUDA memory)).
With Lua Torch the model execution was eager, but you still had to construct the model graph beforehand - it wasn't "define by run" like PyTorch.
Back in the day, having completed Andrew Ng's ML coursew, I then built my own C++ NN framework copying this graph-mode Lua Torch API. One of the nice things about explicitly building a graph was that my framework supported having the model generate a GraphViz DOT representation of itself so I could visualize it.
Ah, I get what you mean now. I am mixing up the nn module and the tensor execution bits. (to be fair, the PyTorch nn module carries over many these quirks!).
I'm no machine learning engineer but I've dabbled professionally with both frameworks a few years ago and the developer experience didn't even compare. The main issue with TF was that you could only chose between a powerful but incomprehensible, poorly documented [1], ultra-verbose and ever changing low-level API, and an abstraction layer (Keras) that was too high level to be really useful.
Maybe TF has gotten better since but at the time it really felt like an internal tool that Google decided to just throw into the wild. By contrast PyTorch offered a more reasonable level of abstraction along with excellent API documentation and tutorials, so it's no wonder that machine learning engineers (who are generally more interested in the science of the model than the technical implementation) ended up favoring it.
[1] The worst part was that Google only hosted the docs for the latest version of TF, so if you were stuck on an older version (because, oh I don't know, you wanted a stable environment to serve models in production), well tough luck. That certainly didn't gain TF any favors.
For me it was about 8 years ago. Back then TF was already bloated but had two weaknesses. Their bet on static compute graphs made writing code verbose and debugging difficult.
The few people I know back then used keras instead. I switched to PyTorch for my next project which was more "batteries included".
First, the migration to 2.0 in 20219 to add eager mode support was horribly painful. Then, starting around 2.7, backward compatibility kept being broken. Not being able to load previously trained models with a new version of the library is wildly painful.
I just remember TF1 being super hard to use as a beginner and Google repeatedly insisting it had to be that way. People talk about the layering API, but it's more than that, everything about it was covered with sharp edges.
Imagine a total newbie trying to fine-tune an image classifier, reusing some open source example code, about a decade ago.
If their folder of 10,000 labelled images contains one image that's a different size to the others, the training job will fail with an error about unexpected dimensions while concatenating.
But it won't be able to say the file's name, or that the problem is an input image of the wrong size. It'll just say it can't concatenate tensors of different sizes.
An experienced user will recognise the error immediately, and will have run a data cleansing script beforehand anyway. But it's not experienced users who bounce from frameworks, it's newbies.
> An experienced user will recognise the error immediately, and will have run a data cleansing script beforehand anyway. But it's not experienced users who bounce from frameworks, it's newbies.
Even seasoned developers will bounce away from frameworks or libraries - no matter if old dogs or the next hot thing - if the documentation isn't up to speed or simple, common tasks require wading through dozens of pages of documentation.
Writing good documentation is hard enough, writing relevant "common usage examples" is even harder... but keeping them up to date and working is a rarely seen art.
And the greatest art of all of it is logging. Soooo many libraries refuse to implement detailed structured logging in internal classes (despite particularly Java and PHP offering very powerful mechanisms), making it much more difficult to troubleshoot problems in the field.
I personally believe TF1 was serving the need of its core users. It provided a compileable compute graph with autodiff, and you got very efficient training and inference from it. There was a steep learning curve, but if you got past it, things worked very very well. The distributed TF never really took off—it was buggy, and I think they made some wrong earlier bets in the design for performance reasons that they should have been sacrificed in favor of simplicity.
I believe some years after the TF1 release, they realized the learning curve was too steep, they were losing users to PyTorch. I think also the Cloud team was attempting to sell customers on their amazing DL tech, which was falling flat. So they tried to keep the TF brand while totally changing the product under the hood by introducing imperative programming and gradient tapes. They killed TF1, upsetting those users, while not having a fully functioning TF2, all the while having plenty of documentation pointing to TF1 references that didn’t work. Any new grad student made the simple choice of using a tool that was user-friendly and worked, which was PyTorch. And most old TF1 users hopped on the band wagon.
I only remember 2015 TF and I was wondering: why would I use Python to assemble a computational graph when what I really want is to write code and then differentiate through it?
Not OP. I prefer JAX for non-AI tasks in scientific computing because of the different mental model than PyTorch. In JAX, you think about functions and gradients of functions. In PyTorch you think about tensors which accumulate a gradient while being manipulated through functions. JAX just suits my way of thinking much better.
I also like that jax.jit forces you to write "functional" functions free of side effects or inplace array updates. It might feel weird at first (and not every algorithm is suited for this style) but ultimately it leads to clearer and faster code.
I am surprised that JIT in PyTorch gets so little attention. Maybe it's less impactful for PyTorch's usual usecase of large networks, as opposed to general scientific computing?
>I also like that jax.jit forces you to write "functional" functions free of side effects or inplace array updates. It might feel weird at first (and not every algorithm is suited for this style) but ultimately it leads to clearer and faster code.
It's not weird. It's actually the most natural way of doing things for me. You just write down your math equations as JAX and you're done.
> You just write down your math equations as JAX and you're done.
It's natural when your basic unit is a whole vector (tensor), manipulated by some linear algebra expression. It's less natural if your basic unit is an element of a vector.
If you're solving sudoku, for example, the obvious 'update' is in-place.
In-place updates are also often the right answer for performance reasons, such as writing the output of a .map() operation directly to the destination tensor. Jax leans heavily on compile-time optimizations to turn the mathematically-nice code into computer-nice code, so the delta between eager-Jax and compiled-Jax is much larger than the delta between eager-Pytorch and compiled-Pytorch.
Not Op. I have production / scale experience in PyTorch and toy/hobby experience in JAX. I wish I could have time time or liberty to use JAX more. It consists of small, orthogonal set of ideas that combine like lego blocks. I can attempt to reason from first principals about performance. The documentation is super readable and strives to make you understand things.
JAX seems well engineered. One would argue so was TensorFlow. But ideas behind JAX were built outside Google (autograd) so it has struck right balance with being close to idiomatic Python / Numpy.
PyTorch is where the tailwinds are, though. It is a wildly successful project which has acquired ton of code over the years. So it is little harder to figure out how something works (say torch-compile) from first principles.
JAX seems great but the Google ghost is a lot to stomach. The risk of JAX getting axed or replaced with a JAX 2.0 - completely incompatible with existing JAX code - is not insignificant.
For anyone that’s curious, the underlying Torch library is also a joy to work with, as are the many other torch bindings. For example, Rust has tch and Burn which both work with libtorch.
PyTorch of course has the benefit of being dynamically debuggable. Can’t forget the first time I break pointed my pytorch model and wrote pytorch calls inside the terminal to inspect the behavior. That’s still something I miss a lot now that I’m working with only “fast” compiled code.
>>Every major AI company and hardware vendor are on a speed dial. This kind of power is really hard to give up. But curiosity ultimately won out in my head.
A simple feeling has such a power.
May he gets an opportunity to create one more powerful tool before retiring.
The second I stop being curious I stop finding new and exciting things to do, and I stop feeling fulfillment. It’s one of the biggest signs saying “it’s time to move on”.
I feel so strongly for the people who can’t afford the luxury. Ive been there, unfulfilling jobs for years because bills or resumè building.
> It's taught in classrooms from MIT to rural India. The tools I dreamed about making accessible? They are. The barrier to entry I wanted to lower? It's almost gone.
I have an ironic sense that there are classrooms in rural India with better pedagogy and lower barriers to entry than some of our elite engineering programs.
Many elite engineering programs in the United States (I don't know if this is what you mean by "our") are elite solely due to social status (rankings need to publish rankings that feel right, or they're ignored, and they accept bribes to rank specific programs) and research output, with little to do with quality of pedagogy. Instead, pedagogy is generally poor because the elite researchers usually view teaching as a chore and many don't have any real skill in it either.
I thought traditional Indian pedagogy was heavily criticized for being heavily based on rote memorization over conceptual understanding or problem solving. And also being heavily hierarchical and exam-oriented.
This isn't to say engineering programs in the US can't be improved, but there seems to be widespread consensus that they don't suffer from the kinds of serious problems that ones in India commonly do.
All I can say is 'thanks!'. It does take a team, and a community, but individuals matter. I use pytorch daily, it has made it possible for me to play with ideas that I would have only dreamed of. It is a big part of my life so, thanks for your contributions and best of luck on the next thing!
What an amazing impact this man has had. He seems like a great guy too. I don't know him at all personally but his replies online have always been so nice. I really wish him all the best with whatever he does with his time next
If someone made $2 million/year over 10 years, after taxes, it would be $1 million (NYC has local taxes too). Let's say all of it was saved and invested in SP500 or Meta.
SP500: tripled over 10 years i.e. ~12% a year. Reinvesting dividends gives ~14% a year
Meta: 8x over 10 years i.e. ~23% a year.
If growth was uniform over 10 years and compensation/savings was uniform over 10 years, total portfolio would be:
((1+r)^11-1)/r (geometric series since each year's contributions grow for different amount of times)
1 (this year) + (1+r) (previous year) + (1+r)^2 (previous-to-previous year) and so on
SP500: 14% -> $23M
Meta: 23% -> $38M
Now, it's entirely possible, the compensation for a position like this runs into $10s of millions and one can easily account for non-uniform compensation.
Even in NYC, actually even in Manhattan, $10M is more than comfortable for retirement. It lets you draw $300-$400K (3-4% per year adjusted for inflation annually). If one is taking a short sabbatical, then it's a no-brainer.
This seems to assume unusual optimism or foresight, most people don’t invest their life savings 100% into stocks and don’t hold on to 100% of their company vests through ups and downs. You might as well say “assuming he put all his money in NVDA…”
It's a back-of-the-envelope calculation not a precise one. It doesn't take foresight to invest in SP500. DCAing (dollar-cost averaging) into an index fund is actually the recommended savings strategy with a short-term cash balance of 6 months-2 years of cash savings depending on your plans (sabbatical etc.), especially when one is decades away from retirement age.
I only included meta because he works/worked at meta and it's not unusual for people to just leave their rsus in their accounts after they vested. I agree though that one shouldn't pick stocks that happened to explode (e.g. nvda).
There are several unrealistic assumptions I did make:
* Presumably when someone starts, they earn less than in recent years. He probably wasn't making huge amounts his first few years. Amounts invested in earlier years are smaller but have more time to compound and amounts invested in recent years are larger but have had less time to compound.
* Returns aren't constant.
* I pulled the $2 million/yr out of thin air. It could be $1 million/yr or even $10 million/yr. I have no idea what the overall head of a project like PyTorch would make.
* Everyone's expenses are different. In and around NYC, one can live on $80k/year, $120-150k/year as well as as on $1 million/yr. I assumed zero since I wanted a nice even $1 million/yr savings. Maybe it was $500k/yr of savings in which case all the numbers should be halved.
In any case, I can't see how one wouldn't end up with at least $10 million in a position like this with 10 years at meta. Unless one buys a $5 million unit in Manhattan and is burdened by a high mortgage.
To me it sounds as if he is trying to open a new chapter in his life. Good for him, but I wonder if everything was really as smooth as described. People often write how everything is always perfect on their blog. Well - could be. But it could also be that not everything was perfect but no description followed on the blog.
You're asking a lot of questions but are you willing to think about it? For one, no it's not "everyone's style" because you wouldn't have asked whether it was, you'd know.
PyTorch is one of those tools that’s so simple and easy to take apart that you feel like you might’ve been able to make it yourself. I can’t imagine how much engineering effort was behind all those moments where I thought to myself, “of course it should work like that, how can it be any other way?”
The choice of the dynamic computation graph [1] of PyTorch made it easier to debug and implement, leading to higher adoption, even though running speed was initially slower (and therefore training cost higher).
Other decisions follow from this one.
Tensorflow started with static and had to move to dynamic at version 2.0, which broke everything. Fragmentation between tensorflow 1, tensorflow 2, keras, jax.
Pytorch's compilation of this computation graph erased the remaining edge of Tensorflow.
Is the battle over ? From a purely computational point, Pytorch solution is very far from optimal and billions of dollars of electricity and GPUs are burned every year, but major players are happy with circular deals to entrench their positions. So at the pace of current AI code development, probably one or two years before Pytorch is old history.
> at the pace of current AI code development, probably one or two years before Pytorch is old history.
Ehhh, I don’t know about that.
Sure, new AI techniques and new models are coming out pretty fast, but when I go to work with a new AI project, they’re often using a version of PyTorch or CUDA from when the project began a year or two ago. It’s been super annoying having to update projects to PyTorch 2.7.0 and CUDA 12.8 so I can run them on RTX 5000 series GPUs.
All this to say: If PyTorch was going to be replaced in a year or two, we’d know the name of its killer by now, and they’d be the talk of HN. Not to mention that at this point all of the PhDs flooding into AI startups wrote their grad work in PyTorch, it has a lot of network lock-in that an upstart would have to overcome by being way better at something PyTorch can never be good at. I don’t even know what that would be.
Bear in mind that it took a few years for Tensorflow to die out due to lock in, and we all knew about PyTorch that whole time.
> a lot of network lock-in that an upstart would have to overcome by being way better at something PyTorch can never be good at
Higher level code migration to the newer framework, is going to 0. You ask your favorite agent (or intern) to port and check that the migration is exact. We already see this in the multitude of deep-learning frameworks.
The day one optimization trick that PyTorch can't do but another framework can, which reduce your training cost 10x and PyTorch is going the way of the dodo.
The day one architecture which can't be implemented in PyTorch get superior performance, and it's bye bye python.
We see this with architectures which require real-time rendering like Gaussian Splatting (Instant Nerf), or the caching strategies for LLM sequence generation.
Pytorch's has 3 main selling point :
- Abstracting away the GPU (or device) specific code, which is due to nvidia's mess : custom optimized kernels, which you are forced to adapt to if you don't want to write custom kernels.
If you don't mind writing optimized kernels, because the machine write them. Or if you don't need Cuda because you can't use Nvidia hardware because for example you are in China. Or if you use custom silicon, like Grok and need your own kernels anyway.
- Automatic differentiation. It's one of its weak point, because they went for easy instead of optimal. They shut themselves off some architectures. Some language like Julia because of the dynamic low-level compilation can do things Pytorch won't even dream about, (but Julia has its own problems mainly related to memory allocations). Here with the pytorch's introduction of the "scan function"[2] we have made our way full circle to Theano, Tensorflow's/Keras ancestor, which is usually the pain point of the bad automatic differentiating strategy chosen by Pytorch.
The optimal solution like all physics Phds which wrote simulations know, is writing custom adjoint code with 'Source Code Transformation' or symbolically : it's not hard but very tedious so it's now a great fit for your LLM (or intern or Phd candidate running 'student gradient descent') if you prove or check your gradient calculation is ok.
- Cluster Orchestration and serialization : a model can be shared with less security risks than arbitrary source code, because you only share weights. A model can be splitted between machines dynamically. But this is also a big weakness because your code rust as you become dependent of versioning, you are locked with the specific version number your model was trained on.
There are two types of stops : soft stops, and hard stops.
- Soft stops is when the dynamic graph computation overhead is too much, which mean you can still calculate, but if you were to write the function manually or with a better framework, you could be 10x faster.
Typical example involve manually unrolling a loop. Or doing kernel fusion. Other typical example is when you have lots of small objects or need to do loops in python because it doesn't vectorize well. Or using the sparsity efficiently by ignoring the zeros.
- Hard stop is when computing the function become impossible, because the memory needed to do the computation in a non optimal way explode. Some times you can get away with just writing customized kernels.
The typical example where you can get away with it are custom attention layers.
Typical example where you can't get away are physics simulations. Like for example the force is the gradient of energy, but you have n^2 interactions between the particles, so if you use anything more than 0 memory preserved during the forward pass per interaction, your memory consumption explode. And typically with things like Lagrangian or Hamiltonian neural networks where you look the discover dynamics of an energy conserving system, you need to be able differentiate at least three times in a row.
There are also energy expanding stops, where you need to find work-around to make it work like if you want to have your parameters changing shape during the optimization process like learning point clouds of growing size, and they spread you thin so they won't be standardized.
I don't know the full list, but back when it came out, TF felt like a crude set of bindings to the underlying c++/CUDA workhorse. PyTorch felt, in contrast, pythonic. It was much closer in feeling to numpy.
I think it was mostly the eager evaluation that made it possible to debug every step in the network forward/backward passes.
Tensorflow didn't have that at the time which made debugging practically impossible.
I’m not sure if such an overview exists, but when caffe2 was still a thing and JAX was a big contender dynamic vs static computational graphs seemed to be a major focus point for people ranking the frameworks.
I read one post on his blog and found that Adam Paszke reached out to the author and got an internship. I wonder if it was that easy to get an internship at FAIR. I thought that they hire only PhDs.
I was pretty involved in the PyTorch ecosystem in the early days around 2016 and Adam was nothing short of a genius and prolific developer whose contributions to the codebase and community were immense. I think he was like an undergrad in Poland at the time. My understanding is that his contributions came before the internship, but I don’t know.
My memory is that Souminth was really open to other people’s contributions and questions, no matter their credentials. He was a great leader who felt approachable to the open-source community.
I didn't know that. Soumith Chintala certainly paid it forward. He was very helpful and supportive of random strangers (like me!) in the early pytorch days. I count him with Andrej Karpathy and Chris Olah as one of the people who really made machine learning accessible to regular software engineers.
Given the technical achievements and industry impact, I'm struck that his emphasis is on the people, and in particular growing people to take on challenges. His influence might lie as much in what they do. The connection linking them seems to be building curiosity and wonder into delight for users.
If there's a soul to silicon valley, that's it. However many jerks and political/power players I encounter, I remain inspired by the few like this.
It is notable (but perhaps not surprising) that this is mostly about the people and the work itself. The writing is silent on the downstream impacts on the world. In contrast, there are fields (global development, medicine, etc.) where people tend to focus on the impact on humanity (especially when reaching a milestone in their career).
If you take advice from reformed Internet trolls, consider turning off all your devices and trying to give yourself at least a week, but ideally a month offline staring at your new baby. You'll never get that time back and there's nothing your brain will appreciate more than loading up those memories as they grow.
Ironical but one HN front page item today is this: "Meta projected 10% of 2024 revenue came from scams and banned goods, Reuters reports"
Glad you're leaving, hopefully you're in a good place financially. Take a page from Bill Gates and work on something that attempts to improve society. Stay away from surveillance capitalism and enshittification.
Many of the comments here are judging PyTorch in hindsight, which is unfair.
When Soumith Chintala co-launched PyTorch, and for many years after, the alternatives for fast, interactive, convenient development were much worse. There was no Jax.
Every single AI researcher I know, including me, who tried PyTorch back then immediately wanted to switch to it, because it was so much better. Andrej Karpathy described what PyTorch felt like back then when he tweeted, on May 2017, "I've been using PyTorch a few months now and I've never felt better. I have more energy. My skin is clearer. My eyesight has improved."[a]
THANK YOU SOUMITH for your hard work over all these years! Your hard work has made a difference for a huge number of people, including many of us here on HN.
We wish you success in your future endeavors, whatever they turn out to be!
Please ignore all the petty criticism.
---
[a] https://x.com/karpathy/status/868178954032513024
There was Chainer, which originated the define-by-run model that characterized PyTorch’s effectiveness. It was developed by a much smaller, much less influential company in Japan. Early PyTorch is transparent about the debt owed to Chainer.
Thanks. Yes, I remember Chainer, but only vaguely. I kinda remember looking at it, but not actually using it.
My recollection is that when I looked at Chainer back then, it didn't offer a comprehensive library of preexisting components for deep learning. When I tried PyTorch, on the other hand, I vividly remember it as already having lots of prebuilt components (common layers, activation functions, etc.) in `torch.nn`, so it was easier and faster to get going.
These memories are vague, so I could be wrong.
Yes, exactly—not many people know about Chainer nowadays. Back in 2016, PyTorch's interface was actually inferior to Chainer's, and I think Chainer's design was really ahead of its time.
The company is called preferred networks, and they're still around, and have some branched off subsidiaries too.
PyTorch was partly inspired by the python Autograd library (circa 2015 [1]) to the point where they called their autodiff [2] system "autograd" [3]. Jax is the direct successor of the Autograd library and several of the Autograd developers work on Jax to this day. Of course, for that matter, PyTorch author Adam Paszke is currently on the JAX team and seems to work on JAX and Dex these days.
[1] https://pypi.org/project/autograd/#history
[2] https://www.cs.toronto.edu/~rgrosse/courses/csc421_2019/read...
[2] https://web.archive.org/web/20170422051747/http://pytorch.or...
Yes, PyTorch borrowed from Autograd, Chainer, etc.
...but PyTorch felt friendlier and more Pythonic, and it came with a comprehensive library of prebuilt components for deep learning in `torch.nn`.
See https://news.ycombinator.com/item?id=45848768
I'm one of the many people who Soumith hired to Meta and PyTorch. I had the privilege of working on PyTorch with him and lots of the folks on this post.
As his longtime colleague, the one thing I would want people to know about him and this decision is that Soumith has always viewed PyTorch as a community project. He consistently celebrated the contributions of his co-creators Adam and Sam, and he extended the same view towards the Yangqing and the Caffe2 crew that we merged into PyTorch. At the very beginning, by Soumith's highly intentional design, PyTorch was aimed at being truly developed by and for the AI research community and for many years that was the key way in which we grew the framework, FB PT team, and the wider community. At every single stage of PT's lifecycle, he always ensured that our conception of PT and its community grew to include and celebrate the new people and organizations growing what was possible with PT. He's an incredible talent magnet, and thus more and more smart people kept dedicating their blood, sweat, and tears to making PT bigger and better for more people.
I've worked with some very well known and highly compensated leaders in tech, but *no one* has done the job he has done with ameliorating a bus factor problem with his baby. PT has a unique level of broad support that few other open source technology can reach. In a world of unbounded AI salaries, people who want to move AI research methods forward still freely give their time and attention to PyTorch and its ecosystem. It's the great lever of this era of AI that is moving the world, *due in large part* to the strength of the community he fostered and can now let continue without his direct involvement.
His departure is the end of an era, but it's also operationally a true non-event. PyTorch is going strong and can afford to let one of its creators retire from stewardship. This is precisely what success looks like in open source software.
He deserves our congratulations and our thanks. Enjoy your PT retirement, man.
Also worked with Soumith. The man is a legend, moves mountains and completely changed the course of my career because he liked something I wrote. No arrogance, no politics, just an extremely down to earth and chill guy who elevates everyone around him.
Hope him the best!
What did you write?
What I find most interesting with this is that it shows they believe there is nothing unique at Meta related to AI. There is no resource, people and computing power, that they can't get elsewhere for whatever they believe would be more interesting for them.
I mention this because it feels analogous to military research, where people "dream" of how advanced the military is, how forward they are compared to public research... and yet, it seems to be a recurring myth they love to sustain.
So the signal I get here is AI "labs" in BigTech have nothing worth waiting for around the corner, it's just more of the same and boring for people who stick there.
I think you might be reading a bit too much into this.
He’s been with Meta for 11 years and is likely in a very comfortable financial position, given the substantial stock options he’s received over that time.
He also mentioned the arrival of a new child, and it’s well known that Meta's work-life balance isn’t always ideal.
On top of that, Meta, like many major tech companies, has been shifting its focus toward LLM-based AI, moving away from more traditional PyTorch use cases.
Considering all of this, it seems like a natural time for him to move on and pursue new, more exciting opportunities.
> On top of that, Meta, like many major tech companies, has been shifting its focus toward LLM-based AI, moving away from more traditional PyTorch use cases.
This is very wrong. Meta is on the forefront of recommendation algorithms and that's all done with traditional ML models made using PyTorch.
Meta is definitely at the forefront of recommendation algorithms built. However, the leadership team likely has shifted focus to LLMs.
Some recommendations are uncanny, except that I don't want any of them in my Facebook news feed and no matter how often I select "never show me this feed again," it keeps trying.
> toward LLM-based AI, moving away from more traditional PyTorch use cases
Wait, are LLMs not built with PyTorch?
GP is likely saying that “building with AI” these days is mostly prompting pretrained models rather than training your own (using PyTorch).
Everyone is fine-tuning constantly though. Training an entire model in excess of a few billion parameters. It’s pretty much on nobody’s personal radar, you have a handful of well fundedgroups using pytorch to do that. The masses are still using pytorch, just on small training jobs.
Building AI, and building with AI.
Fine-tuning is great for known, concrete use cases where you have the data in hand already, but how much of the industry does that actually cover? Managers have hated those use cases since the beginning of the deep learning era — huge upfront cost for data collection, high latency cycles for training and validation, slow reaction speed to new requirements and conditions.
Llama and Candle are a lot more modern for these things than PyTorch/libtorch, though libtorch is still the de-facto standard.
That's wrong. Llama.cpp / Candle doesn't offer anything on the table that PyTorch cannot do (design wise). What they offer is smaller deployment footprint.
What's modern about LLM is the training infrastructure and single coordinator pattern, which PyTorch just started and inferior to many internal implementations: https://pytorch.org/blog/integration-idea-monarch/
Pytorch is still pretty dominant in cloud hosting. I’m not aware of anyone not using it (usually by way of vLLM or similar). It’s also completely dominant for training. I’m not aware of anyone using anything else.
It’s not dominant in terms of self-hosted where llama.cpp wins but there’s also not really that much self-hosting going on (at least compared with the amount of requests that hosted models are serving)
About the military, from my limited experience, they are significantly behind the civilian state of the art, except for technology that has has few applications outside of the military, like stealth.
In fact everything secret tends to be behind. Secrecy is a huge burden, and seriously limits all forms of collaboration.
In addition, because military projects are often big and highly politicized you get all the inefficiencies that goes with that. Classification is also convenient for hiding screwups and corruption.
I spent a dozen years as a US defense contractor across a broad spectrum of places (from R&D for the future to working with E3's today), and worked at internet scale and start-up B2B stuff in the other dozen years of my working career.
I think that the major difference about deployed military technologies- in contrast to both military R&D and the entire commercial side- is that they are, by and large, incredibly rock solid and reliable. If they aren't, they don't actually get used. It takes a lot of effort to get them that way. I remember once at a testing ground for our robot tanks of the far future, right next door was an outdoor test-track. And they were testing a kitchen trailer (a kitchen for ~200 men that can be towed by a Humvee). And they drove it around the track continuously for three weeks, stopping only long enough to change drivers/vehicles, and the four times a day they would halt and make 200 people meals, and then pack up and get back to driving. This was one of several reliability tests that the kitchen trailer had to get through before it was accepted for service.
Our R&D stuff couldn't handle that (it needed 3-4 engineers to carefully monitor it at all times), but the stuff that needed to be in the hands of some random 18 year old with a two week training course had to be rock solid to use, do regular maintenance on, and fix, even when they were only getting four hours of sleep a night. If it wasn't up to that level, then the troops ended up ignoring it, leaving it behind when they went out to do their job. And by and large, from what I could tell, most of the stuff they had was that reliable. There were some cool things that we were doing in the R&D space, but we were a long way from that level.
One thing I meant to add: this extensive testing- and the enormous amount of documentation/training materials necessary to take an 18 year old with average ASVAB scores and produce someone who can cook meals for 200 other soldiers on four hours of sleep a night- is both why military things cost so much, relative to commercial grade stuff, and why they don't get updated particularly often. Frequent software updates that change menus around play havoc with the detailed training powerpoints that the military relies on to produce those 18 year old tool operators.
Secret Squirrel projects (which I was near but never read into) can get away with lower reliability because they can count on the users to be much better trained and prepared, though again, from my brief encounters with these sorts, they will ignore anything they don't trust to be completely reliable. Reliability matters far more than cutting edge for like 99.9% of military gear.
I just assume all government software is poorly written by huge consulting companies, like the famous FBI one https://en.wikipedia.org/wiki/Virtual_Case_File
> a 318-page report [...] said the SAIC software was incomplete, inadequate and so poorly designed that it would be essentially unusable under real-world conditions. Even in rudimentary tests, the system did not comply with basic requirements
I figured the reason Palantir was so successful was because it was a SV software company instead of a defense contractor dabbling in IT or specialized government consultancy.
Post Cold War, most militaries shifted to COTS and less boutique development. Turns out, you only need to put resources in a few places to stay ahead (stealth, sensing and measuring, space, hypersonics, drones, etc).
It's MUCH cheaper and quicker.
The military doesn't have the luxury of things being unreliable. It puts a pressure on them that corporations don't necessarily have: they'd rather have a less-effective but proven system than a potentially-more-effective but riskier system (especially since each system they have comes with massive logistics support).
Ironically, corporations can afford to take more risks of failure (financially and project-wise) than militaries because failure for them doesn't mean actual human death (and when it can, you see processes come in that look a lot more like military processes).
It's actually the commercial/consumer side that gets more reliability than the military side.
The military should have very reliable systems, and they often know the point at which their systems will fail (MTBF calculations are easier to develop with their record keeping). However, the military also has an almost unlimited budget and body count to keep just reliable enough things working much better than they should. It's also really bad about actually competing companies against each other.
The commercial sector, targeting consumers, is where you actually get reliable systems. Why? Because consumers will go towards either the cheapest option (reliability is replaced with ubiquity in the market, it's replaceable) or the more reliable but more expensive options. They (individuals) don't have an unlimited budget or unlimited time to maintain everything in their life. There's competition in the commercial world that's completely absent in the military world
The two major exceptions are where COTS products have taken over (definitionally, DOD is using commercial, often consumer-targeted, products instead of military specific products) and special forces. Special forces often bypasses normal acquisitions processes and so ends up having a better chance to compete vendors against each other than other parts of the military.
This doesn't mean everything the DOD procures through normal acquisitions is inherently unreliable, but reliability is only one of many factors and often only really discovered after selection and full-rate production has started. By that point, the DOD is committed to it for years to come. Each DOD procurement is separate enough from others that you don't even get huge opportunities for reuse. The F-35, to pick something from this century, didn't get components that were shared with other aircraft in the DOD fleet. It's almost all new, which means a lot of things were learned about its reliability after it started flying. It has new comms, new radar, new almost everything. Even the engine (though that probably used many subcomponents shared with other engines) was a new engine just used by the F-35.
> What I find most interesting with this is that it shows they believe there is nothing unique at Meta related to AI
Whether or not this is the case, I don't get this as being the reason for Sousmith leaving - it sounds as if he is just ready for a change.
Still, it is noticeable that with many of the AI companies claiming that their version of "AGI" is just around the corner, developers and staff don't appear to be particularly excited about this (I assume they realize it is just hype, not some momentous advance around the corner), and leave to pursue different things, such as Mira Murati starting a fine-tuning company, Karpathy going back to education, others switching ship (typically from OpenAI to Anthropic), etc.
"Ready for change" is just the polite way to say, "I can't stand it here anymore. I'd rather roll the dice on a new place because reversion-to-mean means it's probably going to be better than whatever this has become."
There are a lot of things I don't like about my current job, but not enough for it to make sense to gamble on a new place. It's easier to push for change from my current position than to count on any new place being any better.
But if it gets worse and I do leave, I'll definitely be telling the interviewer, "I was just ready for a change."
On the other hand, while I know nothing about Soumith, he clearly has enough financial runway (see my calc below) to not have to work again.
As far as I know, we all get one life. If one can help it (modulo other constraints), one should not get trapped by prestige, achievement, short-term admiration by others, impact and external facing factors. To see an alternate reality, it helps to escape the bubble, for example, by spending time in a completely different culture or environment where no one knows or cares about what one did.
I admire people taking such decisions. It's easy to be on autopilot in life. But, people who wear their success lightly are rare but more philosophically aware, in my opinion at least. I wish him good luck!
> see an alternate reality, it helps to escape the bubble, for example, by spending time in a completely different culture
I'm at similar position now, need to make decision. The problem is after leaving IT world for a while it will be hard to get back. I'll have to change my life completely and discard all knowledge and expertise I have. That will be fun, interesting, eyes opening, etc, but no way back.
I don't know you, don't know your situation, but this does not seem to match the experiences of many of my friends who left for a while and then came back. "Spent two years starting a restaurant" and "had to take care of my parents" were not blockers for getting another computer related job in due time. There are few truly irrevocable decisions in our life.
Now, the current job market makes this significantly harder than it was in the 2010's, but that's floating over all of us- if your company does an Amazon tomorrow, would you get a job as nice as you currently have? Maybe, maybe not.
In executive roles, your expertise really is in management acumen a lot of the time. But as an individual contributor--or adjacent--once you're out of a technical space for a few years, it's increasingly hard to get back in even if you've casually kept a finger in.
Exactly, the only way to stay current is to keep doing something at least half time. The good thing it doesn't have to be the same as prev job. Just keep brain working and learning.
I think age plays an important part in the decision to move away from a place. I think in your 20s or very early 30s you have far more leeway to kind of go away and start again, but a lot of the hope to actually be able to find that unicorn workplace fades away as you approach your late 30s. Once into your 40s, depending on your trade, you're dead on arrival unless you successfully manage to rebrand yourself as a consultant, whatever the fuck that means.
Age does factor in various ways. It can be "it's now or never" or it may be "I might as well hold on for a few" or something in between.
> is just the polite way to say
Can be*, that's not necessarily always true. I've quit jobs plenty of times without having any plan for the future or particular drama-reason for leaving, just "It's not as fun here anymore, despite this being a great place to work", I'm sure I'm not the only one who does so.
What I've never done though, is leaving a place without being 100% honest exactly why I'm leaving. I won't say "I was just ready for change" if that wasn't the reason, I have no reason not to be honest about why I'm leaving.
I've generally had 10+ year tenures other than a return to school that was basically always in my plan and dot-bomb (leaving a company I wasn't really a fit with anyway). But, yeah, I've always been ready to move on at about that ten year point which is actually fairly long by a lot of people's standards in the tech industry.
I do disagree though that, unless there's some actionable change that would specifically benefit you like more money, my answer outside of private conversations with people I know well is going to be some variant of "time for a change." Anything else just invites arguments and conversations I don't want to have.
I do want to push back on this a little. People leave all the time for this "I wanna see what else is out there" especially at such senior levels and with so much financial security as he inevitably has from working at Meta for 11 years. It is not always a gamble for many of them, and many of them are not so skeptical and cynical of other places they could go and bring their expertise
I don't think that's the read? Guy says he wants to work on something small. If you want to work on something big you probably want to be in a big corp to have the resources to do the big thing.
Also absolutely unknown if the "new thing" is AI-related at all!
> If you want to work on something big you probably want to be in a big corp to have the resources to do the big thing.
If anything, the reverse seems to be true, if you want to work on something big, you want to be in a small company, sufficiently funded, filled with great people, yet not "big", that's when "something big" seems to be more likely to happen.
In contrast, as far as I can think, the bigger a company gets, the less likely they are to actually come up with "something big", it seems like most of the times you need (creative) constraints in order for the results to end up being actually innovative, otherwise you end up like IBM and Meta, throwing money on stuff and getting some results, but nothing really out of the ordinary considering what's happening elsewhere in their ecosystems.
Well he left so whatever is coming next, AI related or not, "small" or not (small for them might be reaching just a million people, he wrote that he "lead the software layer that powers the entire AI industry." so his notion of scale is probably unlike mine, maybe yours too) is more exciting to him that whatever he could do next with all of Meta's resources.
Edit: to be clear, I didn't mean to imply their next thing is AI related, solely that they obviously know more about AI at Meta than e.g. XR at Meta, just because that's their expertise.
Your assumption is a bad read because it only works if his set of life priorities contains nothing else but maximizing his impact in the world of AI.
If he has just one other priority in that set (which could still include a robotic min/max of AI impact), then your assumption fails.
It reads to me as if he was the victim of office politics and decided to say "fuck it" instead of being transferred to something else within Meta.
Pretty crazy/bizarre that a VP/Fellow engineer would have such little say in what they do at Meta. In my mind, companies would do everything possible to retain them. They are a special and rare breed.
> It reads to me as if he was the victim of office politics and decided to say "fuck it" instead of being transferred to something else within Meta.
It looks like he'd already been transferred once (to Infra) and maybe didn't want to do it again.
> where people "dream" of how advanced the military is
If you've ever worked on "advanced military grade" equipment, you'd know better.
It tends to be what you'd euphemistically call "well-proven technology", built down to a price by the lowest bidder, by comparatively unskilled labour.
The most shocking thing about the "captured" Russian drones is they use name-brand Raspberry Pis inside. I'm prepared to bet the American versions use whatever AliExpress crap is on special this week. The UK stuff definitely does.
Isn't that exactly the point parent was trying to make? Maybe I misunderstood their comment, but it seems like you're repeating what they said.
Post cup-of-tea (not NATO-spec, just black thanks, tea with just tea in it) I realise you're correct.
You read it right, I think they agree. Maybe when I wrote "dream" in quotes the irony was lost.
I mean, these things do exist. There are always tons of big and small tech projects floating around in the special operations community. Cutting-edge sets of hybrid night/thermal vision. Classified helicopters. Hand-built rifled with custom cartridges. Classified medical tech. Advanced fixed wing aircraft with unique capabilities. Advanced dive gear. So on.
"Big Army" doesn't see that stuff for decades, if ever, and mostly never due to cost. And I'm not even getting into classified submarine and nuclear tech, fixed wing drones and aircraft flying at night out of known test facilities, etc.
There's tons of actually advance tech out there in military circles.
Yeah. The DOD is enormous and definitely has your boring every day stuff, but tons of skunk works as r&d. just not very public. An organization that big has all kinds of nooks and crannies, so isn’t really that monolithic.
If you can afford to support yourself, which I’m sure he can, there’s a serenity to working on small projects that are nothin the public eye. It may simply be that he craves some quiet time that enables him to focus on his family and himself.
Negative, what you shave taken away is it’s the people. He mentions standing up clusters. Small shops can’t afford clusters. Ignore the technical aspect of this article and read it for what it’s for. A thank you note to the people he has worked with on amazing projects. Research in a bubble of 1 isn’t very useful. Research in a small team with Meta Budget is extremely useful. With the right people.
I don't think thats what is being said.
Having friends who are at or near both FAIR and other AI parts of meta, reosurces are not the issue, anymore at least. (there had been a massive squeeze for the last two years though) But pytorch and FAIR use(d) a AWS based cluster. (however pytorch is used everywhere else inside facebook though. well not everywhere...)
There is/are plenty of interesting things happening at big tech, and Meta specifically. If you like computer vision, then Meta is pretty much still the world leader. Much as it pains me to say it.
> I mention this because it feels analogous to military research, where people "dream" of how advanced the military is, how forward they are compared to public research... and yet, it seems to be a recurring myth they love to sustain.
I don't think that you can read this from the blog post at all, but it gives me a chuckle to think how the quest for AGI at Meta may be "The Men Who Stare at Goats" all over again.
I'm totally speculating. I have no extra information there.
It just makes me think of all the staff, technical staff, that left OpenAI recently. Altman was making grand claims about what was coming next.
Well, we know what followed, namely I don't think any researcher who left knowing what was in the pipeline feel like they missed much in terms of access.
Just checked BTW and ... premise looks fun but the score is too low https://www.rottentomatoes.com/m/men_who_stare_at_goats was it actually good as movie, not just the idea behind it?
It's more the idea behind it. Considering the great cast, the movie could have been much better.
The non-fiction book behind it is probably better comparison than the film adaptation, if you think Meta are doing goat-staring (I don't think they're especially bad on this issue compared to their rivals).
Or he just takes for granted the resources he has.
Nothing unique? Totally disagree.
Unlimited compute resources aren’t literally unique but there are only a small handful of places in the world that have that.
Vast quantities of private data, especially text communications and images. Very few places have that. Coupled with a culture that puts zero privacy protections on that data. Even Google likes to think they’re doing the right thing, so I think that makes Meta unique.
That man has an infective enthusiasm. I remember the DCGAN paper inspired me to try getting the (Lua) Torch code to work, and I tried it on the Oxford flowers dataset early on. It worked surprisingly well, and Soumith Chintala even shared it around in social media, surprised at how well it worked on such a small dataset. Of course back then we didn't really appreciate the problem of mode collapse.
Pytorch and old Lua Torch were a pleasure to work with compared to the contemporary Tensorflow. Lots of S.C's code was copied around liberally, it had its quirks (I remember the DCGAN code had a pretty odd way of doing parameter passing) but it was also really easy to understand and made random people like me feel like we had suddenly stumbled onto something crazy powerful (which we had!). It was wonderfully hackable.
What is the Oxford flowers dataset ? Where is it available ?
https://www.robots.ox.ac.uk/~vgg/data/flowers/
“ What's next for me? Something small. Something new. Something I don't fully understand yet. Something uncomfortable. I could have moved to something else inside Meta. But I needed to know what's out there. I needed to do something small again. I couldn't live with the counterfactual regret of never trying something outside Meta.”
Shades of Siddhartha. Back to the forest.
Ah yes, shades of Siddhartha. I almost forgot about the part where he worked for a megacorp that was ripping society’s social fabric apart and wanted to do something else for a while.
I don't think he was involved in that though.
He was. he didn't just parachute in Meta to start working on PyTorch. he worked in many areas of the product and a member of the senior technical staff, was knowledgable about many aspects of the company.
As a loyal JAX user, I hope they can play catchup. PyTorch has dominated the AI scene since TF1 fumbled the ball at 10th yard line. What Matt Johnson has done turning Autograd into JAX is hopefully going to be worthy of as much praise as what Soumith has received.
> PyTorch has dominated the AI scene since TF1 fumbled the ball at 10th yard line
can you explain why you think TensorFlow fumbled?
I see good answers already, but here's a concrete example:
In my University we had to decide between both libraries so, as a test, we decided to write a language model from scratch. The first minor problem with TF was that (if memory serves me right) you were supposed to declare your network "backwards" - instead of saying "A -> B -> C" you had to declare "C(B(A))". The major problem, however, was that there was no way to add debug messages - either your network worked or it didn't. To make matters worse, the "official" TF tutorial on how to write a Seq2Seq model didn't compile because the library had changed but the bug reports for that were met for years with "we are changing the API so we'll fix the example once we're done".
PyTorch, by comparison, had the advantage of a Python-based interface - you simply defined classes like you always did (including debug statements!), connected them as variables, and that was that. So when I and my beginner colleagues had to decide which library to pick, "the one that's not a nightmare to debug" sounded much better than "the one that's more efficient if you have several billions training datapoints and a cluster". Me and my colleagues then went on to become professionals, and we all brought PyTorch with us.
The inability to use print debug to tell me the dimensions of my hidden states was 100% why TF was hard for me to use as a greenhorn MSc student.
Another consequence of this was that PyTorch let you use regular old Python for logic flow.
This was also my experience. TensorFlow's model of constructing then evaluating a computation graph felt at odds with Python's principles. It made it extremely difficult to debug because you couldn't print tensors easily! It didn't feel like Python at all.
Also the API changed constantly so examples from docs or open source repos wouldn't work.
They also had that weird thing about all tensors having a unique global name. I remember I tried to evaluate a DQN network twice in the same script and it errored because of that.
It's somewhat vindicating to see many people in this thread shared my frustrations. Considering the impact of these technologies I think a documentary about why TensorFlow failed and PyTorch took off would be a great watch.
In 2018, I co-wrote a blog post with the inflammatory title “Don’t use TensorFlow, try PyTorch instead” (https://news.ycombinator.com/item?id=17415321). As it gained traction here, it was changed to “Keras vs PyTorch” (some edgy things that work for a private blog are not good for a corporate one). Yet the initial title stuck, and you can see it resonated well with the crowd.
TensorFlow (while a huge step on top of Theano) had issues with a strange API, mixing needlessly complex parts (even for the simplest layers) with magic-box-like optimization.
There was Keras, which I liked and used before it was cool (when it still supported the Theano backend), and it was the right decision for TF to incorporate it as the default API. But it was 1–2 years too late.
At the same time, I initially looked at PyTorch as some intern’s summer project porting from Lua to Python. I expected an imitation of the original Torch. Yet the more it developed, the better it was, with (at least to my mind) the perfect level of abstraction. On the one hand, you can easily add two tensors, as if it were NumPy (and print its values in Python, which was impossible with TF at that time). On the other hand, you can wrap anything (from just a simple operation to a huge network) in an nn.Module. So it offered this natural hierarchical approach to deep learning. It offered building blocks that can be easily created, composed, debugged, and reused. It offered a natural way of picking the abstraction level you want to work with, so it worked well for industry and experimentation with novel architectures.
So, while in 2016–2017 I was using Keras as the go-to for deep learning (https://p.migdal.pl/blog/2017/04/teaching-deep-learning/), in 2018 I saw the light of PyTorch and didn’t feel a need to look back. In 2019, even for the intro, I used PyTorch (https://github.com/stared/thinking-in-tensors-writing-in-pyt...).
Actually, I opened “Teaching deep learning” and smiled as I saw how it evolved:
> There is a handful of popular deep learning libraries, including TensorFlow, Theano, Torch and Caffe. Each of them has Python interface (now also for Torch: PyTorch)
> [...]
> EDIT (July 2017): If you want a low-level framework, PyTorch may be the best way to start. It combines relatively brief and readable code (almost like Keras) but at the same time gives low-level access to all features (actually, more than TensorFlow).
> EDIT (June 2018): In Keras or PyTorch as your first deep learning framework I discuss pros and cons of starting learning deep learning with each of them.
The original TensorFlow had an API similar to the original Lua-based Torch (the predecessor to PyTorch) that required you to first build the network, node by node, then run it. PyTorch used a completely different, and much more convenient approach, where the network is built automatically for you just by running the forward pass code (and will then be used for the backward pass), using both provided node types and arbitrary NumPy compatible code. You're basically just writing differentiable code.
This new PyTorch approach was eventually supported by TensorFlow as well ("immediate mode"), but the PyTorch approach was such a huge improvement that there had been an immediate shift by many developers from TF to PyTorch, and TF never seemed able to regain the momentum.
TF also suffered from having a confusing array of alternate user libraries built on top of the core framework, none of which had great documentation, while PyTorch had a more focused approach and fantastic online support from the developer team.
LuaTorch is eager-execution. The problem with LuaTorch is the GC. You cannot rely on traditional GC for good work, since each tensor is megabytes (at the time), now gigabytes large, you need to collect them aggressively rather than at intervals (Python's reference-counting system solves this issue, and of course, by "collecting", I don't mean free the memory (PyTorch has a simple slab allocator to manage CUDA memory)).
With Lua Torch the model execution was eager, but you still had to construct the model graph beforehand - it wasn't "define by run" like PyTorch.
Back in the day, having completed Andrew Ng's ML coursew, I then built my own C++ NN framework copying this graph-mode Lua Torch API. One of the nice things about explicitly building a graph was that my framework supported having the model generate a GraphViz DOT representation of itself so I could visualize it.
Ah, I get what you mean now. I am mixing up the nn module and the tensor execution bits. (to be fair, the PyTorch nn module carries over many these quirks!).
I'm no machine learning engineer but I've dabbled professionally with both frameworks a few years ago and the developer experience didn't even compare. The main issue with TF was that you could only chose between a powerful but incomprehensible, poorly documented [1], ultra-verbose and ever changing low-level API, and an abstraction layer (Keras) that was too high level to be really useful.
Maybe TF has gotten better since but at the time it really felt like an internal tool that Google decided to just throw into the wild. By contrast PyTorch offered a more reasonable level of abstraction along with excellent API documentation and tutorials, so it's no wonder that machine learning engineers (who are generally more interested in the science of the model than the technical implementation) ended up favoring it.
[1] The worst part was that Google only hosted the docs for the latest version of TF, so if you were stuck on an older version (because, oh I don't know, you wanted a stable environment to serve models in production), well tough luck. That certainly didn't gain TF any favors.
For me it was about 8 years ago. Back then TF was already bloated but had two weaknesses. Their bet on static compute graphs made writing code verbose and debugging difficult.
The few people I know back then used keras instead. I switched to PyTorch for my next project which was more "batteries included".
First, the migration to 2.0 in 20219 to add eager mode support was horribly painful. Then, starting around 2.7, backward compatibility kept being broken. Not being able to load previously trained models with a new version of the library is wildly painful.
I just remember TF1 being super hard to use as a beginner and Google repeatedly insisting it had to be that way. People talk about the layering API, but it's more than that, everything about it was covered with sharp edges.
Imagine a total newbie trying to fine-tune an image classifier, reusing some open source example code, about a decade ago.
If their folder of 10,000 labelled images contains one image that's a different size to the others, the training job will fail with an error about unexpected dimensions while concatenating.
But it won't be able to say the file's name, or that the problem is an input image of the wrong size. It'll just say it can't concatenate tensors of different sizes.
An experienced user will recognise the error immediately, and will have run a data cleansing script beforehand anyway. But it's not experienced users who bounce from frameworks, it's newbies.
> An experienced user will recognise the error immediately, and will have run a data cleansing script beforehand anyway. But it's not experienced users who bounce from frameworks, it's newbies.
Even seasoned developers will bounce away from frameworks or libraries - no matter if old dogs or the next hot thing - if the documentation isn't up to speed or simple, common tasks require wading through dozens of pages of documentation.
Writing good documentation is hard enough, writing relevant "common usage examples" is even harder... but keeping them up to date and working is a rarely seen art.
And the greatest art of all of it is logging. Soooo many libraries refuse to implement detailed structured logging in internal classes (despite particularly Java and PHP offering very powerful mechanisms), making it much more difficult to troubleshoot problems in the field.
Greenfielding TF2.X and not maintaining 1.X compatibility
I personally believe TF1 was serving the need of its core users. It provided a compileable compute graph with autodiff, and you got very efficient training and inference from it. There was a steep learning curve, but if you got past it, things worked very very well. The distributed TF never really took off—it was buggy, and I think they made some wrong earlier bets in the design for performance reasons that they should have been sacrificed in favor of simplicity.
I believe some years after the TF1 release, they realized the learning curve was too steep, they were losing users to PyTorch. I think also the Cloud team was attempting to sell customers on their amazing DL tech, which was falling flat. So they tried to keep the TF brand while totally changing the product under the hood by introducing imperative programming and gradient tapes. They killed TF1, upsetting those users, while not having a fully functioning TF2, all the while having plenty of documentation pointing to TF1 references that didn’t work. Any new grad student made the simple choice of using a tool that was user-friendly and worked, which was PyTorch. And most old TF1 users hopped on the band wagon.
I only remember 2015 TF and I was wondering: why would I use Python to assemble a computational graph when what I really want is to write code and then differentiate through it?
Do you have experience in both JAX and PyTorch? Why do you prefer JAX?
Not OP. I prefer JAX for non-AI tasks in scientific computing because of the different mental model than PyTorch. In JAX, you think about functions and gradients of functions. In PyTorch you think about tensors which accumulate a gradient while being manipulated through functions. JAX just suits my way of thinking much better.
I also like that jax.jit forces you to write "functional" functions free of side effects or inplace array updates. It might feel weird at first (and not every algorithm is suited for this style) but ultimately it leads to clearer and faster code.
I am surprised that JIT in PyTorch gets so little attention. Maybe it's less impactful for PyTorch's usual usecase of large networks, as opposed to general scientific computing?
>I also like that jax.jit forces you to write "functional" functions free of side effects or inplace array updates. It might feel weird at first (and not every algorithm is suited for this style) but ultimately it leads to clearer and faster code.
It's not weird. It's actually the most natural way of doing things for me. You just write down your math equations as JAX and you're done.
> You just write down your math equations as JAX and you're done.
It's natural when your basic unit is a whole vector (tensor), manipulated by some linear algebra expression. It's less natural if your basic unit is an element of a vector.
If you're solving sudoku, for example, the obvious 'update' is in-place.
In-place updates are also often the right answer for performance reasons, such as writing the output of a .map() operation directly to the destination tensor. Jax leans heavily on compile-time optimizations to turn the mathematically-nice code into computer-nice code, so the delta between eager-Jax and compiled-Jax is much larger than the delta between eager-Pytorch and compiled-Pytorch.
Not Op. I have production / scale experience in PyTorch and toy/hobby experience in JAX. I wish I could have time time or liberty to use JAX more. It consists of small, orthogonal set of ideas that combine like lego blocks. I can attempt to reason from first principals about performance. The documentation is super readable and strives to make you understand things.
JAX seems well engineered. One would argue so was TensorFlow. But ideas behind JAX were built outside Google (autograd) so it has struck right balance with being close to idiomatic Python / Numpy.
PyTorch is where the tailwinds are, though. It is a wildly successful project which has acquired ton of code over the years. So it is little harder to figure out how something works (say torch-compile) from first principles.
JAX seems great but the Google ghost is a lot to stomach. The risk of JAX getting axed or replaced with a JAX 2.0 - completely incompatible with existing JAX code - is not insignificant.
For anyone that’s curious, the underlying Torch library is also a joy to work with, as are the many other torch bindings. For example, Rust has tch and Burn which both work with libtorch.
PyTorch of course has the benefit of being dynamically debuggable. Can’t forget the first time I break pointed my pytorch model and wrote pytorch calls inside the terminal to inspect the behavior. That’s still something I miss a lot now that I’m working with only “fast” compiled code.
I wrote som truly awful code back in the day because of that but god it was glorious.
>>Every major AI company and hardware vendor are on a speed dial. This kind of power is really hard to give up. But curiosity ultimately won out in my head.
A simple feeling has such a power. May he gets an opportunity to create one more powerful tool before retiring.
If the curiosity dies, the entire thing crumbles.
The second I stop being curious I stop finding new and exciting things to do, and I stop feeling fulfillment. It’s one of the biggest signs saying “it’s time to move on”.
I feel so strongly for the people who can’t afford the luxury. Ive been there, unfulfilling jobs for years because bills or resumè building.
Gosh, given you’ve been there I have to ask what allowed you to get out of that and pursue only things that interest and excite you?
> It's taught in classrooms from MIT to rural India. The tools I dreamed about making accessible? They are. The barrier to entry I wanted to lower? It's almost gone.
I have an ironic sense that there are classrooms in rural India with better pedagogy and lower barriers to entry than some of our elite engineering programs.
Many elite engineering programs in the United States (I don't know if this is what you mean by "our") are elite solely due to social status (rankings need to publish rankings that feel right, or they're ignored, and they accept bribes to rank specific programs) and research output, with little to do with quality of pedagogy. Instead, pedagogy is generally poor because the elite researchers usually view teaching as a chore and many don't have any real skill in it either.
I thought traditional Indian pedagogy was heavily criticized for being heavily based on rote memorization over conceptual understanding or problem solving. And also being heavily hierarchical and exam-oriented.
This isn't to say engineering programs in the US can't be improved, but there seems to be widespread consensus that they don't suffer from the kinds of serious problems that ones in India commonly do.
All I can say is 'thanks!'. It does take a team, and a community, but individuals matter. I use pytorch daily, it has made it possible for me to play with ideas that I would have only dreamed of. It is a big part of my life so, thanks for your contributions and best of luck on the next thing!
What an amazing impact this man has had. He seems like a great guy too. I don't know him at all personally but his replies online have always been so nice. I really wish him all the best with whatever he does with his time next
I wonder how much this guy has earned from Meta in total. Would it reach $100M?
Considering Meta was trying to Poach AI talent for $250M, I wouldn’t be surprised if this guy has his own 8-figure income
If someone made $2 million/year over 10 years, after taxes, it would be $1 million (NYC has local taxes too). Let's say all of it was saved and invested in SP500 or Meta.
SP500: tripled over 10 years i.e. ~12% a year. Reinvesting dividends gives ~14% a year
Meta: 8x over 10 years i.e. ~23% a year.
If growth was uniform over 10 years and compensation/savings was uniform over 10 years, total portfolio would be:
((1+r)^11-1)/r (geometric series since each year's contributions grow for different amount of times)
1 (this year) + (1+r) (previous year) + (1+r)^2 (previous-to-previous year) and so on
SP500: 14% -> $23M Meta: 23% -> $38M
Now, it's entirely possible, the compensation for a position like this runs into $10s of millions and one can easily account for non-uniform compensation.
Even in NYC, actually even in Manhattan, $10M is more than comfortable for retirement. It lets you draw $300-$400K (3-4% per year adjusted for inflation annually). If one is taking a short sabbatical, then it's a no-brainer.
This seems to assume unusual optimism or foresight, most people don’t invest their life savings 100% into stocks and don’t hold on to 100% of their company vests through ups and downs. You might as well say “assuming he put all his money in NVDA…”
It's a back-of-the-envelope calculation not a precise one. It doesn't take foresight to invest in SP500. DCAing (dollar-cost averaging) into an index fund is actually the recommended savings strategy with a short-term cash balance of 6 months-2 years of cash savings depending on your plans (sabbatical etc.), especially when one is decades away from retirement age.
I only included meta because he works/worked at meta and it's not unusual for people to just leave their rsus in their accounts after they vested. I agree though that one shouldn't pick stocks that happened to explode (e.g. nvda).
There are several unrealistic assumptions I did make:
* Presumably when someone starts, they earn less than in recent years. He probably wasn't making huge amounts his first few years. Amounts invested in earlier years are smaller but have more time to compound and amounts invested in recent years are larger but have had less time to compound.
* Returns aren't constant.
* I pulled the $2 million/yr out of thin air. It could be $1 million/yr or even $10 million/yr. I have no idea what the overall head of a project like PyTorch would make.
* Everyone's expenses are different. In and around NYC, one can live on $80k/year, $120-150k/year as well as as on $1 million/yr. I assumed zero since I wanted a nice even $1 million/yr savings. Maybe it was $500k/yr of savings in which case all the numbers should be halved.
In any case, I can't see how one wouldn't end up with at least $10 million in a position like this with 10 years at meta. Unless one buys a $5 million unit in Manhattan and is burdened by a high mortgage.
I just want to say: thank you. It is hard to overstate how much Pytorch has accelerated ML/AI development, across the board.
His homepage says he wants to build a robot. So he is probably going to work with robots for his next role.
He is an investor in Anthropic, didnt know you could do that working for Meta.
Could be Meta is quite liberal in this area. Or it could be one of those "For my friend, anything, for everyone else its corporate policy."
To me it sounds as if he is trying to open a new chapter in his life. Good for him, but I wonder if everything was really as smooth as described. People often write how everything is always perfect on their blog. Well - could be. But it could also be that not everything was perfect but no description followed on the blog.
This is the end of an era. Amazing work soumith.
Counterfactual Regret Minimization irl
Soumith's 2nd release? https://github.com/pytorch/pytorch/releases/tag/v0.1.1
Also, looking at the contribution history for a long career is very interesting; reflects the changing roles over time https://github.com/soumith
There's no context around 'FAIR' - is it https://www.go-fair.org/?
Facebook Artificial Intelligence Research, c.f. https://engineering.fb.com/category/ai-research/#:~:text=Art...
Now retconned as "Fundamental AI Research" since the Meta rebrand.
Interesting, is Yann Lecun still heading FAIR? From the outside looking in, it seems like he is getting sidelined
it's https://ai.meta.com/research/
Is this also partially AI generated? What's with the repeated short phrases? Is this just everyone's style now?
I felt it very human the writing on this post
You're asking a lot of questions but are you willing to think about it? For one, no it's not "everyone's style" because you wouldn't have asked whether it was, you'd know.
PyTorch is one of those tools that’s so simple and easy to take apart that you feel like you might’ve been able to make it yourself. I can’t imagine how much engineering effort was behind all those moments where I thought to myself, “of course it should work like that, how can it be any other way?”
Can anyone recommend a technical overview describing the design decisions PyTorch made that led it to win out?
The choice of the dynamic computation graph [1] of PyTorch made it easier to debug and implement, leading to higher adoption, even though running speed was initially slower (and therefore training cost higher).
Other decisions follow from this one.
Tensorflow started with static and had to move to dynamic at version 2.0, which broke everything. Fragmentation between tensorflow 1, tensorflow 2, keras, jax.
Pytorch's compilation of this computation graph erased the remaining edge of Tensorflow.
Is the battle over ? From a purely computational point, Pytorch solution is very far from optimal and billions of dollars of electricity and GPUs are burned every year, but major players are happy with circular deals to entrench their positions. So at the pace of current AI code development, probably one or two years before Pytorch is old history.
[1] https://www.geeksforgeeks.org/deep-learning/dynamic-vs-stati...
Someone’s got to prototype the next generation of architectures.
> at the pace of current AI code development, probably one or two years before Pytorch is old history.
Ehhh, I don’t know about that.
Sure, new AI techniques and new models are coming out pretty fast, but when I go to work with a new AI project, they’re often using a version of PyTorch or CUDA from when the project began a year or two ago. It’s been super annoying having to update projects to PyTorch 2.7.0 and CUDA 12.8 so I can run them on RTX 5000 series GPUs.
All this to say: If PyTorch was going to be replaced in a year or two, we’d know the name of its killer by now, and they’d be the talk of HN. Not to mention that at this point all of the PhDs flooding into AI startups wrote their grad work in PyTorch, it has a lot of network lock-in that an upstart would have to overcome by being way better at something PyTorch can never be good at. I don’t even know what that would be.
Bear in mind that it took a few years for Tensorflow to die out due to lock in, and we all knew about PyTorch that whole time.
> a lot of network lock-in that an upstart would have to overcome by being way better at something PyTorch can never be good at
Higher level code migration to the newer framework, is going to 0. You ask your favorite agent (or intern) to port and check that the migration is exact. We already see this in the multitude of deep-learning frameworks.
The day one optimization trick that PyTorch can't do but another framework can, which reduce your training cost 10x and PyTorch is going the way of the dodo.
The day one architecture which can't be implemented in PyTorch get superior performance, and it's bye bye python.
We see this with architectures which require real-time rendering like Gaussian Splatting (Instant Nerf), or the caching strategies for LLM sequence generation.
Pytorch's has 3 main selling point :
- Abstracting away the GPU (or device) specific code, which is due to nvidia's mess : custom optimized kernels, which you are forced to adapt to if you don't want to write custom kernels.
If you don't mind writing optimized kernels, because the machine write them. Or if you don't need Cuda because you can't use Nvidia hardware because for example you are in China. Or if you use custom silicon, like Grok and need your own kernels anyway.
- Automatic differentiation. It's one of its weak point, because they went for easy instead of optimal. They shut themselves off some architectures. Some language like Julia because of the dynamic low-level compilation can do things Pytorch won't even dream about, (but Julia has its own problems mainly related to memory allocations). Here with the pytorch's introduction of the "scan function"[2] we have made our way full circle to Theano, Tensorflow's/Keras ancestor, which is usually the pain point of the bad automatic differentiating strategy chosen by Pytorch.
The optimal solution like all physics Phds which wrote simulations know, is writing custom adjoint code with 'Source Code Transformation' or symbolically : it's not hard but very tedious so it's now a great fit for your LLM (or intern or Phd candidate running 'student gradient descent') if you prove or check your gradient calculation is ok.
- Cluster Orchestration and serialization : a model can be shared with less security risks than arbitrary source code, because you only share weights. A model can be splitted between machines dynamically. But this is also a big weakness because your code rust as you become dependent of versioning, you are locked with the specific version number your model was trained on.
[2] "https://docs.pytorch.org/xla/master/features/scan.html
What would stop PyTorch from implementing whatever optimization trick becomes important? Even if it requires a different API.
There are two types of stops : soft stops, and hard stops.
- Soft stops is when the dynamic graph computation overhead is too much, which mean you can still calculate, but if you were to write the function manually or with a better framework, you could be 10x faster.
Typical example involve manually unrolling a loop. Or doing kernel fusion. Other typical example is when you have lots of small objects or need to do loops in python because it doesn't vectorize well. Or using the sparsity efficiently by ignoring the zeros.
- Hard stop is when computing the function become impossible, because the memory needed to do the computation in a non optimal way explode. Some times you can get away with just writing customized kernels.
The typical example where you can get away with it are custom attention layers.
Typical example where you can't get away are physics simulations. Like for example the force is the gradient of energy, but you have n^2 interactions between the particles, so if you use anything more than 0 memory preserved during the forward pass per interaction, your memory consumption explode. And typically with things like Lagrangian or Hamiltonian neural networks where you look the discover dynamics of an energy conserving system, you need to be able differentiate at least three times in a row.
There are also energy expanding stops, where you need to find work-around to make it work like if you want to have your parameters changing shape during the optimization process like learning point clouds of growing size, and they spread you thin so they won't be standardized.
I don't know the full list, but back when it came out, TF felt like a crude set of bindings to the underlying c++/CUDA workhorse. PyTorch felt, in contrast, pythonic. It was much closer in feeling to numpy.
I think it was mostly the eager evaluation that made it possible to debug every step in the network forward/backward passes. Tensorflow didn't have that at the time which made debugging practically impossible.
I would highly recommend the podcast by ezyang https://pytorch-dev-podcast.simplecast.com/ for a collection of design discussions on the different parts of the library.
I’m not sure if such an overview exists, but when caffe2 was still a thing and JAX was a big contender dynamic vs static computational graphs seemed to be a major focus point for people ranking the frameworks.
I read one post on his blog and found that Adam Paszke reached out to the author and got an internship. I wonder if it was that easy to get an internship at FAIR. I thought that they hire only PhDs.
I was pretty involved in the PyTorch ecosystem in the early days around 2016 and Adam was nothing short of a genius and prolific developer whose contributions to the codebase and community were immense. I think he was like an undergrad in Poland at the time. My understanding is that his contributions came before the internship, but I don’t know.
My memory is that Souminth was really open to other people’s contributions and questions, no matter their credentials. He was a great leader who felt approachable to the open-source community.
I didn't know that. Soumith Chintala certainly paid it forward. He was very helpful and supportive of random strangers (like me!) in the early pytorch days. I count him with Andrej Karpathy and Chris Olah as one of the people who really made machine learning accessible to regular software engineers.
Chris Olah is the goat.
I reached out to him myself years ago and was surprised at getting a response.
And the response was incredibly generous. I probably wouldn't have had the confidence to do my switch if it wasn't for Olah.
And as I got further into this path, I learned that Olah had done the same for some of my mentors and colleagues.
Every time Olah speaks, I listen.
You can't do anything if you never try.
Given the technical achievements and industry impact, I'm struck that his emphasis is on the people, and in particular growing people to take on challenges. His influence might lie as much in what they do. The connection linking them seems to be building curiosity and wonder into delight for users.
If there's a soul to silicon valley, that's it. However many jerks and political/power players I encounter, I remain inspired by the few like this.
Respect.
Very proud as a Swiss that Soumith has a .ch domain!
Probably because his first name is Chintala
That d be his last name
true haha
You forgot to thank Jürgen. /scnr
It is notable (but perhaps not surprising) that this is mostly about the people and the work itself. The writing is silent on the downstream impacts on the world. In contrast, there are fields (global development, medicine, etc.) where people tend to focus on the impact on humanity (especially when reaching a milestone in their career).
Nice, that is the dream career!
Sounds like you had a momentous run.
If you take advice from reformed Internet trolls, consider turning off all your devices and trying to give yourself at least a week, but ideally a month offline staring at your new baby. You'll never get that time back and there's nothing your brain will appreciate more than loading up those memories as they grow.
Good luck.
https://news.ycombinator.com/item?id=45847465
Seems relevant.
Good luck, good job, and may more fantastical journeys lay ahead.
Firstly, Good work.
Ironical but one HN front page item today is this: "Meta projected 10% of 2024 revenue came from scams and banned goods, Reuters reports"
Glad you're leaving, hopefully you're in a good place financially. Take a page from Bill Gates and work on something that attempts to improve society. Stay away from surveillance capitalism and enshittification.
The last few years must have been incredibly exhausting. Thanks for your work good luck and 73.
Look, I get that some pages require javascript, but
which is then unset by JS, with no <noscript> anywhere, is just... I just get white page.Changing it to
gives perfectly readable web, so it seem bit... pointless.It's just a python wrapper around Torch, an open source C program by a Swiss non-profit, so META could compete with TensorFlow of Google.
I don't know why this is celebrated so much, a big company rebranding a non-profit for profit.
But I guess that's the norm for AI now.
Wait, PyTorch and the ecosystem are much more than just that. You can't be serious?
It's just Torch for idiots