I struggle with these world models from the perspective of video games (so this post is a particular perspective).
I'm not a game developer myself, but some of my favorite games carry a deep sense of intentionality. For instance, there is typically not a single item misplaced in a FromSoftware game (or, for instance, Lies of P -- more recently). Almost every object is placed intentionally.
Games which lack this intentionality often feel dead in contrast. You run into experiences which break immersion, or pull you out of the experience that the developer is trying to convey to you.
It's difficult for me to imagine world models getting to a place where this sort of intentionality is captured. The best frontier LLMs fail to do this in writing (all the time), and even in code, and the surface of experiences for those mediums often feel "smaller" than the user interaction profile of a video game.
It's not clear how these world models could be used modularly by humans hoping to develop intentional experiences? I don't know much about their usage (LLMs are somewhat modular: they can produce text, humans can work on it, other LLMs can work on it). Is the same true for the video output here?
All this to say, I'm impressed with these world models, but similar to LLMs with writing, it's not really clear what it is that we are building towards? We are able to create less satisfying, less humane experiences faster? Perhaps the most immediate benefit is the ability for robotic systems to simulate actions (by conjuring a world, and imagining the implications).
In general, I have the feeling that we are hurtling towards a world with less intentionality behind all the things we experience. Everything becomes impersonal, more noisy, etc.
By and large I agree, but it doesn’t need to be either/or.
Many of the most popular games in the past decade are procedurally generated and have nothing “intentionally” placed (apart from tuning/tweaking the balance of the seeding algorithms).
Games. Build campaigns in hours instead of months. Make it possible for users to create their own campaigns, move the action to different game worlds - 'gimme Mario Kart in the ${favourite_game} world', etc.
Most videos seem to have some issues like that, e.g. the book on the table in the library video takes up different shapes every now and then.
The 'Refiner' effect seems to do the opposite if the examples are representative as in all cases the 1-st stage images look better than the 'refined' ones. Less clutter, more realistic, less 'cowbell' for those who know the phrase.
All video models are terrible at consistency. Even closed source ones.
Seedance 2.0, Kling 3 are regarded the best closed source video models we have. I have subscribed to a few AI video subreddits, consensus atm is they are good for anything but long form videos with humans.
No surprises that we're very good at spotting even the most subtle differences while looking at other people.
The trouble is the lack of training available to these models compared to the ones like Seedance and Kling who seems to be tapping into their unlimited video inventory. Many models like LTX is technically good but when it comes to slightly different camera movements or the subject interacting with objects they struggle. For a recent example we had to use sample videos generated by closed source models and then use the same for final video.
I tend to think of these NV Labs models as architectural demos and ‘free razor blades’ — they’re more intended to inform internal R&D, get customers something that lets them do what they want quickly, and enhance the state of the art.
In this case, what looks interesting is the one minute coherence and the massive speedup - they claim 36x over open models with similar capabilities. You can tell they aren’t aiming for state of the art visuals — looks very SD 1.5 in terms of the output quality.
I struggle with these world models from the perspective of video games (so this post is a particular perspective).
I'm not a game developer myself, but some of my favorite games carry a deep sense of intentionality. For instance, there is typically not a single item misplaced in a FromSoftware game (or, for instance, Lies of P -- more recently). Almost every object is placed intentionally.
Games which lack this intentionality often feel dead in contrast. You run into experiences which break immersion, or pull you out of the experience that the developer is trying to convey to you.
It's difficult for me to imagine world models getting to a place where this sort of intentionality is captured. The best frontier LLMs fail to do this in writing (all the time), and even in code, and the surface of experiences for those mediums often feel "smaller" than the user interaction profile of a video game.
It's not clear how these world models could be used modularly by humans hoping to develop intentional experiences? I don't know much about their usage (LLMs are somewhat modular: they can produce text, humans can work on it, other LLMs can work on it). Is the same true for the video output here?
All this to say, I'm impressed with these world models, but similar to LLMs with writing, it's not really clear what it is that we are building towards? We are able to create less satisfying, less humane experiences faster? Perhaps the most immediate benefit is the ability for robotic systems to simulate actions (by conjuring a world, and imagining the implications).
In general, I have the feeling that we are hurtling towards a world with less intentionality behind all the things we experience. Everything becomes impersonal, more noisy, etc.
By and large I agree, but it doesn’t need to be either/or.
Many of the most popular games in the past decade are procedurally generated and have nothing “intentionally” placed (apart from tuning/tweaking the balance of the seeding algorithms).
Right, and I wondered how these world models might be use in a careful way (just as agents can be used carefully to accelerate work).
Are video game developers using these systems in their workflows? Would love to learn more!
Which game would that be apart from Minecraft?
Outputting video of that quality/consistency at 1 minute, for a 2.6B model seems insane?
They all look like video games. I guess Unreal Engine is used to create synthetic data for training.
What’s the long term utility of world models?
There’s no doubt they’re technically impressive, but what does one do with it?
Games. Build campaigns in hours instead of months. Make it possible for users to create their own campaigns, move the action to different game worlds - 'gimme Mario Kart in the ${favourite_game} world', etc.
They can be base models for a bunch of things. Turning text-conditioned video generation models into robotics VLAs is a fun exercise.
This one is probably too small to be useful for that, and not diverse enough? But I could be wrong.
Put them in a robot so that it can navigate the physical world like humans. Self-driving cars.
It's a step towards something else?
Digital twin?
It ain’t open source until it’s released. It’s baitware.
First video with the guy walking the mountain in snow has consistency issues with the cave entrance. Which is "expected" at this model size?!
Most videos seem to have some issues like that, e.g. the book on the table in the library video takes up different shapes every now and then.
The 'Refiner' effect seems to do the opposite if the examples are representative as in all cases the 1-st stage images look better than the 'refined' ones. Less clutter, more realistic, less 'cowbell' for those who know the phrase.
My dreams have it too, which is unexpected at that model size!
All video models are terrible at consistency. Even closed source ones.
Seedance 2.0, Kling 3 are regarded the best closed source video models we have. I have subscribed to a few AI video subreddits, consensus atm is they are good for anything but long form videos with humans.
No surprises that we're very good at spotting even the most subtle differences while looking at other people.
So, where is the download? I can't find it on Github, and on your web page the download button is disabled.
Also, will this run on RTX 4090 with 24GB memory?
Thank you!
Scroll down and there are more videos --- seems like models will be there "soon".
The most exciting part is that it’s open-source — innovation is going to compound fast.
Bot comment.
Given that is where everything is going, why not just get there faster by open-sourcing Seedance 2.0, Happyhorse, Veo 3 and all the others.
The trouble is the lack of training available to these models compared to the ones like Seedance and Kling who seems to be tapping into their unlimited video inventory. Many models like LTX is technically good but when it comes to slightly different camera movements or the subject interacting with objects they struggle. For a recent example we had to use sample videos generated by closed source models and then use the same for final video.
I tend to think of these NV Labs models as architectural demos and ‘free razor blades’ — they’re more intended to inform internal R&D, get customers something that lets them do what they want quickly, and enhance the state of the art.
In this case, what looks interesting is the one minute coherence and the massive speedup - they claim 36x over open models with similar capabilities. You can tell they aren’t aiming for state of the art visuals — looks very SD 1.5 in terms of the output quality.