With media ingestion this is called "eager" processing. Historically for things like pulling thumbnails for images / video and pre-generating common sizes for things. This follows the same pattern and makes all the sense in the world. My only concern is that due to the non deterministic nature of LLMs new models will reveal new information about your data.
For example you might identify a car in an image but the context is the car running a red light. A new model might pick that up while an old one doesn't. These context adjustments might sometimes require you to rerun your LLM processing or potentially have a one to many relationship for multiple runs so you can take the best off or combine results.
Actual usage will also reveal most commonly used assets and you can target the ones that are most trafficked and save a ton on processing that way.
> we don't send images to the model at query time. We describe each image once, at indexing time, with a cheap vision model, store the descriptions as text, and retrieve them alongside ordinary text chunks
This is what I've been doing in my Obsidian infodump for a while. If I know that an image is important, I generate a text description (Mermaid if possible, English if not) and paste it after the image in a block. This lets agents see the image if they don't really see it. Though my process is manual, the improvements in outcomes for agents that rely on text search/retrieval is very real and is worth it.
With media ingestion this is called "eager" processing. Historically for things like pulling thumbnails for images / video and pre-generating common sizes for things. This follows the same pattern and makes all the sense in the world. My only concern is that due to the non deterministic nature of LLMs new models will reveal new information about your data.
For example you might identify a car in an image but the context is the car running a red light. A new model might pick that up while an old one doesn't. These context adjustments might sometimes require you to rerun your LLM processing or potentially have a one to many relationship for multiple runs so you can take the best off or combine results.
Actual usage will also reveal most commonly used assets and you can target the ones that are most trafficked and save a ton on processing that way.
> we don't send images to the model at query time. We describe each image once, at indexing time, with a cheap vision model, store the descriptions as text, and retrieve them alongside ordinary text chunks
This is what I've been doing in my Obsidian infodump for a while. If I know that an image is important, I generate a text description (Mermaid if possible, English if not) and paste it after the image in a block. This lets agents see the image if they don't really see it. Though my process is manual, the improvements in outcomes for agents that rely on text search/retrieval is very real and is worth it.
That cookie popup just makes me wanna leave and never come back
I think they've fixed it now.
Thanks! Yep fixed