This looks like many LLM assisted data projects which help and are flexible, but aren't repeatable and aren't fast enough to be interactive. Nao is a good execution of the concept.
I built Buckaroo as a data table UI for Jupyter and Pandas/Polars, that first lets you look at the data in a modern performant table with histograms, formatting, and summary stats.
Yesterday I released autocleaning for Buckaroo. This looks at data and heuristically chooses cleaning methods with definite code. This is fast (less than 500ms). Multiple cleaning strategies can be cycled through and you can choose the best approach for your data. For the simple problems we shoudn't need to consult an LLM to do the obvious things.
Whelps! Ive been building something really similar. Mines nowhere as complete as Buckaroo I really think embedded apps in notebooks can be very useful.
Yes we use Fill in the middle models (Mistral and our own trained Qwen). And we feed it with your data context - we have our own SQL parser to feed you with the right data schema context depending on where your cursor is in the query.
Been using this for several weeks now and it's genuinely improved my workflow—I'm choosing it over VSCode and extensions more than half the time.
The chat for exploratory data analysis ("what can you tell me about this column I just added?"), the worksheets and column lineage are real game-changers for dbt development. These features feel purposefully designed for how I actually work.
Claire and Christophe are super responsive to feedback, implementing features and fixes quickly. You can see the product evolving in all the right directions!
This is really slick. I watched the YouTube video (a couple of times; I didn't grok what was happening immediately) and I really love how this accelerates feedback cycles. Very, very cool.
Thank you! Actually this is exactly what we target, we've seen that data teams have often a longer feedback loop than software engineers. That's a goal for us to shorten it and to bring data the closest to your dev flow.
Hi, this is a great idea. I'm trying to get it working but I'm having some trouble in the last step of "Configure dbt project". I'm looking for a support link on your website but can't find any.
Here is our contact page, feel free to contact us and we'll hep you setup: https://docs.getnao.io/docs/support/support
Also, we'll be adding a new dbt onboarding flow tomorrow!
Does this only work if I'm writing raw SQL? Can I use this today if my project uses Postgres but has queries written in TypeScript using a query builder like Kysely?
How much data is hitting your models/prompts? I am okay with you knowing about my schema, but a lot of warehouse data is sensitive data. I saw you have enterprise plans, and maybe that is my answer, but I’d love to know ahead of time if data/results are hitting your servers in addition to the code, or if it’s code-only.
The content of the data is never going to the models, unless you specifically grant access. Our server only store embeddings of your codebase and data schema.
The content of the data is only accessed locally by your computer. When you ask our agent to run a query for you, it will execute it on your warehouse, then ask you for permission to read the result. If you don't allow it, you'll be able to preview it locally without sending it to the LLM.
The enterprise version is for your if you want to be sure the prompts and context you send to nao don't go through a public LLM endpoint and get trained on.
I can't really tell what those databases are that are coming soon, a "hover" over the icons would be nice. SQLServer coming anytime soon, my coworkers are working on some data integrity work right now and it might be a nice tool for them.
It's Databricks, Iceberg and Redshift, which on the first survey that we did were the most asked. But as per this post and a broader audience it appears SQLite at least win! We'll add also SQLServer in the list
Yes I second this. I use sqlite for local use and also for prototyping data designs, so sqlite support is very useful indeed - not a deal breaker but definitely a tick item.
also how does it do with transitive joins across multiple tables that may not have FK/PK relationships? Other key features that would put this over the top: Usage analysis and query rewriting for inefficient already existing queries.
For the joins, we give the right context so that the model can infer the relationships: the schema of each table + how the joins are already done in your repository/query history.
Usage analysis is definitely on the roadmap: we want to access the logs of the data warehouse to mesure usage of each table.
Do you support Exasol? In the current climate we don’t want to be too dependent on US cloud services, so we are moving our performance sensitive dwh workloads off Snowflake to Exasol.
We tested those, but none of them could reach the performance we needed, especially under high concurrent load (we have a large number of concurrent workloads). Exasol is just crazy fast.
When it comes to SQL writing we are more relevant, when it comes to speed this is hard to benchmark exactly against Cursor and Windsurf but we are a bit slower (around ~600ms on average) obviously and we know what we have to improve to speed it up.
Next in the list is the next edit suggestion dedicated to data work, especially with dbt (or SQL transformations) where when you change a query you have to change the downstream queries directly.
Nice! I think this space is growing. There are a few others Im aware off in the space worth checking out: https://julius.ai/, https://cipher42.ai (I've built the early version of this).
We heard of Julius a lot, but did not know about Cipher42, there are a few others folks around. We feel there is a pain and also data teams are a bit abandoned at the moment when it comes to work using AI so makes sense. Curious to get a feedback about your journey building cipher, did you stop working on it?
Probably in a few months. For now we're focusing to make the experience great for a restricted number of warehouses. But you can reach out by email and we'll keep you updated
We say "data vibing" so it feels unique to the data community! But in all seriousness, this is already an issue, people are already asking ChatGPT (or Cursor/whatever else) to generate SQL for them, but the next steps do not exist, if you "vibe code" for data you want to have the easiest feedback loop you can get to check if the output is good, and that's what we are working on: identifying the downstream impacts in the IDE and proposing fixes, a table diff view, new UI/UX to test your outputs.
The goal for us is to be the best way to do data with AI.
With data I think that is very hard, I wrote a SQL query (without AI) which ran and showed what look like correct numbers only to years later realise it was incorrect.
When doing more complex calculations, I am not clear how to check if the output is correct.
Usually what we've seen is data people having notebooks/worksheets on the side with a bunch of manual SQL queries that they run to validate the data consistency. The process is highly manual, time consuming. Most of the time teams knows what kind of checks they want to run on the data to validate it, our goal here is to provide them the best toolbox to do it, in the IDE.
Tho, i'd say this is like when writing tests in software, you can't catch everything the first time (even when going 100% code coverage), especially in data when most of the time it breaks because of upstream producers.
It will still require live observability tools monitoring data live in the near future.
This looks like many LLM assisted data projects which help and are flexible, but aren't repeatable and aren't fast enough to be interactive. Nao is a good execution of the concept.
I built Buckaroo as a data table UI for Jupyter and Pandas/Polars, that first lets you look at the data in a modern performant table with histograms, formatting, and summary stats.
Yesterday I released autocleaning for Buckaroo. This looks at data and heuristically chooses cleaning methods with definite code. This is fast (less than 500ms). Multiple cleaning strategies can be cycled through and you can choose the best approach for your data. For the simple problems we shoudn't need to consult an LLM to do the obvious things.
All of this is open source and extensible.
[1] https://youtube.com/shorts/4Jz-Wgf3YDc
[2] https://github.com/paddymul/buckaroo
[3] https://marimo.io/p/@paddy-mullen/buckaroo-auto-cleaning Live WASM notebook that you can play with - no downloads or installs required
Whelps! Ive been building something really similar. Mines nowhere as complete as Buckaroo I really think embedded apps in notebooks can be very useful.
Thanks for sharing. I like the view you built to visualize the profiling of your data, I think that's indeed key to understand your data.
Cool idea! How did you train your tab model? Fill in the middle or is it based on edit history like cursor? Someone posted this yesterday and I found it fascinating https://www.coplay.dev/blog/a-brief-history-of-cursor-s-tab-...
Yes we use Fill in the middle models (Mistral and our own trained Qwen). And we feed it with your data context - we have our own SQL parser to feed you with the right data schema context depending on where your cursor is in the query.
I hadn't realized you trained your own model. That's an important differentiator. How do you get training data of schemas in the wild?
Been using this for several weeks now and it's genuinely improved my workflow—I'm choosing it over VSCode and extensions more than half the time.
The chat for exploratory data analysis ("what can you tell me about this column I just added?"), the worksheets and column lineage are real game-changers for dbt development. These features feel purposefully designed for how I actually work.
Claire and Christophe are super responsive to feedback, implementing features and fixes quickly. You can see the product evolving in all the right directions!
Thanks for your kind message — and for helping us make nao better!
This is really slick. I watched the YouTube video (a couple of times; I didn't grok what was happening immediately) and I really love how this accelerates feedback cycles. Very, very cool.
Thank you! Actually this is exactly what we target, we've seen that data teams have often a longer feedback loop than software engineers. That's a goal for us to shorten it and to bring data the closest to your dev flow.
Hi, this is a great idea. I'm trying to get it working but I'm having some trouble in the last step of "Configure dbt project". I'm looking for a support link on your website but can't find any.
Here is our contact page, feel free to contact us and we'll hep you setup: https://docs.getnao.io/docs/support/support Also, we'll be adding a new dbt onboarding flow tomorrow!
Does this only work if I'm writing raw SQL? Can I use this today if my project uses Postgres but has queries written in TypeScript using a query builder like Kysely?
Yeah at the moment the Tab is made to work the best with raw SQL (either pure SQL files or in a string).
But, if you use the chat/agent you can explain that you're using Kysely and give the warehouse context he will probably handle this.
I did not know Kysely, but from the gif on the project landing it looks like the autocomplete is great? It's different than a tab I agree tho.
How much data is hitting your models/prompts? I am okay with you knowing about my schema, but a lot of warehouse data is sensitive data. I saw you have enterprise plans, and maybe that is my answer, but I’d love to know ahead of time if data/results are hitting your servers in addition to the code, or if it’s code-only.
The content of the data is never going to the models, unless you specifically grant access. Our server only store embeddings of your codebase and data schema. The content of the data is only accessed locally by your computer. When you ask our agent to run a query for you, it will execute it on your warehouse, then ask you for permission to read the result. If you don't allow it, you'll be able to preview it locally without sending it to the LLM. The enterprise version is for your if you want to be sure the prompts and context you send to nao don't go through a public LLM endpoint and get trained on.
Like the looks of this. Any chance you'll be adding support for SQLite at some point?
Oh yes! Should be done fairly easily, we have DuckDB coming in the next release, we can also add SQLite. You use SQLite to develop locally I guess?
I can't really tell what those databases are that are coming soon, a "hover" over the icons would be nice. SQLServer coming anytime soon, my coworkers are working on some data integrity work right now and it might be a nice tool for them.
It's Databricks, Iceberg and Redshift, which on the first survey that we did were the most asked. But as per this post and a broader audience it appears SQLite at least win! We'll add also SQLServer in the list
Yes I second this. I use sqlite for local use and also for prototyping data designs, so sqlite support is very useful indeed - not a deal breaker but definitely a tick item.
Will also give nao a shot as soon as this is shipped. A LOT of non corp data work happens in SQLite and duckdb.
Yes just local, but love to use Nao to quickly analyze datasets
also how does it do with transitive joins across multiple tables that may not have FK/PK relationships? Other key features that would put this over the top: Usage analysis and query rewriting for inefficient already existing queries.
For the joins, we give the right context so that the model can infer the relationships: the schema of each table + how the joins are already done in your repository/query history. Usage analysis is definitely on the roadmap: we want to access the logs of the data warehouse to mesure usage of each table.
Do you support Exasol? In the current climate we don’t want to be too dependent on US cloud services, so we are moving our performance sensitive dwh workloads off Snowflake to Exasol.
Not yet, but we are willing to develop these specific connectors on ask. You can reach out!
Just one question what makes you pick Exasol rather than going with an open-source warehouse tech (e.g. Clickhouse or a lake with Trino)?
We tested those, but none of them could reach the performance we needed, especially under high concurrent load (we have a large number of concurrent workloads). Exasol is just crazy fast.
Would this work with Hydra? https://news.ycombinator.com/item?id=43937852
we support Postgres (and DuckDB is coming very soon) so yes probably as Hydra is a mix of both, but I have to try it
Sweeeet. Let's give it a go!
I've met one of the founders, Christophe. Smart, perfect vision and huge energy. I can say that I have no doubt they'll succeed with Nao! Congrats!
Thanks for the kind comments, he's surely a great guy :)
*blushes*
Does anyone have any links for more LLM based tools that are aimed at data engineering and data science?
I'm working on a repo listing this, I hope to finish it soon
Congrats on the HN launch! Really excited to give this a try I think this could be a huge unlock for my team.
One quick issue - unable to connect to my postgres instance that requires SSL.
SSH tunneling seems to be broken as well because when the box is checked I am unable to select a private key path and the connect button is gone
Parsing DB URI would be a helpful feature as well!
Thanks so much, excited to get this up and running when everything is fixed!
ignore the SSH tunneling, didn't see that i had to scroll...it's been a long week. regardless, SSL enabled connections would be huge
Thanks for your feedback, we'll add SSL in the connection soon
Great idea! How does your tab model compare to other ones from Cursor/Windsurf..?
When it comes to SQL writing we are more relevant, when it comes to speed this is hard to benchmark exactly against Cursor and Windsurf but we are a bit slower (around ~600ms on average) obviously and we know what we have to improve to speed it up.
Next in the list is the next edit suggestion dedicated to data work, especially with dbt (or SQL transformations) where when you change a query you have to change the downstream queries directly.
The founders showcased a demo at the Data Council conference. Looked cool!
Glad you liked it!
This looks awesome! I wish I could connect to my Postgres DB using SSL.
Thanks for suggesting, we should set this up!
add dataform support please for us Google/BigQuery native orgs :-)
Yes it's in our roadmap, some users already asked for it!
Nice! I think this space is growing. There are a few others Im aware off in the space worth checking out: https://julius.ai/, https://cipher42.ai (I've built the early version of this).
We heard of Julius a lot, but did not know about Cipher42, there are a few others folks around. We feel there is a pain and also data teams are a bit abandoned at the moment when it comes to work using AI so makes sense. Curious to get a feedback about your journey building cipher, did you stop working on it?
Any plans to add support for ClickHouse? If so, what does that timeline look like?
Probably in a few months. For now we're focusing to make the experience great for a restricted number of warehouses. But you can reach out by email and we'll keep you updated
That’s exactly what I was looking for months ago. I will check out Nao for sure.
Great! Let us know if it's how you imagined it when you try it
Awesome product!
Thank you!
Well done!
Thank you!
Does this mean we will have people “vibe coding” data warehouses now? Might cause a few issues…
We say "data vibing" so it feels unique to the data community! But in all seriousness, this is already an issue, people are already asking ChatGPT (or Cursor/whatever else) to generate SQL for them, but the next steps do not exist, if you "vibe code" for data you want to have the easiest feedback loop you can get to check if the output is good, and that's what we are working on: identifying the downstream impacts in the IDE and proposing fixes, a table diff view, new UI/UX to test your outputs.
The goal for us is to be the best way to do data with AI.
Ok, but how do you know it’s good?
With data I think that is very hard, I wrote a SQL query (without AI) which ran and showed what look like correct numbers only to years later realise it was incorrect.
When doing more complex calculations, I am not clear how to check if the output is correct.
Usually what we've seen is data people having notebooks/worksheets on the side with a bunch of manual SQL queries that they run to validate the data consistency. The process is highly manual, time consuming. Most of the time teams knows what kind of checks they want to run on the data to validate it, our goal here is to provide them the best toolbox to do it, in the IDE.
Tho, i'd say this is like when writing tests in software, you can't catch everything the first time (even when going 100% code coverage), especially in data when most of the time it breaks because of upstream producers.
It will still require live observability tools monitoring data live in the near future.
[dead]