1. You may not want to expose bits and pieces of your data and metadata to an LLM, you dont want your data to be used for training. If you are using LLM running on your machine, as in this case, you are covered there.
2. Claude can do a lot of stuff, but doing multi step analysis consistently and reliably is not guaranteed due to the non-deterministic nature of LLMs. Every time it may take a different route. Nile local offers a bunch of data primitives like query, build-pipe, discover, etc. that reduces the non-determinism and bring reliability and transparency (how the answer was derrived) to the data analysis.
Very cool idea. The part I would love to hear more about is how you are thinking about the boundary between notebook/IDE convenience and actual data lake guarantees. For example, what exactly is versioned, how reproducible are transformations, and how much lineage visibility do I get once I start mixing SQL, PySpark, natural language queries, and imported web/DB data?
What's the difference between this and asking claude to do data analysis?
Two things:
1. You may not want to expose bits and pieces of your data and metadata to an LLM, you dont want your data to be used for training. If you are using LLM running on your machine, as in this case, you are covered there.
2. Claude can do a lot of stuff, but doing multi step analysis consistently and reliably is not guaranteed due to the non-deterministic nature of LLMs. Every time it may take a different route. Nile local offers a bunch of data primitives like query, build-pipe, discover, etc. that reduces the non-determinism and bring reliability and transparency (how the answer was derrived) to the data analysis.
Very cool idea. The part I would love to hear more about is how you are thinking about the boundary between notebook/IDE convenience and actual data lake guarantees. For example, what exactly is versioned, how reproducible are transformations, and how much lineage visibility do I get once I start mixing SQL, PySpark, natural language queries, and imported web/DB data?
Everything including actual data, schema and transform is versioned and tracked at job run level.
You will get job run level lineage for any datasets created in the system.
When you say local do you mean I could run it without wifi? i have some work files I could use some help on but can’t connect to other LLMs
Can I run it on my MacBook.. do I need to setup LLM myself?
Yes. I would recommend a model with 16gb ram at least but I was able to run it on a MacBook air 8gb but it lagged for LLM assist.
You don't need to setup LLM locally, the tool does that. You can choose which model to go with. It has Gemma and Qwen supported now.