Astro - Hacker News

4 comments

NitpickLawyer 21 minutes ago ago

I'm surprised the article doesn't mention the biggest use of steering vectors, which is the potential to remove refusals from models (a.k.a. abliteration or uncensoring).
There was an earlier paper that found that "most refusals are on a single vector", and you can identify and "nerf" that vector so the model will skip refusals and answer "any" request normally. This was very doable for earlier models trained with SFT for refusals, seems to be a bit more complicated for newer models, but still doable to some extent.
There are already some libraries to automate this process and reduce refusals, but usually they focus on identifying and then modifying the models and releasing them as uncensored models. This technique of steering lets you enable this vector changing dynamically, so you don't need to change models if the abliteration process somehow hurts accuracy on other unrelated tasks.
[-]
- cyanydeez 3 minutes ago ago
  
  not sure why youre fixed on censoring. if we invert your POV censoring includes not reporting falsehoods "vaccines are harmful". Science and logic often tackle these subject via censoring, but a model given a equal sampling of Internet, would think vacinnes are harmful. a less naive correction would censor this problematic context.
  so im cofised as to why you think unmasking whatever bias you think is censored will result in improvement in generic use case.
wolttam 30 minutes ago ago

> inspired to write this post by antirez’s recent project DwarfStar 4, which is a version of llama.cpp that’s been stripped down to run only DeepSeek-V4-Flash
This is not true, it is its own project.
Indebted to llama.cpp, sure, but not a stripped down version
dominotw 3 minutes ago ago

> you can already exercise extremely fine-grained control by tweaking the language of your prompt.
maybe i suck at prompting but i find it impossible to overcome its biases from training data, post training ect.
you can only pattern mine from training data using prompts. you dont really have sort of fine-grained control.