Astro - Hacker News

maxloh an hour ago ago

I think we should stop calling this type of models open source. They are indeed "open weight." The training code is proprietary and never revealed.

https://github.com/microsoft/VibeVoice/issues/102

[-]

jcmfernandes 13 minutes ago ago

Indeed. We now live in a world where freeware is named open source. We are very sorry, Stallman.
JumpCrisscross an hour ago ago

> we should stop calling this type of model open source. They are indeed "open weight”
This ship has sailed. It’s now in the same category as hacker/cracker and the pronunciation of GIF.
[-]
- andy_ppp an hour ago ago
  
  I think you mean GIF.
- giancarlostoro 38 minutes ago ago
  
  It's the same as GIS, you wouldn't say jizz now would you?
  [-]
  - DoctorOW 23 minutes ago ago
    
    I absolutely do, every single time it comes up.
  - kevin_thibedeau 17 minutes ago ago
    
    The developer of the format declared the pronunciation 30+ years ago. It has always been jif.
  - dijksterhuis 4 minutes ago ago
    
    i am absolutely going to from now on
  - pardon_me 10 minutes ago ago
    
    How do you pronounce giraffe?
    
    [-]
    
    parineum 5 minutes ago ago
    
    How do you pronounce gift?
  - notabotiswear 31 minutes ago ago
    
    I take it that you haven’t met the Arcgees people…
- WarmWash 23 minutes ago ago
  
  And "hallucination" which should have been "delusion".
  Way early on (spring 2023) people tried to stop it, but no luck.
giancarlostoro 39 minutes ago ago

I mean, you have "AI" which means just about anything in marketing speak, "Agentic" is kind of becoming similar, hopefully they don't goof that one too badly, would be nice to know what you are trying to sell me. Used to be "Cloud" meant storage not just hosting (I guess it still does).
Then there's "Smart" in front of Car, Phone, TV, and so on... Meaning different things.
I do think "Open Weight" should be more commonly used. There's definitely communities that spring up that build the training infrastructure and inference infrastructure around open models on the other hand.
notabotiswear 27 minutes ago ago

Openwashing is the new greenwashing, which, coincidently, seems to have gone out of fashion a few hundred datacentres ago.
[-]
- dist-epoch 19 minutes ago ago
  
  it was replaced with abundancewashing

steinvakt2 an hour ago ago

This is not a new model. Also, it hallucinates a lot. Also, it's very heavy and slow in inference. It's also bad in multilingual.

Edit: I'm talking purely about speech to text (STT). Not sure about the other things this can do.

[-]

lblock an hour ago ago

Yeah, I don't get why it is suddenly getting so much attention today, it is all over twitter too
[-]
- ramon156 an hour ago ago
  
  well duh, they updated the news section
  https://github.com/microsoft/VibeVoice/commit/e73d1e17c3754f...
  which is microsoft for "we removed two dead links". AI innovation knows no limits!
SecretDreams an hour ago ago

I think this was all covered when they said it was released by Microsoft?
[-]
- NobleLie 3 minutes ago ago
  
  The nuance is lost on LLM agentic dominant partakers.

khimaros 3 minutes ago ago

looks like this offers ASR support in GGUF https://github.com/CrispStrobe/CrispASR -- haven't tested

Void_ an hour ago ago

I the past month or so, I added 2 models to my app Whisper Memos (https://whispermemos.com):

- Cohere Transcribe (self hosted)

- Grok Speech To Text (they provide an API, only $0.10/hr!)

They are both excellent. I'm not sure about this one. Would you like to see it in a consumer speech to text app?

[-]

Barbing 13 minutes ago ago

Does Cohere work with longer transcripts? Do you have to do some magic to merge recordings over 35 seconds long?
olejorgenb an hour ago ago

I've had good experiences with the Mistral Voxtral models (I've used the API, but some of the model-variants are open weight)
2ndorderthought an hour ago ago

Have you tried qwen?
SecretDreams an hour ago ago

Any non-Musk alternatives that are comparable in quality and cost?
[-]
- jayphen 10 minutes ago ago
  
  Voxtral competes on price ($0.003/min) and quality. Speechmatics has best in class accuracy but is a bit more expensive ($0.004/min)
- Void_ 42 minutes ago ago
  
  Our default is still OpenAI Whisper. Grok is just a choice for users who might prefer it.

frangonf 16 minutes ago ago

I took a look into local options for ASR and diarization some months ago, I missed that VibeVoice now has this feature.

My conclusions back then (which only came from a shallow research on the topic and 0 real experience mind you) was that Whisper + Pyannote was the "stable" approach.

Have the VibeVoice, Voxtral, Qwen or the Nemo solutions caught up in segmentation and speaker recognition?

aqme28 44 minutes ago ago

Interesting to see "vibe" enshrined by the likes of Microsoft as an AI product word.

[-]

accrual 34 minutes ago ago

Especially when "vibe coded" can have a negative connotation meaning quickly put together without understanding.
[-]
- Barbing 11 minutes ago ago
  
  I’m just surprised they put the name of the e-waste slop company in their product
altmanaltman 22 minutes ago ago

Which makes it even more weird they get offended when people use Mircoslop. They are the ones leaning into the marketing

embedding-shape an hour ago ago

Isn't this project the one Microsoft published but then soon after pulled it for security/safety reasons? What has changed since then?

[-]

infecto 24 minutes ago ago

It’s confusing (at least for me) because the project covers a number of things including what you are mentioning.
542458 an hour ago ago

Look at the "News" section in the readme - The original TTS model is gone from this repo (you can still find it other places), but the SST/ASR, long form TTS, and streaming TTS models are newer.

CubsFan1060 an hour ago ago

Great post last night from Simon: https://simonwillison.net/2026/Apr/27/vibevoice/

[-]

542458 an hour ago ago

Note that this just covers the Speech-to-Text/Speech-Recognition aspect (a-la whisper), there's also models for long-form Text-To-Speech and steaming Text-To-Speech.
JumpCrisscross an hour ago ago

“VibeVoice can only handle up to an hour of audio”
Why?

ryukoposting 25 minutes ago ago

Holy moly, a Microsoft AI product that isn't named Copilot!

[-]

DoctorOW 22 minutes ago ago

Missed opportunity to call it Vopilot

podgietaru an hour ago ago

So we've really just settled on Vibe as the verb for AI then?

[-]

giarc an hour ago ago

I'd be willing to bet it will be "Word of the Year" for 2026. Merriam-Webster had 'slop' for 2025, and 'polarization' for 2024. Is there a prediction market for this?
[-]
- internet_points an hour ago ago
  
  it'll probably be something we're not even talking about yet - we still have 7 months in which to make the world even worse
pryanshu89 an hour ago ago

Why use precise technical language when you can just vibe with your AI system?

Anonyneko an hour ago ago

You have selected Microsoft Sam as the computer's default voice.

[-]

accrual 33 minutes ago ago

My friends and I had fun in the computer lab with Microsoft Sam, inputting long strings of characters to create funny sound effects. Sususususususu.

JumpCrisscross an hour ago ago

What’s the current state of the art, for each of training locally and in the cloud, for learning my voice?

[-]

khimaros a few seconds ago ago

open weights i would say S2: https://github.com/rodrigomatta/s2.cpp
yreg 9 minutes ago ago

Locally maybe https://voicebox.sh/
Elevenlabs in the cloud.
chrsw 35 minutes ago ago

Local? No idea. Cloud? Eleven Labs, probably. But it's described as "cloning" not "training". Not sure what the distinction is or why it matters if the end result is you can to generate any TTS that sounds like you. There might very well be an important one, I just don't know it.

pluc an hour ago ago

Interesting story about this repo/product/author by cybersecurity researcher Kevin Beaumont: https://cyberplace.social/@GossiTheDog/116454846703138243

BlastBash192 43 minutes ago ago

Maybe Microsoft’s real strength was never making the best model, it was knowing you don’t need to, as long as you own the platform everyone builds on.

mistic92 an hour ago ago

For me its giving me very poor results

walthamstow an hour ago ago

Seems quite heavy for a STT model, Parakeet and Whisper are much smaller and perform great for quick dictation and transcription of longer files. I guess that's due to additional accuracy and speaker diarisation?

The TTS example clip in the repo of 'spontaneous singing' is creepy as fuck