Voice First Experiments: Adventures in AI and Transcription

Okay, it is Wednesday, January 4th, and this is my second day trying to do voice first on my computer. It’s been some interesting experiments. Yesterday, I tried to do a blog post just by speaking into my iPhone memo program. It wasn’t quite the experience I was looking for. First off, getting the memo app off of the iPhone onto the iPad, or the MacBook rather, was a little bit difficult. The app itself on the MacBook isn’t great either the way the file storage works. It’s not user-friendly at all.

So what I wound up doing was I wound up using my iPhone memo, recording the audio on that, and then basically transferring the memo to an iCloud folder, which was then synced to my MacBook, then I was able to run that file in whisper. The transcription was actually pretty good. I tried running it through GPT a couple of times to pare it down or to clean it up, but I wasn’t happy with the results, mainly because my original post was about 3,000 characters. So it was a little too long to get a real balance between the prompt and the result from GPT. So yeah, I didn’t have too good of a time with that. I did generate a title out of it. It’s really good at summation. We all know that. So it’s really good at summarizing things, but I tried GPT through Playground. I tried chat GPT, several things I tried to do, wasn’t really happy. I get the sense that it wasn’t quite as frictionless as just typing it out myself, which is kind of why I’ve always been typing to begin with, but we’ll try it again today and we’ll see how things go.

So hopefully we’ll have some better luck today. One thing I have been playing around with also is this thing called Whisper Mic. It’s a GitHub program that uses your computer’s microphone to basically just listen. It’s been running right now. It’s been running all night, actually. I didn’t realize it, but we turned it on last night just to kind of capture things. So it’s kind of clunky. It does what it’s supposed to do if you’re giving it good quality audio. Like if I’m speaking right in front of my MacBook right now, it seems like it’s been doing a pretty good translation, but while I was leaving it running in the background and people were walking around the house this morning and a couple feet away across the room, it wasn’t picking things up very well. I’m not sure how it’s going to pick up the coughing either, excuse me. I noticed it didn’t do a very good job with Elder’s speech. She’s 10, but it did not seem like it was picking up her words properly as much as it has for me. There’s some testing to do there to figure out how that works.

I also want to play around with some of these multi-speaker models. I’ve seen some demos of some where you specify the number of speakers in a clip and you feed in and whisper, and it’s able to basically tell you speaker one, speaker two.

One thing that I did have some success with yesterday was using this to grab information, to summarize a meeting basically. We had a quick standup yesterday. Our designer came in and gave us an update on his work over the break. He goes in the work mode and I managed to hit the record button and click up while he was speaking. I did have a video, but I was not able to get it from there where I wanted it to go very easily. I figured I could go straight from click up, pull the video down, but it was in some kind of WebM wrapper and I was getting some error trying to pull it into FFMPG. What I wanted to do was just letting it play while my computer’s voice memo program started running again and was able to pull the transcript into whisper and then fed it in GPT, asked it basically to pull out requirements to do the next task and stuff like that. It did summarize it. Again, it was a very short two or three minutes, I guess. The context was not overloading the chat GPT buffer, which is one of the main problems we have with this, obviously.

There were a couple things I had some interesting conversations I had with friend Chris about AI. He’s in the machine vision industry, and so he and I had a long discussion around dinnertime last night. Some interesting developments at work that I can’t really talk about, other than to say that I’m trying to build a department of AI, so we’ll see if any of the founders bite at that. So that’s what I’m going to do to present that to them and put some of this information together. Hopefully, having more of a record of things will help and we’ll see how I can take these transcripts and use them in such a way that will be helpful and will be very interesting to figure that out. Maybe some embeddings work, text embeddings, building some search around audio, like how expensive would it be to run whisper mic 24-7, and then just log everything that it catches into a database and just do semantic search on that. I think it would be pretty interesting to have your entire life stored in that way.

Speaking of which, I have been trying something called rewind AI, basically the way it works is that it records a screenshot of your computer screen every X seconds or whatever, and then it does some sort of… It crawls all the text from your screen, so it crawls all that text from your screen, and then I guess it loads some kind of database, some compression algorithm that they have as proprietary, but basically then if you’re looking for something, you can push command shift and two swipes up on the mouse, and it will give you a search. Anything you’ve seen, said, or heard, well, heard, that’s interesting. I’m going to have to experiment with that to see how that works, but it also has an interesting little history function you can slide back and see what your stuff sounded like.

One thing that I did have some success with yesterday was using this to grab information, to summarize a meeting basically. We had a quick standup yesterday. Our designer came in and gave us an update on his work over the break. He goes in the work mode and I managed to hit the record button and click up while he was speaking. I did have a video, but I was not able to get it from there where I wanted it to go very easily. I figured I could go straight from click up, pull the video down, but it was in some kind of WebM wrapper and I was getting some error trying to pull it into FFMPG. What I wanted to do was just letting it play while my computer’s voice memo program started running again and was able to pull the transcript into whisper and then fed it in GPT, asked it basically to pull out requirements to do the next task and stuff like that. It did summarize it. Again, it was a very short two or three minutes, I guess. The context was not overloading the chat GPT buffer, which is one of the main problems we have with this, obviously.

There were a couple things I had some interesting conversations I had with friend Chris about AI. He’s in the machine vision industry, and so he and I had a long discussion around dinnertime last night. Some interesting developments at work that I can’t really talk about, other than to say that I’m trying to build a department of AI, so we’ll see if any of the founders bite at that. So that’s what I’m going to do to present that to them and put some of this information together. Hopefully, having more of a record of things will help and we’ll see how I can take these transcripts and use them in such a way that will be helpful and will be very interesting to figure that out. Maybe some embeddings work, text embeddings, building some search around audio, like how expensive would it be to run whisper mic 24-7, and then just log everything that it catches into a database and just do semantic search on that. I think it would be pretty interesting to have your entire life stored in that way.

Speaking of which, I have been trying something called rewind AI, basically the way it works is that it records a screenshot of your computer screen every X seconds or whatever, and then it does some sort of… It crawls all the text from your screen, so it crawls all that text from your screen, and then I guess it loads some kind of database, some compression algorithm that they have as proprietary, but basically then if you’re looking for something, you can push command shift and two swipes up on the mouse, and it will give you a search. Anything you’ve seen, said, or heard, well, heard, that’s interesting. I’m going to have to experiment with that to see how that works, but it also has an interesting little history function you can slide back and see what your stuff sounded like.

So that’s probably it for right now. We’re going to do a test, I guess I’m going to use the iPhone Memo again today just to see how it compares to what Whispers is pulling out, so that will be my next task, and yeah, that’s it for now.