Turning text into voice: Freedom for your content!
github.com/W01fw00d/text-to-voice
- Post header illustration by @gelabert.art ππ»ββοΈ
Some context
I used to manage and write in a collaborative narrative writing group β (mostly in Spanish): I even did a 5 min talk about it! At some point, we started recording in audio π whenever we met to read together a finalized book. Then, I decided to do some editing and create a podcast using those audios.
That was cool enough π butβ¦ what about the first books we wrote, before having a good mic to record them? Did I really want to use my eyes to be able to enjoy them again...? After an 8-hour session of computer work π₯?
What I needed
- A task that could read through every chapter of a story and generate an audio file with a voice narrating it π
- Dynamically add the chapter number at the beginning π’
- Dynamically change the narrator voice when reading the chapter title, on dialogues (I wanted to have at least two voices for dialogues in order to indicate two different speakers π©π»βπ€βπ©πΌ)
- Dynamically add the opening and ending songs πΌ assigned to that book in every one of its chapters
- Support both Spanish and English π€π»
- Output in .mp3 so I just have to upload it to my podcast platform of choice: ivoox (free πΈ)
How it went
- Issue π²: The library node-gtts is cool and useful, but it didnβt allow me to change voices using markdown or anything fancy like that π, and it didnβt offer more than 2 different Spanish voices
- Solution ππ½: I developed my own custom loop logic and called the library for each paragraph, indicating which voice to use in each one. I had to use Portuguese and Italian, which are phonetically quite similar to Spanish, in order to have more available voices π©βπ©βπ§βπ¦
- Issue β: When I merged the voice file with the songs files, the output file was corrupted π§ββοΈ (probably missing
headers
) - Solution ππΏ: I was using a quite naive and old approach for combining files, based on an archived repo. I had the same issue with the audioconcat library π£. Finally, directly using the fluent-ffmpeg library worked perfectly, and I didnβt have to expend so much effort as I imagined
Lessons learned
Parsing narrative text can be expensive to develop π: every writer uses different conventions for identifying the dialogues, scene changesβ¦ you need to design a system quite flexible and "open" because youβll need to change it whenever you find an unexpected chapter format! I even ended up using some regex π΅...
Itβs a good idea to invest some time investigating a library before developing your solution. And sometimes, even if a library seems to be the perfect fit for your need, itβs better to use a library with more powerful features or just write the needed code yourself π¨πΏβπ»
- Audio edition is effort-expensive for someone like me who doesnβt really know about
sampleRates
and similar audio file attributes π§. For this kind of feature, it's better to delegate to some program or library
The fruits of hard work
You can access the code in my github and check the current status of the project in the issues section π€!
You can even check an example of the final output! Itβs an specific chapter that I wrote in English π€
Is there a future for this project?
While I finish transforming all the old books π, Iβll extract some features (dialogue interpretation, adding opening and closure songs...) for more general use (something like a podcast-chapter-generator).
I would like to fix the burden of having to upload manually every chapter to ivoox (they don't offer an API, so I'm thinking of using Cypress or other similar automated tools π€ in order to, as always, make my life a bit easier...)