Amine's personnal blog

Repo

Very exited as we reach the last milestone of our little project. It's time to send the article to our LLM and get an audio form it. First, let's dive in in Open API documentation to learn more about the model we will use. It's called Whisper.

1. OpenAI API: Whisper model

Whisper model

Whisper is a state-of-the-art open-source speech recognition model that offers high accuracy speech-to-text and text-to-speech capabilities. It is also available through the OpenAI API, which provides speech-to-text endpoints for transcriptions and translations.

Be mindful the API is NOT FREE. OpenAI offers a usage-based pricing model for its API, where you pay per request. You can check the pricing out here . It costs, at the time being, $0.015 / 1K characters for the TTS (Text-To-Speech) model. A request is limited to 4096 characters, so the maximum cost per request would be $0.06144 .

Using a paid API may be scary, as cost could go out of control in some cases. But OpenAPI offers a pre-paid subscription that will stop once all credits are consumed. No risk to get a bill of thousands of dollars like with AWS 😅. In my case I put 50$ as learning budget to experiment with different OpenAI APIs. In our tutorial we can do it for less than a dollar. Though, be carefull to not putting an infinite loop in your code 😀

For a more details you may refer to the official OpenAI documentation:

The API offers two endpoints for speech-to-text: transcriptions and translations. The one we are interested in is POST https://api.openai.com/v1/audio/speech . It takes 3 required params:

model: tts-1 or tts-1-hd. The HD model cost twice the price of the standard one, so we will stick to tts-1 .
voice: The voice to use when generating the audio. Supported voices are alloy, echo, fable, onyx nova, and shimmer.
input: The text to generate audio for. The maximum length is 4096 characters.

The default response format is mp3.

OpenAI API account

The API account is different from the one used for ChatGPT: You need to create an account https://platform.openai.com/ and add credits. To call the API you need to get two keys: an API key, and an Organization key.

To create an API key is API keys section:

The organization ID can be found in the Settings section:

With enough understanding on how this API works, we can break this last milestone into small manageable tasks:

First we need to connect to OpenAI api and make a request to the transform a text to speech.
The API has a limit of 4096 characters per call. SO we need to split the article content into multiple chunks of 4096 character max.
Eventually we have to merge the audios generated from those chunks to a single mp3 file

2. Connect to OpenAI

API module

In previous sections we've created modules by adding code to a separate file. We can also organize module's code into a separate folder. To do so, you have to give the folder the module's name and add a required file named mod.rs. The mod.rs allows you to control which parts of your code are visible and accessible to other parts of your codebase. You can use it to define what should be considered as the public interface of the module directory.

Inside the mod.rs file, you can use the pub keyword to make specific items (functions, structs, enums, or other modules) within that directory visible to code outside the directory. These items can then be accessed using the module's name.

Let's create a new module (directory) called api with two files mod.rs and openai_client.rs. It will handle an openai client with a struct to store the api url and api key, and a method to make a request to the endpoint that transforms text to speech.

api/
├── mod.rs
└── openai_client.rs

// openai_client.rs

pub struct OpenAIClient {
    url: String,
    api_key: String,
    org_id: String,
}

impl OpenAIClient {
    pub fn new() -> OpenAIClient {
        // TODO
    }

	pub async fn text_to_speech(&self, text: String) -> Result<reqwest::Response, reqwest::Error> {
	    // TODO
	}
}

Environment variables

To handle secrets like api keys we need to add a .env in the root directory. Within this file, define the environment variables needed to call OpenAI API, and don't forget to add it to the .gitignore file to avoid publishing your keys to Github.

OPENAI_API_KEY=YOUR_API_KEY_HERE 
OPENAI_ORG_ID=YOUR_ORG_ID_HERE
OPENAI_API_URL="https://api.openai.com/v1/audio/speech"

We can rely on the dotenv crate to seamlessly read these environment variables,. Let's add it to our project:

cargo add dotenv

dotenv crate has a function dotenv() responsible for loading the environment variables from the .env file in the root directory. When you call dotenv(), it attempts to read the .env file and parse its contents, setting the environment variables accordingly. If the file is not found or there are any issues with parsing, it will return a result.

Depending on your use case, if you just want to ignore errors from reading env variables, you can call dotenv().ok();

use dotenv::dotenv;

fn main() {
    dotenv().ok();
	// Environment variables loaded successfully
    // ... (rest of your code)
}

It is a concise way of saying, "attempt to load environment variables from the .env file, and ignore any potential errors." It ensures that your program doesn't crash if there are issues with loading the .env file or parsing its contents. Instead, it gracefully continues execution, allowing you to handle any errors as needed in the rest of your code.

If you want to handle errors more explicitly and take specific actions when an error occurs, you can modify the code as follows:

use dotenv::dotenv;

fn main() {
    match dotenv() {
        Ok(_) => {
            // Environment variables loaded successfully
            // ... (rest of your code)
        }
        Err(err) => {
           println!("Error loading environment variables: {}", err);
        }
    }
}

In our project, we need those env variables when creating a new instance of OpenAIClient and at the same time we need to build headers needed to send an http request to the OpenAI API:

// openai_client.rs

pub fn new() -> OpenAIClient {
        dotenv().ok();
        let api_key: String = env::var("OPENAI_API_KEY").expect("OPENAI_API_KEY not found!");
        let api_org: String = env::var("OPENAI_ORG_ID").expect("OPENAI_ORG_ID not found!");
        let api_url: String = env::var("OPENAI_API_URL").expect("OPENAI_ORG_ID not found!");

        
        let http_client = Client::new();
        let mut custom_headers = HeaderMap::new();

		    custom_headers.insert(CONTENT_TYPE, HeaderValue::from_static("application/json"));

        let authorization = format!("Bearer {}", api_key);
        custom_headers.insert(
            AUTHORIZATION,
            HeaderValue::from_str(&authorization).expect("Could not get Authorization header!"),
        );
        custom_headers.insert(
            "OpenAI-Organization",
            HeaderValue::from_str(&api_org).expect("Could not get OpenAI-Organization header!"),
        );

        OpenAIClient {
            url: api_url,
            headers: custom_headers,
            http: http_client,
        }
 }

Let break down this code:

The .expect() method is called on the result returned by env::var("ENV_VAR"). It takes a single argument, which is a custom error message that you provide. If the Result is an Ok variant (meaning the environment variable was found and its value retrieved successfully), .expect() simply returns the value inside the Ok. However, if the Result is an Err variant (indicating an error occurred, such as the environment variable not being set), .expect() will panic and terminate the program with the provided error message.
Client::new() creates a new HTTP client instance using the reqwest crate, similar to what we have done in previous sections.
HeaderMap::new() initializes an empty custom HTTP header map to store additional headers that will be sent with HTTP requests. It allows you to store and manage multiple HTTP headers within a single structure. Headers in an HeaderMap are stored as key-value pairs, where the key is the name of the header (e.g., "Content-Type") and the value is the corresponding header value (e.g., "application/json").
HeaderValue is a type that represents the value of an HTTP header. It stores the actual content of an HTTP header, such as the value of "Content-Type" or "Authorization.". It provides methods to convert between different types and representations of header values, including converting to and from String, &str, and byte slices (&[u8]). It ensures that header values are correctly formatted according to the HTTP specification. When constructing an HeaderValue, you typically use the from_static method to create a static header value from a string literal, or you can use the from_str method to parse a string into a HeaderValue instance.

Implement text to speech

Now, we can start implementing the code that sends a text to the api endpoint and save the audio generated as mp3 file. The text_to_speech method should:

First we need to create the request body as json
Make a Post request to the API with text to transform
Parse the response and save the mp3 file

To be able to send a json payload with the post request we have add a json feature to the crate reqwest and add another carte serde_json that can create json objects:

cargo add --features "json" reqwest 
cargo add serde_json

// openai_client.rs

use serde_json::json;
use std::fs::File;
use std::io::Write;

impl OpenAIClient {
    
    // previous code 

    pub async fn text_to_speech(&self, text: String, output_path: &str) {
        let request_body = json!({
            "model": "tts-1",
            "input": text,
            "voice": "alloy"
        });

        let req = self
            .http
            .post(&self.url)
            .headers(self.headers.clone())
            .json(&request_body)
            .send();

        match req.await {
            Ok(res) => {
                // Extracts the binary audio data from the response
                let mp3 = res.bytes().await.unwrap();

                // Attempts to create a new file at the specified output_path
                match File::create(output_path) {
                    Ok(mut file) => {
                        //  If the file was created successfully, it writes the MP3 audio data to the file
                        file.write_all(&mp3).unwrap();
                        println!("Audio file saved ✔️");
                    }
                    Err(_) => {
                        println!("❌ Failed to save audio file to  {}", output_path);
                    }
                }
            }
            Err(e) => {
                println!("❌ Failed to generate audio: {}", e);
            }
        }
    }
}

Let's try this out by calling it in main.rs file. Remember that the api does not accept text with more than 4096 characters. As we did not implement any control on the text size yet, at the moment we will send the article title just to make sure that the implementation is working as expected.

Create an output folder at the root of the project. It will be path to save the audio files generated.

use api::openai_client::OpenAIClient;

#[tokio::main]
async fn main() -> Result<(), reqwest::Error> {
	   // previous code

    for (index, arg) in args.iter().skip(1).enumerate() {
        let url = arg.to_string();
        match get_article(url).await {
            Ok(article) => {
                let article_title = article.describe();
                println!("✅ Article fetched!");
                println!("⏩ Title: {}", article_title);
                println!("🎤 Audiofy...");

                let client = OpenAIClient::new();

                let output_path = format!("output/{}.mp3", article_title);
                // For testing purpose: we send the title only at the moment
                client.text_to_speech(article_title, &output_path).await;
            }
            Err(e) => {
                print!("❌ Failed to process argument at index {}: {:?}", index, e);
            }
        }
    }

    Ok(())
}

Now run:

cargo run https://www.sadry.dev/articles/intro-to-rust-for-js-devs-part-1

Be mindful that the api may take a few minutes before finishing the processing. Eventually you should find an mp3 file in the output directory.

Rust cli, text to speech test output

Congratulations you have successfully “audiofied” the title of the article!

3. Audiofy an article

First of all, let's create a new module called audio that will handle the processing of an article and transform it to an mp3 file. In the main.rs file you should have only one call to an audiofy function. Create a new folder called audio with two files audiofy.rs and mod.rs:

// audio/audiofy.rs

use crate::api::openai_client::OpenAIClient;

pub async fn audiofy(text: String) {
    println!("🎤 Audiofy...");

    let client = OpenAIClient::new();

    let output_path = format!("output/{}.mp3", text);
    // For testing purpose: we send the title only at the moment
    client.text_to_speech(text, &output_path).await;
}


// audio/mod.rs

mod audiofy;
// only audiofy accessible outside this module 
pub use audiofy::audiofy;

In the previous section, the whole openai_client module was declared as public using pub mod openai_client declaration in mod.rs.
For this module, we will expose only one function as follows:

// main.rs

mod audio;
// ...

use audio::audiofy;

#[tokio::main]
async fn main() -> Result<(), reqwest::Error> {
   // previous code 

        match get_article(url).await {
            Ok(article) => {
                println!("✅ Article fetched!");
                let article_title = article.describe();
                println!("⏩ Title: {}", article_title);

                audiofy(article_title).await;
         
            }
            Err(e) => {
                print!("❌ Failed to process argument at index {}: {:?}", index, e);
            }
        }

	// ...
}

At this point the CLI should work as before. No change to the logic has been done.

Now, let's handle the size limit. As mentioned before, OpenAI API request is limited to 4096 characters. If the payload is more than that, the request will fail. So before any attempt, we need to slip our text to multiple chunks of 4096 characters max each. Fortunately Rust offers a helper function that does exactly that: chunks ...Well it is not that straightforward!

From the doc it's not clear how to use it but we can see that is a method of ad Primitive Type slice. First we need to clarify what is a slice. In Rust, a slice it is a fundamental concept and a type that represents a view into a contiguous sequence of elements. Slices allow to work with a portion of an array, vector, or other data structures without copying the data, making them efficient and versatile.

In our case we have our article text as &str . We need to get the amount of characters of it. .chars() returns an iterator of characters (of type std::str::Chars) from a &str.

Chars is not a slice type. Chars is an iterator type. We cannot apply .chunks to it directly. To transform it to a slice we can use .collect::<Vec<char>>() to collect the characters into a a vector.

Vec<char> is not a slice. It is a vector but we can apply methods like .chunks() to it because Vec<char> is a collection that implements the IntoIterator trait, allowing to treat it as an iterator of slices.

The .chunks() method itself creates an iterator that yields slices (not vectors) of the original collection, breaking it into chunks of a specified size. It's important to note that the slices produced by .chunks() are slices, not vectors.

At this point you may think we are done with a code as follows:

text        
    .chars()
    .collect::<Vec<char>>() 
    .chunks(API_LIMIT)

Not at all! The return type of this operation is slices of characters std::slice::Chunks<'_, char> and it is not suitable for our api call. We need to each chunk back to String.

// audio/audiofy.rs

const API_LIMIT:usize = 4096;

fn split_into_chunks(text: &str) -> Vec<String> {
    text
		.chars()
        .collect::<Vec<char>>()
        .chunks(API_LIMIT)
        .map(|chunk| chunk.iter().collect::<String>())
        .collect()
}

chunk.iter(): It converts the chunk (slice of characters) produced by the iterator from the previous step into an iterator over characters.
.collect(): The characters in the iterator are collected into a String. So, this line converts each chunk into a separate string.
The last .collect::<String>() is used to concatenate the individual chunks produced by .map() into a single String. Without it, you would end up with an iterator of strings, and if you intend to send the entire text as a payload to an API, you typically want it to be a single, continuous string.

Process article's content

The next step is to update our audiofy function and the main part of our program to split article's content into smaller parts of 4096 characters max, and send them to the API.

First we need to add a method to the Article implementation that returns the content of an article:

// article.rs

impl Article {
    async fn new(url: String) -> Result<Article, ValidationError> {
     // previous code
    }

    pub fn get_content(&self) -> Option<String> {
        self.content.clone()
    }
}

Next, we want to update audiofy call to take two argiments: the article content, and the path where to save the audio file once it's ready:

// main.rs

#[tokio::main]
async fn main() -> Result<(), reqwest::Error> {
  // ...
            Ok(article) => {
                println!("✅ Article fetched!");
                let article_title = article.describe();
                println!("⏩ Title: {}", article_title);

                let output_path = format!("output/{}.mp3", article_title);

                match article.get_content() {
                    Some(article_content) => {
                        audiofy(article_content, &output_path).await;
                    }
                    None => {
                        println!("❌ Article has no conetnt!");
                    }
                }
            }
	// ...
}

The way we'll set up audiofy is really straightforward. First, it splits the article into chunks, each no more than 4096 characters. Next, it converts each chunk into audio using the OpenAI API, and stores each audio segment in a temporary folder. Finally, it merges all these audio segments into a single MP3 file, which it saves in a specified location given as an argument.

// audio/audiofy.rs

use std::fs::{create_dir, read_dir, remove_dir_all};
use std::io;

const API_LIMIT: usize = 4096;
const OUTPUT_DIR: &str = "output";
const TEMP_DIR: &str = "temp";

//...

pub async fn audiofy(text: String, output_path: &str) {
    println!("🎤 Audiofy...");

    let client = OpenAIClient::new();

	// get a vector of chunks of 4096 characters max
    let text_chunks = split_into_chunks(&text);
		
	// Create a temp folder to save the audio file of each chunk
    let temp_path = format!("{}/{}", OUTPUT_DIR, TEMP_DIR);
    create_dir(&temp_path).expect("Failed to create tmp directory");

	// Call text_to_speech on each chunk to get the mp3 from it
    println!("=> {} chunks to transform:", text_chunks.len());
    for (i, chunk) in text_chunks.iter().enumerate() {
        println!("Processing chunk at index {}...", i);
        let chunk_path = format!("{}/chunk_{}.mp3", temp_path, i);
        client.text_to_speech(chunk.to_string(), &chunk_path).await;
    }
		
	// Merge all audio files into one final mp3 file saved into the output path
    merge_audio_chunks(&temp_path, &output_path);

	// Delete the temp dir with all its content
    remove_dir_all(&temp_path).expect("Failed to remove directory");
}

To merge the MP3 files into one audio file, we'll use ffmpeg. Its a powerful open-source project that helps with processing video, audio, and other multimedia files. ffmpeg needs to be installed separately from our Rust application. If you're on macOS, simply run brew install ffmpeg in your terminal. For other operating systems, you can download it from the official website.

To concat multiple audio files using ffmpeg, we need to execute the following command (doc):

fmpeg -i "concat:file1.mp3|file2.mp3|file3.mp3" -c copy output.mp3

To run it from our code, we will leverage std::process::Command. This part of the Rust standard library lets our program start and communicate with external processes, like running a command in a terminal.

// audio/audiofy.rs

fn merge_audio_chunks(input_path: &str, output_path: &str) {
    let files = list_file_paths_in_directory(input_path).unwrap();
	
	// This line constructs a string argument for ffmpeg's -i (input) option using the concat protocol. 
	// files.join("|") takes the list of file paths and joins them into a single string, separated by the | character. 
    // This string is then prefixed with
	let concat_command_arg = format!("concat:{}", files.join("|"));

	Command::new("ffmpeg")
        .arg("-i")
        .arg(concat_command_arg)
        .arg(output_path)
        .output()
        .expect("Failed to create your MP3 file");

    println!("\n ✅ Your podcast is ready in: {}", output_path)
}

The list_file_paths_in_directory function accepts a directory path (input_path) as its parameter and outputs a list of file paths found inside that directory. The actual code follows below.

format!("concat:{}", files.join("|")) creates an argument for the -i (input) option of ffmpeg, using the concat protocol. It combines a list of file paths into one string, with each path separated by the | character. This combined string is then used as the input for the concat protocol by prefixing it with "concat:".

In the final step, the process begins by initializing a new ffmpeg process using Command::new("ffmpeg"), which sets up the execution of the ffmpeg command. It then adds the input flag via .arg("-i") to specify the input files, followed by adding the pre-constructed concatenation argument with .arg(concat_command_arg). The location for the resulting audio file is determined by .arg(output_path), where output_path is a variable holding the full path and filename of the output MP3.

The execution of the ffmpeg command and the capture of its output is triggered by .output(), returning a Result that contains the outcome of the operation. If an error occurs, leading to a failed command execution, .expect("Failed to create your MP3 file") will halt the program and display the specified error message, ensuring the user is aware of the failure to create the MP3 file.

Let's go back to list_file_paths_in_directory function. It's is designed to list all the file paths within a given directory and return them as a vector of strings (Vec<String>). The std::fs module exposes a function read_dir that takes a directory path and returns an iterator over the entries in this directory.

// audio/audiofy.rs

fn list_file_paths_in_directory(path: &str) -> io::Result<Vec<String>> {
    let mut file_paths = Vec::new();
    let entries = read_dir(path)?;
    for entry in entries {
        let entry = entry?;
        let path = entry.path();
        if path.is_file() {
            if let Some(path_str) = path.to_str() {
                file_paths.push(path_str.to_string());
            }
        }
    }

    Ok(file_paths)
}

The implementation is pretty straightforward. However, one part that might need a bit more explanation is this:

if let Some(path_str) = path.to_str() {
   file_paths.push(path_str.to_string());
}

The if let syntax is a concise way to handle pattern matching in Rust, especially useful when you're interested in only one pattern and want to ignore others. It's equivalent to:

match path.to_str() {
    Some(path_str) => file_paths.push(path_str.to_string()),
    None => (), // Do nothing if the path cannot be converted to a string
}

And with that we've reached a very exiting milestone. our setup is complete and capable of processing an article in its entirety. We can now take any article, regardless of its length, and transform it to speech. 🎉

With this achievement, we've hit an exiting milestone: our setup is now fully operational and capable of convert an entire article intro speech, regardless of its length. Let's Celebrate 🎉

4. To be continued

I've hit my first big goal with these tutorials: getting comfortable with Rust by creating a CLI that turns articles into mp3s. Sure, the code isn't perfect or what a pro Rust developer might write, but it's solid enough for me and helps me to jump into and navigate a Rust codebase on GitHub.

Now, I want to add a couple of cool things to our CLI:

Implement a safeguard to avoid calling the api if the payload is more than 4096 characters. This will prevent unnessessary fees for invalid requets
Install the cli globally to be able to run it from my terminal without having to navigate to the codebase.

Tackling these upgrades is our next challenge for the final article in this series.