Voice AI assistant using javascript, PHP, and Chat GPT

My disappointment with my Amazon Echo/Alexa device doubled every time I tried to use it and, after some recent exploration with live video streaming, I wanted to pair my desire for a quirky voice assistant with my desire to learn more about audio stream handling in javascript. The result is a cheesy AI voice assistant that sometimes feeds me a good dad joke.

Click here to give it a try for yourself!

How it works

Establishing Audio Input. The script begins with creating a media stream/getting the mic input. It’ll then listen and analyze the background amplitude / sound level to determine how much background noise is present so it can distinguish between background noise and the user talking to it. This eliminates the need for a “wake word” like “Alexa” or “Hey Google”. Once the baseline amplitude has been determined, we add 10 to it just to give a bit more buffer to distinguish background noise from the user speaking to the assistant. This baseline amplitude is defined a “Threshold” on the UI.

const checkAmplitude = () => {
	whatsHappeningDiv.innerHTML = 'Calibrating . . .';
	analyser.getByteFrequencyData(dataArray);
	const average = dataArray.reduce((acc, val) => acc + val, 0) / bufferLength;
	amplitudeSum += average;
	count++;

	if (count >= 100) { // 100 * 50ms (0.05seconds) = 5 seconds
		backgroundAmplitude = Math.max(10, 1.7 * (amplitudeSum / count)); //the avearge initial amplitude detected. ie the background noise. Adding 70% to it to give a threshold buffer and setting the min to 10 so high quality mics/very quiet environments don't allow the avg to be 0.
		clearInterval(timer);
		resolve();
		whatsHappeningDiv.innerHTML = 'I\'m ready to listen.';
	}
};

Listening for Input. Next, the script simply listens to the audio stream indefinitely until the amplitude exceeds the baseline threshold defined above. The incoming amplitude is displayed to the user alongside the threshold aplitude. The user can recalibrate the baseline threshold by simply refreshing.

function updateAmplitude(stream) {
	const audioContext = new AudioContext();
	const analyser = audioContext.createAnalyser();
	const microphone = audioContext.createMediaStreamSource(stream);
	microphone.connect(analyser);
	analyser.fftSize = 32; //It's better to keep this low so the response generation is faster (less to analyze and average)
	const bufferLength = analyser.frequencyBinCount;
	const dataArray = new Uint8Array(bufferLength);

	const checkAmplitude = () => {
		analyser.getByteFrequencyData(dataArray);
		//console.log('Frequency Data Array:', dataArray); // Log the array data
		const average = dataArray.reduce((acc, val) => acc + val, 0) / bufferLength;
		const amplitudeDisplay = document.getElementById('amplitudeDisplay');
		amplitudeDisplay.textContent = 'Amplitude: ' + average.toFixed(0) + '. Threshold: ' + Math.round(backgroundAmplitude) + '.';

			if (average > backgroundAmplitude && !isRecording && !isPaused && !isWaitingForResponse && !isAudioPlaying) {
				startRecording();
				console.log('Recording STARTED due to high amplitude.' + 'recording: ' + isRecording + 'pause:' + isPaused + 'waiting: ' + isWaitingForResponse + 'playing: ' + isAudioPlaying);
			} else if (average < backgroundAmplitude && isRecording && !isPaused) {
				if (!lowAmplitudeStartTime) {
					lowAmplitudeStartTime = Date.now();
				} else if (Date.now() - lowAmplitudeStartTime >= 3000) {
					stopRecording(); //If there's more than 3 seconds of quiet, stop recording.
					console.log('Recording STOPPED due to low amplitude.');
				}
			} else {
				lowAmplitudeStartTime = null;
			}
	};

	timer = setInterval(checkAmplitude, 50);
}

Recording Audio. Once the baseline amplitude threshold has been exceeded, the script begins recording audio. Once the script detects audio below the baseline threshold for >3 seconds *or* if the recording time exceeds 10 seconds, the script stops recording and generates an MP3.

function startRecording() {
	if (!isRecording && !isWaitingForResponse && !isAudioPlaying) {
		mediaRecorder.start();
		isRecording = true;
		whatsHappeningDiv.innerHTML = 'Listening . . .';
		humanTextRequestDiv.innerHTML = '';
		assistantResponseTextDiv.innerHTML = '';
		recordingTimeout = setTimeout(stopRecording, 10000); //If listening for more than 10 seconds, stop.
		lowAmplitudeStartTime = null;
	}
}
...
function saveRecording(blob) {
	isWaitingForResponse = true;
	const xhr = new XMLHttpRequest();
	xhr.onload = function () {
		isWaitingForResponse = false;
		if (xhr.status === 200) {
			const responseJson = JSON.parse(xhr.responseText);
			humanTextRequestDiv.innerHTML = '<strong>What I heard:</strong> ' + responseJson.human_text_request;
			assistantResponseTextDiv.innerHTML = '<strong>My response:</strong> ' + responseJson.assistant_response_text;
			console.log('Recording saved successfully.');
			audioPlayer.src = responseJson.audio_src;
			audioPlayer.load();
			audioPlayer.play();
			console.log(responseJson);
			console.log('Audio src updated and reloaded: ' + responseJson.audio_src);
		} else {
			whatsHappeningDiv.innerHTML = 'Failed to save recording: ' + xhr.statusText;
			console.error('Failed to save recording:', xhr.statusText);
		}
	};
	xhr.open('POST', 'write_file.php');
	console.log('File sent to POST handler.');
	whatsHappeningDiv.innerHTML = 'Thinking about what you said . . .';
	xhr.send(blob);
}

Transcribing Audio to Text. From here, we shift from javascript to the PHP handler. The handler leverages OpenAI’s Whisper model and speech to text endpoint to transcribe.

//A function to handle the repeat cURL calls
function callOpenAPI($url, $postData, $headers) {
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_POST, true);
    curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
    curl_setopt($ch, CURLOPT_POSTFIELDS, $postData);
    $response = curl_exec($ch);

    if (curl_errno($ch)) {
        $response = ['errors' => curl_error($ch)];
    }

    curl_close($ch);
    return json_decode($response);
}

// Step 1: Transcribe the input
$postTranscriptionData = [
    'model' => 'whisper-1',
    'file' => curl_file_create($filename),
    'response_format' => 'verbose_json',
    'language' => 'en',
];
$response = callOpenAPI('https://api.openai.com/v1/audio/transcriptions', $postTranscriptionData, $headers);
$text = isset($response->text) ? $response->text : '';
$json_summary->human_text_request = $text;

Prompting Chat GPT for Response. I wanted some quirky feedback so I tweaked my prompt to return some dad jokes but otheriwse, it’s pretty straight forward using the OpenAI Completions endpoint.

// Step 2: Generate response from Chat GPT
$headers = [
    "Content-Type: application/json",
    "Authorization: Bearer $openAIToken",
];
$postGPTData = [
    'model' => 'gpt-4-turbo-preview',
    'messages' => [
        ['role' => 'system', 'content' => "You're a virtual assistant who is interpreting voice requests from users. They like to joke and enjoy sarcasm but appreciate factual and succinct responses. Keep your responses to their requests and questions to less than 100 words unless they ask for something longer. They love a good edgy or sarcastic dad joke if you can incorporate one as part of your response -- but don't make it too corny."],
        ['role' => 'user', 'content' => $text],
    ],
];
$response = callOpenAPI('https://api.openai.com/v1/chat/completions', json_encode($postGPTData), $headers);
$assistant_response = isset($response->choices[0]->message->content) ? $response->choices[0]->message->content : '';
$json_summary->assistant_response_text = $assistant_response;

Generating Text to Speech. Lastly, we convert the Chat GPT response to speech and save it as an MP3 file, as well. All of these are then bundled and passed back to the javascript as json and then updated and played for the user.

// Step 3: Generate speech response
$postTTSData = [
    'model' => 'tts-1',
    'input' => $assistant_response,
    'voice' => 'onyx',
];
$response = callOpenAPI('https://api.openai.com/v1/audio/speech', json_encode($postTTSData), $headers);
file_put_contents("recordings/{$file_id}_response.mp3", $response);
//$json_summary->test = $response;
$json_summary->audio_src = "recordings/{$file_id}_response.mp3";

The full javascript:

document.addEventListener('DOMContentLoaded', () => {
	const audioPlayer = document.getElementById('responseAudio');
	const whatsHappeningDiv = document.getElementById('whats_happening');
	const humanTextRequestDiv = document.getElementById('human_text_requestDiv');
	const assistantResponseTextDiv = document.getElementById('assistant_response_textDiv');
	const toggleMute = document.getElementById('toggle_mute');

	let mediaRecorder;
	let audioChunks = [];
	let isRecording = false;
	let isPaused = false;
	let isWaitingForResponse = false;
	let isAudioPlaying = false;
	let timer;
	let recordingTimeout;
	let lowAmplitudeStartTime;
	let backgroundAmplitude = 0;

	function calculateBackgroundAmplitude(stream) {
		return new Promise((resolve, reject) => {
			const audioContext = new AudioContext();
			const analyser = audioContext.createAnalyser();
			const microphone = audioContext.createMediaStreamSource(stream);
			microphone.connect(analyser);

			analyser.fftSize = 32; //It's better to keep this low so the response generation is faster (less to analyze and average)

			const bufferLength = analyser.frequencyBinCount;
			const dataArray = new Uint8Array(bufferLength);

			let amplitudeSum = 0;
			let count = 0;

			const checkAmplitude = () => {
				whatsHappeningDiv.innerHTML = 'Calibrating . . .';
				analyser.getByteFrequencyData(dataArray);
				const average = dataArray.reduce((acc, val) => acc + val, 0) / bufferLength;
				amplitudeSum += average;
				count++;

				if (count >= 100) { // 100 * 50ms (0.05seconds) = 5 seconds
					backgroundAmplitude = Math.max(10, 1.7 * (amplitudeSum / count)); //the avearge initial amplitude detected. ie the background noise. Adding 70% to it to give a threshold buffer and setting the min to 10 so high quality mics/very quiet environments don't allow the avg to be 0.
					clearInterval(timer);
					resolve();
					whatsHappeningDiv.innerHTML = 'I\'m ready to listen.';
				}
			};

			timer = setInterval(checkAmplitude, 50);
		});
	}

	if (navigator.mediaDevices && navigator.mediaDevices.getUserMedia) {
		navigator.mediaDevices.getUserMedia({ audio: true })
			.then(async stream => {
				await calculateBackgroundAmplitude(stream); // Call function to calculate background amplitude
				updateAmplitude(stream);
				mediaRecorder = new MediaRecorder(stream);
				mediaRecorder.ondataavailable = event => {
					audioChunks.push(event.data);
				};
				mediaRecorder.onstop = () => {
					const audioBlob = new Blob(audioChunks, { 'type': 'audio/mp3' });
					saveRecording(audioBlob);
					audioChunks = [];
				};
			})
			.catch(error => {
				console.error('Error accessing microphone:', error);
				whatsHappeningDiv.innerHTML = 'Please allow microphone access and then refresh the page if needed.';
			});
	} else {
		console.error('getUserMedia not supported in this browser.');
		whatsHappeningDiv.innerHTML = 'Your browser isn\'t supported.';
	}

	function startRecording() {
		if (!isRecording && !isWaitingForResponse && !isAudioPlaying) {
			mediaRecorder.start();
			isRecording = true;
			whatsHappeningDiv.innerHTML = 'Listening . . .';
			humanTextRequestDiv.innerHTML = '';
			assistantResponseTextDiv.innerHTML = '';
			recordingTimeout = setTimeout(stopRecording, 10000); //If listening for more than 10 seconds, stop.
			lowAmplitudeStartTime = null;
		}
	}

	function stopRecording() {
		if (isRecording) {
			clearTimeout(recordingTimeout);
			whatsHappeningDiv.innerHTML = 'No longer listening . . .';
			mediaRecorder.stop();
			isRecording = false;
		}
	}

	/*
	I intend to add ability to mute TTS audio playback at some point.
	function muteAudio() {
		audioPlayer.muted = !audioPlayer.muted;
		toggleMute.innerHTML = audioPlayer.muted ? '<a onclick="muteAudio()">Toggle Mute Version 1</a>' : '<a onclick="muteAudio()">Toggle Mute Version 2</a>';
	}
	*/
	
	function handleAudioEvent(event) {
		if (event.type === 'play') {
			isAudioPlaying = true;
			console.log('Audio is playing.');
		} else if (event.type === 'ended') {
			isAudioPlaying = false;
			console.log('Audio has stopped playing.');
			whatsHappeningDiv.innerHTML = 'I\'m ready to listen again.';
		}
	}

	audioPlayer.addEventListener('play', handleAudioEvent);
	audioPlayer.addEventListener('ended', handleAudioEvent);

	function saveRecording(blob) {
		isWaitingForResponse = true;
		const xhr = new XMLHttpRequest();
		xhr.onload = function () {
			isWaitingForResponse = false;
			if (xhr.status === 200) {
				const responseJson = JSON.parse(xhr.responseText);
				humanTextRequestDiv.innerHTML = '<strong>What I heard:</strong> ' + responseJson.human_text_request;
				assistantResponseTextDiv.innerHTML = '<strong>My response:</strong> ' + responseJson.assistant_response_text;
				console.log('Recording saved successfully.');
				audioPlayer.src = responseJson.audio_src;
				audioPlayer.load();
				audioPlayer.play();
				console.log(responseJson);
				console.log('Audio src updated and reloaded: ' + responseJson.audio_src);
			} else {
				whatsHappeningDiv.innerHTML = 'Failed to save recording: ' + xhr.statusText;
				console.error('Failed to save recording:', xhr.statusText);
			}
		};
		xhr.open('POST', 'write_file.php');
		console.log('File sent to POST handler.');
		whatsHappeningDiv.innerHTML = 'Thinking about what you said . . .';
		xhr.send(blob);
	}

	function updateAmplitude(stream) {
		const audioContext = new AudioContext();
		const analyser = audioContext.createAnalyser();
		const microphone = audioContext.createMediaStreamSource(stream);
		microphone.connect(analyser);
		analyser.fftSize = 32; //It's better to keep this low so the response generation is faster (less to analyze and average)
		const bufferLength = analyser.frequencyBinCount;
		const dataArray = new Uint8Array(bufferLength);

		const checkAmplitude = () => {
			analyser.getByteFrequencyData(dataArray);
			//console.log('Frequency Data Array:', dataArray); // Log the array data
			const average = dataArray.reduce((acc, val) => acc + val, 0) / bufferLength;
			const amplitudeDisplay = document.getElementById('amplitudeDisplay');
			amplitudeDisplay.textContent = 'Amplitude: ' + average.toFixed(0) + '. Threshold: ' + Math.round(backgroundAmplitude) + '.';

				if (average > backgroundAmplitude && !isRecording && !isPaused && !isWaitingForResponse && !isAudioPlaying) {
					startRecording();
					console.log('Recording STARTED due to high amplitude.' + 'recording: ' + isRecording + 'pause:' + isPaused + 'waiting: ' + isWaitingForResponse + 'playing: ' + isAudioPlaying);
				} else if (average < backgroundAmplitude && isRecording && !isPaused) {
					if (!lowAmplitudeStartTime) {
						lowAmplitudeStartTime = Date.now();
					} else if (Date.now() - lowAmplitudeStartTime >= 3000) {
						stopRecording(); //If there's more than 3 seconds of quiet, stop recording.
						console.log('Recording STOPPED due to low amplitude.');
					}
				} else {
					lowAmplitudeStartTime = null;
				}
		};

		timer = setInterval(checkAmplitude, 50);
	}

});

The full PHP script:

<?php
//A function to handle the repeat cURL calls
function callOpenAPI($url, $postData, $headers) {
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_POST, true);
    curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
    curl_setopt($ch, CURLOPT_POSTFIELDS, $postData);
    $response = curl_exec($ch);

    if (curl_errno($ch)) {
        $response = ['errors' => curl_error($ch)];
    }

    curl_close($ch);
    return json_decode($response);
}

//Handing the audio input
$audio_data = file_get_contents('php://input');
$file_id = uniqid();
$filename = "recordings/$file_id.mp3";
file_put_contents($filename, $audio_data);

//Setting the standard header and prepping the response
$json_summary = new stdClass();
$openAIToken="";
$headers = [
    "Authorization: Bearer $openAIToken",
];

////////////////////////////////////////////////////////////
// Step 1: Transcribe the input
$postTranscriptionData = [
    'model' => 'whisper-1',
    'file' => curl_file_create($filename),
    'response_format' => 'verbose_json',
    'language' => 'en',
];
$response = callOpenAPI('https://api.openai.com/v1/audio/transcriptions', $postTranscriptionData, $headers);
$text = isset($response->text) ? $response->text : '';
$json_summary->human_text_request = $text;

////////////////////////////////////////////////////////////
// Step 2: Generate response from Chat GPT
$headers = [
    "Content-Type: application/json",
    "Authorization: Bearer $openAIToken",
];
$postGPTData = [
    'model' => 'gpt-4-turbo-preview',
    'messages' => [
        ['role' => 'system', 'content' => "You're a virtual assistant who is interpreting voice requests from users. They like to joke and enjoy sarcasm but appreciate factual and succinct responses. Keep your responses to their requests and questions to less than 100 words unless they ask for something longer. They love a good edgy or sarcastic dad joke if you can incorporate one as part of your response -- but don't make it too corny."],
        ['role' => 'user', 'content' => $text],
    ],
];
$response = callOpenAPI('https://api.openai.com/v1/chat/completions', json_encode($postGPTData), $headers);
$assistant_response = isset($response->choices[0]->message->content) ? $response->choices[0]->message->content : '';
$json_summary->assistant_response_text = $assistant_response;

////////////////////////////////////////////////////////////
// Step 3: Generate speech response
$postTTSData = [
    'model' => 'tts-1',
    'input' => $assistant_response,
    'voice' => 'onyx',
];
$response = callOpenAPI('https://api.openai.com/v1/audio/speech', json_encode($postTTSData), $headers);
file_put_contents("recordings/{$file_id}_response.mp3", $response);
//$json_summary->test = $response;
$json_summary->audio_src = "recordings/{$file_id}_response.mp3";

////////////////////////////////////////////////////////////
// Step 4: Return json
echo json_encode($json_summary);
?>

If you wanted to use the same CSS stylying I have for my demo, here’s the full thing:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>The Better Alexa</title>
    <style>
        body {
            display: flex;
            justify-content: center;
            align-items: center;
            height: 100vh;
            margin: 0;
            background: linear-gradient(to bottom, #1f0036, #000000);
        }

        .centered-content {
            text-align: center;
            color: white;
            font-family: Helvetica, sans-serif;
        }

        .whats_happening {
            padding-bottom: 50px;
            font-size: 30px;
            font-weight: bold;

            --background: linear-gradient(to right, #553c9a 20%, #FFCC00 40%, #ee4b2b 60%, #ee4b2b 80%);
            background: linear-gradient(to right, #FEAC5E 20%, #C779D0 40%, #4BC0C8 60%, #FEAC5E 80%);
            background-size: 200% auto;

            color: #000;
            background-clip: text;
            -webkit-background-clip: text;
            text-fill-color: transparent;
            -webkit-text-fill-color: transparent;

            animation: shine 20s linear infinite;
        }

        @keyframes shine {
            to {
                background-position: 200% center;
            }
        }

        .toggle_mute {
            color: #ffffff;
            --font-size:80px;
            padding-bottom: 20px;
        }
        .human_text_requestDiv {
            color: #777;
            padding-bottom: 20px;
        }
        .assistant_response_textDiv {
            color: #ffffff;
        }
        .amplitudeDisplay {
            color: #555;
            font-size: 10px;
            padding-top: 50px;
        }

    </style>
</head>
<body>
    <div class="centered-content">
        <audio src="#" controls id="responseAudio" name="responseAudio" style="display: none;"></audio>
        <div id="whats_happening" class="whats_happening">Checking mic . . .</div>
        <div id="toggle_mute" class="toggle_mute"></div>
        <div id="human_text_requestDiv" class="human_text_requestDiv"></div>
        <div id="assistant_response_textDiv" class="assistant_response_textDiv"></div>
        <div id="amplitudeDisplay" class="amplitudeDisplay"></div>
    </div>
</body>
<script>
    document.addEventListener('DOMContentLoaded', () => {
        const audioPlayer = document.getElementById('responseAudio');
        const whatsHappeningDiv = document.getElementById('whats_happening');
        const humanTextRequestDiv = document.getElementById('human_text_requestDiv');
        const assistantResponseTextDiv = document.getElementById('assistant_response_textDiv');
        const toggleMute = document.getElementById('toggle_mute');

        let mediaRecorder;
        let audioChunks = [];
        let isRecording = false;
        let isPaused = false;
        let isWaitingForResponse = false;
        let isAudioPlaying = false;
        let timer;
        let recordingTimeout;
        let lowAmplitudeStartTime;
        let backgroundAmplitude = 0;

        function calculateBackgroundAmplitude(stream) {
            return new Promise((resolve, reject) => {
                const audioContext = new AudioContext();
                const analyser = audioContext.createAnalyser();
                const microphone = audioContext.createMediaStreamSource(stream);
                microphone.connect(analyser);

                analyser.fftSize = 32; //It's better to keep this low so the response generation is faster (less to analyze and average)

                const bufferLength = analyser.frequencyBinCount;
                const dataArray = new Uint8Array(bufferLength);

                let amplitudeSum = 0;
                let count = 0;

                const checkAmplitude = () => {
					whatsHappeningDiv.innerHTML = 'Calibrating . . .';
                    analyser.getByteFrequencyData(dataArray);
                    const average = dataArray.reduce((acc, val) => acc + val, 0) / bufferLength;
                    amplitudeSum += average;
                    count++;

                    if (count >= 100) { // 100 * 50ms (0.05seconds) = 5 seconds
                        backgroundAmplitude = Math.max(10, 1.7 * (amplitudeSum / count)); //the avearge initial amplitude detected. ie the background noise. Adding 70% to it to give a threshold buffer and setting the min to 10 so high quality mics/very quiet environments don't allow the avg to be 0.
                        clearInterval(timer);
                        resolve();
						whatsHappeningDiv.innerHTML = 'I\'m ready to listen.';
                    }
                };

                timer = setInterval(checkAmplitude, 50);
            });
        }

        if (navigator.mediaDevices && navigator.mediaDevices.getUserMedia) {
            navigator.mediaDevices.getUserMedia({ audio: true })
                .then(async stream => {
                    await calculateBackgroundAmplitude(stream); // Call function to calculate background amplitude
                    updateAmplitude(stream);
                    mediaRecorder = new MediaRecorder(stream);
                    mediaRecorder.ondataavailable = event => {
                        audioChunks.push(event.data);
                    };
                    mediaRecorder.onstop = () => {
                        const audioBlob = new Blob(audioChunks, { 'type': 'audio/mp3' });
                        saveRecording(audioBlob);
                        audioChunks = [];
                    };
                })
                .catch(error => {
                    console.error('Error accessing microphone:', error);
                    whatsHappeningDiv.innerHTML = 'Please allow microphone access and then refresh the page if needed.';
                });
        } else {
            console.error('getUserMedia not supported in this browser.');
            whatsHappeningDiv.innerHTML = 'Your browser isn\'t supported.';
        }

        function startRecording() {
            if (!isRecording && !isWaitingForResponse && !isAudioPlaying) {
                mediaRecorder.start();
                isRecording = true;
                whatsHappeningDiv.innerHTML = 'Listening . . .';
                humanTextRequestDiv.innerHTML = '';
                assistantResponseTextDiv.innerHTML = '';
                recordingTimeout = setTimeout(stopRecording, 10000); //If listening for more than 10 seconds, stop.
                lowAmplitudeStartTime = null;
            }
        }

        function stopRecording() {
            if (isRecording) {
                clearTimeout(recordingTimeout);
                whatsHappeningDiv.innerHTML = 'No longer listening . . .';
                mediaRecorder.stop();
                isRecording = false;
            }
        }

        /*
		I intend to add ability to mute TTS audio playback at some point.
		function muteAudio() {
            audioPlayer.muted = !audioPlayer.muted;
            toggleMute.innerHTML = audioPlayer.muted ? '<a onclick="muteAudio()">Toggle Mute Version 1</a>' : '<a onclick="muteAudio()">Toggle Mute Version 2</a>';
        }
		*/
		
        function handleAudioEvent(event) {
            if (event.type === 'play') {
                isAudioPlaying = true;
                console.log('Audio is playing.');
            } else if (event.type === 'ended') {
                isAudioPlaying = false;
                console.log('Audio has stopped playing.');
                whatsHappeningDiv.innerHTML = 'I\'m ready to listen again.';
            }
        }

        audioPlayer.addEventListener('play', handleAudioEvent);
        audioPlayer.addEventListener('ended', handleAudioEvent);

        function saveRecording(blob) {
            isWaitingForResponse = true;
            const xhr = new XMLHttpRequest();
            xhr.onload = function () {
                isWaitingForResponse = false;
                if (xhr.status === 200) {
                    const responseJson = JSON.parse(xhr.responseText);
                    humanTextRequestDiv.innerHTML = '<strong>What I heard:</strong> ' + responseJson.human_text_request;
                    assistantResponseTextDiv.innerHTML = '<strong>My response:</strong> ' + responseJson.assistant_response_text;
                    console.log('Recording saved successfully.');
                    audioPlayer.src = responseJson.audio_src;
                    audioPlayer.load();
                    audioPlayer.play();
                    console.log(responseJson);
                    console.log('Audio src updated and reloaded: ' + responseJson.audio_src);
                } else {
                    whatsHappeningDiv.innerHTML = 'Failed to save recording: ' + xhr.statusText;
                    console.error('Failed to save recording:', xhr.statusText);
                }
            };
            xhr.open('POST', 'write_file.php');
            console.log('File sent to POST handler.');
            whatsHappeningDiv.innerHTML = 'Thinking about what you said . . .';
            xhr.send(blob);
        }

        function updateAmplitude(stream) {
            const audioContext = new AudioContext();
            const analyser = audioContext.createAnalyser();
            const microphone = audioContext.createMediaStreamSource(stream);
            microphone.connect(analyser);
            analyser.fftSize = 32; //It's better to keep this low so the response generation is faster (less to analyze and average)
            const bufferLength = analyser.frequencyBinCount;
            const dataArray = new Uint8Array(bufferLength);

            const checkAmplitude = () => {
                analyser.getByteFrequencyData(dataArray);
                //console.log('Frequency Data Array:', dataArray); // Log the array data
                const average = dataArray.reduce((acc, val) => acc + val, 0) / bufferLength;
                const amplitudeDisplay = document.getElementById('amplitudeDisplay');
                amplitudeDisplay.textContent = 'Amplitude: ' + average.toFixed(0) + '. Threshold: ' + Math.round(backgroundAmplitude) + '.';

                    if (average > backgroundAmplitude && !isRecording && !isPaused && !isWaitingForResponse && !isAudioPlaying) {
                        startRecording();
                        console.log('Recording STARTED due to high amplitude.' + 'recording: ' + isRecording + 'pause:' + isPaused + 'waiting: ' + isWaitingForResponse + 'playing: ' + isAudioPlaying);
                    } else if (average < backgroundAmplitude && isRecording && !isPaused) {
                        if (!lowAmplitudeStartTime) {
                            lowAmplitudeStartTime = Date.now();
                        } else if (Date.now() - lowAmplitudeStartTime >= 3000) {
                            stopRecording(); //If there's more than 3 seconds of quiet, stop recording.
                            console.log('Recording STOPPED due to low amplitude.');
                        }
                    } else {
                        lowAmplitudeStartTime = null;
                    }
            };

            timer = setInterval(checkAmplitude, 50);
        }

    });
</script>
</html>

Comparing AWS Textract and AWS Rekognition to extract text from images using PHP

A few months ago I tried using AWS Rekognition to detect text in images. The results were okay for casual use cases but overall the quality was pretty poor (primarily because Rekognition isn’t intended to be used as an OCR product).

A few days ago (May 29), AWS announced the general availability of Textract, an actual OCR product. Out of curiosity, I wanted to run the same image I ran through Rekognition through Textract to compare the difference. While Textract isn’t 100%, it’s a huge improvement over Rekognition (as should be expected since it’s intended for this).

Below is a side-by-side comparison of the results from the two services:

Textract Results Rekognition Results
DetectedText Confidence DetectedText Confidence
DANS PUMP AND GO 98% DANS PLMP AND GO 98%
15238 MAIN ST 99% 15238 MAIN ST 100%
NEWTOWN 100% NEWTOWN 100%
CAROLINA 93812 96% CAROLINA 93812 97%
ST-TX: 11089984 99% ST-TX: 11089987 (555) 708-2224 98%
(555) 708-2224 100%
2014-02-25 IW424534:9338300 07:09 99% 2014-02-25 TW420534: 34:9338300 07:09 94%
TERMINAL: 509338300 OPER: A 89% TERMINAL: 509338300 OPER: A 99%
Fuel 99% Fuel (G) ($/G) 99%
(G) ($/G) 98%
($) 95% ($) 99%
Pump 9 80% Pump 9 97%
Premium 93% Prem ium 40.000 1.345 53.80* 98%
40.000 1.345 53.80* 98%
Total Owed 97% Total Owed 53.80 98%
53.8 100%
TOTAL PAID 100% TOTAL PAID 100%
CREDIT CARD 100% CREDIT CARD 53.80 99%
53.8 99%
VISA 100% VISA *kkkkkkkkkkk4597 98%
$4,444,444,440,597 58%
INV. 972821 AUTH. 545633 99% INV. 972821 AUTH. 545633 99%
Purchase 98% Purchase 100%
S 0010010010 00 127 89% S 0010010010 98%
00 APPROVED – THANK YOU 92%
94%
IMPORTANT 100%
95%
Retain This Copy For Your Records 100%

As always, here’s my source used for this test:

<?php
require '/home/bitnami/vendor/autoload.php'; 

use Aws\Textract\TextractClient;
use Aws\Exception\AwsException;

$client = new TextractClient([
'version' => 'latest',
'region' => 'us-west-2',
'credentials' => [
  'key' => '', //IAM user key
  'secret' => '', //IAM user secret
]]);

try{
  $result = $client->detectDocumentText([
    'Document' => [
      'S3Object' => [
        'Bucket' => 'fkrekognition',
        'Name' => 'receipt_preview.jpg'
      ],
    ],
  ]);
} catch (AwsException $e){
  echo "<pre>Error: $e</pre>";
}

echo "<h1>Textract</h1>";
echo "<img src=https://s3-us-west-2.amazonaws.com/fkrekognition/receipt_preview.jpg><br />";
$i=0;
echo "<table border=1 cellspacing=0><tr><td>#</td><td>BlockType</td><td>Text</td><td>Confidence</td></tr>";
foreach ($result['Blocks'] as $phrase) {
  if($phrase['BlockType']=="LINE"){
    $i++;
    echo "<tr><td>$i</td><td>".$phrase['BlockType']."</td><td>".$phrase['Text']."</td><td>".round($phrase['Confidence'])."%</td></tr>";
  }
}
echo "</table>";
  
echo "<h1>Raw Output</h1><pre>";
print_r($result);
echo "</pre>";
?>

 

Reducing Amazon Connect Telephony Costs by 46% while Improving Caller Experience

The “Call Me” concept isn’t new but it’s low-hanging fruit that many don’t take advantage of. Using Amazon Connect, we’ll create a simple UI to improve the caller experience while saving 46% on our telephony costs (assuming we’re making US-destined calls with a US East/West instance) by diverting inbound toll-free calls to outbound DID calls. This is an extension of the “Placing Outbound Calls Using Amazon Connect API” post I did a couple months ago. That post should be your starting point if the code examples below aren’t lining up for you.

The Benefits

The result of a “Call Me” UI is a streamlined caller experience whereby the point of conversion (whether that’s a sale, lead, support request, or other) is merged with a “Call Me” experience that allows you to control the population they speak to and how they get to that population. Beyond the caller experience side (where they benefit from not having to repeat their issue multiple times, not losing their self-service history once they contact, etc), there’s a financial benefit (at least with Amazon Connect). As the Call Me experience is outbound and DID dialing, the costs per minute are ~46% lower than inbound toll-free dialing:

Example based on 10,000 inbound TFN dials per day. This assumes US-bound dialing with US east/west instance types.

Beyond the immediate telephony cost savings and user experience improvement, there’s also the added benefit of transfer reduction and better staff tiering as you know the customer-selected issue before they call (and can route to the correct population/tier based on that issue selection). Additionally, there’s likely a reduction in caller identification, authentication, etc. It’s a win-win that takes very little effort to implement.

What we’re doing

  1. Creating a simple form to allow the customer to enter their phone number and also pass some basic contextual attributes that we’ll present to the agent.
  2. Setup a contact flow to deliver a custom greeting based on contact attributes we pass via the outbound call.
  3. Placing an outbound call to the customer.
  4. Surfacing the contact attributes to the agent via the Streams API (assumes you already have this installed).

You can download the full demo here.

Caller Experience Demonstration:

Agent Experience Demonstration:

Step 1: Creating the “Call Me” UI/Form

To make this look a bit spiffier than just generic forms, I’ll use the Cerulean Bootstrap theme.
We’ll include hidden fields to mimic the following attributes:

  • Last page the user was on before trying to contact
  • The authentication status of the user (if they’re authenticated, no need to go through this step on the call)
  • The VIP status of the user (are they searching for expensive items, a very loyal customer, etc?)

And we’ll surface the following fields to the user to verify accuracy and collect additional information up front:

  • Their phone number/the number we should call
  • The issue they’re calling about
<form id="callMeForm" method="POST" action="handler.callme.php">
<input type="hidden" value="http://example.com/help/lostpassword" id="lastPage" name="lastPage">
<input type="hidden" value="false" id="authenticatedStatus" name="authenticatedStatus">
<input type="hidden" value="true" id="vipStatus" name="vipStatus">
<fieldset>
  <div class="form-group">
  <label for="issueSelected">What can we help you with?</label>
  <select class="form-control" id="issueSelected" name="issueSelected">
    <option>Pre-Purchase</option>
    <option>Purchase Experience</option>
    <option>Post-Purchase</option>
    <option>Other</option>
  </select>
  </div>
  <div class="form-group">
  <label class="col-form-label" for="phoneNumber">We'll call you at:</label>
  <input type="text" class="form-control" placeholder="+15555555555" id="phoneNumber" name="phoneNumber">
  </div>
  </fieldset>
  <button type="submit" class="btn btn-primary">Call Me Now</button>
</fieldset>
</form>

Step 2: Placing the outbound call

Starting with a blank contact flow, we’ll set it to:

  1. Detect the value of the “VIP” attribute we set
    1. If VIP=true, we’ll play a special prompt
    2. If VIP<>true, we’ll play a standard, more generic prompt.
  2. Locate the caller’s name (via stored attribute) and pass it to the contact flow to greet the caller by name.
  3. After greeting, terminate the contact.

The full contact flow:

In order to play the caller’s name as part of the prompt, we’ll reference the user-defined attribute we set in step 2 (see referenced code example zip file): “Hello VIP caller $.Attributes.CustomerFirstName“.

Step 3: Placing the Outbound Call to the Caller

Using the snippet from the “Placing Outbound Calls Using Amazon Connect API” post, we’ll simply add in an associative array of key/values which will be available for reference within the contact flow (ie greeting by name based on VIP status) and also stored in the contacts trace record:

//Include AWS SDK
require '/home/bitnami/vendor/autoload.php'; 

//New Connect client
$client = new Aws\Connect\ConnectClient([
'region'  => 'us-west-2', //the region of your Connect instance
'version' => 'latest',
'credentials' => [
  'key' => '', //IAM user key
  'secret' => '', //IAM user secret
]
]);

//Capture form fields - should do some additonal sanitation and validation here but this will suffice as a proof of concept
$lastPage=$_POST['lastPage'];
$authenticatedStatus=$_POST['authenticatedStatus'];
$vipStatus=$_POST['vipStatus'];
$issueSelected=$_POST['issueSelected'];
$phoneNumber=$_POST['phoneNumber'];
$customerFirstName="Kevin";

//Place the call
$result = $client->startOutboundVoiceContact([
  'Attributes' => array("LastPageViewed"=>"$lastPage", 
          "Authenticated"=>"$authenticatedStatus", 
          "VIP"=>"$vipStatus", 
          "IssueSelected"=>"$issueSelected",
          "CustomerFirstName"=>"$customerFirstName"),
  'ContactFlowId' => '', // REQUIRED
  'DestinationPhoneNumber' => "$phoneNumber", // REQUIRED
  'InstanceId' => '', // REQUIRED
  'QueueId' => '', // Use either QueueId OR SourcePhoneNumber. SourcePhoneNumber must be claimed in your Connect instnace.
  //'SourcePhoneNumber' => '', // Use either QueueId OR SourcePhoneNumber. SourcePhoneNumber must be claimed in your Connect instnace.
]);
  
echo "<pre>";
print_r($result);
echo "</pre>";

Step 4: Displaying the Connect contact attributes for the agent

For this, we’ll use the Streams API (assuming you already have this setup and in place).  Using the same styling from the Caller side demo, we’ll create an agent UI. I’ve plugged in the various API references below so I believe it’s pretty straight forward to follow:

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>Call Me Demo</title>
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <meta http-equiv="X-UA-Compatible" content="IE=edge" />
    <link rel="stylesheet" href="./style.css">
    <link rel="stylesheet" href="./_variable.css">
    <link rel="stylesheet" href="./_bootswatch.css">
  <script src="https://code.jquery.com/jquery-2.1.4.min.js"></script>
  <style type="text/css">
  .ccp {
    width: 350px; 
    height: 465px; 
    padding: 0px;
  }
  
  .ccp iframe {
    border: none;
  }
  </style>
</head>   
<body>
    <div class="navbar navbar-expand-lg fixed-top navbar-dark bg-dark">
      <div class="container">
        <a href="../" class="navbar-brand">Call Me Demo</a>
        <button class="navbar-toggler" type="button" data-toggle="collapse" data-target="#navbarResponsive" aria-controls="navbarResponsive" aria-expanded="false" aria-label="Toggle navigation">
          <span class="navbar-toggler-icon"></span>
        </button>
        <div class="collapse navbar-collapse" id="navbarResponsive">
      
      
          <ul class="nav navbar-nav ml-auto">
            <li class="nav-item">
              <a class="nav-link" href="#" target="_blank">user</a>
            </li>
          </ul>
        </div>
      </div>
    </div>
  
  
    <div class="container">

      <div class="bs-docs-section">
        <div class="row">
          <div class="col-lg-12">
            <div class="page-header">
              <h1 id="forms">Incoming Call</h1>
            </div>
          </div>
        </div>
    
    <div class="row">
      <div class="col-lg-4">
        <div id="ccpDiv" class="ccp" >
        <!-- your contact control panel will display here -->
        </div>
      </div>
      
      <div class="col-lg-8">
        <div id="connectAttributesDiv">
        <!-- the contact attributes will display here -->
        </div>
      </div>
    
    </div>

<script src="../amazon-connect-streams/amazon-connect-v1.2.0-34-ga.js"></script> <!-- replace this with your streams JS file -->
<script>
window.contactControlPanel = window.contactControlPanel || {};

var ccpUrl = "https://<YOUR Connect CCP URL HERE.awsapps.com/connect/ccp#/"; //Plug in your Connect CCP address here

//Contact Control Panel https://github.com/aws/amazon-connect-streams/blob/master/Documentation.md#connectcoreinitccp
connect.core.initCCP(ccpDiv, {
  ccpUrl: ccpUrl,        
  loginPopup: true,         
  softphone: {
    allowFramedSoftphone: true
  }
});


//Subscribe a method to be called on each incoming contact
connect.contact(eventListener); //https://github.com/aws/amazon-connect-streams/blob/master/Documentation.md#connectcontact

//The function to call on each incoming contact
function eventListener(contact) {
  window.contactControlPanel.contact = contact;
  updateAttributeElement(contact.getAttributes()); //https://github.com/aws/amazon-connect-streams/blob/master/Documentation.md#contactgetattributes
  contact.onEnded(clearAttributeElement); //https://github.com/aws/amazon-connect-streams/blob/master/Documentation.md#contactonended
}

//Loops through attributes object and prints out the corresponding key:value
function updateAttributeElement(msg){
  for (var key in msg) {
    if (msg.hasOwnProperty(key)) {
      var connectAttributesDiv = document.getElementById("connectAttributesDiv");
      var newAttribute = document.createElement('div');
      newAttribute.innerHTML = '<strong>' + key + '</strong>: ' + msg[key]['value'] + '<br />';

      while (newAttribute.firstChild) {
        connectAttributesDiv.appendChild(newAttribute.firstChild);
      }
    }
  }
}

//Clears the previous contact attrbitues onEnded (disconnect) of contact
function clearAttributeElement(){
  document.getElementById("connectAttributesDiv").innerHTML = "";

}
</script>
</body>
 
</html>

The end result is a “Call Me” framework that can be used to capture pass session attributes through to the contact experience:

Lambda Data Dips within Amazon Connect Contact Flows

I’ve read many different guides on this but none seemed to provide end-to-end guidance or were cluttered with other noise unrelated to Lambda or Connect.

The power of Lambda function inclusion in the contact flow is immense – perform security functions, lookup/validate/store data, lookup customer data for CRM integration, etc. While learning this, I created a simple Lambda function to simply multiply the caller’s input by 10, store both numbers, and return the output to the caller – I’ll dive into querying Dynamo databases in the near future.

What we’re doing

Using Amazon Connect and AWS Lambda, we’ll create a phone number which accepts a user’s DTMF input, multiplies it by 10, saves the results as contact attributes, and regurgitates those numbers to the caller. The final experience can be had by calling +1 571-327-3066 (select option 2).

Step 1-Create your Lambda Function

Visit the Lambda console and select “Create Function”. For this example, I’m going to use the following details:
Name: “FKLambdaDataDip”
Runtime: Node.js 8.10
Rule: Create a custom role (and use the default values on the subsequent popup)

Step 2-Creating the Resource Policy

Now that the Lambda function exists, copy the ARN from the top right of the page:

Using the AWS CLI, we’ll create a resource policy for the function & Connect:

aws lambda add-permission --function-name function:<YOUR_LAMBDA_FUNCTION_NAME> --statement-id 1 --principal connect.amazonaws.com --action lambda:InvokeFunction --source-account <YOUR_AWS_ACCOUNT_NUMBER> --source-arn <YOUR_AWS_CONNECT_INSTANCE_ARN>

You can find your Connect ARN in the admin console and your AWS acount ID on your AWS account page.

Step 3-Granting Connect permission to invoke your Lambda function

From the Connect admin page, select “Contact Flows” from the left menu. Under the AWS Lambda heading, select your function from the drop down and click ‘+Add Lambda Function”.

You should now be able to successfully invoke your Lambda function via your Amazon Connect contact flow.

Step 4-Creating the Amazon Connect Contact Flow

I’m going to outline my high-level flow before finishing my actual Lambda function. We’ll come back and plug in all the variable names and details. Here’s the visual of my flow:

Step 5-Finalizing the AWS Lambda Function

As noted, our function will simply multiple the number entered by 10 and return it.

exports.handler = function(event, context, callback) {

var receivedCallerSubmittedNumber = event['Details']['Parameters']['callerSubmittedNumber'];
var calculated = receivedCallerSubmittedNumber * 10;

var resultMap = {
    sentLambdaCalculatedNumber:calculated
}

callback(null, resultMap);
}

Note that we’re getting to the “callerSubmittedNumber” variable via “event[‘Details’][‘Parameters’][‘callerSubmittedNumber’]”. This is because the json published from Connect to Lambda has this structure (where our Connect attributes are passed in the parameters section):

{
    "Details": {
        "ContactData": {
            "Attributes": {},
            "Channel": "VOICE",
            "ContactId": "4a573372-1f28-4e26-b97b-XXXXXXXXXXX",
            "CustomerEndpoint": {
                "Address": "+1234567890",
                "Type": "TELEPHONE_NUMBER"
            },
            "InitialContactId": "4a573372-1f28-4e26-b97b-XXXXXXXXXXX",
            "InitiationMethod": "INBOUND | OUTBOUND | TRANSFER | CALLBACK",
            "InstanceARN": "arn:aws:connect:aws-region:1234567890:instance/c8c0e68d-2200-4265-82c0-XXXXXXXXXX",
            "PreviousContactId": "4a573372-1f28-4e26-b97b-XXXXXXXXXX",
            "Queue": "QueueName",
            "SystemEndpoint": {
                "Address": "+1234567890",
                "Type": "TELEPHONE_NUMBER"
            }
        },
        "Parameters": {
            "sentAttributeKey": "sentAttributeValue"
        }
    },
    "Name": "ContactFlowEvent"
}

6-Finalizing the Amazon Connect Contact Flow

Back in the Contact Flow Designer, we’ll edit the “Invoke AWS Lambda Function” module to plug in our Function ARN (again, copied from the Lambda function’s page). This is the same function ARN that you setup the policy for in step 2.

In the next “Set contact attributes” module, we’ll set the attribute “Destination Key” to “lambdaCalculatedNumber”, the type to “External”, and the “Attribute” to “sentLambdaCalculatedNumber”.
Lastly, we’ll edit the last prompt of the flow to play back the number by configuring it to “Text to speech”, “Enter Dynamically”, “External” as the type, and “sentLambdaCalculatedNumber” as the Attribute.

Save and publish your contact flow.

As the variable and key assignments can be a bit confusing and as the documentation provided by Connect on this is of poor quality, I’ve recorded what I’ve set each of my to in this demo. Connect’s own documentation actually has some typos in it that will result in errors from Lambda (at the time of writing this, at least).

Step 7-Testing

Once you associate your contact flow with a number, you can now test. Beyond dialing and hearing the response, we can see it recorded alongside the contact attributes:

I’ve setup a test number for this demo: +1 571-327-3066 (select option 2). Dial to experience the end result.

Placing Outbound Calls Using Amazon Connect API & PHP

Amazon Connect is the AWS answer to costly contact center telephony platforms. There’s no upfront costs and overall usage is EXTREMELY cheap when compared to legacy telephony platforms – you essentially just pay per minute.

I wanted to play with this a bit so I setup an instance and created a simple script to place outbound calls which will allow the call recipient to choose from hearing Abbott and Costello’s famous “Who’s on first?” bit or running their call through a sample Lambda script to identify their state (call 1-571-327-3066 for a demo, minus the outbound experience). Real-world use cases for this could automating calls to remind customers of upcoming appointments, notifying a group of an emergency situation, creating a “Don’t call us, we’ll call you!” customer service setup (so that you don’t have to expose your company’s phone number), scheduling wake-up calls, etc.

What we’re doing

Using Amazon Connect, we’ll:

  1. Configure our instance for application integration
  2. Create a sample contact flow with basic IVR and Lambda integration
  3. Use the Connect API to place a phone call (with PHP)

This assumes you already have your Amazon Connect instance setup with a single number claimed. If not, this takes ~5 minutes to do.

Step 1: Configure your instance for application integration

In order to interact with Connect outside of the Connect console, you have to add an approved origin. From the AWS console, select “Application Integration” and add the domain which will house our script (from step three below).

Step 2: Create the contact flow

As noted above, my example will call the user and give them an option to listen to “Who’s on First?” or interact with a Lambda function (which will detect state based on area code). You could easily use a pre-defined contact flow for this or create your own. Here’s the contact flow I’m using:

Step 3: Use the Connect API to place an outbound call

Like all other API interactions, you’ll need credentials. To do this, I create a temporary IAM user that has the AmazonConnectFullAccess policy attached.

The next thing you’ll need to do is get your instance ID, contact flow ID, and queue ID. Connect could make this a bit easier but it’s still simple to locate.

  • Getting your instance ID: Navigate to the Connect page in the AWS console and on the “Overview” page, you’ll see your instance ARN. It’s formatted similar to “arn:aws:connect:us-west-2:99999999instance/”. Your instance ID is after the “…instance/” portion. This is also in the queue and contact flow ARNs.
  • Getting your contact flow and queue IDs: From the Connect console, navigate to the contact flow and queue ID you want to use. On both pages, you’ll see “Show additional queue information”. On click, this will display the ARN. The tail (after “…/queue/” or “…/contact-flow/” of the ARNs contain your IDs. These both also contain your instance ID.

The script itself is pretty straight-forward. I’ve set it up so that each of the numbers to dial are loaded into an array and from there, it just loops through each and places the call:

<?php
//Include AWS SDK
require '/home/bitnami/vendor/autoload.php'; 

//New Connect client
$client = new Aws\Connect\ConnectClient([
'region'  => 'us-west-2', //the region of your Connect instance
'version' => 'latest',
'credentials' => [
  'key' => '<yourIAMkey>', //IAM user key
  'secret' => '<yourIAMsecret>', //IAM user secret
]
]);

$dialNumbers=array('<phonenumber1>','<phonenumber2>');
foreach ($dialNumbers as $number){
  $result = $client->startOutboundVoiceContact([
    'ContactFlowId' => '<contactFlowId>', // REQUIRED
    'DestinationPhoneNumber' => "$number", // REQUIRED
    'InstanceId' => '<yourConnectInstanceId>', // REQUIRED
    'QueueId' => '<yourConnectQueueId>', // Use either QueueId OR SourcePhoneNumber. SourcePhoneNumber must be claimed in your Connect instnace.
    //'SourcePhoneNumber' => '', // Use either QueueId OR SourcePhoneNumber. SourcePhoneNumber must be claimed in your Connect instnace.
  ]);
  
  echo "<pre>";
  print_r($result);
  echo "</pre>";
  echo "<hr />";
}
?>

The phone numbers must be formatted in E.164 format. The US, for example, would be +15555555555.

You’ll get a response with the following details:

Aws\Result Object
(
    [data:Aws\Result:private] => Array
        (
            [ContactId] => c###4
            [@metadata] => Array
                (
                    [statusCode] => 200
                    [effectiveUri] => https://connect.us-west-2.amazonaws.com/contact/outbound-voice
                    [headers] => Array
                        (
                            [content-type] => application/json
                            [content-length] => 52
                            [connection] => keep-alive
                            [date] => Wed, 21 Nov 2018 21:53:39 GMT
                            [x-amzn-requestid] => e79###6
                            [access-control-allow-origin] => *
                            [access-control-allow-headers] => Content-Type,X-Amz-Date,Authorization,X-Api-Key,X-Amz-Security-Token
                            [x-amz-apigw-id] => Qu4###_og=
                            [access-control-allow-methods] => GET,OPTIONS,POST
                            [x-amzn-trace-id] => Root=1-5b####dcb6a90;Sampled=1
                            [x-cache] => Miss from cloudfront
                            [via] => 1.1 85d####da.cloudfront.net (CloudFront)
                            [x-amz-cf-id] => zlUCJR####B0Lmw==
                        )

                    [transferStats] => Array
                        (
                            [http] => Array
                                (
                                    [0] => Array
                                        (
                                        )

                                )

                        )

                )

        )

    [monitoringEvents:Aws\Result:private] => Array
        (
        )

)

 

Consuming RTSP Stream and Saving to AWS S3

I wanted to stream and record my home security cameras to the cloud for three reasons: 1) if the NVR is stolen, I’ll have the footage stored remotely, 2) (more realistically) I want to increase the storage availability without having to add hard drives, and 3) I want to increase the ease-of-access for my recordings. There are a number of services that do this for you (such as Wowza) and you can also purchase systems that do this out-of-the-box. The downside to services like Wowza is cost — at least $50/month for a single channel streaming without any recording – and the out-of-the-box solutions are expensive and run on proprietary platforms that limit your use and access…plus it’s more fun to do it yourself and learn something.

The solution I arrived at was to use AWS Lightsail and S3. This gives me the low cost, ease of scale, and accessibility I desire. Due primarily to the transfer rate limits, Lightsail will only work for small, home setups but you could “upgrade” from Lightsail to EC2 to mitigate that. After all, Lightsail is just a pretty UI that takes away all the manual config work needed to setup an EC2 instance (in fact, Lightsail utilizes EC2 behind the scenes).  If you prefer not to use Lightsail or EC2 at all, you could swap in a Raspberry Pi to do the grunt work locally and pass the files to S3. This would cut the monthly cost by ~$5 but comes with the maintenance of local hardware.

What we’re doing

In this guide, we’ll capture and RTSP stream from a Hikvision (which includes most Nightowl, LaView, and many more as they all use a branded form of Hikvision’s software) security camera NVR and save the files to AWS S3 by:

  1. Creating an AWS Lightsail instance
  2. Installing openRTSP (via LiveMedia-Utils package)
  3. Capturing the RTSP stream, save it locally to the Lightsail instance
  4. Installing the AWS PHP SDK and use it to sweep the video files from the Lightsail instance to S3

While the details below are specific to my setup, any RTSP stream (such as the NASA stream from the International Space Station) and any Linux server will work as well. Substitute as desired.

Step 1: Creating the Lightsail Instance

I’m going to use the $5/month LAMP w/PHP7 type so that we can have the 2TB of transfer. In my testing, this was sufficient for the number of cameras/channels I’m handling. You should do your own testing to determine whether this is right for you. Keep in mind that transfer is measured both in AND out and we’ll be transferring these files out to S3.

  1. Navigate to Lightsail
  2. Select [Create Instance].
  3. Here’s a screenshot of the instance I’m using:

Although 100% optional, I’d recommend going ahead and assigning a static IP  and setting up a connection in PuTTY. Otherwise, the web terminal window provided in the Lightsail UI will work – I find it a bit buggy, though.

Step 2: Install LiveMedia-Utils package

The LiveMedia-Utils package contains openRTSP which is the key to consuming and storing the feeds from our NVR. Once connected to our Lightsail instance, let’s:

sudo apt-get install livemedia-utils
cd /usr/src
sudo wget http://www.live555.com/liveMedia/public/live555-latest.tar.gz
sudo tar -xzf live555-latest.tar.gz
cd live
sudo ./genMakefiles linux
sudo make
sudo make install

At this point, openRTSP should be ready to go.

Step 3: Capturing the RTSP steam

I want to keep my video files contained so let’s create a new directory for them:

mkdir /home/bitnami/recordings
cd /home/bitnami/recordings

And now we’re ready to test! I’d recommend reviewing the list of options openRTSP offers before diving in. Here’s my set of options:

openRTSP -D 1 -c -B 10000000 -b 10000000 -4 -Q -F CAM1 -d 300 -P 300 -t -u <USERNAME> <PASSWORD> rtsp://<MYCAMIP>:554/Streaming/Channels/102

Some explanations:
-D 5 | Quit if nothing is received for 5 of more seconds
-c | Play continuously, even after –d timeframe
-B 10000000 | Input buffer of 10MB.
-b 10000000 | Output buffer of 10MB (to the .mp4 file)
-4 | Write in .mp4 format
-Q | Display QOS statistics on exit
-F CAM1 | Prefix the .mp4 files with “CAM1”
-d 300 | Run openRTSP for this many seconds – essentially, the length of your files.
-P 300 | Start a new file every 300 seconds – essential, the length of your individual files (so each 5 minute block of time will be a unique file)
-t | Use TCP instead of UDP
-u <> | My cam’s username, password, and the RTSP URL.

You can use tmux to let openRTSP command contiue to run in the backgound (otherwise, it’ll die when your close your terminal window). So:

tmux
openRTSP -D 1 -c -B 10000000 -b 10000000 -4 -Q -F CAM2 -d 300 -P 300 -t -u <username> <password> <rtspURL>

Then press ctrl+b followed by d to hop out of tmux and you can close the terminal window.

You should see your video files start populating in the /home/bitnami/recordings directory now:

Step 4: Install the AWS PHP SDK and move recordings to S3

As S3 is cheaper and since we only have 40GB of storage with our Lightsail instance, I’m going to move my recordings from Lightsail to S3 using PHP.

Before proceeding, Install the AWS PHP SDK.

Now that the SDK is installed, we can create a simple script and cron to filter through the files in the /home/bitnami/recordings directory, determine their age, move the oldest S3, and delete the file from Lightsail. If my files are 5 minutes long, I’ll have my cron run every 5 minutes. Yes, there are more efficient ways of doing this but I’m okay with being scrappy in this situation.

I’d recommend taking a snapshot of your instance now that everything is setup, tested, and finalized. This enables you to tinker and try new things without worrying about having to repeat this process if you screw something up.

I’ll create a directory for my cron script and its log to live and then create my cron file:

mkdir /home/bitnami/cron
cd /home/bitnami/cron
sudo nano move.php

Here’s the script (move.php) I wrote to handle the directory list, sortation, movement to S3, and deletion from Lilghtsail:

<?php
//Include AWS SDK
require '/home/bitnami/vendor/autoload.php'; 

//Start S3 client
$s3 = new Aws\S3\S3Client([
'region'  => 'us-west-2',
'version' => 'latest',
'credentials' => [
  'key' => '<iamkey>', //IAM user key
  'secret' => '<iamsecret>', //IAM user secret
]
]);

//Set timezone and get current time
date_default_timezone_set('America/Los_Angeles');
$currentTime=strtotime("now");
 
 //Get a list of all the items in the directory, ignoring those we don't want to mess with
$files = array_diff(scandir("/home/bitnami/recordings",1), array('.', '..','.mp4','_cron_camsstorevideos.sh'));

//Loop through those files
foreach($files as $file){
  $lastModified=date ("Y-m-d H:i:s", filemtime("/home/bitnami/recordings/$file"));//Separate out the "pretty" timestamp as we'll use it to rename our files.
  $lastModifiedEpoch=strtotime($lastModified);//Get the last modified time
  if($currentTime-$lastModifiedEpoch>30){ //If the difference between now and when the file was last modified is > 30 seconds (meaning it's finished writing to disk), take actions
    echo "\r\n Taking action! $file was last modified: " . date ("F d Y H:i:s", filemtime("/home/bitnami/recordings/$file"));
    //Save to S3
    $result = $s3->putObject([
    'Bucket' => '<bucketname>', //the S3 bucket name you're using
    'Key'    => "CAM1VIDEO @ $lastModified.mp4", //The new filename/S3 key for our video (we'll use the last modified time for this)
    'SourceFile' => "/home/bitnami/recordings/$file", //The source file for our video
    'StorageClass' => 'ONEZONE_IA' //I'm using one zone, infrequent access (IA) storage for this because it's cheaper
    ]);
    
    //Delete file from lightsail
    unlink("/home/bitnami/recordings/$file");
  }
}
?>

That’s it! As long as you have the write policy applied to your bucket, you should be good to go:

The last thing I’ll do is set a crontab to run the move.php script every 5 minutes and log the output:

*/5 * * * * sudo php /home/bitnami/cron/move.php >> /home/bitnami/cron/move.log 2>&1

Using AWS Rekognition to Detect Text in Images with PHP

A couple years ago, I tinkered with a solution to use a webcam to capture images of receipts, covert the images to raw text, and store in a database. My scrappy solution worked okay but it lacked the accuracy to make it viable for anything real-world.

With AWS Rekognition launching since then, I figured I’d try it out and see how it compares. I used a fake receipt to see how it’d do.

Like every other AWS product I’ve used, it was incredibly easy to work it. I’ll share the simple script I used at the bottom of this post but, needless to say, there’s not much to it.

While use was a breeze, the results were disappointing. Primarily, the fact that Rekognition is limited to ONLY 50 words in an image. So clearly it’s not a full-on OCR tool.

Somewhat more disappointing was the limited range of confidence scores Rekognition returned (for each text detection, it provides a confidence score). The overall output was pretty accurate but not accurate enough for me to consider it “wow” worthy. Despite this, all of the confidence scores were above 93%.

To be considered an OCR service, AWS Rekognition has a long way to go before it’s competitive as an OCR service. It’s performance in object detection/facial recognition (which is the heart and primary usecase of Rekognition) may be better but I haven’t tested that at this point.

You can view the full analysis and output of the receipt image here.

Below is the code used to generate the output linked above:

<?php
require '/home/vendor/autoload.php'; 
use Aws\Rekognition\RekognitionClient;

$client = new Aws\Rekognition\RekognitionClient([
    'version'     => 'latest',
    'region'      => 'us-west-2',
    'credentials' => [
        'key'    => 'IAM KEY',
        'secret' => 'IAM SECRET'
    ]
]);

$result = $client->detectText([
    'Image' => [
        'S3Object' => [
            'Bucket' => 'S3 BUCKET CONTAINING IMAGE',
            'Name' => 'receipt_preview.jpg',
        ],
    ],
]);

echo "<h1>Rekognition</h1>";
$i=0;
echo "<table border=1 cellspacing=0><tr><td>#</td><td>DetectedText</td><td>Type</td><td>ID</td><td>ParentId</td><td>Confidence</td></tr>";
foreach ($result['TextDetections'] as $phrase) {
  $i++;
    echo "<tr><td>$i</td><td>".$phrase['DetectedText']."</td><td>".$phrase['Type']."</td><td>".$phrase['Id']."</td><td>".$phrase['ParentId']."</td><td>".round($phrase['Confidence'])."%</td></tr>";
}
echo "</table>";

echo "<h1>Raw Output</h1><pre>";
print_r($result);
echo "</pre>";
?>

 

Converting text to speech with AWS Polly

I wanted to try my hand at using the AWS Polly text-to-speech service.  Polly offers several different voices and supports multiple languages, most of which sound pretty good, especially if you use SSML when passing text.  SSML is where the character of the speech (rate, tone, pitch, etc) come into play.  See here for more detail.

What I’ve done is created a script to interact with the AWS Polly API using PHP and store the output into an S3 bucket.  Click here to try it out.

Step 1: Creating the IAM User

This has been outlined in many prior posts so I won’t go into detail.  We’ll be using the AmazonS3FullAccess and AmazonPollyFullAccess permission policies (screenshot of my user summary here).  If you don’t plan on saving the results to S3, you don’t need the S3 policy attached (obviously).

Step 2:  Converting our text to speech

<?php
require '/home/bitnami/vendor/autoload.php'; 

//Prep the Polly client and plug in our IAM credentials
use Aws\Polly\PollyClient;
$clientPolly = new PollyClient([
'version' => 'latest',
'region' => 'us-west-2', //I have all of my AWS stuff in USW2 but it's merely preference given my location.
'credentials' => [
  'key' => '', //IAM user key
  'secret' => '', //IAM user secret
 ]]);

//Configure our inputs and desired output details
$resultPolly = $clientPolly->synthesizeSpeech([
'OutputFormat' => 'mp3',
'Text' => "<speak>This is my example.</speak>",
'TextType' => 'ssml',
'VoiceId' => 'Joey', //A list of voices is available here: https://docs.aws.amazon.com/polly/latest/dg/voicelist.html
]);

//Get our results
$resultPollyData = $resultPolly->get('AudioStream')->getContents();

?>

Step 3: Saving the MP3 file to S3

<?php
//Set timezone to use in the file name
date_default_timezone_set('America/Los_Angeles');

//Prep S3 client and plug in the IAM credentials
$clientS3 = new Aws\S3\S3Client([
'version' => 'latest',
'region' => 'us-west-2', //Region of your S3 bucket
'credentials' => [
  'key' => '', //Same IAM user as above
  'secret' => '',  //Same as above
 ]]);

//Put the Polly MP3 file to S3
$resultS3 = $clientS3->putObject([
  'Key'         => date("Y-m-d_H:i:s").'.mp3',
  'ACL'         => 'public-read',
  'Body'        => $resultPollyData,
  'Bucket'      => 'fkpolly',
  'ContentType' => 'audio/mpeg',
  'SampleRate'  => '22050'
]);

//Return a link to the file
echo "<a href=\"".$resultS3['ObjectURL']."\">Listen</a>";
?>

Try It Out

I’ve saved my example for ongoing tinkering.  Click here to try it out.

Using Natural Language Processing (AWS Comprehend) to Analyze Text (Cardi B Lyrics)

Humans spend a lot of time reading, analyzing, and responding through text (emails, chats, etc). A lot of this is inefficient or not for pleasure (such as the amount of payroll companies spend to read through feedback emails or the amount of time I spend sifting through Outlook each day). Using Natural Language Processing (NLP), we can reduce the inefficient and not-for-pleasure reading we do so that time can be re-invested into something more productive or fulfilling.

For fun, I scrapily ran the lyrics to Cardi B’s “I Like It” through AWS Comprehend to see what its response would be. I also ran a review of Mission Impossible: Fallout through the same service.  The full output for Cardi B can be viewed here.  The full output for Mission Impossible: Fallout can be viewed here.

While these are low value examples, a more real-world use case for Comprehend could be using AWS Comprehend to detect the language of emails sent to your company to adjust the routing destination in real-time (should they go to you English team or your Spanish?). Another example would be using Comprehend to collect feedback on your new product launch or ad campaign. For example, we could easily capture Twitter mentions for a brand, funnel those into an S3 bucket, and run the contents of that buck to split out negative vs positive vs neutral vs mixed sentiment mentions. From there, we could surface the most frequent adjectives and entities mentioned for each group of each sentiment bucket. It’s a cheap, quick way to customer capture and analyze feedback that would otherwise be ignored.

Initial Setup
In each of the last couple of posts, I’ve outline how to create an IAM user for your project so I won’t repeat that again.  After we have our IAM user created and the credentials added to our /.aws/credentials file, we’ll import the AWS PHP SDK and ComprehendClient.  Next, we’ll create a Comprehend client, define the API version, region, and credentials profile to use:

<?php

require '/home/vendor/autoload.php'; 
use Aws\Comprehend\ComprehendClient;

$client = new Aws\Comprehend\ComprehendClient([
    'version'     => '2017-11-27',
    'region'      => 'us-west-2',
    'profile'     => 'fk-comprehend',
]);
$review="Your Comprehend text";

Exploring the Options
In this example, we’ll detect the language(s), entities (objects, businesses, etc), key phrases, sentiment, and syntax (parts of speech) of our sample texts (Cardi B lyrics and a movie review).  For all of these except DetectDominantLanguage, the language is a required input.  If we use Comprehend to identify that first, then we can simply repeat its output in later functions.  For each output, Comprehend also spits out a confidence score which basically  tells you how confident it is in the output.  This could be used to ignore low-confidence suggestions, thus increasing the accuracy of the models you build using Comprehend.

DetectDominantLanguage Example
This will detect the language and spit out the ISO abbreviation.

//Detecting Dominant Language
$result = $client->detectDominantLanguage([
    "Text" => "$review",
]);

echo "<h1>DetectDominantLanguage</h1><pre>";
print_r($result);
echo "</pre>";

foreach ($result['Languages'] as $phrase) {
    echo "Language ".$phrase['LanguageCode']." has a confidence score of ".round($phrase['Score']*100)."%.<br />";
}

DetectSentiment Example

//Detecting Sentiment
$result = $client->detectSentiment([
    "LanguageCode" => "en",
    "Text" => "$review",
]);

echo "<h1>DetectSentiment</h1><pre>";
print_r($result);
echo "</pre>";

echo "Sentiment: ".$result['Sentiment']."<br />";
echo "Positive: ".round($result['SentimentScore']['Positive']*100)."%<br />";
echo "Negative: ".round($result['SentimentScore']['Negative']*100)."%<br />";
echo "Neutral: ".round($result['SentimentScore']['Neutral']*100)."%<br />";
echo "Mixed: ".round($result['SentimentScore']['Mixed']*100)."%<br />";

DetectKeyPhrases Example

//Detecting KeyPhrases
$result = $client->detectKeyPhrases([
    "LanguageCode" => "en",
    "Text" => "$review",
]);

echo "<h1>DetectKeyPhrases</h1><pre>";
print_r($result);
echo "</pre>";

foreach ($result['KeyPhrases'] as $phrase) {
    echo "Phrase ".$phrase['Text']." has a score of ".round($phrase['Score']*100)."%.<br />";
}

DetectSyntax Example

//Detecting Syntax
$result = $client->detectSyntax([
    "LanguageCode" => "en",
    "Text" => "$review",
]);

echo "<h1>DetectSyntax</h1><pre>";
print_r($result);
echo "</pre>";

foreach ($result['SyntaxTokens'] as $syntax) {
    echo "Phrase ".$syntax['Text']." is as ".$syntax['PartOfSpeech']['Tag']." (with ".round($syntax['PartOfSpeech']['Score']*100)."% confidence).<br />";
}

DetectEntities Example

//Detecting Entities
$result = $client->detectEntities([
    "LanguageCode" => "en",
    "Text" => "$review",
]);

echo "<h1>DetectEntities</h1><pre>";
print_r($result);
echo "</pre>";

foreach ($result['Entities'] as $syntax) {
    echo "Phrase ".$syntax['Text']." is as ".$syntax['Type']." (".round($syntax['Score']*100)."% confidence).<br />";
}

The Results for Cardi B and Tom Cruise

The full output for Cardi B can be viewed here.  This one is the most interesting of the two as “I like it” has a Spanish verse.  You can see how Comprehend dealt with it when it was passed as English.  It also does a good job of determining when “bitch” is a noun vs an adjective except in the line “Where’s my pen? Bitch I’m signin'” — I’m unsure as to why.

The full output for Mission Impossible: Fallout can be viewed here.  The interesting piece here is the sentiment analysis: NEUTRAL (8% positive, 36% negative, 39% neutral, and 17% mixed).  After reading the review, I would say this is pretty in-line with the reviewer and Comprehend did a good job of identifying the overall sentiment of the article.

Automatically Creating Lightsail Instance Snapshots

Given the target audience of Lightsail, I would expect UI-based functionality for automating snapshots and other common tasks; however, this doesn’t exist.  Creating snapshots is an important task – I create snapshots before I make any major changes and every few days.  In the event I screw something up or if something happens to my instance, I can simply spin up a new instance from an old snapshot – no big deal.

In addition to the lack of UI-based functionality, the default IAM policies don’t apply to Lightsail, either.  Given the age of Lightsail, I would think this would be built into IAM default policies by this point.

In the guide below, we’ll:

  1. Create an IAM policy to manage our Lightsail snapshots
  2. Create an IAM user to use that IAM policy
  3. Add our IAM user to our AWS credentials file
  4. Create a Lightsail snapshot using the AWS CLI

Beyond creating snapshots, there AWS CLI offers all commands needed to manage Lightsail – I encourage you to explore: https://docs.aws.amazon.com/cli/latest/reference/lightsail/index.html

Create the needed IAM Policy

  1. From the IAM page of the AWS Console, select Policies.
  2. From there, click “Create Policy” and select the json tab.  We’ll use this policy which will limit actions to just the creation and listing of instance snapshots:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "VisualEditor0",
                "Effect": "Allow",
                "Action": [
                    "lightsail:GetInstanceSnapshot",
                    "lightsail:GetInstanceSnapshots",
                    "lightsail:CreateInstanceSnapshot"
                ],
                "Resource": "*"
            }
        ]
    }

    Here’s a screenshot of the inputs.

  3. Click “Review Policy” and then we’ll give it a name (I’ve used “LightsailSnapshotCreate”) and then click “Create Policy”

Create the needed IAM User

  1. Back on the IAM page of the AWS Console, click on Users.
  2. We’ll name the user (I’ve used “FKSnapshotCreate” and check the “Programmatic Access” box.  Screenshot.
  3. On next page, we’ll attach the policy we created (“LightsailSnapshotCreate”) and create the user.  Screenshot.
  4. Lastly, we’ll copy the access ID and key into our credentials file.  Screenshot.

Adding the IAM User Access Key/Access ID to our Credentials file

  1. Open your credentials file and add in another user (as described here).
  2. In my project, I have a user named “fk-createsnapshot” so my credentials file looks something like this.

Creating a snapshot using AWS CLI

  1. Now that we have everything configured, we just need to run the command.  We’ll do this with the following command:
    aws lightsail create-instance-snapshot --instance-name FigarosKingdomWP --instance-snapshot-name FK-2018-09
    -09 --profile fk-createsnapshot --region us-west-2

    “–instance-name FigarosKingdomWP” – this is the name of the instance from the Lightsail console that you’re wanting to snapshot.
    “–instance-snapshot-name FK-2018-09-09” – this is the name of the snapshot.  It can be anything you like.
    “–profile fk-createsnapshot” – this is the IAM User (same one we created above) that we want to use with this command.
    “–region us-west-2” – this is the region of the instance.

  2. Once executed, you should see output similar to this:
  3. Browse to the Lightsail Console and you should see your new snapshot: screenshot.

Creating a shell script to automate snapshots

We’ll create a shell script (I’m using fk-createsnapshot.sh) and add in our command.  I’ve also added variables for the date so that the snapshot name matches the date.  You can find more about this here.  Here’s my fk-createsnapshot.sh:

#!/bin/bash 
aws lightsail create-instance-snapshot --instance-name FigarosKingdomWP --instance-snapshot-name FK-$(date +%Y-%m-%d) --profile fk-createsnapshot --region us-west-2

Adding a cron job to run the shell script

In our crontab (crontab -e), we’ll set it to run at 3am every day (adjust to your script location):

0 3 * * * /home/bitnami/fk-createsnapshot.sh

It may be a good idea to set it to run sooner just to verify that everything is working as intended.  Otherwise, you’ll start seeing new snapshots appear in your Lightsail console every day at 3am!

Using this same approach, you can take pretty much all actions for your Lightsail instances directly through the CLI.  Check out the AWS CLI documentation for more.