Echo Removal

The questions below are due on Thursday April 03, 2025; 02:00:00 PM.
 
You are not logged in.

Please Log In for full access to the web site.
Note that this link will take you to an external site (https://shimmer.mit.edu) to authenticate, and then you will be redirected back to this page.

Consider a person standing near a microphone, with a wall off in the distance, as shown below.

In such a situation, audio recorded from the microphone will be corrupted by an echo that results because the microphone picks up not only sound wave from the nearby source but also the reflection of the sound wave off the wall.

Listen to the audio file phrase_echo.wav in this week's code distribution, which mimics this kind of situation.

We can model this process as a convolution y[n] = (x * h_1)[n], where the signal x is the speaker's original utterance, y is the sound picked up by the microphone, and h_1 is a filter of the following form:

h_1[n] = \cases{1&if $n=0$\cr A&if $n=n_0$\cr0&otherwise\cr}
where A is a value between 0 and 1 that represents the relative amplitude of the echo, and n_0 is a positive number that represents a time delay.

Part a. Make a plot of h_1[n] versus n for some value of A and n_0. Determine an expression for H_1(\Omega), which represents the DTFT of h_1[n]. Be prepared to discuss these results during your check-in.

Part b. Having just heard about convolution, Ben Bitdiddle believes that he can remove the echo by convolving the recorded signal y[n] with the following signal:

h_2 = \cases{1&if $n=0$\cr-A&if $n=n_0$\cr0&otherwise\cr}
His idea is based on a special case where the speaker's original utterance x[n] = \delta[n], so that the output with echo will be
y[n] = \delta[n]+A\delta[n{-}n_0]
Convolving y[n] with Ben's h_2[n] yields a single nonzero term at n{=}0:
h_2[0]\times y[0] = 1\times 1 = 1\,.
and just two terms at n{=}n_0:
h_2[0]\times y[n_0] + h_2[n_0]\times y[0] = 1\times A + (-A)\times 1 = 0
The term at n{=}0 preserves the original utterance while the -A at n{=}n_0 cancels the echo. Ben therefore argues that he has solved the problem.

It turns out that Ben is incorrect. What is the flaw in his reasoning? Be prepared to discuss this during your check-in.

Part c. Design an improved filter h_3 such that, if y[n] = (x * h_1)[n], then (y * h_3)[n] = x[n]. Plot h_3[n] versus n for some value of A and n_0. How many non-zero terms will h_3 have? Determine the first four values of n for which h_3[n]\ne0 and label their values on your plot. Be prepared to discuss these results at your check-in.

Part d. The lab handout includes a sound file phrase_echo.wav that was created by convolving an input speech signal with h_1[n] with A{=}0.9 and n_0{=}1323. We would like to remove the echo from this signal (y[n]) by convolving it with h_3[n].

One difficulty with this approach is that h_3[n] has infinite length. However, if we can approximate h_3[n] using its first N samples, then we can implement the convolution with the provided function called causal_convolution which convolves two sequences of finite length, both of which are presumed to be zero for negative time indices. Look at the code for causal_convolution. How does this function implement convolution? Create a python list h_3 that contains the first 5 non-zero terms in h_3[n]. Use the causal_convolution function to convolve h_3 with phrase_echo.wav. Listen to your output. You may find the result surprising! Can you explain what is happening, in terms of the kernel we used above?

How would you expect the output to change if we added more terms? Try it out!

Part e. Check In.

  • Show a staff member your plots of h_1[\cdot].
  • Describe what was wrong with Ben Bitdiddle's approach to echo removal.
  • Show your sketch of h_3[\cdot], and explain your process for determining h_3[\cdot].
  • Play the result of convolving the echo-corrupted signal with your approximation of h_3 containing only 5 non-zero values. Do the same for 9 nonzero samples. Explain the results you hear.

Check In:
Please describe your approach to each of the parts of this lab.

Part f. Echo Removal: Frequency Domain Approach. As we saw in lecture, we could also perform convolution by operating in the frequency domain. Start by finding the DTFT of h_3. What is the relation between H_3(\Omega) and H_1(\Omega)? Devise a frequency-domain method for removing the echo in Python, and use it to remove the echo from phrase_echo.wav.

We have provided a pair of functions -- fft (Fast Fourier Transform) and ifft (inverse FFT) -- to speed up the frequency-domain calculations. The function fft takes a single argument, which is a list of samples of the time domain signal x[\cdot]. Then fft returns a list of complex-valued samples of the DTFT of x[\cdot]. The output list always has the same length as the input list. For example, if the input is a list of N samples of x[\cdot], then the output list will contain N samples of the DTFT, where the k^{\rm th} sample is given by

X[k] = N\left[X\big(\Omega\big)\right]_{\Omega=2\pi k/N}

The function ifft inverts the operation of fft. It takes a list of samples of a DTFT and returns samples of the corresponding time-domain signal.

Please note that numerical errors can introduce small imaginary parts to an output of ifft that should theoretically be purely real. You may need to remove such imaginary parts before writing your output signal to a .wav file.

  • Upload your Python code for removing the echo.
  • Upload your .wav file for the signal with the echo removed.
  • Describe your frequency-domain method. How did you make use of the DTFT of h_3?
  • Compare results for the time and frequency approaches. What are the important differences?

Please upload a single .zip file that contains your code and the resulting .wav files:

 No file selected