Data
For our data, we have selected recordings from https://www.loc.gov/collections/national-jukebox/. We looked for recordings that had lots of noise, loud pops, and soft sounds. These recordings have specific qualities that we plan on mitigating through our remastering program.
We chose our dataset to encapsulate genres that are on opposite spectrums of music, from jazz to rock, from instrumental to vocal. We have also chosen songs that contain less background noise than purely noisy files. We have one specific benchmark file that has crash symbols that sound similar and even have the same volume as some of the dirty noise that we are trying to filter out. We want to make sure we really catch the noise only, focus on reducing false positives in filtering so we don’t remove any parts of the actual music.
In addition to these old recordings, we also would like to use “ideal” recordings as reference tracks. We can use these references to act as a mapping or calibration to clarify the song after we filter out the noise.
Initial Filtering steps:
So far we have worked on first cleaning up the background noise through a mixture of Gaussian and moving average filters. The gaussian filtering works incredibly well on filtering out the white noise from the recordings, and the moving average filter does a great job of eliminating high frequency noises from the recording as well. Below we can see the the comparison of the magnitudes of frequencies in the Fourier domain for the original song compared to the gaussian and moving average filters:
The first plot, which is the magnitude FFT of a 30 second sample of a song, gives us an overview of the frequency content so we can compare it to the filtered FFT plots to determine the relative effects of various types of filters. The second plot is a magnitude FFT plot of the song sample when filtered through a gaussian window with window size 4096. The third plot is a magnitude FFT plot of the song sample when filtered through a moving average filter with window size 10. As we can see, the Gaussian filtering is the most effective, and cuts down most of the high frequencies that contribute to noise, effectively lowering the signal to noise ratio. This also makes sense as the background noise is white noise, i.e. gaussian noise, so filtering via a gaussian window should be efficient in removing noise. This is evident in listening to the song, as the noise, while not fully eliminated, is lowered to the point where the music is much more distinct from the noise. The moving average filter, as compared to the gaussian filter, does not do as good of a job of removing the noise, but it still works somewhat well. In listening to the filtered audio, the moving average filter creates a duller sound, and as we can see from the plot, there appears to almost be some kind of harmonic effect going on with the frequency attenuation. However, both are good at reducing the noise, and both do improve sound quality a lot.
The current issue that we have come across is that the gaussian filter introduces a sort of "rubber banding" effect at certain large window sizes, where the song will start, then a time shifted version will start playing after a short time, then after another short time another time shifted version will start playing, etc. This is a problem because we need a larger window to get the noise reduction to where we want it to be. We have tried multiple correction methods such as cutting out uncorrupted portions then shifting, however these attempts have failed. Currently we are investigating if there is unwanted phase shift that the filter is introducing.
Cleaning Background Noise
Normal digital filtering can separate frequencies that are mainly signal from frequencies that are mainly noise, but it ends when the frequencies of noise overlap with frequencies of the signal. This often appears in the form of magnetic tape hiss, electronic noise in analog circuits, wind blowing into receivers, or even cheering from crowds that lie in the same range of 200 Hz to 3.2 kHz (range of voice frequencies). Linear filtering in this case will not work as the signals overlap in both the time and frequency domain, therefore we must process the audio using nonlinear methods.
We have explored two different nonlinear techniques so far. The first is a technique used for reducing wideband noise and the second is used for separating signals that have been multiplied or convolved with each other instead of a mixing through addition.
Methods:
1) In a short segment of speech, the amplitude of the frequency components are greatly irregular, but there is still a clear difference between what is background noise and what is part of the desired signal. Most of the desired signal (a) is contained in a few large amplitude frequencies, while random noise (b) is very irregular, but more uniformly distributed at a low amplitude.
The main basis of this method is partially separating the signals by looking at the amplitude of each frequency. If the amplitude is large, it is most likely part of the signal and should be kept, while smaller amplitudes should be remove.
2) We can also process in a manner of handling signals of the same structure (called homomorphic signal processing). This method handles cases in which noise and interference is mixed in through multiplication or convolution (nonlinearly). This method is mainly built upon the idea of turning this nonlinear problem into a linear problem. The steps taken to separate is first through applying a Fourier transform to turn any convolution of signals into multiplication. Following this we apply logarithm to the transformed signal to turn the multiplication into an addition ( log(x*y) = log(x) + log(y) ). Once we identify the portion of this sum that makes up background noise, we remove it from the overall signal and apply inverse log and inverse transform in order to get the resultant signal.
Sound Sharpening
After filtering the sound, one problem we noticed was that the filtered song sounded dull and somewhat far away. As this reduces sound quality quite a bit and cannot be corrected through simple filters, we needed a way to apply a transformation to the song to correct for this. Using the concept of a reference track from the paper End-to-end Music Remastering System Using Self-Supervised and Adversarial Training, by Koo Junghyun et al to devise the following two methods to obtain the transformation to correct the song.
Conceptual overview:
Conceptually, if we have a song that we want to sharpen that has been recorded on a piece of equipment with a known distortion pattern, such as a reel to reel tape recorder, gramophone, or other such device, we can reasonably infer that there will be some statistical distribution that represents a mapping, or transfer function, of the distortion for a given recording method. Furthermore, if we know the distribution pattern of distortion transfer functions and the additional distortion distribution introduced by filtering out the noise, we can create a singular transfer function that will map the distorted/unsharpened song to the median sharpened song, which can then be used to take a "dull" signal and "sharpen" the noise via multiplication in the frequency domain.
Method 1:
The first method we have come up with to obtain the transfer function mapping between distorted and sharpened song will use a separate reference track from the song we want to sharpen. Ideally this track has a wide frequency content range so that a full mapping can be performed and no additional distortion is introduced. The reference track must have an ideal recording of itself that is at the optimal audio quality, and must also have a modified version of itself that represents the distortion introduced by the recording and filtering. A transfer function can be made from this reference track that maps from distorted song to the sharpened song via a transfer function. This is shown in the following equations, Where R,ideal is the ideal sharp reference track, R,dist is the distorted reference track, H is the transfer function, X is the song we want to sharpen, and Y is the sharpened song:
H = (R,ideal)/(R,dist) Y = (X)*(H)
Method 2:
The second method we have come up with to obtain the transfer function mapping between distorted and sharpened song uses a portion of the song we want to sharpen as the reference track. The first step to create the mapping is to use an audio processing application like audacity to clean up and sharpen the portion of the song to be used as the reference track. Care must be taken to make sure the frequency content of the reference track is reflective of the entire song, or else there may be unwanted additional distortion for frequencies outside of the reference track's frequency content. Using the sharpened and original tracks, we can now create a transfer function H that maps the distorted song onto the sharpened song, and can multiply H by the entire song to sharpen the entirety of the song. This is shown in the following equations, Where X,sample is the original distorted reference track, X,corrected is the corrected reference track, H is the transfer function, X is the song we want to sharpen, and Y is the sharpened song:
H = (X,corrected)/(X,sample) Y = (X)*(H)
As we have not yet implemented this in Matlab, we do not know which method will be most effective for our purposes. The main difference between method 1 and method 2 is that method 1 works backwards from the ideal to guess the distortion introduced by the recording and filtering methods, while method 2 works forwards from the already distorted and filtered data to correct the distortion. While in theory both would work well provided we know the distortion transfer function, in practice method 2 will likely be the more effective method, as creating a mapping between the distortion to ideal from the distortion is less likely to introduce artifacts than a mapping from ideal to distortion, as it is hard to replicate the exact distortion that a recording method will introduce. Additionally, as the distortion of a set of different recording devices is a probability distribution, the transfer function that works well for one recording will likely be ineffective for others, and this will have to be addressed somehow when changing between different songs.
Equalizing
For the equalizer, we have come up with two implementations. One implementation uses a more traditional approach where we multiply the input signal by a transfer function that amplifies the signal for a certain range of frequencies and maintains the same magnitude for the other frequencies. The second implementation is a bit more experimental where instead we send the signal through a filter that extracts the desired frequency band. Then, we add this extracted signal back to the original signal to amplify that frequency band in the original signal. For the filter, we will have a low-pass filter for extracting the low’s, a high-pass filter for extracting the high’s, and a bandpass filter for extracting the mid’s.
There have been no challenges so far regarding the equalizer. One possible hurdle that may come up is when developing a user interface for the equalizer. This is because none of us have any experience making user interfaces in MATLAB, so we may encounter problems when we try to make it.
One thing we have learned is how equalizers works. Before, it always sounded complicated, but in reality, it’s just a transfer function whose gain can be adjusted for certain frequencies. Understanding how the equalizer works is important when implementing our own equalizer for our project.