Robert Harley: Let’s start with an overview of the concept of MQA—not the technology but the objective.
J. Robert Stuart: MQA embodies a philosophy as well as a group of technologies. The philosophy starts with doing the very best that we can do with the recording. It goes back to the original event, because sound is analog; it’s created in analog and it’s analog in the air. When we listen to it, it’s analog. Our experience of it is analog.
We start with the idea that you’re sitting in a concert, and the performers are in front of you. Then we ask what we can put between ourselves and the performers that doesn’t change the sound. There is something that does change the sound, and that’s air—but totally necessary! One of the building blocks of our philosophy with MQA is that the degree to which air blurs and attenuates the sound is a yardstick for the acceptable degradation of a recording system.
All the analog methods of recording, which is to say moving the signal through time and space, are beset with problems of analog: noise, distortion, it’s unrepeatable, and it’s fragile. An analog recording typically isn’t the same twice, it’s lossy, and degrades every time you play it, whether it’s vinyl or magnetic tape. Since the early 80s the world has wanted to use digital techniques for moving the data around for a couple of reasons, partly because it’s convenient—it’s the language of the Internet and it’s dependable.
But all that skirts around the central problem that MQA is trying to focus on, which is that converting the sound to digital is an unnatural act in the first place. The whole proposition of digital always was that once the signal is digital, we can do what we want with it. But it’s those two gateways where a lot of problems occur—taking that analog signal into digital, and digital back to analog. The philosophy of MQA is to try to get the signal in and out of digital in the most optimal way, and then use insights to transmit it efficiently.
One problem with digital audio is that the industry, because of a period of low bandwidth, got interested in lossy compression. There’s been this general trend downward, with sound quality traded off against convenience and efficiency. Applying lossy compression is not part of the recording process, nor should it ever be.
Our view is that the archival master recording is the most important thing. We should be putting something in the archive which is an over-complete description, so that in 20 years’ time, when we come to take it out, there’s more available to reconstruct the signal.
At various times in the history of recorded music, the thing that was distributed was the same as was in the archive, and at various other times it was different. But we believe that the very, very best thing should be put in the archive even if you can’t distribute it. And MQA is about that; but it’s also about delivering the sound of the studio and of the existing archive, as authentically as possible, to the listeners today.
What’s wrong with the current digital chain from the source master to the playback device? What are some of the problems in current digital that MQA addresses?
Well, the first problem is with the question, specifically what you mean by the “master”? The idea that the master is a digital file that came out of the A-to-D is a descriptor that is used pedantically by the audiophile community but loosely in the studio. In the MQA world, we suggest that the master is actually the sound that created the file in the first place. Getting access to that is the most important thing. We want access to it without it being polluted by what the A-to-D converter does to it, and we want to play it back without the pollution of the D-to-A. The playback chain has problems, but so does the recording chain. Typical analog-to-digital conversion has concepts embedded that are not ideal from the point of view of the human listener seeking high resolution.
A purely academic exercise could have created entirely new A-to-D and D-to-A converters that would be fantastic. But pragmatically, there are hundreds of millions of DAC chips out there. You’re not going to go to Apple and say, “I need you to change the chip in my iPhone.” That’s not going to happen. So we worked out how to get the best out of those DACs that are already out there.
When the engineers listen in the studio, their DAC is almost certainly not your DAC. You as a listener can’t hear it as it was heard in the studio. MQA is not only about accessing the sound inside the file, but also managing the DAC in the studio and in the decoder on your device at home to produce a much closer sound to each other. We’re actually drilling upstream to the analog sound in the studio and downstream to the analog sound in your playback device. What we’re trying to do, conceptually, is directly connect together the modulators at both ends—the high-speed delta-sigma modulators in the A-to-D and the DAC. That’s the essence of a large step forward in transparency and accuracy, because when you do that, it all sounds more like the original analog sound.
If you’re in the recording studio you have access to the microphone feed, or the analog tape recorder, and can compare it to digital. We’ve been to scores of studios that tell us if they take an analog signal and feed it into the A-to-D and then straight into the D-to-A, what comes out the other end doesn’t sound like what goes in. It doesn’t because of the brickwall digitizing process, which creates pre-echo, post-echo, quantization of the wrong type, arithmetic noise, and temporal blur. Because of the pre-ringing and post-ringing you have to wait a long time to find out when a transient happens. That is unnatural because it’s too loosely connected to the natural world of sound.
We’ve designed MQA so that doesn’t happen. In fact, there’s no pre-ring and basically no post-ring, and everything’s compact and tight in the time domain.
The MQA file contains information about the A-to-D converter so that the decoder can correct some of the A-to-D converter’s problems.
That’s right. Part of the encoding process adds ancillary data such as date code and copyright owner. The encoder has information about the A-to-D, which tells the decoder how it was encapsulated so the decoder at the other end can use the best decapsulation formula; producing the shortest temporal blur possible for that content. It will vary according to the musical content.
What advancements have made MQA possible today?
MQA is based on insights from two sciences that have really changed in the last two decades. One is significant advances in digital sampling theory. The Nyquist Theorem, that says we can’t record signals of more than half the sample rate, is actually a guideline. In other areas of information processing, it’s been understood for some time that, within certain conditions, you can critically undersample. In other words, you can get more information down the pipe than the sampling theorem would indicate, providing the signals have finite-rate of innovation, which is true for music. For example, in medical scanning, it’s known that body imaging can be done beyond the Nyquist limit. And in areas of radio astronomy, a similar thing is true. If you want to find a planet circling a star, you have to use apodizing or undersampling in the optical domain. For some reason the audio industry has ignored these advances—“We’ve got Nyquist, and we’re done.”
The other area of tremendous advancements has been in neuroscience.
Can you talk about how neuroscience influenced MQA’s technical direction?
We wanted to understand one of the paradoxes of digital audio, which was why 96kHz sounds better than 48, and 192 sounds a little better than 96, and 384 sounds a little better than 192, and 768 can actually sound better than 384! Each time we’ve got a doubling of data rate, but we’ve got the rapidly diminishing return of incremental sound quality.
That’s a puzzle because we can’t really hear frequencies above 20kHz. From the pure frequency-domain point of view, hi-res sampling doesn’t seem to make a lot of sense, which is why you see some skeptics who don’t actually think it through to the next stage.
And yet, we observe that this is the case and understood by producers, and recording engineers. If you said to a recording engineer, “You can’t hear an improvement with 192k because theory says you can’t,” they’d think you were nuts. They hear it clearly; it’s part of their everyday experience. In general, higher rates sound better because when you sample at a high rate, there’s less ringing in the file and the ringing is shorter. But the file is still full of ringing. Every sample is a mini-transient that ignites that system. The higher rates sound better because the ringing is shorter, and there’s more chance to ameliorate arithmetic noise in the filters and quantizers. (In fact we separated these two mechanisms in our recent listening tests described at AES last October.) Acoustically, this is like a sheer curtain on top of the audio. If we did the same kind of processing in the visual domain, we’d think it was a joke. The idea that every edge has got parallel lines near it because of the way the filters ring would be unacceptable. The mess we’re in has been a mixture of lack of understanding combined with pragmatism.
Imagine where we started. It’s weird that faster and faster sampling can sound better. We puzzled about it. And then a light bulb went on. We concluded that the problem is a mismatch between how (specifically linear) PCM samples and reconstructs music, and the human neuro system.
All through my career, we’ve looked to psychoacoustics to understand the bridge between the measurement domain and the observed domain. It’s very useful. There’s a lot you can learn from it.
Now neuroscience has brought insights that are just terrifically exciting to me. That doesn’t mean there are neuroscientists working on high-end audio— nobody’s funding it. But enough comes out of the work being done that we can apply to our understanding of audio and human hearing.
The more we work and study in this area, the more we realize that our hearing is incredibly exquisite. In fact, hearing’s arguably our most important sense. It’s probably the primary sense of survival, whereas vision you could argue is about purpose; seeking and hunting. But hearing is defensive, it works in the dark, it works around corners, and it’s also massively robust, which is why it gets us into a lot of trouble when it comes to recording and reproducing music. It’s why we can tolerate MP3 and worse; it’s why we can listen to music across the park and still follow the song.
Neuroscience has shown us that our hearing is highly adapted to its natural role, and is arguably at its best in dealing with the sounds that it’s been dealing with for the longest period of time, which are environmental sounds.
Humans and many other animals to a large extent share similar physical and neural structures for hearing. Another thing we all have in common is we live on the same planet. The background sounds are environmental—the wind, the rain, running water, rustling leaves, and sounds like that are part of the natural world. Wherever we go on the planet and use a microphone to measure a signal, you see that the environmental sounds are incredibly consistent. There are louder and quieter districts, but one thing they all have in common is that the sound structure is very “pink.” [Pink noise has decreasing amplitude with frequency and constant power per octave.] There’s more energy at low frequencies than at mid frequencies, and more at mid than at high.
Why is that important? Well, it’s telling us that the background noise in itself has structure, which means it’s predictable at the micro level as well as at the macro level. It’s actually fractal. And when you know that, you know something very important about how to code it, because it is not a full spectrum, rectangular signal. It’s not a full-scale tone at 19kHz in a CD channel, nor full-scale at 36kHz in a 96kHz channel, not by any stretch. And in fact, environmental sounds have almost no frequency; they are stop-start transients.
Against that background, birds and animals communicate with a variety of different vocalizations, which are optimized for what has to be communicated and the distance over which they want to do it. The major insight from neuroscience is to say: “Here are the classes of sounds to which the species is adapted, and these are the sounds we should use to test.”
That’s where neuroscience diverges from psychoacoustics because the psychoacoustician takes a sort of engineering viewpoint. She’ll say, “Here’s a human, I’m going to play him sine waves at different levels, beeps and chirps and hisses, and I’ll find out what the hearing system can do with those kind of signals.” But these are artificial signals, testing a very complicated, highly adapted machine. In fact, you can learn more by playing recordings of the rainforest floor and vocalizations, whilst doing the things that neuroscience has been able to do that it couldn’t do fifteen years ago, specifically, real-time functional MRI that shows how neurons respond.
Most information in environmental sounds is transient. The human hearing system deals with these three types of sounds—animal sounds, speech sounds, and environmental sounds—differently. It’s important to know that music, statistically, lives in the area between the three sounds I mentioned.
We believe that music evolved to please our ears, not the other way around. It isn’t like music came out of nowhere. These sounds have predictability, they have structure relating to movement, they build a language of emotion, and they have a statistical basis that tells us a lot about how to encode it efficiently. Just knowing this allows us to make savings in the data rate and to capture a much more precise time-frequency balance than we can if we design a system that is based on an arbitrary frequency-based model.
There are a very large number of neurons traveling to the ear, which was a puzzle for me. It’s a much more complicated model than the idea that your ears are microphones that feed signals to your brain. There are neurons going to your ears because of an active process inside the cochlea, controlled by the cortex and using the outer hair cells to adaptively change sensitivity and selectivity; neuroscientific modeling suggests that our ears respond differently to the three types of sounds I mentioned. Literally, the ears are tuned continuously in real-time according to what you’re listening for. It’s like how your eyes automatically focus without you being aware of it.
There are some quite interesting studies showing that the human is approximately five times more sensitive to time than to frequency. The traditional audio engineer brought up in what I would call “the BBC era,” and then “the digital era,” would believe that the two are equivalent, so that 20kHz meant 50 microseconds, approximately.
In several circumstances humans beat the time-frequency uncertainty limit by factors of five, or even ten or more. Experienced listeners, particularly musicians and conductors, score better on these tests; it has more to do with experience than age. The human has greater temporal resolution than microphones currently in use. A human can hear a three-microsecond time interval in the right circumstances, whereas a microphone typically blurs at least 10 microseconds.
So what have we got here? We’ve got an understanding that the ear actually switches into different modes to listen to different types of sounds. We’ve got an understanding that time is more important that frequency. And the other thing we have is an understanding that as signals go up the brain stem, features are extracted using highly non-linear processes, and temporal uncertainty is reduced by parallel processing.
They’ve found neurons that respond to the envelope of a signal. Hardwired in the hearing system are the features of natural sounds, including that they tend to start loud and get softer. When you hear a piano note, it starts loud, drops off. If you try to do a time-frequency balance with sounds that are time-reversed, you get a very bad result. We don’t expect sounds to build up and then stop suddenly; it’s unnatural. And yet in the classical digital sampling process, we had an envelope of the ring of the pulse. Instead of a transient starting from nothing and then stopping as the original sound does, you’ve got this ringing before and after. It was once believed that this ring was so high in frequency that you couldn’t hear it. Wrong: You hear the envelope.
If the recording system adds pre-ring and post-ring to transient sounds, that means they are extended in duration, with uncertain positioning that fires the startle reflex. The sounds are alarming, startling. We did things in player design to try to fix this, and we can remove the pre-ringing to make the presentation as good as it can be. But you can’t authentically fix both the post- and pre-ring at playback without access to the studio.
Psychoacoustic and neuroscientific insights suggest that ten microseconds is a good target for resolution. And by resolution, we don’t mean bits and we don’t mean sample rate—they are comparatively meaningless numbers from the coding. Resolution is better described by as shortest event that can be carried by the system, by means of which we can tell two things apart. If a twig snaps behind you, you know from the transient’s leading and trailing edges where it is and you know how far away it is. Whenever there’s a sound, there are also reflections and reverberation. Our brains are able to separate direct sound from the reverberation if they’re not blurred. But if these sounds are blurred or smeared by a digital filter, or by an analog filter that’s too slow, then the details become smudged together. The listener will hear the sound as bright, or be unable to judge depth; you can’t locate the sounds, or there’s something wrong with the decay.
Temporal resolution is critically important. In the natural world the only sounds that have pitch are the lower speech frequencies, animal or bird songs. Everything else has to do with fricatives, sibilance, transients, and so on. Our hearing is adapted to the sounds that are the most helpful to survival.
The more we read about recent neuroscience research, the more we learn the ways in which neurons combine and fire in pattern recognition, the more we appreciate that our hearing is a totally exquisite machine and that you mess with coding a recording at your peril.
The idea that you could make a better codec by learning which bits to throw away is a total anathema. That’s one of the big distinctions with MQA. We are definitely not saying that neuroscience teaches us better which bits to throw way. No. It teaches us which bits to focus on most.
Compare it to vision. You’ve got maybe a hundred million “pixels” in the eye. Quite a large percentage of them are in the fovea, and wherever you focus attention, you have extreme resolution. You ears do that. They can focus their resolution, their attention, on different sound types, streams and locations. Our hearing is most sensitive at moderate loudness and MQA uses its highest precision in that region. We do know three things you can’t hear: You can’t hear continuous frequencies much above 20kHz, you can’t hear sounds that are quieter than the threshold or that are well below environmental background noise. There are limits on the hearing system, but everything else, everything that’s inside the triangle of music in our origami drawings, is critical.
We don’t discard any information in the area of the music. We are folding it instead of cutting it, and using an area that is quieter than silence to bury some of the folded information and bring it back. When we do that, we end up with a very low data rate, and no invasion of the audio.
It’s hard to think of conventional PCM as anything but crude and inefficient. A lot of those bits are baggage.
Oh, total baggage. People like 24-bit LPCM in the studio because it gives more room for error. It allows smaller steps but can also code 60 to 80dB below silence and below the thermal noise of air! It’s ridiculous in terms of distribution channel capacity to transmit information far below silence; somewhere down there data’s being used that has no meaning.
Similarly, there’s a lot of unused coding area in the first octave above 20kHz, and even more around the next octave if it’s 192kHz, and even more than that if it’s 384 kHz; these areas have no signal, simply benign inaudible noise. The higher the sample rate the more inefficient it gets. It gets progressively, preposterously inefficient.
What’s been the response by recording and mastering engineers who have heard MQA?
It’s been uniformly superb. In the early stages, when we were doing our research, we decided that the only thing we could do was go to the source to let mastering and recording engineers listen to our system. And after about a dozen of these experiences, where they gave us the same answer, basically, “How the hell did you do that?” We realized that we really were on to something. By removing the defects in the file they were giving us and by better managing the DAC, they were hearing something much closer to the analog source. They’d get excited, because they’d hear things that they’re not able to hear outside the studio; the natural sound of the recording. Another unexpected thing that happened was that, by taking away the grunge, the engineer could consider how to make better choices.
From the label point of view, MQA is a set of tools that allows a rebirth of the ideal of recorded music, with everyone moving in the same direction. It solves the industry’s problem of having up to 40 different deliverables [file formats and physical media] to make, each of which has actually got nothing to do with making the record. We can get the studio sound to the listener at a data rate that can be streamed, or in a small download file, with a single mechanical. That also means the customer can buy the file and play it anywhere, without hassle.
The MQA light on the decoder indicates the end-to-end authentication.
Exactly. The instructions to the decoder are buried in the file in the studio. The engineers can listen to it on an MQA-managed DAC and know that what they hear is exactly what the listener with an MQA DAC will hear. When the MQA light comes on you know that everything in the chain is right—you haven’t screwed up the data, the computer settings are right, and it’s bit-perfect from the studio to the listener.
The decoder can be software that’s built into a DAC or other decoding chips across a wide variety of devices.
Exactly. The decoder can be soft or hard or a mixture. We’re talking to chipmakers about building part of MQA decoding into the DAC. Making a DAC that understands how to render MQA natively is easier than making a DAC that understands how to render 48kHz natively. We imagine and hope that over time some of the pragmatic parts—which have to do with how to optimize the performance of any DAC—can be done ever better.
When do you expect the first MQA-encoded recordings to be released?
Second quarter, maybe sooner. One thing we haven’t talked about is backward compatibility. The hierarchical download file that we end up with is typically 1x, which means it’s 44 or 48kHz and 24 bit. And that file, if it finds a decoder, unwraps perfectly. You’ve heard it—I mean, it’s extraordinary.
If you don’t have a decoder, you can play it back without a decoder because it is PCM. MQA turns PCM into PCM. When you play it back, it’ll play back on a legacy system sounding better than a CD. And it sounds better than CD because the noise floor is properly managed and the signal has been pre-apodized.
So you get great sound on legacy players, and it means that you can take the single mechanical and put it in your car, you can put it in Sonos, you can put it in iTunes, you can put it on a phone, and get better-than-CD quality. But when it hits the decoder, the decoder will then give you the best that the DAC can produce. So if it’s in an iPhone, for example, that can’t go faster than 48k, it’ll authenticate it and manage the DAC as best it can. On an iPad, it will play back at 96k and give you a better rendering. If the DAC can do 192, 384 or 768 kHz, it will be programmed by the maker to produce the best sound.
This has obviously been an epic technical project. But then after completing the engineering, you have the hurdles of establishing it with record companies, hardware manufacturers, and consumers. Many of the principles on which MQA is based challenge the current paradigm. Why take on this mammoth, uphill battle at this stage in your career?
It had to be done. I’ve worked in this area and cared passionately about sound quality and recorded music all my life, and Peter Craven’s just the same. We thought that this is a problem that had to be solved, and when we discovered a way to solve it, we had a very bad moment because you realize that we are tilting at windmills and we’re dismantling a few sacred cows. You and I had a very interesting discussion about this [at CES a month earlier]. In fact, I just got the Kuhn book you recommended [Thomas S. Kuhn’s The Structure of Scientific Revolutions].So we knew we had a terrible uphill battle because we’re flying in the face of established thinking. But I don’t care, because we know it’s true, and we know it’s right, and it should be done. We also understand the nature of the technical problems, the problem of the music industry, and the fact there needs to be a solution like this. And recorded music is important.
So there is great music in the archive that has to be brought out, and we have to teach people how to put the best thing in the archive, because the art form is so important, music is so important to us, we have to do it right.
The studios have got all this content they won’t release because they don’t want to put their big, fat, so-called “master files” out to the stores because they will end up on a Torrent site. If you’re a media company, imagine how bad it is after a decade when your entire cupboard is in the cloud and in public domain? That’s not going to work. So they have to have a manageable way of improving the archive and selling the content in order to keep the business going for our benefit, because we want these great recordings.
Also, how could you have known how to do this and not done it?
Well, without wishing to draw any comparisons of significance, I have walked on the beach and thought, “Now I know how Copernicus felt.” [laughs] He knew something, but it was going to be painful to tell the story. But we might make a big improvement in how the next generations can enjoy music, so how can we stay silent? I guess I’m luckier, because the worst we’ll get is the wrath of audiophiles and scientists, not excommunication.