Acoustic Properties of Sound
Sound waves are the foundation of spoken language. When you speak, your vocal cords vibrate and send longitudinal waves through the air. These waves carry specific acoustic information that listeners decode into meaningful speech. Three core properties define any sound wave:
- Frequency measures how many wave cycles occur per second, expressed in Hertz (Hz). Higher frequency means higher perceived pitch. Human speech typically falls between about 85 Hz and 8,000 Hz.
- Amplitude is the maximum displacement of the wave from its resting position. It relates to how loud a sound is, measured in decibels (dB). A larger amplitude means a louder sound.
- Wavelength is the physical distance between consecutive wave peaks (or troughs), measured in meters. Wavelength is inversely proportional to frequency: higher frequency means shorter wavelength.
These three properties connect through the speed of sound equation:
Here, is the speed of sound (approximately 343 m/s in air at 20°C), is frequency, and is wavelength. So if you know any two values, you can calculate the third.
Speech Production and Perception

Articulatory vs. acoustic characteristics
The source-filter theory is the key framework here. Think of speech production in two stages: the larynx generates a raw sound (the source), and then the vocal tract shapes that sound (the filter). Different configurations of the tongue, lips, and jaw create different filter shapes, which is why changing your mouth position changes the sound that comes out.
Vowels are characterized by formants, which are the resonant frequencies of the vocal tract. Two formants matter most for identifying vowels:
- F1 (the first formant) correlates with tongue height. A high F1 value means a low tongue position (as in "ah"), and a low F1 means a high tongue (as in "ee").
- F2 (the second formant) correlates with tongue advancement. A high F2 means the tongue is pushed forward (front vowels like "ee"), while a low F2 means the tongue is pulled back (back vowels like "oo").
Consonants produce their own distinct acoustic signatures:
- Stops (like /p/ or /b/) show a brief period of silence followed by a burst of energy on a spectrogram.
- Fricatives (like /s/ or /f/) generate continuous turbulent noise, visible as fuzzy high-frequency energy.
- Nasals (like /m/ or /n/) display anti-formants, which are bands of reduced energy caused by sound resonating in the nasal cavity.
Voice onset time (VOT) measures the delay between when a stop consonant is released and when the vocal folds start vibrating. A short VOT (or even a negative one, where voicing starts before the release) signals a voiced stop like /b/. A longer, positive VOT signals a voiceless or aspirated stop like /p/.
One more concept: coarticulation. Your mouth doesn't reset to a neutral position between every sound. Instead, articulatory gestures overlap. You might round your lips during a consonant because the next vowel is rounded (anticipatory coarticulation), or a vowel might be nasalized because the preceding consonant was nasal (carryover coarticulation). This overlap creates smooth acoustic transitions between sounds rather than sharp boundaries.

Process of speech perception
Hearing speech involves a chain of events from the ear to the brain:
- The outer ear collects sound waves and funnels them down the ear canal, which naturally amplifies certain frequencies.
- The middle ear transmits vibrations through three tiny bones called ossicles (malleus, incus, stapes), amplifying the signal as it passes to the inner ear.
- In the inner ear, the cochlea converts mechanical vibrations into electrical nerve signals. The basilar membrane inside the cochlea is organized tonotopically, meaning different positions along it respond to different frequencies (high frequencies near the base, low frequencies near the tip).
- The auditory nerve carries these electrical signals to the auditory cortex in the brain, where they're processed as speech.
Beyond this physical pathway, perception involves some fascinating cognitive processes:
- Categorical perception is why you hear a clean distinction between /b/ and /p/ even though the acoustic difference (VOT) is actually a smooth continuum. Your brain sorts sounds into discrete phoneme categories rather than perceiving every tiny acoustic variation.
- Top-down processing means you use context and linguistic knowledge to fill in gaps. If a word is partially masked by noise, your brain can often reconstruct it from the surrounding sentence.
- The McGurk effect shows that speech perception isn't purely auditory. If you hear "ba" but see someone mouth "ga," you'll often perceive "da." Visual cues from lip and mouth movements actively shape what you think you're hearing.
Tools for acoustic analysis
Linguists use several tools to study the acoustic properties of speech. Most of these are available in Praat, a free software program widely used in phonetics research.
- Waveforms plot amplitude over time. They're useful for identifying syllable boundaries, measuring segment durations, and finding the exact moment of a stop burst for VOT measurement.
- Spectrograms display frequency on the vertical axis and time on the horizontal axis, with intensity shown as darkness or color. Formants appear as dark horizontal bands, and you can see transitions between sounds, bursts, and fricative noise all in one image.
- Formant analysis tracks F1 and F2 values over time. Plotting F1 against F2 for different vowels produces a vowel space chart, which is a standard way to compare vowel systems across speakers or languages.
- VOT measurement pinpoints the time gap between a stop release burst and the onset of voicing. For example, English aspirated /p/ might have a VOT around 50–80 ms, while unaspirated /b/ might be near 0 ms or slightly negative.
- Pitch tracking traces the fundamental frequency () contour across an utterance, revealing intonation patterns (like the rise at the end of a question) and stress placement.
- Intensity analysis measures relative loudness across segments, helping identify stressed syllables and prominence patterns in connected speech.