• Telco (voice) channel

    From Don Y@3:633/10 to All on Friday, June 05, 2026 13:27:53
    In the POTS days, one could model a telephone channel as
    roughly 3KHz BW (~300-3KHz) with the copper crapping out
    at about 4KHz.

    Digitizing to 8b at 8KHz and you were pretty much golden.

    With cellular and the fact that legacy plant is still
    in place (i.e., you have no real control over how the
    signal is routed -- from second to second), is there
    a better model?

    Or, a technique to guesstimate the characteristics of
    the "current" channel, dynamically?

    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From John R Walliker@3:633/10 to All on Friday, June 05, 2026 23:23:12
    On 05/06/2026 21:27, Don Y wrote:
    In the POTS days, one could model a telephone channel as
    roughly 3KHz BW (~300-3KHz) with the copper crapping out
    at about 4KHz.

    Digitizing to 8b at 8KHz and you were pretty much golden.

    With cellular and the fact that legacy plant is still
    in place (i.e., you have no real control over how the
    signal is routed -- from second to second), is there
    a better model?

    Or, a technique to guesstimate the characteristics of
    the "current" channel, dynamically?

    Back in those days - well around year 2000 at least - a rather
    different model actually applied. This was dc to 4kHz.
    I tested phone calls between the UK and USA and Canada and
    also across Europe and was usually able to pass dc across the
    phone network.
    This did require using ISDN equipment at each end. Even that
    far back most of the global phone network was digital and it
    was only the codecs at each end that defined the frequency
    response. Many codecs had a high-pass corner at around
    150Hz and an anti-aliasing filter at about 3.7kHz.
    However, the GSM standard rate codec had a high-pass at
    about 80Hz. There was a problem because the GSM specification
    for that codec had an error in the filter coefficients which
    meant that it only slightly attenuated dc. This was a problem
    when echo cancelers treated dc just like any other frequency.
    A major international project had severe difficulties because
    the silence inserted between voice recognizer prompts was
    accidentally coded in u-law when it should have been A-law.
    The code mismatch resulted in a dc offset during the silences
    which caused the silences to take priority when the GSM codec
    happened to be working in half duplex mode causing the
    user responses to be lost.
    The bottom line is - if you really want to know the channel
    response you will have to measure it yourself.
    John


    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Don Y@3:633/10 to All on Friday, June 05, 2026 16:26:30
    On 6/5/2026 3:23 PM, John R Walliker wrote:
    On 05/06/2026 21:27, Don Y wrote:
    In the POTS days, one could model a telephone channel as
    roughly 3KHz BW (~300-3KHz) with the copper crapping out
    at about 4KHz.

    Digitizing to 8b at 8KHz and you were pretty much golden.

    With cellular and the fact that legacy plant is still
    in place (i.e., you have no real control over how the
    signal is routed -- from second to second), is there
    a better model?

    Or, a technique to guesstimate the characteristics of
    the "current" channel, dynamically?

    Back in those days - well around year 2000 at least - a rather
    different model actually applied.˙ This was dc to 4kHz.
    I tested phone calls between the UK and USA and Canada and
    also across Europe and was usually able to pass dc across the
    phone network.
    This did require using ISDN equipment at each end.˙ Even that
    far back most of the global phone network was digital and it
    was only the codecs at each end that defined the frequency
    response.˙ Many codecs had a high-pass corner at around
    150Hz and an anti-aliasing filter at about 3.7kHz.

    But, one can't be sure of how point A and point B are
    interconnected, even if you know the type of "connection"
    at one end.

    E.g., I can call a neighbor using an old-fashioned land line.
    Or, call the same neighbor from a cell phone. Or, call the
    *neighbor's* cell phone. etc.

    However, the GSM standard rate codec had a high-pass at
    about 80Hz.˙ There was a problem because the GSM specification
    for that codec had an error in the filter coefficients which
    meant that it only slightly attenuated dc.˙ This was a problem
    when echo cancelers treated dc just like any other frequency.
    A major international project had severe difficulties because
    the silence inserted between voice recognizer prompts was
    accidentally coded in u-law when it should have been A-law.
    The code mismatch resulted in a dc offset during the silences
    which caused the silences to take priority when the GSM codec
    happened to be working in half duplex mode causing the
    user responses to be lost.
    The bottom line is - if you really want to know the channel
    response you will have to measure it yourself.

    That's just not practical. If I called you, today, how could
    *you* characterize the connection? Would that characterization
    remain unchanged even if the call went on for hours? What if
    we dropped the connection and reestablished it?

    Now, instead of "I" and "you" imagine two indiscriminate individuals
    with no technical expertise and no *interest* in any of that...
    Does party A "sound different" (than expectations) because of
    a head cold? Or, some characteristic of the particular connection?

    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Don Y@3:633/10 to All on Friday, June 05, 2026 16:27:46
    On 6/5/2026 3:01 PM, Niocl s P˘l Caile n de Ghloucester wrote:
    Telephone-company employees must sign documentation promising that they
    will not enact unfaithful propagations of persons' voices, but
    completely faithful telephones do not exist.

    It's not just that they aren't "faithful" but, rather, that
    they aren't *repeatable*.

    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Martin Brown@3:633/10 to All on Saturday, June 06, 2026 15:55:01
    On 05/06/2026 21:27, Don Y wrote:
    In the POTS days, one could model a telephone channel as
    roughly 3KHz BW (~300-3KHz) with the copper crapping out
    at about 4KHz.

    Digitizing to 8b at 8KHz and you were pretty much golden.

    With cellular and the fact that legacy plant is still
    in place (i.e., you have no real control over how the
    signal is routed -- from second to second), is there
    a better model?

    Or, a technique to guesstimate the characteristics of
    the "current" channel, dynamically?

    You may be able to infer a rough idea of the channel frequency response
    by taking an FFT of a chunk digitised at 44kHz. But I'd be more inclined
    to low pass filter everything above 4kHz and use 8b @ 8kHz.

    It should be good enough for the intended purpose of speech at that.

    --
    Martin Brown


    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Don Y@3:633/10 to All on Saturday, June 06, 2026 12:36:33
    On 6/6/2026 7:55 AM, Martin Brown wrote:
    On 05/06/2026 21:27, Don Y wrote:
    In the POTS days, one could model a telephone channel as
    roughly 3KHz BW (~300-3KHz) with the copper crapping out
    at about 4KHz.

    Digitizing to 8b at 8KHz and you were pretty much golden.

    With cellular and the fact that legacy plant is still
    in place (i.e., you have no real control over how the
    signal is routed -- from second to second), is there
    a better model?

    Or, a technique to guesstimate the characteristics of
    the "current" channel, dynamically?

    You may be able to infer a rough idea of the channel frequency response by taking an FFT of a chunk digitised at 44kHz. But I'd be more inclined to low pass filter everything above 4kHz and use 8b @ 8kHz.

    It should be good enough for the intended purpose of speech at that.

    I typically am dealing with speech that is conveyed in open air
    or via a relatively high bandwidth channel. I continually update my
    models based on my knowledge of the speaker's identity AND THE
    "unimposing" CHARACTERISTICS OF THE CHANNEL.

    But, also have to handle the likely possibility of a telecom channel
    for the same speaker.

    This both distorts their perceived speech AND would bias the
    model that has been built without those constraints.

    If all you are doing is *recognition*, its not a problem. But,
    for diarization and identification, it lowers their effectiveness.
    Lack of sidetone and the latency of modern telco already leaves
    you with a significant challenge!

    As an "inverse filter" can't reconstruct characteristics that have
    been discarded by a lower sampling frequency, I have two choices:
    - run the trained models through an appropriate filter that
    models the telco channel
    - build a separate model for *just* the telco channel

    But, if that channel can change from call to call... <shrug>

    I *might* be able to use PRESUMED knowledge of the party's identity
    to help characterize the current channel. But, that would defeat identification.


    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Phil Hobbs@3:633/10 to All on Saturday, June 06, 2026 20:46:42
    Martin Brown <'''newspam'''@nonad.co.uk> wrote:
    On 05/06/2026 21:27, Don Y wrote:
    In the POTS days, one could model a telephone channel as
    roughly 3KHz BW (~300-3KHz) with the copper crapping out
    at about 4KHz.

    Digitizing to 8b at 8KHz and you were pretty much golden.

    With cellular and the fact that legacy plant is still
    in place (i.e., you have no real control over how the
    signal is routed -- from second to second), is there
    a better model?

    Or, a technique to guesstimate the characteristics of
    the "current" channel, dynamically?

    You may be able to infer a rough idea of the channel frequency response
    by taking an FFT of a chunk digitised at 44kHz. But I'd be more inclined
    to low pass filter everything above 4kHz and use 8b @ 8kHz.

    It should be good enough for the intended purpose of speech at that.


    To use 8 bits, you normally had to use companding digitizers (A-law over
    here, mu-law in Europe, iirc).

    Haven?t seen one of those in a looong time.

    Cheers

    Phil Hobbs

    --
    Dr Philip C D Hobbs Principal Consultant ElectroOptical Innovations LLC / Hobbs ElectroOptics Optics, Electro-optics, Photonics, Analog Electronics

    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Carlos E.R.@3:633/10 to All on Saturday, June 06, 2026 22:44:23
    On 2026-06-06 00:01, Niocl s P˘l Caile n de Ghloucester wrote:

    ...

    Accuracy
    Many file formats that compress audio discard some of the audio signal
    information whilst doing so. Converting to such a format and then con?
    verting back again will not produce an exact copy of the original au?
    dio. This is the case for many formats used in telephony (e.g. A-law,

    A-law does not compress. It only changes the amplitude. Does not change
    the bandwidth.
    A-law changes the amplitude encoding, not the audio bandwidth.

    GSM) where low signal bandwidth is more important than high audio fi?
    delity, and for many formats used in portable music players (e.g. MP3,
    Vorbis) where adequate fidelity can be retained even with the large
    compression ratios that are needed to make portable players
    practical.

    Cheers, Carlos.
    ES??, EU??;

    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Jeremiah Jones@3:633/10 to All on Saturday, June 06, 2026 14:29:40
    Don Y <blockedofcourse@foo.invalid> wrote:

    In the POTS days, one could model a telephone channel as
    roughly 3KHz BW (~300-3KHz) with the copper crapping out
    at about 4KHz.

    Digitizing to 8b at 8KHz and you were pretty much golden.

    With cellular and the fact that legacy plant is still
    in place (i.e., you have no real control over how the
    signal is routed -- from second to second), is there
    a better model?

    Or, a technique to guesstimate the characteristics of
    the "current" channel, dynamically?

    The Telco minimum standard is 300 to 3400 hz (or 3200 for some
    equipment manufacturers, depending on who you talk to). LPF
    is required before digitizing to avoid aliasing.

    In some implementations, such as some special circuit voice
    lines, the least significant bit is reserved for supervisory
    signalling.

    Cell uses a much narrower bandwidth. They do some data
    compression to save bits, the nature of which i can't tell
    you, but the saving is dramatic. You can tell the difference
    in sound quality, tho, if you're calling from a landline to
    another landline, as opposed to calling a cell phone.

    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Don Y@3:633/10 to All on Saturday, June 06, 2026 15:06:56
    On 6/6/2026 2:29 PM, Jeremiah Jones wrote:
    Don Y <blockedofcourse@foo.invalid> wrote:

    In the POTS days, one could model a telephone channel as
    roughly 3KHz BW (~300-3KHz) with the copper crapping out
    at about 4KHz.

    Digitizing to 8b at 8KHz and you were pretty much golden.

    With cellular and the fact that legacy plant is still
    in place (i.e., you have no real control over how the
    signal is routed -- from second to second), is there
    a better model?

    Or, a technique to guesstimate the characteristics of
    the "current" channel, dynamically?

    The Telco minimum standard is 300 to 3400 hz (or 3200 for some
    equipment manufacturers, depending on who you talk to). LPF
    is required before digitizing to avoid aliasing.

    Yes, but that was when PSTN was *the* means of moving voice.
    How do more modern network protocols address this? Do they
    deliberately cripple themselves thinking there is a land-line
    (over copper, SLIC06, etc.) on at least one end of the line
    (so why go out of your way to provide MORE fidelity)?

    In some implementations, such as some special circuit voice
    lines, the least significant bit is reserved for supervisory
    signalling.

    Cell uses a much narrower bandwidth. They do some data
    compression to save bits, the nature of which i can't tell
    you, but the saving is dramatic. You can tell the difference
    in sound quality, tho, if you're calling from a landline to
    another landline, as opposed to calling a cell phone.

    That is exactly the issue.

    If you speak into a microphone hard-wired to a device that
    digitizes your speech at a high frequency and resolution and
    have models of your speech built in that environment, how
    do you match those models to speech that has made its way to the analyzer/models via channels of dubious characteristics?

    POTS relied on the fact that humans can recognize even severely
    distorted speech and can rely on knowledge of WHO they telephoned
    (or who reassured themselves at the far end) to identify the
    called parties.

    Note how some vendors now rely on speech as a biometric
    authenticator. How reliable could *that* be if I placed a
    call to them from an old exchange through a dubious PBX, etc.?

    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Jeremiah Jones@3:633/10 to All on Saturday, June 06, 2026 23:25:14
    Don Y <blockedofcourse@foo.invalid> wrote:
    On 6/6/2026 2:29 PM, Jeremiah Jones wrote:
    Don Y <blockedofcourse@foo.invalid> wrote:

    In the POTS days, one could model a telephone channel as
    roughly 3KHz BW (~300-3KHz) with the copper crapping out
    at about 4KHz.

    Digitizing to 8b at 8KHz and you were pretty much golden.

    With cellular and the fact that legacy plant is still
    in place (i.e., you have no real control over how the
    signal is routed -- from second to second), is there
    a better model?

    Or, a technique to guesstimate the characteristics of
    the "current" channel, dynamically?

    The Telco minimum standard is 300 to 3400 hz (or 3200 for some
    equipment manufacturers, depending on who you talk to). LPF
    is required before digitizing to avoid aliasing.

    Yes, but that was when PSTN was *the* means of moving voice.
    How do more modern network protocols address this? Do they
    deliberately cripple themselves thinking there is a land-line
    (over copper, SLIC06, etc.) on at least one end of the line
    (so why go out of your way to provide MORE fidelity)?

    The PSTN is still the go-to carrier for voice calls. There
    are VOIP and various ATM vehicles, which would likely use
    telco routers and switches or telco TDM facilities... if you
    need wider sound bandwidth. I'm not sure what more modern
    protocols you mean. I'm not up to date on that.

    3400 hz sampled at 8 khz gives you pretty decent voice
    quality. Not very good for music, but understandable for
    speech, and for people like me with poor high-frequency
    hearing, extra audio bandwidth wouldn't help much. Telephone
    mikes and sound elements are also cheap and crappy. If you
    want fidelity i'd start with a good quality headset.

    Most of the information in a human voice signal is well within
    the sub-1khz range. More so for men than women.

    24 telco voice channels fit nicely on a T1. Bandwidth was not
    always cheap. Do you remember back when it was easy to run up
    $40 on one phone call? It wasn't a matter of deliberate
    crippling, it was about what's doable at a reasonable price
    that fits what the customer needs.


    In some implementations, such as some special circuit voice
    lines, the least significant bit is reserved for supervisory
    signalling.

    Cell uses a much narrower bandwidth. They do some data
    compression to save bits, the nature of which i can't tell
    you, but the saving is dramatic. You can tell the difference
    in sound quality, tho, if you're calling from a landline to
    another landline, as opposed to calling a cell phone.

    That is exactly the issue.

    If you speak into a microphone hard-wired to a device that
    digitizes your speech at a high frequency and resolution and
    have models of your speech built in that environment, how
    do you match those models to speech that has made its way to the analyzer/models via channels of dubious characteristics?

    Not sure what you mean. The only part of the "speech model"
    that matters is the spectral content. Signal/noise matters,
    but that's a hardware issue.

    POTS relied on the fact that humans can recognize even severely
    distorted speech and can rely on knowledge of WHO they telephoned
    (or who reassured themselves at the far end) to identify the
    called parties.

    Note how some vendors now rely on speech as a biometric
    authenticator. How reliable could *that* be if I placed a
    call to them from an old exchange through a dubious PBX, etc.?

    I don't know, but probably just fine. One would think they
    are built for telco channel transmission.

    In most cases the PBX or other customer facilities does the
    A/D and D/A conversion at the customer premise, so analog
    signals traveling down long copper spans are not a source of
    degradation.

    What's the alternative... a call over Instagram? A
    high-priced special circuit from your carrier?

    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Don Y@3:633/10 to All on Sunday, June 07, 2026 00:07:40
    On 6/6/2026 11:25 PM, Jeremiah Jones wrote:
    With cellular and the fact that legacy plant is still
    in place (i.e., you have no real control over how the
    signal is routed -- from second to second), is there
    a better model?

    Or, a technique to guesstimate the characteristics of
    the "current" channel, dynamically?

    The Telco minimum standard is 300 to 3400 hz (or 3200 for some
    equipment manufacturers, depending on who you talk to). LPF
    is required before digitizing to avoid aliasing.

    Yes, but that was when PSTN was *the* means of moving voice.
    How do more modern network protocols address this? Do they
    deliberately cripple themselves thinking there is a land-line
    (over copper, SLIC06, etc.) on at least one end of the line
    (so why go out of your way to provide MORE fidelity)?

    The PSTN is still the go-to carrier for voice calls. There
    are VOIP and various ATM vehicles, which would likely use
    telco routers and switches or telco TDM facilities... if you
    need wider sound bandwidth. I'm not sure what more modern
    protocols you mean. I'm not up to date on that.

    There are more ways of moving voice signals from speaker A to
    listener B than in decades past. Each one "colors" the speech
    in ways that are easily accommodated by human listeners but
    regarded as distinct by analytical mechanisms.

    3400 hz sampled at 8 khz gives you pretty decent voice
    quality. Not very good for music, but understandable for
    speech, and for people like me with poor high-frequency
    hearing, extra audio bandwidth wouldn't help much. Telephone
    mikes and sound elements are also cheap and crappy. If you
    want fidelity i'd start with a good quality headset.

    What I want is repeatability and, ideally, with some form
    of quantitative description that lets me relate THIS audio
    stream to an audio stream that I would encounter when sharing
    a room with the speaker.

    Most of the information in a human voice signal is well within
    the sub-1khz range. More so for men than women.

    But, that's because the human listener is "accommodating".
    One can distort their voice and still be understood by
    someone intent on that understanding. Witness how you
    can understand someone with a severe head cold, despite
    their voice being quantitatively "different".

    Or, someone with laryngitis.

    A machine can similarly understand what they are *saying*.
    But, its harder for a machine to identify and authenticate
    that speaker. "What does Bob sound like when he's got
    a head cold?"

    Anything that the channel does to the signal falls into the
    same problem domain.

    E.g., I have a bridge that lets my *wired* home phones
    use my cellular connection. But, the quality of the
    connection is beneath the quality of the wired phones
    on a wired landline *or* the call phone that is connected
    to the bridge.

    24 telco voice channels fit nicely on a T1. Bandwidth was not
    always cheap. Do you remember back when it was easy to run up
    $40 on one phone call? It wasn't a matter of deliberate
    crippling, it was about what's doable at a reasonable price
    that fits what the customer needs.

    In some implementations, such as some special circuit voice
    lines, the least significant bit is reserved for supervisory
    signalling.

    Cell uses a much narrower bandwidth. They do some data
    compression to save bits, the nature of which i can't tell
    you, but the saving is dramatic. You can tell the difference
    in sound quality, tho, if you're calling from a landline to
    another landline, as opposed to calling a cell phone.

    That is exactly the issue.

    If you speak into a microphone hard-wired to a device that
    digitizes your speech at a high frequency and resolution and
    have models of your speech built in that environment, how
    do you match those models to speech that has made its way to the
    analyzer/models via channels of dubious characteristics?

    Not sure what you mean. The only part of the "speech model"
    that matters is the spectral content. Signal/noise matters,
    but that's a hardware issue.

    But "you" sound different when your speech travels to "me"
    via different channels. I can understand what you are trying to
    *convey* -- much like you could understand me in a helium
    environment. But, you likely wouldn't recognize my "voice"
    in that environment. You would have to rely on other cues.

    POTS relied on the fact that humans can recognize even severely
    distorted speech and can rely on knowledge of WHO they telephoned
    (or who reassured themselves at the far end) to identify the
    called parties.

    Note how some vendors now rely on speech as a biometric
    authenticator. How reliable could *that* be if I placed a
    call to them from an old exchange through a dubious PBX, etc.?

    I don't know, but probably just fine. One would think they
    are built for telco channel transmission.

    They rely on other information -- your caller ID (which can
    be forged), your knowledge of an account number, etc. And,
    if their (machine's) confidence in the authentication isn't
    high enough, they let a human operator perform that task
    before granting you access to "confidential" information
    (even though it is YOUR information)

    When someone you know phones you, you are able to *identify* the
    caller without checking their CID or waiting for them to introduce
    themselves. Because you likely have spoken with them over the
    telephone (channel) previously and knows what their "telephone
    voice" sounds like.

    *Learning* what they sound like requires some prior "training".
    If someone called you "out of the blue" and made a demand that
    you *should* honor (e.g., your boss), you would likely need
    some confirmation prior to acting on it -- and would probably
    try to get that confirmation (authentication) surreptitiously
    (so you don't have to explain how you didn't recognize the caller).

    In most cases the PBX or other customer facilities does the
    A/D and D/A conversion at the customer premise, so analog
    signals traveling down long copper spans are not a source of
    degradation.

    What's the alternative... a call over Instagram? A
    high-priced special circuit from your carrier?

    I'm not asking (or requiring) anything special. Rather,
    I am trying to learn the characteristics of a channel that
    has presented itself to me (my device) without any input
    from me (or the other party!) in that selection. I.e.,
    you can't tell your provider what sort of connection you want
    for THIS call.

    If I phone home and simply state "Meet me out front",
    my other half will do so, even if annoyed at the terse,
    imperative nature of my comment. Because she would
    *recognize* me as the caller and would realize there
    was a reason for my demand AND for its brevity as well as
    the likelihood of my making such a demand.

    I want a machine to be able to act with that same level
    of reliability without having to resort to supplemental
    interactions to bolster confidence in its authentication
    of the caller.

    So, if I am out of town, I can instruct the machine to
    recognize a neighbor's "commands" (to perform a limited
    set of actions, as my proxy) without having to quiz them
    on other information (effectively, shared secrets -- even
    if not truly "secret") to get that level of confidence
    in their authentication:

    "What day did Don depart?"
    "When did he say he was going to return?"
    "Where was he going?"
    "What's your pet's name?"

    I.e., a recording of said neighbor (or, an AI generated
    simulant) could sound correct and yet not respond correctly.

    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Don Y@3:633/10 to All on Sunday, June 07, 2026 00:40:41
    On 6/7/2026 12:07 AM, Don Y wrote:

    If I phone home and simply state "Meet me out front",
    my other half will do so, even if annoyed at the terse,
    imperative nature of my comment.˙ Because she would
    *recognize* me as the caller and would realize there
    was a reason for my demand AND for its brevity as well as
    the likelihood of my making such a demand.

    I want a machine to be able to act with that same level
    of reliability without having to resort to supplemental
    interactions to bolster confidence in its authentication
    of the caller.

    So, if I am out of town, I can instruct the machine to
    recognize a neighbor's "commands" (to perform a limited
    set of actions, as my proxy) without having to quiz them
    on other information (effectively, shared secrets -- even
    if not truly "secret") to get that level of confidence
    in their authentication:

    ˙ "What day did Don depart?"
    ˙ "When did he say he was going to return?"
    ˙ "Where was he going?"
    ˙ "What's your pet's name?"

    I.e., a recording of said neighbor (or, an AI generated
    simulant) could sound correct and yet not respond correctly.

    A good summary of the various technologies involved:
    <https://en.wikipedia.org/wiki/Speaker_recognition>
    Including:

    "In 2023 Vice News and The Guardian separately demonstrated
    they could defeat standard financial speaker-authentication
    systems using AI-generated voices generated from about five
    minutes of the target's voice samples."

    (I don't think you even need THAT much training material)

    Note that the banking application inherently constrains the types
    of "content" that is likely exchanged making it easier to spoof.

    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Liz Tuddenham@3:633/10 to All on Sunday, June 07, 2026 08:54:59
    Jeremiah Jones <jj@j.j> wrote:

    [...]
    You can tell the difference
    in sound quality, tho, if you're calling from a landline to
    another landline, as opposed to calling a cell phone.

    Some mobile 'phones produce sounds that are barely recognisable as a
    voice, let alone any particular voice. This is made even worse by
    people who stand them on a table and switch them to 'hands-free' with a
    loud television or radio in the background.


    --
    ~ Liz Tuddenham ~
    (Remove the ".invalid"s and add ".co.uk" to reply)
    www.poppyrecords.co.uk

    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Don Y@3:633/10 to All on Sunday, June 07, 2026 12:30:52
    On 6/7/2026 12:54 AM, Liz Tuddenham wrote:
    Jeremiah Jones <jj@j.j> wrote:

    [...]
    You can tell the difference
    in sound quality, tho, if you're calling from a landline to
    another landline, as opposed to calling a cell phone.

    Some mobile 'phones produce sounds that are barely recognisable as a
    voice, let alone any particular voice. This is made even worse by
    people who stand them on a table and switch them to 'hands-free' with a
    loud television or radio in the background.

    But that's also true of wired (and "wireless") phones. I
    had a "single chip" phone many decades ago that felt like
    I was using a can-on-a-string -- a mild breeze would blow it
    off the table!

    The fact that they can get any sort of audio quality
    (think: "music player") out of a cell phone suggests
    they are putting SOME effort into that.

    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Jeremiah Jones@3:633/10 to All on Sunday, June 07, 2026 16:49:04
    Don Y <blockedofcourse@foo.invalid> wrote:
    On 6/6/2026 11:25 PM, Jeremiah Jones wrote:

    (snip for focus)

    3400 hz sampled at 8 khz gives you pretty decent voice
    quality. Not very good for music, but understandable for
    speech, and for people like me with poor high-frequency
    hearing, extra audio bandwidth wouldn't help much. Telephone
    mikes and sound elements are also cheap and crappy. If you
    want fidelity i'd start with a good quality headset.

    What I want is repeatability and, ideally, with some form
    of quantitative description that lets me relate THIS audio
    stream to an audio stream that I would encounter when sharing
    a room with the speaker.

    That would be signal-to-noise ratio (SNR).
    SNR = St / (St - Sr)
    if St is the transmitted signal, ie out of the speaker's
    mouth, and Sr is the received signal, that which goes into
    your ear.

    Simple to say, accurate measurement could be pretty difficult.

    You could send a known signal thru the network that would
    quantify frequency and phase distortion and latency. I don't
    know what off-the-shelf equipment would do that, but there
    must be some.


    Most of the information in a human voice signal is well within
    the sub-1khz range. More so for men than women.

    But, that's because the human listener is "accommodating".

    No. It's because that's just the spectral content of a human
    voice. That one can be objectively measured & quantified.


    One can distort their voice and still be understood by
    someone intent on that understanding. Witness how you
    can understand someone with a severe head cold, despite
    their voice being quantitatively "different".

    Or, someone with laryngitis.

    A machine can similarly understand what they are *saying*.
    But, its harder for a machine to identify and authenticate
    that speaker. "What does Bob sound like when he's got
    a head cold?"

    Anything that the channel does to the signal falls into the
    same problem domain.

    Not exactly. What the channel does to the signal is noise,
    which falls under SNR. If Bob has a head cold, that's just a
    difference in the original signal.

    If you are trying to identify the caller via a "voice print",
    it still adds to the confusion all the same. IIRC a voice
    print can't be spoofed by another human voice, but i'm pretty
    sure it can be spoofed by an AI synthesis of Bob's voice. I
    speculate the voice print would not be greatly affected if Bob
    has a head cold, since it sounds like a product of the
    speaker's physiology, like larynx, tongue etc, at least
    according to my very incomplete understanding.


    E.g., I have a bridge that lets my *wired* home phones
    use my cellular connection. But, the quality of the
    connection is beneath the quality of the wired phones
    on a wired landline *or* the call phone that is connected
    to the bridge.

    24 telco voice channels fit nicely on a T1. Bandwidth was not
    always cheap. Do you remember back when it was easy to run up
    $40 on one phone call? It wasn't a matter of deliberate
    crippling, it was about what's doable at a reasonable price
    that fits what the customer needs.

    In some implementations, such as some special circuit voice
    lines, the least significant bit is reserved for supervisory
    signalling.

    Cell uses a much narrower bandwidth. They do some data
    compression to save bits, the nature of which i can't tell
    you, but the saving is dramatic. You can tell the difference
    in sound quality, tho, if you're calling from a landline to
    another landline, as opposed to calling a cell phone.

    That is exactly the issue.

    If you speak into a microphone hard-wired to a device that
    digitizes your speech at a high frequency and resolution and
    have models of your speech built in that environment, how
    do you match those models to speech that has made its way to the
    analyzer/models via channels of dubious characteristics?

    Not sure what you mean. The only part of the "speech model"
    that matters is the spectral content. Signal/noise matters,
    but that's a hardware issue.

    But "you" sound different when your speech travels to "me"
    via different channels. I can understand what you are trying to
    *convey* -- much like you could understand me in a helium
    environment. But, you likely wouldn't recognize my "voice"
    in that environment. You would have to rely on other cues.

    That would be a pretty serious signal degradation.

    (snip)

    In most cases the PBX or other customer facilities does the
    A/D and D/A conversion at the customer premise, so analog
    signals traveling down long copper spans are not a source of
    degradation.

    What's the alternative... a call over Instagram? A
    high-priced special circuit from your carrier?

    I'm not asking (or requiring) anything special. Rather,
    I am trying to learn the characteristics of a channel that
    has presented itself to me (my device) without any input
    from me (or the other party!) in that selection. I.e.,
    you can't tell your provider what sort of connection you want
    for THIS call.

    I see.

    To the extent that a phone call is digitized the entire way,
    except for the first and last mile (or fraction thereof),
    different call paths aren't going to make any difference.

    It's also probable that, if you use the PSTN, you are going to
    get the same path every time anyway. The PSTN mostly uses TDM
    transport which is pre-determined, rather than multiple paths
    like the Internet does.

    (snip the rest, as I have nothing useful to add to it)

    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)
  • From Don Y@3:633/10 to All on Monday, June 08, 2026 00:02:21
    On 6/7/2026 4:49 PM, Jeremiah Jones wrote:
    Don Y <blockedofcourse@foo.invalid> wrote:
    On 6/6/2026 11:25 PM, Jeremiah Jones wrote:

    (snip for focus)

    3400 hz sampled at 8 khz gives you pretty decent voice
    quality. Not very good for music, but understandable for
    speech, and for people like me with poor high-frequency
    hearing, extra audio bandwidth wouldn't help much. Telephone
    mikes and sound elements are also cheap and crappy. If you
    want fidelity i'd start with a good quality headset.

    What I want is repeatability and, ideally, with some form
    of quantitative description that lets me relate THIS audio
    stream to an audio stream that I would encounter when sharing
    a room with the speaker.

    That would be signal-to-noise ratio (SNR).
    SNR = St / (St - Sr)
    if St is the transmitted signal, ie out of the speaker's
    mouth, and Sr is the received signal, that which goes into
    your ear.

    I think it is more than that. I think the channel "colors"
    the content.

    What prompted the post was a demo I gave the other day.
    With my ear buds in place, I was successfully authenticated
    (context independant).

    I borrowed one of my guests' cell phone and then placed a call
    to "myself". And, was NOT authenticated. My voice hadn't
    changed in the few minutes between those two tests. I didn't
    choose a different utterance.

    What had changed was the channel between my mouth and the recognizer.

    When I had initially designed the mechanism, I verified it with a
    POTS connection. Now, the station set was replaced with a cell phone
    and the copper line with a packet switch network operating over a
    wireless connection.

    To the machine, my voice was no longer strictly what it should have
    been.

    Simple to say, accurate measurement could be pretty difficult.

    You could send a known signal thru the network that would
    quantify frequency and phase distortion and latency. I don't
    know what off-the-shelf equipment would do that, but there
    must be some.

    In theory, KNOWING the identity of the remote party AND recognizing
    whatever utterance, I could dynammically determine an inverse function
    that maps what I'm *hearing* from him to what he is likely *saying*.
    That function would qualify the channel's characteristics.

    But, if I want to normalize the channel SO I CAN AUTHENTICATE THE CALLER,
    I've already corrupted that analysis by using my knowledge of his
    speech to adjust the channel so it *matches* his speech. That's rife
    with fallacy.

    Most of the information in a human voice signal is well within
    the sub-1khz range. More so for men than women.

    But, that's because the human listener is "accommodating".

    No. It's because that's just the spectral content of a human
    voice. That one can be objectively measured & quantified.

    But only if it is faithfully presented to the analyzer.
    If it is altered in transit, then all bets are off.
    (Well, you can claim to make a match by enlarging the
    confidence interval you're willing to tolerate for
    "certainty")

    One can distort their voice and still be understood by
    someone intent on that understanding. Witness how you
    can understand someone with a severe head cold, despite
    their voice being quantitatively "different".

    Or, someone with laryngitis.

    A machine can similarly understand what they are *saying*.
    But, its harder for a machine to identify and authenticate
    that speaker. "What does Bob sound like when he's got
    a head cold?"

    Anything that the channel does to the signal falls into the
    same problem domain.

    Not exactly. What the channel does to the signal is noise,
    which falls under SNR. If Bob has a head cold, that's just a
    difference in the original signal.

    Shirley the instrument used doesn't differ from every other
    instrument that could have been used to convey the signal?
    I.e., all cell phones, carriers, etc. are 100% interchangeable.
    The mechanical dimensions of transducers don't attenuate or
    amplify certain frequencies, etc.

    It's just a fluke that the cell phone used in my test (above)
    happened to corrupt the signal delivered to the network?

    If you are trying to identify the caller via a "voice print",
    it still adds to the confusion all the same. IIRC a voice
    print can't be spoofed by another human voice, but i'm pretty
    sure it can be spoofed by an AI synthesis of Bob's voice. I
    speculate the voice print would not be greatly affected if Bob
    has a head cold, since it sounds like a product of the
    speaker's physiology, like larynx, tongue etc, at least
    according to my very incomplete understanding.

    Your physiology changes in each of these "abnormal" situations.

    When you "get a cold", your larynx swells which drives the fundamental frequency of all of your voiced sounds lower. They can become coated
    with mucus further deepening the voice. Coughing or clearing phlegm
    irritates the vocal chords making your voice sound more raspy or
    "shredded". The nasal passages tend to get blocked or narrowed
    reducing the prominence of unvoiced sounds (m, ng, etc.) as the
    resonance is altered by the physical changes to that "chamber".
    Soft tissues don't vibrate as easily. etc.

    Most "impersonators" rely on other mechanisms to reinforce the
    perception of their voice. A machine can note the difference
    in spectral content while a human listener is more willing to
    tolerate deviations.

    When I am over-tired, my voice gets very breathy and more
    monotonic. A human can understand what I am saying -- as
    can a machine. But, expecting the machine to correctly
    *authenticate* me from such speech samples would be a stretch.

    An AI (or other synthetic voice) is actually using a model
    built *from* the speaker's voice so it more accurately
    reflects the original speaker's voice.

    To defeat an AI, you have to posses one or more "secrets"
    that the AI can't deduce from the situation. Or, hope
    the AI is sluggish in responding to your prompts (which
    can be spoken or visual, etc. -- "How many fingers am
    I holding up?")

    E.g., I have a bridge that lets my *wired* home phones
    use my cellular connection. But, the quality of the
    connection is beneath the quality of the wired phones
    on a wired landline *or* the call phone that is connected
    to the bridge.

    24 telco voice channels fit nicely on a T1. Bandwidth was not
    always cheap. Do you remember back when it was easy to run up
    $40 on one phone call? It wasn't a matter of deliberate
    crippling, it was about what's doable at a reasonable price
    that fits what the customer needs.

    In some implementations, such as some special circuit voice
    lines, the least significant bit is reserved for supervisory
    signalling.

    Cell uses a much narrower bandwidth. They do some data
    compression to save bits, the nature of which i can't tell
    you, but the saving is dramatic. You can tell the difference
    in sound quality, tho, if you're calling from a landline to
    another landline, as opposed to calling a cell phone.

    That is exactly the issue.

    If you speak into a microphone hard-wired to a device that
    digitizes your speech at a high frequency and resolution and
    have models of your speech built in that environment, how
    do you match those models to speech that has made its way to the
    analyzer/models via channels of dubious characteristics?

    Not sure what you mean. The only part of the "speech model"
    that matters is the spectral content. Signal/noise matters,
    but that's a hardware issue.

    But "you" sound different when your speech travels to "me"
    via different channels. I can understand what you are trying to
    *convey* -- much like you could understand me in a helium
    environment. But, you likely wouldn't recognize my "voice"
    in that environment. You would have to rely on other cues.

    That would be a pretty serious signal degradation.

    There are no standards for the station sets on each end of a
    comm link. Because humans can recognize seriously degraded
    speech, fidelity isn't an important criteria.

    (snip)

    In most cases the PBX or other customer facilities does the
    A/D and D/A conversion at the customer premise, so analog
    signals traveling down long copper spans are not a source of
    degradation.

    What's the alternative... a call over Instagram? A
    high-priced special circuit from your carrier?

    I'm not asking (or requiring) anything special. Rather,
    I am trying to learn the characteristics of a channel that
    has presented itself to me (my device) without any input
    from me (or the other party!) in that selection. I.e.,
    you can't tell your provider what sort of connection you want
    for THIS call.

    I see.

    To the extent that a phone call is digitized the entire way,
    except for the first and last mile (or fraction thereof),
    different call paths aren't going to make any difference.

    It's also probable that, if you use the PSTN, you are going to
    get the same path every time anyway. The PSTN mostly uses TDM
    transport which is pre-determined, rather than multiple paths
    like the Internet does.

    "ATM is a core protocol used in the synchronous optical networking
    and synchronous digital hierarchy (SONET/SDH) backbone of the public
    switched telephone network and in the Integrated Services Digital
    Network (ISDN) but has largely been superseded in favor of next-generation networks based on IP technology."

    IP doesn't ensure that one datagram follows the same path as the
    preceding or following.

    (snip the rest, as I have nothing useful to add to it)


    --- PyGate Linux v1.5.15
    * Origin: Dragon's Lair, PyGate NNTP<>Fido Gate (3:633/10)