So, let’s do this again, but this time cleanly. In a Facebook Post, Michael Seemann has been explaining why the Facebook App does not listen to every word you ever say, all of the time.

He is right. A telephone is a device with limited power supply, limited cooling and limited, metered connectivity. It has an operating system that monitors and manages these critical resources, hard. You can’t listen to things all of the time and expect not to be noticed. Like, “the battery is empty and my LTE budget is gone” noticed.

Other devices, an Alexa, a Sonos One or a Google Home, are on cabled power and unmetered Wifi. The could theoretically get away with listening all of the time.

So how much data is that? Let’s do the math.

Let’s assume one human talks, on the average, one hour each day. We are not recording environmental noise or gaps in speech. Just all the words.

Let’s also assume there are at most 20.000 talking days (55 years) in a human life on the average.

Let’s assume a really good Codec, like G.723.1 with 750 Byte/s.

One hour of talking is 750 Byte * 3600s = 2700000 Byte per Hour, or 2.6 MB per hour. A less frugal codec would consume about 10x this.

A lifetime of speech in a human is around 50 GB of recording, then.

There are currently 7.5 billion (10E9) humans alive. Let’s assume there have been 5 times as many humans alive ever, with 20.000 hours/50 GB of space requirements for each.

So 37 500 000 000 humans consume 50 GB each, that’s 1 875 000 000 000 GB, 1 875 000 000 TB, 1 875 000 PB, 1 875 EB  of storage.

I am simplifying by rounding up to 2000 EB, or 200 Million 10 TB drives of data. We can have 12 of these in a 1U server, so 16 666 666 servers.

Google or Facebook each have around 3 Million servers. So, no, not possible by a factor of not quite 10.

On the other hand, if we made no error with the assumptions, barely possible for all living humans. Quite possible for some hundred millions of people.

Published inComputer ScienceData Centers

1. Wer also das Smartphone nicht als reale Gefahr sieht und den “Stummschaltern” bei Google Home und Echo nicht traut, der kann die entsprechenden Geräte an eine Funksteckdose ohne IoT anschließen oder das ganze mit seinem Sicherheitschef besprechen ;)

2. AndreasLobinger

did someone mention the term VAD (voice activity detection)? (hint: hiding this 1h = 2.6MB in uplink is not really a problem if you can buffer…)

3. Onur

Why store the data in a sound format? They are not interested in tonality, pitch, timber for now at least. Much easier to transcribe it into text and keep that instead. Since currently transcription is done at HQ, you need to send the audio files over the network. Transmitting ~26MB/h with a bad audio codec is hardly a big deal in terms of network resources. At HQ you can have ring buffers that dispose of the audio data as soon as the transcription is done to avoid using storage resources as well.

Interesting thing would be to know how much local resources are consumed when the mike is always-on, using a codec, and storing it in local storage. IMHO, this feels like something you can get away with in terms of battery, i/o and compute. Especially when you optimise away the “blank times” where no speech is detected such as sleep or reading. Depending on how much resources you dare to consume, the recording may include metadata like location or time etc. I am guessing apps that do snoring analysis would be a good basic indicator of how feasible this is.

You can then batch this for periodical upload to HQ since you don’t need real-time transcription of this kind of data.

• kris

It certainly cannot be better than the “talk time” given in the spec sheet of the handset, because what you make is basically a permanent background Voip-Call. You will notice that.

• Onur

Maybe it can:

Most of the talk-time is consumed by transmission power of whatever GSM technology is in use. Energy use of transmission is constant during a voip-call even if you are not utilising full-bandwidth. So you can save a lot of battery by only transmitting in periodical bursts.

So what we need to know is the resource use of recording things locally, without transmitting them.

I am not knowledgable in this field by any means, just making educated guesses.

• Markus

Wikipedia says:
” The complexity of the algorithm is below 16 MIPS. 2.2 kilobytes of RAM is needed for codebooks.”

Lets use a low power µC, let’s say a STM32L0 which takes about 5mA@1.8V@32MHz. That’s 9mW. Let’s assume a battery capacity of 3000mAh@3.7V = 11100mWh.
So this system could run for 11100mWh/9mW = 1233h =51 days on battery.

So if you charge your phone every 5 days it will drain less then 10% additional power.

This is of course a simplified calculation. Writing to memory card and other stuff will also consume some power, but one could also put the codec into hardware and reduce power by a great deal.

4. bjoern

This calculation is BS. Try again:
– reduce to current IoT owners
– store only a few weeks/months in full
– assume phonetic recognition

5. Waiting for “Hey Siri” exclusively on local resources already works.
Why should I trust the phone to not listen to other keywords to begin recording and transmitting, probably piggybacked on the transmissions that happen when I am actually and willfully interacting with Siri?