Respectlytics Respect lytics
Menu
Re-identification Data Anonymization Privacy by Design

Why "Anonymized" Data
Isn't Always Private

11 min read

"Anonymized" is a marketing word, not a guarantee. Datasets that strip out names and email addresses are routinely re-identified using linkage attacks, device fingerprints, and quasi-identifier combinations. The defensible alternative is not better anonymization — it is collecting less data in the first place.

🎯 The Promise of Anonymization (and Why It Fails)

Most analytics vendors describe their data as "anonymized." The implied contract is simple: we collect a lot, but we strip the parts that point to a real person. The user is protected because the leftover data is just numbers.

The problem is that "the parts that point to a real person" turns out to be almost everything once you have enough of it. Researchers have repeatedly demonstrated that small bundles of innocuous-looking attributes uniquely identify most of the population. The classic study by Latanya Sweeney showed that 87% of Americans are uniquely identified by just three fields: ZIP code, birth date, and gender. None of those is a "personal identifier" by itself.

The uncomfortable truth: If your dataset contains enough behavioral signal to be useful for analytics, it almost certainly contains enough signal to be re-identified.

🔗 Linkage Attacks: How Anonymous Becomes Identifiable

A linkage attack joins an "anonymous" dataset to an external source that has overlapping fields. The overlap turns into an identity. Here is the mental model:

Anatomy of a linkage attack

  1. Step 1. An attacker has a "harmless" dataset: timestamps + city + device model + app version.
  2. Step 2. They obtain a second dataset that contains a real identity plus one of those fields — for example, a leaked CRM dump or a public review with a timestamp.
  3. Step 3. They join the two on the overlapping fields. A handful of matches is enough; the rest of the rows are then trivially linked to the same person.

The classic real-world example is the AOL search-log release in 2006. AOL replaced usernames with random IDs and published the search history of 650,000 users for research. Within days, journalists had identified specific individuals by reading their queries. The lesson: behavioral histories are themselves identifiers.

Quasi-identifiers in mobile analytics

"Quasi-identifier" is a fancy name for any field that is not directly identifying but narrows down who you are. In mobile analytics the usual suspects are:

Field Why it leaks identity
Precise timestamp (ms)Only one person opens the app at exactly 14:32:07.482 from a specific IP.
Device model + OS + localeUnusual combinations (e.g. an older device with a rare locale) are unique within a city.
App versionBeta-channel users are a small population that becomes traceable.
IP addressIdentifies a household and, with rotation logs, an individual.
Custom propertiesEmail or names accidentally pasted in by developers; happens constantly.
Session paths over timeA user's behavioral fingerprint is as unique as a real fingerprint.

🪞 Device Fingerprinting Defeats Pseudonyms

Even if every event is tagged with a fresh random ID, a passive device fingerprint can stitch sessions back together. Common signals: screen dimensions, OS version, time-zone, locale, installed font set, accelerometer noise pattern, language preferences. Combined, they form a near-unique signature that survives reinstall.

This is why platform vendors have been steadily restricting fingerprinting APIs and treating fingerprinting as functionally equivalent to assigning a persistent user ID. Stripping the explicit ID from your dataset does not help if the fingerprint is still implicitly there.

💡 Mental model

Anonymization tries to remove identity after data is collected. Fingerprinting puts identity back after anonymization. The arms race always favors the attacker — they only need to win once.

📚 Anonymized vs De-identified vs Pseudonymized

Vendors use these words interchangeably. Regulators do not. Knowing the distinction matters when you are evaluating a privacy claim.

Term What it actually means Reversible?
PseudonymizedDirect identifiers replaced by tokens; mapping table exists somewhere.Yes
De-identifiedDirect identifiers removed; quasi-identifiers usually remain.Often, in practice
Anonymized (true)Re-identification is structurally impossible even with auxiliary data.No
"Anonymized" (marketing)Whatever the vendor wants the word to mean.Usually, yes

Modern privacy frameworks treat data as personal data as long as re-identification is reasonably likely. That is a much higher bar than "we removed the email field." If your retained dataset contains stable device fingerprints, precise timestamps, or behavioral histories, regulators will treat it as personal data regardless of how it is labeled.

⚠️ Common Pitfalls in Mobile Analytics

1. The "we just hash it" myth

Hashing an email or IDFA does not anonymize it — anyone with the same email can reproduce the hash and match it in your dataset. Hashing is obfuscation, not anonymization.

2. Custom properties leak PII

SDKs that accept properties: {...} end up with email addresses, full names, and free-text in production. Developers paste them in by mistake. Once it is in the pipeline, you own it.

3. Persistent pseudonyms

A "user_id" that is just a random UUID stored in UserDefaults or SharedPreferences is still personal data. It tracks the same person across years of usage.

4. Stored IP addresses

Persisting raw IPs alongside session data turns "anonymous" events into a household-level log. Geolocate transiently, then drop the IP.

5. Sub-second timestamps

Millisecond precision is rarely needed for product analytics but is enough to single out a user in a population. Coarsen to minute resolution if your queries do not need more.

6. Long-lived session IDs

A "session" that lasts days or weeks is just a user ID with extra steps. Effective rotation (e.g. every two hours, or on app restart) is what makes a session non-identifying over time.

🛡️ A Better Standard: Avoid Collection (ROA)

The most reliable anonymization is the one you do not have to do. We call this Return of Avoidance (ROA): the best way to handle sensitive data is to never collect it in the first place. If a field is not in the database, it cannot leak, cannot be subpoenaed, cannot be re-identified, and cannot be misused.

What ROA looks like in practice

  • Strict allowlist for stored fields. Anything else is rejected at the API layer.
  • No persistent user identifier. Cross-session tracking is structurally impossible.
  • RAM-only session IDs with frequent rotation. Nothing written to disk on the device.
  • No custom properties. Free-text inputs are the leading cause of accidental PII collection.
  • Country-only geolocation from IP, then immediately discard the IP.

Session IDs are not user IDs. Custom properties were the biggest leak vector. Removing them was not a feature cut — it was a privacy upgrade.

Audit Checklist for Your Pipeline

Run through this list against any analytics pipeline that claims to be anonymous. If you cannot answer "yes" to all of them, your data is pseudonymous at best.

  • Is there a strict allowlist of stored fields, enforced server-side?
  • Are device IDs (IDFA, GAID, Android ID) absent from both SDK and server?
  • Are IP addresses processed transiently and never persisted next to events?
  • Is the session identifier rotated on a short cadence (hours, not weeks)?
  • Is timestamp precision capped at the minute level for product analytics?
  • Are custom properties forbidden, or strictly typed and allowlisted?
  • Can a developer prove the SDK writes zero bytes to disk for analytics?
  • Do retention policies actually delete data, or just hide it?

💡 How Respectlytics implements ROA

The API stores exactly five fields per event: event_name, session_id, timestamp, platform, and country. Anything else is rejected. Session IDs rotate every two hours and live only in device RAM. IP addresses are processed transiently for country lookup, then dropped before write.

There is no anonymization step because there is nothing user-identifying to anonymize. That is the whole point.

Frequently Asked Questions

Is anonymized data always private?

No. Anonymized datasets are routinely re-identified through linkage attacks and quasi-identifier combinations. A small number of innocuous fields — ZIP code, birth date, gender — uniquely identify most people. Privacy comes from minimizing collection, not stripping fields after the fact.

What is re-identification in analytics?

Re-identification links anonymous records back to a real person, usually by combining quasi-identifiers (timestamps, locations, device attributes) with auxiliary data. Common vectors in mobile analytics: precise timestamps, device fingerprints, and persistent pseudonymous IDs.

What is the difference between anonymized and pseudonymized data?

Pseudonymized data replaces direct identifiers with reversible tokens. Anonymized data is meant to remove the link entirely. In practice, most "anonymized" analytics data still contains enough quasi-identifiers to be re-identified — meaning it is technically pseudonymous.

How does device fingerprinting defeat anonymization?

Fingerprinting combines passive signals (model, OS, locale, screen, fonts, sensor noise) into a near-unique signature that persists across sessions. Even with random per-event IDs, the fingerprint reconstructs a per-device profile.

How can mobile apps avoid re-identification risks?

Avoid them at the source. No device IDs, no custom properties, no stored IPs, frequent session rotation, coarse timestamps. The most defensible architecture is one where re-identification is structurally impossible — not one that bolts an anonymization step on top of a pipeline that collects everything.

Does Respectlytics anonymize data?

Respectlytics avoids collection. Five fields stored per event, no device IDs, no custom properties, no stored IPs, RAM-only session IDs that rotate every two hours. There is no anonymization step because there is nothing identifying to strip.

Legal Disclaimer: This information is provided for educational purposes and does not constitute legal advice. Regulations and requirements change over time. Consult your legal team to determine the requirements that apply to your specific situation.

Related Reading

Genuinely private analytics, by design.

No device IDs. No custom properties. No persistent user IDs. Five fields stored per event.