Maya Chen

2025-11-18

Insights

What we've learned from a billion records

What we've learned from a billion records

What we've learned from a billion records

After scanning across hundreds of customer environments, the patterns of where sensitive data ends up are remarkably consistent.

After scanning across hundreds of customer environments, the patterns of where sensitive data ends up are remarkably consistent.

What the data shows

Argus has scanned across a wide variety of customer environments — different industries, different sizes, different data architectures. The patterns of where sensitive data ends up are remarkably consistent, and most of them are not where the security team thinks.

The patterns we see everywhere

Three patterns repeat across nearly every customer environment we've worked with, regardless of industry or company size. None are surprising once they're named, but they're hard to find without continuous scanning across the full environment.

The first: developer test environments contain production-grade personal data far more often than anyone admits. The data gets there through a hundred small decisions — copying production for realistic testing, snapshotting a database before a migration, pulling a sample to debug a customer issue. Each decision is defensible in isolation; the cumulative effect is that test environments often have weaker controls than production but similar data sensitivity.

Three places sensitive data hides

The second: log tables and observability data routinely contain identifiers and other personal data that fall outside normal classification reviews. Stack traces include user emails. Request logs include session tokens that map to user identity. Application metrics include user-specific dimensions that aggregate into identifiable patterns. These get retained longer than the production data they describe and are rarely covered by access controls designed for sensitive data.

The third: shared analytics workspaces are where the long tail of unintended exposure lives. Notebooks with cached query results. CSV exports left in shared drives. Saved query history with literal values that include identifiers. The analytics workflow optimizes for speed and iteration, which is in tension with data minimization, and the result is sensitive data scattered across artifacts nobody's tracking.

What changes when teams find this early

The teams that catch these patterns early have moved away from the model of classifying production databases once a year and toward continuous discovery as the operational baseline. It's a different way to think about the work — less project, more infrastructure. Less audit cycle, more always-on visibility.

The shift takes time and isn't always easy to justify in a budget cycle. But the teams that have made it stop having the same conversations about exposure incidents that everyone else is still having every quarter.

Protect YOur Data

Protect what matters most.

Create a free website with Framer, the website builder loved by startups, designers and agencies.