-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Add Sensitive Data Labels #176
Comments
Very good idea but who and how it's implemented will be a challenge. Do you see this across all otel data types? |
Couldn't this be handled with metadata about semantic conventions? E.g. in the YAML definitions we add a field |
@jkowall I see this across all types since "background" tasks have the potential to leak data such as appending the SQL query to a metric capturing latency (not my greatest example but more highlighting that it shouldn't be restricted to on spans and metrics that contain a user context). @Oberon00 potentially? I haven't looked at it in more detail but it would complicate managing data processors that look for special tags so to speak within the labels or have to do some level of regex matching. Sorry, I am not ruling out sensitive label prefixes in the semantic convention but also understand those become hard to manage once the convention evolves. |
I think this is a good idea. I am not convinced that the semantic convention is the place where an attribute should be defined as sensitive or not. It could easily change between applications. It probably needs to be defined by the user themselves, and the protocol handles the rest. |
What do you mean by prefix? And why would it become difficult to manage once they evolve? I was talking about adding an additional field to semantic conventions. For example, this is the current net.peer.ip definition in YAML: https://github.com/open-telemetry/opentelemetry-specification/blob/v1.6.1/semantic_conventions/trace/general.yaml#L36-L41 We could extend it with a field - id: peer.ip
sensitive: pii
brief: ... |
Two questions:
|
The
Can't answer this fully, but maybe? I imagined it would mean the users could set "attributes" and "sensitive attributes" separately. That's just how I initially imagined it though, haven't put much more thought into it than that. |
Makes sense. Although if we just want to cover security we could simplify by always treating it as sensitive.
Who is the "user" in that case? EDIT: See https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/glossary.md#user-roles |
Sorry, I meant user as in application owner. |
Hey @Oberon00
I had made an assumption that the labels keys would be transformed to something like:
From trying to adopt the semantic convention internally at Atlassian, there has been a few times that fields had changed (sometimes it was us internally updating the labels, or the semantic convention were were extending hand changed). Moreover, we've had AWS implement a conflicting semantic convention used for the databases (@jamesmoessis correct if I am wrong). My biggest concern going forward with this approach is that convention changes in a non backwards compatible way that would data that was once considered sensitive (and should still be considered) is no; then allowing for sensitive data into systems that shouldn't contain it.
An application that is running in different environments could consider internal testing / synthetic users non sensitive compared real life users (this is more applicable for Real User Monitoring).
I would argue so, I don't think the semantic convention can extend fair enough to cover all cases where data regulations are concerned. |
No, my suggestion was to just have this metadata in the YAML and at any point where you would want to hide/remove sensititve data, you would look up the key in the YAML (or probably not directly from the YAML but some data file generated from it) to determine if it is sensitive. So a key would be either sensitive or not, but any particular value for that key would have the same sensitivity, which does not support the use cases brought up above in #176 (comment) |
It is true that we have changed the semantic conventions much more often than the protobuf. But that's because the protobuf is stable, while the semantic conventions are still experimental (with the sole exception of service.name IIRC). For sensitivity in particular, making a key non-sensitive would be impossible for the reason you mentioned. Instead, a new key would have to be used. |
@MovieStoreGuy, thank you for opening this proposal: while we all love having our technical observability, the boundaries of privacy concerns are crossed quickly, so having it in mind with OpenTelemetry would be great! From the discussion it appears that there is no "one-size-fit-all" solution, here's a few thoughts from my end: Adding a "sensitive" field to the semantic conventions will only work for a very few fields, since "personal identifiable" is often not well defined:
This problem is not new and also reflected in many "official" definitions of PII, e.g. https://www.gsa.gov/reference/gsa-privacy-program/rules-and-policies-protecting-pii-privacy-act:
Now, of course one could argue that nothing of that is a concern for the OpenTelemetry standard, but while I think there can not be a simple & complete solution, there are multiple ways to tackle this:
|
I really like the idea of this, if it were possible to include a privacy module inside the app exporter and the collector, where it has the ability to both observe and anonymise fields based on some provided configuration, it would address my immediate concerns of leaking this type of data to vendors whom aren't prepared to accept it. I have vague idea this already exists as |
For the SDKs, we don't really have this, but there is an issue that would probably be a prerequesite for easier filtering: open-telemetry/opentelemetry-specification#1089. Actually the privacy filtering use case was already brought up there: open-telemetry/opentelemetry-specification#1089 (comment) |
The Pitch
The idea is rather simple, extending the current OTLP to include an additional dictionary that is intended to store sensitive data.
The Rationale
Making the use of sensitive data being explicit instead implicit means that data vendors, data processors (pre and post) can first of all:
As we collect more telemetry from user's experiences, the likelihood of including UGC and PII increases and this should help those system that collect it to be clear on the actions they take on it.
The outcome
The idea I have in mind and would love further discussion on is how this is consumed by the client (the application actually sending the data).
The idea being that the OTLP is extending to include another dictionary that is explicit in its sensitive nature and the SDK implement a
AddSensitiveLabel
method to make it clear on what it is.From that the otel-collector could filter, drop, or transform those labels if they are known.
From there, the exported data vendors / processes can set up ingestion policies on what should happen when sensitive data arrives.
The text was updated successfully, but these errors were encountered: