There’s not a person out there who would disagree that IoT devices are generating and sending more structured and unstructured data than ever. Most would even agree that a lot of IoT devices being developed today are performing more complicated data processing tasks and are even storing data on-device to save network load and bandwidth.
So with all this data generated and multiple data flows, what is happening (or what can happen) to ensure data privacy? To get into that, let’s go over the classic five W’s to get a deeper look into data privacy in IoT.
Who does this data affect, who is it regarding, and what’s being done to ensure privacy?
- There is a phrase that has come to encompass a large part of what we consider data privacy – data anonymization. Data Anonymization is a type of information sanitization with the intent being privacy protection. It is the process of removing personally identifiable information from data sets, so that the people whom the data describe remain anonymous. Here are a few different methods of data anonymization in use today:
- Data Masking — Hiding data with altered values.
You can create a mirror version of a database and apply modification techniques such as character shuffling, encryption, and word or character substitution. For example, you can replace a value character with a symbol such as “*” or “x”. Data masking makes reverse engineering or detection impossible.
- Pseudonymization — A data management and de-identification method that replaces private identifiers with fake identifiers or pseudonyms, for example replacing the identifier “John Smith” with “Mark Spencer”.
Pseudonymization preserves statistical accuracy and data integrity, allowing the modified data to be used for training, development, testing, and analytics while protecting data privacy.
- Generalization — Deliberately removes some of the data to make it less identifiable.
Data can be modified into a set of ranges or a broad area with appropriate boundaries. You can remove the house number in an address, but make sure you don’t remove the road name. The purpose is to eliminate some of the identifiers while retaining a measure of data accuracy.
- Data Swapping — Also known as shuffling and permutation, a technique used to rearrange the dataset attribute values so they don’t correspond with the original records.
Swapping attributes (columns) that contain identifier values such as date of birth, for example, may have more impact on anonymization than membership type values.
- Data Perturbation — Modifies the original dataset slightly by applying techniques that round numbers and add random noise.
The range of values needs to be in proportion to the perturbation. A small base may lead to weak anonymization while a large base can reduce the utility of the dataset. For example, you can use a base of 5 for rounding values like age or house number because it’s proportional to the original value. You can multiply a house number by 15 and the value may retain its credence. However, using higher bases like 15 can make the age values seem fake.
- Synthetic Data — Algorithmically manufactured information that has no connection to real events.
Synthetic data is used to create artificial datasets instead of altering the original dataset or using it as is and risking privacy and security. The process involves creating statistical models based on patterns found in the original dataset. You can use standard deviations, medians, linear regression or other statistical techniques to generate the synthetic data.
- Data Masking — Hiding data with altered values.
Where are some of the most current data privacy laws being drafted and implemented?
- More countries are creating specific laws about keeping data on their citizens private. While these are country specific laws, because citizens of those countries travel internationally and data may be gathered on them in different countries, it’s important to keep a global view on privacy and respect the data privacy laws of each country. To do that efficiently, most have decided to adopt the most stringent of guidelines – GDPR. Too many articles have been written about GDPR to count so we’ll forgo the details for now.
What kind of data are we talking about?
- Structured Data – This kind of data is easily organized and acted upon by computers. GPS coordinates, motion detection, and temperature readings are all good examples of structured data.
- Unstructured Data – This kind of data is not easily classified or understood by computers. Digital video, images, and audio are all prime examples of unstructured data.
When we talk about schedules and timeframes in data privacy, what does that mean?
- How frequently is the data stored?
- This depends entirely on the type of device and the settings it adheres to. It could be a camera recording video 24/7, or it could be a motion sensor creating data only when triggered. Adding to that idea, it also depends on the power supply and corresponding energy efficiency requirement of the device. For example, if a device runs on a button cell, then its data recording interval might be more infrequent for power saving.
- How frequently is it wiped/cleaned/anonymized?
- This, on the other hand, depends more on organizational policy and can vary widely. Hopefully, best practices and scheduled cleans are adhered to and everyone’s data remains safe and private. It can also depend on the cost of data storage. If data is stored on a cloud service such as AWS, their pricing is often calculated on a data-volume basis. So, a company might have a policy in place to delete unused data sets after a certain period.
How are people defining best practices for data privacy?
- Looking at the above info on data and privacy, you may be wondering if device or cloud data storage is better – Or if having the complete data set stored on-device and only send data that requires analytical processing to the cloud. There are many in-between practices which also depend on the use-case of the application. So, the short and somewhat incomplete answer to that depends on a few things, how the organization is run and what types of devices are used.
- Either way, the focus here is to ensure that the data remains private. Other than where the data is being stored, it’s important to look at things like if/how the data is encrypted while flowing from point A to B, if/how the data files themselves are password protected, and if secondary security measures like 2FA are being implemented to give an extra layer of security. What kind of security measures are taken on the projects you’ve been working on?
Data Privacy in IoT from Step One
To touch on the subject on more of a developer level, secure by design is a method in the manufacturing process which ensures the devices being created are as secure as possible.
Data doesn’t stop at the device, so no matter how secure your device security is, network security should be considered too. Is the data you work with being passed onto a public cloud, a hybrid cloud, or something else? Depending on the type of network the data goes through, different measures may need to be taken to ensure maximum data privacy is maintained. The last type of security, and often overlooked in the tech world, is physical security. It shouldn’t need to be said that keeping data centers and storage devices physically safe and secure is very important.
When talking about data security and privacy, it seems a little ironic that humans are almost always the weakest link. However, this point is more about security and less about privacy so for now, we’ll leave it for another article.
Ensuring data privacy requires an ongoing effort and diligence. Because this idea gets overlooked, it’s best to build privacy into the system and device from the start. TechDesign works on IoT projects and with creators, innovators, and startups to build the best possible version of a product possible from step one and all the way through to market and after service strategies. Learn more about data privacy in IoT by contacting them today.