What are some database design best practices or strategies for dealing with instrument time series data from a SCADA system?

Question

What are some database design best practices or strategies for dealing with instrument time series data from a SCADA system?

Stephen Young 0

I am interested in learning some of the best practices for database design for dealing with the copious amount of data which can come from a SCADA system for lab/research machinery where several instruments could be logged at a high frequency. I have been a part of a project where a pre-commercial demonstrator facility was built and used to test and research a particular technology and while we managed to develop procedures which worked, more or less, for coping with the amount of data we were collecting, but it felt clunky and inelegant. When it came time to conduct a second-look audit/analysis of the data from the machine, I found myself needing to dynamically select databases and tables in my queries in the program which retrieved and analyzed the raw instrument data from the database and from some reading online, this need to dynamically determine the database and table to query is usually a symptom of a poor database design.

Most tutorials and information I can find on databases are for web apps or sales databases with user or sales data, while sometimes helpful (I'd imagine an online store the likes of Amazon has could populate a lot of sales data), often feel to generic to be applicable to my context. I would like to know of any strategies or best practices for designing databases for the specific context of SCADA and instrument data.

For those interested, additional context of the pilot plant with the some details on my specific context are below:

I have a MS SQL Server which contains the raw instrument data from a pilot plant (additional context of the pilot plant and the data from it have been published here: https://borealisdata.ca/dataset.xhtml?persistentId=doi:10.5683/SP3/LZ56EG). The pilot plant had a locally hosted OPC server (TOP Server) for interacting with equipment and monitoring instrumentation through our custom built human-machine-interface (HMI) written with MATLAB on our main control PC. The control PC also locally hosted a MS SQL server which was populated with instrument data through TOP Server's DataLogger plug-in which would be configured to poll a set list of tags (in our case it would be around 50 unique tags) on the OPC server and the frequency to record them (once every second, 5 mins or whatever). It would write the name of the tag, the timestamp of when the value was retrieved, its value, and the tag quality. We would normally have the logger frequency to be once every second for periods of active testing/operating the machine and once every 5 minutes for periods where the machine was offline or on stand-by.

This data would be queried from the SQL Server through MATLAB for offline analysis of the instrument data collected during experiments. Our experiments eventually became standardized and automated and we recorded the start and end time of the experiment. Eventually, the time to execute a query for the data during an experiment became noticeable when the table with the data log was around 2 GB in size. The procedure was to make a new table and re-configure the data logger to write the instrument log to this new table. If there was ever a change to the set of tags that DataLogger would poll when producing the log, we would make a new database on the MS SQL Server and a new table so that it was readily apparent to us that a change had been made.

Tables with logged instrument data would have the columns:

TableIndex (int, not null)
TagTimestamp (datetime, not null)
TagItemID (varchar(50), not null)
TagValue (sql_variant, not null)
TagQuality (int, not null)

This was the only data contained in the database as other aspects that, in retrospect, could have been other tables in the database, were managed with spreadsheets. We were a group of Mining Engineering grad students who did not have formal training on designing and managing databases and I imagine that there are plenty of reasons not to do it how we did it but it ended up working for our purposes. Feel free to tell me what would be some of those reasons?

One reason that became apparent to me after conducting a second-look analysis of the data was that the structure of the database required that I programmatically determine which table contains the log of a given experiment and this led to the SQL queries sent to the database not being able to be parameterized fully as the database and table were filled in with string formatting. Because this database can only be accessed locally, the likely-hood of a SQL injection attack is really low but it could become a problem if the SQL server would be made remotely accessible.

Any guidance on a "good" or better way to structure the database would be greatly appreciated. It is not likely that this database will be ever used again but I am curious for my own benefit to learn from this experience.

To summarize, I am looking for the following things:

materials on strategies and best practices for database design for logging instrument data from a SCADA system
strategies to cope with copious amounts of data which can come from an instrument data log
specific problems with the structure of the database of the pilot plant
potential remedies for restructuring the database

2 answers

Your answer

Answer 1

When designing a database for logging instrument data from a SCADA system, especially in the context of high-frequency data collection, there are several best practices and strategies you can consider:

Schema Design: Instead of creating a new table for each experiment or change in tags, consider using a more normalized schema. A single table for all instrument data can be structured with a clear distinction between different types of data. For example, you could have a Measurements table that includes columns for MeasurementID, TagTimestamp, TagItemID, TagValue, and TagQuality. This allows for easier querying and reduces the need for dynamic SQL.
Data Types: Use appropriate data types for your columns. Instead of sql_variant for TagValue, consider using specific types like float or decimal if the values are numeric. This can improve performance and make your queries more efficient.
Indexing: Implement indexing on frequently queried columns, such as TagTimestamp and TagItemID. This can significantly speed up query performance, especially as the size of your data grows.
Partitioning: For large datasets, consider partitioning your tables based on time (e.g., monthly or yearly). This can help manage the size of the table and improve query performance by allowing the database engine to scan only relevant partitions.
Data Retention Policies: Implement data retention policies to manage the volume of data. For example, you might choose to archive or delete data older than a certain threshold, which can help maintain performance and manage storage costs.
Batch Processing: If you are dealing with large amounts of data, consider using batch processing to aggregate or summarize data before it is written to the database. This can reduce the volume of data stored and improve performance during analysis.
Use of Metadata: Create a separate table for metadata about experiments, such as ExperimentID, StartTime, EndTime, and any relevant parameters. This allows you to keep your measurements table focused on the data itself while still being able to relate measurements to specific experiments.
Avoid Dynamic SQL: As you noted, the need for dynamic SQL can indicate a design issue. By structuring your database to avoid the need for dynamic table names, you can simplify your queries and improve security.
Documentation and Training: Ensure that all team members understand the database structure and best practices for querying and managing data. This can help prevent issues related to misuse or misunderstanding of the database.

By implementing these strategies, you can create a more robust and efficient database design for your SCADA system that will facilitate easier analysis and management of the data collected from your instruments.

References:

Answer 2

First of all, I am not familiar with SCADA, so I cannot speak exactly in the context of SCADA.

But it seems that partitioning would serve you well. I can see that there could be some point of filling data into an empty table. That table would have the columns you describe, plus some column which describes the experiment as such. That could be a plain id which refers to another table where you have data about the experiment.

Once the experiment has completed, you can switch into the main table. This is a metadata operation which is swift.

Thus, you will have one table with all the experiments instead of having umpteen different tables with the same structure. And, yes, they would be in the same database. Having multiple database only makes things messy, I think.

Here, I have left out the technical details of the exact commands in interest of brevity,

Share via

What are some database design best practices or strategies for dealing with instrument time series data from a SCADA system?

2 answers

Your answer