Ðóñ Eng Cn Translate this page:
Please select your language to translate the article


You can just close the window to don't translate
Library
Your profile

Back to contents

Software systems and computational methods
Reference:

Data storage format for analytical systems based on metadata and dependency graphs between CSV and JSON

Alpatov Aleksey Nikolaevich

ORCID: 0000-0001-8624-1662

Assistant professor, MIREA - Russian Technological University

78 Prospekt Vernadskogo str., Moscow, 119454, Russia

aleksej01-91@mail.ru
Other publications by this author
 

 
Bogatireva Anna Alekseevna

Student, MIREA – Russian Technological University

111033, Russia, Moscow, 4A Tankovy ave., 24 sq.

pecherni@gmail.com

DOI:

10.7256/2454-0714.2024.2.70229

EDN:

TVEPRE

Received:

25-03-2024


Published:

01-04-2024


Abstract: In the modern information society, the volume of data is constantly growing, and its effective processing is becoming key for enterprises. The transmission and storage of this data also plays a critical role. Big data used in analytics systems is most often transmitted in one of two popular formats: CSV for structured data and JSON for unstructured data. However, existing file formats may not be effective or flexible enough for certain data analysis tasks. For example, they may not support complex data structures or provide sufficient control over metadata. Alternatively, analytical tasks may require additional information about the data, such as metadata, data schema, etc. Based on the above, the subject of this study is a data format based on the combined use of CSV and JSON for processing and analyzing large amounts of information. The option of sharing the designated data types for the implementation of a new data format is proposed. For this purpose, designations have been introduced for the data structure, which includes CSV files, JSON files, metadata and a dependency graph. Various types of functions are described, such as aggregating, transforming, filtering, etc. Examples of the application of these functions to data are given. The proposed approach is a technique that can significantly facilitate the processes of information analysis and processing. It is based on a formalized approach that allows you to establish clear rules and procedures for working with data, which contributes to their more efficient processing. Another aspect of the proposed approach is to determine the criteria for choosing the most appropriate data storage format. This criterion is based on the mathematical principles of information theory and entropy. The introduction of a criterion for choosing a data format based on entropy makes it possible to evaluate the information content and compactness of the data. This approach is based on the calculation of entropy for selected formats and weights reflecting the importance of each data value. By comparing entropies, you can determine the required data transmission format. This approach takes into account not only the compactness of the data, but also the context of their use, as well as the possibility of including additional meta-information in the files themselves and supporting data ready for analysis.


Keywords:

Data storage formats, JSON, CSV, Analysis Ready Data, Metadata, Data processing, Data analysis, Integration of data formats, Apache Parquet, Big Data

This article is automatically translated. You can find original text of the article here.

 

Introduction

The task of integrating data in various scientific fields is extremely important. The development of big data has led to the emergence of a large number of disparate tools used in various scientific research, and written using different technology stacks, and most importantly, the formats of the data used vary from tool to tool, which leads to the need for data reformatting, which can be difficult for big data.

To solve this problem, you can use a variety of programming interfaces (API-Application programming interface) [1]. This approach has a number of undoubted advantages, but one of the significant disadvantages is the high load on data serialization and deserialization, which can become a serious bottleneck for applications. In addition, there is no standardization of data representation, which leads to developers performing customized integrations using existing or customized systems. To solve the above problems, an alternative method of data exchange can be applied: data exchange using common file formats optimized for high performance in analytical systems. In fact, the data format conversion method is a fairly common way to integrate data [2]. Most often, data, without using highly specialized platforms, is accumulated in CSV and JSON formats, and in the case of systems such as Hadoop, the format, for example, avro, can be used.

CSV (Comma-Separated Values) is a text format designed to represent tabular data [3]. A table row corresponds to a line of text that contains one or more fields separated by commas. It is allowed to specify the names of the data columns in the first row, but this is not a mandatory requirement. This format does not involve data typing, all data values are considered strings. CSV is the most commonly used format for storing structured data obtained from a relational database.

JSON (JavaScript Object Notation) is a text data exchange format based on JavaScript, while the format is independent of JS and can be used in any programming language [4]. The format is designed for unstructured data, and has gained popularity due to the spread of REST and NoSQL approaches to data storage.

The Avro format is a compact, fast and serializable data format developed by the Apache Software Foundation [5]. It provides a data schema in JSON format, which makes it self-describing. Avro supports various data types, data compression and is efficient for transferring data between different systems. It is widely used in Big Data and distributed systems for data exchange and data storage.

The main difficulty of using such formats for storing and presenting big data is the problem of guaranteed availability of metadata for describing datasets in order to use them in the future. Metadata greatly simplifies the integration of disparate data by providing users with additional information about how to collect or generate a dataset, which allows them to evaluate the quality of the dataset. Metadata may also contain information about the methods used in conducting research that led to the appearance of this dataset. Unfortunately, the presentation of metadata is not always a generally accepted practice, which leads to the fact that metadata is not supplied or is provided in incomplete volume, which complicates the further use of the dataset [6]. During the analysis, it was also revealed that the considered formats also have not very high indicators [7], but their advantage is full platform independence. For example, they may not support complex data structures or provide sufficient control over metadata, or are suboptimal for processing large amounts of data due to their structure or performance limitations when reading/writing.

 Thus, the development and use of big data file formats aimed at optimizing the storage of big data is an important and urgent task.  

 

Format development

Let's define a special case of data in which neither CSV nor JSON will be the preferred data transfer format. This case is presented below, in Figure 1, as an example of a fragment of a class diagram.

Figure 1 – A fragment of the class diagram

The diagram shows two classes: Service (list of products /services) and ServiceCost (price history of the product). The Service contains the name of the product, its purchase price, description of the characteristics and the rest of the goods. ServiceCost contains the name of the product, its value and the date of assignment of this value.

For Service, the optimal data transfer format is a CSV table, and ServiceCost is more convenient to transfer to JSON, since if you add this data to the same CSV file, all product data will be duplicated for each of its prices. At the moment, it is proposed to transfer data in the form of two archived files. Listings 1 and listing 2 show the structure of these files.

Listing 1 – Transfer of Service in the service.csv file

sofa_low,10000.00,description,8

armchair_rocking,8000.00,description,14

bed_double,18000.00,description,6

chair_computer, 12000.00,description,2

bed_children,16000.00,description,5

Listing 2 – Passing ServiceCost in the service_cost.json file

{

    "sofa_low": [

        {

            "price": "21000.00",

            "date": "2021.01.16"

        },

        {

            "price": "19000.00",

            "date": "2021.02.21"

        }

    ],

    "armchair_rocking": [

        {

            "price": "15000.00",

            "date": "2021.01.16"

        }

    ]

}

 

This format will still have extra redundancy, since there are many extra characters, but also have insufficient information, since there is no additional information about the type of data in the files.

Let's introduce a criterion for choosing an effective data transfer method for the Service and ServiceCost classes, using the theory of information and entropy for this purpose. Entropy is a measure of the information uncertainty of data [8]. It is calculated by summing the probabilities of occurrence of each data value and their logarithms based on base 2 (Shannon's formula) [9]. Let them be the entropies of the data in the files service.csv and service_cost.json. We introduce additional weighting factors for each probability of occurrence of a data value. Let them be the weighting factors for the probabilities of data values appearing in CSV and JSON. These coefficients may reflect the importance of each value or the context in which it is used. Then entropy can be used to evaluate information content and as a criterion for choosing data formats, as shown in (1) and (2).

 

                                                         ,                                              (1)

where is the probability of data appearing in CSV records,

n is the number of different values in the CSV file.

 

                                                                                                       (2)                    

where is the probability of data appearing in JSON records,

m is the number of different values in the CSV file.

 

Let's compare the entropy of data, using the example of CSV and JSON formats, to determine the optimal data transfer format. A lower entropy may indicate a more compact representation of the data. In other words, if, then CSV may be a more optimal format for data transmission. Otherwise, if, then JSON may be preferable. Or, to determine the preferred of the two formats, a slightly different approach can be used, taking into account complicated entropy and weight coefficients, as shown in (3).

 

                                                         ,                                                 (3)

where are the coefficients reflecting the importance of entropy in each format.

 

If so, the preferred format will be CSV. If so, the preferred format will be JSON. Otherwise, at the same time, data storage formats will be equally preferred.

This approach allows you to take into account not only the compactness of the data, but also the possibility of including key information in the files themselves. For the task of developing a new format, such key information may be additional meta-information, or, for example, the presence of noise in the data. If the data contains a lot of noise or random values, high entropy may indicate the uncertainty and complexity of analyzing this data. The proposed evaluation criterion takes into account the context of data use, their value, the degree of complexity and possible consequences of errors in data analysis due to high entropy.

Based on the above, it is also possible to formulate basic data format requirements for the BigDate domain. These include:

1. Standardization and prevalence.  Standardization means the establishment of common and unified standards for the presentation and processing of data. For example, standardized data formats include XML (eXtensible Markup Language), which is used for structuring and exchanging data in various fields, including web services. In the context of big data storage formats, standardization helps to create generally accepted and unified ways of presenting information. This facilitates data exchange between different systems and applications, simplifies software development, and ensures data compatibility between different tools. The prevalence of the data format is related to how widely this format is used in the industry and among various applications. The more widespread the data format, the more likely it is that many tools and systems support it, which is very important for the field of big data analysis, where there is a developed ecosystem for working with this format, including programming languages and data analysis tools and storage systems.

2. Compactness. The format should take up a minimum amount of storage space. This is important for the efficient use of storage and data transmission over the network, especially in the case of large amounts of information. Compact formats can save storage resources, speed up data transfer and reduce network load.

3. Compression capability. Support for data compression mechanisms can significantly reduce the amount of data stored and speed up its transmission over the network.

4. Metadata storage. Storing metadata in a data format plays an important role in ensuring that the information stored in this format is fully and correctly interpreted. Metadata is data about data and contains information about the structure, types, format, and other characteristics of the data.

5. Support for data ready for analysis. Analysis-Ready Data (ARD) is a concept in the field of big data analysis that involves preprocessing and preparing data before analysis. It is designed to simplify and speed up the data analysis process by providing ready-to-use datasets. Currently, this concept is being widely developed in the field of satellite image analysis [10].

            The proposed data storage format will be based on the synthesis of two common data storage formats, such as CSV and JSON. This solution has a number of advantages, due to the fact that CSV allows you to store data in a tabular structure with explicitly defined columns, while JSON has a more flexible structure. Combining them can provide convenient storage of tabular data while preserving the hierarchy and nested objects. Also, using JSON for data storage provides data typing, which makes it easier to perceive and work with data, and an additional metafile can contain information about data types for the CSV part, which provides complete typing.

 

Modification of the data storage format with metadata

It is possible to add column names in the CSV file and, since CSV cannot store information about the data type, you can add type information in a separate meta file. A modified CSV file and a metafile with type information are presented in listings 3 and 4.

Listing 3 – Modification of service.csv.

name,purchase_price,description,remainder

sofa_low,10000.00,description,8

armchair_rocking,8000.00,description,14

bed_double,18000.00,description,6

chair_computer, 12000.00,description,2

bed_children,16000.00,description,5

Listing 4 is the service.meta meta file.

{

    "name": "string",

    "purchase_price": "double",

    "description": "string",

    "remainder": "int"

}

 

The JSON format, when deserialized in various programming languages, provides for data typing, but to unify the format, data types will be stored in a separate metafile (listing 5).

Listing 5 is the service_cost.meta meta file.

{

    "price": "double",

    "date": "date"

}

 

Thus, the developed format is an archive consisting of the four files described above.

The format can be expanded with additional data in CSV or JSON format with appropriate metafiles if necessary.

"PandasKeepFormat" is proposed as the name of the format, the format extension accordingly looks like PDKF.

 

Mathematical description of the format and applicable functions over the data structure

Next, let's consider a mathematical description for the data format proposed in this paper, which combines CSV and JSON, thus defining a set of valid operations on data.

Let:

? C - a set of CSV files representing a set of rows and columns.

? J - a set of JSON files containing structured data in JSON format.

? M is a set of metadata describing information about the data structure in CSV or JSON, such as data types, column names, and other characteristics.

 

Then, the new data format can be defined as follows, as shown in formula 4.

                                                                                                                    (4)          .

In the above equality, it is a set combining data from CSV and JSON. The elements of this set can be either tabular (CSV) or hierarchical (JSON) data. A is a set of pairs where the first element represents data from , and the second element is the corresponding metadata of M. Thus, each data entry from is accompanied by its own metadata.

This approach will allow you to combine data and metadata, however, it does not take into account the presence of more complex dependencies in the data. For example, some data may depend on others, but for some tasks it may also be important to understand the context and sequence of using this data. Or, in some analytical scenarios, there may be a need to perform complex operations that involve related data elements. For example, calculations that require a combination of data from multiple sources may use information about relationships. The peculiarity of the data storage format proposed above is that it can be relatively easily expanded, due to the possibility of adding additional data, as mentioned above. For example, functions defined in the implementation of this format are added to the function dictionary and can be applied to data stored in the structure.

To expand this model, it is proposed to introduce elements from graph theory and category theory. To do this, we additionally introduce a directed graph G into the model, where vertices V represent elements from and edges E represent connections or dependencies between these elements, in other words G=(V,E).

To work with the data defined by the proposed format, it is necessary to additionally define a set of applicable functions on the data. Here, in this paper, applicable functions, in the context of data structures, are understood to be functions or operations that can be correctly applied or performed on the data of the proposed structure. That is, these are functions that correspond to the type or structure of the data and can be used to process or modify this data.   Accordingly, let there be a set of functions that can be applied to the data from, taking into account their structure and metadata. Let's assume that each function has certain characteristics such as aggregation, filtering, transformation, combination, etc. Let's define some of these functions.

Let D be a data structure, C be a set of CSV files, J be a set of JSON files, M be a set of metadata, and f be a function from a set of applicable functions, where are functions that can perform various operations such as filtering, sorting, aggregation, etc.

Let's denote the data element in D as , where i and j are the coordinates of the element in the coordinate grid. In other words , where n is the number of rows, m is the number of columns.

Then, taking into account the above, the aggregating function can be expressed.

The transform function converts data from a structure to a new structure.

The column data extraction function extracts a specific column from CSV or JSON data (if the data is organized appropriately).  Let be an element from D, and let be the corresponding element from M. Then the function . For example, if - CSV file, - metadata, and , then returns the age column.

The value filtering function leaves only those data rows where a certain column satisfies a given condition. Similar to the data extraction function, the function can be expressed in , where condition means a condition that determines which rows of data should be left after applying the filtering function. This condition is checked for a specific column in the data, and only those rows that satisfy this condition remain in the resulting dataset. For example, if - a JSON file, - metadata, and , then it will leave only records with a price above $100.

Taking into account the above, finally, the new data format can be defined as shown in formula 5.

                                                                                                                      (5)         

 

Implementation of the format handler

This section presents the implementation of the PDKF format handler in the Python programming language, which converts data from a file into a list of Pandas dataframes. A fragment of the handler's source code is presented in listing 6.

Listing 6 – Implementation of the format handler.

import pandas

import json

from zipfile import ZipFile

from io import BytesIO

from io import StringIO

 

 

def read_file(file_name, file_content):

    if file_name[-3:] == 'csv':

        data = pandas.read_csv(BytesIO(file_content))

        return data

    elif file_name[-4:] == 'json':

        json_list = json.loads(file_content.decode())

        data = pandas.DataFrame()

        for item in json_list:

            if len(data) == 0:

                data = pandas.json_normalize(list(item.values())[0])

            else:

                data = pandas.concat([data, pandas.json_normalize(list(item.values())[0])], ignore_index = True)

        return data

    else:

        raise Exception(f'Wrong file format: {file_name}')

               

 

def read_pdbc(file):

    if file[-4:] != 'pdkf':

        raise Exception(f'Wrong file format')

    dataframe_dict = {}

    file_zip = ZipFile(file, 'r', allowZip64=True)

    try:

        for file_name in file_zip.namelist():

            if file_name[-4:] != 'meta':

                file_content = file_zip.read(file_name)

                dataframe_dict[file_name[:file_name.find('.')]] = read_file(file_name, file_content)

        for file_name in file_zip.namelist():

            if file_name[-4:] == 'meta':

                file_content = file_zip.read(file_name)

                data = json.loads(file_content.decode())

                for column, type in data.items():

                    dataframe_dict[file_name[:-5]][column] = dataframe_dict[file_name[:-5]][column].astype(type)

    finally:

        file_zip.close()

    return dataframe_dict

 

The following dependencies are required to implement the handler:

· Pandas is a popular library necessary for big data analytics, it contains the DataFrame data type.

· Zipfile is a built–in library necessary for working with ZIP archives.

 

The above code allows you to read files in CSV and JSON formats, as well as from a ZIP archive with a specific structure, and create the appropriate pandas DataFrame for further data processing. In addition, it provides metadata processing and conversion of data types to a DataFrame according to the specified types in the metadata. An example of using the above module and some manipulations with the received data are presented in Listing 7.

Listing 7 is an example of using the module.

import pdkf

 

dataframe_dict = pdkf.read_pdkf('file.pdkf')

for key, dataframe in dataframe_dict.items():

    print(key)

    print(dataframe.info())

 

In this case, information is displayed about each dataframe received from the file. In the future, you can perform any manipulations with the data available in the Pandas module, such as analytics, forecasting based on this data, etc.

The developed module is posted in the Python PYPI package catalog [https://pypi.org/project/pdkf /] under the MIT license, which allows you to install it with the console command "pip install pdkf". Figure 2 shows a screenshot of the placement of the developed solution in the official software repository for Python - PyPI, an analogue of CRAN for the R language.

 

Figure 2 – Placement of the pdkf format handler module in the PYPI repository

 

Conclusion

The proposed solution suggests using a combination of CSV and JSON formats for optimal data storage and transmission. For the convenience of data perception and ensuring complete typing, it is proposed to additionally use metafiles that contain information about data types for the corresponding CSV and JSON files. The modification of the CSV format includes adding column names and storing information about data types in a separate metafile. The JSON format already provides data typing during deserialization, but it is also proposed to store type information in a metafile to unify the format.

Thus, the proposed solution is an archive consisting of CSV and JSON files, as well as corresponding metafiles, which provide structured data storage with preservation of types and metadata. The format can be supplemented with additional data in CSV or JSON format with appropriate metafiles, if necessary.

References
1.   Malcolm, R., Morrison, C., Grandison, T., Thorpe, S., Christie, K., Wallace, A., Ñampbell, A. (2014). Increasing the accessibility to big data systems via a common services api. In 2014 IEEE International Conference on Big Data (Big Data), 883-892. 
2.  Wu, T. (2009). System of Teaching quality analyzing and evaluating based on Data Warehouse. Computer Engineering and Design30(6), 1545-1547.
3.  Vitagliano, G., Hameed, M., Jiang, L., Reisener, L., Wu, E., Naumann, F. (2023). Pollock: A Data Loading Benchmark. Proceedings of the VLDB Endowment16(8), 1870-1882. 
4.   Xiaojuan, L., Yu, Z. (2023). A data integration tool for the integrated modeling and analysis for east. Fusion Engineering and Design195, 113933.
5.  Lemzin, A. (2023). Streaming Data Processing. Asian Journal of Research in Computer Science15(1), 11-21.
6.  Hughes, L. D., Tsueng, G., DiGiovanna, J., Horvath, T. D., Rasmussen, L. V., Savidge, T. C., NIAID Systems Biology Data Dissemination Working Group. (2023). Addressing barriers in FAIR data practices for biomedical data. Scientific Data10(1), 98.
7.   Gohil, A., Shroff, A., Garg, A., Kumar, S. (2022). A compendious research on big data file formats. In 2022 6th International Conference on Intelligent Computing and Control Systems (ICICCS), 905-913.
8.  Elsukov P. Yu.  (2017). Information asymmetry and information uncertainty //ITNOU: Information technologies in science, education and management, (4 (4)), 69-76.
9.  Bromiley, P. A., Thacker, N. A., Bouhova-Thacker, E. (2004). Shannon entropy, Renyi entropy, and information. Statistics and Inf. Series (2004-004)9(2004), 2-8.
10.   Dwyer, J. L., Roy, D. P., Sauer, B., Jenkerson, C. B., Zhang, H. K., Lymburner, L. (2018). Analysis ready data: enabling analysis of the Landsat archive. Remote Sensing10(9), 1363.

Peer Review

Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
The list of publisher reviewers can be found here.

The article is devoted to the urgent task of integrating and optimizing data storage in various formats, especially focusing on CSV and JSON formats. The study addresses the issues of effective presentation of large amounts of data in order to simplify the process of analytics and processing. The authors propose an innovative approach to combining data storage formats using metadata and graph structures to optimize and improve data availability. The research includes theoretical analysis, the development of a new data format, a mathematical description of the proposed methods, as well as practical implementation in the Python programming language. In the era of "big data", the relevance of the research is undeniable. The problem of data integration and efficient storage remains an important task, especially in the context of the growing volume of unstructured data and the need for their analysis. The scientific novelty of the work lies in the proposal of the original data storage format "PandasKeepFormat", which synthesizes the advantages of CSV and JSON formats with the addition of metadata and dependency graphs. This makes it much easier to work with large amounts of data and provides new opportunities for analytics. The article has a logical and consistent presentation of the material, starting with the justification of the relevance of the problem and ending with the practical implementation of the proposed solution. The presentation style is clear and accessible, the material is structured, and the content is informative and rich. The conclusions of the article emphasize the importance of the developed format for simplifying the processes of analytics and data integration. The work will be of interest to a wide range of readers, including specialists in the field of information technology, data analytics, as well as software developers. The article makes a significant contribution to the development of methods of working with big data and deserves publication. It would be useful to consider the possibility of further development of the study, for example, by conducting a comparative analysis of the performance of the new format with existing solutions based on real datasets. I recommend that you accept it for publication, taking into account the relevance of the topic, the scientific novelty and the quality of the material presented. To further develop the work presented in the article on a new data storage format, several directions can be proposed, which include conducting an experimental study to compare the performance of the proposed PandasKeepFormat format with other existing data formats such as Parquet, ORC or Avro. It is important to consider various aspects, including the speed of reading and writing data, the efficiency of data compression and memory consumption. I also suggest exploring the possibility of supplementing the format with support for a wider range of data types, including complex data structures and custom types, which may increase its applicability to various specialized areas of data analysis. One of the important areas of development is the development of plug-ins or modules for integration with popular libraries and frameworks for working with data in order to facilitate the use of the new format in existing ecosystems. Optimizing the format and algorithms of serialization and deserialization for efficient use in distributed computing systems is also a key aspect, including the possibilities of efficient partitioning and parallel processing of data. It is important to pay attention to data security by developing encryption and data security mechanisms embedded in the format to enhance the protection of information confidentiality during storage and transmission. Improved metadata management, including advanced metadata management and versioning approaches, will provide better compatibility and opportunities for data format evolution without losing information about previous versions. Creating tools to visualize the data structure and metadata, as well as to monitor the performance of working with data in a new format, will allow users to manage and analyze their data more effectively. Finally, organizing a platform for interacting with users and developers interested in using and developing the new format, as well as collecting and analyzing feedback, will help identify user needs and identify areas for further improvements. These steps contribute to improving the effectiveness of the proposed data storage format, expanding its applicability and popularity in the scientific and professional community.