INSIGHTS, DATA VISUALIZATION

When ML Meets Data Visualization

January 25, 2022

Data visualization is a compelling way to uncover hidden insights from the data and persuasively present them—every business benefits from making data easier to understand. Thus, visuals help to grasp large volumes of data instantly. Data Visualization makes insights more transparent, which helps in making decisions faster. 

There are various visualization tools available for creating visualizations from datasets. However, these tools have a steep learning curve because specifications are needed to be made manually by codes. The requirement for the user to manually select among data columns and decide which statistical tool to apply for generating visuals can be a daunting task. Typical users with limited time, statistics and data visualization skills may get severe problems dealing with complex datasets.  

Not all tools require the manual inputs of codes; most recommender systems operate on predetermined rules to automatically generate visualizations for the user to search and select. While effective for specific use cases, these rule-based approaches are costly and have limited scalability. In contrast, machine learning (ML)-based systems directly learn the relationship between data and visualizations by training models on analyst interaction or implicitly learning these rules from examples.  

Here two ML models are described in detail that transforms the data into visualizations. One is a neural network model, and the other is a neural transformation model. 

VizML  

VizML model is a fully connected feedforward neural network (NN) with three hidden layers which consists of 1,000 neurons each with ReLU activation functions. It is implemented using PyTorch, which provides visualization recommendations after learning visualization choices from a large corpus of data. At first, five key design choices made by analysts while creating visualizations are identified. Then using one million dataset-visualization pairs collected from a popular online visualization platform, models are trained to predict these design choices. There is a higher accuracy as the neural network predicts the design choices rather than random guessing and simple classifiers.

There are 5 steps that explain how the VizML model is trained: 

Step – 1: Data Source

The corpus of 2.3 million dataset-visualization pairs is selected from an online visualization platform called Plotly. 

Step – 2: Raw Corpus

The corpus after being collected is cleaned, and then the description of the data is provided. 

Step – 3: (a) Feature Extraction

After the description, feature extraction is done by mapping the dataset to 841 features, single columns to 81 features, and pairwise columns to 30 features using 16 aggregation functions. Each column is described across four categories: Dimensions, Type, Values, and Names. Each pair of columns are described with 30 pairwise-column features that fall into two categories: Values and Names. However, some features are already mapped while single columns are mapped. Using 841 dataset-level features by aggregating these single-and pairwise-column features using the 16 aggregation functions. These aggregation functions convert single-column features and pairwise-column features into scalar values. For example, given a dataset, the number of columns can be counted, the percent of categorical columns can be described, and the mean correlation between all pairs of quantitative columns can be computed. The feature extraction is explained using the example of an Automobile MPG dataset. 

Source: K. Hu, M. A. Bakker, S. Li, T. Kraska, and C. Hidalgo, “VizML: A machine learning approach to visualization recommendation,” in Proceedings of the ACM Conference on Human Factors in Computing Systems, ser. CHI ’19, 2019.

(b)Design Choices

An analyst’s design choices are extracted by parsing the traces that associate collections of data with visual elements in an online visualization platform. These representations are specified using encodings that map from data to the retinal properties (e.g., position, length, or color) of graphical marks (e.g., points, lines, or rectangles). Examples of encoding-level design choices include mark type, such as scatter, line, bar; or the specification of the column represented on the axis, and whether there is a shared axis or not. By aggregating these encoding-level design choices, visualization-level design choices of a chart are characterized.

Source: K. Hu, M. A. Bakker, S. Li, T. Kraska, and C. Hidalgo, “VizML: A machine learning approach to visualization recommendation,” in Proceedings of the ACM Conference on Human Factors in Computing Systems, ser. CHI ’19, 2019.

Step – 4: Models

After the deduplication of the datasets, the model is trained to predict the design choices. Two visualization-level prediction tasks use dataset-level features to predict visualization-level design choices.

Source: K. Hu, M. A. Bakker, S. Li, T. Kraska, and C. Hidalgo, “VizML: A machine learning approach to visualization recommendation,” in Proceedings of the ACM Conference on Human Factors in Computing Systems, ser. CHI ’19, 2019.

The three encoding-level prediction tasks use features about individual columns to predict how they are visually encoded. 

Source: K. Hu, M. A. Bakker, S. Li, T. Kraska, and C. Hidalgo, “VizML: A machine learning approach to visualization recommendation,” in Proceedings of the ACM Conference on Human Factors in Computing Systems, ser. CHI ’19, 2019.

Step – 5: Recommended choices

For the Visualization Type and Mark Type tasks, the 2- class task predicts line vs. bar, and the 3-class predicts scatter vs. line vs. Bar. 

A Diagram of data processing and analysis flow in VizML is given below. 

Source: K. Hu, M. A. Bakker, S. Li, T. Kraska, and C. Hidalgo, “VizML: A machine learning approach to visualization recommendation,” in Proceedings of the ACM Conference on Human Factors in Computing Systems, ser. CHI ’19, 2019. 

To develop ML-based recommenders for their own systems, developers could begin by identifying user design choices and extracting simple features from data. These features and design choices can be used to train models or developers can use pre-trained models such as VizML. With models in hand, customized measures of visualization effectiveness can also be progressed further by collecting the usage analytics.

Data2Vis 

Data2Vis is a sequence-to-sequence trainable neural translation model which automatically generates visualizations from datasets provided. It generates visualizations in a fraction of time compared to manually created visualizations and has the potential for scalability.  

Prior works used rule-based approaches, but Data2Vis departs from it both in the technical strategy and conceptual formulation. In Data2Vis, a deep learning model is implemented, emphasizing creating visualizations specifications using rules learned from examples rather than resorting to a predefined set of rules.  

Visualization generation is formatted as a language translation problem where data specications are mapped to visualization specications in a declarative language Vega-Lite. A multilayered attention-based encoder-decoder network is trained with long short-term memory (LSTM) units on a corpus of visualization specications. A 2-layered bidirectional RNN encoder and a 2-layered RNN decoder, each with 512 Long Short-Term Memory (LSTM) units are used. 

The model needs to meet specific learning objectives: First, the model needs to select a subset of fields to focus on while creating visualizations. Understanding the data type is the next step, and finally, transformations are applied to the fields according to their data type. To enhance the learning process, the transformations are explicitly performed. String and numeric field names are replaced using a short notation — “str” and “num” in the dataset. Next, a similar backward transformation is replicated in the target sequence to maintain consistency in field names. These transformations help enhance the learning process by reducing the vocabulary size, thus reducing training time and the number of hidden layers that the model needs to converge. 

The model gave good results after being trained with Vega-Lite visualizations; then the model was trained with real-world data sets. To generate the training dataset, a single row from the dataset and target pair was iteratively generated from each example file. Each example was then sampled 50 times i.e., 50 different data rows with the same Vega-Lite specification which resulted in a total of 215,000 pairs used to train the model. 

A character vocabulary for the source and target sequences (84 and 45 symbols, respectively) were generated. A dropout rate of 0.5 was applied at the input of each cell and a maximum source and target sequence length of 500 was used. The entire model is then trained end-to-end using a fixed learning rate of 0.0001 with Adam optimizer, minimizing the negative log-likelihood of the target characters using stochastic gradient descent. The model is trained for a total of 20,000 steps, using a batch size of 32. Translation performance log perplexity metric score of 0.032 was achieved which suggests the model excels at predicting visualization specifications that are similar to the specifications in the test trained. Thus, Data2Vis generates visualizations that can be easily compared to the visualizations created manually.

An image is provided below to show how Data2Vis works. A user can load a random dataset from the RDdataset collection or paste a dataset (JSON format) and select “Generate. Data2Vis generates Vega-Lite specifications using beam search based on the dataset. 

Source: V. Dibia and C. Demiralp, “Data2vis: Automatic generation of data visualizations using sequence-to-sequence recurrent neural networks,” IEEE Computer Graphics and Applications, vol. 39, no. 5, pp. 33–46, 2019. 

Data2Vis can learn patterns in data visualizations that can be generalized to a variety of real-world datasets. Analysts can initialize their visualization tasks with Data2Vis and iteratively correct its content while generating intermediate visualizations. 

Conclusion

Without explicit programming, ML provides the system the ability to automatically learn and improve from experience. Data visualization is a compelling way to uncover hidden insights and frame them into a story. Still, it is a challenging road, for it needs a large amount of human effort and high reliance on expertise. Providing users with little or no programming experience to rapidly create expressive data visualizations empowers users and brings data visualization into their workflow. With the help of ML algorithms, we can make the visual presentations effective and effortless. With technological advancements taking place every minute, designing interactions with ML-based recommenders is crucial for future work. 

About Viz-CoE:

With an intention to lead the way in exploring and adopting new technologies, tools, techniques, or practices in the area of visualization, RoundSqr has kick-started a new initiative called “Viz-CoE”, Visualization Center of Excellence. Contributions from our subject matter experts, clients, partners, industry leaders, colleagues, and interns through perspectives, training courses, success stories, and opinions bring this initiative to life.