Data Transformations Guide

Complete guide to data transformations using the shaper system in RING-5.

Overview

RING-5 provides a powerful shaper system for transforming data before visualization:

Shapers: Individual transformation functions
Pipelines: Sequential chains of shapers
Immutability: All operations return new DataFrames

What is a Shaper?

A shaper is a data transformation function that:

Takes a DataFrame as input
Returns a new DataFrame (never modifies in-place)
Performs one specific transformation
Can be chained with other shapers

Available Shapers

1. Column Selector

Purpose: Select specific columns from the dataset

Configuration:

{
    "type": "columnSelector",
    "columns": ["benchmark", "config", "ipc"]
}

Use Case: Remove unnecessary columns before plotting

Example:

# Before: 50 columns
# After: 3 columns (benchmark, config, ipc)

2. Sort

Purpose: Sort data by custom column order

Configuration:

{
    "type": "sort",
    "order_dict": {
        "benchmark": ["mcf", "omnetpp", "xalancbmk"],
        "config": ["baseline", "tx_lazy", "tx_eager"]
    }
}

Use Case: Control the display order in plots

Example:

# Sort benchmarks alphabetically
# Then sort configurations by performance

3. Mean Calculator

Purpose: Compute aggregated means (arithmetic, geometric, harmonic)

Configuration:

{
    "type": "mean",
    "meanVars": ["ipc", "execution_time"],
    "meanAlgorithm": "geomean",  # or "amean", "hmean"
    "groupingColumns": ["config"],
    "replacingColumn": "benchmark"
}

Use Case: Add geomean rows for multi-benchmark comparisons

Example:

# Original data: mcf, omnetpp, xalancbmk
# Result: mcf, omnetpp, xalancbmk, geomean

4. Normalize

Purpose: Normalize values to a baseline

Configuration:

{
    "type": "normalize",
    "normalizeVars": ["ipc", "throughput"],
    "normalizerColumn": "config",
    "normalizerValue": "baseline",
    "groupBy": ["benchmark"]
}

Use Case: Show relative performance improvements

Example:

# Baseline IPC = 1.5
# tx_lazy IPC = 1.8
# Result: tx_lazy normalized = 1.2 (20% improvement)

5. Filter (Condition Selector)

Purpose: Filter rows based on conditions

Configuration:

{
    "type": "conditionSelector",
    "column": "benchmark",
    "mode": "equals",  # or "contains", "greater_than", "less_than"
    "threshold": "mcf"
}

Use Case: Focus on specific benchmarks or configurations

Example:

# Filter: ipc > 1.0
# Removes low-performing configurations

6. Transformer

Purpose: Convert column data types

Configuration:

{
    "type": "transformer",
    "column": "config",
    "target_type": "factor",  # or "numeric", "string"
    "order": ["baseline", "tx_lazy", "tx_eager"]
}

Use Case: Control categorical ordering in plots

Example:

# Convert config to categorical
# Set specific order for legend/axis

Building Pipelines

Example 1: Basic Filtering and Sorting

pipeline = [
    # Step 1: Select relevant columns
    {
        "type": "columnSelector",
        "columns": ["benchmark", "config", "ipc"]
    },
    # Step 2: Filter benchmarks
    {
        "type": "conditionSelector",
        "column": "benchmark",
        "mode": "contains",
        "threshold": "spec"
    },
    # Step 3: Sort data
    {
        "type": "sort",
        "order_dict": {
            "benchmark": ["mcf", "omnetpp", "xalancbmk"]
        }
    }
]

Example 2: Normalization Pipeline

pipeline = [
    # Step 1: Normalize to baseline
    {
        "type": "normalize",
        "normalizeVars": ["ipc", "execution_time"],
        "normalizerColumn": "config",
        "normalizerValue": "baseline",
        "groupBy": ["benchmark"]
    },
    # Step 2: Add geometric mean
    {
        "type": "mean",
        "meanVars": ["ipc"],
        "meanAlgorithm": "geomean",
        "groupingColumns": ["config"],
        "replacingColumn": "benchmark"
    },
    # Step 3: Sort for presentation
    {
        "type": "sort",
        "order_dict": {
            "benchmark": ["mcf", "omnetpp", "xalancbmk", "geomean"]
        }
    }
]

Example 3: Multi-Stage Aggregation

pipeline = [
    # Step 1: Filter out warmup phase
    {
        "type": "conditionSelector",
        "column": "phase",
        "mode": "not_equals",
        "threshold": "warmup"
    },
    # Step 2: Aggregate per benchmark
    {
        "type": "mean",
        "meanVars": ["ipc"],
        "meanAlgorithm": "amean",
        "groupingColumns": ["benchmark", "config"],
        "replacingColumn": "seed"
    },
    # Step 3: Compute geomean across benchmarks
    {
        "type": "mean",
        "meanVars": ["ipc"],
        "meanAlgorithm": "geomean",
        "groupingColumns": ["config"],
        "replacingColumn": "benchmark"
    }
]

Using Pipelines in the UI

In Manage Plots

Navigate to Manage Plots
Select or create a plot
Scroll to Data Processing Pipeline
Click Add transformation
Select shaper type from dropdown
Configure shaper parameters
Click Add to Pipeline
Repeat for additional transformations
Click Update Plot to apply

Pipeline Editor Features

Reorder: Drag shapers to reorder pipeline
Edit: Click shaper to modify configuration
Delete: Remove shaper from pipeline
Preview: See transformed data before plotting

Pipeline Persistence

Export Pipeline

Configure pipeline in plot
Click Export Pipeline
Save JSON file locally

Format:

{
  "pipeline": [
    {
      "type": "columnSelector",
      "columns": ["benchmark", "config", "ipc"]
    },
    {
      "type": "normalize",
      "normalizeVars": ["ipc"],
      "normalizerColumn": "config",
      "normalizerValue": "baseline"
    }
  ]
}

Import Pipeline

Click Import Pipeline
Select JSON file
Pipeline is loaded into plot

Use Cases:

Reuse pipelines across different datasets
Share pipelines with collaborators
Maintain consistent transformations

Advanced Techniques

Conditional Transformations

Apply different transformations based on data characteristics:

# If data has seeds, aggregate them first
if "seed" in data.columns:
    pipeline.insert(0, {
        "type": "mean",
        "meanVars": ["ipc"],
        "groupingColumns": ["benchmark", "config"],
        "replacingColumn": "seed"
    })

Multi-Column Operations

Transform multiple columns simultaneously:

{
    "type": "normalize",
    "normalizeVars": ["ipc", "throughput", "latency"],
    "normalizerColumn": "config",
    "normalizerValue": "baseline",
    "groupBy": ["benchmark"]
}

Nested Grouping

Group by multiple levels:

{
    "type": "mean",
    "meanVars": ["ipc"],
    "groupingColumns": ["config", "benchmark"],
    "replacingColumn": "seed"
}
# Groups by config AND benchmark, aggregating seeds

Best Practices

DO

Start Simple: Begin with one shaper, verify, then add more
Check Data: Review transformed data in Data Managers
Use Column Selector Early: Remove unused columns first
Normalize Last: Apply normalization after filtering/sorting
Name Columns Clearly: Rename columns for better plots

DON’T

Don’t Chain Too Many: Keep pipelines under 5-6 shapers
Don’t Normalize Twice: Multiple normalizations produce incorrect results
Don’t Filter Too Aggressively: Ensure data remains after filters
Don’t Ignore Errors: Pipeline failures indicate data issues

Troubleshooting

Pipeline Fails

Symptoms: Error message, plot doesn’t update

Solutions:

Check column names: Typos in column names
Verify data exists: Filters may exclude all data
Review shaper order: Some shapers depend on previous transformations
Check for missing values: Handle NaN/null before aggregating

Unexpected Results

Symptoms: Plot shows incorrect data

Solutions:

Preview each step: Apply shapers one at a time
Check data types: Ensure numeric columns are numeric
Verify normalization baseline: Baseline value must exist in data
Review grouping columns: Grouping affects aggregation results

Performance Issues

Symptoms: Slow pipeline execution

Solutions:

Reduce data size: Filter early in pipeline
Simplify aggregations: Avoid complex nested grouping
Use column selector: Remove unused columns immediately
Cache results: Store intermediate transformations

Common Patterns

Pattern 1: Speedup Calculation

pipeline = [
    {"type": "normalize", "normalizeVars": ["execution_time"],
     "normalizerColumn": "config", "normalizerValue": "baseline"},
    {"type": "mean", "meanVars": ["execution_time"],
     "meanAlgorithm": "geomean", "groupingColumns": ["config"]}
]
# Result: Speedup relative to baseline

Pattern 2: Top-K Selection

pipeline = [
    {"type": "sort", "order_dict": {"ipc": "descending"}},
    {"type": "conditionSelector", "column": "rank",
     "mode": "less_than", "threshold": 10}
]
# Result: Top 10 configurations by IPC

Pattern 3: Outlier Removal

pipeline = [
    {"type": "conditionSelector", "column": "ipc",
     "mode": "greater_than", "threshold": 0.5},
    {"type": "conditionSelector", "column": "ipc",
     "mode": "less_than", "threshold": 10.0}
]
# Result: Remove outliers outside [0.5, 10.0]

Integration with Other Features

With Data Managers

Load data via Data Source
Apply Data Managers (Outlier Remover, Seeds Reducer)
Use shapers for plot-specific transformations

When to use which:

Data Managers: Global transformations for all plots
Shapers: Plot-specific transformations

With Portfolios

Pipelines are saved with plots in portfolios:

Load portfolio → Pipelines restored
Pipelines are reusable across sessions

Next Steps

Creating Plots: Learn about Plot Creation
Advanced Plotting: Explore plot types in plots/
API Reference: See Shaper API
Custom Shapers: Build custom transformations (advanced)

Need Help? Check Troubleshooting or open a GitHub issue.