Skip to content

9.6. Data Analysis with AI

Data analysis workflows typically involve multiple tools: a spreadsheet application for viewing CSVs, a code editor for writing scripts, a terminal for running them, and another editor for writing up findings. Backend.AI GO's Cowork menu brings all of this into one autonomous workflow: describe your analysis goal, and the agent reads your data, runs the calculations, and produces a report—entirely on your local machine.

This guide walks through setting up Cowork for data analysis and provides concrete examples using CSV files, JSON data, and Python-based processing.

Why Use Cowork for Data Analysis?

Traditional workflow Cowork workflow
Open CSV in spreadsheet, write formulas manually Agent parses CSV, performs calculations autonomously
Write and run Python scripts in separate tools Agent writes and executes Python in a sandboxed environment
Copy results into a report by hand Agent writes the final report directly to your file system
Repeat for every new dataset Reuse global instructions for consistent formatting

A key advantage is privacy: your data never leaves your machine. Analysis runs entirely through the local model and local tool execution—no cloud uploads required.

Prerequisites

Before you begin, make sure you have:

  • Backend.AI GO installed and running
  • At least one model loaded — a capable model (7B+ parameters recommended) for best analysis quality
  • Data files on your local file system (CSV, JSON, TXT, or other text formats)

Built-in Data Analysis Tools

The Cowork menu provides several built-in tools for data analysis workflows. No plugins or extensions are required.

Tool Description Typical Use
csv_reader Parse CSV files with configurable delimiters, column selection, and row limits Load sales data, filter columns, preview structure
json_query Query structured JSON data using JSONPath expressions Extract fields from API exports, filter nested objects
run_python Execute Python scripts in a sandboxed environment pandas analysis, statistics, data processing
read_file Read any text file (CSV, JSON, TXT, logs) Load raw data files
write_file Save results, reports, and processed data Export analysis output
search_content Search across files using regex patterns Find specific entries in log files
calculator Evaluate mathematical expressions Quick spot calculations

Setting Up for Data Analysis

Step 1: Grant Folder Access

The agent needs permission to read your data files and (optionally) write results.

  1. Click the Folder Permissions toggle in the task input area at the bottom of the Cowork page.

  2. Click Add Folder and select the folder containing your data files.

  3. Choose a permission level:

Permission Levels

  • Read Only: The agent can read files but cannot create or modify them. Use this when you want the agent to explore and analyze without making changes.
  • Read & Write: The agent can read existing files and create or modify files. Use this when you want the agent to save analysis results, generated reports, or processed datasets.
  • Full Access: The agent can also delete and move files. Use with caution.

A common pattern is to add your raw data folder as Read Only and a separate results/ folder as Read & Write.

Step 2: Configure Global Instructions (Optional)

Global Instructions let you set persistent preferences that apply to all Cowork tasks.

  1. Open the Settings drawer (gear icon in the header).

  2. Go to the Instructions tab.

  3. Enter analysis preferences such as:

    Use Python with pandas for data manipulation.
    Present all numbers with 2 decimal places.
    Always include a statistical summary (count, mean, min, max, std) for numeric columns.
    Save output files to the 'results' subfolder.
    Use Markdown tables for tabular output.
    
  4. Enable the instructions and close the drawer.

These instructions will apply to all future Cowork tasks until you change them.

Step-by-Step Example: Analyzing Sales Data

This example demonstrates a complete data analysis workflow using a CSV file.

Scenario: You have a file sales_2024.csv with columns date, product_category, region, units_sold, and revenue. You want to identify top-performing categories and calculate month-over-month growth.

Step 1: Open Cowork

Click the Cowork icon in the sidebar to open the Cowork interface.

Step 2: Set Up Folder Permissions

  1. Click the Folder Permissions toggle.

  2. Add the folder containing sales_2024.csv with Read Only permission.

  3. Add (or create) a results/ folder with Read & Write permission for the output.

Step 3: Enter the Analysis Task

In the task input at the bottom of the page, describe what you want:

Analyze the file sales_2024.csv in my data folder. Give me a summary of total revenue by product category, identify the top 5 performing months, and calculate month-over-month growth rates. Save the analysis as sales_analysis.md in the results folder.

Press Enter (or click Start Task) to begin.

Step 4: Watch the Agent Work

The agent breaks the task into steps and executes them autonomously. You can watch the progress in the step viewer:

  1. Inspect structure — The agent uses csv_reader to examine the file's columns, data types, and a few sample rows to understand the schema.

  2. Load data — It uses read_file to load the full dataset.

  3. Run analysis — The agent writes and executes a Python script with run_python:

    • Parses dates and groups data by month and category
    • Calculates total revenue per category
    • Ranks months by revenue
    • Computes month-over-month growth rates
  4. Write report — The agent formats the results as a Markdown document and uses write_file to save sales_analysis.md.

  5. Present summary — A final summary with key findings is shown in the conversation.

Tool Approval

The first time the agent uses run_python or write_file, you may be prompted to approve the tool. You can approve once or for the entire session. Read operations within permitted folders are auto-approved by default.

Step 5: Steer the Agent Mid-Task

If you want to adjust the analysis while the agent is running, use the Steering input:

  • "Also break down revenue by region within each category"
  • "Add a visualization of month-over-month growth as an ASCII chart"
  • "Focus only on the Electronics and Software categories"

The agent incorporates your guidance without restarting from scratch.

Step 6: Follow Up

After the initial analysis completes, you can continue in the same session:

Create a bar chart of revenue by category using matplotlib and save it as revenue_by_category.png in the results folder.

The agent retains the context from the previous analysis and continues from where it left off.

Example: JSON Data Analysis

For structured JSON data—such as API exports or configuration files—use json_query to extract specific fields before processing.

Scenario: You have an api_export.json file with a nested structure and want to extract and analyze specific metrics.

  1. Add the folder containing api_export.json with Read Only permission.

  2. Enter a task:

    Load api_export.json. Use json_query with JSONPath $.data[*].metrics.revenue to extract all revenue values. Calculate the total, average, and top 10 entries by revenue. Save the results as revenue_report.md.

The agent will:

  1. Use json_query to extract the revenue field from each record in the array
  2. Use run_python to calculate statistics on the extracted values
  3. Use write_file to save the formatted report

JSONPath Syntax

json_query uses standard JSONPath expressions:

  • $.field — top-level field
  • $.array[*].field — field from every element in an array
  • $.array[?(@.status == "active")] — filter by condition
  • $..field — recursive search for a field at any depth

Python Execution Details

The run_python tool runs Python scripts in a sandboxed environment. Understanding its constraints helps you write effective analysis tasks.

Available Libraries

The following libraries are available by default:

  • Standard library: math, statistics, json, csv, collections, itertools, functools, datetime, re, io, string, decimal, fractions, random
  • Data analysis: pandas, numpy
  • Visualization: matplotlib

Sandboxing Restrictions

For security, a Python import hook blocks dangerous modules at the top level. The following modules are blocked:

  • os, subprocess, shutil — command execution and file system manipulation
  • socket, http, urllib, ftplib, smtplib, telnetlib — network access
  • pickle, shelve, marshal — unsafe deserialization
  • ctypes — native code execution
  • multiprocessing, signal, resource — process and system management
  • importlib, pkgutil, zipimport — import system manipulation
  • tempfile, glob, pathlib — file system access
  • code, codeop, compileall — dynamic code compilation
  • xmlrpc — remote procedure calls

Modules that are safe for computation (math, json, csv, re, datetime, random, collections, itertools, typing, etc.) are allowed. The sys, io, and threading modules are also available because they are required internally by many standard library modules.

Note that this is an application-level defense via import hooks, not kernel-level sandboxing. It provides a strong default barrier against accidental or agent-generated dangerous code.

If your script needs file I/O, ask the agent to use read_file and write_file tools to load and save data, then pass the content into the Python script as a variable.

Timeout

Scripts have a configurable timeout (default: 30 seconds, maximum: 300 seconds). For large datasets, ask the agent to process data in chunks or request a higher timeout in your task description:

Run the analysis with a 120-second timeout—the dataset is large.

Capturing Results

Results are captured via:

  • print() output — anything printed to stdout is returned
  • result variable — assign your final value to a variable named result and it will be included in the output

Tips for Data Analysis

  • Start with a structural overview. Ask the agent to use csv_reader first to show column names, data types, and a few sample rows before diving into full analysis. This helps catch encoding issues or unexpected formats early.

  • Pre-filter large JSON files. For large JSON datasets, use json_query to extract only the fields you need before passing data to run_python. This reduces memory usage and speeds up analysis.

  • Use folder-specific instructions. Attach instructions to your data folder describing the schema: column meanings, units, known data quality issues. The agent will apply this context automatically.

  • Chain tasks iteratively. Start with exploration, then analysis, then reporting. Each step builds on the last without losing context:

    1. "Describe the structure and content of sales_2024.csv"
    2. "Now analyze revenue trends by category"
    3. "Generate a formatted report from the analysis"
  • Use Read Only for raw data. Add your original data files with Read Only permission to prevent accidental modification. Use a separate folder with Read & Write for outputs.

  • Include column context in your prompt. If the agent doesn't know what your columns mean, tell it: "The rev column represents monthly revenue in USD. The cat_id column maps to product categories defined in categories.json."

Troubleshooting

Problem Solution
csv_reader reports encoding errors Specify the encoding in your task: "The CSV uses Latin-1 encoding". Common encodings: utf-8, latin-1, cp1252.
run_python times out Break the analysis into smaller steps, or ask for a higher timeout. For very large files, ask the agent to sample the data first.
json_query returns no results Check your JSONPath expression. Ask the agent to first run json_query with $ (root) to show the top-level structure, then refine the path.
Agent modifies the wrong files Use Read Only permission for source data. Only grant Read & Write to your output folder.
Analysis results are inconsistent Add explicit formatting instructions in Global Instructions (e.g., "Always round to 2 decimal places", "Use ISO 8601 date format").
run_python fails with import error The required library may not be available. Ask the agent to use standard library alternatives, or use csv_reader and json_query directly instead of Python imports.