# Skill: AI data analyst
## Purpose
Perform comprehensive data analysis, statistical modeling, and data visualization by writing and executing self-contained Python scripts. Generate publication-quality charts, statistical reports, and actionable insights from data files or databases.
## When to use this skill
- You need to **analyze datasets** to understand patterns, trends, or relationships.
- You want to perform **statistical tests** or build predictive models.
- You need **data visualizations** (charts, graphs, dashboards) to communicate findings.
- You're doing **exploratory data analysis** (EDA) to understand data structure and quality.
- You need to **clean, transform, or merge** datasets for analysis.
- You want **reproducible analysis** with documented methodology and code.
## Key capabilities
Unlike point-solution data analysis tools:
- **Full Python ecosystem**: Access to pandas, numpy, scikit-learn, statsmodels, matplotlib, seaborn, plotly, and more.
- **Runs locally**: Your data stays on your machine; no uploads to third-party services.
- **Reproducible**: All analysis is code-based and version controllable.
- **Customizable**: Extend with any Python library or custom analysis logic.
- **Publication-quality output**: Generate professional charts and reports.
- **Statistical rigor**: Access to comprehensive statistical and ML libraries.
## Inputs
- **Data sources**: CSV files, Excel files, JSON, Parquet, or database connections.
- **Analysis goals**: Questions to answer or hypotheses to test.
- **Variables of interest**: Specific columns, metrics, or dimensions to focus on.
- **Output preferences**: Chart types, report format, statistical tests needed.
- **Context**: Business domain, data dictionary, or known data quality issues.
## Out of scope
- Real-time streaming data analysis (use appropriate streaming tools).
- Extremely large datasets requiring distributed computing (use Spark/Dask instead).
- Production ML model deployment (use ML ops tools and infrastructure).
- Live dashboarding (use BI tools like Tableau/Looker for operational dashboards).
## Conventions and best practices
### Python environment
- Use **virtual environments** to isolate dependencies.
- Install only necessary packages for the specific analysis.
- Document all dependencies in `requirements.txt` or `environment.yml`.
### Code structure
- Write **self-contained scripts** that can be re-run by others.
- Use **clear variable names** and add comments for complex logic.
- **Separate concerns**: data loading, cleaning, analysis, visualization.
- Save **intermediate results** to files when analysis is multi-stage.
### Data handling
- **Never modify source data files** – work on copies or in-memory dataframes.
- **Document data transformations** clearly in code comments.
- **Handle missing values** explicitly and document approach.
- **Validate data quality** before analysis (check for nulls, outliers, duplicates).
### Visualization best practices
- Choose **appropriate chart types** for the data and question.
- Use **clear labels, titles, and legends** on all charts.
- Apply **appropriate color schemes** (colorblind-friendly when possible).
- Include **sample sizes and confidence intervals** where relevant.
- Save visualizations in **high-resolution formats** (PNG 300 DPI, SVG for vector graphics).
### Statistical analysis
- **State assumptions** for statistical tests clearly.
- **Check assumptions** before applying tests (normality, homoscedasticity, etc.).
- **Report effect sizes** not just p-values.
- **Use appropriate corrections** for multiple comparisons.
- **Explain practical significance** in addition to statistical significance.
## Required behavior
1. **Understand the question**: Clarify what insights or decisions the analysis should support.
2. **Explore the data**: Check structure, types, missing values, distributions, outliers.
3. **Clean and prepare**: Handle missing data, outliers, and transformations appropriately.
4. **Analyze systematically**: Apply appropriate statistical methods or ML techniques.
5. **Visualize effectively**: Create clear, informative charts that answer the question.
6. **Generate insights**: Translate statistical findings into actionable business insights.
7. **Document thoroughly**: Explain methodology, assumptions, limitations, and conclusions.
8. **Make reproducible**: Ensure others can re-run the analysis and get the same results.
## Required artifacts
- **Analysis script(s)**: Well-documented Python code performing the analysis.
- **Visualizations**: Charts saved as high-quality image files (PNG/SVG).
- **Analysis report**: Markdown or text document summarizing:
- Research question and methodology
- Data description and quality assessment
- Key findings with supporting statistics
- Visualizations with interpretations
- Limitations and caveats
- Recommendations or next steps
- **Requirements file**: `requirements.txt` with all dependencies.
- **Sample data** (if appropriate and non-sensitive): Small sample for reproducibility.
## Implementation checklist
### 1. Data exploration and preparation
- [ ] Load data and inspect structure (shape, columns, types)
- [ ] Check for missing values, duplicates, outliers
- [ ] Generate summary statistics (mean, median, std, min, max)
- [ ] Visualize distributions of key variables
- [ ] Document data quality issues found
### 2. Data cleaning and transformation
- [ ] Handle missing values (impute, drop, or flag)
- [ ] Address outliers if needed (cap, transform, or document)
- [ ] Create derived variables if needed
- [ ] Normalize or scale variables for modeling
- [ ] Split data if doing train/test analysis
### 3. Analysis execution
- [ ] Choose appropriate analytical methods
- [ ] Check statistical assumptions
- [ ] Execute analysis with proper parameters
- [ ] Calculate confidence intervals and effect sizes
- [ ] Perform sensitivity analyses if appropriate
### 4. Visualization
- [ ] Create exploratory visualizations
- [ ] Generate publication-quality final charts
- [ ] Ensure all charts have clear labels and titles
- [ ] Use appropriate color schemes and styling
- [ ] Save in high-resolution formats
### 5. Reporting
- [ ] Write clear summary of methods used
- [ ] Present key findings with supporting evidence
- [ ] Explain practical significance of results
- [ ] Document limitations and assumptions
- [ ] Provide actionable recommendations
### 6. Reproducibility
- [ ] Test that script runs from clean environment
- [ ] Document all dependencies
- [ ] Add comments explaining non-obvious code
- [ ] Include instructions for running analysis
## Verification
Run the following to verify the analysis:
```bash
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # or `venv\Scripts\activate` on Windows
# Install dependencies
pip install -r requirements.txt
# Run analysis script
python analysis.py
# Check outputs generated
ls -lh outputs/