by Kanth
In Simple Terms:
A Data Analyst finds patterns, answers business questions, and supports decisions using data.
Main Responsibilities:
Data Collection: Gather data from sources (SQL databases, APIs, Excel sheets, CRM tools, etc.)
Data Cleaning: Fix missing values, remove duplicates, format inconsistencies.
Exploratory Data Analysis (EDA): Find patterns, trends, and interesting insights in data.
Data Visualization: Build dashboards and charts (Power BI, Tableau, Excel) to explain the findings.
Business Reporting: Present insights to non-technical stakeholders (managers, marketing, sales, finance).
Statistical Analysis: Sometimes, basic stats like averages, standard deviation, correlation, hypothesis testing.
Automation: Writing simple scripts (Python, SQL) to automate data pulls and reporting.
Step-1 Master Core Skills (Theory + Hands-on)
Step-2 Practice Real Interview Questions
Step-3 Work on Portfolio Projects
Interviewers love when you show real-world practice.
Examples:
Step-4: Prepare for Non-Technical (Behavioral) Questions
Common questions:
Step-5 Mock Interviews and Resume Practice
Resume must highlight:
- Data Analytics: Focus on inspecting, cleaning, transforming, and modelling data to find useful insights.
- Data Science: Broader β includes advanced predictive modeling, machine learning, and AI.
- Data Engineering: Focuses on building and maintaining data pipelines and infrastructure.
- Descriptive Analytics: What happened? (e.g., monthly sales report)
- Diagnostic Analytics: Why did it happen? (e.g., churn analysis)
- Predictive Analytics: What could happen? (e.g., sales forecasting)
- Prescriptive Analytics: What should we do about it? (e.g., recommendation engines)
- Define the problem/business objective.
- Collect data.
- Clean data (handle missing values, outliers).
- Perform exploratory data analysis (EDA).
- Apply statistical methods or models.
- Interpret results.
- Communicate findings through visualizations and reports.
- Data Cleaning is the process of fixing or removing incorrect, corrupted, or incomplete data.
- Importance: Garbage in, garbage out β clean data ensures accurate results and decision-making.
- Deletion: Remove missing entries if the impact is small.
- Imputation: Replace missing values using mean, median, mode, or predictive modeling.
- Flagging: Mark missing data separately to consider during analysis.
- Excel: Quick analysis, pivot tables.
- SQL: Database querying.
- Python/R: Statistical analysis, modeling.
- Power BI/Tableau: Data visualization.
- SAS/SPSS: Statistical analysis (especially in banking and pharma industries).
- KPIs (Key Performance Indicators): Strategic measures tied to business objectives (e.g., Customer Retention Rate).
- Metrics: Broader measures that track the status of a specific process (e.g., website page views).
- INNER JOIN: Common records from both tables.
- LEFT JOIN: All records from left table, matched ones from right.
- RIGHT JOIN: All records from right table, matched ones from left.
- FULL JOIN: All records when there is a match in either table.
- Data Visualization is the graphical representation of information (charts, graphs).
- Importance: Helps stakeholders quickly grasp complex patterns, outliers, and trends.
- Structured Data: Organized into rows/columns (e.g., SQL databases).
- Unstructured Data: No formal structure (e.g., images, videos, emails, social media posts).
- A statistical method to test an assumption about a population.
- Example: “Is the average delivery time less than 30 minutes?”
- Key terms: Null Hypothesis (H0), Alternative Hypothesis (H1), p-value, confidence intervals.
- Mean, Median, Mode
- Standard Deviation, Variance
- Correlation, Covariance
- Regression Analysis (Linear, Logistic)
- A/B Testing
- OLAP (Online Analytical Processing): Used for complex queries, reporting, business intelligence (e.g., analyzing sales trends).
- OLTP (Online Transaction Processing): Used for day-to-day transactions (e.g., banking systems).
- Use simple language β avoid jargon.
- Focus on the business impact rather than technical metrics.
- Use storytelling: what the data says, why it matters, and what actions to take.
- Include visuals like charts and key summary points.
“At my previous role, I analyzed customer churn data for an e-commerce company. Using Python (Pandas, Matplotlib) and SQL, I identified that customers in certain regions were more likely to churn due to delayed deliveries. This led the company to optimize its logistics network, reducing churn by 12% over six months.”
- KPIsΒ areΒ measurable valuesΒ that demonstrate how effectively a company is achieving key business objectives.
- Example:
- For Sales Team:Β Monthly Revenue Growth
- For Customer Support:Β First Response Time
- Purpose: Focus the business on what matters most.
- KPIs: AlwaysΒ aligned to business goalsΒ (e.g., profit margin, customer churn rate).
- Metrics: Any data points you track (e.g., page views, app downloads), not always tied to a goal.
- Think of KPIs asΒ “critical” metrics.
- Understand business objectives.
- Make sure KPIs are:
- Specific
- Measurable
- Achievable
- Relevant
- Time-boundΒ (SMART criteria).
- PrioritizeΒ actionable KPIs, not vanity numbers.
- AΒ business dashboardΒ is aΒ visual displayΒ of the most important information needed to achieve business objectives.
- Importance:
- Provides at-a-glance views
- Speeds up decision-making
- Unifies data from multiple sources
- Strategic Dashboard: Tracks long-term KPIs (executive-level).
- Operational Dashboard: Monitors daily activities.
- Analytical Dashboard: Deep dive into trends and analysis.
- Tactical Dashboard: For middle management, focuses on processes.
- Static Dashboard: Data updated periodically (e.g., daily, weekly).
- Real-Time Dashboard: Updates automatically as new data flows in (good for monitoring critical operations).
- Know your audience: Executives need summary, Analysts need drill-downs.
- Focus on theΒ most critical KPIsΒ (not too many!).
- Use theΒ right chart types:
- Line chart for trends
- Bar chart for comparisons
- Pie chart for parts of a whole
- EnsureΒ clarity,Β minimalism, andΒ fast loading.
- Total Revenue
- Sales Conversion Rate
- Customer Lifetime Value (CLTV)
- Marketing:
- Cost per Acquisition (CPA)
- Return on Ad Spend (ROAS)
- Website Conversion Rate
- First Response Time
- Customer Satisfaction Score (CSAT)
- Ticket Resolution Time
- Net Promoter Score (NPS)
- E-commerce: Cart Abandonment Rate, Repeat Purchase Rate
- Banking: Non-Performing Assets (NPA) %, Loan Disbursement Time
- Healthcare: Patient Wait Time, Readmission Rates
- Education: Student Retention Rate, Course Completion Rate
- AΒ hierarchical breakdownΒ of KPIs into contributing metrics.
- Example:
- Overall Revenue
- Revenue per Region
- Revenue per Product Line
- Revenue per Salesperson
- Helps diagnose which part of the business is underperforming.
- BenchmarkingΒ is comparing your KPI performance against:
- Industry standards
- Historical data
- Competitor performance
- Helps set realistic targets.
- Use simple language: “Revenue grew by 10% this quarter due to 15% more new customers.”
- Focus onΒ impact and actionsΒ rather than just numbers.
- Use visuals: KPIs in green/red to show status (traffic light system).
- Poor data quality
- Lack of clear ownership of KPIs
- Tracking vanity metrics (like number of followers) instead of actionable KPIs
- Delay in data reporting affecting real-time decision-making
“In my previous role, I built a Sales Performance dashboard using Power BI. After identifying that Sales Conversion Rates were dropping in specific regions, we optimized the lead nurturing process. As a result, conversion rates improved by 18% over the next quarter.”
- SQL (Structured Query Language)Β is used to interact with databases.
- In Data Analytics, SQL helps toΒ extract,Β filter,Β aggregate, andΒ manipulateΒ data efficiently for analysis.
- DDL (Data Definition Language):Β CREATE,Β ALTER,Β DROP
- DML (Data Manipulation Language):Β SELECT,Β INSERT,Β UPDATE,Β DELETE
- DCL (Data Control Language):Β GRANT,Β REVOKE
- TCL (Transaction Control Language):Β COMMIT,Β ROLLBACK
- DQL (Data Query Language):Β SELECT
- INNER JOIN: Returns matching rows from both tables.
- LEFT JOIN: All rows from left table + matched rows from right.
- RIGHT JOIN: All rows from right table + matched rows from left.
- FULL JOIN: All rows when there is a match in either table.
- SELF JOIN: Joining a table to itself.
- WHERE: Filters rowsΒ beforeΒ grouping.
- HAVING: Filters groupsΒ afterΒ aggregation (GROUP BY).
- AΒ SubqueryΒ is a query inside another query.
- Can be used inΒ SELECT,Β FROM, orΒ WHEREΒ clauses.
- UNION: Combines results of two queries and removes duplicates.
- UNION ALL: Combines results andΒ keeps duplicates.
- GROUP BYΒ combined withΒ HAVING COUNT > 1Β identifies duplicates.
- Primary Key: Unique identifier for a table (no NULLs).
- Foreign Key: A field in one table that links to the Primary Key of another.
- Perform calculations across rows related to the current row.
- Examples:Β ROW_NUMBER(),Β RANK(),Β DENSE_RANK(),Β SUM() OVER (PARTITION BY…).
- ROW_NUMBER(): Unique sequence for each row.
- RANK(): Leaves gaps after ties.
- DENSE_RANK(): No gaps after ties.
- Aggregate functions perform calculations on multiple rows:
- SUM(),Β AVG(),Β COUNT(),Β MIN(),Β MAX()
Β
- Normalization: Organizing data to minimize redundancy.
- Types:
- 1NF (First Normal Form): Atomic columns.
- 2NF: No partial dependency.
- 3NF: No transitive dependency.
- Use indexes.
- Select only necessary columns.
- UseΒ WHEREΒ filters early.
- AvoidΒ SELECT *.
- Use joins efficiently.
- Avoid nested subqueries if not needed.
- Temporary result set that can be referenced within a query.
Β
-- SubQuery
SELECT MAX(salary)
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);
-- Using Dense Rank
SELECT salary
FROM (
SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) as rnk
FROM employees
) tmp
WHERE rnk = 2;
- Power BIΒ is aΒ business intelligenceΒ tool by Microsoft used toΒ analyze, visualize, and share dataΒ across teams or organizations.
- Importance: It turns raw data intoΒ interactive dashboards and reports, helping decision-makers act faster.
- Power BI Desktop: Create reports (used by developers/analysts).
- Power BI Service: Cloud-based service to share and collaborate on reports.
- Power BI Mobile: App to access reports on mobile devices.
- Power BI Gateway: Bridges on-premise data to cloud services.
- Power BI Report Server: For on-premises report hosting.
- DAX (Data Analysis Expressions): A collection ofΒ functions,Β operators, andΒ constantsΒ used to performΒ calculationsΒ andΒ data analysis.
- Think of DAX asΒ Excel formulasΒ but much moreΒ powerfulΒ for big datasets!
- Dashboard: AΒ single-pageΒ view showing visuals from multiple reports (good for KPIs, summary).
- Report:Β Multi-page, detailed set of visualizations built on one dataset.
- Bar Charts, Line Charts, Pie Charts, Tables, Matrix, Cards, Maps, Tree Maps, Gauge Charts, Waterfall Charts, Scatter Plots, and custom visuals from the marketplace.
- SlicersΒ are visual filters that allow users to filter dataΒ directly on the report.
- Example: Add a year slicer to let users select data for specific years.
- Measure: Dynamic calculation at report run-time (lighter and faster).
- Example:Β Total Profit = SUM(Sales[Profit])
- Calculated Column: Static calculation stored in the data model (heavier memory usage).
- Example: Adding a Profit Margin column asΒ Sales[Profit]/Sales[SalesAmount].
- RLSΒ restricts data access for users based on filters.
- Example: A manager can only see the sales data for his own region, not other regions.
- Files: Excel, CSV, XML, JSON.
- Databases: SQL Server, Oracle, MySQL, PostgreSQL.
- Cloud: Azure, Salesforce, Google Analytics, SharePoint, etc.
- APIs/Web: REST APIs, OData.
- Import Mode: Loads data into Power BI memory (fast performance, but needs refresh).
- DirectQuery: Data stays in the source; queries are sent live (no data storage, slower depending on network/database speed).
- Power Query: ETL (Extract, Transform, Load) tool for shaping and preparing data.
- M Language: The underlying language behind Power Query transformations.
- Example: Filtering, merging tables, changing column types.
- Reduce number of visuals on a page.
- Use Import mode where possible.
- Minimize calculated columns and use measures.
- Apply filters and limit data load.
- Optimize DAX queries (avoid repeated calculations).
- Star Schema: Fact table (large transactional table) in the center, surrounded by dimension tables.
- Benefits:
- Simpler queries
- Faster performance
- Easier data modeling
- Example: Sales (Fact Table) connected to Customers, Products, Regions (Dimension Tables).
- Relationships connect different tables usingΒ primary-foreign keyΒ logic.
- Types: One-to-Many (1:Many), Many-to-One, Many-to-Many.
- Example: Linking Customer ID in Sales table with Customer ID in Customer table.
- Calculated Column: Adds a column to an existing table using DAX.
- Calculated Table: Creates aΒ new tableΒ based on a DAX expression.
- SnowflakeΒ is aΒ cloud-based data warehouseΒ platform.
- Popular because:
- Handles large volumes of structured/semi-structured data (JSON, Avro, Parquet).
- Auto-scaling compute and storage separately (pay for what you use).
- Easy integration with BI tools (Power BI, Tableau) and Data Engineering pipelines.
- 3-Layered Architecture:
- Storage: Stores structured and semi-structured data.
- Compute: Virtual warehouses for processing queries.
- Cloud Services: Authentication, metadata, query optimization.
- Key difference:Β Separation of Storage and ComputeΒ β they can scale independently.
- Virtual WarehouseΒ = Compute engine.
- Runs queries, performs DML operations (insert, update).
- CanΒ scale up/downΒ (larger warehouses = faster) andΒ auto-suspendΒ when idle to save cost.
- Multi-cluster architecture
- Automatic scaling
- Time Travel (data recovery)
- Zero Copy Cloning (instant table/database copying)
- Semi-structured data support
- Secure Data Sharing without ETL
- Time TravelΒ allows users to access historical data for aΒ specified periodΒ (default 1 day, up to 90 days for enterprise).
- Example: Recover accidentally deleted data.
- Query example:
SELECT * FROM sales AT (TIMESTAMP => ‘2025-04-01 00:00:00’);
- After Time Travel expires,Β Fail-safeΒ provides aΒ 7-day recovery window.
- Managed by Snowflake support team (not directly by users).
- Create aΒ zero-copy cloneΒ of a table, schema, or database.
- Cloning is instant and storage-efficient.
- No need to duplicate data physically at the time of cloning.
- Snowflake automatically divides large tables intoΒ micro-partitionsΒ (~16 MB compressed size).
- Benefits:
- Faster query performance through partition pruning.
- No manual partitioning needed.
- Caching Layers:
- Result Cache: Stores results of previous queries.
- Metadata Cache: Caches file locations, table schemas.
- Data Cache: Stores local copies of data in SSD storage.
- Result Cache Example:
- If you rerun a query, Snowflake returns results from cache without reprocessing.
- Star Schema: Central fact table with de-normalized dimension tables.
- Snowflake Schema: Fact table connected toΒ normalizedΒ dimension tables (i.e., dimensions have sub-dimensions).
- Snowflake (the tool) supports both, but Snowflake Schema increases query complexity slightly for storage optimization.
- RBAC (Role-Based Access Control)Β model.
- UsersΒ are assignedΒ RolesΒ (Admin, Reader, Writer), and Roles have permissions on objects (databases, tables).
- UseΒ VARIANTΒ data type.
- You can store JSON, XML, Avro, ORC, or Parquet directly.
- Powerful for modern data analytics where APIs generate semi-structured outputs.
- Stream: Change Data Capture (CDC) mechanism β tracks changes (INSERT, UPDATE, DELETE) to a table.
- Task: Scheduled jobs to run SQL scripts or procedures.
- Secure Data SharingΒ allows sharing live, read-only dataΒ across different Snowflake accountsΒ without copying data.
- No data movement. Itβs instant and secure.
- Example use case: Vendors sharing sales data with distributors.
- Use correct Warehouse size (scale up for large queries, scale out for concurrency).
- Minimize large SELECT * queries.
- Use clustering keys if the table grows very large (>1 TB).
- Always monitor and auto-suspend warehouses when idle.
- Filter queries early to take advantage of micro-partition pruning.