Glossary — Data, AI & Analytics Terms

A

2 terms

Apache Iceberg: An open table format that adds database-grade reliability — schema evolution, hidden partitioning, time travel, and ACID transactions — on top of object storage like S3 or GCS. Iceberg is one of the foundational formats that makes the modern lakehouse possible.
Aqua: Dview's high-concurrency query engine. Aqua serves dashboards and ad-hoc analytics over centralized data with autoscaling, caching, and predictable performance — without per-query cost surprises.

B

1 term

Batch Processing: Running a data job over a bounded chunk of data on a schedule (hourly, daily, weekly) rather than continuously. Batch is simpler to reason about and cheaper than streaming, and remains the right choice for most analytical workloads.

C

1 term

CDC (Change Data Capture): A technique for detecting and propagating only the rows that have changed in a source database, rather than re-copying entire tables. CDC dramatically reduces load on operational systems and makes near-real-time replication economical.

D

11 terms

Data Catalog: A searchable inventory of every dataset in the organization — table, file, dashboard — enriched with descriptions, owners, lineage, and quality scores. The catalog is the front door of any data platform: if a dataset is not in the catalog, it effectively does not exist.
Data Fabric: An architectural approach that weaves together storage, compute, governance, and access across multiple clouds and silos so that data can be queried and governed as if it were in one place. Dview's platform is a data fabric implementation.
Data Lakehouse: A storage architecture that combines the cheap, open-format scale of a data lake with the transactional consistency, schema enforcement, and SQL performance of a warehouse. Lakehouses replace the legacy split between operational lakes and analytical warehouses.
Data Lineage: An auditable graph of how every column flows through every transformation — from the source system that produced it to the dashboard that displays it. Lineage powers impact analysis, compliance reporting, and confident debugging.
Data Masking: Replacing sensitive values (credit cards, IDs) with realistic-looking but non-identifying tokens — either irreversibly (anonymization) or reversibly (tokenization) — so that the data remains useful for analytics without exposing the underlying secret.
Data Mesh: An organizational pattern where domain teams own their data products end-to-end, supported by a central self-serve platform. Mesh emphasizes ownership and decentralization; it is complementary to (not a replacement for) lakehouse architecture.
Data Observability: Continuous monitoring of pipelines and datasets for freshness, volume, schema, distribution, and lineage anomalies. Where data quality asks 'is this row right?', observability asks 'is this entire pipeline behaving as expected right now?'.
Data Product: A curated, versioned, documented dataset built and maintained with the same rigor as a software product — owner, SLA, contract, deprecation policy. Treating datasets as products is the central idea behind data mesh.
Data Quality: How accurate, complete, consistent, timely, and unique a dataset is for its intended purpose. Modern platforms encode quality as automated tests on every pipeline run, blocking bad data from reaching downstream consumers.
Data Warehouse: A relational store optimized for analytical queries on structured data. Classic warehouses (Snowflake, BigQuery, Redshift) excel at SQL performance; lakehouses now offer comparable performance over open formats with more flexibility.
DSense: Dview's natural-language interface for enterprise data: business users ask questions in plain English and DSense returns SQL-grounded, citation-backed answers using any LLM, with VPC deployment and built-in security guardrails.

E

3 terms

ELT (Extract, Load, Transform): A data integration pattern that loads raw data into the target store first and transforms it there, leveraging the target's compute. ELT replaced ETL as warehouses and lakehouses became powerful enough to do the transforming.
Embedding: A dense vector representation of text, images or other data where semantically similar inputs end up close together in the vector space. Embeddings power retrieval-augmented generation, semantic search, and recommendation.
ETL (Extract, Transform, Load): The original data integration pattern: pull data from sources, transform it on a separate engine, then load the result into the warehouse. ETL is still appropriate when target compute is constrained or transformations are heavy.

F

2 terms

Federated Query: Running a single query that transparently spans multiple underlying data stores — for example joining a Postgres table to a Parquet file in S3 without moving the data. Federated query is the backbone of a data fabric.
Fiber: Dview's no-code data pipeline product. Fiber connects 100+ source systems to a centralized lakehouse with auto-schema sync, CDC, and analytics-ready delivery — without engineers writing custom connectors.

H

1 term

Hallucination: When a large language model generates confident output that is factually wrong or unsupported by its inputs. RAG and grounding-on-data techniques are the primary defense against hallucination in enterprise applications.

L

1 term

LLM (Large Language Model): A neural network trained on massive text corpora that can generate, summarize, translate and reason over natural language. In enterprise data, LLMs are typically used together with retrieval (RAG) so answers stay grounded in the customer's own data.

M

2 terms

Materialized View: A precomputed query result stored as a table and refreshed on a schedule or on data change. Materialized views trade storage and freshness for query speed, and are essential for high-concurrency dashboards.
Metadata: Data about data: schemas, owners, descriptions, freshness, sample values, lineage, quality scores. Modern platforms treat metadata as a first-class citizen — a queryable asset in its own right.

O

2 terms

OLAP (Online Analytical Processing): Workloads that aggregate large amounts of historical data to answer analytical questions — dashboards, reports, ad-hoc analysis. OLAP systems are columnar, read-optimized and tolerant of higher latency.
OLTP (Online Transaction Processing): Workloads that read and write small amounts of data with strict consistency and low latency — placing an order, recording a payment. OLTP systems are row-oriented and write-optimized.

P

3 terms

Parquet: An open columnar file format that compresses well and lets query engines skip irrelevant columns and row groups. Parquet is the de-facto on-disk format underneath every modern lakehouse table.
PII (Personally Identifiable Information): Data that can identify an individual — name, email, government ID, IP address — directly or in combination with other fields. PII triggers regulatory obligations under GDPR, CCPA, India's DPDP Act, and similar laws.
Pipeline: An ordered sequence of steps that move and reshape data from source to consumer. A modern pipeline is version-controlled, observable, idempotent, and exposes its lineage and quality state to downstream users.

R

3 terms

RAG (Retrieval-Augmented Generation): An architecture where an LLM is given the most relevant chunks of trusted data at query time, retrieved from a vector store or SQL warehouse. RAG is the standard pattern for grounding LLM answers in enterprise data.
RBAC (Role-Based Access Control): An access model where permissions are granted to named roles, and users inherit permissions by holding roles. RBAC scales better than per-user permissions and is the foundation of most enterprise data security postures.
Row-Level Security: Access control that filters rows of a table based on the identity or attributes of the requesting user — for example, a regional manager only sees rows for their region. Critical for shared, multi-tenant analytics.

S

5 terms

SCD (Slowly Changing Dimension): A modeling pattern for tracking how an entity (e.g. a customer's address) changes over time. Type 1 overwrites; Type 2 keeps history with effective-from / effective-to columns. Type 2 is the workhorse of dimensional modeling.
Schema Drift: When the structure of incoming data changes unexpectedly — a column is renamed, a type widens, a field disappears. Detecting drift early is the difference between a five-minute heads-up and a broken dashboard at 9am.
Schema Evolution: The ability of a table format to change its schema over time — adding columns, renaming, widening types — without breaking historical queries. Apache Iceberg, Delta Lake and Hudi all provide this safely.
Star Schema: A dimensional model with a central fact table (orders, events) joined to surrounding dimension tables (customer, product, date). Star schemas are intuitive, queryable by BI tools, and the dominant pattern in analytical warehouses.
Streaming: Processing data continuously as events arrive, with sub-second latency. Streaming powers real-time fraud detection, alerting, and live dashboards; it is more operationally complex than batch and reserved for genuinely time-sensitive use cases.

T

1 term

Text-to-SQL: Translating a natural-language question into an executable SQL query against the right tables. Modern text-to-SQL combines schema retrieval, LLM generation, and validation against a semantic layer to keep results trustworthy.

V

2 terms

Vector Database: A database optimized for similarity search over high-dimensional vectors (embeddings). Vector DBs are the retrieval half of the RAG pattern, returning the most semantically relevant chunks for a query in milliseconds.
VPC Deployment: Running a service inside the customer's own Virtual Private Cloud so that data never leaves their network perimeter. VPC deployment is the gold standard for regulated industries — banks, AMCs, healthcare — that cannot share data with vendor-managed multi-tenant environments.