Data Lake Insight (DLI)

Data Lake Insight (DLI) is a serverless big data query and analysis service fully compatible with Apache Spark and Apache Flink ecosystems. DLI supports standard SQL and is compatible with Spark and Flink SQL. It also supports multiple access modes and is compatible with mainstream data formats. DLI supports SQL statements and Spark applications for heterogeneous data sources, including CloudTable, RDS, DWS, CSS, OBS, custom databases on ECSs, and offline databases.

Spark is a unified analysis engine that is ideal for large-scale data processing. It focuses on query, compute, and analysis. DLI optimizes performance and reconstructs services based on open-source Spark. It is compatible with the Apache Spark ecosystem and interfaces and improves performance by 2.5x when compared with open-source Spark. That way, DLI enables you to perform query and analysis of EB's of data within hours.

Flink is a distributed compute engine that is ideal for batch processing, i.e., for processing static data sets and historical data sets. You can also use it for stream processing, i.e., processing real-time data streams and generating data results in real time. DLI enhances features and security based on the open-source Flink and provides the Stream SQL feature required for data processing.

Woman working in front of several screens in a data center

Reasons for DLI in the Open Telekom Cloud

Ease of use

DLI lets you easily explore entire terabytes in your data lake in seconds using standard SQLs with zero O&M burden.

Icon with pie chart and speech bubble with bullet list

One-stop analysis

Fully compatible with Apache Spark and Flink; stream & batch processing and interactive analysis in one place.

Icon with gear and arrow symbol for scalability

Scalable resources

On-demand, shared access to pooled resources, flexible scaling based on preset priorities.

Cross-source connection

Easy cross-source data access for collaborative analysis with DLI datasource connections, no need for data migration.

Key Features of DLI

Woman with pen in hand works in front of several screens displaying different data

Full SQL compatibility

You do not need a background in big data to conduct big data analyses. You only need to know SQL, and you are good to go. The SQL syntax is fully compatible with the standard ANSI SQL 2003.

Serverless Spark/Flink

Seamlessly migrate your offline applications to the cloud with serverless technology. DLI is fully compatible with Apache Spark, Apache Flink, and Presto ecosystems and APIs.

Cross-source analysis

Analyze your data across databases. No migration required. A unified view of your data gives you a comprehensive understanding of your data and helps you innovate faster. There are no restrictions on data formats, cloud data sources, or whether the database is created online or off.

Enterprise multi-tenant

Manage compute or resource related permissions by project or by user. Enjoy fine-grained control that makes it easy to maintain data independence for separate tasks.

Storage-compute decoupling

DLI decouples storage from computing so that you can use lower costs while improving resource utilization.

O&M-free and high availability

DLI frees you from the hassle of complicated O&M and upgrade operations while you enjoy high data availability with dual-AZ deployment.

Identity and Access Management

DLI has a comprehensive permission control mechanism and supports fine-grained authentication through Identity and Access Management (IAM). You can create policies in IAM to manage DLI permissions. You can use both the DLI's permission control mechanism and the IAM service for permission management.

Application Scenarios of IAM Authentication

When using DLI on the cloud, enterprise users need to manage DLI resources (queues) used by employees in different departments, including creating, deleting, using, and isolating resources. In addition, data of different departments needs to be managed, including data isolation and sharing.

DLI uses IAM for refined enterprise-level multi-tenant management. IAM provides identity authentication, permissions management, and access control, helping you securely access to your cloud resources.

With IAM, you can use your cloud account to create IAM users for your employees and assign permissions to the users to control their access to specific resource types. For example, some software developers in your enterprise may need to use DLI resources but should not delete them or perform any high-risk operations. To guarantee this result, you can create IAM users for the software developers and grant them only the permissions required for using DLI resources.

DLI system permissions

Roles: A type of coarse-grained authorization mechanism that defines permissions related to user responsibilities. This mechanism provides only a limited number of service-level roles for authorization. When using roles to grant permissions, you need to also assign other roles on which the permissions depend to take effect. However, roles are not an ideal choice for fine-grained authorization and secure access control.

Policies: A type of fine-grained authorization mechanism that defines permissions required to perform operations on specific cloud resources under certain conditions. This mechanism allows for more flexible policy-based authorization, meeting requirements for secure access control. For example, you can grant DLI users only the permissions for managing a certain type of ECSs.

Role/Policy Name	Description	Category
DLI FullAccess	All permissions for DLI	System defined policy
DLI ReadOnlyAccess	DLI read permissions	System defined policy
Tenant Administrator	Tenant administrator Administer permissions for managing and accessing all cloud services. After a database or a queue is created, the user can use the Access Control List (ACL) to assign rights to other users. Scope: project-level service	System defined role
DLI Service Admin	DLI administrator Administer permissions for managing and accessing the queues and data of DLI. After a database or a queue is created, the user can use the Access Control List (ACL) to assign rights to other users. Scope: project-level service	System defined role

DLI service permissions

Permission Type	Subtype	SQL Syntax
Queue Permissions	Queue management permissions Queue usage permission	None
Data Permissions	Database permissions Table permissions Column permissions	For details, see SQL Syntax of Batch Jobs > Data Permissions Management > Data Permissions List in the Data Lake Insight SQL Syntax Reference.
Job Permissions	Flink job permissions	None
Package Permissions	Package group permissions Packe permissions	None
Datasource Connection Permissions	Datasource connection permissions	None

For details, see Permission-related APIs > Granting Users with the Data Usage Permission in the Data Lake Insight API Reference.

DLI console features

SQL Editor

You can use SQL statements in the SQL job editor to execute data query. DLI supports SQL 2003 and complies with Spark SQL.

On the overview page, click ‘SQL Editor’ in the navigation pane on the left or ‘Create Job’ in the upper right corner of the SQL Jobs pane. The SQL Editor page will be displayed.

A message is displayed, indicating that a temporary DLI data bucket will be created. The created bucket is used to store temporary data generated by DLI, such as job logs. You cannot view job logs if you choose not to create it. You can periodically delete objects in a bucket or transit objects between different storage classes. The bucket name is set by default.

Job Management

SQL jobs allow you to execute SQL statements entered in the 4 SQL Editor, import data, and export data.

SQL job management provides the following functions:

Searching for jobs: Search for jobs that meet the search criteria.
Viewing job details: Display job details.
Terminating a job: Stop a job in the ‘Submitting’ or ‘Running’ status.
Exporting query results: A maximum of 1000 records can be displayed in the query result on the console. To view more or all data, you can export the data to OBS.

Resources in Queue Management

Queues in DLI are computing resources, which are the basis for using DLI. All executed jobs require computing resources.

Currently, DLI provides two types of queues: for SQL and for general use. SQL queues are used to run SQL jobs. General-use queues are compatible with Spark queues of earlier versions and are used to run Spark and Flink jobs.

Data Management

DLI database and table management provide the following functions:

Database Permission Management
Table Permission Management
Creating a database or a table
Deleting a database or a table
Modifying the owners of databases and tables
Importing data to the table
Exporting data from DLI to OBS
Viewing metadata
Previewing data

Job Template

To facilitate SQL operation execution, DLI allows you to customize query templates or save the SQL statements in use as templates. After templates are saved, you do not need to compile SQL statements. Instead, you can directly perform the SQL operations using the templates.

SQL templates include sample templates and custom templates. The default sample template contains 22 standard TPC-H query statements, which can meet most TPC-H test requirements.

SQL template management provides the following functions:

Sample templates
Custom templates
Creating a template
Executing the template
Searching for a template
Modifying a template
Deleting a template

Datasource Connections

DLI supports the datasource capability of the native Spark and extends it. With DLI datasource connection, you can access other data storage services through SQL statements, Spark jobs, and Flink jobs and import, query, analyze, and process data in the services.

Global Configuration

Global variables can be used to simplify complex parameters. For example, long and difficult variables can be replaced to improve the readability of SQL statements.

Application scenarios

Analytics

Database Analysis

Application data stored in relational databases needs analysis to derive more value. For example, big data from registration details helps with commercial decision-making.

Pain Points

Complicated queries are not supported for larger relational databases.
Comprehensive analysis is not possible because database and table partitions are spread in multiple relational databases. Business data analysis might overload available resources.

Advantages

SQL experience transferability
Hit the ground running with new services. DLI supports standard ANSI SQL 2003 relational database syntax so there is almost no learning curve.
Versatile, robust performance
Distributed in-memory computing models effortlessly handle complicated queries, cross-partition analysis, and business intelligence processing.

Related Services

DataArts Studio

Cloud Data Migration (CDM)

E-Commerce

Precision Marketing

Associative analysis combines information from multiple channels to improve conversion rates.

Advantages

Cross-source analysis
Advertisement CTR data stored in OBS and user registration data in RDS can be directly queried without migration to DLI.
Only SQL needed
Interconnected data sources map together with a table created using just SQL statements.

Related Services

Object Storage Service (OBS)

Large Enterprises

Permission Control

When multiple departments need to manage resources independently, fine-grained permissions management improves data security and operations efficiency.

Advantages

Easier permissions assignment
Grant permissions by column or by specific operation, such as INSERT INTO/OVERWRITE, and set metadata to read-only.
Unified management
A single IAM account handles permissions for all staff users.

Genetics

Library Integration

Genome analysis relies on third-party analysis libraries, which are built on the Spark distributed framework.

Pain Points

High technical skills are required to install analysis libraries such as ADAM and Hail.
Every time you create a cluster, you have to install these analysis libraries again.

Advantages

Custom images
Instead of installing libraries in a technically demanding process, package them into custom images uploaded directly to the Software Repository for Container (SWR). When using DLI to create a cluster, custom images in SWR are automatically pulled so you don't have to reinstall these libraries.

Related Services

Software Repository for Container (SWR)

Finance

Real-time Risk Control

Almost every aspect of financial services requires comprehensive risk management and mitigation.

Pain Point

There is very little tolerance for excessive latency when it comes to risk control.

Advantages

High throughput
Real-time data analysis in DLI with the help of an Apache Flink dataflow model keeps latency low. A single CPU processes 1,000 to 20,000 messages per second.
Ecosystem coverage
Save real-time data streams to multiple cloud services such as CloudTable and SMN for comprehensive application.

Related Services

Simple Message Notification (SMN)

Geography

Big Data Analysis

Massive volumes of data include petabytes of satellite images and many types – structured remote sensing grid data, vector data, and unstructured spatial location data. The analysis and mining of all this data needs efficient tools.

Advantages

Spatial data analysis
Spark algorithm operators in DLI enable real-time stream processing and offline batch processing. They support massive data types, including structured remote sensing image data, unstructured 3D modeling, and laser point cloud data.
CEP SQL functionality
SQL statements are all that is needed for yaw detection and geo-fencing.
Heavy data processing
Quickly migrate up to exabytes of remote sensing images to the cloud, then slice them into data sources for distributed batch processing.

Related Services

Cloud Data Migration (CDM)

New Features

11/24/2023DLI offers new features and new functionalitiesView Details

10/04/2024New Features for Data Lake Insight (DLI)View Details

04/09/2025Data Lake Insight (DLI) new Minor VersionView Details

Don't want to miss any updates?Visit our portfolio roadmap and discover new services and updates.
Learn more

view all release notes

Find out more

Documentation

Ask & exchange
Best practices & Blueprint