Data Lake Insight (DLI)
Data Lake Insight (DLI) is a serverless big data query and analysis service fully compatible with Apache Spark and Apache Flink ecosystems. DLI supports standard SQL and is compatible with Spark and Flink SQL. It also supports multiple access modes and is compatible with mainstream data formats. DLI supports SQL statements and Spark applications for heterogeneous data sources, including CloudTable, RDS, DWS, CSS, OBS, custom databases on ECSs, and offline databases.
Spark is a unified analysis engine that is ideal for large-scale data processing. It focuses on query, compute, and analysis. DLI optimizes performance and reconstructs services based on open-source Spark. It is compatible with the Apache Spark ecosystem and interfaces and improves performance by 2.5x when compared with open-source Spark. That way, DLI enables you to perform query and analysis of EB's of data within hours.
Flink is a distributed compute engine that is ideal for batch processing, i.e., for processing static data sets and historical data sets. You can also use it for stream processing, i.e., processing real-time data streams and generating data results in real time. DLI enhances features and security based on the open-source Flink and provides the Stream SQL feature required for data processing.
DLI lets you easily explore entire terabytes in your data lake in seconds using standard SQLs with zero O&M burden.
Fully compatible with Apache Spark and Flink; stream & batch processing and interactive analysis in one place.
On-demand, shared access to pooled resources, flexible scaling based on preset priorities.
Seamlessly migrate your offline applications to the cloud with serverless technology. DLI is fully compatible with Apache Spark, Apache Flink, and Presto ecosystems and APIs.
Analyze your data across databases. No migration required. A unified view of your data gives you a comprehensive understanding of your data and helps you innovate faster. There are no restrictions on data formats, cloud data sources, or whether the database is created online or off.
DLI decouples storage from computing so that you can use lower costs while improving resource utilization.
DLI has a comprehensive permission control mechanism and supports fine-grained authentication through Identity and Access Management (IAM). You can create policies in IAM to manage DLI permissions. You can use both the DLI's permission control mechanism and the IAM service for permission management.
When using DLI on the cloud, enterprise users need to manage DLI resources (queues) used by employees in different departments, including creating, deleting, using, and isolating resources. In addition, data of different departments needs to be managed, including data isolation and sharing.
DLI uses IAM for refined enterprise-level multi-tenant management. IAM provides identity authentication, permissions management, and access control, helping you securely access to your cloud resources.
With IAM, you can use your cloud account to create IAM users for your employees and assign permissions to the users to control their access to specific resource types. For example, some software developers in your enterprise may need to use DLI resources but should not delete them or perform any high-risk operations. To guarantee this result, you can create IAM users for the software developers and grant them only the permissions required for using DLI resources.
Roles: A type of coarse-grained authorization mechanism that defines permissions related to user responsibilities. This mechanism provides only a limited number of service-level roles for authorization. When using roles to grant permissions, you need to also assign other roles on which the permissions depend to take effect. However, roles are not an ideal choice for fine-grained authorization and secure access control.
Policies: A type of fine-grained authorization mechanism that defines permissions required to perform operations on specific cloud resources under certain conditions. This mechanism allows for more flexible policy-based authorization, meeting requirements for secure access control. For example, you can grant DLI users only the permissions for managing a certain type of ECSs.
All permissions for DLI
System defined policy
DLI read permissions
System defined policy
System defined role
DLI Service Admin
System defined role
Queue management permissions
For details, see SQL Syntax of Batch Jobs > Data Permissions Management > Data Permissions List in the Data Lake Insight SQL Syntax Reference.
Flink job permissions
Package group permissions
Datasource Connection Permissions
Datasource connection permissions
For details, see Permission-related APIs > Granting Users with the Data Usage Permission in the Data Lake Insight API Reference.
You can use SQL statements in the SQL job editor to execute data query. DLI supports SQL 2003 and complies with Spark SQL.
On the overview page, click ‘SQL Editor’ in the navigation pane on the left or ‘Create Job’ in the upper right corner of the SQL Jobs pane. The SQL Editor page will be displayed.
A message is displayed, indicating that a temporary DLI data bucket will be created. The created bucket is used to store temporary data generated by DLI, such as job logs. You cannot view job logs if you choose not to create it. You can periodically delete objects in a bucket or transit objects between different storage classes. The bucket name is set by default.
SQL jobs allow you to execute SQL statements entered in the 4 SQL Editor, import data, and export data.
SQL job management provides the following functions:
- Searching for jobs: Search for jobs that meet the search criteria.
- Viewing job details: Display job details.
- Terminating a job: Stop a job in the ‘Submitting’ or ‘Running’ status.
- Exporting query results: A maximum of 1000 records can be displayed in the query result on the console. To view more or all data, you can export the data to OBS.
Resources in Queue Management
Queues in DLI are computing resources, which are the basis for using DLI. All executed jobs require computing resources.
Currently, DLI provides two types of queues: for SQL and for general use. SQL queues are used to run SQL jobs. General-use queues are compatible with Spark queues of earlier versions and are used to run Spark and Flink jobs.
DLI database and table management provide the following functions:
- Database Permission Management
- Table Permission Management
- Creating a database or a table
- Deleting a database or a table
- Modifying the owners of databases and tables
- Importing data to the table
- Exporting data from DLI to OBS
- Viewing metadata
- Previewing data
To facilitate SQL operation execution, DLI allows you to customize query templates or save the SQL statements in use as templates. After templates are saved, you do not need to compile SQL statements. Instead, you can directly perform the SQL operations using the templates.
SQL templates include sample templates and custom templates. The default sample template contains 22 standard TPC-H query statements, which can meet most TPC-H test requirements.
SQL template management provides the following functions:
- Sample templates
- Custom templates
- Creating a template
- Executing the template
- Searching for a template
- Modifying a template
- Deleting a template
DLI supports the datasource capability of the native Spark and extends it. With DLI datasource connection, you can access other data storage services through SQL statements, Spark jobs, and Flink jobs and import, query, analyze, and process data in the services.
Global variables can be used to simplify complex parameters. For example, long and difficult variables can be replaced to improve the readability of SQL statements.
Application data stored in relational databases needs analysis to derive more value. For example, big data from registration details helps with commercial decision-making.
- Complicated queries are not supported for larger relational databases.
- Comprehensive analysis is not possible because database and table partitions are spread in multiple relational databases. Business data analysis might overload available resources.
- SQL experience transferability
Hit the ground running with new services. DLI supports standard ANSI SQL 2003 relational database syntax so there is almost no learning curve.
- Versatile, robust performance
Distributed in-memory computing models effortlessly handle complicated queries, cross-partition analysis, and business intelligence processing.
Cloud Data Migration (CDM)
Associative analysis combines information from multiple channels to improve conversion rates.
- Cross-source analysis
Advertisement CTR data stored in OBS and user registration data in RDS can be directly queried without migration to DLI.
- Only SQL needed
Interconnected data sources map together with a table created using just SQL statements.
When multiple departments need to manage resources independently, fine-grained permissions management improves data security and operations efficiency.
- Easier permissions assignment
Grant permissions by column or by specific operation, such as INSERT INTO/OVERWRITE, and set metadata to read-only.
- Unified management
A single IAM account handles permissions for all staff users.
Genome analysis relies on third-party analysis libraries, which are built on the Spark distributed framework.
- High technical skills are required to install analysis libraries such as ADAM and Hail.
- Every time you create a cluster, you have to install these analysis libraries again.
- Custom images
Instead of installing libraries in a technically demanding process, package them into custom images uploaded directly to the Software Repository for Container (SWR). When using DLI to create a cluster, custom images in SWR are automatically pulled so you don't have to reinstall these libraries.
Real-time Risk Control
Almost every aspect of financial services requires comprehensive risk management and mitigation.
- There is very little tolerance for excessive latency when it comes to risk control.
- High throughput
Real-time data analysis in DLI with the help of an Apache Flink dataflow model keeps latency low. A single CPU processes 1,000 to 20,000 messages per second.
- Ecosystem coverage
Save real-time data streams to multiple cloud services such as CloudTable and SMN for comprehensive application.
Big Data Analysis
Massive volumes of data include petabytes of satellite images and many types – structured remote sensing grid data, vector data, and unstructured spatial location data. The analysis and mining of all this data needs efficient tools.
- Spatial data analysis
Spark algorithm operators in DLI enable real-time stream processing and offline batch processing. They support massive data types, including structured remote sensing image data, unstructured 3D modeling, and laser point cloud data.
- CEP SQL functionality
SQL statements are all that is needed for yaw detection and geo-fencing.
- Heavy data processing
Quickly migrate up to exabytes of remote sensing images to the cloud, then slice them into data sources for distributed batch processing.
Cloud Data Migration (CDM)