arrow

Develop Python UDFs Using The MaxCompute Client

1. Experiment

1.1 Knowledge points

This experiment uses Alibaba Cloud MaxCompute. It describes how to use the MaxCompute client to create and verify a User Defined Function (UDF). The availability of an abundance of data collection methods has led to the explosive growth of industry data. Big data has grown to the scale of hundreds of TB, PB, or even EB, orders of magnitude beyond what traditional software can handle. MaxCompute is built to solve this problem. It specializes in the storage of large volumes of structured data and the analysis and modeling of big data.

1.2 Experiment process

  • Install the MaxCompute client
  • Develop a Python UDF
  • Verify the UDF

1.3 Cloud resources required

  • ECS
  • MaxCompute

1.4 Prerequisites

  • If you’re using your own Alibaba Cloud account instead of the account provided by this lab to operate the experiment, please note that you’ll need to choose the same Ubuntu 16.04 operating system for your ECS in order to run the experiment smoothly.
  • Before starting the experiment, please confirm that the previous experiment has been closed normally and exited.

2. Start the experiment environment

Click Start Lab in the upper right corner of the page to start the experiment.

image desc.

After the experiment environment is successfully started, the system has deployed resources required by this experiment in the background, including the ECS instance, RDS instance, Server Load Balancer instance, and OSS bucket. An account consisting of the username and password for logging on to the Web console of Alibaba Cloud is also provided.

image desc

After the experiment environment is started and related resources are properly deployed, the experiment starts a countdown. You have two hours to perform experimental operations. After the countdown ends, the experiment stops, and related resources are released. During the experiment, pay attention to the remaining time and arrange your time wisely. Next, use the username and password provided by the system to log on to the Web console of Alibaba Cloud and view related resources:

openCole

Go to the logon page of Alibaba Cloud console.

image desc

Fill in the sub-user account and click Next.

image desc

Fill in the sub-user password and click Log on.

image desc

After you successfully log on to the console, the following page is displayed.

image desc

3. Install the MaxCompute client

3.1 Create a DataWorks project

In the Alibaba Cloud console, choose DataWorks.

image desc

On the Workspace List page, choose US(Silicon Valley) and click Create Workspace.

image desc

Set the project name, select Basic Mode for Mode, and click Next.

image desc

Select the MaxCompute engine and click Next.

image desc

Set the Instance Display Name, and click Create Workspace.

image desc

The creation is complete.

image desc

After a while, the status will be displayed as Normal, and the creation is successful.

image desc

3.2 Create an AccessKey

As shown below, click AccessKey Management.

image desc

Click Create AccessKey. After AccessKey has been created successfully, AccessKeyID and AccessKeySecret are displayed. AccessKeySecret is only displayed once. Click Download CSV FIle to save the AccessKeySecret

image desc

3.3 Install the MaxCompute client

Click Elastic Compute Service, as shown in the following figure.

image desc

We can see one running ECS instance in the US(Silicon Valley) region. Click it to go to the ECS console as shown in the following figure.

image desc

Copy this ECS instance’s Internet IP address and remotely log on to this ECS (Ubuntu system) instance. For details of remote login, refer to login

image desc

The default account name and password of the ECS instance:

Account name: root

Password: nkYHG890..

After successful logon, run the following command to update the APT installation source:

apt update

image desc

Run the following command to download the installation package:

wget https://labex-ali-data.oss-us-west-1.aliyuncs.com/spark-analysis/jdk-8u181-linux-x64.tar.gz

image desc

Run the following command to zip the downloaded installation package into the /usr/local directory:

tar -zxf jdk-8u181-linux-x64.tar.gz -C /usr/local/

image desc

Run the vim /etc/profile command to open this file and then add the following code to the end of this file:

export JAVA_HOME=/usr/local/jdk1.8.0_181
export PATH=$PATH:$JAVA_HOME/bin

image desc

Run the source /etc/profile command to make your changes take effect:

image desc

Run the following command to add an Alibaba Cloud image source:

cd && mkdir client && cd client 

wget https://labex-ali-data.oss-us-west-1.aliyuncs.com/tunnel/odpscmd_public.zip

image desc

Run the following command to install the unzip extraction tool:

apt update && apt -y install unzip

image desc

Run the following command to uncompress the previous file:

unzip odpscmd_public.zip

ls

image desc

After the extraction is complete, you can see that there are four directories.

3.4 Configure the MaxCompute client

Run vim conf/odps_config.ini to open the MaxCompute client configuration file and modify the configuration according to the following figure.

end_point=http://service.us-west-1.maxcompute.aliyun-inc.com/api
tunnel_endpoint=http://dt.us-west-1.maxcompute.aliyun-inc.com

image desc

Save the file and exit.

Run the following command to start the MaxCompute client:

bin/odpscmd --config=conf/odps_config.ini

image desc

After the MaxCompute client is started, run the following command to check whether the client has been connected to the MaxCompute project:

whoami;

image desc

As shown in the preceding figure, the connection is successful.

4. Develop a Python UDF

Next we will create a UDF in Python and use it to look up the geolocations of specific IP addresses.

First, create a table named ipresource to store IP ranges and their corresponding geolocations.

Then, create a function to look up the IP range of a certain IP address in the ipresource table, in order to retrieve its corresponding geolocation.

Finally, upload the function to MaxCompute and use an SQL statement to call it.

4.1. Create the ipresource table

Run the following command to create a table named ipresource:

CREATE TABLE IF NOT EXISTS ipresource 
(
    start_ip BIGINT,
    end_ip BIGINT,
    start_ip_arg string,
    end_ip_arg string,
    country STRING,
    area STRING,
    city STRING,
    county STRING,
    isp STRING
);

image desc

Open a new Linux CLI and run vim DataIP.txt to create a file named DataIP.txt. Copy and paste the following content to the file. Save the file and exit.

16834560,16834623,"1.0.224.0","1.0.224.63","Thailand","Phuket","","","TOT"
16834624,16834815,"1.0.224.64","1.0.224.255","Thailand","","","","TOT"
16834816,16835071,"1.0.225.0","1.0.225.255","Thailand","Phuket","","","TOT"
16835072,16835199,"1.0.226.0","1.0.226.127","Thailand","","","","TOT"
16835200,16841471,"1.0.226.128","1.0.250.255","Thailand","Phuket","","","TOT"
16841472,16841599,"1.0.251.0","1.0.251.127","Thailand","","","","TOT"
16841600,16841663,"1.0.251.128","1.0.251.191","Thailand","Phuket","","","TOT"
16841664,16841727,"1.0.251.192","1.0.251.255","Thailand","","","","TOT"
16841728,16842239,"1.0.252.0","1.0.253.255","Thailand","Phuket","","","TOT"
16842240,16842271,"1.0.254.0","1.0.254.31","Thailand","Song Kafu","","","TOT"
16842272,16842303,"1.0.254.32","1.0.254.63","Thailand","Narathiwat","","","TOT"
16842304,16842367,"1.0.254.64","1.0.254.127","Thailand","Phuket","","","TOT"
16842368,16842383,"1.0.254.128","1.0.254.143","Thailand","Song Kafu","","","TOT"
16842384,16842399,"1.0.254.144","1.0.254.159","Thailand","Phuket","","","TOT"
16842400,16842431,"1.0.254.160","1.0.254.191","Thailand","Song Kafu","","","TOT"
16842432,16842623,"1.0.254.192","1.0.255.127","Thailand","Phuket","","","TOT"
16842624,16842751,"1.0.255.128","1.0.255.255","Thailand","Song Kafu","","","TOT"
16842752,16843007,"1.1.0.0","1.1.0.255","China","Fujian","Fuzhou","","telecom"
16843008,16843263,"1.1.1.0","1.1.1.255","Australia","Sydney","Sydney","","TOT"
16843264,16844799,"1.1.2.0","1.1.7.255","China","Fujian","Fuzhou","","telecom"
16844800,16845055,"1.1.8.0","1.1.8.255","China","Guangdong","Zhuhai","","telecom"
16845056,16859135,"1.1.9.0","1.1.63.255","China","Guangdong","Guangzhou","","telecom"
16910592,16941055,"1.2.9.0","1.2.127.255","China","Guangdong","Guangzhou","","telecom"

image desc

Return to the MaxCompute CLI and run the following command to populate the ipresource table with the content of the DataIP.txt file:

tunnel upload /root/DataIP.txt ipresource;

image desc

Run the following command to view the content of the ipresource table to make sure that it has been successfully populated:

select * from ipresource limit 10;

image desc

4.2 Create a UDF

Return to the Linux CLI and run vim SelectIP.py to create a Python file named SelectIP.py. Copy and paste the following content to the SelectIP.py file. Save the file and exit.

from odps.udf import annotate
@annotate("string->bigint")
class SelectIP(object):
    def evaluate(self, ip):
        try:
            return reduce(lambda x, y: (x << 8) + y, map(int, ip.split('.')))
        except:
            return 0

image desc

Return to the MaxCompute CLI and run the following command to add the Python script you just created:

add py /root/SelectIP.py;

image desc

Run the following command to create a UDF named SelectIP using the Python script you just added:

create function SelectIP as SelectIP.SelectIP using SelectIP.py;

image desc

5. Verify the UDF

Run the following command to call the SelectIP UDF you just created:

select SelectIP('1.2.24.2');

image desc

Output similar to that shown in the preceding screenshot appears. This indicates that the command is successfully executed.

Use the following command to look up the geolocation of 1.2.24.2.

select * from ipresource where SelectIP('1.2.24.2') >= start_ip and SelectIP('1.2.24.2') <= end_ip;

image desc

Reminder:
Before you leave this lab, remember to log out your Alibaba RAM account before you click the ‘stop’ button of your lab. Otherwise you’ll encounter some issue when opening a new lab session in the same browser:

image descimage desc

6. Experiment summary

This experiment describes how to use the MaxCompute client to create and verify a UDF. MaxCompute is currently available in 16 countries and regions around the world, providing customers in finance, technology, bio-medical, energy, transportation, media, and other industries with big data storage and computing services.