Develop Python UDFs Using The MaxCompute Client
1. Experiment
1.1 Knowledge points
This experiment uses Alibaba Cloud MaxCompute. It describes how to use the MaxCompute client to create and verify a User Defined Function (UDF). The availability of an abundance of data collection methods has led to the explosive growth of industry data. Big data has grown to the scale of hundreds of TB, PB, or even EB, orders of magnitude beyond what traditional software can handle. MaxCompute is built to solve this problem. It specializes in the storage of large volumes of structured data and the analysis and modeling of big data.
1.2 Experiment process
- Install the MaxCompute client
- Develop a Python UDF
- Verify the UDF
1.3 Cloud resources required
1.4 Prerequisites
- If you’re using your own Alibaba Cloud account instead of the account provided by this lab to operate the experiment, please note that you’ll need to choose the same Ubuntu 16.04 operating system for your ECS in order to run the experiment smoothly.
- Before starting the experiment, please confirm that the previous experiment has been closed normally and exited.
2. Start the experiment environment
Click Start Lab in the upper right corner of the page to start the experiment.
.
After the experiment environment is successfully started, the system has deployed resources required by this experiment in the background, including the ECS instance, RDS instance, Server Load Balancer instance, and OSS bucket. An account consisting of the username and password for logging on to the Web console of Alibaba Cloud is also provided.
After the experiment environment is started and related resources are properly deployed, the experiment starts a countdown. You have two hours to perform experimental operations. After the countdown ends, the experiment stops, and related resources are released. During the experiment, pay attention to the remaining time and arrange your time wisely. Next, use the username and password provided by the system to log on to the Web console of Alibaba Cloud and view related resources:
Go to the logon page of Alibaba Cloud console.
Fill in the sub-user account and click Next.
Fill in the sub-user password and click Log on.
After you successfully log on to the console, the following page is displayed.
3. Install the MaxCompute client
3.1 Create a DataWorks project
In the Alibaba Cloud console, choose DataWorks.
On the Workspace List page, choose US(Silicon Valley) and click Create Workspace.
Set the project name, select Basic Mode for Mode, and click Next.
Select the MaxCompute engine and click Next.
Set the Instance Display Name, and click Create Workspace.
The creation is complete.
After a while, the status will be displayed as Normal, and the creation is successful.
3.2 Create an AccessKey
As shown below, click AccessKey Management.
Click Create AccessKey. After AccessKey has been created successfully, AccessKeyID and AccessKeySecret are displayed. AccessKeySecret is only displayed once. Click Download CSV FIle to save the AccessKeySecret
3.3 Install the MaxCompute client
Click Elastic Compute Service, as shown in the following figure.
We can see one running ECS instance in the US(Silicon Valley) region. Click it to go to the ECS console as shown in the following figure.
Copy this ECS instance’s Internet IP address and remotely log on to this ECS (Ubuntu system) instance. For details of remote login, refer to login。
The default account name and password of the ECS instance:
Account name: root
Password: nkYHG890..
After successful logon, run the following command to update the APT installation source:
apt update
Run the following command to download the installation package:
wget https://labex-ali-data.oss-us-west-1.aliyuncs.com/spark-analysis/jdk-8u181-linux-x64.tar.gz
Run the following command to zip the downloaded installation package into the /usr/local directory:
tar -zxf jdk-8u181-linux-x64.tar.gz -C /usr/local/
Run the vim /etc/profile
command to open this file and then add the following code to the end of this file:
export JAVA_HOME=/usr/local/jdk1.8.0_181
export PATH=$PATH:$JAVA_HOME/bin
Run the source /etc/profile
command to make your changes take effect:
Run the following command to add an Alibaba Cloud image source:
cd && mkdir client && cd client
wget https://labex-ali-data.oss-us-west-1.aliyuncs.com/tunnel/odpscmd_public.zip
Run the following command to install the unzip extraction tool:
apt update && apt -y install unzip
Run the following command to uncompress the previous file:
unzip odpscmd_public.zip
ls
After the extraction is complete, you can see that there are four directories.
Run vim conf/odps_config.ini
to open the MaxCompute client configuration file and modify the configuration according to the following figure.
end_point=http://service.us-west-1.maxcompute.aliyun-inc.com/api
tunnel_endpoint=http://dt.us-west-1.maxcompute.aliyun-inc.com
Save the file and exit.
Run the following command to start the MaxCompute client:
bin/odpscmd --config=conf/odps_config.ini
After the MaxCompute client is started, run the following command to check whether the client has been connected to the MaxCompute project:
whoami;
As shown in the preceding figure, the connection is successful.
4. Develop a Python UDF
Next we will create a UDF in Python and use it to look up the geolocations of specific IP addresses.
First, create a table named ipresource to store IP ranges and their corresponding geolocations.
Then, create a function to look up the IP range of a certain IP address in the ipresource table, in order to retrieve its corresponding geolocation.
Finally, upload the function to MaxCompute and use an SQL statement to call it.
4.1. Create the ipresource table
Run the following command to create a table named ipresource:
CREATE TABLE IF NOT EXISTS ipresource
(
start_ip BIGINT,
end_ip BIGINT,
start_ip_arg string,
end_ip_arg string,
country STRING,
area STRING,
city STRING,
county STRING,
isp STRING
);
Open a new Linux CLI and run vim DataIP.txt
to create a file named DataIP.txt. Copy and paste the following content to the file. Save the file and exit.
16834560,16834623,"1.0.224.0","1.0.224.63","Thailand","Phuket","","","TOT"
16834624,16834815,"1.0.224.64","1.0.224.255","Thailand","","","","TOT"
16834816,16835071,"1.0.225.0","1.0.225.255","Thailand","Phuket","","","TOT"
16835072,16835199,"1.0.226.0","1.0.226.127","Thailand","","","","TOT"
16835200,16841471,"1.0.226.128","1.0.250.255","Thailand","Phuket","","","TOT"
16841472,16841599,"1.0.251.0","1.0.251.127","Thailand","","","","TOT"
16841600,16841663,"1.0.251.128","1.0.251.191","Thailand","Phuket","","","TOT"
16841664,16841727,"1.0.251.192","1.0.251.255","Thailand","","","","TOT"
16841728,16842239,"1.0.252.0","1.0.253.255","Thailand","Phuket","","","TOT"
16842240,16842271,"1.0.254.0","1.0.254.31","Thailand","Song Kafu","","","TOT"
16842272,16842303,"1.0.254.32","1.0.254.63","Thailand","Narathiwat","","","TOT"
16842304,16842367,"1.0.254.64","1.0.254.127","Thailand","Phuket","","","TOT"
16842368,16842383,"1.0.254.128","1.0.254.143","Thailand","Song Kafu","","","TOT"
16842384,16842399,"1.0.254.144","1.0.254.159","Thailand","Phuket","","","TOT"
16842400,16842431,"1.0.254.160","1.0.254.191","Thailand","Song Kafu","","","TOT"
16842432,16842623,"1.0.254.192","1.0.255.127","Thailand","Phuket","","","TOT"
16842624,16842751,"1.0.255.128","1.0.255.255","Thailand","Song Kafu","","","TOT"
16842752,16843007,"1.1.0.0","1.1.0.255","China","Fujian","Fuzhou","","telecom"
16843008,16843263,"1.1.1.0","1.1.1.255","Australia","Sydney","Sydney","","TOT"
16843264,16844799,"1.1.2.0","1.1.7.255","China","Fujian","Fuzhou","","telecom"
16844800,16845055,"1.1.8.0","1.1.8.255","China","Guangdong","Zhuhai","","telecom"
16845056,16859135,"1.1.9.0","1.1.63.255","China","Guangdong","Guangzhou","","telecom"
16910592,16941055,"1.2.9.0","1.2.127.255","China","Guangdong","Guangzhou","","telecom"
Return to the MaxCompute CLI and run the following command to populate the ipresource table with the content of the DataIP.txt file:
tunnel upload /root/DataIP.txt ipresource;
Run the following command to view the content of the ipresource table to make sure that it has been successfully populated:
select * from ipresource limit 10;
4.2 Create a UDF
Return to the Linux CLI and run vim SelectIP.py
to create a Python file named SelectIP.py. Copy and paste the following content to the SelectIP.py file. Save the file and exit.
from odps.udf import annotate
@annotate("string->bigint")
class SelectIP(object):
def evaluate(self, ip):
try:
return reduce(lambda x, y: (x << 8) + y, map(int, ip.split('.')))
except:
return 0
Return to the MaxCompute CLI and run the following command to add the Python script you just created:
add py /root/SelectIP.py;
Run the following command to create a UDF named SelectIP using the Python script you just added:
create function SelectIP as SelectIP.SelectIP using SelectIP.py;
5. Verify the UDF
Run the following command to call the SelectIP UDF you just created:
select SelectIP('1.2.24.2');
Output similar to that shown in the preceding screenshot appears. This indicates that the command is successfully executed.
Use the following command to look up the geolocation of 1.2.24.2
.
select * from ipresource where SelectIP('1.2.24.2') >= start_ip and SelectIP('1.2.24.2') <= end_ip;
Reminder:
Before you leave this lab, remember to log out your Alibaba RAM account before you click the ‘stop’ button of your lab. Otherwise you’ll encounter some issue when opening a new lab session in the same browser:
6. Experiment summary
This experiment describes how to use the MaxCompute client to create and verify a UDF. MaxCompute is currently available in 16 countries and regions around the world, providing customers in finance, technology, bio-medical, energy, transportation, media, and other industries with big data storage and computing services.