Data Management Platform
Checkout the source code and run unit tests:
mvn test
Integration tests can be run using the pig profile:
mvn verify -P pig
but require Pig and Couchbase installed on the local machine.
Build the package using the pig profile:
mvn package -P pig
The dist directory will contain the following files:
log_analysis.pigPig script implementing the daily analyticslog_analysis.properties.exampleparameters required by thelog_analysis.pigscriptmodel_training.pigPig script for training a category classifiermodel_training.properties.exampleparameters required by themodel_training.pigscriptpig-udfs.jarrequired Pig UDFs and their dependenciesuser_profiles.pigPig script for building user profilesuser_profiles.properties.exampleparameters required by theuser_categories.pigscript
The DMP is composed of several jobs that need to be properly scheduled and run.
The classifier needs to be trained using some SCD files. Adjust the properties file and then run:
pig -f model_training.pig -m model_training.properties
This will produce a model file.
Please refer to the model_training.properties.example for details on the required properties.
Daily logs are processed and user profiles extracted. Adjust the properties file and then run:
pig -f log_analysis.pig -m log_analysis.properties
This will produce a file with daily user profiles.
Please refer to the log_analysis.properties.example for details on the required properties.
Suggest specify invariant parameters in properties file and variant parameters with command line:
pig -f log_analysis.pig -m log_analysis.properties -p input=.. -p output_dir=.. -p model_file=..
User profiles can be computed over a time interval. Typically one schedules two or more executions of this job, one for daily metrics and one or more for weekly/montly/etc. metrics. Adjust the properties file and then run:
pig -f user_profiles.pig -m user_profiles.properties
User profiles will be stored into Couchbase.
Please refer to the user_profiles.properties.example for details on the required properties.
Invariant parameters in properties and Variant parameters with command line as well.
Ensure that Pig directory contains
- the file (or a symlink to)
piggybank.jar - the directory
libcontaining Avro dependencies listed inlog_analysis.pigscriptavro-*.jarjackson-core-asl-*.jarjackson-mapper-asl-*.jarjson-simple-*.jarsnappy-java-*.jar
Ensure 'lib' is on the same directory of
piggybank.jar
User profiles are stored into Couchbase in Json format and accessed by composed keys containing the user uuid and the date.
Here is a sample document with key 0c6d8636f8be87e657f9b14ed07b54de::2014-03-28:
{
"uuid": "0c6d8636f8be87e657f9b14ed07b54de",
"date": "2014-03-28",
"period": 1,
"page_categories": {
"图书音像": 0.24210526315789474,
"服装服饰": 0.6421052631578947,
"玩乐爱好": 0.07368421052631578,
"电脑办公": 0.03684210526315789,
"鞋包配饰": 0.005263157894736842
},
"product_categories": {
"服装服饰": 0.875,
"运动户外": 0.125
},
"product_price": {
"0 - 20": 0.020833333333333332,
......
"400 - 420": 0.020833333333333332,
"440 - 460": 0.020833333333333332,
},
"product_source": {
"1号店官网": 0.041666666666666664,
"V+": 0.041666666666666664,
"京东商城": 0.0625,
"卓越亚马逊": 0.0625,
"天猫": 0.3125,
"当当网": 0.14583333333333334,
"淘宝网": 0.3333333333333333
}
}