UOW
UOW Site Search
Advanced Search
UOW Site Links
Index / Map / Contacts
Research @ UOW

Home

About Us

Capability Statement

People

News

Major Research Projects

Publications

Seminars & Events

Research and Postgraduate Studies

Internal Statistical Consulting and Short Courses

Fellows Research Meetings

Resources

Contact Us

Links

Ray Lindsay,

Senior Data Miner

Australian Taxation Office

Using ensembles of models to predict revenue for unlodged tax returns

Approximating curves by many piecewise constants

We will describe the statistical challenges that were overcome in one of the first deployments of Data Mining in the ATO. This model aimed to make predictions of revenue likely to be raised from a large number of outstanding returns. The challenges included a significant proportion of missing values, initially trying to model a step function, and issues related to the storage and processing of very large datasets required for scoring.

The presence of missing values meant that regression or neural network models could result in many records having the default prediction (mean for interval variables, most common level for nominal). Tree models are more robust to missing values but produce a relatively small number of distinct predicted values. When making predictions for millions of records, one needs some way of distinguishing amongst the top 1% (say) of predictions. When we combine the first (classification tree) and two second stage models(regression trees) we introduce some granularity as records in one node in one of the models are not necessarily in the same node in the other two. Much finer granularity in revenue prediction can be achieved by employing a relatively low number of tree models – where the features input into each stream are randomly selected and a random sample of training data is used, essentially of form of bagging. This approach is somewhat similar in philosophy to RandomForests, which could not be used in this problem due to technical issues, and non-compatibility with large numbers of missing values.

  Last reviewed: 1 April, 2009 
 
University of Wollongong
Wollongong NSW 2522 Australia
Telephone +61 2 4221 3555

CRICOS Provider No: 00102E
Privacy, Disclaimer and Copyright
Feedback: webmasters@uow.edu.au