20 short answers question with word count of 2-3 sentences each
4 long answer with count of 4-5 sentences each
Time limit is 80min

Week 6:
Issues with Data Science

Data Management Maturity

Data governance and data management are often used to

mean each other.

Better to treat them as separate levels

• Data Management is what you do to handle the data

o Resources, practises, enacting policies

• Data Governance is making sure that it is done

appropriately

o Policies, training, providing resources

o Planning and understanding

Governance and management

DCC data (curation) lifecycle model

https://www.dcc.ac.uk/guidance/curation-lifecycle-model

https://www.dcc.ac.uk/guidance/curation-lifecycle-model

Capability Maturity Model
• Good management happens all through the data lifecycle

• 4 key process areas:

o Data acquisition, processing and quality assurance
Goal: Reliably capture and describe scientific data in a way that

facilitates preservation and reuse

o Data description and representation
Goal: Create quality metadata for data discovery, preservation, and

provenance functions

o Data dissemination
Goal: Design and implement interfaces for users to obtain and

interact with data

o Repository services/preservation
Goal: Preserve collected data for long-term use

• Good data governance uses a good management system

o A mature system manages data all through the data lifecycle and
throughout all projects. K Crowston & J Qin (2011) A Capability Maturity Model for Scientific Data Management: Evidence

from the Literature. Proceedings of the American Society for Information Science & Technology V48

https://asistdl.onlinelibrary.wiley.com/doi/full/10.1002/meet.2011.14504801036

https://asistdl.onlinelibrary.wiley.com/doi/full/10.1002/meet.2011.14504801036

https://asistdl.onlinelibrary.wiley.com/doi/full/10.1002/meet.2011.1450480103
6

Capability Maturity Model

https://asistdl.onlinelibrary.wiley.com/doi/full/10.1002/meet.2011.14504801036

• Data management and governance are not things just

arranged for each project.

• They should be universal in how an organisation

thinks about and approaches data

o at all times
o in all divisions
o in all projects
o for all stakeholders

Universality

End of
Data Management Maturity

Week 6:
Issues with Data Science

Ethics of Linked Data

• Connecting elements within multiple structured data
sets

• Allows data relating to an element to be collected
from multiple data sets

• Expands the knowledge base of a single data set

• Linked Open Data (LOD) allows the links and data to
be freely shared and accessed
o Used by companies but don’t tend to contribute

their own data

Linked Data

Sir Tim Berners-Lee, the
inventor of WWW and HTML,
wanted a semantic web, using
linked data
1. Name/Identify things with URIs

2. Use HTTP URIs so things can be
looked up

3. Standardise the format of data
about things with URIs

4. On the web, use the URIs when
mentioning things

CC BY 2.0

2012 Olympics: The Medium Isn't the Message?

Semantic Web

https://www.w3.org/DesignIssues/LinkedData.html

• Resource Description Framework (RDF) is another style of language for
representing (subject, verb, object) triples, which is used to represent
semantics. It is a core representation language for Linked Open Data
and the Semantic Web.

• RDF can be represented in different formats, for instance as XML or
simply as line delimited lists.




1990-07-04



Mona Lisa



20 https://www.w3.org/TR/rdf11-primer/

Format of linked (open) data

http://www.w3.org/RDF/

http://purl.org/dc/terms/

https://www.w3.org/TR/rdf11-primer/

• Ethics – the moral handling of data, e.g., not selling on
other’s private data to scammers

• People have rights
o privacy
o access
o erasure
o … etc.

• Companies have rights

o ownership of data
o intellectual property
o copyright
o confidentiality

Ethics

• Business models
o Data has become a valuable asset
o Data has become a valuable product

• Data from different services can be linked by companies
by buying out other companies or establishing new
services for other companies to use.

Alphabet Facebook Microsoft

Google Facebook Skype

YouTube Instagram Hotmail

Gmail Oculus VR Bing

Android WhatsApp Windows

Chrome Giphy Xbox Live/ Minecraft/ Bethesda

DoubleClick Github

Companies using linked data

• Business models

o Multiple departments have separate systems

o Departments interact, so why can’t their data

o Law enforcement needs to know what everyone
else knows!

• Problems

o Who should know what?

o How do you manage who should know what?

o What priorities do you give to the rights of people?

Governments using linked data

• What can you do?

• What should you do?

• How do you make sure the right thing is done?

Breaking it down

See: “The curly fry conundrum: Why social media ‘likes’ say more

than you might think” by Jennifer Golbeck

e.g. Target ® predicting which women are pregnant based on their

purchases

• Many things can be predicted from Facebook “likes”

• Homophily (tendency to associate with similar individuals)
is important for enabling prediction

• We often don’t own or manage corporate/internet/app

data about ourselves

• The source data critical for advertisers so we cannot expect

companies to be banned/excluded from using it

• So how can we manage confidentiality?

Confidentiality

• for many apps/websites, you must accept their privacy

data sharing policies to use their services fully;

• the interface for selecting privacy preferences should

move away from individual Internet platforms and be put

into the hands of individual consumers;

• user could have an open source agent that broker their

confidentiality preferences

• but would that be feasible and would businesses ever

agree?

Confidentiality (cont.)

See: “Empower consumers to control their privacy in the

Internet of Everything” by Carla Rudder (blog)

https://enterprisersproject.com/article/2015/7/empower-consumers-control-their-privacy-internet-everything

1. Corporations: want to use data for business advantage;

‣ opposing consumers
2. Security conscience: concerned with individual freedom, liberty,

mass surveillance;

‣ opposing intelligence orgs like National Security Agency
3. Open data: want open accessibility, support FOI requests

‣ opposing security experts concerned with leaks
4. Big data and civil rights: concerned about big data and citizens;

‣ opposing data brokers selling consumer data

Politics of Confidentiality

See: “Four political camps in the big data world” by Cathy O’Neil (blog)

Four political camps in the big data world

See: Facebook Doesn’t Tell Users Everything and Facebook Privacy:
Social Network Buys Data

Facebook buys 3rd-party data (from brokers) to obtain a
user’s activity, income, etc.

• keeps upwards of 52,000 features about users, many
provided to advertisers

• bought data used as a complement Oracle’s Datalogix,

• it is public, offline data, e.g., from Oracle’s Datalogix,

• but is not revealed to users

Facebook and Personal Data

https://www.propublica.org/article/facebook-doesnt-tell-users-everything-it-really-knows-about-them

http://www.ibtimes.com/facebook-privacy-social-network-buys-data-third-party-brokers-fill-user-profiles-2466651

https://en.wikipedia.org/wiki/Datalogix

See: “Can Facebook influence an election result?” by Michael Brand
(ex-Monash, opinion on ABC news via The Conversation) and also

“How Facebook could swing the election” by Caitlin Dewey (article,
Washington Post)

• implicit data: Facebook can predict who you will vote for

• their “I voted” button encourages people to vote (as they see
which of their friends have)

• studies show it significantly increased voting in 2010 US election

• they can therefore subtly affect your voting

• could Facebook deploy “I voted” button selectively to favour

certain parties in certain areas?

Facebook and Voting

http://www.abc.net.au/news/2016-09-28/can-facebook-influence-an-election-result/7881660

https://www.washingtonpost.com/news/the-intersect/wp/2016/09/30/how-facebook-could-swing-the-election-and-who-will-benefit-if-it-does/

See “Machine logic: our lives are ruled by big tech’s decisions by data”, and
“If prejudice lurks among us, can our analytics do any better?”

Predictive models built on large populations are used to
filter/make key life decisions like release from jail, treatment
in hospital, getting a loan, news/videos you see (e.g.,
Facebook) …

• ML algorithms do the filtering

• ML algorithms can also produce prejudice (i.e., are biased)

• decisions made on mass, not personalised

• decisions are centralised (who writes the algorithms?)

• perhaps this is OK … perhaps

Population-level Prediction

https://www.theguardian.com/technology/2016/oct/08/algorithms-big-tech-data-decisions

https://www.oreilly.com/ideas/if-prejudice-lurks-among-us-can-our-analytics-do-any-better

Philip R. “Phil” Zimmermann,

• creator of the Pretty Good Privacy (PGP) email

encryption software

• Interview in 2013:
“the biggest threat to privacy was Moore’s Law

… the ability of computers to track us doubles every

eighteen months

…The natural flow of technology tends to move in the
direction of making surveillance easier”

Zimmerman’s law

https://en.wikipedia.org/wiki/Pretty_Good_Privacy

https://en.wikipedia.org/wiki/Email_encryption

https://web.archive.org/web/20130815064716/http:/gigaom.com/2013/08/11/zimmermanns-law-pgp-inventor-and-silent-circle-co-founder-phil-zimmermann-on-the-surveillance-society/

Australian govt interface:
• Australian JobSearch

• Australian Taxation Office
• Centrelink
• Child Support
• Department of Health Applications Portal
• Department of Veterans’ Affairs

• HousingVic Online Services
• Medicare
• My Aged Care
• My Health Record
• National Disability Insurance Scheme

• National Redress Scheme
• State Revenue Office Victoria

Government linked data

https://my.gov.au

• My.gov.au provides access to the public to their data

o Greater dependency on online interfaces

o Less pen and paper data processing

o More automation of processing

o Cf. RoboDebt, Census

• Less clear what access each government can have to

the data

Government data access

• “require some telecommunications service providers to
retain specific telecommunications data (the data set)

relating to the services they offer for at least 2 years”
o Who talks to whom on the phone & when
o Who emails whom & when
o The IP address

• What doesn’t it include?
o information about telecommunications content or web

browsing history

• Who has access to the data without a warrant?
o 20 intelligence agencies, criminal law enforcement agencies,

ATO, ASIC and ACCC
o Civil litigation exemption

(Australian) Data retention laws

https://www.homeaffairs.gov.au/about-us/our-portfolios/national-
security/lawful-access-telecommunications/data-retention-obligations

https://www.homeaffairs.gov.au/about-us/our-portfolios/national-security/lawful-access-telecommunications/data-retention-obligations

• Rights vs functionality

• Change in responsibilities

o Change in processes and technology in response

• Where does automation and AI fit?

o Where is the responsibility and accountability?

o Snowden and the NSA surveillance

Data retention laws – issues

End of
Ethics of Linked Data

Week 6:
Issues with Data Science

AI Veracity

• Various factors can affect the “accuracy” of any
analysis

o Data quality

o Choice of analysis

o Design of analysis

o Choice of data

• It is easy for the modelling to misrepresent what the
data is supposed to reflect.

o Even statistical analysis can be biased!

Can you trust the analysis?

Chris is an excellent driver.
They have applied for new
car insurance, but a ML

system automatically
evaluates their application.
What personal data should be
considered?

a) Driving record?

b) Payment metrics?

c) Location?

Should the system reject the
application purely due to
where Chris lives?

https://www.crimestatistics.vic.gov.au/crime-

statistics/latest-crime-data-by-area

Question

https://www.crimestatistics.vic.gov.au/crime-statistics/latest-crime-data-by-area

Google trains ML systems to recognize some
common items in pictures. What do you think it

thought was in these hands in 2020?

a) Banana

b) Gun

c) Monocular

d) Tool

Question

https://algorithmwatch.org/en/story/google-vision-racism/

https://algorithmwatch.org/en/story/google-vision-racism/

• Not all bias is in the numbers

• Bias can also be in how you have designed the

research

o Are the variables appropriate for all situations being
modelled?

o Are assumptions made about the stakeholders who the
data relates to?

o Are assumptions being made about the context of the
data?

Bias of design

• Sometimes the data used to train a ML system is biased,
regardless of its volume

o Narrow
o Regional

o Undertested in varied contexts

• Biased system may discriminate in its results, for instance
by

o gender
o ethnic associations

o generalities

• Biased system may not be as accurate in its results for
unfamiliar contexts and subjects

Bias of data

• Bias like this can appear in any automated processing

o Google: Shows ads for high paying jobs to men more
than women

o Jailtime: Sees black Americans as more at risk of
reoffending than white Americans

o Student applications: ML used to recognize bias in the
decision process and to add bias to the system

• Automated systems will only be as good as the
underlying data

Not just about image recognition

https://towardsdatascience.com/bias-in-artificial-intelligence-a3239ce316c9
https://www.fastcompany.com/90342596/schools-are-quietly-turning-to-ai-to-

help-pick-who-gets-in-what-could-go-wrong

https://towardsdatascience.com/bias-in-artificial-intelligence-a3239ce316c9

https://www.fastcompany.com/90342596/schools-are-quietly-turning-to-ai-to-help-pick-who-gets-in-what-could-go-wrong

• Automated systems may speed up the processes, but
humans are better at understanding the context

o Human-in-the-loop

• Need the human perspective in the design,
understanding and review of the process, how it is
utilised and its results

Great responsibilities
Great transparency

Human perspective

https://appen.com/blog/human-in-the-loop/

• Need to incorporate legal requirements into the
system

o Not discriminating by race, gender, sex, age, etc unless
allowed

• Need to respect the rights of the individual

• Privacy-by-design

o Factor the right to privacy into the design of any DS
system, not as an afterthought.

o Also factor rights and legal requirements into how any
system is used

Legality by design

https://www.oaic.gov.au/privacy/privacy-for-organisations/privacy-by-design/

End of
AI Veracity

Week 6:
Issues with Data Science

Sampling

• When collecting data for processing, it has to be
relevant
o Can you get all data relating to the scenario you are

modelling?

o Can you only get a random sample of data? The sample data
has to be representative of the population being modelled

o How large a sample do you need?

o What known variables are included in the data?

o Is the sample data distributed to match the required
strata/categories

• Observe the population before you make any
unqualified assumptions

Sampling populations

• Blind experiments or A/B testing may be used to show if

relationship between various variables

• The experimental scenario needs to be divided into:

o A: Sample is subjected to the known variable

o B: Sample is not subjected to the known variable (the Control
set)

• The validity of the the

hypothesis is based on whether

A has a different response to

B, where the response is the

target variable.

https://en.wikipedia.org/wiki/A/B_testing#/media/File:A-
B_testing_example.png (cc BY-SA 4.0)

A/B testing

https://en.wikipedia.org/wiki/A/B_testing

How much of a difference in results is enough?
• Must test the statistical significance

o p value: units of chance of your “surprise” (0 to 1)
Considering how likely you could get the same results
regardless of the hypothesis

• Hypothesis: Aspirin reduces heart attack
o Sample: studied 100 men for 5 years

Group HA: 50 men take aspirin daily
Group HP: 50 men take placebo daily (control)

o Results:
‣ High p: HA 4 heart attacks, HP 5 heart attacks so both

around 1 in 10 men
‣ Low p: HP 10, HA 1

so very different and significant!

Significance testing

https://en.wikipedia.org/wiki/Statistical_significance

• How much difference is enough? (p<0.05?) • More data gives a more accurate impression, but how much is enough? • Should you publish experimental results that challenge previous runs of the same experiment? o Negative results shouldn’t be forgotten o Old experiments may be flawed o New data may understand the context • Can you cross-validate your results? o k-fold testing: experiment with k combinations of test and training data Significance chasing • Is data science interested about finding patterns in data (observation) or experimentation (testing outcomes)? • Both models and theories/hypotheses are research artefacts o Need to demonstrate how they match evidence o Scientific method isn’t the only valid research methodology! • Still need to make sure any modelling or other research outcomes are valid! Challenging the scientific method? • In 2009 Google claimed it worked out a correlation between some search terms and a growth in flu cases o Could identify the trends 2 weeks before it became a health problem! • But this has problems! o Not openly sharing their methods – IP! o Not openly sharing their data – privacy and proprietary o Inconsistent in temporal perspectives o Overestimates the infections! • “greater value can be obtained by combining GFT with other near–real time health data” Google Flu Trends Lazer, D., R. Kennedy, G. King, and A. Vespignani. 2014. “The Parable of Google Flu: Traps in Big Data Analysis.” Science 343 (6176) (March 14): 1203–1205. https://dash.harvard.edu/handle/1/12016836 • Data science allows us to expand what we can do with data o Growth laws o Dealing with the Vs • Data science allows us to reinterpret scenarios o New ways to approach old problems • Data science is not standalone o Combine with existing methods o Human-in-the-loop • Data science doesn’t have just be about making better models o Use data science to solve real problems Data science and society https://www.technologyreview.com/2020/08/18/1007196/ai-research-machine-learning-applications-problems-opinion/ End of Sampling Week 6: Issues with Data Science Future of Data Science https://www.gartner.com/smarter withgartner/5-trends-drive-the- gartner-hype-cycle-for-emerging- technologies-2020/ Gartner’s hype cycle https://www.gartner.com/smarterwithgartner/5-trends-drive-the-gartner-hype-cycle-for-emerging-technologies-2020/ • Traditional technology reaches its limits • DNA storage becomes a reality • Expansion of electronic physical experiences • Farms and factories face automation • CIOs become Chief Operating Officers • Change is driven by recording work conversations • Increase in freelance customer service experts • More attention to a “voice of society” metric in organisations • On-site childcare entices employees • Handling malicious content becomes a priority Gartner’s predictions for 2021+ https://www.gartner.com/smarterwithgartner/gartner-top-10-strategic-predictions-for-2021-and-beyond/ https://www.gartner.com/smarterwithgartner/gartner-top-10-strategic-predictions-for-2021-and-beyond/ The 2021 Hype Cycle for Emerging Technologies https://www.gartner.com/smarterwithgartner/3-themes-surface-in-the-2021-hype-cycle-for-emerging-technologies https://www.gartner.com/smarterwithgartner/3-themes-surface-in-the-2021-hype-cycle-for-emerging-technologies • Growth of big data technologies has allowed multiple types of data to be combined o Structured and unstructured data, e.g., sales records and customer feedback o Multimedia, e.g., video and textual data, image and textual data • Growth of IT has allowed better processing capability (Moore’s Law) o New ways to use multiple models relating to different data sets (Bell’s Law), e.g,. visual interpretation of gestures and audio interpretation of speech vs world knowledge o ML using neural networks (NN) and deep learning Combining data • Very much a data science process o Gather data o Analyse data o Produce conclusions o Make decisions o Act on the decisions • Many uses o Manufacturing “robots” o Robotic vacuum cleaner o Adaptive energy systems o Chatbot o Stock market agent o Independent agents in modelling, e.g., public behaviour during pandemic Autonomous devices • For instance, o Drones & other aircraft (autopilot!) o Trucks on freeways & mining sites o Trains o Suburban cars • Collect data from various sources o Local: speed limits o Internal: sensors, cameras, radar o External: road maps, weather • Actions o Plans: routes, known objectives o Instinct: dynamic, adaptable responses, preempting actions of other entities Autonomous vehicles https://www.artificiallawyer.com/2020/07/27/gartner-legal-tech-hype-curve-2020-positions/ Gartner’s Legal Tech Hype Curve https://www.artificiallawyer.com/2020/07/27/gartner-legal-tech-hype-curve-2020-positions/ • Digital ethics is currently of great interest o GDPR • Laws for autonomous devices o Military weapons like drones and gun turrets o Responsibility for errors by other vehicles • Researchers looking where the legal holes are, how they can be filled, what is possible to implement and how AI can help the law o Dr Campbell Wilson - AI for Law Enforcement & Community Safety Laws for Data Science? https://research.monash.edu/en/projects/ai-for-law-enforcement-community-safety • Data Science is not just about Machine Learning • Data Science is not just about coding • Data Science is about helping society use data better at every stage of the data lifecycle • Data science is not just for IT • Data science is now recognised as having a multi- disciplinary role in all industries o Remember the multiple dimensions of a data scientist's skillset! Future for Data Science End of Future of Data Science Week 6: Issues with Data Science Revisiting the Unit Content • Data science o History o Definition o Machine Learning • Data scientist o Skills o Roles o Job descriptions & requirements • Data science process & value chain Week 1 - Overview • Data science o History o Impact ‣ On other disciplines ‣ On society ‣ Scientific method ‣ Futurology • R o R Markdown o ggplot2: visualising and aesthetics ‣ Graphs & facets o Data wrangling ‣ wrangling verbs ‣ Tidy data Week 1 - Impact • Types of analysis • Modelling o Influence diagrams • Growth laws • Business models o SaaS • Basic statistics o Mean, variance o Variable types o Outliers and box plots • Choosing visualisations Week 2 – Visualising statistics • Big data o The Vs o Growth laws • NIST Case studies o Analysis framework • Data quality o Wrangling o Missing data & strategies o Imputation o NaN and NA ‣ Shadow matrix in R Week 2 – Big Data • Sharing data o Open data o Data sources o Complexities of using shared data o Getting data • Data standards o Formats: machine-readable, containers, markups o Metadata o Semi-structured data: XML, JSON o Predictive Model Markup Language • Combining data o joins Week 3: Data sources • Scripting languages: R, python, Unix shell code o Wildcards o Piping o Directing input/output o Moving files and directories o Analysing file contents: grep, awk o Handling big data • Standardisation o software o workflow o processes Week 3: Big Data and Standards • Temporal data o Temporal elements o Extraction and conversion o Visualisation • Statistical modelling o Variables: dependent, independent o Causation vs correlation o Regression modelling ‣ Model family ‣ Learning parameters/fitting a model ‣ Simple linear regression model Week 3 – Modelling data • Truth of data o Error o Correlation coefficient • Simple linear regression o Residuals: Mean Square Error o Goodness of fit: fitting variation ‣ R-squared • Polynomial regression o Degree o Underfitting, overfitting o Testing and training o Bias-Variance tradeoff o No Free Lunch Theorem o Multiple models & ensembles Week 4 – Fitted modelling • Segmentation • Regression trees o ANOVA • Classification trees • Clustering o Centroids o K-means o Hierarchical trees: dendrograms Week 4 – Grouping data • RDBMS: SQL o Unstructured data: NoSQL • Distributed systems o Hadoop o Map-Reduce o Spark • NIST Big Data Reference Framework • Data Science Tools & Services o Open source software o Case studies o APIs o SaaS Week 5 – Data Science Tools • Data management • Data lifecycles • Data governance o Legal requirements: Privacy Act, GDPR, licenses o Ethical requirements o Rights o Privacy o Confidentiality • Stakeholders • Data management plans • Data curation Week 5 – Data management • Data management capability maturity • Linked data: Semantic web, RDF o Confidentiality o Privacy • Surveillance o Data retention laws • AI veracity o Bias o Human-in-the-loop o Sampling ‣ A/B testing ‣ Significance testing: p-value, k-fold testing o Scientific method Week 6 – Issues • We have covered a lot of areas because data science has a broad influence. • Hope you’ve learnt a lot from the unit. • Best of luck for the final exam assessment task! The end? Or the start? If … If you are interested in doing a minor thesis on the topic of applying Data Science techniques for educational research, please feel free to send an email via: [email protected] mailto:[email protected] Please help us improve by filling out the SETU surveys on Moodle. Lastly … End of FIT5145 Lectures !!!! Week 5: Data Management & Governance Storing Big Data From Big data on Wikipedia: Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time. Big data "size" is a constantly moving target, ... Big Data https://en.wikipedia.org/wiki/Big_data Summary BIG DATA is ANY attribute that challenges CONSTRAINTS of a system’s CAPABILITY or BUSINESS NEED. The Four V’s of Big Data “The Four V’s of Big Data” by IBM (infographic) http://www.ibmbigdatahub.com/infographic/four-vs-big-data New approaches are needed to handle this complexity • Storing it • Analysing it • Visualising it • Using it ➢ New software ➢ New methods ➢ New hardware Big Data is complex • Collection: getting the data • Wrangling: data preprocessing, cleaning • Analysis: discovery (learning, visualisation, etc.) • Presentation: arguing that results are significant and useful • Engineering: storage and computational resources • Governance: overall management of data • Operationalisation: putting the results to work Our Standard Value Chain RDBMS and SQL Review • Relational Database Management Systems (RDBMS) • SQL: structured query language • Rather like large scale set of Excel spreadsheets with better indexing and retrieval • Transaction oriented with support for correctness, distribution, ... • Businesses function in a continuously changing environment: ‣ fixed formats as per RDBMS not suitable ‣ usage varies, requires complex analytical queries • Need to reach insights faster and act on them in real time ‣ stream processing Business Context • Stores graph, commonly as triples, e.g., (subject, verb, object) • Commonly used to store Linked Open Data Graph Database Example Semi-structured data is data that is presented in XML or JSON: • see some examples for here • Note YAML (Yet Another Markup Language), which is just an indentation (easier to read) version of JSON • standard libraries for reading/writing/manipulating semi- structured data exist in Python, Perl, Java • don’t need to know all the details of XML (and related Schema languages), there are many good online tutorials, e.g. W3schools.com Semi-Structured Data https://en.wikipedia.org/wiki/JSON http://www.w3schools.com/ • No fixed format • Semi-structured, key-value pairs, hierarchical • “Friendly” alternative to XML • Self-documenting structure JSON Example REST API Terminology API: Application Programmer Interface • Routines providing programatic access to an application. REST: REpresentational State Transfer • a stateless API usually running over HTTP • Watch a simple introduction to REST-based APIs in this video: REST API concepts and examples by WebConcepts SaaS: Software as a Service • The provisioning of software in a Web browser and/or via an API over the Web as a subscription service. https://www.youtube.com/watch?v=7YcW25PHnAA • Use SQL database when: ‣ data is structured and unchanging • Use NoSQL database when: ‣ Storing large volume of data with little to no structure ‣ Data changes rapidly • NoSQL databases offer a rich variety beyond traditional relational data. SQL and Beyond SQL Databases (NoSQL) • In-database analytics: the analytics is done within the DB • In-memory database: the DB content resides memory • Cache: data stored in-memory • Key-value: value accessible by key, e.g., hash table • Information silo: an insular information system incapable of reciprocal operation with other, related information systems ‣ If two big banks merge, then initially their RDBMSs will be siloed ‣ In a big insurance company, the customer RDBMSs for auto and home insurance may be siloed Database Background Concepts End of Storing Big Data Week 5: Data Management & Governance Hadoop, Spark & Map-Reduce Interactive: bringing humans into the loop Streaming: massive data streaming through system with little storage Batch: data stored and analysed in large blocks, “batches”, easier to develop and analyse Overview: Processing In-memory: in RAM, i.e., not going to disk Parallel processing: performing tasks in parallel Distributed computing: across multiple machines Scalability: to handle a growing amount of work; to be enlarged to accommodate growth (not just “big”) Data parallel: processing can be done independently on separate chunks of data Yes: process all documents in a collection to extract names No: convert a wiring diagram into a physical design (optimisation) Processing Background Concepts • Legacy systems provide powerful statistical tools on the desktop, such as SAS, R, and Matlab, but often-times without distributed or multi-processor support. • Supporting distributed/multi-processor computation requires special redesign of algorithms Distributed Analytics Simple distributed processing framework developed at Google • published by Dean and Ghemawat of Google in 2004 • intended to run on commodity hardware; it has fault-tolerant infrastructure • from a distributed systems perspective, is quite simple Map-Reduce For a simple word-count task: (1) divide data across machines (2) map() to key-value pairs (3) sort and m e r g e ( ) identical keys Map-Reduce Example • requires simple data parallelism followed by some merge (“reduce”) process • stopped using by Google probably in 2005 • Google now uses “Cloud Dataflow” (and here), available commercially as open source Map-Reduce (cont.) https://cloud.google.com/dataflow/ https://cloud.google.com/dataflow/what-is-google-cloud-dataflow Open-source Java implementation of Map-Reduce • originally developed by Doug Cutting while at Yahoo! • architecture: Common: Java libraries and utilities MapReduce: core paradigm • huge tool ecosystem • well passed the peak of the hype curve Hadoop https://en.wikipedia.org/wiki/Doug_Cutting • another (open source) Apache top-level project at Apache Spark • developed at AMPLab at UC Berkeley • builds on Hadoop infrastructure • interfaces in Java, Scala, Python, R • provides in-memory analytics • works with some of the Hadoop ecosystem Spark http://spark.apache.org/ https://amplab.cs.berkeley.edu/ • Hadoop provides an inexpensive and open source platform for parallelising processing: ‣ based on a simple Map-Reduce architecture ‣ not suited to streaming (suitable for offline processing) • Spark is a more recent development than Hadoop ‣ includes Map-Reduce capabilities ‣ provides real-time, in-memory processing ‣ much faster than Hadoop Summary: Hadoop and Spark End of Hadoop, Spark & Map-Reduce Week 5: Data Management & Governance Data Science Tools Here’s how you learn about which tools are important! BOSSIE is Best Open Source Software awards: • BOSSIE awards 2015 for Big Data and BOSSIE awards 2016 for Big Data • BOSSIE awards 2017 for machine learning and deep learning tools and for databases and analytics tools • BOSSIE awards 2019 • BOSSIE awards 2020 • BOSSIE awards 2021 Open Source Software Awards http://www.infoworld.com/article/2982429/open-source-tools/bossie-awards-2015-the-best-open-source-big-data-tools.html http://www.infoworld.com/article/3120856/open-source-tools/bossie-awards-2016-the-best-open-source-big-data-tools.html https://www.infoworld.com/article/3228224/machine-learning/bossie-awards-2017-the-best-machine-learning-tools.html https://www.infoworld.com/article/3228150/analytics/bossie-awards-2017-the-best-databases-and-analytics-tools.html https://www.infoworld.com/article/3444198/the-best-open-source-software-of-2019.html https://www.infoworld.com/article/3575858/the-best-open-source-software-of-2020.html https://www.infoworld.com/article/3575858/the-best-open-source-software-of-2020.html 2015: big data tools, Spark and “elastic” processing, scalable ML and databases, stream/real-time processing (ML, search, analysis, storage, time-series), security 2016: big data tools, pipelines, TensorFlow, distributed IR (Solr), NoSQL analytics, stream analytics, graph database 2017: big data and analytics tools, GPU acceleration, real-time SQL, more Spark, Solr, R, graph databases 2017: ML tools, deep learning, scalable prediction, Python, gradient boosting, TensorFlow 2021: analytics and ML tools, Orange, Apache software, distributed SQL, explainable AI Machine learning & analytics on top of big data are now mainstream! Open Source Software Awards (cont.) Let’s have a look at what all these Open Source Projects doing 1. Apache Hadoop Distributed File System (HDFS) 2. Apache Hadoop YARN 3. Apache Spark 4. Apache Cassandra (distributed NoSQL, wide-column store) 5. Apache HBase (distributed NoSQL, wide-column store) 6. Apache Hive (distributed SQL) 7. Apache Mahout (distributed linear algebra with GPU) 8. Apache Pig (data flow and data analysis on top of Hadoop) 9. Apache Storm (distributed real-time computation) 10. Apache Tez (dataflow for Hive and Pig) Many state-of-the-art platforms integrated into Hortonworks (now the Cloudera Data Platform). Popular Open Source Projects http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html http://spark.apache.org/ https://cassandra.apache.org/ https://hbase.apache.org/ https://hive.apache.org/ https://mahout.apache.org/ https://pig.apache.org/ http://storm.apache.org/ http://tez.apache.org/ http://hortonworks.com/hdp/whats-new/ A number of organisations run salary surveys. These are usually interesting because they also describe what tasks people do and what software they use. • O’Reilly’s Salary Survey: behind login, slides summarised ‣ 2016 Data Science Salary Survey ✓ really interesting content on software used, ... ‣ 2017 European Data Science Salary Survey ✓ really interesting content on tasks done, coding versus meetings, .. • Kaggle state of data science and machine learning ‣ really interesting content on job title, education, methods, barriers, getting started ‣ explore this one online! explore this one online, a 2018 survey, 2019 survey and 2021 survey are also available! Work and Salary Surveys http://www.oreilly.com/data/free/2016-data-science-salary-survey.csp https://www.oreilly.com/ideas/2017-european-data-science-salary-survey https://www.kaggle.com/surveys/2017 https://www.kaggle.com/kaggle/kaggle-survey-2018 https://www.kaggle.com/c/kaggle-survey-2019 https://www.kaggle.com/kaggle-survey-2021 2016 Data Science Salary Survey Software Usage Survey http://www.oreilly.com/data/free/2016-data-science-salary-survey.csp Survey: Clusters amongst the Respondents Survey: Commonly Used Software Kaggle 2021 Survey: Interactive Development Environments Kaggle | State of ML & Data Science 2021 https://www.kaggle.com/kaggle-survey-2021 https://www.kaggle.com/kaggle-survey-2021 Survey: Operating Systems Survey: Programming Languages Survey: Relational Databases Kaggle 2021: Relational Databases Kaggle | State of ML & Data Science 2021 https://www.kaggle.com/kaggle-survey-2021 https://www.kaggle.com/kaggle-survey-2021 Survey: Management and Big Data Survey: Visualization Coding versus Meetings Career Choices Tasks – Time Tasks – Salary End of Data Science Tools Week 5: Data Management & Governance Data Science Use Case Studies • Mike Olson (co-founded Cloudera in 2008) says without big data and a platform to manage big data, machine learning and artificial intelligence just don’t work. • See the machine learning renaissance starting at 60 seconds. The Machine Learning Renaissance https://www.oreilly.com/ideas/the-machine-learning-renaissance • “Visualizing the world’s Twitter data – Jer Thorp”, a TEDYouth 2012 Talk, former New York Times data artist-in-residence Jer Thorp (video, 6mins) • National Map (Youtube, 14 mins) is a website for map-based access to Australian spatial data from government agencies. The website is http://nationalmap.gov.au/. • “Style Stalking; The Stochastic Patterns that Drive Fashion Trends”, by Karen Moon from Strata+Hadoop World 2014 (video, 10 minutes) • Panama Papers, leaked papers (11.5M) on financial transactions, motivations for using data science, and how analysed (Wired, 2016). Case Studies http://ed.ted.com/lessons/mapping-the-world-with-twitter-jer-thorp https://www.youtube.com/watch?v=e7jQoV2pl_0 http://nationalmap.gov.au/ https://www.youtube.com/watch?v=VyV0NZX_eZ8 https://en.wikipedia.org/wiki/Panama_Papers https://blog.unbelievable-machine.com/en/blog/panama-papers-and-data-science/ http://www.wired.co.uk/article/panama-papers-data-leak-how-analysed-amount Data sources: where the data comes from Data volume: how much there is Data velocity: how it changes over time Data variety: what different kinds of data there is Data veracity: correctness problems in the data Software: software needed to do the work Analytics: broadly, what sorts of statistical analysis and visualisation needed Processing: broadly, computational requirements Capabilities: broadly, key requirements of the operational system Security/Privacy: nature of needs here Lifecycle: ongoing requirements Other: notable factors Reminder: NIST Analysis Freebase: • an example of a graph database we looked at earlier • graph can be represented in RDF which is triples of URIs • now owned by Google, and decommissioned • used by others as a knowledge-base in many text processing pipelines: ‣ e.g., using TextRazor to extract meaning from text DBpedia: • aim to extract all structured content from information in Wikipedia • open source project • effectively replaced Freebase Freebase and DBPedia http://www.freebase.com/ https://www.textrazor.com/ http://wiki.dbpedia.org/ The Unified Medical Language System (UMLS) Medical Data Dictionaries http://www.nlm.nih.gov/research/umls/new_users/online_learning/OVR_001.html ICD: the International Classification of Diseases • used to classify diseases and other health problems • based on health and vital records • for example: Pneumonia due to Streptococcus pneumoniae Medical Data Dictionaries (cont.) http://apps.who.int/classifications/icd10/browse/2010/en Other Medical Dictionaries: • SNOMED CT ‣ Systematized Nomenclature of Medicine Clinical Terms • Gene Ontology ‣ Concepts for describing gene function Usage of Medical Dictionaries: • controlled vocabularies • semantic data exploration • clinical surveillance • decision support Medical Data Dictionaries (cont.) http://www.ihtsdo.org/snomed-ct http://geneontology.org/ • PUBMED, we have seen before • ACM Digital Library • Global Patent Index provided by the EPO • Semantic Scholar for research article search Publishing Repositories http://dl.acm.org/ https://www.epo.org/searching-for-patents/technical/espacenet/gpi.html https://www.semanticscholar.org/ Event Registry • collect news article globally, process and organise as events • perform concept and event identification • create a document database for inspection • sometimes news stored as NewsML News and Event Registry http://eventregistry.org/ https://iptc.org/standards/newsml-g2/ • US Government’s Data.GOV • NYC Open Data • Australia’s Urban Intelligence Network (AURIN), e.g. SD Private Health Insurance • BioGrid Australia, curated for research use and usually require getting approval to use Government Data http://www.data.gov/ https://data.cityofnewyork.us/dashboard http://aurin.org.au/ https://data.aurin.org.au/dataset/tua-phidu-sd-privatehealthinsurance-sd https://www.biogrid.org.au/ Many companies are exposing their data and their website functionality as APIs for others to make use of: • Facebook API • Twitter API e.g. search tweets • LinkedIn API • Google Maps API • Youtube API e.g. documentation • Amazon Advertising API • TripAdvisor API • New York Times API Example Data/Information APIs https://developers.facebook.com/products/ https://dev.twitter.com/rest/public https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets.html http://www.programmableweb.com/api/linkedin https://developers.google.com/maps/ https://developers.google.com/youtube/ https://developers.google.com/youtube/v3/getting-started https://advertising.amazon.com/API https://developer-tripadvisor.com/content-api/ http://developer.nytimes.com/ Companies provide functionality via APIs so that others can make use of their data and services: • The Application Economy: A New Model for IT (CISCO) • ProgrammableWeb API Category: Data • Top 30 Predictive Analytics API (see #4) • 20+ Machine Learning as a Service Platforms And for something completely different: • The Sharing Economy | Bullish (on TechCrunch) ‣ these companies are huge users of data science! The API Economy https://www.youtube.com/watch?v=9Ai5TTVTyWc http://www.programmableweb.com/category/data/api http://www.predictiveanalyticstoday.com/top-predictive-analytics-software-api/ http://www.butleranalytics.com/20-machine-learning-service-platforms/ https://techcrunch.com/video/the-sharing-economy-bullish/519620665/ Some companies are exposing their tools/services as APIs or browser based tools for others to make use of: • Azure Machine Learning Studio • Figure-Eight Human in the Loop ML with crowdsourcing support • Watson REST API for semantic web, metadata, entity analysis in text • Google Cloud Prediction API ‣ is closing down in April 2018, and they will focus on cloud solutions Example Processing APIs or Web Services https://azure.microsoft.com/en-us/services/machine-learning-studio/ https://www.figure-eight.com/ https://dataplatform.ibm.com/docs/content/analyze-data/pm_service_api_spark.html https://cloud.google.com/prediction/docs/ • Email systems (Google, Microsoft Office365) • File sharing systems( Dropbox, Box, Microsoft One drive, Google drive ..) • Business systems (Salesforce, Servicenow, ..) SaaS Examples End of Data Science Use Case Studies Week 5: Data Management & Governance Data Management • You want the data you are using to be of sufficient quality for your purpose - Accuracy - Completeness - Consistency - Integrity - Reasonability - Timeliness - Uniqueness/deduplication - Validity § Data Management Association (DAMA) • Much of this is a data management issue - But data management is about more than just data quality! Data quality https://www.naa.gov.au/information-management/building-interoperability/interoperability-development-phases/data-governance-and-management/data-quality Data management is the development, execution and supervision of plans, policies, programs and practices that control, protect, deliver and enhance the value of data and information assets. Data Management • See “Top 10 Mistakes in Data Management” a tutorial from Intricity (a data management company) (Youtube) • See “How to avoid a data management nightmare”, a video created by NYU Health Sciences Library (Youtube) Data Management (cont.) https://www.youtube.com/watch?v=5Pl671FH6MQ https://www.youtube.com/watch?v=nNBiCcBlwRA Examples of data management issues arising in data science projects: Medical informatics: for predicting fungal infections from nursing notes, the team needs to abide by confidentiality and security requirements. Internet advertising: what implicit and explicit data is stored about a user? Retailing: conduct market intelligence on new products; put together data from different divisions (brands) within the company. Predictive medical system: implementation may need changing standard operating procedure for staff Data Management and Data Science Science: reproducibility and credibility of scientific work, producing artifacts of knowledge, creating scientific data Business: governance, compliance, information privacy, etc. Curation: e.g. museums and libraries, preservation, maintenance, etc. Government: a unique legislative environment that regulates them (e.g., “transparency”), archiving, FOIs, support data infrastructure, etc. Medicine: significant privacy issues, conflicting corporate financial constraints, government regulations and furthering of medical science Contexts for Data Management End of Data Management Week 5: Data Management & Governance Data lifecyles • Collection: getting the data • Wrangling: data preprocessing, cleaning • Analysis: discovery (learning, visualisation, etc.) • Presentation: arguing that results are significant and useful • Engineering: storage and computational resources • Governance: overall management of data • Operationalisation: putting the results to work Our Standard Value Chain https://confluence.csiro.au/display/RDM/Research+Data+Management CSIRO research data lifecycle https://confluence.csiro.au/display/RDM/Research+Data+Management https://old.dataone.org/data-life-cycle DataOne model https://old.dataone.org/data-life-cycle https://www.dcc.ac.uk/guidance/curation-lifecycle-model DCC data (curation) lifecycle model https://www.dcc.ac.uk/guidance/curation-lifecycle-model End of Data Lifecycles Week 5: Data Management & Governance Data Governance • See “What is Data Governance?” by Rand Secure Data (Youtube) • See “What is Data Governance?” by Intricity (Youtube) What is Data Governance? https://www.youtube.com/watch?v=t4IOS5csv40 https://www.youtube.com/watch?v=sHPY8zIhy60 Supporting and handling: • ethics, confidentiality • security • consolidation and quality-assurance (e.g. link all customer related information together) • persistence (backups and recoverability) • regulatory compliance • organisation policy compliance • organisation business outcomes which may include handling the steps in the data science and/or big data value chain Data Governance Data governance and data management are often used to mean each other. Better to treat them as separate levels • Data Management is what you do to handle the data o Resources, practises, enacting policies • Data Governance is making sure that it is done appropriately o Policies, training, providing resources o Planning and understanding Governance and management • Must follow laws o Australian Privacy law o Australian medical data regulations o Australian telecommunications act o EU’s General Data Protection Regulations (GDPR) • Must meet (funding) requirements o Australian Research Council (ARC) o National Health and Medical Research Council (NHMRC) • Must be ethical o Don’t be evil! Legal and ethical responsibilities • Confidentiality • Ownership • Copyright • Intellectual property • Licensing Just because a data science project ends, the data curation shouldn’t! Other legal restrictions • Must follow laws o Australian Privacy law o Australian medical data regulations o Australian telecommunications act o EU’s General Data Protection Regulations (GDPR) • Must meet (funding) requirements o Australian Research Council (ARC) o National Health and Medical Research Council (NHMRC) • Must be ethical o Don’t be evil! Legal and ethical responsibilities • Rights for o Privacy o Access o Erasure o and more! • Work with the stakeholders • Be transparent and clear Ethics – doing what is right • Regulations devised by various government bodies: taxation, medical care, securities and investments, work health and safety, employment, corporate law. • They need to check companies for their compliance • Regulatory compliance: that organisations ensure that they are aware of and take steps to comply with relevant laws and regulations. • Auditing systematic and independent examination of books, accounts, documents and vouchers of an organization to ascertain how far they present a true and fair view • auditing data and records are a good source for Data Science Regulations and Compliance Terminology For our purposes, we define: • Privacy as having control over how one shares oneself, e.g., closing the blinds in your living room • Confidentiality as information privacy, how information about an individual is treated and shared, e.g., excluding others from viewing your search terms or browsing history • Security as the protection of data, preventing it from being improperly used, e.g., preventing hackers from stealing credit card data • Ethics as the moral handling of data, e.g., not selling on other’s private data to scammers • Implicit data that is not explicitly stored but inferred with reasonable precision from available data, see “Private traits and attributes are predictable ...” http://www.pnas.org/content/110/15/5802 End of Data Governanace Week 5: Data Management & Governance Stakeholders • Stakeholders are any parties that have a relationship with a project/policy/product/data. This includes o the data’s source o managers o analysts and users o IT developers o data scientists! Who is responsible? With great data, comes great responsibilities for all stakeholders NIST Reference Architecture showing actors and roles in data management Actors End of Stakeholders Week 5: Data Management & Governance Data Management Planning How do you get it all right? • Policies and laws o rights, Australian privacy principles, EU GDPR • Procedures and practises o access, ownership, security • Planning and training o data management plans, design • Management and capability o technology, staffing • Governance o oversight & review, ethics Getting data governance right A DMP provides • Clarity • Direction • Transparency • Expectations The result is • Improvements to efficiency, protection, quality and exposure • Value • Innovation Data Management Plans - purpose • Backups • Survey of existing data • Data owners & stakeholders • File formats • Metadata • Access and security • Data organisation https://www.ands.org.au/__data/assets/pdf_file/0011/690878/Data-management-plans.pdf Data management plans - content See also DMPTool – http://dmptool.org • Bibliography • Storage • Data sharing, publishing and archiving • Destruction • Responsibilities • Budget https://www.ands.org.au/__data/assets/pdf_file/0011/690878/Data-management-plans.pdf http://dmptool.org/ • The data community has lots of tools and systems available (See also DMPTool – http://dmptool.org ) o Archives to use o Indexes o Metadata standards o Data management tools • Examples o ARDC (formerly ANDS) o Monash! Data management communities http://dmptool.org/ End of Data Management Planning Week 5: Data Management & Governance Data Management Mistakes • Don’t forget data management and governance! o Access & security o Software & hardware o Regulations, ethics & licensing o Stakeholders & transparency Your case study – Assignment 4 • Australian government wanted to double check the incomes of people being paid social welfare payments. • The Online Compliance Intervention system (aka RoboDebt) was set up in 2016 to automatically compare ATO records to Centrelink records. o Calculates the benefits that people are entitled to, based on assumptions about their earnings o Debt collection letters if benefits have been overpaid • Problems discovered with a lack of human-in-the-loop for doublechecking o Incorrect/inappropriate calculations o Using out-of-date data o Sending debt notices … Week 4: Statistical Modelling Truth of Fitted Models Learning Outcomes (Week 4) Compared to previous weeks, you will be exposed to more mathematics! • Understand what the maths is trying to do • Understand the concepts involved • Don’t have to remember all the formulas exactly For variables for an individual data case (e.g. a single loan application or a single heart disease patient), the “truth” can be measured directly • Across examples, the “true” model is harder to define: ‣ What is a “true” model of physics? – Newtonian physics, String Theory? • How can you measure the “true” model for the heart disease problem? ‣ collect infinite data and infer statistically ‣ but it's a dynamic problem and general population characteristics always changing • regardless, we assume some underlying “truth” is out there Truth • To evaluate the quality of results derived from learning, we need notions of value • So we will review quality and value. Quality • William Tell was forced to shoot the apple on his son’s head • If he strikes it, he gets both their freedoms William Tell’s Apple Shot • This shows “value” as a function of height • Loss varies depending on where it strikes • How do you compare loss of life versus gain of freedom? William Tell’s Apple Shot (cont.) The boy is smiling! It is hard to find a cartoon with an apple on a boy’s head. • May be the quality of your prediction • May be the consequence of your actions (making a prediction is a kind of action) • Can be measured on a positive or negative scale Loss: positive when things are bad, negative (or zero) when they’re good Gain: positive when things are good, negative when they’re not Error: measure of “miss”, sometimes a distance, but not a direct measure of quality Quality Error measures the distance between the prediction and the actual value. • “0” means no error, prediction was exactly right • we can convert error to a measure of quality using a loss function, e.g., Quality is a Function of Error square-error(x) = x ∗x hinge-error(x) = 1 otherwise absolute-error(x) = |x| |x| if |x | ≤ 1 Data Analysis Algorithms Regression From The Elements of Statistical Learning by T. Hastie, R. Tibshirani and J. Friedman https://www.springer.com/gp/book/9780387848570 • Look for relationships amongst variables • Identify the relation between salary and experience, education, role, etc. Real World Example: What is Regression? Regression Variables can be: • Independent Variables/Inputs/Predictors, e.g., experience, education, role • Dependent Variables/Outputs/Responses, e.g., salary of employee Observation is a data point, row, or sample in a dataset • e.g., an employee's salary, experience, education, role. Terminology • We can measure the strength and direction of the linear relationship of two variables • (Pearson product-moment) correlation coefficient is the covariance of the variables divided by the product of their standard deviations • R or Pearson’s R, when applied to a sample o R=+1 is total positive linear correlation, o R=0 is no linear correlation o R=−1 is total negative linear correlation Correlation coefficient !",$= ∑&'() (+& − + ¯)(/& − / ¯ ) 0 &'( ) (+& − + ¯ )1 0 &'( ) (/& − / ¯ )1 • To determine how multiple variables are related, e.g., determine if and to what extent the experience or education impact salaries • To predict a value, e.g., predict electricity consumption given the outdoor temperature, time of day, and number of residents in that household When Use Regression Example: Sales ~ TV, Radio, Newspaper Simple Linear Regression (two- dimensional space): Regression fits a very simple equation to the data: Here is prediction for y at the point x using the model parameters = (a0, a1), i.e. the intercept and slope terms. Independent Variable D e p e n d e n t V a ri a b le Predicted Actual The aim is that the predicted response be as close as possible to the actual response. Actual - Predicted Best Fitting Line • Given some data pairs (x1, y1), ...,(xN , yN), we fit a model by finding the coefficient vector that minimises the loss function: • Residuals = the distances between the observed values and the predicted values • Ordinary least squares (OLS) = minimises the sum of squared residuals (SSR) Calculating Parameters • If a model fits the data, it should be able to represent its variation. • Therefore, we may wish to measure how well it fits this variation. • Explained variation ~ variation in y explained by the model • Residual variation ~ variation in y unexplained by the model • Total variation in y = Explained variation + Residual variation Goodness of fit: fitting variation Total variation in y = Explained variation + Residual variation SST = SSE + SSR !!" = $%$&' ()* +, (-)./01 2./3.43+5 = ∑(8 − 8 ¯ )< !!= = >[email protected]’&AB>C ()* +, (-)./01 2./3.43+5 = ∑(8
^
− 8

¯
)< !!E = F>GACH&’ ()* +, (-)./01 2./3.43+5 = ∑(8 − 8
^
)< Goodness of fit: formulas The R^2, R2 or R-squared value for a fitted model is a key goodness of fit statistic. R2 = SSE (Explained) SST (Total : SSE + SSR) R2 is between 0 and 1 • 1 is good, variability in y is fully explained by the model • 0 is bad, no variability in y is explained by the model Goodness of fit: R-squared Goodness of fit: Visualisation Goodness of fit: Visualisation Goodness of fit: Visualisation End of Truth of Fitted Models Week 4: Statistical Modelling Underfitting and Overfitting • Assume the polynomial relationship between the inputs and output, e.g., 10th order (a.k.a degree) polynomial Polynomial Regression What is the best degree? 1, 2 or 3? Question Degree: 3 Polynomial Regression Degree: 20 Is this fit better than previous fits? Question • Bayesian information criterion (BIC) includes a penalty for using more variables. The preferred model is the model with the lowest BIC. • Other similar measures include the adjusted-R2, which imposes a penalty on additional variables that do not have a significant effect on explaining y. The preferred model is the model with the higher adjusted-R2. Bayesian Information Criterion (BIC) Overfitting Underfitting and Overfitting Underfitting The more parameters a model has, the more complicated a curve it can fit. • If we don’t have very much data and we try to fit a complicated model to it, the model will make wild predictions. • This phenomenon is referred to as overfitting Overfitting • Small polynomial; cannot fit the data well; said to have high bias • Large polynomial; can fit the data well; fits the data too well; said to have small bias • If there is known error in the data, then a close fit is wasted: the 25th degree polynomial does all sorts of wild contortions! • Poor fit due to high bias called under-fitting • Poor fit due to low bias called overfitting Overfitting (cont.) • Split up the data we have into two non-overlapping parts, a training set and a test set • Do your learning, run your algorithm, build your model using the training set • Run the evaluation using the test set • Don’t run the evaluation on the training set • How big to make the test set? Training Set and Test Set End of Underfitting and Overfitting Week 4: Statistical Modelling Bias and Variance Different data sets of size 30. Bias: measures how much the prediction differs from the desired regression function. Variance: measures how much the predictions for individual data sets vary around their average. Bias and Variance Scenario 1 ■ Low complexity ■ Medium complexity ■ High complexity ■ MSE(Training Data) ■ MSE(Testing Data) Bias vs Variance Trade-off ■ MSE(Training Data) ■ MSE(Testing Data) ■ Low complexity ■ Medium complexity ■ High complexity Bias vs Variance Trade-off Scenario 2 ■ MSE(Training Data) ■ MSE(Testing Data) ■ Low complexity ■ Medium complexity ■ High complexity Bias vs Variance Trade-off Scenario 3 Optimum Degree Bias vs Variance Trade-off Scenario 1 Scenario 2 Scenario 3 Bias vs Variance Trade-off • Blue line is true model that generated the data (before noise was added) • Grey curve is model fit to 30 data points • Black curve is model fit to 90 data points In general, more data means better fit (most of the time) More Data Improves the Fit MSE decreases as the amount of training data grows • these plots are called learning curves • different learning algorithms exhibit different behaviour (rate of decay) Loss decreases with Training Data Wolpert and McCready proved: If a [learning] algorithm performs well on a certain class of problems then it necessarily pays for that with degraded performance on the set of all remaining problems. • There is no universally good machine learning algorithm (when one has finite data) ‣ e.g. Naive Bayesian classification performs well for text classification with smaller data sets ‣ e.g. linear Support Vector Machines perform well for text classification No Free Lunch Theorem End of Bias and Variance Week 4: Statistical Modelling Multiple Models and Ensembles • When the data contains information about many groups, it is not uncommon to fit models of the dependent variable for each group. Ø e.g., modelling the wind power generated from several wind turbines in a wind farm, modelling the life expectancy of every country • Allows us to compare and analyse each group individually and as part of the whole. Multiple models Multiple models - example • Suppose you wanted to fit a linear model of the life expectancy for every country in your data. • Filtering each country one-by-one to fit 142 individual models is not a practical solution. • Use nest() and map()in R o Group by country § Nested dataframe with new column of country-specific data o Map lm to each country o Tidy() the lm data into a tibble then unnest it o Provides slope (gradient) and intercept coefficients for each o Reorganise for analysis Fitting multiple models Clearer, but not clear enough Multiple fitted models Plotting coefficients Goodness of fit - R^2 Fitted models – R^2 <= 0.45 • Given only data, we do not know the truth and can only estimate what may be the “truth” • An ensemble is a collection of possible/reasonable models • From this we can understand the variability and range of predictions that is realistic Ensembles • Generating an ensemble is a whole statistical subject in itself • Often we average the predictions over the models in an ensemble to improve performance Ensembles (cont.) Ensembles: Large Data Ensembles: Small Data Ensemble of BayesNet Models End of Multiple Models and Ensembles Week 4: Statistical Modelling Segmenting Data • Sometimes the segmenting of data is because of the context of the data o Separate sources o Separate collection circumstances o Social or physical distinctions • Sometimes we don’t have pre-determined segments, but we want segmentation o Some of the data may be similar o Some of the modelling would be better if it didn’t need to represent all of the data o Better decision-making if we consider each segment independently Segmenting data • Customers are grouped into segments • Marketing is then specialised to each segment Ø leads to better marketing • In healthcare, segments are called cohorts Ø used for patient management and staff organisation • But how do you segment the data? Segmentation Task: Identifying Customer Segments Segmentation can be based on different types of attributes Example segmentation: traditional segmentation in Britain uses class (from The Independent) http://www.independent.co.uk/news/uk/home-news/britain-now-has-7-social-classes-and-working-class-is-a-dwindling-breed-8557894.html A segmentation model is a graphical model where • the cluster variable is unknown, called “latent” • the cluster variable identifies the segments • latent means the variable is never observed in the data. Segmentation • Relate to relationships between independent and dependent variables • Use functional families • Train fitted parameters on existing data to represent those relationships Linear regression models • A regression tree is a supervised machine learning algorithm that predicts a continuous-valued response variable by learning decision rules from the predictors (or independent variables). o Decision tree • divide the data into subsets of similar values • estimate the response within each subset. Regression trees Rather than using a single function to represent the data, … … divide the data into similar segments, then make predictions in each segment Regression trees - Example • Binary tree structure • Terminal nodes (or leaf nodes) are where the model prediction is made • Paths from the root node to the terminal node represents decision rules. 6 splits 7 terminal nodes Regression trees - structure 1. Identify all possible (n-1) splits of the data. 2. Compute a metric that measures the quality of each possible split. 3. Choose the best split, break the data into two subsets. 4. Repeat steps 1 to 3 on each subset, then continue until a good stopping point is reached. Choosing segments • The partitioning is a top-down, greedy approach. o Start with all data o Once split, don’t change • Searches every distinct value of every input predictor to find a pair of predictor/value that best split the data into two subgroups (G1 and G2). o As in for the population inside that node, this pair of predictor/ value improves the chosen criteria (e.g., ANOVA) the most. • ANOVA criterion = SST − (SSG 1 + SSG 2) • !!" = ∑(&' − & ¯ )+ , total variation of the dependent variable. • SSG1 & SSG2 use the SST formula but with the values for the two subgroups created by the partition. ANOVA • The resulting tree is easy to understand • Visualising the tree can reveal crucial information, such as how decision rules are formed, the importance of different predictors and the effect of the splitting points in the predictors. • It can reveal information about the relationships between variables. • Very useful for Exploratory Data Analysis (EDA) • Implicitly performs feature selection as some of the predictors may not be included in the tree. • Not sensitive to the presence of missing values and outliers. • No assumptions about the shape and the distribution of the data. • It can be used to fit non-linear relationships. Regression trees - Pros • The fit has a high variance meaning small changes in the data set can lead to an entirely different tree. o Overfitting is a problem for decision tree models, but we can adjust the stopping conditions and prune the tree. • Can be inefficient when performing an exhaustive search for the splitting points of continuous numerical predictors. • Greedy algorithms cannot guarantee the return of the globally optimal regression tree. Regression trees - cons • Regression tasks relate to determining quantitative numerical variables based on input variables • Classification tasks about determining a qualitative value (e.g., category or class) based on the input variables • Categorical variables • Nominal data – multiple categories but no ordering, e.g., housing type, postcode, species, countries, phone number • Ordinal data – multiple categories with an order, e.g., education level, salary level, age group Categorical data • For classification task, if we want to use a decision tree, the result is a classification tree. • Most popular split criteria are Gini and Entropy. Classification trees • Can still factor in multiple input variables https://en.wikipedia.org/wiki/Decision_tree_learning#/media/File:Decision_Tree. jpg Classification tree End of Segmenting Data Week 4: Statistical Modelling Clustering Data • A cluster is • segmented data for analysis • segmented network nodes • segmented data storage • a group of associated computers So clustering = segmentation… sometimes Clustering tends to be associated with segmentation that allows us to recognize similar combinations of attribute values when we don’t have predefined categories. Unsupervised machine learning Clustering and segmentation • Text documents, e.g., patents, legal cases, webpages, questions and feedback o Topic modelling • Clients, e.g., recommendation systems • Fault detection, e.g., fraud, network security • Missing data • A clustering task may require a number of different algorithms/approaches. Uses of clustering • Are similar in some attributes • May consider some attributes to weigh more than others o Not all attributes are as important as others o Needs feature selection • May be considered to be close to each other o Needs distance measurements Elements in a clusters • Distance o Commonly dealing with data as vectors o Euclidian distance = vector distance between points, e.g., A= (1,1) B =(3,5) ! ",$ = (3 − 1)++ (5 − 1)+ • Centroid o Value of a data point in the centre of the cluster o May be hypothetical and not match any known data • Nearest neighbor o A data point that is closest to some reference value, e.g., the centroid or the population of a cluster Clustering terms • k-means algorithm 1. Randomly select centroids for K clusters 2. Select nearest data points as cluster population 3. Find mean values in each cluster and use that as new centroid 4. Re-evaluate populations and centroids until stable/convergance • Does not work with categorical data and it is susceptible to outliers • Have to predefine a value for K • No guarantee there are actually clusters to find Clustering – K-means Clustering – K-means Clustering – K-means • Clusters within clusters! • Agglomerative (bottom-up) vs Divisive (Top-down) • Agglomerative o Treat each data point as a centroid in a cluster of population 1 o Form new clusters by merging nearby clusters o Continue until only one cluster • Various ways to calculate which clusters should be merged, often looking at (min or max) distances of the clusters’ populations to each other • The results of hierarchical clustering are usually presented in a dendrogram Clustering - Hierarchical Clustering - Hierarchical • Greedy! • Can be costly, due to having to calculate a lot of distances for each level of the tree. • But with no randomness, the same tree will be produced each time. • Can cut the tree at any level so as to get the population of a certain number of clusters. Clustering – Hierarchical • Both segmentation and clustering can help o Model, predict and classify data o Decision-making o Understand the data • But different types of input and outcome data need different types of segmentation Segmentation and clustering End of Clustering Data Week 3: Data Sources and Modelling the Truth Sharing Data • Working together on a project • Common needs, common resource • Data as a product • Data-based service • For research! - Duplication - Verification - Re-use - Promotion - Knowledge! Why sharing data? • Shared data provides opportunities - New combinations of data - New relationships in data - New visualisations of data - New understandings of data - Also creates new data! Opportunities from shared data • Collection: getting the data • Wrangling: data preprocessing, cleaning • Analysis: discovery (learning, visualisation, etc.) • Presentation: arguing that results are significant and useful • Engineering: storage and computational resources • Governance: overall management of data • Operationalisation: putting the results to work Our Standard Value Chain • Data that is “freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control” – Wikipedia - Free – accessible, costs nothing - Free – unrestricted usage - Free – simple, non-proprietary format • Commonly associated with open government data Open data From “the New Data Republic: Not Quite a Democracy” in MIT Sloan Review 2015 • from Hal Varian (at Google): “information that once was available to only a select few ... available to everyone” • from Robert Duffner (at Salesforce): “finally puts crucial business information in the hands of those who need it” • government and IT departments building data and infrastructure to allow sharing, e.g. USA Open Gov Initiative • analytic tools, (desktop and web-based), available to analyse it. Democratization of Data http://sloanreview.mit.edu/article/the-new-data-republic-not-quite-a-democracy/ http://www.apple.com/au/ The reports: • “Open data”: Unlocking innovation and performance with liquid information” by MGI, and • “Science as an open enterprise” by the Royal Society (UK) claim that: • open data provides new opportunities for business, new products and services, and can raise productivity • open data supports public understanding and citizen engagement • scientists need to better publicise their data (with help from universities, etc.) • industry sectors should work with regulators and coordinate industry collaboration • collaboration across sectors in both public and private settings, • e.g., disaster response, education Open Data Recommendations http://www.mckinsey.com/insights/business_technology/open_data_unlocking_innovation_and_performance_with_liquid_information https://royalsociety.org/~/media/policy/projects/sape/2012-06-20-saoe-summary.pdf The Scientific American report: “What’s Wrong with Open-Data Sites–and How We Can Fix Them” discusses: • its hard to make sense of the huge amount of government data ‣ Data.GOV has 230k datasets, and Data.GOV.AU has 30k • authors developed Data USA What’s Wrong with Open Data Sites https://blogs.scientificamerican.com/guest-blog/what-s-wrong-with-open-data-sites-and-how-we-can-fix-them/ https://datausa.io/ • Publicly available ‣ government and IT departments building data and infrastructure to allow sharing, ‣ e.g., Data.GOV has 230k datasets, and Data.GOV.AU has 30k • Machine-readable? • But.. ‣ it is not always usable ‣ people need the right skills Open Data - Summary From “the New Data Republic: Not Quite a Democracy” in MIT Sloan Review 2015 • from Hal Varian (at Google): “information that once was available to only a select few ... available to everyone” • from Robert Duffner (at Salesforce): “finally puts crucial business information in the hands of those who need it” • government and IT departments building data and infrastructure to allow sharing, e.g. USA Open Gov Initiative • analytic tools, (desktop and web-based), available to analyse it • but people need the right skills! ‣ open data is all good and well, but people need to be able to use it too! Democratization of Data http://sloanreview.mit.edu/article/the-new-data-republic-not-quite-a-democracy/ http://www.apple.com/au/ End of Sharing the Data Week 3: Data Sources and Modelling the Truth Utilising Data Sources Where to find and how to use data sources If we want to forecast traffic: blockages, clearing, surprising situations, alternate routes • Critical data: ‣ GPS data on traffic flow ‣ Maps ‣ incidents and events ‣ weather • Challenge: ‣ collect different sources of data image: math.tu-berlin.de We’ll now look at three examples of public data and using data? 1. NYC data 2. Traffic prediction 3. Predictive analytics for banks Three Examples of Using Data NYC embarked on a program in 2011 to make the city’s data accessible: • “How data and open government are transforming NYC”: ‣ “In God We Trust,” tweeted by New York City Mayor Mike Bloomberg, “Everyone else, bring data.” ‣ applications of the data provided: - “real-time updates on your phone based on where the buses are located using very low-cost technologies” - applying predictive analytics to building code violations and housing data to try to understand where potential fire risks might exist • Bloomberg signs NYC 'Open Data Policy' into law, plans web portal for 2018 • NYC Open Data portal • Melbourne has a similar portal: City of Melbourne’s open data platform New York City Data http://radar.oreilly.com/2011/10/data-new-york-city.html http://www.engadget.com/2012/03/12/bloomberg-signs-nyc-open-data-policy-into-law-plans-web-porta/ https://nycopendata.socrata.com/ https://data.melbourne.vic.gov.au/ “How we found the worst place to park in New York City” is examples, and a discussion of the complexities of getting data out of NYC: • Map of road speed by day+time: GPS data for NYC cabs gives; data obtained via FOIL request, then made public by recipient • Danger spots for cycles: NYPD crash data obtained by daily download of PDF files followed by (non-trivial) extraction. • Dirty waterways: fecal coliform measurements on waterways from Department of Environmental Protection’s website; extracted from Excel sheets per site; each in a different format • Faulty road markings: parking tickets for fire hydrants by location from NYC Open Data portal need to normalize the addresses supplied NYC Data - Using it! http://www.ted.com/talks/ben_wellington_how_we_found_the_worst_place_to_park_in_new_york_city_using_big_data/transcript?language=en http://iquantny.tumblr.com/post/93845043909/quantifying-the-best-and-worst-times-of-day-to-hit http://www.reddit.com/r/bigquery/comments/28ialf/173_million_2013_nyc_taxi_rides_shared_on_bigquery http://iquantny.tumblr.com/post/77977436883/the-terrifying-cycling-injury-map-of-nyc-2013 http://www1.nyc.gov/site/nypd/stats/traffic-data/traffic-data-collision.page http://iquantny.tumblr.com/post/97788820249/fecal-map-nyc-the-worst-places-to-swim-in-the https://data.cityofnewyork.us/Environment/Watershed-Water-Quality-Data/y43c-5n92 http://iquantny.tumblr.com/post/83696310037/meet-the-fire-hydrant-that-unfairly-nets-nyc https://nycopendata.socrata.com/ Back in 2008, Microsoft Introduced a Tool for Avoiding Traffic Jams The system was called Clearflow: • Aims to forecast traffic: blockages, clearing, surprising situations, etc. • and to suggest alternate routes • critical data use to build the application included: ‣ GPS data on traffic flow ‣ maps ‣ incidents and events ‣ weather • See Eric Horvitz’s discussion of system: “Data, Predictions, and Decisions in Support of People and Society” (skip to 7:40-11:06) Traffic Prediction http://www.nytimes.com/2008/04/10/technology/10maps.html?_r=0 http://videolectures.net/kdd2014_horvitz_people_society/ See this video of a seminar on “Predictive Analytics with Fine- grained Behavior Data” • by Foster Provost (Professor at NYU and author of this book) presented at Stata+Hadoop in 2013 • describes customer prediction problem for banking products He discusses about whether bigger data is “always” better. So is big data better? • His answer is that it’s not always (much) better. • But that big data can certainly be better if the data is richer and more fine-grained. Predictive Analytics for Banks https://www.youtube.com/watch?v=1jzMiAfLH2c http://conferences.oreilly.com/strata/stratany2013/public/schedule/detail/31685 http://data-science-for-biz.com/ What lessons have we learnt from these “data” examples? • NYC data • data requires work to clean up • be creative about sources • Traffic prediction • combine many sources • you might have to generate some of your own • Predictive analytics for banks • fine-grained data really helps, but is harder to use Lessons Learnt from the examples Many companies are exposing their data and their website functionality as APIs (Application Programming Interfaces) for others to make use of: • Facebook API • Twitter API e.g. search tweets • LinkedIn API • Google Maps API • Youtube API e.g. documentation • Amazon Advertising API • TripAdvisor API • New York Times API Example Data/Information APIs https://developers.facebook.com/products/ https://dev.twitter.com/rest/public https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets.html http://www.programmableweb.com/api/linkedin https://developers.google.com/maps/ https://developers.google.com/youtube/ https://developers.google.com/youtube/v3/getting-started https://advertising.amazon.com/API https://developer-tripadvisor.com/content-api/ http://developer.nytimes.com/ Twitter is the most famous microblogging platform • with big corporate use • contains lots of metadata: information about users, their follower network, locations, hashtags, emojis+emoticons, … Twitter Sample Twitter XML Data See Twitter’s developer platform • library interfaces for Java, C++, Javascript, Python, Perl, PHP, Ruby, ... • allows other applications to manage Twitter data for users • extensive developer policy • see search API doc • lots of example case studies Twitter Developer API https://dev.twitter.com/ https://developer.twitter.com/en/docs/tweets/search/overview/standard.html https://dev.twitter.com/resources/case-studies End of Utilising the Data Week 3: Data Sources and Modelling the Truth Joining data • Tabular data o Tables - Rows: information about an object - Columns: attributes of the object o Relational database • Graph data o Nodes: entities o Edges: relationships between entities o Graph database Relationships in data • For data sets to be joined, they must have something in common. Joining data sets Set A Set B All data from both sets Product User Pen Alec Book Huang Table Indira Pen Indira Chair Blythe Pen Huang User Contact Stef 733 486 Indira 989 6732 Boris 939 3872 Frances 345 7239 Miguel 125 8369 Huang 934 3482 Set A Set B Product User Contact Pen Alec Book Huang 934 3482 Table Indira 989 6732 Pen Indira 989 6732 Chair Blythe Pen Huang 934 3482 Stef 733 486 Boris 939 3872 Frances 345 7239 Miguel 125 8369 Full (outer) join Set A Set B Just records that link both sets Product User Pen Alec Book Huang Table Indira Pen Indira Chair Blythe Pen Huang User Contact Stef 733 486 Indira 989 6732 Boris 939 3872 Frances 345 7239 Miguel 125 8369 Huang 934 3482 Set A Set B Product User Contact Book Huang 934 3482 Table Indira 989 6732 Pen Indira 989 6732 Pen Huang 934 3482 Inner join Set A Set B All data from A with linked data from B Product User Pen Alec Book Huang Table Indira Pen Indira Chair Blythe Pen Huang User Contact Stef 733 486 Indira 989 6732 Boris 939 3872 Frances 345 7239 Miguel 125 8369 Huang 934 3482 Set A Set B Product User Contact Pen Alec Book Huang 934 3482 Table Indira 989 6732 Pen Indira 989 6732 Chair Blythe Pen Huang 934 3482 Left (outer) join Set A Set B All data from B with linked data from A Product User Pen Alec Book Huang Table Indira Pen Indira Chair Blythe Pen Huang User Contact Stef 733 486 Indira 989 6732 Boris 939 3872 Frances 345 7239 Miguel 125 8369 Huang 934 3482 Set A Set B Product User Contact Book Huang 934 3482 Table Indira 989 6732 Pen Indira 989 6732 Pen Huang 934 3482 Stef 733 486 Boris 939 3872 Frances 345 7239 Miguel 125 8369 Right (outer) join • Can be temporary - Just for the current analysis • Can be permanent - Store the combined data • Can have conditions - Can you share the combined data? • Can be costly - Memory - Processing time & capacity ‣ joining ‣ searching ‣ analysing Joining data sets End of Joining Data Week 3: Data Sources and Modelling the Truth Standardising data • If you standardise things, you can be more efficient - Efficiency lowers costs • So how can you standardise data? • What role do data scientists and data science play in standarising things related to data? Setting the standards Geospatial Data Linked Open Data: DBpedia Linked Open Data: XML Transactional Data Twitter Data Internet of Things Data • Data is about a variety of things - (geo)spatial data - transactional data - linked (open) data - social media data - Internet of Things (IoT) • Data comes in a variety of formats - Ascii/text format (+ Unicode!) - Word or Excel or Pdf format - Comma separated values (CSV) - JSON format - HTML or XML format Data types and formats • Machine-readable data: data (or metadata) which is in a format that can be understood by a computer, e.g., XML, JSON • Markup language: system for annotating a document in a way that is syntactically distinguishable from the text e.g., Markdown, Javadoc • Digital container: file format whose specification describes how different elements of data and metadata coexist in a computer file, e.g., MPEG Data formats: key concepts End of Standardising Data Week 3: Data Sources and Modelling the Truth Metadata Metadata: structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use or manage an information resource. Metadata is: • data about data • structured so that a computer can process & interpret it MetaData MetaData can be: • Descriptive: describes content for identification and retrieval, e.g. title, author of a book • Structural: documents relationships and links, e.g. chapters in a book, elements in XML, containers in MPEG • Administrative: helps to manage information, e.g. version number, archiving date, Digital Rights Management (DRM) MetaData (cont.) • Facilitate data discovery • Help users determine the applicability of the data • Enable interpretation and reuse • Clarify ownership and restrictions on reuse Why Use Metadata EXIF Metadata Book Metadata Media Metadata • IPTC Photo Metadata User Guide • USGS Metadata standards • DCC list of Metadata standards • Medical bibliographic data in XML on PubMed • Registry Interchange Format - Collections and Services (RIF-CS) Other Metadata Examples https://www.iptc.org/std/photometadata/documentation/userguide/ https://www.usgs.gov/products/data-and-tools/data-management/metadata-creation https://www.dcc.ac.uk/guidance/standards/metadata/list https://dtd.nlm.nih.gov/ncbi/pubmed/doc/out/190101/index.html https://www.ands.org.au/online-services/rif-cs-schema • Metadata helps set standards • Metadata should also be standardised - Archiving data - Sharing data - Searching data Standards and metadata End of Metadata Week 3: Data Sources and Modelling the Truth Standardising handling data Examples of standards: • Metadata standards, such as Dublin Core, examples at A Gentle Introduction to Metadata • XML formats for sharing models, e.g. PMML (see below) • Standard vocabularies for use in Medicine, e.g., ‣ health codes: disease and health problem codings ICD-10 ‣ systematized nomenclature of medicine, clinical terms, SNoMed-CT • Standards for describing the data mining/science process, such as CRISP-DM Example Standards http://dublincore.org/documents/dces http://www.language-archives.org/documents/gentle-intro.html https://en.wikipedia.org/wiki/ICD-10 http://www.nlm.nih.gov/research/umls/Snomed/snomed_main.html https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining We’ve seen many data science processes and lifecycles: • e.g. our own “standard Data Science value chain” • CRISP-DM discussed previously, is a standardised data science process • statisticians sometimes use the term exploratory data analysis for part of the process Data Science Process https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining Semi-structured data is data that is presented in XML or JSON: • see some examples for here • Note YAML (Yet Another Markup Language), which is just an indentation (easier to read) version of JSON • standard libraries for reading/writing/manipulating semi- structured data exist in Python, Perl, Java • don’t need to know all the details of XML (and related Schema languages), there are many good online tutorials, e.g. W3schools.com • their use in systems leads to the open world assumption about data, where we may download relevant data on the fly from APIs etc. Semi-Structured Data https://en.wikipedia.org/wiki/JSON http://www.w3schools.com/ PMML: Predictive Model Markup Language PMML provides a standard language for describing a (predictive) model that can be passed between analytic software (e.g. from R to SAS). • PMML: An Open Standard for Sharing Models • A list of products working with PMML is the PMML Powered page on DMG site. Model Language http://journal.r-project.org/archive/2009-1/RJournal_2009-1_Guazzelli+et+al.pdf http://www.dmg.org/products.html 79 PMML Example End of Standardising Handling Data Week 3: Data Sources and Modelling the Truth Scripting • A script is a series of commands to be performed • A script is executable on demand - not compiled to an executable form - interpreted command-by-command as it is executed, like on a command line • Examples: - R - Python - Unix shell Introduction to scripting languages See In data science, the R language is swallowing Python by MattAsay. Python R Free or not? Yes Yes Developed by whom? Computer scientists (for general use) Statisticians (huge support for analysis) Characteristics Better in integrating with other systems Better for stand-alone analysis and exploration Easy to learn/extend Python > R

Scalability Python > R

Discussion: Python vs R

https://www.infoworld.com/article/2951779/in-data-science-the-r-language-is-swallowing-python.html

• Command-line code for Unix (+ Linux & Mac OS)
• Commonly include:

– Wildcards: *, ?
e.g., ./Customer??Loc*.txt

– Piping, | : output from one command streams as input to
another
e.g., cat product*v1.txt | sort

– / in filepaths, not
– ; to separate commands
– > and < to indicate the input and output (>> for appends)

e.g. cat product*v1.txt > contents

Unix Shell script

• pwd: path of current directory
• cd DIRPATH: change directory to DIRPATH
• ls DIRPATH: output the filenames of DIRPATH
• cp FILENAME NEWFILENAME: copy FILENAME to NEWFILENAME
• mv FILENAME NEWFILENAME: rename FILENAME to NEWFILENAME

• echo “TEXT”: output TEXT
• cat FILENAME: output the contents of FILENAME
• less FILENAME: output the contents of FILENAME, one screen at a

time (can page up and down)

e.g., cd DATA/; cat product*v1.txt| sort > contents

Unix commands

• wc FILENAME: count the number of lines, terms, characters in
FILENAME

• grep “PATTERN” FILENAME: output any lines in FILENAME that
match PATTERN
e.g., grep “Australia” product*v1.txt

grep “^[0-9]” product*v1.txt

• head FILENAME: output the first lines of FILENAME
• tail FILENAME: output the last lines of FILENAME
• awk: process text files in various ways, including search and replace

cf. sed, perl
• uniq: remove duplicates from the input (presumes it is sorted)

• diff: find similarities and differences between two files
cf. test

Unix commands

• man COMMAND: output user manual pages for
COMMAND

• COMMAND –?: output shorter help pages for COMMAND

• COMMAND –-help: ditto

Unix help and arguments

• Piping shells commands buffers their execution

– Don’t try to do everything at once, just enough for the
next command

– Tend to work through text files line by line

– Allows different commands to be working on different
parts of the data

– Scales up well for big files!

Ø Reduces the memory overload

Shell scripts and big data

• Ideally, the software used for data should be
standardised, just like data should be standardised.
– Constituency
– Capabilities
– Reproducibility

• However, just like data varies, so can the needs of
software
– Published software doesn’t always meet the needs
– Rapid prototyping is what scripting languages are ideal

for!

Standardising software

• Need to standardise how data is accessed
• Need to be able to reproduce

– Wrangling

– Analysis

– All other stages of the value chain!

• Scripting allows these to be recorded
• Scripting allows these to be shared

• Scripting allows these to be modified

Standardising workflow

• It is also vital to understand why certain steps are used
– Why was the wrangling done

– What was the analysis for

• The context of working with data also needs to be
recorded

Standarising processes

• So how can you standardise data?
– Access
– Format
– Value & vocabulary
– Metadata
– Software & tools
– Process & workflow

• What role do data scientists and data science play in
standardising things related to data?
– Establishing the standards
– Enacting the standards

Setting the standards

End of
Scripting

Week 3:
Data Sources and
Modelling the Truth

Temporal data

• Data indexed with time!

• Data indexed with dates!

• Data about change, transformation and occurrences!

• Time series data

Temporal data

• The temporal aspect of the data can be of different types

§ Specific

o e.g., 20 July 1969 – Man landed on the moon
o e.g., 3:00:00 am, 4 April 2021 – Daylight saving

ended in Victoria

§ Relative

o To a time, e.g., 4 weeks ago – Week 1!

o To each other: Ordinal data that has a temporal
progression, e.g., Stages of an insect’s lifecycle

Temporal context

• Date

o Day of the week – Monday, Mon., M

o Day number – 1, 1st , first

o Month – January, Jan, J, 1

o Year – 2020

• Time

o Hour – 1, one, 13

o Minute – 15, quarter past, o’clock

o Second – 20

o Period – am, pm, AM, PM

Temporal phrases – form

• Date

o 20 Jan 2020, Jan 20

o 20th of January, 2020,
twentieth of January

o 20/01/2020, 1/20/2020, 20:01:2020,
2020-01-20

• Time

o 1 PM
o 13:00, 13.00

o One o’clock

Temporal phrases – syntax

• Era: AD 2020, 20 Jan 2020 CE

• Calendar: Lunar, Hebrew, Chinese, etc.

• Time zone: 1pm AEST, 1pm UTC+10:00

• Submultiples: 13:00.001

• Years do not have the same number of days

• Months have different numbers of days

• It can be difficult to identify the day of the week, day of
the month and week in the year

• Years and months start on different days

• Even specific time phrases can be very complicated to
parse!

Temporal phrases – more

• Decompose once, use often

• Not all elements may be important to you

• Numbers are easier to work with than words

• If the syntax is too complicated, decompose it first into

parts that are known, e.g., 03:45 GMT+10, Monday 2 Jan.

2020

Extracting temporal elements

• If temporal elements need to be used together

o Convert once, use often

o Be consistent

o Be regular

Standardise!

Converting temporal elements

• Time is not decimal

• Months and years have different numbers of days

• Be careful how you compare time elements

• Socially, not all time periods are the same

o weekends

o holidays

o pay periods

Context!

Counting time

• Do you want to show

§ Distinct events?

o Irregularities

o Changes

§ Connections and trends?

o Seasonal

o Regularities

o Variance of values

Plotting temporal elements

• Distribution of plots

o Hour

o Day

o Month

o Year

Plotting temporal elements

• Line plots

o Highlight temporal continuity within and between time
periods

Plotting temporal elements

• Calendar plots

o Helps visually identify irregularities

Plotting temporal elements

• Other plots?

o Bar charts?

o Pie charts?

o Rose/polar
chart?

Plotting temporal elements

End of
Temporal Data

Week 3:
Data Sources and
Modelling the Truth

Correlation vs Causation

• Models represent aspects of a scenario to help us
understand it.

• Statistical models represent the relationships between
variables

o Independent variable(s)
o Dependent variable

• A model can be used to predict about the dependent
variable, given information about the independent
variable(s)

• Rather than trying to use all data about the scenario, the
model just reduces the data set to a low dimensional
summary.

Statistical Modelling

Variables

• Dependent variable
o Outcome variable

o Explained variable

o Response variable

o Target variable

o Predicted variable

o Regressand

• Independent variable
o Input variable

o Explanatory variable

o Control variable

o Predictor variable

o Regressor

CO2 levels, based on data recorded at the South Pole

Modelling data – example

from the BackReaction blog by Sabine Hossenfelder

Modelling

http://backreaction.blogspot.com.au/2008/04/emergence-and-reductionism.html

• “All models are wrong, but some are useful”…

George Box

• “The approximate nature of the model must

always be borne in mind”… George Box

• “The purpose of models is not to fit the data but to

sharpen the questions”… Samuel Karlin

Do Models need to be truthful?

• Variables in a scenario have
relationships

• Some variables influence the
outcome of activities and thus

other variables.

o Dependent vs independent

• Influence diagrams model that
dependency

• It is not always easy to recognise
what influences what

Influence diagrams

• Causation indicates that one event is the result of the
occurrence of the other event; i.e. there is a causal

relationship between the two events. This is also referred
to as cause and effect.

– Australian Bureau of Statistics

Causation

Correlation

• Correlation is a statistical measure (expressed as a
number) that describes the size and direction of a

relationship between two or more variables. A

correlation between variables, however, does not

automatically mean that the change in one variable is

the cause of the change in the values of the other

variable.

– Australian Bureau of Statistics

Correlation & Causation

• Causation implies Correlation (normally);

BUT Correlation does not imply Causation

• We can measure the correlation (next week!)
but that measurement does not tell us anything about the

causation.

• Identifying causation requires controlled experiments that
examine the data related to a situation with or without a

possible correlated variable.

• Scientific hypothesis

A causes B

• Correlation
There is a relationship between A and B, but neither
will cause the other.

• Can we model B, using A?

• e.g., …

FIT5145
Introduction to Data Science

Dr Michael Niemann
Faculty of Information Technology

Slides credit to Prof. Wray Buntine
and Dr. Guanliang Chen

About This Unit

Why this unit?

Data Science is fast developing:

• every academic & industry community wants to claim credit
• huge community of (self proclaimed) “leading international

experts”, “highly sought-after consultants”, and “thought

leaders” to confuse you with advice, blogs, guidelines, …

• huge growth in software and services

We will try to cover the full extent of what makes Data

Science:

• background and context
• leading review articles, lectures, introductions
• academic surveys and national programmes

This is a “grand tour” unit – breadth, not depth.

Meet the Teaching Team

Dr. Guanliang Chen
Chief Examiner

& Lecturer
Dr. Jesmin Nahar Workshop TA

Dr. Michael Niemann Lecturer Mohit Gupta Workshop TA

Dr. Saher Manaseer Admin Tutor Dr. Han Phan Workshop TA

Dr. Heshan Kumarage Admin Tutor Dr. Tam Vo Workshop TA

Dr. Chang Joo Yun
(Chris) Tutor Yi Wei Zhong Workshop TA

Jeffery Liu Tutor

Contacts

1. Ask questions whenever you need

2. Check the Ed platform forums on Moodle

• Click the link on Moodle to enrol in Ed
• Please do NOT post your solutions to assignments

3. Attend the consultation sessions

• Consultation Times on Moodle

Prerequisites
You will need:

• high school level of mathematics and statistics
• basic programming and database skills
• a “critical mindset”:

• you will read/view a variety of materials
• different levels of quality and standards
• some sales, some educational, some journalistic

• basic exposure to information technology and internet
businesses:

• software, science, and business computing
• Amazon, Google, Twitter, …

Unit Schedule
Week Topics Deadlines Comments

WEEK 1
(6-12 Dec)

Overview of Data Science –

WEEK 2
(13 Dec-21 Dec)

Dimensions of Data Science
projects and Big Data

• Case Study Proposal
• Quiz

Xmas/New
Years break

– – University shut down

WEEK 3
(4-9 Jan)

Data sources and modelling
the truth –

WEEK 4
(10-16 Jan)

Statistical modelling • Coding Task I – R

WEEK 5
(17-23 Jan)

Data management &
governance • Coding Task II – scripting

WEEK 6
(24-30 Jan)

Issues with Data Science • Case Study report

https://lms.monash.edu/course/view.php?id=132993&section=6

https://lms.monash.edu/course/view.php?id=132993&section=7

https://lms.monash.edu/course/view.php?id=132993&section=8

https://lms.monash.edu/course/view.php?id=132993&section=9

https://lms.monash.edu/course/view.php?id=132993&section=9

https://lms.monash.edu/course/view.php?id=132993&section=10

https://lms.monash.edu/course/view.php?id=132993&section=11

Weekly classes

Each week, this unit will contain three teaching sessions:
1. Pre-recorded Lectures (available late Monday/Tuesday morning)

See the link/s in the relevant week’s section on Moodle

2. 2 hour Interactive Hybrid Workshop (Wednesday 2-4pm)
See the details in the Class Streaming section. These will be
recorded.

3. 2 hour Tutorial (Thursday or Friday)
See the details in the Class Streaming section. These will not be
recorded.

Students must view the lectures before the Workshop and Tutorials
each week.

https://lms.monash.edu/course/view.php?id=132993&section=1

https://lms.monash.edu/course/view.php?id=132993&section=1

Instructions on setting up your own laptop/desktop are

on Moodle in the sections Unit Information and Week 0.

• You need at least access to
• R and R Studio

• These can be accessed via
• Anaconda – a programming package that also includes

Python Notebooks

• MoVE – a Monash virtual desktop environment that
simulates you being in a campus lab

Technical requirements

Resources for this Unit

• Lectures, Workshops & Tutorials
• Moodle: unit information, assessments, discussion

forum, etc.

• Alexandria: an online textbook, which contains lots
of useful exercises and resources.

• Additional textbook: The Art of Data Science by
Peng & Matsui (http://leanpub.com/artofdatascience)

• Please notice that:
• Library services available
• Special consideration policies
• Disability Support Services (DSS) available

https://www.alexandriarepository.org/syllabus/introduction-to-data-science/

http://leanpub.com/artofdatascience

http://leanpub.com/artofdatascience

Warning

• The Alexandria textbook links to a LOT of content:
• videos, blogs, articles, …
• there is way too much for you to read it all in detail
• Focus on the details when you need something for

assessment (or want for your own development)
• Very importantly, use the guide for what to read

– the double “Johnny look it up” icon
• The Microcredential Steps

• Slightly editted versions of the microcredential’s
webpages

• You don’t need to do the activities in the Steps
• Don’t forget the Other Readings and videos

Assessments
Assessment
Task Weight Due Date Description

Quiz 10%
End of Week 2
(Mon 20-Tues 21 Dec) Multiple-choice Questions, Short-answer Questions

Assignment 1:
Coding task I – R 20%

Start of Week 4
(Mon 10 Jan), 11:55 PM Data Analysis with R

Assignments 2 & 4:
Business & Data
Case Study

2% End of Week 2(Mon 20 Dec) 11:55 PM
Assignment 2: Propose a Data
Science Project

18% End of Week 6(Mon 31 Jan) 11:55 PM
Assignment 4: Report on the Data
Science Project

Assignment 3:
Coding task II –
Shell

10% End of Week 5(Fri 21 Jan) 11:55 PM
Data Analysis with Tools and Shell
Scripting

Scheduled Final
Assessment 40% To be announced (after Week 6).

Multiple-choice Questions, Short-
answer Questions, and Longer-
answer Questions

Three key questions for you

1. What is the problem to be solved?

2. What data is necessary to solve the problem?

3. What Data Science techniques can be used to

make use of the data?

Getting Started

• Set up your computer
• Work through the materials in Week 1 to familiarise

yourself with R, R-Studio and R Markdown

• Each week, please
• Watch the lectures, attend the live-streamed workshop

& read background materials between classes

• Prepare for and attend the live tutorials
• Check out any additional activities for the week on

Moodle

• Complete the readings for Week 1

You’re More than a Knowledge Worker

• As a knowledge worker:
you’re applying your knowledge to do

non-routine problem solving.

• But now you also have to be a learning worker:
you’re learning new skills as you go,

continually adapting.

https://en.wikipedia.org/wiki/Knowledge_worker

https://www.forbes.com/sites/jacobmorgan/2016/06/07/say-goodbye-to-knowledge-workers-and-welcome-to-learning-workers

End of
About This Unit

Week 1:
Overview of Data Science

Data Science and
the Data Science Process

Question: Who are the Data Scientists?

Person D

Person A

Person C

Person B

Defining Data Science

What is Data Science?
• “data science is what a data scientist does”

– a circular definition!
• “data science is the technology of handling and

extracting value from data”

– less circular and a bit more useful
• “machine learning on big data”

– useful, but too narrow!

Different definitions

Source Definition

Wikipedia
… is the extraction of knowledge from data, which is a
continuation of the field data mining and predictive
analytics.

Pivotal

The use of statistical and machine learning techniques on big
multi-structured data in a distributed computing environment
to identify correlations and causal relationships, classify and
predict events, identify patterns and anomalies and infer
probabilities, interest and sentiment.

NIST Big
Data

Working
Group

… is the empirical synthesis of actionable knowledge from
raw data through the complete data lifecycle process.

Journal of
Data

Science

… is almost everything that has something to do with data:

collecting, analysing, modelling …. yet the most important

part is its applications — all sorts of applications.

The Rise of Big Data
in Foreign Affairs, by Cukier and Mayer-Schoenberger

Data Science interest is related to the arrival of “Big Data”.

• Data collection has changed:
• lots of data, but more messy
• don’t look for perfect models – settle for finding patterns
• examples: Google’s language translation and flu trends

• Datafication:
• taking all aspects of life and turning them into data
• e.g. NYC using big data to improve public services and lower

costs

• The information society has come of age
• and data brokers have started amassing huge data about

individuals: big data could become Big Brother

https://www.foreignaffairs.com/articles/2013-04-03/rise-big-data

Defining Machine Learning

Unlike Data Science, the definition for Machine Learning

is better understood and more agreed upon:

Machine Learning is concerned with the development of

algorithms and techniques that allow computers to learn.

• concerned with building computational artifacts,
i.e., computer programs that can learn, oftentimes

with computational output

• but the underlying theory is statistics

See A Gentle Guide to Machine Learning
https://monkeylearn.com/blog/gentle-guide-to-machine-learning/

https://monkeylearn.com/blog/gentle-guide-to-machine-learning/

Why use Machine Learning?

Machine learning is useful when:

• Human expertise is not available,
e.g., Martian exploration

• Humans cannot explain their
expertise (as a set of rules), or
their explanation is incomplete
and needs tuning, e.g., speech
recognition

• Many solutions need to be
adapted automatically, e.g., user
personalisation.

image src: theconversation.com, meduim.com, blog.prioridata.com

Why use Machine Learning?

image src: lifewire.com, clrealyexplained.com, meduim.com

Machine learning is useful when:

• Situation changes over time, e.g.,
junk email

• There are large amounts of data,
e.g., discover astronomical
objects

• Humans are expensive to use for
the work, e.g., handwritten zip
code recognition

Data Science Examples: Data on Bushfires

https://covid19.who.int/

https://covid19.who.int/explorer
https://covid19-projections.com/italy

https://dataventures.nz/assets/pdf/covid19-2020-april-20.pdf

Data Science Examples: Data on COVID-19

https://covid19.who.int/

https://covid19.who.int/explorer

https://covid19-projections.com/italy

https://dataventures.nz/assets/pdf/covid19-2020-april-20.pdf

Some famous data science projects and investigations:

• Google’s spell checker and translation engine

• Amazon.com’s recommendation engine

• Public health: “saturated fat is not bad for you after all”

• Microsoft’s predictive analytics for traffic

Data Science Examples

https://translate.google.com/

http://www.amazon.com/

http://annals.org/article.aspx?articleid=1846638

http://research.microsoft.com/en-us/projects/clearflow/

From Alexandria e-textbook, Section 1.1:

• watch Cukier’s TED talk on “Big Data”

• watch the CERN video “Big Data” from Tim Smith

• read “What is Data Science?” by Mike Loukides of

O’Reilly

Homework

http://ed.ted.com/lessons/exploration-on-the-big-data-frontier-tim-smith

http://cdn.oreilly.com/radar/2010/06/What_is_Data_Science.pdf

Data Science Process
1. Pitching ideas 2. Collecting data 3. Monitoring 4. Integration

5. Interpretation 6. Governance 7. Engineering 8. Wrangling

9. Modelling 10. Visualisation 11. Operationalise

Our Standard Value Chain: Parts of a
Data Science Project
from Doing Data Science by Schutt and O’Neil, 2013 (available digitally through library)

Chapter 1 of the book provides the following visualisation
of the standard value chain for a data science project:

http://shop.oreilly.com/product/0636920028529.do

End of
Data Science and
the Data Science Process

Week 1:
Overview of Data Science

Data Scientists

Interpreting Roles in a Project

Following Jeff Hammerbacher’s UC Berkeley 2012 course
notes, we will interpret these four entities: we will interpret
these

• business analyst
• programmer
• enterprise
• web company

https://en.wikipedia.org/wiki/Jeff_Hammerbacher

Interpretations: the Business Analyst

Collection: copy and paste into Excel

Engineering: use Excel to store and retrieve

Wrangling: use Excel functions, VBA

Analysis: charts

Interpretations: the Programmer

Collection: web APIs, scraping, database queries

Engineering: flat files

Wrangling: Python and Perl, etc.

Analysis: Matplotlib in Python, R

Interpretations: the Enterprise

Collection: application databases, intranet files, server logs

Engineering: Teradata, Oracle, MS SQL Server

Wrangling: Talend, Informatica

Analysis: Cognos, Business Objects, SAS, SPSS

Interpretations: the Web Company

Collection: application databases, server logs, crawl data

Engineering: Hadoop/Hive, Flume, HBase

Wrangling: Pig, Oozie

Analysis: dashboards, R

A quote from Jason Widjaja in Quora:

• Data analysts are primarily people who develop
insights with data ….

• Data scientists are primarily people who develop
data models and products, that in turn produce
insights …

• Data engineers are primarily people who manage
data infrastructure, automate data processing
and deploy models at scale …

See also
Job Comparison – Data Scientist vs Data Engineer vs Statistician
(https://www.analyticsvidhya.com/blog/2015/10/job-comparison-data-scientist-data-engineer-statistician/)

What is the Difference Between …

https://www.quora.com/Whats-the-difference-between-a-data-scientist-a-data-analyst-and-a-data-engineer

Job Comparison – Data Scientist vs Data Engineer vs Statistician

Data scientist: addresses the data science process to extract
meaning/value from data

Data scientist

Knows about

does

From Doing Data Science
From Doing Data Science by Schutt and O’Neil, 2013

http://shop.oreilly.com/product/0636920028529.do

Knows about

• Chief data scientist: a form of chief scientist who addresses
data management, data engineering and data science goals.

• Chief scientist: corporate position, responsible for science
related aspects of a company/organisation

Chief data scientist

Evaluates

From Doing Data Science
From Doing Data Science by Schutt and O’Neil, 2013

http://shop.oreilly.com/product/0636920028529.do

1. Communication skills are underrated.

2. The biggest challenge for a data analyst is the Collection

and Wrangling steps.

3. A data scientist is better at statistics than a software

engineer and better at software engineering than a

statistician.

4. The data industry is still nascent and the roles less well

defined so you get to interact with many parts of the

company from engineering to business intelligence to

product managers.

5. Keep a curiosity about working with data, a quality as

important as your technical abilities.

Lessons from the DA Handbook

See Udacity on data careers

Data Scientists vs. Data Engineers

https://blog.udacity.com/2014/12/data-analyst-vs-data-scientist-vs-data-engineer.html

Steinwig-Woods, J. (2018, May 14). What Skills Does a Data Scientist Actually Need? A Guide to the Most
Popular Data Jobs. https://www.datascience.com/blog/guide-to-popular-data-science-jobs

To become a specialist you need:
• solid machine learning and statistics
• related mathematics (1st+2nd year in many degrees)
• solid prototyping (R, Python, Java)
• perhaps Unix experience (Linux, Mac OSX)

See also:
• The infamous Metromap: Becoming a data scientist
• And Modern Data Scientist (previous slide)

This unit provides an introduction and background only.

Career as Data Scientist

http://nirvacana.com/thoughts/becoming-a-data-scientist/

http://www.marketingdistillery.com/2014/11/29/is-data-science-a-buzzword-modern-data-scientist-defined

End of
Data Scientists

Week 1:
Overview of Data Science

History and Impact
of Data Science

Data science is about
• technology for working with data
• processes for working with data
• getting value from data

in a way that is effective and consistent.

What is data science? (Revisiting)

So why is it regarded as something “new”?

Source: https://www.slideshare.net/slideshow/embed_code/36866068

Timeline of Data Science

Data Science emerges around 2000
• data analysis came of age 1990’s
• William Cleveland published in 2001 “Data Science: An

Action Plan for … the field of Statistics”

• data engineering came of age 2000’s (Dot.Com boom)
• (digital) data management came of age 2000’s (Dot.Com

boom)

• the data/information society
• business pressure on decision making
• “data” as a valuable asset
• Dot.Com companies show the way

See also David Donoho’s “50 years of Data Science” (PDF paper)

Evolution of Data Science …

http://onlinelibrary.wiley.com/doi/10.1002/sam.11239/abstract

http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf

Hype Cycle in 2014
Can you

spot Data
Science?

Hype Cycle for Analytics and Business Intelligence in 2019

Relationship of Data Science to Other Disciplines

See Battle of the Data Science Venn
Diagrams for more.

http://www.prooffreader.com/2016/09/battle-of-data-science-venn-diagrams.html

Related: Data Analysis

Performing analysis and understanding results

• e.g. R, Tableau, Weka, Microsoft Azure Machine
Learning, …

• machine learning, computational statistics,
visualisation, …

Related: Data Engineering

Building scalable systems for storage, processing data

• e.g. Amazon Web Services, Teradata, Hadoop, …
• databases, distributed processing, datalakes, cloud

computing, GPUs, wrangling, …

Related: Data Management

Managing data through its lifecycle

• e.g. ANDS, Talend, Master Data Management, …
• ethics, privacy, providence, curation, backup,

governance, …

Our personal information is increasingly stored in the cloud:
• social life (Facebook),
• career (LinkedIn),
• search history (Google, etc.),
• health and medical (Fitbit, TBD),
• music (Apple), …

This provides many, many, many advantages:
• e.g. personal agents, computerised support for health, but also

some disadvantages:
• e.g. security and privacy breaches

Your Life on the Cloud

But also some disadvantages:

• corporate leakage to government (security, tax, etc.)
• what if you don’t have rights to access/delete data?
• the department of pre-crime (e.g., having recidivism)
• corporate mergers
• “the science is settled” and government mandates

Your Life on the Cloud (cont.)

The Scientific Method

The Scientific Method

from Wikipedia Scientific method

The End of Theory

Chris Anderson’s blog in Wired 23/05/2008

Science is largely driven by laborious studies to find complex causal
models, sometimes using reductionism. The intent is to find an
explanation that can be used for future prediction.

Chris Anderson (Editor-in-chief of Wired magazine) says:
Google’s founding philosophy is that we don’t know why this page is better

than that one: If the statistics of incoming links say it is, that’s good enough. No

semantic or causal analysis is required.

Petabytes allow us to say: “Correlation is enough.” We can stop looking for
models. We can analyze the data without hypotheses about what it might show. …

The new availability of huge amounts of data […] offers a whole new way of
understanding the world. Correlation supersedes causation, …

NB. When Google is delivering an advert, it doesn’t need to be right, it
just needs a good guess, so causality, models, etc., are not important.

The End of Theory (cont.)

https://en.wikipedia.org/wiki/Reductionism

What is a model?

• A simple model of
population growth:

Logistic growth curve
From Integrating Urban Growth

Models,Pearlstine, Mazzotti, Pearlstine and Mann,
2004

• A complex model of
obesity:

Obesity Systems Map

To Understand the Issues …

https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/296290/obesity-map-full-hi-res.pdf

Philosopher Massimo Pigliucci says:

But, if we stop looking for models and hypotheses, are we
still really doing science? Science, unlike advertizing, is not about
finding patterns–…–it is about finding explanations for those
patterns.


science advances only if it can provide explanations.

Data scientist Drew Conway says in some areas the data doesn’t
exist.

Statistician Andrew Gelman says:
… you’ll still have to worry about … all the … reasons why

people say things like, “correlation is not causation” and “the
future is different from the past.”

Not The End of Theory

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2711825/

https://en.wikipedia.org/wiki/Drew_Conway

http://andrewgelman.com/2008/06/26/the_end_of_theo/

• Your stomach can be instrumented to assess contents, nutrients, etc.
• Your bloodstream can be instrumented too assess insulin levels, etc.
• Your “health” dashboard can be online and shared by your GP
• Health management organisations (HMO) tying funding levels to

patient care performance

• GP/HMO will know about your ice cream/beer binge last night and you
missing your morning run

Health Care Futurology

See “Big data – 2020 vision” talk by SAP manager John
Schitka

Car Industry Evolution,
1760s – Today = Driven by Innovation + Globalization

KPCB INTERNET TRENDS 2016 | PAGE
143

Source: KPCB Green Investing Team, Reilly Brennan (Stanford), Piero Scaruffi, Inventors.About.com, International Energy Agency, Joe DeSousa, Popular Science, Franz Haag, Harry Shipler / Utah State Historical Society, National Archives,
texasescapes.com, Federal Highway Administration, Matthew Brown, Forbes, Grossman Publishers, NY Times, Energy Transition, UVA Miller Center for Public Affairs, The Detroit Bureau, SAIC Motor Corporation, Hyundai Motor Company, Kia Motors,
Toyota Motor Corporation, DARPA, Chris Urmson / Carnegie Mellon,

Early Innovation (1760s-
1900s) =
European Inventions

1768 = First Self-Propelled Road Vehicle
(Cugnot, France)

1876 = First 4-strokecycle engine (Otto,
Germany)

1886 = First gas-powered,
‘production’ vehicle (Benz,
Germany)

1888 = First four-wheeled electric car
(Flocken, Germany)

Streamlining (1910s-
1970s) =

American Leadership

1910s = Model T / Assembly
Line (Ford)

1920s-1930s =
Car as Status Symbol… Roaring ‘20s

/ First Motels

1950s = Golden Age…
Interstate Highway Act (1956)… 8 of Top 10

in Fortune 500
in Cars or Oil (1960)

Modernization (1970s-
2010s) =

Going Global / Mass Market

1960s = Ralph Nader / Auto
Safety

1970s = Oil Crisis /
Emissions Focus

1980s = Japanese Auto Takeover Begins…

1990s – 2000s =
Industry Consolidation; Asia

Rising;
USA Hybrid Fail (Prius Rise)

Late 2000s = Recession /
Bankruptcies /
Auto Bailouts

Re-Imagining Cars (Today)
=

USA Rising Again?

DARPA Challenge (2004, 2005,
2007, 2012, 2013) =

Autonomy InflectionPoint?

Today =

+

+

?

138

Source: KPCB Green Investing Team, Darren Liccardo (DJI); Reilly Brennan (Stanford); Tom Denton, “Automobile Electrical and Electronics Systems, 3rd Edition,” Oxford, UK: Tom Denton, 2004; Samuel
DaCosta, Popular Mechanics, Techmor, US EPA, Elec-Intro.com, Autoweb, General Motors, Garmin, Evaluation Engineering, Digi-Key Electronics, Renesas, Jason Aldag and Jhaan Elker / Washington KPCB INTERNET TRENDS 2016 | PAGE
Post, James Brooks / Richard Bone, Shareable

Pre-1980s Analog /
Mechanical
Used switches / wiring to route

feature controls to driver

1980s (to Present) CAN Bus
(Integrated Network)

New regulatory standards drove need
to monitor emissions in real time,

hence central computer

1990s-2010s
Feature-Built Computing

+ Early Connectivity
Automatic cruisecontrol…

Infotainment…Telematics… GPS
/ Mapping…

Today = Smart /
Connected Cars Embedded /

tethered connectivity…
Big Tech = New Tier 1 auto

supplier
(CarPlay / Android Auto)…

Tomorrow = Computers Go
Mobile?…

Central hub / decentralized
systems?
LIDAR…

Vehicle-to-Vehicle (V2V) / Vehicle-
to-Infrastructure (V2I) / 5G…

Security software…

1990s (to Present) OBD
(On-Board Diagnostics) II
Monitor / report engine

performance; Required in all USA
cars post-1996

Today = Complex
Computing

Up to 100 Electronic Control Units
/ car…

Multiple bus networks per
car (CAN / LIN / FlexRay /
MOST)… Drive by Wire…

“The Box” (Brooks
& Bone)

Car Computing Evolution
Since Pre-1980s = Mechanical / Electrical Simple Processors Computers

End of
History and Impact
of Data Science

Week 1:
Overview of Data Science

Introduction to R

R is a programming language.
• Reproducible
• Adaptable

R was originally created for statisticians.
• Functional
• Specialised

R is great in terms of …
• Large community
• Commonly used for business intelligence

R is powerful in drawing graphs.
• Practical
• Communicative

R: A Powerful Data Science Tool

RStudio is a programming environment for R.

• Helps manage the workflow
• Projects – a filing cabinet for your work!
• Libraries of packages

• Works with R scripts from files and the
command-line

RStudio, a tool for R

To maintain a reproducible workflow, you need to record
what steps you take in a process.

• This is vital when dealing with data

R Markdown is an authoring format (.Rmd files) that enables
us to combine embedded R code with formatted text, so we
can:

• Explain our thoughts and process
• Discuss the coding required
• Present the output of the processing
• Interpret the output
• Allow others to reproduce it all!

R Markdown

R Markdown format

See the activity in Week 1

• Introduction to R Markdown
# Top Heading
## Sub-heading
– List item 1
– List item 2
[Link to Monash](https://my.monash.edu)

“`{r}
library(tidyverse)
smaller <- filter(diamonds, carat <= 2.5) smaller ``` https://my.monash.edu/ Once you finish writing the content, you can knit the R Markdown and create the output file. Using RStudio to knit … this Knitting R Markdown Visualisation with R R uses the grammar of graphics to define how to map variables in data with plots in a visualisation • ggplot2: the main package used • An aesthetic mapping (or variable mapping) tells ggplot() which variable in your data corresponds to a particular element to be drawn, e.g., if tb_data is data about cases of tuberculosis, p <- ggplot(tb_data, aes(x=year, y=count, fill=sex)) Then aes tells ggplot to map • the year to the x-axis • the number of cases to the y-axis • the sex will set the colour for a fill element • But R also needs to know how to plot the data. You need to tell it what sort of visualisation you want. • For instance, p <- p + geom_bar(stat=“identify”, position=“fill”) will tell it to create a geom, a geometrical shape. The geom_bar specifically tells it to make a bar chart with 100% fill for which the values (identity) have already been calculated. Plotting with R Facets in R • Sometimes you then want to divide the data further, mapping multiple visualisations. For instance, what if you wanted to analyse the TB data separately for different age groups. • The facet creates the subplots for each category. p <- p + facet_grid(~ age_group) tells R how to present the multiple visualisation plots according to the age_group in the data. Once you combine all levels of the instructions p <- ggplot(tb_au, aes(x=year, y=count, fill=sex)) + geom_bar(stat = "identity", position = "fill") + facet_grid(~ age_group) Visualising it all • Not all data can be used straight away • Not all data is clean and tidy • We need to wrangle the data into shape! Data Wrangling is the process of transforming “raw” … Week 2: Dimensions of Data Science projects and Big Data Growth Laws Explanations about change in IT and society • Moore’s Law • Koomey’s Law • Bell’s Law • Zimmerman’s Law Growth laws Moore’s Law • Number of transistors per chip doubles every 2 years (starting from 1975) • Transistor count translates to: • more memory • bigger CPUs • faster memory, CPUs (smaller==faster) • Pace currently slowing Moore’s Law By Dr Jon Koomey CC BY-SA 3.0, via Wikime- dia Commons Koomey’s Law • Corollary of Moores Law • Amount of battery needed will fall by a factor of 100 every decade • Leads to ubiquitous computing Koomey’s Law • Corollary of Moore’s Law and Koomey’s Law • “Roughly every decade a new, lower priced computer class forms based on a new programming platform, network, and interface resulting in new usage and the establishment of a new industry.” e.g., PCs -> mobile computing -> cloud -> internet-of-things

Bell’s Law
Gordon Bell, Digital Equipment Corporation (DEC), 1972

• Zimmerman is creator of Pretty Good Privacy (PGP), an

early encription system

• “Surveillance is constantly increasing”

• Privacy constantly decreasing

Zimmerman’s Law

Explanations about change in IT and society

• Moore’s Law – capability and size of IT

• Koomey’s Law – capability and size of IT

• Bell’s Law – purpose of IT

• Zimmerman’s Law – relationship between privacy

and IT

Growth laws

End of
Growth Laws

Week 2:
Dimensions of Data Science projects
and Big Data

Big Data and the Vs

From Big data on Wikipedia:

Big data usually includes data sets with sizes beyond

the ability of commonly used software tools to capture,

curate, manage, and process data within a tolerable elapsed

time. Big data “size” is a constantly moving target, …

Big Data

https://en.wikipedia.org/wiki/Big_data

from GO-Gulf in 2017

Things that happen in 60 secs

http://www.go-gulf.com/blog/60-seconds/

Four Vs of Big Data

The Four V’s of Big Data
“The Four V’s of Big Data” by IBM (infographic)

http://www.ibmbigdatahub.com/infographic/four-vs-big-data

Big Data and “V”s

• In 2001, Doug Laney produced report describing 3 V’s:
“3-D Data Management: Controlling Data Volume,
Velocity and Variety”

‣ These characterise bigness, adequately
• Other V’s characterise problems with analysis and

understanding
‣ Veracity: correctness, truth, i.e.. lack of …
‣ Variability: change in meaning over time, e.g., natural

language

• Other V’s characterise aspirations

‣ Visualisation: one method for analysis
‣ Value: what we want to get out of the data

• What else?

• “Data Science Matters” from the [email protected] Blog

• “Intelligence by Variety – Where to Find and Access Big
Data” from Kapow Software

Infographics on Data

http://datascience101.wordpress.com/2013/11/15/data-size-matters-infographic/

http://staging.kapowsoftware.com/resources/infographics/intelligence-by-variety-where-to-find-and-access-big-data.php

Growth laws and Big Data

Each of the growth laws relates to the characteristics of

Big Data.

For instance (but not limited to!)

• Moore’s Law: Velocity, Volume

• Koomey’s Law: Variety

• Bell’s Law: Variety, Veracity

• Zimmerman’s Law: all of them

Summary

BIG DATA is ANY attribute that challenges CONSTRAINTS of a
system’s CAPABILITY or BUSINESS NEED.

End of
Big Data and the Vs

Week 2:
Dimensions of Data Science projects
and Big Data

Business Models

As information technology develops and with more data

collected,

• Businesses utilise it

• Businesses change their attitudes towards it

• Businesses incorporate it in their business models

Innovation!

Growth and business

From Wikipedia:

A business model describes the rationale of how an

organization creates, delivers, and captures value, in

economic, social, cultural or other contexts.

Examples of general classes:

• Retailer versus wholesaler

• Luxury consumer products

• Software vendor

• Service provider

What kinds of businesses do we have operating in the
Data Science world?

Business Models

https://en.wikipedia.org/wiki/Business_model

by Jm3 CC BY-SA 3.0, via Wikimedia
Commons

Bloomberg
terminal

http://creativecommons.org/licenses/by-sa/3.0

The Bloomberg Terminal:

• a computer system provided by Bloomberg L.P.

• enables professionals to monitor and analyse real-time

financial market data

• also place trades on the electronic trading platform

• is a proprietary secure network

Questions:
• Where does the data originally come from?

• Why don’t users of the terminals get their data from

the original source?
• Why wouldn’t people who sell the data to Bloomberg

set up a similar service themselves?

Bloomberg Terminal (cont.)

https://en.wikipedia.org/wiki/Bloomberg_Terminal

• Bloomberg provides an information brokering service.

• Broker: a person who buys and sells goods or assets

for others

Bloomberg Terminal (cont.)

Amazon.com

• An assembly line for the retail industry, with support for
embedded online retailers.

• Huge stock of books, DVDs, CDs, etc. easily searchable.

• Extensive customer reviews.

Amazon.com

• Information-based differentiation: satisfies customers by

providing a differentiated service:

• superior information including reviews about products

• superior range

• Information-based delivery network: they deliver

information for others; retailers in the Amazon

marketplace get:

• customers directed to them

• other retailers’ support

Amazon.com

• See LexisNexis, provides world’s largest electronic database for

legal and public-records related information.

• Information provider: business selling the data it collects

• like a traditional business model, selling data not widgets

• fasting growing segment of the IT industry post 2000 (cited by

Evan Quinn’s blog post on I n f o c h i m p s . c o m April 2013 “Is Big

Data the Tail Wagging the Data Economy Dog?”, now offline)

• some call this the data economy

e.g., data brokers sell consumer data to major retailers or internet

companies

LexisNexis

https://en.wikipedia.org/wiki/LexisNexis

https://www.consumer.ftc.gov/blog/2014/05/ftc-report-examines-data-brokers

• Information brokering service: buys and sells
data/information for others.

• Information-based differentiation: satisfies
customers by providing a differentiated service built
on the data/information.

• Information-based delivery network: deliver
data/information for others.

• Information provider: business selling the
data/information it collects.

“What a Big-Data Business Model Looks Like” by Ray Wang in the

Harvard Business Review claims these are unique in the data world.

Data Business Models

http://hbr.org/2012/12/what-a-big-data-business-model

End of
Business Models

Week 2:
Dimensions of Data Science projects
and Big Data

Case Studies

• Managing data is hard

– Having the required data and technologies

• Managing Big Data is harder

– Determining/creating the required data and
technologies

• Managing data quality is hardest

– Making sure the data (and technologies) meet the
needs

Quality and complexity

• Developed by the National Institute of Standards and

Technology (NIST) and has been widely used in characterising

data science projects.

• Certain parts of the framework will be further discussed in

later weeks.

• This is the kind of analysis you need in Assignment 2 & 4.

NIST Case Studies

• Data sources: where does the data comes from?

• Data volume: how much data is there?

• Data velocity: how does the data change over time?

• Data variety: what different kinds of data is there?

• Data veracity: is the data correct? what problems might it have?

• Software: what software needed to do the work?

• Analytics: what statistical analysis & visualisation is needed?

• Processing: what are the computational requirements?

• Capabilities: what are key requirements of the operational
system?

• Security/privacy: what security/privacy requirements are there?

• Lifecycle: what ongoing requirements are there?

• Other: are there other notable factors?

NIST Analysis

Case Study: Netflix Movies

On-demand internet streaming,
and flat-rate DVD rental:

• Over 203 million subscribers
by Jan 2021

• International market

• Video recommendation!

• Established the Netflix Prize
in 2006-2009 as a
crowdsourced way of testing
out algorithms

By Ivongala (Own work) [Public
domain], via Wikimedia Commons

Netflix

https://en.wikipedia.org/wiki/Netflix_Prize

• Pareto principle, or 80/20

rule:
• Top 20% of films

watched 80% of time

• Standard video store

stocked less than 20% of

available titles in order

to make the most money
from The real meaning of 80/20

• By adopting an Amazon style business model, Netflix could afford

to rent the remaining 80%, the so-called long tail

Netflix: Background
Analysis follow the NIST Big Data WG Netflix analysis in Volume 3, Use Cases
and General Requirements, case 7 on page 8, A-24 and elsewhere

http://longtail.typepad.com/the_long_tail/2005/03/the_real_meanin.html

https://en.wikipedia.org/wiki/Long_tail

http://dx.doi.org/10.6028/NIST.SP.1500-3

• Data sources: user movie ratings, user clicks, user profiles

• Data volume: in 2012: 25 million users, 4 million ratings/day, 3
million searches/day, video cloud storage of 2 petabytes

• Data velocity: video titles change daily, rankings/ratings updated

• Data variety: user rankings, user profiles, media properties

• Software: Hadoop, Pig, Cassandra, Teradata

• Analytics: personalised recommender system

• Processing: analytic processing, streaming video

• Capabilities: ratings and search per day, content delivery

• Security/privacy: protect user data; digital rights

• Lifecycle: continued ranking and updating

• Other: mobile interface

Netflix: Analysis

http://hadoop.apache.org/

https://pig.apache.org/

http://cassandra.apache.org/

https://en.wikipedia.org/wiki/Teradata

Case Study: Electronic Medical

Records (EMR)

EMR: Clinical Data

EMR: Claims and Cost Data

• Clinical data and claims/cost data is available per patient,

per hospital

• large variety of sources of data

• systematic errors and difference in standards across

institution

• Task: segment patients into different types (“phenotypes”) to

use in subsequent cohort studies

• case study is for Indiana Network for Patient Care

Electronic Medical Records
follows NIST Big Data WG Electronic Medical Records analysis in Volume 3, Use
Cases and General Requirements, case 16 in page 14, A-45 and elsewhere

http://dx.doi.org/10.6028/NIST.SP.1500-3

• Data sources: clinical and claims data

• Data volume: 1000 centres, 12 million patients, 4 billion clinical
events

• Data velocity: approx. 1 million clinical events/day

• Data variety: free text, lab results, pathology, outpatient, etc.

• Data veracity: different standards in different places

• Software: Hadoop, Hive, Teradata, PostgreSQL, MongoDB

• Analytics: visualisation for data checking; standardisation of incoming
data; general data analysis

• Processing: analytic processing, handling the volume

• Capabilities: models to support subsequent cohort studies

• Security/privacy: privacy and confidentiality required

• Lifecycle: full data management required

EMR: Analysis

https://hive.apache.org/

Case Study: Medical Imaging (MI)

MI Task: Produce Analysis

Biomedical data for imaging is high resolution and some

is 3D:

• interpretation of images done by trained experts

• requires significant training in interpretation

• many different kinds of instruments each requiring

different interpretations

• millions produced daily in the USA

Medical Imaging
follow NIST Big Data WG Pathology Imaging in Volume 3, Use Cases and
General Requirements, case 17 in page 14, A-48 and elsewhere

http://dx.doi.org/10.6028/NIST.SP.1500-3

• Data sources: biomedical image data

• Data volume: approx. 1 million events/day nationally

• Data variety: X-rays, CT scans, microsopes, …

• Data veracity: current interpretation is often text based, so prone to

text errors

• Software: advanced image processing and machine learning systems

• Analytics: computational image processing, supervised learning from

images

• Processing: handling the large volume, distributed and high throughput

• Capabilities: produce initial analysis for experts

• Security/privacy: privacy and confidentiality required

• Lifecycle: full data management required

Medical Imaging: Analysis

Case Study: Electricity Demand

Forecasting (EDF)

from NIST Big Data WG Electricity Demand Forecasting in Volume 3, Use
Cases and General Requirements, case 51 in page 43 and A-134

Near realtime usage available thanks to smart meters
• with solar cells, consumers do energy generation too, but

it is unpredictable

• main electricity generation must be planned

• brownouts and blackouts need to be prevented

• see Australian Energy
Market Operator (AEMO)
and their electricity site

Electricity Demand Forecasting

https://www.aemo.com.au/

https://www.aemo.com.au/

https://www.aemo.com.au/Energy-systems/Electricity/National-Electricity-Market-NEM/Data-NEM/Data-Dashboard-NEM

• Data sources: utilities, smart meters, weather data, grids

• Data volume: city scale: 10GB/day

• Data velocity: updates every 15 minutes

• Data variety: time series, networks, spatial data

• Data veracity: occasional dropouts

• Software: advanced timeseries processing, spatial analysis

• Analytics: forecasting models

• Processing: handling the forecasting volume

• Capabilities: produce forecasts at different scales (hourly, daily)

• Security/privacy: privacy and confidentiality required

• Lifecycle: full data management required

Electricity Demand Forecasting: Analysis

Big Data is complex.

Complex problems have complex solutions.

There is no one solution for all

… but there is a lot of an opportunity for growth

for Data Science!

for Data Scientists!

Data complexity

End of
Case Studies

Week 2:
Dimensions of Data Science projects
and Big Data

Modelling Influences

from the BackReaction blog by Sabine Hossenfelder

Modelling

http://backreaction.blogspot.com.au/2008/04/emergence-and-reductionism.html

What is a model?

• A simple model of
population growth:

Logistic growth curve
From Integrating Urban Growth Models,

Pearlstine, Mazzotti, Pearlstine and Mann, 2004

• A complex model of
obesity:

Obesity Systems Map

A Slide about Model from Week 1

https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/296290/obesity-map-full-hi-res.pdf

A Slide about Model from Week 1

• “All models are wrong, but some are useful”…

George Box

Viewpoints about Modelling

• “All models are wrong, but some are useful”…

George Box

• “The approximate nature of the model must

always be borne in mind”… George Box

• “The purpose of models is not to fit the data but to

sharpen the questions”… Samuel Karlin

Viewpoints about Modelling

Yes No

Influence Diagrams

Influence Diagrams (a.k.a Decision Graphs) are:

• directed graphical model with 4 types of nodes:

‣ chance nodes, known variable nodes, action/decision
nodes and objective/utility nodes

• model the “influences”, “causes”, random (“chance”)

outcomes, “actions”, “goals” involved in a decision
problem

• provide a coarse abstraction, a conceptual model

Motivating Influence Diagrams

A conceptualisation aid to get you thinking about actions,

values, and unknowns.

Chance variable Known variable Decision or Action Objective

When do we connect an arc to a node?

• Chance variable: connect node A to chance node B if changes to the
value of A can “cause” changes in B;

• Known variable: same as chance node

• Decision: connect node A to decision node B, if variable A is used
when making decision B;

• Objectivity: connect node A to objective node B if variable A is used
when evaluating the value of the objective (e.g. quality or cost)

Node Types

Example: Last Minute Vacation

Example: Last Minute Vacation (cont.)

Bad Arcs for Last Minute Vacation

1. Weather cannot cause its forecast!

2. The forecast cannot cause the weather!
3. Your decision to go on vacation follows in time after

you have obtained forecast.

4. The success (failure) of the vacation follows in time
after your decision.

Example: Internet Advertising

Heart Disease

End of
Modelling Influences

Week 2:
Dimensions of Data Science projects
and Big Data

Visualising Statistics

• Descriptive Analytics: gain insight from historical data
• plot sales results by region and product category

• correlate with advertising revenue per region

• Predictive analytics: make prediction using statistical and
machine learning techniques

• predict next quarter’s sales results using economic projections

and advertising targets

• Prescriptive analytics: recommend decisions using
optimisation, simulation, etc.

• recommend which regions to advertise in given a fixed budget

Primarily a descriptive classification for general discussions.

Analytic Levels

Analytic Levels

There can be other classification schemes.

Check “Eight Levels of Analytics” by SAS

https://www.datasciencecentral.com/profiles/blogs/eight-levels-of-analytics-for-competitive-advantage

“The practice or science of collecting and analysing

numerical data in large quantities, especially for the

purpose of inferring proportions in a whole from those

in a representative samples”.

Two main statistical analytical methods:

• descriptive statistics – explaining data

• inferential statistics – finding regularities in irregular

data

What is statistics?

https://en.wikipedia.org/wiki/Descriptive_statistics

https://en.wikipedia.org/wiki/Statistical_inference

Categorical, qualitative

• Groups or categories

• Nominal – no natural ordering

• Ordinal – ordered

Quantitative

• Numerical

• Discrete – specific values, like counts

• Continuous – like temporal data

– Temporal: time and dates
– Space: locations

Different variable types

Data can be counted in various ways, including …

• How much data/how many records

• How large is the data
• How many unique values are there

• How many instances of each value
• How many instances of a group (bucket) of values

The values don’t have to be numerical.

Counting data

If there is a range of values, we can also evaluate what

is the most likely value.

Mode: which value is most common, e.g.,

Data: 1, 2, 2, 3 ,3, 4, 4, 4, 5 Mode = 4

The data doesn’t have to be numerical.

Median: what is the value in the middle of the data

Data: 1, 2, 2, 3, 3, 4, 4, 4, 5 Median = 3

The data must be ordered & numerical.

Mode and Median

Mean: the average value.
Data: 1, 2, 2, 3, 3, 4, 4, 4, 5 Mean = 3.111
The data must be numerical.

The mean value of is sometimes written as

Mode, mean and median help us describe what we expect
the data to be, but not how much the data differs.

!!

Mean

Other measurements describe how much the numerical
values vary.

• Variance is the average of how much values tend to
differ from the mean.

• Standard deviation is the square root of the variance.

Data: 2, 4, 4, 4, 5, 5, 7, 9 Mean = 5

! =
9 + 1 + 1 + 1 + 0 + 0 + 4 + 16

8 = 4
* = 4 = 2

Deviation and variance

Plotting counts – bar charts

Plotting counts – histograms

Scatter plot Line graph

Plotting points

The mean or median is often used on a visualisation to
show a degree of what is “normal”.

• Not necessarily a benchmark
• Not all plots can easily incorporate a mean or

median, e.g., pie charts!
• Can be used to help visualise the variance in the

data

Adding statistics

• Not all data is ideal for analysis
• Outliers are values outside of the expected parameters

for the data
– Errors
– Exceptional circumstances
– Chance

• Outliers need to be identified and decided on before the
analysis is completed
– They will influence the calculation of the mean
– So wrangle them!

Outliers

• Combine quartiles, median and outliers
• Quartile

• Divide the data into quarters based on the variable
• The upper, median and lower quartiles are the value at

the quartile boundaries, i.e., 25% of the data is less than
or equal to the lower quartile

• Interquartile range (IQR): The difference between the
lower and upper quartiles

Boxplots

• Outliers
– Below Q1 – 1.5 IQR
– Above Q3 + 1.5 IQR

• Whiskers
– Used on a boxplot to show how

much of the data is outside of the
IQR but not an outlier

– Box-and-whiskers plot

• R does a lot of these calculations
for you!

Boxplots

• Motivation: TED talk by Hans Rosling

• Allow us to focus on the relationship between multiple
variables over time.

Temporal data – Motion Charts

• Advantages:
• Time dimension allows deeper insights & observing trends
• Good for exploratory work
• Motion allows identification for this out of common “rhythm”
• “Appeal to the brain at a more instinctual intuitive level”

• Disadvantages:
• Not suited for static media
• Display can be overwhelming, and controls are complex
• Not suited for representing all types of data,

e.g. other graphics might be suitable for business data
• “Data scientists who branch into visualisation must be aware

of the limitations of uses”

Motion Charts – pros & cons

https://www.kdnuggets.com/2019/02/be
st-worst-data-visualization-2018.html

What visualisation should I use?

https://www.kdnuggets.com/2019/02/best-worst-data-visualization-2018.html

• https://www.reddit.com/r/dataisbeautiful/ A mixture of
good or bad visualisations of data

• https://www.kdnuggets.com/2019/02/best-worst-data-
visualization-2018.html

• https://365datascience.com/chart-types-and-how-to-
select-the-right-one/

• FIT5147: Data exploration and visualisation!

What visualisation should I use?

https://www.reddit.com/r/dataisbeautiful/

https://www.kdnuggets.com/2019/02/best-worst-data-visualization-2018.html

https://365datascience.com/chart-types-and-how-to-select-the-right-one/

End of
Visualising Statistics

Week 2:
Dimensions of Data Science projects
and Big Data

Missing Data

• You want the data you are using to be of sufficient quality
for your purpose
– Accuracy
– Completeness
– Consistency
– Integrity
– Reasonability
– Timeliness
– Uniqueness/deduplication
– Validity

Data Management Association (DAMA)

• Much of this is a data management issue
– Leave this for a future week

• Some of this is the fault of the data itself
– This is what we will focus on this week

Data quality

https://www.naa.gov.au/information-management/building-interoperability/interoperability-development-phases/data-governance-and-management/data-quality

Data needs to be cleaned, so it can be (re)used.

Sometimes the quality of data is questionable

• Volume: with a lot of data, irregularities creep in

• Velocity: data can be out-of-date very quickly

• Variety: data can be in a different formats and types

that don’t work well together

• Veracity: the accuracy or consistency of data from

different sources or sets or circumstances

Wrangling big data

Big data can also be incomplete

• Sensors fail

• Data collection procedures fail (staff sick, social or
legal issues)

• Data sharing fails or is temperamental

• Data isn’t significant enough (too small a sample)

Consequently, the data may not have suitable values for all
variables in all parts of the data.

Holes in the data

• Learning algorithms need the values.

• Not all statistical computing and graphics software ignore

missing values

– Bias results

– Incorrect calculations
– ggplot ignores them, but it does warn you!!
ggplot(oceanbuoys,

aes(x = sea_temp_c,
y = humidity)) +
geom_point() +
labs(x = “Sea temperature (celsius)”,
y = “Humidity”)

## Warning: Removed 94 rows containing
missing values (geom_point).

Consequences of missing data

• Need to find where data is missing

– Visualise the invisible!

• Need to decide what to do with what we don’t have

– Sometimes we actually need to wrangle values for the
missing data!

Handling missing data

• Data set was missing 60% of the data

• Looked at the context of the missing data

– Found the data was originally merged from different

sources about people and machines

• Developed new ways of exploring the missing data

– naniar and visdat packages in R allow you to see where
the data is missing

Dr Nick Tierney’s missing data
Interview with Nick Tierney, microcredential Course 2, Step 01.12

• Sometimes, a value is regarded as NaN

– Value is not empty

– Value is not a string or character

– Value is not a number

Missing value!

• This can also be the outcome of a calculation,

e.g., 0 / 0 = NaN,

So analysis can also produce NaNs

• Wrangling often needs to deal with NaNs in the data

is.nan(x)

NaN – Not a Number

NA is not NaN

• Sometimes, a value is regarded as NA

– Value is not empty

– Value is not the expected type

– Value is Not Available

Missing value!

• Wrangling often needs to deal with NAs in the data

is.na(x)

• NA is not the same as NaN!

https://statisticsglobe.com/r-na/

https://statisticsglobe.com/r-na/

• Marks the location of missings in the original data table

• From naniar, bind_shadow() joins a shadow matrix to
the data

Shadow matrix in R

as_shadow(oceanbuoys)

oceanbuoys_shadow <- bind_shadow(oceanbuoys) glimpse(oceanbuoys_shadow) We can then use the shadow matrix to see how the missing values relate to other variables in the table. Shadow matrix in R ggplot(oceanbuoys_shadow, aes(x = wind_ew, y = wind_ns, colour = air_temp_c_NA)) + geom_point(alpha = 0.7) + theme(aspect.ratio = 1) + scale_colour_brewer(palette = "Dark2") + labs(x = "East-West Winds”, y = “North-South winds”) • The choice for missing values like NaNs is often whether to - omit them from the data, or - give them a value Wrangling missing data • If a small fraction of cases have several missings, drop the cases. • If a variable or two, out of many, have a lot of missings, drop the variables. • If missings are small in number, but located in many cases and variables, you need to impute these values (replace with substituted values) to do most analyses. Strategies for missing values • Sometimes we needs a value for every aspect of the dataset - machine learning - visualisation • One method is to “impute” values we don’t know, based on those we do. - Often a crude approximation, so use it with caution, e.g., calculate the mean or median of similar values. Imputation Missing data about air temperature in the oceanbuoys data • Imputation from the mean Imputation Missing data about air temperature in the oceanbuoys data • Imputation from the median Imputation Missing data about air temperature in the oceanbuoys data • Imputation from the nearest neighbours Imputation …




Why Choose Us

  • 100% non-plagiarized Papers
  • 24/7 /365 Service Available
  • Affordable Prices
  • Any Paper, Urgency, and Subject
  • Will complete your papers in 6 hours
  • On-time Delivery
  • Money-back and Privacy guarantees
  • Unlimited Amendments upon request
  • Satisfaction guarantee

How it Works

  • Click on the “Place Order” tab at the top menu or “Order Now” icon at the bottom and a new page will appear with an order form to be filled.
  • Fill in your paper’s requirements in the "PAPER DETAILS" section.
  • Fill in your paper’s academic level, deadline, and the required number of pages from the drop-down menus.
  • Click “CREATE ACCOUNT & SIGN IN” to enter your registration details and get an account with us for record-keeping and then, click on “PROCEED TO CHECKOUT” at the bottom of the page.
  • From there, the payment sections will show, follow the guided payment process and your order will be available for our writing team to work on it.