20 short answers question with word count of 2-3 sentences each
4 long answer with count of 4-5 sentences each
Time limit is 80min
Week 6:
Issues with Data Science
Data Management Maturity
Data governance and data management are often used to
mean each other.
Better to treat them as separate levels
• Data Management is what you do to handle the data
o Resources, practises, enacting policies
• Data Governance is making sure that it is done
appropriately
o Policies, training, providing resources
o Planning and understanding
Governance and management
DCC data (curation) lifecycle model
https://www.dcc.ac.uk/guidance/curation-lifecycle-model
https://www.dcc.ac.uk/guidance/curation-lifecycle-model
Capability Maturity Model
• Good management happens all through the data lifecycle
• 4 key process areas:
o Data acquisition, processing and quality assurance
Goal: Reliably capture and describe scientific data in a way that
facilitates preservation and reuse
o Data description and representation
Goal: Create quality metadata for data discovery, preservation, and
provenance functions
o Data dissemination
Goal: Design and implement interfaces for users to obtain and
interact with data
o Repository services/preservation
Goal: Preserve collected data for long-term use
• Good data governance uses a good management system
o A mature system manages data all through the data lifecycle and
throughout all projects. K Crowston & J Qin (2011) A Capability Maturity Model for Scientific Data Management: Evidence
from the Literature. Proceedings of the American Society for Information Science & Technology V48
https://asistdl.onlinelibrary.wiley.com/doi/full/10.1002/meet.2011.14504801036
https://asistdl.onlinelibrary.wiley.com/doi/full/10.1002/meet.2011.14504801036
https://asistdl.onlinelibrary.wiley.com/doi/full/10.1002/meet.2011.1450480103
6
Capability Maturity Model
https://asistdl.onlinelibrary.wiley.com/doi/full/10.1002/meet.2011.14504801036
• Data management and governance are not things just
arranged for each project.
• They should be universal in how an organisation
thinks about and approaches data
o at all times
o in all divisions
o in all projects
o for all stakeholders
Universality
End of
Data Management Maturity
Week 6:
Issues with Data Science
Ethics of Linked Data
• Connecting elements within multiple structured data
sets
• Allows data relating to an element to be collected
from multiple data sets
• Expands the knowledge base of a single data set
• Linked Open Data (LOD) allows the links and data to
be freely shared and accessed
o Used by companies but don’t tend to contribute
their own data
Linked Data
Sir Tim Berners-Lee, the
inventor of WWW and HTML,
wanted a semantic web, using
linked data
1. Name/Identify things with URIs
2. Use HTTP URIs so things can be
looked up
3. Standardise the format of data
about things with URIs
4. On the web, use the URIs when
mentioning things
CC BY 2.0
Semantic Web
https://www.w3.org/DesignIssues/LinkedData.html
• Resource Description Framework (RDF) is another style of language for
representing (subject, verb, object) triples, which is used to represent
semantics. It is a core representation language for Linked Open Data
and the Semantic Web.
• RDF can be represented in different formats, for instance as XML or
simply as line delimited lists.
20 https://www.w3.org/TR/rdf11-primer/
Format of linked (open) data
http://www.w3.org/RDF/
http://purl.org/dc/terms/
https://www.w3.org/TR/rdf11-primer/
• Ethics – the moral handling of data, e.g., not selling on
other’s private data to scammers
• People have rights
o privacy
o access
o erasure
o … etc.
• Companies have rights
o ownership of data
o intellectual property
o copyright
o confidentiality
Ethics
• Business models
o Data has become a valuable asset
o Data has become a valuable product
• Data from different services can be linked by companies
by buying out other companies or establishing new
services for other companies to use.
Alphabet Facebook Microsoft
Google Facebook Skype
YouTube Instagram Hotmail
Gmail Oculus VR Bing
Android WhatsApp Windows
Chrome Giphy Xbox Live/ Minecraft/ Bethesda
DoubleClick Github
Companies using linked data
• Business models
o Multiple departments have separate systems
o Departments interact, so why can’t their data
o Law enforcement needs to know what everyone
else knows!
• Problems
o Who should know what?
o How do you manage who should know what?
o What priorities do you give to the rights of people?
Governments using linked data
• What can you do?
• What should you do?
• How do you make sure the right thing is done?
Breaking it down
See: “The curly fry conundrum: Why social media ‘likes’ say more
than you might think” by Jennifer Golbeck
e.g. Target ® predicting which women are pregnant based on their
purchases
• Many things can be predicted from Facebook “likes”
• Homophily (tendency to associate with similar individuals)
is important for enabling prediction
• We often don’t own or manage corporate/internet/app
data about ourselves
• The source data critical for advertisers so we cannot expect
companies to be banned/excluded from using it
• So how can we manage confidentiality?
Confidentiality
• for many apps/websites, you must accept their privacy
data sharing policies to use their services fully;
• the interface for selecting privacy preferences should
move away from individual Internet platforms and be put
into the hands of individual consumers;
• user could have an open source agent that broker their
confidentiality preferences
• but would that be feasible and would businesses ever
agree?
Confidentiality (cont.)
See: “Empower consumers to control their privacy in the
Internet of Everything” by Carla Rudder (blog)
https://enterprisersproject.com/article/2015/7/empower-consumers-control-their-privacy-internet-everything
1. Corporations: want to use data for business advantage;
‣ opposing consumers
2. Security conscience: concerned with individual freedom, liberty,
mass surveillance;
‣ opposing intelligence orgs like National Security Agency
3. Open data: want open accessibility, support FOI requests
‣ opposing security experts concerned with leaks
4. Big data and civil rights: concerned about big data and citizens;
‣ opposing data brokers selling consumer data
Politics of Confidentiality
See: “Four political camps in the big data world” by Cathy O’Neil (blog)
See: Facebook Doesn’t Tell Users Everything and Facebook Privacy:
Social Network Buys Data
Facebook buys 3rd-party data (from brokers) to obtain a
user’s activity, income, etc.
• keeps upwards of 52,000 features about users, many
provided to advertisers
• bought data used as a complement Oracle’s Datalogix,
• it is public, offline data, e.g., from Oracle’s Datalogix,
• but is not revealed to users
Facebook and Personal Data
https://www.propublica.org/article/facebook-doesnt-tell-users-everything-it-really-knows-about-them
http://www.ibtimes.com/facebook-privacy-social-network-buys-data-third-party-brokers-fill-user-profiles-2466651
https://en.wikipedia.org/wiki/Datalogix
See: “Can Facebook influence an election result?” by Michael Brand
(ex-Monash, opinion on ABC news via The Conversation) and also
“How Facebook could swing the election” by Caitlin Dewey (article,
Washington Post)
• implicit data: Facebook can predict who you will vote for
• their “I voted” button encourages people to vote (as they see
which of their friends have)
• studies show it significantly increased voting in 2010 US election
• they can therefore subtly affect your voting
• could Facebook deploy “I voted” button selectively to favour
certain parties in certain areas?
Facebook and Voting
http://www.abc.net.au/news/2016-09-28/can-facebook-influence-an-election-result/7881660
https://www.washingtonpost.com/news/the-intersect/wp/2016/09/30/how-facebook-could-swing-the-election-and-who-will-benefit-if-it-does/
See “Machine logic: our lives are ruled by big tech’s decisions by data”, and
“If prejudice lurks among us, can our analytics do any better?”
Predictive models built on large populations are used to
filter/make key life decisions like release from jail, treatment
in hospital, getting a loan, news/videos you see (e.g.,
Facebook) …
• ML algorithms do the filtering
• ML algorithms can also produce prejudice (i.e., are biased)
• decisions made on mass, not personalised
• decisions are centralised (who writes the algorithms?)
• perhaps this is OK … perhaps
Population-level Prediction
https://www.theguardian.com/technology/2016/oct/08/algorithms-big-tech-data-decisions
https://www.oreilly.com/ideas/if-prejudice-lurks-among-us-can-our-analytics-do-any-better
Philip R. “Phil” Zimmermann,
• creator of the Pretty Good Privacy (PGP) email
encryption software
• Interview in 2013:
“the biggest threat to privacy was Moore’s Law
… the ability of computers to track us doubles every
eighteen months
…The natural flow of technology tends to move in the
direction of making surveillance easier”
Zimmerman’s law
https://en.wikipedia.org/wiki/Pretty_Good_Privacy
https://en.wikipedia.org/wiki/Email_encryption
https://web.archive.org/web/20130815064716/http:/gigaom.com/2013/08/11/zimmermanns-law-pgp-inventor-and-silent-circle-co-founder-phil-zimmermann-on-the-surveillance-society/
Australian govt interface:
• Australian JobSearch
• Australian Taxation Office
• Centrelink
• Child Support
• Department of Health Applications Portal
• Department of Veterans’ Affairs
• HousingVic Online Services
• Medicare
• My Aged Care
• My Health Record
• National Disability Insurance Scheme
• National Redress Scheme
• State Revenue Office Victoria
Government linked data
https://my.gov.au
• My.gov.au provides access to the public to their data
o Greater dependency on online interfaces
o Less pen and paper data processing
o More automation of processing
o Cf. RoboDebt, Census
• Less clear what access each government can have to
the data
Government data access
• “require some telecommunications service providers to
retain specific telecommunications data (the data set)
relating to the services they offer for at least 2 years”
o Who talks to whom on the phone & when
o Who emails whom & when
o The IP address
• What doesn’t it include?
o information about telecommunications content or web
browsing history
• Who has access to the data without a warrant?
o 20 intelligence agencies, criminal law enforcement agencies,
ATO, ASIC and ACCC
o Civil litigation exemption
(Australian) Data retention laws
https://www.homeaffairs.gov.au/about-us/our-portfolios/national-
security/lawful-access-telecommunications/data-retention-obligations
https://www.homeaffairs.gov.au/about-us/our-portfolios/national-security/lawful-access-telecommunications/data-retention-obligations
• Rights vs functionality
• Change in responsibilities
o Change in processes and technology in response
• Where does automation and AI fit?
o Where is the responsibility and accountability?
o Snowden and the NSA surveillance
Data retention laws – issues
End of
Ethics of Linked Data
Week 6:
Issues with Data Science
AI Veracity
• Various factors can affect the “accuracy” of any
analysis
o Data quality
o Choice of analysis
o Design of analysis
o Choice of data
• It is easy for the modelling to misrepresent what the
data is supposed to reflect.
o Even statistical analysis can be biased!
Can you trust the analysis?
Chris is an excellent driver.
They have applied for new
car insurance, but a ML
system automatically
evaluates their application.
What personal data should be
considered?
a) Driving record?
b) Payment metrics?
c) Location?
Should the system reject the
application purely due to
where Chris lives?
https://www.crimestatistics.vic.gov.au/crime-
statistics/latest-crime-data-by-area
Question
https://www.crimestatistics.vic.gov.au/crime-statistics/latest-crime-data-by-area
Google trains ML systems to recognize some
common items in pictures. What do you think it
thought was in these hands in 2020?
a) Banana
b) Gun
c) Monocular
d) Tool
Question
https://algorithmwatch.org/en/story/google-vision-racism/
https://algorithmwatch.org/en/story/google-vision-racism/
• Not all bias is in the numbers
• Bias can also be in how you have designed the
research
o Are the variables appropriate for all situations being
modelled?
o Are assumptions made about the stakeholders who the
data relates to?
o Are assumptions being made about the context of the
data?
Bias of design
• Sometimes the data used to train a ML system is biased,
regardless of its volume
o Narrow
o Regional
o Undertested in varied contexts
• Biased system may discriminate in its results, for instance
by
o gender
o ethnic associations
o generalities
• Biased system may not be as accurate in its results for
unfamiliar contexts and subjects
Bias of data
• Bias like this can appear in any automated processing
o Google: Shows ads for high paying jobs to men more
than women
o Jailtime: Sees black Americans as more at risk of
reoffending than white Americans
o Student applications: ML used to recognize bias in the
decision process and to add bias to the system
• Automated systems will only be as good as the
underlying data
Not just about image recognition
https://towardsdatascience.com/bias-in-artificial-intelligence-a3239ce316c9
https://www.fastcompany.com/90342596/schools-are-quietly-turning-to-ai-to-
help-pick-who-gets-in-what-could-go-wrong
https://towardsdatascience.com/bias-in-artificial-intelligence-a3239ce316c9
https://www.fastcompany.com/90342596/schools-are-quietly-turning-to-ai-to-help-pick-who-gets-in-what-could-go-wrong
• Automated systems may speed up the processes, but
humans are better at understanding the context
o Human-in-the-loop
• Need the human perspective in the design,
understanding and review of the process, how it is
utilised and its results
Great responsibilities
Great transparency
Human perspective
https://appen.com/blog/human-in-the-loop/
• Need to incorporate legal requirements into the
system
o Not discriminating by race, gender, sex, age, etc unless
allowed
• Need to respect the rights of the individual
• Privacy-by-design
o Factor the right to privacy into the design of any DS
system, not as an afterthought.
o Also factor rights and legal requirements into how any
system is used
Legality by design
https://www.oaic.gov.au/privacy/privacy-for-organisations/privacy-by-design/
End of
AI Veracity
Week 6:
Issues with Data Science
Sampling
• When collecting data for processing, it has to be
relevant
o Can you get all data relating to the scenario you are
modelling?
o Can you only get a random sample of data? The sample data
has to be representative of the population being modelled
o How large a sample do you need?
o What known variables are included in the data?
o Is the sample data distributed to match the required
strata/categories
• Observe the population before you make any
unqualified assumptions
Sampling populations
• Blind experiments or A/B testing may be used to show if
relationship between various variables
• The experimental scenario needs to be divided into:
o A: Sample is subjected to the known variable
o B: Sample is not subjected to the known variable (the Control
set)
• The validity of the the
hypothesis is based on whether
A has a different response to
B, where the response is the
target variable.
https://en.wikipedia.org/wiki/A/B_testing#/media/File:A-
B_testing_example.png (cc BY-SA 4.0)
A/B testing
https://en.wikipedia.org/wiki/A/B_testing
How much of a difference in results is enough?
• Must test the statistical significance
o p value: units of chance of your “surprise” (0 to 1)
Considering how likely you could get the same results
regardless of the hypothesis
• Hypothesis: Aspirin reduces heart attack
o Sample: studied 100 men for 5 years
Group HA: 50 men take aspirin daily
Group HP: 50 men take placebo daily (control)
o Results:
‣ High p: HA 4 heart attacks, HP 5 heart attacks so both
around 1 in 10 men
‣ Low p: HP 10, HA 1
so very different and significant!
Significance testing
https://en.wikipedia.org/wiki/Statistical_significance
• How much difference is enough? (p<0.05?)
• More data gives a more accurate impression, but how
much is enough?
• Should you publish experimental results that challenge
previous runs of the same experiment?
o Negative results shouldn’t be forgotten
o Old experiments may be flawed
o New data may understand the context
• Can you cross-validate your results?
o k-fold testing: experiment with k combinations of test
and training data
Significance chasing
• Is data science interested about finding patterns in data
(observation) or experimentation (testing outcomes)?
• Both models and theories/hypotheses are research
artefacts
o Need to demonstrate how they match evidence
o Scientific method isn’t the only valid research
methodology!
• Still need to make sure any modelling or other research
outcomes are valid!
Challenging the scientific method?
• In 2009 Google claimed it worked out a correlation
between some search terms and a growth in flu cases
o Could identify the trends 2 weeks before it became a health
problem!
• But this has problems!
o Not openly sharing their methods – IP!
o Not openly sharing their data – privacy and proprietary
o Inconsistent in temporal perspectives
o Overestimates the infections!
• “greater value can be obtained by combining GFT with
other near–real time health data”
Google Flu Trends
Lazer, D., R. Kennedy, G. King, and A. Vespignani. 2014. “The Parable of
Google Flu: Traps in Big Data Analysis.” Science 343 (6176) (March 14): 1203–1205.
https://dash.harvard.edu/handle/1/12016836
• Data science allows us to expand what we can do with
data
o Growth laws
o Dealing with the Vs
• Data science allows us to reinterpret scenarios
o New ways to approach old problems
• Data science is not standalone
o Combine with existing methods
o Human-in-the-loop
• Data science doesn’t have just be about making better
models
o Use data science to solve real problems
Data science and society
https://www.technologyreview.com/2020/08/18/1007196/ai-research-machine-learning-applications-problems-opinion/
End of
Sampling
Week 6:
Issues with Data Science
Future of Data Science
https://www.gartner.com/smarter
withgartner/5-trends-drive-the-
gartner-hype-cycle-for-emerging-
technologies-2020/
Gartner’s hype cycle
https://www.gartner.com/smarterwithgartner/5-trends-drive-the-gartner-hype-cycle-for-emerging-technologies-2020/
• Traditional technology reaches its limits
• DNA storage becomes a reality
• Expansion of electronic physical experiences
• Farms and factories face automation
• CIOs become Chief Operating Officers
• Change is driven by recording work conversations
• Increase in freelance customer service experts
• More attention to a “voice of society” metric in
organisations
• On-site childcare entices employees
• Handling malicious content becomes a priority
Gartner’s predictions for 2021+
https://www.gartner.com/smarterwithgartner/gartner-top-10-strategic-predictions-for-2021-and-beyond/
https://www.gartner.com/smarterwithgartner/gartner-top-10-strategic-predictions-for-2021-and-beyond/
The 2021 Hype Cycle for Emerging
Technologies
https://www.gartner.com/smarterwithgartner/3-themes-surface-in-the-2021-hype-cycle-for-emerging-technologies
https://www.gartner.com/smarterwithgartner/3-themes-surface-in-the-2021-hype-cycle-for-emerging-technologies
• Growth of big data technologies has allowed multiple types
of data to be combined
o Structured and unstructured data, e.g., sales records and
customer feedback
o Multimedia, e.g., video and textual data, image and textual
data
• Growth of IT has allowed better processing capability
(Moore’s Law)
o New ways to use multiple models relating to different data
sets (Bell’s Law), e.g,. visual interpretation of gestures and
audio interpretation of speech vs world knowledge
o ML using neural networks (NN) and deep learning
Combining data
• Very much a data science process
o Gather data
o Analyse data
o Produce conclusions
o Make decisions
o Act on the decisions
• Many uses
o Manufacturing “robots”
o Robotic vacuum cleaner
o Adaptive energy systems
o Chatbot
o Stock market agent
o Independent agents in modelling, e.g., public behaviour
during pandemic
Autonomous devices
• For instance,
o Drones & other aircraft (autopilot!)
o Trucks on freeways & mining sites
o Trains
o Suburban cars
• Collect data from various sources
o Local: speed limits
o Internal: sensors, cameras, radar
o External: road maps, weather
• Actions
o Plans: routes, known objectives
o Instinct: dynamic, adaptable responses, preempting actions
of other entities
Autonomous vehicles
https://www.artificiallawyer.com/2020/07/27/gartner-legal-tech-hype-curve-2020-positions/
Gartner’s Legal Tech Hype Curve
https://www.artificiallawyer.com/2020/07/27/gartner-legal-tech-hype-curve-2020-positions/
• Digital ethics is currently of great interest
o GDPR
• Laws for autonomous devices
o Military weapons like drones and gun turrets
o Responsibility for errors by other vehicles
• Researchers looking where the legal holes are, how they
can be filled, what is possible to implement and how AI
can help the law
o Dr Campbell Wilson - AI for Law Enforcement & Community
Safety
Laws for Data Science?
https://research.monash.edu/en/projects/ai-for-law-enforcement-community-safety
• Data Science is not just about Machine Learning
• Data Science is not just about coding
• Data Science is about helping society use data better at
every stage of the data lifecycle
• Data science is not just for IT
• Data science is now recognised as having a multi-
disciplinary role in all industries
o Remember the multiple dimensions of a data scientist's
skillset!
Future for Data Science
End of
Future of Data Science
Week 6:
Issues with Data Science
Revisiting the Unit Content
• Data science
o History
o Definition
o Machine Learning
• Data scientist
o Skills
o Roles
o Job descriptions & requirements
• Data science process & value chain
Week 1 - Overview
• Data science
o History
o Impact
‣ On other disciplines
‣ On society
‣ Scientific method
‣ Futurology
• R
o R Markdown
o ggplot2: visualising and aesthetics
‣ Graphs & facets
o Data wrangling
‣ wrangling verbs
‣ Tidy data
Week 1 - Impact
• Types of analysis
• Modelling
o Influence diagrams
• Growth laws
• Business models
o SaaS
• Basic statistics
o Mean, variance
o Variable types
o Outliers and box plots
• Choosing visualisations
Week 2 – Visualising statistics
• Big data
o The Vs
o Growth laws
• NIST Case studies
o Analysis framework
• Data quality
o Wrangling
o Missing data & strategies
o Imputation
o NaN and NA
‣ Shadow matrix in R
Week 2 – Big Data
• Sharing data
o Open data
o Data sources
o Complexities of using shared data
o Getting data
• Data standards
o Formats: machine-readable, containers, markups
o Metadata
o Semi-structured data: XML, JSON
o Predictive Model Markup Language
• Combining data
o joins
Week 3: Data sources
• Scripting languages: R, python, Unix shell code
o Wildcards
o Piping
o Directing input/output
o Moving files and directories
o Analysing file contents: grep, awk
o Handling big data
• Standardisation
o software
o workflow
o processes
Week 3: Big Data and Standards
• Temporal data
o Temporal elements
o Extraction and conversion
o Visualisation
• Statistical modelling
o Variables: dependent, independent
o Causation vs correlation
o Regression modelling
‣ Model family
‣ Learning parameters/fitting a model
‣ Simple linear regression model
Week 3 – Modelling data
• Truth of data
o Error
o Correlation coefficient
• Simple linear regression
o Residuals: Mean Square Error
o Goodness of fit: fitting variation
‣ R-squared
• Polynomial regression
o Degree
o Underfitting, overfitting
o Testing and training
o Bias-Variance tradeoff
o No Free Lunch Theorem
o Multiple models & ensembles
Week 4 – Fitted modelling
• Segmentation
• Regression trees
o ANOVA
• Classification trees
• Clustering
o Centroids
o K-means
o Hierarchical trees: dendrograms
Week 4 – Grouping data
• RDBMS: SQL
o Unstructured data: NoSQL
• Distributed systems
o Hadoop
o Map-Reduce
o Spark
• NIST Big Data Reference Framework
• Data Science Tools & Services
o Open source software
o Case studies
o APIs
o SaaS
Week 5 – Data Science Tools
• Data management
• Data lifecycles
• Data governance
o Legal requirements: Privacy Act, GDPR, licenses
o Ethical requirements
o Rights
o Privacy
o Confidentiality
• Stakeholders
• Data management plans
• Data curation
Week 5 – Data management
• Data management capability maturity
• Linked data: Semantic web, RDF
o Confidentiality
o Privacy
• Surveillance
o Data retention laws
• AI veracity
o Bias
o Human-in-the-loop
o Sampling
‣ A/B testing
‣ Significance testing: p-value, k-fold testing
o Scientific method
Week 6 – Issues
• We have covered a lot of areas because data science
has a broad influence.
• Hope you’ve learnt a lot from the unit.
• Best of luck for the final exam assessment task!
The end? Or the start?
If …
If you are interested in doing a minor thesis on the topic
of applying Data Science techniques for educational
research, please feel free to send an email via:
[email protected]
mailto:[email protected]
Please help us improve by filling out
the SETU surveys on Moodle.
Lastly …
End of FIT5145 Lectures !!!!
Week 5:
Data Management & Governance
Storing Big Data
From Big data on Wikipedia:
Big data usually includes data sets with sizes beyond
the ability of commonly used software tools to capture,
curate, manage, and process data within a tolerable elapsed
time. Big data "size" is a constantly moving target, ...
Big Data
https://en.wikipedia.org/wiki/Big_data
Summary
BIG DATA is ANY attribute that challenges CONSTRAINTS of a
system’s CAPABILITY or BUSINESS NEED.
The Four V’s of Big Data
“The Four V’s of Big Data” by IBM (infographic)
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
New approaches are needed to handle this complexity
• Storing it
• Analysing it
• Visualising it
• Using it
➢ New software
➢ New methods
➢ New hardware
Big Data is complex
• Collection: getting the data
• Wrangling: data preprocessing, cleaning
• Analysis: discovery (learning, visualisation, etc.)
• Presentation: arguing that results are significant and useful
• Engineering: storage and computational resources
• Governance: overall management of data
• Operationalisation: putting the results to work
Our Standard Value Chain
RDBMS and SQL Review
• Relational Database Management Systems (RDBMS)
• SQL: structured query language
• Rather like large scale set of Excel spreadsheets with
better indexing and retrieval
• Transaction oriented with support for correctness,
distribution, ...
• Businesses function in a continuously changing
environment:
‣ fixed formats as per RDBMS not suitable
‣ usage varies, requires complex analytical queries
• Need to reach insights faster and act on them in real
time
‣ stream processing
Business Context
• Stores graph,
commonly as
triples, e.g.,
(subject, verb,
object)
• Commonly used
to store Linked
Open Data
Graph Database Example
Semi-structured data is data that is presented in XML or
JSON:
• see some examples for here
• Note YAML (Yet Another Markup Language), which is just an
indentation (easier to read) version of JSON
• standard libraries for reading/writing/manipulating semi-
structured data exist in Python, Perl, Java
• don’t need to know all the details of XML (and related Schema
languages), there are many good online tutorials, e.g.
W3schools.com
Semi-Structured Data
https://en.wikipedia.org/wiki/JSON
http://www.w3schools.com/
• No fixed format
• Semi-structured,
key-value pairs,
hierarchical
• “Friendly”
alternative to XML
• Self-documenting
structure
JSON Example
REST API Terminology
API: Application Programmer Interface
• Routines providing programatic access to an application.
REST: REpresentational State Transfer
• a stateless API usually running over HTTP
• Watch a simple introduction to REST-based APIs in this
video: REST API concepts and examples by WebConcepts
SaaS: Software as a Service
• The provisioning of software in a Web browser and/or via an
API over the Web as a subscription service.
https://www.youtube.com/watch?v=7YcW25PHnAA
• Use SQL database when:
‣ data is structured and unchanging
• Use NoSQL database when:
‣ Storing large volume of data with little to no structure
‣ Data changes rapidly
• NoSQL databases offer a rich variety beyond traditional
relational data.
SQL and Beyond SQL Databases (NoSQL)
• In-database analytics: the analytics is done within the DB
• In-memory database: the DB content resides memory
• Cache: data stored in-memory
• Key-value: value accessible by key, e.g., hash table
• Information silo: an insular information system incapable of
reciprocal operation with other, related information systems
‣ If two big banks merge, then initially their RDBMSs will be
siloed
‣ In a big insurance company, the customer RDBMSs for auto
and home insurance may be siloed
Database Background Concepts
End of
Storing Big Data
Week 5:
Data Management & Governance
Hadoop, Spark & Map-Reduce
Interactive: bringing humans into the loop
Streaming: massive data streaming through system with little
storage
Batch: data stored and analysed in large blocks, “batches”,
easier to develop and analyse
Overview: Processing
In-memory: in RAM, i.e., not going to disk
Parallel processing: performing tasks in parallel
Distributed computing: across multiple machines
Scalability: to handle a growing amount of work; to be
enlarged to accommodate growth (not just
“big”)
Data parallel: processing can be done independently on
separate chunks of data
Yes: process all documents in a collection
to extract names
No: convert a wiring diagram into a physical
design (optimisation)
Processing Background Concepts
• Legacy systems provide powerful statistical tools on the
desktop, such as SAS, R, and Matlab, but often-times
without distributed or multi-processor support.
• Supporting distributed/multi-processor computation
requires special redesign of algorithms
Distributed Analytics
Simple distributed processing framework developed at
Google
• published by Dean and Ghemawat of Google in 2004
• intended to run on commodity hardware; it has
fault-tolerant infrastructure
• from a distributed systems perspective, is quite
simple
Map-Reduce
For a simple word-count task:
(1) divide data across machines
(2) map() to key-value pairs
(3) sort and m e r g e ( ) identical keys
Map-Reduce Example
• requires simple data parallelism followed by some
merge (“reduce”) process
• stopped using by Google probably in 2005
• Google now uses “Cloud Dataflow” (and here), available
commercially as open source
Map-Reduce (cont.)
https://cloud.google.com/dataflow/
https://cloud.google.com/dataflow/what-is-google-cloud-dataflow
Open-source Java implementation of Map-Reduce
• originally developed by Doug Cutting while at Yahoo!
• architecture:
Common: Java libraries and utilities
MapReduce: core paradigm
• huge tool ecosystem
• well passed the peak of the hype curve
Hadoop
https://en.wikipedia.org/wiki/Doug_Cutting
• another (open source) Apache top-level project at
Apache Spark
• developed at AMPLab at UC Berkeley
• builds on Hadoop infrastructure
• interfaces in Java, Scala, Python, R
• provides in-memory analytics
• works with some of the Hadoop ecosystem
Spark
http://spark.apache.org/
https://amplab.cs.berkeley.edu/
• Hadoop provides an inexpensive and open source platform for
parallelising processing:
‣ based on a simple Map-Reduce architecture
‣ not suited to streaming (suitable for offline processing)
• Spark is a more recent development than Hadoop
‣ includes Map-Reduce capabilities
‣ provides real-time, in-memory processing
‣ much faster than Hadoop
Summary: Hadoop and Spark
End of
Hadoop, Spark & Map-Reduce
Week 5:
Data Management & Governance
Data Science Tools
Here’s how you learn about which tools are important!
BOSSIE is Best Open Source Software awards:
• BOSSIE awards 2015 for Big Data and BOSSIE awards 2016
for Big Data
• BOSSIE awards 2017 for machine learning and deep learning
tools and for databases and analytics tools
• BOSSIE awards 2019
• BOSSIE awards 2020
• BOSSIE awards 2021
Open Source Software Awards
http://www.infoworld.com/article/2982429/open-source-tools/bossie-awards-2015-the-best-open-source-big-data-tools.html
http://www.infoworld.com/article/3120856/open-source-tools/bossie-awards-2016-the-best-open-source-big-data-tools.html
https://www.infoworld.com/article/3228224/machine-learning/bossie-awards-2017-the-best-machine-learning-tools.html
https://www.infoworld.com/article/3228150/analytics/bossie-awards-2017-the-best-databases-and-analytics-tools.html
https://www.infoworld.com/article/3444198/the-best-open-source-software-of-2019.html
https://www.infoworld.com/article/3575858/the-best-open-source-software-of-2020.html
https://www.infoworld.com/article/3575858/the-best-open-source-software-of-2020.html
2015: big data tools, Spark and “elastic” processing, scalable ML and
databases, stream/real-time processing (ML, search, analysis,
storage, time-series), security
2016: big data tools, pipelines, TensorFlow, distributed IR (Solr),
NoSQL analytics, stream analytics, graph database
2017: big data and analytics tools, GPU acceleration, real-time SQL,
more Spark, Solr, R, graph databases
2017: ML tools, deep learning, scalable prediction, Python, gradient
boosting, TensorFlow
2021: analytics and ML tools, Orange, Apache software, distributed
SQL, explainable AI
Machine learning & analytics on top of big data are now mainstream!
Open Source Software Awards (cont.)
Let’s have a look at what all these Open Source Projects doing
1. Apache Hadoop Distributed File System (HDFS)
2. Apache Hadoop YARN
3. Apache Spark
4. Apache Cassandra (distributed NoSQL, wide-column store)
5. Apache HBase (distributed NoSQL, wide-column store)
6. Apache Hive (distributed SQL)
7. Apache Mahout (distributed linear algebra with GPU)
8. Apache Pig (data flow and data analysis on top of Hadoop)
9. Apache Storm (distributed real-time computation)
10. Apache Tez (dataflow for Hive and Pig)
Many state-of-the-art platforms integrated into Hortonworks
(now the Cloudera Data Platform).
Popular Open Source Projects
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
http://spark.apache.org/
https://cassandra.apache.org/
https://hbase.apache.org/
https://hive.apache.org/
https://mahout.apache.org/
https://pig.apache.org/
http://storm.apache.org/
http://tez.apache.org/
http://hortonworks.com/hdp/whats-new/
A number of organisations run salary surveys. These are usually
interesting because they also describe what tasks people do and
what software they use.
• O’Reilly’s Salary Survey: behind login, slides summarised
‣ 2016 Data Science Salary Survey
✓ really interesting content on software used, ...
‣ 2017 European Data Science Salary Survey
✓ really interesting content on tasks done, coding versus meetings, ..
• Kaggle state of data science and machine learning
‣ really interesting content on job title, education, methods,
barriers, getting started
‣ explore this one online! explore this one online, a 2018 survey,
2019 survey and 2021 survey are also available!
Work and Salary Surveys
http://www.oreilly.com/data/free/2016-data-science-salary-survey.csp
https://www.oreilly.com/ideas/2017-european-data-science-salary-survey
https://www.kaggle.com/surveys/2017
https://www.kaggle.com/kaggle/kaggle-survey-2018
https://www.kaggle.com/c/kaggle-survey-2019
https://www.kaggle.com/kaggle-survey-2021
2016 Data Science Salary Survey
Software Usage Survey
http://www.oreilly.com/data/free/2016-data-science-salary-survey.csp
Survey: Clusters amongst the Respondents
Survey: Commonly Used Software
Kaggle 2021 Survey:
Interactive Development Environments
Kaggle | State of ML & Data Science 2021
https://www.kaggle.com/kaggle-survey-2021
https://www.kaggle.com/kaggle-survey-2021
Survey: Operating Systems
Survey: Programming Languages
Survey: Relational Databases
Kaggle 2021: Relational Databases
Kaggle | State of ML & Data Science 2021
https://www.kaggle.com/kaggle-survey-2021
https://www.kaggle.com/kaggle-survey-2021
Survey: Management and Big Data
Survey: Visualization
Coding versus Meetings
Career Choices
Tasks – Time
Tasks – Salary
End of
Data Science Tools
Week 5:
Data Management & Governance
Data Science Use Case Studies
• Mike Olson (co-founded Cloudera in 2008) says without big
data and a platform to manage big data, machine learning
and artificial intelligence just don’t work.
• See the machine learning renaissance starting at 60
seconds.
The Machine Learning Renaissance
https://www.oreilly.com/ideas/the-machine-learning-renaissance
• “Visualizing the world’s Twitter data – Jer Thorp”, a TEDYouth
2012 Talk, former New York Times data artist-in-residence Jer
Thorp (video, 6mins)
• National Map (Youtube, 14 mins) is a website for map-based
access to Australian spatial data from government agencies.
The website is http://nationalmap.gov.au/.
• “Style Stalking; The Stochastic Patterns that Drive Fashion
Trends”, by Karen Moon from Strata+Hadoop World 2014
(video, 10 minutes)
• Panama Papers, leaked papers (11.5M) on financial
transactions, motivations for using data science, and how
analysed (Wired, 2016).
Case Studies
http://ed.ted.com/lessons/mapping-the-world-with-twitter-jer-thorp
https://www.youtube.com/watch?v=e7jQoV2pl_0
http://nationalmap.gov.au/
https://www.youtube.com/watch?v=VyV0NZX_eZ8
https://en.wikipedia.org/wiki/Panama_Papers
https://blog.unbelievable-machine.com/en/blog/panama-papers-and-data-science/
http://www.wired.co.uk/article/panama-papers-data-leak-how-analysed-amount
Data sources: where the data comes from
Data volume: how much there is
Data velocity: how it changes over time
Data variety: what different kinds of data there is
Data veracity: correctness problems in the data
Software: software needed to do the work
Analytics: broadly, what sorts of statistical analysis and
visualisation needed
Processing: broadly, computational requirements
Capabilities: broadly, key requirements of the operational
system
Security/Privacy: nature of needs here
Lifecycle: ongoing requirements
Other: notable factors
Reminder: NIST Analysis
Freebase:
• an example of a graph database we looked at earlier
• graph can be represented in RDF which is triples of URIs
• now owned by Google, and decommissioned
• used by others as a knowledge-base in many text processing
pipelines:
‣ e.g., using TextRazor to extract meaning from text
DBpedia:
• aim to extract all structured content from information in
Wikipedia
• open source project
• effectively replaced Freebase
Freebase and DBPedia
http://www.freebase.com/
https://www.textrazor.com/
http://wiki.dbpedia.org/
The Unified Medical Language System (UMLS)
Medical Data Dictionaries
http://www.nlm.nih.gov/research/umls/new_users/online_learning/OVR_001.html
ICD: the International Classification of Diseases
• used to classify diseases and other health problems
• based on health and vital records
• for example: Pneumonia due to Streptococcus pneumoniae
Medical Data Dictionaries (cont.)
http://apps.who.int/classifications/icd10/browse/2010/en
Other Medical Dictionaries:
• SNOMED CT
‣ Systematized Nomenclature of Medicine Clinical Terms
• Gene Ontology
‣ Concepts for describing gene function
Usage of Medical Dictionaries:
• controlled vocabularies
• semantic data exploration
• clinical surveillance
• decision support
Medical Data Dictionaries (cont.)
http://www.ihtsdo.org/snomed-ct
http://geneontology.org/
• PUBMED, we have seen before
• ACM Digital Library
• Global Patent Index provided by the EPO
• Semantic Scholar for research article search
Publishing Repositories
http://dl.acm.org/
https://www.epo.org/searching-for-patents/technical/espacenet/gpi.html
https://www.semanticscholar.org/
Event Registry
• collect news article globally, process and organise as events
• perform concept and event identification
• create a document database for inspection
• sometimes news stored as NewsML
News and Event Registry
http://eventregistry.org/
https://iptc.org/standards/newsml-g2/
• US Government’s Data.GOV
• NYC Open Data
• Australia’s Urban Intelligence Network (AURIN), e.g. SD
Private Health Insurance
• BioGrid Australia, curated for research use and usually
require getting approval to use
Government Data
http://www.data.gov/
https://data.cityofnewyork.us/dashboard
http://aurin.org.au/
https://data.aurin.org.au/dataset/tua-phidu-sd-privatehealthinsurance-sd
https://www.biogrid.org.au/
Many companies are exposing their data and their
website functionality as APIs for others to make use of:
• Facebook API
• Twitter API
e.g. search tweets
• LinkedIn API
• Google Maps API
• Youtube API
e.g. documentation
• Amazon Advertising API
• TripAdvisor API
• New York Times API
Example Data/Information APIs
https://developers.facebook.com/products/
https://dev.twitter.com/rest/public
https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets.html
http://www.programmableweb.com/api/linkedin
https://developers.google.com/maps/
https://developers.google.com/youtube/
https://developers.google.com/youtube/v3/getting-started
https://advertising.amazon.com/API
https://developer-tripadvisor.com/content-api/
http://developer.nytimes.com/
Companies provide functionality via APIs so that others
can make use of their data and services:
• The Application Economy: A New Model for IT (CISCO)
• ProgrammableWeb API Category: Data
• Top 30 Predictive Analytics API (see #4)
• 20+ Machine Learning as a Service Platforms
And for something completely different:
• The Sharing Economy | Bullish (on TechCrunch)
‣ these companies are huge users of data science!
The API Economy
https://www.youtube.com/watch?v=9Ai5TTVTyWc
http://www.programmableweb.com/category/data/api
http://www.predictiveanalyticstoday.com/top-predictive-analytics-software-api/
http://www.butleranalytics.com/20-machine-learning-service-platforms/
https://techcrunch.com/video/the-sharing-economy-bullish/519620665/
Some companies are exposing their tools/services as APIs or
browser based tools for others to make use of:
• Azure Machine Learning Studio
• Figure-Eight Human in the Loop ML with crowdsourcing
support
• Watson REST API for semantic web, metadata, entity
analysis in text
• Google Cloud Prediction API
‣ is closing down in April 2018, and they will focus on cloud
solutions
Example Processing APIs or Web Services
https://azure.microsoft.com/en-us/services/machine-learning-studio/
https://www.figure-eight.com/
https://dataplatform.ibm.com/docs/content/analyze-data/pm_service_api_spark.html
https://cloud.google.com/prediction/docs/
• Email systems (Google, Microsoft Office365)
• File sharing systems( Dropbox, Box, Microsoft One drive, Google
drive ..)
• Business systems (Salesforce, Servicenow, ..)
SaaS Examples
End of
Data Science Use Case Studies
Week 5:
Data Management & Governance
Data Management
• You want the data you are using to be of
sufficient quality for your purpose
- Accuracy
- Completeness
- Consistency
- Integrity
- Reasonability
- Timeliness
- Uniqueness/deduplication
- Validity
§ Data Management Association (DAMA)
• Much of this is a data management issue
- But data management is about more than just data quality!
Data quality
https://www.naa.gov.au/information-management/building-interoperability/interoperability-development-phases/data-governance-and-management/data-quality
Data management is the development, execution and
supervision of plans, policies, programs and practices
that control, protect, deliver and enhance the value of
data and information assets.
Data Management
• See “Top 10 Mistakes in Data Management” a tutorial
from Intricity (a data management company) (Youtube)
• See “How to avoid a data management nightmare”, a
video created by NYU Health Sciences Library
(Youtube)
Data Management (cont.)
https://www.youtube.com/watch?v=5Pl671FH6MQ
https://www.youtube.com/watch?v=nNBiCcBlwRA
Examples of data management issues arising in data science projects:
Medical informatics: for predicting fungal infections from nursing
notes, the team needs to abide by
confidentiality and security requirements.
Internet advertising: what implicit and explicit data is stored
about a user?
Retailing: conduct market intelligence on new
products; put together data from different
divisions (brands) within the company.
Predictive medical system: implementation may need changing
standard operating procedure for staff
Data Management and Data Science
Science: reproducibility and credibility of scientific work,
producing artifacts of knowledge, creating scientific data
Business: governance, compliance, information privacy, etc.
Curation: e.g. museums and libraries, preservation, maintenance,
etc.
Government: a unique legislative environment that regulates them
(e.g., “transparency”), archiving, FOIs, support data
infrastructure, etc.
Medicine: significant privacy issues, conflicting corporate financial
constraints, government regulations and furthering of
medical science
Contexts for Data Management
End of
Data Management
Week 5:
Data Management & Governance
Data lifecyles
• Collection: getting the data
• Wrangling: data preprocessing, cleaning
• Analysis: discovery (learning, visualisation, etc.)
• Presentation: arguing that results are significant and useful
• Engineering: storage and computational resources
• Governance: overall management of data
• Operationalisation: putting the results to work
Our Standard Value Chain
https://confluence.csiro.au/display/RDM/Research+Data+Management
CSIRO research data lifecycle
https://confluence.csiro.au/display/RDM/Research+Data+Management
https://old.dataone.org/data-life-cycle
DataOne model
https://old.dataone.org/data-life-cycle
https://www.dcc.ac.uk/guidance/curation-lifecycle-model
DCC data (curation) lifecycle model
https://www.dcc.ac.uk/guidance/curation-lifecycle-model
End of
Data Lifecycles
Week 5:
Data Management & Governance
Data Governance
• See “What is Data Governance?” by Rand Secure Data
(Youtube)
• See “What is Data Governance?” by Intricity (Youtube)
What is Data Governance?
https://www.youtube.com/watch?v=t4IOS5csv40
https://www.youtube.com/watch?v=sHPY8zIhy60
Supporting and handling:
• ethics, confidentiality
• security
• consolidation and quality-assurance (e.g. link all customer
related information together)
• persistence (backups and recoverability)
• regulatory compliance
• organisation policy compliance
• organisation business outcomes
which may include handling the steps in the data science and/or
big data value chain
Data Governance
Data governance and data management are often used to
mean each other.
Better to treat them as separate levels
• Data Management is what you do to handle the data
o Resources, practises, enacting policies
• Data Governance is making sure that it is done
appropriately
o Policies, training, providing resources
o Planning and understanding
Governance and management
• Must follow laws
o Australian Privacy law
o Australian medical data regulations
o Australian telecommunications act
o EU’s General Data Protection Regulations (GDPR)
• Must meet (funding) requirements
o Australian Research Council (ARC)
o National Health and Medical Research Council (NHMRC)
• Must be ethical
o Don’t be evil!
Legal and ethical responsibilities
• Confidentiality
• Ownership
• Copyright
• Intellectual property
• Licensing
Just because a data science project ends,
the data curation shouldn’t!
Other legal restrictions
• Must follow laws
o Australian Privacy law
o Australian medical data regulations
o Australian telecommunications act
o EU’s General Data Protection Regulations (GDPR)
• Must meet (funding) requirements
o Australian Research Council (ARC)
o National Health and Medical Research Council (NHMRC)
• Must be ethical
o Don’t be evil!
Legal and ethical responsibilities
• Rights for
o Privacy
o Access
o Erasure
o and more!
• Work with the stakeholders
• Be transparent and clear
Ethics – doing what is right
• Regulations devised by various government bodies: taxation,
medical care, securities and investments, work health and safety,
employment, corporate law.
• They need to check companies for their compliance
• Regulatory compliance:
that organisations ensure that they are aware of and take
steps to comply with relevant laws and regulations.
• Auditing
systematic and independent examination of books, accounts,
documents and vouchers of an organization to ascertain how
far they present a true and fair view
• auditing data and records are a good source for Data Science
Regulations and Compliance
Terminology
For our purposes, we define:
• Privacy as having control over how one shares oneself, e.g., closing
the blinds in your living room
• Confidentiality as information privacy, how information about an
individual is treated and shared, e.g., excluding others from viewing
your search terms or browsing history
• Security as the protection of data, preventing it from being
improperly used, e.g., preventing hackers from stealing credit card
data
• Ethics as the moral handling of data, e.g., not selling on other’s
private data to scammers
• Implicit data that is not explicitly stored but inferred with reasonable
precision from available data, see “Private traits and attributes are
predictable ...”
http://www.pnas.org/content/110/15/5802
End of
Data Governanace
Week 5:
Data Management & Governance
Stakeholders
• Stakeholders are any parties that have a relationship with a
project/policy/product/data.
This includes
o the data’s source
o managers
o analysts and users
o IT developers
o data scientists!
Who is responsible?
With great data, comes great responsibilities
for all stakeholders
NIST Reference Architecture showing actors
and roles in data management
Actors
End of
Stakeholders
Week 5:
Data Management & Governance
Data Management Planning
How do you get it all right?
• Policies and laws
o rights, Australian privacy principles, EU GDPR
• Procedures and practises
o access, ownership, security
• Planning and training
o data management plans, design
• Management and capability
o technology, staffing
• Governance
o oversight & review, ethics
Getting data governance right
A DMP provides
• Clarity
• Direction
• Transparency
• Expectations
The result is
• Improvements to efficiency, protection, quality and
exposure
• Value
• Innovation
Data Management Plans - purpose
• Backups
• Survey of existing data
• Data owners &
stakeholders
• File formats
• Metadata
• Access and security
• Data organisation
https://www.ands.org.au/__data/assets/pdf_file/0011/690878/Data-management-plans.pdf
Data management plans - content
See also DMPTool – http://dmptool.org
• Bibliography
• Storage
• Data sharing, publishing
and archiving
• Destruction
• Responsibilities
• Budget
https://www.ands.org.au/__data/assets/pdf_file/0011/690878/Data-management-plans.pdf
http://dmptool.org/
• The data community has lots of tools and systems
available (See also DMPTool – http://dmptool.org )
o Archives to use
o Indexes
o Metadata standards
o Data management tools
• Examples
o ARDC (formerly ANDS)
o Monash!
Data management communities
http://dmptool.org/
End of
Data Management Planning
Week 5:
Data Management & Governance
Data Management Mistakes
• Don’t forget data management and governance!
o Access & security
o Software & hardware
o Regulations, ethics & licensing
o Stakeholders & transparency
Your case study – Assignment 4
• Australian government wanted to double check the incomes of people
being paid social welfare payments.
• The Online Compliance Intervention system (aka RoboDebt) was set up
in 2016 to automatically compare ATO records to Centrelink records.
o Calculates the benefits that people are entitled to, based on
assumptions about their earnings
o Debt collection letters if benefits have been overpaid
• Problems discovered with a lack of human-in-the-loop for doublechecking
o Incorrect/inappropriate calculations
o Using out-of-date data
o Sending debt notices …
Week 4:
Statistical Modelling
Truth of Fitted Models
Learning Outcomes (Week 4)
Compared to previous weeks, you will be exposed
to more mathematics!
• Understand what the maths is trying to do
• Understand the concepts involved
• Don’t have to remember all the formulas exactly
For variables for an individual data case (e.g. a single loan
application or a single heart disease patient), the “truth” can be
measured directly
• Across examples, the “true” model is harder to define:
‣ What is a “true” model of physics? – Newtonian physics, String Theory?
• How can you measure the “true” model for the heart disease
problem?
‣ collect infinite data and infer statistically
‣ but it's a dynamic problem and general population characteristics
always changing
• regardless, we assume some underlying “truth” is out there
Truth
• To evaluate the quality of results derived from learning, we need
notions of value
• So we will review quality and value.
Quality
• William Tell was forced to shoot the apple on his son’s head
• If he strikes it, he gets both their freedoms
William Tell’s Apple Shot
• This shows “value” as a function
of height
• Loss varies depending on where
it strikes
• How do you compare loss of life
versus gain of freedom?
William Tell’s Apple Shot (cont.)
The boy is smiling! It is hard to find a
cartoon with an apple on a boy’s head.
• May be the quality of your prediction
• May be the consequence of your actions (making a
prediction is a kind of action)
• Can be measured on a positive or negative scale
Loss: positive when things are bad, negative (or zero) when
they’re good
Gain: positive when things are good, negative when they’re not
Error: measure of “miss”, sometimes a distance, but not a direct
measure of quality
Quality
Error measures the distance between the prediction and the actual value.
• “0” means no error, prediction was exactly right
• we can convert error to a measure of quality using a loss function, e.g.,
Quality is a Function of Error
square-error(x) = x ∗x
hinge-error(x) =
1 otherwise
absolute-error(x) = |x|
|x| if |x | ≤ 1
Data Analysis Algorithms
Regression
From The Elements of Statistical Learning
by T. Hastie, R. Tibshirani and J. Friedman
https://www.springer.com/gp/book/9780387848570
• Look for relationships amongst variables
• Identify the relation between salary and experience,
education, role, etc.
Real World Example:
What is Regression?
Regression
Variables can be:
• Independent Variables/Inputs/Predictors, e.g., experience,
education, role
• Dependent Variables/Outputs/Responses, e.g., salary of employee
Observation is a data point, row, or sample in a dataset
• e.g., an employee's salary, experience, education, role.
Terminology
• We can measure the strength and direction of the linear relationship of
two variables
• (Pearson product-moment) correlation coefficient is the covariance of
the variables divided by the product of their standard deviations
• R or Pearson’s R, when applied to a sample
o R=+1 is total positive linear correlation,
o R=0 is no linear correlation
o R=−1 is total negative linear correlation
Correlation coefficient
!",$= ∑&'() (+& − +
¯)(/& − /
¯ )
0
&'(
)
(+& − +
¯ )1 0
&'(
)
(/& − /
¯ )1
• To determine how multiple variables are related, e.g., determine
if and to what extent the experience or education impact
salaries
• To predict a value, e.g., predict electricity consumption given
the outdoor temperature, time of day, and number of residents
in that household
When Use Regression
Example: Sales ~ TV, Radio, Newspaper
Simple Linear Regression (two-
dimensional space):
Regression fits a very simple equation to the data:
Here is prediction for y at the point x using the model
parameters = (a0, a1), i.e. the intercept and slope terms.
Independent Variable
D
e
p
e
n
d
e
n
t
V
a
ri
a
b
le
Predicted
Actual
The aim is that the predicted response be as close as
possible to the actual response.
Actual - Predicted
Best Fitting Line
• Given some data pairs (x1, y1), ...,(xN , yN),
we fit a model by finding the coefficient vector
that minimises the loss function:
• Residuals = the distances between the observed values
and the predicted values
• Ordinary least squares (OLS) = minimises the sum of
squared residuals (SSR)
Calculating Parameters
• If a model fits the data, it should be able to represent its
variation.
• Therefore, we may wish to measure how well it fits this
variation.
• Explained variation ~ variation in y explained by the model
• Residual variation ~ variation in y unexplained by the model
• Total variation in y = Explained variation
+ Residual variation
Goodness of fit: fitting variation
Total variation in y = Explained variation + Residual variation
SST = SSE + SSR
!!" = $%$&' ()* +, (-)./01 2./3.43+5 = ∑(8 − 8
¯
)<
!!= = >[email protected]’&AB>C ()* +, (-)./01 2./3.43+5 = ∑(8
^
− 8
¯
)<
!!E = F>GACH&’ ()* +, (-)./01 2./3.43+5 = ∑(8 − 8
^
)<
Goodness of fit: formulas
The R^2, R2 or R-squared value for a fitted model is a
key goodness of fit statistic.
R2 = SSE (Explained)
SST (Total : SSE + SSR)
R2 is between 0 and 1
• 1 is good, variability in y is fully explained by the model
• 0 is bad, no variability in y is explained by the model
Goodness of fit: R-squared
Goodness of fit: Visualisation
Goodness of fit: Visualisation
Goodness of fit: Visualisation
End of
Truth of Fitted Models
Week 4:
Statistical Modelling
Underfitting and Overfitting
• Assume the polynomial relationship between the inputs and
output, e.g., 10th order (a.k.a degree) polynomial
Polynomial Regression
What is the best degree? 1, 2 or 3?
Question
Degree: 3
Polynomial Regression
Degree: 20
Is this fit better than previous fits?
Question
• Bayesian information criterion (BIC) includes a penalty
for using more variables. The preferred model is the
model with the lowest BIC.
• Other similar measures include the adjusted-R2, which
imposes a penalty on additional variables that do not
have a significant effect on explaining y. The preferred
model is the model with the higher adjusted-R2.
Bayesian Information Criterion (BIC)
Overfitting
Underfitting and Overfitting
Underfitting
The more parameters a model
has, the more complicated a
curve it can fit.
• If we don’t have very much
data and we try to fit a
complicated model to it, the
model will make wild
predictions.
• This phenomenon is referred
to as overfitting
Overfitting
• Small polynomial; cannot fit the data well; said to have high
bias
• Large polynomial; can fit the data well; fits the data too well;
said to have small bias
• If there is known error in the data, then a close fit is wasted:
the 25th degree polynomial does all sorts of wild contortions!
• Poor fit due to high bias called under-fitting
• Poor fit due to low bias called overfitting
Overfitting (cont.)
• Split up the data we have into two non-overlapping parts,
a training set and a test set
• Do your learning,
run your algorithm,
build your model using the training set
• Run the evaluation using the test set
• Don’t run the evaluation on the training set
• How big to make the test set?
Training Set and Test Set
End of
Underfitting and Overfitting
Week 4:
Statistical Modelling
Bias and Variance
Different data sets of size 30.
Bias: measures how much the prediction differs from
the desired regression function.
Variance: measures how much the predictions for individual
data sets vary around their average.
Bias and Variance
Scenario 1
■ Low complexity
■ Medium complexity
■ High complexity
■ MSE(Training Data)
■ MSE(Testing Data)
Bias vs Variance Trade-off
■ MSE(Training Data)
■ MSE(Testing Data)
■ Low complexity
■ Medium complexity
■ High complexity
Bias vs Variance Trade-off
Scenario 2
■ MSE(Training Data)
■ MSE(Testing Data)
■ Low complexity
■ Medium complexity
■ High complexity
Bias vs Variance Trade-off
Scenario 3
Optimum Degree
Bias vs Variance Trade-off
Scenario 1 Scenario 2 Scenario 3
Bias vs Variance Trade-off
• Blue line is true model that
generated the data (before
noise was added)
• Grey curve is model fit to 30
data points
• Black curve is model fit to 90
data points
In general, more data means
better fit (most of the time)
More Data Improves the Fit
MSE decreases as the amount of training data grows
• these plots are called learning curves
• different learning algorithms exhibit different behaviour
(rate of decay)
Loss decreases with Training Data
Wolpert and McCready proved:
If a [learning] algorithm performs well on a certain class of problems
then it necessarily pays for that with degraded performance on the
set of all remaining problems.
• There is no universally good machine learning algorithm (when one
has finite data)
‣ e.g. Naive Bayesian classification performs well for text classification
with smaller data sets
‣ e.g. linear Support Vector Machines perform well for text classification
No Free Lunch Theorem
End of
Bias and Variance
Week 4:
Statistical Modelling
Multiple Models and Ensembles
• When the data contains information about many groups,
it is not uncommon to fit models of the dependent
variable for each group.
Ø e.g., modelling the wind power generated from several
wind turbines in a wind farm, modelling the life
expectancy of every country
• Allows us to compare and analyse each group individually
and as part of the whole.
Multiple models
Multiple models - example
• Suppose you wanted to fit a linear model of the life expectancy for
every country in your data.
• Filtering each country one-by-one to fit 142 individual models is not
a practical solution.
• Use nest() and map()in R
o Group by country
§ Nested dataframe with new column of country-specific data
o Map lm to each country
o Tidy() the lm data into a tibble then unnest it
o Provides slope (gradient) and intercept coefficients for each
o Reorganise for analysis
Fitting multiple models
Clearer, but not clear enough
Multiple fitted models
Plotting coefficients
Goodness of fit - R^2
Fitted models – R^2 <= 0.45
• Given only data, we do not
know the truth and can only
estimate what may be the
“truth”
• An ensemble is a collection
of possible/reasonable
models
• From this we can
understand the variability
and range of predictions
that is realistic
Ensembles
• Generating an ensemble is a whole statistical subject in itself
• Often we average the predictions over the models in an
ensemble to improve performance
Ensembles (cont.)
Ensembles: Large Data
Ensembles: Small Data
Ensemble of BayesNet Models
End of
Multiple Models and Ensembles
Week 4:
Statistical Modelling
Segmenting Data
• Sometimes the segmenting of data is because of the
context of the data
o Separate sources
o Separate collection circumstances
o Social or physical distinctions
• Sometimes we don’t have pre-determined segments, but
we want segmentation
o Some of the data may be similar
o Some of the modelling would be better if it didn’t need to
represent all of the data
o Better decision-making if we consider each segment
independently
Segmenting data
• Customers are grouped into segments
• Marketing is then specialised to each segment
Ø leads to better marketing
• In healthcare, segments are called cohorts
Ø used for patient management and staff organisation
• But how do you segment the data?
Segmentation Task:
Identifying Customer Segments
Segmentation can be
based on different
types of attributes
Example segmentation: traditional segmentation
in Britain uses class (from The Independent)
http://www.independent.co.uk/news/uk/home-news/britain-now-has-7-social-classes-and-working-class-is-a-dwindling-breed-8557894.html
A segmentation model is a graphical model where
• the cluster variable is unknown, called “latent”
• the cluster variable identifies the segments
• latent means the variable is never observed in the data.
Segmentation
• Relate to relationships between independent and
dependent variables
• Use functional families
• Train fitted parameters on existing data to represent
those relationships
Linear regression models
• A regression tree is a supervised machine learning
algorithm that predicts a continuous-valued
response variable by learning decision rules from the
predictors (or independent variables).
o Decision tree
• divide the data into subsets of similar values
• estimate the response within each subset.
Regression trees
Rather than using a
single function to
represent the data, …
… divide the data into similar
segments, then make
predictions in each segment
Regression trees - Example
• Binary tree structure
• Terminal nodes (or leaf nodes) are where the model
prediction is made
• Paths from the root node to the terminal node represents
decision rules.
6 splits
7 terminal nodes
Regression trees - structure
1. Identify all possible (n-1) splits of the data.
2. Compute a metric that measures the quality of each
possible split.
3. Choose the best split, break the data into two
subsets.
4. Repeat steps 1 to 3 on each subset, then continue
until a good stopping point is reached.
Choosing segments
• The partitioning is a top-down, greedy approach.
o Start with all data
o Once split, don’t change
• Searches every distinct value of every input predictor to find a pair
of predictor/value that best split the data into two subgroups (G1
and G2).
o As in for the population inside that node, this pair of predictor/
value improves the chosen criteria (e.g., ANOVA) the most.
• ANOVA criterion = SST − (SSG 1 + SSG 2)
• !!" = ∑(&' − &
¯ )+ , total variation of the dependent variable.
• SSG1 & SSG2 use the SST formula but with the values for the two
subgroups created by the partition.
ANOVA
• The resulting tree is easy to understand
• Visualising the tree can reveal crucial information, such as how
decision rules are formed, the importance of different predictors
and the effect of the splitting points in the predictors.
• It can reveal information about the relationships between
variables.
• Very useful for Exploratory Data Analysis (EDA)
• Implicitly performs feature selection as some of the predictors
may not be included in the tree.
• Not sensitive to the presence of missing values and outliers.
• No assumptions about the shape and the distribution of the data.
• It can be used to fit non-linear relationships.
Regression trees - Pros
• The fit has a high variance meaning small changes in
the data set can lead to an entirely different tree.
o Overfitting is a problem for decision tree models, but
we can adjust the stopping conditions and prune the
tree.
• Can be inefficient when performing an exhaustive
search for the splitting points of continuous numerical
predictors.
• Greedy algorithms cannot guarantee the return of the
globally optimal regression tree.
Regression trees - cons
• Regression tasks relate to determining quantitative
numerical variables based on input variables
• Classification tasks about determining a qualitative
value (e.g., category or class) based on the input
variables
• Categorical variables
• Nominal data – multiple categories but no ordering,
e.g., housing type, postcode, species, countries, phone
number
• Ordinal data – multiple categories with an order, e.g.,
education level, salary level, age group
Categorical data
• For classification task, if we want to use a decision
tree, the result is a classification tree.
• Most popular split criteria are Gini and Entropy.
Classification trees
• Can still factor in multiple input variables
https://en.wikipedia.org/wiki/Decision_tree_learning#/media/File:Decision_Tree.
jpg
Classification tree
End of
Segmenting Data
Week 4:
Statistical Modelling
Clustering Data
• A cluster is
• segmented data for analysis
• segmented network nodes
• segmented data storage
• a group of associated computers
So clustering = segmentation… sometimes
Clustering tends to be associated with segmentation that
allows us to recognize similar combinations of attribute
values when we don’t have predefined categories.
Unsupervised machine learning
Clustering and segmentation
• Text documents, e.g., patents, legal cases,
webpages, questions and feedback
o Topic modelling
• Clients, e.g., recommendation systems
• Fault detection, e.g., fraud, network security
• Missing data
• A clustering task may require a number of different
algorithms/approaches.
Uses of clustering
• Are similar in some attributes
• May consider some attributes to weigh more than
others
o Not all attributes are as important as others
o Needs feature selection
• May be considered to be close to each other
o Needs distance measurements
Elements in a clusters
• Distance
o Commonly dealing with data as vectors
o Euclidian distance = vector distance between points,
e.g., A= (1,1) B =(3,5)
! ",$ = (3 − 1)++ (5 − 1)+
• Centroid
o Value of a data point in the centre of the cluster
o May be hypothetical and not match any known data
• Nearest neighbor
o A data point that is closest to some reference value,
e.g., the centroid or the population of a cluster
Clustering terms
• k-means algorithm
1. Randomly select centroids for K clusters
2. Select nearest data points as cluster population
3. Find mean values in each cluster and use that as new
centroid
4. Re-evaluate populations and centroids until
stable/convergance
• Does not work with categorical data and it is susceptible
to outliers
• Have to predefine a value for K
• No guarantee there are actually clusters to find
Clustering – K-means
Clustering – K-means
Clustering – K-means
• Clusters within clusters!
• Agglomerative (bottom-up) vs Divisive (Top-down)
• Agglomerative
o Treat each data point as a centroid in a cluster of population 1
o Form new clusters by merging nearby clusters
o Continue until only one cluster
• Various ways to calculate which clusters should be merged, often
looking at (min or max) distances of the clusters’ populations to
each other
• The results of hierarchical clustering are usually presented in a
dendrogram
Clustering - Hierarchical
Clustering - Hierarchical
• Greedy!
• Can be costly, due to having to calculate a lot of
distances for each level of the tree.
• But with no randomness, the same tree will be
produced each time.
• Can cut the tree at any level so as to get the
population of a certain number of clusters.
Clustering – Hierarchical
• Both segmentation and clustering can help
o Model, predict and classify data
o Decision-making
o Understand the data
• But different types of input and outcome data need
different types of segmentation
Segmentation and clustering
End of
Clustering Data
Week 3:
Data Sources and
Modelling the Truth
Sharing Data
• Working together on a project
• Common needs, common resource
• Data as a product
• Data-based service
• For research!
- Duplication
- Verification
- Re-use
- Promotion
- Knowledge!
Why sharing data?
• Shared data provides opportunities
- New combinations of data
- New relationships in data
- New visualisations of data
- New understandings of data
- Also creates new data!
Opportunities from shared data
• Collection: getting the data
• Wrangling: data preprocessing, cleaning
• Analysis: discovery (learning, visualisation, etc.)
• Presentation: arguing that results are significant and useful
• Engineering: storage and computational resources
• Governance: overall management of data
• Operationalisation: putting the results to work
Our Standard Value Chain
• Data that is “freely available to everyone to use and
republish as they wish, without restrictions from copyright,
patents or other mechanisms of control” – Wikipedia
- Free – accessible, costs nothing
- Free – unrestricted usage
- Free – simple, non-proprietary format
• Commonly associated with open government data
Open data
From “the New Data Republic: Not Quite a Democracy” in MIT
Sloan Review 2015
• from Hal Varian (at Google): “information that once was
available to only a select few ... available to everyone”
• from Robert Duffner (at Salesforce): “finally puts crucial
business information in the hands of those who need it”
• government and IT departments building data and
infrastructure to allow sharing, e.g. USA Open Gov Initiative
• analytic tools, (desktop and web-based), available to analyse
it.
Democratization of Data
http://sloanreview.mit.edu/article/the-new-data-republic-not-quite-a-democracy/
http://www.apple.com/au/
The reports:
• “Open data”: Unlocking innovation and performance with liquid
information” by MGI, and
• “Science as an open enterprise” by the Royal Society (UK)
claim that:
• open data provides new opportunities for business, new products
and services, and can raise productivity
• open data supports public understanding and citizen engagement
• scientists need to better publicise their data (with help from
universities, etc.)
• industry sectors should work with regulators and coordinate
industry collaboration
• collaboration across sectors in both public and private settings,
• e.g., disaster response, education
Open Data Recommendations
http://www.mckinsey.com/insights/business_technology/open_data_unlocking_innovation_and_performance_with_liquid_information
https://royalsociety.org/~/media/policy/projects/sape/2012-06-20-saoe-summary.pdf
The Scientific American report:
“What’s Wrong with Open-Data Sites–and How We Can Fix Them”
discusses:
• its hard to make sense of the huge amount of government data
‣ Data.GOV has 230k datasets, and Data.GOV.AU has 30k
• authors developed Data USA
What’s Wrong with Open Data Sites
https://blogs.scientificamerican.com/guest-blog/what-s-wrong-with-open-data-sites-and-how-we-can-fix-them/
https://datausa.io/
• Publicly available
‣ government and IT departments building data and
infrastructure to allow sharing,
‣ e.g., Data.GOV has 230k datasets, and Data.GOV.AU has
30k
• Machine-readable?
• But..
‣ it is not always usable
‣ people need the right skills
Open Data - Summary
From “the New Data Republic: Not Quite a Democracy” in MIT
Sloan Review 2015
• from Hal Varian (at Google): “information that once was
available to only a select few ... available to everyone”
• from Robert Duffner (at Salesforce): “finally puts crucial
business information in the hands of those who need it”
• government and IT departments building data and
infrastructure to allow sharing, e.g. USA Open Gov Initiative
• analytic tools, (desktop and web-based), available to analyse it
• but people need the right skills!
‣ open data is all good and well, but people need to be able
to use it too!
Democratization of Data
http://sloanreview.mit.edu/article/the-new-data-republic-not-quite-a-democracy/
http://www.apple.com/au/
End of
Sharing the Data
Week 3:
Data Sources and
Modelling the Truth
Utilising Data Sources
Where to find and how to use data
sources
If we want to forecast traffic:
blockages, clearing, surprising
situations, alternate routes
• Critical data:
‣ GPS data on traffic flow
‣ Maps
‣ incidents and events
‣ weather
• Challenge:
‣ collect different sources
of data
image: math.tu-berlin.de
We’ll now look at three examples of public data and using
data?
1. NYC data
2. Traffic prediction
3. Predictive analytics for banks
Three Examples of Using Data
NYC embarked on a program in 2011 to make the city’s data accessible:
• “How data and open government are transforming NYC”:
‣ “In God We Trust,” tweeted by New York City Mayor Mike Bloomberg,
“Everyone else, bring data.”
‣ applications of the data provided:
- “real-time updates on your phone based on where the buses are located
using very low-cost technologies”
- applying predictive analytics to building code violations and housing data
to try to understand where potential fire risks might exist
• Bloomberg signs NYC 'Open Data Policy' into law, plans web portal for
2018
• NYC Open Data portal
• Melbourne has a similar portal: City of Melbourne’s open data platform
New York City Data
http://radar.oreilly.com/2011/10/data-new-york-city.html
http://www.engadget.com/2012/03/12/bloomberg-signs-nyc-open-data-policy-into-law-plans-web-porta/
https://nycopendata.socrata.com/
https://data.melbourne.vic.gov.au/
“How we found the worst place to park in New York City” is examples,
and a discussion of the complexities of getting data out of NYC:
• Map of road speed by day+time: GPS data for NYC cabs gives; data
obtained via FOIL request, then made public by recipient
• Danger spots for cycles: NYPD crash data obtained by daily
download of PDF files followed by (non-trivial) extraction.
• Dirty waterways: fecal coliform measurements on waterways from
Department of Environmental Protection’s website; extracted
from Excel sheets per site; each in a different format
• Faulty road markings: parking tickets for fire hydrants by location
from NYC Open Data portal need to normalize the addresses
supplied
NYC Data - Using it!
http://www.ted.com/talks/ben_wellington_how_we_found_the_worst_place_to_park_in_new_york_city_using_big_data/transcript?language=en
http://iquantny.tumblr.com/post/93845043909/quantifying-the-best-and-worst-times-of-day-to-hit
http://www.reddit.com/r/bigquery/comments/28ialf/173_million_2013_nyc_taxi_rides_shared_on_bigquery
http://iquantny.tumblr.com/post/77977436883/the-terrifying-cycling-injury-map-of-nyc-2013
http://www1.nyc.gov/site/nypd/stats/traffic-data/traffic-data-collision.page
http://iquantny.tumblr.com/post/97788820249/fecal-map-nyc-the-worst-places-to-swim-in-the
https://data.cityofnewyork.us/Environment/Watershed-Water-Quality-Data/y43c-5n92
http://iquantny.tumblr.com/post/83696310037/meet-the-fire-hydrant-that-unfairly-nets-nyc
https://nycopendata.socrata.com/
Back in 2008, Microsoft Introduced a Tool for Avoiding Traffic Jams
The system was called Clearflow:
• Aims to forecast traffic: blockages, clearing, surprising situations,
etc.
• and to suggest alternate routes
• critical data use to build the application included:
‣ GPS data on traffic flow
‣ maps
‣ incidents and events
‣ weather
• See Eric Horvitz’s discussion of system: “Data, Predictions, and
Decisions in Support of People and Society” (skip to 7:40-11:06)
Traffic Prediction
http://www.nytimes.com/2008/04/10/technology/10maps.html?_r=0
http://videolectures.net/kdd2014_horvitz_people_society/
See this video of a seminar on “Predictive Analytics with Fine-
grained Behavior Data”
• by Foster Provost (Professor at NYU and author of this book)
presented at Stata+Hadoop in 2013
• describes customer prediction problem for banking products
He discusses about whether bigger data is “always” better. So
is big data better?
• His answer is that it’s not always (much) better.
• But that big data can certainly be better if the data is richer
and more fine-grained.
Predictive Analytics for Banks
https://www.youtube.com/watch?v=1jzMiAfLH2c
http://conferences.oreilly.com/strata/stratany2013/public/schedule/detail/31685
http://data-science-for-biz.com/
What lessons have we learnt from these “data” examples?
• NYC data
• data requires work to clean up
• be creative about sources
• Traffic prediction
• combine many sources
• you might have to generate some of your own
• Predictive analytics for banks
• fine-grained data really helps, but is harder to use
Lessons Learnt from the examples
Many companies are exposing their data and their website
functionality as APIs (Application Programming Interfaces) for
others to make use of:
• Facebook API
• Twitter API
e.g. search tweets
• LinkedIn API
• Google Maps API
• Youtube API
e.g. documentation
• Amazon Advertising API
• TripAdvisor API
• New York Times API
Example Data/Information APIs
https://developers.facebook.com/products/
https://dev.twitter.com/rest/public
https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets.html
http://www.programmableweb.com/api/linkedin
https://developers.google.com/maps/
https://developers.google.com/youtube/
https://developers.google.com/youtube/v3/getting-started
https://advertising.amazon.com/API
https://developer-tripadvisor.com/content-api/
http://developer.nytimes.com/
Twitter is the most famous microblogging platform
• with big corporate use
• contains lots of metadata: information about users, their follower
network, locations, hashtags, emojis+emoticons, …
Twitter
Sample Twitter XML Data
See Twitter’s developer platform
• library interfaces for Java, C++, Javascript, Python, Perl, PHP, Ruby,
...
• allows other applications to manage Twitter data for users
• extensive developer policy
• see search API doc
• lots of example case studies
Twitter Developer API
https://dev.twitter.com/
https://developer.twitter.com/en/docs/tweets/search/overview/standard.html
https://dev.twitter.com/resources/case-studies
End of
Utilising the Data
Week 3:
Data Sources and
Modelling the Truth
Joining data
• Tabular data
o Tables
- Rows: information about an object
- Columns: attributes of the object
o Relational database
• Graph data
o Nodes: entities
o Edges: relationships between entities
o Graph database
Relationships in data
• For data sets to be joined, they must have something
in common.
Joining data sets
Set A Set B
All data from both sets
Product User
Pen Alec
Book Huang
Table Indira
Pen Indira
Chair Blythe
Pen Huang
User Contact
Stef 733 486
Indira 989 6732
Boris 939 3872
Frances 345 7239
Miguel 125 8369
Huang 934 3482
Set A Set B
Product User Contact
Pen Alec
Book Huang 934 3482
Table Indira 989 6732
Pen Indira 989 6732
Chair Blythe
Pen Huang 934 3482
Stef 733 486
Boris 939 3872
Frances 345 7239
Miguel 125 8369
Full (outer) join
Set A Set B
Just records that
link both sets
Product User
Pen Alec
Book Huang
Table Indira
Pen Indira
Chair Blythe
Pen Huang
User Contact
Stef 733 486
Indira 989 6732
Boris 939 3872
Frances 345 7239
Miguel 125 8369
Huang 934 3482
Set A Set B
Product User Contact
Book Huang 934 3482
Table Indira 989 6732
Pen Indira 989 6732
Pen Huang 934 3482
Inner join
Set A Set B
All data from A with
linked data from B
Product User
Pen Alec
Book Huang
Table Indira
Pen Indira
Chair Blythe
Pen Huang
User Contact
Stef 733 486
Indira 989 6732
Boris 939 3872
Frances 345 7239
Miguel 125 8369
Huang 934 3482
Set A Set B
Product User Contact
Pen Alec
Book Huang 934 3482
Table Indira 989 6732
Pen Indira 989 6732
Chair Blythe
Pen Huang 934 3482
Left (outer) join
Set A Set B
All data from B with
linked data from A
Product User
Pen Alec
Book Huang
Table Indira
Pen Indira
Chair Blythe
Pen Huang
User Contact
Stef 733 486
Indira 989 6732
Boris 939 3872
Frances 345 7239
Miguel 125 8369
Huang 934 3482
Set A Set B
Product User Contact
Book Huang 934 3482
Table Indira 989 6732
Pen Indira 989 6732
Pen Huang 934 3482
Stef 733 486
Boris 939 3872
Frances 345 7239
Miguel 125 8369
Right (outer) join
• Can be temporary
- Just for the current analysis
• Can be permanent
- Store the combined data
• Can have conditions
- Can you share the combined data?
• Can be costly
- Memory
- Processing time & capacity
‣ joining
‣ searching
‣ analysing
Joining data sets
End of
Joining Data
Week 3:
Data Sources and
Modelling the Truth
Standardising data
• If you standardise things, you can be more efficient
- Efficiency lowers costs
• So how can you standardise data?
• What role do data scientists and data science play in
standarising things related to data?
Setting the standards
Geospatial Data
Linked Open Data: DBpedia
Linked Open Data: XML
Transactional Data
Twitter Data
Internet of Things Data
• Data is about a variety of things
- (geo)spatial data
- transactional data
- linked (open) data
- social media data
- Internet of Things (IoT)
• Data comes in a variety of formats
- Ascii/text format (+ Unicode!)
- Word or Excel or Pdf format
- Comma separated values (CSV)
- JSON format
- HTML or XML format
Data types and formats
• Machine-readable data: data (or metadata) which is in a
format that can be understood by a computer,
e.g., XML, JSON
• Markup language: system for annotating a document in a
way that is syntactically distinguishable from the text
e.g., Markdown, Javadoc
• Digital container: file format whose specification describes
how different elements of data and metadata coexist in a
computer file, e.g., MPEG
Data formats: key concepts
End of
Standardising Data
Week 3:
Data Sources and
Modelling the Truth
Metadata
Metadata: structured information that describes, explains,
locates, or otherwise makes it easier to retrieve, use or
manage an information resource.
Metadata is:
• data about data
• structured so that a computer can process & interpret it
MetaData
MetaData can be:
• Descriptive: describes content for identification and
retrieval, e.g. title, author of a book
• Structural: documents relationships and links, e.g.
chapters in a book, elements in XML, containers in MPEG
• Administrative: helps to manage information, e.g. version
number, archiving date, Digital Rights Management (DRM)
MetaData (cont.)
• Facilitate data discovery
• Help users determine the applicability of the data
• Enable interpretation and reuse
• Clarify ownership and restrictions on reuse
Why Use Metadata
EXIF Metadata
Book Metadata
Media Metadata
• IPTC Photo Metadata User Guide
• USGS Metadata standards
• DCC list of Metadata standards
• Medical bibliographic data in XML on PubMed
• Registry Interchange Format - Collections and
Services (RIF-CS)
Other Metadata Examples
https://www.iptc.org/std/photometadata/documentation/userguide/
https://www.usgs.gov/products/data-and-tools/data-management/metadata-creation
https://www.dcc.ac.uk/guidance/standards/metadata/list
https://dtd.nlm.nih.gov/ncbi/pubmed/doc/out/190101/index.html
https://www.ands.org.au/online-services/rif-cs-schema
• Metadata helps set standards
• Metadata should also be standardised
- Archiving data
- Sharing data
- Searching data
Standards and metadata
End of
Metadata
Week 3:
Data Sources and
Modelling the Truth
Standardising handling data
Examples of standards:
• Metadata standards, such as Dublin Core, examples at A
Gentle Introduction to Metadata
• XML formats for sharing models, e.g. PMML (see below)
• Standard vocabularies for use in Medicine, e.g.,
‣ health codes: disease and health problem codings ICD-10
‣ systematized nomenclature of medicine, clinical terms,
SNoMed-CT
• Standards for describing the data mining/science process,
such as CRISP-DM
Example Standards
http://dublincore.org/documents/dces
http://www.language-archives.org/documents/gentle-intro.html
https://en.wikipedia.org/wiki/ICD-10
http://www.nlm.nih.gov/research/umls/Snomed/snomed_main.html
https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining
We’ve seen many data
science processes and
lifecycles:
• e.g. our own “standard
Data Science value chain”
• CRISP-DM discussed
previously, is a
standardised data science
process
• statisticians sometimes
use the term exploratory
data analysis for part of
the process
Data Science Process
https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining
Semi-structured data is data that is presented in XML or
JSON:
• see some examples for here
• Note YAML (Yet Another Markup Language), which is just an
indentation (easier to read) version of JSON
• standard libraries for reading/writing/manipulating semi-
structured data exist in Python, Perl, Java
• don’t need to know all the details of XML (and related Schema
languages), there are many good online tutorials, e.g.
W3schools.com
• their use in systems leads to the open world assumption about
data, where we may download relevant data on the fly from
APIs etc.
Semi-Structured Data
https://en.wikipedia.org/wiki/JSON
http://www.w3schools.com/
PMML: Predictive Model Markup Language
PMML provides a standard language for describing a (predictive)
model that can be passed between analytic software (e.g. from
R to SAS).
• PMML: An Open Standard for Sharing Models
• A list of products working with PMML is the PMML Powered page
on DMG site.
Model Language
http://journal.r-project.org/archive/2009-1/RJournal_2009-1_Guazzelli+et+al.pdf
http://www.dmg.org/products.html
79
PMML Example
End of
Standardising Handling Data
Week 3:
Data Sources and
Modelling the Truth
Scripting
• A script is a series of commands to be performed
• A script is executable on demand
- not compiled to an executable form
- interpreted command-by-command as it is executed, like
on a command line
• Examples:
- R
- Python
- Unix shell
Introduction to scripting languages
See In data science, the R language is swallowing Python by MattAsay.
Python R
Free or not? Yes Yes
Developed by
whom?
Computer scientists
(for general use)
Statisticians (huge
support for analysis)
Characteristics
Better in integrating
with other systems
Better for stand-alone
analysis and exploration
Easy to learn/extend Python > R
Scalability Python > R
Discussion: Python vs R
https://www.infoworld.com/article/2951779/in-data-science-the-r-language-is-swallowing-python.html
• Command-line code for Unix (+ Linux & Mac OS)
• Commonly include:
– Wildcards: *, ?
e.g., ./Customer??Loc*.txt
– Piping, | : output from one command streams as input to
another
e.g., cat product*v1.txt | sort
– / in filepaths, not
– ; to separate commands
– > and < to indicate the input and output (>> for appends)
e.g. cat product*v1.txt > contents
Unix Shell script
• pwd: path of current directory
• cd DIRPATH: change directory to DIRPATH
• ls DIRPATH: output the filenames of DIRPATH
• cp FILENAME NEWFILENAME: copy FILENAME to NEWFILENAME
• mv FILENAME NEWFILENAME: rename FILENAME to NEWFILENAME
• echo “TEXT”: output TEXT
• cat FILENAME: output the contents of FILENAME
• less FILENAME: output the contents of FILENAME, one screen at a
time (can page up and down)
e.g., cd DATA/; cat product*v1.txt| sort > contents
Unix commands
• wc FILENAME: count the number of lines, terms, characters in
FILENAME
• grep “PATTERN” FILENAME: output any lines in FILENAME that
match PATTERN
e.g., grep “Australia” product*v1.txt
grep “^[0-9]” product*v1.txt
• head FILENAME: output the first lines of FILENAME
• tail FILENAME: output the last lines of FILENAME
• awk: process text files in various ways, including search and replace
cf. sed, perl
• uniq: remove duplicates from the input (presumes it is sorted)
• diff: find similarities and differences between two files
cf. test
Unix commands
• man COMMAND: output user manual pages for
COMMAND
• COMMAND –?: output shorter help pages for COMMAND
• COMMAND –-help: ditto
Unix help and arguments
• Piping shells commands buffers their execution
– Don’t try to do everything at once, just enough for the
next command
– Tend to work through text files line by line
– Allows different commands to be working on different
parts of the data
– Scales up well for big files!
Ø Reduces the memory overload
Shell scripts and big data
• Ideally, the software used for data should be
standardised, just like data should be standardised.
– Constituency
– Capabilities
– Reproducibility
• However, just like data varies, so can the needs of
software
– Published software doesn’t always meet the needs
– Rapid prototyping is what scripting languages are ideal
for!
Standardising software
• Need to standardise how data is accessed
• Need to be able to reproduce
– Wrangling
– Analysis
– All other stages of the value chain!
• Scripting allows these to be recorded
• Scripting allows these to be shared
• Scripting allows these to be modified
Standardising workflow
• It is also vital to understand why certain steps are used
– Why was the wrangling done
– What was the analysis for
• The context of working with data also needs to be
recorded
Standarising processes
• So how can you standardise data?
– Access
– Format
– Value & vocabulary
– Metadata
– Software & tools
– Process & workflow
• What role do data scientists and data science play in
standardising things related to data?
– Establishing the standards
– Enacting the standards
Setting the standards
End of
Scripting
Week 3:
Data Sources and
Modelling the Truth
Temporal data
• Data indexed with time!
• Data indexed with dates!
• Data about change, transformation and occurrences!
• Time series data
Temporal data
• The temporal aspect of the data can be of different types
§ Specific
o e.g., 20 July 1969 – Man landed on the moon
o e.g., 3:00:00 am, 4 April 2021 – Daylight saving
ended in Victoria
§ Relative
o To a time, e.g., 4 weeks ago – Week 1!
o To each other: Ordinal data that has a temporal
progression, e.g., Stages of an insect’s lifecycle
Temporal context
• Date
o Day of the week – Monday, Mon., M
o Day number – 1, 1st , first
o Month – January, Jan, J, 1
o Year – 2020
• Time
o Hour – 1, one, 13
o Minute – 15, quarter past, o’clock
o Second – 20
o Period – am, pm, AM, PM
Temporal phrases – form
• Date
o 20 Jan 2020, Jan 20
o 20th of January, 2020,
twentieth of January
o 20/01/2020, 1/20/2020, 20:01:2020,
2020-01-20
• Time
o 1 PM
o 13:00, 13.00
o One o’clock
Temporal phrases – syntax
• Era: AD 2020, 20 Jan 2020 CE
• Calendar: Lunar, Hebrew, Chinese, etc.
• Time zone: 1pm AEST, 1pm UTC+10:00
• Submultiples: 13:00.001
• Years do not have the same number of days
• Months have different numbers of days
• It can be difficult to identify the day of the week, day of
the month and week in the year
• Years and months start on different days
• Even specific time phrases can be very complicated to
parse!
Temporal phrases – more
• Decompose once, use often
• Not all elements may be important to you
• Numbers are easier to work with than words
• If the syntax is too complicated, decompose it first into
parts that are known, e.g., 03:45 GMT+10, Monday 2 Jan.
2020
Extracting temporal elements
• If temporal elements need to be used together
o Convert once, use often
o Be consistent
o Be regular
Standardise!
Converting temporal elements
• Time is not decimal
• Months and years have different numbers of days
• Be careful how you compare time elements
• Socially, not all time periods are the same
o weekends
o holidays
o pay periods
Context!
Counting time
• Do you want to show
§ Distinct events?
o Irregularities
o Changes
§ Connections and trends?
o Seasonal
o Regularities
o Variance of values
Plotting temporal elements
• Distribution of plots
o Hour
o Day
o Month
o Year
Plotting temporal elements
• Line plots
o Highlight temporal continuity within and between time
periods
Plotting temporal elements
• Calendar plots
o Helps visually identify irregularities
Plotting temporal elements
• Other plots?
o Bar charts?
o Pie charts?
o Rose/polar
chart?
Plotting temporal elements
End of
Temporal Data
Week 3:
Data Sources and
Modelling the Truth
Correlation vs Causation
• Models represent aspects of a scenario to help us
understand it.
• Statistical models represent the relationships between
variables
o Independent variable(s)
o Dependent variable
• A model can be used to predict about the dependent
variable, given information about the independent
variable(s)
• Rather than trying to use all data about the scenario, the
model just reduces the data set to a low dimensional
summary.
Statistical Modelling
Variables
• Dependent variable
o Outcome variable
o Explained variable
o Response variable
o Target variable
o Predicted variable
o Regressand
• Independent variable
o Input variable
o Explanatory variable
o Control variable
o Predictor variable
o Regressor
CO2 levels, based on data recorded at the South Pole
Modelling data – example
from the BackReaction blog by Sabine Hossenfelder
Modelling
http://backreaction.blogspot.com.au/2008/04/emergence-and-reductionism.html
• “All models are wrong, but some are useful”…
George Box
• “The approximate nature of the model must
always be borne in mind”… George Box
• “The purpose of models is not to fit the data but to
sharpen the questions”… Samuel Karlin
Do Models need to be truthful?
• Variables in a scenario have
relationships
• Some variables influence the
outcome of activities and thus
other variables.
o Dependent vs independent
• Influence diagrams model that
dependency
• It is not always easy to recognise
what influences what
Influence diagrams
• Causation indicates that one event is the result of the
occurrence of the other event; i.e. there is a causal
relationship between the two events. This is also referred
to as cause and effect.
– Australian Bureau of Statistics
Causation
Correlation
• Correlation is a statistical measure (expressed as a
number) that describes the size and direction of a
relationship between two or more variables. A
correlation between variables, however, does not
automatically mean that the change in one variable is
the cause of the change in the values of the other
variable.
– Australian Bureau of Statistics
Correlation & Causation
• Causation implies Correlation (normally);
BUT Correlation does not imply Causation
• We can measure the correlation (next week!)
but that measurement does not tell us anything about the
causation.
• Identifying causation requires controlled experiments that
examine the data related to a situation with or without a
possible correlated variable.
• Scientific hypothesis
A causes B
• Correlation
There is a relationship between A and B, but neither
will cause the other.
• Can we model B, using A?
• e.g., …
FIT5145
Introduction to Data Science
Dr Michael Niemann
Faculty of Information Technology
Slides credit to Prof. Wray Buntine
and Dr. Guanliang Chen
About This Unit
Why this unit?
Data Science is fast developing:
• every academic & industry community wants to claim credit
• huge community of (self proclaimed) “leading international
experts”, “highly sought-after consultants”, and “thought
leaders” to confuse you with advice, blogs, guidelines, …
• huge growth in software and services
We will try to cover the full extent of what makes Data
Science:
• background and context
• leading review articles, lectures, introductions
• academic surveys and national programmes
This is a “grand tour” unit – breadth, not depth.
Meet the Teaching Team
Dr. Guanliang Chen
Chief Examiner
& Lecturer
Dr. Jesmin Nahar Workshop TA
Dr. Michael Niemann Lecturer Mohit Gupta Workshop TA
Dr. Saher Manaseer Admin Tutor Dr. Han Phan Workshop TA
Dr. Heshan Kumarage Admin Tutor Dr. Tam Vo Workshop TA
Dr. Chang Joo Yun
(Chris) Tutor Yi Wei Zhong Workshop TA
Jeffery Liu Tutor
Contacts
1. Ask questions whenever you need
2. Check the Ed platform forums on Moodle
• Click the link on Moodle to enrol in Ed
• Please do NOT post your solutions to assignments
3. Attend the consultation sessions
• Consultation Times on Moodle
Prerequisites
You will need:
• high school level of mathematics and statistics
• basic programming and database skills
• a “critical mindset”:
• you will read/view a variety of materials
• different levels of quality and standards
• some sales, some educational, some journalistic
• basic exposure to information technology and internet
businesses:
• software, science, and business computing
• Amazon, Google, Twitter, …
Unit Schedule
Week Topics Deadlines Comments
WEEK 1
(6-12 Dec)
Overview of Data Science –
WEEK 2
(13 Dec-21 Dec)
Dimensions of Data Science
projects and Big Data
• Case Study Proposal
• Quiz
Xmas/New
Years break
– – University shut down
WEEK 3
(4-9 Jan)
Data sources and modelling
the truth –
WEEK 4
(10-16 Jan)
Statistical modelling • Coding Task I – R
WEEK 5
(17-23 Jan)
Data management &
governance • Coding Task II – scripting
WEEK 6
(24-30 Jan)
Issues with Data Science • Case Study report
https://lms.monash.edu/course/view.php?id=132993§ion=6
https://lms.monash.edu/course/view.php?id=132993§ion=7
https://lms.monash.edu/course/view.php?id=132993§ion=8
https://lms.monash.edu/course/view.php?id=132993§ion=9
https://lms.monash.edu/course/view.php?id=132993§ion=9
https://lms.monash.edu/course/view.php?id=132993§ion=10
https://lms.monash.edu/course/view.php?id=132993§ion=11
Weekly classes
Each week, this unit will contain three teaching sessions:
1. Pre-recorded Lectures (available late Monday/Tuesday morning)
See the link/s in the relevant week’s section on Moodle
2. 2 hour Interactive Hybrid Workshop (Wednesday 2-4pm)
See the details in the Class Streaming section. These will be
recorded.
3. 2 hour Tutorial (Thursday or Friday)
See the details in the Class Streaming section. These will not be
recorded.
Students must view the lectures before the Workshop and Tutorials
each week.
https://lms.monash.edu/course/view.php?id=132993§ion=1
https://lms.monash.edu/course/view.php?id=132993§ion=1
Instructions on setting up your own laptop/desktop are
on Moodle in the sections Unit Information and Week 0.
• You need at least access to
• R and R Studio
• These can be accessed via
• Anaconda – a programming package that also includes
Python Notebooks
• MoVE – a Monash virtual desktop environment that
simulates you being in a campus lab
Technical requirements
Resources for this Unit
• Lectures, Workshops & Tutorials
• Moodle: unit information, assessments, discussion
forum, etc.
• Alexandria: an online textbook, which contains lots
of useful exercises and resources.
• Additional textbook: The Art of Data Science by
Peng & Matsui (http://leanpub.com/artofdatascience)
• Please notice that:
• Library services available
• Special consideration policies
• Disability Support Services (DSS) available
https://www.alexandriarepository.org/syllabus/introduction-to-data-science/
http://leanpub.com/artofdatascience
http://leanpub.com/artofdatascience
Warning
• The Alexandria textbook links to a LOT of content:
• videos, blogs, articles, …
• there is way too much for you to read it all in detail
• Focus on the details when you need something for
assessment (or want for your own development)
• Very importantly, use the guide for what to read
– the double “Johnny look it up” icon
• The Microcredential Steps
• Slightly editted versions of the microcredential’s
webpages
• You don’t need to do the activities in the Steps
• Don’t forget the Other Readings and videos
Assessments
Assessment
Task Weight Due Date Description
Quiz 10%
End of Week 2
(Mon 20-Tues 21 Dec) Multiple-choice Questions, Short-answer Questions
Assignment 1:
Coding task I – R 20%
Start of Week 4
(Mon 10 Jan), 11:55 PM Data Analysis with R
Assignments 2 & 4:
Business & Data
Case Study
2% End of Week 2(Mon 20 Dec) 11:55 PM
Assignment 2: Propose a Data
Science Project
18% End of Week 6(Mon 31 Jan) 11:55 PM
Assignment 4: Report on the Data
Science Project
Assignment 3:
Coding task II –
Shell
10% End of Week 5(Fri 21 Jan) 11:55 PM
Data Analysis with Tools and Shell
Scripting
Scheduled Final
Assessment 40% To be announced (after Week 6).
Multiple-choice Questions, Short-
answer Questions, and Longer-
answer Questions
Three key questions for you
1. What is the problem to be solved?
2. What data is necessary to solve the problem?
3. What Data Science techniques can be used to
make use of the data?
Getting Started
• Set up your computer
• Work through the materials in Week 1 to familiarise
yourself with R, R-Studio and R Markdown
• Each week, please
• Watch the lectures, attend the live-streamed workshop
& read background materials between classes
• Prepare for and attend the live tutorials
• Check out any additional activities for the week on
Moodle
• Complete the readings for Week 1
You’re More than a Knowledge Worker
• As a knowledge worker:
you’re applying your knowledge to do
non-routine problem solving.
• But now you also have to be a learning worker:
you’re learning new skills as you go,
continually adapting.
https://en.wikipedia.org/wiki/Knowledge_worker
https://www.forbes.com/sites/jacobmorgan/2016/06/07/say-goodbye-to-knowledge-workers-and-welcome-to-learning-workers
End of
About This Unit
Week 1:
Overview of Data Science
Data Science and
the Data Science Process
Question: Who are the Data Scientists?
Person D
Person A
Person C
Person B
Defining Data Science
What is Data Science?
• “data science is what a data scientist does”
– a circular definition!
• “data science is the technology of handling and
extracting value from data”
– less circular and a bit more useful
• “machine learning on big data”
– useful, but too narrow!
Different definitions
Source Definition
Wikipedia
… is the extraction of knowledge from data, which is a
continuation of the field data mining and predictive
analytics.
Pivotal
The use of statistical and machine learning techniques on big
multi-structured data in a distributed computing environment
to identify correlations and causal relationships, classify and
predict events, identify patterns and anomalies and infer
probabilities, interest and sentiment.
NIST Big
Data
Working
Group
… is the empirical synthesis of actionable knowledge from
raw data through the complete data lifecycle process.
Journal of
Data
Science
… is almost everything that has something to do with data:
collecting, analysing, modelling …. yet the most important
part is its applications — all sorts of applications.
The Rise of Big Data
in Foreign Affairs, by Cukier and Mayer-Schoenberger
Data Science interest is related to the arrival of “Big Data”.
• Data collection has changed:
• lots of data, but more messy
• don’t look for perfect models – settle for finding patterns
• examples: Google’s language translation and flu trends
• Datafication:
• taking all aspects of life and turning them into data
• e.g. NYC using big data to improve public services and lower
costs
• The information society has come of age
• and data brokers have started amassing huge data about
individuals: big data could become Big Brother
https://www.foreignaffairs.com/articles/2013-04-03/rise-big-data
Defining Machine Learning
Unlike Data Science, the definition for Machine Learning
is better understood and more agreed upon:
Machine Learning is concerned with the development of
algorithms and techniques that allow computers to learn.
• concerned with building computational artifacts,
i.e., computer programs that can learn, oftentimes
with computational output
• but the underlying theory is statistics
See A Gentle Guide to Machine Learning
https://monkeylearn.com/blog/gentle-guide-to-machine-learning/
https://monkeylearn.com/blog/gentle-guide-to-machine-learning/
Why use Machine Learning?
Machine learning is useful when:
• Human expertise is not available,
e.g., Martian exploration
• Humans cannot explain their
expertise (as a set of rules), or
their explanation is incomplete
and needs tuning, e.g., speech
recognition
• Many solutions need to be
adapted automatically, e.g., user
personalisation.
image src: theconversation.com, meduim.com, blog.prioridata.com
Why use Machine Learning?
image src: lifewire.com, clrealyexplained.com, meduim.com
Machine learning is useful when:
• Situation changes over time, e.g.,
junk email
• There are large amounts of data,
e.g., discover astronomical
objects
• Humans are expensive to use for
the work, e.g., handwritten zip
code recognition
Data Science Examples: Data on Bushfires
https://covid19.who.int/
https://covid19.who.int/explorer
https://covid19-projections.com/italy
https://dataventures.nz/assets/pdf/covid19-2020-april-20.pdf
Data Science Examples: Data on COVID-19
https://covid19.who.int/
https://covid19.who.int/explorer
https://covid19-projections.com/italy
https://dataventures.nz/assets/pdf/covid19-2020-april-20.pdf
Some famous data science projects and investigations:
• Google’s spell checker and translation engine
• Amazon.com’s recommendation engine
• Public health: “saturated fat is not bad for you after all”
• Microsoft’s predictive analytics for traffic
Data Science Examples
https://translate.google.com/
http://www.amazon.com/
http://annals.org/article.aspx?articleid=1846638
http://research.microsoft.com/en-us/projects/clearflow/
From Alexandria e-textbook, Section 1.1:
• watch Cukier’s TED talk on “Big Data”
• watch the CERN video “Big Data” from Tim Smith
• read “What is Data Science?” by Mike Loukides of
O’Reilly
Homework
http://ed.ted.com/lessons/exploration-on-the-big-data-frontier-tim-smith
http://cdn.oreilly.com/radar/2010/06/What_is_Data_Science.pdf
Data Science Process
1. Pitching ideas 2. Collecting data 3. Monitoring 4. Integration
5. Interpretation 6. Governance 7. Engineering 8. Wrangling
9. Modelling 10. Visualisation 11. Operationalise
Our Standard Value Chain: Parts of a
Data Science Project
from Doing Data Science by Schutt and O’Neil, 2013 (available digitally through library)
Chapter 1 of the book provides the following visualisation
of the standard value chain for a data science project:
http://shop.oreilly.com/product/0636920028529.do
End of
Data Science and
the Data Science Process
Week 1:
Overview of Data Science
Data Scientists
Interpreting Roles in a Project
Following Jeff Hammerbacher’s UC Berkeley 2012 course
notes, we will interpret these four entities: we will interpret
these
• business analyst
• programmer
• enterprise
• web company
https://en.wikipedia.org/wiki/Jeff_Hammerbacher
Interpretations: the Business Analyst
Collection: copy and paste into Excel
Engineering: use Excel to store and retrieve
Wrangling: use Excel functions, VBA
Analysis: charts
Interpretations: the Programmer
Collection: web APIs, scraping, database queries
Engineering: flat files
Wrangling: Python and Perl, etc.
Analysis: Matplotlib in Python, R
Interpretations: the Enterprise
Collection: application databases, intranet files, server logs
Engineering: Teradata, Oracle, MS SQL Server
Wrangling: Talend, Informatica
Analysis: Cognos, Business Objects, SAS, SPSS
Interpretations: the Web Company
Collection: application databases, server logs, crawl data
Engineering: Hadoop/Hive, Flume, HBase
Wrangling: Pig, Oozie
Analysis: dashboards, R
A quote from Jason Widjaja in Quora:
• Data analysts are primarily people who develop
insights with data ….
• Data scientists are primarily people who develop
data models and products, that in turn produce
insights …
• Data engineers are primarily people who manage
data infrastructure, automate data processing
and deploy models at scale …
See also
Job Comparison – Data Scientist vs Data Engineer vs Statistician
(https://www.analyticsvidhya.com/blog/2015/10/job-comparison-data-scientist-data-engineer-statistician/)
What is the Difference Between …
https://www.quora.com/Whats-the-difference-between-a-data-scientist-a-data-analyst-and-a-data-engineer
Data scientist: addresses the data science process to extract
meaning/value from data
Data scientist
Knows about
does
From Doing Data Science
From Doing Data Science by Schutt and O’Neil, 2013
http://shop.oreilly.com/product/0636920028529.do
Knows about
• Chief data scientist: a form of chief scientist who addresses
data management, data engineering and data science goals.
• Chief scientist: corporate position, responsible for science
related aspects of a company/organisation
Chief data scientist
Evaluates
From Doing Data Science
From Doing Data Science by Schutt and O’Neil, 2013
http://shop.oreilly.com/product/0636920028529.do
1. Communication skills are underrated.
2. The biggest challenge for a data analyst is the Collection
and Wrangling steps.
3. A data scientist is better at statistics than a software
engineer and better at software engineering than a
statistician.
4. The data industry is still nascent and the roles less well
defined so you get to interact with many parts of the
company from engineering to business intelligence to
product managers.
5. Keep a curiosity about working with data, a quality as
important as your technical abilities.
Lessons from the DA Handbook
See Udacity on data careers
Data Scientists vs. Data Engineers
https://blog.udacity.com/2014/12/data-analyst-vs-data-scientist-vs-data-engineer.html
Steinwig-Woods, J. (2018, May 14). What Skills Does a Data Scientist Actually Need? A Guide to the Most
Popular Data Jobs. https://www.datascience.com/blog/guide-to-popular-data-science-jobs
To become a specialist you need:
• solid machine learning and statistics
• related mathematics (1st+2nd year in many degrees)
• solid prototyping (R, Python, Java)
• perhaps Unix experience (Linux, Mac OSX)
See also:
• The infamous Metromap: Becoming a data scientist
• And Modern Data Scientist (previous slide)
This unit provides an introduction and background only.
Career as Data Scientist
http://nirvacana.com/thoughts/becoming-a-data-scientist/
http://www.marketingdistillery.com/2014/11/29/is-data-science-a-buzzword-modern-data-scientist-defined
End of
Data Scientists
Week 1:
Overview of Data Science
History and Impact
of Data Science
Data science is about
• technology for working with data
• processes for working with data
• getting value from data
in a way that is effective and consistent.
What is data science? (Revisiting)
So why is it regarded as something “new”?
Source: https://www.slideshare.net/slideshow/embed_code/36866068
Timeline of Data Science
Data Science emerges around 2000
• data analysis came of age 1990’s
• William Cleveland published in 2001 “Data Science: An
Action Plan for … the field of Statistics”
• data engineering came of age 2000’s (Dot.Com boom)
• (digital) data management came of age 2000’s (Dot.Com
boom)
• the data/information society
• business pressure on decision making
• “data” as a valuable asset
• Dot.Com companies show the way
See also David Donoho’s “50 years of Data Science” (PDF paper)
Evolution of Data Science …
http://onlinelibrary.wiley.com/doi/10.1002/sam.11239/abstract
http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf
Hype Cycle in 2014
Can you
spot Data
Science?
Hype Cycle for Analytics and Business Intelligence in 2019
Relationship of Data Science to Other Disciplines
See Battle of the Data Science Venn
Diagrams for more.
http://www.prooffreader.com/2016/09/battle-of-data-science-venn-diagrams.html
Related: Data Analysis
Performing analysis and understanding results
• e.g. R, Tableau, Weka, Microsoft Azure Machine
Learning, …
• machine learning, computational statistics,
visualisation, …
Related: Data Engineering
Building scalable systems for storage, processing data
• e.g. Amazon Web Services, Teradata, Hadoop, …
• databases, distributed processing, datalakes, cloud
computing, GPUs, wrangling, …
Related: Data Management
Managing data through its lifecycle
• e.g. ANDS, Talend, Master Data Management, …
• ethics, privacy, providence, curation, backup,
governance, …
Our personal information is increasingly stored in the cloud:
• social life (Facebook),
• career (LinkedIn),
• search history (Google, etc.),
• health and medical (Fitbit, TBD),
• music (Apple), …
This provides many, many, many advantages:
• e.g. personal agents, computerised support for health, but also
some disadvantages:
• e.g. security and privacy breaches
Your Life on the Cloud
But also some disadvantages:
• corporate leakage to government (security, tax, etc.)
• what if you don’t have rights to access/delete data?
• the department of pre-crime (e.g., having recidivism)
• corporate mergers
• “the science is settled” and government mandates
Your Life on the Cloud (cont.)
The Scientific Method
The Scientific Method
from Wikipedia Scientific method
The End of Theory
Chris Anderson’s blog in Wired 23/05/2008
Science is largely driven by laborious studies to find complex causal
models, sometimes using reductionism. The intent is to find an
explanation that can be used for future prediction.
Chris Anderson (Editor-in-chief of Wired magazine) says:
Google’s founding philosophy is that we don’t know why this page is better
than that one: If the statistics of incoming links say it is, that’s good enough. No
semantic or causal analysis is required.
…
Petabytes allow us to say: “Correlation is enough.” We can stop looking for
models. We can analyze the data without hypotheses about what it might show. …
…
The new availability of huge amounts of data […] offers a whole new way of
understanding the world. Correlation supersedes causation, …
NB. When Google is delivering an advert, it doesn’t need to be right, it
just needs a good guess, so causality, models, etc., are not important.
The End of Theory (cont.)
https://en.wikipedia.org/wiki/Reductionism
What is a model?
• A simple model of
population growth:
Logistic growth curve
From Integrating Urban Growth
Models,Pearlstine, Mazzotti, Pearlstine and Mann,
2004
• A complex model of
obesity:
Obesity Systems Map
To Understand the Issues …
https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/296290/obesity-map-full-hi-res.pdf
Philosopher Massimo Pigliucci says:
But, if we stop looking for models and hypotheses, are we
still really doing science? Science, unlike advertizing, is not about
finding patterns–…–it is about finding explanations for those
patterns.
…
science advances only if it can provide explanations.
Data scientist Drew Conway says in some areas the data doesn’t
exist.
Statistician Andrew Gelman says:
… you’ll still have to worry about … all the … reasons why
people say things like, “correlation is not causation” and “the
future is different from the past.”
Not The End of Theory
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2711825/
https://en.wikipedia.org/wiki/Drew_Conway
http://andrewgelman.com/2008/06/26/the_end_of_theo/
• Your stomach can be instrumented to assess contents, nutrients, etc.
• Your bloodstream can be instrumented too assess insulin levels, etc.
• Your “health” dashboard can be online and shared by your GP
• Health management organisations (HMO) tying funding levels to
patient care performance
• GP/HMO will know about your ice cream/beer binge last night and you
missing your morning run
Health Care Futurology
See “Big data – 2020 vision” talk by SAP manager John
Schitka
Car Industry Evolution,
1760s – Today = Driven by Innovation + Globalization
KPCB INTERNET TRENDS 2016 | PAGE
143
Source: KPCB Green Investing Team, Reilly Brennan (Stanford), Piero Scaruffi, Inventors.About.com, International Energy Agency, Joe DeSousa, Popular Science, Franz Haag, Harry Shipler / Utah State Historical Society, National Archives,
texasescapes.com, Federal Highway Administration, Matthew Brown, Forbes, Grossman Publishers, NY Times, Energy Transition, UVA Miller Center for Public Affairs, The Detroit Bureau, SAIC Motor Corporation, Hyundai Motor Company, Kia Motors,
Toyota Motor Corporation, DARPA, Chris Urmson / Carnegie Mellon,
Early Innovation (1760s-
1900s) =
European Inventions
1768 = First Self-Propelled Road Vehicle
(Cugnot, France)
1876 = First 4-strokecycle engine (Otto,
Germany)
1886 = First gas-powered,
‘production’ vehicle (Benz,
Germany)
1888 = First four-wheeled electric car
(Flocken, Germany)
Streamlining (1910s-
1970s) =
American Leadership
1910s = Model T / Assembly
Line (Ford)
1920s-1930s =
Car as Status Symbol… Roaring ‘20s
/ First Motels
1950s = Golden Age…
Interstate Highway Act (1956)… 8 of Top 10
in Fortune 500
in Cars or Oil (1960)
Modernization (1970s-
2010s) =
Going Global / Mass Market
1960s = Ralph Nader / Auto
Safety
1970s = Oil Crisis /
Emissions Focus
1980s = Japanese Auto Takeover Begins…
1990s – 2000s =
Industry Consolidation; Asia
Rising;
USA Hybrid Fail (Prius Rise)
Late 2000s = Recession /
Bankruptcies /
Auto Bailouts
Re-Imagining Cars (Today)
=
USA Rising Again?
DARPA Challenge (2004, 2005,
2007, 2012, 2013) =
Autonomy InflectionPoint?
Today =
+
+
?
138
Source: KPCB Green Investing Team, Darren Liccardo (DJI); Reilly Brennan (Stanford); Tom Denton, “Automobile Electrical and Electronics Systems, 3rd Edition,” Oxford, UK: Tom Denton, 2004; Samuel
DaCosta, Popular Mechanics, Techmor, US EPA, Elec-Intro.com, Autoweb, General Motors, Garmin, Evaluation Engineering, Digi-Key Electronics, Renesas, Jason Aldag and Jhaan Elker / Washington KPCB INTERNET TRENDS 2016 | PAGE
Post, James Brooks / Richard Bone, Shareable
Pre-1980s Analog /
Mechanical
Used switches / wiring to route
feature controls to driver
1980s (to Present) CAN Bus
(Integrated Network)
New regulatory standards drove need
to monitor emissions in real time,
hence central computer
1990s-2010s
Feature-Built Computing
+ Early Connectivity
Automatic cruisecontrol…
Infotainment…Telematics… GPS
/ Mapping…
Today = Smart /
Connected Cars Embedded /
tethered connectivity…
Big Tech = New Tier 1 auto
supplier
(CarPlay / Android Auto)…
Tomorrow = Computers Go
Mobile?…
Central hub / decentralized
systems?
LIDAR…
Vehicle-to-Vehicle (V2V) / Vehicle-
to-Infrastructure (V2I) / 5G…
Security software…
1990s (to Present) OBD
(On-Board Diagnostics) II
Monitor / report engine
performance; Required in all USA
cars post-1996
Today = Complex
Computing
Up to 100 Electronic Control Units
/ car…
Multiple bus networks per
car (CAN / LIN / FlexRay /
MOST)… Drive by Wire…
“The Box” (Brooks
& Bone)
Car Computing Evolution
Since Pre-1980s = Mechanical / Electrical Simple Processors Computers
End of
History and Impact
of Data Science
Week 1:
Overview of Data Science
Introduction to R
R is a programming language.
• Reproducible
• Adaptable
R was originally created for statisticians.
• Functional
• Specialised
R is great in terms of …
• Large community
• Commonly used for business intelligence
R is powerful in drawing graphs.
• Practical
• Communicative
R: A Powerful Data Science Tool
RStudio is a programming environment for R.
• Helps manage the workflow
• Projects – a filing cabinet for your work!
• Libraries of packages
• Works with R scripts from files and the
command-line
RStudio, a tool for R
To maintain a reproducible workflow, you need to record
what steps you take in a process.
• This is vital when dealing with data
R Markdown is an authoring format (.Rmd files) that enables
us to combine embedded R code with formatted text, so we
can:
• Explain our thoughts and process
• Discuss the coding required
• Present the output of the processing
• Interpret the output
• Allow others to reproduce it all!
R Markdown
R Markdown format
See the activity in Week 1
• Introduction to R Markdown
# Top Heading
## Sub-heading
– List item 1
– List item 2
[Link to Monash](https://my.monash.edu)
“`{r}
library(tidyverse)
smaller <- filter(diamonds, carat <= 2.5)
smaller
```
https://my.monash.edu/
Once you finish writing the content, you can knit the R
Markdown and create the output file.
Using RStudio to knit … this
Knitting R Markdown
Visualisation with R
R uses the grammar of graphics to define how to map
variables in data with plots in a visualisation
• ggplot2: the main package used
• An aesthetic mapping (or variable mapping) tells
ggplot() which variable in your data corresponds to a
particular element to be drawn, e.g., if tb_data is data
about cases of tuberculosis,
p <- ggplot(tb_data, aes(x=year, y=count, fill=sex))
Then aes tells ggplot to map
• the year to the x-axis
• the number of cases to the y-axis
• the sex will set the colour for a fill element
• But R also needs to know how to plot the data. You need
to tell it what sort of visualisation you want.
• For instance,
p <- p + geom_bar(stat=“identify”, position=“fill”)
will tell it to create a geom, a
geometrical shape.
The geom_bar specifically tells
it to make a bar chart with 100%
fill for which the values (identity)
have already been calculated.
Plotting with R
Facets in R
• Sometimes you then want to divide the data further,
mapping multiple visualisations. For instance, what if you
wanted to analyse the TB data separately for different
age groups.
• The facet creates the subplots for each category.
p <- p + facet_grid(~ age_group)
tells R how to present the multiple visualisation
plots according to the age_group in the data.
Once you combine all levels of the instructions
p <- ggplot(tb_au, aes(x=year, y=count, fill=sex))
+ geom_bar(stat = "identity", position = "fill")
+ facet_grid(~ age_group)
Visualising it all
• Not all data can be used straight away
• Not all data is clean and tidy
• We need to wrangle the data into shape!
Data Wrangling
is the process of transforming “raw” …
Week 2:
Dimensions of Data Science projects
and Big Data
Growth Laws
Explanations about change in IT and society
• Moore’s Law
• Koomey’s Law
• Bell’s Law
• Zimmerman’s Law
Growth laws
Moore’s Law
• Number of transistors per chip doubles every 2 years
(starting from 1975)
• Transistor count translates to:
• more memory
• bigger CPUs
• faster memory, CPUs (smaller==faster)
• Pace currently slowing
Moore’s Law
By Dr Jon Koomey CC BY-SA 3.0,
via Wikime- dia Commons
Koomey’s Law
• Corollary of Moores Law
• Amount of battery needed will fall by a factor of 100 every
decade
• Leads to ubiquitous computing
Koomey’s Law
• Corollary of Moore’s Law and Koomey’s Law
• “Roughly every decade a new, lower priced computer
class forms based on a new programming platform,
network, and interface resulting in new usage and the
establishment of a new industry.”
e.g., PCs -> mobile computing -> cloud -> internet-of-things
Bell’s Law
Gordon Bell, Digital Equipment Corporation (DEC), 1972
• Zimmerman is creator of Pretty Good Privacy (PGP), an
early encription system
• “Surveillance is constantly increasing”
• Privacy constantly decreasing
Zimmerman’s Law
Explanations about change in IT and society
• Moore’s Law – capability and size of IT
• Koomey’s Law – capability and size of IT
• Bell’s Law – purpose of IT
• Zimmerman’s Law – relationship between privacy
and IT
Growth laws
End of
Growth Laws
Week 2:
Dimensions of Data Science projects
and Big Data
Big Data and the Vs
From Big data on Wikipedia:
Big data usually includes data sets with sizes beyond
the ability of commonly used software tools to capture,
curate, manage, and process data within a tolerable elapsed
time. Big data “size” is a constantly moving target, …
Big Data
https://en.wikipedia.org/wiki/Big_data
from GO-Gulf in 2017
Things that happen in 60 secs
http://www.go-gulf.com/blog/60-seconds/
Four Vs of Big Data
The Four V’s of Big Data
“The Four V’s of Big Data” by IBM (infographic)
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
Big Data and “V”s
• In 2001, Doug Laney produced report describing 3 V’s:
“3-D Data Management: Controlling Data Volume,
Velocity and Variety”
‣ These characterise bigness, adequately
• Other V’s characterise problems with analysis and
understanding
‣ Veracity: correctness, truth, i.e.. lack of …
‣ Variability: change in meaning over time, e.g., natural
language
• Other V’s characterise aspirations
‣ Visualisation: one method for analysis
‣ Value: what we want to get out of the data
• What else?
• “Data Science Matters” from the [email protected] Blog
• “Intelligence by Variety – Where to Find and Access Big
Data” from Kapow Software
Infographics on Data
http://datascience101.wordpress.com/2013/11/15/data-size-matters-infographic/
http://staging.kapowsoftware.com/resources/infographics/intelligence-by-variety-where-to-find-and-access-big-data.php
Growth laws and Big Data
Each of the growth laws relates to the characteristics of
Big Data.
For instance (but not limited to!)
• Moore’s Law: Velocity, Volume
• Koomey’s Law: Variety
• Bell’s Law: Variety, Veracity
• Zimmerman’s Law: all of them
Summary
BIG DATA is ANY attribute that challenges CONSTRAINTS of a
system’s CAPABILITY or BUSINESS NEED.
End of
Big Data and the Vs
Week 2:
Dimensions of Data Science projects
and Big Data
Business Models
As information technology develops and with more data
collected,
• Businesses utilise it
• Businesses change their attitudes towards it
• Businesses incorporate it in their business models
Innovation!
Growth and business
From Wikipedia:
A business model describes the rationale of how an
organization creates, delivers, and captures value, in
economic, social, cultural or other contexts.
Examples of general classes:
• Retailer versus wholesaler
• Luxury consumer products
• Software vendor
• Service provider
What kinds of businesses do we have operating in the
Data Science world?
Business Models
https://en.wikipedia.org/wiki/Business_model
by Jm3 CC BY-SA 3.0, via Wikimedia
Commons
Bloomberg
terminal
http://creativecommons.org/licenses/by-sa/3.0
The Bloomberg Terminal:
• a computer system provided by Bloomberg L.P.
• enables professionals to monitor and analyse real-time
financial market data
• also place trades on the electronic trading platform
• is a proprietary secure network
Questions:
• Where does the data originally come from?
• Why don’t users of the terminals get their data from
the original source?
• Why wouldn’t people who sell the data to Bloomberg
set up a similar service themselves?
Bloomberg Terminal (cont.)
https://en.wikipedia.org/wiki/Bloomberg_Terminal
• Bloomberg provides an information brokering service.
• Broker: a person who buys and sells goods or assets
for others
Bloomberg Terminal (cont.)
Amazon.com
• An assembly line for the retail industry, with support for
embedded online retailers.
• Huge stock of books, DVDs, CDs, etc. easily searchable.
• Extensive customer reviews.
Amazon.com
• Information-based differentiation: satisfies customers by
providing a differentiated service:
• superior information including reviews about products
• superior range
• Information-based delivery network: they deliver
information for others; retailers in the Amazon
marketplace get:
• customers directed to them
• other retailers’ support
Amazon.com
• See LexisNexis, provides world’s largest electronic database for
legal and public-records related information.
• Information provider: business selling the data it collects
• like a traditional business model, selling data not widgets
• fasting growing segment of the IT industry post 2000 (cited by
Evan Quinn’s blog post on I n f o c h i m p s . c o m April 2013 “Is Big
Data the Tail Wagging the Data Economy Dog?”, now offline)
• some call this the data economy
e.g., data brokers sell consumer data to major retailers or internet
companies
LexisNexis
https://en.wikipedia.org/wiki/LexisNexis
https://www.consumer.ftc.gov/blog/2014/05/ftc-report-examines-data-brokers
• Information brokering service: buys and sells
data/information for others.
• Information-based differentiation: satisfies
customers by providing a differentiated service built
on the data/information.
• Information-based delivery network: deliver
data/information for others.
• Information provider: business selling the
data/information it collects.
“What a Big-Data Business Model Looks Like” by Ray Wang in the
Harvard Business Review claims these are unique in the data world.
Data Business Models
http://hbr.org/2012/12/what-a-big-data-business-model
End of
Business Models
Week 2:
Dimensions of Data Science projects
and Big Data
Case Studies
• Managing data is hard
– Having the required data and technologies
• Managing Big Data is harder
– Determining/creating the required data and
technologies
• Managing data quality is hardest
– Making sure the data (and technologies) meet the
needs
Quality and complexity
• Developed by the National Institute of Standards and
Technology (NIST) and has been widely used in characterising
data science projects.
• Certain parts of the framework will be further discussed in
later weeks.
• This is the kind of analysis you need in Assignment 2 & 4.
NIST Case Studies
• Data sources: where does the data comes from?
• Data volume: how much data is there?
• Data velocity: how does the data change over time?
• Data variety: what different kinds of data is there?
• Data veracity: is the data correct? what problems might it have?
• Software: what software needed to do the work?
• Analytics: what statistical analysis & visualisation is needed?
• Processing: what are the computational requirements?
• Capabilities: what are key requirements of the operational
system?
• Security/privacy: what security/privacy requirements are there?
• Lifecycle: what ongoing requirements are there?
• Other: are there other notable factors?
NIST Analysis
Case Study: Netflix Movies
On-demand internet streaming,
and flat-rate DVD rental:
• Over 203 million subscribers
by Jan 2021
• International market
• Video recommendation!
• Established the Netflix Prize
in 2006-2009 as a
crowdsourced way of testing
out algorithms
By Ivongala (Own work) [Public
domain], via Wikimedia Commons
Netflix
https://en.wikipedia.org/wiki/Netflix_Prize
• Pareto principle, or 80/20
rule:
• Top 20% of films
watched 80% of time
• Standard video store
stocked less than 20% of
available titles in order
to make the most money
from The real meaning of 80/20
• By adopting an Amazon style business model, Netflix could afford
to rent the remaining 80%, the so-called long tail
Netflix: Background
Analysis follow the NIST Big Data WG Netflix analysis in Volume 3, Use Cases
and General Requirements, case 7 on page 8, A-24 and elsewhere
http://longtail.typepad.com/the_long_tail/2005/03/the_real_meanin.html
https://en.wikipedia.org/wiki/Long_tail
http://dx.doi.org/10.6028/NIST.SP.1500-3
• Data sources: user movie ratings, user clicks, user profiles
• Data volume: in 2012: 25 million users, 4 million ratings/day, 3
million searches/day, video cloud storage of 2 petabytes
• Data velocity: video titles change daily, rankings/ratings updated
• Data variety: user rankings, user profiles, media properties
• Software: Hadoop, Pig, Cassandra, Teradata
• Analytics: personalised recommender system
• Processing: analytic processing, streaming video
• Capabilities: ratings and search per day, content delivery
• Security/privacy: protect user data; digital rights
• Lifecycle: continued ranking and updating
• Other: mobile interface
Netflix: Analysis
http://hadoop.apache.org/
https://pig.apache.org/
http://cassandra.apache.org/
https://en.wikipedia.org/wiki/Teradata
Case Study: Electronic Medical
Records (EMR)
EMR: Clinical Data
EMR: Claims and Cost Data
• Clinical data and claims/cost data is available per patient,
per hospital
• large variety of sources of data
• systematic errors and difference in standards across
institution
• Task: segment patients into different types (“phenotypes”) to
use in subsequent cohort studies
• case study is for Indiana Network for Patient Care
Electronic Medical Records
follows NIST Big Data WG Electronic Medical Records analysis in Volume 3, Use
Cases and General Requirements, case 16 in page 14, A-45 and elsewhere
http://dx.doi.org/10.6028/NIST.SP.1500-3
• Data sources: clinical and claims data
• Data volume: 1000 centres, 12 million patients, 4 billion clinical
events
• Data velocity: approx. 1 million clinical events/day
• Data variety: free text, lab results, pathology, outpatient, etc.
• Data veracity: different standards in different places
• Software: Hadoop, Hive, Teradata, PostgreSQL, MongoDB
• Analytics: visualisation for data checking; standardisation of incoming
data; general data analysis
• Processing: analytic processing, handling the volume
• Capabilities: models to support subsequent cohort studies
• Security/privacy: privacy and confidentiality required
• Lifecycle: full data management required
EMR: Analysis
https://hive.apache.org/
Case Study: Medical Imaging (MI)
MI Task: Produce Analysis
Biomedical data for imaging is high resolution and some
is 3D:
• interpretation of images done by trained experts
• requires significant training in interpretation
• many different kinds of instruments each requiring
different interpretations
• millions produced daily in the USA
Medical Imaging
follow NIST Big Data WG Pathology Imaging in Volume 3, Use Cases and
General Requirements, case 17 in page 14, A-48 and elsewhere
http://dx.doi.org/10.6028/NIST.SP.1500-3
• Data sources: biomedical image data
• Data volume: approx. 1 million events/day nationally
• Data variety: X-rays, CT scans, microsopes, …
• Data veracity: current interpretation is often text based, so prone to
text errors
• Software: advanced image processing and machine learning systems
• Analytics: computational image processing, supervised learning from
images
• Processing: handling the large volume, distributed and high throughput
• Capabilities: produce initial analysis for experts
• Security/privacy: privacy and confidentiality required
• Lifecycle: full data management required
Medical Imaging: Analysis
Case Study: Electricity Demand
Forecasting (EDF)
from NIST Big Data WG Electricity Demand Forecasting in Volume 3, Use
Cases and General Requirements, case 51 in page 43 and A-134
Near realtime usage available thanks to smart meters
• with solar cells, consumers do energy generation too, but
it is unpredictable
• main electricity generation must be planned
• brownouts and blackouts need to be prevented
• see Australian Energy
Market Operator (AEMO)
and their electricity site
Electricity Demand Forecasting
https://www.aemo.com.au/
https://www.aemo.com.au/
https://www.aemo.com.au/Energy-systems/Electricity/National-Electricity-Market-NEM/Data-NEM/Data-Dashboard-NEM
• Data sources: utilities, smart meters, weather data, grids
• Data volume: city scale: 10GB/day
• Data velocity: updates every 15 minutes
• Data variety: time series, networks, spatial data
• Data veracity: occasional dropouts
• Software: advanced timeseries processing, spatial analysis
• Analytics: forecasting models
• Processing: handling the forecasting volume
• Capabilities: produce forecasts at different scales (hourly, daily)
• Security/privacy: privacy and confidentiality required
• Lifecycle: full data management required
Electricity Demand Forecasting: Analysis
Big Data is complex.
Complex problems have complex solutions.
There is no one solution for all
… but there is a lot of an opportunity for growth
for Data Science!
for Data Scientists!
Data complexity
End of
Case Studies
Week 2:
Dimensions of Data Science projects
and Big Data
Modelling Influences
from the BackReaction blog by Sabine Hossenfelder
Modelling
http://backreaction.blogspot.com.au/2008/04/emergence-and-reductionism.html
What is a model?
• A simple model of
population growth:
Logistic growth curve
From Integrating Urban Growth Models,
Pearlstine, Mazzotti, Pearlstine and Mann, 2004
• A complex model of
obesity:
Obesity Systems Map
A Slide about Model from Week 1
https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/296290/obesity-map-full-hi-res.pdf
A Slide about Model from Week 1
• “All models are wrong, but some are useful”…
George Box
Viewpoints about Modelling
• “All models are wrong, but some are useful”…
George Box
• “The approximate nature of the model must
always be borne in mind”… George Box
• “The purpose of models is not to fit the data but to
sharpen the questions”… Samuel Karlin
Viewpoints about Modelling
Yes No
Influence Diagrams
Influence Diagrams (a.k.a Decision Graphs) are:
• directed graphical model with 4 types of nodes:
‣ chance nodes, known variable nodes, action/decision
nodes and objective/utility nodes
• model the “influences”, “causes”, random (“chance”)
outcomes, “actions”, “goals” involved in a decision
problem
• provide a coarse abstraction, a conceptual model
Motivating Influence Diagrams
A conceptualisation aid to get you thinking about actions,
values, and unknowns.
Chance variable Known variable Decision or Action Objective
When do we connect an arc to a node?
• Chance variable: connect node A to chance node B if changes to the
value of A can “cause” changes in B;
• Known variable: same as chance node
• Decision: connect node A to decision node B, if variable A is used
when making decision B;
• Objectivity: connect node A to objective node B if variable A is used
when evaluating the value of the objective (e.g. quality or cost)
Node Types
Example: Last Minute Vacation
Example: Last Minute Vacation (cont.)
Bad Arcs for Last Minute Vacation
1. Weather cannot cause its forecast!
2. The forecast cannot cause the weather!
3. Your decision to go on vacation follows in time after
you have obtained forecast.
4. The success (failure) of the vacation follows in time
after your decision.
Example: Internet Advertising
Heart Disease
End of
Modelling Influences
Week 2:
Dimensions of Data Science projects
and Big Data
Visualising Statistics
• Descriptive Analytics: gain insight from historical data
• plot sales results by region and product category
• correlate with advertising revenue per region
• Predictive analytics: make prediction using statistical and
machine learning techniques
• predict next quarter’s sales results using economic projections
and advertising targets
• Prescriptive analytics: recommend decisions using
optimisation, simulation, etc.
• recommend which regions to advertise in given a fixed budget
Primarily a descriptive classification for general discussions.
Analytic Levels
Analytic Levels
There can be other classification schemes.
Check “Eight Levels of Analytics” by SAS
https://www.datasciencecentral.com/profiles/blogs/eight-levels-of-analytics-for-competitive-advantage
“The practice or science of collecting and analysing
numerical data in large quantities, especially for the
purpose of inferring proportions in a whole from those
in a representative samples”.
Two main statistical analytical methods:
• descriptive statistics – explaining data
• inferential statistics – finding regularities in irregular
data
What is statistics?
https://en.wikipedia.org/wiki/Descriptive_statistics
https://en.wikipedia.org/wiki/Statistical_inference
Categorical, qualitative
• Groups or categories
• Nominal – no natural ordering
• Ordinal – ordered
Quantitative
• Numerical
• Discrete – specific values, like counts
• Continuous – like temporal data
– Temporal: time and dates
– Space: locations
Different variable types
Data can be counted in various ways, including …
• How much data/how many records
• How large is the data
• How many unique values are there
• How many instances of each value
• How many instances of a group (bucket) of values
The values don’t have to be numerical.
Counting data
If there is a range of values, we can also evaluate what
is the most likely value.
Mode: which value is most common, e.g.,
Data: 1, 2, 2, 3 ,3, 4, 4, 4, 5 Mode = 4
The data doesn’t have to be numerical.
Median: what is the value in the middle of the data
Data: 1, 2, 2, 3, 3, 4, 4, 4, 5 Median = 3
The data must be ordered & numerical.
Mode and Median
Mean: the average value.
Data: 1, 2, 2, 3, 3, 4, 4, 4, 5 Mean = 3.111
The data must be numerical.
The mean value of is sometimes written as
Mode, mean and median help us describe what we expect
the data to be, but not how much the data differs.
!!
Mean
Other measurements describe how much the numerical
values vary.
• Variance is the average of how much values tend to
differ from the mean.
• Standard deviation is the square root of the variance.
Data: 2, 4, 4, 4, 5, 5, 7, 9 Mean = 5
! =
9 + 1 + 1 + 1 + 0 + 0 + 4 + 16
8 = 4
* = 4 = 2
Deviation and variance
Plotting counts – bar charts
Plotting counts – histograms
Scatter plot Line graph
Plotting points
The mean or median is often used on a visualisation to
show a degree of what is “normal”.
• Not necessarily a benchmark
• Not all plots can easily incorporate a mean or
median, e.g., pie charts!
• Can be used to help visualise the variance in the
data
Adding statistics
• Not all data is ideal for analysis
• Outliers are values outside of the expected parameters
for the data
– Errors
– Exceptional circumstances
– Chance
• Outliers need to be identified and decided on before the
analysis is completed
– They will influence the calculation of the mean
– So wrangle them!
Outliers
• Combine quartiles, median and outliers
• Quartile
• Divide the data into quarters based on the variable
• The upper, median and lower quartiles are the value at
the quartile boundaries, i.e., 25% of the data is less than
or equal to the lower quartile
• Interquartile range (IQR): The difference between the
lower and upper quartiles
Boxplots
• Outliers
– Below Q1 – 1.5 IQR
– Above Q3 + 1.5 IQR
• Whiskers
– Used on a boxplot to show how
much of the data is outside of the
IQR but not an outlier
– Box-and-whiskers plot
• R does a lot of these calculations
for you!
Boxplots
• Motivation: TED talk by Hans Rosling
• Allow us to focus on the relationship between multiple
variables over time.
Temporal data – Motion Charts
• Advantages:
• Time dimension allows deeper insights & observing trends
• Good for exploratory work
• Motion allows identification for this out of common “rhythm”
• “Appeal to the brain at a more instinctual intuitive level”
• Disadvantages:
• Not suited for static media
• Display can be overwhelming, and controls are complex
• Not suited for representing all types of data,
e.g. other graphics might be suitable for business data
• “Data scientists who branch into visualisation must be aware
of the limitations of uses”
Motion Charts – pros & cons
https://www.kdnuggets.com/2019/02/be
st-worst-data-visualization-2018.html
What visualisation should I use?
https://www.kdnuggets.com/2019/02/best-worst-data-visualization-2018.html
• https://www.reddit.com/r/dataisbeautiful/ A mixture of
good or bad visualisations of data
• https://www.kdnuggets.com/2019/02/best-worst-data-
visualization-2018.html
• https://365datascience.com/chart-types-and-how-to-
select-the-right-one/
• FIT5147: Data exploration and visualisation!
What visualisation should I use?
https://www.reddit.com/r/dataisbeautiful/
https://www.kdnuggets.com/2019/02/best-worst-data-visualization-2018.html
https://365datascience.com/chart-types-and-how-to-select-the-right-one/
End of
Visualising Statistics
Week 2:
Dimensions of Data Science projects
and Big Data
Missing Data
• You want the data you are using to be of sufficient quality
for your purpose
– Accuracy
– Completeness
– Consistency
– Integrity
– Reasonability
– Timeliness
– Uniqueness/deduplication
– Validity
Data Management Association (DAMA)
• Much of this is a data management issue
– Leave this for a future week
• Some of this is the fault of the data itself
– This is what we will focus on this week
Data quality
https://www.naa.gov.au/information-management/building-interoperability/interoperability-development-phases/data-governance-and-management/data-quality
Data needs to be cleaned, so it can be (re)used.
Sometimes the quality of data is questionable
• Volume: with a lot of data, irregularities creep in
• Velocity: data can be out-of-date very quickly
• Variety: data can be in a different formats and types
that don’t work well together
• Veracity: the accuracy or consistency of data from
different sources or sets or circumstances
Wrangling big data
Big data can also be incomplete
• Sensors fail
• Data collection procedures fail (staff sick, social or
legal issues)
• Data sharing fails or is temperamental
• Data isn’t significant enough (too small a sample)
Consequently, the data may not have suitable values for all
variables in all parts of the data.
Holes in the data
• Learning algorithms need the values.
• Not all statistical computing and graphics software ignore
missing values
– Bias results
– Incorrect calculations
– ggplot ignores them, but it does warn you!!
ggplot(oceanbuoys,
aes(x = sea_temp_c,
y = humidity)) +
geom_point() +
labs(x = “Sea temperature (celsius)”,
y = “Humidity”)
## Warning: Removed 94 rows containing
missing values (geom_point).
Consequences of missing data
• Need to find where data is missing
– Visualise the invisible!
• Need to decide what to do with what we don’t have
– Sometimes we actually need to wrangle values for the
missing data!
Handling missing data
• Data set was missing 60% of the data
• Looked at the context of the missing data
– Found the data was originally merged from different
sources about people and machines
• Developed new ways of exploring the missing data
– naniar and visdat packages in R allow you to see where
the data is missing
Dr Nick Tierney’s missing data
Interview with Nick Tierney, microcredential Course 2, Step 01.12
• Sometimes, a value is regarded as NaN
– Value is not empty
– Value is not a string or character
– Value is not a number
Missing value!
• This can also be the outcome of a calculation,
e.g., 0 / 0 = NaN,
So analysis can also produce NaNs
• Wrangling often needs to deal with NaNs in the data
is.nan(x)
NaN – Not a Number
NA is not NaN
• Sometimes, a value is regarded as NA
– Value is not empty
– Value is not the expected type
– Value is Not Available
Missing value!
• Wrangling often needs to deal with NAs in the data
is.na(x)
• NA is not the same as NaN!
https://statisticsglobe.com/r-na/
https://statisticsglobe.com/r-na/
• Marks the location of missings in the original data table
• From naniar, bind_shadow() joins a shadow matrix to
the data
Shadow matrix in R
as_shadow(oceanbuoys)
oceanbuoys_shadow <- bind_shadow(oceanbuoys) glimpse(oceanbuoys_shadow) We can then use the shadow matrix to see how the missing values relate to other variables in the table. Shadow matrix in R ggplot(oceanbuoys_shadow, aes(x = wind_ew, y = wind_ns, colour = air_temp_c_NA)) + geom_point(alpha = 0.7) + theme(aspect.ratio = 1) + scale_colour_brewer(palette = "Dark2") + labs(x = "East-West Winds”, y = “North-South winds”) • The choice for missing values like NaNs is often whether to - omit them from the data, or - give them a value Wrangling missing data • If a small fraction of cases have several missings, drop the cases. • If a variable or two, out of many, have a lot of missings, drop the variables. • If missings are small in number, but located in many cases and variables, you need to impute these values (replace with substituted values) to do most analyses. Strategies for missing values • Sometimes we needs a value for every aspect of the dataset - machine learning - visualisation • One method is to “impute” values we don’t know, based on those we do. - Often a crude approximation, so use it with caution, e.g., calculate the mean or median of similar values. Imputation Missing data about air temperature in the oceanbuoys data • Imputation from the mean Imputation Missing data about air temperature in the oceanbuoys data • Imputation from the median Imputation Missing data about air temperature in the oceanbuoys data • Imputation from the nearest neighbours Imputation …
Why Choose Us
- 100% non-plagiarized Papers
- 24/7 /365 Service Available
- Affordable Prices
- Any Paper, Urgency, and Subject
- Will complete your papers in 6 hours
- On-time Delivery
- Money-back and Privacy guarantees
- Unlimited Amendments upon request
- Satisfaction guarantee
How it Works
- Click on the “Place Order” tab at the top menu or “Order Now” icon at the bottom and a new page will appear with an order form to be filled.
- Fill in your paper’s requirements in the "PAPER DETAILS" section.
- Fill in your paper’s academic level, deadline, and the required number of pages from the drop-down menus.
- Click “CREATE ACCOUNT & SIGN IN” to enter your registration details and get an account with us for record-keeping and then, click on “PROCEED TO CHECKOUT” at the bottom of the page.
- From there, the payment sections will show, follow the guided payment process and your order will be available for our writing team to work on it.