29 Apr Big Data Spark Problem Solving (PySpark) Sample
At Programming Homework Tutors, we believe in providing our students with practical, real-world examples of how to apply the concepts they learn in class. That’s why we’ve developed a variety of sample projects to help you see how our courses can be used to create impactful solutions in your field of study.
Instructions
The Office of Foreign Assets Control (OFAC) publishes a list of sanctioned companies, individuals, and vessels that are prohibited from conducting any business with counterparties in the United States. If an American company violates this sanction regime, they’re likely to be heavily fined by the US government.
There are a number of other sanctions lists produced by foreign governments. The UK Treasury publishes a consolidated list of “asset freeze targets” that’s somewhat similar to OFAC’s list.
Processing semi-structured data like entries in these sanctions lists is an important part of what we do.
Task
For this assignment we’re also providing somewhat normalized versions of these datasets. The OFAC list corresponds to “ofac.jsonl.gz” and the UK Treasury list corresponds to
“gbr.jsonl.gz”. These files are gzipped json-lines format.
Although these files have mostly the same schema, there are some differences to watch out for (for example in how dates are formatted).
Your task:
Download the normalized versions of both lists. Take a look at the available fields.
Load each list into Spark.
Using Spark, find entities (people, companies, vessels, etc) that are present in both lists, and store these entities in an output file.
Some details about what we’d like to see:
For the output format, we’d like a dataset where each row represents a single entity found in both input datasets. There should be at least two columns that indicate the matched entity from the original datasets. The first column should be the id from OFAC’s list under the name ofac_id, and the second column should be the id from the UK list under the name uk_id.
In addition to these ID fields, please provide some kind information about why the two entities are the same. You can use whatever schema you find easiest to convey this information. Just give us something that indicates why an entity from the OFAC list was matched with an entity from the UK list.
In addition to your Spark application, please send us your final output file.
For the Spark application, we’ll accept a PySpark (.py) file
We’ll be testing your application in local mode with something like the following command:
Please give us a list of third-party dependencies that you used. For Python dependencies we’d like a txt file. For Spark a list of packages to pass under the –packages option would be great.
Let us know the exact spark command we should use to run your application.
You are encouraged to use whatever third-party libraries you like, as long as we can easily reproduce your solution. No need to reinvent the wheel.
Final note: Finding entities in both datasets is meant to be challenging, since there is no perfect shared unique identifier in both datasets. There’s not a perfect answer for this task, but we are interested in seeing a sensible approach to the problem with a good balance between quality and quantity of matches. While we aren’t expecting a production-ready unit-tested repository, we will be looking for fundamentally good code quality and understandable code.
Disclaimer
The sample projects provided on our website are intended to be used as a guide and reference for educational purposes only. While we have made every effort to ensure that the projects are accurate and up-to-date, we do not guarantee their accuracy or completeness. The projects should be used at your own discretion, and we are not responsible for any loss or damage that may result from their use.
At Programming Homework Tutors, we are dedicated to helping students and educators achieve their goals by providing them with the resources they need to succeed. Our website offers a variety of tools and resources that can help you with the project mentioned above.
Whether you need help with research, project management, or technical support, our team of experts is here to assist you every step of the way. We offer online courses, tutorials, and community forums where you can connect with other learners and get the support you need to succeed.
If you’re looking to take your skills to the next level and make an impact in your field, we invite you to explore our website and see how we can help you achieve your goals.
Latest Topic
-
Cloud-Native Technologies: Best Practices
20 April, 2024 -
Generative AI with Llama 3: Shaping the Future
15 April, 2024 -
Mastering Llama 3: The Ultimate Guide
10 April, 2024
Category
- Assignment Help
- Homework Help
- Programming
- Trending Topics
- C Programming Assignment Help
- Art, Interactive, And Robotics
- Networked Operating Systems Programming
- Knowledge Representation & Reasoning Assignment Help
- Digital Systems Assignment Help
- Computer Design Assignment Help
- Artificial Life And Digital Evolution
- Coding and Fundamentals: Working With Collections
- UML Online Assignment Help
- Prolog Online Assignment Help
- Natural Language Processing Assignment Help
- Julia Assignment Help
- Golang Assignment Help
- Design Implementation Of Network Protocols
- Computer Architecture Assignment Help
- Object-Oriented Languages And Environments
- Coding Early Object and Algorithms: Java Coding Fundamentals
- Deep Learning In Healthcare Assignment Help
- Geometric Deep Learning Assignment Help
- Models Of Computation Assignment Help
- Systems Performance And Concurrent Computing
- Advanced Security Assignment Help
- Typescript Assignment Help
- Computational Media Assignment Help
- Design And Analysis Of Algorithms
- Geometric Modelling Assignment Help
- JavaScript Assignment Help
- MySQL Online Assignment Help
- Programming Practicum Assignment Help
- Public Policy, Legal, And Ethical Issues In Computing, Privacy, And Security
- Computer Vision
- Advanced Complexity Theory Assignment Help
- Big Data Mining Assignment Help
- Parallel Computing And Distributed Computing
- Law And Computer Science Assignment Help
- Engineering Distributed Objects For Cloud Computing
- Building Secure Computer Systems Assignment Help
- Ada Assignment Help
- R Programming Assignment Help
- Oracle Online Assignment Help
- Languages And Automata Assignment Help
- Haskell Assignment Help
- Economics And Computation Assignment Help
- ActionScript Assignment Help
- Audio Programming Assignment Help
- Bash Assignment Help
- Computer Graphics Assignment Help
- Groovy Assignment Help
- Kotlin Assignment Help
- Object Oriented Languages And Environments
- COBOL ASSIGNMENT HELP
- Bayesian Statistical Probabilistic Programming
- Computer Network Assignment Help
- Django Assignment Help
- Lambda Calculus Assignment Help
- Operating System Assignment Help
- Computational Learning Theory
- Delphi Assignment Help
- Concurrent Algorithms And Data Structures Assignment Help
- Machine Learning Assignment Help
- Human Computer Interface Assignment Help
- Foundations Of Data Networking Assignment Help
- Continuous Mathematics Assignment Help
- Compiler Assignment Help
- Computational Biology Assignment Help
- PostgreSQL Online Assignment Help
- Lua Assignment Help
- Human Computer Interaction Assignment Help
- Ethics And Responsible Innovation Assignment Help
- Communication And Ethical Issues In Computing
- Computer Science
- Combinatorial Optimisation Assignment Help
- Ethical Computing In Practice
- HTML Homework Assignment Help
- Linear Algebra Assignment Help
- Perl Assignment Help
- Artificial Intelligence Assignment Help
- Uncategorized
- Ethics And Professionalism Assignment Help
- Human Augmentics Assignment Help
- Linux Assignment Help
- PHP Assignment Help
- Assembly Language Assignment Help
- Dart Assignment Help
- Complete Python Bootcamp From Zero To Hero In Python Corrected Version
- Swift Assignment Help
- Computational Complexity Assignment Help
- Probability And Computing Assignment Help
- MATLAB Programming For Engineers
- Introduction To Statistical Learning
- Database Systems Implementation Assignment Help
- Computational Game Theory Assignment Help
- Database Assignment Help
- Probabilistic Model Checking Assignment Help
- Mathematics For Computer Science And Philosophy
- Introduction To Formal Proof Assignment Help
- Creative Coding Assignment Help
- Foundations Of Self-Programming Agents Assignment Help
- Machine Organization Assignment Help
- Software Design Assignment Help
- Data Communication And Networking Assignment Help
- Computational Biology
- Data Structure Assignment Help
- Foundations Of Software Engineering Assignment Help
- Mathematical Foundations Of Computing
- Principles Of Programming Languages Assignment Help
- Software Engineering Capstone Assignment Help
- Algorithms and Data Structures Assignment Help
No Comments