Big Data Spark Problem Solving (PySpark) Sample

Big Data Spark Problem Solving (PySpark) Sample

Big Data Spark Problem Solving (PySpark) Sample

Programming Assignment Help

At Programming Homework Tutors, we believe in providing our students with practical, real-world examples of how to apply the concepts they learn in class. That’s why we’ve developed a variety of sample projects to help you see how our courses can be used to create impactful solutions in your field of study.

Instructions

The Office of Foreign Assets Control (OFAC) publishes a list of sanctioned companies, individuals, and vessels that are prohibited from conducting any business with counterparties in the United States. If an American company violates this sanction regime, they’re likely to be heavily fined by the US government.

There are a number of other sanctions lists produced by foreign governments. The UK Treasury publishes a consolidated list of “asset freeze targets” that’s somewhat similar to OFAC’s list.

Processing semi-structured data like entries in these sanctions lists is an important part of what we do.

Task

For this assignment we’re also providing somewhat normalized versions of these datasets. The OFAC list corresponds to “ofac.jsonl.gz” and the UK Treasury list corresponds to

“gbr.jsonl.gz”. These files are gzipped json-lines format.

Although these files have mostly the same schema, there are some differences to watch out for (for example in how dates are formatted).

Your task:

Download the normalized versions of both lists. Take a look at the available fields.

Load each list into Spark.

Using Spark, find entities (people, companies, vessels, etc) that are present in both lists, and store these entities in an output file.

Some details about what we’d like to see:

For the output format, we’d like a dataset where each row represents a single entity found in both input datasets. There should be at least two columns that indicate the matched entity from the original datasets. The first column should be the id from OFAC’s list under the name ofac_id, and the second column should be the id from the UK list under the name uk_id.

In addition to these ID fields, please provide some kind information about why the two entities are the same. You can use whatever schema you find easiest to convey this information. Just give us something that indicates why an entity from the OFAC list was matched with an entity from the UK list.

In addition to your Spark application, please send us your final output file.

For the Spark application, we’ll accept a PySpark (.py) file

We’ll be testing your application in local mode with something like the following command:


Please give us a list of third-party dependencies that you used. For Python dependencies we’d like a txt file. For Spark a list of packages to pass under the –packages option would be great.

Let us know the exact spark command we should use to run your application.

You are encouraged to use whatever third-party libraries you like, as long as we can easily reproduce your solution. No need to reinvent the wheel.

Final note: Finding entities in both datasets is meant to be challenging, since there is no perfect shared unique identifier in both datasets. There’s not a perfect answer for this task, but we are interested in seeing a sensible approach to the problem with a good balance between quality and quantity of matches. While we aren’t expecting a production-ready unit-tested repository, we will be looking for fundamentally good code quality and understandable code.

Disclaimer

The sample projects provided on our website are intended to be used as a guide and reference for educational purposes only. While we have made every effort to ensure that the projects are accurate and up-to-date, we do not guarantee their accuracy or completeness. The projects should be used at your own discretion, and we are not responsible for any loss or damage that may result from their use.
At Programming Homework Tutors, we are dedicated to helping students and educators achieve their goals by providing them with the resources they need to succeed. Our website offers a variety of tools and resources that can help you with the project mentioned above.
Whether you need help with research, project management, or technical support, our team of experts is here to assist you every step of the way. We offer online courses, tutorials, and community forums where you can connect with other learners and get the support you need to succeed.
If you’re looking to take your skills to the next level and make an impact in your field, we invite you to explore our website and see how we can help you achieve your goals.

No Comments

Post A Comment

This will close in 20 seconds