Bookmarks for April 3rd through April 4th

These are my links for April 3rd through April 4th:

  • Python for Business: Identifying Duplicate Data – 33 Sticks – Data Preparation is one of those critical tasks that most digital analysts take for granted as many of the analytics platforms we use take care of this task for us or at least we like to believe they do so. With that said, Data Preparation should be a task that every good analyst completes as part of any data investigation.
    Wes McKinney, author of Python for Data Analysis, defines Data Preparation as “cleaning, munging, combining, normalizing, reshaping, slicing, dicing, and transforming data for analysis.”
    In this post, I am going to walk you through a real world example, focusing on Data Preparation, of how Python can be a very powerful tool for business focused data analysis.
  • Data Mining: Finding Similar Items and Users – To find similar items to a certain item, you've got to first define what it means for 2 items to be similar and this depends on the problem you're trying to solve:

    on a blog, you may want to suggest similar articles that share the same tags, or that have been viewed by the same people viewing the item you want to compare with
    Amazon has this section called "customers that bought this item also bought", which is self-explanatory
    a service like IMDB, based on your ratings, could find users similar to you, users that liked or hated approximately the same movies you did, thus giving you suggestions on movies you'd like to watch in the future
    In each case you need a way to classify these items you're comparing, whether it is tags, or items purchased, or movies reviewed. We'll be using tags, as it is simpler, but the formula holds for more complicated instances.

  • Implementing the Five Most Popular Similarity Measures in Python – Dataconomy – Similarity is the measure of how much alike two data objects are. Similarity in a data mining context is usually described as a distance with dimensions representing features of the objects. If this distance is small, there will be high degree of similarity; if a distance is large, there will be low degree of similarity. Similarity is subjective and is highly dependent on the domain and application. For example, two fruits are similar because of color or size or taste. Care should be taken when calculating distance across dimensions/features that are unrelated. The relative values of each feature must be normalized, or one feature could end up dominating the distance
  • Cosine Similarity Part 1: The Basics – Algorithms for Big Data – The business use case for cosine similarity involves comparing customer profiles, product profiles or text documents. The algorithmic question is whether two customer profiles are similar or not. Cosine similarity is perhaps the simplest way to determine this.

    If one can compare whether any two objects are similar, one can use the similarity as a building block to achieve more complex tasks, such as:

    search: find the most similar document to a given one
    classification: is some customer likely to buy that product
    clustering: are there natural groups of similar documents
    product recommendations: which products are similar to the customer’s past purchases

  • Harry Potter and the Methods of Rationality | Petunia married a professor, and Harry grew up reading science and science fiction.