Tools we love – OpenRefine

  • Product Reviews


Screenshot_2020-08-22 OpenRefine

OpenRefine is an awesome open-source tool for handling messy data, transforming data, understanding data, cleaning it up and reconciling it into ready-made form. It’s open-source and currently being maintained by Code for Science & Society. It’s a Java-based web power tool that can be used from the comfort and privacy of your computer.

OpenRefine adds spreadsheets of data which can be in various forms like CSV, TSV, Excel, JSON, XML, RDF etc. The data can be added locally from computer, web addresses, Clipboard, database etc

Screenshot_2020-08-22 OpenRefine Beginners Tutorial - YouTube(2)

OpenRefine web platform helps in exploring the added data and can analyse data automatically and gain valuable insights from it. It removes inconsistencies in various datasets, understand anomalies in data and helps us to zoom in data.

Another useful feature is how Open Refine helps in cleaning certain datasets into some format and removing unwanted garbage data. It separates chunks of data, and can handle millions of rows of datasets, and do these tasks at this scale as long as your computer memory supports that. It can also transform data from one form to another.

Screenshot_2020-08-22 OpenRefine Beginners Tutorial - YouTube

OpenRefine can be used to link and extend your dataset with various web services. This is called reconciling data as a lot of thought process needs to be done before applying this. Consider the case of a dataset with input text being there in multiple languages. We can use Google Translate to detect the language which is being in the input text and identify which language each input text belongs to and map to language code.

OpenRefine is used by a lot of professional organisations and users. It’s open-sourced under BSD license and loved by the community with more than 7,500+ GitHub stars.

So next time when you want to work on any Natural language understanding tasks do check out Semantic Reactor.

Thanks everyone, hope you’re staying safe and making cool things!


About the author:

Kurian Benoy is a SE-Data scientist at AOT Technologies. He is a Kaggle expert, with an interest in working on data science problems. He was a Google CodeIn mentor for Tensorflow in 2019, and has worked for open-source organizations like Keras, DVC, Swathanthra Malayalam Computing.
In his free time free, he likes to do bird watching and takes interest in learning more about world history.