An Experiment in Natural Language Processing, Machine Learning, and Islamic Law :: Part 2 ::

By Mairaj Syed


I initially decided that I would divide up the Testimony chapter into 7-gram word fragments, because the original evidence canon consisted of seven Arabic words. This created a list of 38,683 7-gram fragments. Being ambitious and hoping to be lucky, I decided to send the google service 1000 texts at a time. It worked, once or twice, but timed out most times. So I kept reducing the size down until I could send about 500 at a time. It took several hours to process all of them. Below is a selection of the 19 7-gram texts that ranked most similar to the evidence canon:

The results are promising, and all 38,683 scores can be found here. Note that a standard textual search would actually not have found any texts because the text searched for “البينة على المدعي واليمين على من أنكر” does not exist in exactly this form in Qarāfī’s work. You can also see that rows 1-4, which have the highest similarity scores may refer to the exact same passage in Qarāfī’s work. What is unclear is why row 5, with a similarity score of 0.7152 would rank higher than row 8?[1] These results need further interpretation, but I think were promising enough to move forward with the search for the doubt canon in the criminal law chapter without drastic modification of my technique.

When I saw the results on the evidence canon, I decided that the 7-gram fragments may have been too small. I divided up the criminal law book into 20-gram fragments; there being 43,341 such fragments. I found that I could send only about 300 texts without timing out on a regular basis, perhaps because the fragments are larger. Ultimately, it took about 2 hours to get similarity scores for all, and again the results are more promising than in the previous case:

The results for all 43,341 fragments can be found here. The form of the canon that I asked GSSS to return similarity scores for was tudraʾ ḥudūd bi ʾl-shubuhat. The top five results definitely contain this idea. What inspires great confidence in the method is the fact that top results rank fragments in which Qarafī uses the verb yusqaṭ as a synonym entirely replacing the verb tudraʾ! Reading through the rest of the results one definitely gets the sense that the GSSS is returning fragments that in some sense talk about exceptional circumstances that either lessen or suspend punishment, even if doubt is not the driving consideration. For example results 11, 13, 16, 18, 19 (all probably from the same passage) talk about how individuals compelled by necessity or coercion are not to be given the ḥudūd punishments.

The other thing that may be noticed in comparing the absolute values representing the similarity between the two canons is how the results of the second canon score about 30 points lower than the first. Why is this the case? I suspect that it must be a function of the larger n-grams supplied in the case of the second canon compared to the first (20 vs. 7).[2] For our purposes, though, the absolute values do not matter. We just need some type of relative values that will provide a ranking of similarity between all of the searched against texts we feed GSSS. The few that rise to the top should definitively contain the canon we are looking for and the rest of the fragments should be absent of it.

In order for a search based on semantic similarity scores to be viable we must be confident that the highest scores do in fact cluster together and are dramatically different from the rest of the scores, on the assumption of tens of thousands of fragments that make up each chapter, only a handful come anywhere close to being semantically similar in a meaningful way to the canon. It is difficult to investigate that by looking at just 19 rows of results. But we can plot a histogram that shows the distribution of the similarity scores and we should find that the range of the top scores should be very few in comparison to similar ranges of values of all other semantic scores. This is what we find:

The range of the 19 results from the doubt canon 0.78 to 0.61 are mostly contained in the last three bins (far right) of the histogram depicting the distribution of the scores on the evidence canon. In contrast, the range that got the highest number of values (between 0.22 and 0.16) got 9,757 values. A similar picture emerges when looking at the distribution of the similarity values of the doubt canon. The histogram of the distribution of the semantic similarity values confirms that the top fragments with the highest similarity scores are few in number and clustered together within a range that is quite distant from all other semantic scores.

Future Exploration

At a minimum, the results suggest that adopting a semantic similarity approach is at least a fruitful avenue for further efforts in developing a tool that automatically discovers the use of explicit canons in the fiqh corpus. But, I’m not sure using the Google service, at least in its current form, is the way to go due to issues related to its performance, potential costs, and the accuracy of results. We supplied 38,683 and 43,341 fragments from two chapters of a single work of positive law and sought to discover two canons. Given that there are hundreds of works of Islamic law, the number of fragments they would generate in totality would be of a magnitude greater than what we did in this experiment. Let’s assume that there are 10 million 20-gram fragments in Islamic legal literature. One author of a modern encyclopedia of canons, has identified just over 900 of them. Given that we would ask GSSS to compare each canon against all 10 million 20-gram fragments, we would want it to give us similarity scores for 9 billion text-pairs. This number is large, but not insurmountable, especially because once similarity scores have been computed once, they need not be done again. GSSS right now is experimental, and for that reason it was free. I would also surmise that they did not devote many computational resources to a product that they are currently just testing out. Presumably, in the near future it will be offered as a commercial service. Thus one may just purchase more computing power to generate the scores in a smaller amount of time. But, I have no idea how much that would ultimately cost.

While GSSS is certainly one method through which we may generate similarity scores, it is not the only method, nor is it necessarily even be the most accurate one. First, we don’t quite know how Google is generating the similarity scores—their documentation on this issue was virtually non-existent or extremely difficult to find. More specifically, we don’t know how they generate the numerical representations of texts (called embeddings in NLP and ML), which they surely rely on in order to compute the similarity between the words. Numerical representations of texts allow us to perform mathematical operations on them, such as assign phrases scores that capture the extent of their similarity to some other phrase. Generating embeddings involves many parameters and decisions, and the accuracy of embeddings can be experimentally tested to determine the best set of parameters for a given task. While there has been much research on how those decisions regarding those parameters that yield better and worse embedding in English, this is not so much the case for other languages. In fact, much recent research in Arabic NLP has emerged to tackle this precise problem, and has shown that given the structural differences between Arabic and English, embedding techniques that take cognizance of these differences and modified algorithms accordingly perform better.[3] We don’t know the extent to which Google’s service relies on this recent research, so we don’t know if we could get better results if we just generated the word embeddings ourselves.

There is another potential problem with even the word embeddings generated by recent Arabic NLP approaches. They rely largely on modern standard Arabic (either Arabic websites, or Arabic Wikipedia) or Arabic twitter. Of course embeddings generated by the former are much more appropriate, but I wonder if the generation of word embeddings that give greater weight to premodern classical Arabic texts might generate better results in tasks such as semantic similarity? Furthermore, given the fact that Islamic legal literature is a technical discourse, in which words often mean something very different and precise than their conventional uses outside of it, generating word embedding by giving greater weight to Islamic legal texts might also lead to better results for our research purposes.

The question about which method to use to locate canons in Islamic legal literature is one that can be answered through collaborative, experimental research. We can test to see which method gives us the best results in the most efficient manner possible for the research purposes at hand.

Of course, nothing that I have said is limited to canons or Islamic law, per se. The applications can be extended to other systematic Islamic discourses, such as theology, philosophy, literature, poetry, or mysticism. Nor is the technique at the general level uniquely usable only on Islamic material in Arabic, either. Nor must it be limited to simply discovering semantically similar phrases in a corpus. Any time a scholar wonders whether there exists an idea similar to the one she is considering in some other text, semantic similarity search is a potential technique for discovering the answer. The recent past has seen two promising developments: an increase in research on Arabic NLP/ML and the steady introduction of computational techniques to the broader public at relatively low cost. The Islamic studies community can capitalize on these two developments and build tools that may be used to ask entirely new research questions, and just maybe even settle some enduring ones as well. But, given the complexity of the task and the experimental nature of this project, it will have to be undertaken through collaboration between many people that have different skill sets: Arabic NLP and machine learning experts will need to work hand in hand with those that have in depth knowledge of the different domains of Islamic studies. If you are interested in such a project, reach out to me. Again, you can find the Jupyter Python notebook here, which shows how I obtained, cleaned, managed the process of sending GSSS the canon and the fragments, and recorded the results. For the accompanying video walk-through see this.


[1] I suspect that, for whatever reason, GSSS is giving too much importance to the word ‘قال/qāla’ in row eight. Given the fact that in most cases it is akin to an opening quotation mark in premodern Arabic, and therefore often ubiquitous, it should not be given much weight in making determinations of semantic similarity, especially for the specific purposes of discovering the explicit citation of canons, which are often going to be preceded by the word ‘قال/qāla’.

[2] This suspicion is partly confirmed by further experiment. I re-ran the evidence canon against the Evidence/Testimony chapter using 20-grams instead of 7-grams and GSSS returned much lower absolute values.

[3] See Madamira and Farasa as two Arabic NLP packages that perform much of the necessary preparatory work on a corpus before embeddings may be obtained from. For a pre-trained set of embeddings obtained from Arabic Wikipedia, Tweets, and Arabic web pages, see Aravec.

Leave a Reply